Evaluation of computer-aided detection and computer detection systems

Computer-aided diagnosis has been defined as a diagnosis made by a radiologist who uses the output of computer analysis in making his or her decision. This article will evaluate these systems.

COMMENTS comments

Share your thoughts.
Post a comment →
Read Comments(0) →
Article Tools Sponsored By
Loading...

Dr. Nishikawa is an Associate Professor in the Department of Radiology at The University of Chicago, Chicago, IL.

Computer-aided diagnosis has been defined as a diagnosis made by a radiologist who uses the output of computer analysis in making his or her decision. Computer-aided detection (CAD) specifically applies to situations in which the computer is assisting the radiologist in locating an abnormality, such as in screening mammography. Computer-aided diagnosis applies to situations in which the computer assists the radiologist in classifying an already-detected lesion as being benign or malignant, such as in diagnostic mammography. Another distinction in terminology is between the computer system's findings or output and the computer system's use as an aid to the radiologist. The latter is referred as CAD, to be consistent with the definition, and the former is referred to as the computer detection or the computer diagnosis, depending on the application. While there is a growing body of research in computer diagnosis, computer detection is a clinical reality. This article is restricted to a discussion of computer-aided detection.

As CAD gains clinical acceptance, it is important to understand how CAD and CAD systems are evaluated. There are two levels of evaluation, one being the performance of the computer itself and the other the performance of the radiologist when using the CAD system. In computer-aided detection, the goal is to identify a cancer on the mammogram that a radiologist will overlook. The last requirement is critical: a computer's identification of cancers that a radiologist could identify without the detection aid would produce no benefit to the radiologist. Conversely, the computer could, in theory, detect only a fraction of the cancers present in a screening population, but if the cancers that the computer detects were ones that a radiologist would overlook, then CAD would produce a large benefit.

Performance of computer detection

To evaluate the performance of computer detection, one usually measures the sensitivity (percentage of cancers correctly detected by the computer) versus the average number of areas identified by the computer that do not contain a cancer (number of false marks) per image. A plot of sensitivity versus the average number of false marks per image is known as a free-response receiver operating characteristic (FROC) curve, as shown in figure 1. A FROC curve differs from the more common receiver operating characteristic (ROC) curve by the quantity measured by the x -axis (figure 2). In a ROC curve, the x -axis is not the number of false marks per image but the false-positive fraction or 1 minus the specificity (percentage of normal cases correctly called normal). The difference exists because a single image can contain more than one false mark. Thus, the total number of false marks possible is unknown, and as a result, the false-positive fraction is not well defined. One could calculate specificity by arbitrarily counting one or more detections in a normal image as a single false detection. However, this approach cannot distinguish between a computer detection scheme that has only 1 detection per image and one that has 10 false detections per image.

Depending on how "aggressively" the scheme is designed to operate, detection schemes can be characterized by a single point on the FROC curve. Just as there are days when a radiologist may read more aggressively--and thereby find more cancers, but have a higher recall rate--CAD schemes can be tuned to operate with a given sensitivity or a specified false-mark rate, but there is usually a trade-off between sensitivity and false-mark rate. The only way that sensitivity can be increased without a concomitant increase in the number of false marks per image is by operating on a FROC curve that is closer to the upper-left corner of the graph. Thus, one way to compare two different CAD schemes is to look for the one with the higher FROC curve. Unfortunately, there is no statistical method to compare two FROC curves, so it is not possible to say one curve is statistically better than another curve, in some cases even if the difference is large.

Performance of computer-aided detection

While it is interesting to measure the performance of a computer-aided detection system, it is more important to know how much improvement in performance a radiologist obtains when using such a system. The ultimate method for showing the benefits of CAD is in a clinical study. This requires a comparison of performance--number of cancers detected and callback rate--with and without the computer aid. While simple in concept, it can be difficult to conduct a careful study. First, two populations of screening cases need to be identified--one that will be read with the computer and one that will be read without the computer. These two populations should be identical in terms of cancer prevalence and difficulty of cases (e.g., same fraction of women with dense breasts). While there is no way to guarantee the composition of two populations, if a large number of women are included in the study, and if the women are drawn either from the same overall population or similar populations of women, one can expect to get similar groups in the study. Further, each population needs to be read by a group of radiologists, and, ideally, these groups should be identical in terms of ability. This can be either two separate groups of radiologists, or the same group reading mammograms with and without the computer. If a single group of radiologists is used, there is a potential for a bias. If "with" and "without" readings are sequential (i.e., if radiologists read without aid for 6 months and then read with aid for 6 months), the ability of radiologists can improve over time, particularly for less experienced individuals, leading to a bias.

Another method for determining the effects of the computer on radiologists' ability is to have the radiologist read the film without the aid and render a decision, and then have the radiologist examine the computer result and make another decision. Care must be used in designing such a study to ensure that the interpretation made without the computer aid is truly what the radiologist would have done if he or she were never shown the computer output. Since the final interpretation is not made until after the computer output is seen, it is possible that the radiologist will not give an unbiased reading. For example, asking a radiologist after he or she had seen the computer output whether or not they missed a cancer is open to potential bias. An unbiased method would be to record the radiologist's interpretation before he or she has seen the computer output and then record a second interpretation after the computer output has been reviewed.

As a precursor to a clinical trial, observer studies are often performed. In these types of studies, normal and cancer cases are selected retrospectively, and a number of radiologists read each case, once with and once without the computer aid. The method for conducting and analyzing observer studies is well described in the literature. The objective of these studies is to show, under simulated clinical conditions, that radiologists' performance is better when using the computer aid. This is done by comparing ROC curves (figure 2). Statistical methods for comparing the significance of two curves exist. The limitations of observer studies is that it is extremely difficult to simulate clinical conditions, if for no other reason than it is not realistic to conduct a study with a prevalence of only 5 cancers in 1000 cases. However, these types of studies can provide valuable insight as to how CAD may perform when implemented clinically.

Comparing CAD systems

It is important to be able to compare different computer detection systems. This, however, is not as straightforward as it seems. The simplest comparison would be to compare the sensitivities and false-mark rates. However, if a system has higher sensitivity but more false marks than another system, it is not clear which system is best. One way to avoid this possible confusion is to plot FROC curves--the higher the curve, the better the performance (figure 1).

There are caveats to note when comparing CAD systems. First, the performance is strongly dependent on the cases used. A system tested using only cases with palpable lesions is likely to have better performance than another system tested on cases with cancers from an incidence round of screening, even if the latter system has better inherent performance. The appropriate choice of cases for evaluating a CAD system depends on the population of women for which the system will ultimately be used. For example, if the system will be used in North America on screening cases, then a database of consecutive cancers and a second set of consecutive normal cases drawn from several centers in North America would be appropriate. The greater the number of cases used, the more accurately the final numbers will reflect clinical performance. However, as the number of cases increases, the cost and the logistics of performing the evaluation increase, probably exponentially.

It is also important that the cases used to evaluate or test the performance of a CAD system be different from the cases used to develop and train the system. Most CAD schemes have 4 stages: i) a prefilter, where the lesion of interest is enhanced while other structures are suppressed; ii) segmentation of the lesion to isolate it from its background; iii) extraction of features characterizing the lesion; and iv) a classifier that uses the features to distinguish actual lesions from nonlesions. There are many ways to accomplish each of these four stages.

In developing a scheme, the researcher decides on the best techniques for performing each of these stages. Training the system involves selecting values for the parameters used in each of the four stages (e.g., training a neural network to recognize microcalcifications). If the same cases are used to develop, train, and test the CAD system, then a bias will exist and the measured performance will be higher than the actual performance. For example, it is possible to train an artificial neural network to have perfect performance on a given set of cases. However, when that network is used on a different set of cases, the results will be less than perfect and often will be surprisingly low. A single set of cases can be used to train and to test a CAD system if a jackknife method of testing and training is employed. Essentially, this and other similar methods divide the cases into training and testing sets. The CAD system is trained on the training set and evaluated on the testing set. Then a new division is made and the process is repeated. The results from each process are accumulated into a single FROC curve.

One should remember, again, that the detection scheme with the highest sensitivity is not necessarily the system that will provide the most benefit to the radiologist. The system that finds the most cancers that a radiologist will miss has the potential to be the most beneficial--high sensitivity does not necessarily mean high sensitivity for overlooked cancers.

Conclusion

Computer detection systems are beginning to be used clinically. This opens the potential for improving the benefits of screening mammography further. However, at this crucial time in the development of CAD as a field, radiologists play an important role. It is radiologists who will conduct the needed clinical studies to show conclusively the benefits of CAD, and it is radiologists who will purchase and use CAD systems. Radiologists need to understand CAD to fulfill this role.

Ultimately, a rigorous clinical evaluation is needed to measure the benefit of CAD to mammography. Until that time, smaller clinical studies and laboratory experiments will continue to measure the potential of CAD. Not all studies are conducted with the same protocol. It is necessary to understand how the study was conducted to understand how the results could be translated to clinical performance. The greater the number of radiologists with good CAD acumen, the faster the field will grow and the quicker beneficial schemes will be employed clinically. It will be a setback to the field if radiologists start evaluating CAD systems that are not ready for clinical testing. *

 

Disclosure

Robert Nishikawa is a shareholder in R2 Technology, Inc. (Los Altos, CA). It is the University of Chicago conflict-of-interest policy that investigators disclose publicly actual or potential significant financial interests that may appear to be affected by the research activities.

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1