Computer-aided diagnosis has been defined as a diagnosis made by a radiologist who uses the output of computer analysis in making his or her decision. This article will evaluate these systems.
Dr. Nishikawa
is an Associate Professor in the Department of Radiology at The
University of Chicago, Chicago, IL.
Computer-aided diagnosis has been defined as a diagnosis made by
a radiologist who uses the output of computer analysis in making
his or her decision. Computer-aided detection (CAD) specifically
applies to situations in which the computer is assisting the
radiologist in locating an abnormality, such as in screening
mammography. Computer-aided diagnosis applies to situations in
which the computer assists the radiologist in classifying an
already-detected lesion as being benign or malignant, such as in
diagnostic mammography. Another distinction in terminology is
between the computer system's findings or output and the computer
system's use as an aid to the radiologist. The latter is referred
as CAD, to be consistent with the definition, and the former is
referred to as the computer detection or the computer diagnosis,
depending on the application. While there is a growing body of
research in computer diagnosis, computer detection is a clinical
reality. This article is restricted to a discussion of
computer-aided detection.
As CAD gains clinical acceptance, it is important to understand
how CAD and CAD systems are evaluated. There are two levels of
evaluation, one being the performance of the computer itself and
the other the performance of the radiologist when using the CAD
system. In computer-aided detection, the goal is to identify a
cancer on the mammogram that a radiologist will overlook. The last
requirement is critical: a computer's identification of cancers
that a radiologist could identify without the detection aid would
produce no benefit to the radiologist. Conversely, the computer
could, in theory, detect only a fraction of the cancers present in
a screening population, but if the cancers that the computer
detects were ones that a radiologist would overlook, then CAD would
produce a large benefit.
Performance of computer detection
To evaluate the performance of computer detection, one usually
measures the sensitivity (percentage of cancers correctly detected
by the computer) versus the average number of areas identified by
the computer that do not contain a cancer (number of false marks)
per image. A plot of sensitivity versus the average number of false
marks per image is known as a free-response receiver operating
characteristic (FROC) curve, as shown in figure 1. A FROC curve
differs from the more common receiver operating characteristic
(ROC) curve by the quantity measured by the
x
-axis (figure 2). In a ROC curve, the
x
-axis is not the number of false marks per image but the
false-positive fraction or 1 minus the specificity (percentage of
normal cases correctly called normal). The difference exists
because a single image can contain more than one false mark. Thus,
the total number of false marks possible is unknown, and as a
result, the false-positive fraction is not well defined. One could
calculate specificity by arbitrarily counting one or more
detections in a normal image as a single false detection. However,
this approach cannot distinguish between a computer detection
scheme that has only 1 detection per image and one that has 10
false detections per image.
Depending on how "aggressively" the scheme is designed to
operate, detection schemes can be characterized by a single point
on the FROC curve. Just as there are days when a radiologist may
read more aggressively--and thereby find more cancers, but have a
higher recall rate--CAD schemes can be tuned to operate with a
given sensitivity or a specified false-mark rate, but there is
usually a trade-off between sensitivity and false-mark rate. The
only way that sensitivity can be increased without a concomitant
increase in the number of false marks per image is by operating on
a FROC curve that is closer to the upper-left corner of the graph.
Thus, one way to compare two different CAD schemes is to look for
the one with the higher FROC curve. Unfortunately, there is no
statistical method to compare two FROC curves, so it is not
possible to say one curve is statistically better than another
curve, in some cases even if the difference is large.
Performance of computer-aided detection
While it is interesting to measure the performance of a
computer-aided detection system, it is more important to know how
much improvement in performance a radiologist obtains when using
such a system. The ultimate method for showing the benefits of CAD
is in a clinical study. This requires a comparison of
performance--number of cancers detected and callback rate--with and
without the computer aid. While simple in concept, it can be
difficult to conduct a careful study. First, two populations of
screening cases need to be identified--one that will be read with
the computer and one that will be read without the computer. These
two populations should be identical in terms of cancer prevalence
and difficulty of cases (e.g., same fraction of women with dense
breasts). While there is no way to guarantee the composition of two
populations, if a large number of women are included in the study,
and if the women are drawn either from the same overall population
or similar populations of women, one can expect to get similar
groups in the study. Further, each population needs to be read by a
group of radiologists, and, ideally, these groups should be
identical in terms of ability. This can be either two separate
groups of radiologists, or the same group reading mammograms with
and without the computer. If a single group of radiologists is
used, there is a potential for a bias. If "with" and "without"
readings are sequential (i.e., if radiologists read without aid for
6 months and then read with aid for 6 months), the ability of
radiologists can improve over time, particularly for less
experienced individuals, leading to a bias.
Another method for determining the effects of the computer on
radiologists' ability is to have the radiologist read the film
without the aid and render a decision, and then have the
radiologist examine the computer result and make another decision.
Care must be used in designing such a study to ensure that the
interpretation made without the computer aid is truly what the
radiologist would have done if he or she were never shown the
computer output. Since the final interpretation is not made until
after the computer output is seen, it is possible that the
radiologist will not give an unbiased reading. For example, asking
a radiologist after he or she had seen the computer output whether
or not they missed a cancer is open to potential bias. An unbiased
method would be to record the radiologist's interpretation before
he or she has seen the computer output and then record a second
interpretation after the computer output has been reviewed.
As a precursor to a clinical trial, observer studies are often
performed. In these types of studies, normal and cancer cases are
selected retrospectively, and a number of radiologists read each
case, once with and once without the computer aid. The method for
conducting and analyzing observer studies is well described in the
literature. The objective of these studies is to show, under
simulated clinical conditions, that radiologists' performance is
better when using the computer aid. This is done by comparing ROC
curves (figure 2). Statistical methods for comparing the
significance of two curves exist. The limitations of observer
studies is that it is extremely difficult to simulate clinical
conditions, if for no other reason than it is not realistic to
conduct a study with a prevalence of only 5 cancers in 1000 cases.
However, these types of studies can provide valuable insight as to
how CAD may perform when implemented clinically.
Comparing CAD systems
It is important to be able to compare different computer
detection systems. This, however, is not as straightforward as it
seems. The simplest comparison would be to compare the
sensitivities and false-mark rates. However, if a system has higher
sensitivity but more false marks than another system, it is not
clear which system is best. One way to avoid this possible
confusion is to plot FROC curves--the higher the curve, the better
the performance (figure 1).
There are caveats to note when comparing CAD systems. First, the
performance is strongly dependent on the cases used. A system
tested using only cases with palpable lesions is likely to have
better performance than another system tested on cases with cancers
from an incidence round of screening, even if the latter system has
better inherent performance. The appropriate choice of cases for
evaluating a CAD system depends on the population of women for
which the system will ultimately be used. For example, if the
system will be used in North America on screening cases, then a
database of consecutive cancers and a second set of consecutive
normal cases drawn from several centers in North America would be
appropriate. The greater the number of cases used, the more
accurately the final numbers will reflect clinical performance.
However, as the number of cases increases, the cost and the
logistics of performing the evaluation increase, probably
exponentially.
It is also important that the cases used to evaluate or
test
the performance of a CAD system be different from the cases used to
develop
and
train
the system. Most CAD schemes have 4 stages: i) a prefilter, where
the lesion of interest is enhanced while other structures are
suppressed; ii) segmentation of the lesion to isolate it from its
background; iii) extraction of features characterizing the lesion;
and iv) a classifier that uses the features to distinguish actual
lesions from nonlesions. There are many ways to accomplish each of
these four stages.
In
developing
a scheme, the researcher decides on the best techniques for
performing each of these stages.
Training
the system involves selecting values for the parameters used in
each of the four stages (e.g., training a neural network to
recognize microcalcifications). If the same cases are used to
develop, train, and test the CAD system, then a bias will exist and
the measured performance will be higher than the actual
performance. For example, it is possible to train an artificial
neural network to have perfect performance on a given set of cases.
However, when that network is used on a different set of cases, the
results will be less than perfect and often will be surprisingly
low. A single set of cases can be used to train and to test a CAD
system if a jackknife method of testing and training is employed.
Essentially, this and other similar methods divide the cases into
training and testing sets. The CAD system is trained on the
training set and evaluated on the testing set. Then a new division
is made and the process is repeated. The results from each process
are accumulated into a single FROC curve.
One should remember, again, that the detection scheme with the
highest sensitivity is not necessarily the system that will provide
the most benefit to the radiologist. The system that finds the most
cancers that a radiologist will miss has the potential to be the
most beneficial--high sensitivity does not necessarily mean high
sensitivity for overlooked cancers.
Conclusion
Computer detection systems are beginning to be used clinically.
This opens the potential for improving the benefits of screening
mammography further. However, at this crucial time in the
development of CAD as a field, radiologists play an important role.
It is radiologists who will conduct the needed clinical studies to
show conclusively the benefits of CAD, and it is radiologists who
will purchase and use CAD systems. Radiologists need to understand
CAD to fulfill this role.
Ultimately, a rigorous clinical evaluation is needed to measure
the benefit of CAD to mammography. Until that time, smaller
clinical studies and laboratory experiments will continue to
measure the potential of CAD. Not all studies are conducted with
the same protocol. It is necessary to understand how the study was
conducted to understand how the results could be translated to
clinical performance. The greater the number of radiologists with
good CAD acumen, the faster the field will grow and the quicker
beneficial schemes will be employed clinically. It will be a
setback to the field if radiologists start evaluating CAD systems
that are not ready for clinical testing. *
Disclosure
Robert Nishikawa is a shareholder in R2 Technology, Inc. (Los
Altos, CA). It is the University of Chicago conflict-of-interest
policy that investigators disclose publicly actual or potential
significant financial interests that may appear to be affected by
the research activities.