Limits...
Rapid grading of fundus photographs for diabetic retinopathy using crowdsourcing.

Brady CJ, Villanti AC, Pearson JL, Kirchner TR, Gupta OP, Shah CP - J. Med. Internet Res. (2014)

Bottom Line: As screening improves, new methods to deal with screening data may help reduce the human resource needs.With the addition of grading categories, time to grade each image increased and percentage of images graded correctly decreased.In Phase II, area under the curve (AUC) of the receiver-operator characteristic (ROC) indicated that sensitivity and specificity were maximized after 7 graders for ratings of normal versus abnormal (AUC=0.98) but was significantly reduced (AUC=0.63) when Turkers were asked to specify the level of severity.

View Article: PubMed Central - HTML - PubMed

Affiliation: Wills Eye Hospital, Retina Service: Mid Atlantic Retina, Philadelphia, PA, United States. brady@jhmi.edu.

ABSTRACT

Background: Screening for diabetic retinopathy is both effective and cost-effective, but rates of screening compliance remain suboptimal. As screening improves, new methods to deal with screening data may help reduce the human resource needs. Crowdsourcing has been used in many contexts to harness distributed human intelligence for the completion of small tasks including image categorization.

Objective: Our goal was to develop and validate a novel method for fundus photograph grading.

Methods: An interface for fundus photo classification was developed for the Amazon Mechanical Turk crowdsourcing platform. We posted 19 expert-graded images for grading by Turkers, with 10 repetitions per photo for an initial proof-of-concept (Phase I). Turkers were paid US $0.10 per image. In Phase II, one prototypical image from each of the four grading categories received 500 unique Turker interpretations. Fifty draws of 1-50 Turkers were then used to estimate the variance in accuracy derived from randomly drawn samples of increasing crowd size to determine the minimum number of Turkers needed to produce valid results. In Phase III, the interface was modified to attempt to improve Turker grading.

Results: Across 230 grading instances in the normal versus abnormal arm of Phase I, 187 images (81.3%) were correctly classified by Turkers. Average time to grade each image was 25 seconds, including time to review training images. With the addition of grading categories, time to grade each image increased and percentage of images graded correctly decreased. In Phase II, area under the curve (AUC) of the receiver-operator characteristic (ROC) indicated that sensitivity and specificity were maximized after 7 graders for ratings of normal versus abnormal (AUC=0.98) but was significantly reduced (AUC=0.63) when Turkers were asked to specify the level of severity. With improvements to the interface in Phase III, correctly classified images by the mean Turker grade in four-category grading increased to a maximum of 52.6% (10/19 images) from 26.3% (5/19 images). Throughout all trials, 100% sensitivity for normal versus abnormal was maintained.

Conclusions: With minimal training, the Amazon Mechanical Turk workforce can rapidly and correctly categorize fundus photos of diabetic patients as normal or abnormal, though further refinement of the methodology is needed to improve Turker ratings of the degree of retinopathy. Images were interpreted for a total cost of US $1.10 per eye. Crowdsourcing may offer a novel and inexpensive means to reduce the skilled grader burden and increase screening for diabetic retinopathy.

Show MeSH

Related in: MedlinePlus

Area under the curve (AUC) of the receiver-operator characteristic (ROC) curve for increasing numbers of Turker interpretations of a prototypical image from each severity level. Turkers had low accuracy for the Mild (Panel A) and Severe image (Panel C), but acceptable accuracy for the Moderate image (Panel B). When all four images were analyzed for absence or presence of disease only, Turkers performed well (Panel D) with a highly significant AUC.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4259907&req=5

figure2: Area under the curve (AUC) of the receiver-operator characteristic (ROC) curve for increasing numbers of Turker interpretations of a prototypical image from each severity level. Turkers had low accuracy for the Mild (Panel A) and Severe image (Panel C), but acceptable accuracy for the Moderate image (Panel B). When all four images were analyzed for absence or presence of disease only, Turkers performed well (Panel D) with a highly significant AUC.

Mentions: Results of Phase II likewise indicate that sensitivity and specificity for overall ratings of abnormal versus normal was excellent, producing a highly significant AUC (0.98; Figure 2, Panel D). Turkers were not as accurate when asked to differentiate among four severity levels. Post hoc contrast analyses, however, indicate that Turkers performed well when asked to identify the abnormalities that were moderate in severity (ROC=0.85; Figure 2, Panel B). The pattern of results indicates that lower accuracy identifying mild (ROC=0.57; Figure 2, Panel A) and severe (AUC=0.73; Figure 2, Panel C) abnormalities was due to a tendency to rate all abnormalities as moderate in severity, rather than a failure to recognize normal versus mild and severe abnormalities more generally. Results also indicate that maximum AUC was usually achieved when crowd size reached a total of between 7 and 10 Turkers, confirming the validity of the crowd sizes used to rate the larger set of images (Figure 2). This affirms that the results of Phases I and III would not have been different had we sought a larger number of Turkers for each HIT.


Rapid grading of fundus photographs for diabetic retinopathy using crowdsourcing.

Brady CJ, Villanti AC, Pearson JL, Kirchner TR, Gupta OP, Shah CP - J. Med. Internet Res. (2014)

Area under the curve (AUC) of the receiver-operator characteristic (ROC) curve for increasing numbers of Turker interpretations of a prototypical image from each severity level. Turkers had low accuracy for the Mild (Panel A) and Severe image (Panel C), but acceptable accuracy for the Moderate image (Panel B). When all four images were analyzed for absence or presence of disease only, Turkers performed well (Panel D) with a highly significant AUC.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4259907&req=5

figure2: Area under the curve (AUC) of the receiver-operator characteristic (ROC) curve for increasing numbers of Turker interpretations of a prototypical image from each severity level. Turkers had low accuracy for the Mild (Panel A) and Severe image (Panel C), but acceptable accuracy for the Moderate image (Panel B). When all four images were analyzed for absence or presence of disease only, Turkers performed well (Panel D) with a highly significant AUC.
Mentions: Results of Phase II likewise indicate that sensitivity and specificity for overall ratings of abnormal versus normal was excellent, producing a highly significant AUC (0.98; Figure 2, Panel D). Turkers were not as accurate when asked to differentiate among four severity levels. Post hoc contrast analyses, however, indicate that Turkers performed well when asked to identify the abnormalities that were moderate in severity (ROC=0.85; Figure 2, Panel B). The pattern of results indicates that lower accuracy identifying mild (ROC=0.57; Figure 2, Panel A) and severe (AUC=0.73; Figure 2, Panel C) abnormalities was due to a tendency to rate all abnormalities as moderate in severity, rather than a failure to recognize normal versus mild and severe abnormalities more generally. Results also indicate that maximum AUC was usually achieved when crowd size reached a total of between 7 and 10 Turkers, confirming the validity of the crowd sizes used to rate the larger set of images (Figure 2). This affirms that the results of Phases I and III would not have been different had we sought a larger number of Turkers for each HIT.

Bottom Line: As screening improves, new methods to deal with screening data may help reduce the human resource needs.With the addition of grading categories, time to grade each image increased and percentage of images graded correctly decreased.In Phase II, area under the curve (AUC) of the receiver-operator characteristic (ROC) indicated that sensitivity and specificity were maximized after 7 graders for ratings of normal versus abnormal (AUC=0.98) but was significantly reduced (AUC=0.63) when Turkers were asked to specify the level of severity.

View Article: PubMed Central - HTML - PubMed

Affiliation: Wills Eye Hospital, Retina Service: Mid Atlantic Retina, Philadelphia, PA, United States. brady@jhmi.edu.

ABSTRACT

Background: Screening for diabetic retinopathy is both effective and cost-effective, but rates of screening compliance remain suboptimal. As screening improves, new methods to deal with screening data may help reduce the human resource needs. Crowdsourcing has been used in many contexts to harness distributed human intelligence for the completion of small tasks including image categorization.

Objective: Our goal was to develop and validate a novel method for fundus photograph grading.

Methods: An interface for fundus photo classification was developed for the Amazon Mechanical Turk crowdsourcing platform. We posted 19 expert-graded images for grading by Turkers, with 10 repetitions per photo for an initial proof-of-concept (Phase I). Turkers were paid US $0.10 per image. In Phase II, one prototypical image from each of the four grading categories received 500 unique Turker interpretations. Fifty draws of 1-50 Turkers were then used to estimate the variance in accuracy derived from randomly drawn samples of increasing crowd size to determine the minimum number of Turkers needed to produce valid results. In Phase III, the interface was modified to attempt to improve Turker grading.

Results: Across 230 grading instances in the normal versus abnormal arm of Phase I, 187 images (81.3%) were correctly classified by Turkers. Average time to grade each image was 25 seconds, including time to review training images. With the addition of grading categories, time to grade each image increased and percentage of images graded correctly decreased. In Phase II, area under the curve (AUC) of the receiver-operator characteristic (ROC) indicated that sensitivity and specificity were maximized after 7 graders for ratings of normal versus abnormal (AUC=0.98) but was significantly reduced (AUC=0.63) when Turkers were asked to specify the level of severity. With improvements to the interface in Phase III, correctly classified images by the mean Turker grade in four-category grading increased to a maximum of 52.6% (10/19 images) from 26.3% (5/19 images). Throughout all trials, 100% sensitivity for normal versus abnormal was maintained.

Conclusions: With minimal training, the Amazon Mechanical Turk workforce can rapidly and correctly categorize fundus photos of diabetic patients as normal or abnormal, though further refinement of the methodology is needed to improve Turker ratings of the degree of retinopathy. Images were interpreted for a total cost of US $1.10 per eye. Crowdsourcing may offer a novel and inexpensive means to reduce the skilled grader burden and increase screening for diabetic retinopathy.

Show MeSH
Related in: MedlinePlus