Limits...
Figure text extraction in biomedical literature.

Kim D, Yu H - PLoS ONE (2011)

Bottom Line: Then we adapted the off-the-shelf OCR tool on the improved text localization for character recognition.FigTExT significantly improved the performance of the off-the-shelf OCR tool we used, which on its own performed with 36.6% precision, 19.3% recall, and 25.3% F1-score for text extraction.In addition, our results show that FigTExT can extract texts that do not appear in figure captions or other associated text, further suggesting the potential utility of FigTExT for improving figure search.

View Article: PubMed Central - PubMed

Affiliation: Department of Health Science, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin, United States of America. kim48@uwm.edu

ABSTRACT

Background: Figures are ubiquitous in biomedical full-text articles, and they represent important biomedical knowledge. However, the sheer volume of biomedical publications has made it necessary to develop computational approaches for accessing figures. Therefore, we are developing the Biomedical Figure Search engine (http://figuresearch.askHERMES.org) to allow bioscientists to access figures efficiently. Since text frequently appears in figures, automatically extracting such text may assist the task of mining information from figures. Little research, however, has been conducted exploring text extraction from biomedical figures.

Methodology: We first evaluated an off-the-shelf Optical Character Recognition (OCR) tool on its ability to extract text from figures appearing in biomedical full-text articles. We then developed a Figure Text Extraction Tool (FigTExT) to improve the performance of the OCR tool for figure text extraction through the use of three innovative components: image preprocessing, character recognition, and text correction. We first developed image preprocessing to enhance image quality and to improve text localization. Then we adapted the off-the-shelf OCR tool on the improved text localization for character recognition. Finally, we developed and evaluated a novel text correction framework by taking advantage of figure-specific lexicons.

Results/conclusions: The evaluation on 382 figures (9,643 figure texts in total) randomly selected from PubMed Central full-text articles shows that FigTExT performed with 84% precision, 98% recall, and 90% F1-score for text localization and with 62.5% precision, 51.0% recall and 56.2% F1-score for figure text extraction. When limiting figure texts to those judged by domain experts to be important content, FigTExT performed with 87.3% precision, 68.8% recall, and 77% F1-score. FigTExT significantly improved the performance of the off-the-shelf OCR tool we used, which on its own performed with 36.6% precision, 19.3% recall, and 25.3% F1-score for text extraction. In addition, our results show that FigTExT can extract texts that do not appear in figure captions or other associated text, further suggesting the potential utility of FigTExT for improving figure search.

Show MeSH
Figure text localization.(A) Input image. (B) Binary image. (C) Foreground objects. (D) Figure text regions.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3020938&req=5

pone-0015338-g004: Figure text localization.(A) Input image. (B) Binary image. (C) Foreground objects. (D) Figure text regions.

Mentions: Once the contrast was enhanced and the gray-level of the figure texts was determined, we adapted Gatos et al.'s approach to first obtain the binary image (Figure 4(b)) of the input image (Figure 4(a)) and then extract foreground objects (Figure 4(c)) according to the gray-level of the figure texts. Rather than identifying regions of foreground objects as others have done [17], [19], we extracted strong edges of foreground objects and then identified a set of connected components. This approach is motivated by the fact that figure text usually has a high contrast with its background due to the contrast stretching transformation in Figure 3(b). To detect character regions, we first applied geometrical constraints (e.g., size and aspect ratio of a character) to remove non-text regions. We then merged adjacent characters into the same text region with a morphological technique (Figure 4(d)). We first evaluated the performance of the text localization prior to applying it for FigTExT. To this end, we manually extracted 2,856 original text regions from 73 figure images randomly selected from the open-access articles deposited in PubMed Central. We then counted the number of correctly detected text regions (Nc), the number of incorrectly detected text regions (Nf), and the number of missed text regions (Nm). Recall is computed as Nc/(Nc+Nm); precision as Nc/(Nc+Nf) and F1-score as the harmonic mean of recall and precision. Our evaluation results showed that our figure text localization attained approximately 84% precision, 98% recall, and 90% F1-score.


Figure text extraction in biomedical literature.

Kim D, Yu H - PLoS ONE (2011)

Figure text localization.(A) Input image. (B) Binary image. (C) Foreground objects. (D) Figure text regions.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3020938&req=5

pone-0015338-g004: Figure text localization.(A) Input image. (B) Binary image. (C) Foreground objects. (D) Figure text regions.
Mentions: Once the contrast was enhanced and the gray-level of the figure texts was determined, we adapted Gatos et al.'s approach to first obtain the binary image (Figure 4(b)) of the input image (Figure 4(a)) and then extract foreground objects (Figure 4(c)) according to the gray-level of the figure texts. Rather than identifying regions of foreground objects as others have done [17], [19], we extracted strong edges of foreground objects and then identified a set of connected components. This approach is motivated by the fact that figure text usually has a high contrast with its background due to the contrast stretching transformation in Figure 3(b). To detect character regions, we first applied geometrical constraints (e.g., size and aspect ratio of a character) to remove non-text regions. We then merged adjacent characters into the same text region with a morphological technique (Figure 4(d)). We first evaluated the performance of the text localization prior to applying it for FigTExT. To this end, we manually extracted 2,856 original text regions from 73 figure images randomly selected from the open-access articles deposited in PubMed Central. We then counted the number of correctly detected text regions (Nc), the number of incorrectly detected text regions (Nf), and the number of missed text regions (Nm). Recall is computed as Nc/(Nc+Nm); precision as Nc/(Nc+Nf) and F1-score as the harmonic mean of recall and precision. Our evaluation results showed that our figure text localization attained approximately 84% precision, 98% recall, and 90% F1-score.

Bottom Line: Then we adapted the off-the-shelf OCR tool on the improved text localization for character recognition.FigTExT significantly improved the performance of the off-the-shelf OCR tool we used, which on its own performed with 36.6% precision, 19.3% recall, and 25.3% F1-score for text extraction.In addition, our results show that FigTExT can extract texts that do not appear in figure captions or other associated text, further suggesting the potential utility of FigTExT for improving figure search.

View Article: PubMed Central - PubMed

Affiliation: Department of Health Science, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin, United States of America. kim48@uwm.edu

ABSTRACT

Background: Figures are ubiquitous in biomedical full-text articles, and they represent important biomedical knowledge. However, the sheer volume of biomedical publications has made it necessary to develop computational approaches for accessing figures. Therefore, we are developing the Biomedical Figure Search engine (http://figuresearch.askHERMES.org) to allow bioscientists to access figures efficiently. Since text frequently appears in figures, automatically extracting such text may assist the task of mining information from figures. Little research, however, has been conducted exploring text extraction from biomedical figures.

Methodology: We first evaluated an off-the-shelf Optical Character Recognition (OCR) tool on its ability to extract text from figures appearing in biomedical full-text articles. We then developed a Figure Text Extraction Tool (FigTExT) to improve the performance of the OCR tool for figure text extraction through the use of three innovative components: image preprocessing, character recognition, and text correction. We first developed image preprocessing to enhance image quality and to improve text localization. Then we adapted the off-the-shelf OCR tool on the improved text localization for character recognition. Finally, we developed and evaluated a novel text correction framework by taking advantage of figure-specific lexicons.

Results/conclusions: The evaluation on 382 figures (9,643 figure texts in total) randomly selected from PubMed Central full-text articles shows that FigTExT performed with 84% precision, 98% recall, and 90% F1-score for text localization and with 62.5% precision, 51.0% recall and 56.2% F1-score for figure text extraction. When limiting figure texts to those judged by domain experts to be important content, FigTExT performed with 87.3% precision, 68.8% recall, and 77% F1-score. FigTExT significantly improved the performance of the off-the-shelf OCR tool we used, which on its own performed with 36.6% precision, 19.3% recall, and 25.3% F1-score for text extraction. In addition, our results show that FigTExT can extract texts that do not appear in figure captions or other associated text, further suggesting the potential utility of FigTExT for improving figure search.

Show MeSH