Limits...
Figure text extraction in biomedical literature.

Kim D, Yu H - PLoS ONE (2011)

Bottom Line: Then we adapted the off-the-shelf OCR tool on the improved text localization for character recognition.FigTExT significantly improved the performance of the off-the-shelf OCR tool we used, which on its own performed with 36.6% precision, 19.3% recall, and 25.3% F1-score for text extraction.In addition, our results show that FigTExT can extract texts that do not appear in figure captions or other associated text, further suggesting the potential utility of FigTExT for improving figure search.

View Article: PubMed Central - PubMed

Affiliation: Department of Health Science, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin, United States of America. kim48@uwm.edu

ABSTRACT

Background: Figures are ubiquitous in biomedical full-text articles, and they represent important biomedical knowledge. However, the sheer volume of biomedical publications has made it necessary to develop computational approaches for accessing figures. Therefore, we are developing the Biomedical Figure Search engine (http://figuresearch.askHERMES.org) to allow bioscientists to access figures efficiently. Since text frequently appears in figures, automatically extracting such text may assist the task of mining information from figures. Little research, however, has been conducted exploring text extraction from biomedical figures.

Methodology: We first evaluated an off-the-shelf Optical Character Recognition (OCR) tool on its ability to extract text from figures appearing in biomedical full-text articles. We then developed a Figure Text Extraction Tool (FigTExT) to improve the performance of the OCR tool for figure text extraction through the use of three innovative components: image preprocessing, character recognition, and text correction. We first developed image preprocessing to enhance image quality and to improve text localization. Then we adapted the off-the-shelf OCR tool on the improved text localization for character recognition. Finally, we developed and evaluated a novel text correction framework by taking advantage of figure-specific lexicons.

Results/conclusions: The evaluation on 382 figures (9,643 figure texts in total) randomly selected from PubMed Central full-text articles shows that FigTExT performed with 84% precision, 98% recall, and 90% F1-score for text localization and with 62.5% precision, 51.0% recall and 56.2% F1-score for figure text extraction. When limiting figure texts to those judged by domain experts to be important content, FigTExT performed with 87.3% precision, 68.8% recall, and 77% F1-score. FigTExT significantly improved the performance of the off-the-shelf OCR tool we used, which on its own performed with 36.6% precision, 19.3% recall, and 25.3% F1-score for text extraction. In addition, our results show that FigTExT can extract texts that do not appear in figure captions or other associated text, further suggesting the potential utility of FigTExT for improving figure search.

Show MeSH
FigTExT (Figure Text Extraction Tool).FigTExT has three components: image preprocessing, character recognition, and text correction. Image preprocessing enhances not only text region detection by improving image contrast and determining the gray-level of figure text, but also image quality by up-sampling. FigTExT incorporates an off-the-shelf OCR tool for character recognition. For text correction, FigTExT first corrects misrecognized characters with a figure-specific lexicon and then refines the corrected result to filter out some spurious corrections.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3020938&req=5

pone-0015338-g002: FigTExT (Figure Text Extraction Tool).FigTExT has three components: image preprocessing, character recognition, and text correction. Image preprocessing enhances not only text region detection by improving image contrast and determining the gray-level of figure text, but also image quality by up-sampling. FigTExT incorporates an off-the-shelf OCR tool for character recognition. For text correction, FigTExT first corrects misrecognized characters with a figure-specific lexicon and then refines the corrected result to filter out some spurious corrections.

Mentions: As shown in Figure 2, FigTExT has three components: image preprocessing, character recognition, and text correction. Image preprocessing enhances not only text region detection by improving image contrast and determining the gray level of figure texts, but also image quality by up-sampling. FigTExT adapts an off-the-shelf OCR tool on the improved text localization for character recognition. For text correction, FigTExT first corrects misrecognized characters using a figure-specific lexicon and then refines the corrected result to filter out some spurious corrections.


Figure text extraction in biomedical literature.

Kim D, Yu H - PLoS ONE (2011)

FigTExT (Figure Text Extraction Tool).FigTExT has three components: image preprocessing, character recognition, and text correction. Image preprocessing enhances not only text region detection by improving image contrast and determining the gray-level of figure text, but also image quality by up-sampling. FigTExT incorporates an off-the-shelf OCR tool for character recognition. For text correction, FigTExT first corrects misrecognized characters with a figure-specific lexicon and then refines the corrected result to filter out some spurious corrections.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3020938&req=5

pone-0015338-g002: FigTExT (Figure Text Extraction Tool).FigTExT has three components: image preprocessing, character recognition, and text correction. Image preprocessing enhances not only text region detection by improving image contrast and determining the gray-level of figure text, but also image quality by up-sampling. FigTExT incorporates an off-the-shelf OCR tool for character recognition. For text correction, FigTExT first corrects misrecognized characters with a figure-specific lexicon and then refines the corrected result to filter out some spurious corrections.
Mentions: As shown in Figure 2, FigTExT has three components: image preprocessing, character recognition, and text correction. Image preprocessing enhances not only text region detection by improving image contrast and determining the gray level of figure texts, but also image quality by up-sampling. FigTExT adapts an off-the-shelf OCR tool on the improved text localization for character recognition. For text correction, FigTExT first corrects misrecognized characters using a figure-specific lexicon and then refines the corrected result to filter out some spurious corrections.

Bottom Line: Then we adapted the off-the-shelf OCR tool on the improved text localization for character recognition.FigTExT significantly improved the performance of the off-the-shelf OCR tool we used, which on its own performed with 36.6% precision, 19.3% recall, and 25.3% F1-score for text extraction.In addition, our results show that FigTExT can extract texts that do not appear in figure captions or other associated text, further suggesting the potential utility of FigTExT for improving figure search.

View Article: PubMed Central - PubMed

Affiliation: Department of Health Science, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin, United States of America. kim48@uwm.edu

ABSTRACT

Background: Figures are ubiquitous in biomedical full-text articles, and they represent important biomedical knowledge. However, the sheer volume of biomedical publications has made it necessary to develop computational approaches for accessing figures. Therefore, we are developing the Biomedical Figure Search engine (http://figuresearch.askHERMES.org) to allow bioscientists to access figures efficiently. Since text frequently appears in figures, automatically extracting such text may assist the task of mining information from figures. Little research, however, has been conducted exploring text extraction from biomedical figures.

Methodology: We first evaluated an off-the-shelf Optical Character Recognition (OCR) tool on its ability to extract text from figures appearing in biomedical full-text articles. We then developed a Figure Text Extraction Tool (FigTExT) to improve the performance of the OCR tool for figure text extraction through the use of three innovative components: image preprocessing, character recognition, and text correction. We first developed image preprocessing to enhance image quality and to improve text localization. Then we adapted the off-the-shelf OCR tool on the improved text localization for character recognition. Finally, we developed and evaluated a novel text correction framework by taking advantage of figure-specific lexicons.

Results/conclusions: The evaluation on 382 figures (9,643 figure texts in total) randomly selected from PubMed Central full-text articles shows that FigTExT performed with 84% precision, 98% recall, and 90% F1-score for text localization and with 62.5% precision, 51.0% recall and 56.2% F1-score for figure text extraction. When limiting figure texts to those judged by domain experts to be important content, FigTExT performed with 87.3% precision, 68.8% recall, and 77% F1-score. FigTExT significantly improved the performance of the off-the-shelf OCR tool we used, which on its own performed with 36.6% precision, 19.3% recall, and 25.3% F1-score for text extraction. In addition, our results show that FigTExT can extract texts that do not appear in figure captions or other associated text, further suggesting the potential utility of FigTExT for improving figure search.

Show MeSH