Limits...
TreeRipper web application: towards a fully automated optical tree recognition software.

Hughes J - BMC Bioinformatics (2011)

Bottom Line: Optical Character Recognition (OCR) is used to convert the tip labels into text with the freely available tesseract-ocr software. 32% of images meeting the prerequisites for TreeRipper were successfully recognised, the largest tree had 115 leaves.We also provide a dataset of 100 tree images and associated tree files for training and/or benchmarking future software.TreeRipper is an open source project licensed under the GNU General Public Licence v3.

View Article: PubMed Central - HTML - PubMed

Affiliation: IBAHCM, College of Medical, Veterinary and Life Sciences, University of Glasgow, Graham Kerr Building, University Avenue, Glasgow, G12 8QQ, UK. joseph.hughes@glasgow.ac.uk

ABSTRACT

Background: Relationships between species, genes and genomes have been printed as trees for over a century. Whilst this may have been the best format for exchanging and sharing phylogenetic hypotheses during the 20th century, the worldwide web now provides faster and automated ways of transferring and sharing phylogenetic knowledge. However, novel software is needed to defrost these published phylogenies for the 21st century.

Results: TreeRipper is a simple website for the fully-automated recognition of multifurcating phylogenetic trees (http://linnaeus.zoology.gla.ac.uk/~jhughes/treeripper/). The program accepts a range of input image formats (PNG, JPG/JPEG or GIF). The underlying command line c++ program follows a number of cleaning steps to detect lines, remove node labels, patch-up broken lines and corners and detect line edges. The edge contour is then determined to detect the branch length, tip label positions and the topology of the tree. Optical Character Recognition (OCR) is used to convert the tip labels into text with the freely available tesseract-ocr software. 32% of images meeting the prerequisites for TreeRipper were successfully recognised, the largest tree had 115 leaves.

Conclusions: Despite the diversity of ways phylogenies have been illustrated making the design of a fully automated tree recognition software difficult, TreeRipper is a step towards automating the digitization of past phylogenies. We also provide a dataset of 100 tree images and associated tree files for training and/or benchmarking future software. TreeRipper is an open source project licensed under the GNU General Public Licence v3.

Show MeSH
Architecture of the software design for TreeRipper. The input image is scaled, node labels are removed, branches are smoothed and corners patched-up, the contour is detected. Tips locations are used to determine leaf label boxes for which the text is recognised using Tesseract. TreeRipper summarizes the tree topology and labels in a text file and an SVG file, which shows the contours.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3111373&req=5

Figure 2: Architecture of the software design for TreeRipper. The input image is scaled, node labels are removed, branches are smoothed and corners patched-up, the contour is detected. Tips locations are used to determine leaf label boxes for which the text is recognised using Tesseract. TreeRipper summarizes the tree topology and labels in a text file and an SVG file, which shows the contours.

Mentions: TreeRipper is written in c++ using a set of Standard Template Library algorithms provided by Magick++. The image is first converted to black and white and rescaled so that horizontal lines are on average 2 pixels thick. The image is cleaned by removing a series of patterns such as black pixels surrounded by a box of white pixels and horizontal lines that are not connected to vertical lines. Lines and corners are then patched up before the contour is traced and the topology detected. The locations of branch tips are then used to crop the tip labels from the original image. Tip labels are converted to text using the freely available tesseract-ocr program. The steps in the program are depicted in Figure 2. The web application written in PHP enables the visualization of the tracing and allows editing of the labels.


TreeRipper web application: towards a fully automated optical tree recognition software.

Hughes J - BMC Bioinformatics (2011)

Architecture of the software design for TreeRipper. The input image is scaled, node labels are removed, branches are smoothed and corners patched-up, the contour is detected. Tips locations are used to determine leaf label boxes for which the text is recognised using Tesseract. TreeRipper summarizes the tree topology and labels in a text file and an SVG file, which shows the contours.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3111373&req=5

Figure 2: Architecture of the software design for TreeRipper. The input image is scaled, node labels are removed, branches are smoothed and corners patched-up, the contour is detected. Tips locations are used to determine leaf label boxes for which the text is recognised using Tesseract. TreeRipper summarizes the tree topology and labels in a text file and an SVG file, which shows the contours.
Mentions: TreeRipper is written in c++ using a set of Standard Template Library algorithms provided by Magick++. The image is first converted to black and white and rescaled so that horizontal lines are on average 2 pixels thick. The image is cleaned by removing a series of patterns such as black pixels surrounded by a box of white pixels and horizontal lines that are not connected to vertical lines. Lines and corners are then patched up before the contour is traced and the topology detected. The locations of branch tips are then used to crop the tip labels from the original image. Tip labels are converted to text using the freely available tesseract-ocr program. The steps in the program are depicted in Figure 2. The web application written in PHP enables the visualization of the tracing and allows editing of the labels.

Bottom Line: Optical Character Recognition (OCR) is used to convert the tip labels into text with the freely available tesseract-ocr software. 32% of images meeting the prerequisites for TreeRipper were successfully recognised, the largest tree had 115 leaves.We also provide a dataset of 100 tree images and associated tree files for training and/or benchmarking future software.TreeRipper is an open source project licensed under the GNU General Public Licence v3.

View Article: PubMed Central - HTML - PubMed

Affiliation: IBAHCM, College of Medical, Veterinary and Life Sciences, University of Glasgow, Graham Kerr Building, University Avenue, Glasgow, G12 8QQ, UK. joseph.hughes@glasgow.ac.uk

ABSTRACT

Background: Relationships between species, genes and genomes have been printed as trees for over a century. Whilst this may have been the best format for exchanging and sharing phylogenetic hypotheses during the 20th century, the worldwide web now provides faster and automated ways of transferring and sharing phylogenetic knowledge. However, novel software is needed to defrost these published phylogenies for the 21st century.

Results: TreeRipper is a simple website for the fully-automated recognition of multifurcating phylogenetic trees (http://linnaeus.zoology.gla.ac.uk/~jhughes/treeripper/). The program accepts a range of input image formats (PNG, JPG/JPEG or GIF). The underlying command line c++ program follows a number of cleaning steps to detect lines, remove node labels, patch-up broken lines and corners and detect line edges. The edge contour is then determined to detect the branch length, tip label positions and the topology of the tree. Optical Character Recognition (OCR) is used to convert the tip labels into text with the freely available tesseract-ocr software. 32% of images meeting the prerequisites for TreeRipper were successfully recognised, the largest tree had 115 leaves.

Conclusions: Despite the diversity of ways phylogenies have been illustrated making the design of a fully automated tree recognition software difficult, TreeRipper is a step towards automating the digitization of past phylogenies. We also provide a dataset of 100 tree images and associated tree files for training and/or benchmarking future software. TreeRipper is an open source project licensed under the GNU General Public Licence v3.

Show MeSH