Limits...
Knowledge-driven geospatial location resolution for phylogeographic models of virus migration.

Weissenbacher D, Tahsin T, Beard R, Figaro M, Rivera R, Scotch M, Gonzalez G - Bioinformatics (2015)

Bottom Line: For detecting and disambiguating locations, our system performed best using the metadata heuristic (0.54 Precision, 0.89 Recall and 0.68 F-score).Our error analysis showed that a noticeable increase in the accuracy of toponym resolution is possible by improving the geospatial location detection.By improving these fundamental automated tasks, our system can be a useful resource to phylogeographers that rely on geospatial metadata of GenBank sequences. .

View Article: PubMed Central - PubMed

Affiliation: Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA.

Show MeSH

Related in: MedlinePlus

Patricia tree used to index the GeoNames database. Black nodes are accepting nodes
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4542781&req=5

btv259-F2: Patricia tree used to index the GeoNames database. Black nodes are accepting nodes

Mentions: To detect toponyms in documents, we built a rule-based module relying primarily on the GeoNames database and the Socrata dataset. Our module detects toponyms in documents by matching entries of GeoNames with the phrases in the documents. Since location names are of arbitrary length, each sequence of tokens in a document should, theoretically, be compared against all entries of GeoNames to find a match, leading to a problem of exponential comparisons. Due to the considerably large size of GeoNames, we built an index to perform more efficient look-ups. Our system splits all entries of GeoNames into contiguous sequences of tokens and then clusters them according to the first two letters of their first token. Each cluster is then associated with a variant of a Patricia tree, where edges are associated with a unique token and all nodes can either be accepting or non-accepting. Each token sequence representing an entry of GeoNames is modeled as a distinct branch of a tree, with its last node accepting. In Figure 2, we provide a concrete example for three locations: Abar, Abar abu Hileiqt, Abar abu Hirga. The creation of the index took 7.19 min on an Intel Xeon 2.60 Hz processor with 9 GB of memory and each document was parsed in 6.66 cs on average. The complexity of the search algorithm is in O(log n).Fig. 2.


Knowledge-driven geospatial location resolution for phylogeographic models of virus migration.

Weissenbacher D, Tahsin T, Beard R, Figaro M, Rivera R, Scotch M, Gonzalez G - Bioinformatics (2015)

Patricia tree used to index the GeoNames database. Black nodes are accepting nodes
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4542781&req=5

btv259-F2: Patricia tree used to index the GeoNames database. Black nodes are accepting nodes
Mentions: To detect toponyms in documents, we built a rule-based module relying primarily on the GeoNames database and the Socrata dataset. Our module detects toponyms in documents by matching entries of GeoNames with the phrases in the documents. Since location names are of arbitrary length, each sequence of tokens in a document should, theoretically, be compared against all entries of GeoNames to find a match, leading to a problem of exponential comparisons. Due to the considerably large size of GeoNames, we built an index to perform more efficient look-ups. Our system splits all entries of GeoNames into contiguous sequences of tokens and then clusters them according to the first two letters of their first token. Each cluster is then associated with a variant of a Patricia tree, where edges are associated with a unique token and all nodes can either be accepting or non-accepting. Each token sequence representing an entry of GeoNames is modeled as a distinct branch of a tree, with its last node accepting. In Figure 2, we provide a concrete example for three locations: Abar, Abar abu Hileiqt, Abar abu Hirga. The creation of the index took 7.19 min on an Intel Xeon 2.60 Hz processor with 9 GB of memory and each document was parsed in 6.66 cs on average. The complexity of the search algorithm is in O(log n).Fig. 2.

Bottom Line: For detecting and disambiguating locations, our system performed best using the metadata heuristic (0.54 Precision, 0.89 Recall and 0.68 F-score).Our error analysis showed that a noticeable increase in the accuracy of toponym resolution is possible by improving the geospatial location detection.By improving these fundamental automated tasks, our system can be a useful resource to phylogeographers that rely on geospatial metadata of GenBank sequences. .

View Article: PubMed Central - PubMed

Affiliation: Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA.

Show MeSH
Related in: MedlinePlus