Knowledge-driven geospatial location resolution for phylogeographic models of virus migration.
Bottom Line: For detecting and disambiguating locations, our system performed best using the metadata heuristic (0.54 Precision, 0.89 Recall and 0.68 F-score).Our error analysis showed that a noticeable increase in the accuracy of toponym resolution is possible by improving the geospatial location detection.By improving these fundamental automated tasks, our system can be a useful resource to phylogeographers that rely on geospatial metadata of GenBank sequences. .
Affiliation: Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA.Show MeSH
Related in: MedlinePlus
Mentions: To detect toponyms in documents, we built a rule-based module relying primarily on the GeoNames database and the Socrata dataset. Our module detects toponyms in documents by matching entries of GeoNames with the phrases in the documents. Since location names are of arbitrary length, each sequence of tokens in a document should, theoretically, be compared against all entries of GeoNames to find a match, leading to a problem of exponential comparisons. Due to the considerably large size of GeoNames, we built an index to perform more efficient look-ups. Our system splits all entries of GeoNames into contiguous sequences of tokens and then clusters them according to the first two letters of their first token. Each cluster is then associated with a variant of a Patricia tree, where edges are associated with a unique token and all nodes can either be accepting or non-accepting. Each token sequence representing an entry of GeoNames is modeled as a distinct branch of a tree, with its last node accepting. In Figure 2, we provide a concrete example for three locations: Abar, Abar abu Hileiqt, Abar abu Hirga. The creation of the index took 7.19 min on an Intel Xeon 2.60 Hz processor with 9 GB of memory and each document was parsed in 6.66 cs on average. The complexity of the search algorithm is in O(log n).Fig. 2.
Affiliation: Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA.