Limits...
Grounding annotations in published literature with an emphasis on the functional roles used in metabolic models

View Article: PubMed Central

ABSTRACT

Accurate genome annotations in databases are a critical resource available to the scientific community for analysis and research. Inaccurate and inconsistent annotations exist as a result of errors generated from mass automated annotation, and currently act as a barrier to the application of bioinformatics. The purpose of this effort was to improve the SEED by improving the connection of functional roles to literature references. Direct literature references (DLits), found through searches of PubMed and other online databases such as SwissProt, were attached to protein sequences within the PubSEED to provide literature support for the roughly 2,500 distinct functional roles used to construct metabolic models within the Model SEED. Only DLits in which a researcher asserted the function of a protein were attached to sequences. Starting from a list of 1,072 functional roles that did not previously have DLit support, we were able to connect sequences to literature for 655 functional roles, at least 484 of which were in the original list of unsupported roles. When added to the existing set of sequences having DLits, the resulting set of DLit-sequence pairs (the foundation set) now connects approximately 4,300 DLits to approximately 5,600 distinct protein sequences obtained from approximately 16,000 genes (some of these genes have identical protein sequences). From the foundation set, we construct projection sets such that each set contains one member of the foundation set and projections of its functional role onto similar genes. The projection sets revealed 120 inconsistent annotations within the SEED. Two types of inconsistencies were corrected through manual annotation in the PubSEED: instances in which two identical protein sequences had been annotated with different functions, and instances when projected functions contradicted previous annotations. 26,785 changes to gene function assignment, 219 of which were to previously uncharacterized proteins, resulted in a more consistent and accurate set of input data from which to construct revised metabolic models within the Model SEED.

Electronic supplementary material: The online version of this article (doi:10.1007/s13205-011-0039-z) contains supplementary material, which is available to authorized users.

No MeSH data available.


BBHs. aLight arrows denote weak similarity, b illustration of the need for a “false positive BBH” filter
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3376863&req=5

Fig1: BBHs. aLight arrows denote weak similarity, b illustration of the need for a “false positive BBH” filter

Mentions: Our sequence similarity criterion is that the region of match between the compared genes must cover at least 80% of the total length of each gene, and that the similarity must be a clear bidirectional best hit. The minimum 80% coverage criterion eliminates spurious hits against single common domains, as well as hits against fused genes. A Bidirectional Best Hit (BBH) signifies that the candidate gene is more similar to the foundation set gene than to any other gene in the foundation set genome, and that the foundation set gene is more similar to the candidate gene than to any other gene in the candidate genome. Figure 1a illustrates the BBH relationship; the heavy double-headed arrow denotes the BBH, while the lighter single-headed arrows denote weaker similarities to other genes. A BBH is said to be a clear BBH if the difference between the percent identity of the BBH and the next highest percent identity between either gene and any gene in the other genome is ≥5%. (Requiring at least a 5% difference in percent identities is sufficient to rule out gene duplications from recently inserted mobile elements or prophages, which often display identity scores close to 100%.)Fig. 1


Grounding annotations in published literature with an emphasis on the functional roles used in metabolic models
BBHs. aLight arrows denote weak similarity, b illustration of the need for a “false positive BBH” filter
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3376863&req=5

Fig1: BBHs. aLight arrows denote weak similarity, b illustration of the need for a “false positive BBH” filter
Mentions: Our sequence similarity criterion is that the region of match between the compared genes must cover at least 80% of the total length of each gene, and that the similarity must be a clear bidirectional best hit. The minimum 80% coverage criterion eliminates spurious hits against single common domains, as well as hits against fused genes. A Bidirectional Best Hit (BBH) signifies that the candidate gene is more similar to the foundation set gene than to any other gene in the foundation set genome, and that the foundation set gene is more similar to the candidate gene than to any other gene in the candidate genome. Figure 1a illustrates the BBH relationship; the heavy double-headed arrow denotes the BBH, while the lighter single-headed arrows denote weaker similarities to other genes. A BBH is said to be a clear BBH if the difference between the percent identity of the BBH and the next highest percent identity between either gene and any gene in the other genome is ≥5%. (Requiring at least a 5% difference in percent identities is sufficient to rule out gene duplications from recently inserted mobile elements or prophages, which often display identity scores close to 100%.)Fig. 1

View Article: PubMed Central

ABSTRACT

Accurate genome annotations in databases are a critical resource available to the scientific community for analysis and research. Inaccurate and inconsistent annotations exist as a result of errors generated from mass automated annotation, and currently act as a barrier to the application of bioinformatics. The purpose of this effort was to improve the SEED by improving the connection of functional roles to literature references. Direct literature references (DLits), found through searches of PubMed and other online databases such as SwissProt, were attached to protein sequences within the PubSEED to provide literature support for the roughly 2,500 distinct functional roles used to construct metabolic models within the Model SEED. Only DLits in which a researcher asserted the function of a protein were attached to sequences. Starting from a list of 1,072 functional roles that did not previously have DLit support, we were able to connect sequences to literature for 655 functional roles, at least 484 of which were in the original list of unsupported roles. When added to the existing set of sequences having DLits, the resulting set of DLit-sequence pairs (the foundation set) now connects approximately 4,300 DLits to approximately 5,600 distinct protein sequences obtained from approximately 16,000 genes (some of these genes have identical protein sequences). From the foundation set, we construct projection sets such that each set contains one member of the foundation set and projections of its functional role onto similar genes. The projection sets revealed 120 inconsistent annotations within the SEED. Two types of inconsistencies were corrected through manual annotation in the PubSEED: instances in which two identical protein sequences had been annotated with different functions, and instances when projected functions contradicted previous annotations. 26,785 changes to gene function assignment, 219 of which were to previously uncharacterized proteins, resulted in a more consistent and accurate set of input data from which to construct revised metabolic models within the Model SEED.

Electronic supplementary material: The online version of this article (doi:10.1007/s13205-011-0039-z) contains supplementary material, which is available to authorized users.

No MeSH data available.