Limits...
Grounding annotations in published literature with an emphasis on the functional roles used in metabolic models

View Article: PubMed Central

ABSTRACT

Accurate genome annotations in databases are a critical resource available to the scientific community for analysis and research. Inaccurate and inconsistent annotations exist as a result of errors generated from mass automated annotation, and currently act as a barrier to the application of bioinformatics. The purpose of this effort was to improve the SEED by improving the connection of functional roles to literature references. Direct literature references (DLits), found through searches of PubMed and other online databases such as SwissProt, were attached to protein sequences within the PubSEED to provide literature support for the roughly 2,500 distinct functional roles used to construct metabolic models within the Model SEED. Only DLits in which a researcher asserted the function of a protein were attached to sequences. Starting from a list of 1,072 functional roles that did not previously have DLit support, we were able to connect sequences to literature for 655 functional roles, at least 484 of which were in the original list of unsupported roles. When added to the existing set of sequences having DLits, the resulting set of DLit-sequence pairs (the foundation set) now connects approximately 4,300 DLits to approximately 5,600 distinct protein sequences obtained from approximately 16,000 genes (some of these genes have identical protein sequences). From the foundation set, we construct projection sets such that each set contains one member of the foundation set and projections of its functional role onto similar genes. The projection sets revealed 120 inconsistent annotations within the SEED. Two types of inconsistencies were corrected through manual annotation in the PubSEED: instances in which two identical protein sequences had been annotated with different functions, and instances when projected functions contradicted previous annotations. 26,785 changes to gene function assignment, 219 of which were to previously uncharacterized proteins, resulted in a more consistent and accurate set of input data from which to construct revised metabolic models within the Model SEED.

Electronic supplementary material: The online version of this article (doi:10.1007/s13205-011-0039-z) contains supplementary material, which is available to authorized users.

No MeSH data available.


Gene context is conserved in the upper portion of this illustration, but not in the lower portion
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3376863&req=5

Fig2: Gene context is conserved in the upper portion of this illustration, but not in the lower portion

Mentions: Empirically, it is observed that genes that work together or carry out related functions are often found within close proximity to each other on the chromosome, and that this proximity is strongly conserved (Fig. 2)—a phenomenon known as “chromosomal clustering” (Overbeek et al. 1999a, b; Dandekar et al. 1998). Hence, once the “false-positive” BBHs are removed, we compute for each BBH a projection score (Overbeek and Xia 2011) that takes into account both the number of conserved neighbors and the percent identity of the BLAST (Altschul et al. 1997) computed similarity as follows:Fig. 2


Grounding annotations in published literature with an emphasis on the functional roles used in metabolic models
Gene context is conserved in the upper portion of this illustration, but not in the lower portion
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3376863&req=5

Fig2: Gene context is conserved in the upper portion of this illustration, but not in the lower portion
Mentions: Empirically, it is observed that genes that work together or carry out related functions are often found within close proximity to each other on the chromosome, and that this proximity is strongly conserved (Fig. 2)—a phenomenon known as “chromosomal clustering” (Overbeek et al. 1999a, b; Dandekar et al. 1998). Hence, once the “false-positive” BBHs are removed, we compute for each BBH a projection score (Overbeek and Xia 2011) that takes into account both the number of conserved neighbors and the percent identity of the BLAST (Altschul et al. 1997) computed similarity as follows:Fig. 2

View Article: PubMed Central

ABSTRACT

Accurate genome annotations in databases are a critical resource available to the scientific community for analysis and research. Inaccurate and inconsistent annotations exist as a result of errors generated from mass automated annotation, and currently act as a barrier to the application of bioinformatics. The purpose of this effort was to improve the SEED by improving the connection of functional roles to literature references. Direct literature references (DLits), found through searches of PubMed and other online databases such as SwissProt, were attached to protein sequences within the PubSEED to provide literature support for the roughly 2,500 distinct functional roles used to construct metabolic models within the Model SEED. Only DLits in which a researcher asserted the function of a protein were attached to sequences. Starting from a list of 1,072 functional roles that did not previously have DLit support, we were able to connect sequences to literature for 655 functional roles, at least 484 of which were in the original list of unsupported roles. When added to the existing set of sequences having DLits, the resulting set of DLit-sequence pairs (the foundation set) now connects approximately 4,300 DLits to approximately 5,600 distinct protein sequences obtained from approximately 16,000 genes (some of these genes have identical protein sequences). From the foundation set, we construct projection sets such that each set contains one member of the foundation set and projections of its functional role onto similar genes. The projection sets revealed 120 inconsistent annotations within the SEED. Two types of inconsistencies were corrected through manual annotation in the PubSEED: instances in which two identical protein sequences had been annotated with different functions, and instances when projected functions contradicted previous annotations. 26,785 changes to gene function assignment, 219 of which were to previously uncharacterized proteins, resulted in a more consistent and accurate set of input data from which to construct revised metabolic models within the Model SEED.

Electronic supplementary material: The online version of this article (doi:10.1007/s13205-011-0039-z) contains supplementary material, which is available to authorized users.

No MeSH data available.