Limits...
Identification and correction of abnormal, incomplete and mispredicted proteins in public databases.

Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L, Patthy L - BMC Bioinformatics (2008)

Bottom Line: Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases.MisPred works efficiently in identifying errors in predictions generated by the most reliable gene prediction tools such as the EnsEMBL and NCBI's GNOMON pipelines and also guides the correction of errors.We suggest that application of the MisPred approach will significantly improve the quality of gene predictions and the associated databases.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Enzymology, Biological Research Center, Hungarian Academy of Sciences, H-1113 Budapest, Hungary. nagya@enzim.hu

ABSTRACT

Background: Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases. Since the majority of incomplete, abnormal or mispredicted entries are not annotated as such, these errors seriously affect the reliability of these databases. Here we describe the MisPred approach that may provide an efficient means for the quality control of databases. The current version of the MisPred approach uses five distinct routines for identifying abnormal, incomplete or mispredicted entries based on the principle that a sequence is likely to be incorrect if some of its features conflict with our current knowledge about protein-coding genes and proteins: (i) conflict between the predicted subcellular localization of proteins and the absence of the corresponding sequence signals; (ii) presence of extracellular and cytoplasmic domains and the absence of transmembrane segments; (iii) co-occurrence of extracellular and nuclear domains; (iv) violation of domain integrity; (v) chimeras encoded by two or more genes located on different chromosomes.

Results: Analyses of predicted EnsEMBL protein sequences of nine deuterostome (Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Fugu rubripes, Danio rerio and Ciona intestinalis) and two protostome species (Caenorhabditis elegans and Drosophila melanogaster) have revealed that the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions. Analyses of sequences predicted by NCBI's GNOMON annotation pipeline show that the rates of mispredictions are comparable to those of EnsEMBL. Interestingly, even the manually curated UniProtKB/Swiss-Prot dataset is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON-predicted entries.

Conclusion: MisPred works efficiently in identifying errors in predictions generated by the most reliable gene prediction tools such as the EnsEMBL and NCBI's GNOMON pipelines and also guides the correction of errors. We suggest that application of the MisPred approach will significantly improve the quality of gene predictions and the associated databases.

Show MeSH

Related in: MedlinePlus

Error detected by MisPred routine for Conflict 4: the case of the Swiss-Prot entry EPHA5_RAT. This protein contains a C-terminal truncated SAM_1 domain that deviates significantly from the normal size of this domain family. It is noteworthy that orthologs from mouse, human and chicken contain an intact SAM_1 domain. The sequence of this protein was corrected by targeted search of the rat genome using the sequences of the full-length orthologs. The alignment shows the C-terminal parts of EPHA5_RAT, EPHA5_RAT_corrected, EPHA5_MOUSE [Swiss-Prot:Q60629], EPHA5_HUMAN [Swiss-Prot:P54756] and EPHA5_CHICK [Swiss-Prot:P54755]. The region of the predicted SAM_1 domain of EPHA5_RAT_corrected that is absent in EPHA5_RAT is underlined and highlighted in yellow.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2542381&req=5

Figure 4: Error detected by MisPred routine for Conflict 4: the case of the Swiss-Prot entry EPHA5_RAT. This protein contains a C-terminal truncated SAM_1 domain that deviates significantly from the normal size of this domain family. It is noteworthy that orthologs from mouse, human and chicken contain an intact SAM_1 domain. The sequence of this protein was corrected by targeted search of the rat genome using the sequences of the full-length orthologs. The alignment shows the C-terminal parts of EPHA5_RAT, EPHA5_RAT_corrected, EPHA5_MOUSE [Swiss-Prot:Q60629], EPHA5_HUMAN [Swiss-Prot:P54756] and EPHA5_CHICK [Swiss-Prot:P54755]. The region of the predicted SAM_1 domain of EPHA5_RAT_corrected that is absent in EPHA5_RAT is underlined and highlighted in yellow.

Mentions: MisPred routine based on Conflict 4 also identified a number of truly erroneous entries. A major source for this type of error is that the Swiss-Prot entry corresponds to an incomplete protein (with a truncated domain). For example, the sequence of EPHA5_RAT [Swiss-Prot: P54757] contains only a fragment of a SAM_1 domain since the protein sequence is truncated at the C-terminal end (Figure 4). The error in EPHA5_RAT could be corrected by targeted search of the rat genome using the sequences of the full-length orthologs (see Figure 4).


Identification and correction of abnormal, incomplete and mispredicted proteins in public databases.

Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L, Patthy L - BMC Bioinformatics (2008)

Error detected by MisPred routine for Conflict 4: the case of the Swiss-Prot entry EPHA5_RAT. This protein contains a C-terminal truncated SAM_1 domain that deviates significantly from the normal size of this domain family. It is noteworthy that orthologs from mouse, human and chicken contain an intact SAM_1 domain. The sequence of this protein was corrected by targeted search of the rat genome using the sequences of the full-length orthologs. The alignment shows the C-terminal parts of EPHA5_RAT, EPHA5_RAT_corrected, EPHA5_MOUSE [Swiss-Prot:Q60629], EPHA5_HUMAN [Swiss-Prot:P54756] and EPHA5_CHICK [Swiss-Prot:P54755]. The region of the predicted SAM_1 domain of EPHA5_RAT_corrected that is absent in EPHA5_RAT is underlined and highlighted in yellow.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2542381&req=5

Figure 4: Error detected by MisPred routine for Conflict 4: the case of the Swiss-Prot entry EPHA5_RAT. This protein contains a C-terminal truncated SAM_1 domain that deviates significantly from the normal size of this domain family. It is noteworthy that orthologs from mouse, human and chicken contain an intact SAM_1 domain. The sequence of this protein was corrected by targeted search of the rat genome using the sequences of the full-length orthologs. The alignment shows the C-terminal parts of EPHA5_RAT, EPHA5_RAT_corrected, EPHA5_MOUSE [Swiss-Prot:Q60629], EPHA5_HUMAN [Swiss-Prot:P54756] and EPHA5_CHICK [Swiss-Prot:P54755]. The region of the predicted SAM_1 domain of EPHA5_RAT_corrected that is absent in EPHA5_RAT is underlined and highlighted in yellow.
Mentions: MisPred routine based on Conflict 4 also identified a number of truly erroneous entries. A major source for this type of error is that the Swiss-Prot entry corresponds to an incomplete protein (with a truncated domain). For example, the sequence of EPHA5_RAT [Swiss-Prot: P54757] contains only a fragment of a SAM_1 domain since the protein sequence is truncated at the C-terminal end (Figure 4). The error in EPHA5_RAT could be corrected by targeted search of the rat genome using the sequences of the full-length orthologs (see Figure 4).

Bottom Line: Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases.MisPred works efficiently in identifying errors in predictions generated by the most reliable gene prediction tools such as the EnsEMBL and NCBI's GNOMON pipelines and also guides the correction of errors.We suggest that application of the MisPred approach will significantly improve the quality of gene predictions and the associated databases.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Enzymology, Biological Research Center, Hungarian Academy of Sciences, H-1113 Budapest, Hungary. nagya@enzim.hu

ABSTRACT

Background: Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases. Since the majority of incomplete, abnormal or mispredicted entries are not annotated as such, these errors seriously affect the reliability of these databases. Here we describe the MisPred approach that may provide an efficient means for the quality control of databases. The current version of the MisPred approach uses five distinct routines for identifying abnormal, incomplete or mispredicted entries based on the principle that a sequence is likely to be incorrect if some of its features conflict with our current knowledge about protein-coding genes and proteins: (i) conflict between the predicted subcellular localization of proteins and the absence of the corresponding sequence signals; (ii) presence of extracellular and cytoplasmic domains and the absence of transmembrane segments; (iii) co-occurrence of extracellular and nuclear domains; (iv) violation of domain integrity; (v) chimeras encoded by two or more genes located on different chromosomes.

Results: Analyses of predicted EnsEMBL protein sequences of nine deuterostome (Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Fugu rubripes, Danio rerio and Ciona intestinalis) and two protostome species (Caenorhabditis elegans and Drosophila melanogaster) have revealed that the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions. Analyses of sequences predicted by NCBI's GNOMON annotation pipeline show that the rates of mispredictions are comparable to those of EnsEMBL. Interestingly, even the manually curated UniProtKB/Swiss-Prot dataset is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON-predicted entries.

Conclusion: MisPred works efficiently in identifying errors in predictions generated by the most reliable gene prediction tools such as the EnsEMBL and NCBI's GNOMON pipelines and also guides the correction of errors. We suggest that application of the MisPred approach will significantly improve the quality of gene predictions and the associated databases.

Show MeSH
Related in: MedlinePlus