Limits...
Three-level prediction of protein function by combining profile-sequence search, profile-profile search, and domain co-occurrence networks.

Wang Z, Cao R, Cheng J - BMC Bioinformatics (2013)

Bottom Line: Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era.However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function.These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, University of Missouri, Columbia, Missouri 65211, USA.

ABSTRACT
Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era. However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function. Here, we developed a method that integrated profile-sequence alignment, profile-profile alignment, and Domain Co-Occurrence Networks (DCN) to predict protein function at different levels of complexity, ranging from obvious homology, to remote homology, to no homology. We tested the method blindingly in the 2011 Critical Assessment of Function Annotation (CAFA). Our experiments demonstrated that our three-level prediction method effectively increased the recall of function prediction while maintaining a reasonable precision. Particularly, our method can predict function terms defined by the Gene Ontology more accurately than three standard baseline methods in most situations, handle multi-domain proteins naturally, and make ab initio function prediction when no homology exists. These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).

Show MeSH
Precision and recall of our three predictors and three baseline methods when considering top n, 1 < = n < = 20, predictions ranked by confidence scores.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3584933&req=5

Figure 1: Precision and recall of our three predictors and three baseline methods when considering top n, 1 < = n < = 20, predictions ranked by confidence scores.

Mentions: Figure 1 plots the precision-recall curves of six predictors. The plot shows that our methods tend to have higher recall and lower precision in comparison with the lower recall and higher precision of two baseline methods: Prior and GOtcha. For instance, our best performing predictor 1 can reach a recall value as high as ~0.55, whereas the highest recall value of the baseline methods (i.e. Prior and GOtcha) is at ~0.27. That the baseline methods, particularly the Prior method that selects most frequent GO terms in a general protein function database, can have a high precision but a low recall may be because these methods tend to predict more general GO terms in top n predictions that are closer to the root node of the Gene Ontology, but farther away from the most specific true GO terms. To illustrate this point using an example, we calculated the average depth (number of nodes from a GO term to the root node) of predicted GO terms of the target protein T07719 for the six predictors when recall is at ~ 0.1. The average depth of our predictor 1, 2, and 3, and the Prior, BLAST, and GOtcha is 16.3, 16.3, 18.4, 13.9, 16.1, and 4.8, respectively, which shows that our predictors tried to predict deeper GO terms than the Prior and GOtcha. Although the precision-recall curves of our methods and the three baseline methods largely occupy two different areas in the plot, when their recalls overlap within the range (~0.22, ~0.27), our predictors have higher precisions than the Prior and GOtcha methods at the same recalls. The BLAST baseline method can only have a maximum recall at ~0.18 and a maximum precision at ~0.31, which clearly performed worse than our three methods including our least accurate predictor 3 based on PSI-BLAST search alone. This suggests that PSI-BLAST might work better than BLAST for protein function prediction, even though other factors such as how to rank GO terms based on alignments cannot be ruled out either.


Three-level prediction of protein function by combining profile-sequence search, profile-profile search, and domain co-occurrence networks.

Wang Z, Cao R, Cheng J - BMC Bioinformatics (2013)

Precision and recall of our three predictors and three baseline methods when considering top n, 1 < = n < = 20, predictions ranked by confidence scores.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3584933&req=5

Figure 1: Precision and recall of our three predictors and three baseline methods when considering top n, 1 < = n < = 20, predictions ranked by confidence scores.
Mentions: Figure 1 plots the precision-recall curves of six predictors. The plot shows that our methods tend to have higher recall and lower precision in comparison with the lower recall and higher precision of two baseline methods: Prior and GOtcha. For instance, our best performing predictor 1 can reach a recall value as high as ~0.55, whereas the highest recall value of the baseline methods (i.e. Prior and GOtcha) is at ~0.27. That the baseline methods, particularly the Prior method that selects most frequent GO terms in a general protein function database, can have a high precision but a low recall may be because these methods tend to predict more general GO terms in top n predictions that are closer to the root node of the Gene Ontology, but farther away from the most specific true GO terms. To illustrate this point using an example, we calculated the average depth (number of nodes from a GO term to the root node) of predicted GO terms of the target protein T07719 for the six predictors when recall is at ~ 0.1. The average depth of our predictor 1, 2, and 3, and the Prior, BLAST, and GOtcha is 16.3, 16.3, 18.4, 13.9, 16.1, and 4.8, respectively, which shows that our predictors tried to predict deeper GO terms than the Prior and GOtcha. Although the precision-recall curves of our methods and the three baseline methods largely occupy two different areas in the plot, when their recalls overlap within the range (~0.22, ~0.27), our predictors have higher precisions than the Prior and GOtcha methods at the same recalls. The BLAST baseline method can only have a maximum recall at ~0.18 and a maximum precision at ~0.31, which clearly performed worse than our three methods including our least accurate predictor 3 based on PSI-BLAST search alone. This suggests that PSI-BLAST might work better than BLAST for protein function prediction, even though other factors such as how to rank GO terms based on alignments cannot be ruled out either.

Bottom Line: Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era.However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function.These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, University of Missouri, Columbia, Missouri 65211, USA.

ABSTRACT
Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era. However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function. Here, we developed a method that integrated profile-sequence alignment, profile-profile alignment, and Domain Co-Occurrence Networks (DCN) to predict protein function at different levels of complexity, ranging from obvious homology, to remote homology, to no homology. We tested the method blindingly in the 2011 Critical Assessment of Function Annotation (CAFA). Our experiments demonstrated that our three-level prediction method effectively increased the recall of function prediction while maintaining a reasonable precision. Particularly, our method can predict function terms defined by the Gene Ontology more accurately than three standard baseline methods in most situations, handle multi-domain proteins naturally, and make ab initio function prediction when no homology exists. These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).

Show MeSH