Limits...
Three-level prediction of protein function by combining profile-sequence search, profile-profile search, and domain co-occurrence networks.

Wang Z, Cao R, Cheng J - BMC Bioinformatics (2013)

Bottom Line: Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era.However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function.These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, University of Missouri, Columbia, Missouri 65211, USA.

ABSTRACT
Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era. However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function. Here, we developed a method that integrated profile-sequence alignment, profile-profile alignment, and Domain Co-Occurrence Networks (DCN) to predict protein function at different levels of complexity, ranging from obvious homology, to remote homology, to no homology. We tested the method blindingly in the 2011 Critical Assessment of Function Annotation (CAFA). Our experiments demonstrated that our three-level prediction method effectively increased the recall of function prediction while maintaining a reasonable precision. Particularly, our method can predict function terms defined by the Gene Ontology more accurately than three standard baseline methods in most situations, handle multi-domain proteins naturally, and make ab initio function prediction when no homology exists. These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).

Show MeSH
Precision and recall of our three predictors and three baseline methods when considering predictions with confidence score above a threshold t, 0 < = t < = 1. A number of threshold values evenly distributed in the range [0, 1] at step size 0.01 were used to calculated precisions and recalls.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3584933&req=5

Figure 2: Precision and recall of our three predictors and three baseline methods when considering predictions with confidence score above a threshold t, 0 < = t < = 1. A number of threshold values evenly distributed in the range [0, 1] at step size 0.01 were used to calculated precisions and recalls.

Mentions: Figure 2 shows the precision-recall curves of our three predictors and three baseline methods. Because all the predictions with a confidence score above a threshold other than just top 20 predictions are selected for evaluation, all the predictors in Figure 2 can reach a higher recall than those in Figure 1. Particularly, the Prior method can yield the highest recall (0.82) among all predictors because it constantly predicts 836 GO terms for each target protein, which is several times more than all other predictors (Table 1). The other two baseline methods (BLAST and GOtcha) that predict more than twice as many GO terms as our methods have the maximum recall values 0.55 and 0.5 respectively, which are lower than the ones of our three predictors. It is worth noting that our predictor 1 that predicted ~73 GO terms on average delivered the second highest recall ~0.68. When recall is higher than ~0.31, our three predictors have higher precision than all three baseline methods at the same recall, while predictor 1 performed mostly better than or occasionally comparable to predictors 2 and 3. That predictor 1 mostly performed better than predictor 2 suggests that the sequential combination of three levels of predictions generated by PSI-BLAST, HHSearch and DCN is more effective than the weighted combination of these predictions. Another interesting observation is that predictor 3 that simply pooled all PSI-BLAST hits to generate GO term predictions performed better than the baseline BLAST and GOtcha methods throughout the entire recall range. It also worked better than the baseline Prior method when recall is > ~0.15, whereas the latter yielded better precision than all other methods when the recall is low (< ~0.15). The better performance of the Prior method in the low recall range may be explained by that its highly common GO term predictions were largely correct, but too far away from the specific GO function of target proteins. Thus its highest precision (~0.85) may serve as an upper limit that current function prediction methods can aim to achieve. Table 2 shows the break-even values of our predictors and the three baseline methods when precision equals to recall. According to this criteria, our methods performed better than the baseline methods, while predictor 1 yielded the highest break-even value 0.306.


Three-level prediction of protein function by combining profile-sequence search, profile-profile search, and domain co-occurrence networks.

Wang Z, Cao R, Cheng J - BMC Bioinformatics (2013)

Precision and recall of our three predictors and three baseline methods when considering predictions with confidence score above a threshold t, 0 < = t < = 1. A number of threshold values evenly distributed in the range [0, 1] at step size 0.01 were used to calculated precisions and recalls.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3584933&req=5

Figure 2: Precision and recall of our three predictors and three baseline methods when considering predictions with confidence score above a threshold t, 0 < = t < = 1. A number of threshold values evenly distributed in the range [0, 1] at step size 0.01 were used to calculated precisions and recalls.
Mentions: Figure 2 shows the precision-recall curves of our three predictors and three baseline methods. Because all the predictions with a confidence score above a threshold other than just top 20 predictions are selected for evaluation, all the predictors in Figure 2 can reach a higher recall than those in Figure 1. Particularly, the Prior method can yield the highest recall (0.82) among all predictors because it constantly predicts 836 GO terms for each target protein, which is several times more than all other predictors (Table 1). The other two baseline methods (BLAST and GOtcha) that predict more than twice as many GO terms as our methods have the maximum recall values 0.55 and 0.5 respectively, which are lower than the ones of our three predictors. It is worth noting that our predictor 1 that predicted ~73 GO terms on average delivered the second highest recall ~0.68. When recall is higher than ~0.31, our three predictors have higher precision than all three baseline methods at the same recall, while predictor 1 performed mostly better than or occasionally comparable to predictors 2 and 3. That predictor 1 mostly performed better than predictor 2 suggests that the sequential combination of three levels of predictions generated by PSI-BLAST, HHSearch and DCN is more effective than the weighted combination of these predictions. Another interesting observation is that predictor 3 that simply pooled all PSI-BLAST hits to generate GO term predictions performed better than the baseline BLAST and GOtcha methods throughout the entire recall range. It also worked better than the baseline Prior method when recall is > ~0.15, whereas the latter yielded better precision than all other methods when the recall is low (< ~0.15). The better performance of the Prior method in the low recall range may be explained by that its highly common GO term predictions were largely correct, but too far away from the specific GO function of target proteins. Thus its highest precision (~0.85) may serve as an upper limit that current function prediction methods can aim to achieve. Table 2 shows the break-even values of our predictors and the three baseline methods when precision equals to recall. According to this criteria, our methods performed better than the baseline methods, while predictor 1 yielded the highest break-even value 0.306.

Bottom Line: Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era.However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function.These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, University of Missouri, Columbia, Missouri 65211, USA.

ABSTRACT
Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era. However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function. Here, we developed a method that integrated profile-sequence alignment, profile-profile alignment, and Domain Co-Occurrence Networks (DCN) to predict protein function at different levels of complexity, ranging from obvious homology, to remote homology, to no homology. We tested the method blindingly in the 2011 Critical Assessment of Function Annotation (CAFA). Our experiments demonstrated that our three-level prediction method effectively increased the recall of function prediction while maintaining a reasonable precision. Particularly, our method can predict function terms defined by the Gene Ontology more accurately than three standard baseline methods in most situations, handle multi-domain proteins naturally, and make ab initio function prediction when no homology exists. These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).

Show MeSH