Limits...
Three-level prediction of protein function by combining profile-sequence search, profile-profile search, and domain co-occurrence networks.

Wang Z, Cao R, Cheng J - BMC Bioinformatics (2013)

Bottom Line: Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era.However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function.These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, University of Missouri, Columbia, Missouri 65211, USA.

ABSTRACT
Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era. However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function. Here, we developed a method that integrated profile-sequence alignment, profile-profile alignment, and Domain Co-Occurrence Networks (DCN) to predict protein function at different levels of complexity, ranging from obvious homology, to remote homology, to no homology. We tested the method blindingly in the 2011 Critical Assessment of Function Annotation (CAFA). Our experiments demonstrated that our three-level prediction method effectively increased the recall of function prediction while maintaining a reasonable precision. Particularly, our method can predict function terms defined by the Gene Ontology more accurately than three standard baseline methods in most situations, handle multi-domain proteins naturally, and make ab initio function prediction when no homology exists. These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).

Show MeSH
Average similarity scores of our three predictors and three baseline methods for top 1st .. 20th ranked predictions.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3584933&req=5

Figure 4: Average similarity scores of our three predictors and three baseline methods for top 1st .. 20th ranked predictions.

Mentions: where θ(r1) and θ(r2) denotes the number of GO terms of paths r1 and r2, respectively. The numerator is the number of common GO terms shared by paths r1 and r2. For a target protein, every predicted GO term was compared with each of the true GO terms to calculate similarity scores; and the highest score was considered as the similarity score between a specific predicted GO term and the actual GO terms. Averaging the similarities over all predicted GO terms of a target generated the similarity score for the target. The average similarity scores of all the target protein were used as the prediction similarity score of a predictor. We computed the similarity scores of all predictors for top 1st, 2nd, ..., 20th ranked, predicted GO terms respectively. It is worth noting that, because the GO terms that have the same confidence scores get the same rank, the set of top 1st .. nth ranked GO terms may actually contain more than n GO terms. The average similarity scores of these predictors were plotted in Figure 4. According to this measure, all our three predictors performed better than three baseline methods. Predictor-3 had higher scores than predictor-1 and predictor-2. It is largely because the latter two more often predicted GO terms with the same confidence scores than the former, resulting in more 1st .. nth ranked GO terms selected for evaluation than the former. In addition to the average similarity score for each target, we also calculated the best similarity score of a target - the highest similarity score among all GO term predictions for the target and averaged the best similarity scores over all the targets for each predictor. Figure 5 shows the best similarity scores of six predictors for top 1-20 predictions, which also demonstrates the better performances of our three predictors compared with the three baseline methods. Predict-1 and predictor-2 performed better than predictor-3 according to this measure.


Three-level prediction of protein function by combining profile-sequence search, profile-profile search, and domain co-occurrence networks.

Wang Z, Cao R, Cheng J - BMC Bioinformatics (2013)

Average similarity scores of our three predictors and three baseline methods for top 1st .. 20th ranked predictions.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3584933&req=5

Figure 4: Average similarity scores of our three predictors and three baseline methods for top 1st .. 20th ranked predictions.
Mentions: where θ(r1) and θ(r2) denotes the number of GO terms of paths r1 and r2, respectively. The numerator is the number of common GO terms shared by paths r1 and r2. For a target protein, every predicted GO term was compared with each of the true GO terms to calculate similarity scores; and the highest score was considered as the similarity score between a specific predicted GO term and the actual GO terms. Averaging the similarities over all predicted GO terms of a target generated the similarity score for the target. The average similarity scores of all the target protein were used as the prediction similarity score of a predictor. We computed the similarity scores of all predictors for top 1st, 2nd, ..., 20th ranked, predicted GO terms respectively. It is worth noting that, because the GO terms that have the same confidence scores get the same rank, the set of top 1st .. nth ranked GO terms may actually contain more than n GO terms. The average similarity scores of these predictors were plotted in Figure 4. According to this measure, all our three predictors performed better than three baseline methods. Predictor-3 had higher scores than predictor-1 and predictor-2. It is largely because the latter two more often predicted GO terms with the same confidence scores than the former, resulting in more 1st .. nth ranked GO terms selected for evaluation than the former. In addition to the average similarity score for each target, we also calculated the best similarity score of a target - the highest similarity score among all GO term predictions for the target and averaged the best similarity scores over all the targets for each predictor. Figure 5 shows the best similarity scores of six predictors for top 1-20 predictions, which also demonstrates the better performances of our three predictors compared with the three baseline methods. Predict-1 and predictor-2 performed better than predictor-3 according to this measure.

Bottom Line: Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era.However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function.These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, University of Missouri, Columbia, Missouri 65211, USA.

ABSTRACT
Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era. However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function. Here, we developed a method that integrated profile-sequence alignment, profile-profile alignment, and Domain Co-Occurrence Networks (DCN) to predict protein function at different levels of complexity, ranging from obvious homology, to remote homology, to no homology. We tested the method blindingly in the 2011 Critical Assessment of Function Annotation (CAFA). Our experiments demonstrated that our three-level prediction method effectively increased the recall of function prediction while maintaining a reasonable precision. Particularly, our method can predict function terms defined by the Gene Ontology more accurately than three standard baseline methods in most situations, handle multi-domain proteins naturally, and make ab initio function prediction when no homology exists. These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).

Show MeSH