Limits...
Three-level prediction of protein function by combining profile-sequence search, profile-profile search, and domain co-occurrence networks.

Wang Z, Cao R, Cheng J - BMC Bioinformatics (2013)

Bottom Line: Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era.However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function.These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, University of Missouri, Columbia, Missouri 65211, USA.

ABSTRACT
Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era. However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function. Here, we developed a method that integrated profile-sequence alignment, profile-profile alignment, and Domain Co-Occurrence Networks (DCN) to predict protein function at different levels of complexity, ranging from obvious homology, to remote homology, to no homology. We tested the method blindingly in the 2011 Critical Assessment of Function Annotation (CAFA). Our experiments demonstrated that our three-level prediction method effectively increased the recall of function prediction while maintaining a reasonable precision. Particularly, our method can predict function terms defined by the Gene Ontology more accurately than three standard baseline methods in most situations, handle multi-domain proteins naturally, and make ab initio function prediction when no homology exists. These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).

Show MeSH
An example (CAFA target T30248) showing how DCN-based "aggregated neighbor-counting" method works. (A) Tertiary structure of target protein T30248 predicted by MULTICOM [35] (magenta: domain SEP; green: domain UBX). (B) Electrostatic of protein T30248, generated based on predicted structures in (A) (blue: positive; red: negative). (C) The main Domain Co-occurrence Networks (DCN) of Homo sapiens used to make function prediction. (D) Radius-one neighbor domains of the four domains - ubiquitin, UBX, SEP, and FERM_N - of the target. The GO terms of the neighboring domains were used as predictions for the target. Detailed discussion can be found in "Results and Discussion" section.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3584933&req=5

Figure 6: An example (CAFA target T30248) showing how DCN-based "aggregated neighbor-counting" method works. (A) Tertiary structure of target protein T30248 predicted by MULTICOM [35] (magenta: domain SEP; green: domain UBX). (B) Electrostatic of protein T30248, generated based on predicted structures in (A) (blue: positive; red: negative). (C) The main Domain Co-occurrence Networks (DCN) of Homo sapiens used to make function prediction. (D) Radius-one neighbor domains of the four domains - ubiquitin, UBX, SEP, and FERM_N - of the target. The GO terms of the neighboring domains were used as predictions for the target. Detailed discussion can be found in "Results and Discussion" section.

Mentions: We chose an example to illustrate the effectiveness of the DCN function prediction component when both profile-sequence alignment (PSI-BLAST [4]) and profile-profile alignment (HHSearch [34]) cannot make precise predictions. Figure 6 shows how functions were predicted by using Domain Co-occurrence Network (DCN) for target T30248 - a multi-domain protein in Mus musculus (house mouse). Figure 6 (A) and 6(B) illustrate the tertiary structure of the protein predicted by MULTICOM [35] and electrostatic potentials (blue: positive charged; red: negative charged), calculated and visualized by DeepView (http://spdbv.vital-it.ch/). In order to make function prediction, the DCN method executed PSI-BLAST to search the target protein against our pre-built protein sequence database containing the proteome of H. sapiens, S. cerevisiae, C. elegans, D. melanogaster, 15 plants, and 398 bacteria species (detailed organism names can be found at [16]), and identified the most significant homologous hit - a H. sapiens protein (Swiss-Prot ID P68543) with PSI-BLAST e-value 3e-95 and score 348. Then it utilized the DCN of H. sapiens (human) (Figure 6 (C)) to make prediction as follows. Firstly it used the profile-profile alignment tool HHSearch [34] to search the target protein against PfamA [33] database, which detected eight domain families with homologous probability > = 0.80: SEP, UBX, Spt20, ubiquitin, UN_NPL4, Cobl, Rad60-SLD, and FERM_N. The four domains (SEP, UBX, ubiquitin, and FERM_N) that existed in the H. sapiens proteome were then used as central domains in the DCN of H. sapiens to identify domain neighbors. Because domains SEP, UBX, FERM_N do not have GO functional annotations in the Pfam database and the ubiquitin domain only has one general GO term annotation (GO:0006464), directly inferring precise function of the target from these domains was not possible. However, the DCN method was able to use the annotated GO terms of the neighboring domains of these four domains to make function prediction for the target as shown in Figure 6 (D), where red nodes denotes the central domains detected for the target and yellow nodes represents their radius-one neighboring domains. The GO terms of the neighboring domains were aggregated and ranked based on frequency. The top ranked GO terms were used as predictions. The frequencies were used as the confidence scores of the predicted GO terms. Similarly, radius-two neighboring domains (not illustrated) were applied to generate predictions in a similar way.


Three-level prediction of protein function by combining profile-sequence search, profile-profile search, and domain co-occurrence networks.

Wang Z, Cao R, Cheng J - BMC Bioinformatics (2013)

An example (CAFA target T30248) showing how DCN-based "aggregated neighbor-counting" method works. (A) Tertiary structure of target protein T30248 predicted by MULTICOM [35] (magenta: domain SEP; green: domain UBX). (B) Electrostatic of protein T30248, generated based on predicted structures in (A) (blue: positive; red: negative). (C) The main Domain Co-occurrence Networks (DCN) of Homo sapiens used to make function prediction. (D) Radius-one neighbor domains of the four domains - ubiquitin, UBX, SEP, and FERM_N - of the target. The GO terms of the neighboring domains were used as predictions for the target. Detailed discussion can be found in "Results and Discussion" section.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3584933&req=5

Figure 6: An example (CAFA target T30248) showing how DCN-based "aggregated neighbor-counting" method works. (A) Tertiary structure of target protein T30248 predicted by MULTICOM [35] (magenta: domain SEP; green: domain UBX). (B) Electrostatic of protein T30248, generated based on predicted structures in (A) (blue: positive; red: negative). (C) The main Domain Co-occurrence Networks (DCN) of Homo sapiens used to make function prediction. (D) Radius-one neighbor domains of the four domains - ubiquitin, UBX, SEP, and FERM_N - of the target. The GO terms of the neighboring domains were used as predictions for the target. Detailed discussion can be found in "Results and Discussion" section.
Mentions: We chose an example to illustrate the effectiveness of the DCN function prediction component when both profile-sequence alignment (PSI-BLAST [4]) and profile-profile alignment (HHSearch [34]) cannot make precise predictions. Figure 6 shows how functions were predicted by using Domain Co-occurrence Network (DCN) for target T30248 - a multi-domain protein in Mus musculus (house mouse). Figure 6 (A) and 6(B) illustrate the tertiary structure of the protein predicted by MULTICOM [35] and electrostatic potentials (blue: positive charged; red: negative charged), calculated and visualized by DeepView (http://spdbv.vital-it.ch/). In order to make function prediction, the DCN method executed PSI-BLAST to search the target protein against our pre-built protein sequence database containing the proteome of H. sapiens, S. cerevisiae, C. elegans, D. melanogaster, 15 plants, and 398 bacteria species (detailed organism names can be found at [16]), and identified the most significant homologous hit - a H. sapiens protein (Swiss-Prot ID P68543) with PSI-BLAST e-value 3e-95 and score 348. Then it utilized the DCN of H. sapiens (human) (Figure 6 (C)) to make prediction as follows. Firstly it used the profile-profile alignment tool HHSearch [34] to search the target protein against PfamA [33] database, which detected eight domain families with homologous probability > = 0.80: SEP, UBX, Spt20, ubiquitin, UN_NPL4, Cobl, Rad60-SLD, and FERM_N. The four domains (SEP, UBX, ubiquitin, and FERM_N) that existed in the H. sapiens proteome were then used as central domains in the DCN of H. sapiens to identify domain neighbors. Because domains SEP, UBX, FERM_N do not have GO functional annotations in the Pfam database and the ubiquitin domain only has one general GO term annotation (GO:0006464), directly inferring precise function of the target from these domains was not possible. However, the DCN method was able to use the annotated GO terms of the neighboring domains of these four domains to make function prediction for the target as shown in Figure 6 (D), where red nodes denotes the central domains detected for the target and yellow nodes represents their radius-one neighboring domains. The GO terms of the neighboring domains were aggregated and ranked based on frequency. The top ranked GO terms were used as predictions. The frequencies were used as the confidence scores of the predicted GO terms. Similarly, radius-two neighboring domains (not illustrated) were applied to generate predictions in a similar way.

Bottom Line: Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era.However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function.These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, University of Missouri, Columbia, Missouri 65211, USA.

ABSTRACT
Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era. However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function. Here, we developed a method that integrated profile-sequence alignment, profile-profile alignment, and Domain Co-Occurrence Networks (DCN) to predict protein function at different levels of complexity, ranging from obvious homology, to remote homology, to no homology. We tested the method blindingly in the 2011 Critical Assessment of Function Annotation (CAFA). Our experiments demonstrated that our three-level prediction method effectively increased the recall of function prediction while maintaining a reasonable precision. Particularly, our method can predict function terms defined by the Gene Ontology more accurately than three standard baseline methods in most situations, handle multi-domain proteins naturally, and make ab initio function prediction when no homology exists. These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).

Show MeSH