Limits...
The use of Gene Ontology terms for predicting highly-connected 'hub' nodes in protein-protein interaction networks.

Hsing M, Byler KG, Cherkasov A - BMC Syst Biol (2008)

Bottom Line: Testing the developed hub classifier on external sets of experimental protein interaction data in Methicillin-resistant Staphylococcus aureus (MRSA) 252 and Caenorhabditis elegans demonstrated that our approach can predict hub proteins with a high degree of accuracy.A practical application of the developed bioinformatics method has been illustrated by the effective protein bait selection for large-scale pull-down experiments that aim to map complete protein-protein interaction networks for several species.The successful development of an accurate hub classifier demonstrated that highly-connected proteins tend to share certain relevant functional properties reflected in their Gene Ontology annotations.It is anticipated that the developed bioinformatics hub classifier will represent a useful tool for the theoretical prediction of highly-interacting proteins, the study of cellular network organizations, and the identification of prospective drug targets - even in those organisms that currently lack large-scale protein interaction data.

View Article: PubMed Central - HTML - PubMed

Affiliation: Faculty of Graduate Studies, Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, Canada. mhsing@interchange.ubc.ca

ABSTRACT

Background: Protein-protein interactions mediate a wide range of cellular functions and responses and have been studied rigorously through recent large-scale proteomics experiments and bioinformatics analyses. One of the most important findings of those endeavours was the observation that 'hub' proteins participate in significant numbers of protein interactions and play critical roles in the organization and function of cellular protein interaction networks (PINs) 12. It has also been demonstrated that such hub proteins may constitute an important pool of attractive drug targets.Thus, it is crucial to be able to identify hub proteins based not only on experimental data but also by means of bioinformatics predictions.

Results: A hub protein classifier has been developed based on the available interaction data and Gene Ontology (GO) annotations for proteins in the Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster and Homo sapiens genomes. In particular, by utilizing the machine learning method of boosting trees we were able to create a predictive bioinformatics tool for the identification of proteins that are likely to play the role of a hub in protein interaction networks. Testing the developed hub classifier on external sets of experimental protein interaction data in Methicillin-resistant Staphylococcus aureus (MRSA) 252 and Caenorhabditis elegans demonstrated that our approach can predict hub proteins with a high degree of accuracy.A practical application of the developed bioinformatics method has been illustrated by the effective protein bait selection for large-scale pull-down experiments that aim to map complete protein-protein interaction networks for several species.

Conclusion: The successful development of an accurate hub classifier demonstrated that highly-connected proteins tend to share certain relevant functional properties reflected in their Gene Ontology annotations. It is anticipated that the developed bioinformatics hub classifier will represent a useful tool for the theoretical prediction of highly-interacting proteins, the study of cellular network organizations, and the identification of prospective drug targets - even in those organisms that currently lack large-scale protein interaction data.

Show MeSH

Related in: MedlinePlus

Distribution of GO annotation terms between the training and testing sets in the four cross-validation samples. Each point on a graph represents the percentage of proteins annotated with a given GO term in the training set (x-axis), and the percentage of proteins annotated with the same GO term in the testing set (y-axis). All four plots were fitted with linear regression lines, with high R2 values of 0.998. This indicates an equal distribution of the GO terms between the training and testing sets of the four samples.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2553323&req=5

Figure 3: Distribution of GO annotation terms between the training and testing sets in the four cross-validation samples. Each point on a graph represents the percentage of proteins annotated with a given GO term in the training set (x-axis), and the percentage of proteins annotated with the same GO term in the testing set (y-axis). All four plots were fitted with linear regression lines, with high R2 values of 0.998. This indicates an equal distribution of the GO terms between the training and testing sets of the four samples.

Mentions: To train models that classify a protein as a hub or a non-hub, the protein interaction data from the four species were combined into a single data set (90,164 interactions involving 2,069 hubs and 19,715 non-hubs). A four-fold cross-validation strategy was used in which four non-overlapping testing sets (25% of the total protein set), and four training sets (75% of the total protein set) were utilized for building the hub classifiers. Each training and testing set maintained the same hub to non-hub (1:9) ratio. In addition, the proteins in the training sets have maintained the same distribution of GO annotation terms as the proteins in the testing sets. Figure 3 illustrates the distribution of each of the 125 GO terms, represented by the percentage of proteins with this term in the training sets vs. the testing sets of the four cross-validation samples. A high correlation R2 values of 0.9981 ~0.9983 indicated an equal GO distribution between the training and testing sets. It is also shown that the majority of the GO terms were associated with less than 10% of the proteins in a given data set.


The use of Gene Ontology terms for predicting highly-connected 'hub' nodes in protein-protein interaction networks.

Hsing M, Byler KG, Cherkasov A - BMC Syst Biol (2008)

Distribution of GO annotation terms between the training and testing sets in the four cross-validation samples. Each point on a graph represents the percentage of proteins annotated with a given GO term in the training set (x-axis), and the percentage of proteins annotated with the same GO term in the testing set (y-axis). All four plots were fitted with linear regression lines, with high R2 values of 0.998. This indicates an equal distribution of the GO terms between the training and testing sets of the four samples.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2553323&req=5

Figure 3: Distribution of GO annotation terms between the training and testing sets in the four cross-validation samples. Each point on a graph represents the percentage of proteins annotated with a given GO term in the training set (x-axis), and the percentage of proteins annotated with the same GO term in the testing set (y-axis). All four plots were fitted with linear regression lines, with high R2 values of 0.998. This indicates an equal distribution of the GO terms between the training and testing sets of the four samples.
Mentions: To train models that classify a protein as a hub or a non-hub, the protein interaction data from the four species were combined into a single data set (90,164 interactions involving 2,069 hubs and 19,715 non-hubs). A four-fold cross-validation strategy was used in which four non-overlapping testing sets (25% of the total protein set), and four training sets (75% of the total protein set) were utilized for building the hub classifiers. Each training and testing set maintained the same hub to non-hub (1:9) ratio. In addition, the proteins in the training sets have maintained the same distribution of GO annotation terms as the proteins in the testing sets. Figure 3 illustrates the distribution of each of the 125 GO terms, represented by the percentage of proteins with this term in the training sets vs. the testing sets of the four cross-validation samples. A high correlation R2 values of 0.9981 ~0.9983 indicated an equal GO distribution between the training and testing sets. It is also shown that the majority of the GO terms were associated with less than 10% of the proteins in a given data set.

Bottom Line: Testing the developed hub classifier on external sets of experimental protein interaction data in Methicillin-resistant Staphylococcus aureus (MRSA) 252 and Caenorhabditis elegans demonstrated that our approach can predict hub proteins with a high degree of accuracy.A practical application of the developed bioinformatics method has been illustrated by the effective protein bait selection for large-scale pull-down experiments that aim to map complete protein-protein interaction networks for several species.The successful development of an accurate hub classifier demonstrated that highly-connected proteins tend to share certain relevant functional properties reflected in their Gene Ontology annotations.It is anticipated that the developed bioinformatics hub classifier will represent a useful tool for the theoretical prediction of highly-interacting proteins, the study of cellular network organizations, and the identification of prospective drug targets - even in those organisms that currently lack large-scale protein interaction data.

View Article: PubMed Central - HTML - PubMed

Affiliation: Faculty of Graduate Studies, Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, Canada. mhsing@interchange.ubc.ca

ABSTRACT

Background: Protein-protein interactions mediate a wide range of cellular functions and responses and have been studied rigorously through recent large-scale proteomics experiments and bioinformatics analyses. One of the most important findings of those endeavours was the observation that 'hub' proteins participate in significant numbers of protein interactions and play critical roles in the organization and function of cellular protein interaction networks (PINs) 12. It has also been demonstrated that such hub proteins may constitute an important pool of attractive drug targets.Thus, it is crucial to be able to identify hub proteins based not only on experimental data but also by means of bioinformatics predictions.

Results: A hub protein classifier has been developed based on the available interaction data and Gene Ontology (GO) annotations for proteins in the Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster and Homo sapiens genomes. In particular, by utilizing the machine learning method of boosting trees we were able to create a predictive bioinformatics tool for the identification of proteins that are likely to play the role of a hub in protein interaction networks. Testing the developed hub classifier on external sets of experimental protein interaction data in Methicillin-resistant Staphylococcus aureus (MRSA) 252 and Caenorhabditis elegans demonstrated that our approach can predict hub proteins with a high degree of accuracy.A practical application of the developed bioinformatics method has been illustrated by the effective protein bait selection for large-scale pull-down experiments that aim to map complete protein-protein interaction networks for several species.

Conclusion: The successful development of an accurate hub classifier demonstrated that highly-connected proteins tend to share certain relevant functional properties reflected in their Gene Ontology annotations. It is anticipated that the developed bioinformatics hub classifier will represent a useful tool for the theoretical prediction of highly-interacting proteins, the study of cellular network organizations, and the identification of prospective drug targets - even in those organisms that currently lack large-scale protein interaction data.

Show MeSH
Related in: MedlinePlus