Limits...
InPrePPI: an integrated evaluation method based on genomic context for predicting protein-protein interactions in prokaryotic genomes.

Sun J, Sun Y, Ding G, Liu Q, Wang C, He Y, Shi T, Li Y, Zhao Z - BMC Bioinformatics (2007)

Bottom Line: After realizing the limited power in the prediction using only one genomic feature, investigators are now moving toward integration.So far, there have been few integration studies for PPI prediction; one failed to yield appreciable improvement of prediction and the others did not conduct performance comparison.It remains unclear whether an integration of multiple genomic features can improve the PPI prediction and, if it can, how to integrate these features.

View Article: PubMed Central - HTML - PubMed

Affiliation: Virginia Institute for Psychiatric and Behavioral Genetics and Department of Psychiatry, Virginia Commonwealth University, Richmond, VA 23298, USA. jsun@vcu.edu

ABSTRACT

Background: Although many genomic features have been used in the prediction of protein-protein interactions (PPIs), frequently only one is used in a computational method. After realizing the limited power in the prediction using only one genomic feature, investigators are now moving toward integration. So far, there have been few integration studies for PPI prediction; one failed to yield appreciable improvement of prediction and the others did not conduct performance comparison. It remains unclear whether an integration of multiple genomic features can improve the PPI prediction and, if it can, how to integrate these features.

Results: In this study, we first performed a systematic evaluation on the PPI prediction in Escherichia coli (E. coli) by four genomic context based methods: the phylogenetic profile method, the gene cluster method, the gene fusion method, and the gene neighbor method. The number of predicted PPIs and the average degree in the predicted PPI networks varied greatly among the four methods. Further, no method outperformed the others when we tested using three well-defined positive datasets from the KEGG, EcoCyc, and DIP databases. Based on these comparisons, we developed a novel integrated method, named InPrePPI. InPrePPI first normalizes the AC value (an integrated value of the accuracy and coverage) of each method using three positive datasets, then calculates a weight for each method, and finally uses the weight to calculate an integrated score for each protein pair predicted by the four genomic context based methods. We demonstrate that InPrePPI outperforms each of the four individual methods and, in general, the other two existing integrated methods: the joint observation method and the integrated prediction method in STRING. These four methods and InPrePPI are implemented in a user-friendly web interface.

Conclusion: This study evaluated the PPI prediction by four genomic context based methods, and presents an integrated evaluation method that shows better performance in E. coli.

Show MeSH

Related in: MedlinePlus

Comparison of PPI prediction by the four methods using the KEGG, EcoCyc, and DIP datasets. Performance of the prediction was measured by AC value.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2238723&req=5

Figure 1: Comparison of PPI prediction by the four methods using the KEGG, EcoCyc, and DIP datasets. Performance of the prediction was measured by AC value.

Mentions: Figure 1 shows the AC values of the four methods using all three datasets. The results can be summarized in the following three points. First, among the four methods, the GFM had the highest AC value in the KEGG dataset; in contrast, it had the lowest value in the EcoCyc and DIP datasets. Further examination of the KEGG dataset, which included 1,386 E. coli proteins, found a total of 117 pathways, of which 103 were in the category of metabolism. This indicates that most proteins in the KEGG dataset are involved in metabolism. The preference of the GFM in metabolic proteins is consistent with Tsoka and Ouzounis' previous report [31]; thus, it suggests that the GFM performs well in the prediction of PPIs involved in metabolisms. Second, the GCM had the highest AC value in the EcoCyc dataset, which is consistent with the concept that genes in the same operon often encode proteins involved in the protein complexes. Third, in contrast to the GFM and GCM, the PPM had the highest AC value in the DIP dataset but the lowest value in the KEGG dataset. This suggests that the PPM may be suitable for prediction of PPIs involved in protein interactions but not in the pathways. Overall, no method outperformed the others among these three datasets.


InPrePPI: an integrated evaluation method based on genomic context for predicting protein-protein interactions in prokaryotic genomes.

Sun J, Sun Y, Ding G, Liu Q, Wang C, He Y, Shi T, Li Y, Zhao Z - BMC Bioinformatics (2007)

Comparison of PPI prediction by the four methods using the KEGG, EcoCyc, and DIP datasets. Performance of the prediction was measured by AC value.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2238723&req=5

Figure 1: Comparison of PPI prediction by the four methods using the KEGG, EcoCyc, and DIP datasets. Performance of the prediction was measured by AC value.
Mentions: Figure 1 shows the AC values of the four methods using all three datasets. The results can be summarized in the following three points. First, among the four methods, the GFM had the highest AC value in the KEGG dataset; in contrast, it had the lowest value in the EcoCyc and DIP datasets. Further examination of the KEGG dataset, which included 1,386 E. coli proteins, found a total of 117 pathways, of which 103 were in the category of metabolism. This indicates that most proteins in the KEGG dataset are involved in metabolism. The preference of the GFM in metabolic proteins is consistent with Tsoka and Ouzounis' previous report [31]; thus, it suggests that the GFM performs well in the prediction of PPIs involved in metabolisms. Second, the GCM had the highest AC value in the EcoCyc dataset, which is consistent with the concept that genes in the same operon often encode proteins involved in the protein complexes. Third, in contrast to the GFM and GCM, the PPM had the highest AC value in the DIP dataset but the lowest value in the KEGG dataset. This suggests that the PPM may be suitable for prediction of PPIs involved in protein interactions but not in the pathways. Overall, no method outperformed the others among these three datasets.

Bottom Line: After realizing the limited power in the prediction using only one genomic feature, investigators are now moving toward integration.So far, there have been few integration studies for PPI prediction; one failed to yield appreciable improvement of prediction and the others did not conduct performance comparison.It remains unclear whether an integration of multiple genomic features can improve the PPI prediction and, if it can, how to integrate these features.

View Article: PubMed Central - HTML - PubMed

Affiliation: Virginia Institute for Psychiatric and Behavioral Genetics and Department of Psychiatry, Virginia Commonwealth University, Richmond, VA 23298, USA. jsun@vcu.edu

ABSTRACT

Background: Although many genomic features have been used in the prediction of protein-protein interactions (PPIs), frequently only one is used in a computational method. After realizing the limited power in the prediction using only one genomic feature, investigators are now moving toward integration. So far, there have been few integration studies for PPI prediction; one failed to yield appreciable improvement of prediction and the others did not conduct performance comparison. It remains unclear whether an integration of multiple genomic features can improve the PPI prediction and, if it can, how to integrate these features.

Results: In this study, we first performed a systematic evaluation on the PPI prediction in Escherichia coli (E. coli) by four genomic context based methods: the phylogenetic profile method, the gene cluster method, the gene fusion method, and the gene neighbor method. The number of predicted PPIs and the average degree in the predicted PPI networks varied greatly among the four methods. Further, no method outperformed the others when we tested using three well-defined positive datasets from the KEGG, EcoCyc, and DIP databases. Based on these comparisons, we developed a novel integrated method, named InPrePPI. InPrePPI first normalizes the AC value (an integrated value of the accuracy and coverage) of each method using three positive datasets, then calculates a weight for each method, and finally uses the weight to calculate an integrated score for each protein pair predicted by the four genomic context based methods. We demonstrate that InPrePPI outperforms each of the four individual methods and, in general, the other two existing integrated methods: the joint observation method and the integrated prediction method in STRING. These four methods and InPrePPI are implemented in a user-friendly web interface.

Conclusion: This study evaluated the PPI prediction by four genomic context based methods, and presents an integrated evaluation method that shows better performance in E. coli.

Show MeSH
Related in: MedlinePlus