Limits...
Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment.

Jothi R, Przytycka TM, Aravind L - BMC Bioinformatics (2007)

Bottom Line: Inclusion of most parasitic, pathogenic or vertebrate genomes and multiple strains of the same species into the reference set do not necessarily contribute to an improved sensitivity or accuracy.Interestingly, we also found that evolutionary histories of individual pathways have a significant affect on the performance of the PPC approach with respect to a particular reference set.For example, to accurately predict functional links in carbohydrate or lipid metabolism, a reference set solely composed of prokaryotic (or bacterial) genomes performed among the best compared to one composed of genomes from all three super-kingdoms; this is in contrast to predicting functional links in translation for which a reference set composed of prokaryotic (or bacterial) genomes performed the worst.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA. jothi@ncbi.nlm.nih.gov

ABSTRACT

Background: A widely-used approach for discovering functional and physical interactions among proteins involves phylogenetic profile comparisons (PPCs). Here, proteins with similar profiles are inferred to be functionally related under the assumption that proteins involved in the same metabolic pathway or cellular system are likely to have been co-inherited during evolution.

Results: Our experimentation with E. coli and yeast proteins with 16 different carefully composed reference sets of genomes revealed that the phyletic patterns of proteins in prokaryotes alone could be adequate enough to make reasonably accurate functional linkage predictions. A slight improvement in performance is observed on adding few eukaryotes into the reference set, but a noticeable drop-off in performance is observed with increased number of eukaryotes. Inclusion of most parasitic, pathogenic or vertebrate genomes and multiple strains of the same species into the reference set do not necessarily contribute to an improved sensitivity or accuracy. Interestingly, we also found that evolutionary histories of individual pathways have a significant affect on the performance of the PPC approach with respect to a particular reference set. For example, to accurately predict functional links in carbohydrate or lipid metabolism, a reference set solely composed of prokaryotic (or bacterial) genomes performed among the best compared to one composed of genomes from all three super-kingdoms; this is in contrast to predicting functional links in translation for which a reference set composed of prokaryotic (or bacterial) genomes performed the worst. We also demonstrate that the widely used random model to quantify the statistical significance of profile similarity is incomplete, which could result in an increased number of false-positives.

Conclusion: Contrary to previous proposals, it is not merely the number of genomes but a careful selection of informative genomes in the reference set that influences the prediction accuracy of the PPC approach. We note that the predictive power of the PPC approach, especially in eukaryotes, is heavily influenced by the primary endosymbiosis and subsequent bacterial contributions. The over-representation of parasitic unicellular eukaryotes and vertebrates additionally make eukaryotes less useful in the reference sets. Reference sets composed of highly non-redundant set of genomes from all three super-kingdoms fare better with pathways showing considerable vertical inheritance and strong conservation (e.g. translation apparatus), while reference sets solely composed of prokaryotic genomes fare better for more variable pathways like carbohydrate metabolism. Differential performance of the PPC approach on various pathways, and a weak positive correlation between functional and profile similarities suggest that caution should be exercised while interpreting functional linkages inferred from genome-wide large-scale profile comparisons using a single reference set.

Show MeSH
Results from phylogenetic profile comparison of 708,645 pairs of proteins chosen from among a subset of 1,347 E. coli proteins. (a) Predictive power of pyholgenetic profile analysis. Each point in this plot represents a specific mutual information threshold at which the measures were recorded. Reference sets with diverse bacterial genomes along with a few archaeal and/or eukaryotic genomes (BA, BAE1, BAE2, BAE3a, BAE3b, NR, NR-3, NR-8, LA, and BAE4) perform well over a reference set (B), which comprises just the bacterial genomes. The performances of BAE3a and NR are almost the same in the zoomed-in high specificity region (inset), which suggests that adding redundancy (different strains of the same organism) to the reference set does not improve the performance. The removal of evolutionarily closely-related (uninformative) genomes from the best performing BAE3a (NR-3, NR-8) decreases the performance, but to a small extent. (b) Sensitivity versus specificity plot.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1904249&req=5

Figure 2: Results from phylogenetic profile comparison of 708,645 pairs of proteins chosen from among a subset of 1,347 E. coli proteins. (a) Predictive power of pyholgenetic profile analysis. Each point in this plot represents a specific mutual information threshold at which the measures were recorded. Reference sets with diverse bacterial genomes along with a few archaeal and/or eukaryotic genomes (BA, BAE1, BAE2, BAE3a, BAE3b, NR, NR-3, NR-8, LA, and BAE4) perform well over a reference set (B), which comprises just the bacterial genomes. The performances of BAE3a and NR are almost the same in the zoomed-in high specificity region (inset), which suggests that adding redundancy (different strains of the same organism) to the reference set does not improve the performance. The removal of evolutionarily closely-related (uninformative) genomes from the best performing BAE3a (NR-3, NR-8) decreases the performance, but to a small extent. (b) Sensitivity versus specificity plot.

Mentions: We examined a total of 708,645 possible functional linkages in E. coli, and 635,628 possible functional linkages in yeast using each of the 16 different reference sets of genomes. We considered two proteins to be functionally related (or linked) if they co-occur in at least one KEGG pathway [42,72]. Two proteins are inferred to be functionally related if their mutual information score is above a certain threshold. For each of 16 reference sets of genomes, performance measures for various mutual information thresholds were recorded. The overall performances using all 16 reference sets of genomes are depicted in Figures 2 and 3 for E. coli and yeast proteins, respectively.


Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment.

Jothi R, Przytycka TM, Aravind L - BMC Bioinformatics (2007)

Results from phylogenetic profile comparison of 708,645 pairs of proteins chosen from among a subset of 1,347 E. coli proteins. (a) Predictive power of pyholgenetic profile analysis. Each point in this plot represents a specific mutual information threshold at which the measures were recorded. Reference sets with diverse bacterial genomes along with a few archaeal and/or eukaryotic genomes (BA, BAE1, BAE2, BAE3a, BAE3b, NR, NR-3, NR-8, LA, and BAE4) perform well over a reference set (B), which comprises just the bacterial genomes. The performances of BAE3a and NR are almost the same in the zoomed-in high specificity region (inset), which suggests that adding redundancy (different strains of the same organism) to the reference set does not improve the performance. The removal of evolutionarily closely-related (uninformative) genomes from the best performing BAE3a (NR-3, NR-8) decreases the performance, but to a small extent. (b) Sensitivity versus specificity plot.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1904249&req=5

Figure 2: Results from phylogenetic profile comparison of 708,645 pairs of proteins chosen from among a subset of 1,347 E. coli proteins. (a) Predictive power of pyholgenetic profile analysis. Each point in this plot represents a specific mutual information threshold at which the measures were recorded. Reference sets with diverse bacterial genomes along with a few archaeal and/or eukaryotic genomes (BA, BAE1, BAE2, BAE3a, BAE3b, NR, NR-3, NR-8, LA, and BAE4) perform well over a reference set (B), which comprises just the bacterial genomes. The performances of BAE3a and NR are almost the same in the zoomed-in high specificity region (inset), which suggests that adding redundancy (different strains of the same organism) to the reference set does not improve the performance. The removal of evolutionarily closely-related (uninformative) genomes from the best performing BAE3a (NR-3, NR-8) decreases the performance, but to a small extent. (b) Sensitivity versus specificity plot.
Mentions: We examined a total of 708,645 possible functional linkages in E. coli, and 635,628 possible functional linkages in yeast using each of the 16 different reference sets of genomes. We considered two proteins to be functionally related (or linked) if they co-occur in at least one KEGG pathway [42,72]. Two proteins are inferred to be functionally related if their mutual information score is above a certain threshold. For each of 16 reference sets of genomes, performance measures for various mutual information thresholds were recorded. The overall performances using all 16 reference sets of genomes are depicted in Figures 2 and 3 for E. coli and yeast proteins, respectively.

Bottom Line: Inclusion of most parasitic, pathogenic or vertebrate genomes and multiple strains of the same species into the reference set do not necessarily contribute to an improved sensitivity or accuracy.Interestingly, we also found that evolutionary histories of individual pathways have a significant affect on the performance of the PPC approach with respect to a particular reference set.For example, to accurately predict functional links in carbohydrate or lipid metabolism, a reference set solely composed of prokaryotic (or bacterial) genomes performed among the best compared to one composed of genomes from all three super-kingdoms; this is in contrast to predicting functional links in translation for which a reference set composed of prokaryotic (or bacterial) genomes performed the worst.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA. jothi@ncbi.nlm.nih.gov

ABSTRACT

Background: A widely-used approach for discovering functional and physical interactions among proteins involves phylogenetic profile comparisons (PPCs). Here, proteins with similar profiles are inferred to be functionally related under the assumption that proteins involved in the same metabolic pathway or cellular system are likely to have been co-inherited during evolution.

Results: Our experimentation with E. coli and yeast proteins with 16 different carefully composed reference sets of genomes revealed that the phyletic patterns of proteins in prokaryotes alone could be adequate enough to make reasonably accurate functional linkage predictions. A slight improvement in performance is observed on adding few eukaryotes into the reference set, but a noticeable drop-off in performance is observed with increased number of eukaryotes. Inclusion of most parasitic, pathogenic or vertebrate genomes and multiple strains of the same species into the reference set do not necessarily contribute to an improved sensitivity or accuracy. Interestingly, we also found that evolutionary histories of individual pathways have a significant affect on the performance of the PPC approach with respect to a particular reference set. For example, to accurately predict functional links in carbohydrate or lipid metabolism, a reference set solely composed of prokaryotic (or bacterial) genomes performed among the best compared to one composed of genomes from all three super-kingdoms; this is in contrast to predicting functional links in translation for which a reference set composed of prokaryotic (or bacterial) genomes performed the worst. We also demonstrate that the widely used random model to quantify the statistical significance of profile similarity is incomplete, which could result in an increased number of false-positives.

Conclusion: Contrary to previous proposals, it is not merely the number of genomes but a careful selection of informative genomes in the reference set that influences the prediction accuracy of the PPC approach. We note that the predictive power of the PPC approach, especially in eukaryotes, is heavily influenced by the primary endosymbiosis and subsequent bacterial contributions. The over-representation of parasitic unicellular eukaryotes and vertebrates additionally make eukaryotes less useful in the reference sets. Reference sets composed of highly non-redundant set of genomes from all three super-kingdoms fare better with pathways showing considerable vertical inheritance and strong conservation (e.g. translation apparatus), while reference sets solely composed of prokaryotic genomes fare better for more variable pathways like carbohydrate metabolism. Differential performance of the PPC approach on various pathways, and a weak positive correlation between functional and profile similarities suggest that caution should be exercised while interpreting functional linkages inferred from genome-wide large-scale profile comparisons using a single reference set.

Show MeSH