Limits...
Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment.

Jothi R, Przytycka TM, Aravind L - BMC Bioinformatics (2007)

Bottom Line: Inclusion of most parasitic, pathogenic or vertebrate genomes and multiple strains of the same species into the reference set do not necessarily contribute to an improved sensitivity or accuracy.Interestingly, we also found that evolutionary histories of individual pathways have a significant affect on the performance of the PPC approach with respect to a particular reference set.For example, to accurately predict functional links in carbohydrate or lipid metabolism, a reference set solely composed of prokaryotic (or bacterial) genomes performed among the best compared to one composed of genomes from all three super-kingdoms; this is in contrast to predicting functional links in translation for which a reference set composed of prokaryotic (or bacterial) genomes performed the worst.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA. jothi@ncbi.nlm.nih.gov

ABSTRACT

Background: A widely-used approach for discovering functional and physical interactions among proteins involves phylogenetic profile comparisons (PPCs). Here, proteins with similar profiles are inferred to be functionally related under the assumption that proteins involved in the same metabolic pathway or cellular system are likely to have been co-inherited during evolution.

Results: Our experimentation with E. coli and yeast proteins with 16 different carefully composed reference sets of genomes revealed that the phyletic patterns of proteins in prokaryotes alone could be adequate enough to make reasonably accurate functional linkage predictions. A slight improvement in performance is observed on adding few eukaryotes into the reference set, but a noticeable drop-off in performance is observed with increased number of eukaryotes. Inclusion of most parasitic, pathogenic or vertebrate genomes and multiple strains of the same species into the reference set do not necessarily contribute to an improved sensitivity or accuracy. Interestingly, we also found that evolutionary histories of individual pathways have a significant affect on the performance of the PPC approach with respect to a particular reference set. For example, to accurately predict functional links in carbohydrate or lipid metabolism, a reference set solely composed of prokaryotic (or bacterial) genomes performed among the best compared to one composed of genomes from all three super-kingdoms; this is in contrast to predicting functional links in translation for which a reference set composed of prokaryotic (or bacterial) genomes performed the worst. We also demonstrate that the widely used random model to quantify the statistical significance of profile similarity is incomplete, which could result in an increased number of false-positives.

Conclusion: Contrary to previous proposals, it is not merely the number of genomes but a careful selection of informative genomes in the reference set that influences the prediction accuracy of the PPC approach. We note that the predictive power of the PPC approach, especially in eukaryotes, is heavily influenced by the primary endosymbiosis and subsequent bacterial contributions. The over-representation of parasitic unicellular eukaryotes and vertebrates additionally make eukaryotes less useful in the reference sets. Reference sets composed of highly non-redundant set of genomes from all three super-kingdoms fare better with pathways showing considerable vertical inheritance and strong conservation (e.g. translation apparatus), while reference sets solely composed of prokaryotic genomes fare better for more variable pathways like carbohydrate metabolism. Differential performance of the PPC approach on various pathways, and a weak positive correlation between functional and profile similarities suggest that caution should be exercised while interpreting functional linkages inferred from genome-wide large-scale profile comparisons using a single reference set.

Show MeSH
Sensitivity versus specificity plots for protein pairs in various yeast pathways based on the 2nd level of KEGG orthology. Performances were measured at different mutual information thresholds. (a) Carbohydrate metabolism. (b) Energy metabolism (c) Lipid metabolism (d) Nucleotide metabolism (e) Amino acid metabolism (f) Metabolism of cofactors and vitamins (g) Translation (h) Folding, sorting, and degradation.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1904249&req=5

Figure 6: Sensitivity versus specificity plots for protein pairs in various yeast pathways based on the 2nd level of KEGG orthology. Performances were measured at different mutual information thresholds. (a) Carbohydrate metabolism. (b) Energy metabolism (c) Lipid metabolism (d) Nucleotide metabolism (e) Amino acid metabolism (f) Metabolism of cofactors and vitamins (g) Translation (h) Folding, sorting, and degradation.

Mentions: A KEGG pathway or a map is an ensemble of many smaller pathways that are typically centered on a particular metabolite (e.g. DNA or RNA) or a distinct class of related molecules (e.g. carbohydrate or amino acids). It is a well-known fact that different pathways differ vastly in terms of the conservation patterns of their components. Most genome-wide large-scale functional linkage predictions using PPCs have largely ignored this intrinsic diversity in the behavior of individual biological functional systems. To evaluate the role of this diversity in conservation across different functional systems, and its effects on the accuracy of functional linkages predicted from PPCs, we considered a set of nine such systems as defined in the second level of KEGG orthology [91] (Table 2; seven are seen in both test species, while one each are found only in E. coli and yeast), each with 80 or more protein components. We then repeated the same analysis (as done for the complete protein set) for proteins in each of these KEGG pathways using seven of the 16 reference sets (Figures 5 and 6). For our analysis on individual pathways, we considered a pair of proteins to be a positive if they co-occur in the pathway under consideration. Otherwise, we consider the pair to be a negative (with respect to the pathway under consideration), although they may co-occur in some other pathway. This is different from our overall analysis where a pair of proteins is a positive if they co-occur in any pathway and a negative if they do not co-occur at all.


Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment.

Jothi R, Przytycka TM, Aravind L - BMC Bioinformatics (2007)

Sensitivity versus specificity plots for protein pairs in various yeast pathways based on the 2nd level of KEGG orthology. Performances were measured at different mutual information thresholds. (a) Carbohydrate metabolism. (b) Energy metabolism (c) Lipid metabolism (d) Nucleotide metabolism (e) Amino acid metabolism (f) Metabolism of cofactors and vitamins (g) Translation (h) Folding, sorting, and degradation.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1904249&req=5

Figure 6: Sensitivity versus specificity plots for protein pairs in various yeast pathways based on the 2nd level of KEGG orthology. Performances were measured at different mutual information thresholds. (a) Carbohydrate metabolism. (b) Energy metabolism (c) Lipid metabolism (d) Nucleotide metabolism (e) Amino acid metabolism (f) Metabolism of cofactors and vitamins (g) Translation (h) Folding, sorting, and degradation.
Mentions: A KEGG pathway or a map is an ensemble of many smaller pathways that are typically centered on a particular metabolite (e.g. DNA or RNA) or a distinct class of related molecules (e.g. carbohydrate or amino acids). It is a well-known fact that different pathways differ vastly in terms of the conservation patterns of their components. Most genome-wide large-scale functional linkage predictions using PPCs have largely ignored this intrinsic diversity in the behavior of individual biological functional systems. To evaluate the role of this diversity in conservation across different functional systems, and its effects on the accuracy of functional linkages predicted from PPCs, we considered a set of nine such systems as defined in the second level of KEGG orthology [91] (Table 2; seven are seen in both test species, while one each are found only in E. coli and yeast), each with 80 or more protein components. We then repeated the same analysis (as done for the complete protein set) for proteins in each of these KEGG pathways using seven of the 16 reference sets (Figures 5 and 6). For our analysis on individual pathways, we considered a pair of proteins to be a positive if they co-occur in the pathway under consideration. Otherwise, we consider the pair to be a negative (with respect to the pathway under consideration), although they may co-occur in some other pathway. This is different from our overall analysis where a pair of proteins is a positive if they co-occur in any pathway and a negative if they do not co-occur at all.

Bottom Line: Inclusion of most parasitic, pathogenic or vertebrate genomes and multiple strains of the same species into the reference set do not necessarily contribute to an improved sensitivity or accuracy.Interestingly, we also found that evolutionary histories of individual pathways have a significant affect on the performance of the PPC approach with respect to a particular reference set.For example, to accurately predict functional links in carbohydrate or lipid metabolism, a reference set solely composed of prokaryotic (or bacterial) genomes performed among the best compared to one composed of genomes from all three super-kingdoms; this is in contrast to predicting functional links in translation for which a reference set composed of prokaryotic (or bacterial) genomes performed the worst.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA. jothi@ncbi.nlm.nih.gov

ABSTRACT

Background: A widely-used approach for discovering functional and physical interactions among proteins involves phylogenetic profile comparisons (PPCs). Here, proteins with similar profiles are inferred to be functionally related under the assumption that proteins involved in the same metabolic pathway or cellular system are likely to have been co-inherited during evolution.

Results: Our experimentation with E. coli and yeast proteins with 16 different carefully composed reference sets of genomes revealed that the phyletic patterns of proteins in prokaryotes alone could be adequate enough to make reasonably accurate functional linkage predictions. A slight improvement in performance is observed on adding few eukaryotes into the reference set, but a noticeable drop-off in performance is observed with increased number of eukaryotes. Inclusion of most parasitic, pathogenic or vertebrate genomes and multiple strains of the same species into the reference set do not necessarily contribute to an improved sensitivity or accuracy. Interestingly, we also found that evolutionary histories of individual pathways have a significant affect on the performance of the PPC approach with respect to a particular reference set. For example, to accurately predict functional links in carbohydrate or lipid metabolism, a reference set solely composed of prokaryotic (or bacterial) genomes performed among the best compared to one composed of genomes from all three super-kingdoms; this is in contrast to predicting functional links in translation for which a reference set composed of prokaryotic (or bacterial) genomes performed the worst. We also demonstrate that the widely used random model to quantify the statistical significance of profile similarity is incomplete, which could result in an increased number of false-positives.

Conclusion: Contrary to previous proposals, it is not merely the number of genomes but a careful selection of informative genomes in the reference set that influences the prediction accuracy of the PPC approach. We note that the predictive power of the PPC approach, especially in eukaryotes, is heavily influenced by the primary endosymbiosis and subsequent bacterial contributions. The over-representation of parasitic unicellular eukaryotes and vertebrates additionally make eukaryotes less useful in the reference sets. Reference sets composed of highly non-redundant set of genomes from all three super-kingdoms fare better with pathways showing considerable vertical inheritance and strong conservation (e.g. translation apparatus), while reference sets solely composed of prokaryotic genomes fare better for more variable pathways like carbohydrate metabolism. Differential performance of the PPC approach on various pathways, and a weak positive correlation between functional and profile similarities suggest that caution should be exercised while interpreting functional linkages inferred from genome-wide large-scale profile comparisons using a single reference set.

Show MeSH