Limits...
An integrative genomic approach to uncover molecular mechanisms of prokaryotic traits.

Liu Y, Li J, Sam L, Goh CS, Gerstein M, Lussier YA - PLoS Comput. Biol. (2006)

Bottom Line: We identified 3,711 significant correlations between 1,499 distinct Pfam and 63 phenotypes, with 2,650 correlations and 1,061 anti-correlations.We stratified the most significant 478 predictions and subjected 100 to manual evaluation, of which 60 were corroborated in the literature.We propose that this method elucidates which classes of molecular mechanisms are associated with phenotypes or metaphenotypes and holds promise in facilitating a computable systems biology approach to genomic and biomedical research.

View Article: PubMed Central - PubMed

Affiliation: Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America.

ABSTRACT
With mounting availability of genomic and phenotypic databases, data integration and mining become increasingly challenging. While efforts have been put forward to analyze prokaryotic phenotypes, current computational technologies either lack high throughput capacity for genomic scale analysis, or are limited in their capability to integrate and mine data across different scales of biology. Consequently, simultaneous analysis of associations among genomes, phenotypes, and gene functions is prohibited. Here, we developed a high throughput computational approach, and demonstrated for the first time the feasibility of integrating large quantities of prokaryotic phenotypes along with genomic datasets for mining across multiple scales of biology (protein domains, pathways, molecular functions, and cellular processes). Applying this method over 59 fully sequenced prokaryotic species, we identified genetic basis and molecular mechanisms underlying the phenotypes in bacteria. We identified 3,711 significant correlations between 1,499 distinct Pfam and 63 phenotypes, with 2,650 correlations and 1,061 anti-correlations. Manual evaluation of a random sample of these significant correlations showed a minimal precision of 30% (95% confidence interval: 20%-42%; n = 50). We stratified the most significant 478 predictions and subjected 100 to manual evaluation, of which 60 were corroborated in the literature. We furthermore unveiled 10 significant correlations between phenotypes and KEGG pathways, eight of which were corroborated in the evaluation, and 309 significant correlations between phenotypes and 166 GO concepts evaluated using a random sample (minimal precision = 72%; 95% confidence interval: 60%-80%; n = 50). Additionally, we conducted a novel large-scale phenomic visualization analysis to provide insight into the modular nature of common molecular mechanisms spanning multiple biological scales and reused by related phenotypes (metaphenotypes). We propose that this method elucidates which classes of molecular mechanisms are associated with phenotypes or metaphenotypes and holds promise in facilitating a computable systems biology approach to genomic and biomedical research.

Show MeSH
False Positive Error Rates Predicted from Random Datasets According to the Uncorrected Hypergeometric DistributionThe false positive error rate represents the ratio of the number of significant correlations from the randomized dataset (control experiment) to the number of significant correlations from the real dataset below a certain p-value. At different p-value cutoffs, we calculated the error rates from a sample of 1,000 random permutations of the relationship vectors within the dataset (permutation resampling method), and the cutoffs for the highest 1% of occurrences for each uncorrected p-value of the hypergeometric distribution (data presented). For uncorrected p-values of 0.002 or less, the correlations between phenotypes and Pfam families are predicted to have an error rate of approximately 5%. This cutoff is applied in this study to identify significant correlations.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1636675&req=5

pcbi-0020159-g003: False Positive Error Rates Predicted from Random Datasets According to the Uncorrected Hypergeometric DistributionThe false positive error rate represents the ratio of the number of significant correlations from the randomized dataset (control experiment) to the number of significant correlations from the real dataset below a certain p-value. At different p-value cutoffs, we calculated the error rates from a sample of 1,000 random permutations of the relationship vectors within the dataset (permutation resampling method), and the cutoffs for the highest 1% of occurrences for each uncorrected p-value of the hypergeometric distribution (data presented). For uncorrected p-values of 0.002 or less, the correlations between phenotypes and Pfam families are predicted to have an error rate of approximately 5%. This cutoff is applied in this study to identify significant correlations.

Mentions: To overcome these limitations, we applied an additional method based on statistical simulation which can stratify predicted correlations as described in detail in Materials and Methods. With this method, we conducted a permutation resampling in which we compared the number of significant correlations inferred from the original data with those inferred from an experimental control consisting of the distributions of random permutations of the data with different statistical cutoffs for the hypergeometric distribution. Since this method predicts significant correlations in comparison with randomized samples, its results have less stringent cutoffs and cover more phenotypes. Figure 3 summarizes the results from the control experiment over random data, using cutoffs of uncorrected p-values, ranging from 0.0001 to 0.05 from the uncorrected hypergeometric test (details in Materials and Methods). As shown in Figure 3, we can expect about 5% of the predictions to be false positives if the uncorrected p-value of the hypergeometric distribution is equal to or less than 0.002. In our unadjusted dataset, using an uncorrected p-value of 0.002 or less, we identified 3,711 significant correlations in which we expect about 5% to be false positive predictions (data shown in the file Phenotype_Sim_Pfam_mapping.xls at http://phenos.bsd.uchicago.edu/prok_phenotype). By carefully choosing complementary statistical methods for conducting data-mining calculations, we can provide researchers with accurate stratified information on the prediction, yet without filtering out potentially meaningful correlations.


An integrative genomic approach to uncover molecular mechanisms of prokaryotic traits.

Liu Y, Li J, Sam L, Goh CS, Gerstein M, Lussier YA - PLoS Comput. Biol. (2006)

False Positive Error Rates Predicted from Random Datasets According to the Uncorrected Hypergeometric DistributionThe false positive error rate represents the ratio of the number of significant correlations from the randomized dataset (control experiment) to the number of significant correlations from the real dataset below a certain p-value. At different p-value cutoffs, we calculated the error rates from a sample of 1,000 random permutations of the relationship vectors within the dataset (permutation resampling method), and the cutoffs for the highest 1% of occurrences for each uncorrected p-value of the hypergeometric distribution (data presented). For uncorrected p-values of 0.002 or less, the correlations between phenotypes and Pfam families are predicted to have an error rate of approximately 5%. This cutoff is applied in this study to identify significant correlations.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1636675&req=5

pcbi-0020159-g003: False Positive Error Rates Predicted from Random Datasets According to the Uncorrected Hypergeometric DistributionThe false positive error rate represents the ratio of the number of significant correlations from the randomized dataset (control experiment) to the number of significant correlations from the real dataset below a certain p-value. At different p-value cutoffs, we calculated the error rates from a sample of 1,000 random permutations of the relationship vectors within the dataset (permutation resampling method), and the cutoffs for the highest 1% of occurrences for each uncorrected p-value of the hypergeometric distribution (data presented). For uncorrected p-values of 0.002 or less, the correlations between phenotypes and Pfam families are predicted to have an error rate of approximately 5%. This cutoff is applied in this study to identify significant correlations.
Mentions: To overcome these limitations, we applied an additional method based on statistical simulation which can stratify predicted correlations as described in detail in Materials and Methods. With this method, we conducted a permutation resampling in which we compared the number of significant correlations inferred from the original data with those inferred from an experimental control consisting of the distributions of random permutations of the data with different statistical cutoffs for the hypergeometric distribution. Since this method predicts significant correlations in comparison with randomized samples, its results have less stringent cutoffs and cover more phenotypes. Figure 3 summarizes the results from the control experiment over random data, using cutoffs of uncorrected p-values, ranging from 0.0001 to 0.05 from the uncorrected hypergeometric test (details in Materials and Methods). As shown in Figure 3, we can expect about 5% of the predictions to be false positives if the uncorrected p-value of the hypergeometric distribution is equal to or less than 0.002. In our unadjusted dataset, using an uncorrected p-value of 0.002 or less, we identified 3,711 significant correlations in which we expect about 5% to be false positive predictions (data shown in the file Phenotype_Sim_Pfam_mapping.xls at http://phenos.bsd.uchicago.edu/prok_phenotype). By carefully choosing complementary statistical methods for conducting data-mining calculations, we can provide researchers with accurate stratified information on the prediction, yet without filtering out potentially meaningful correlations.

Bottom Line: We identified 3,711 significant correlations between 1,499 distinct Pfam and 63 phenotypes, with 2,650 correlations and 1,061 anti-correlations.We stratified the most significant 478 predictions and subjected 100 to manual evaluation, of which 60 were corroborated in the literature.We propose that this method elucidates which classes of molecular mechanisms are associated with phenotypes or metaphenotypes and holds promise in facilitating a computable systems biology approach to genomic and biomedical research.

View Article: PubMed Central - PubMed

Affiliation: Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America.

ABSTRACT
With mounting availability of genomic and phenotypic databases, data integration and mining become increasingly challenging. While efforts have been put forward to analyze prokaryotic phenotypes, current computational technologies either lack high throughput capacity for genomic scale analysis, or are limited in their capability to integrate and mine data across different scales of biology. Consequently, simultaneous analysis of associations among genomes, phenotypes, and gene functions is prohibited. Here, we developed a high throughput computational approach, and demonstrated for the first time the feasibility of integrating large quantities of prokaryotic phenotypes along with genomic datasets for mining across multiple scales of biology (protein domains, pathways, molecular functions, and cellular processes). Applying this method over 59 fully sequenced prokaryotic species, we identified genetic basis and molecular mechanisms underlying the phenotypes in bacteria. We identified 3,711 significant correlations between 1,499 distinct Pfam and 63 phenotypes, with 2,650 correlations and 1,061 anti-correlations. Manual evaluation of a random sample of these significant correlations showed a minimal precision of 30% (95% confidence interval: 20%-42%; n = 50). We stratified the most significant 478 predictions and subjected 100 to manual evaluation, of which 60 were corroborated in the literature. We furthermore unveiled 10 significant correlations between phenotypes and KEGG pathways, eight of which were corroborated in the evaluation, and 309 significant correlations between phenotypes and 166 GO concepts evaluated using a random sample (minimal precision = 72%; 95% confidence interval: 60%-80%; n = 50). Additionally, we conducted a novel large-scale phenomic visualization analysis to provide insight into the modular nature of common molecular mechanisms spanning multiple biological scales and reused by related phenotypes (metaphenotypes). We propose that this method elucidates which classes of molecular mechanisms are associated with phenotypes or metaphenotypes and holds promise in facilitating a computable systems biology approach to genomic and biomedical research.

Show MeSH