Limits...
An integrative genomic approach to uncover molecular mechanisms of prokaryotic traits.

Liu Y, Li J, Sam L, Goh CS, Gerstein M, Lussier YA - PLoS Comput. Biol. (2006)

Bottom Line: We identified 3,711 significant correlations between 1,499 distinct Pfam and 63 phenotypes, with 2,650 correlations and 1,061 anti-correlations.We stratified the most significant 478 predictions and subjected 100 to manual evaluation, of which 60 were corroborated in the literature.We propose that this method elucidates which classes of molecular mechanisms are associated with phenotypes or metaphenotypes and holds promise in facilitating a computable systems biology approach to genomic and biomedical research.

View Article: PubMed Central - PubMed

Affiliation: Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America.

ABSTRACT
With mounting availability of genomic and phenotypic databases, data integration and mining become increasingly challenging. While efforts have been put forward to analyze prokaryotic phenotypes, current computational technologies either lack high throughput capacity for genomic scale analysis, or are limited in their capability to integrate and mine data across different scales of biology. Consequently, simultaneous analysis of associations among genomes, phenotypes, and gene functions is prohibited. Here, we developed a high throughput computational approach, and demonstrated for the first time the feasibility of integrating large quantities of prokaryotic phenotypes along with genomic datasets for mining across multiple scales of biology (protein domains, pathways, molecular functions, and cellular processes). Applying this method over 59 fully sequenced prokaryotic species, we identified genetic basis and molecular mechanisms underlying the phenotypes in bacteria. We identified 3,711 significant correlations between 1,499 distinct Pfam and 63 phenotypes, with 2,650 correlations and 1,061 anti-correlations. Manual evaluation of a random sample of these significant correlations showed a minimal precision of 30% (95% confidence interval: 20%-42%; n = 50). We stratified the most significant 478 predictions and subjected 100 to manual evaluation, of which 60 were corroborated in the literature. We furthermore unveiled 10 significant correlations between phenotypes and KEGG pathways, eight of which were corroborated in the evaluation, and 309 significant correlations between phenotypes and 166 GO concepts evaluated using a random sample (minimal precision = 72%; 95% confidence interval: 60%-80%; n = 50). Additionally, we conducted a novel large-scale phenomic visualization analysis to provide insight into the modular nature of common molecular mechanisms spanning multiple biological scales and reused by related phenotypes (metaphenotypes). We propose that this method elucidates which classes of molecular mechanisms are associated with phenotypes or metaphenotypes and holds promise in facilitating a computable systems biology approach to genomic and biomedical research.

Show MeSH
List of Bacterial Species with Full Genome Sequences Used in This StudyBacteria with full genome sequences and laboratory tests used in this study are listed with their phylogenetic tree drawn according to the NCBI taxonomy (the NCBI Taxonomy database is widely used for taxonomy; however, it is not an authoritative source for nomenclature or classification, and an alternate dendogram could have been constructed using the Species2000 Bacteriology Insight Orienting System, http://www-sp2000ao.nies.go.jp/english/bios/). Many of the taxons of species in the laboratory tests are parents of the strains being fully sequenced. The numbers of Pfam families for all species are also shown.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1636675&req=5

pcbi-0020159-g001: List of Bacterial Species with Full Genome Sequences Used in This StudyBacteria with full genome sequences and laboratory tests used in this study are listed with their phylogenetic tree drawn according to the NCBI taxonomy (the NCBI Taxonomy database is widely used for taxonomy; however, it is not an authoritative source for nomenclature or classification, and an alternate dendogram could have been constructed using the Species2000 Bacteriology Insight Orienting System, http://www-sp2000ao.nies.go.jp/english/bios/). Many of the taxons of species in the laboratory tests are parents of the strains being fully sequenced. The numbers of Pfam families for all species are also shown.

Mentions: Currently, the availability of more than 208 microbial genome sequences in GenBank provides a rich source of information about the genetic contents of various microorganisms [23]. In addition, functional classification databases, such as COGs [18,19] and Pfam [17], enable us to compare conservation and divergence of functional genes across microorganisms. However, little has been done in the past to correlate the genomic data with phenotypic information. In this study, to uncover the underlying linkages between microorganism phenotypes and their genetic contents, we integrated and analyzed datasets of a microbiological phenotypic database (the MKD) and a genomic protein domains dataset (Pfam). In this study we have, by design, limited the analysis to complete genome sequences to avoid a selection bias toward genes coming from partial genomes that were preferentially sequenced. Methods to deal with organisms with partial genomic sequences will be explored in a future study. As a result, we selected each of the 59 species of microorganisms that exist in both the MKD and Pfam databases with fully deduced genome sequences (Figure 1). These species belong to six phylums, including 20 Firmicutes, 17 Proteobacteria, six Actinobacteria, four Spirochaetes, four Bacteroidetes, and one Chlamydiae, representing about 30% of the bacteria species that have been fully sequenced at the time of this study. Detailed information about their taxonomy in comparison with the fully sequenced bacteria at the time of this study is provided in Table 1. Out of the 208 fully sequenced bacteria available at the time of this study, the 59 species used in this study cover approximately 30% to 40% of available fully sequenced bacterial genomes at different taxonomic levels.


An integrative genomic approach to uncover molecular mechanisms of prokaryotic traits.

Liu Y, Li J, Sam L, Goh CS, Gerstein M, Lussier YA - PLoS Comput. Biol. (2006)

List of Bacterial Species with Full Genome Sequences Used in This StudyBacteria with full genome sequences and laboratory tests used in this study are listed with their phylogenetic tree drawn according to the NCBI taxonomy (the NCBI Taxonomy database is widely used for taxonomy; however, it is not an authoritative source for nomenclature or classification, and an alternate dendogram could have been constructed using the Species2000 Bacteriology Insight Orienting System, http://www-sp2000ao.nies.go.jp/english/bios/). Many of the taxons of species in the laboratory tests are parents of the strains being fully sequenced. The numbers of Pfam families for all species are also shown.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1636675&req=5

pcbi-0020159-g001: List of Bacterial Species with Full Genome Sequences Used in This StudyBacteria with full genome sequences and laboratory tests used in this study are listed with their phylogenetic tree drawn according to the NCBI taxonomy (the NCBI Taxonomy database is widely used for taxonomy; however, it is not an authoritative source for nomenclature or classification, and an alternate dendogram could have been constructed using the Species2000 Bacteriology Insight Orienting System, http://www-sp2000ao.nies.go.jp/english/bios/). Many of the taxons of species in the laboratory tests are parents of the strains being fully sequenced. The numbers of Pfam families for all species are also shown.
Mentions: Currently, the availability of more than 208 microbial genome sequences in GenBank provides a rich source of information about the genetic contents of various microorganisms [23]. In addition, functional classification databases, such as COGs [18,19] and Pfam [17], enable us to compare conservation and divergence of functional genes across microorganisms. However, little has been done in the past to correlate the genomic data with phenotypic information. In this study, to uncover the underlying linkages between microorganism phenotypes and their genetic contents, we integrated and analyzed datasets of a microbiological phenotypic database (the MKD) and a genomic protein domains dataset (Pfam). In this study we have, by design, limited the analysis to complete genome sequences to avoid a selection bias toward genes coming from partial genomes that were preferentially sequenced. Methods to deal with organisms with partial genomic sequences will be explored in a future study. As a result, we selected each of the 59 species of microorganisms that exist in both the MKD and Pfam databases with fully deduced genome sequences (Figure 1). These species belong to six phylums, including 20 Firmicutes, 17 Proteobacteria, six Actinobacteria, four Spirochaetes, four Bacteroidetes, and one Chlamydiae, representing about 30% of the bacteria species that have been fully sequenced at the time of this study. Detailed information about their taxonomy in comparison with the fully sequenced bacteria at the time of this study is provided in Table 1. Out of the 208 fully sequenced bacteria available at the time of this study, the 59 species used in this study cover approximately 30% to 40% of available fully sequenced bacterial genomes at different taxonomic levels.

Bottom Line: We identified 3,711 significant correlations between 1,499 distinct Pfam and 63 phenotypes, with 2,650 correlations and 1,061 anti-correlations.We stratified the most significant 478 predictions and subjected 100 to manual evaluation, of which 60 were corroborated in the literature.We propose that this method elucidates which classes of molecular mechanisms are associated with phenotypes or metaphenotypes and holds promise in facilitating a computable systems biology approach to genomic and biomedical research.

View Article: PubMed Central - PubMed

Affiliation: Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America.

ABSTRACT
With mounting availability of genomic and phenotypic databases, data integration and mining become increasingly challenging. While efforts have been put forward to analyze prokaryotic phenotypes, current computational technologies either lack high throughput capacity for genomic scale analysis, or are limited in their capability to integrate and mine data across different scales of biology. Consequently, simultaneous analysis of associations among genomes, phenotypes, and gene functions is prohibited. Here, we developed a high throughput computational approach, and demonstrated for the first time the feasibility of integrating large quantities of prokaryotic phenotypes along with genomic datasets for mining across multiple scales of biology (protein domains, pathways, molecular functions, and cellular processes). Applying this method over 59 fully sequenced prokaryotic species, we identified genetic basis and molecular mechanisms underlying the phenotypes in bacteria. We identified 3,711 significant correlations between 1,499 distinct Pfam and 63 phenotypes, with 2,650 correlations and 1,061 anti-correlations. Manual evaluation of a random sample of these significant correlations showed a minimal precision of 30% (95% confidence interval: 20%-42%; n = 50). We stratified the most significant 478 predictions and subjected 100 to manual evaluation, of which 60 were corroborated in the literature. We furthermore unveiled 10 significant correlations between phenotypes and KEGG pathways, eight of which were corroborated in the evaluation, and 309 significant correlations between phenotypes and 166 GO concepts evaluated using a random sample (minimal precision = 72%; 95% confidence interval: 60%-80%; n = 50). Additionally, we conducted a novel large-scale phenomic visualization analysis to provide insight into the modular nature of common molecular mechanisms spanning multiple biological scales and reused by related phenotypes (metaphenotypes). We propose that this method elucidates which classes of molecular mechanisms are associated with phenotypes or metaphenotypes and holds promise in facilitating a computable systems biology approach to genomic and biomedical research.

Show MeSH