Limits...
Characterization of the Gut Microbiome Using 16S or Shotgun Metagenomics.

Jovel J, Patterson J, Wang W, Hotte N, O'Keefe S, Mitchel T, Perry T, Kao D, Mason AL, Madsen KL, Wong GK - Front Microbiol (2016)

Bottom Line: The two main approaches for analyzing the microbiome, 16S ribosomal RNA (rRNA) gene amplicons and shotgun metagenomics, are illustrated with analyses of libraries designed to highlight their strengths and weaknesses.To the extent that fluctuations in the diversity of gut bacterial populations correlate with health and disease, we emphasize various techniques for the analysis of bacterial communities within samples (α-diversity) and between samples (β-diversity).Finally, we demonstrate techniques to infer the metabolic capabilities of a bacteria community from these 16S and shotgun data.

View Article: PubMed Central - PubMed

Affiliation: Department of Medicine, University of Alberta Edmonton, AB, Canada.

ABSTRACT
The advent of next generation sequencing (NGS) has enabled investigations of the gut microbiome with unprecedented resolution and throughput. This has stimulated the development of sophisticated bioinformatics tools to analyze the massive amounts of data generated. Researchers therefore need a clear understanding of the key concepts required for the design, execution and interpretation of NGS experiments on microbiomes. We conducted a literature review and used our own data to determine which approaches work best. The two main approaches for analyzing the microbiome, 16S ribosomal RNA (rRNA) gene amplicons and shotgun metagenomics, are illustrated with analyses of libraries designed to highlight their strengths and weaknesses. Several methods for taxonomic classification of bacterial sequences are discussed. We present simulations to assess the number of sequences that are required to perform reliable appraisals of bacterial community structure. To the extent that fluctuations in the diversity of gut bacterial populations correlate with health and disease, we emphasize various techniques for the analysis of bacterial communities within samples (α-diversity) and between samples (β-diversity). Finally, we demonstrate techniques to infer the metabolic capabilities of a bacteria community from these 16S and shotgun data.

No MeSH data available.


Related in: MedlinePlus

Precision of taxonomy assignments is affected by highly similar sequences in different taxa. (A) For the 16S libraries described in Figure 1, sequences were clustered into operational taxonomic units (OTUs) using a 97% similarity threshold and taxonomy assignments were performed with the RDP classifier. Sequences from OTUs classified as Bifidobacterium (n = 3), Agrobacterium (n = 3), Streptococcus (n = 3), Lactobacillus (n = 3), Bacteroides (n = 3), Peptostreptococcaceae (n = 4), or Enterobacteriaceae (n = 9) were randomly extracted and aligned to the Greengenes database to extract the closest relative (best hit). In addition, we included Greengenes 16S rRNA gene sequences (in green) from Clostridium difficile and C. botulinum as reference for Peptostreptococcaceae and Citrobacter freundii and Enterobacter cloacae as reference for Enterobacteriaceae. The V4 region of the 16S rRNA gene was cropped from the Greengenes sequences to construct a phylogenetic tree with MEGA-6, using UPGMA hierarchical clustering and 10,000 bootstraps. (B) Sequences from our bacterial populations in Figure 1 were aligned against the NCBI nt and human microbiome project (HMP) databases to identify the most similar reference genome. For each bacterium, a simulated library was created by segmenting the reference genome sequence into 500 nt stretches (250 nt paired ends in a head-to-tail orientation), iterating the process to generate ~1.5 million sequences. This simulated library was aligned back to the reference genome and the taxonomy resolved with MEGAN5. As examples, we show the reads classification of Bifidobacterium breve, Bacteroides thetaiotamicron, and Escherichia coli, which accumulated a large proportion of reads that could be resolved at the species, genus or family levels, respectively. Color-matched bars on the right show the proportion of reads accumulated at each level for these particular examples. S, species; G, genus; F, family; O, order; C, class; P, phylum.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4837688&req=5

Figure 2: Precision of taxonomy assignments is affected by highly similar sequences in different taxa. (A) For the 16S libraries described in Figure 1, sequences were clustered into operational taxonomic units (OTUs) using a 97% similarity threshold and taxonomy assignments were performed with the RDP classifier. Sequences from OTUs classified as Bifidobacterium (n = 3), Agrobacterium (n = 3), Streptococcus (n = 3), Lactobacillus (n = 3), Bacteroides (n = 3), Peptostreptococcaceae (n = 4), or Enterobacteriaceae (n = 9) were randomly extracted and aligned to the Greengenes database to extract the closest relative (best hit). In addition, we included Greengenes 16S rRNA gene sequences (in green) from Clostridium difficile and C. botulinum as reference for Peptostreptococcaceae and Citrobacter freundii and Enterobacter cloacae as reference for Enterobacteriaceae. The V4 region of the 16S rRNA gene was cropped from the Greengenes sequences to construct a phylogenetic tree with MEGA-6, using UPGMA hierarchical clustering and 10,000 bootstraps. (B) Sequences from our bacterial populations in Figure 1 were aligned against the NCBI nt and human microbiome project (HMP) databases to identify the most similar reference genome. For each bacterium, a simulated library was created by segmenting the reference genome sequence into 500 nt stretches (250 nt paired ends in a head-to-tail orientation), iterating the process to generate ~1.5 million sequences. This simulated library was aligned back to the reference genome and the taxonomy resolved with MEGAN5. As examples, we show the reads classification of Bifidobacterium breve, Bacteroides thetaiotamicron, and Escherichia coli, which accumulated a large proportion of reads that could be resolved at the species, genus or family levels, respectively. Color-matched bars on the right show the proportion of reads accumulated at each level for these particular examples. S, species; G, genus; F, family; O, order; C, class; P, phylum.

Mentions: We conducted taxonomy assignments using end1, end2, or both paired ends. When using Illumina chemistry, end1 typically exhibits higher quality than end2 (Supplemental Figure 1); accordingly, end1 provided a somewhat more accurate classification than end2 or paired ends (Supplemental Figure 2). However, the V4 variable region of the 16S rRNA gene is relatively short and in most cases will be covered by any one of the two ends (250 nt in this case); as such, these results may only reflect the higher quality of end1. For single ends, the best results were obtained with the pick_open_otus.py script from QIIME to cluster sequences (Supplemental Figure 2). Chimeric sequences can be artifactually generated when PCR amplification of the 16S region of interest is incomplete and the resultant partial sequences serve as primers that recombine with heterologous molecules containing a similar 3′ moiety. Several bioinformatics approaches have been developed for detection and removal of chimeric sequences. We used the USEARCH tool (Edgar, 2010) to remove chimeras. However, a desirable approach is to prevent formation of such chimeras in vitro, using high fidelity amplification protocols like LEA-Seq (Faith et al., 2013). Sequences were initially clustered into phylotypes using the Greengenes database of 16S rRNA sequences (DeSantis et al., 2006) as reference, while more dissimilar sequences were clustered de novo into OTUs. Taxonomy was then assigned using the RDP classifier, using the UCLUST method (Wang et al., 2007). RTAX (Soergel et al., 2012), a method embedded in QIIME, and UPARSE (Edgar, 2013) are algorithms especially designed to take advantage of mate pairs information. For paired-end analysis, the UPARSE pipeline (Edgar, 2013) produced more satisfactory results than the RTAX method (Soergel et al., 2012; Supplemental Figure 2). Irrespective of the method used for clustering, we found a consistent over-representation of sequences in the Clostridium and Lactobacillus genera. These two genera contain sequences that are perfectly complementary to the primers used for amplification, while at least one mismatch is found in the rest of genera included in our experimental (mock) bacterial population. This demonstrates how subtle differences in primer binding sites within the 16S rRNA gene sequences lead to biased estimates of relative abundance. Other primers have been reported to present biases, for instance the primer pair 27F/338R results in underrepresentation of Bifidobacterium (Martínez et al., 2009; Kuczynski et al., 2010). In our study, the detection of some Clostridium, Escherichia and Salmonella sequences was only possible after computational extraction of representative sequences of OTUs and blasting them against both the nr/nt and the 16S ribosomal RNA databases from NCBI. In general, sequences in the Enterobacteriaceae family and the Clostridiales order were poorly resolved using the 16S V4 or V3-V4 regions (Figure 2A), and this seems to be the case with Enterobacteriaceae for other 16S variable regions as well (Chakravorty et al., 2007).


Characterization of the Gut Microbiome Using 16S or Shotgun Metagenomics.

Jovel J, Patterson J, Wang W, Hotte N, O'Keefe S, Mitchel T, Perry T, Kao D, Mason AL, Madsen KL, Wong GK - Front Microbiol (2016)

Precision of taxonomy assignments is affected by highly similar sequences in different taxa. (A) For the 16S libraries described in Figure 1, sequences were clustered into operational taxonomic units (OTUs) using a 97% similarity threshold and taxonomy assignments were performed with the RDP classifier. Sequences from OTUs classified as Bifidobacterium (n = 3), Agrobacterium (n = 3), Streptococcus (n = 3), Lactobacillus (n = 3), Bacteroides (n = 3), Peptostreptococcaceae (n = 4), or Enterobacteriaceae (n = 9) were randomly extracted and aligned to the Greengenes database to extract the closest relative (best hit). In addition, we included Greengenes 16S rRNA gene sequences (in green) from Clostridium difficile and C. botulinum as reference for Peptostreptococcaceae and Citrobacter freundii and Enterobacter cloacae as reference for Enterobacteriaceae. The V4 region of the 16S rRNA gene was cropped from the Greengenes sequences to construct a phylogenetic tree with MEGA-6, using UPGMA hierarchical clustering and 10,000 bootstraps. (B) Sequences from our bacterial populations in Figure 1 were aligned against the NCBI nt and human microbiome project (HMP) databases to identify the most similar reference genome. For each bacterium, a simulated library was created by segmenting the reference genome sequence into 500 nt stretches (250 nt paired ends in a head-to-tail orientation), iterating the process to generate ~1.5 million sequences. This simulated library was aligned back to the reference genome and the taxonomy resolved with MEGAN5. As examples, we show the reads classification of Bifidobacterium breve, Bacteroides thetaiotamicron, and Escherichia coli, which accumulated a large proportion of reads that could be resolved at the species, genus or family levels, respectively. Color-matched bars on the right show the proportion of reads accumulated at each level for these particular examples. S, species; G, genus; F, family; O, order; C, class; P, phylum.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4837688&req=5

Figure 2: Precision of taxonomy assignments is affected by highly similar sequences in different taxa. (A) For the 16S libraries described in Figure 1, sequences were clustered into operational taxonomic units (OTUs) using a 97% similarity threshold and taxonomy assignments were performed with the RDP classifier. Sequences from OTUs classified as Bifidobacterium (n = 3), Agrobacterium (n = 3), Streptococcus (n = 3), Lactobacillus (n = 3), Bacteroides (n = 3), Peptostreptococcaceae (n = 4), or Enterobacteriaceae (n = 9) were randomly extracted and aligned to the Greengenes database to extract the closest relative (best hit). In addition, we included Greengenes 16S rRNA gene sequences (in green) from Clostridium difficile and C. botulinum as reference for Peptostreptococcaceae and Citrobacter freundii and Enterobacter cloacae as reference for Enterobacteriaceae. The V4 region of the 16S rRNA gene was cropped from the Greengenes sequences to construct a phylogenetic tree with MEGA-6, using UPGMA hierarchical clustering and 10,000 bootstraps. (B) Sequences from our bacterial populations in Figure 1 were aligned against the NCBI nt and human microbiome project (HMP) databases to identify the most similar reference genome. For each bacterium, a simulated library was created by segmenting the reference genome sequence into 500 nt stretches (250 nt paired ends in a head-to-tail orientation), iterating the process to generate ~1.5 million sequences. This simulated library was aligned back to the reference genome and the taxonomy resolved with MEGAN5. As examples, we show the reads classification of Bifidobacterium breve, Bacteroides thetaiotamicron, and Escherichia coli, which accumulated a large proportion of reads that could be resolved at the species, genus or family levels, respectively. Color-matched bars on the right show the proportion of reads accumulated at each level for these particular examples. S, species; G, genus; F, family; O, order; C, class; P, phylum.
Mentions: We conducted taxonomy assignments using end1, end2, or both paired ends. When using Illumina chemistry, end1 typically exhibits higher quality than end2 (Supplemental Figure 1); accordingly, end1 provided a somewhat more accurate classification than end2 or paired ends (Supplemental Figure 2). However, the V4 variable region of the 16S rRNA gene is relatively short and in most cases will be covered by any one of the two ends (250 nt in this case); as such, these results may only reflect the higher quality of end1. For single ends, the best results were obtained with the pick_open_otus.py script from QIIME to cluster sequences (Supplemental Figure 2). Chimeric sequences can be artifactually generated when PCR amplification of the 16S region of interest is incomplete and the resultant partial sequences serve as primers that recombine with heterologous molecules containing a similar 3′ moiety. Several bioinformatics approaches have been developed for detection and removal of chimeric sequences. We used the USEARCH tool (Edgar, 2010) to remove chimeras. However, a desirable approach is to prevent formation of such chimeras in vitro, using high fidelity amplification protocols like LEA-Seq (Faith et al., 2013). Sequences were initially clustered into phylotypes using the Greengenes database of 16S rRNA sequences (DeSantis et al., 2006) as reference, while more dissimilar sequences were clustered de novo into OTUs. Taxonomy was then assigned using the RDP classifier, using the UCLUST method (Wang et al., 2007). RTAX (Soergel et al., 2012), a method embedded in QIIME, and UPARSE (Edgar, 2013) are algorithms especially designed to take advantage of mate pairs information. For paired-end analysis, the UPARSE pipeline (Edgar, 2013) produced more satisfactory results than the RTAX method (Soergel et al., 2012; Supplemental Figure 2). Irrespective of the method used for clustering, we found a consistent over-representation of sequences in the Clostridium and Lactobacillus genera. These two genera contain sequences that are perfectly complementary to the primers used for amplification, while at least one mismatch is found in the rest of genera included in our experimental (mock) bacterial population. This demonstrates how subtle differences in primer binding sites within the 16S rRNA gene sequences lead to biased estimates of relative abundance. Other primers have been reported to present biases, for instance the primer pair 27F/338R results in underrepresentation of Bifidobacterium (Martínez et al., 2009; Kuczynski et al., 2010). In our study, the detection of some Clostridium, Escherichia and Salmonella sequences was only possible after computational extraction of representative sequences of OTUs and blasting them against both the nr/nt and the 16S ribosomal RNA databases from NCBI. In general, sequences in the Enterobacteriaceae family and the Clostridiales order were poorly resolved using the 16S V4 or V3-V4 regions (Figure 2A), and this seems to be the case with Enterobacteriaceae for other 16S variable regions as well (Chakravorty et al., 2007).

Bottom Line: The two main approaches for analyzing the microbiome, 16S ribosomal RNA (rRNA) gene amplicons and shotgun metagenomics, are illustrated with analyses of libraries designed to highlight their strengths and weaknesses.To the extent that fluctuations in the diversity of gut bacterial populations correlate with health and disease, we emphasize various techniques for the analysis of bacterial communities within samples (α-diversity) and between samples (β-diversity).Finally, we demonstrate techniques to infer the metabolic capabilities of a bacteria community from these 16S and shotgun data.

View Article: PubMed Central - PubMed

Affiliation: Department of Medicine, University of Alberta Edmonton, AB, Canada.

ABSTRACT
The advent of next generation sequencing (NGS) has enabled investigations of the gut microbiome with unprecedented resolution and throughput. This has stimulated the development of sophisticated bioinformatics tools to analyze the massive amounts of data generated. Researchers therefore need a clear understanding of the key concepts required for the design, execution and interpretation of NGS experiments on microbiomes. We conducted a literature review and used our own data to determine which approaches work best. The two main approaches for analyzing the microbiome, 16S ribosomal RNA (rRNA) gene amplicons and shotgun metagenomics, are illustrated with analyses of libraries designed to highlight their strengths and weaknesses. Several methods for taxonomic classification of bacterial sequences are discussed. We present simulations to assess the number of sequences that are required to perform reliable appraisals of bacterial community structure. To the extent that fluctuations in the diversity of gut bacterial populations correlate with health and disease, we emphasize various techniques for the analysis of bacterial communities within samples (α-diversity) and between samples (β-diversity). Finally, we demonstrate techniques to infer the metabolic capabilities of a bacteria community from these 16S and shotgun data.

No MeSH data available.


Related in: MedlinePlus