Limits...
Improved OTU-picking using long-read 16S rRNA gene amplicon sequencing and generic hierarchical clustering.

Franzén O, Hu J, Bao X, Itzkowitz SH, Peter I, Bashir A - Microbiome (2015)

Bottom Line: However, clustering of short 16S rRNA gene reads into biologically meaningful OTUs is challenging, in part because nucleotide variation along the 16S rRNA gene is only partially captured by short reads.The recent emergence of long-read platforms, such as single-molecule real-time (SMRT) sequencing from Pacific Biosciences, offers the potential for improved taxonomic and phylogenetic profiling.In simulated data, long-read sequencing was shown to improve OTU quality and decrease variance.

View Article: PubMed Central - PubMed

Affiliation: Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA. p.oscar.franzen@gmail.com.

ABSTRACT

Background: High-throughput bacterial 16S rRNA gene sequencing followed by clustering of short sequences into operational taxonomic units (OTUs) is widely used for microbiome profiling. However, clustering of short 16S rRNA gene reads into biologically meaningful OTUs is challenging, in part because nucleotide variation along the 16S rRNA gene is only partially captured by short reads. The recent emergence of long-read platforms, such as single-molecule real-time (SMRT) sequencing from Pacific Biosciences, offers the potential for improved taxonomic and phylogenetic profiling. Here, we evaluate the performance of long- and short-read 16S rRNA gene sequencing using simulated and experimental data, followed by OTU inference using computational pipelines based on heuristic and complete-linkage hierarchical clustering.

Results: In simulated data, long-read sequencing was shown to improve OTU quality and decrease variance. We then profiled 40 human gut microbiome samples using a combination of Illumina MiSeq and Blautia-specific SMRT sequencing, further supporting the notion that long reads can identify additional OTUs. We implemented a complete-linkage hierarchical clustering strategy using a flexible computational pipeline, tailored specifically for PacBio circular consensus sequencing (CCS) data that outperforms heuristic methods in most settings: https://github.com/oscar-franzen/oclust/ .

Conclusion: Our data demonstrate that long reads can improve OTU inference; however, the choice of clustering algorithm and associated clustering thresholds has significant impact on performance.

No MeSH data available.


Flowchart of the OTU-picking pipeline. 16S rRNA gene primers specific for the Blautia genus were identified with pprospector, and the primer pair was used to generate 16S rRNA gene amplicons. Enriched PCR samples were targeted for SMRT-seq to get long CCS sequencing reads. CCS reads were taxonomically classified with QIIME pipeline, and reads classified to the genus Blautia were kept. Genetic distances were computed using pairwise alignments. Hierarchical clustering of the resulting dissimilarity matrix was performed at sequence similarities 1 to 6 % with 1 % increments. The final set of OTUs was filtered by requiring at least 10 or more supporting reads
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4593230&req=5

Fig2: Flowchart of the OTU-picking pipeline. 16S rRNA gene primers specific for the Blautia genus were identified with pprospector, and the primer pair was used to generate 16S rRNA gene amplicons. Enriched PCR samples were targeted for SMRT-seq to get long CCS sequencing reads. CCS reads were taxonomically classified with QIIME pipeline, and reads classified to the genus Blautia were kept. Genetic distances were computed using pairwise alignments. Hierarchical clustering of the resulting dissimilarity matrix was performed at sequence similarities 1 to 6 % with 1 % increments. The final set of OTUs was filtered by requiring at least 10 or more supporting reads

Mentions: An overview of the experiment and analysis strategy is shown in Fig. 2. Based on 40 full-length 16S rRNA gene sequences of the genus Blautia, we designed a new degenerate primer pair to be used for genus-specific amplification (the primer pair is referred to as 404F/1263R). The primer pair produced an amplicon of ~800 bp and was confirmed to be specific by in silico matching toward full-length 16S rRNA gene sequences of other genera. 404F/1263R amplifies across five hypervariable regions (V3 to V7), whereas the universal primers only span across two hypervariable regions (V3 and V4). Subsequently, PCR amplification of the 44 samples was carried out using barcoded versions of 404F/1263R. The amplicons were pooled and then sequenced with PacBio SMRT-seq on eight SMRT-cells, yielding in total 417,538 PacBio CCS reads (at ≥3 passes). A negative relationship between the CCS read length and number of passes (Spearman’s rho = −0.25, p < 0.01) was observed. The median number of passes per CCS read was 15, and the bottom and top quartiles ranged between 3 to 9 and 24 to 153 passes, respectively. The CCS reads displayed a median length of 416 bp (median absolute deviation (MAD) = 336 bp), which indicated possible contamination or non-specific primer binding during the PCR amplification. A screen against the human genome identified 45 % (188,628/417,538) of the CCS reads as contaminants, and these were subsequently removed from further analysis. The remaining CCS reads were demultiplexed and then quality-, chimera-, and size-filtered (only keeping sequences ±100 bp of the expected amplicon size), resulting in 79,339 high-quality CCS reads with a median length of 826 bp (MAD = 7 bp). The number of passes for the high-quality CCS reads ranged from 5 to 39 (median = 16). Each sample contained from 206 to 3391 CCS reads (median = 1594 reads; mean = 1652 reads; SD = 725). The number of CCS reads classified to Blautia ranged from 16 to 1029 per sample (median = 255 reads; mean = 334 reads; SD = 234). In order to assess data quality, we evaluated the sequence diversity along the 16S rRNA gene. PacBio and MiSeq reads were aligned using a hand-curated 16S rRNA covariance model, and Shannon entropy was computed along the 16S rRNA gene (Fig. 3). Two and seven distinct peaks were seen for MiSeq and PacBio, respectively. Peaks colocalized with the known regions of hypervariable regions (V3 and V4 for MiSeq; V3 to V7 for the PacBio CCS), confirming the richer information content in CCS reads.Fig. 2


Improved OTU-picking using long-read 16S rRNA gene amplicon sequencing and generic hierarchical clustering.

Franzén O, Hu J, Bao X, Itzkowitz SH, Peter I, Bashir A - Microbiome (2015)

Flowchart of the OTU-picking pipeline. 16S rRNA gene primers specific for the Blautia genus were identified with pprospector, and the primer pair was used to generate 16S rRNA gene amplicons. Enriched PCR samples were targeted for SMRT-seq to get long CCS sequencing reads. CCS reads were taxonomically classified with QIIME pipeline, and reads classified to the genus Blautia were kept. Genetic distances were computed using pairwise alignments. Hierarchical clustering of the resulting dissimilarity matrix was performed at sequence similarities 1 to 6 % with 1 % increments. The final set of OTUs was filtered by requiring at least 10 or more supporting reads
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4593230&req=5

Fig2: Flowchart of the OTU-picking pipeline. 16S rRNA gene primers specific for the Blautia genus were identified with pprospector, and the primer pair was used to generate 16S rRNA gene amplicons. Enriched PCR samples were targeted for SMRT-seq to get long CCS sequencing reads. CCS reads were taxonomically classified with QIIME pipeline, and reads classified to the genus Blautia were kept. Genetic distances were computed using pairwise alignments. Hierarchical clustering of the resulting dissimilarity matrix was performed at sequence similarities 1 to 6 % with 1 % increments. The final set of OTUs was filtered by requiring at least 10 or more supporting reads
Mentions: An overview of the experiment and analysis strategy is shown in Fig. 2. Based on 40 full-length 16S rRNA gene sequences of the genus Blautia, we designed a new degenerate primer pair to be used for genus-specific amplification (the primer pair is referred to as 404F/1263R). The primer pair produced an amplicon of ~800 bp and was confirmed to be specific by in silico matching toward full-length 16S rRNA gene sequences of other genera. 404F/1263R amplifies across five hypervariable regions (V3 to V7), whereas the universal primers only span across two hypervariable regions (V3 and V4). Subsequently, PCR amplification of the 44 samples was carried out using barcoded versions of 404F/1263R. The amplicons were pooled and then sequenced with PacBio SMRT-seq on eight SMRT-cells, yielding in total 417,538 PacBio CCS reads (at ≥3 passes). A negative relationship between the CCS read length and number of passes (Spearman’s rho = −0.25, p < 0.01) was observed. The median number of passes per CCS read was 15, and the bottom and top quartiles ranged between 3 to 9 and 24 to 153 passes, respectively. The CCS reads displayed a median length of 416 bp (median absolute deviation (MAD) = 336 bp), which indicated possible contamination or non-specific primer binding during the PCR amplification. A screen against the human genome identified 45 % (188,628/417,538) of the CCS reads as contaminants, and these were subsequently removed from further analysis. The remaining CCS reads were demultiplexed and then quality-, chimera-, and size-filtered (only keeping sequences ±100 bp of the expected amplicon size), resulting in 79,339 high-quality CCS reads with a median length of 826 bp (MAD = 7 bp). The number of passes for the high-quality CCS reads ranged from 5 to 39 (median = 16). Each sample contained from 206 to 3391 CCS reads (median = 1594 reads; mean = 1652 reads; SD = 725). The number of CCS reads classified to Blautia ranged from 16 to 1029 per sample (median = 255 reads; mean = 334 reads; SD = 234). In order to assess data quality, we evaluated the sequence diversity along the 16S rRNA gene. PacBio and MiSeq reads were aligned using a hand-curated 16S rRNA covariance model, and Shannon entropy was computed along the 16S rRNA gene (Fig. 3). Two and seven distinct peaks were seen for MiSeq and PacBio, respectively. Peaks colocalized with the known regions of hypervariable regions (V3 and V4 for MiSeq; V3 to V7 for the PacBio CCS), confirming the richer information content in CCS reads.Fig. 2

Bottom Line: However, clustering of short 16S rRNA gene reads into biologically meaningful OTUs is challenging, in part because nucleotide variation along the 16S rRNA gene is only partially captured by short reads.The recent emergence of long-read platforms, such as single-molecule real-time (SMRT) sequencing from Pacific Biosciences, offers the potential for improved taxonomic and phylogenetic profiling.In simulated data, long-read sequencing was shown to improve OTU quality and decrease variance.

View Article: PubMed Central - PubMed

Affiliation: Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA. p.oscar.franzen@gmail.com.

ABSTRACT

Background: High-throughput bacterial 16S rRNA gene sequencing followed by clustering of short sequences into operational taxonomic units (OTUs) is widely used for microbiome profiling. However, clustering of short 16S rRNA gene reads into biologically meaningful OTUs is challenging, in part because nucleotide variation along the 16S rRNA gene is only partially captured by short reads. The recent emergence of long-read platforms, such as single-molecule real-time (SMRT) sequencing from Pacific Biosciences, offers the potential for improved taxonomic and phylogenetic profiling. Here, we evaluate the performance of long- and short-read 16S rRNA gene sequencing using simulated and experimental data, followed by OTU inference using computational pipelines based on heuristic and complete-linkage hierarchical clustering.

Results: In simulated data, long-read sequencing was shown to improve OTU quality and decrease variance. We then profiled 40 human gut microbiome samples using a combination of Illumina MiSeq and Blautia-specific SMRT sequencing, further supporting the notion that long reads can identify additional OTUs. We implemented a complete-linkage hierarchical clustering strategy using a flexible computational pipeline, tailored specifically for PacBio circular consensus sequencing (CCS) data that outperforms heuristic methods in most settings: https://github.com/oscar-franzen/oclust/ .

Conclusion: Our data demonstrate that long reads can improve OTU inference; however, the choice of clustering algorithm and associated clustering thresholds has significant impact on performance.

No MeSH data available.