Limits...
PyroClean: denoising pyrosequences from protein-coding amplicons for the recovery of interspecific and intraspecific genetic variation.

Ramirez-Gonzalez R, Yu DW, Bruce C, Heavens D, Caccamo M, Emerson BC - PLoS ONE (2013)

Bottom Line: Filtering of remaining non-target sequences attributed to PCR error, sequencing error, or numts further reduced unique sequence volume to 0.8% of the original raw reads.PyroClean reduces or eliminates the need for an additional, time-consuming step to cluster reads into Operational Taxonomic Units, which facilitates the detection of intraspecific DNA sequence variation.Comparison of our sequence data to public databases reveals that we are able to successfully recover both interspecific and intraspecific sequence diversity.

View Article: PubMed Central - PubMed

Affiliation: The Genome Analysis Centre, Norwich Research Park, Norwich, United Kingdom.

ABSTRACT
High-throughput parallel sequencing is a powerful tool for the quantification of microbial diversity through the amplification of nuclear ribosomal gene regions. Recent work has extended this approach to the quantification of diversity within otherwise difficult-to-study metazoan groups. However, nuclear ribosomal genes present both analytical challenges and practical limitations that are a consequence of the mutational properties of nuclear ribosomal genes. Here we exploit useful properties of protein-coding genes for cross-species amplification and denoising of 454 flowgrams. We first use experimental mixtures of species from the class Collembola to amplify and pyrosequence the 5' region of the COI barcode, and we implement a new algorithm called PyroClean for the denoising of Roche GS FLX pyrosequences. Using parameter values from the analysis of experimental mixtures, we then analyse two communities sampled from field sites on the island of Tenerife. Cross-species amplification success of target mitochondrial sequences in experimental species mixtures is high; however, there is little relationship between template DNA concentrations and pyrosequencing read abundance. Homopolymer error correction and filtering against a consensus reference sequence reduced the volume of unique sequences to approximately 5% of the original unique raw reads. Filtering of remaining non-target sequences attributed to PCR error, sequencing error, or numts further reduced unique sequence volume to 0.8% of the original raw reads. PyroClean reduces or eliminates the need for an additional, time-consuming step to cluster reads into Operational Taxonomic Units, which facilitates the detection of intraspecific DNA sequence variation. PyroCleaned sequence data from field sites in Tenerife demonstrate the utility of our approach for quantifying evolutionary diversity and its spatial structure. Comparison of our sequence data to public databases reveals that we are able to successfully recover both interspecific and intraspecific sequence diversity.

Show MeSH

Related in: MedlinePlus

Heat map of the percentage representation of the 27 Collembola genomes within each of the 5 genomic pools, and within each of the three MID tag pools derived from each of the 5 genomic pools.For visual clarity, percentage representation of 2% or more is presented as maximum representation. See Supplementary Table 1 to see the actual percentages of each sequence found, and species names.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3585932&req=5

pone-0057615-g001: Heat map of the percentage representation of the 27 Collembola genomes within each of the 5 genomic pools, and within each of the three MID tag pools derived from each of the 5 genomic pools.For visual clarity, percentage representation of 2% or more is presented as maximum representation. See Supplementary Table 1 to see the actual percentages of each sequence found, and species names.

Mentions: After homopolymer read error correction, we excluded singleton sequences and then filtered remaining sequences against the consensus reference sequence (step 4) allowing 1 mismatch to the consensus reference sequence, and trimming the alignment to 220 bases after observing that homopolymer error correction was not successful beyond this point. After filtering sequences divergent from the consensus reference sequence by more than 1 nucleotide, unique sequence reads represent approximately 1.5% of unique raw reads (Table 2). Computational processing time for individual MID tag pools ranged from 13 to 27 minutes, with longer processing times for more complex genome pools. All expected target sequences were recovered for each of the 5 pools, with only a few instances of an expected target sequence not being represented in all three PCR replicates within a pool (Fig. 1, Table S1). For PCR pool 4 (MID tags 10–12), three unexpected target sequences were recovered at low frequency (Table S1), which we attribute to cross-contamination during preparation of the genomic pools. There is little relationship between the proportional representation of genomic templates and the frequency of the target sequence within an MID tag pool (Fig. 2).


PyroClean: denoising pyrosequences from protein-coding amplicons for the recovery of interspecific and intraspecific genetic variation.

Ramirez-Gonzalez R, Yu DW, Bruce C, Heavens D, Caccamo M, Emerson BC - PLoS ONE (2013)

Heat map of the percentage representation of the 27 Collembola genomes within each of the 5 genomic pools, and within each of the three MID tag pools derived from each of the 5 genomic pools.For visual clarity, percentage representation of 2% or more is presented as maximum representation. See Supplementary Table 1 to see the actual percentages of each sequence found, and species names.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3585932&req=5

pone-0057615-g001: Heat map of the percentage representation of the 27 Collembola genomes within each of the 5 genomic pools, and within each of the three MID tag pools derived from each of the 5 genomic pools.For visual clarity, percentage representation of 2% or more is presented as maximum representation. See Supplementary Table 1 to see the actual percentages of each sequence found, and species names.
Mentions: After homopolymer read error correction, we excluded singleton sequences and then filtered remaining sequences against the consensus reference sequence (step 4) allowing 1 mismatch to the consensus reference sequence, and trimming the alignment to 220 bases after observing that homopolymer error correction was not successful beyond this point. After filtering sequences divergent from the consensus reference sequence by more than 1 nucleotide, unique sequence reads represent approximately 1.5% of unique raw reads (Table 2). Computational processing time for individual MID tag pools ranged from 13 to 27 minutes, with longer processing times for more complex genome pools. All expected target sequences were recovered for each of the 5 pools, with only a few instances of an expected target sequence not being represented in all three PCR replicates within a pool (Fig. 1, Table S1). For PCR pool 4 (MID tags 10–12), three unexpected target sequences were recovered at low frequency (Table S1), which we attribute to cross-contamination during preparation of the genomic pools. There is little relationship between the proportional representation of genomic templates and the frequency of the target sequence within an MID tag pool (Fig. 2).

Bottom Line: Filtering of remaining non-target sequences attributed to PCR error, sequencing error, or numts further reduced unique sequence volume to 0.8% of the original raw reads.PyroClean reduces or eliminates the need for an additional, time-consuming step to cluster reads into Operational Taxonomic Units, which facilitates the detection of intraspecific DNA sequence variation.Comparison of our sequence data to public databases reveals that we are able to successfully recover both interspecific and intraspecific sequence diversity.

View Article: PubMed Central - PubMed

Affiliation: The Genome Analysis Centre, Norwich Research Park, Norwich, United Kingdom.

ABSTRACT
High-throughput parallel sequencing is a powerful tool for the quantification of microbial diversity through the amplification of nuclear ribosomal gene regions. Recent work has extended this approach to the quantification of diversity within otherwise difficult-to-study metazoan groups. However, nuclear ribosomal genes present both analytical challenges and practical limitations that are a consequence of the mutational properties of nuclear ribosomal genes. Here we exploit useful properties of protein-coding genes for cross-species amplification and denoising of 454 flowgrams. We first use experimental mixtures of species from the class Collembola to amplify and pyrosequence the 5' region of the COI barcode, and we implement a new algorithm called PyroClean for the denoising of Roche GS FLX pyrosequences. Using parameter values from the analysis of experimental mixtures, we then analyse two communities sampled from field sites on the island of Tenerife. Cross-species amplification success of target mitochondrial sequences in experimental species mixtures is high; however, there is little relationship between template DNA concentrations and pyrosequencing read abundance. Homopolymer error correction and filtering against a consensus reference sequence reduced the volume of unique sequences to approximately 5% of the original unique raw reads. Filtering of remaining non-target sequences attributed to PCR error, sequencing error, or numts further reduced unique sequence volume to 0.8% of the original raw reads. PyroClean reduces or eliminates the need for an additional, time-consuming step to cluster reads into Operational Taxonomic Units, which facilitates the detection of intraspecific DNA sequence variation. PyroCleaned sequence data from field sites in Tenerife demonstrate the utility of our approach for quantifying evolutionary diversity and its spatial structure. Comparison of our sequence data to public databases reveals that we are able to successfully recover both interspecific and intraspecific sequence diversity.

Show MeSH
Related in: MedlinePlus