Limits...
PyroClean: denoising pyrosequences from protein-coding amplicons for the recovery of interspecific and intraspecific genetic variation.

Ramirez-Gonzalez R, Yu DW, Bruce C, Heavens D, Caccamo M, Emerson BC - PLoS ONE (2013)

Bottom Line: Filtering of remaining non-target sequences attributed to PCR error, sequencing error, or numts further reduced unique sequence volume to 0.8% of the original raw reads.PyroClean reduces or eliminates the need for an additional, time-consuming step to cluster reads into Operational Taxonomic Units, which facilitates the detection of intraspecific DNA sequence variation.Comparison of our sequence data to public databases reveals that we are able to successfully recover both interspecific and intraspecific sequence diversity.

View Article: PubMed Central - PubMed

Affiliation: The Genome Analysis Centre, Norwich Research Park, Norwich, United Kingdom.

ABSTRACT
High-throughput parallel sequencing is a powerful tool for the quantification of microbial diversity through the amplification of nuclear ribosomal gene regions. Recent work has extended this approach to the quantification of diversity within otherwise difficult-to-study metazoan groups. However, nuclear ribosomal genes present both analytical challenges and practical limitations that are a consequence of the mutational properties of nuclear ribosomal genes. Here we exploit useful properties of protein-coding genes for cross-species amplification and denoising of 454 flowgrams. We first use experimental mixtures of species from the class Collembola to amplify and pyrosequence the 5' region of the COI barcode, and we implement a new algorithm called PyroClean for the denoising of Roche GS FLX pyrosequences. Using parameter values from the analysis of experimental mixtures, we then analyse two communities sampled from field sites on the island of Tenerife. Cross-species amplification success of target mitochondrial sequences in experimental species mixtures is high; however, there is little relationship between template DNA concentrations and pyrosequencing read abundance. Homopolymer error correction and filtering against a consensus reference sequence reduced the volume of unique sequences to approximately 5% of the original unique raw reads. Filtering of remaining non-target sequences attributed to PCR error, sequencing error, or numts further reduced unique sequence volume to 0.8% of the original raw reads. PyroClean reduces or eliminates the need for an additional, time-consuming step to cluster reads into Operational Taxonomic Units, which facilitates the detection of intraspecific DNA sequence variation. PyroCleaned sequence data from field sites in Tenerife demonstrate the utility of our approach for quantifying evolutionary diversity and its spatial structure. Comparison of our sequence data to public databases reveals that we are able to successfully recover both interspecific and intraspecific sequence diversity.

Show MeSH
Neighbour joining tree of PyroCleaned sequences derived from two forest sampling sites in the island of Tenerife, and 21 taxonomically identified Sanger sequences samples from Tenerife (in bold).Sequence identifiers represent MID tag, the name of the sequence, and the frequency representation of the sequence. Sequences were assessed for taxonomic identity against the BOLD Identification Database, and we report closest matches for Collembola, and non-Collembola if matching is higher. Letters in bold represent sequence matches to Collembola of 99% or higher: A – Tetrodontophora bielanensis (88%), Pelosia muscerda (Lepidoptera, 91%), B – Tetrodontophora bielanensis (89%), Myotis ikonnikova (Chiroptera, 91%); C – Tetrodontophora bielanensis (89%), Myotis ikonnikova (Chiroptera, 91%); D – Verhoeffiella sp. (89%), Stenopsyche (Trichoptera, 92%), E – no significant match, Stenopsyche (Trichoptera, 94%), F – Folsomia (94%), Ophiogomphus (Odonata, 95%), G – Protaphorura (99%), H – Protaphorura (100%), I – Xenylla humicola (90%), Carterocephalus silvicola (Lepidoptera, 91%), J – Bourletiella (89%), K – Lepidocyrtus violaceus (88%), Herona marathus (Lepidoptera, 90%) L – Tullbergia sp. (99%), M – Parisotoma notabilis L3 (92%), N – Folsomina yosii (90%), O – Isotomiella (99%), P – Parisotoma notabilis (100%), Q - Parisotoma notabilis (100%), R – Parisotoma notabilis (100%), S – Xenylla humicola (90%), Finlaya (Culicidae, 91%), T – Verhoeffiella sp. (90%), U – Entomobrya atrocincta (100%), V - Entomobrya atrocincta (99%), W - Entomobrya atrocincta (100%), X – Parisotoma notabilis (94%), Phasus (Lepidoptera, 96%), Y – Isotomurus (92%), Z – Desoria sp. (91%), Bathymunida nebulosa (Decapoda, 95%), AA – Entomobryidae (90%), Polytremis pellucida (Lepidoptera, 93%), AB – Isotoma sp. (91%), Neophylax rickeri (Trichoptera, 92%), AC – Isotomurus (86%), Chordate (90%), AD – Neanura muscorum (100%), AE – Ceratophysella sp. (99%), AF – Ceratophysella sp. (100%). The five sequences with an asterisk are potential numts or pcr error that exceed the final filter threshold. The two sequences with a filled circle are left in as an example of probable numts.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3585932&req=5

pone-0057615-g003: Neighbour joining tree of PyroCleaned sequences derived from two forest sampling sites in the island of Tenerife, and 21 taxonomically identified Sanger sequences samples from Tenerife (in bold).Sequence identifiers represent MID tag, the name of the sequence, and the frequency representation of the sequence. Sequences were assessed for taxonomic identity against the BOLD Identification Database, and we report closest matches for Collembola, and non-Collembola if matching is higher. Letters in bold represent sequence matches to Collembola of 99% or higher: A – Tetrodontophora bielanensis (88%), Pelosia muscerda (Lepidoptera, 91%), B – Tetrodontophora bielanensis (89%), Myotis ikonnikova (Chiroptera, 91%); C – Tetrodontophora bielanensis (89%), Myotis ikonnikova (Chiroptera, 91%); D – Verhoeffiella sp. (89%), Stenopsyche (Trichoptera, 92%), E – no significant match, Stenopsyche (Trichoptera, 94%), F – Folsomia (94%), Ophiogomphus (Odonata, 95%), G – Protaphorura (99%), H – Protaphorura (100%), I – Xenylla humicola (90%), Carterocephalus silvicola (Lepidoptera, 91%), J – Bourletiella (89%), K – Lepidocyrtus violaceus (88%), Herona marathus (Lepidoptera, 90%) L – Tullbergia sp. (99%), M – Parisotoma notabilis L3 (92%), N – Folsomina yosii (90%), O – Isotomiella (99%), P – Parisotoma notabilis (100%), Q - Parisotoma notabilis (100%), R – Parisotoma notabilis (100%), S – Xenylla humicola (90%), Finlaya (Culicidae, 91%), T – Verhoeffiella sp. (90%), U – Entomobrya atrocincta (100%), V - Entomobrya atrocincta (99%), W - Entomobrya atrocincta (100%), X – Parisotoma notabilis (94%), Phasus (Lepidoptera, 96%), Y – Isotomurus (92%), Z – Desoria sp. (91%), Bathymunida nebulosa (Decapoda, 95%), AA – Entomobryidae (90%), Polytremis pellucida (Lepidoptera, 93%), AB – Isotoma sp. (91%), Neophylax rickeri (Trichoptera, 92%), AC – Isotomurus (86%), Chordate (90%), AD – Neanura muscorum (100%), AE – Ceratophysella sp. (99%), AF – Ceratophysella sp. (100%). The five sequences with an asterisk are potential numts or pcr error that exceed the final filter threshold. The two sequences with a filled circle are left in as an example of probable numts.

Mentions: PyroCleaning resulted in an average of 37 unique sequences within each of the 6 MID tag pools, with a total of 88 unique sequences across the 6 MID tag pools (Table 3). As in the test pool data, none of these sequences was found to be of chimeric origin. After removing uncorrected compensated indels, sequences were then subjected to a neighbour joining analysis using p-distances with MEGA5 [29]. The neighbour-joining tree revealed several clusters of closely related presumed numts (Fig. S2). Removal of these sequences reduced the average number of unique sequences within each MID tag pool to 19, representing a total of 39 unique sequences. A neighbour joining analysis reveals the 39 unique sequences to comprise 24 divergent lineages, and the taxonomic identity of 14 of these are revealed with the joint analysis of taxonomically referenced Sanger sequences, with 10 of these representing exact matches to Sanger sequences (Fig. 3). Thirty of the 37 PyroCleaned sequences were assessed for taxonomic identity against the BOLD Identification System, with 13 sequence matches of 99% identity or higher (Fig. 3).


PyroClean: denoising pyrosequences from protein-coding amplicons for the recovery of interspecific and intraspecific genetic variation.

Ramirez-Gonzalez R, Yu DW, Bruce C, Heavens D, Caccamo M, Emerson BC - PLoS ONE (2013)

Neighbour joining tree of PyroCleaned sequences derived from two forest sampling sites in the island of Tenerife, and 21 taxonomically identified Sanger sequences samples from Tenerife (in bold).Sequence identifiers represent MID tag, the name of the sequence, and the frequency representation of the sequence. Sequences were assessed for taxonomic identity against the BOLD Identification Database, and we report closest matches for Collembola, and non-Collembola if matching is higher. Letters in bold represent sequence matches to Collembola of 99% or higher: A – Tetrodontophora bielanensis (88%), Pelosia muscerda (Lepidoptera, 91%), B – Tetrodontophora bielanensis (89%), Myotis ikonnikova (Chiroptera, 91%); C – Tetrodontophora bielanensis (89%), Myotis ikonnikova (Chiroptera, 91%); D – Verhoeffiella sp. (89%), Stenopsyche (Trichoptera, 92%), E – no significant match, Stenopsyche (Trichoptera, 94%), F – Folsomia (94%), Ophiogomphus (Odonata, 95%), G – Protaphorura (99%), H – Protaphorura (100%), I – Xenylla humicola (90%), Carterocephalus silvicola (Lepidoptera, 91%), J – Bourletiella (89%), K – Lepidocyrtus violaceus (88%), Herona marathus (Lepidoptera, 90%) L – Tullbergia sp. (99%), M – Parisotoma notabilis L3 (92%), N – Folsomina yosii (90%), O – Isotomiella (99%), P – Parisotoma notabilis (100%), Q - Parisotoma notabilis (100%), R – Parisotoma notabilis (100%), S – Xenylla humicola (90%), Finlaya (Culicidae, 91%), T – Verhoeffiella sp. (90%), U – Entomobrya atrocincta (100%), V - Entomobrya atrocincta (99%), W - Entomobrya atrocincta (100%), X – Parisotoma notabilis (94%), Phasus (Lepidoptera, 96%), Y – Isotomurus (92%), Z – Desoria sp. (91%), Bathymunida nebulosa (Decapoda, 95%), AA – Entomobryidae (90%), Polytremis pellucida (Lepidoptera, 93%), AB – Isotoma sp. (91%), Neophylax rickeri (Trichoptera, 92%), AC – Isotomurus (86%), Chordate (90%), AD – Neanura muscorum (100%), AE – Ceratophysella sp. (99%), AF – Ceratophysella sp. (100%). The five sequences with an asterisk are potential numts or pcr error that exceed the final filter threshold. The two sequences with a filled circle are left in as an example of probable numts.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3585932&req=5

pone-0057615-g003: Neighbour joining tree of PyroCleaned sequences derived from two forest sampling sites in the island of Tenerife, and 21 taxonomically identified Sanger sequences samples from Tenerife (in bold).Sequence identifiers represent MID tag, the name of the sequence, and the frequency representation of the sequence. Sequences were assessed for taxonomic identity against the BOLD Identification Database, and we report closest matches for Collembola, and non-Collembola if matching is higher. Letters in bold represent sequence matches to Collembola of 99% or higher: A – Tetrodontophora bielanensis (88%), Pelosia muscerda (Lepidoptera, 91%), B – Tetrodontophora bielanensis (89%), Myotis ikonnikova (Chiroptera, 91%); C – Tetrodontophora bielanensis (89%), Myotis ikonnikova (Chiroptera, 91%); D – Verhoeffiella sp. (89%), Stenopsyche (Trichoptera, 92%), E – no significant match, Stenopsyche (Trichoptera, 94%), F – Folsomia (94%), Ophiogomphus (Odonata, 95%), G – Protaphorura (99%), H – Protaphorura (100%), I – Xenylla humicola (90%), Carterocephalus silvicola (Lepidoptera, 91%), J – Bourletiella (89%), K – Lepidocyrtus violaceus (88%), Herona marathus (Lepidoptera, 90%) L – Tullbergia sp. (99%), M – Parisotoma notabilis L3 (92%), N – Folsomina yosii (90%), O – Isotomiella (99%), P – Parisotoma notabilis (100%), Q - Parisotoma notabilis (100%), R – Parisotoma notabilis (100%), S – Xenylla humicola (90%), Finlaya (Culicidae, 91%), T – Verhoeffiella sp. (90%), U – Entomobrya atrocincta (100%), V - Entomobrya atrocincta (99%), W - Entomobrya atrocincta (100%), X – Parisotoma notabilis (94%), Phasus (Lepidoptera, 96%), Y – Isotomurus (92%), Z – Desoria sp. (91%), Bathymunida nebulosa (Decapoda, 95%), AA – Entomobryidae (90%), Polytremis pellucida (Lepidoptera, 93%), AB – Isotoma sp. (91%), Neophylax rickeri (Trichoptera, 92%), AC – Isotomurus (86%), Chordate (90%), AD – Neanura muscorum (100%), AE – Ceratophysella sp. (99%), AF – Ceratophysella sp. (100%). The five sequences with an asterisk are potential numts or pcr error that exceed the final filter threshold. The two sequences with a filled circle are left in as an example of probable numts.
Mentions: PyroCleaning resulted in an average of 37 unique sequences within each of the 6 MID tag pools, with a total of 88 unique sequences across the 6 MID tag pools (Table 3). As in the test pool data, none of these sequences was found to be of chimeric origin. After removing uncorrected compensated indels, sequences were then subjected to a neighbour joining analysis using p-distances with MEGA5 [29]. The neighbour-joining tree revealed several clusters of closely related presumed numts (Fig. S2). Removal of these sequences reduced the average number of unique sequences within each MID tag pool to 19, representing a total of 39 unique sequences. A neighbour joining analysis reveals the 39 unique sequences to comprise 24 divergent lineages, and the taxonomic identity of 14 of these are revealed with the joint analysis of taxonomically referenced Sanger sequences, with 10 of these representing exact matches to Sanger sequences (Fig. 3). Thirty of the 37 PyroCleaned sequences were assessed for taxonomic identity against the BOLD Identification System, with 13 sequence matches of 99% identity or higher (Fig. 3).

Bottom Line: Filtering of remaining non-target sequences attributed to PCR error, sequencing error, or numts further reduced unique sequence volume to 0.8% of the original raw reads.PyroClean reduces or eliminates the need for an additional, time-consuming step to cluster reads into Operational Taxonomic Units, which facilitates the detection of intraspecific DNA sequence variation.Comparison of our sequence data to public databases reveals that we are able to successfully recover both interspecific and intraspecific sequence diversity.

View Article: PubMed Central - PubMed

Affiliation: The Genome Analysis Centre, Norwich Research Park, Norwich, United Kingdom.

ABSTRACT
High-throughput parallel sequencing is a powerful tool for the quantification of microbial diversity through the amplification of nuclear ribosomal gene regions. Recent work has extended this approach to the quantification of diversity within otherwise difficult-to-study metazoan groups. However, nuclear ribosomal genes present both analytical challenges and practical limitations that are a consequence of the mutational properties of nuclear ribosomal genes. Here we exploit useful properties of protein-coding genes for cross-species amplification and denoising of 454 flowgrams. We first use experimental mixtures of species from the class Collembola to amplify and pyrosequence the 5' region of the COI barcode, and we implement a new algorithm called PyroClean for the denoising of Roche GS FLX pyrosequences. Using parameter values from the analysis of experimental mixtures, we then analyse two communities sampled from field sites on the island of Tenerife. Cross-species amplification success of target mitochondrial sequences in experimental species mixtures is high; however, there is little relationship between template DNA concentrations and pyrosequencing read abundance. Homopolymer error correction and filtering against a consensus reference sequence reduced the volume of unique sequences to approximately 5% of the original unique raw reads. Filtering of remaining non-target sequences attributed to PCR error, sequencing error, or numts further reduced unique sequence volume to 0.8% of the original raw reads. PyroClean reduces or eliminates the need for an additional, time-consuming step to cluster reads into Operational Taxonomic Units, which facilitates the detection of intraspecific DNA sequence variation. PyroCleaned sequence data from field sites in Tenerife demonstrate the utility of our approach for quantifying evolutionary diversity and its spatial structure. Comparison of our sequence data to public databases reveals that we are able to successfully recover both interspecific and intraspecific sequence diversity.

Show MeSH