Limits...
A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE.

Keegan KP, Trimble WL, Wilkening J, Wilke A, Harrison T, D'Souza M, Meyer F - PLoS Comput. Biol. (2012)

Bottom Line: DRISEE provides positional error estimates that can be used to inform read trimming within a sample.It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples.Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms.

View Article: PubMed Central - PubMed

Affiliation: Argonne National Laboratory, Argonne, Illinois, United States of America. kkeegan@anl.gov

ABSTRACT
We provide a novel method, DRISEE (duplicate read inferred sequencing error estimation), to assess sequencing quality (alternatively referred to as "noise" or "error") within and/or between sequencing samples. DRISEE provides positional error estimates that can be used to inform read trimming within a sample. It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples. For shotgun metagenomic data, we believe that DRISEE provides estimates of sequencing error that are more accurate and less constrained by technical limitations than existing methods that rely on reference genomes or the use of scores (e.g. Phred). Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms. The DRISEE error estimate is obtained by analyzing sets of artifactual duplicate reads (ADRs), a known by-product of both sequencing platforms. We present DRISEE as an open-source, platform-independent method to assess sequencing error in shotgun metagenomic data, and utilize it to discover previously uncharacterized error in de novo sequence data from the 454 and Illumina sequencing platforms.

Show MeSH

Related in: MedlinePlus

DRISEE performance on simulated and real data.(a) Simulated data sets were generated from real whole genome sequences [12], taken from a single sequenced genome, and randomly fragmented into reads that exhibit length distributions consistent with different sequencing technologies (see Methods). Total DRISEE error rates for each sample (Y-axis) are plotted against the known, artificially introduced error rates (X-axis). The equation and R2 values represent a linear regression of displayed data. (b) DRISEE and a conventional reference-genome-based error method were applied to a set of published genomic data sets [12] (see Methods). Cumulative DRISEE errors (Y axis) are plotted against reference-genome errors determined for the same sample. The equations and R2 values represent linear regressions of displayed data. The regression for all samples is plotted as a black line; red lines indicate this regression plus or minus one standard deviation. Red points indicate values further than one standard deviation from the “All Samples” regression. Orange indicates a single point that may disproportionately inflate the observed R2. Equations and R2 values for the “All Samples” regression are provided as well as for regressions that exclude only the red points or the red and orange points.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3369934&req=5

pcbi-1002541-g002: DRISEE performance on simulated and real data.(a) Simulated data sets were generated from real whole genome sequences [12], taken from a single sequenced genome, and randomly fragmented into reads that exhibit length distributions consistent with different sequencing technologies (see Methods). Total DRISEE error rates for each sample (Y-axis) are plotted against the known, artificially introduced error rates (X-axis). The equation and R2 values represent a linear regression of displayed data. (b) DRISEE and a conventional reference-genome-based error method were applied to a set of published genomic data sets [12] (see Methods). Cumulative DRISEE errors (Y axis) are plotted against reference-genome errors determined for the same sample. The equations and R2 values represent linear regressions of displayed data. The regression for all samples is plotted as a black line; red lines indicate this regression plus or minus one standard deviation. Red points indicate values further than one standard deviation from the “All Samples” regression. Orange indicates a single point that may disproportionately inflate the observed R2. Equations and R2 values for the “All Samples” regression are provided as well as for regressions that exclude only the red points or the red and orange points.

Mentions: The initial output of a DRISEE analysis is a table, excerpted examples of which are presented as Tables 1 and 2. It indicates the number (Table 1), or percent (Table 2), of sequences (indexed by consensus sequence position) in all considered clusters of ADRs that match or do not match the consensus derived from the ADR cluster to which they belong. DRISEE tables can indicate the match/mismatch counts for a single cluster of prefix-identical reads from a single sequencing sample, for multiple clusters from a single sample (Tables 1 and 2 present one such example), or for multiple clusters collected from a large number of samples that may represent some common trait of interest (e.g. samples produced with the same sequencing technology, that used the same RNA/DNA extraction procedures, that were collected as part of the same sequencing project etc.). This adaptable tabular format represents the simplest incarnation of a DRISEE error profile; it can be analyzed and visualized in a number of ways (numerous examples are presented below – see Figures 2–5) to garner detailed platform-independent information regarding sequencing error in genomic and metagenomic shotgun sequencing data. A more detailed description of the tabular format is included in the legend for Tables 1 and 2.


A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE.

Keegan KP, Trimble WL, Wilkening J, Wilke A, Harrison T, D'Souza M, Meyer F - PLoS Comput. Biol. (2012)

DRISEE performance on simulated and real data.(a) Simulated data sets were generated from real whole genome sequences [12], taken from a single sequenced genome, and randomly fragmented into reads that exhibit length distributions consistent with different sequencing technologies (see Methods). Total DRISEE error rates for each sample (Y-axis) are plotted against the known, artificially introduced error rates (X-axis). The equation and R2 values represent a linear regression of displayed data. (b) DRISEE and a conventional reference-genome-based error method were applied to a set of published genomic data sets [12] (see Methods). Cumulative DRISEE errors (Y axis) are plotted against reference-genome errors determined for the same sample. The equations and R2 values represent linear regressions of displayed data. The regression for all samples is plotted as a black line; red lines indicate this regression plus or minus one standard deviation. Red points indicate values further than one standard deviation from the “All Samples” regression. Orange indicates a single point that may disproportionately inflate the observed R2. Equations and R2 values for the “All Samples” regression are provided as well as for regressions that exclude only the red points or the red and orange points.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3369934&req=5

pcbi-1002541-g002: DRISEE performance on simulated and real data.(a) Simulated data sets were generated from real whole genome sequences [12], taken from a single sequenced genome, and randomly fragmented into reads that exhibit length distributions consistent with different sequencing technologies (see Methods). Total DRISEE error rates for each sample (Y-axis) are plotted against the known, artificially introduced error rates (X-axis). The equation and R2 values represent a linear regression of displayed data. (b) DRISEE and a conventional reference-genome-based error method were applied to a set of published genomic data sets [12] (see Methods). Cumulative DRISEE errors (Y axis) are plotted against reference-genome errors determined for the same sample. The equations and R2 values represent linear regressions of displayed data. The regression for all samples is plotted as a black line; red lines indicate this regression plus or minus one standard deviation. Red points indicate values further than one standard deviation from the “All Samples” regression. Orange indicates a single point that may disproportionately inflate the observed R2. Equations and R2 values for the “All Samples” regression are provided as well as for regressions that exclude only the red points or the red and orange points.
Mentions: The initial output of a DRISEE analysis is a table, excerpted examples of which are presented as Tables 1 and 2. It indicates the number (Table 1), or percent (Table 2), of sequences (indexed by consensus sequence position) in all considered clusters of ADRs that match or do not match the consensus derived from the ADR cluster to which they belong. DRISEE tables can indicate the match/mismatch counts for a single cluster of prefix-identical reads from a single sequencing sample, for multiple clusters from a single sample (Tables 1 and 2 present one such example), or for multiple clusters collected from a large number of samples that may represent some common trait of interest (e.g. samples produced with the same sequencing technology, that used the same RNA/DNA extraction procedures, that were collected as part of the same sequencing project etc.). This adaptable tabular format represents the simplest incarnation of a DRISEE error profile; it can be analyzed and visualized in a number of ways (numerous examples are presented below – see Figures 2–5) to garner detailed platform-independent information regarding sequencing error in genomic and metagenomic shotgun sequencing data. A more detailed description of the tabular format is included in the legend for Tables 1 and 2.

Bottom Line: DRISEE provides positional error estimates that can be used to inform read trimming within a sample.It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples.Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms.

View Article: PubMed Central - PubMed

Affiliation: Argonne National Laboratory, Argonne, Illinois, United States of America. kkeegan@anl.gov

ABSTRACT
We provide a novel method, DRISEE (duplicate read inferred sequencing error estimation), to assess sequencing quality (alternatively referred to as "noise" or "error") within and/or between sequencing samples. DRISEE provides positional error estimates that can be used to inform read trimming within a sample. It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples. For shotgun metagenomic data, we believe that DRISEE provides estimates of sequencing error that are more accurate and less constrained by technical limitations than existing methods that rely on reference genomes or the use of scores (e.g. Phred). Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms. The DRISEE error estimate is obtained by analyzing sets of artifactual duplicate reads (ADRs), a known by-product of both sequencing platforms. We present DRISEE as an open-source, platform-independent method to assess sequencing error in shotgun metagenomic data, and utilize it to discover previously uncharacterized error in de novo sequence data from the 454 and Illumina sequencing platforms.

Show MeSH
Related in: MedlinePlus