Limits...
A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE.

Keegan KP, Trimble WL, Wilkening J, Wilke A, Harrison T, D'Souza M, Meyer F - PLoS Comput. Biol. (2012)

Bottom Line: DRISEE provides positional error estimates that can be used to inform read trimming within a sample.It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples.Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms.

View Article: PubMed Central - PubMed

Affiliation: Argonne National Laboratory, Argonne, Illinois, United States of America. kkeegan@anl.gov

ABSTRACT
We provide a novel method, DRISEE (duplicate read inferred sequencing error estimation), to assess sequencing quality (alternatively referred to as "noise" or "error") within and/or between sequencing samples. DRISEE provides positional error estimates that can be used to inform read trimming within a sample. It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples. For shotgun metagenomic data, we believe that DRISEE provides estimates of sequencing error that are more accurate and less constrained by technical limitations than existing methods that rely on reference genomes or the use of scores (e.g. Phred). Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms. The DRISEE error estimate is obtained by analyzing sets of artifactual duplicate reads (ADRs), a known by-product of both sequencing platforms. We present DRISEE as an open-source, platform-independent method to assess sequencing error in shotgun metagenomic data, and utilize it to discover previously uncharacterized error in de novo sequence data from the 454 and Illumina sequencing platforms.

Show MeSH

Related in: MedlinePlus

DRISEE calculated Errors, separated by error type, for 454 and Illumina metagenomic samples.DRISEE error profiles are displayed for metagenomic data produced by the 454 (65 samples, (a)) and Illumina (127 samples, (b)) platforms. DRISEE determined errors (Y-axis) are plotted with respect to read position (X-axis). DRISEE errors are displayed as total (black) and type separated (A_sub = A substitutions, T_sub = T substitutions, C_sub = C substitutions, G_sub = G substitutions, and InDel indicates insertions and deletions).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3369934&req=5

pcbi-1002541-g005: DRISEE calculated Errors, separated by error type, for 454 and Illumina metagenomic samples.DRISEE error profiles are displayed for metagenomic data produced by the 454 (65 samples, (a)) and Illumina (127 samples, (b)) platforms. DRISEE determined errors (Y-axis) are plotted with respect to read position (X-axis). DRISEE errors are displayed as total (black) and type separated (A_sub = A substitutions, T_sub = T substitutions, C_sub = C substitutions, G_sub = G substitutions, and InDel indicates insertions and deletions).

Mentions: The initial output of a DRISEE analysis is a table, excerpted examples of which are presented as Tables 1 and 2. It indicates the number (Table 1), or percent (Table 2), of sequences (indexed by consensus sequence position) in all considered clusters of ADRs that match or do not match the consensus derived from the ADR cluster to which they belong. DRISEE tables can indicate the match/mismatch counts for a single cluster of prefix-identical reads from a single sequencing sample, for multiple clusters from a single sample (Tables 1 and 2 present one such example), or for multiple clusters collected from a large number of samples that may represent some common trait of interest (e.g. samples produced with the same sequencing technology, that used the same RNA/DNA extraction procedures, that were collected as part of the same sequencing project etc.). This adaptable tabular format represents the simplest incarnation of a DRISEE error profile; it can be analyzed and visualized in a number of ways (numerous examples are presented below – see Figures 2–5) to garner detailed platform-independent information regarding sequencing error in genomic and metagenomic shotgun sequencing data. A more detailed description of the tabular format is included in the legend for Tables 1 and 2.


A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE.

Keegan KP, Trimble WL, Wilkening J, Wilke A, Harrison T, D'Souza M, Meyer F - PLoS Comput. Biol. (2012)

DRISEE calculated Errors, separated by error type, for 454 and Illumina metagenomic samples.DRISEE error profiles are displayed for metagenomic data produced by the 454 (65 samples, (a)) and Illumina (127 samples, (b)) platforms. DRISEE determined errors (Y-axis) are plotted with respect to read position (X-axis). DRISEE errors are displayed as total (black) and type separated (A_sub = A substitutions, T_sub = T substitutions, C_sub = C substitutions, G_sub = G substitutions, and InDel indicates insertions and deletions).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3369934&req=5

pcbi-1002541-g005: DRISEE calculated Errors, separated by error type, for 454 and Illumina metagenomic samples.DRISEE error profiles are displayed for metagenomic data produced by the 454 (65 samples, (a)) and Illumina (127 samples, (b)) platforms. DRISEE determined errors (Y-axis) are plotted with respect to read position (X-axis). DRISEE errors are displayed as total (black) and type separated (A_sub = A substitutions, T_sub = T substitutions, C_sub = C substitutions, G_sub = G substitutions, and InDel indicates insertions and deletions).
Mentions: The initial output of a DRISEE analysis is a table, excerpted examples of which are presented as Tables 1 and 2. It indicates the number (Table 1), or percent (Table 2), of sequences (indexed by consensus sequence position) in all considered clusters of ADRs that match or do not match the consensus derived from the ADR cluster to which they belong. DRISEE tables can indicate the match/mismatch counts for a single cluster of prefix-identical reads from a single sequencing sample, for multiple clusters from a single sample (Tables 1 and 2 present one such example), or for multiple clusters collected from a large number of samples that may represent some common trait of interest (e.g. samples produced with the same sequencing technology, that used the same RNA/DNA extraction procedures, that were collected as part of the same sequencing project etc.). This adaptable tabular format represents the simplest incarnation of a DRISEE error profile; it can be analyzed and visualized in a number of ways (numerous examples are presented below – see Figures 2–5) to garner detailed platform-independent information regarding sequencing error in genomic and metagenomic shotgun sequencing data. A more detailed description of the tabular format is included in the legend for Tables 1 and 2.

Bottom Line: DRISEE provides positional error estimates that can be used to inform read trimming within a sample.It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples.Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms.

View Article: PubMed Central - PubMed

Affiliation: Argonne National Laboratory, Argonne, Illinois, United States of America. kkeegan@anl.gov

ABSTRACT
We provide a novel method, DRISEE (duplicate read inferred sequencing error estimation), to assess sequencing quality (alternatively referred to as "noise" or "error") within and/or between sequencing samples. DRISEE provides positional error estimates that can be used to inform read trimming within a sample. It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples. For shotgun metagenomic data, we believe that DRISEE provides estimates of sequencing error that are more accurate and less constrained by technical limitations than existing methods that rely on reference genomes or the use of scores (e.g. Phred). Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms. The DRISEE error estimate is obtained by analyzing sets of artifactual duplicate reads (ADRs), a known by-product of both sequencing platforms. We present DRISEE as an open-source, platform-independent method to assess sequencing error in shotgun metagenomic data, and utilize it to discover previously uncharacterized error in de novo sequence data from the 454 and Illumina sequencing platforms.

Show MeSH
Related in: MedlinePlus