Limits...
A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE.

Keegan KP, Trimble WL, Wilkening J, Wilke A, Harrison T, D'Souza M, Meyer F - PLoS Comput. Biol. (2012)

Bottom Line: DRISEE provides positional error estimates that can be used to inform read trimming within a sample.It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples.Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms.

View Article: PubMed Central - PubMed

Affiliation: Argonne National Laboratory, Argonne, Illinois, United States of America. kkeegan@anl.gov

ABSTRACT
We provide a novel method, DRISEE (duplicate read inferred sequencing error estimation), to assess sequencing quality (alternatively referred to as "noise" or "error") within and/or between sequencing samples. DRISEE provides positional error estimates that can be used to inform read trimming within a sample. It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples. For shotgun metagenomic data, we believe that DRISEE provides estimates of sequencing error that are more accurate and less constrained by technical limitations than existing methods that rely on reference genomes or the use of scores (e.g. Phred). Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms. The DRISEE error estimate is obtained by analyzing sets of artifactual duplicate reads (ADRs), a known by-product of both sequencing platforms. We present DRISEE as an open-source, platform-independent method to assess sequencing error in shotgun metagenomic data, and utilize it to discover previously uncharacterized error in de novo sequence data from the 454 and Illumina sequencing platforms.

Show MeSH

Related in: MedlinePlus

DRISEE error profiles for metagenomic sequencing data sets.Total (% substitutions + % insertions + % deletions) DRISEE error (Y-axis) as a function of read position (X-axis) for all considered reads. (a) and (b): Phred vs. DRISEE: Total DRISEE (red) and average Phred (blue) derived errors (Q values converted to percent error) for (a) 20 metagenomic 454 samples and (b) 12 metagenomic Illumina samples. (c):DRISEE total error of several Illumina-based sample sets: DRISEE total error profiles are displayed for 5 different Illumina experiments/sample sets. Parentheses indicate the number of samples in each experiment/sample set. (d):DRISEE total error of single samples: DRISEE total error profiles are displayed for two individual samples. The samples represent the lowest and highest averaged DRISEE total errors (averaged across all read positions), observed in Sample Set 3 (see Figure 4c above). Pie charts indicate a summary of MG-RAST-based annotation of the two samples. The upper pie chart was produced from the data set that corresponds to the purple DRISEE profile (average DRISEE error = 45%). The lower pie chart corresponds to annotation of the data set that produced the green DRISEE profile (average DRISEE error = 1%).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3369934&req=5

pcbi-1002541-g004: DRISEE error profiles for metagenomic sequencing data sets.Total (% substitutions + % insertions + % deletions) DRISEE error (Y-axis) as a function of read position (X-axis) for all considered reads. (a) and (b): Phred vs. DRISEE: Total DRISEE (red) and average Phred (blue) derived errors (Q values converted to percent error) for (a) 20 metagenomic 454 samples and (b) 12 metagenomic Illumina samples. (c):DRISEE total error of several Illumina-based sample sets: DRISEE total error profiles are displayed for 5 different Illumina experiments/sample sets. Parentheses indicate the number of samples in each experiment/sample set. (d):DRISEE total error of single samples: DRISEE total error profiles are displayed for two individual samples. The samples represent the lowest and highest averaged DRISEE total errors (averaged across all read positions), observed in Sample Set 3 (see Figure 4c above). Pie charts indicate a summary of MG-RAST-based annotation of the two samples. The upper pie chart was produced from the data set that corresponds to the purple DRISEE profile (average DRISEE error = 45%). The lower pie chart corresponds to annotation of the data set that produced the green DRISEE profile (average DRISEE error = 1%).

Mentions: Total (% substitutions + % insertions + % deletions) DRISEE error (Y-axis) as a function of read position (X-axis) for all considered reads. (a) and (b): Phred vs. DRISEE: Total DRISEE (red) and average Phred (blue) derived errors (Q values converted to percent error) for (a) 20 metagenomic 454 samples and (b) 12 metagenomic Illumina samples. (c):DRISEE total error of several Illumina-based sample sets: DRISEE total error profiles are displayed for 5 different Illumina experiments/sample sets. Parentheses indicate the number of samples in each experiment/sample set. (d):DRISEE total error of single samples: DRISEE total error profiles are displayed for two individual samples. The samples represent the lowest and highest averaged DRISEE total errors (averaged across all read positions), observed in Sample Set 3 (see Figure 4c above). Pie charts indicate a summary of MG-RAST-based annotation of the two samples. The upper pie chart was produced from the data set that corresponds to the purple DRISEE profile (average DRISEE error = 45%). The lower pie chart corresponds to annotation of the data set that produced the green DRISEE profile (average DRISEE error = 1%).


A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE.

Keegan KP, Trimble WL, Wilkening J, Wilke A, Harrison T, D'Souza M, Meyer F - PLoS Comput. Biol. (2012)

DRISEE error profiles for metagenomic sequencing data sets.Total (% substitutions + % insertions + % deletions) DRISEE error (Y-axis) as a function of read position (X-axis) for all considered reads. (a) and (b): Phred vs. DRISEE: Total DRISEE (red) and average Phred (blue) derived errors (Q values converted to percent error) for (a) 20 metagenomic 454 samples and (b) 12 metagenomic Illumina samples. (c):DRISEE total error of several Illumina-based sample sets: DRISEE total error profiles are displayed for 5 different Illumina experiments/sample sets. Parentheses indicate the number of samples in each experiment/sample set. (d):DRISEE total error of single samples: DRISEE total error profiles are displayed for two individual samples. The samples represent the lowest and highest averaged DRISEE total errors (averaged across all read positions), observed in Sample Set 3 (see Figure 4c above). Pie charts indicate a summary of MG-RAST-based annotation of the two samples. The upper pie chart was produced from the data set that corresponds to the purple DRISEE profile (average DRISEE error = 45%). The lower pie chart corresponds to annotation of the data set that produced the green DRISEE profile (average DRISEE error = 1%).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3369934&req=5

pcbi-1002541-g004: DRISEE error profiles for metagenomic sequencing data sets.Total (% substitutions + % insertions + % deletions) DRISEE error (Y-axis) as a function of read position (X-axis) for all considered reads. (a) and (b): Phred vs. DRISEE: Total DRISEE (red) and average Phred (blue) derived errors (Q values converted to percent error) for (a) 20 metagenomic 454 samples and (b) 12 metagenomic Illumina samples. (c):DRISEE total error of several Illumina-based sample sets: DRISEE total error profiles are displayed for 5 different Illumina experiments/sample sets. Parentheses indicate the number of samples in each experiment/sample set. (d):DRISEE total error of single samples: DRISEE total error profiles are displayed for two individual samples. The samples represent the lowest and highest averaged DRISEE total errors (averaged across all read positions), observed in Sample Set 3 (see Figure 4c above). Pie charts indicate a summary of MG-RAST-based annotation of the two samples. The upper pie chart was produced from the data set that corresponds to the purple DRISEE profile (average DRISEE error = 45%). The lower pie chart corresponds to annotation of the data set that produced the green DRISEE profile (average DRISEE error = 1%).
Mentions: Total (% substitutions + % insertions + % deletions) DRISEE error (Y-axis) as a function of read position (X-axis) for all considered reads. (a) and (b): Phred vs. DRISEE: Total DRISEE (red) and average Phred (blue) derived errors (Q values converted to percent error) for (a) 20 metagenomic 454 samples and (b) 12 metagenomic Illumina samples. (c):DRISEE total error of several Illumina-based sample sets: DRISEE total error profiles are displayed for 5 different Illumina experiments/sample sets. Parentheses indicate the number of samples in each experiment/sample set. (d):DRISEE total error of single samples: DRISEE total error profiles are displayed for two individual samples. The samples represent the lowest and highest averaged DRISEE total errors (averaged across all read positions), observed in Sample Set 3 (see Figure 4c above). Pie charts indicate a summary of MG-RAST-based annotation of the two samples. The upper pie chart was produced from the data set that corresponds to the purple DRISEE profile (average DRISEE error = 45%). The lower pie chart corresponds to annotation of the data set that produced the green DRISEE profile (average DRISEE error = 1%).

Bottom Line: DRISEE provides positional error estimates that can be used to inform read trimming within a sample.It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples.Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms.

View Article: PubMed Central - PubMed

Affiliation: Argonne National Laboratory, Argonne, Illinois, United States of America. kkeegan@anl.gov

ABSTRACT
We provide a novel method, DRISEE (duplicate read inferred sequencing error estimation), to assess sequencing quality (alternatively referred to as "noise" or "error") within and/or between sequencing samples. DRISEE provides positional error estimates that can be used to inform read trimming within a sample. It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples. For shotgun metagenomic data, we believe that DRISEE provides estimates of sequencing error that are more accurate and less constrained by technical limitations than existing methods that rely on reference genomes or the use of scores (e.g. Phred). Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms. The DRISEE error estimate is obtained by analyzing sets of artifactual duplicate reads (ADRs), a known by-product of both sequencing platforms. We present DRISEE as an open-source, platform-independent method to assess sequencing error in shotgun metagenomic data, and utilize it to discover previously uncharacterized error in de novo sequence data from the 454 and Illumina sequencing platforms.

Show MeSH
Related in: MedlinePlus