Limits...
A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE.

Keegan KP, Trimble WL, Wilkening J, Wilke A, Harrison T, D'Souza M, Meyer F - PLoS Comput. Biol. (2012)

Bottom Line: DRISEE provides positional error estimates that can be used to inform read trimming within a sample.It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples.Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms.

View Article: PubMed Central - PubMed

Affiliation: Argonne National Laboratory, Argonne, Illinois, United States of America. kkeegan@anl.gov

ABSTRACT
We provide a novel method, DRISEE (duplicate read inferred sequencing error estimation), to assess sequencing quality (alternatively referred to as "noise" or "error") within and/or between sequencing samples. DRISEE provides positional error estimates that can be used to inform read trimming within a sample. It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples. For shotgun metagenomic data, we believe that DRISEE provides estimates of sequencing error that are more accurate and less constrained by technical limitations than existing methods that rely on reference genomes or the use of scores (e.g. Phred). Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms. The DRISEE error estimate is obtained by analyzing sets of artifactual duplicate reads (ADRs), a known by-product of both sequencing platforms. We present DRISEE as an open-source, platform-independent method to assess sequencing error in shotgun metagenomic data, and utilize it to discover previously uncharacterized error in de novo sequence data from the 454 and Illumina sequencing platforms.

Show MeSH

Related in: MedlinePlus

(a) Error detection capabilities of Score, Reference-genome, and DRISEE methods.(1) Simplified procedural diagram of a typical sequencing protocol. Sample collection: First, the biological sample is collected, Extraction/Initial purification: Then the RNA/DNA undergoes extraction and initial purification procedures, Pre-sequencing amplification(s): Next, the extracted genetic material may undergo amplification (e.g. whole genome amplification – see main text) followed by additional purifications and/or other processing procedures, “Sequencing”: Genetic material is placed in the sequencer itself, and is sequenced. Note that sequencing itself frequently involves additional rounds of amplification, Analyses of sequencing output: Sequencer outputs are analyzed. (2) Given a procedure such as A, the portion of the procedure over which score/Phred-based methods can detect error is indicated in red. (3) Given a procedure such as A, the portion of the procedure over which reference-genome-based methods can detect error is indicated in green. Note that reference-genome-based methods are only applicable to single genome data; they cannot consider metagenomic data. (4) Given a procedure such as A, the portion of the procedure over which DRISEE-based methods can detect error is indicated in blue. Note that DRISEE methods can be applied to metagenomic or genomic data, provided that certain requirements are met. See methods. 1: BMC Bioinformatics. 2008 Sep 19;9:386. 2: Nat Methods. 2010 May;7(5):335–6. Epub 2010 Apr 11. (b) DRISEE workflow The steps in a typical DRISEE workflow are depicted and briefly described (in figure captions). Please see Text S1 (Supplemental Methods, Typical DRISEE workflow) for a much more detailed description of each depicted step.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3369934&req=5

pcbi-1002541-g001: (a) Error detection capabilities of Score, Reference-genome, and DRISEE methods.(1) Simplified procedural diagram of a typical sequencing protocol. Sample collection: First, the biological sample is collected, Extraction/Initial purification: Then the RNA/DNA undergoes extraction and initial purification procedures, Pre-sequencing amplification(s): Next, the extracted genetic material may undergo amplification (e.g. whole genome amplification – see main text) followed by additional purifications and/or other processing procedures, “Sequencing”: Genetic material is placed in the sequencer itself, and is sequenced. Note that sequencing itself frequently involves additional rounds of amplification, Analyses of sequencing output: Sequencer outputs are analyzed. (2) Given a procedure such as A, the portion of the procedure over which score/Phred-based methods can detect error is indicated in red. (3) Given a procedure such as A, the portion of the procedure over which reference-genome-based methods can detect error is indicated in green. Note that reference-genome-based methods are only applicable to single genome data; they cannot consider metagenomic data. (4) Given a procedure such as A, the portion of the procedure over which DRISEE-based methods can detect error is indicated in blue. Note that DRISEE methods can be applied to metagenomic or genomic data, provided that certain requirements are met. See methods. 1: BMC Bioinformatics. 2008 Sep 19;9:386. 2: Nat Methods. 2010 May;7(5):335–6. Epub 2010 Apr 11. (b) DRISEE workflow The steps in a typical DRISEE workflow are depicted and briefly described (in figure captions). Please see Text S1 (Supplemental Methods, Typical DRISEE workflow) for a much more detailed description of each depicted step.

Mentions: Score-based methods use an alternative approach. Sequencer signals are compared with sophisticated, frequently proprietary, probabilistic models that attempt to account for platform-dependent artifacts, generating base calls, each with an affiliated quality (Phred or Q score) that provides an estimate of error frequency, but no information regarding error type. Although score-based methods are applicable to metagenomic data, their inability to consider error type can prove to be a substantial limitation. For example, similarity-based gene annotation is extremely sensitive to frame-shifting insertion/deletion errors but only moderately affected by substitutions [3]. In this context, knowledge of error type, specifically the ratio of insertion and/or deletions to substitutions provides crucial information, knowledge unattainable with conventional Phred or Q scores. The absence of information regarding error type is an even greater concern in light of documented platform-dependent biases in sequencing error type: Illumina-based sequencing exhibits high substitution rates [14], whereas 454 technologies exhibit a preponderance of insertion/deletion errors [13]; identical Q scores from these two technologies are likely to represent different types of error, rendering ostensibly similar metrics incomparable [13], [15], [16], [17], [18], [19], [20]. The most concerning, but paradoxically least discussed and perhaps least understood, deficit of score-based methods is their implicit disregard of experimental procedure. Typical sequencing efforts employ a host of procedures to extract, amplify, and purify genetic material, experimental processes that necessarily contribute errors (i.e. introduction of non-biological bias in sequence content and/or abundance relative to original biological template sequences); however, as these errors are introduced before the actual act of sequencing, they can not be accounted for with score-based methods. Thus, a large portion of experimental error in sequencing is frequently overlooked (Figure 1a) (an in depth literature search revealed no works that directly address this issue).


A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE.

Keegan KP, Trimble WL, Wilkening J, Wilke A, Harrison T, D'Souza M, Meyer F - PLoS Comput. Biol. (2012)

(a) Error detection capabilities of Score, Reference-genome, and DRISEE methods.(1) Simplified procedural diagram of a typical sequencing protocol. Sample collection: First, the biological sample is collected, Extraction/Initial purification: Then the RNA/DNA undergoes extraction and initial purification procedures, Pre-sequencing amplification(s): Next, the extracted genetic material may undergo amplification (e.g. whole genome amplification – see main text) followed by additional purifications and/or other processing procedures, “Sequencing”: Genetic material is placed in the sequencer itself, and is sequenced. Note that sequencing itself frequently involves additional rounds of amplification, Analyses of sequencing output: Sequencer outputs are analyzed. (2) Given a procedure such as A, the portion of the procedure over which score/Phred-based methods can detect error is indicated in red. (3) Given a procedure such as A, the portion of the procedure over which reference-genome-based methods can detect error is indicated in green. Note that reference-genome-based methods are only applicable to single genome data; they cannot consider metagenomic data. (4) Given a procedure such as A, the portion of the procedure over which DRISEE-based methods can detect error is indicated in blue. Note that DRISEE methods can be applied to metagenomic or genomic data, provided that certain requirements are met. See methods. 1: BMC Bioinformatics. 2008 Sep 19;9:386. 2: Nat Methods. 2010 May;7(5):335–6. Epub 2010 Apr 11. (b) DRISEE workflow The steps in a typical DRISEE workflow are depicted and briefly described (in figure captions). Please see Text S1 (Supplemental Methods, Typical DRISEE workflow) for a much more detailed description of each depicted step.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3369934&req=5

pcbi-1002541-g001: (a) Error detection capabilities of Score, Reference-genome, and DRISEE methods.(1) Simplified procedural diagram of a typical sequencing protocol. Sample collection: First, the biological sample is collected, Extraction/Initial purification: Then the RNA/DNA undergoes extraction and initial purification procedures, Pre-sequencing amplification(s): Next, the extracted genetic material may undergo amplification (e.g. whole genome amplification – see main text) followed by additional purifications and/or other processing procedures, “Sequencing”: Genetic material is placed in the sequencer itself, and is sequenced. Note that sequencing itself frequently involves additional rounds of amplification, Analyses of sequencing output: Sequencer outputs are analyzed. (2) Given a procedure such as A, the portion of the procedure over which score/Phred-based methods can detect error is indicated in red. (3) Given a procedure such as A, the portion of the procedure over which reference-genome-based methods can detect error is indicated in green. Note that reference-genome-based methods are only applicable to single genome data; they cannot consider metagenomic data. (4) Given a procedure such as A, the portion of the procedure over which DRISEE-based methods can detect error is indicated in blue. Note that DRISEE methods can be applied to metagenomic or genomic data, provided that certain requirements are met. See methods. 1: BMC Bioinformatics. 2008 Sep 19;9:386. 2: Nat Methods. 2010 May;7(5):335–6. Epub 2010 Apr 11. (b) DRISEE workflow The steps in a typical DRISEE workflow are depicted and briefly described (in figure captions). Please see Text S1 (Supplemental Methods, Typical DRISEE workflow) for a much more detailed description of each depicted step.
Mentions: Score-based methods use an alternative approach. Sequencer signals are compared with sophisticated, frequently proprietary, probabilistic models that attempt to account for platform-dependent artifacts, generating base calls, each with an affiliated quality (Phred or Q score) that provides an estimate of error frequency, but no information regarding error type. Although score-based methods are applicable to metagenomic data, their inability to consider error type can prove to be a substantial limitation. For example, similarity-based gene annotation is extremely sensitive to frame-shifting insertion/deletion errors but only moderately affected by substitutions [3]. In this context, knowledge of error type, specifically the ratio of insertion and/or deletions to substitutions provides crucial information, knowledge unattainable with conventional Phred or Q scores. The absence of information regarding error type is an even greater concern in light of documented platform-dependent biases in sequencing error type: Illumina-based sequencing exhibits high substitution rates [14], whereas 454 technologies exhibit a preponderance of insertion/deletion errors [13]; identical Q scores from these two technologies are likely to represent different types of error, rendering ostensibly similar metrics incomparable [13], [15], [16], [17], [18], [19], [20]. The most concerning, but paradoxically least discussed and perhaps least understood, deficit of score-based methods is their implicit disregard of experimental procedure. Typical sequencing efforts employ a host of procedures to extract, amplify, and purify genetic material, experimental processes that necessarily contribute errors (i.e. introduction of non-biological bias in sequence content and/or abundance relative to original biological template sequences); however, as these errors are introduced before the actual act of sequencing, they can not be accounted for with score-based methods. Thus, a large portion of experimental error in sequencing is frequently overlooked (Figure 1a) (an in depth literature search revealed no works that directly address this issue).

Bottom Line: DRISEE provides positional error estimates that can be used to inform read trimming within a sample.It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples.Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms.

View Article: PubMed Central - PubMed

Affiliation: Argonne National Laboratory, Argonne, Illinois, United States of America. kkeegan@anl.gov

ABSTRACT
We provide a novel method, DRISEE (duplicate read inferred sequencing error estimation), to assess sequencing quality (alternatively referred to as "noise" or "error") within and/or between sequencing samples. DRISEE provides positional error estimates that can be used to inform read trimming within a sample. It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples. For shotgun metagenomic data, we believe that DRISEE provides estimates of sequencing error that are more accurate and less constrained by technical limitations than existing methods that rely on reference genomes or the use of scores (e.g. Phred). Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms. The DRISEE error estimate is obtained by analyzing sets of artifactual duplicate reads (ADRs), a known by-product of both sequencing platforms. We present DRISEE as an open-source, platform-independent method to assess sequencing error in shotgun metagenomic data, and utilize it to discover previously uncharacterized error in de novo sequence data from the 454 and Illumina sequencing platforms.

Show MeSH
Related in: MedlinePlus