Limits...
Identification of errors introduced during high throughput sequencing of the T cell receptor repertoire.

Nguyen P, Ma J, Pei D, Obert C, Cheng C, Geiger TL - BMC Genomics (2011)

Bottom Line: Filtering for lower quality sequences diminished but did not eliminate sequence errors, which occurred within 1-6% of sequences.Caution is needed in interpreting repertoire data due to potential contamination with mis-sequence reads.However, a high association of errors with phred score, high relatedness of erroneous sequences with the parental sequence, dominance of specific nt substitutions, and skewed ratio of forward to reverse reads among erroneous sequences indicate approaches to filter erroneous sequences from repertoire data sets.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Pathology, St, Jude Children's Research Hospital, 262 Danny Thomas Pl., Memphis, TN 38105, USA.

ABSTRACT

Background: Recent advances in massively parallel sequencing have increased the depth at which T cell receptor (TCR) repertoires can be probed by >3log10, allowing for saturation sequencing of immune repertoires. The resolution of this sequencing is dependent on its accuracy, and direct assessments of the errors formed during high throughput repertoire analyses are limited.

Results: We analyzed 3 monoclonal TCR from TCR transgenic, Rag-/- mice using Illumina® sequencing. A total of 27 sequencing reactions were performed for each TCR using a trifurcating design in which samples were divided into 3 at significant processing junctures. More than 20 million complementarity determining region (CDR) 3 sequences were analyzed. Filtering for lower quality sequences diminished but did not eliminate sequence errors, which occurred within 1-6% of sequences. Erroneous sequences were pre-dominantly of correct length and contained single nucleotide substitutions. Rates of specific substitutions varied dramatically in a position-dependent manner. Four substitutions, all purine-pyrimidine transversions, predominated. Solid phase amplification and sequencing rather than liquid sample amplification and preparation appeared to be the primary sources of error. Analysis of polyclonal repertoires demonstrated the impact of error accumulation on data parameters.

Conclusions: Caution is needed in interpreting repertoire data due to potential contamination with mis-sequence reads. However, a high association of errors with phred score, high relatedness of erroneous sequences with the parental sequence, dominance of specific nt substitutions, and skewed ratio of forward to reverse reads among erroneous sequences indicate approaches to filter erroneous sequences from repertoire data sets.

Show MeSH
Filtering single nt mismatch sequences from repertoire data. To determine the extent to which errors could be purged by filtering sequences with single nt mismatches, we examined the residual percent of erroneous sequences for each sequencing reaction after culling single nt mismatch sequences. Assessment of residual erroneous sequences was performed at multiple cutoff values for the frequency of the mismatch sequence relative to the true 5C.C7 (A), OT-1 (B), or DO11.10 (C) sequence, and mean + 1 s.d. plotted. Our data suggests values of less than 0.01 are adequate for optimal error reduction. In application, a cutoff would need to be selected that optimizes removal of erroneous sequences while also minimizing inadvertent culling of true sequences.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3045962&req=5

Figure 5: Filtering single nt mismatch sequences from repertoire data. To determine the extent to which errors could be purged by filtering sequences with single nt mismatches, we examined the residual percent of erroneous sequences for each sequencing reaction after culling single nt mismatch sequences. Assessment of residual erroneous sequences was performed at multiple cutoff values for the frequency of the mismatch sequence relative to the true 5C.C7 (A), OT-1 (B), or DO11.10 (C) sequence, and mean + 1 s.d. plotted. Our data suggests values of less than 0.01 are adequate for optimal error reduction. In application, a cutoff would need to be selected that optimizes removal of erroneous sequences while also minimizing inadvertent culling of true sequences.

Mentions: Considering that a majority of erroneous sequences are single nt substitutions, it should be possible to purge erroneous sequences by excluding sequences present at lower frequency and differing by a single nt from an index sequence. To assess this, we calculated the impact of filtering sequences with single nt mismatches compared with the true 5C.C7, OT-1, and DO11.10 TCR CDR3. Cutoff values, indicating the maximum frequency of the culled single nt mismatch sequence relative to the correct CDR3 sequence, were varied (Figure 5a-c). An exclusion cutoff of 0 does not filter any of the sequences. A cutoff of 1 eliminates all single nt mismatches at or below the index sequence's frequency. Residual erroneous sequences at this latter cutoff include multiple sequence mismatches, or nt additions or deletions. For each TCR, cutoffs in the range of 0.0001-0.01 (0.01-1% of index frequency) eliminated most erroneous sequences. Indeed, at a q = 30 and cutoff of 0.01, only 0.0086 ± 0.002%, 0.030 ± 0.006%, and 0.057 ± 0.008% of total sequences were erroneous for the 5C.C7, OT-1, and DO11.10 CDR3 respectively. Therefore filtering single nt mismatch sequences has the potential to dramatically decrease overall error rates in CDR3 acquired by next generation sequencing.


Identification of errors introduced during high throughput sequencing of the T cell receptor repertoire.

Nguyen P, Ma J, Pei D, Obert C, Cheng C, Geiger TL - BMC Genomics (2011)

Filtering single nt mismatch sequences from repertoire data. To determine the extent to which errors could be purged by filtering sequences with single nt mismatches, we examined the residual percent of erroneous sequences for each sequencing reaction after culling single nt mismatch sequences. Assessment of residual erroneous sequences was performed at multiple cutoff values for the frequency of the mismatch sequence relative to the true 5C.C7 (A), OT-1 (B), or DO11.10 (C) sequence, and mean + 1 s.d. plotted. Our data suggests values of less than 0.01 are adequate for optimal error reduction. In application, a cutoff would need to be selected that optimizes removal of erroneous sequences while also minimizing inadvertent culling of true sequences.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3045962&req=5

Figure 5: Filtering single nt mismatch sequences from repertoire data. To determine the extent to which errors could be purged by filtering sequences with single nt mismatches, we examined the residual percent of erroneous sequences for each sequencing reaction after culling single nt mismatch sequences. Assessment of residual erroneous sequences was performed at multiple cutoff values for the frequency of the mismatch sequence relative to the true 5C.C7 (A), OT-1 (B), or DO11.10 (C) sequence, and mean + 1 s.d. plotted. Our data suggests values of less than 0.01 are adequate for optimal error reduction. In application, a cutoff would need to be selected that optimizes removal of erroneous sequences while also minimizing inadvertent culling of true sequences.
Mentions: Considering that a majority of erroneous sequences are single nt substitutions, it should be possible to purge erroneous sequences by excluding sequences present at lower frequency and differing by a single nt from an index sequence. To assess this, we calculated the impact of filtering sequences with single nt mismatches compared with the true 5C.C7, OT-1, and DO11.10 TCR CDR3. Cutoff values, indicating the maximum frequency of the culled single nt mismatch sequence relative to the correct CDR3 sequence, were varied (Figure 5a-c). An exclusion cutoff of 0 does not filter any of the sequences. A cutoff of 1 eliminates all single nt mismatches at or below the index sequence's frequency. Residual erroneous sequences at this latter cutoff include multiple sequence mismatches, or nt additions or deletions. For each TCR, cutoffs in the range of 0.0001-0.01 (0.01-1% of index frequency) eliminated most erroneous sequences. Indeed, at a q = 30 and cutoff of 0.01, only 0.0086 ± 0.002%, 0.030 ± 0.006%, and 0.057 ± 0.008% of total sequences were erroneous for the 5C.C7, OT-1, and DO11.10 CDR3 respectively. Therefore filtering single nt mismatch sequences has the potential to dramatically decrease overall error rates in CDR3 acquired by next generation sequencing.

Bottom Line: Filtering for lower quality sequences diminished but did not eliminate sequence errors, which occurred within 1-6% of sequences.Caution is needed in interpreting repertoire data due to potential contamination with mis-sequence reads.However, a high association of errors with phred score, high relatedness of erroneous sequences with the parental sequence, dominance of specific nt substitutions, and skewed ratio of forward to reverse reads among erroneous sequences indicate approaches to filter erroneous sequences from repertoire data sets.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Pathology, St, Jude Children's Research Hospital, 262 Danny Thomas Pl., Memphis, TN 38105, USA.

ABSTRACT

Background: Recent advances in massively parallel sequencing have increased the depth at which T cell receptor (TCR) repertoires can be probed by >3log10, allowing for saturation sequencing of immune repertoires. The resolution of this sequencing is dependent on its accuracy, and direct assessments of the errors formed during high throughput repertoire analyses are limited.

Results: We analyzed 3 monoclonal TCR from TCR transgenic, Rag-/- mice using Illumina® sequencing. A total of 27 sequencing reactions were performed for each TCR using a trifurcating design in which samples were divided into 3 at significant processing junctures. More than 20 million complementarity determining region (CDR) 3 sequences were analyzed. Filtering for lower quality sequences diminished but did not eliminate sequence errors, which occurred within 1-6% of sequences. Erroneous sequences were pre-dominantly of correct length and contained single nucleotide substitutions. Rates of specific substitutions varied dramatically in a position-dependent manner. Four substitutions, all purine-pyrimidine transversions, predominated. Solid phase amplification and sequencing rather than liquid sample amplification and preparation appeared to be the primary sources of error. Analysis of polyclonal repertoires demonstrated the impact of error accumulation on data parameters.

Conclusions: Caution is needed in interpreting repertoire data due to potential contamination with mis-sequence reads. However, a high association of errors with phred score, high relatedness of erroneous sequences with the parental sequence, dominance of specific nt substitutions, and skewed ratio of forward to reverse reads among erroneous sequences indicate approaches to filter erroneous sequences from repertoire data sets.

Show MeSH