Limits...
Identification of errors introduced during high throughput sequencing of the T cell receptor repertoire.

Nguyen P, Ma J, Pei D, Obert C, Cheng C, Geiger TL - BMC Genomics (2011)

Bottom Line: Filtering for lower quality sequences diminished but did not eliminate sequence errors, which occurred within 1-6% of sequences.Caution is needed in interpreting repertoire data due to potential contamination with mis-sequence reads.However, a high association of errors with phred score, high relatedness of erroneous sequences with the parental sequence, dominance of specific nt substitutions, and skewed ratio of forward to reverse reads among erroneous sequences indicate approaches to filter erroneous sequences from repertoire data sets.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Pathology, St, Jude Children's Research Hospital, 262 Danny Thomas Pl., Memphis, TN 38105, USA.

ABSTRACT

Background: Recent advances in massively parallel sequencing have increased the depth at which T cell receptor (TCR) repertoires can be probed by >3log10, allowing for saturation sequencing of immune repertoires. The resolution of this sequencing is dependent on its accuracy, and direct assessments of the errors formed during high throughput repertoire analyses are limited.

Results: We analyzed 3 monoclonal TCR from TCR transgenic, Rag-/- mice using Illumina® sequencing. A total of 27 sequencing reactions were performed for each TCR using a trifurcating design in which samples were divided into 3 at significant processing junctures. More than 20 million complementarity determining region (CDR) 3 sequences were analyzed. Filtering for lower quality sequences diminished but did not eliminate sequence errors, which occurred within 1-6% of sequences. Erroneous sequences were pre-dominantly of correct length and contained single nucleotide substitutions. Rates of specific substitutions varied dramatically in a position-dependent manner. Four substitutions, all purine-pyrimidine transversions, predominated. Solid phase amplification and sequencing rather than liquid sample amplification and preparation appeared to be the primary sources of error. Analysis of polyclonal repertoires demonstrated the impact of error accumulation on data parameters.

Conclusions: Caution is needed in interpreting repertoire data due to potential contamination with mis-sequence reads. However, a high association of errors with phred score, high relatedness of erroneous sequences with the parental sequence, dominance of specific nt substitutions, and skewed ratio of forward to reverse reads among erroneous sequences indicate approaches to filter erroneous sequences from repertoire data sets.

Show MeSH
Complementation in error occurrence. An expected frequency of multiple errors was calculated based on the assumption that each error is independent using the formula p = C(SER)M, where SER = observed single error rate, M = number of mutated nt in sequence, and C = total number of possible erroneous sequence combinations. C = N!/(M!x(N-M)!), where N = number of nucleotides in the sequence. The expected frequency of multiple mutations is plotted against the observed frequency in experimental samples either for data sets not filtered based on phred score or filtered at a q = 30, and for the presence of between 2 and 10 mutated nt for q = 0 and 2 and 4 for q = 30 (no events were observed with 4-10 mutations for q = 30 filtered data).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3045962&req=5

Figure 4: Complementation in error occurrence. An expected frequency of multiple errors was calculated based on the assumption that each error is independent using the formula p = C(SER)M, where SER = observed single error rate, M = number of mutated nt in sequence, and C = total number of possible erroneous sequence combinations. C = N!/(M!x(N-M)!), where N = number of nucleotides in the sequence. The expected frequency of multiple mutations is plotted against the observed frequency in experimental samples either for data sets not filtered based on phred score or filtered at a q = 30, and for the presence of between 2 and 10 mutated nt for q = 0 and 2 and 4 for q = 30 (no events were observed with 4-10 mutations for q = 30 filtered data).

Mentions: Although even with a q = 0 only ~1% of total sequences had multiple errors, their incidence was greater than that which would have been anticipated from the single error rate. This indicates complementation in the formation of multiple errors (Figure 4a-c). Screening using a q = 30, however, reduced the number of sequences with multiple errors, and multiple error rates more closely reflected those predicted from the single error rate. Therefore, a substantial proportion of sequences, ~1 - 6% depending on the TCR and use of phred quality filtering, acquired using high throughput sequencing were erroneous, these sequences were virtually exclusively of the correct length, and primarily single nt substitutions. Filtering sequences based on phred scores altered both the quantity and the types of errors observed.


Identification of errors introduced during high throughput sequencing of the T cell receptor repertoire.

Nguyen P, Ma J, Pei D, Obert C, Cheng C, Geiger TL - BMC Genomics (2011)

Complementation in error occurrence. An expected frequency of multiple errors was calculated based on the assumption that each error is independent using the formula p = C(SER)M, where SER = observed single error rate, M = number of mutated nt in sequence, and C = total number of possible erroneous sequence combinations. C = N!/(M!x(N-M)!), where N = number of nucleotides in the sequence. The expected frequency of multiple mutations is plotted against the observed frequency in experimental samples either for data sets not filtered based on phred score or filtered at a q = 30, and for the presence of between 2 and 10 mutated nt for q = 0 and 2 and 4 for q = 30 (no events were observed with 4-10 mutations for q = 30 filtered data).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3045962&req=5

Figure 4: Complementation in error occurrence. An expected frequency of multiple errors was calculated based on the assumption that each error is independent using the formula p = C(SER)M, where SER = observed single error rate, M = number of mutated nt in sequence, and C = total number of possible erroneous sequence combinations. C = N!/(M!x(N-M)!), where N = number of nucleotides in the sequence. The expected frequency of multiple mutations is plotted against the observed frequency in experimental samples either for data sets not filtered based on phred score or filtered at a q = 30, and for the presence of between 2 and 10 mutated nt for q = 0 and 2 and 4 for q = 30 (no events were observed with 4-10 mutations for q = 30 filtered data).
Mentions: Although even with a q = 0 only ~1% of total sequences had multiple errors, their incidence was greater than that which would have been anticipated from the single error rate. This indicates complementation in the formation of multiple errors (Figure 4a-c). Screening using a q = 30, however, reduced the number of sequences with multiple errors, and multiple error rates more closely reflected those predicted from the single error rate. Therefore, a substantial proportion of sequences, ~1 - 6% depending on the TCR and use of phred quality filtering, acquired using high throughput sequencing were erroneous, these sequences were virtually exclusively of the correct length, and primarily single nt substitutions. Filtering sequences based on phred scores altered both the quantity and the types of errors observed.

Bottom Line: Filtering for lower quality sequences diminished but did not eliminate sequence errors, which occurred within 1-6% of sequences.Caution is needed in interpreting repertoire data due to potential contamination with mis-sequence reads.However, a high association of errors with phred score, high relatedness of erroneous sequences with the parental sequence, dominance of specific nt substitutions, and skewed ratio of forward to reverse reads among erroneous sequences indicate approaches to filter erroneous sequences from repertoire data sets.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Pathology, St, Jude Children's Research Hospital, 262 Danny Thomas Pl., Memphis, TN 38105, USA.

ABSTRACT

Background: Recent advances in massively parallel sequencing have increased the depth at which T cell receptor (TCR) repertoires can be probed by >3log10, allowing for saturation sequencing of immune repertoires. The resolution of this sequencing is dependent on its accuracy, and direct assessments of the errors formed during high throughput repertoire analyses are limited.

Results: We analyzed 3 monoclonal TCR from TCR transgenic, Rag-/- mice using Illumina® sequencing. A total of 27 sequencing reactions were performed for each TCR using a trifurcating design in which samples were divided into 3 at significant processing junctures. More than 20 million complementarity determining region (CDR) 3 sequences were analyzed. Filtering for lower quality sequences diminished but did not eliminate sequence errors, which occurred within 1-6% of sequences. Erroneous sequences were pre-dominantly of correct length and contained single nucleotide substitutions. Rates of specific substitutions varied dramatically in a position-dependent manner. Four substitutions, all purine-pyrimidine transversions, predominated. Solid phase amplification and sequencing rather than liquid sample amplification and preparation appeared to be the primary sources of error. Analysis of polyclonal repertoires demonstrated the impact of error accumulation on data parameters.

Conclusions: Caution is needed in interpreting repertoire data due to potential contamination with mis-sequence reads. However, a high association of errors with phred score, high relatedness of erroneous sequences with the parental sequence, dominance of specific nt substitutions, and skewed ratio of forward to reverse reads among erroneous sequences indicate approaches to filter erroneous sequences from repertoire data sets.

Show MeSH