Limits...
Identification of errors introduced during high throughput sequencing of the T cell receptor repertoire.

Nguyen P, Ma J, Pei D, Obert C, Cheng C, Geiger TL - BMC Genomics (2011)

Bottom Line: Filtering for lower quality sequences diminished but did not eliminate sequence errors, which occurred within 1-6% of sequences.Caution is needed in interpreting repertoire data due to potential contamination with mis-sequence reads.However, a high association of errors with phred score, high relatedness of erroneous sequences with the parental sequence, dominance of specific nt substitutions, and skewed ratio of forward to reverse reads among erroneous sequences indicate approaches to filter erroneous sequences from repertoire data sets.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Pathology, St, Jude Children's Research Hospital, 262 Danny Thomas Pl., Memphis, TN 38105, USA.

ABSTRACT

Background: Recent advances in massively parallel sequencing have increased the depth at which T cell receptor (TCR) repertoires can be probed by >3log10, allowing for saturation sequencing of immune repertoires. The resolution of this sequencing is dependent on its accuracy, and direct assessments of the errors formed during high throughput repertoire analyses are limited.

Results: We analyzed 3 monoclonal TCR from TCR transgenic, Rag-/- mice using Illumina® sequencing. A total of 27 sequencing reactions were performed for each TCR using a trifurcating design in which samples were divided into 3 at significant processing junctures. More than 20 million complementarity determining region (CDR) 3 sequences were analyzed. Filtering for lower quality sequences diminished but did not eliminate sequence errors, which occurred within 1-6% of sequences. Erroneous sequences were pre-dominantly of correct length and contained single nucleotide substitutions. Rates of specific substitutions varied dramatically in a position-dependent manner. Four substitutions, all purine-pyrimidine transversions, predominated. Solid phase amplification and sequencing rather than liquid sample amplification and preparation appeared to be the primary sources of error. Analysis of polyclonal repertoires demonstrated the impact of error accumulation on data parameters.

Conclusions: Caution is needed in interpreting repertoire data due to potential contamination with mis-sequence reads. However, a high association of errors with phred score, high relatedness of erroneous sequences with the parental sequence, dominance of specific nt substitutions, and skewed ratio of forward to reverse reads among erroneous sequences indicate approaches to filter erroneous sequences from repertoire data sets.

Show MeSH
Analysis of polyclonal C57BL/6 repertoires. In 2 independent analyses, C57BL/6 splenocytes were sorted into CD4+GFP-Foxp3- and CD4+GFP-Foxp3+ populations and the Vβ8.2 TCR repertoire analyzed. Frequency of total (A) and unique (B) sequences acquired for each analysis without or with filtering sequences at q = 30. For each unique sequence acquired, sequences present at lower frequency with a single nt mismatch were tabulated. For the 20 most frequent sequences in each cohort, the total number of single nt mismatch sequences present at less than the indicated frequency (abscissa) relative to each corresponding high frequency index sequence were tallied. The total number of these presumed erroneous sequences for the Foxp3- (C) and Foxp3+ (D) populations either analyzed without filtering or filtered at a q = 30 are plotted (ordinate). Results demonstrate a decreased number of presumed erroneous sequences after applying a q = 30 filter. (E) For each unique sequence, the total number of other unique sequences present at a lower frequency and with a single nt mismatch was tallied. The number of these single mismatch sequences was summed for all sequences within each cohort with or without q = 30 filtering. (F) ACE values were calculated as estimates of total repertoire diversity in populations either with or without q = 30 filtering.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3045962&req=5

Figure 9: Analysis of polyclonal C57BL/6 repertoires. In 2 independent analyses, C57BL/6 splenocytes were sorted into CD4+GFP-Foxp3- and CD4+GFP-Foxp3+ populations and the Vβ8.2 TCR repertoire analyzed. Frequency of total (A) and unique (B) sequences acquired for each analysis without or with filtering sequences at q = 30. For each unique sequence acquired, sequences present at lower frequency with a single nt mismatch were tabulated. For the 20 most frequent sequences in each cohort, the total number of single nt mismatch sequences present at less than the indicated frequency (abscissa) relative to each corresponding high frequency index sequence were tallied. The total number of these presumed erroneous sequences for the Foxp3- (C) and Foxp3+ (D) populations either analyzed without filtering or filtered at a q = 30 are plotted (ordinate). Results demonstrate a decreased number of presumed erroneous sequences after applying a q = 30 filter. (E) For each unique sequence, the total number of other unique sequences present at a lower frequency and with a single nt mismatch was tallied. The number of these single mismatch sequences was summed for all sequences within each cohort with or without q = 30 filtering. (F) ACE values were calculated as estimates of total repertoire diversity in populations either with or without q = 30 filtering.

Mentions: To examine the impact of sequence filtering on intact repertoires, we assessed TRBV13-2+ sequences of 2 samples each of flow cytometrically purified CD4+Foxp3+ or CD4+Foxp3- T cell TCR from C57BL/6 mice. Total sequence numbers varied from 136,716 to 779,107 and unique sequences from 34,449 to 158,886 in the different analyses in the absence of phred-based filtering (Figure 9a, b). Application of a q = 30 filter reduced the total sequence numbers by 49.4 ± 7.7% and unique sequence numbers by 45.0 ± 8.1%.


Identification of errors introduced during high throughput sequencing of the T cell receptor repertoire.

Nguyen P, Ma J, Pei D, Obert C, Cheng C, Geiger TL - BMC Genomics (2011)

Analysis of polyclonal C57BL/6 repertoires. In 2 independent analyses, C57BL/6 splenocytes were sorted into CD4+GFP-Foxp3- and CD4+GFP-Foxp3+ populations and the Vβ8.2 TCR repertoire analyzed. Frequency of total (A) and unique (B) sequences acquired for each analysis without or with filtering sequences at q = 30. For each unique sequence acquired, sequences present at lower frequency with a single nt mismatch were tabulated. For the 20 most frequent sequences in each cohort, the total number of single nt mismatch sequences present at less than the indicated frequency (abscissa) relative to each corresponding high frequency index sequence were tallied. The total number of these presumed erroneous sequences for the Foxp3- (C) and Foxp3+ (D) populations either analyzed without filtering or filtered at a q = 30 are plotted (ordinate). Results demonstrate a decreased number of presumed erroneous sequences after applying a q = 30 filter. (E) For each unique sequence, the total number of other unique sequences present at a lower frequency and with a single nt mismatch was tallied. The number of these single mismatch sequences was summed for all sequences within each cohort with or without q = 30 filtering. (F) ACE values were calculated as estimates of total repertoire diversity in populations either with or without q = 30 filtering.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3045962&req=5

Figure 9: Analysis of polyclonal C57BL/6 repertoires. In 2 independent analyses, C57BL/6 splenocytes were sorted into CD4+GFP-Foxp3- and CD4+GFP-Foxp3+ populations and the Vβ8.2 TCR repertoire analyzed. Frequency of total (A) and unique (B) sequences acquired for each analysis without or with filtering sequences at q = 30. For each unique sequence acquired, sequences present at lower frequency with a single nt mismatch were tabulated. For the 20 most frequent sequences in each cohort, the total number of single nt mismatch sequences present at less than the indicated frequency (abscissa) relative to each corresponding high frequency index sequence were tallied. The total number of these presumed erroneous sequences for the Foxp3- (C) and Foxp3+ (D) populations either analyzed without filtering or filtered at a q = 30 are plotted (ordinate). Results demonstrate a decreased number of presumed erroneous sequences after applying a q = 30 filter. (E) For each unique sequence, the total number of other unique sequences present at a lower frequency and with a single nt mismatch was tallied. The number of these single mismatch sequences was summed for all sequences within each cohort with or without q = 30 filtering. (F) ACE values were calculated as estimates of total repertoire diversity in populations either with or without q = 30 filtering.
Mentions: To examine the impact of sequence filtering on intact repertoires, we assessed TRBV13-2+ sequences of 2 samples each of flow cytometrically purified CD4+Foxp3+ or CD4+Foxp3- T cell TCR from C57BL/6 mice. Total sequence numbers varied from 136,716 to 779,107 and unique sequences from 34,449 to 158,886 in the different analyses in the absence of phred-based filtering (Figure 9a, b). Application of a q = 30 filter reduced the total sequence numbers by 49.4 ± 7.7% and unique sequence numbers by 45.0 ± 8.1%.

Bottom Line: Filtering for lower quality sequences diminished but did not eliminate sequence errors, which occurred within 1-6% of sequences.Caution is needed in interpreting repertoire data due to potential contamination with mis-sequence reads.However, a high association of errors with phred score, high relatedness of erroneous sequences with the parental sequence, dominance of specific nt substitutions, and skewed ratio of forward to reverse reads among erroneous sequences indicate approaches to filter erroneous sequences from repertoire data sets.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Pathology, St, Jude Children's Research Hospital, 262 Danny Thomas Pl., Memphis, TN 38105, USA.

ABSTRACT

Background: Recent advances in massively parallel sequencing have increased the depth at which T cell receptor (TCR) repertoires can be probed by >3log10, allowing for saturation sequencing of immune repertoires. The resolution of this sequencing is dependent on its accuracy, and direct assessments of the errors formed during high throughput repertoire analyses are limited.

Results: We analyzed 3 monoclonal TCR from TCR transgenic, Rag-/- mice using Illumina® sequencing. A total of 27 sequencing reactions were performed for each TCR using a trifurcating design in which samples were divided into 3 at significant processing junctures. More than 20 million complementarity determining region (CDR) 3 sequences were analyzed. Filtering for lower quality sequences diminished but did not eliminate sequence errors, which occurred within 1-6% of sequences. Erroneous sequences were pre-dominantly of correct length and contained single nucleotide substitutions. Rates of specific substitutions varied dramatically in a position-dependent manner. Four substitutions, all purine-pyrimidine transversions, predominated. Solid phase amplification and sequencing rather than liquid sample amplification and preparation appeared to be the primary sources of error. Analysis of polyclonal repertoires demonstrated the impact of error accumulation on data parameters.

Conclusions: Caution is needed in interpreting repertoire data due to potential contamination with mis-sequence reads. However, a high association of errors with phred score, high relatedness of erroneous sequences with the parental sequence, dominance of specific nt substitutions, and skewed ratio of forward to reverse reads among erroneous sequences indicate approaches to filter erroneous sequences from repertoire data sets.

Show MeSH