Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform.
Bottom Line: A better knowledge of the error patterns is essential for sequence analysis and vital if we are to draw valid conclusions.Studying true genetic variation in a population sample is fundamental for understanding diseases, evolution and origin.Furthermore we tested the efficiency of various error correction strategies and identified quality trimming (Sickle) combined with error correction (BayesHammer) followed by read overlapping (PANDAseq) as the most successful approach, reducing substitution error rates on average by 93%.
Affiliation: School of Engineering, University of Glasgow, Glasgow, UK email@example.com.Show MeSH
Mentions: The first column of Figure 3 displays the observed quality scores for all reads. For this data set we generally encountered very high quality scores for the R1 reads and only slightly lower values for the R2 reads. In the second column of Figure 3 we constrained the boxplot to quality scores associated with substitution errors. Most noticeable is the range of quality scores for substitutions of As and Cs. The average quality score for those types of errors was only slightly lower than the average quality score observed for the respective base in general. Furthermore almost all of the quality scores associated with substitutions of C are between 32 and 35 and 75% of the quality scores associated with substitutions of A are above 32 for the R1 reads (see Figure 3a). The R2 reads showed a larger range for those error types, though a significant number of errors were also associated with very high quality scores. Erroneous Gs and Ts show on average much lower quality values (see Figure 3b). G and T are read by the same laser (green channel). Erroneous bases sequenced on the red channel have on average very high quality values and cannot be detected based on the reported quality score. We observed the same issue for insertions and deletions. In R1 reads 75% of the indels showed quality scores of 35 and above. In R2 reads the same was true for deletions, whereas for insertions the average quality score dropped just below 35. The last column of Figure 3 shows the position-specific substitution quality profiles and suggests that there is a correlation between position of the error and its quality value. Errors occurring at the start and middle of the read had in general much higher quality scores and the quality value decreased towards the end of the reads.
Affiliation: School of Engineering, University of Glasgow, Glasgow, UK firstname.lastname@example.org.