Limits...
ConDeTri--a content dependent read trimmer for Illumina data.

Smeds L, Künstner A - PLoS ONE (2011)

Bottom Line: During the last few years, DNA and RNA sequencing have started to play an increasingly important role in biological and medical applications, especially due to the greater amount of sequencing data yielded from the new sequencing machines and the enormous decrease in sequencing costs.ConDeTri is able to trim and remove reads with low quality scores to save computational time and memory usage during de novo assemblies.Low coverage or large genome sequencing projects will especially gain from trimming reads.

View Article: PubMed Central - PubMed

Affiliation: Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden.

ABSTRACT

Unlabelled: During the last few years, DNA and RNA sequencing have started to play an increasingly important role in biological and medical applications, especially due to the greater amount of sequencing data yielded from the new sequencing machines and the enormous decrease in sequencing costs. Particularly, Illumina/Solexa sequencing has had an increasing impact on gathering data from model and non-model organisms. However, accurate and easy to use tools for quality filtering have not yet been established. We present ConDeTri, a method for content dependent read trimming for next generation sequencing data using quality scores of each individual base. The main focus of the method is to remove sequencing errors from reads so that sequencing reads can be standardized. Another aspect of the method is to incorporate read trimming in next-generation sequencing data processing and analysis pipelines. It can process single-end and paired-end sequence data of arbitrary length and it is independent from sequencing coverage and user interaction. ConDeTri is able to trim and remove reads with low quality scores to save computational time and memory usage during de novo assemblies. Low coverage or large genome sequencing projects will especially gain from trimming reads. The method can easily be incorporated into preprocessing and analysis pipelines for Illumina data.

Availability and implementation: Freely available on the web at http://code.google.com/p/condetri.

Show MeSH
Recall and accuracy for different data sets.The proportion of the genome covered (recall) and the proportion of the assembly mapped onto the genome (accuracy) for untrimmed reads (squares), ConDeTri (circles), Bwa (triangle point down), SolexaQA (triangle point up) and Quake (diamonds). Open symbols denote the full data set, and crossed symbols reduced data. The upper panel shows results for SRR0063698, the lower panel for SRR0063699.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3198461&req=5

pone-0026314-g001: Recall and accuracy for different data sets.The proportion of the genome covered (recall) and the proportion of the assembly mapped onto the genome (accuracy) for untrimmed reads (squares), ConDeTri (circles), Bwa (triangle point down), SolexaQA (triangle point up) and Quake (diamonds). Open symbols denote the full data set, and crossed symbols reduced data. The upper panel shows results for SRR0063698, the lower panel for SRR0063699.

Mentions: For the ‘higher quality’ D. melanogaster data set (NCBI SRA:SRR063698) the assembly using no trimming yielded the best accuracy (88%) and the best recall (63%, 106.0 MB of the genome covered). ConDeTri gave quite similar results, with an accuracy of 86% and recall of 61% (102.8 MB of the genome covered). Data filtered using Bwa, SolexaQA or Quake gave slightly less accurate assemblies (83%, 81% and 71%) and also smaller proportions of the genome assembled (recall 59%, 56% and 48%, respectively). Interestingly, the assembly with the longest N50 sizes (Quake) gave the smallest genome assembly. Potentially, mis-assemblies are more common in assemblies with longer N50 sizes. Small assembly mistakes tend to produce continuous sequences from contigs that are not actually located close to each other and can thereby greatly reduce the ability of programs to create longer scaffolds. Using the reduced data showed even more pronounced results in favor of the untrimmed assembly (see Fig. 1 and Table 2).


ConDeTri--a content dependent read trimmer for Illumina data.

Smeds L, Künstner A - PLoS ONE (2011)

Recall and accuracy for different data sets.The proportion of the genome covered (recall) and the proportion of the assembly mapped onto the genome (accuracy) for untrimmed reads (squares), ConDeTri (circles), Bwa (triangle point down), SolexaQA (triangle point up) and Quake (diamonds). Open symbols denote the full data set, and crossed symbols reduced data. The upper panel shows results for SRR0063698, the lower panel for SRR0063699.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3198461&req=5

pone-0026314-g001: Recall and accuracy for different data sets.The proportion of the genome covered (recall) and the proportion of the assembly mapped onto the genome (accuracy) for untrimmed reads (squares), ConDeTri (circles), Bwa (triangle point down), SolexaQA (triangle point up) and Quake (diamonds). Open symbols denote the full data set, and crossed symbols reduced data. The upper panel shows results for SRR0063698, the lower panel for SRR0063699.
Mentions: For the ‘higher quality’ D. melanogaster data set (NCBI SRA:SRR063698) the assembly using no trimming yielded the best accuracy (88%) and the best recall (63%, 106.0 MB of the genome covered). ConDeTri gave quite similar results, with an accuracy of 86% and recall of 61% (102.8 MB of the genome covered). Data filtered using Bwa, SolexaQA or Quake gave slightly less accurate assemblies (83%, 81% and 71%) and also smaller proportions of the genome assembled (recall 59%, 56% and 48%, respectively). Interestingly, the assembly with the longest N50 sizes (Quake) gave the smallest genome assembly. Potentially, mis-assemblies are more common in assemblies with longer N50 sizes. Small assembly mistakes tend to produce continuous sequences from contigs that are not actually located close to each other and can thereby greatly reduce the ability of programs to create longer scaffolds. Using the reduced data showed even more pronounced results in favor of the untrimmed assembly (see Fig. 1 and Table 2).

Bottom Line: During the last few years, DNA and RNA sequencing have started to play an increasingly important role in biological and medical applications, especially due to the greater amount of sequencing data yielded from the new sequencing machines and the enormous decrease in sequencing costs.ConDeTri is able to trim and remove reads with low quality scores to save computational time and memory usage during de novo assemblies.Low coverage or large genome sequencing projects will especially gain from trimming reads.

View Article: PubMed Central - PubMed

Affiliation: Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden.

ABSTRACT

Unlabelled: During the last few years, DNA and RNA sequencing have started to play an increasingly important role in biological and medical applications, especially due to the greater amount of sequencing data yielded from the new sequencing machines and the enormous decrease in sequencing costs. Particularly, Illumina/Solexa sequencing has had an increasing impact on gathering data from model and non-model organisms. However, accurate and easy to use tools for quality filtering have not yet been established. We present ConDeTri, a method for content dependent read trimming for next generation sequencing data using quality scores of each individual base. The main focus of the method is to remove sequencing errors from reads so that sequencing reads can be standardized. Another aspect of the method is to incorporate read trimming in next-generation sequencing data processing and analysis pipelines. It can process single-end and paired-end sequence data of arbitrary length and it is independent from sequencing coverage and user interaction. ConDeTri is able to trim and remove reads with low quality scores to save computational time and memory usage during de novo assemblies. Low coverage or large genome sequencing projects will especially gain from trimming reads. The method can easily be incorporated into preprocessing and analysis pipelines for Illumina data.

Availability and implementation: Freely available on the web at http://code.google.com/p/condetri.

Show MeSH