Limits...
TagDust2: a generic method to extract reads from sequencing data.

Lassmann T - BMC Bioinformatics (2015)

Bottom Line: TagDust2 extracts more reads of higher quality compared to other approaches.Taken together TagDust2 is a feature rich, flexible and adaptive solution to go from raw to mappable NGS reads in a single step.The ability to recognize and record the contents of raw reads will help to automate and demystify the initial, and often poorly documented, steps in NGS data analysis pipelines.

View Article: PubMed Central - PubMed

Affiliation: RIKEN Center for Life Science Technologies (CLST), RIKEN Yokohama Institute, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045, Kanagawa, Japan. timolassmann@gmail.com.

ABSTRACT

Background: Arguably the most basic step in the analysis of next generation sequencing data (NGS) involves the extraction of mappable reads from the raw reads produced by sequencing instruments. The presence of barcodes, adaptors and artifacts subject to sequencing errors makes this step non-trivial.

Results: Here I present TagDust2, a generic approach utilizing a library of hidden Markov models (HMM) to accurately extract reads from a wide array of possible read architectures. TagDust2 extracts more reads of higher quality compared to other approaches. Processing of multiplexed single, paired end and libraries containing unique molecular identifiers is fully supported. Two additional post processing steps are included to exclude known contaminants and filter out low complexity sequences. Finally, TagDust2 can automatically detect the library type of sequenced data from a predefined selection.

Conclusion: Taken together TagDust2 is a feature rich, flexible and adaptive solution to go from raw to mappable NGS reads in a single step. The ability to recognize and record the contents of raw reads will help to automate and demystify the initial, and often poorly documented, steps in NGS data analysis pipelines. TagDust2 is freely available at: http://tagdust.sourceforge.net .

Show MeSH
Sample to sample correlation can be improved by using unique molecular identifiers. The right panel shows correlation between samples using 15 and 25 PCR cycles without using the UMI sequences. The left panel shows the same data after collapsing all reads mapping to the location and containing the same UMI. TagDust correctly identified PCR artifacts based on their UMI sequences.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4384298&req=5

Fig6: Sample to sample correlation can be improved by using unique molecular identifiers. The right panel shows correlation between samples using 15 and 25 PCR cycles without using the UMI sequences. The left panel shows the same data after collapsing all reads mapping to the location and containing the same UMI. TagDust correctly identified PCR artifacts based on their UMI sequences.

Mentions: In the second case each read contains a random 10 nucleotide unique molecular identifier (UMI). Finding the same UMI associated with reads mapping to same location is a good indicator that these reads are PCR duplicates. TagDust2 automatically recognizes UMI sequences and converts them into a unique number. To understand whether the UMIs actually help in reducing technical noise caused by PCR amplification I compared libraries amplified using either 15 or 25 PCR cycles. After collapsing reads mapping to the same region with the same UMI the sample to sample correlation could be improved (Figure 6). More importantly, TagDust2 was able to extract the reads using the short command line:Figure 6


TagDust2: a generic method to extract reads from sequencing data.

Lassmann T - BMC Bioinformatics (2015)

Sample to sample correlation can be improved by using unique molecular identifiers. The right panel shows correlation between samples using 15 and 25 PCR cycles without using the UMI sequences. The left panel shows the same data after collapsing all reads mapping to the location and containing the same UMI. TagDust correctly identified PCR artifacts based on their UMI sequences.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4384298&req=5

Fig6: Sample to sample correlation can be improved by using unique molecular identifiers. The right panel shows correlation between samples using 15 and 25 PCR cycles without using the UMI sequences. The left panel shows the same data after collapsing all reads mapping to the location and containing the same UMI. TagDust correctly identified PCR artifacts based on their UMI sequences.
Mentions: In the second case each read contains a random 10 nucleotide unique molecular identifier (UMI). Finding the same UMI associated with reads mapping to same location is a good indicator that these reads are PCR duplicates. TagDust2 automatically recognizes UMI sequences and converts them into a unique number. To understand whether the UMIs actually help in reducing technical noise caused by PCR amplification I compared libraries amplified using either 15 or 25 PCR cycles. After collapsing reads mapping to the same region with the same UMI the sample to sample correlation could be improved (Figure 6). More importantly, TagDust2 was able to extract the reads using the short command line:Figure 6

Bottom Line: TagDust2 extracts more reads of higher quality compared to other approaches.Taken together TagDust2 is a feature rich, flexible and adaptive solution to go from raw to mappable NGS reads in a single step.The ability to recognize and record the contents of raw reads will help to automate and demystify the initial, and often poorly documented, steps in NGS data analysis pipelines.

View Article: PubMed Central - PubMed

Affiliation: RIKEN Center for Life Science Technologies (CLST), RIKEN Yokohama Institute, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045, Kanagawa, Japan. timolassmann@gmail.com.

ABSTRACT

Background: Arguably the most basic step in the analysis of next generation sequencing data (NGS) involves the extraction of mappable reads from the raw reads produced by sequencing instruments. The presence of barcodes, adaptors and artifacts subject to sequencing errors makes this step non-trivial.

Results: Here I present TagDust2, a generic approach utilizing a library of hidden Markov models (HMM) to accurately extract reads from a wide array of possible read architectures. TagDust2 extracts more reads of higher quality compared to other approaches. Processing of multiplexed single, paired end and libraries containing unique molecular identifiers is fully supported. Two additional post processing steps are included to exclude known contaminants and filter out low complexity sequences. Finally, TagDust2 can automatically detect the library type of sequenced data from a predefined selection.

Conclusion: Taken together TagDust2 is a feature rich, flexible and adaptive solution to go from raw to mappable NGS reads in a single step. The ability to recognize and record the contents of raw reads will help to automate and demystify the initial, and often poorly documented, steps in NGS data analysis pipelines. TagDust2 is freely available at: http://tagdust.sourceforge.net .

Show MeSH