Limits...
TagDigger: user-friendly extraction of read counts from GBS and RAD-seq data.

Clark LV, Sacks EJ - Source Code Biol Med (2016)

Bottom Line: Pipelines that do not use a reference genome assign arbitrary names to SNPs, making meta-analysis across projects difficult.A third script, barcode_splitter.py, assists with preparing FASTQ data for deposit in a public archive by splitting FASTQ files by barcode and generating MD5 checksums for the resulting files.TagDigger will run on a laptop with any operating system, does not consume hard drive space with intermediate files, and does not require programming skill to use.

View Article: PubMed Central - PubMed

Affiliation: Department of Crop Sciences, University of Illinois at Urbana-Champaign, 1201 W. Gregory Drive, Urbana, IL 61802 USA.

ABSTRACT

Background: In genotyping-by-sequencing (GBS) and restriction site-associated DNA sequencing (RAD-seq), read depth is important for assessing the quality of genotype calls and estimating allele dosage in polyploids. However, existing pipelines for GBS and RAD-seq do not provide read counts in formats that are both accurate and easy to access. Additionally, although existing pipelines allow previously-mined SNPs to be genotyped on new samples, they do not allow the user to manually specify a subset of loci to examine. Pipelines that do not use a reference genome assign arbitrary names to SNPs, making meta-analysis across projects difficult.

Results: We created the software TagDigger, which includes three programs for analyzing GBS and RAD-seq data. The first script, tagdigger_interactive.py, rapidly extracts read counts and genotypes from FASTQ files using user-supplied sets of barcodes and tags. Input and output is in CSV format so that it can be opened by spreadsheet software. Tag sequences can also be imported from the Stacks, TASSEL-GBSv2, TASSEL-UNEAK, or pyRAD pipelines, and a separate file can be imported listing the names of markers to retain. A second script, tag_manager.py, consolidates marker names and sequences across multiple projects. A third script, barcode_splitter.py, assists with preparing FASTQ data for deposit in a public archive by splitting FASTQ files by barcode and generating MD5 checksums for the resulting files.

Conclusions: TagDigger is open-source and freely available software written in Python 3. It uses a scalable, rapid search algorithm that can process over 100 million FASTQ reads per hour. TagDigger will run on a laptop with any operating system, does not consume hard drive space with intermediate files, and does not require programming skill to use.

No MeSH data available.


Related in: MedlinePlus

Graphical representation of a sequence indexing tree generated by TagDigger. Use of the tree to match sequencing reads to known tags is illustrated. The red read does not match any known tags, and it takes two steps (looking at the first two nucleotides of the read) to make this determination. The blue read matches one of the expected tags, and it takes four steps to make the match. In comparison, if every read were compared to every tag, seven steps (one for each possible tag) would be required for every read. The maximum number of steps required to match a read will always be the length of the longest tag, which is advantageous when there are thousands of possible tags that are each 40–80 nucleotides long
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4940913&req=5

Fig1: Graphical representation of a sequence indexing tree generated by TagDigger. Use of the tree to match sequencing reads to known tags is illustrated. The red read does not match any known tags, and it takes two steps (looking at the first two nucleotides of the read) to make this determination. The blue read matches one of the expected tags, and it takes four steps to make the match. In comparison, if every read were compared to every tag, seven steps (one for each possible tag) would be required for every read. The maximum number of steps required to match a read will always be the length of the longest tag, which is advantageous when there are thousands of possible tags that are each 40–80 nucleotides long

Mentions: Before processing a FASTQ file, tagdigger_interactive.py recursively builds indexing trees of expected barcode and tag sequences (Fig. 1). These indexing trees enable very rapid matching of sequencing reads to the appropriate barcode and tag. They also enable the software to quickly discard reads that do not match any expected barcode or tag. To save processing time, tagdigger_interactive.py ignores sequencing quality scores, with the assumption that a read with an error is unlikely to match a known tag. TASSEL also ignores quality scores, instead distinguishing alleles from errors based on how frequently they appear in the dataset [6, 14]. Because most SNP mining software does not use the entire sequencing read (for example, by default TASSEL uses only the first 64 nucleotides after the barcode, and in other software the reads may be truncated to only retain the highest-quality portion) the TagDigger search algorithm checks to see if the read begins with a barcode and tag, and ignores any nucleotides in the read after the end of the tag. For any given FASTQ read, the algorithm begins at the first nucleotide of the read and moves along the read one nucleotide at a time, navigating through the indexing tree of barcodes. If the read begins with an expected barcode, and also has the expected restriction cut site after the barcode, the algorithm begins navigating the indexing tree of tags, beginning with the first nucleotide in the read after the restriction site (or at the beginning of the restriction site in the case of enzymes such as ApeKI with variable cut sites). If the read contains an expected tag, the software increments the read count for that barcode*tag combination.Fig. 1


TagDigger: user-friendly extraction of read counts from GBS and RAD-seq data.

Clark LV, Sacks EJ - Source Code Biol Med (2016)

Graphical representation of a sequence indexing tree generated by TagDigger. Use of the tree to match sequencing reads to known tags is illustrated. The red read does not match any known tags, and it takes two steps (looking at the first two nucleotides of the read) to make this determination. The blue read matches one of the expected tags, and it takes four steps to make the match. In comparison, if every read were compared to every tag, seven steps (one for each possible tag) would be required for every read. The maximum number of steps required to match a read will always be the length of the longest tag, which is advantageous when there are thousands of possible tags that are each 40–80 nucleotides long
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4940913&req=5

Fig1: Graphical representation of a sequence indexing tree generated by TagDigger. Use of the tree to match sequencing reads to known tags is illustrated. The red read does not match any known tags, and it takes two steps (looking at the first two nucleotides of the read) to make this determination. The blue read matches one of the expected tags, and it takes four steps to make the match. In comparison, if every read were compared to every tag, seven steps (one for each possible tag) would be required for every read. The maximum number of steps required to match a read will always be the length of the longest tag, which is advantageous when there are thousands of possible tags that are each 40–80 nucleotides long
Mentions: Before processing a FASTQ file, tagdigger_interactive.py recursively builds indexing trees of expected barcode and tag sequences (Fig. 1). These indexing trees enable very rapid matching of sequencing reads to the appropriate barcode and tag. They also enable the software to quickly discard reads that do not match any expected barcode or tag. To save processing time, tagdigger_interactive.py ignores sequencing quality scores, with the assumption that a read with an error is unlikely to match a known tag. TASSEL also ignores quality scores, instead distinguishing alleles from errors based on how frequently they appear in the dataset [6, 14]. Because most SNP mining software does not use the entire sequencing read (for example, by default TASSEL uses only the first 64 nucleotides after the barcode, and in other software the reads may be truncated to only retain the highest-quality portion) the TagDigger search algorithm checks to see if the read begins with a barcode and tag, and ignores any nucleotides in the read after the end of the tag. For any given FASTQ read, the algorithm begins at the first nucleotide of the read and moves along the read one nucleotide at a time, navigating through the indexing tree of barcodes. If the read begins with an expected barcode, and also has the expected restriction cut site after the barcode, the algorithm begins navigating the indexing tree of tags, beginning with the first nucleotide in the read after the restriction site (or at the beginning of the restriction site in the case of enzymes such as ApeKI with variable cut sites). If the read contains an expected tag, the software increments the read count for that barcode*tag combination.Fig. 1

Bottom Line: Pipelines that do not use a reference genome assign arbitrary names to SNPs, making meta-analysis across projects difficult.A third script, barcode_splitter.py, assists with preparing FASTQ data for deposit in a public archive by splitting FASTQ files by barcode and generating MD5 checksums for the resulting files.TagDigger will run on a laptop with any operating system, does not consume hard drive space with intermediate files, and does not require programming skill to use.

View Article: PubMed Central - PubMed

Affiliation: Department of Crop Sciences, University of Illinois at Urbana-Champaign, 1201 W. Gregory Drive, Urbana, IL 61802 USA.

ABSTRACT

Background: In genotyping-by-sequencing (GBS) and restriction site-associated DNA sequencing (RAD-seq), read depth is important for assessing the quality of genotype calls and estimating allele dosage in polyploids. However, existing pipelines for GBS and RAD-seq do not provide read counts in formats that are both accurate and easy to access. Additionally, although existing pipelines allow previously-mined SNPs to be genotyped on new samples, they do not allow the user to manually specify a subset of loci to examine. Pipelines that do not use a reference genome assign arbitrary names to SNPs, making meta-analysis across projects difficult.

Results: We created the software TagDigger, which includes three programs for analyzing GBS and RAD-seq data. The first script, tagdigger_interactive.py, rapidly extracts read counts and genotypes from FASTQ files using user-supplied sets of barcodes and tags. Input and output is in CSV format so that it can be opened by spreadsheet software. Tag sequences can also be imported from the Stacks, TASSEL-GBSv2, TASSEL-UNEAK, or pyRAD pipelines, and a separate file can be imported listing the names of markers to retain. A second script, tag_manager.py, consolidates marker names and sequences across multiple projects. A third script, barcode_splitter.py, assists with preparing FASTQ data for deposit in a public archive by splitting FASTQ files by barcode and generating MD5 checksums for the resulting files.

Conclusions: TagDigger is open-source and freely available software written in Python 3. It uses a scalable, rapid search algorithm that can process over 100 million FASTQ reads per hour. TagDigger will run on a laptop with any operating system, does not consume hard drive space with intermediate files, and does not require programming skill to use.

No MeSH data available.


Related in: MedlinePlus