Limits...
KvarQ: targeted and direct variant calling from fastq reads of bacterial genomes.

Steiner A, Stucki D, Coscolla M, Borrell S, Gagneux S - BMC Genomics (2014)

Bottom Line: Instead, KvarQ loads "testsuites" that define specific SNPs or short regions of interest in a reference genome, and directly synthesizes the relevant results based on the occurrence of these markers in the fastq files.In this article, we demonstrate how KvarQ can be used to successfully detect all main drug resistance mutations and phylogenetic markers in 880 bacterial whole genome sequences.This enables researchers and laboratory technicians with limited bioinformatics expertise to scan and analyze raw sequencing data in a matter of minutes.

View Article: PubMed Central - PubMed

Affiliation: Swiss Tropical and Public Health Institute, Socinstrasse 57, Basel 4051, Switzerland. sebastien.gagneux@unibas.ch.

ABSTRACT

Background: High-throughput DNA sequencing produces vast amounts of data, with millions of short reads that usually have to be mapped to a reference genome or newly assembled. Both reference-based mapping and de novo assembly are computationally intensive, generating large intermediary data files, and thus require bioinformatics skills that are often lacking in the laboratories producing the data. Moreover, many research and practical applications in microbiology require only a small fraction of the whole genome data.

Results: We developed KvarQ, a new tool that directly scans fastq files of bacterial genome sequences for known variants, such as single nucleotide polymorphisms (SNP), bypassing the need of mapping all sequencing reads to a reference genome and de novo assembly. Instead, KvarQ loads "testsuites" that define specific SNPs or short regions of interest in a reference genome, and directly synthesizes the relevant results based on the occurrence of these markers in the fastq files. KvarQ has a versatile command line interface and a graphical user interface. KvarQ currently ships with two "testsuites" for Mycobacterium tuberculosis, but new "testsuites" for other organisms can easily be created and distributed. In this article, we demonstrate how KvarQ can be used to successfully detect all main drug resistance mutations and phylogenetic markers in 880 bacterial whole genome sequences. The average scanning time per genome sequence was two minutes. The variant calls of a subset of these genomes were validated with a standard bioinformatics pipeline and revealed >99% congruency.

Conclusion: KvarQ is a user-friendly tool that directly extracts relevant information from fastq files. This enables researchers and laboratory technicians with limited bioinformatics expertise to scan and analyze raw sequencing data in a matter of minutes. KvarQ is open-source, and pre-compiled packages with a graphical user interface are available at http://www.swisstph.ch/kvarq.

Show MeSH

Related in: MedlinePlus

Simplified overview of scanning process and preparation. “Testsuites” are python source files that define the SNPs of interest (or in this case a 3 base pair long region) as well as other relevant genetic information (in this case the katG gene in which mutations can confer isoniazid resistance). This information is used to extract a “target sequence” from a reference MTBC genome: on both sides, additional bases (“flanks”) are concatenated to avoid border effects within the sequence of interest. During the scanning process, every read is trimmed depending on its PHRED score (in this case, a quality cutoff of Q = 13 was defined which corresponds to the ASCII character “/”). After the scanning, the part of the reads that matched the target sequence and exceeded the minimum quality score (represented with gray bars) are assembled to a “coverage” that indicates the overall coverage depth as well as all detected mutations (green). In a further step, additional information is generated from this coverage (such as the resulting amino acid sequence) and finally a short “result” string is generated that summarizes the result of the scanning process.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4197298&req=5

Fig1: Simplified overview of scanning process and preparation. “Testsuites” are python source files that define the SNPs of interest (or in this case a 3 base pair long region) as well as other relevant genetic information (in this case the katG gene in which mutations can confer isoniazid resistance). This information is used to extract a “target sequence” from a reference MTBC genome: on both sides, additional bases (“flanks”) are concatenated to avoid border effects within the sequence of interest. During the scanning process, every read is trimmed depending on its PHRED score (in this case, a quality cutoff of Q = 13 was defined which corresponds to the ASCII character “/”). After the scanning, the part of the reads that matched the target sequence and exceeded the minimum quality score (represented with gray bars) are assembled to a “coverage” that indicates the overall coverage depth as well as all detected mutations (green). In a further step, additional information is generated from this coverage (such as the resulting amino acid sequence) and finally a short “result” string is generated that summarizes the result of the scanning process.

Mentions: The following paragraphs describe the information flow inside KvarQ from the fastq files to the final results that are displayed to the user (Figure 1).Figure 1


KvarQ: targeted and direct variant calling from fastq reads of bacterial genomes.

Steiner A, Stucki D, Coscolla M, Borrell S, Gagneux S - BMC Genomics (2014)

Simplified overview of scanning process and preparation. “Testsuites” are python source files that define the SNPs of interest (or in this case a 3 base pair long region) as well as other relevant genetic information (in this case the katG gene in which mutations can confer isoniazid resistance). This information is used to extract a “target sequence” from a reference MTBC genome: on both sides, additional bases (“flanks”) are concatenated to avoid border effects within the sequence of interest. During the scanning process, every read is trimmed depending on its PHRED score (in this case, a quality cutoff of Q = 13 was defined which corresponds to the ASCII character “/”). After the scanning, the part of the reads that matched the target sequence and exceeded the minimum quality score (represented with gray bars) are assembled to a “coverage” that indicates the overall coverage depth as well as all detected mutations (green). In a further step, additional information is generated from this coverage (such as the resulting amino acid sequence) and finally a short “result” string is generated that summarizes the result of the scanning process.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4197298&req=5

Fig1: Simplified overview of scanning process and preparation. “Testsuites” are python source files that define the SNPs of interest (or in this case a 3 base pair long region) as well as other relevant genetic information (in this case the katG gene in which mutations can confer isoniazid resistance). This information is used to extract a “target sequence” from a reference MTBC genome: on both sides, additional bases (“flanks”) are concatenated to avoid border effects within the sequence of interest. During the scanning process, every read is trimmed depending on its PHRED score (in this case, a quality cutoff of Q = 13 was defined which corresponds to the ASCII character “/”). After the scanning, the part of the reads that matched the target sequence and exceeded the minimum quality score (represented with gray bars) are assembled to a “coverage” that indicates the overall coverage depth as well as all detected mutations (green). In a further step, additional information is generated from this coverage (such as the resulting amino acid sequence) and finally a short “result” string is generated that summarizes the result of the scanning process.
Mentions: The following paragraphs describe the information flow inside KvarQ from the fastq files to the final results that are displayed to the user (Figure 1).Figure 1

Bottom Line: Instead, KvarQ loads "testsuites" that define specific SNPs or short regions of interest in a reference genome, and directly synthesizes the relevant results based on the occurrence of these markers in the fastq files.In this article, we demonstrate how KvarQ can be used to successfully detect all main drug resistance mutations and phylogenetic markers in 880 bacterial whole genome sequences.This enables researchers and laboratory technicians with limited bioinformatics expertise to scan and analyze raw sequencing data in a matter of minutes.

View Article: PubMed Central - PubMed

Affiliation: Swiss Tropical and Public Health Institute, Socinstrasse 57, Basel 4051, Switzerland. sebastien.gagneux@unibas.ch.

ABSTRACT

Background: High-throughput DNA sequencing produces vast amounts of data, with millions of short reads that usually have to be mapped to a reference genome or newly assembled. Both reference-based mapping and de novo assembly are computationally intensive, generating large intermediary data files, and thus require bioinformatics skills that are often lacking in the laboratories producing the data. Moreover, many research and practical applications in microbiology require only a small fraction of the whole genome data.

Results: We developed KvarQ, a new tool that directly scans fastq files of bacterial genome sequences for known variants, such as single nucleotide polymorphisms (SNP), bypassing the need of mapping all sequencing reads to a reference genome and de novo assembly. Instead, KvarQ loads "testsuites" that define specific SNPs or short regions of interest in a reference genome, and directly synthesizes the relevant results based on the occurrence of these markers in the fastq files. KvarQ has a versatile command line interface and a graphical user interface. KvarQ currently ships with two "testsuites" for Mycobacterium tuberculosis, but new "testsuites" for other organisms can easily be created and distributed. In this article, we demonstrate how KvarQ can be used to successfully detect all main drug resistance mutations and phylogenetic markers in 880 bacterial whole genome sequences. The average scanning time per genome sequence was two minutes. The variant calls of a subset of these genomes were validated with a standard bioinformatics pipeline and revealed >99% congruency.

Conclusion: KvarQ is a user-friendly tool that directly extracts relevant information from fastq files. This enables researchers and laboratory technicians with limited bioinformatics expertise to scan and analyze raw sequencing data in a matter of minutes. KvarQ is open-source, and pre-compiled packages with a graphical user interface are available at http://www.swisstph.ch/kvarq.

Show MeSH
Related in: MedlinePlus