Limits...
Identifying structural variation in haploid microbial genomes from short-read resequencing data using breseq.

Barrick JE, Colburn G, Deatherage DE, Traverse CC, Strand MD, Borges JJ, Knoester DB, Reba A, Meyer AG - BMC Genomics (2014)

Bottom Line: They typically disregard information in reads mapping to repeat sequences, and significant post-processing and manual examination of their output is often required to rule out false-positive predictions and precisely describe mutational events.Finally, breseq combines predictions of new junctions and deleted chromosomal regions to output biologically relevant descriptions of mutations and their effects on genes.In these cases, breseq can discover mutations that may be responsible for important or unintended changes in genomes that might otherwise go undetected.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Biosciences, Institute for Cellular and Molecular Biology, Center for Systems and Synthetic Biology, Center for Computational Biology and Bioinformatics, The University of Texas at Austin, Austin, TX 78712, USA. jbarrick@cm.utexas.edu.

ABSTRACT

Background: Mutations that alter chromosomal structure play critical roles in evolution and disease, including in the origin of new lifestyles and pathogenic traits in microbes. Large-scale rearrangements in genomes are often mediated by recombination events involving new or existing copies of mobile genetic elements, recently duplicated genes, or other repetitive sequences. Most current software programs for predicting structural variation from short-read DNA resequencing data are intended primarily for use on human genomes. They typically disregard information in reads mapping to repeat sequences, and significant post-processing and manual examination of their output is often required to rule out false-positive predictions and precisely describe mutational events.

Results: We have implemented an algorithm for identifying structural variation from DNA resequencing data as part of the breseq computational pipeline for predicting mutations in haploid microbial genomes. Our method evaluates the support for new sequence junctions present in a clonal sample from split-read alignments to a reference genome, including matches to repeat sequences. Then, it uses a statistical model of read coverage evenness to accept or reject these predictions. Finally, breseq combines predictions of new junctions and deleted chromosomal regions to output biologically relevant descriptions of mutations and their effects on genes. We demonstrate the performance of breseq on simulated Escherichia coli genomes with deletions generating unique breakpoint sequences, new insertions of mobile genetic elements, and deletions mediated by mobile elements. Then, we reanalyze data from an E. coli K-12 mutation accumulation evolution experiment in which structural variation was not previously identified. Transposon insertions and large-scale chromosomal changes detected by breseq account for ~25% of spontaneous mutations in this strain. In all cases, we find that breseq is able to reliably predict structural variation with modest read-depth coverage of the reference genome (>40-fold).

Conclusions: Using breseq to predict structural variation should be useful for studies of microbial epidemiology, experimental evolution, synthetic biology, and genetics when a reference genome for a closely related strain is available. In these cases, breseq can discover mutations that may be responsible for important or unintended changes in genomes that might otherwise go undetected.

Show MeSH

Related in: MedlinePlus

Performance of structural variant prediction on simulated Illumina data sets. Data sets with different read lengths and coverage depths were generated according to an Illumina error model from simulated E. coli reference sequences with many examples of a single type of mutation causing structural variation randomly introduced. Results in terms of the sensitivity (or recall) for recovering true-positives (top panels) and the precision, equal to the number of true-positive predictions over the total number of predictions (bottom panels), are graphed as a function of junction skew scores accepted for making predictions. Results are shown for simulated genomes containing only a) deletions with breakpoints in non-repetitive reference genome sequences, b) new insertions of bacterial transposable sequences (IS elements), and c) deletions with one boundary ending on a repetitive IS element. The default junction skew score cutoff used by breseq is 3.0.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4300727&req=5

Fig8: Performance of structural variant prediction on simulated Illumina data sets. Data sets with different read lengths and coverage depths were generated according to an Illumina error model from simulated E. coli reference sequences with many examples of a single type of mutation causing structural variation randomly introduced. Results in terms of the sensitivity (or recall) for recovering true-positives (top panels) and the precision, equal to the number of true-positive predictions over the total number of predictions (bottom panels), are graphed as a function of junction skew scores accepted for making predictions. Results are shown for simulated genomes containing only a) deletions with breakpoints in non-repetitive reference genome sequences, b) new insertions of bacterial transposable sequences (IS elements), and c) deletions with one boundary ending on a repetitive IS element. The default junction skew score cutoff used by breseq is 3.0.

Mentions: We evaluated the sensitivity (fraction of true-positive junctions predicted) and precision (number of true-positive predictions divided by the total number of junction predictions) for each simulated data set (FiguresĀ 8 and 9). All analyses used the default parameters except for the simulated 400-base 454 reads with the new mobile element insertions, for which the maximum cumulative length of junction candidate sequences was increased to 25% of the reference genome length and the minimum number of candidate junctions tested was increased to 250 so that enough of the longer junction candidates were tested to enable evaluation of all 200 true-positive junctions. Predictions of these mutations by breseq were highly specific and sensitive for all technologies and read lengths when there was at least 20-fold coverage of the reference genome. These results show that the skew score provides a good statistical cutoff at the default significance level (p = 0.001) for predicting the new sequence junctions created by all three categories of mutations.Figure 8


Identifying structural variation in haploid microbial genomes from short-read resequencing data using breseq.

Barrick JE, Colburn G, Deatherage DE, Traverse CC, Strand MD, Borges JJ, Knoester DB, Reba A, Meyer AG - BMC Genomics (2014)

Performance of structural variant prediction on simulated Illumina data sets. Data sets with different read lengths and coverage depths were generated according to an Illumina error model from simulated E. coli reference sequences with many examples of a single type of mutation causing structural variation randomly introduced. Results in terms of the sensitivity (or recall) for recovering true-positives (top panels) and the precision, equal to the number of true-positive predictions over the total number of predictions (bottom panels), are graphed as a function of junction skew scores accepted for making predictions. Results are shown for simulated genomes containing only a) deletions with breakpoints in non-repetitive reference genome sequences, b) new insertions of bacterial transposable sequences (IS elements), and c) deletions with one boundary ending on a repetitive IS element. The default junction skew score cutoff used by breseq is 3.0.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4300727&req=5

Fig8: Performance of structural variant prediction on simulated Illumina data sets. Data sets with different read lengths and coverage depths were generated according to an Illumina error model from simulated E. coli reference sequences with many examples of a single type of mutation causing structural variation randomly introduced. Results in terms of the sensitivity (or recall) for recovering true-positives (top panels) and the precision, equal to the number of true-positive predictions over the total number of predictions (bottom panels), are graphed as a function of junction skew scores accepted for making predictions. Results are shown for simulated genomes containing only a) deletions with breakpoints in non-repetitive reference genome sequences, b) new insertions of bacterial transposable sequences (IS elements), and c) deletions with one boundary ending on a repetitive IS element. The default junction skew score cutoff used by breseq is 3.0.
Mentions: We evaluated the sensitivity (fraction of true-positive junctions predicted) and precision (number of true-positive predictions divided by the total number of junction predictions) for each simulated data set (FiguresĀ 8 and 9). All analyses used the default parameters except for the simulated 400-base 454 reads with the new mobile element insertions, for which the maximum cumulative length of junction candidate sequences was increased to 25% of the reference genome length and the minimum number of candidate junctions tested was increased to 250 so that enough of the longer junction candidates were tested to enable evaluation of all 200 true-positive junctions. Predictions of these mutations by breseq were highly specific and sensitive for all technologies and read lengths when there was at least 20-fold coverage of the reference genome. These results show that the skew score provides a good statistical cutoff at the default significance level (p = 0.001) for predicting the new sequence junctions created by all three categories of mutations.Figure 8

Bottom Line: They typically disregard information in reads mapping to repeat sequences, and significant post-processing and manual examination of their output is often required to rule out false-positive predictions and precisely describe mutational events.Finally, breseq combines predictions of new junctions and deleted chromosomal regions to output biologically relevant descriptions of mutations and their effects on genes.In these cases, breseq can discover mutations that may be responsible for important or unintended changes in genomes that might otherwise go undetected.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Biosciences, Institute for Cellular and Molecular Biology, Center for Systems and Synthetic Biology, Center for Computational Biology and Bioinformatics, The University of Texas at Austin, Austin, TX 78712, USA. jbarrick@cm.utexas.edu.

ABSTRACT

Background: Mutations that alter chromosomal structure play critical roles in evolution and disease, including in the origin of new lifestyles and pathogenic traits in microbes. Large-scale rearrangements in genomes are often mediated by recombination events involving new or existing copies of mobile genetic elements, recently duplicated genes, or other repetitive sequences. Most current software programs for predicting structural variation from short-read DNA resequencing data are intended primarily for use on human genomes. They typically disregard information in reads mapping to repeat sequences, and significant post-processing and manual examination of their output is often required to rule out false-positive predictions and precisely describe mutational events.

Results: We have implemented an algorithm for identifying structural variation from DNA resequencing data as part of the breseq computational pipeline for predicting mutations in haploid microbial genomes. Our method evaluates the support for new sequence junctions present in a clonal sample from split-read alignments to a reference genome, including matches to repeat sequences. Then, it uses a statistical model of read coverage evenness to accept or reject these predictions. Finally, breseq combines predictions of new junctions and deleted chromosomal regions to output biologically relevant descriptions of mutations and their effects on genes. We demonstrate the performance of breseq on simulated Escherichia coli genomes with deletions generating unique breakpoint sequences, new insertions of mobile genetic elements, and deletions mediated by mobile elements. Then, we reanalyze data from an E. coli K-12 mutation accumulation evolution experiment in which structural variation was not previously identified. Transposon insertions and large-scale chromosomal changes detected by breseq account for ~25% of spontaneous mutations in this strain. In all cases, we find that breseq is able to reliably predict structural variation with modest read-depth coverage of the reference genome (>40-fold).

Conclusions: Using breseq to predict structural variation should be useful for studies of microbial epidemiology, experimental evolution, synthetic biology, and genetics when a reference genome for a closely related strain is available. In these cases, breseq can discover mutations that may be responsible for important or unintended changes in genomes that might otherwise go undetected.

Show MeSH
Related in: MedlinePlus