Limits...
Identifying structural variation in haploid microbial genomes from short-read resequencing data using breseq.

Barrick JE, Colburn G, Deatherage DE, Traverse CC, Strand MD, Borges JJ, Knoester DB, Reba A, Meyer AG - BMC Genomics (2014)

Bottom Line: They typically disregard information in reads mapping to repeat sequences, and significant post-processing and manual examination of their output is often required to rule out false-positive predictions and precisely describe mutational events.Finally, breseq combines predictions of new junctions and deleted chromosomal regions to output biologically relevant descriptions of mutations and their effects on genes.In these cases, breseq can discover mutations that may be responsible for important or unintended changes in genomes that might otherwise go undetected.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Biosciences, Institute for Cellular and Molecular Biology, Center for Systems and Synthetic Biology, Center for Computational Biology and Bioinformatics, The University of Texas at Austin, Austin, TX 78712, USA. jbarrick@cm.utexas.edu.

ABSTRACT

Background: Mutations that alter chromosomal structure play critical roles in evolution and disease, including in the origin of new lifestyles and pathogenic traits in microbes. Large-scale rearrangements in genomes are often mediated by recombination events involving new or existing copies of mobile genetic elements, recently duplicated genes, or other repetitive sequences. Most current software programs for predicting structural variation from short-read DNA resequencing data are intended primarily for use on human genomes. They typically disregard information in reads mapping to repeat sequences, and significant post-processing and manual examination of their output is often required to rule out false-positive predictions and precisely describe mutational events.

Results: We have implemented an algorithm for identifying structural variation from DNA resequencing data as part of the breseq computational pipeline for predicting mutations in haploid microbial genomes. Our method evaluates the support for new sequence junctions present in a clonal sample from split-read alignments to a reference genome, including matches to repeat sequences. Then, it uses a statistical model of read coverage evenness to accept or reject these predictions. Finally, breseq combines predictions of new junctions and deleted chromosomal regions to output biologically relevant descriptions of mutations and their effects on genes. We demonstrate the performance of breseq on simulated Escherichia coli genomes with deletions generating unique breakpoint sequences, new insertions of mobile genetic elements, and deletions mediated by mobile elements. Then, we reanalyze data from an E. coli K-12 mutation accumulation evolution experiment in which structural variation was not previously identified. Transposon insertions and large-scale chromosomal changes detected by breseq account for ~25% of spontaneous mutations in this strain. In all cases, we find that breseq is able to reliably predict structural variation with modest read-depth coverage of the reference genome (>40-fold).

Conclusions: Using breseq to predict structural variation should be useful for studies of microbial epidemiology, experimental evolution, synthetic biology, and genetics when a reference genome for a closely related strain is available. In these cases, breseq can discover mutations that may be responsible for important or unintended changes in genomes that might otherwise go undetected.

Show MeSH

Related in: MedlinePlus

Example of assigning coverage evenness scores to candidate junctions. Reads that align to a candidate new junction sequence may start at many different positions relative to the breakpoint. Reads that do not unambiguously support the new junction (gray arrows) because they do not extend across the breakpoint and any overlap or read-only bases (yellow highlighting) are not counted toward the evenness score. Although the two examples have the same number of reads that support the new junction because they align across the breakpoint and match the junction better than the reference genome (black arrows), the example in (a) is well-supported because these reads start in many different registers with respect to the breakpoint as would be expected for a normal reference genome location, whereas the example in (b) has reads beginning at a small number of biased positions with respect to the junction. This coverage evenness score is used to calculate a skew p-value to accept or reject a candidate junction, after also accounting for differences in the maximum number of read start positions that can support each candidate junction. In cases of tandem duplications much shorter than the read length, reads must also extend several “continuation” bases past any unique-only or overlap sequence to count as supporting a junction, as illustrated in Figure 5.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4300727&req=5

Fig4: Example of assigning coverage evenness scores to candidate junctions. Reads that align to a candidate new junction sequence may start at many different positions relative to the breakpoint. Reads that do not unambiguously support the new junction (gray arrows) because they do not extend across the breakpoint and any overlap or read-only bases (yellow highlighting) are not counted toward the evenness score. Although the two examples have the same number of reads that support the new junction because they align across the breakpoint and match the junction better than the reference genome (black arrows), the example in (a) is well-supported because these reads start in many different registers with respect to the breakpoint as would be expected for a normal reference genome location, whereas the example in (b) has reads beginning at a small number of biased positions with respect to the junction. This coverage evenness score is used to calculate a skew p-value to accept or reject a candidate junction, after also accounting for differences in the maximum number of read start positions that can support each candidate junction. In cases of tandem duplications much shorter than the read length, reads must also extend several “continuation” bases past any unique-only or overlap sequence to count as supporting a junction, as illustrated in Figure 5.

Mentions: To determine the best merged junction candidates to further test, breseq next assigns a coverage evenness score to each one (Figure 4). This score is equal to the number of distinct start positions for alignments of reads that extend across the breakpoint far enough to unambiguously support the junction and not the reference sequence. That is, they must span any overlap or read-only bases in the junction sequence. If the junction is a short deletion of a few bases in the reference sequence, then it may be required to extend additional bases that are not accounted for in these values — a continuation length — in order to unambiguously support the junction and count toward the actual or possible coverage evenness score (Figure 5). Each read is counted as starting at the position in the reference genome where its first base matches, so alignments with the same start and end coordinates, but on opposite strands will each count toward this evenness score once.


Identifying structural variation in haploid microbial genomes from short-read resequencing data using breseq.

Barrick JE, Colburn G, Deatherage DE, Traverse CC, Strand MD, Borges JJ, Knoester DB, Reba A, Meyer AG - BMC Genomics (2014)

Example of assigning coverage evenness scores to candidate junctions. Reads that align to a candidate new junction sequence may start at many different positions relative to the breakpoint. Reads that do not unambiguously support the new junction (gray arrows) because they do not extend across the breakpoint and any overlap or read-only bases (yellow highlighting) are not counted toward the evenness score. Although the two examples have the same number of reads that support the new junction because they align across the breakpoint and match the junction better than the reference genome (black arrows), the example in (a) is well-supported because these reads start in many different registers with respect to the breakpoint as would be expected for a normal reference genome location, whereas the example in (b) has reads beginning at a small number of biased positions with respect to the junction. This coverage evenness score is used to calculate a skew p-value to accept or reject a candidate junction, after also accounting for differences in the maximum number of read start positions that can support each candidate junction. In cases of tandem duplications much shorter than the read length, reads must also extend several “continuation” bases past any unique-only or overlap sequence to count as supporting a junction, as illustrated in Figure 5.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4300727&req=5

Fig4: Example of assigning coverage evenness scores to candidate junctions. Reads that align to a candidate new junction sequence may start at many different positions relative to the breakpoint. Reads that do not unambiguously support the new junction (gray arrows) because they do not extend across the breakpoint and any overlap or read-only bases (yellow highlighting) are not counted toward the evenness score. Although the two examples have the same number of reads that support the new junction because they align across the breakpoint and match the junction better than the reference genome (black arrows), the example in (a) is well-supported because these reads start in many different registers with respect to the breakpoint as would be expected for a normal reference genome location, whereas the example in (b) has reads beginning at a small number of biased positions with respect to the junction. This coverage evenness score is used to calculate a skew p-value to accept or reject a candidate junction, after also accounting for differences in the maximum number of read start positions that can support each candidate junction. In cases of tandem duplications much shorter than the read length, reads must also extend several “continuation” bases past any unique-only or overlap sequence to count as supporting a junction, as illustrated in Figure 5.
Mentions: To determine the best merged junction candidates to further test, breseq next assigns a coverage evenness score to each one (Figure 4). This score is equal to the number of distinct start positions for alignments of reads that extend across the breakpoint far enough to unambiguously support the junction and not the reference sequence. That is, they must span any overlap or read-only bases in the junction sequence. If the junction is a short deletion of a few bases in the reference sequence, then it may be required to extend additional bases that are not accounted for in these values — a continuation length — in order to unambiguously support the junction and count toward the actual or possible coverage evenness score (Figure 5). Each read is counted as starting at the position in the reference genome where its first base matches, so alignments with the same start and end coordinates, but on opposite strands will each count toward this evenness score once.

Bottom Line: They typically disregard information in reads mapping to repeat sequences, and significant post-processing and manual examination of their output is often required to rule out false-positive predictions and precisely describe mutational events.Finally, breseq combines predictions of new junctions and deleted chromosomal regions to output biologically relevant descriptions of mutations and their effects on genes.In these cases, breseq can discover mutations that may be responsible for important or unintended changes in genomes that might otherwise go undetected.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Biosciences, Institute for Cellular and Molecular Biology, Center for Systems and Synthetic Biology, Center for Computational Biology and Bioinformatics, The University of Texas at Austin, Austin, TX 78712, USA. jbarrick@cm.utexas.edu.

ABSTRACT

Background: Mutations that alter chromosomal structure play critical roles in evolution and disease, including in the origin of new lifestyles and pathogenic traits in microbes. Large-scale rearrangements in genomes are often mediated by recombination events involving new or existing copies of mobile genetic elements, recently duplicated genes, or other repetitive sequences. Most current software programs for predicting structural variation from short-read DNA resequencing data are intended primarily for use on human genomes. They typically disregard information in reads mapping to repeat sequences, and significant post-processing and manual examination of their output is often required to rule out false-positive predictions and precisely describe mutational events.

Results: We have implemented an algorithm for identifying structural variation from DNA resequencing data as part of the breseq computational pipeline for predicting mutations in haploid microbial genomes. Our method evaluates the support for new sequence junctions present in a clonal sample from split-read alignments to a reference genome, including matches to repeat sequences. Then, it uses a statistical model of read coverage evenness to accept or reject these predictions. Finally, breseq combines predictions of new junctions and deleted chromosomal regions to output biologically relevant descriptions of mutations and their effects on genes. We demonstrate the performance of breseq on simulated Escherichia coli genomes with deletions generating unique breakpoint sequences, new insertions of mobile genetic elements, and deletions mediated by mobile elements. Then, we reanalyze data from an E. coli K-12 mutation accumulation evolution experiment in which structural variation was not previously identified. Transposon insertions and large-scale chromosomal changes detected by breseq account for ~25% of spontaneous mutations in this strain. In all cases, we find that breseq is able to reliably predict structural variation with modest read-depth coverage of the reference genome (>40-fold).

Conclusions: Using breseq to predict structural variation should be useful for studies of microbial epidemiology, experimental evolution, synthetic biology, and genetics when a reference genome for a closely related strain is available. In these cases, breseq can discover mutations that may be responsible for important or unintended changes in genomes that might otherwise go undetected.

Show MeSH
Related in: MedlinePlus