Limits...
Identifying structural variation in haploid microbial genomes from short-read resequencing data using breseq.

Barrick JE, Colburn G, Deatherage DE, Traverse CC, Strand MD, Borges JJ, Knoester DB, Reba A, Meyer AG - BMC Genomics (2014)

Bottom Line: They typically disregard information in reads mapping to repeat sequences, and significant post-processing and manual examination of their output is often required to rule out false-positive predictions and precisely describe mutational events.Finally, breseq combines predictions of new junctions and deleted chromosomal regions to output biologically relevant descriptions of mutations and their effects on genes.In these cases, breseq can discover mutations that may be responsible for important or unintended changes in genomes that might otherwise go undetected.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Biosciences, Institute for Cellular and Molecular Biology, Center for Systems and Synthetic Biology, Center for Computational Biology and Bioinformatics, The University of Texas at Austin, Austin, TX 78712, USA. jbarrick@cm.utexas.edu.

ABSTRACT

Background: Mutations that alter chromosomal structure play critical roles in evolution and disease, including in the origin of new lifestyles and pathogenic traits in microbes. Large-scale rearrangements in genomes are often mediated by recombination events involving new or existing copies of mobile genetic elements, recently duplicated genes, or other repetitive sequences. Most current software programs for predicting structural variation from short-read DNA resequencing data are intended primarily for use on human genomes. They typically disregard information in reads mapping to repeat sequences, and significant post-processing and manual examination of their output is often required to rule out false-positive predictions and precisely describe mutational events.

Results: We have implemented an algorithm for identifying structural variation from DNA resequencing data as part of the breseq computational pipeline for predicting mutations in haploid microbial genomes. Our method evaluates the support for new sequence junctions present in a clonal sample from split-read alignments to a reference genome, including matches to repeat sequences. Then, it uses a statistical model of read coverage evenness to accept or reject these predictions. Finally, breseq combines predictions of new junctions and deleted chromosomal regions to output biologically relevant descriptions of mutations and their effects on genes. We demonstrate the performance of breseq on simulated Escherichia coli genomes with deletions generating unique breakpoint sequences, new insertions of mobile genetic elements, and deletions mediated by mobile elements. Then, we reanalyze data from an E. coli K-12 mutation accumulation evolution experiment in which structural variation was not previously identified. Transposon insertions and large-scale chromosomal changes detected by breseq account for ~25% of spontaneous mutations in this strain. In all cases, we find that breseq is able to reliably predict structural variation with modest read-depth coverage of the reference genome (>40-fold).

Conclusions: Using breseq to predict structural variation should be useful for studies of microbial epidemiology, experimental evolution, synthetic biology, and genetics when a reference genome for a closely related strain is available. In these cases, breseq can discover mutations that may be responsible for important or unintended changes in genomes that might otherwise go undetected.

Show MeSH

Related in: MedlinePlus

Missing coverage evidence. a) The censored fit of read depth at sites with unique-only coverage across the reference genome to a negative binomial distribution is shown for one of the E. coli samples from the mutation accumulation evolution experiment. The threshold for extending putative deleted regions of the genome is determined by taking the coverage value that produces a left-tail probability from the fit distribution as described in the text (arrow). b) A missing coverage evidence item is shown for the same E. coli sample to illustrate how its boundaries are determined by extending outward from a seed region with zero coverage of uniquely aligned reads through regions with multiply-mapped reads that match genomic repeat sequences until the coverage of uniquely aligned reads exceeds the calculated propagation threshold. Note that the left and right boundaries both correspond to a range of positions because they fall within repeat regions. In some cases, this type of ambiguity in the extent of the deletion can be resolved by examining new junction evidence matching the endpoints.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4300727&req=5

Fig6: Missing coverage evidence. a) The censored fit of read depth at sites with unique-only coverage across the reference genome to a negative binomial distribution is shown for one of the E. coli samples from the mutation accumulation evolution experiment. The threshold for extending putative deleted regions of the genome is determined by taking the coverage value that produces a left-tail probability from the fit distribution as described in the text (arrow). b) A missing coverage evidence item is shown for the same E. coli sample to illustrate how its boundaries are determined by extending outward from a seed region with zero coverage of uniquely aligned reads through regions with multiply-mapped reads that match genomic repeat sequences until the coverage of uniquely aligned reads exceeds the calculated propagation threshold. Note that the left and right boundaries both correspond to a range of positions because they fall within repeat regions. In some cases, this type of ambiguity in the extent of the deletion can be resolved by examining new junction evidence matching the endpoints.

Mentions: First, to calibrate whether any given coverage evenness score is unusually low with respect to the expectation for a typical position in the reference genome, breseq fits the distribution of coverage read depth across the genome and determines the average chance that at least one read starts at a given position of the reference sequence. These parameters are estimated from the initial best mappings of reads to the reference genome before resolving candidate junctions. For short-read resequencing data, the depth of read coverage at different positions in the reference genome is fit well by an overdispersed Poisson (negative binomial) distribution [29]. breseq fits this distribution using unique-only reference genome positions (those not matched by any read that maps to multiple reference locations equally well) (Figure 6a). Before fitting, this data is left-censored at half the average coverage, to account for positions that are truly deleted in the sample but may have a small amount of residual coverage due to incorrect mapping of reads with errors or cross-contamination from sequencing similar genome samples without the deletion at the same time. The data is also right-censored at 1.5 times the average coverage so that fitting will be more robust against cases where failed sequencing reads may spuriously map to a small number of genomic locations, creating anomalously high coverage, and to cases where there are increases in copy-number in a sample relative to the reference across a significant portion of the genome. The negative binomial distribution is described by the mean coverage (μcov) and a size parameter (αcov) reflecting the overdispersion. As coverage is tabulated, the number of unique-only positions with no reads beginning there that match a given strand (forward or reverse) is also tracked for each reference sequence. Dividing this total count by twice the number of unique-only positions in a reference sequence gives , the average chance that no read will be found to start at a given position extending across a breakpoint, i.e., the chance that a possible position where a read could have started will contribute to the evenness score for a junction.Figure 6


Identifying structural variation in haploid microbial genomes from short-read resequencing data using breseq.

Barrick JE, Colburn G, Deatherage DE, Traverse CC, Strand MD, Borges JJ, Knoester DB, Reba A, Meyer AG - BMC Genomics (2014)

Missing coverage evidence. a) The censored fit of read depth at sites with unique-only coverage across the reference genome to a negative binomial distribution is shown for one of the E. coli samples from the mutation accumulation evolution experiment. The threshold for extending putative deleted regions of the genome is determined by taking the coverage value that produces a left-tail probability from the fit distribution as described in the text (arrow). b) A missing coverage evidence item is shown for the same E. coli sample to illustrate how its boundaries are determined by extending outward from a seed region with zero coverage of uniquely aligned reads through regions with multiply-mapped reads that match genomic repeat sequences until the coverage of uniquely aligned reads exceeds the calculated propagation threshold. Note that the left and right boundaries both correspond to a range of positions because they fall within repeat regions. In some cases, this type of ambiguity in the extent of the deletion can be resolved by examining new junction evidence matching the endpoints.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4300727&req=5

Fig6: Missing coverage evidence. a) The censored fit of read depth at sites with unique-only coverage across the reference genome to a negative binomial distribution is shown for one of the E. coli samples from the mutation accumulation evolution experiment. The threshold for extending putative deleted regions of the genome is determined by taking the coverage value that produces a left-tail probability from the fit distribution as described in the text (arrow). b) A missing coverage evidence item is shown for the same E. coli sample to illustrate how its boundaries are determined by extending outward from a seed region with zero coverage of uniquely aligned reads through regions with multiply-mapped reads that match genomic repeat sequences until the coverage of uniquely aligned reads exceeds the calculated propagation threshold. Note that the left and right boundaries both correspond to a range of positions because they fall within repeat regions. In some cases, this type of ambiguity in the extent of the deletion can be resolved by examining new junction evidence matching the endpoints.
Mentions: First, to calibrate whether any given coverage evenness score is unusually low with respect to the expectation for a typical position in the reference genome, breseq fits the distribution of coverage read depth across the genome and determines the average chance that at least one read starts at a given position of the reference sequence. These parameters are estimated from the initial best mappings of reads to the reference genome before resolving candidate junctions. For short-read resequencing data, the depth of read coverage at different positions in the reference genome is fit well by an overdispersed Poisson (negative binomial) distribution [29]. breseq fits this distribution using unique-only reference genome positions (those not matched by any read that maps to multiple reference locations equally well) (Figure 6a). Before fitting, this data is left-censored at half the average coverage, to account for positions that are truly deleted in the sample but may have a small amount of residual coverage due to incorrect mapping of reads with errors or cross-contamination from sequencing similar genome samples without the deletion at the same time. The data is also right-censored at 1.5 times the average coverage so that fitting will be more robust against cases where failed sequencing reads may spuriously map to a small number of genomic locations, creating anomalously high coverage, and to cases where there are increases in copy-number in a sample relative to the reference across a significant portion of the genome. The negative binomial distribution is described by the mean coverage (μcov) and a size parameter (αcov) reflecting the overdispersion. As coverage is tabulated, the number of unique-only positions with no reads beginning there that match a given strand (forward or reverse) is also tracked for each reference sequence. Dividing this total count by twice the number of unique-only positions in a reference sequence gives , the average chance that no read will be found to start at a given position extending across a breakpoint, i.e., the chance that a possible position where a read could have started will contribute to the evenness score for a junction.Figure 6

Bottom Line: They typically disregard information in reads mapping to repeat sequences, and significant post-processing and manual examination of their output is often required to rule out false-positive predictions and precisely describe mutational events.Finally, breseq combines predictions of new junctions and deleted chromosomal regions to output biologically relevant descriptions of mutations and their effects on genes.In these cases, breseq can discover mutations that may be responsible for important or unintended changes in genomes that might otherwise go undetected.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Biosciences, Institute for Cellular and Molecular Biology, Center for Systems and Synthetic Biology, Center for Computational Biology and Bioinformatics, The University of Texas at Austin, Austin, TX 78712, USA. jbarrick@cm.utexas.edu.

ABSTRACT

Background: Mutations that alter chromosomal structure play critical roles in evolution and disease, including in the origin of new lifestyles and pathogenic traits in microbes. Large-scale rearrangements in genomes are often mediated by recombination events involving new or existing copies of mobile genetic elements, recently duplicated genes, or other repetitive sequences. Most current software programs for predicting structural variation from short-read DNA resequencing data are intended primarily for use on human genomes. They typically disregard information in reads mapping to repeat sequences, and significant post-processing and manual examination of their output is often required to rule out false-positive predictions and precisely describe mutational events.

Results: We have implemented an algorithm for identifying structural variation from DNA resequencing data as part of the breseq computational pipeline for predicting mutations in haploid microbial genomes. Our method evaluates the support for new sequence junctions present in a clonal sample from split-read alignments to a reference genome, including matches to repeat sequences. Then, it uses a statistical model of read coverage evenness to accept or reject these predictions. Finally, breseq combines predictions of new junctions and deleted chromosomal regions to output biologically relevant descriptions of mutations and their effects on genes. We demonstrate the performance of breseq on simulated Escherichia coli genomes with deletions generating unique breakpoint sequences, new insertions of mobile genetic elements, and deletions mediated by mobile elements. Then, we reanalyze data from an E. coli K-12 mutation accumulation evolution experiment in which structural variation was not previously identified. Transposon insertions and large-scale chromosomal changes detected by breseq account for ~25% of spontaneous mutations in this strain. In all cases, we find that breseq is able to reliably predict structural variation with modest read-depth coverage of the reference genome (>40-fold).

Conclusions: Using breseq to predict structural variation should be useful for studies of microbial epidemiology, experimental evolution, synthetic biology, and genetics when a reference genome for a closely related strain is available. In these cases, breseq can discover mutations that may be responsible for important or unintended changes in genomes that might otherwise go undetected.

Show MeSH
Related in: MedlinePlus