Limits...
Phylogenetically typing bacterial strains from partial SNP genotypes observed from direct sequencing of clinical specimen metagenomic data.

Sahl JW, Schupp JM, Rasko DA, Colman RE, Foster JT, Keim P - Genome Med (2015)

Bottom Line: Sequence reads from unknown samples are aligned to a reference genome where the allele states of known SNPs are determined.The Whole Genome Focused Array SNP Typing (WG-FAST) pipeline can identify unknown strains with much less read data than is needed for genome assembly.To test WG-FAST, we resampled SNPs from real samples to understand the relationship between low coverage metagenomic data and accurate phylogenetic placement.

View Article: PubMed Central - PubMed

Affiliation: Department of Pathogen Genomics, Translational Genomics Research Institute, Flagstaff, AZ USA ; Center for Microbial Genetics and Genomics, Northern Arizona University, Flagstaff, AZ 86011 USA.

ABSTRACT
We describe an approach for genotyping bacterial strains from low coverage genome datasets, including metagenomic data from complex samples. Sequence reads from unknown samples are aligned to a reference genome where the allele states of known SNPs are determined. The Whole Genome Focused Array SNP Typing (WG-FAST) pipeline can identify unknown strains with much less read data than is needed for genome assembly. To test WG-FAST, we resampled SNPs from real samples to understand the relationship between low coverage metagenomic data and accurate phylogenetic placement. WG-FAST can be downloaded from https://github.com/jasonsahl/wgfast.

No MeSH data available.


A phylogeny and associated heatmap showing the placement of read subsamples for 100 iterations of three E. coli isolates. The maximum likelihood phylogeny was inferred by RAxML [20] from a concatenation of approximately 225,000 single nucleotide polymorphisms called against the reference genome, K-12 W3110 [30]. Raw reads were randomly sampled from three genomes at four different depths and reference positions were identified. Each genome, at each level, was then inserted into the phylogeny with the evolutionary placement algorithm (EPA) in RAxML [26]. Duplicate placements were removed from the tree and redundancies were represented as a heatmap
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4487561&req=5

Fig4: A phylogeny and associated heatmap showing the placement of read subsamples for 100 iterations of three E. coli isolates. The maximum likelihood phylogeny was inferred by RAxML [20] from a concatenation of approximately 225,000 single nucleotide polymorphisms called against the reference genome, K-12 W3110 [30]. Raw reads were randomly sampled from three genomes at four different depths and reference positions were identified. Each genome, at each level, was then inserted into the phylogeny with the evolutionary placement algorithm (EPA) in RAxML [26]. Duplicate placements were removed from the tree and redundancies were represented as a heatmap

Mentions: WG-FAST was designed for metagenomic datasets where the number of reads mapping to a reference genome will be variable and difficult to control or predict. The accuracy of phylogenetic placement will be greater with more reads, but also dependent upon whether these reads align to known SNP positions to allow determination of the allelic state. We have resampled between 200 and 2,000 raw reads from different genomes from each major group from the phylogeny (Additional file 4), aligned them to the reference genome, and then determined the SNP allele states for positions with mapped reads. These limited genotypic data were resampled 100 times each and then placed with RAxML onto the reference tree and the frequency of placement represented as a heatmap (Fig. 4); the patristic distances of all subsamples compared to the ‘correct’ placement demonstrates that additional SNP loci genotyped increases the quality of the placement (Additional file 9). The accuracy of placement increases with larger read resamples, but as with SNP resampling (Fig. 3), this relationship changes based on phylogenetic position. Clades with many closely related genomes complicate exact positioning of an unknown, especially with smaller read datasets, though near-misses are very common and might be sufficient for some studies. As one demonstration of the potential, however, we demonstrate that the O157:H7 Sakai genome could be accurately placed on the tree >95 % of the time with as few as 360 SNP loci genotyped from only 50 Illumina MiSeq (2 × 250 bp) read pair alignments (Additional file 4).Fig. 4


Phylogenetically typing bacterial strains from partial SNP genotypes observed from direct sequencing of clinical specimen metagenomic data.

Sahl JW, Schupp JM, Rasko DA, Colman RE, Foster JT, Keim P - Genome Med (2015)

A phylogeny and associated heatmap showing the placement of read subsamples for 100 iterations of three E. coli isolates. The maximum likelihood phylogeny was inferred by RAxML [20] from a concatenation of approximately 225,000 single nucleotide polymorphisms called against the reference genome, K-12 W3110 [30]. Raw reads were randomly sampled from three genomes at four different depths and reference positions were identified. Each genome, at each level, was then inserted into the phylogeny with the evolutionary placement algorithm (EPA) in RAxML [26]. Duplicate placements were removed from the tree and redundancies were represented as a heatmap
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4487561&req=5

Fig4: A phylogeny and associated heatmap showing the placement of read subsamples for 100 iterations of three E. coli isolates. The maximum likelihood phylogeny was inferred by RAxML [20] from a concatenation of approximately 225,000 single nucleotide polymorphisms called against the reference genome, K-12 W3110 [30]. Raw reads were randomly sampled from three genomes at four different depths and reference positions were identified. Each genome, at each level, was then inserted into the phylogeny with the evolutionary placement algorithm (EPA) in RAxML [26]. Duplicate placements were removed from the tree and redundancies were represented as a heatmap
Mentions: WG-FAST was designed for metagenomic datasets where the number of reads mapping to a reference genome will be variable and difficult to control or predict. The accuracy of phylogenetic placement will be greater with more reads, but also dependent upon whether these reads align to known SNP positions to allow determination of the allelic state. We have resampled between 200 and 2,000 raw reads from different genomes from each major group from the phylogeny (Additional file 4), aligned them to the reference genome, and then determined the SNP allele states for positions with mapped reads. These limited genotypic data were resampled 100 times each and then placed with RAxML onto the reference tree and the frequency of placement represented as a heatmap (Fig. 4); the patristic distances of all subsamples compared to the ‘correct’ placement demonstrates that additional SNP loci genotyped increases the quality of the placement (Additional file 9). The accuracy of placement increases with larger read resamples, but as with SNP resampling (Fig. 3), this relationship changes based on phylogenetic position. Clades with many closely related genomes complicate exact positioning of an unknown, especially with smaller read datasets, though near-misses are very common and might be sufficient for some studies. As one demonstration of the potential, however, we demonstrate that the O157:H7 Sakai genome could be accurately placed on the tree >95 % of the time with as few as 360 SNP loci genotyped from only 50 Illumina MiSeq (2 × 250 bp) read pair alignments (Additional file 4).Fig. 4

Bottom Line: Sequence reads from unknown samples are aligned to a reference genome where the allele states of known SNPs are determined.The Whole Genome Focused Array SNP Typing (WG-FAST) pipeline can identify unknown strains with much less read data than is needed for genome assembly.To test WG-FAST, we resampled SNPs from real samples to understand the relationship between low coverage metagenomic data and accurate phylogenetic placement.

View Article: PubMed Central - PubMed

Affiliation: Department of Pathogen Genomics, Translational Genomics Research Institute, Flagstaff, AZ USA ; Center for Microbial Genetics and Genomics, Northern Arizona University, Flagstaff, AZ 86011 USA.

ABSTRACT
We describe an approach for genotyping bacterial strains from low coverage genome datasets, including metagenomic data from complex samples. Sequence reads from unknown samples are aligned to a reference genome where the allele states of known SNPs are determined. The Whole Genome Focused Array SNP Typing (WG-FAST) pipeline can identify unknown strains with much less read data than is needed for genome assembly. To test WG-FAST, we resampled SNPs from real samples to understand the relationship between low coverage metagenomic data and accurate phylogenetic placement. WG-FAST can be downloaded from https://github.com/jasonsahl/wgfast.

No MeSH data available.