Limits...
High throughput sequencing in mice: a platform comparison identifies a preponderance of cryptic SNPs.

Walter NA, Bottomly D, Laderas T, Mooney MA, Darakjian P, Searles RP, Harrington CA, McWeeney SK, Hitzemann R, Buck KJ - BMC Genomics (2009)

Bottom Line: Polymorphisms result in a high incidence of false positive and false negative results in hybridization based analyses and hinder the identification of the true variation underlying genetically determined differences in physiology and behavior.Using the same templates on both platforms, we compared realignments and single nucleotide polymorphism (SNP) detection with an 80 fold average read depth across platforms and samples.Furthermore, we confirmed 40 missense SNPs and discovered 36 new missense SNPs.

View Article: PubMed Central - HTML - PubMed

Affiliation: Research and Development Service, Portland VA Medical Center, Portland, OR, USA. waltern@ohsu.edu

ABSTRACT

Background: Allelic variation is the cornerstone of genetically determined differences in gene expression, gene product structure, physiology, and behavior. However, allelic variation, particularly cryptic (unknown or not annotated) variation, is problematic for follow up analyses. Polymorphisms result in a high incidence of false positive and false negative results in hybridization based analyses and hinder the identification of the true variation underlying genetically determined differences in physiology and behavior. Given the proliferation of mouse genetic models (e.g., knockout models, selectively bred lines, heterogeneous stocks derived from standard inbred strains and wild mice) and the wealth of gene expression microarray and phenotypic studies using genetic models, the impact of naturally-occurring polymorphisms on these data is critical. With the advent of next-generation, high-throughput sequencing, we are now in a position to determine to what extent polymorphisms are currently cryptic in such models and their impact on downstream analyses.

Results: We sequenced the two most commonly used inbred mouse strains, DBA/2J and C57BL/6J, across a region of chromosome 1 (171.6 - 174.6 megabases) using two next generation high-throughput sequencing platforms: Applied Biosystems (SOLiD) and Illumina (Genome Analyzer). Using the same templates on both platforms, we compared realignments and single nucleotide polymorphism (SNP) detection with an 80 fold average read depth across platforms and samples. While public datasets currently annotate 4,527 SNPs between the two strains in this interval, thorough high-throughput sequencing identified a total of 11,824 SNPs in the interval, including 7,663 new SNPs. Furthermore, we confirmed 40 missense SNPs and discovered 36 new missense SNPs.

Conclusion: Comparisons utilizing even two of the best characterized mouse genetic models, DBA/2J and C57BL/6J, indicate that more than half of naturally-occurring SNPs remain cryptic. The magnitude of this problem is compounded when using more divergent or poorly annotated genetic models. This warrants full genomic sequencing of the mouse strains used as genetic models.

Show MeSH
D2 vs. B6 SNPs and B6 realignment discrepancies vs. reference sequence. Custom HTS of chromosome 1 (171.6 – 174.6 Mb) data reveals D2 vs. B6 PARC SNPs and realignment discrepancies when compared to the B6 reference sequence. A) B6 vs. D2 PARC SNPs identified by Illumina's realignment of Genome Analyzer data (blue circle) and/or Applied Biosystems' realignment of SOLiD data (red circle). D2 vs. B6 SNPs annotated in the Mouse Phenome Database (MPD) are also illustrated for comparison (green circle). Of the 11,824 PARC SNPs found using custom HTS by at least one sequencing method, 7,663 were previously unknown or not annotated as D2. vs. B6 SNPs, demonstrating that public data for D2 vs. B6 SNPs is currently far from complete. B) Identification of 12 and 31 provisional discrepancies in the comparison of the B6 reference sequence to Illumina's (blue circle) and Applied Biosystems' (red circle) realignments; and confirmation (i.e., detected by both HTS platforms) of realignment discrepancies at 29 positions when compared to reference sequence.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2743714&req=5

Figure 3: D2 vs. B6 SNPs and B6 realignment discrepancies vs. reference sequence. Custom HTS of chromosome 1 (171.6 – 174.6 Mb) data reveals D2 vs. B6 PARC SNPs and realignment discrepancies when compared to the B6 reference sequence. A) B6 vs. D2 PARC SNPs identified by Illumina's realignment of Genome Analyzer data (blue circle) and/or Applied Biosystems' realignment of SOLiD data (red circle). D2 vs. B6 SNPs annotated in the Mouse Phenome Database (MPD) are also illustrated for comparison (green circle). Of the 11,824 PARC SNPs found using custom HTS by at least one sequencing method, 7,663 were previously unknown or not annotated as D2. vs. B6 SNPs, demonstrating that public data for D2 vs. B6 SNPs is currently far from complete. B) Identification of 12 and 31 provisional discrepancies in the comparison of the B6 reference sequence to Illumina's (blue circle) and Applied Biosystems' (red circle) realignments; and confirmation (i.e., detected by both HTS platforms) of realignment discrepancies at 29 positions when compared to reference sequence.

Mentions: Although comparison of our custom HTS B6 data to the B6 reference sequence confirmed the high accuracy of the reference sequence (≥ 99.998%), we nonetheless identified a small number of discrepancies in the realignments to the reference sequence. Upon assembly, the Applied Biosystems and Illumina realignments of the B6 sequence reads differed from the B6 reference sequence for 41 and 60 residues (Table 1), respectively; 29 of these differences were consistent in both datasets when compared to the reference sequence (Figure 3). Subsequently, based upon realignment discrepancies called by one or the other platforms, we detected 43 provisional realignment discrepancies between our HTS and the reference sequence. Sequencing on each the SOLiD and Genome Analyzer platforms was completed independently, with assembly algorithms each requiring a minimal depth of 3 reads in order to confirm an allele call. Quality of the nucleotide in question, as well as the quality of the surrounding base calls, was taken into account. Together, these restraints offer high confidence in the B6 allele calls identified by both platforms that differ from the references sequence. Overall, this indicates an exceptionally low discrepancy rate (only 0.001 – 0.002%) between HTS data and the reference sequence and also demonstrates the high quality of the sequence data generated in our analyses. Importantly, our custom HTS sequence data and the B6 reference strain sequence were both from RPCI-23 B6 BACs, which were generated from a pool of five female B6 mice from the Jackson Laboratory. Thus, it is extremely unlikely that strain or template inconsistency contributed to the discrepancies in the realignment of our sequence to the B6 reference sequence.


High throughput sequencing in mice: a platform comparison identifies a preponderance of cryptic SNPs.

Walter NA, Bottomly D, Laderas T, Mooney MA, Darakjian P, Searles RP, Harrington CA, McWeeney SK, Hitzemann R, Buck KJ - BMC Genomics (2009)

D2 vs. B6 SNPs and B6 realignment discrepancies vs. reference sequence. Custom HTS of chromosome 1 (171.6 – 174.6 Mb) data reveals D2 vs. B6 PARC SNPs and realignment discrepancies when compared to the B6 reference sequence. A) B6 vs. D2 PARC SNPs identified by Illumina's realignment of Genome Analyzer data (blue circle) and/or Applied Biosystems' realignment of SOLiD data (red circle). D2 vs. B6 SNPs annotated in the Mouse Phenome Database (MPD) are also illustrated for comparison (green circle). Of the 11,824 PARC SNPs found using custom HTS by at least one sequencing method, 7,663 were previously unknown or not annotated as D2. vs. B6 SNPs, demonstrating that public data for D2 vs. B6 SNPs is currently far from complete. B) Identification of 12 and 31 provisional discrepancies in the comparison of the B6 reference sequence to Illumina's (blue circle) and Applied Biosystems' (red circle) realignments; and confirmation (i.e., detected by both HTS platforms) of realignment discrepancies at 29 positions when compared to reference sequence.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2743714&req=5

Figure 3: D2 vs. B6 SNPs and B6 realignment discrepancies vs. reference sequence. Custom HTS of chromosome 1 (171.6 – 174.6 Mb) data reveals D2 vs. B6 PARC SNPs and realignment discrepancies when compared to the B6 reference sequence. A) B6 vs. D2 PARC SNPs identified by Illumina's realignment of Genome Analyzer data (blue circle) and/or Applied Biosystems' realignment of SOLiD data (red circle). D2 vs. B6 SNPs annotated in the Mouse Phenome Database (MPD) are also illustrated for comparison (green circle). Of the 11,824 PARC SNPs found using custom HTS by at least one sequencing method, 7,663 were previously unknown or not annotated as D2. vs. B6 SNPs, demonstrating that public data for D2 vs. B6 SNPs is currently far from complete. B) Identification of 12 and 31 provisional discrepancies in the comparison of the B6 reference sequence to Illumina's (blue circle) and Applied Biosystems' (red circle) realignments; and confirmation (i.e., detected by both HTS platforms) of realignment discrepancies at 29 positions when compared to reference sequence.
Mentions: Although comparison of our custom HTS B6 data to the B6 reference sequence confirmed the high accuracy of the reference sequence (≥ 99.998%), we nonetheless identified a small number of discrepancies in the realignments to the reference sequence. Upon assembly, the Applied Biosystems and Illumina realignments of the B6 sequence reads differed from the B6 reference sequence for 41 and 60 residues (Table 1), respectively; 29 of these differences were consistent in both datasets when compared to the reference sequence (Figure 3). Subsequently, based upon realignment discrepancies called by one or the other platforms, we detected 43 provisional realignment discrepancies between our HTS and the reference sequence. Sequencing on each the SOLiD and Genome Analyzer platforms was completed independently, with assembly algorithms each requiring a minimal depth of 3 reads in order to confirm an allele call. Quality of the nucleotide in question, as well as the quality of the surrounding base calls, was taken into account. Together, these restraints offer high confidence in the B6 allele calls identified by both platforms that differ from the references sequence. Overall, this indicates an exceptionally low discrepancy rate (only 0.001 – 0.002%) between HTS data and the reference sequence and also demonstrates the high quality of the sequence data generated in our analyses. Importantly, our custom HTS sequence data and the B6 reference strain sequence were both from RPCI-23 B6 BACs, which were generated from a pool of five female B6 mice from the Jackson Laboratory. Thus, it is extremely unlikely that strain or template inconsistency contributed to the discrepancies in the realignment of our sequence to the B6 reference sequence.

Bottom Line: Polymorphisms result in a high incidence of false positive and false negative results in hybridization based analyses and hinder the identification of the true variation underlying genetically determined differences in physiology and behavior.Using the same templates on both platforms, we compared realignments and single nucleotide polymorphism (SNP) detection with an 80 fold average read depth across platforms and samples.Furthermore, we confirmed 40 missense SNPs and discovered 36 new missense SNPs.

View Article: PubMed Central - HTML - PubMed

Affiliation: Research and Development Service, Portland VA Medical Center, Portland, OR, USA. waltern@ohsu.edu

ABSTRACT

Background: Allelic variation is the cornerstone of genetically determined differences in gene expression, gene product structure, physiology, and behavior. However, allelic variation, particularly cryptic (unknown or not annotated) variation, is problematic for follow up analyses. Polymorphisms result in a high incidence of false positive and false negative results in hybridization based analyses and hinder the identification of the true variation underlying genetically determined differences in physiology and behavior. Given the proliferation of mouse genetic models (e.g., knockout models, selectively bred lines, heterogeneous stocks derived from standard inbred strains and wild mice) and the wealth of gene expression microarray and phenotypic studies using genetic models, the impact of naturally-occurring polymorphisms on these data is critical. With the advent of next-generation, high-throughput sequencing, we are now in a position to determine to what extent polymorphisms are currently cryptic in such models and their impact on downstream analyses.

Results: We sequenced the two most commonly used inbred mouse strains, DBA/2J and C57BL/6J, across a region of chromosome 1 (171.6 - 174.6 megabases) using two next generation high-throughput sequencing platforms: Applied Biosystems (SOLiD) and Illumina (Genome Analyzer). Using the same templates on both platforms, we compared realignments and single nucleotide polymorphism (SNP) detection with an 80 fold average read depth across platforms and samples. While public datasets currently annotate 4,527 SNPs between the two strains in this interval, thorough high-throughput sequencing identified a total of 11,824 SNPs in the interval, including 7,663 new SNPs. Furthermore, we confirmed 40 missense SNPs and discovered 36 new missense SNPs.

Conclusion: Comparisons utilizing even two of the best characterized mouse genetic models, DBA/2J and C57BL/6J, indicate that more than half of naturally-occurring SNPs remain cryptic. The magnitude of this problem is compounded when using more divergent or poorly annotated genetic models. This warrants full genomic sequencing of the mouse strains used as genetic models.

Show MeSH