Limits...
High throughput sequencing in mice: a platform comparison identifies a preponderance of cryptic SNPs.

Walter NA, Bottomly D, Laderas T, Mooney MA, Darakjian P, Searles RP, Harrington CA, McWeeney SK, Hitzemann R, Buck KJ - BMC Genomics (2009)

Bottom Line: Polymorphisms result in a high incidence of false positive and false negative results in hybridization based analyses and hinder the identification of the true variation underlying genetically determined differences in physiology and behavior.Using the same templates on both platforms, we compared realignments and single nucleotide polymorphism (SNP) detection with an 80 fold average read depth across platforms and samples.Furthermore, we confirmed 40 missense SNPs and discovered 36 new missense SNPs.

View Article: PubMed Central - HTML - PubMed

Affiliation: Research and Development Service, Portland VA Medical Center, Portland, OR, USA. waltern@ohsu.edu

ABSTRACT

Background: Allelic variation is the cornerstone of genetically determined differences in gene expression, gene product structure, physiology, and behavior. However, allelic variation, particularly cryptic (unknown or not annotated) variation, is problematic for follow up analyses. Polymorphisms result in a high incidence of false positive and false negative results in hybridization based analyses and hinder the identification of the true variation underlying genetically determined differences in physiology and behavior. Given the proliferation of mouse genetic models (e.g., knockout models, selectively bred lines, heterogeneous stocks derived from standard inbred strains and wild mice) and the wealth of gene expression microarray and phenotypic studies using genetic models, the impact of naturally-occurring polymorphisms on these data is critical. With the advent of next-generation, high-throughput sequencing, we are now in a position to determine to what extent polymorphisms are currently cryptic in such models and their impact on downstream analyses.

Results: We sequenced the two most commonly used inbred mouse strains, DBA/2J and C57BL/6J, across a region of chromosome 1 (171.6 - 174.6 megabases) using two next generation high-throughput sequencing platforms: Applied Biosystems (SOLiD) and Illumina (Genome Analyzer). Using the same templates on both platforms, we compared realignments and single nucleotide polymorphism (SNP) detection with an 80 fold average read depth across platforms and samples. While public datasets currently annotate 4,527 SNPs between the two strains in this interval, thorough high-throughput sequencing identified a total of 11,824 SNPs in the interval, including 7,663 new SNPs. Furthermore, we confirmed 40 missense SNPs and discovered 36 new missense SNPs.

Conclusion: Comparisons utilizing even two of the best characterized mouse genetic models, DBA/2J and C57BL/6J, indicate that more than half of naturally-occurring SNPs remain cryptic. The magnitude of this problem is compounded when using more divergent or poorly annotated genetic models. This warrants full genomic sequencing of the mouse strains used as genetic models.

Show MeSH
Kcnj9 B6 vs. D2 PARC SNPs. Kcnj9 is represented with the coding region shown in the gray boxes and the 5' and 3' untranslated regions (UTRs) in the white boxes. Introns are illustrated as V-lines, and upstream region as a horizontal black line. The blue lines indicate SNPs found in MPD that we confirmed by custom HTS (PARC SNPs). The red lines indicate previously cryptic SNPs, i.e., new PARC SNPs. For Kcnj9 (ENSMUST00000062387), we found nine new SNPs in the 3'-UTR, and 18 new SNPs in the 2 kb upstream region, and identified one false positive SNP (green line) in the 3'-UTR. In addition, we confirmed six synonymous coding SNPs, eleven 3'-UTR SNPs, one 5'-UTR SNP, and 5 SNPs within 2 kb upstream of the transcription start site. Additionally, within the introns, we identified 15 new SNPs and confirmed 28 SNPs (not shown).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2743714&req=5

Figure 5: Kcnj9 B6 vs. D2 PARC SNPs. Kcnj9 is represented with the coding region shown in the gray boxes and the 5' and 3' untranslated regions (UTRs) in the white boxes. Introns are illustrated as V-lines, and upstream region as a horizontal black line. The blue lines indicate SNPs found in MPD that we confirmed by custom HTS (PARC SNPs). The red lines indicate previously cryptic SNPs, i.e., new PARC SNPs. For Kcnj9 (ENSMUST00000062387), we found nine new SNPs in the 3'-UTR, and 18 new SNPs in the 2 kb upstream region, and identified one false positive SNP (green line) in the 3'-UTR. In addition, we confirmed six synonymous coding SNPs, eleven 3'-UTR SNPs, one 5'-UTR SNP, and 5 SNPs within 2 kb upstream of the transcription start site. Additionally, within the introns, we identified 15 new SNPs and confirmed 28 SNPs (not shown).

Mentions: In addition to the new missense SNPs, we identified 7,627 new PARC SNPs in non-coding regions. These SNPs were not previously known or annotated and may regulate gene and/or protein expression. For example, in the Kcnj9 gene, which demonstrates differential transcript expression between the D2 and B6 strains [13], we identified 18 new SNPs within 2 kb upstream of the transcriptional start site as well as nine new SNPs in the 3' untranslated region (Figure 5). We also identified one false-positive SNP annotated in MPD in the 3' untranslated region. Thus, for Kcnj9 and other differentially expressed genes, elucidation of cryptic SNPs could identify nucleotide variation that underlies QTL phenotypic effects.


High throughput sequencing in mice: a platform comparison identifies a preponderance of cryptic SNPs.

Walter NA, Bottomly D, Laderas T, Mooney MA, Darakjian P, Searles RP, Harrington CA, McWeeney SK, Hitzemann R, Buck KJ - BMC Genomics (2009)

Kcnj9 B6 vs. D2 PARC SNPs. Kcnj9 is represented with the coding region shown in the gray boxes and the 5' and 3' untranslated regions (UTRs) in the white boxes. Introns are illustrated as V-lines, and upstream region as a horizontal black line. The blue lines indicate SNPs found in MPD that we confirmed by custom HTS (PARC SNPs). The red lines indicate previously cryptic SNPs, i.e., new PARC SNPs. For Kcnj9 (ENSMUST00000062387), we found nine new SNPs in the 3'-UTR, and 18 new SNPs in the 2 kb upstream region, and identified one false positive SNP (green line) in the 3'-UTR. In addition, we confirmed six synonymous coding SNPs, eleven 3'-UTR SNPs, one 5'-UTR SNP, and 5 SNPs within 2 kb upstream of the transcription start site. Additionally, within the introns, we identified 15 new SNPs and confirmed 28 SNPs (not shown).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2743714&req=5

Figure 5: Kcnj9 B6 vs. D2 PARC SNPs. Kcnj9 is represented with the coding region shown in the gray boxes and the 5' and 3' untranslated regions (UTRs) in the white boxes. Introns are illustrated as V-lines, and upstream region as a horizontal black line. The blue lines indicate SNPs found in MPD that we confirmed by custom HTS (PARC SNPs). The red lines indicate previously cryptic SNPs, i.e., new PARC SNPs. For Kcnj9 (ENSMUST00000062387), we found nine new SNPs in the 3'-UTR, and 18 new SNPs in the 2 kb upstream region, and identified one false positive SNP (green line) in the 3'-UTR. In addition, we confirmed six synonymous coding SNPs, eleven 3'-UTR SNPs, one 5'-UTR SNP, and 5 SNPs within 2 kb upstream of the transcription start site. Additionally, within the introns, we identified 15 new SNPs and confirmed 28 SNPs (not shown).
Mentions: In addition to the new missense SNPs, we identified 7,627 new PARC SNPs in non-coding regions. These SNPs were not previously known or annotated and may regulate gene and/or protein expression. For example, in the Kcnj9 gene, which demonstrates differential transcript expression between the D2 and B6 strains [13], we identified 18 new SNPs within 2 kb upstream of the transcriptional start site as well as nine new SNPs in the 3' untranslated region (Figure 5). We also identified one false-positive SNP annotated in MPD in the 3' untranslated region. Thus, for Kcnj9 and other differentially expressed genes, elucidation of cryptic SNPs could identify nucleotide variation that underlies QTL phenotypic effects.

Bottom Line: Polymorphisms result in a high incidence of false positive and false negative results in hybridization based analyses and hinder the identification of the true variation underlying genetically determined differences in physiology and behavior.Using the same templates on both platforms, we compared realignments and single nucleotide polymorphism (SNP) detection with an 80 fold average read depth across platforms and samples.Furthermore, we confirmed 40 missense SNPs and discovered 36 new missense SNPs.

View Article: PubMed Central - HTML - PubMed

Affiliation: Research and Development Service, Portland VA Medical Center, Portland, OR, USA. waltern@ohsu.edu

ABSTRACT

Background: Allelic variation is the cornerstone of genetically determined differences in gene expression, gene product structure, physiology, and behavior. However, allelic variation, particularly cryptic (unknown or not annotated) variation, is problematic for follow up analyses. Polymorphisms result in a high incidence of false positive and false negative results in hybridization based analyses and hinder the identification of the true variation underlying genetically determined differences in physiology and behavior. Given the proliferation of mouse genetic models (e.g., knockout models, selectively bred lines, heterogeneous stocks derived from standard inbred strains and wild mice) and the wealth of gene expression microarray and phenotypic studies using genetic models, the impact of naturally-occurring polymorphisms on these data is critical. With the advent of next-generation, high-throughput sequencing, we are now in a position to determine to what extent polymorphisms are currently cryptic in such models and their impact on downstream analyses.

Results: We sequenced the two most commonly used inbred mouse strains, DBA/2J and C57BL/6J, across a region of chromosome 1 (171.6 - 174.6 megabases) using two next generation high-throughput sequencing platforms: Applied Biosystems (SOLiD) and Illumina (Genome Analyzer). Using the same templates on both platforms, we compared realignments and single nucleotide polymorphism (SNP) detection with an 80 fold average read depth across platforms and samples. While public datasets currently annotate 4,527 SNPs between the two strains in this interval, thorough high-throughput sequencing identified a total of 11,824 SNPs in the interval, including 7,663 new SNPs. Furthermore, we confirmed 40 missense SNPs and discovered 36 new missense SNPs.

Conclusion: Comparisons utilizing even two of the best characterized mouse genetic models, DBA/2J and C57BL/6J, indicate that more than half of naturally-occurring SNPs remain cryptic. The magnitude of this problem is compounded when using more divergent or poorly annotated genetic models. This warrants full genomic sequencing of the mouse strains used as genetic models.

Show MeSH