Limits...
Reference-free SNP calling: improved accuracy by preventing incorrect calls from repetitive genomic regions.

Dou J, Zhao X, Fu X, Jiao W, Wang N, Zhang L, Hu X, Wang S, Bao Z - Biol. Direct (2012)

Bottom Line: The iML algorithm incorporates the mixed Poisson/normal model to detect composite read clusters and can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions.The iML algorithm can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions, and thus outperforms the original ML algorithm by achieving much higher genotyping accuracy.Our algorithm is therefore very useful for accurate de novo SNP genotyping in the non-model organisms without a reference genome.

View Article: PubMed Central - HTML - PubMed

Affiliation: Key Laboratory of Marine Genetics and Breeding, College of Marine Life Sciences, Ocean University of China, 5 Yushan Road, Qingdao, 266003, China.

ABSTRACT

Background: Single nucleotide polymorphisms (SNPs) are the most abundant type of genetic variation in eukaryotic genomes and have recently become the marker of choice in a wide variety of ecological and evolutionary studies. The advent of next-generation sequencing (NGS) technologies has made it possible to efficiently genotype a large number of SNPs in the non-model organisms with no or limited genomic resources. Most NGS-based genotyping methods require a reference genome to perform accurate SNP calling. Little effort, however, has yet been devoted to developing or improving algorithms for accurate SNP calling in the absence of a reference genome.

Results: Here we describe an improved maximum likelihood (ML) algorithm called iML, which can achieve high genotyping accuracy for SNP calling in the non-model organisms without a reference genome. The iML algorithm incorporates the mixed Poisson/normal model to detect composite read clusters and can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions. Through analysis of simulation and real sequencing datasets, we demonstrate that in comparison with ML or a threshold approach, iML can remarkably improve the accuracy of de novo SNP genotyping and is especially powerful for the reference-free genotyping in diploid genomes with high repeat contents.

Conclusions: The iML algorithm can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions, and thus outperforms the original ML algorithm by achieving much higher genotyping accuracy. Our algorithm is therefore very useful for accurate de novo SNP genotyping in the non-model organisms without a reference genome.

Show MeSH

Related in: MedlinePlus

Comparison of the performance of threede novoSNP calling approaches based on the simulation datasets ofArabidopsis thaliana(A) andOryza sativa(B). iML outperforms ML or a threshold approach by improving genotyping accuracy remarkably at the expense of little decreased sensitivity. ML_ref, reference-based SNP calling using the ML algorithm; iML_denovo, de novo SNP calling using the iML algorithm; ML_denovo, de novo SNP calling using the ML algorithm; TH_denovo, de novo SNP calling using the threshold approach.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3472322&req=5

Figure 3: Comparison of the performance of threede novoSNP calling approaches based on the simulation datasets ofArabidopsis thaliana(A) andOryza sativa(B). iML outperforms ML or a threshold approach by improving genotyping accuracy remarkably at the expense of little decreased sensitivity. ML_ref, reference-based SNP calling using the ML algorithm; iML_denovo, de novo SNP calling using the iML algorithm; ML_denovo, de novo SNP calling using the ML algorithm; TH_denovo, de novo SNP calling using the threshold approach.

Mentions: For A. thaliana, iML always generated lower FPRs than the threshold approach or ML with 12 ~ 19%, 6 ~ 11% and 2 ~ 4% FPR reductions corresponding to the read lengths of 35, 50 and 100 bp, respectively, at a 40x sequencing depth, whereas iML generated only slightly higher FNRs (~1%) in comparison with ML (Figure 3A, Additional file 3: Table S3). For the relatively large rice genome, which has a high repeat content, the performance of iML is even more pronounced with 15 ~ 23%, 11 ~ 20% and 3 ~ 8% FPR reductions corresponding to the read lengths of 35, 50 and 100 bp, respectively, at a 40x sequencing depth but less noticeable FNR reductions in comparison with ML (Figure 3B, Additional file 3: Table S3). The threshold approach performed better than ML in terms of FPR reduction, but this was achieved at the expense of substantially decreased sensitivity (e.g. 11% FNR increase for A. thaliana and 21% for O. sativa at a 40x sequencing depth for 35-bp reads). In all cases, iML improved the accuracy of de novo SNP calling, bringing the accuracy close to the level resulting from the reference-based mapping approach.


Reference-free SNP calling: improved accuracy by preventing incorrect calls from repetitive genomic regions.

Dou J, Zhao X, Fu X, Jiao W, Wang N, Zhang L, Hu X, Wang S, Bao Z - Biol. Direct (2012)

Comparison of the performance of threede novoSNP calling approaches based on the simulation datasets ofArabidopsis thaliana(A) andOryza sativa(B). iML outperforms ML or a threshold approach by improving genotyping accuracy remarkably at the expense of little decreased sensitivity. ML_ref, reference-based SNP calling using the ML algorithm; iML_denovo, de novo SNP calling using the iML algorithm; ML_denovo, de novo SNP calling using the ML algorithm; TH_denovo, de novo SNP calling using the threshold approach.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3472322&req=5

Figure 3: Comparison of the performance of threede novoSNP calling approaches based on the simulation datasets ofArabidopsis thaliana(A) andOryza sativa(B). iML outperforms ML or a threshold approach by improving genotyping accuracy remarkably at the expense of little decreased sensitivity. ML_ref, reference-based SNP calling using the ML algorithm; iML_denovo, de novo SNP calling using the iML algorithm; ML_denovo, de novo SNP calling using the ML algorithm; TH_denovo, de novo SNP calling using the threshold approach.
Mentions: For A. thaliana, iML always generated lower FPRs than the threshold approach or ML with 12 ~ 19%, 6 ~ 11% and 2 ~ 4% FPR reductions corresponding to the read lengths of 35, 50 and 100 bp, respectively, at a 40x sequencing depth, whereas iML generated only slightly higher FNRs (~1%) in comparison with ML (Figure 3A, Additional file 3: Table S3). For the relatively large rice genome, which has a high repeat content, the performance of iML is even more pronounced with 15 ~ 23%, 11 ~ 20% and 3 ~ 8% FPR reductions corresponding to the read lengths of 35, 50 and 100 bp, respectively, at a 40x sequencing depth but less noticeable FNR reductions in comparison with ML (Figure 3B, Additional file 3: Table S3). The threshold approach performed better than ML in terms of FPR reduction, but this was achieved at the expense of substantially decreased sensitivity (e.g. 11% FNR increase for A. thaliana and 21% for O. sativa at a 40x sequencing depth for 35-bp reads). In all cases, iML improved the accuracy of de novo SNP calling, bringing the accuracy close to the level resulting from the reference-based mapping approach.

Bottom Line: The iML algorithm incorporates the mixed Poisson/normal model to detect composite read clusters and can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions.The iML algorithm can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions, and thus outperforms the original ML algorithm by achieving much higher genotyping accuracy.Our algorithm is therefore very useful for accurate de novo SNP genotyping in the non-model organisms without a reference genome.

View Article: PubMed Central - HTML - PubMed

Affiliation: Key Laboratory of Marine Genetics and Breeding, College of Marine Life Sciences, Ocean University of China, 5 Yushan Road, Qingdao, 266003, China.

ABSTRACT

Background: Single nucleotide polymorphisms (SNPs) are the most abundant type of genetic variation in eukaryotic genomes and have recently become the marker of choice in a wide variety of ecological and evolutionary studies. The advent of next-generation sequencing (NGS) technologies has made it possible to efficiently genotype a large number of SNPs in the non-model organisms with no or limited genomic resources. Most NGS-based genotyping methods require a reference genome to perform accurate SNP calling. Little effort, however, has yet been devoted to developing or improving algorithms for accurate SNP calling in the absence of a reference genome.

Results: Here we describe an improved maximum likelihood (ML) algorithm called iML, which can achieve high genotyping accuracy for SNP calling in the non-model organisms without a reference genome. The iML algorithm incorporates the mixed Poisson/normal model to detect composite read clusters and can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions. Through analysis of simulation and real sequencing datasets, we demonstrate that in comparison with ML or a threshold approach, iML can remarkably improve the accuracy of de novo SNP genotyping and is especially powerful for the reference-free genotyping in diploid genomes with high repeat contents.

Conclusions: The iML algorithm can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions, and thus outperforms the original ML algorithm by achieving much higher genotyping accuracy. Our algorithm is therefore very useful for accurate de novo SNP genotyping in the non-model organisms without a reference genome.

Show MeSH
Related in: MedlinePlus