Limits...
Reference-free SNP calling: improved accuracy by preventing incorrect calls from repetitive genomic regions.

Dou J, Zhao X, Fu X, Jiao W, Wang N, Zhang L, Hu X, Wang S, Bao Z - Biol. Direct (2012)

Bottom Line: The iML algorithm incorporates the mixed Poisson/normal model to detect composite read clusters and can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions.The iML algorithm can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions, and thus outperforms the original ML algorithm by achieving much higher genotyping accuracy.Our algorithm is therefore very useful for accurate de novo SNP genotyping in the non-model organisms without a reference genome.

View Article: PubMed Central - HTML - PubMed

Affiliation: Key Laboratory of Marine Genetics and Breeding, College of Marine Life Sciences, Ocean University of China, 5 Yushan Road, Qingdao, 266003, China.

ABSTRACT

Background: Single nucleotide polymorphisms (SNPs) are the most abundant type of genetic variation in eukaryotic genomes and have recently become the marker of choice in a wide variety of ecological and evolutionary studies. The advent of next-generation sequencing (NGS) technologies has made it possible to efficiently genotype a large number of SNPs in the non-model organisms with no or limited genomic resources. Most NGS-based genotyping methods require a reference genome to perform accurate SNP calling. Little effort, however, has yet been devoted to developing or improving algorithms for accurate SNP calling in the absence of a reference genome.

Results: Here we describe an improved maximum likelihood (ML) algorithm called iML, which can achieve high genotyping accuracy for SNP calling in the non-model organisms without a reference genome. The iML algorithm incorporates the mixed Poisson/normal model to detect composite read clusters and can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions. Through analysis of simulation and real sequencing datasets, we demonstrate that in comparison with ML or a threshold approach, iML can remarkably improve the accuracy of de novo SNP genotyping and is especially powerful for the reference-free genotyping in diploid genomes with high repeat contents.

Conclusions: The iML algorithm can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions, and thus outperforms the original ML algorithm by achieving much higher genotyping accuracy. Our algorithm is therefore very useful for accurate de novo SNP genotyping in the non-model organisms without a reference genome.

Show MeSH

Related in: MedlinePlus

Schematic illustration of an occurrence of a false SNP afterde novoclustering of reads derived from repetitive genomic regions. Both ML and iML perform well in the genotyping of SNPs derived from single-copy genomic regions (left), but iML is more efficient to identify and exclude false SNPs resulting from repetitive regions (right).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3472322&req=5

Figure 1: Schematic illustration of an occurrence of a false SNP afterde novoclustering of reads derived from repetitive genomic regions. Both ML and iML perform well in the genotyping of SNPs derived from single-copy genomic regions (left), but iML is more efficient to identify and exclude false SNPs resulting from repetitive regions (right).

Mentions: Single nucleotide polymorphisms (SNPs) are the most abundant type of genetic variation in eukaryotic genomes and have recently become the marker of choice in a wide variety of ecological and evolutionary studies such as local adaptation, population connectivity, and speciation. Many of these studies focused on non-model species, for which the number of SNPs that can be assayed are usually very limited. The advent of next-generation sequencing (NGS) technologies has made it possible to efficiently genotype a large number (e.g., thousands to tens of thousands) of SNPs in the non-model organisms with no or limited genomic resources. Several genotyping methods based on NGS platforms have recently been developed [1], most of which utilize restriction enzymes for genome complexity reduction (GCR) to reduce the total sequencing cost. In particular, RAD (restriction-site associated DNA) has gained popularity among these GCR-based methods, and allows for nearly every restriction site in the genome to be screened in parallel [2]. Most SNP calling algorithms depend on the reference-based mapping approach [3], thus limiting their use in non-model species for which a reference genome is usually not available. Little effort has yet been devoted to developing or improving algorithms for accurate SNP calling in the absence of a reference genome. Catchen et al. [4] have recently developed a pipeline program called Stacks for de novo assembly and genotyping of RAD tags from a set of individuals. The core component of their program is ustacks, which can efficiently build reference sites de novo through the assembly of short reads into clusters (i.e., stacks), and apply a maximum likelihood (ML) statistical model to distinguish SNPs from sequencing errors. Building read clusters correctly is a critical step toward accurate SNP calling, which is, however, highly sensitive to read length and genome complexity [5]. Most eukaryotic genomes contain a remarkable portion of sequences that are repetitive or close to repetitive especially on the length scale of short read. False SNPs could arise and be miscalled from read clusters in which reads carrying different sequence variants are actually derived from distinct genomic locations (i.e., repetitive regions) (Figure 1). In general, such composite read clusters should have greater depth than the normal (i.e., non-composite) ones, such that this information can be utilized to identify composite clusters and further exclude them from SNP calling. Herein, we demonstrate that the accuracy of de novo SNP calling can be remarkably improved using an improved ML algorithm (thereafter called iML) that incorporates the mixed Poisson/normal model to identify and exclude composite clusters from genotyping, and therefore prevents incorrect SNP calls resulting from repetitive genomic regions (Figure 1). The iML algorithm is especially powerful for accurate de novo SNP calling in diploid genomes with high repeat contents.


Reference-free SNP calling: improved accuracy by preventing incorrect calls from repetitive genomic regions.

Dou J, Zhao X, Fu X, Jiao W, Wang N, Zhang L, Hu X, Wang S, Bao Z - Biol. Direct (2012)

Schematic illustration of an occurrence of a false SNP afterde novoclustering of reads derived from repetitive genomic regions. Both ML and iML perform well in the genotyping of SNPs derived from single-copy genomic regions (left), but iML is more efficient to identify and exclude false SNPs resulting from repetitive regions (right).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3472322&req=5

Figure 1: Schematic illustration of an occurrence of a false SNP afterde novoclustering of reads derived from repetitive genomic regions. Both ML and iML perform well in the genotyping of SNPs derived from single-copy genomic regions (left), but iML is more efficient to identify and exclude false SNPs resulting from repetitive regions (right).
Mentions: Single nucleotide polymorphisms (SNPs) are the most abundant type of genetic variation in eukaryotic genomes and have recently become the marker of choice in a wide variety of ecological and evolutionary studies such as local adaptation, population connectivity, and speciation. Many of these studies focused on non-model species, for which the number of SNPs that can be assayed are usually very limited. The advent of next-generation sequencing (NGS) technologies has made it possible to efficiently genotype a large number (e.g., thousands to tens of thousands) of SNPs in the non-model organisms with no or limited genomic resources. Several genotyping methods based on NGS platforms have recently been developed [1], most of which utilize restriction enzymes for genome complexity reduction (GCR) to reduce the total sequencing cost. In particular, RAD (restriction-site associated DNA) has gained popularity among these GCR-based methods, and allows for nearly every restriction site in the genome to be screened in parallel [2]. Most SNP calling algorithms depend on the reference-based mapping approach [3], thus limiting their use in non-model species for which a reference genome is usually not available. Little effort has yet been devoted to developing or improving algorithms for accurate SNP calling in the absence of a reference genome. Catchen et al. [4] have recently developed a pipeline program called Stacks for de novo assembly and genotyping of RAD tags from a set of individuals. The core component of their program is ustacks, which can efficiently build reference sites de novo through the assembly of short reads into clusters (i.e., stacks), and apply a maximum likelihood (ML) statistical model to distinguish SNPs from sequencing errors. Building read clusters correctly is a critical step toward accurate SNP calling, which is, however, highly sensitive to read length and genome complexity [5]. Most eukaryotic genomes contain a remarkable portion of sequences that are repetitive or close to repetitive especially on the length scale of short read. False SNPs could arise and be miscalled from read clusters in which reads carrying different sequence variants are actually derived from distinct genomic locations (i.e., repetitive regions) (Figure 1). In general, such composite read clusters should have greater depth than the normal (i.e., non-composite) ones, such that this information can be utilized to identify composite clusters and further exclude them from SNP calling. Herein, we demonstrate that the accuracy of de novo SNP calling can be remarkably improved using an improved ML algorithm (thereafter called iML) that incorporates the mixed Poisson/normal model to identify and exclude composite clusters from genotyping, and therefore prevents incorrect SNP calls resulting from repetitive genomic regions (Figure 1). The iML algorithm is especially powerful for accurate de novo SNP calling in diploid genomes with high repeat contents.

Bottom Line: The iML algorithm incorporates the mixed Poisson/normal model to detect composite read clusters and can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions.The iML algorithm can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions, and thus outperforms the original ML algorithm by achieving much higher genotyping accuracy.Our algorithm is therefore very useful for accurate de novo SNP genotyping in the non-model organisms without a reference genome.

View Article: PubMed Central - HTML - PubMed

Affiliation: Key Laboratory of Marine Genetics and Breeding, College of Marine Life Sciences, Ocean University of China, 5 Yushan Road, Qingdao, 266003, China.

ABSTRACT

Background: Single nucleotide polymorphisms (SNPs) are the most abundant type of genetic variation in eukaryotic genomes and have recently become the marker of choice in a wide variety of ecological and evolutionary studies. The advent of next-generation sequencing (NGS) technologies has made it possible to efficiently genotype a large number of SNPs in the non-model organisms with no or limited genomic resources. Most NGS-based genotyping methods require a reference genome to perform accurate SNP calling. Little effort, however, has yet been devoted to developing or improving algorithms for accurate SNP calling in the absence of a reference genome.

Results: Here we describe an improved maximum likelihood (ML) algorithm called iML, which can achieve high genotyping accuracy for SNP calling in the non-model organisms without a reference genome. The iML algorithm incorporates the mixed Poisson/normal model to detect composite read clusters and can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions. Through analysis of simulation and real sequencing datasets, we demonstrate that in comparison with ML or a threshold approach, iML can remarkably improve the accuracy of de novo SNP genotyping and is especially powerful for the reference-free genotyping in diploid genomes with high repeat contents.

Conclusions: The iML algorithm can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions, and thus outperforms the original ML algorithm by achieving much higher genotyping accuracy. Our algorithm is therefore very useful for accurate de novo SNP genotyping in the non-model organisms without a reference genome.

Show MeSH
Related in: MedlinePlus