Limits...
SNP calling by sequencing pooled samples.

Raineri E, Ferretti L, Esteve-Codina A, Nevado B, Heath S, Pérez-Enciso M - BMC Bioinformatics (2012)

Bottom Line: First, sequencing errors may have a much larger relevance than in individual SNP calling: while their impact in individual sequencing can be reduced by setting a restriction on a minimum number of reads per allele, this would have a strong and undesired effect in pools because it is unlikely that alleles at low frequency in the pool will be read many times.Third, an allele appearing only once in the reads from a pool does not necessarily correspond to a singleton in the set of individuals making up the pool, and vice versa, there can be more than one read - or, more likely, none - from a true singleton.In order to test the behaviour of our software, we generated (through simulated coalescence) artificial genomes and computed the effect of a pooled sequencing protocol, followed by SNP calling.

View Article: PubMed Central - HTML - PubMed

Affiliation: Centro Nacional de Análisis Genómico, Parc Científic de Barcelona, Barcelona, 08028, Spain. emanuele.raineri@gmail.com

ABSTRACT

Background: Performing high throughput sequencing on samples pooled from different individuals is a strategy to characterize genetic variability at a small fraction of the cost required for individual sequencing. In certain circumstances some variability estimators have even lower variance than those obtained with individual sequencing. SNP calling and estimating the frequency of the minor allele from pooled samples, though, is a subtle exercise for at least three reasons. First, sequencing errors may have a much larger relevance than in individual SNP calling: while their impact in individual sequencing can be reduced by setting a restriction on a minimum number of reads per allele, this would have a strong and undesired effect in pools because it is unlikely that alleles at low frequency in the pool will be read many times. Second, the prior allele frequency for heterozygous sites in individuals is usually 0.5 (assuming one is not analyzing sequences coming from, e.g. cancer tissues), but this is not true in pools: in fact, under the standard neutral model, singletons (i.e. alleles of minimum frequency) are the most common class of variants because P(f) ∝ 1/f and they occur more often as the sample size increases. Third, an allele appearing only once in the reads from a pool does not necessarily correspond to a singleton in the set of individuals making up the pool, and vice versa, there can be more than one read - or, more likely, none - from a true singleton.

Results: To improve upon existing theory and software packages, we have developed a Bayesian approach for minor allele frequency (MAF) computation and SNP calling in pools (and implemented it in a program called snape): the approach takes into account sequencing errors and allows users to choose different priors. We also set up a pipeline which can simulate the coalescence process giving rise to the SNPs, the pooling procedure and the sequencing. We used it to compare the performance of snape to that of other packages.

Conclusions: We present a software which helps in calling SNPs in pooled samples: it has good power while retaining a low false discovery rate (FDR). The method also provides the posterior probability that a SNP is segregating and the full posterior distribution of f for every SNP. In order to test the behaviour of our software, we generated (through simulated coalescence) artificial genomes and computed the effect of a pooled sequencing protocol, followed by SNP calling. In this setting, snape has better power and False Discovery Rate (FDR) than the comparable packages samtools, PoPoolation, Varscan : for N = 50 chromosomes, snape has power ≈ 35%and FDR ≈ 2.5%. snape is available at http://code.google.com/p/snape-pooled/ (source code and precompiled binaries).

Show MeSH

Related in: MedlinePlus

Power and false discovery rate (FDR) according to actual depth and minimum allele frequency when using N = 20,50 and 100 chromosomes (from top to bottom) obtained with different methods (legend on upper-right panel). Average depth was 20X. Left column panels show power as a function of actual depth, middle column is the false discovery rate as a function of actual depth, and right column, power as a function of true minor allele frequency (MAF). Average of 100 replicates.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3475117&req=5

Figure 2: Power and false discovery rate (FDR) according to actual depth and minimum allele frequency when using N = 20,50 and 100 chromosomes (from top to bottom) obtained with different methods (legend on upper-right panel). Average depth was 20X. Left column panels show power as a function of actual depth, middle column is the false discovery rate as a function of actual depth, and right column, power as a function of true minor allele frequency (MAF). Average of 100 replicates.

Mentions: Two main factors affect the accuracy of SNP calling in pools: depth and minimum allele frequency, although their effect varied according to the algorithm used (Figure2).


SNP calling by sequencing pooled samples.

Raineri E, Ferretti L, Esteve-Codina A, Nevado B, Heath S, Pérez-Enciso M - BMC Bioinformatics (2012)

Power and false discovery rate (FDR) according to actual depth and minimum allele frequency when using N = 20,50 and 100 chromosomes (from top to bottom) obtained with different methods (legend on upper-right panel). Average depth was 20X. Left column panels show power as a function of actual depth, middle column is the false discovery rate as a function of actual depth, and right column, power as a function of true minor allele frequency (MAF). Average of 100 replicates.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3475117&req=5

Figure 2: Power and false discovery rate (FDR) according to actual depth and minimum allele frequency when using N = 20,50 and 100 chromosomes (from top to bottom) obtained with different methods (legend on upper-right panel). Average depth was 20X. Left column panels show power as a function of actual depth, middle column is the false discovery rate as a function of actual depth, and right column, power as a function of true minor allele frequency (MAF). Average of 100 replicates.
Mentions: Two main factors affect the accuracy of SNP calling in pools: depth and minimum allele frequency, although their effect varied according to the algorithm used (Figure2).

Bottom Line: First, sequencing errors may have a much larger relevance than in individual SNP calling: while their impact in individual sequencing can be reduced by setting a restriction on a minimum number of reads per allele, this would have a strong and undesired effect in pools because it is unlikely that alleles at low frequency in the pool will be read many times.Third, an allele appearing only once in the reads from a pool does not necessarily correspond to a singleton in the set of individuals making up the pool, and vice versa, there can be more than one read - or, more likely, none - from a true singleton.In order to test the behaviour of our software, we generated (through simulated coalescence) artificial genomes and computed the effect of a pooled sequencing protocol, followed by SNP calling.

View Article: PubMed Central - HTML - PubMed

Affiliation: Centro Nacional de Análisis Genómico, Parc Científic de Barcelona, Barcelona, 08028, Spain. emanuele.raineri@gmail.com

ABSTRACT

Background: Performing high throughput sequencing on samples pooled from different individuals is a strategy to characterize genetic variability at a small fraction of the cost required for individual sequencing. In certain circumstances some variability estimators have even lower variance than those obtained with individual sequencing. SNP calling and estimating the frequency of the minor allele from pooled samples, though, is a subtle exercise for at least three reasons. First, sequencing errors may have a much larger relevance than in individual SNP calling: while their impact in individual sequencing can be reduced by setting a restriction on a minimum number of reads per allele, this would have a strong and undesired effect in pools because it is unlikely that alleles at low frequency in the pool will be read many times. Second, the prior allele frequency for heterozygous sites in individuals is usually 0.5 (assuming one is not analyzing sequences coming from, e.g. cancer tissues), but this is not true in pools: in fact, under the standard neutral model, singletons (i.e. alleles of minimum frequency) are the most common class of variants because P(f) ∝ 1/f and they occur more often as the sample size increases. Third, an allele appearing only once in the reads from a pool does not necessarily correspond to a singleton in the set of individuals making up the pool, and vice versa, there can be more than one read - or, more likely, none - from a true singleton.

Results: To improve upon existing theory and software packages, we have developed a Bayesian approach for minor allele frequency (MAF) computation and SNP calling in pools (and implemented it in a program called snape): the approach takes into account sequencing errors and allows users to choose different priors. We also set up a pipeline which can simulate the coalescence process giving rise to the SNPs, the pooling procedure and the sequencing. We used it to compare the performance of snape to that of other packages.

Conclusions: We present a software which helps in calling SNPs in pooled samples: it has good power while retaining a low false discovery rate (FDR). The method also provides the posterior probability that a SNP is segregating and the full posterior distribution of f for every SNP. In order to test the behaviour of our software, we generated (through simulated coalescence) artificial genomes and computed the effect of a pooled sequencing protocol, followed by SNP calling. In this setting, snape has better power and False Discovery Rate (FDR) than the comparable packages samtools, PoPoolation, Varscan : for N = 50 chromosomes, snape has power ≈ 35%and FDR ≈ 2.5%. snape is available at http://code.google.com/p/snape-pooled/ (source code and precompiled binaries).

Show MeSH
Related in: MedlinePlus