Limits...
Next Generation Sequencing of Pooled Samples: Guideline for Variants ’ Filtering

View Article: PubMed Central - PubMed

ABSTRACT

Sequencing large number of individuals, which is often needed for population genetics studies, is still economically challenging despite falling costs of Next Generation Sequencing (NGS). Pool-seq is an alternative cost- and time-effective option in which DNA from several individuals is pooled for sequencing. However, pooling of DNA creates new problems and challenges for accurate variant call and allele frequency (AF) estimation. In particular, sequencing errors confound with the alleles present at low frequency in the pools possibly giving rise to false positive variants. We sequenced 996 individuals in 83 pools (12 individuals/pool) in a targeted re-sequencing experiment. We show that Pool-seq AFs are robust and reliable by comparing them with public variant databases and in-house SNP-genotyping data of individual subjects of pools. Furthermore, we propose a simple filtering guideline for the removal of spurious variants based on the Kolmogorov-Smirnov statistical test. We experimentally validated our filters by comparing Pool-seq to individual sequencing data showing that the filters remove most of the false variants while retaining majority of true variants. The proposed guideline is fairly generic in nature and could be easily applied in other Pool-seq experiments.

No MeSH data available.


QUAL(ity) score distribution of all variants.The dashed red vertical line denotes the ad-hoc threshold of low-quality (QUAL = 100).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5037392&req=5

f4: QUAL(ity) score distribution of all variants.The dashed red vertical line denotes the ad-hoc threshold of low-quality (QUAL = 100).

Mentions: CRISP generates a quality score for each variant by considering several parameters using a sophisticated multi-step algorithm12. Considering our entire SNV dataset, the resulting quality score (QUAL) values span a large range, from 20 to over 1 million, distributed as shown in Fig. 4. Around 29% (N = 6862) of the variants have a “low” (QUAL < 100) quality score (Fig. 4) and almost all of them are rare variants (AF < 0.01; Supplementary Fig. S9). However, not all rare variants (N = 19139) have low quality values, actually spanning from 20 to 11080 (Supplementary Fig. S10). Comparing the distribution of quality for the rare variants reported in any of the 1000genomes, dbSNP, ExAC or ESP database (N in.db = 6359) with those not annotated in any public database (N novel = 12780), we found a disproportionate number of lower quality variants in the novel rare variant category [Fig. 5(a)]. However, we expect these two distributions to be similar because the presence or absence of variants in public database and the quality score of variant calls are completely independent parameters.


Next Generation Sequencing of Pooled Samples: Guideline for Variants ’ Filtering
QUAL(ity) score distribution of all variants.The dashed red vertical line denotes the ad-hoc threshold of low-quality (QUAL = 100).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5037392&req=5

f4: QUAL(ity) score distribution of all variants.The dashed red vertical line denotes the ad-hoc threshold of low-quality (QUAL = 100).
Mentions: CRISP generates a quality score for each variant by considering several parameters using a sophisticated multi-step algorithm12. Considering our entire SNV dataset, the resulting quality score (QUAL) values span a large range, from 20 to over 1 million, distributed as shown in Fig. 4. Around 29% (N = 6862) of the variants have a “low” (QUAL < 100) quality score (Fig. 4) and almost all of them are rare variants (AF < 0.01; Supplementary Fig. S9). However, not all rare variants (N = 19139) have low quality values, actually spanning from 20 to 11080 (Supplementary Fig. S10). Comparing the distribution of quality for the rare variants reported in any of the 1000genomes, dbSNP, ExAC or ESP database (N in.db = 6359) with those not annotated in any public database (N novel = 12780), we found a disproportionate number of lower quality variants in the novel rare variant category [Fig. 5(a)]. However, we expect these two distributions to be similar because the presence or absence of variants in public database and the quality score of variant calls are completely independent parameters.

View Article: PubMed Central - PubMed

ABSTRACT

Sequencing large number of individuals, which is often needed for population genetics studies, is still economically challenging despite falling costs of Next Generation Sequencing (NGS). Pool-seq is an alternative cost- and time-effective option in which DNA from several individuals is pooled for sequencing. However, pooling of DNA creates new problems and challenges for accurate variant call and allele frequency (AF) estimation. In particular, sequencing errors confound with the alleles present at low frequency in the pools possibly giving rise to false positive variants. We sequenced 996 individuals in 83 pools (12&thinsp;individuals/pool) in a targeted re-sequencing experiment. We show that Pool-seq AFs are robust and reliable by comparing them with public variant databases and in-house SNP-genotyping data of individual subjects of pools. Furthermore, we propose a simple filtering guideline for the removal of spurious variants based on the Kolmogorov-Smirnov statistical test. We experimentally validated our filters by comparing Pool-seq to individual sequencing data showing that the filters remove most of the false variants while retaining majority of true variants. The proposed guideline is fairly generic in nature and could be easily applied in other Pool-seq experiments.

No MeSH data available.