Limits...
Next Generation Sequencing of Pooled Samples: Guideline for Variants ’ Filtering

View Article: PubMed Central - PubMed

ABSTRACT

Sequencing large number of individuals, which is often needed for population genetics studies, is still economically challenging despite falling costs of Next Generation Sequencing (NGS). Pool-seq is an alternative cost- and time-effective option in which DNA from several individuals is pooled for sequencing. However, pooling of DNA creates new problems and challenges for accurate variant call and allele frequency (AF) estimation. In particular, sequencing errors confound with the alleles present at low frequency in the pools possibly giving rise to false positive variants. We sequenced 996 individuals in 83 pools (12 individuals/pool) in a targeted re-sequencing experiment. We show that Pool-seq AFs are robust and reliable by comparing them with public variant databases and in-house SNP-genotyping data of individual subjects of pools. Furthermore, we propose a simple filtering guideline for the removal of spurious variants based on the Kolmogorov-Smirnov statistical test. We experimentally validated our filters by comparing Pool-seq to individual sequencing data showing that the filters remove most of the false variants while retaining majority of true variants. The proposed guideline is fairly generic in nature and could be easily applied in other Pool-seq experiments.

No MeSH data available.


Comparison of poolAF with AF of 1000genomes.(a) Histogram of differences between poolAF and 1000genomes European population [1 kg(EUR)]. Minimum: −0.494;1st Quartile: 0.005; Median: 0.000; Mean: −0.002; 3rd Quartile: 0.005; Maximum: 0.308. (b) Boxplot of differences: Left panel 1000genomes_ALL (delta.kg.all) and Right panel 1000genomes_EUR (delta.kg.eur). The overall similarity between poolAF and 1000Genomes is higher for 1000genomes_EUR population as shown by smaller IQR and lesser spread of data.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5037392&req=5

f2: Comparison of poolAF with AF of 1000genomes.(a) Histogram of differences between poolAF and 1000genomes European population [1 kg(EUR)]. Minimum: −0.494;1st Quartile: 0.005; Median: 0.000; Mean: −0.002; 3rd Quartile: 0.005; Maximum: 0.308. (b) Boxplot of differences: Left panel 1000genomes_ALL (delta.kg.all) and Right panel 1000genomes_EUR (delta.kg.eur). The overall similarity between poolAF and 1000Genomes is higher for 1000genomes_EUR population as shown by smaller IQR and lesser spread of data.

Mentions: The number of individuals in our samples (12 individuals/pool * 83 pools = 996 individuals) is comparable to that of 1000genomes database. We compared Pool-seq AF (poolAF) with AF of 1000Genomes_EUR population. For 7068 SNVs for which 1000genomes_EUR frequency was available, there is an excellent correlation between poolAF and 1000genomes_EUR AF (R2 = 0.980; Supplementary Fig. S5). The difference between poolAF and 1000genome_EUR AF shows a very tight distribution centred at zero [median = 0; Inter Quartile Range, IQR = 0.01; Fig. 2(a)]. Considering the fact that our pools are composed of Italian subjects, the overall similarity between poolAF and 1000Genomes AF is higher for 1000genomes_EUR population than 1000genomes_ALL population as expected (R2_EUR = 0.980 vs. R2_ALL = 0.922; Supplementary Fig. S5). This is also proved by the fact that the distribution of differences between poolAF and 1000Genomes AF shows smaller IQR and a much lesser spread of data for comparison with 1000genomes_EUR population than 1000genomes_ALL population [Fig. 2(b)]. In a stratified analysis for rare and common variants separately, we further show that the relative differences (absolute delta/AF) are small for either of the groups of variants (Supplementary Fig. S6).


Next Generation Sequencing of Pooled Samples: Guideline for Variants ’ Filtering
Comparison of poolAF with AF of 1000genomes.(a) Histogram of differences between poolAF and 1000genomes European population [1 kg(EUR)]. Minimum: −0.494;1st Quartile: 0.005; Median: 0.000; Mean: −0.002; 3rd Quartile: 0.005; Maximum: 0.308. (b) Boxplot of differences: Left panel 1000genomes_ALL (delta.kg.all) and Right panel 1000genomes_EUR (delta.kg.eur). The overall similarity between poolAF and 1000Genomes is higher for 1000genomes_EUR population as shown by smaller IQR and lesser spread of data.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5037392&req=5

f2: Comparison of poolAF with AF of 1000genomes.(a) Histogram of differences between poolAF and 1000genomes European population [1 kg(EUR)]. Minimum: −0.494;1st Quartile: 0.005; Median: 0.000; Mean: −0.002; 3rd Quartile: 0.005; Maximum: 0.308. (b) Boxplot of differences: Left panel 1000genomes_ALL (delta.kg.all) and Right panel 1000genomes_EUR (delta.kg.eur). The overall similarity between poolAF and 1000Genomes is higher for 1000genomes_EUR population as shown by smaller IQR and lesser spread of data.
Mentions: The number of individuals in our samples (12 individuals/pool * 83 pools = 996 individuals) is comparable to that of 1000genomes database. We compared Pool-seq AF (poolAF) with AF of 1000Genomes_EUR population. For 7068 SNVs for which 1000genomes_EUR frequency was available, there is an excellent correlation between poolAF and 1000genomes_EUR AF (R2 = 0.980; Supplementary Fig. S5). The difference between poolAF and 1000genome_EUR AF shows a very tight distribution centred at zero [median = 0; Inter Quartile Range, IQR = 0.01; Fig. 2(a)]. Considering the fact that our pools are composed of Italian subjects, the overall similarity between poolAF and 1000Genomes AF is higher for 1000genomes_EUR population than 1000genomes_ALL population as expected (R2_EUR = 0.980 vs. R2_ALL = 0.922; Supplementary Fig. S5). This is also proved by the fact that the distribution of differences between poolAF and 1000Genomes AF shows smaller IQR and a much lesser spread of data for comparison with 1000genomes_EUR population than 1000genomes_ALL population [Fig. 2(b)]. In a stratified analysis for rare and common variants separately, we further show that the relative differences (absolute delta/AF) are small for either of the groups of variants (Supplementary Fig. S6).

View Article: PubMed Central - PubMed

ABSTRACT

Sequencing large number of individuals, which is often needed for population genetics studies, is still economically challenging despite falling costs of Next Generation Sequencing (NGS). Pool-seq is an alternative cost- and time-effective option in which DNA from several individuals is pooled for sequencing. However, pooling of DNA creates new problems and challenges for accurate variant call and allele frequency (AF) estimation. In particular, sequencing errors confound with the alleles present at low frequency in the pools possibly giving rise to false positive variants. We sequenced 996 individuals in 83 pools (12 individuals/pool) in a targeted re-sequencing experiment. We show that Pool-seq AFs are robust and reliable by comparing them with public variant databases and in-house SNP-genotyping data of individual subjects of pools. Furthermore, we propose a simple filtering guideline for the removal of spurious variants based on the Kolmogorov-Smirnov statistical test. We experimentally validated our filters by comparing Pool-seq to individual sequencing data showing that the filters remove most of the false variants while retaining majority of true variants. The proposed guideline is fairly generic in nature and could be easily applied in other Pool-seq experiments.

No MeSH data available.