Limits...
Next Generation Sequencing of Pooled Samples: Guideline for Variants ’ Filtering

View Article: PubMed Central - PubMed

ABSTRACT

Sequencing large number of individuals, which is often needed for population genetics studies, is still economically challenging despite falling costs of Next Generation Sequencing (NGS). Pool-seq is an alternative cost- and time-effective option in which DNA from several individuals is pooled for sequencing. However, pooling of DNA creates new problems and challenges for accurate variant call and allele frequency (AF) estimation. In particular, sequencing errors confound with the alleles present at low frequency in the pools possibly giving rise to false positive variants. We sequenced 996 individuals in 83 pools (12 individuals/pool) in a targeted re-sequencing experiment. We show that Pool-seq AFs are robust and reliable by comparing them with public variant databases and in-house SNP-genotyping data of individual subjects of pools. Furthermore, we propose a simple filtering guideline for the removal of spurious variants based on the Kolmogorov-Smirnov statistical test. We experimentally validated our filters by comparing Pool-seq to individual sequencing data showing that the filters remove most of the false variants while retaining majority of true variants. The proposed guideline is fairly generic in nature and could be easily applied in other Pool-seq experiments.

No MeSH data available.


(a) Allele Frequency distribution of all variants. (b) Distribution of variants according to the number of pools in which they are found.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5037392&req=5

f1: (a) Allele Frequency distribution of all variants. (b) Distribution of variants according to the number of pools in which they are found.

Mentions: CRISP called a total of 29736 variants in our data out of which 27529 were single nucleotide variants (SNVs) and 2207 were insertions and deletions (INDELs). INDELs represent a challenging issue for any variant calling software and we decided to focus our attention only on SNVs. Only 23651 SNVs passed all filtering imposed by CRISP (e.g. low-depth, strand-bias etc.). Figure 1(a) shows the allele frequency (AF) distribution of all SNVs. Most variants (N = 19139, 80.92%) can be classified as rare, showing AF below 0.01. Many of the SNVs (N = 10111, 42.75%) are found in only one pool, and they may be private rare variants (present in only one individual of that pool) [Fig. 1(b)]. These are expected results since rare variants are abundant in population12 and their chances of detection increase with increasing sequencing depth and number of individuals sequenced. However, they could also derive from sequencing errors.


Next Generation Sequencing of Pooled Samples: Guideline for Variants ’ Filtering
(a) Allele Frequency distribution of all variants. (b) Distribution of variants according to the number of pools in which they are found.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5037392&req=5

f1: (a) Allele Frequency distribution of all variants. (b) Distribution of variants according to the number of pools in which they are found.
Mentions: CRISP called a total of 29736 variants in our data out of which 27529 were single nucleotide variants (SNVs) and 2207 were insertions and deletions (INDELs). INDELs represent a challenging issue for any variant calling software and we decided to focus our attention only on SNVs. Only 23651 SNVs passed all filtering imposed by CRISP (e.g. low-depth, strand-bias etc.). Figure 1(a) shows the allele frequency (AF) distribution of all SNVs. Most variants (N = 19139, 80.92%) can be classified as rare, showing AF below 0.01. Many of the SNVs (N = 10111, 42.75%) are found in only one pool, and they may be private rare variants (present in only one individual of that pool) [Fig. 1(b)]. These are expected results since rare variants are abundant in population12 and their chances of detection increase with increasing sequencing depth and number of individuals sequenced. However, they could also derive from sequencing errors.

View Article: PubMed Central - PubMed

ABSTRACT

Sequencing large number of individuals, which is often needed for population genetics studies, is still economically challenging despite falling costs of Next Generation Sequencing (NGS). Pool-seq is an alternative cost- and time-effective option in which DNA from several individuals is pooled for sequencing. However, pooling of DNA creates new problems and challenges for accurate variant call and allele frequency (AF) estimation. In particular, sequencing errors confound with the alleles present at low frequency in the pools possibly giving rise to false positive variants. We sequenced 996 individuals in 83 pools (12 individuals/pool) in a targeted re-sequencing experiment. We show that Pool-seq AFs are robust and reliable by comparing them with public variant databases and in-house SNP-genotyping data of individual subjects of pools. Furthermore, we propose a simple filtering guideline for the removal of spurious variants based on the Kolmogorov-Smirnov statistical test. We experimentally validated our filters by comparing Pool-seq to individual sequencing data showing that the filters remove most of the false variants while retaining majority of true variants. The proposed guideline is fairly generic in nature and could be easily applied in other Pool-seq experiments.

No MeSH data available.