Limits...
Estimating DNA polymorphism from next generation sequencing data with high error rate by dual sequencing applications.

He Z, Li X, Ling S, Fu YX, Hungate E, Shi S, Wu CI - BMC Genomics (2013)

Bottom Line: As the error rate is high and the distribution of errors across sites is non-uniform in next generation sequencing (NGS) data, it has been a challenge to estimate DNA polymorphism (θ) accurately from NGS data.Under the current NGS error rate, sequencing each individual separately offers little advantage unless the coverage per individual is high (>20X).Since errors from the two sequencing applications are usually non-overlapping, it is possible to separate low frequency polymorphisms from sequencing errors.

View Article: PubMed Central - HTML - PubMed

Affiliation: State Key Laboratory of Biocontrol and Guangdong Key Laboratory of Plant Resources, Sun Yat-sen University, 135 Xingang West Road, Guangzhou 510275, China.

ABSTRACT

Background: As the error rate is high and the distribution of errors across sites is non-uniform in next generation sequencing (NGS) data, it has been a challenge to estimate DNA polymorphism (θ) accurately from NGS data.

Results: By computer simulations, we compare the two methods of data acquisition - sequencing each diploid individual separately and sequencing the pooled sample. Under the current NGS error rate, sequencing each individual separately offers little advantage unless the coverage per individual is high (>20X). We hence propose a new method for estimating θ from pooled samples that have been subjected to two separate rounds of DNA sequencing. Since errors from the two sequencing applications are usually non-overlapping, it is possible to separate low frequency polymorphisms from sequencing errors. Simulation results show that the dual applications method is reliable even when the error rate is high and θ is low.

Conclusions: In studies of natural populations where the sequencing coverage is usually modest (~2X per individual), the dual applications method on pooled samples should be a reasonable choice.

Show MeSH

Related in: MedlinePlus

Error rate correlation patterns. a) MAF (minor allele frequency) of putative SNPs called by either SOLiD or Illumina GA. b) MAF in two samples (Bangkunsha and Thongnian) sequenced by Illumina HiSeq.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3750404&req=5

Figure 1: Error rate correlation patterns. a) MAF (minor allele frequency) of putative SNPs called by either SOLiD or Illumina GA. b) MAF in two samples (Bangkunsha and Thongnian) sequenced by Illumina HiSeq.

Mentions: Dual platforms - We re-analyzed sequencing data from a species of mangrove trees, Sonneratia alba, known to be completely monomorphic within some populations [4]. DNA sequences for 71 genes from one such population were generated using the Illumina GA and SOLiD platforms at a depth of ~2500X and ~5400X, respectively. For sites with more than 2000X depth in both platforms, we called variants using a set of criteria more stringent than the previous study. As shown in Figure 1a, Illumina GA and SOLiD systems both call many false SNPs, few of which are called by both. Because the sample is known to be monomorphic by Sanger sequencing [4], the detected variants are all false SNPs, which fortunately do not show overlap between platforms. Pearson's correlation coefficient of the error rate distributions between the two platforms is only 0.054.


Estimating DNA polymorphism from next generation sequencing data with high error rate by dual sequencing applications.

He Z, Li X, Ling S, Fu YX, Hungate E, Shi S, Wu CI - BMC Genomics (2013)

Error rate correlation patterns. a) MAF (minor allele frequency) of putative SNPs called by either SOLiD or Illumina GA. b) MAF in two samples (Bangkunsha and Thongnian) sequenced by Illumina HiSeq.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3750404&req=5

Figure 1: Error rate correlation patterns. a) MAF (minor allele frequency) of putative SNPs called by either SOLiD or Illumina GA. b) MAF in two samples (Bangkunsha and Thongnian) sequenced by Illumina HiSeq.
Mentions: Dual platforms - We re-analyzed sequencing data from a species of mangrove trees, Sonneratia alba, known to be completely monomorphic within some populations [4]. DNA sequences for 71 genes from one such population were generated using the Illumina GA and SOLiD platforms at a depth of ~2500X and ~5400X, respectively. For sites with more than 2000X depth in both platforms, we called variants using a set of criteria more stringent than the previous study. As shown in Figure 1a, Illumina GA and SOLiD systems both call many false SNPs, few of which are called by both. Because the sample is known to be monomorphic by Sanger sequencing [4], the detected variants are all false SNPs, which fortunately do not show overlap between platforms. Pearson's correlation coefficient of the error rate distributions between the two platforms is only 0.054.

Bottom Line: As the error rate is high and the distribution of errors across sites is non-uniform in next generation sequencing (NGS) data, it has been a challenge to estimate DNA polymorphism (θ) accurately from NGS data.Under the current NGS error rate, sequencing each individual separately offers little advantage unless the coverage per individual is high (>20X).Since errors from the two sequencing applications are usually non-overlapping, it is possible to separate low frequency polymorphisms from sequencing errors.

View Article: PubMed Central - HTML - PubMed

Affiliation: State Key Laboratory of Biocontrol and Guangdong Key Laboratory of Plant Resources, Sun Yat-sen University, 135 Xingang West Road, Guangzhou 510275, China.

ABSTRACT

Background: As the error rate is high and the distribution of errors across sites is non-uniform in next generation sequencing (NGS) data, it has been a challenge to estimate DNA polymorphism (θ) accurately from NGS data.

Results: By computer simulations, we compare the two methods of data acquisition - sequencing each diploid individual separately and sequencing the pooled sample. Under the current NGS error rate, sequencing each individual separately offers little advantage unless the coverage per individual is high (>20X). We hence propose a new method for estimating θ from pooled samples that have been subjected to two separate rounds of DNA sequencing. Since errors from the two sequencing applications are usually non-overlapping, it is possible to separate low frequency polymorphisms from sequencing errors. Simulation results show that the dual applications method is reliable even when the error rate is high and θ is low.

Conclusions: In studies of natural populations where the sequencing coverage is usually modest (~2X per individual), the dual applications method on pooled samples should be a reasonable choice.

Show MeSH
Related in: MedlinePlus