Limits...
Estimating DNA polymorphism from next generation sequencing data with high error rate by dual sequencing applications.

He Z, Li X, Ling S, Fu YX, Hungate E, Shi S, Wu CI - BMC Genomics (2013)

Bottom Line: As the error rate is high and the distribution of errors across sites is non-uniform in next generation sequencing (NGS) data, it has been a challenge to estimate DNA polymorphism (θ) accurately from NGS data.Under the current NGS error rate, sequencing each individual separately offers little advantage unless the coverage per individual is high (>20X).Since errors from the two sequencing applications are usually non-overlapping, it is possible to separate low frequency polymorphisms from sequencing errors.

View Article: PubMed Central - HTML - PubMed

Affiliation: State Key Laboratory of Biocontrol and Guangdong Key Laboratory of Plant Resources, Sun Yat-sen University, 135 Xingang West Road, Guangzhou 510275, China.

ABSTRACT

Background: As the error rate is high and the distribution of errors across sites is non-uniform in next generation sequencing (NGS) data, it has been a challenge to estimate DNA polymorphism (θ) accurately from NGS data.

Results: By computer simulations, we compare the two methods of data acquisition - sequencing each diploid individual separately and sequencing the pooled sample. Under the current NGS error rate, sequencing each individual separately offers little advantage unless the coverage per individual is high (>20X). We hence propose a new method for estimating θ from pooled samples that have been subjected to two separate rounds of DNA sequencing. Since errors from the two sequencing applications are usually non-overlapping, it is possible to separate low frequency polymorphisms from sequencing errors. Simulation results show that the dual applications method is reliable even when the error rate is high and θ is low.

Conclusions: In studies of natural populations where the sequencing coverage is usually modest (~2X per individual), the dual applications method on pooled samples should be a reasonable choice.

Show MeSH
θ estimation of simulation data of pooled-lines sample with 3 different sequencing errors. The θ value of simulation data is set to 0.1 / 1 per kb. Singletons are discarded in dual applications method (S>1). Singletons and doubletons are discarded in single platform method (S>2). The length of each error bar is 2 times the standard deviation. The means (and the standard deviations) of θ are estimated from 1000 replicates.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3750404&req=5

Figure 2: θ estimation of simulation data of pooled-lines sample with 3 different sequencing errors. The θ value of simulation data is set to 0.1 / 1 per kb. Singletons are discarded in dual applications method (S>1). Singletons and doubletons are discarded in single platform method (S>2). The length of each error bar is 2 times the standard deviation. The means (and the standard deviations) of θ are estimated from 1000 replicates.

Mentions: The simulation procedure is almost the same as that for the single platform, but with data from an additional sequencing application. The means and the standard deviations of θ estimates using different parameters are reported in Table 1. For sequencing data without errors, the dual platform method can accurately estimates θ, although the standard deviation values are slightly larger than those obtained by the single platform method. However, with the increase of the error rate, the advantage of the dual platform method compared with other methods becomes obvious (Figure 2). With an error rate of 0.01, the mean estimate of θ is 0.102 per kb when using S>2, which is only 2% higher than the real value (0.1 per kb). This estimate is dramatically better than the corresponding single platform estimate (4.061) or the single line estimate (0.180). This method is also better than the others when the error rate is Beta distributed as shown in Table 2.


Estimating DNA polymorphism from next generation sequencing data with high error rate by dual sequencing applications.

He Z, Li X, Ling S, Fu YX, Hungate E, Shi S, Wu CI - BMC Genomics (2013)

θ estimation of simulation data of pooled-lines sample with 3 different sequencing errors. The θ value of simulation data is set to 0.1 / 1 per kb. Singletons are discarded in dual applications method (S>1). Singletons and doubletons are discarded in single platform method (S>2). The length of each error bar is 2 times the standard deviation. The means (and the standard deviations) of θ are estimated from 1000 replicates.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3750404&req=5

Figure 2: θ estimation of simulation data of pooled-lines sample with 3 different sequencing errors. The θ value of simulation data is set to 0.1 / 1 per kb. Singletons are discarded in dual applications method (S>1). Singletons and doubletons are discarded in single platform method (S>2). The length of each error bar is 2 times the standard deviation. The means (and the standard deviations) of θ are estimated from 1000 replicates.
Mentions: The simulation procedure is almost the same as that for the single platform, but with data from an additional sequencing application. The means and the standard deviations of θ estimates using different parameters are reported in Table 1. For sequencing data without errors, the dual platform method can accurately estimates θ, although the standard deviation values are slightly larger than those obtained by the single platform method. However, with the increase of the error rate, the advantage of the dual platform method compared with other methods becomes obvious (Figure 2). With an error rate of 0.01, the mean estimate of θ is 0.102 per kb when using S>2, which is only 2% higher than the real value (0.1 per kb). This estimate is dramatically better than the corresponding single platform estimate (4.061) or the single line estimate (0.180). This method is also better than the others when the error rate is Beta distributed as shown in Table 2.

Bottom Line: As the error rate is high and the distribution of errors across sites is non-uniform in next generation sequencing (NGS) data, it has been a challenge to estimate DNA polymorphism (θ) accurately from NGS data.Under the current NGS error rate, sequencing each individual separately offers little advantage unless the coverage per individual is high (>20X).Since errors from the two sequencing applications are usually non-overlapping, it is possible to separate low frequency polymorphisms from sequencing errors.

View Article: PubMed Central - HTML - PubMed

Affiliation: State Key Laboratory of Biocontrol and Guangdong Key Laboratory of Plant Resources, Sun Yat-sen University, 135 Xingang West Road, Guangzhou 510275, China.

ABSTRACT

Background: As the error rate is high and the distribution of errors across sites is non-uniform in next generation sequencing (NGS) data, it has been a challenge to estimate DNA polymorphism (θ) accurately from NGS data.

Results: By computer simulations, we compare the two methods of data acquisition - sequencing each diploid individual separately and sequencing the pooled sample. Under the current NGS error rate, sequencing each individual separately offers little advantage unless the coverage per individual is high (>20X). We hence propose a new method for estimating θ from pooled samples that have been subjected to two separate rounds of DNA sequencing. Since errors from the two sequencing applications are usually non-overlapping, it is possible to separate low frequency polymorphisms from sequencing errors. Simulation results show that the dual applications method is reliable even when the error rate is high and θ is low.

Conclusions: In studies of natural populations where the sequencing coverage is usually modest (~2X per individual), the dual applications method on pooled samples should be a reasonable choice.

Show MeSH