Limits...
Error correction and diversity analysis of population mixtures determined by NGS.

Wood GR, Burroughs NJ, Evans DJ, Ryabov EV - PeerJ (2014)

Bottom Line: The impetus for this work was the need to analyse nucleotide diversity in a viral mix taken from honeybees.The paper has two findings.A compendium of existing and new diversity analysis tools is also presented, allowing hypotheses about diversity and mean diversity to be tested and associated confidence intervals to be calculated.

View Article: PubMed Central - HTML - PubMed

Affiliation: Warwick Systems Biology Centre, University of Warwick , Coventry , United Kingdom.

ABSTRACT
The impetus for this work was the need to analyse nucleotide diversity in a viral mix taken from honeybees. The paper has two findings. First, a method for correction of next generation sequencing error in the distribution of nucleotides at a site is developed. Second, a package of methods for assessment of nucleotide diversity is assembled. The error correction method is statistically based and works at the level of the nucleotide distribution rather than the level of individual nucleotides. The method relies on an error model and a sample of known viral genotypes that is used for model calibration. A compendium of existing and new diversity analysis tools is also presented, allowing hypotheses about diversity and mean diversity to be tested and associated confidence intervals to be calculated. The methods are illustrated using honeybee viral samples. Software in both Excel and Matlab and a guide are available at http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/, the Warwick University Systems Biology Centre software download site.

No MeSH data available.


A geometric view of the correction process, showing the tetrahedron of all nucleotide distributions.When the true nucleotide distribution p is (1, 0, 0, 0), the limiting distribution after next generation sequencing q is along the line from p to the centroid c. Under sampling, we see nearby  which is corrected to , lying on the extension of the line from c to .
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4232844&req=5

fig-6: A geometric view of the correction process, showing the tetrahedron of all nucleotide distributions.When the true nucleotide distribution p is (1, 0, 0, 0), the limiting distribution after next generation sequencing q is along the line from p to the centroid c. Under sampling, we see nearby which is corrected to , lying on the extension of the line from c to .

Mentions: A geometric view of the correction process is shown in Fig. 6. The tetrahedron is the locus of all probability mass functions on the four points {A, C, G, T}, with the vertices corresponding to situations where the probability mass is on a single nucleotide and the diversity is zero. The centroid c = (0.25, 0.25, 0.25, 0.25) corresponds to the situation where mass is spread equally across nucleotides, and the diversity takes its maximum value of one. When the true nucleotide distribution is p (e.g., p = (1, 0, 0, 0) in Fig. 6) and the NGS error rate is α, the limiting distribution following NGS (based on sequencing a very large number of nucleotides) is q = Mp = (1−4α) p + 4αc, along the line between p and c. In practice we sequence finitely many nucleotides n, giving an empirical distribution close to q. The correction of , , reverses the p to q contraction with lying along the extension of the line from c to . (If this lies outside the simplex, is taken to be the nearest point in the simplex to its projection onto the plane determined by its non-negative components.) In general, the further p is from c, the greater the distance from p to q, so the greater the distance from to ; the greater the error, the greater the correction. In turn, diversity increases from zero to one along the line from a tetrahedron vertex to the centroid, and is convex down, as shown in Fig. 7, so changes more quickly when the diversity is low and is relatively stable when diversity is high. In summary,


Error correction and diversity analysis of population mixtures determined by NGS.

Wood GR, Burroughs NJ, Evans DJ, Ryabov EV - PeerJ (2014)

A geometric view of the correction process, showing the tetrahedron of all nucleotide distributions.When the true nucleotide distribution p is (1, 0, 0, 0), the limiting distribution after next generation sequencing q is along the line from p to the centroid c. Under sampling, we see nearby  which is corrected to , lying on the extension of the line from c to .
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4232844&req=5

fig-6: A geometric view of the correction process, showing the tetrahedron of all nucleotide distributions.When the true nucleotide distribution p is (1, 0, 0, 0), the limiting distribution after next generation sequencing q is along the line from p to the centroid c. Under sampling, we see nearby which is corrected to , lying on the extension of the line from c to .
Mentions: A geometric view of the correction process is shown in Fig. 6. The tetrahedron is the locus of all probability mass functions on the four points {A, C, G, T}, with the vertices corresponding to situations where the probability mass is on a single nucleotide and the diversity is zero. The centroid c = (0.25, 0.25, 0.25, 0.25) corresponds to the situation where mass is spread equally across nucleotides, and the diversity takes its maximum value of one. When the true nucleotide distribution is p (e.g., p = (1, 0, 0, 0) in Fig. 6) and the NGS error rate is α, the limiting distribution following NGS (based on sequencing a very large number of nucleotides) is q = Mp = (1−4α) p + 4αc, along the line between p and c. In practice we sequence finitely many nucleotides n, giving an empirical distribution close to q. The correction of , , reverses the p to q contraction with lying along the extension of the line from c to . (If this lies outside the simplex, is taken to be the nearest point in the simplex to its projection onto the plane determined by its non-negative components.) In general, the further p is from c, the greater the distance from p to q, so the greater the distance from to ; the greater the error, the greater the correction. In turn, diversity increases from zero to one along the line from a tetrahedron vertex to the centroid, and is convex down, as shown in Fig. 7, so changes more quickly when the diversity is low and is relatively stable when diversity is high. In summary,

Bottom Line: The impetus for this work was the need to analyse nucleotide diversity in a viral mix taken from honeybees.The paper has two findings.A compendium of existing and new diversity analysis tools is also presented, allowing hypotheses about diversity and mean diversity to be tested and associated confidence intervals to be calculated.

View Article: PubMed Central - HTML - PubMed

Affiliation: Warwick Systems Biology Centre, University of Warwick , Coventry , United Kingdom.

ABSTRACT
The impetus for this work was the need to analyse nucleotide diversity in a viral mix taken from honeybees. The paper has two findings. First, a method for correction of next generation sequencing error in the distribution of nucleotides at a site is developed. Second, a package of methods for assessment of nucleotide diversity is assembled. The error correction method is statistically based and works at the level of the nucleotide distribution rather than the level of individual nucleotides. The method relies on an error model and a sample of known viral genotypes that is used for model calibration. A compendium of existing and new diversity analysis tools is also presented, allowing hypotheses about diversity and mean diversity to be tested and associated confidence intervals to be calculated. The methods are illustrated using honeybee viral samples. Software in both Excel and Matlab and a guide are available at http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/, the Warwick University Systems Biology Centre software download site.

No MeSH data available.