Limits...
Accurate and fast methods to estimate the population mutation rate from error prone sequences.

Knudsen B, Miyamoto MM - BMC Bioinformatics (2009)

Bottom Line: These new methods are all accurate and fast according to evolutionary simulations and analyses of a real complex population dataset for the California seahare.In light of these results, we recommend the use of these three new methods for the determination of theta from error prone sequences.In particular, we advocate the new maximum likelihood model as a starting point for the further development of more complex coalescent/mutation models that also account for experimental error and design.

View Article: PubMed Central - HTML - PubMed

Affiliation: CLC bio, 8200 Arhus N, Denmark. bknudsen@clcbio.com

ABSTRACT

Background: The population mutation rate (theta) remains one of the most fundamental parameters in genetics, ecology, and evolutionary biology. However, its accurate estimation can be seriously compromised when working with error prone data such as expressed sequence tags, low coverage draft sequences, and other such unfinished products. This study is premised on the simple idea that a random sequence error due to a chance accident during data collection or recording will be distributed within a population dataset as a singleton (i.e., as a polymorphic site where one sampled sequence exhibits a unique base relative to the common nucleotide of the others). Thus, one can avoid these random errors by ignoring the singletons within a dataset.

Results: This strategy is implemented under an infinite sites model that focuses on only the internal branches of the sample genealogy where a shared polymorphism can arise (i.e., a variable site where each alternative base is represented by at least two sequences). This approach is first used to derive independently the same new Watterson and Tajima estimators of theta, as recently reported by Achaz 1 for error prone sequences. It is then used to modify the recent, full, maximum-likelihood model of Knudsen and Miyamoto 2, which incorporates various factors for experimental error and design with those for coalescence and mutation. These new methods are all accurate and fast according to evolutionary simulations and analyses of a real complex population dataset for the California seahare.

Conclusion: In light of these results, we recommend the use of these three new methods for the determination of theta from error prone sequences. In particular, we advocate the new maximum likelihood model as a starting point for the further development of more complex coalescent/mutation models that also account for experimental error and design.

Show MeSH

Related in: MedlinePlus

Representative symmetrical (a) and asymmetrical (b) genealogies for four sampled sequences. In all, there are six symmetrical and 12 asymmetrical labeled genealogies for these four sequences [8,45]. This figure illustrates how the shape of a genealogy affects whether a mutation will lead to a singleton or shared polymorphism. This heterogeneity contributes to the variance of θ for the three new methods. This contribution is in addition to the heterogeneity of the genealogical branch lengths and Poisson mutation process [4,5]. The diamond highlights the common ancestor of the (n - 1) basal group of the asymmetrical genealogy. The dotted and thin solid lines mark the basal branch leading to this ancestor and the external branches, respectively, where a mutation will result in a singleton. The thick solid lines denote the other internal branches where a mutation will lead to a shared polymorphism. Expected branch lengths are given in units of scaled coalescent time next to each internode (those of the symmetrical genealogy are specific for its particular labeled history). Although both genealogies have an expected overall length of 11/3, the total length of the internal branches where a shared polymorphism can arise is 7/3 for (a) but only 1/3 for (b).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2746815&req=5

Figure 2: Representative symmetrical (a) and asymmetrical (b) genealogies for four sampled sequences. In all, there are six symmetrical and 12 asymmetrical labeled genealogies for these four sequences [8,45]. This figure illustrates how the shape of a genealogy affects whether a mutation will lead to a singleton or shared polymorphism. This heterogeneity contributes to the variance of θ for the three new methods. This contribution is in addition to the heterogeneity of the genealogical branch lengths and Poisson mutation process [4,5]. The diamond highlights the common ancestor of the (n - 1) basal group of the asymmetrical genealogy. The dotted and thin solid lines mark the basal branch leading to this ancestor and the external branches, respectively, where a mutation will result in a singleton. The thick solid lines denote the other internal branches where a mutation will lead to a shared polymorphism. Expected branch lengths are given in units of scaled coalescent time next to each internode (those of the symmetrical genealogy are specific for its particular labeled history). Although both genealogies have an expected overall length of 11/3, the total length of the internal branches where a shared polymorphism can arise is 7/3 for (a) but only 1/3 for (b).

Mentions: In an asymmetrical genealogy with a basal split of one versus (n - 1) sampled sequences, a mutation in the internal branch leading to the common ancestor of the (n - 1) group will result in a singleton within the dataset (i.e., a variable site where the single nonmember sequence exhibits the unique base) (Figure 2). Thus, the length of this (n - 1) basal branch (as weighted by its probability of occurrence) must also be accounted for in the adjustment of the denominator for θ'W. The weighted length of this (n - 1) basal branch can be calculated as follows. First, the chance of this branch is determined as the probability of an (n - 1) asymmetrical topology or equivalently as the probability that either one of the two basal branches for the genealogy remains unbroken to the present. According to Equation (5), the latter probability is 2 [1/(n - 1)]. Second, the unweighted length of this branch is then calculated as the length of the time interval T2, which is 1 [Equation (2)]. Thus, the weighted length of the (n - 1) basal branch is:(7)


Accurate and fast methods to estimate the population mutation rate from error prone sequences.

Knudsen B, Miyamoto MM - BMC Bioinformatics (2009)

Representative symmetrical (a) and asymmetrical (b) genealogies for four sampled sequences. In all, there are six symmetrical and 12 asymmetrical labeled genealogies for these four sequences [8,45]. This figure illustrates how the shape of a genealogy affects whether a mutation will lead to a singleton or shared polymorphism. This heterogeneity contributes to the variance of θ for the three new methods. This contribution is in addition to the heterogeneity of the genealogical branch lengths and Poisson mutation process [4,5]. The diamond highlights the common ancestor of the (n - 1) basal group of the asymmetrical genealogy. The dotted and thin solid lines mark the basal branch leading to this ancestor and the external branches, respectively, where a mutation will result in a singleton. The thick solid lines denote the other internal branches where a mutation will lead to a shared polymorphism. Expected branch lengths are given in units of scaled coalescent time next to each internode (those of the symmetrical genealogy are specific for its particular labeled history). Although both genealogies have an expected overall length of 11/3, the total length of the internal branches where a shared polymorphism can arise is 7/3 for (a) but only 1/3 for (b).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2746815&req=5

Figure 2: Representative symmetrical (a) and asymmetrical (b) genealogies for four sampled sequences. In all, there are six symmetrical and 12 asymmetrical labeled genealogies for these four sequences [8,45]. This figure illustrates how the shape of a genealogy affects whether a mutation will lead to a singleton or shared polymorphism. This heterogeneity contributes to the variance of θ for the three new methods. This contribution is in addition to the heterogeneity of the genealogical branch lengths and Poisson mutation process [4,5]. The diamond highlights the common ancestor of the (n - 1) basal group of the asymmetrical genealogy. The dotted and thin solid lines mark the basal branch leading to this ancestor and the external branches, respectively, where a mutation will result in a singleton. The thick solid lines denote the other internal branches where a mutation will lead to a shared polymorphism. Expected branch lengths are given in units of scaled coalescent time next to each internode (those of the symmetrical genealogy are specific for its particular labeled history). Although both genealogies have an expected overall length of 11/3, the total length of the internal branches where a shared polymorphism can arise is 7/3 for (a) but only 1/3 for (b).
Mentions: In an asymmetrical genealogy with a basal split of one versus (n - 1) sampled sequences, a mutation in the internal branch leading to the common ancestor of the (n - 1) group will result in a singleton within the dataset (i.e., a variable site where the single nonmember sequence exhibits the unique base) (Figure 2). Thus, the length of this (n - 1) basal branch (as weighted by its probability of occurrence) must also be accounted for in the adjustment of the denominator for θ'W. The weighted length of this (n - 1) basal branch can be calculated as follows. First, the chance of this branch is determined as the probability of an (n - 1) asymmetrical topology or equivalently as the probability that either one of the two basal branches for the genealogy remains unbroken to the present. According to Equation (5), the latter probability is 2 [1/(n - 1)]. Second, the unweighted length of this branch is then calculated as the length of the time interval T2, which is 1 [Equation (2)]. Thus, the weighted length of the (n - 1) basal branch is:(7)

Bottom Line: These new methods are all accurate and fast according to evolutionary simulations and analyses of a real complex population dataset for the California seahare.In light of these results, we recommend the use of these three new methods for the determination of theta from error prone sequences.In particular, we advocate the new maximum likelihood model as a starting point for the further development of more complex coalescent/mutation models that also account for experimental error and design.

View Article: PubMed Central - HTML - PubMed

Affiliation: CLC bio, 8200 Arhus N, Denmark. bknudsen@clcbio.com

ABSTRACT

Background: The population mutation rate (theta) remains one of the most fundamental parameters in genetics, ecology, and evolutionary biology. However, its accurate estimation can be seriously compromised when working with error prone data such as expressed sequence tags, low coverage draft sequences, and other such unfinished products. This study is premised on the simple idea that a random sequence error due to a chance accident during data collection or recording will be distributed within a population dataset as a singleton (i.e., as a polymorphic site where one sampled sequence exhibits a unique base relative to the common nucleotide of the others). Thus, one can avoid these random errors by ignoring the singletons within a dataset.

Results: This strategy is implemented under an infinite sites model that focuses on only the internal branches of the sample genealogy where a shared polymorphism can arise (i.e., a variable site where each alternative base is represented by at least two sequences). This approach is first used to derive independently the same new Watterson and Tajima estimators of theta, as recently reported by Achaz 1 for error prone sequences. It is then used to modify the recent, full, maximum-likelihood model of Knudsen and Miyamoto 2, which incorporates various factors for experimental error and design with those for coalescence and mutation. These new methods are all accurate and fast according to evolutionary simulations and analyses of a real complex population dataset for the California seahare.

Conclusion: In light of these results, we recommend the use of these three new methods for the determination of theta from error prone sequences. In particular, we advocate the new maximum likelihood model as a starting point for the further development of more complex coalescent/mutation models that also account for experimental error and design.

Show MeSH
Related in: MedlinePlus