Limits...
Comparison of methods for estimating the nucleotide substitution matrix.

Oscamou M, McDonald D, Yap VB, Huttley GA, Lladser ME, Knight R - BMC Bioinformatics (2008)

Bottom Line: The nucleotide substitution rate matrix is a key parameter of molecular evolution.Different methods differ in performance by orders of magnitude (ranging from 1 ms to 10 s per matrix), but differences in accuracy of rate matrix reconstruction appear to be relatively small.The method of Barry and Hartigan (1987) can provide somewhat more accuracy, measured as the Euclidean distance between the true and inferred matrices, on long sequences (> 2000 nucleotides) at the expense of substantially longer computation time.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Applied Mathematics, University of Colorado, Boulder, CO, USA. Maribeth.Oscamou@colorado.edu

ABSTRACT

Background: The nucleotide substitution rate matrix is a key parameter of molecular evolution. Several methods for inferring this parameter have been proposed, with different mathematical bases. These methods include counting sequence differences and taking the log of the resulting probability matrices, methods based on Markov triples, and maximum likelihood methods that infer the substitution probabilities that lead to the most likely model of evolution. However, the speed and accuracy of these methods has not been compared.

Results: Different methods differ in performance by orders of magnitude (ranging from 1 ms to 10 s per matrix), but differences in accuracy of rate matrix reconstruction appear to be relatively small. Encouragingly, relatively simple and fast methods can provide results at least as accurate as far more complex and computationally intensive methods, especially when the sequences to be compared are relatively short.

Conclusion: Based on the conditions tested, we recommend the use of method of Gojobori et al. (1982) for long sequences (> 600 nucleotides), and the method of Goldman et al. (1996) for shorter sequences (< 600 nucleotides). The method of Barry and Hartigan (1987) can provide somewhat more accuracy, measured as the Euclidean distance between the true and inferred matrices, on long sequences (> 2000 nucleotides) at the expense of substantially longer computation time. The availability of methods that are both fast and accurate will allow us to gain a global picture of change in the nucleotide substitution rate matrix on a genomewide scale across the tree of life.

Show MeSH

Related in: MedlinePlus

Results considering all matrices. Results for accuracy (left column) and speed (right column) for the different methods (legend in panel a) under different simulation conditions. Error is measured as the mean Euclidean distance between the matrix used to generate the sequences and the inferred matrix (arbitrary units); time is measured in seconds. Figures show effects of DNA length (a, b), branch ratio (c, d) and branch inner length (e, f) on accuracy and speed. The effect of whether the inner branches evolved according to the same or different matrices was negligible, and is not shown.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2655096&req=5

Figure 2: Results considering all matrices. Results for accuracy (left column) and speed (right column) for the different methods (legend in panel a) under different simulation conditions. Error is measured as the mean Euclidean distance between the matrix used to generate the sequences and the inferred matrix (arbitrary units); time is measured in seconds. Figures show effects of DNA length (a, b), branch ratio (c, d) and branch inner length (e, f) on accuracy and speed. The effect of whether the inner branches evolved according to the same or different matrices was negligible, and is not shown.

Mentions: Figure 2 summarizes the speed and accuracy of the different methods. Each point shows the average for all points where the condition holds, e.g. the point for sequence length 100 averages over all the trials in which the sequence length was 100 nucleotides. Whether the two inner matrices were the same or different had negligible effect on either the speed or accuracy of the methods (data not shown, results shown are averages over both conditions). The Lake method appears to be most accurate under a wide range of conditions, although this is an artifact of its inability to operate on data generated using matrices that produce inaccurate results using all methods (see the next section below). The Gojobori and Goldman methods ran up to 3 orders of magnitude faster than the other methods, and had comparable or better accuracy (specifically, the Goldman method outperformed the others at short sequence lengths, less than 600 nucleotides, and the Gojobori method performed well at longer sequence lengths). In general, there was no clear association between speed and accuracy: the MW method is among the slowest methods but not generally among the more accurate methods under the conditions tested, for example, whereas the Gojobori method is generally the fastest and has accuracy very similar to methods that it outperforms by orders of magnitude. However, the BH method is most accurate on very long sequences (over 2000), although it is much slower than either the Gojobori or Goldman method in this range. One cautionary note about the MW method is that we reduced the maximum number of iterations from 1000 to 100, and increased the tolerance from 10-10 to 10-5, to improve run-time. Using the default parameters might increase the accuracy at an additional substantial runtime penalty, although even under the conditions tested this method is too slow for large-scale applications.


Comparison of methods for estimating the nucleotide substitution matrix.

Oscamou M, McDonald D, Yap VB, Huttley GA, Lladser ME, Knight R - BMC Bioinformatics (2008)

Results considering all matrices. Results for accuracy (left column) and speed (right column) for the different methods (legend in panel a) under different simulation conditions. Error is measured as the mean Euclidean distance between the matrix used to generate the sequences and the inferred matrix (arbitrary units); time is measured in seconds. Figures show effects of DNA length (a, b), branch ratio (c, d) and branch inner length (e, f) on accuracy and speed. The effect of whether the inner branches evolved according to the same or different matrices was negligible, and is not shown.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2655096&req=5

Figure 2: Results considering all matrices. Results for accuracy (left column) and speed (right column) for the different methods (legend in panel a) under different simulation conditions. Error is measured as the mean Euclidean distance between the matrix used to generate the sequences and the inferred matrix (arbitrary units); time is measured in seconds. Figures show effects of DNA length (a, b), branch ratio (c, d) and branch inner length (e, f) on accuracy and speed. The effect of whether the inner branches evolved according to the same or different matrices was negligible, and is not shown.
Mentions: Figure 2 summarizes the speed and accuracy of the different methods. Each point shows the average for all points where the condition holds, e.g. the point for sequence length 100 averages over all the trials in which the sequence length was 100 nucleotides. Whether the two inner matrices were the same or different had negligible effect on either the speed or accuracy of the methods (data not shown, results shown are averages over both conditions). The Lake method appears to be most accurate under a wide range of conditions, although this is an artifact of its inability to operate on data generated using matrices that produce inaccurate results using all methods (see the next section below). The Gojobori and Goldman methods ran up to 3 orders of magnitude faster than the other methods, and had comparable or better accuracy (specifically, the Goldman method outperformed the others at short sequence lengths, less than 600 nucleotides, and the Gojobori method performed well at longer sequence lengths). In general, there was no clear association between speed and accuracy: the MW method is among the slowest methods but not generally among the more accurate methods under the conditions tested, for example, whereas the Gojobori method is generally the fastest and has accuracy very similar to methods that it outperforms by orders of magnitude. However, the BH method is most accurate on very long sequences (over 2000), although it is much slower than either the Gojobori or Goldman method in this range. One cautionary note about the MW method is that we reduced the maximum number of iterations from 1000 to 100, and increased the tolerance from 10-10 to 10-5, to improve run-time. Using the default parameters might increase the accuracy at an additional substantial runtime penalty, although even under the conditions tested this method is too slow for large-scale applications.

Bottom Line: The nucleotide substitution rate matrix is a key parameter of molecular evolution.Different methods differ in performance by orders of magnitude (ranging from 1 ms to 10 s per matrix), but differences in accuracy of rate matrix reconstruction appear to be relatively small.The method of Barry and Hartigan (1987) can provide somewhat more accuracy, measured as the Euclidean distance between the true and inferred matrices, on long sequences (> 2000 nucleotides) at the expense of substantially longer computation time.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Applied Mathematics, University of Colorado, Boulder, CO, USA. Maribeth.Oscamou@colorado.edu

ABSTRACT

Background: The nucleotide substitution rate matrix is a key parameter of molecular evolution. Several methods for inferring this parameter have been proposed, with different mathematical bases. These methods include counting sequence differences and taking the log of the resulting probability matrices, methods based on Markov triples, and maximum likelihood methods that infer the substitution probabilities that lead to the most likely model of evolution. However, the speed and accuracy of these methods has not been compared.

Results: Different methods differ in performance by orders of magnitude (ranging from 1 ms to 10 s per matrix), but differences in accuracy of rate matrix reconstruction appear to be relatively small. Encouragingly, relatively simple and fast methods can provide results at least as accurate as far more complex and computationally intensive methods, especially when the sequences to be compared are relatively short.

Conclusion: Based on the conditions tested, we recommend the use of method of Gojobori et al. (1982) for long sequences (> 600 nucleotides), and the method of Goldman et al. (1996) for shorter sequences (< 600 nucleotides). The method of Barry and Hartigan (1987) can provide somewhat more accuracy, measured as the Euclidean distance between the true and inferred matrices, on long sequences (> 2000 nucleotides) at the expense of substantially longer computation time. The availability of methods that are both fast and accurate will allow us to gain a global picture of change in the nucleotide substitution rate matrix on a genomewide scale across the tree of life.

Show MeSH
Related in: MedlinePlus