Limits...
Comparison of methods for estimating the nucleotide substitution matrix.

Oscamou M, McDonald D, Yap VB, Huttley GA, Lladser ME, Knight R - BMC Bioinformatics (2008)

Bottom Line: The nucleotide substitution rate matrix is a key parameter of molecular evolution.Different methods differ in performance by orders of magnitude (ranging from 1 ms to 10 s per matrix), but differences in accuracy of rate matrix reconstruction appear to be relatively small.The method of Barry and Hartigan (1987) can provide somewhat more accuracy, measured as the Euclidean distance between the true and inferred matrices, on long sequences (> 2000 nucleotides) at the expense of substantially longer computation time.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Applied Mathematics, University of Colorado, Boulder, CO, USA. Maribeth.Oscamou@colorado.edu

ABSTRACT

Background: The nucleotide substitution rate matrix is a key parameter of molecular evolution. Several methods for inferring this parameter have been proposed, with different mathematical bases. These methods include counting sequence differences and taking the log of the resulting probability matrices, methods based on Markov triples, and maximum likelihood methods that infer the substitution probabilities that lead to the most likely model of evolution. However, the speed and accuracy of these methods has not been compared.

Results: Different methods differ in performance by orders of magnitude (ranging from 1 ms to 10 s per matrix), but differences in accuracy of rate matrix reconstruction appear to be relatively small. Encouragingly, relatively simple and fast methods can provide results at least as accurate as far more complex and computationally intensive methods, especially when the sequences to be compared are relatively short.

Conclusion: Based on the conditions tested, we recommend the use of method of Gojobori et al. (1982) for long sequences (> 600 nucleotides), and the method of Goldman et al. (1996) for shorter sequences (< 600 nucleotides). The method of Barry and Hartigan (1987) can provide somewhat more accuracy, measured as the Euclidean distance between the true and inferred matrices, on long sequences (> 2000 nucleotides) at the expense of substantially longer computation time. The availability of methods that are both fast and accurate will allow us to gain a global picture of change in the nucleotide substitution rate matrix on a genomewide scale across the tree of life.

Show MeSH

Related in: MedlinePlus

Overview of the workflow. The overall workflow is as follows. First, one or more random rate matrices (depending on parameter settings) is generated, and sequences evolved according to a three-taxon tree of known topology. These sequences are generated as an ungapped alignment, so alignment issues do not affect the result (i.e. we assume that the alignment process is perfect). Depending on the algorithm, we then either (a) infer counts that differentiate specific pairs of sequences, then infer the probability matrices for different branches by normalizing the count matrices, or (b) infer the probability matrices for different branches directly from the alignment. Finally, the rate matrices are inferred from the probability matrices (either by the matrix log method or the constrained optimization method), normalized to eliminate the time component, and compared against the original rate matrix or matrices used to generate the sequences. Parameters that we varied during the analysis are marked on the tree and shown in the bottom-left of this figure.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2655096&req=5

Figure 1: Overview of the workflow. The overall workflow is as follows. First, one or more random rate matrices (depending on parameter settings) is generated, and sequences evolved according to a three-taxon tree of known topology. These sequences are generated as an ungapped alignment, so alignment issues do not affect the result (i.e. we assume that the alignment process is perfect). Depending on the algorithm, we then either (a) infer counts that differentiate specific pairs of sequences, then infer the probability matrices for different branches by normalizing the count matrices, or (b) infer the probability matrices for different branches directly from the alignment. Finally, the rate matrices are inferred from the probability matrices (either by the matrix log method or the constrained optimization method), normalized to eliminate the time component, and compared against the original rate matrix or matrices used to generate the sequences. Parameters that we varied during the analysis are marked on the tree and shown in the bottom-left of this figure.

Mentions: The speed and accuracy of each method were tested under the following conditions. The length of the sequence ranged from 100 to 1000 in steps of 100, then 1000 to 5000 in steps of 1000 (inclusive). The length of the inner branch ranged from 1 to 10% divergence in steps of 1%, and from 10 to 50% divergence in steps of 5%. The branch ratio, i.e. the ratio of the inner branch to the outer branch, was 0.1 to 1 in steps of 0.1, and 1 to 10 in steps of 1. The substitution matrix was either kept constant for the two inner branches, or varied between these two branches. We sampled 100 random matrices under every possible combination of these conditions. The overall workflow, including a description of these parameters, is shown in Figure 1 (adapted from [31]). Because we are studying nonstationary processes, we scaled the trace to (-1) (i.e. we did not weight by the equilibrium base frequencies), but this procedure is valid for the purposes of comparing the methods and was used consistently for all methods.


Comparison of methods for estimating the nucleotide substitution matrix.

Oscamou M, McDonald D, Yap VB, Huttley GA, Lladser ME, Knight R - BMC Bioinformatics (2008)

Overview of the workflow. The overall workflow is as follows. First, one or more random rate matrices (depending on parameter settings) is generated, and sequences evolved according to a three-taxon tree of known topology. These sequences are generated as an ungapped alignment, so alignment issues do not affect the result (i.e. we assume that the alignment process is perfect). Depending on the algorithm, we then either (a) infer counts that differentiate specific pairs of sequences, then infer the probability matrices for different branches by normalizing the count matrices, or (b) infer the probability matrices for different branches directly from the alignment. Finally, the rate matrices are inferred from the probability matrices (either by the matrix log method or the constrained optimization method), normalized to eliminate the time component, and compared against the original rate matrix or matrices used to generate the sequences. Parameters that we varied during the analysis are marked on the tree and shown in the bottom-left of this figure.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2655096&req=5

Figure 1: Overview of the workflow. The overall workflow is as follows. First, one or more random rate matrices (depending on parameter settings) is generated, and sequences evolved according to a three-taxon tree of known topology. These sequences are generated as an ungapped alignment, so alignment issues do not affect the result (i.e. we assume that the alignment process is perfect). Depending on the algorithm, we then either (a) infer counts that differentiate specific pairs of sequences, then infer the probability matrices for different branches by normalizing the count matrices, or (b) infer the probability matrices for different branches directly from the alignment. Finally, the rate matrices are inferred from the probability matrices (either by the matrix log method or the constrained optimization method), normalized to eliminate the time component, and compared against the original rate matrix or matrices used to generate the sequences. Parameters that we varied during the analysis are marked on the tree and shown in the bottom-left of this figure.
Mentions: The speed and accuracy of each method were tested under the following conditions. The length of the sequence ranged from 100 to 1000 in steps of 100, then 1000 to 5000 in steps of 1000 (inclusive). The length of the inner branch ranged from 1 to 10% divergence in steps of 1%, and from 10 to 50% divergence in steps of 5%. The branch ratio, i.e. the ratio of the inner branch to the outer branch, was 0.1 to 1 in steps of 0.1, and 1 to 10 in steps of 1. The substitution matrix was either kept constant for the two inner branches, or varied between these two branches. We sampled 100 random matrices under every possible combination of these conditions. The overall workflow, including a description of these parameters, is shown in Figure 1 (adapted from [31]). Because we are studying nonstationary processes, we scaled the trace to (-1) (i.e. we did not weight by the equilibrium base frequencies), but this procedure is valid for the purposes of comparing the methods and was used consistently for all methods.

Bottom Line: The nucleotide substitution rate matrix is a key parameter of molecular evolution.Different methods differ in performance by orders of magnitude (ranging from 1 ms to 10 s per matrix), but differences in accuracy of rate matrix reconstruction appear to be relatively small.The method of Barry and Hartigan (1987) can provide somewhat more accuracy, measured as the Euclidean distance between the true and inferred matrices, on long sequences (> 2000 nucleotides) at the expense of substantially longer computation time.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Applied Mathematics, University of Colorado, Boulder, CO, USA. Maribeth.Oscamou@colorado.edu

ABSTRACT

Background: The nucleotide substitution rate matrix is a key parameter of molecular evolution. Several methods for inferring this parameter have been proposed, with different mathematical bases. These methods include counting sequence differences and taking the log of the resulting probability matrices, methods based on Markov triples, and maximum likelihood methods that infer the substitution probabilities that lead to the most likely model of evolution. However, the speed and accuracy of these methods has not been compared.

Results: Different methods differ in performance by orders of magnitude (ranging from 1 ms to 10 s per matrix), but differences in accuracy of rate matrix reconstruction appear to be relatively small. Encouragingly, relatively simple and fast methods can provide results at least as accurate as far more complex and computationally intensive methods, especially when the sequences to be compared are relatively short.

Conclusion: Based on the conditions tested, we recommend the use of method of Gojobori et al. (1982) for long sequences (> 600 nucleotides), and the method of Goldman et al. (1996) for shorter sequences (< 600 nucleotides). The method of Barry and Hartigan (1987) can provide somewhat more accuracy, measured as the Euclidean distance between the true and inferred matrices, on long sequences (> 2000 nucleotides) at the expense of substantially longer computation time. The availability of methods that are both fast and accurate will allow us to gain a global picture of change in the nucleotide substitution rate matrix on a genomewide scale across the tree of life.

Show MeSH
Related in: MedlinePlus