Limits...
Scoredist: a simple and robust protein sequence distance estimator.

Sonnhammer EL, Hollich V - BMC Bioinformatics (2005)

Bottom Line: The optimal matrix can be found either by an iterative search for the Maximum Likelihood matrix, or by integration to find the Expected Distance.Scoredist proved as accurate as the optimal matrix methods, yet substantially more robust.Scoredist has been incorporated into the Belvu alignment viewer, which is available at ftp://ftp.cgb.ki.se/pub/prog/belvu/.

View Article: PubMed Central - HTML - PubMed

Affiliation: Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius väg 35, 171 77 Stockholm, Sweden. erik.sonnhammer@cgb.ki.se

ABSTRACT

Background: Distance-based methods are popular for reconstructing evolutionary trees thanks to their speed and generality. A number of methods exist for estimating distances from sequence alignments, which often involves some sort of correction for multiple substitutions. The problem is to accurately estimate the number of true substitutions given an observed alignment. So far, the most accurate protein distance estimators have looked for the optimal matrix in a series of transition probability matrices, e.g. the Dayhoff series. The evolutionary distance between two aligned sequences is here estimated as the evolutionary distance of the optimal matrix. The optimal matrix can be found either by an iterative search for the Maximum Likelihood matrix, or by integration to find the Expected Distance. As a consequence, these methods are more complex to implement and computationally heavier than correction-based methods. Another problem is that the result may vary substantially depending on the evolutionary model used for the matrices. An ideal distance estimator should produce consistent and accurate distances independent of the evolutionary model used.

Results: We propose a correction-based protein sequence estimator called Scoredist. It uses a logarithmic correction of observed divergence based on the alignment score according to the BLOSUM62 score matrix. We evaluated Scoredist and a number of optimal matrix methods using three evolutionary models for both training and testing Dayhoff, Jones-Taylor-Thornton, and Muller-Vingron, as well as Whelan and Goldman solely for testing. Test alignments with known distances between 0.01 and 2 substitutions per position (1-200 PAM) were simulated using ROSE. Scoredist proved as accurate as the optimal matrix methods, yet substantially more robust. When trained on one model but tested on another one, Scoredist was nearly always more accurate. The Jukes-Cantor and Kimura correction methods were also tested, but were substantially less accurate.

Conclusion: The Scoredist distance estimator is fast to implement and run, and combines robustness with accuracy. Scoredist has been incorporated into the Belvu alignment viewer, which is available at ftp://ftp.cgb.ki.se/pub/prog/belvu/.

Show MeSH
Stratified accuracy analysis of Scoredist and ML. To illustrate how estimated distance depends on the model, the average deviation is plotted as a function of true distance for two evolutionary models, Dayhoff and Mueller-Vingron. For each evolutionary distance between 1 and 200 PAM, 10 alignments were generated. For each alignment, the deviation was calculated as the difference between the estimated distance and the true distance used for data generation by ROSE [16]. The average of the 10 deviations was plotted using a running average with a window of 10 residues. Note that positive and negative deviations at the same true distance can cancel each other out – the curve only shows the average deviation and not the variability. The values in Table 1 measure the accuracy more correctly by using RMSD of every datapoint. The testset data was created with the matrices given by Dayhoff (A) or Müller-Vingron (B). In both cases, the estimators using the same evolutionary model as the testset data perform well. However, when switching the model in the estimator, Scoredist diverges less than ML, indicating that Scoredist is more robust. The curves show that ML-MV is more different from ML-Dayhoff than Scoredist-MV is from Scoredist-Dayhoff, particularly for the MV dataset in (B). The less difference between estimates using different models, the more robust is the method.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC1131889&req=5

Figure 1: Stratified accuracy analysis of Scoredist and ML. To illustrate how estimated distance depends on the model, the average deviation is plotted as a function of true distance for two evolutionary models, Dayhoff and Mueller-Vingron. For each evolutionary distance between 1 and 200 PAM, 10 alignments were generated. For each alignment, the deviation was calculated as the difference between the estimated distance and the true distance used for data generation by ROSE [16]. The average of the 10 deviations was plotted using a running average with a window of 10 residues. Note that positive and negative deviations at the same true distance can cancel each other out – the curve only shows the average deviation and not the variability. The values in Table 1 measure the accuracy more correctly by using RMSD of every datapoint. The testset data was created with the matrices given by Dayhoff (A) or Müller-Vingron (B). In both cases, the estimators using the same evolutionary model as the testset data perform well. However, when switching the model in the estimator, Scoredist diverges less than ML, indicating that Scoredist is more robust. The curves show that ML-MV is more different from ML-Dayhoff than Scoredist-MV is from Scoredist-Dayhoff, particularly for the MV dataset in (B). The less difference between estimates using different models, the more robust is the method.

Mentions: Figure 1 two shows a more detailed picture of the different distance estimators. The average of 10 estimates from 10 independent simulations at each evolutionary distance is plotted for data generated with the Dayhoff matrices. The variance among the 10 estimates is not shown for clarity; they are however reflected by the RMSD values in Table 1 which may give a slightly different picture. For instance, it is possible that the average deviation is close to zero if the individual estimates have large positive and negative deviations that cancel each other out. Therefore, the RMSD values should be trusted more than the deviation plots when in doubt. Figure 1A shows the dependence on evolutionary model for Scoredist and ML. Testing on the Dayhoff testset, Scoredist-Dayhoff and ML-Dayhoff stayed reasonable accurate in the entire range (below 5% error). In contrast, Scoredist-MV and ML-MV deviated considerably from the true distance. It is however clear that ML is more affected by switching model than Scoredist is. In Figure 1B the testset was generated with the MV model. Again, the corresponding deviation was observed for "wrong model" estimators. Here it is even more pronounced that ML is more dependent on the model, and generalizes poorly. Scoredist was less affected by the change of model – Scoredist-Dayhoff was considerably more accurate on the MV testset than ML-Dayhoff. As expected, when Scoredist and ML had been trained on MV data, the accuracy is very good for both estimators. In conclusion, we observed that although the Scoredist method is very simple compared to the ML method, it is approximately equally accurate when testing and training using the same evolutionary model. However, when testing on a different model, Scoredist is considerably more accurate.


Scoredist: a simple and robust protein sequence distance estimator.

Sonnhammer EL, Hollich V - BMC Bioinformatics (2005)

Stratified accuracy analysis of Scoredist and ML. To illustrate how estimated distance depends on the model, the average deviation is plotted as a function of true distance for two evolutionary models, Dayhoff and Mueller-Vingron. For each evolutionary distance between 1 and 200 PAM, 10 alignments were generated. For each alignment, the deviation was calculated as the difference between the estimated distance and the true distance used for data generation by ROSE [16]. The average of the 10 deviations was plotted using a running average with a window of 10 residues. Note that positive and negative deviations at the same true distance can cancel each other out – the curve only shows the average deviation and not the variability. The values in Table 1 measure the accuracy more correctly by using RMSD of every datapoint. The testset data was created with the matrices given by Dayhoff (A) or Müller-Vingron (B). In both cases, the estimators using the same evolutionary model as the testset data perform well. However, when switching the model in the estimator, Scoredist diverges less than ML, indicating that Scoredist is more robust. The curves show that ML-MV is more different from ML-Dayhoff than Scoredist-MV is from Scoredist-Dayhoff, particularly for the MV dataset in (B). The less difference between estimates using different models, the more robust is the method.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC1131889&req=5

Figure 1: Stratified accuracy analysis of Scoredist and ML. To illustrate how estimated distance depends on the model, the average deviation is plotted as a function of true distance for two evolutionary models, Dayhoff and Mueller-Vingron. For each evolutionary distance between 1 and 200 PAM, 10 alignments were generated. For each alignment, the deviation was calculated as the difference between the estimated distance and the true distance used for data generation by ROSE [16]. The average of the 10 deviations was plotted using a running average with a window of 10 residues. Note that positive and negative deviations at the same true distance can cancel each other out – the curve only shows the average deviation and not the variability. The values in Table 1 measure the accuracy more correctly by using RMSD of every datapoint. The testset data was created with the matrices given by Dayhoff (A) or Müller-Vingron (B). In both cases, the estimators using the same evolutionary model as the testset data perform well. However, when switching the model in the estimator, Scoredist diverges less than ML, indicating that Scoredist is more robust. The curves show that ML-MV is more different from ML-Dayhoff than Scoredist-MV is from Scoredist-Dayhoff, particularly for the MV dataset in (B). The less difference between estimates using different models, the more robust is the method.
Mentions: Figure 1 two shows a more detailed picture of the different distance estimators. The average of 10 estimates from 10 independent simulations at each evolutionary distance is plotted for data generated with the Dayhoff matrices. The variance among the 10 estimates is not shown for clarity; they are however reflected by the RMSD values in Table 1 which may give a slightly different picture. For instance, it is possible that the average deviation is close to zero if the individual estimates have large positive and negative deviations that cancel each other out. Therefore, the RMSD values should be trusted more than the deviation plots when in doubt. Figure 1A shows the dependence on evolutionary model for Scoredist and ML. Testing on the Dayhoff testset, Scoredist-Dayhoff and ML-Dayhoff stayed reasonable accurate in the entire range (below 5% error). In contrast, Scoredist-MV and ML-MV deviated considerably from the true distance. It is however clear that ML is more affected by switching model than Scoredist is. In Figure 1B the testset was generated with the MV model. Again, the corresponding deviation was observed for "wrong model" estimators. Here it is even more pronounced that ML is more dependent on the model, and generalizes poorly. Scoredist was less affected by the change of model – Scoredist-Dayhoff was considerably more accurate on the MV testset than ML-Dayhoff. As expected, when Scoredist and ML had been trained on MV data, the accuracy is very good for both estimators. In conclusion, we observed that although the Scoredist method is very simple compared to the ML method, it is approximately equally accurate when testing and training using the same evolutionary model. However, when testing on a different model, Scoredist is considerably more accurate.

Bottom Line: The optimal matrix can be found either by an iterative search for the Maximum Likelihood matrix, or by integration to find the Expected Distance.Scoredist proved as accurate as the optimal matrix methods, yet substantially more robust.Scoredist has been incorporated into the Belvu alignment viewer, which is available at ftp://ftp.cgb.ki.se/pub/prog/belvu/.

View Article: PubMed Central - HTML - PubMed

Affiliation: Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius väg 35, 171 77 Stockholm, Sweden. erik.sonnhammer@cgb.ki.se

ABSTRACT

Background: Distance-based methods are popular for reconstructing evolutionary trees thanks to their speed and generality. A number of methods exist for estimating distances from sequence alignments, which often involves some sort of correction for multiple substitutions. The problem is to accurately estimate the number of true substitutions given an observed alignment. So far, the most accurate protein distance estimators have looked for the optimal matrix in a series of transition probability matrices, e.g. the Dayhoff series. The evolutionary distance between two aligned sequences is here estimated as the evolutionary distance of the optimal matrix. The optimal matrix can be found either by an iterative search for the Maximum Likelihood matrix, or by integration to find the Expected Distance. As a consequence, these methods are more complex to implement and computationally heavier than correction-based methods. Another problem is that the result may vary substantially depending on the evolutionary model used for the matrices. An ideal distance estimator should produce consistent and accurate distances independent of the evolutionary model used.

Results: We propose a correction-based protein sequence estimator called Scoredist. It uses a logarithmic correction of observed divergence based on the alignment score according to the BLOSUM62 score matrix. We evaluated Scoredist and a number of optimal matrix methods using three evolutionary models for both training and testing Dayhoff, Jones-Taylor-Thornton, and Muller-Vingron, as well as Whelan and Goldman solely for testing. Test alignments with known distances between 0.01 and 2 substitutions per position (1-200 PAM) were simulated using ROSE. Scoredist proved as accurate as the optimal matrix methods, yet substantially more robust. When trained on one model but tested on another one, Scoredist was nearly always more accurate. The Jukes-Cantor and Kimura correction methods were also tested, but were substantially less accurate.

Conclusion: The Scoredist distance estimator is fast to implement and run, and combines robustness with accuracy. Scoredist has been incorporated into the Belvu alignment viewer, which is available at ftp://ftp.cgb.ki.se/pub/prog/belvu/.

Show MeSH