Limits...
Scoredist: a simple and robust protein sequence distance estimator.

Sonnhammer EL, Hollich V - BMC Bioinformatics (2005)

Bottom Line: The optimal matrix can be found either by an iterative search for the Maximum Likelihood matrix, or by integration to find the Expected Distance.Scoredist proved as accurate as the optimal matrix methods, yet substantially more robust.Scoredist has been incorporated into the Belvu alignment viewer, which is available at ftp://ftp.cgb.ki.se/pub/prog/belvu/.

View Article: PubMed Central - HTML - PubMed

Affiliation: Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius väg 35, 171 77 Stockholm, Sweden. erik.sonnhammer@cgb.ki.se

ABSTRACT

Background: Distance-based methods are popular for reconstructing evolutionary trees thanks to their speed and generality. A number of methods exist for estimating distances from sequence alignments, which often involves some sort of correction for multiple substitutions. The problem is to accurately estimate the number of true substitutions given an observed alignment. So far, the most accurate protein distance estimators have looked for the optimal matrix in a series of transition probability matrices, e.g. the Dayhoff series. The evolutionary distance between two aligned sequences is here estimated as the evolutionary distance of the optimal matrix. The optimal matrix can be found either by an iterative search for the Maximum Likelihood matrix, or by integration to find the Expected Distance. As a consequence, these methods are more complex to implement and computationally heavier than correction-based methods. Another problem is that the result may vary substantially depending on the evolutionary model used for the matrices. An ideal distance estimator should produce consistent and accurate distances independent of the evolutionary model used.

Results: We propose a correction-based protein sequence estimator called Scoredist. It uses a logarithmic correction of observed divergence based on the alignment score according to the BLOSUM62 score matrix. We evaluated Scoredist and a number of optimal matrix methods using three evolutionary models for both training and testing Dayhoff, Jones-Taylor-Thornton, and Muller-Vingron, as well as Whelan and Goldman solely for testing. Test alignments with known distances between 0.01 and 2 substitutions per position (1-200 PAM) were simulated using ROSE. Scoredist proved as accurate as the optimal matrix methods, yet substantially more robust. When trained on one model but tested on another one, Scoredist was nearly always more accurate. The Jukes-Cantor and Kimura correction methods were also tested, but were substantially less accurate.

Conclusion: The Scoredist distance estimator is fast to implement and run, and combines robustness with accuracy. Scoredist has been incorporated into the Belvu alignment viewer, which is available at ftp://ftp.cgb.ki.se/pub/prog/belvu/.

Show MeSH
Estimation of the calibration factor c in Scoredist. This factor rescales the raw distance dr to optimally fit true evolutionary distances. The plot shows how c is estimated by least-squares fitting of raw distances dr to true distances for 2000 artificially produced sequence alignments, using the Dayhoff matrix series. The linear relationship between the raw distance dr and the true distance of the sequence samples justifies the introduction of the calibration factor c, which was here determined to cDayhoff = 1.3370 (See Table 2).
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC1131889&req=5

Figure 3: Estimation of the calibration factor c in Scoredist. This factor rescales the raw distance dr to optimally fit true evolutionary distances. The plot shows how c is estimated by least-squares fitting of raw distances dr to true distances for 2000 artificially produced sequence alignments, using the Dayhoff matrix series. The linear relationship between the raw distance dr and the true distance of the sequence samples justifies the introduction of the calibration factor c, which was here determined to cDayhoff = 1.3370 (See Table 2).

Mentions: As seen in Figure 3, dr is linearly related to the true distance, deviating only by a constant factor. The Scoredist evolutionary distance estimate of two sequences is given as the product of the raw distance and a calibration factor


Scoredist: a simple and robust protein sequence distance estimator.

Sonnhammer EL, Hollich V - BMC Bioinformatics (2005)

Estimation of the calibration factor c in Scoredist. This factor rescales the raw distance dr to optimally fit true evolutionary distances. The plot shows how c is estimated by least-squares fitting of raw distances dr to true distances for 2000 artificially produced sequence alignments, using the Dayhoff matrix series. The linear relationship between the raw distance dr and the true distance of the sequence samples justifies the introduction of the calibration factor c, which was here determined to cDayhoff = 1.3370 (See Table 2).
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC1131889&req=5

Figure 3: Estimation of the calibration factor c in Scoredist. This factor rescales the raw distance dr to optimally fit true evolutionary distances. The plot shows how c is estimated by least-squares fitting of raw distances dr to true distances for 2000 artificially produced sequence alignments, using the Dayhoff matrix series. The linear relationship between the raw distance dr and the true distance of the sequence samples justifies the introduction of the calibration factor c, which was here determined to cDayhoff = 1.3370 (See Table 2).
Mentions: As seen in Figure 3, dr is linearly related to the true distance, deviating only by a constant factor. The Scoredist evolutionary distance estimate of two sequences is given as the product of the raw distance and a calibration factor

Bottom Line: The optimal matrix can be found either by an iterative search for the Maximum Likelihood matrix, or by integration to find the Expected Distance.Scoredist proved as accurate as the optimal matrix methods, yet substantially more robust.Scoredist has been incorporated into the Belvu alignment viewer, which is available at ftp://ftp.cgb.ki.se/pub/prog/belvu/.

View Article: PubMed Central - HTML - PubMed

Affiliation: Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius väg 35, 171 77 Stockholm, Sweden. erik.sonnhammer@cgb.ki.se

ABSTRACT

Background: Distance-based methods are popular for reconstructing evolutionary trees thanks to their speed and generality. A number of methods exist for estimating distances from sequence alignments, which often involves some sort of correction for multiple substitutions. The problem is to accurately estimate the number of true substitutions given an observed alignment. So far, the most accurate protein distance estimators have looked for the optimal matrix in a series of transition probability matrices, e.g. the Dayhoff series. The evolutionary distance between two aligned sequences is here estimated as the evolutionary distance of the optimal matrix. The optimal matrix can be found either by an iterative search for the Maximum Likelihood matrix, or by integration to find the Expected Distance. As a consequence, these methods are more complex to implement and computationally heavier than correction-based methods. Another problem is that the result may vary substantially depending on the evolutionary model used for the matrices. An ideal distance estimator should produce consistent and accurate distances independent of the evolutionary model used.

Results: We propose a correction-based protein sequence estimator called Scoredist. It uses a logarithmic correction of observed divergence based on the alignment score according to the BLOSUM62 score matrix. We evaluated Scoredist and a number of optimal matrix methods using three evolutionary models for both training and testing Dayhoff, Jones-Taylor-Thornton, and Muller-Vingron, as well as Whelan and Goldman solely for testing. Test alignments with known distances between 0.01 and 2 substitutions per position (1-200 PAM) were simulated using ROSE. Scoredist proved as accurate as the optimal matrix methods, yet substantially more robust. When trained on one model but tested on another one, Scoredist was nearly always more accurate. The Jukes-Cantor and Kimura correction methods were also tested, but were substantially less accurate.

Conclusion: The Scoredist distance estimator is fast to implement and run, and combines robustness with accuracy. Scoredist has been incorporated into the Belvu alignment viewer, which is available at ftp://ftp.cgb.ki.se/pub/prog/belvu/.

Show MeSH