Limits...
Comparison of mode estimation methods and application in molecular clock analysis.

Hedges SB, Shah P - BMC Bioinformatics (2003)

Bottom Line: In those cases, the mode is a better estimator of the overall time of divergence than the mean or median.We determined that bootstrapping reduces the variance of both mode estimators.Application of the different methods to real data sets yielded results that were concordant with the simulations.

View Article: PubMed Central - HTML - PubMed

Affiliation: NASA Astrobiology Institute and Department of Biology, Pennsylvania State University, 208 Mueller Laboratory, University Park, PA 16802-5301, U,S,A. sbh1@psu.edu

ABSTRACT

Background: Distributions of time estimates in molecular clock studies are sometimes skewed or contain outliers. In those cases, the mode is a better estimator of the overall time of divergence than the mean or median. However, different methods are available for estimating the mode. We compared these methods in simulations to determine their strengths and weaknesses and further assessed their performance when applied to real data sets from a molecular clock study.

Results: We found that the half-range mode and robust parametric mode methods have a lower bias than other mode methods under a diversity of conditions. However, the half-range mode suffers from a relatively high variance and the robust parametric mode is more susceptible to bias by outliers. We determined that bootstrapping reduces the variance of both mode estimators. Application of the different methods to real data sets yielded results that were concordant with the simulations.

Conclusion: Because the half-range mode is a simple and fast method, and produced less bias overall in our simulations, we recommend the bootstrapped version of it as a general-purpose mode estimator and suggest a bootstrap method for obtaining the standard error and 95% confidence interval of the mode.

Show MeSH

Related in: MedlinePlus

Selected results of the simulations for bias of the estimators. Graphs shown here are for intermediate sample size (n = 100), and normally distributed contamination located at twice the distance between the true mode and the 99th percentile; see Supplementary Data for full results. Each column of graphs represents simulations for different original distributions: normal (left, a-d), lognormal (middle, e-h), and coarse (right, i-l). The top row (a, e, i) shows bias results for a comparison of three mode estimators with mean and median, all using 1000 replications. The lower three rows show bias results for comparisons between mode estimators with no bootstrapping (NBM, solid line), mode of 100 bootstrapped modes (BMO, dashed line), and mean of 100 bootstrapped modes (BME, dotted line), for HRM (b, g, l), RPM (c, h, n), and SPM (d, i, n); all using 100 simulation replications (100 bootstrap replications per simulation replication). In each case, the level of contamination applied to the original distribution ranged from zero to 40% (shown on X-axis). Bias is indicated in absolute units (Y-axis).
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC183840&req=5

Figure 4: Selected results of the simulations for bias of the estimators. Graphs shown here are for intermediate sample size (n = 100), and normally distributed contamination located at twice the distance between the true mode and the 99th percentile; see Supplementary Data for full results. Each column of graphs represents simulations for different original distributions: normal (left, a-d), lognormal (middle, e-h), and coarse (right, i-l). The top row (a, e, i) shows bias results for a comparison of three mode estimators with mean and median, all using 1000 replications. The lower three rows show bias results for comparisons between mode estimators with no bootstrapping (NBM, solid line), mode of 100 bootstrapped modes (BMO, dashed line), and mean of 100 bootstrapped modes (BME, dotted line), for HRM (b, g, l), RPM (c, h, n), and SPM (d, i, n); all using 100 simulation replications (100 bootstrap replications per simulation replication). In each case, the level of contamination applied to the original distribution ranged from zero to 40% (shown on X-axis). Bias is indicated in absolute units (Y-axis).

Mentions: The mode estimators were not biased by randomly distributed contamination, and the cauchy-distributed contamination produced similar results (albeit at higher levels of bias) to the normally distributed contamination. For these reasons, we confine our discussion to results using the normally distributed contamination. Also, simulations using contaminant data located close to the true mode (67th percentile) showed that mode estimators performed poorly under those conditions, as noted elsewhere [2], especially with the normal and coarse original distributions. In those cases, all estimators (mean and modes) showed similar bias, being equally misled by the contamination. On the other hand, most mode estimators performed better (lower bias) than the mean or median in simulations using contamination located at the 99th and twice the 99th percentiles. Because we wished to compare the efficiencies of the various mode estimators, we further confine the discussion to results using contamination centered at twice the 99th percentile (Figs. 1, 4, 5; Table 1).


Comparison of mode estimation methods and application in molecular clock analysis.

Hedges SB, Shah P - BMC Bioinformatics (2003)

Selected results of the simulations for bias of the estimators. Graphs shown here are for intermediate sample size (n = 100), and normally distributed contamination located at twice the distance between the true mode and the 99th percentile; see Supplementary Data for full results. Each column of graphs represents simulations for different original distributions: normal (left, a-d), lognormal (middle, e-h), and coarse (right, i-l). The top row (a, e, i) shows bias results for a comparison of three mode estimators with mean and median, all using 1000 replications. The lower three rows show bias results for comparisons between mode estimators with no bootstrapping (NBM, solid line), mode of 100 bootstrapped modes (BMO, dashed line), and mean of 100 bootstrapped modes (BME, dotted line), for HRM (b, g, l), RPM (c, h, n), and SPM (d, i, n); all using 100 simulation replications (100 bootstrap replications per simulation replication). In each case, the level of contamination applied to the original distribution ranged from zero to 40% (shown on X-axis). Bias is indicated in absolute units (Y-axis).
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC183840&req=5

Figure 4: Selected results of the simulations for bias of the estimators. Graphs shown here are for intermediate sample size (n = 100), and normally distributed contamination located at twice the distance between the true mode and the 99th percentile; see Supplementary Data for full results. Each column of graphs represents simulations for different original distributions: normal (left, a-d), lognormal (middle, e-h), and coarse (right, i-l). The top row (a, e, i) shows bias results for a comparison of three mode estimators with mean and median, all using 1000 replications. The lower three rows show bias results for comparisons between mode estimators with no bootstrapping (NBM, solid line), mode of 100 bootstrapped modes (BMO, dashed line), and mean of 100 bootstrapped modes (BME, dotted line), for HRM (b, g, l), RPM (c, h, n), and SPM (d, i, n); all using 100 simulation replications (100 bootstrap replications per simulation replication). In each case, the level of contamination applied to the original distribution ranged from zero to 40% (shown on X-axis). Bias is indicated in absolute units (Y-axis).
Mentions: The mode estimators were not biased by randomly distributed contamination, and the cauchy-distributed contamination produced similar results (albeit at higher levels of bias) to the normally distributed contamination. For these reasons, we confine our discussion to results using the normally distributed contamination. Also, simulations using contaminant data located close to the true mode (67th percentile) showed that mode estimators performed poorly under those conditions, as noted elsewhere [2], especially with the normal and coarse original distributions. In those cases, all estimators (mean and modes) showed similar bias, being equally misled by the contamination. On the other hand, most mode estimators performed better (lower bias) than the mean or median in simulations using contamination located at the 99th and twice the 99th percentiles. Because we wished to compare the efficiencies of the various mode estimators, we further confine the discussion to results using contamination centered at twice the 99th percentile (Figs. 1, 4, 5; Table 1).

Bottom Line: In those cases, the mode is a better estimator of the overall time of divergence than the mean or median.We determined that bootstrapping reduces the variance of both mode estimators.Application of the different methods to real data sets yielded results that were concordant with the simulations.

View Article: PubMed Central - HTML - PubMed

Affiliation: NASA Astrobiology Institute and Department of Biology, Pennsylvania State University, 208 Mueller Laboratory, University Park, PA 16802-5301, U,S,A. sbh1@psu.edu

ABSTRACT

Background: Distributions of time estimates in molecular clock studies are sometimes skewed or contain outliers. In those cases, the mode is a better estimator of the overall time of divergence than the mean or median. However, different methods are available for estimating the mode. We compared these methods in simulations to determine their strengths and weaknesses and further assessed their performance when applied to real data sets from a molecular clock study.

Results: We found that the half-range mode and robust parametric mode methods have a lower bias than other mode methods under a diversity of conditions. However, the half-range mode suffers from a relatively high variance and the robust parametric mode is more susceptible to bias by outliers. We determined that bootstrapping reduces the variance of both mode estimators. Application of the different methods to real data sets yielded results that were concordant with the simulations.

Conclusion: Because the half-range mode is a simple and fast method, and produced less bias overall in our simulations, we recommend the bootstrapped version of it as a general-purpose mode estimator and suggest a bootstrap method for obtaining the standard error and 95% confidence interval of the mode.

Show MeSH
Related in: MedlinePlus