Limits...
Comparison of mode estimation methods and application in molecular clock analysis.

Hedges SB, Shah P - BMC Bioinformatics (2003)

Bottom Line: In those cases, the mode is a better estimator of the overall time of divergence than the mean or median.We determined that bootstrapping reduces the variance of both mode estimators.Application of the different methods to real data sets yielded results that were concordant with the simulations.

View Article: PubMed Central - HTML - PubMed

Affiliation: NASA Astrobiology Institute and Department of Biology, Pennsylvania State University, 208 Mueller Laboratory, University Park, PA 16802-5301, U,S,A. sbh1@psu.edu

ABSTRACT

Background: Distributions of time estimates in molecular clock studies are sometimes skewed or contain outliers. In those cases, the mode is a better estimator of the overall time of divergence than the mean or median. However, different methods are available for estimating the mode. We compared these methods in simulations to determine their strengths and weaknesses and further assessed their performance when applied to real data sets from a molecular clock study.

Results: We found that the half-range mode and robust parametric mode methods have a lower bias than other mode methods under a diversity of conditions. However, the half-range mode suffers from a relatively high variance and the robust parametric mode is more susceptible to bias by outliers. We determined that bootstrapping reduces the variance of both mode estimators. Application of the different methods to real data sets yielded results that were concordant with the simulations.

Conclusion: Because the half-range mode is a simple and fast method, and produced less bias overall in our simulations, we recommend the bootstrapped version of it as a general-purpose mode estimator and suggest a bootstrap method for obtaining the standard error and 95% confidence interval of the mode.

Show MeSH

Related in: MedlinePlus

Design of the simulations. Each path from the root to a leaf defines the parameter values for an individual simulation run.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC183840&req=5

Figure 3: Design of the simulations. Each path from the root to a leaf defines the parameter values for an individual simulation run.

Mentions: We designed 63 simulated data sets to compare the three mode estimation methods (HRM, SPM, and RPM) with the mean and median. Each data set was defined using combinations of the following six parameters, modeled in part after the study of Bickel [13] but with slightly larger variance (standard deviation parameter = 2) to more closely approximate real biological data sets [7,9], additional levels, distributions, and locations of contamination, and other features: (1) Type of original distribution: normal (mode = 6), lognormal (mode = 1, median = 2.72), and coarse (mode = 6.25) distributions (Figure 2a). (2) Type of contamination: normal, cauchy, and random (Figure 2b). (3) Level of contamination: 0%, 5%, 10%, 15%, 20%, 30%, and 40%. (4) Location of the contaminant: 67th percentile of the original distribution, 99th percentile of the original distribution, and "twice the 99th percentile" (true mode plus twice the distance between the 99th percentile of original distribution and the true mode). (5) Spread of the contaminant: standard deviation = 2.0. (6) Sample size: N = 20, 100, 1000. For simulations testing the non-bootstrap mode methods, 1000 replications were used. Additionally, 100 replicates were used for simulations testing the use of bootstrapping, with 100 bootstrap iterations performed during each replication. An overview of the simulation design is shown in Figure 3.


Comparison of mode estimation methods and application in molecular clock analysis.

Hedges SB, Shah P - BMC Bioinformatics (2003)

Design of the simulations. Each path from the root to a leaf defines the parameter values for an individual simulation run.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC183840&req=5

Figure 3: Design of the simulations. Each path from the root to a leaf defines the parameter values for an individual simulation run.
Mentions: We designed 63 simulated data sets to compare the three mode estimation methods (HRM, SPM, and RPM) with the mean and median. Each data set was defined using combinations of the following six parameters, modeled in part after the study of Bickel [13] but with slightly larger variance (standard deviation parameter = 2) to more closely approximate real biological data sets [7,9], additional levels, distributions, and locations of contamination, and other features: (1) Type of original distribution: normal (mode = 6), lognormal (mode = 1, median = 2.72), and coarse (mode = 6.25) distributions (Figure 2a). (2) Type of contamination: normal, cauchy, and random (Figure 2b). (3) Level of contamination: 0%, 5%, 10%, 15%, 20%, 30%, and 40%. (4) Location of the contaminant: 67th percentile of the original distribution, 99th percentile of the original distribution, and "twice the 99th percentile" (true mode plus twice the distance between the 99th percentile of original distribution and the true mode). (5) Spread of the contaminant: standard deviation = 2.0. (6) Sample size: N = 20, 100, 1000. For simulations testing the non-bootstrap mode methods, 1000 replications were used. Additionally, 100 replicates were used for simulations testing the use of bootstrapping, with 100 bootstrap iterations performed during each replication. An overview of the simulation design is shown in Figure 3.

Bottom Line: In those cases, the mode is a better estimator of the overall time of divergence than the mean or median.We determined that bootstrapping reduces the variance of both mode estimators.Application of the different methods to real data sets yielded results that were concordant with the simulations.

View Article: PubMed Central - HTML - PubMed

Affiliation: NASA Astrobiology Institute and Department of Biology, Pennsylvania State University, 208 Mueller Laboratory, University Park, PA 16802-5301, U,S,A. sbh1@psu.edu

ABSTRACT

Background: Distributions of time estimates in molecular clock studies are sometimes skewed or contain outliers. In those cases, the mode is a better estimator of the overall time of divergence than the mean or median. However, different methods are available for estimating the mode. We compared these methods in simulations to determine their strengths and weaknesses and further assessed their performance when applied to real data sets from a molecular clock study.

Results: We found that the half-range mode and robust parametric mode methods have a lower bias than other mode methods under a diversity of conditions. However, the half-range mode suffers from a relatively high variance and the robust parametric mode is more susceptible to bias by outliers. We determined that bootstrapping reduces the variance of both mode estimators. Application of the different methods to real data sets yielded results that were concordant with the simulations.

Conclusion: Because the half-range mode is a simple and fast method, and produced less bias overall in our simulations, we recommend the bootstrapped version of it as a general-purpose mode estimator and suggest a bootstrap method for obtaining the standard error and 95% confidence interval of the mode.

Show MeSH
Related in: MedlinePlus