Limits...
Resemblance profiles as clustering decision criteria: Estimating statistical power, error, and correspondence for a hypothesis test for multivariate structure

View Article: PubMed Central - PubMed

ABSTRACT

Clustering data continues to be a highly active area of data analysis, and resemblance profiles are being incorporated into ecological methodologies as a hypothesis testing‐based approach to clustering multivariate data. However, these new clustering techniques have not been rigorously tested to determine the performance variability based on the algorithm's assumptions or any underlying data structures. Here, we use simulation studies to estimate the statistical error rates for the hypothesis test for multivariate structure based on dissimilarity profiles (DISPROF). We concurrently tested a widely used algorithm that employs the unweighted pair group method with arithmetic mean (UPGMA) to estimate the proficiency of clustering with DISPROF as a decision criterion. We simulated unstructured multivariate data from different probability distributions with increasing numbers of objects and descriptors, and grouped data with increasing overlap, overdispersion for ecological data, and correlation among descriptors within groups. Using simulated data, we measured the resolution and correspondence of clustering solutions achieved by DISPROF with UPGMA against the reference grouping partitions used to simulate the structured test datasets. Our results highlight the dynamic interactions between dataset dimensionality, group overlap, and the properties of the descriptors within a group (i.e., overdispersion or correlation structure) that are relevant to resemblance profiles as a clustering criterion for multivariate data. These methods are particularly useful for multivariate ecological datasets that benefit from distance‐based statistical analyses. We propose guidelines for using DISPROF as a clustering decision tool that will help future users avoid potential pitfalls during the application of methods and the interpretation of results.

No MeSH data available.


Power of the DISPROF test versus the proportion of group overlap: Statistical power of DISPROF versus Ov for all P tested under Sim 2. Each line plot represents the 50 power values for S = 1,000 datasets at each Ov level for a given P. The horizontal dashed line at power = 0.8 is the lower limit of acceptable power values
© Copyright Policy - creativeCommonsBy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5383504&req=5

ece32760-fig-0004: Power of the DISPROF test versus the proportion of group overlap: Statistical power of DISPROF versus Ov for all P tested under Sim 2. Each line plot represents the 50 power values for S = 1,000 datasets at each Ov level for a given P. The horizontal dashed line at power = 0.8 is the lower limit of acceptable power values

Mentions: The mean power values for each P‐dimension, calculated from the 50 proportions of type II errors, estimated for each [N × P × Ov] configuration (S = 1,000), showed an increase in the power of DISPROF to detect the presence of multivariate structure as the overall dimensionality of the dataset increased (Table 4). A closer look at each P‐dimension's power values (Figure 4) showed that, for P ≤ 10, as Ov decreased, the statistical power of DISPROF increased asymptotically from unacceptable levels toward 1. For all values of P ≥ 25, the power was estimated to equal 1 for all Ov. Furthermore, for any given Ov the power increased as P increased. The average number of groups (G¯) per S = 50,000 datasets from all [N × P] configurations across all 50 Ov levels was similar across all P, ranging from a minimum G¯ = 1.81 (P = 2) to a maximum G¯ = 2.16 (P = 5; Table 4). Closer inspection of each [P × Ov] combination (S = 1,000) revealed that DISPROF clustering solutions where P ≤ 3 displayed an increase in G¯ as Ov decreased. G¯ increased from a value of G¯ < 2 and asymptotically approached the mean of G¯ for all clustering solutions within a given [P × Ov] combination. For all P ≥ 5, G¯ values remained above 2 for all Ov and were much more tightly bound around their respective means (Figure 5a, Table 4). The mean correspondence values (ARI¯HA) for each S = 50,000 datasets from all [N × P] configurations across all Ov increased as P increased (Table 4), and for any single Ov level, the ARI¯HA also increased with P (Figure 5b). A more detailed view of ARI¯HA within each P‐dimension (Figure 5b) indicated for P ≤ 5 the mean ARIHA values persisted below 0.8 for the majority of Ov scenarios, but had a generally increasing trend. Eventually, the ARI¯HA had high correspondence values at low levels of Ov. All P ≥ 10 clustering solutions had ARI¯HA values that were considerably less variable across all levels of Ov than those for P ≤ 5. These solutions’ correspondence values were tightly bound around their respective mean ARI¯HA values (Table 4) and displayed good or excellent correspondence (Figure 5b).


Resemblance profiles as clustering decision criteria: Estimating statistical power, error, and correspondence for a hypothesis test for multivariate structure
Power of the DISPROF test versus the proportion of group overlap: Statistical power of DISPROF versus Ov for all P tested under Sim 2. Each line plot represents the 50 power values for S = 1,000 datasets at each Ov level for a given P. The horizontal dashed line at power = 0.8 is the lower limit of acceptable power values
© Copyright Policy - creativeCommonsBy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5383504&req=5

ece32760-fig-0004: Power of the DISPROF test versus the proportion of group overlap: Statistical power of DISPROF versus Ov for all P tested under Sim 2. Each line plot represents the 50 power values for S = 1,000 datasets at each Ov level for a given P. The horizontal dashed line at power = 0.8 is the lower limit of acceptable power values
Mentions: The mean power values for each P‐dimension, calculated from the 50 proportions of type II errors, estimated for each [N × P × Ov] configuration (S = 1,000), showed an increase in the power of DISPROF to detect the presence of multivariate structure as the overall dimensionality of the dataset increased (Table 4). A closer look at each P‐dimension's power values (Figure 4) showed that, for P ≤ 10, as Ov decreased, the statistical power of DISPROF increased asymptotically from unacceptable levels toward 1. For all values of P ≥ 25, the power was estimated to equal 1 for all Ov. Furthermore, for any given Ov the power increased as P increased. The average number of groups (G¯) per S = 50,000 datasets from all [N × P] configurations across all 50 Ov levels was similar across all P, ranging from a minimum G¯ = 1.81 (P = 2) to a maximum G¯ = 2.16 (P = 5; Table 4). Closer inspection of each [P × Ov] combination (S = 1,000) revealed that DISPROF clustering solutions where P ≤ 3 displayed an increase in G¯ as Ov decreased. G¯ increased from a value of G¯ < 2 and asymptotically approached the mean of G¯ for all clustering solutions within a given [P × Ov] combination. For all P ≥ 5, G¯ values remained above 2 for all Ov and were much more tightly bound around their respective means (Figure 5a, Table 4). The mean correspondence values (ARI¯HA) for each S = 50,000 datasets from all [N × P] configurations across all Ov increased as P increased (Table 4), and for any single Ov level, the ARI¯HA also increased with P (Figure 5b). A more detailed view of ARI¯HA within each P‐dimension (Figure 5b) indicated for P ≤ 5 the mean ARIHA values persisted below 0.8 for the majority of Ov scenarios, but had a generally increasing trend. Eventually, the ARI¯HA had high correspondence values at low levels of Ov. All P ≥ 10 clustering solutions had ARI¯HA values that were considerably less variable across all levels of Ov than those for P ≤ 5. These solutions’ correspondence values were tightly bound around their respective mean ARI¯HA values (Table 4) and displayed good or excellent correspondence (Figure 5b).

View Article: PubMed Central - PubMed

ABSTRACT

Clustering data continues to be a highly active area of data analysis, and resemblance profiles are being incorporated into ecological methodologies as a hypothesis testing&#8208;based approach to clustering multivariate data. However, these new clustering techniques have not been rigorously tested to determine the performance variability based on the algorithm's assumptions or any underlying data structures. Here, we use simulation studies to estimate the statistical error rates for the hypothesis test for multivariate structure based on dissimilarity profiles (DISPROF). We concurrently tested a widely used algorithm that employs the unweighted pair group method with arithmetic mean (UPGMA) to estimate the proficiency of clustering with DISPROF as a decision criterion. We simulated unstructured multivariate data from different probability distributions with increasing numbers of objects and descriptors, and grouped data with increasing overlap, overdispersion for ecological data, and correlation among descriptors within groups. Using simulated data, we measured the resolution and correspondence of clustering solutions achieved by DISPROF with UPGMA against the reference grouping partitions used to simulate the structured test datasets. Our results highlight the dynamic interactions between dataset dimensionality, group overlap, and the properties of the descriptors within a group (i.e., overdispersion or correlation structure) that are relevant to resemblance profiles as a clustering criterion for multivariate data. These methods are particularly useful for multivariate ecological datasets that benefit from distance&#8208;based statistical analyses. We propose guidelines for using DISPROF as a clustering decision tool that will help future users avoid potential pitfalls during the application of methods and the interpretation of results.

No MeSH data available.