Limits...
Resemblance profiles as clustering decision criteria: Estimating statistical power, error, and correspondence for a hypothesis test for multivariate structure

View Article: PubMed Central - PubMed

ABSTRACT

Clustering data continues to be a highly active area of data analysis, and resemblance profiles are being incorporated into ecological methodologies as a hypothesis testing‐based approach to clustering multivariate data. However, these new clustering techniques have not been rigorously tested to determine the performance variability based on the algorithm's assumptions or any underlying data structures. Here, we use simulation studies to estimate the statistical error rates for the hypothesis test for multivariate structure based on dissimilarity profiles (DISPROF). We concurrently tested a widely used algorithm that employs the unweighted pair group method with arithmetic mean (UPGMA) to estimate the proficiency of clustering with DISPROF as a decision criterion. We simulated unstructured multivariate data from different probability distributions with increasing numbers of objects and descriptors, and grouped data with increasing overlap, overdispersion for ecological data, and correlation among descriptors within groups. Using simulated data, we measured the resolution and correspondence of clustering solutions achieved by DISPROF with UPGMA against the reference grouping partitions used to simulate the structured test datasets. Our results highlight the dynamic interactions between dataset dimensionality, group overlap, and the properties of the descriptors within a group (i.e., overdispersion or correlation structure) that are relevant to resemblance profiles as a clustering criterion for multivariate data. These methods are particularly useful for multivariate ecological datasets that benefit from distance‐based statistical analyses. We propose guidelines for using DISPROF as a clustering decision tool that will help future users avoid potential pitfalls during the application of methods and the interpretation of results.

No MeSH data available.


Related in: MedlinePlus

Theoretical diagram of the process flow for DISPROF clustering with UPGMA: (1) Data are pretreated and configured. (2) An appropriate resemblance metric is applied to the pretreated dataset. (3) The UPGMA site‐connection linkage is assembled. (4) DISPROF is employed in an iterative process to identify the grouping structure in the data and create breaks in the associated linkage tree. (5) DISPROF settles on a final solution, and a two‐dimensional dendrogram visualization is created
© Copyright Policy - creativeCommonsBy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5383504&req=5

ece32760-fig-0001: Theoretical diagram of the process flow for DISPROF clustering with UPGMA: (1) Data are pretreated and configured. (2) An appropriate resemblance metric is applied to the pretreated dataset. (3) The UPGMA site‐connection linkage is assembled. (4) DISPROF is employed in an iterative process to identify the grouping structure in the data and create breaks in the associated linkage tree. (5) DISPROF settles on a final solution, and a two‐dimensional dendrogram visualization is created

Mentions: Clarke et al. (2008) demonstrated the use of SIMPROF in conjunction with agglomerative hierarchical clustering via the unweighted pair group method with arithmetic mean (UPGMA; Figure 1), and they also described two theoretical corollaries to the functional dynamics of their algorithm. They proposed that (1) the test for multivariate structure would become more powerful as the number of descriptors increased and (2) that the resolution of any structure identified (i.e., number of groups, G) might be far finer (greater) than is meaningfully interpreted (Clarke et al., 2008). It is our understanding that these corollaries have yet to be tested empirically with numerical simulations, and given recent inconsistencies in the performance of other permutation‐ and distance‐based hypothesis tests (e.g., ANOSIM and MANTEL tests; Anderson & Walsh, 2013; Legendre & Fortin, 2010), we felt this action was warranted.


Resemblance profiles as clustering decision criteria: Estimating statistical power, error, and correspondence for a hypothesis test for multivariate structure
Theoretical diagram of the process flow for DISPROF clustering with UPGMA: (1) Data are pretreated and configured. (2) An appropriate resemblance metric is applied to the pretreated dataset. (3) The UPGMA site‐connection linkage is assembled. (4) DISPROF is employed in an iterative process to identify the grouping structure in the data and create breaks in the associated linkage tree. (5) DISPROF settles on a final solution, and a two‐dimensional dendrogram visualization is created
© Copyright Policy - creativeCommonsBy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5383504&req=5

ece32760-fig-0001: Theoretical diagram of the process flow for DISPROF clustering with UPGMA: (1) Data are pretreated and configured. (2) An appropriate resemblance metric is applied to the pretreated dataset. (3) The UPGMA site‐connection linkage is assembled. (4) DISPROF is employed in an iterative process to identify the grouping structure in the data and create breaks in the associated linkage tree. (5) DISPROF settles on a final solution, and a two‐dimensional dendrogram visualization is created
Mentions: Clarke et al. (2008) demonstrated the use of SIMPROF in conjunction with agglomerative hierarchical clustering via the unweighted pair group method with arithmetic mean (UPGMA; Figure 1), and they also described two theoretical corollaries to the functional dynamics of their algorithm. They proposed that (1) the test for multivariate structure would become more powerful as the number of descriptors increased and (2) that the resolution of any structure identified (i.e., number of groups, G) might be far finer (greater) than is meaningfully interpreted (Clarke et al., 2008). It is our understanding that these corollaries have yet to be tested empirically with numerical simulations, and given recent inconsistencies in the performance of other permutation‐ and distance‐based hypothesis tests (e.g., ANOSIM and MANTEL tests; Anderson & Walsh, 2013; Legendre & Fortin, 2010), we felt this action was warranted.

View Article: PubMed Central - PubMed

ABSTRACT

Clustering data continues to be a highly active area of data analysis, and resemblance profiles are being incorporated into ecological methodologies as a hypothesis testing‐based approach to clustering multivariate data. However, these new clustering techniques have not been rigorously tested to determine the performance variability based on the algorithm's assumptions or any underlying data structures. Here, we use simulation studies to estimate the statistical error rates for the hypothesis test for multivariate structure based on dissimilarity profiles (DISPROF). We concurrently tested a widely used algorithm that employs the unweighted pair group method with arithmetic mean (UPGMA) to estimate the proficiency of clustering with DISPROF as a decision criterion. We simulated unstructured multivariate data from different probability distributions with increasing numbers of objects and descriptors, and grouped data with increasing overlap, overdispersion for ecological data, and correlation among descriptors within groups. Using simulated data, we measured the resolution and correspondence of clustering solutions achieved by DISPROF with UPGMA against the reference grouping partitions used to simulate the structured test datasets. Our results highlight the dynamic interactions between dataset dimensionality, group overlap, and the properties of the descriptors within a group (i.e., overdispersion or correlation structure) that are relevant to resemblance profiles as a clustering criterion for multivariate data. These methods are particularly useful for multivariate ecological datasets that benefit from distance‐based statistical analyses. We propose guidelines for using DISPROF as a clustering decision tool that will help future users avoid potential pitfalls during the application of methods and the interpretation of results.

No MeSH data available.


Related in: MedlinePlus