Limits...
Resemblance profiles as clustering decision criteria: Estimating statistical power, error, and correspondence for a hypothesis test for multivariate structure

View Article: PubMed Central - PubMed

ABSTRACT

Clustering data continues to be a highly active area of data analysis, and resemblance profiles are being incorporated into ecological methodologies as a hypothesis testing‐based approach to clustering multivariate data. However, these new clustering techniques have not been rigorously tested to determine the performance variability based on the algorithm's assumptions or any underlying data structures. Here, we use simulation studies to estimate the statistical error rates for the hypothesis test for multivariate structure based on dissimilarity profiles (DISPROF). We concurrently tested a widely used algorithm that employs the unweighted pair group method with arithmetic mean (UPGMA) to estimate the proficiency of clustering with DISPROF as a decision criterion. We simulated unstructured multivariate data from different probability distributions with increasing numbers of objects and descriptors, and grouped data with increasing overlap, overdispersion for ecological data, and correlation among descriptors within groups. Using simulated data, we measured the resolution and correspondence of clustering solutions achieved by DISPROF with UPGMA against the reference grouping partitions used to simulate the structured test datasets. Our results highlight the dynamic interactions between dataset dimensionality, group overlap, and the properties of the descriptors within a group (i.e., overdispersion or correlation structure) that are relevant to resemblance profiles as a clustering criterion for multivariate data. These methods are particularly useful for multivariate ecological datasets that benefit from distance‐based statistical analyses. We propose guidelines for using DISPROF as a clustering decision tool that will help future users avoid potential pitfalls during the application of methods and the interpretation of results.

No MeSH data available.


Two examples of Euclidean‐dissimilarity profiles: Resemblance value sort order is increasing along the x‐axis, and the sorted pairwise dissimilarity values are increasing along the y‐axis. (a) A dissimilarity profile for a simulated unstructured dataset drawn from the exponential probability distribution with [N × P] = [50 × 50]. The observed profile is within the 99% confidence envelope based on 999 permutations of the observed data. (b) A dissimilarity profile for a simulated structured dataset drawn from the normal distribution with two groups having equal variance, [N × P] = [50 × 50], and Ov = 0.01. The observed profile has many dissimilarity values that are above and below the expected mean permuted profile, and its associated 99% confidence envelope, thereby signifying the presence of structure in the dataset
© Copyright Policy - creativeCommonsBy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5383504&req=5

ece32760-fig-0002: Two examples of Euclidean‐dissimilarity profiles: Resemblance value sort order is increasing along the x‐axis, and the sorted pairwise dissimilarity values are increasing along the y‐axis. (a) A dissimilarity profile for a simulated unstructured dataset drawn from the exponential probability distribution with [N × P] = [50 × 50]. The observed profile is within the 99% confidence envelope based on 999 permutations of the observed data. (b) A dissimilarity profile for a simulated structured dataset drawn from the normal distribution with two groups having equal variance, [N × P] = [50 × 50], and Ov = 0.01. The observed profile has many dissimilarity values that are above and below the expected mean permuted profile, and its associated 99% confidence envelope, thereby signifying the presence of structure in the dataset

Mentions: For a set of objects, a similarity profile is created by plotting the rank‐ordered similarity values versus each value's rank (Figure 2a). This profile is ultimately checked against the mean rank‐ordered similarity values for many randomized profiles (i.e., ≥1,000) created via permuting the original descriptor measurements across objects. The π statistic is created by summing the absolute deviations of the observed profile from the mean of the set of permuted profiles. Intuitively, one can see that if an observed profile has many more high and/or low similarity values than would be expected under the conditions, then multivariate structure would be deemed present (Figure 2b). The hypothesis (Ho) of “no multivariate structure among objects, with respect to the descriptors” in the original dataset, is formally tested by examining the placement of the observed π statistic relative to the distribution of all permuted π statistics. To model the distribution of the π statistic, an additional set of permuted similarity profiles (i.e., ≥1,000 iterations) is created, and their associated π statistics are calculated with respect to the same mean profile used to calculate the original observed π statistic. The p‐value for the observed π statistic is calculated as the proportion of π statistics that are at least as large as the observed statistic versus the total number of π statistics calculated via permutation (Clarke et al., 2008).


Resemblance profiles as clustering decision criteria: Estimating statistical power, error, and correspondence for a hypothesis test for multivariate structure
Two examples of Euclidean‐dissimilarity profiles: Resemblance value sort order is increasing along the x‐axis, and the sorted pairwise dissimilarity values are increasing along the y‐axis. (a) A dissimilarity profile for a simulated unstructured dataset drawn from the exponential probability distribution with [N × P] = [50 × 50]. The observed profile is within the 99% confidence envelope based on 999 permutations of the observed data. (b) A dissimilarity profile for a simulated structured dataset drawn from the normal distribution with two groups having equal variance, [N × P] = [50 × 50], and Ov = 0.01. The observed profile has many dissimilarity values that are above and below the expected mean permuted profile, and its associated 99% confidence envelope, thereby signifying the presence of structure in the dataset
© Copyright Policy - creativeCommonsBy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5383504&req=5

ece32760-fig-0002: Two examples of Euclidean‐dissimilarity profiles: Resemblance value sort order is increasing along the x‐axis, and the sorted pairwise dissimilarity values are increasing along the y‐axis. (a) A dissimilarity profile for a simulated unstructured dataset drawn from the exponential probability distribution with [N × P] = [50 × 50]. The observed profile is within the 99% confidence envelope based on 999 permutations of the observed data. (b) A dissimilarity profile for a simulated structured dataset drawn from the normal distribution with two groups having equal variance, [N × P] = [50 × 50], and Ov = 0.01. The observed profile has many dissimilarity values that are above and below the expected mean permuted profile, and its associated 99% confidence envelope, thereby signifying the presence of structure in the dataset
Mentions: For a set of objects, a similarity profile is created by plotting the rank‐ordered similarity values versus each value's rank (Figure 2a). This profile is ultimately checked against the mean rank‐ordered similarity values for many randomized profiles (i.e., ≥1,000) created via permuting the original descriptor measurements across objects. The π statistic is created by summing the absolute deviations of the observed profile from the mean of the set of permuted profiles. Intuitively, one can see that if an observed profile has many more high and/or low similarity values than would be expected under the conditions, then multivariate structure would be deemed present (Figure 2b). The hypothesis (Ho) of “no multivariate structure among objects, with respect to the descriptors” in the original dataset, is formally tested by examining the placement of the observed π statistic relative to the distribution of all permuted π statistics. To model the distribution of the π statistic, an additional set of permuted similarity profiles (i.e., ≥1,000 iterations) is created, and their associated π statistics are calculated with respect to the same mean profile used to calculate the original observed π statistic. The p‐value for the observed π statistic is calculated as the proportion of π statistics that are at least as large as the observed statistic versus the total number of π statistics calculated via permutation (Clarke et al., 2008).

View Article: PubMed Central - PubMed

ABSTRACT

Clustering data continues to be a highly active area of data analysis, and resemblance profiles are being incorporated into ecological methodologies as a hypothesis testing‐based approach to clustering multivariate data. However, these new clustering techniques have not been rigorously tested to determine the performance variability based on the algorithm's assumptions or any underlying data structures. Here, we use simulation studies to estimate the statistical error rates for the hypothesis test for multivariate structure based on dissimilarity profiles (DISPROF). We concurrently tested a widely used algorithm that employs the unweighted pair group method with arithmetic mean (UPGMA) to estimate the proficiency of clustering with DISPROF as a decision criterion. We simulated unstructured multivariate data from different probability distributions with increasing numbers of objects and descriptors, and grouped data with increasing overlap, overdispersion for ecological data, and correlation among descriptors within groups. Using simulated data, we measured the resolution and correspondence of clustering solutions achieved by DISPROF with UPGMA against the reference grouping partitions used to simulate the structured test datasets. Our results highlight the dynamic interactions between dataset dimensionality, group overlap, and the properties of the descriptors within a group (i.e., overdispersion or correlation structure) that are relevant to resemblance profiles as a clustering criterion for multivariate data. These methods are particularly useful for multivariate ecological datasets that benefit from distance‐based statistical analyses. We propose guidelines for using DISPROF as a clustering decision tool that will help future users avoid potential pitfalls during the application of methods and the interpretation of results.

No MeSH data available.