Limits...
What to Do When K -Means Clustering Fails: A Simple yet Principled Alternative Algorithm

View Article: PubMed Central - PubMed

ABSTRACT

The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.

No MeSH data available.


Clustering solution obtained by K-means and MAP-DP for overlapping, synthetic elliptical Gaussian data.All clusters have different elliptical covariances, and the data is unequally distributed across different clusters (30% blue cluster, 5% yellow cluster, 65% orange). The significant overlap is challenging even for MAP-DP, but it produces a meaningful clustering solution where the only mislabelled points lie in the overlapping region. K-means does not produce a clustering result which is faithful to the actual clustering.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5036949&req=5

pone.0162259.g006: Clustering solution obtained by K-means and MAP-DP for overlapping, synthetic elliptical Gaussian data.All clusters have different elliptical covariances, and the data is unequally distributed across different clusters (30% blue cluster, 5% yellow cluster, 65% orange). The significant overlap is challenging even for MAP-DP, but it produces a meaningful clustering solution where the only mislabelled points lie in the overlapping region. K-means does not produce a clustering result which is faithful to the actual clustering.

Mentions: Having seen that MAP-DP works well in cases where K-means can fail badly, we will examine a clustering problem which should be a challenge for MAP-DP. The data is generated from three elliptical Gaussian distributions with different covariances and different number of points in each cluster. There is significant overlap between the clusters. MAP-DP manages to correctly learn the number of clusters in the data and obtains a good, meaningful solution which is close to the truth (Fig 6, NMI score 0.88, Table 3). The small number of data points mislabeled by MAP-DP are all in the overlapping region. By contrast, K-means fails to perform a meaningful clustering (NMI score 0.56) and mislabels a large fraction of the data points that are outside the overlapping region. This shows that MAP-DP, unlike K-means, can easily accommodate departures from sphericity even in the context of significant cluster overlap. As the cluster overlap increases, MAP-DP degrades but always leads to a much more interpretable solution than K-means. In this example, the number of clusters can be correctly estimated using BIC.


What to Do When K -Means Clustering Fails: A Simple yet Principled Alternative Algorithm
Clustering solution obtained by K-means and MAP-DP for overlapping, synthetic elliptical Gaussian data.All clusters have different elliptical covariances, and the data is unequally distributed across different clusters (30% blue cluster, 5% yellow cluster, 65% orange). The significant overlap is challenging even for MAP-DP, but it produces a meaningful clustering solution where the only mislabelled points lie in the overlapping region. K-means does not produce a clustering result which is faithful to the actual clustering.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5036949&req=5

pone.0162259.g006: Clustering solution obtained by K-means and MAP-DP for overlapping, synthetic elliptical Gaussian data.All clusters have different elliptical covariances, and the data is unequally distributed across different clusters (30% blue cluster, 5% yellow cluster, 65% orange). The significant overlap is challenging even for MAP-DP, but it produces a meaningful clustering solution where the only mislabelled points lie in the overlapping region. K-means does not produce a clustering result which is faithful to the actual clustering.
Mentions: Having seen that MAP-DP works well in cases where K-means can fail badly, we will examine a clustering problem which should be a challenge for MAP-DP. The data is generated from three elliptical Gaussian distributions with different covariances and different number of points in each cluster. There is significant overlap between the clusters. MAP-DP manages to correctly learn the number of clusters in the data and obtains a good, meaningful solution which is close to the truth (Fig 6, NMI score 0.88, Table 3). The small number of data points mislabeled by MAP-DP are all in the overlapping region. By contrast, K-means fails to perform a meaningful clustering (NMI score 0.56) and mislabels a large fraction of the data points that are outside the overlapping region. This shows that MAP-DP, unlike K-means, can easily accommodate departures from sphericity even in the context of significant cluster overlap. As the cluster overlap increases, MAP-DP degrades but always leads to a much more interpretable solution than K-means. In this example, the number of clusters can be correctly estimated using BIC.

View Article: PubMed Central - PubMed

ABSTRACT

The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.

No MeSH data available.