Limits...
A highly efficient multi-core algorithm for clustering extremely large datasets.

Kraus JM, Kestler HA - BMC Bioinformatics (2010)

Bottom Line: We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters.Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.Most desktop computers and even notebooks provide at least dual-core processors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Neural Information Processing, University of Ulm, 89069 Ulm, Germany.

ABSTRACT

Background: In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer.

Results: We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.

Conclusions: Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.

Show MeSH

Related in: MedlinePlus

Runtime performance of ParaKMeans, k-means R, and McKmeans on the artificial data sets. Benchmark results for the simulated data sets (no cluster structure imposed, features chosen uniformly from [0, 1]) comparing the runtime of ParaKMeans, k-means R, and McKmeans. For the smaller data set (panel A) the computational overhead of the parallelization negatively affects the runtime. For the larger data set (1 million cases, panel B) an improvement of the runtime by a factor of 10 can be observed. The network-based parallelization algorithm ParaKMeans is significantly slower than McKmeans. Panel C shows the dependency of the runtime on the number of threads used (Kruskal-Wallis test: p = 1.15 × 10-5) and Panel D the number of cores used (Kruskal-Wallis test: p = 4.59 × 10-6) for a data set of 100000 cases and 500 features. Each box summarizes the results from 10 repeated clusterings (median and interquartile range).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2865495&req=5

Figure 5: Runtime performance of ParaKMeans, k-means R, and McKmeans on the artificial data sets. Benchmark results for the simulated data sets (no cluster structure imposed, features chosen uniformly from [0, 1]) comparing the runtime of ParaKMeans, k-means R, and McKmeans. For the smaller data set (panel A) the computational overhead of the parallelization negatively affects the runtime. For the larger data set (1 million cases, panel B) an improvement of the runtime by a factor of 10 can be observed. The network-based parallelization algorithm ParaKMeans is significantly slower than McKmeans. Panel C shows the dependency of the runtime on the number of threads used (Kruskal-Wallis test: p = 1.15 × 10-5) and Panel D the number of cores used (Kruskal-Wallis test: p = 4.59 × 10-6) for a data set of 100000 cases and 500 features. Each box summarizes the results from 10 repeated clusterings (median and interquartile range).

Mentions: We generated data sets without imposing a cluster structure. As the k-means algorithm is guaranteed to converge to a clustering, the median runtime of the algorithm on such data sets was used as a performance measure. We generated three simulated data sets (10000 samples with 100 features, 100000 samples with 500 features, 1000000 samples with 200 features). Each feature is uniformly distributed over the interval [0,1] to minimize the effect of random initializations. The performance of clustering the data sets into 20 clusters is summarized in Figure 5A. Each box summarizes the results of 10 repeated clusterings (median and interquartile range). In case of the small data set the computational overhead of the thread management negatively affects the runtime. For the extremely large data set, an improvement of the runtime by a factor of 10 can be observed (Figure 5B).


A highly efficient multi-core algorithm for clustering extremely large datasets.

Kraus JM, Kestler HA - BMC Bioinformatics (2010)

Runtime performance of ParaKMeans, k-means R, and McKmeans on the artificial data sets. Benchmark results for the simulated data sets (no cluster structure imposed, features chosen uniformly from [0, 1]) comparing the runtime of ParaKMeans, k-means R, and McKmeans. For the smaller data set (panel A) the computational overhead of the parallelization negatively affects the runtime. For the larger data set (1 million cases, panel B) an improvement of the runtime by a factor of 10 can be observed. The network-based parallelization algorithm ParaKMeans is significantly slower than McKmeans. Panel C shows the dependency of the runtime on the number of threads used (Kruskal-Wallis test: p = 1.15 × 10-5) and Panel D the number of cores used (Kruskal-Wallis test: p = 4.59 × 10-6) for a data set of 100000 cases and 500 features. Each box summarizes the results from 10 repeated clusterings (median and interquartile range).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2865495&req=5

Figure 5: Runtime performance of ParaKMeans, k-means R, and McKmeans on the artificial data sets. Benchmark results for the simulated data sets (no cluster structure imposed, features chosen uniformly from [0, 1]) comparing the runtime of ParaKMeans, k-means R, and McKmeans. For the smaller data set (panel A) the computational overhead of the parallelization negatively affects the runtime. For the larger data set (1 million cases, panel B) an improvement of the runtime by a factor of 10 can be observed. The network-based parallelization algorithm ParaKMeans is significantly slower than McKmeans. Panel C shows the dependency of the runtime on the number of threads used (Kruskal-Wallis test: p = 1.15 × 10-5) and Panel D the number of cores used (Kruskal-Wallis test: p = 4.59 × 10-6) for a data set of 100000 cases and 500 features. Each box summarizes the results from 10 repeated clusterings (median and interquartile range).
Mentions: We generated data sets without imposing a cluster structure. As the k-means algorithm is guaranteed to converge to a clustering, the median runtime of the algorithm on such data sets was used as a performance measure. We generated three simulated data sets (10000 samples with 100 features, 100000 samples with 500 features, 1000000 samples with 200 features). Each feature is uniformly distributed over the interval [0,1] to minimize the effect of random initializations. The performance of clustering the data sets into 20 clusters is summarized in Figure 5A. Each box summarizes the results of 10 repeated clusterings (median and interquartile range). In case of the small data set the computational overhead of the thread management negatively affects the runtime. For the extremely large data set, an improvement of the runtime by a factor of 10 can be observed (Figure 5B).

Bottom Line: We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters.Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.Most desktop computers and even notebooks provide at least dual-core processors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Neural Information Processing, University of Ulm, 89069 Ulm, Germany.

ABSTRACT

Background: In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer.

Results: We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.

Conclusions: Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.

Show MeSH
Related in: MedlinePlus