Limits...
A highly efficient multi-core algorithm for clustering extremely large datasets.

Kraus JM, Kestler HA - BMC Bioinformatics (2010)

Bottom Line: We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters.Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.Most desktop computers and even notebooks provide at least dual-core processors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Neural Information Processing, University of Ulm, 89069 Ulm, Germany.

ABSTRACT

Background: In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer.

Results: We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.

Conclusions: Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.

Show MeSH

Related in: MedlinePlus

Clustering and stability estimation for HapMap SNP profiles. Cluster number estimation via repeated clustering of profiles/subjects for the HapMap data (210 profiles, 116678 SNPs). For each k ∈ {2, 3, ..., 10}, 1000 repeated cluster runs were performed. For each cluster number, two boxplots are shown, one summarizing the MCA values from the pairwise comparisons of all cluster results (left), and the other one for the results from the random prototype baseline (right). A higher value indicates increased stability.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2865495&req=5

Figure 7: Clustering and stability estimation for HapMap SNP profiles. Cluster number estimation via repeated clustering of profiles/subjects for the HapMap data (210 profiles, 116678 SNPs). For each k ∈ {2, 3, ..., 10}, 1000 repeated cluster runs were performed. For each cluster number, two boxplots are shown, one summarizing the MCA values from the pairwise comparisons of all cluster results (left), and the other one for the results from the random prototype baseline (right). A higher value indicates increased stability.

Mentions: We performed a cluster number estimation for this data (number of rows 210 (profiles), number of columns 116678 (SNPs)). For each k {∈ 2, 3, ..., 10}, we performed 1000 runs of clustering on a resampled data set of row size 195 (jackknife ). The results are illustrated in Figure 7. For each k, two boxplots are shown, one summarizing the MCA values from the pairwise comparisons of all cluster results and the other one showing the results of the random baseline. The best clustering (maximal distance between medians) is reported for k = 4 (Mann-Whitney test: p < 1.0 × 10-16). We then computed 1000 repeated runs of k-means with k = 4. The clustering with the minimal quantization error is given in Table 2. The reported clustering essentially coincides with the different populations. All individuals from CEU form a cluster, as well as individuals from YRI do. One individual from CHB is clustered into the group of JPT, and 3 individuals from JPT are clustered into the group of CHB. This gives an overall accuracy of 98.1% for separating the population by clustering the available SNP data.


A highly efficient multi-core algorithm for clustering extremely large datasets.

Kraus JM, Kestler HA - BMC Bioinformatics (2010)

Clustering and stability estimation for HapMap SNP profiles. Cluster number estimation via repeated clustering of profiles/subjects for the HapMap data (210 profiles, 116678 SNPs). For each k ∈ {2, 3, ..., 10}, 1000 repeated cluster runs were performed. For each cluster number, two boxplots are shown, one summarizing the MCA values from the pairwise comparisons of all cluster results (left), and the other one for the results from the random prototype baseline (right). A higher value indicates increased stability.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2865495&req=5

Figure 7: Clustering and stability estimation for HapMap SNP profiles. Cluster number estimation via repeated clustering of profiles/subjects for the HapMap data (210 profiles, 116678 SNPs). For each k ∈ {2, 3, ..., 10}, 1000 repeated cluster runs were performed. For each cluster number, two boxplots are shown, one summarizing the MCA values from the pairwise comparisons of all cluster results (left), and the other one for the results from the random prototype baseline (right). A higher value indicates increased stability.
Mentions: We performed a cluster number estimation for this data (number of rows 210 (profiles), number of columns 116678 (SNPs)). For each k {∈ 2, 3, ..., 10}, we performed 1000 runs of clustering on a resampled data set of row size 195 (jackknife ). The results are illustrated in Figure 7. For each k, two boxplots are shown, one summarizing the MCA values from the pairwise comparisons of all cluster results and the other one showing the results of the random baseline. The best clustering (maximal distance between medians) is reported for k = 4 (Mann-Whitney test: p < 1.0 × 10-16). We then computed 1000 repeated runs of k-means with k = 4. The clustering with the minimal quantization error is given in Table 2. The reported clustering essentially coincides with the different populations. All individuals from CEU form a cluster, as well as individuals from YRI do. One individual from CHB is clustered into the group of JPT, and 3 individuals from JPT are clustered into the group of CHB. This gives an overall accuracy of 98.1% for separating the population by clustering the available SNP data.

Bottom Line: We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters.Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.Most desktop computers and even notebooks provide at least dual-core processors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Neural Information Processing, University of Ulm, 89069 Ulm, Germany.

ABSTRACT

Background: In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer.

Results: We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.

Conclusions: Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.

Show MeSH
Related in: MedlinePlus