Limits...
Misty Mountain clustering: application to fast unsupervised flow cytometry gating.

Sugár IP, Sealfon SC - BMC Bioinformatics (2010)

Bottom Line: These supervised serial clustering methods are time consuming and frequently different criteria result in different optimal cluster numbers.Misty Mountain is fast, unbiased for cluster shape, identifies stable clusters and is robust to noise.It provides a useful, general solution for multidimensional clustering problems.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Neurology and Center for Translational Systems Biology, Mount Sinai School of Medicine, New York, NY, USA. istvan.sugar@mssm.edu

ABSTRACT

Background: There are many important clustering questions in computational biology for which no satisfactory method exists. Automated clustering algorithms, when applied to large, multidimensional datasets, such as flow cytometry data, prove unsatisfactory in terms of speed, problems with local minima or cluster shape bias. Model-based approaches are restricted by the assumptions of the fitting functions. Furthermore, model based clustering requires serial clustering for all cluster numbers within a user defined interval. The final cluster number is then selected by various criteria. These supervised serial clustering methods are time consuming and frequently different criteria result in different optimal cluster numbers. Various unsupervised heuristic approaches that have been developed such as affinity propagation are too expensive to be applied to datasets on the order of 106 points that are often generated by high throughput experiments.

Results: To circumvent these limitations, we developed a new, unsupervised density contour clustering algorithm, called Misty Mountain, that is based on percolation theory and that efficiently analyzes large data sets. The approach can be envisioned as a progressive top-down removal of clouds covering a data histogram relief map to identify clusters by the appearance of statistically distinct peaks and ridges. This is a parallel clustering method that finds every cluster after analyzing only once the cross sections of the histogram. The overall run time for the composite steps of the algorithm increases linearly by the number of data points. The clustering of 106 data points in 2D data space takes place within about 15 seconds on a standard laptop PC. Comparison of the performance of this algorithm with other state of the art automated flow cytometry gating methods indicate that Misty Mountain provides substantial improvements in both run time and in the accuracy of cluster assignment.

Conclusions: Misty Mountain is fast, unbiased for cluster shape, identifies stable clusters and is robust to noise. It provides a useful, general solution for multidimensional clustering problems. We demonstrate its suitability for automated gating of flow cytometry data.

Show MeSH
Two-dimensional FCM data. 853,674 U937 cells are stained by two florescence dyes, Pacific Blue and APC-Cy7-A. a) The fluorescence intensity of APC-Cy7-A is plotted against the fluorescence intensity of Pacific Blue. b) Result of the cluster analysis by using the Misty Mountain method. Each cluster is marked by a code number. Table in Additional File 2 lists the characteristics of the resulting clusters.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2967560&req=5

Figure 3: Two-dimensional FCM data. 853,674 U937 cells are stained by two florescence dyes, Pacific Blue and APC-Cy7-A. a) The fluorescence intensity of APC-Cy7-A is plotted against the fluorescence intensity of Pacific Blue. b) Result of the cluster analysis by using the Misty Mountain method. Each cluster is marked by a code number. Table in Additional File 2 lists the characteristics of the resulting clusters.

Mentions: The performance of the Misty Mountain algorithm with a complex flow cytometry dataset consisting varying levels of two fluorophores, APC-Cy7-A and Pacific Blue-A, in 853,674 U937 cells is shown in Figure 3. The dataset in Figure 3a was generated for a barcoding experiment [40] in which different groups of cells were labeled with different concentrations of each fluorophore. The respective optimal histogram that was analyzed contained 52 × 52 bins. The most populated bin contained 4003 data points. Thus during the analysis, 4003 cross sections of the histogram were created. The elapsed CPU time of the cluster analysis was 9.8 sec. The results of the cluster analysis are shown in Figure 3b. The analysis identified 15 large clusters where the reliability of the cluster elements was from 0.75-0.98, and 5 small clusters with about 0.5 reliability. These clusters contained 87% of all the data points. The characteristic properties of the assigned clusters are listed in Table in Additional File 2. The last cluster in the table is a very small one and it is considered as noise (see Sec. Major and Small Peaks of the Histogram). In Additional Files 3 and 4 the analysis of an even more complex 3D barcoding experiment is shown.


Misty Mountain clustering: application to fast unsupervised flow cytometry gating.

Sugár IP, Sealfon SC - BMC Bioinformatics (2010)

Two-dimensional FCM data. 853,674 U937 cells are stained by two florescence dyes, Pacific Blue and APC-Cy7-A. a) The fluorescence intensity of APC-Cy7-A is plotted against the fluorescence intensity of Pacific Blue. b) Result of the cluster analysis by using the Misty Mountain method. Each cluster is marked by a code number. Table in Additional File 2 lists the characteristics of the resulting clusters.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2967560&req=5

Figure 3: Two-dimensional FCM data. 853,674 U937 cells are stained by two florescence dyes, Pacific Blue and APC-Cy7-A. a) The fluorescence intensity of APC-Cy7-A is plotted against the fluorescence intensity of Pacific Blue. b) Result of the cluster analysis by using the Misty Mountain method. Each cluster is marked by a code number. Table in Additional File 2 lists the characteristics of the resulting clusters.
Mentions: The performance of the Misty Mountain algorithm with a complex flow cytometry dataset consisting varying levels of two fluorophores, APC-Cy7-A and Pacific Blue-A, in 853,674 U937 cells is shown in Figure 3. The dataset in Figure 3a was generated for a barcoding experiment [40] in which different groups of cells were labeled with different concentrations of each fluorophore. The respective optimal histogram that was analyzed contained 52 × 52 bins. The most populated bin contained 4003 data points. Thus during the analysis, 4003 cross sections of the histogram were created. The elapsed CPU time of the cluster analysis was 9.8 sec. The results of the cluster analysis are shown in Figure 3b. The analysis identified 15 large clusters where the reliability of the cluster elements was from 0.75-0.98, and 5 small clusters with about 0.5 reliability. These clusters contained 87% of all the data points. The characteristic properties of the assigned clusters are listed in Table in Additional File 2. The last cluster in the table is a very small one and it is considered as noise (see Sec. Major and Small Peaks of the Histogram). In Additional Files 3 and 4 the analysis of an even more complex 3D barcoding experiment is shown.

Bottom Line: These supervised serial clustering methods are time consuming and frequently different criteria result in different optimal cluster numbers.Misty Mountain is fast, unbiased for cluster shape, identifies stable clusters and is robust to noise.It provides a useful, general solution for multidimensional clustering problems.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Neurology and Center for Translational Systems Biology, Mount Sinai School of Medicine, New York, NY, USA. istvan.sugar@mssm.edu

ABSTRACT

Background: There are many important clustering questions in computational biology for which no satisfactory method exists. Automated clustering algorithms, when applied to large, multidimensional datasets, such as flow cytometry data, prove unsatisfactory in terms of speed, problems with local minima or cluster shape bias. Model-based approaches are restricted by the assumptions of the fitting functions. Furthermore, model based clustering requires serial clustering for all cluster numbers within a user defined interval. The final cluster number is then selected by various criteria. These supervised serial clustering methods are time consuming and frequently different criteria result in different optimal cluster numbers. Various unsupervised heuristic approaches that have been developed such as affinity propagation are too expensive to be applied to datasets on the order of 106 points that are often generated by high throughput experiments.

Results: To circumvent these limitations, we developed a new, unsupervised density contour clustering algorithm, called Misty Mountain, that is based on percolation theory and that efficiently analyzes large data sets. The approach can be envisioned as a progressive top-down removal of clouds covering a data histogram relief map to identify clusters by the appearance of statistically distinct peaks and ridges. This is a parallel clustering method that finds every cluster after analyzing only once the cross sections of the histogram. The overall run time for the composite steps of the algorithm increases linearly by the number of data points. The clustering of 106 data points in 2D data space takes place within about 15 seconds on a standard laptop PC. Comparison of the performance of this algorithm with other state of the art automated flow cytometry gating methods indicate that Misty Mountain provides substantial improvements in both run time and in the accuracy of cluster assignment.

Conclusions: Misty Mountain is fast, unbiased for cluster shape, identifies stable clusters and is robust to noise. It provides a useful, general solution for multidimensional clustering problems. We demonstrate its suitability for automated gating of flow cytometry data.

Show MeSH