Limits...
A non-parametric Bayesian model for joint cell clustering and cluster matching: identification of anomalous sample phenotypes with random effects.

Dundar M, Akova F, Yerebakan HZ, Rajwa B - BMC Bioinformatics (2014)

Bottom Line: These results demonstrate that anomalous samples can be identified by ASPIRE with almost perfect accuracy without a priori access to samples of anomalous subtypes in the training set.The ASPIRE approach is unique in its ability to form generalizations regarding normal and anomalous states given only very weak assumptions regarding sample characteristics and origin.Thus, ASPIRE could become highly instrumental in providing unique insights about observed biological phenomena in the absence of full information about the investigated samples.

View Article: PubMed Central - PubMed

Affiliation: Computer Science Department, IUPUI, 723 W, Michigan St,, 46037 Indianapolis IN, US. dundar@cs.iupui.edu.

ABSTRACT

Background: Flow cytometry (FC)-based computer-aided diagnostics is an emerging technique utilizing modern multiparametric cytometry systems.The major difficulty in using machine-learning approaches for classification of FC data arises from limited access to a wide variety of anomalous samples for training. In consequence, any learning with an abundance of normal cases and a limited set of specific anomalous cases is biased towards the types of anomalies represented in the training set. Such models do not accurately identify anomalies, whether previously known or unknown, that may exist in future samples tested. Although one-class classifiers trained using only normal cases would avoid such a bias, robust sample characterization is critical for a generalizable model. Owing to sample heterogeneity and instrumental variability, arbitrary characterization of samples usually introduces feature noise that may lead to poor predictive performance. Herein, we present a non-parametric Bayesian algorithm called ASPIRE (anomalous sample phenotype identification with random effects) that identifies phenotypic differences across a batch of samples in the presence of random effects. Our approach involves simultaneous clustering of cellular measurements in individual samples and matching of discovered clusters across all samples in order to recover global clusters using probabilistic sampling techniques in a systematic way.

Results: We demonstrate the performance of the proposed method in identifying anomalous samples in two different FC data sets, one of which represents a set of samples including acute myeloid leukemia (AML) cases, and the other a generic 5-parameter peripheral-blood immunophenotyping. Results are evaluated in terms of the area under the receiver operating characteristics curve (AUC). ASPIRE achieved AUCs of 0.99 and 1.0 on the AML and generic blood immunophenotyping data sets, respectively.

Conclusions: These results demonstrate that anomalous samples can be identified by ASPIRE with almost perfect accuracy without a priori access to samples of anomalous subtypes in the training set. The ASPIRE approach is unique in its ability to form generalizations regarding normal and anomalous states given only very weak assumptions regarding sample characteristics and origin. Thus, ASPIRE could become highly instrumental in providing unique insights about observed biological phenomena in the absence of full information about the investigated samples.

Show MeSH

Related in: MedlinePlus

Examples of a normal and an anomalous sample in the Purdue data set. 2D scatter plots of cells expressing CD45, CD4, CD8, CD3, and CD19 markers. A. Anomalous sample. B. Normal sample.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4262223&req=5

Fig3: Examples of a normal and an anomalous sample in the Purdue data set. 2D scatter plots of cells expressing CD45, CD4, CD8, CD3, and CD19 markers. A. Anomalous sample. B. Normal sample.

Mentions: The sample collection and data acquisition were performed over a number of days. In accordance with standard FC data-analysis procedures, samples were pre-processed by performing linear spectral unmixing (compensation)[34, 35]. In order for the compensation to return approximate abundances of the labels used, one must employ the correct spillover matrix obtained from single-stained controls run under identical experimental conditions. However, in post-processing, it was discovered that a small subset of samples had been compensated using the wrong controls. These samples are readily identifiable by trained cytometrists (Figure3). We consider the improperly unmixed samples to be anomalous. The task for the algorithm was to find the anomalous samples automatically. This task mimics a typical data-quality check step performed on a large collection of flow cytometry data.Figure 3


A non-parametric Bayesian model for joint cell clustering and cluster matching: identification of anomalous sample phenotypes with random effects.

Dundar M, Akova F, Yerebakan HZ, Rajwa B - BMC Bioinformatics (2014)

Examples of a normal and an anomalous sample in the Purdue data set. 2D scatter plots of cells expressing CD45, CD4, CD8, CD3, and CD19 markers. A. Anomalous sample. B. Normal sample.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4262223&req=5

Fig3: Examples of a normal and an anomalous sample in the Purdue data set. 2D scatter plots of cells expressing CD45, CD4, CD8, CD3, and CD19 markers. A. Anomalous sample. B. Normal sample.
Mentions: The sample collection and data acquisition were performed over a number of days. In accordance with standard FC data-analysis procedures, samples were pre-processed by performing linear spectral unmixing (compensation)[34, 35]. In order for the compensation to return approximate abundances of the labels used, one must employ the correct spillover matrix obtained from single-stained controls run under identical experimental conditions. However, in post-processing, it was discovered that a small subset of samples had been compensated using the wrong controls. These samples are readily identifiable by trained cytometrists (Figure3). We consider the improperly unmixed samples to be anomalous. The task for the algorithm was to find the anomalous samples automatically. This task mimics a typical data-quality check step performed on a large collection of flow cytometry data.Figure 3

Bottom Line: These results demonstrate that anomalous samples can be identified by ASPIRE with almost perfect accuracy without a priori access to samples of anomalous subtypes in the training set.The ASPIRE approach is unique in its ability to form generalizations regarding normal and anomalous states given only very weak assumptions regarding sample characteristics and origin.Thus, ASPIRE could become highly instrumental in providing unique insights about observed biological phenomena in the absence of full information about the investigated samples.

View Article: PubMed Central - PubMed

Affiliation: Computer Science Department, IUPUI, 723 W, Michigan St,, 46037 Indianapolis IN, US. dundar@cs.iupui.edu.

ABSTRACT

Background: Flow cytometry (FC)-based computer-aided diagnostics is an emerging technique utilizing modern multiparametric cytometry systems.The major difficulty in using machine-learning approaches for classification of FC data arises from limited access to a wide variety of anomalous samples for training. In consequence, any learning with an abundance of normal cases and a limited set of specific anomalous cases is biased towards the types of anomalies represented in the training set. Such models do not accurately identify anomalies, whether previously known or unknown, that may exist in future samples tested. Although one-class classifiers trained using only normal cases would avoid such a bias, robust sample characterization is critical for a generalizable model. Owing to sample heterogeneity and instrumental variability, arbitrary characterization of samples usually introduces feature noise that may lead to poor predictive performance. Herein, we present a non-parametric Bayesian algorithm called ASPIRE (anomalous sample phenotype identification with random effects) that identifies phenotypic differences across a batch of samples in the presence of random effects. Our approach involves simultaneous clustering of cellular measurements in individual samples and matching of discovered clusters across all samples in order to recover global clusters using probabilistic sampling techniques in a systematic way.

Results: We demonstrate the performance of the proposed method in identifying anomalous samples in two different FC data sets, one of which represents a set of samples including acute myeloid leukemia (AML) cases, and the other a generic 5-parameter peripheral-blood immunophenotyping. Results are evaluated in terms of the area under the receiver operating characteristics curve (AUC). ASPIRE achieved AUCs of 0.99 and 1.0 on the AML and generic blood immunophenotyping data sets, respectively.

Conclusions: These results demonstrate that anomalous samples can be identified by ASPIRE with almost perfect accuracy without a priori access to samples of anomalous subtypes in the training set. The ASPIRE approach is unique in its ability to form generalizations regarding normal and anomalous states given only very weak assumptions regarding sample characteristics and origin. Thus, ASPIRE could become highly instrumental in providing unique insights about observed biological phenomena in the absence of full information about the investigated samples.

Show MeSH
Related in: MedlinePlus