Limits...
A composite model for subgroup identification and prediction via bicluster analysis.

Chen HC, Zou W, Lu TP, Chen JJ - PLoS ONE (2014)

Bottom Line: The proposed composite model neither depends on any specific biclustering algorithm or patterns of biclusters, nor on any classification algorithms.The proposed approach combines unsupervised biclustering and supervised classification techniques to classify samples into disjoint subgroups based on their associated attributes, such as genotypic factors, phenotypic outcomes, efficacy/safety measures, or responses to treatments.The procedure is useful for identification of unknown species or new biomarkers for targeted therapy.

View Article: PubMed Central - PubMed

Affiliation: Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas, United States of America; Graduate Institute of Biostatistics and Biostatistics Center, China Medical University, Taichung, Taiwan.

ABSTRACT

Background: A major challenges in the analysis of large and complex biomedical data is to develop an approach for 1) identifying distinct subgroups in the sampled populations, 2) characterizing their relationships among subgroups, and 3) developing a prediction model to classify subgroup memberships of new samples by finding a set of predictors. Each subgroup can represent different pathogen serotypes of microorganisms, different tumor subtypes in cancer patients, or different genetic makeups of patients related to treatment response.

Methods: This paper proposes a composite model for subgroup identification and prediction using biclusters. A biclustering technique is first used to identify a set of biclusters from the sampled data. For each bicluster, a subgroup-specific binary classifier is built to determine if a particular sample is either inside or outside the bicluster. A composite model, which consists of all binary classifiers, is constructed to classify samples into several disjoint subgroups. The proposed composite model neither depends on any specific biclustering algorithm or patterns of biclusters, nor on any classification algorithms.

Results: The composite model was shown to have an overall accuracy of 97.4% for a synthetic dataset consisting of four subgroups. The model was applied to two datasets where the sample's subgroup memberships were known. The procedure showed 83.7% accuracy in discriminating lung cancer adenocarcinoma and squamous carcinoma subtypes, and was able to identify 5 serotypes and several subtypes with about 94% accuracy in a pathogen dataset.

Conclusion: The composite model presents a novel approach to developing a biclustering-based classification model from unlabeled sampled data. The proposed approach combines unsupervised biclustering and supervised classification techniques to classify samples into disjoint subgroups based on their associated attributes, such as genotypic factors, phenotypic outcomes, efficacy/safety measures, or responses to treatments. The procedure is useful for identification of unknown species or new biomarkers for targeted therapy.

Show MeSH

Related in: MedlinePlus

Hierarchical cluster analysis of the 14 subgroups identified from the test dataset using the average linkage distance.The 14 subgroups consist of 5 major subgroups: 1. Thompson (0010000000); 2. Typhimurium (0100000000); 3. Decoy (0000000000); 4. Oranienburg (0001000000, 0000100000, 0001100000, 0001110000, 0000110000); 5. Hadar (0000010000, 1000010000) and I4,[5],12:i- (1000000000, 1000000100, 1000001000, 0000001000).
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4210136&req=5

pone-0111318-g006: Hierarchical cluster analysis of the 14 subgroups identified from the test dataset using the average linkage distance.The 14 subgroups consist of 5 major subgroups: 1. Thompson (0010000000); 2. Typhimurium (0100000000); 3. Decoy (0000000000); 4. Oranienburg (0001000000, 0000100000, 0001100000, 0001110000, 0000110000); 5. Hadar (0000010000, 1000010000) and I4,[5],12:i- (1000000000, 1000000100, 1000001000, 0000001000).

Mentions: The SVM model identified 24 classification patterns from the 10 binary classifiers m1–m10. Based on the n* = 5 as a cutoff, 14 subgroups were identified (Table 5), where 13 of the 14 were identical to the 13 subgroups that were identified in the training data. The additional subgroup consisted of 8 Hadar isolates. The serotypes and their associated binary classifiers were: 4,5,12:i-: m1, m7, (m1, m7), (m1, m8); Hadar: m6, (m1, m6); Oranienburg: m4, m5, (m4, m5), (m5, m6), (m4, m5, m6); Thompson: m3; Typhimurium: m2.The sensitivities between the training and test datasets were similar for the data of the five training serotypes. The overall specificity was lower since there were 1,000 additional “Decoy” isolates (Table 4 and Table 5). For the “Decoy” serotype, the sensitivity and specificity were 74.7% and 91.7%, respectively. The accuracies were 95.9% and 96.1% by excluding and including the “Decoy” isolates, in the calculation, respectively. The relationships among the 14 subgroups were further analyzed using the hierarchical cluster using the Euclidean distance function and the average agglomeration method (Figure 6). The 14 subgroups identified all 5 major serotypes and their subtypes, and the “Decoy” serotype: 1. Thompson (0010000000); 2. Typhimurium (0100000000); 3. Decoy (0000000000); 4. Oranienburg contained 5 subtypes (0001000000, 0000100000, 0001100000, 0001110000, 0000110000); 5. Hadar contained 2 subtypes (0000010000, 1000010000); 6. I4,[5],12:i- contained 4 subtypes (1000000000, 1000000100, 1000001000, 0000001000).


A composite model for subgroup identification and prediction via bicluster analysis.

Chen HC, Zou W, Lu TP, Chen JJ - PLoS ONE (2014)

Hierarchical cluster analysis of the 14 subgroups identified from the test dataset using the average linkage distance.The 14 subgroups consist of 5 major subgroups: 1. Thompson (0010000000); 2. Typhimurium (0100000000); 3. Decoy (0000000000); 4. Oranienburg (0001000000, 0000100000, 0001100000, 0001110000, 0000110000); 5. Hadar (0000010000, 1000010000) and I4,[5],12:i- (1000000000, 1000000100, 1000001000, 0000001000).
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4210136&req=5

pone-0111318-g006: Hierarchical cluster analysis of the 14 subgroups identified from the test dataset using the average linkage distance.The 14 subgroups consist of 5 major subgroups: 1. Thompson (0010000000); 2. Typhimurium (0100000000); 3. Decoy (0000000000); 4. Oranienburg (0001000000, 0000100000, 0001100000, 0001110000, 0000110000); 5. Hadar (0000010000, 1000010000) and I4,[5],12:i- (1000000000, 1000000100, 1000001000, 0000001000).
Mentions: The SVM model identified 24 classification patterns from the 10 binary classifiers m1–m10. Based on the n* = 5 as a cutoff, 14 subgroups were identified (Table 5), where 13 of the 14 were identical to the 13 subgroups that were identified in the training data. The additional subgroup consisted of 8 Hadar isolates. The serotypes and their associated binary classifiers were: 4,5,12:i-: m1, m7, (m1, m7), (m1, m8); Hadar: m6, (m1, m6); Oranienburg: m4, m5, (m4, m5), (m5, m6), (m4, m5, m6); Thompson: m3; Typhimurium: m2.The sensitivities between the training and test datasets were similar for the data of the five training serotypes. The overall specificity was lower since there were 1,000 additional “Decoy” isolates (Table 4 and Table 5). For the “Decoy” serotype, the sensitivity and specificity were 74.7% and 91.7%, respectively. The accuracies were 95.9% and 96.1% by excluding and including the “Decoy” isolates, in the calculation, respectively. The relationships among the 14 subgroups were further analyzed using the hierarchical cluster using the Euclidean distance function and the average agglomeration method (Figure 6). The 14 subgroups identified all 5 major serotypes and their subtypes, and the “Decoy” serotype: 1. Thompson (0010000000); 2. Typhimurium (0100000000); 3. Decoy (0000000000); 4. Oranienburg contained 5 subtypes (0001000000, 0000100000, 0001100000, 0001110000, 0000110000); 5. Hadar contained 2 subtypes (0000010000, 1000010000); 6. I4,[5],12:i- contained 4 subtypes (1000000000, 1000000100, 1000001000, 0000001000).

Bottom Line: The proposed composite model neither depends on any specific biclustering algorithm or patterns of biclusters, nor on any classification algorithms.The proposed approach combines unsupervised biclustering and supervised classification techniques to classify samples into disjoint subgroups based on their associated attributes, such as genotypic factors, phenotypic outcomes, efficacy/safety measures, or responses to treatments.The procedure is useful for identification of unknown species or new biomarkers for targeted therapy.

View Article: PubMed Central - PubMed

Affiliation: Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas, United States of America; Graduate Institute of Biostatistics and Biostatistics Center, China Medical University, Taichung, Taiwan.

ABSTRACT

Background: A major challenges in the analysis of large and complex biomedical data is to develop an approach for 1) identifying distinct subgroups in the sampled populations, 2) characterizing their relationships among subgroups, and 3) developing a prediction model to classify subgroup memberships of new samples by finding a set of predictors. Each subgroup can represent different pathogen serotypes of microorganisms, different tumor subtypes in cancer patients, or different genetic makeups of patients related to treatment response.

Methods: This paper proposes a composite model for subgroup identification and prediction using biclusters. A biclustering technique is first used to identify a set of biclusters from the sampled data. For each bicluster, a subgroup-specific binary classifier is built to determine if a particular sample is either inside or outside the bicluster. A composite model, which consists of all binary classifiers, is constructed to classify samples into several disjoint subgroups. The proposed composite model neither depends on any specific biclustering algorithm or patterns of biclusters, nor on any classification algorithms.

Results: The composite model was shown to have an overall accuracy of 97.4% for a synthetic dataset consisting of four subgroups. The model was applied to two datasets where the sample's subgroup memberships were known. The procedure showed 83.7% accuracy in discriminating lung cancer adenocarcinoma and squamous carcinoma subtypes, and was able to identify 5 serotypes and several subtypes with about 94% accuracy in a pathogen dataset.

Conclusion: The composite model presents a novel approach to developing a biclustering-based classification model from unlabeled sampled data. The proposed approach combines unsupervised biclustering and supervised classification techniques to classify samples into disjoint subgroups based on their associated attributes, such as genotypic factors, phenotypic outcomes, efficacy/safety measures, or responses to treatments. The procedure is useful for identification of unknown species or new biomarkers for targeted therapy.

Show MeSH
Related in: MedlinePlus