Limits...
Optimality driven nearest centroid classification from genomic data.

Dabney AR, Storey JD - PLoS ONE (2007)

Bottom Line: Nearest-centroid classifiers have recently been successfully employed in high-dimensional applications, such as in genomics.We introduce a new feature selection approach for high-dimensional nearest centroid classifiers that instead is based on the theoretically optimal choice of a given number of features, which we determine directly here.We apply the proposed method to clinical classification based on gene-expression microarrays, demonstrating that the proposed method can outperform existing nearest centroid classifiers.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Texas A&M University, College Station, Texas, United States of America. adabney@stat.tamu.edu

ABSTRACT
Nearest-centroid classifiers have recently been successfully employed in high-dimensional applications, such as in genomics. A necessary step when building a classifier for high-dimensional data is feature selection. Feature selection is frequently carried out by computing univariate scores for each feature individually, without consideration for how a subset of features performs as a whole. We introduce a new feature selection approach for high-dimensional nearest centroid classifiers that instead is based on the theoretically optimal choice of a given number of features, which we determine directly here. This allows us to develop a new greedy algorithm to estimate this optimal nearest-centroid classifier with a given number of features. In addition, whereas the centroids are usually formed from maximum likelihood estimates, we investigate the applicability of high-dimensional shrinkage estimates of centroids. We apply the proposed method to clinical classification based on gene-expression microarrays, demonstrating that the proposed method can outperform existing nearest centroid classifiers.

Show MeSH

Related in: MedlinePlus

Results for SRBCT data.Classifiers are identical to those in Tables 3 and 4, with Clanc v1-v4 corresponding to the last four variants reported there, respectively.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC1991588&req=5

pone-0001002-g001: Results for SRBCT data.Classifiers are identical to those in Tables 3 and 4, with Clanc v1-v4 corresponding to the last four variants reported there, respectively.

Mentions: The results for the SRBCT data are shown in Figure 1, those for the lymphoma data in Figure 2, and those for the NCI data in Figure 3. The classifiers presented are identical to those in Tables 3 and 4, except that Clanc classifiers with unrestricted covariances are excluded. The Clanc classifiers indicated by “v1-v4” correspond to the last four classifiers reported in Tables 3 and 4. Clanc improves accuracy over the PAM approach using univariate scoring. Shrunken centroids in Clanc improve accuracy in the NCI example but make no difference in the other examples. Diagonal covariance matrices result in greater accuracy overall for these examples. Overall, we interpret these results as indicating that Clanc classifiers with greedy searches guided by (4) can outperform the existing PAM classification method. In particular, the results support the use of shrunken centroids and diagonal covariance matrices, and we have implemented this algorithm in the Clanc software [13].


Optimality driven nearest centroid classification from genomic data.

Dabney AR, Storey JD - PLoS ONE (2007)

Results for SRBCT data.Classifiers are identical to those in Tables 3 and 4, with Clanc v1-v4 corresponding to the last four variants reported there, respectively.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC1991588&req=5

pone-0001002-g001: Results for SRBCT data.Classifiers are identical to those in Tables 3 and 4, with Clanc v1-v4 corresponding to the last four variants reported there, respectively.
Mentions: The results for the SRBCT data are shown in Figure 1, those for the lymphoma data in Figure 2, and those for the NCI data in Figure 3. The classifiers presented are identical to those in Tables 3 and 4, except that Clanc classifiers with unrestricted covariances are excluded. The Clanc classifiers indicated by “v1-v4” correspond to the last four classifiers reported in Tables 3 and 4. Clanc improves accuracy over the PAM approach using univariate scoring. Shrunken centroids in Clanc improve accuracy in the NCI example but make no difference in the other examples. Diagonal covariance matrices result in greater accuracy overall for these examples. Overall, we interpret these results as indicating that Clanc classifiers with greedy searches guided by (4) can outperform the existing PAM classification method. In particular, the results support the use of shrunken centroids and diagonal covariance matrices, and we have implemented this algorithm in the Clanc software [13].

Bottom Line: Nearest-centroid classifiers have recently been successfully employed in high-dimensional applications, such as in genomics.We introduce a new feature selection approach for high-dimensional nearest centroid classifiers that instead is based on the theoretically optimal choice of a given number of features, which we determine directly here.We apply the proposed method to clinical classification based on gene-expression microarrays, demonstrating that the proposed method can outperform existing nearest centroid classifiers.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Texas A&M University, College Station, Texas, United States of America. adabney@stat.tamu.edu

ABSTRACT
Nearest-centroid classifiers have recently been successfully employed in high-dimensional applications, such as in genomics. A necessary step when building a classifier for high-dimensional data is feature selection. Feature selection is frequently carried out by computing univariate scores for each feature individually, without consideration for how a subset of features performs as a whole. We introduce a new feature selection approach for high-dimensional nearest centroid classifiers that instead is based on the theoretically optimal choice of a given number of features, which we determine directly here. This allows us to develop a new greedy algorithm to estimate this optimal nearest-centroid classifier with a given number of features. In addition, whereas the centroids are usually formed from maximum likelihood estimates, we investigate the applicability of high-dimensional shrinkage estimates of centroids. We apply the proposed method to clinical classification based on gene-expression microarrays, demonstrating that the proposed method can outperform existing nearest centroid classifiers.

Show MeSH
Related in: MedlinePlus