Near-infrared spectroscopy as a diagnostic tool for distinguishing between normal and malignant colorectal tissues.
Bottom Line:
The successive projection algorithm-linear discriminant analysis (SPA-LDA) was used to seek a reduced subset of variables/wavenumbers and build a diagnostic model of LDA.Principal component analysis (PCA) was used for a preliminary analysis.The results showed that, compared to PLS-DA, SPA-LDA provided more parsimonious model using only three wavenumbers/variables (4065, 4173, and 5758 cm(-1)) to achieve the sensitivity of 84.6%, 92.3%, and 92.3% for the training, validation, and test sets, respectively, and the specificity of 100% for each subset.
View Article:
PubMed Central - PubMed
Affiliation: Yibin University Hospital, Yibin, Sichuan 644000, China ; Key Lab of Process Analysis and Control of Sichuan Universities, Yibin University, Yibin, Sichuan 644000, China.
ABSTRACT
Show MeSH
Cancer diagnosis is one of the most important tasks of biomedical research and has become the main objective of medical investigations. The present paper proposed an analytical strategy for distinguishing between normal and malignant colorectal tissues by combining the use of near-infrared (NIR) spectroscopy with chemometrics. The successive projection algorithm-linear discriminant analysis (SPA-LDA) was used to seek a reduced subset of variables/wavenumbers and build a diagnostic model of LDA. For comparison, the partial least squares-discriminant analysis (PLS-DA) based on full-spectrum classification was also used as the reference. Principal component analysis (PCA) was used for a preliminary analysis. A total of 186 spectra from 20 patients with partial colorectal resection were collected and divided into three subsets for training, optimizing, and testing the model. The results showed that, compared to PLS-DA, SPA-LDA provided more parsimonious model using only three wavenumbers/variables (4065, 4173, and 5758 cm(-1)) to achieve the sensitivity of 84.6%, 92.3%, and 92.3% for the training, validation, and test sets, respectively, and the specificity of 100% for each subset. It indicated that the combination of NIR spectroscopy and SPA-LDA algorithm can serve as a potential tool for distinguishing between normal and malignant colorectal tissues. Related in: MedlinePlus |
Related In:
Results -
Collection
License getmorefigures.php?uid=PMC4309295&req=5
Mentions: The combination of SPA and LDA is expressed as SPA-LDA. Figure 1 gives the flowchart of the SPA-LDA algorithm. SPA-LDA focuses at selecting a subset of variables with minimal collinearity and appropriate discriminating ability for use in classification problems. For this purpose, it is assumed that a training set consisting of N samples with known class labels is available for guiding the process of variable selection. In the case of spectroscopic dataset, each sample consists of a spectrum with K points (wavenumbers/wavelengths). The SPA-LDA scheme comprises two main phases [27]. In Phase 1, the N training samples/spectra are first centered on the mean of each class and stacked in the form of a matrix X (N × K). Each column of X corresponds to a variable. Projection operations related to the columns of X are then carried out to create K chains with L variables. Due to the loss of freedom degrees in the process of calculating class means, the chain length is limited by N − C, where C is the number of classes involved in the problem. Each time, the chain is initialized by one of the available K variables. Subsequent variables are selected to the chain in order to display the least collinearity with the previous ones. The collinearity is evaluated by the correlation between the respective column vectors of X. In Phase 2, different variable subsets are extracted and evaluated. For each of the K chains formed in Phase 1, a total of L subsets of variables can be extracted by using one up L variables in the order in which they are selected. Thus, a total of K × L subsets of variables can be generated. These candidate subsets are assessed in terms of a cost function involving the average risk of misclassification over the validation set. The cost function is defined as(1)Gcost=1Nval∑n=1Nvalgn,where(2)gn=MD2xval,n,x−InminIj≠InMD2xval,n,x−Ij.In (2), the numerator is the squared Mahalanobis distance between the nth validation sample xval,n and the center of its true class calculated over the training set by the formula:(3)MD2xval,n,x−In=xval,n−x−InS−1xval,n−x−InT,where S is a pooled covariance matrix calculated on the training set, instead of using a separate estimate for each class. To have a well-posed problem, the number of training samples should be larger than the number of variables included in a LDA model; otherwise, the estimated S will be singular, which makes it impossible to calculate the matrix inverse. The denominator in (2) corresponds to the squared Mahalanobis distance between xval,n and the mean of the nearest wrong class. A small value of gn indicates that xval,n is close to the center of its true class and distant from the centers of all other classes. The cost function is defined as the average of gn of all samples in the validation set. So, the minimization of the cost function can lead to better separation of samples of different classes. |
View Article: PubMed Central - PubMed
Affiliation: Yibin University Hospital, Yibin, Sichuan 644000, China ; Key Lab of Process Analysis and Control of Sichuan Universities, Yibin University, Yibin, Sichuan 644000, China.