Limits...
An extension of PPLS-DA for classification and comparison to ordinary PLS-DA.

Telaar A, Liland KH, Repsilber D, Nürnberg G - PLoS ONE (2013)

Bottom Line: For the investigated data sets with weak linear dependency between features/variables, no improvement is shown for PPLS-DA and for the extensions compared to PLS-DA.A very weak linear dependency, a low proportion of differentially expressed genes for simulated data, does not lead to an improvement of PPLS-DA over PLS-DA, but our extension shows a lower prediction error.Moreover we compare these prediction results with results of support vector machines with linear kernel and linear discriminant analysis.

View Article: PubMed Central - PubMed

Affiliation: Institute for Genetics and Biometry, Department of Bioinformatics and Biomathematics, Leibniz Institute for Farm Animal Biology, Dummerstorf, Germany.

ABSTRACT
Classification studies are widely applied, e.g. in biomedical research to classify objects/patients into predefined groups. The goal is to find a classification function/rule which assigns each object/patient to a unique group with the greatest possible accuracy (classification error). Especially in gene expression experiments often a lot of variables (genes) are measured for only few objects/patients. A suitable approach is the well-known method PLS-DA, which searches for a transformation to a lower dimensional space. Resulting new components are linear combinations of the original variables. An advancement of PLS-DA leads to PPLS-DA, introducing a so called 'power parameter', which is maximized towards the correlation between the components and the group-membership. We introduce an extension of PPLS-DA for optimizing this power parameter towards the final aim, namely towards a minimal classification error. We compare this new extension with the original PPLS-DA and also with the ordinary PLS-DA using simulated and experimental datasets. For the investigated data sets with weak linear dependency between features/variables, no improvement is shown for PPLS-DA and for the extensions compared to PLS-DA. A very weak linear dependency, a low proportion of differentially expressed genes for simulated data, does not lead to an improvement of PPLS-DA over PLS-DA, but our extension shows a lower prediction error. On the contrary, for the data set with strong between-feature collinearity and a low proportion of differentially expressed genes and a large total number of genes, the prediction error of PPLS-DA and the extensions is clearly lower than for PLS-DA. Moreover we compare these prediction results with results of support vector machines with linear kernel and linear discriminant analysis.

Show MeSH

Related in: MedlinePlus

Plot of the first 50 largest eigenvalues  of cov() (bars) and of the absolute covariance between  and  (dots) for the experimental data sets and for case 3 for the simulated data.The eigenvalues ,  are scaled corresponding to the largest eigenvalue, also the absolute values of the covariance between the principal component  and the response vector , here  equals 1 if sample i belongs to group , otherwise  equals −1.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3569448&req=5

pone-0055267-g004: Plot of the first 50 largest eigenvalues of cov() (bars) and of the absolute covariance between and (dots) for the experimental data sets and for case 3 for the simulated data.The eigenvalues , are scaled corresponding to the largest eigenvalue, also the absolute values of the covariance between the principal component and the response vector , here equals 1 if sample i belongs to group , otherwise equals −1.

Mentions: For detailed description of the covariance structure of our data, we use two measures analogous to Sæbø et al. in [12]. The condition index, first used in [13], and the absolute value of the covariances between the principal components of and the response vector as used in [12]. The condition index is used as a measure for variable dependence, with being the kth eigenvalue of . It can be assumed that . The increase of the first five condition indexes () reflects the collinearity of the features. A rapid increase means, the features have a strong linear dependence, a weak increase implies a weak dependence. If we now consider the principal components, like in [12], the relevance of a component is measured by means of the absolute value of the covariances () between the principal component and the class vector . Here equals 1 if sample belongs to group , otherwise equals -1, . The eigenvector belonging to the kth largest eigenvalue is denoted by . Helland and Almøy [14] infer, that data sets with relevant components, which have small eigenvalues, are difficult to predict. The condition index is plotted for the first five largest eigenvalues (scaled to the first eigenvalue) in Figure 3. Figure 4 shows the first 50 largest scaled eigenvalues and the corresponding scaled covariances between and for all experimental data sets and a simulated data set (case 3) investigated.


An extension of PPLS-DA for classification and comparison to ordinary PLS-DA.

Telaar A, Liland KH, Repsilber D, Nürnberg G - PLoS ONE (2013)

Plot of the first 50 largest eigenvalues  of cov() (bars) and of the absolute covariance between  and  (dots) for the experimental data sets and for case 3 for the simulated data.The eigenvalues ,  are scaled corresponding to the largest eigenvalue, also the absolute values of the covariance between the principal component  and the response vector , here  equals 1 if sample i belongs to group , otherwise  equals −1.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3569448&req=5

pone-0055267-g004: Plot of the first 50 largest eigenvalues of cov() (bars) and of the absolute covariance between and (dots) for the experimental data sets and for case 3 for the simulated data.The eigenvalues , are scaled corresponding to the largest eigenvalue, also the absolute values of the covariance between the principal component and the response vector , here equals 1 if sample i belongs to group , otherwise equals −1.
Mentions: For detailed description of the covariance structure of our data, we use two measures analogous to Sæbø et al. in [12]. The condition index, first used in [13], and the absolute value of the covariances between the principal components of and the response vector as used in [12]. The condition index is used as a measure for variable dependence, with being the kth eigenvalue of . It can be assumed that . The increase of the first five condition indexes () reflects the collinearity of the features. A rapid increase means, the features have a strong linear dependence, a weak increase implies a weak dependence. If we now consider the principal components, like in [12], the relevance of a component is measured by means of the absolute value of the covariances () between the principal component and the class vector . Here equals 1 if sample belongs to group , otherwise equals -1, . The eigenvector belonging to the kth largest eigenvalue is denoted by . Helland and Almøy [14] infer, that data sets with relevant components, which have small eigenvalues, are difficult to predict. The condition index is plotted for the first five largest eigenvalues (scaled to the first eigenvalue) in Figure 3. Figure 4 shows the first 50 largest scaled eigenvalues and the corresponding scaled covariances between and for all experimental data sets and a simulated data set (case 3) investigated.

Bottom Line: For the investigated data sets with weak linear dependency between features/variables, no improvement is shown for PPLS-DA and for the extensions compared to PLS-DA.A very weak linear dependency, a low proportion of differentially expressed genes for simulated data, does not lead to an improvement of PPLS-DA over PLS-DA, but our extension shows a lower prediction error.Moreover we compare these prediction results with results of support vector machines with linear kernel and linear discriminant analysis.

View Article: PubMed Central - PubMed

Affiliation: Institute for Genetics and Biometry, Department of Bioinformatics and Biomathematics, Leibniz Institute for Farm Animal Biology, Dummerstorf, Germany.

ABSTRACT
Classification studies are widely applied, e.g. in biomedical research to classify objects/patients into predefined groups. The goal is to find a classification function/rule which assigns each object/patient to a unique group with the greatest possible accuracy (classification error). Especially in gene expression experiments often a lot of variables (genes) are measured for only few objects/patients. A suitable approach is the well-known method PLS-DA, which searches for a transformation to a lower dimensional space. Resulting new components are linear combinations of the original variables. An advancement of PLS-DA leads to PPLS-DA, introducing a so called 'power parameter', which is maximized towards the correlation between the components and the group-membership. We introduce an extension of PPLS-DA for optimizing this power parameter towards the final aim, namely towards a minimal classification error. We compare this new extension with the original PPLS-DA and also with the ordinary PLS-DA using simulated and experimental datasets. For the investigated data sets with weak linear dependency between features/variables, no improvement is shown for PPLS-DA and for the extensions compared to PLS-DA. A very weak linear dependency, a low proportion of differentially expressed genes for simulated data, does not lead to an improvement of PPLS-DA over PLS-DA, but our extension shows a lower prediction error. On the contrary, for the data set with strong between-feature collinearity and a low proportion of differentially expressed genes and a large total number of genes, the prediction error of PPLS-DA and the extensions is clearly lower than for PLS-DA. Moreover we compare these prediction results with results of support vector machines with linear kernel and linear discriminant analysis.

Show MeSH
Related in: MedlinePlus