Limits...
A framework for significance analysis of gene expression data using dimension reduction methods.

Gidskehaug L, Anderssen E, Flatberg A, Alsberg BK - BMC Bioinformatics (2007)

Bottom Line: Our results show that underlying biological phenomena and unknown relationships in the data can be detected by a simple visual interpretation of the model parameters.It is also found that measured phenotypic responses may model the expression data more accurately than if the design-parameters are used as input.The presented framework is conceptually and algorithmically simple, and a Matlab toolbox (Mathworks Inc, USA) is supplemented.

View Article: PubMed Central - HTML - PubMed

Affiliation: Chemometrics and Bioinformatics Group, Department of Chemistry, Norwegian University of Science and Technology, N-7491 Trondheim, Norway. gidskeha@phys.chem.ntnu.no

ABSTRACT

Background: The most popular methods for significance analysis on microarray data are well suited to find genes differentially expressed across predefined categories. However, identification of features that correlate with continuous dependent variables is more difficult using these methods, and long lists of significant genes returned are not easily probed for co-regulations and dependencies. Dimension reduction methods are much used in the microarray literature for classification or for obtaining low-dimensional representations of data sets. These methods have an additional interpretation strength that is often not fully exploited when expression data are analysed. In addition, significance analysis may be performed directly on the model parameters to find genes that are important for any number of categorical or continuous responses. We introduce a general scheme for analysis of expression data that combines significance testing with the interpretative advantages of the dimension reduction methods. This approach is applicable both for explorative analysis and for classification and regression problems.

Results: Three public data sets are analysed. One is used for classification, one contains spiked-in transcripts of known concentrations, and one represents a regression problem with several measured responses. Model-based significance analysis is performed using a modified version of Hotelling's T2-test, and a false discovery rate significance level is estimated by resampling. Our results show that underlying biological phenomena and unknown relationships in the data can be detected by a simple visual interpretation of the model parameters. It is also found that measured phenotypic responses may model the expression data more accurately than if the design-parameters are used as input. For the classification data, our method finds much the same genes as the standard methods, in addition to some extra which are shown to be biologically relevant. The list of spiked-in genes is also reproduced with high accuracy.

Conclusion: The dimension reduction methods are versatile tools that may also be used for significance testing. Visual inspection of model components is useful for interpretation, and the methodology is the same whether the goal is classification, prediction of responses, feature selection or exploration of a data set. The presented framework is conceptually and algorithmically simple, and a Matlab toolbox (Mathworks Inc, USA) is supplemented.

Show MeSH

Related in: MedlinePlus

Bridge-PLSR loadings. Loadings from the Bridge-PLSR of the smoker-data are plotted for the two first components. The blue spots represent features that are not found significant by the jack-knife procedure. The green spots are genes that are found significant by both SAM and Bridge-PLSR, while the red spots are called significant by the T2-test but not by SAM. The significant features span mainly the direction of smokers vs. never smokers, but Bridge-PLSR detects some genes relevant for former smokers as well.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2194745&req=5

Figure 5: Bridge-PLSR loadings. Loadings from the Bridge-PLSR of the smoker-data are plotted for the two first components. The blue spots represent features that are not found significant by the jack-knife procedure. The green spots are genes that are found significant by both SAM and Bridge-PLSR, while the red spots are called significant by the T2-test but not by SAM. The significant features span mainly the direction of smokers vs. never smokers, but Bridge-PLSR detects some genes relevant for former smokers as well.

Mentions: The PCA-analysis indicates that the group information should be allowed to guide the decomposition if any genes related to smoking are to be found. We know that a minimum of K - 1 independent variables are needed for a linear separation of K classes. For instance, one gene measured for several subjects may in theory be used to differentiate between two classes, while a minimum of two genes are needed for assignments into three classes. Due to co-regulation and inter-dependencies between genes, such single, descriptive features are usually both uninformative and hard to find. However, dimension reduction offers independent linear combinations of genes that may be used instead. If Y is a binary matrix holding the class information for each object, a good linear separation based on the expression data may be obtained with components that span all the Y-related variance in X. In Bridge-PLSR, this variance may be completely described in a minimum number of components. The score plot from a two-component Bridge-PLSR model is given in figure 4. Here, three groups corresponding to the predefined classes are revealed. This model explains 54% of Y, whereas an ordinary PLSR for comparison explains 45% in two components. Jack-knife is performed based on a full leave-one-out cross-validation, and the significant features are given in the loading plot in figure 5. The green spots correspond to features called significant by both SAM and Bridge-PLSR, whereas the red spots indicate significant genes not found by SAM. Neighbouring genes in the loading plot have similar profiles, whereas genes in opposite quadrants are negatively correlated. Also, the arrays in a specific area of the score plot have a high expression value for the genes in the corresponding area of the loading plot. As the significant genes span mainly the first component, they discriminate well between smokers and never-smokers, according to the score plot. Along the second component, which separates former smokers from the rest, fewer significant genes are found. This indicates that this correlation is weaker and more susceptible to random variations in the data. It is seen, however, that Bridge-PLSR finds more genes than SAM along this direction. Any significant genes along the second component describes smoking-induced transcriptomic changes among the former smokers.


A framework for significance analysis of gene expression data using dimension reduction methods.

Gidskehaug L, Anderssen E, Flatberg A, Alsberg BK - BMC Bioinformatics (2007)

Bridge-PLSR loadings. Loadings from the Bridge-PLSR of the smoker-data are plotted for the two first components. The blue spots represent features that are not found significant by the jack-knife procedure. The green spots are genes that are found significant by both SAM and Bridge-PLSR, while the red spots are called significant by the T2-test but not by SAM. The significant features span mainly the direction of smokers vs. never smokers, but Bridge-PLSR detects some genes relevant for former smokers as well.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2194745&req=5

Figure 5: Bridge-PLSR loadings. Loadings from the Bridge-PLSR of the smoker-data are plotted for the two first components. The blue spots represent features that are not found significant by the jack-knife procedure. The green spots are genes that are found significant by both SAM and Bridge-PLSR, while the red spots are called significant by the T2-test but not by SAM. The significant features span mainly the direction of smokers vs. never smokers, but Bridge-PLSR detects some genes relevant for former smokers as well.
Mentions: The PCA-analysis indicates that the group information should be allowed to guide the decomposition if any genes related to smoking are to be found. We know that a minimum of K - 1 independent variables are needed for a linear separation of K classes. For instance, one gene measured for several subjects may in theory be used to differentiate between two classes, while a minimum of two genes are needed for assignments into three classes. Due to co-regulation and inter-dependencies between genes, such single, descriptive features are usually both uninformative and hard to find. However, dimension reduction offers independent linear combinations of genes that may be used instead. If Y is a binary matrix holding the class information for each object, a good linear separation based on the expression data may be obtained with components that span all the Y-related variance in X. In Bridge-PLSR, this variance may be completely described in a minimum number of components. The score plot from a two-component Bridge-PLSR model is given in figure 4. Here, three groups corresponding to the predefined classes are revealed. This model explains 54% of Y, whereas an ordinary PLSR for comparison explains 45% in two components. Jack-knife is performed based on a full leave-one-out cross-validation, and the significant features are given in the loading plot in figure 5. The green spots correspond to features called significant by both SAM and Bridge-PLSR, whereas the red spots indicate significant genes not found by SAM. Neighbouring genes in the loading plot have similar profiles, whereas genes in opposite quadrants are negatively correlated. Also, the arrays in a specific area of the score plot have a high expression value for the genes in the corresponding area of the loading plot. As the significant genes span mainly the first component, they discriminate well between smokers and never-smokers, according to the score plot. Along the second component, which separates former smokers from the rest, fewer significant genes are found. This indicates that this correlation is weaker and more susceptible to random variations in the data. It is seen, however, that Bridge-PLSR finds more genes than SAM along this direction. Any significant genes along the second component describes smoking-induced transcriptomic changes among the former smokers.

Bottom Line: Our results show that underlying biological phenomena and unknown relationships in the data can be detected by a simple visual interpretation of the model parameters.It is also found that measured phenotypic responses may model the expression data more accurately than if the design-parameters are used as input.The presented framework is conceptually and algorithmically simple, and a Matlab toolbox (Mathworks Inc, USA) is supplemented.

View Article: PubMed Central - HTML - PubMed

Affiliation: Chemometrics and Bioinformatics Group, Department of Chemistry, Norwegian University of Science and Technology, N-7491 Trondheim, Norway. gidskeha@phys.chem.ntnu.no

ABSTRACT

Background: The most popular methods for significance analysis on microarray data are well suited to find genes differentially expressed across predefined categories. However, identification of features that correlate with continuous dependent variables is more difficult using these methods, and long lists of significant genes returned are not easily probed for co-regulations and dependencies. Dimension reduction methods are much used in the microarray literature for classification or for obtaining low-dimensional representations of data sets. These methods have an additional interpretation strength that is often not fully exploited when expression data are analysed. In addition, significance analysis may be performed directly on the model parameters to find genes that are important for any number of categorical or continuous responses. We introduce a general scheme for analysis of expression data that combines significance testing with the interpretative advantages of the dimension reduction methods. This approach is applicable both for explorative analysis and for classification and regression problems.

Results: Three public data sets are analysed. One is used for classification, one contains spiked-in transcripts of known concentrations, and one represents a regression problem with several measured responses. Model-based significance analysis is performed using a modified version of Hotelling's T2-test, and a false discovery rate significance level is estimated by resampling. Our results show that underlying biological phenomena and unknown relationships in the data can be detected by a simple visual interpretation of the model parameters. It is also found that measured phenotypic responses may model the expression data more accurately than if the design-parameters are used as input. For the classification data, our method finds much the same genes as the standard methods, in addition to some extra which are shown to be biologically relevant. The list of spiked-in genes is also reproduced with high accuracy.

Conclusion: The dimension reduction methods are versatile tools that may also be used for significance testing. Visual inspection of model components is useful for interpretation, and the methodology is the same whether the goal is classification, prediction of responses, feature selection or exploration of a data set. The presented framework is conceptually and algorithmically simple, and a Matlab toolbox (Mathworks Inc, USA) is supplemented.

Show MeSH
Related in: MedlinePlus