Limits...
A framework for significance analysis of gene expression data using dimension reduction methods.

Gidskehaug L, Anderssen E, Flatberg A, Alsberg BK - BMC Bioinformatics (2007)

Bottom Line: Dimension reduction methods are much used in the microarray literature for classification or for obtaining low-dimensional representations of data sets.It is also found that measured phenotypic responses may model the expression data more accurately than if the design-parameters are used as input.The dimension reduction methods are versatile tools that may also be used for significance testing.

View Article: PubMed Central - HTML - PubMed

Affiliation: Chemometrics and Bioinformatics Group, Department of Chemistry, Norwegian University of Science and Technology, N-7491 Trondheim, Norway. gidskeha@phys.chem.ntnu.no

ABSTRACT

Background: The most popular methods for significance analysis on microarray data are well suited to find genes differentially expressed across predefined categories. However, identification of features that correlate with continuous dependent variables is more difficult using these methods, and long lists of significant genes returned are not easily probed for co-regulations and dependencies. Dimension reduction methods are much used in the microarray literature for classification or for obtaining low-dimensional representations of data sets. These methods have an additional interpretation strength that is often not fully exploited when expression data are analysed. In addition, significance analysis may be performed directly on the model parameters to find genes that are important for any number of categorical or continuous responses. We introduce a general scheme for analysis of expression data that combines significance testing with the interpretative advantages of the dimension reduction methods. This approach is applicable both for explorative analysis and for classification and regression problems.

Results: Three public data sets are analysed. One is used for classification, one contains spiked-in transcripts of known concentrations, and one represents a regression problem with several measured responses. Model-based significance analysis is performed using a modified version of Hotelling's T2-test, and a false discovery rate significance level is estimated by resampling. Our results show that underlying biological phenomena and unknown relationships in the data can be detected by a simple visual interpretation of the model parameters. It is also found that measured phenotypic responses may model the expression data more accurately than if the design-parameters are used as input. For the classification data, our method finds much the same genes as the standard methods, in addition to some extra which are shown to be biologically relevant. The list of spiked-in genes is also reproduced with high accuracy.

Conclusion: The dimension reduction methods are versatile tools that may also be used for significance testing. Visual inspection of model components is useful for interpretation, and the methodology is the same whether the goal is classification, prediction of responses, feature selection or exploration of a data set. The presented framework is conceptually and algorithmically simple, and a Matlab toolbox (Mathworks Inc, USA) is supplemented.

Show MeSH

Related in: MedlinePlus

PCA scores. The scores from the PCA of the smoker-data plotted for the two first components. There is a major source of variation along the first component that does not correspond to the smoking history of the test subjects. These components are only able to explain 1% of the variance in Y.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2194745&req=5

Figure 1: PCA scores. The scores from the PCA of the smoker-data plotted for the two first components. There is a major source of variation along the first component that does not correspond to the smoking history of the test subjects. These components are only able to explain 1% of the variance in Y.

Mentions: We first look at a data set that investigates the effects of smoking on human airway transcripts [15]. Data are available from three groups of subjects according to their smoking history. The original article reports significant differences both between a group of smokers compared to subjects that have never been smoking, and between former smokers and the group of never smokers. A PCA of the expression data X reveals that there is a large source of variation in the data that does not correspond to the available group information. In figure 1, a PCA score plot of the two first components is given, where the objects are coloured according to the classes "Never", "Former", and "Current". Unreported experimental factors, for instance a change in laboratory procedures, may be responsible for the separation of objects into two clusters along the first component. In the corresponding loading plot in figure 2, the 725 most significant genes from the PCA analysis are given in green. These genes span mainly the unreported X-variation and are not important for the smoking-induced differences of expression. The Venn diagram in figure 3 confirms that the genes detected by PCA are to a large degree irrelevant for classification of these data.


A framework for significance analysis of gene expression data using dimension reduction methods.

Gidskehaug L, Anderssen E, Flatberg A, Alsberg BK - BMC Bioinformatics (2007)

PCA scores. The scores from the PCA of the smoker-data plotted for the two first components. There is a major source of variation along the first component that does not correspond to the smoking history of the test subjects. These components are only able to explain 1% of the variance in Y.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2194745&req=5

Figure 1: PCA scores. The scores from the PCA of the smoker-data plotted for the two first components. There is a major source of variation along the first component that does not correspond to the smoking history of the test subjects. These components are only able to explain 1% of the variance in Y.
Mentions: We first look at a data set that investigates the effects of smoking on human airway transcripts [15]. Data are available from three groups of subjects according to their smoking history. The original article reports significant differences both between a group of smokers compared to subjects that have never been smoking, and between former smokers and the group of never smokers. A PCA of the expression data X reveals that there is a large source of variation in the data that does not correspond to the available group information. In figure 1, a PCA score plot of the two first components is given, where the objects are coloured according to the classes "Never", "Former", and "Current". Unreported experimental factors, for instance a change in laboratory procedures, may be responsible for the separation of objects into two clusters along the first component. In the corresponding loading plot in figure 2, the 725 most significant genes from the PCA analysis are given in green. These genes span mainly the unreported X-variation and are not important for the smoking-induced differences of expression. The Venn diagram in figure 3 confirms that the genes detected by PCA are to a large degree irrelevant for classification of these data.

Bottom Line: Dimension reduction methods are much used in the microarray literature for classification or for obtaining low-dimensional representations of data sets.It is also found that measured phenotypic responses may model the expression data more accurately than if the design-parameters are used as input.The dimension reduction methods are versatile tools that may also be used for significance testing.

View Article: PubMed Central - HTML - PubMed

Affiliation: Chemometrics and Bioinformatics Group, Department of Chemistry, Norwegian University of Science and Technology, N-7491 Trondheim, Norway. gidskeha@phys.chem.ntnu.no

ABSTRACT

Background: The most popular methods for significance analysis on microarray data are well suited to find genes differentially expressed across predefined categories. However, identification of features that correlate with continuous dependent variables is more difficult using these methods, and long lists of significant genes returned are not easily probed for co-regulations and dependencies. Dimension reduction methods are much used in the microarray literature for classification or for obtaining low-dimensional representations of data sets. These methods have an additional interpretation strength that is often not fully exploited when expression data are analysed. In addition, significance analysis may be performed directly on the model parameters to find genes that are important for any number of categorical or continuous responses. We introduce a general scheme for analysis of expression data that combines significance testing with the interpretative advantages of the dimension reduction methods. This approach is applicable both for explorative analysis and for classification and regression problems.

Results: Three public data sets are analysed. One is used for classification, one contains spiked-in transcripts of known concentrations, and one represents a regression problem with several measured responses. Model-based significance analysis is performed using a modified version of Hotelling's T2-test, and a false discovery rate significance level is estimated by resampling. Our results show that underlying biological phenomena and unknown relationships in the data can be detected by a simple visual interpretation of the model parameters. It is also found that measured phenotypic responses may model the expression data more accurately than if the design-parameters are used as input. For the classification data, our method finds much the same genes as the standard methods, in addition to some extra which are shown to be biologically relevant. The list of spiked-in genes is also reproduced with high accuracy.

Conclusion: The dimension reduction methods are versatile tools that may also be used for significance testing. Visual inspection of model components is useful for interpretation, and the methodology is the same whether the goal is classification, prediction of responses, feature selection or exploration of a data set. The presented framework is conceptually and algorithmically simple, and a Matlab toolbox (Mathworks Inc, USA) is supplemented.

Show MeSH
Related in: MedlinePlus