Limits...
A framework for significance analysis of gene expression data using dimension reduction methods.

Gidskehaug L, Anderssen E, Flatberg A, Alsberg BK - BMC Bioinformatics (2007)

Bottom Line: Dimension reduction methods are much used in the microarray literature for classification or for obtaining low-dimensional representations of data sets.It is also found that measured phenotypic responses may model the expression data more accurately than if the design-parameters are used as input.The dimension reduction methods are versatile tools that may also be used for significance testing.

View Article: PubMed Central - HTML - PubMed

Affiliation: Chemometrics and Bioinformatics Group, Department of Chemistry, Norwegian University of Science and Technology, N-7491 Trondheim, Norway. gidskeha@phys.chem.ntnu.no

ABSTRACT

Background: The most popular methods for significance analysis on microarray data are well suited to find genes differentially expressed across predefined categories. However, identification of features that correlate with continuous dependent variables is more difficult using these methods, and long lists of significant genes returned are not easily probed for co-regulations and dependencies. Dimension reduction methods are much used in the microarray literature for classification or for obtaining low-dimensional representations of data sets. These methods have an additional interpretation strength that is often not fully exploited when expression data are analysed. In addition, significance analysis may be performed directly on the model parameters to find genes that are important for any number of categorical or continuous responses. We introduce a general scheme for analysis of expression data that combines significance testing with the interpretative advantages of the dimension reduction methods. This approach is applicable both for explorative analysis and for classification and regression problems.

Results: Three public data sets are analysed. One is used for classification, one contains spiked-in transcripts of known concentrations, and one represents a regression problem with several measured responses. Model-based significance analysis is performed using a modified version of Hotelling's T2-test, and a false discovery rate significance level is estimated by resampling. Our results show that underlying biological phenomena and unknown relationships in the data can be detected by a simple visual interpretation of the model parameters. It is also found that measured phenotypic responses may model the expression data more accurately than if the design-parameters are used as input. For the classification data, our method finds much the same genes as the standard methods, in addition to some extra which are shown to be biologically relevant. The list of spiked-in genes is also reproduced with high accuracy.

Conclusion: The dimension reduction methods are versatile tools that may also be used for significance testing. Visual inspection of model components is useful for interpretation, and the methodology is the same whether the goal is classification, prediction of responses, feature selection or exploration of a data set. The presented framework is conceptually and algorithmically simple, and a Matlab toolbox (Mathworks Inc, USA) is supplemented.

Show MeSH

Related in: MedlinePlus

Significant outcomes from Bridge-PLSR. A venn diagram comparing the significant genes from Bridge-PLSR with SAM and Limma. At a significance level of 5%, the T2-test finds 725 significant genes, while SAM and Limma find a total of 668 and 471 features, respectively. The majority of the genes are found by all three methods.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2194745&req=5

Figure 6: Significant outcomes from Bridge-PLSR. A venn diagram comparing the significant genes from Bridge-PLSR with SAM and Limma. At a significance level of 5%, the T2-test finds 725 significant genes, while SAM and Limma find a total of 668 and 471 features, respectively. The majority of the genes are found by all three methods.

Mentions: A Venn diagram which compares the significant genes found with Bridge-PLSR, SAM and Limma is given in figure 6. All methods agree on the significance calls for the majority of the genes, however, Bridge-PLSR finds an additional 185 genes which are undetected by the other methods. To verify that these genes are not primarily false positives, the genes are annotated and their biological significance with regard to smoking is evaluated. A comprehensive list of our findings is given as supplementary information (see additional file 2). Of the genes identified as differentially expressed by Bridge-PLSR, only 66 have unknown or poorly known biological function. An additional 14 have well understood biological function often related to the natural system, but with no readily apparent link to smoking or lung damage. Of the remaining genes, 49 are known to be involved in regulation of cell growth, cell cycle, or apoptosis, processes known to be affected by smoking [16,17]. Seven of these genes as well as another 14 genes on the list not related to cell growth, have been shown to be involved in various forms of lung cancer. As the majority of lung cancer sufferers are smokers or previous smokers, these genes can be linked to smoking directly. Another interpretation is that these genes change early in the cancer development process. Seven genes are identified that have been reported to take part in lung development, this is not surprising as smoking can cause dramatic changes in the airways. In addition to cancer, we identify 17 genes that are related to other diseases in the lungs and bronchia, such as asthma, inflammation, fibrosis, or response to metallic toxins. Another large group of genes belongs to the protein life cycle, particularly the ribosome, or the ubiquitin cycle. This could point to an overall change in the rate of protein turnover caused by smoking. This is consistent with the results of Tomfohr et al. [18], who analysed the same data set with gene set enrichment analysis. Five genes are also found that are known to be expressed in lungs, but for which we find no link to cancer development or lung damage in the literature. Overall it seems that of the genes for which biological background information is available, the majority relate to biological processes that are influenced by smoking. The additional genes found by Bridge-PLSR are therefore unlikely to be false discoveries.


A framework for significance analysis of gene expression data using dimension reduction methods.

Gidskehaug L, Anderssen E, Flatberg A, Alsberg BK - BMC Bioinformatics (2007)

Significant outcomes from Bridge-PLSR. A venn diagram comparing the significant genes from Bridge-PLSR with SAM and Limma. At a significance level of 5%, the T2-test finds 725 significant genes, while SAM and Limma find a total of 668 and 471 features, respectively. The majority of the genes are found by all three methods.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2194745&req=5

Figure 6: Significant outcomes from Bridge-PLSR. A venn diagram comparing the significant genes from Bridge-PLSR with SAM and Limma. At a significance level of 5%, the T2-test finds 725 significant genes, while SAM and Limma find a total of 668 and 471 features, respectively. The majority of the genes are found by all three methods.
Mentions: A Venn diagram which compares the significant genes found with Bridge-PLSR, SAM and Limma is given in figure 6. All methods agree on the significance calls for the majority of the genes, however, Bridge-PLSR finds an additional 185 genes which are undetected by the other methods. To verify that these genes are not primarily false positives, the genes are annotated and their biological significance with regard to smoking is evaluated. A comprehensive list of our findings is given as supplementary information (see additional file 2). Of the genes identified as differentially expressed by Bridge-PLSR, only 66 have unknown or poorly known biological function. An additional 14 have well understood biological function often related to the natural system, but with no readily apparent link to smoking or lung damage. Of the remaining genes, 49 are known to be involved in regulation of cell growth, cell cycle, or apoptosis, processes known to be affected by smoking [16,17]. Seven of these genes as well as another 14 genes on the list not related to cell growth, have been shown to be involved in various forms of lung cancer. As the majority of lung cancer sufferers are smokers or previous smokers, these genes can be linked to smoking directly. Another interpretation is that these genes change early in the cancer development process. Seven genes are identified that have been reported to take part in lung development, this is not surprising as smoking can cause dramatic changes in the airways. In addition to cancer, we identify 17 genes that are related to other diseases in the lungs and bronchia, such as asthma, inflammation, fibrosis, or response to metallic toxins. Another large group of genes belongs to the protein life cycle, particularly the ribosome, or the ubiquitin cycle. This could point to an overall change in the rate of protein turnover caused by smoking. This is consistent with the results of Tomfohr et al. [18], who analysed the same data set with gene set enrichment analysis. Five genes are also found that are known to be expressed in lungs, but for which we find no link to cancer development or lung damage in the literature. Overall it seems that of the genes for which biological background information is available, the majority relate to biological processes that are influenced by smoking. The additional genes found by Bridge-PLSR are therefore unlikely to be false discoveries.

Bottom Line: Dimension reduction methods are much used in the microarray literature for classification or for obtaining low-dimensional representations of data sets.It is also found that measured phenotypic responses may model the expression data more accurately than if the design-parameters are used as input.The dimension reduction methods are versatile tools that may also be used for significance testing.

View Article: PubMed Central - HTML - PubMed

Affiliation: Chemometrics and Bioinformatics Group, Department of Chemistry, Norwegian University of Science and Technology, N-7491 Trondheim, Norway. gidskeha@phys.chem.ntnu.no

ABSTRACT

Background: The most popular methods for significance analysis on microarray data are well suited to find genes differentially expressed across predefined categories. However, identification of features that correlate with continuous dependent variables is more difficult using these methods, and long lists of significant genes returned are not easily probed for co-regulations and dependencies. Dimension reduction methods are much used in the microarray literature for classification or for obtaining low-dimensional representations of data sets. These methods have an additional interpretation strength that is often not fully exploited when expression data are analysed. In addition, significance analysis may be performed directly on the model parameters to find genes that are important for any number of categorical or continuous responses. We introduce a general scheme for analysis of expression data that combines significance testing with the interpretative advantages of the dimension reduction methods. This approach is applicable both for explorative analysis and for classification and regression problems.

Results: Three public data sets are analysed. One is used for classification, one contains spiked-in transcripts of known concentrations, and one represents a regression problem with several measured responses. Model-based significance analysis is performed using a modified version of Hotelling's T2-test, and a false discovery rate significance level is estimated by resampling. Our results show that underlying biological phenomena and unknown relationships in the data can be detected by a simple visual interpretation of the model parameters. It is also found that measured phenotypic responses may model the expression data more accurately than if the design-parameters are used as input. For the classification data, our method finds much the same genes as the standard methods, in addition to some extra which are shown to be biologically relevant. The list of spiked-in genes is also reproduced with high accuracy.

Conclusion: The dimension reduction methods are versatile tools that may also be used for significance testing. Visual inspection of model components is useful for interpretation, and the methodology is the same whether the goal is classification, prediction of responses, feature selection or exploration of a data set. The presented framework is conceptually and algorithmically simple, and a Matlab toolbox (Mathworks Inc, USA) is supplemented.

Show MeSH
Related in: MedlinePlus