Limits...
Knowledge-guided multi-scale independent component analysis for biomarker identification.

Chen L, Xuan J, Wang C, Shih IeM, Wang Y, Zhang Z, Hoffman E, Clarke R - BMC Bioinformatics (2008)

Bottom Line: Since gene expression levels reflect the joint effect of several underlying biological functions, disease-specific biomarkers may be involved in several distinct biological functions.Finally disease-specific biomarkers are extracted by their weighted connectivity scores associated with the extracted regulatory modes.The approach has been successfully applied to two expression profiling experiments to demonstrate its improved performance in extracting biologically meaningful and disease-related biomarkers.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, USA. lchen06@vt.edu

ABSTRACT

Background: Many statistical methods have been proposed to identify disease biomarkers from gene expression profiles. However, from gene expression profile data alone, statistical methods often fail to identify biologically meaningful biomarkers related to a specific disease under study. In this paper, we develop a novel strategy, namely knowledge-guided multi-scale independent component analysis (ICA), to first infer regulatory signals and then identify biologically relevant biomarkers from microarray data.

Results: Since gene expression levels reflect the joint effect of several underlying biological functions, disease-specific biomarkers may be involved in several distinct biological functions. To identify disease-specific biomarkers that provide unique mechanistic insights, a meta-data "knowledge gene pool" (KGP) is first constructed from multiple data sources to provide important information on the likely functions (such as gene ontology information) and regulatory events (such as promoter responsive elements) associated with potential genes of interest. The gene expression and biological meta data associated with the members of the KGP can then be used to guide subsequent analysis. ICA is then applied to multi-scale gene clusters to reveal regulatory modes reflecting the underlying biological mechanisms. Finally disease-specific biomarkers are extracted by their weighted connectivity scores associated with the extracted regulatory modes. A statistical significance test is used to evaluate the significance of transcription factor enrichment for the extracted gene set based on motif information. We applied the proposed method to yeast cell cycle microarray data and Rsf-1-induced ovarian cancer microarray data. The results show that our knowledge-guided ICA approach can extract biologically meaningful regulatory modes and outperform several baseline methods for biomarker identification.

Conclusion: We have proposed a novel method, namely knowledge-guided multi-scale ICA, to identify disease-specific biomarkers. The goal is to infer knowledge-relevant regulatory signals and then identify corresponding biomarkers through a multi-scale strategy. The approach has been successfully applied to two expression profiling experiments to demonstrate its improved performance in extracting biologically meaningful and disease-related biomarkers. More importantly, the proposed approach shows promising results to infer novel biomarkers for ovarian cancer and extend current knowledge.

Show MeSH

Related in: MedlinePlus

Histogram of determined optimal number of clusters in ten-fold cross- validation on yeast cell cycle data set.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2576264&req=5

Figure 3: Histogram of determined optimal number of clusters in ten-fold cross- validation on yeast cell cycle data set.

Mentions: The resulting average AUC value of ten-fold cross-validation on 104 genes is 0.9206 with standard deviation of 0.0470. Fig. 3 shows the histogram of determined optimal number of clusters during the ten-fold cross-validation procedure. From the figure we can see the most frequent number of clusters is five. Then we implemented three baseline methods for ten-fold cross-validation as comparisons. For baseline correlation method-2, we chose the optimal cluster number from the multi-scale ICA method for a fair comparison. The ROC curves of ten-fold cross validation for the two baseline correlation methods, the baseline ICA method, and our multi-scale ICA method are shown in Fig. 4. The ROC curves show that the multi-scale ICA method outperforms the baseline correlation method-2, and that the baseline ICA approach is better than the baseline correlation method-1. Overall, the proposed multi-scale ICA method significantly outperforms all three baseline methods as estimated by the Kolmogorov-Smirnov (K-S) one-sided test (Table 1).


Knowledge-guided multi-scale independent component analysis for biomarker identification.

Chen L, Xuan J, Wang C, Shih IeM, Wang Y, Zhang Z, Hoffman E, Clarke R - BMC Bioinformatics (2008)

Histogram of determined optimal number of clusters in ten-fold cross- validation on yeast cell cycle data set.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2576264&req=5

Figure 3: Histogram of determined optimal number of clusters in ten-fold cross- validation on yeast cell cycle data set.
Mentions: The resulting average AUC value of ten-fold cross-validation on 104 genes is 0.9206 with standard deviation of 0.0470. Fig. 3 shows the histogram of determined optimal number of clusters during the ten-fold cross-validation procedure. From the figure we can see the most frequent number of clusters is five. Then we implemented three baseline methods for ten-fold cross-validation as comparisons. For baseline correlation method-2, we chose the optimal cluster number from the multi-scale ICA method for a fair comparison. The ROC curves of ten-fold cross validation for the two baseline correlation methods, the baseline ICA method, and our multi-scale ICA method are shown in Fig. 4. The ROC curves show that the multi-scale ICA method outperforms the baseline correlation method-2, and that the baseline ICA approach is better than the baseline correlation method-1. Overall, the proposed multi-scale ICA method significantly outperforms all three baseline methods as estimated by the Kolmogorov-Smirnov (K-S) one-sided test (Table 1).

Bottom Line: Since gene expression levels reflect the joint effect of several underlying biological functions, disease-specific biomarkers may be involved in several distinct biological functions.Finally disease-specific biomarkers are extracted by their weighted connectivity scores associated with the extracted regulatory modes.The approach has been successfully applied to two expression profiling experiments to demonstrate its improved performance in extracting biologically meaningful and disease-related biomarkers.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, USA. lchen06@vt.edu

ABSTRACT

Background: Many statistical methods have been proposed to identify disease biomarkers from gene expression profiles. However, from gene expression profile data alone, statistical methods often fail to identify biologically meaningful biomarkers related to a specific disease under study. In this paper, we develop a novel strategy, namely knowledge-guided multi-scale independent component analysis (ICA), to first infer regulatory signals and then identify biologically relevant biomarkers from microarray data.

Results: Since gene expression levels reflect the joint effect of several underlying biological functions, disease-specific biomarkers may be involved in several distinct biological functions. To identify disease-specific biomarkers that provide unique mechanistic insights, a meta-data "knowledge gene pool" (KGP) is first constructed from multiple data sources to provide important information on the likely functions (such as gene ontology information) and regulatory events (such as promoter responsive elements) associated with potential genes of interest. The gene expression and biological meta data associated with the members of the KGP can then be used to guide subsequent analysis. ICA is then applied to multi-scale gene clusters to reveal regulatory modes reflecting the underlying biological mechanisms. Finally disease-specific biomarkers are extracted by their weighted connectivity scores associated with the extracted regulatory modes. A statistical significance test is used to evaluate the significance of transcription factor enrichment for the extracted gene set based on motif information. We applied the proposed method to yeast cell cycle microarray data and Rsf-1-induced ovarian cancer microarray data. The results show that our knowledge-guided ICA approach can extract biologically meaningful regulatory modes and outperform several baseline methods for biomarker identification.

Conclusion: We have proposed a novel method, namely knowledge-guided multi-scale ICA, to identify disease-specific biomarkers. The goal is to infer knowledge-relevant regulatory signals and then identify corresponding biomarkers through a multi-scale strategy. The approach has been successfully applied to two expression profiling experiments to demonstrate its improved performance in extracting biologically meaningful and disease-related biomarkers. More importantly, the proposed approach shows promising results to infer novel biomarkers for ovarian cancer and extend current knowledge.

Show MeSH
Related in: MedlinePlus