Limits...
Simple integrative preprocessing preserves what is shared in data sources.

Tripathi A, Klami A, Kaski S - BMC Bioinformatics (2008)

Bottom Line: Principal components analysis of all sources combined together is an obvious choice if it is not important to distinguish between data source-specific and shared variation.Canonical Correlation Analysis (CCA) focuses on mutual dependencies and discards source-specific "noise" but it produces a separate set of components for each source.We give the details and implement them in a software tool.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, P,O, Box 68, FI-00014, University of Helsinki, Finland. abhishek@cs.helsinki.fi

ABSTRACT

Background: Bioinformatics data analysis toolbox needs general-purpose, fast and easily interpretable preprocessing tools that perform data integration during exploratory data analysis. Our focus is on vector-valued data sources, each consisting of measurements of the same entity but on different variables, and on tasks where source-specific variation is considered noisy or not interesting. Principal components analysis of all sources combined together is an obvious choice if it is not important to distinguish between data source-specific and shared variation. Canonical Correlation Analysis (CCA) focuses on mutual dependencies and discards source-specific "noise" but it produces a separate set of components for each source.

Results: It turns out that components given by CCA can be combined easily to produce a linear and hence fast and easily interpretable feature extraction method. The method fuses together several sources, such that the properties they share are preserved. Source-specific variation is discarded as uninteresting. We give the details and implement them in a software tool. The method is demonstrated on gene expression measurements in three case studies: classification of cell cycle regulated genes in yeast, identification of differentially expressed genes in leukemia, and defining stress response in yeast. The software package is available at http://www.cis.hut.fi/projects/mi/software/drCCA/.

Conclusion: We introduced a method for the task of data fusion for exploratory data analysis, when statistical dependencies between the sources and not within a source are interesting. The method uses canonical correlation analysis in a new way for dimensionality reduction, and inherits its good properties of being simple, fast, and easily interpretable as a linear projection.

Show MeSH

Related in: MedlinePlus

Shared and Data-specific variation for leukemia data. Shared (top) and data-specific (bottom) variation retained with CCA (solid line) and PCA (dashed line) as a function of the reduced dimensionality for the leukemia data. The values obtained by random projections (dash-dotted line and dotted confidence intervals) have been included for reference. The suggested dimensionality for the CCA-projection is marked with a tick.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2278131&req=5

Figure 2: Shared and Data-specific variation for leukemia data. Shared (top) and data-specific (bottom) variation retained with CCA (solid line) and PCA (dashed line) as a function of the reduced dimensionality for the leukemia data. The values obtained by random projections (dash-dotted line and dotted confidence intervals) have been included for reference. The suggested dimensionality for the CCA-projection is marked with a tick.

Mentions: The results are presented for each of the data sets in Figures 2, 3, and 4. In all cases it is easily seen that the proposed method retains clearly less data-specific variation than PCA (bottom subfigures), regardless of the dimension. The CCA-based method still keeps more variation than random projections, indicating that it is not purposefully looking for projection directions that would lose more variation than necessary.


Simple integrative preprocessing preserves what is shared in data sources.

Tripathi A, Klami A, Kaski S - BMC Bioinformatics (2008)

Shared and Data-specific variation for leukemia data. Shared (top) and data-specific (bottom) variation retained with CCA (solid line) and PCA (dashed line) as a function of the reduced dimensionality for the leukemia data. The values obtained by random projections (dash-dotted line and dotted confidence intervals) have been included for reference. The suggested dimensionality for the CCA-projection is marked with a tick.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2278131&req=5

Figure 2: Shared and Data-specific variation for leukemia data. Shared (top) and data-specific (bottom) variation retained with CCA (solid line) and PCA (dashed line) as a function of the reduced dimensionality for the leukemia data. The values obtained by random projections (dash-dotted line and dotted confidence intervals) have been included for reference. The suggested dimensionality for the CCA-projection is marked with a tick.
Mentions: The results are presented for each of the data sets in Figures 2, 3, and 4. In all cases it is easily seen that the proposed method retains clearly less data-specific variation than PCA (bottom subfigures), regardless of the dimension. The CCA-based method still keeps more variation than random projections, indicating that it is not purposefully looking for projection directions that would lose more variation than necessary.

Bottom Line: Principal components analysis of all sources combined together is an obvious choice if it is not important to distinguish between data source-specific and shared variation.Canonical Correlation Analysis (CCA) focuses on mutual dependencies and discards source-specific "noise" but it produces a separate set of components for each source.We give the details and implement them in a software tool.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, P,O, Box 68, FI-00014, University of Helsinki, Finland. abhishek@cs.helsinki.fi

ABSTRACT

Background: Bioinformatics data analysis toolbox needs general-purpose, fast and easily interpretable preprocessing tools that perform data integration during exploratory data analysis. Our focus is on vector-valued data sources, each consisting of measurements of the same entity but on different variables, and on tasks where source-specific variation is considered noisy or not interesting. Principal components analysis of all sources combined together is an obvious choice if it is not important to distinguish between data source-specific and shared variation. Canonical Correlation Analysis (CCA) focuses on mutual dependencies and discards source-specific "noise" but it produces a separate set of components for each source.

Results: It turns out that components given by CCA can be combined easily to produce a linear and hence fast and easily interpretable feature extraction method. The method fuses together several sources, such that the properties they share are preserved. Source-specific variation is discarded as uninteresting. We give the details and implement them in a software tool. The method is demonstrated on gene expression measurements in three case studies: classification of cell cycle regulated genes in yeast, identification of differentially expressed genes in leukemia, and defining stress response in yeast. The software package is available at http://www.cis.hut.fi/projects/mi/software/drCCA/.

Conclusion: We introduced a method for the task of data fusion for exploratory data analysis, when statistical dependencies between the sources and not within a source are interesting. The method uses canonical correlation analysis in a new way for dimensionality reduction, and inherits its good properties of being simple, fast, and easily interpretable as a linear projection.

Show MeSH
Related in: MedlinePlus