Limits...
Structure-revealing data fusion.

Acar E, Papalexakis EE, Gürdeniz G, Rasmussen MA, Lawaetz AJ, Nilsson M, Bro R - BMC Bioinformatics (2014)

Bottom Line: Fusing data from multiple sources has already proved useful in many applications in social network analysis, signal processing and bioinformatics.Using numerical experiments, we demonstrate the effectiveness of the proposed approach in terms of identifying shared and unshared components.Furthermore, we measure a set of mixtures with known chemical composition using both LC-MS (Liquid Chromatography - Mass Spectrometry) and NMR (Nuclear Magnetic Resonance) and demonstrate that the structure-revealing data fusion model can (i) successfully capture the chemicals in the mixtures and extract the relative concentrations of the chemicals accurately, (ii) provide promising results in terms of identifying shared and unshared chemicals, and (iii) reveal the relevant patterns in LC-MS by coupling with the diffusion NMR data.

View Article: PubMed Central - PubMed

Affiliation: Department of Food Science, Faculty of Science, University of Copenhagen, Frederiksberg C, Denmark. evrim@life.ku.dk.

ABSTRACT

Background: Analysis of data from multiple sources has the potential to enhance knowledge discovery by capturing underlying structures, which are, otherwise, difficult to extract. Fusing data from multiple sources has already proved useful in many applications in social network analysis, signal processing and bioinformatics. However, data fusion is challenging since data from multiple sources are often (i) heterogeneous (i.e., in the form of higher-order tensors and matrices), (ii) incomplete, and (iii) have both shared and unshared components. In order to address these challenges, in this paper, we introduce a novel unsupervised data fusion model based on joint factorization of matrices and higher-order tensors.

Results: While the traditional formulation of coupled matrix and tensor factorizations modeling only shared factors fails to capture the underlying structures in the presence of both shared and unshared factors, the proposed data fusion model has the potential to automatically reveal shared and unshared components through modeling constraints. Using numerical experiments, we demonstrate the effectiveness of the proposed approach in terms of identifying shared and unshared components. Furthermore, we measure a set of mixtures with known chemical composition using both LC-MS (Liquid Chromatography - Mass Spectrometry) and NMR (Nuclear Magnetic Resonance) and demonstrate that the structure-revealing data fusion model can (i) successfully capture the chemicals in the mixtures and extract the relative concentrations of the chemicals accurately, (ii) provide promising results in terms of identifying shared and unshared chemicals, and (iii) reveal the relevant patterns in LC-MS by coupling with the diffusion NMR data.

Conclusions: We have proposed a structure-revealing data fusion model that can jointly analyze heterogeneous, incomplete data sets with shared and unshared components and demonstrated its promising performance as well as potential limitations on both simulated and real data.

Show MeSH
Case 4 - Weightsλandσas well as the match score for factor matrix A captured by (a) CMTF (b) ACMTF.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4117975&req=5

Fig6: Case 4 - Weightsλandσas well as the match score for factor matrix A captured by (a) CMTF (b) ACMTF.

Mentions: Experiments demonstrate the potential problem with the CMTF model and how it fails to identify shared and unshared components due to uniqueness issues. On the other hand, our structure-revealing model can successfully identify shared/unshared components through the use of sparsity penalties on the component weights. Figures 3, 4, 5, and 6 demonstrate the weights, λ and σ, estimated using both models for 100 runs returning the same function valuea, i.e., multiple random starts are used and the minimum function value is obtained 100 times. When we use CMTF, λ and σ are estimated by normalizing the columns of the extracted factor matrices. In Figure 3, we expect to recover λ= [ 1 0 1]T and σ= [ 1 1 0]T; however, we observe that weights captured by CMTF vary hiding the true underlying structure of the data sets. On the other hand, ACMTF reveals the true structure indicating that there is one shared and one unshared component in each data set. The order of original and extracted components is different due to the permutation ambiguity in the models. Also, due to the permutation ambiguity, all possible permutations of the components for different runs returning the minimum function value are compared and the results are reported based on the best matching permutationb. Bottom plots in Figure 3 show how well the extracted factors match with the true columns of factor matrix A. Let be the rth column of the factor matrix extracted from the common mode. The match score corresponds to after finding the best matching permutation of the columns. These plots show that not only the weights can indicate shared/unshared components but also factor vectors can be estimated well using ACMTF. Similarly, in Figure 4, we expect to see three non-zero weights for the matrix and two non-zero weights for the tensor. However, there is variation for the same function value particularly in σ hiding the structure of the data sets and preventing recovery of the factor vectors accurately when data sets are modeled using CMTF. ACMTF, on the other hand, can identify shared and unshared components accurately. Unlike Case 1 and 2, CMTF performs well for Case 3, where the tensor has all three components and two of them are shared with the matrix (Figure 5).


Structure-revealing data fusion.

Acar E, Papalexakis EE, Gürdeniz G, Rasmussen MA, Lawaetz AJ, Nilsson M, Bro R - BMC Bioinformatics (2014)

Case 4 - Weightsλandσas well as the match score for factor matrix A captured by (a) CMTF (b) ACMTF.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4117975&req=5

Fig6: Case 4 - Weightsλandσas well as the match score for factor matrix A captured by (a) CMTF (b) ACMTF.
Mentions: Experiments demonstrate the potential problem with the CMTF model and how it fails to identify shared and unshared components due to uniqueness issues. On the other hand, our structure-revealing model can successfully identify shared/unshared components through the use of sparsity penalties on the component weights. Figures 3, 4, 5, and 6 demonstrate the weights, λ and σ, estimated using both models for 100 runs returning the same function valuea, i.e., multiple random starts are used and the minimum function value is obtained 100 times. When we use CMTF, λ and σ are estimated by normalizing the columns of the extracted factor matrices. In Figure 3, we expect to recover λ= [ 1 0 1]T and σ= [ 1 1 0]T; however, we observe that weights captured by CMTF vary hiding the true underlying structure of the data sets. On the other hand, ACMTF reveals the true structure indicating that there is one shared and one unshared component in each data set. The order of original and extracted components is different due to the permutation ambiguity in the models. Also, due to the permutation ambiguity, all possible permutations of the components for different runs returning the minimum function value are compared and the results are reported based on the best matching permutationb. Bottom plots in Figure 3 show how well the extracted factors match with the true columns of factor matrix A. Let be the rth column of the factor matrix extracted from the common mode. The match score corresponds to after finding the best matching permutation of the columns. These plots show that not only the weights can indicate shared/unshared components but also factor vectors can be estimated well using ACMTF. Similarly, in Figure 4, we expect to see three non-zero weights for the matrix and two non-zero weights for the tensor. However, there is variation for the same function value particularly in σ hiding the structure of the data sets and preventing recovery of the factor vectors accurately when data sets are modeled using CMTF. ACMTF, on the other hand, can identify shared and unshared components accurately. Unlike Case 1 and 2, CMTF performs well for Case 3, where the tensor has all three components and two of them are shared with the matrix (Figure 5).

Bottom Line: Fusing data from multiple sources has already proved useful in many applications in social network analysis, signal processing and bioinformatics.Using numerical experiments, we demonstrate the effectiveness of the proposed approach in terms of identifying shared and unshared components.Furthermore, we measure a set of mixtures with known chemical composition using both LC-MS (Liquid Chromatography - Mass Spectrometry) and NMR (Nuclear Magnetic Resonance) and demonstrate that the structure-revealing data fusion model can (i) successfully capture the chemicals in the mixtures and extract the relative concentrations of the chemicals accurately, (ii) provide promising results in terms of identifying shared and unshared chemicals, and (iii) reveal the relevant patterns in LC-MS by coupling with the diffusion NMR data.

View Article: PubMed Central - PubMed

Affiliation: Department of Food Science, Faculty of Science, University of Copenhagen, Frederiksberg C, Denmark. evrim@life.ku.dk.

ABSTRACT

Background: Analysis of data from multiple sources has the potential to enhance knowledge discovery by capturing underlying structures, which are, otherwise, difficult to extract. Fusing data from multiple sources has already proved useful in many applications in social network analysis, signal processing and bioinformatics. However, data fusion is challenging since data from multiple sources are often (i) heterogeneous (i.e., in the form of higher-order tensors and matrices), (ii) incomplete, and (iii) have both shared and unshared components. In order to address these challenges, in this paper, we introduce a novel unsupervised data fusion model based on joint factorization of matrices and higher-order tensors.

Results: While the traditional formulation of coupled matrix and tensor factorizations modeling only shared factors fails to capture the underlying structures in the presence of both shared and unshared factors, the proposed data fusion model has the potential to automatically reveal shared and unshared components through modeling constraints. Using numerical experiments, we demonstrate the effectiveness of the proposed approach in terms of identifying shared and unshared components. Furthermore, we measure a set of mixtures with known chemical composition using both LC-MS (Liquid Chromatography - Mass Spectrometry) and NMR (Nuclear Magnetic Resonance) and demonstrate that the structure-revealing data fusion model can (i) successfully capture the chemicals in the mixtures and extract the relative concentrations of the chemicals accurately, (ii) provide promising results in terms of identifying shared and unshared chemicals, and (iii) reveal the relevant patterns in LC-MS by coupling with the diffusion NMR data.

Conclusions: We have proposed a structure-revealing data fusion model that can jointly analyze heterogeneous, incomplete data sets with shared and unshared components and demonstrated its promising performance as well as potential limitations on both simulated and real data.

Show MeSH