Kernel approaches for differential expression analysis of mass spectrometry-based metabolomics data.
Bottom Line:
One is the distance-based kernel and the other is the stratified kernel.While we initially describe the procedures in the case of single-metabolite analysis, we extend the methods to handle metabolite sets as well.Evaluation based on both simulated data and real data from a liver cancer metabolomics study indicates that our kernel method has a better performance than some existing alternatives.
View Article:
PubMed Central - PubMed
Affiliation: Department of Statistics, Pennsylvania State University, 325 Thomas Building, University Park, 16802, PA, USA. xyz5074@psu.edu.
ABSTRACT
Show MeSH
Background: Data generated from metabolomics experiments are different from other types of "-omics" data. For example, a common phenomenon in mass spectrometry (MS)-based metabolomics data is that the data matrix frequently contains missing values, which complicates some quantitative analyses. One way to tackle this problem is to treat them as absent. Hence there are two types of information that are available in metabolomics data: presence/absence of a metabolite and a quantitative value of the abundance level of a metabolite if it is present. Combining these two layers of information poses challenges to the application of traditional statistical approaches in differential expression analysis. Results: In this article, we propose a novel kernel-based score test for the metabolomics differential expression analysis. In order to simultaneously capture both the continuous pattern and discrete pattern in metabolomics data, two new kinds of kernels are designed. One is the distance-based kernel and the other is the stratified kernel. While we initially describe the procedures in the case of single-metabolite analysis, we extend the methods to handle metabolite sets as well. Conclusions: Evaluation based on both simulated data and real data from a liver cancer metabolomics study indicates that our kernel method has a better performance than some existing alternatives. An implementation of the proposed kernel method in the R statistical computing environment is available at http://works.bepress.com/debashis_ghosh/60/ . Related in: MedlinePlus |
Related In:
Results -
Collection
License 1 - License 2 getmorefigures.php?uid=PMC4359587&req=5
Mentions: Figure 1 presents histograms of p-value of different methods on the simulation scenario with low group effect, 20% missing data and 50% differentially expressed metabolites. In this simulation, a total of 1000 metabolite-sets were generated, and the first 500 metabolite-sets were truly differentially expressed. Except for the Qual method, the p-value histograms of all other five methods showed an expected distribution in the sense that about 500 hypotheses were rejected and most of them were true positive (the corresponding number in Qual method was about 400 instead). Figure 2 are the ROC curves of different methods on the same simulation study as that in Figure 1. We were interested in the performance of all tests in a low false positive rate. Hence we considered FDR ≤0.05. Based on Figure 2a) (left panel), kernel methods, Quant, T and Wilcox methods had better performance than the Qual method, which supported the result found in Figure 1. We focused on the area with true positive rate form 0.9 to 1 to get the Figure 2b) (right panel). Because Qual method could not achieve a true positive rate greater than 0.8 based on Figure 2a), Qual did not appear in Figure 2b). Based on Figure 2b), we can see that kernel methods had best performance in terms of ROC curve, followed by the Wilcoxon signed-rank test. A similar pattern was observed for other simulation scenarios. One reason for such a good performance of Wilcoxon test is that there were almost no ties in our simulation setting. Recall that the number of metabolites in each set was uniformly distributed in {1,2,…,15}, and most of the 1000 metabolite contained more than one metabolite. We performed the Wilcoxon test on the average score of multiple metabolites. However, by taking average value for metabolites, it broke down many tied zero values. Hence, in the average scores, there were much fewer ties. That explained the fairly good performance of Wilcoxon test presented in Figure 2.Figure 1 |
View Article: PubMed Central - PubMed
Affiliation: Department of Statistics, Pennsylvania State University, 325 Thomas Building, University Park, 16802, PA, USA. xyz5074@psu.edu.
Background: Data generated from metabolomics experiments are different from other types of "-omics" data. For example, a common phenomenon in mass spectrometry (MS)-based metabolomics data is that the data matrix frequently contains missing values, which complicates some quantitative analyses. One way to tackle this problem is to treat them as absent. Hence there are two types of information that are available in metabolomics data: presence/absence of a metabolite and a quantitative value of the abundance level of a metabolite if it is present. Combining these two layers of information poses challenges to the application of traditional statistical approaches in differential expression analysis.
Results: In this article, we propose a novel kernel-based score test for the metabolomics differential expression analysis. In order to simultaneously capture both the continuous pattern and discrete pattern in metabolomics data, two new kinds of kernels are designed. One is the distance-based kernel and the other is the stratified kernel. While we initially describe the procedures in the case of single-metabolite analysis, we extend the methods to handle metabolite sets as well.
Conclusions: Evaluation based on both simulated data and real data from a liver cancer metabolomics study indicates that our kernel method has a better performance than some existing alternatives. An implementation of the proposed kernel method in the R statistical computing environment is available at http://works.bepress.com/debashis_ghosh/60/ .