Limits...
Kernel approaches for differential expression analysis of mass spectrometry-based metabolomics data.

Zhan X, Patterson AD, Ghosh D - BMC Bioinformatics (2015)

Bottom Line: One is the distance-based kernel and the other is the stratified kernel.While we initially describe the procedures in the case of single-metabolite analysis, we extend the methods to handle metabolite sets as well.Evaluation based on both simulated data and real data from a liver cancer metabolomics study indicates that our kernel method has a better performance than some existing alternatives.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Pennsylvania State University, 325 Thomas Building, University Park, 16802, PA, USA. xyz5074@psu.edu.

ABSTRACT

Background: Data generated from metabolomics experiments are different from other types of "-omics" data. For example, a common phenomenon in mass spectrometry (MS)-based metabolomics data is that the data matrix frequently contains missing values, which complicates some quantitative analyses. One way to tackle this problem is to treat them as absent. Hence there are two types of information that are available in metabolomics data: presence/absence of a metabolite and a quantitative value of the abundance level of a metabolite if it is present. Combining these two layers of information poses challenges to the application of traditional statistical approaches in differential expression analysis.

Results: In this article, we propose a novel kernel-based score test for the metabolomics differential expression analysis. In order to simultaneously capture both the continuous pattern and discrete pattern in metabolomics data, two new kinds of kernels are designed. One is the distance-based kernel and the other is the stratified kernel. While we initially describe the procedures in the case of single-metabolite analysis, we extend the methods to handle metabolite sets as well.

Conclusions: Evaluation based on both simulated data and real data from a liver cancer metabolomics study indicates that our kernel method has a better performance than some existing alternatives. An implementation of the proposed kernel method in the R statistical computing environment is available at http://works.bepress.com/debashis_ghosh/60/ .

Show MeSH

Related in: MedlinePlus

Significance versus FDR. Number of significantly differentially expressed metabolites versus FDR estimation on hepatocellular carcinoma data. The vertical dotted line has an estimated FDR of 0.05.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4359587&req=5

Fig4: Significance versus FDR. Number of significantly differentially expressed metabolites versus FDR estimation on hepatocellular carcinoma data. The vertical dotted line has an estimated FDR of 0.05.

Mentions: Figure 3 illustrates the p-values obtained from the three methods. We checked the distribution of those p-values greater than 0.05 using QQ-plot on the right panel in Figure 3. The deviations from the straight line were mostly minimal. Hence, those p-values greater than 0.05 were almost distributed as Uniform (0,1), and it was valid to use Eq. (9) to estimate FDR. Figure 4 shows curves of number of significantly differentially expressed metabolites versus FDR estimation. Each point in the curve corresponds to a cutoff value c. The y-axis is associated with the number of features with a p-value smaller than c, and the x-axis is the estimated using Eq. (9). The range of the cutoff value c was set to be (0,0.05) in Figure 4. Different λ values in Eq. (9) were used and those results were similar. The one presented in Figure 4 corresponds to λ=0.7. Based on Figure 4, a distance-based kernel score test had the best performance in that it could detect more significance at a given estimated FDR level than the other two methods. At an estimated FDR level lower than 0.1, our stratified kernel score test also outperformed the Wilcoxon signed-rank test. At an estimated FDR level of 0.05, 279, 218, 194 feature-sets were detected as significantly differentially expressed by distance-based kernel score test, stratified kernel score test and Wilcoxon test respectively. At a estimated FDR level of 0.01, the numbers of rejections from the three tests were 210, 163 and 86. Therefore, our kernel-based method had best performance in metabolomics differential expression analysis on this HCC data especially at a low FDR level. In this HCC dataset, 1064 out of 1130 feature-sets contain only one feature. There are a lot of tied zero values in those single-feature feature-sets. Those ties reduce the power of the Wilcoxon signed-rank test. Moreover, we also performed the grouping based on Spearman’s correlation. The results are shown in Section 2 in Additional file 1. The results obtained from grouping based on Spearman’s correlation are very similar with those using Pearson’s correlation. The same analysis we did here can also apply to the differential analysis using Spearman’s correlation-based groupings. Our kernel approach for differential expression analysis has a good performance irrespective of the grouping scheme.Figure 3


Kernel approaches for differential expression analysis of mass spectrometry-based metabolomics data.

Zhan X, Patterson AD, Ghosh D - BMC Bioinformatics (2015)

Significance versus FDR. Number of significantly differentially expressed metabolites versus FDR estimation on hepatocellular carcinoma data. The vertical dotted line has an estimated FDR of 0.05.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4359587&req=5

Fig4: Significance versus FDR. Number of significantly differentially expressed metabolites versus FDR estimation on hepatocellular carcinoma data. The vertical dotted line has an estimated FDR of 0.05.
Mentions: Figure 3 illustrates the p-values obtained from the three methods. We checked the distribution of those p-values greater than 0.05 using QQ-plot on the right panel in Figure 3. The deviations from the straight line were mostly minimal. Hence, those p-values greater than 0.05 were almost distributed as Uniform (0,1), and it was valid to use Eq. (9) to estimate FDR. Figure 4 shows curves of number of significantly differentially expressed metabolites versus FDR estimation. Each point in the curve corresponds to a cutoff value c. The y-axis is associated with the number of features with a p-value smaller than c, and the x-axis is the estimated using Eq. (9). The range of the cutoff value c was set to be (0,0.05) in Figure 4. Different λ values in Eq. (9) were used and those results were similar. The one presented in Figure 4 corresponds to λ=0.7. Based on Figure 4, a distance-based kernel score test had the best performance in that it could detect more significance at a given estimated FDR level than the other two methods. At an estimated FDR level lower than 0.1, our stratified kernel score test also outperformed the Wilcoxon signed-rank test. At an estimated FDR level of 0.05, 279, 218, 194 feature-sets were detected as significantly differentially expressed by distance-based kernel score test, stratified kernel score test and Wilcoxon test respectively. At a estimated FDR level of 0.01, the numbers of rejections from the three tests were 210, 163 and 86. Therefore, our kernel-based method had best performance in metabolomics differential expression analysis on this HCC data especially at a low FDR level. In this HCC dataset, 1064 out of 1130 feature-sets contain only one feature. There are a lot of tied zero values in those single-feature feature-sets. Those ties reduce the power of the Wilcoxon signed-rank test. Moreover, we also performed the grouping based on Spearman’s correlation. The results are shown in Section 2 in Additional file 1. The results obtained from grouping based on Spearman’s correlation are very similar with those using Pearson’s correlation. The same analysis we did here can also apply to the differential analysis using Spearman’s correlation-based groupings. Our kernel approach for differential expression analysis has a good performance irrespective of the grouping scheme.Figure 3

Bottom Line: One is the distance-based kernel and the other is the stratified kernel.While we initially describe the procedures in the case of single-metabolite analysis, we extend the methods to handle metabolite sets as well.Evaluation based on both simulated data and real data from a liver cancer metabolomics study indicates that our kernel method has a better performance than some existing alternatives.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Pennsylvania State University, 325 Thomas Building, University Park, 16802, PA, USA. xyz5074@psu.edu.

ABSTRACT

Background: Data generated from metabolomics experiments are different from other types of "-omics" data. For example, a common phenomenon in mass spectrometry (MS)-based metabolomics data is that the data matrix frequently contains missing values, which complicates some quantitative analyses. One way to tackle this problem is to treat them as absent. Hence there are two types of information that are available in metabolomics data: presence/absence of a metabolite and a quantitative value of the abundance level of a metabolite if it is present. Combining these two layers of information poses challenges to the application of traditional statistical approaches in differential expression analysis.

Results: In this article, we propose a novel kernel-based score test for the metabolomics differential expression analysis. In order to simultaneously capture both the continuous pattern and discrete pattern in metabolomics data, two new kinds of kernels are designed. One is the distance-based kernel and the other is the stratified kernel. While we initially describe the procedures in the case of single-metabolite analysis, we extend the methods to handle metabolite sets as well.

Conclusions: Evaluation based on both simulated data and real data from a liver cancer metabolomics study indicates that our kernel method has a better performance than some existing alternatives. An implementation of the proposed kernel method in the R statistical computing environment is available at http://works.bepress.com/debashis_ghosh/60/ .

Show MeSH
Related in: MedlinePlus