Limits...
Validation and selection of ODE based systems biology models: how to arrive at more reliable decisions.

Hasdemir D, Hoefsloot HC, Smilde AK - BMC Syst Biol (2015)

Bottom Line: However, drawbacks associated with this approach are usually under-estimated.The hold-out validation strategy leads to biased conclusions, since it can lead to different validation and selection decisions when different partitioning schemes are used.Therefore, it proves to be a promising alternative to the standard hold-out validation strategy.

View Article: PubMed Central - PubMed

Affiliation: Biosystems Data Analysis Group, Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, The Netherlands. D.Hasdemir@uva.nl.

ABSTRACT

Background: Most ordinary differential equation (ODE) based modeling studies in systems biology involve a hold-out validation step for model validation. In this framework a pre-determined part of the data is used as validation data and, therefore it is not used for estimating the parameters of the model. The model is assumed to be validated if the model predictions on the validation dataset show good agreement with the data. Model selection between alternative model structures can also be performed in the same setting, based on the predictive power of the model structures on the validation dataset. However, drawbacks associated with this approach are usually under-estimated.

Results: We have carried out simulations by using a recently published High Osmolarity Glycerol (HOG) pathway from S.cerevisiae to demonstrate these drawbacks. We have shown that it is very important how the data is partitioned and which part of the data is used for validation purposes. The hold-out validation strategy leads to biased conclusions, since it can lead to different validation and selection decisions when different partitioning schemes are used. Furthermore, finding sensible partitioning schemes that would lead to reliable decisions are heavily dependent on the biology and unknown model parameters which turns the problem into a paradox. This brings the need for alternative validation approaches that offer flexible partitioning of the data. For this purpose, we have introduced a stratified random cross-validation (SRCV) approach that successfully overcomes these limitations.

Conclusions: SRCV leads to more stable decisions for both validation and selection which are not biased by underlying biological phenomena. Furthermore, it is less dependent on the specific noise realization in the data. Therefore, it proves to be a promising alternative to the standard hold-out validation strategy.

No MeSH data available.


Related in: MedlinePlus

Fit and predictions obtained on a single realization of data in the Sln1 scheme. Black and red points (connected by lines of the same color) refer to data points which were used for parameter estimation and validation, respectively. In this example, all doses of Sln1 data and the data on the downstream species (protein, mRNA and internal glycerol) were used for parameter estimation. The magenta lines show the profiles obtained (both fit and prediction) by using the true model for the parameter estimation. The orange lines belong to the profiles obtained by the simplified model structure. All concentrations are given in percentages. The top three rows are for the Hog1PP data. The titles for each graph show the dose and the cell type related to the experiment in which the Hog1PP data was collected. The last row of graphs give the concentration ratios for the downstream species. The associated data was collected in a single experiment with WT cells following a 0.5 M. NaCl shock
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4493957&req=5

Fig8: Fit and predictions obtained on a single realization of data in the Sln1 scheme. Black and red points (connected by lines of the same color) refer to data points which were used for parameter estimation and validation, respectively. In this example, all doses of Sln1 data and the data on the downstream species (protein, mRNA and internal glycerol) were used for parameter estimation. The magenta lines show the profiles obtained (both fit and prediction) by using the true model for the parameter estimation. The orange lines belong to the profiles obtained by the simplified model structure. All concentrations are given in percentages. The top three rows are for the Hog1PP data. The titles for each graph show the dose and the cell type related to the experiment in which the Hog1PP data was collected. The last row of graphs give the concentration ratios for the downstream species. The associated data was collected in a single experiment with WT cells following a 0.5 M. NaCl shock

Mentions: Firstly, we would like to stress that in all of our simulations, we observed very good fit of the true model structure to the data. Additionally, our emphasis in this work is on model validation and selection using validation datasets which were excluded from the training set. Therefore, we do not present detailed analysis of the quality of model fits. Only in Fig. 8 and Additional file 1: Figure S2, we present the model fits together with the predictions in two examples. We should also mention that the term ’prediction’ always refers to predictions on validation datasets, throughout the text. Finally, we present our results on both percentage error (PE) and model separation (ΔTS) using a box plot representation. With this representation, each box plot shows the distribution of the associated measure across the 100 different noise realizations. For example, in the case of percentage error, the median of this distribution gives an idea on how high the prediction errors are in general. In addition, the box plots show also the outliers with relatively high prediction errors by the red points outside the boxes.Fig. 8


Validation and selection of ODE based systems biology models: how to arrive at more reliable decisions.

Hasdemir D, Hoefsloot HC, Smilde AK - BMC Syst Biol (2015)

Fit and predictions obtained on a single realization of data in the Sln1 scheme. Black and red points (connected by lines of the same color) refer to data points which were used for parameter estimation and validation, respectively. In this example, all doses of Sln1 data and the data on the downstream species (protein, mRNA and internal glycerol) were used for parameter estimation. The magenta lines show the profiles obtained (both fit and prediction) by using the true model for the parameter estimation. The orange lines belong to the profiles obtained by the simplified model structure. All concentrations are given in percentages. The top three rows are for the Hog1PP data. The titles for each graph show the dose and the cell type related to the experiment in which the Hog1PP data was collected. The last row of graphs give the concentration ratios for the downstream species. The associated data was collected in a single experiment with WT cells following a 0.5 M. NaCl shock
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4493957&req=5

Fig8: Fit and predictions obtained on a single realization of data in the Sln1 scheme. Black and red points (connected by lines of the same color) refer to data points which were used for parameter estimation and validation, respectively. In this example, all doses of Sln1 data and the data on the downstream species (protein, mRNA and internal glycerol) were used for parameter estimation. The magenta lines show the profiles obtained (both fit and prediction) by using the true model for the parameter estimation. The orange lines belong to the profiles obtained by the simplified model structure. All concentrations are given in percentages. The top three rows are for the Hog1PP data. The titles for each graph show the dose and the cell type related to the experiment in which the Hog1PP data was collected. The last row of graphs give the concentration ratios for the downstream species. The associated data was collected in a single experiment with WT cells following a 0.5 M. NaCl shock
Mentions: Firstly, we would like to stress that in all of our simulations, we observed very good fit of the true model structure to the data. Additionally, our emphasis in this work is on model validation and selection using validation datasets which were excluded from the training set. Therefore, we do not present detailed analysis of the quality of model fits. Only in Fig. 8 and Additional file 1: Figure S2, we present the model fits together with the predictions in two examples. We should also mention that the term ’prediction’ always refers to predictions on validation datasets, throughout the text. Finally, we present our results on both percentage error (PE) and model separation (ΔTS) using a box plot representation. With this representation, each box plot shows the distribution of the associated measure across the 100 different noise realizations. For example, in the case of percentage error, the median of this distribution gives an idea on how high the prediction errors are in general. In addition, the box plots show also the outliers with relatively high prediction errors by the red points outside the boxes.Fig. 8

Bottom Line: However, drawbacks associated with this approach are usually under-estimated.The hold-out validation strategy leads to biased conclusions, since it can lead to different validation and selection decisions when different partitioning schemes are used.Therefore, it proves to be a promising alternative to the standard hold-out validation strategy.

View Article: PubMed Central - PubMed

Affiliation: Biosystems Data Analysis Group, Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, The Netherlands. D.Hasdemir@uva.nl.

ABSTRACT

Background: Most ordinary differential equation (ODE) based modeling studies in systems biology involve a hold-out validation step for model validation. In this framework a pre-determined part of the data is used as validation data and, therefore it is not used for estimating the parameters of the model. The model is assumed to be validated if the model predictions on the validation dataset show good agreement with the data. Model selection between alternative model structures can also be performed in the same setting, based on the predictive power of the model structures on the validation dataset. However, drawbacks associated with this approach are usually under-estimated.

Results: We have carried out simulations by using a recently published High Osmolarity Glycerol (HOG) pathway from S.cerevisiae to demonstrate these drawbacks. We have shown that it is very important how the data is partitioned and which part of the data is used for validation purposes. The hold-out validation strategy leads to biased conclusions, since it can lead to different validation and selection decisions when different partitioning schemes are used. Furthermore, finding sensible partitioning schemes that would lead to reliable decisions are heavily dependent on the biology and unknown model parameters which turns the problem into a paradox. This brings the need for alternative validation approaches that offer flexible partitioning of the data. For this purpose, we have introduced a stratified random cross-validation (SRCV) approach that successfully overcomes these limitations.

Conclusions: SRCV leads to more stable decisions for both validation and selection which are not biased by underlying biological phenomena. Furthermore, it is less dependent on the specific noise realization in the data. Therefore, it proves to be a promising alternative to the standard hold-out validation strategy.

No MeSH data available.


Related in: MedlinePlus