Limits...
Validation and selection of ODE based systems biology models: how to arrive at more reliable decisions.

Hasdemir D, Hoefsloot HC, Smilde AK - BMC Syst Biol (2015)

Bottom Line: However, drawbacks associated with this approach are usually under-estimated.The hold-out validation strategy leads to biased conclusions, since it can lead to different validation and selection decisions when different partitioning schemes are used.Therefore, it proves to be a promising alternative to the standard hold-out validation strategy.

View Article: PubMed Central - PubMed

Affiliation: Biosystems Data Analysis Group, Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, The Netherlands. D.Hasdemir@uva.nl.

ABSTRACT

Background: Most ordinary differential equation (ODE) based modeling studies in systems biology involve a hold-out validation step for model validation. In this framework a pre-determined part of the data is used as validation data and, therefore it is not used for estimating the parameters of the model. The model is assumed to be validated if the model predictions on the validation dataset show good agreement with the data. Model selection between alternative model structures can also be performed in the same setting, based on the predictive power of the model structures on the validation dataset. However, drawbacks associated with this approach are usually under-estimated.

Results: We have carried out simulations by using a recently published High Osmolarity Glycerol (HOG) pathway from S.cerevisiae to demonstrate these drawbacks. We have shown that it is very important how the data is partitioned and which part of the data is used for validation purposes. The hold-out validation strategy leads to biased conclusions, since it can lead to different validation and selection decisions when different partitioning schemes are used. Furthermore, finding sensible partitioning schemes that would lead to reliable decisions are heavily dependent on the biology and unknown model parameters which turns the problem into a paradox. This brings the need for alternative validation approaches that offer flexible partitioning of the data. For this purpose, we have introduced a stratified random cross-validation (SRCV) approach that successfully overcomes these limitations.

Conclusions: SRCV leads to more stable decisions for both validation and selection which are not biased by underlying biological phenomena. Furthermore, it is less dependent on the specific noise realization in the data. Therefore, it proves to be a promising alternative to the standard hold-out validation strategy.

No MeSH data available.


Related in: MedlinePlus

Partitioning schemes used in the adapted scenarios. Light gray colored boxes show parts of the data which we used as the training set (T) for parameter estimation. Dark gray colored boxes show parts which we used as the validation set (V). Different background colors represent different partitioning schemes and are consistent with the colors used in the graphs in the Results and discussion section. Each partitioning scheme offers the use of twelve subsets of the Hog1PP data as the training set and the remaining six subsets as the validation set. The training set included either (a) Sln1 & Sho1 data, (b) Sln1 & WT data, (c) Sho1 & WT data, (d) the lowest four doses from each cell type or (e) the highest four doses from each cell type
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4493957&req=5

Fig5: Partitioning schemes used in the adapted scenarios. Light gray colored boxes show parts of the data which we used as the training set (T) for parameter estimation. Dark gray colored boxes show parts which we used as the validation set (V). Different background colors represent different partitioning schemes and are consistent with the colors used in the graphs in the Results and discussion section. Each partitioning scheme offers the use of twelve subsets of the Hog1PP data as the training set and the remaining six subsets as the validation set. The training set included either (a) Sln1 & Sho1 data, (b) Sln1 & WT data, (c) Sho1 & WT data, (d) the lowest four doses from each cell type or (e) the highest four doses from each cell type

Mentions: Lastly, we introduce variation in the training sets. We update our first scenario in such a way that in each partitioning scheme (Fig. 5a-c), we use data from two cell types for training. All six doses from the remaining cell type can be used as validation data. We make consensus decisions on model validation and selection considering all the validation subsets. The schemes are named throughout the manuscript as Sln1/Sho1, Sln1/WT and Sho1/WT schemes depending on the pair of training cell types. We update our second scenario in such a way that in each partitioning scheme (Fig. 5d-e), we use either data from the four highest or four lowest doses from each cell type for training. The remaining two doses from each cell type can be used as validation data. Similar to the first updated scenario we make consensus decisions using all validation subsets at once. The schemes are named as low doses and high doses schemes based on the doses used in the training set. This way, we can obtain five schemes (Fig. 5) each of which uses twelve subsets of the phosphorylated Hog1 (Hog1PP) data for training and the remaining six subsets for validation. Therefore, these five schemes can be compared to the stratified cross-validation scheme which also makes use of twelve subsets of Hog1PP data in each training set.Fig. 5


Validation and selection of ODE based systems biology models: how to arrive at more reliable decisions.

Hasdemir D, Hoefsloot HC, Smilde AK - BMC Syst Biol (2015)

Partitioning schemes used in the adapted scenarios. Light gray colored boxes show parts of the data which we used as the training set (T) for parameter estimation. Dark gray colored boxes show parts which we used as the validation set (V). Different background colors represent different partitioning schemes and are consistent with the colors used in the graphs in the Results and discussion section. Each partitioning scheme offers the use of twelve subsets of the Hog1PP data as the training set and the remaining six subsets as the validation set. The training set included either (a) Sln1 & Sho1 data, (b) Sln1 & WT data, (c) Sho1 & WT data, (d) the lowest four doses from each cell type or (e) the highest four doses from each cell type
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4493957&req=5

Fig5: Partitioning schemes used in the adapted scenarios. Light gray colored boxes show parts of the data which we used as the training set (T) for parameter estimation. Dark gray colored boxes show parts which we used as the validation set (V). Different background colors represent different partitioning schemes and are consistent with the colors used in the graphs in the Results and discussion section. Each partitioning scheme offers the use of twelve subsets of the Hog1PP data as the training set and the remaining six subsets as the validation set. The training set included either (a) Sln1 & Sho1 data, (b) Sln1 & WT data, (c) Sho1 & WT data, (d) the lowest four doses from each cell type or (e) the highest four doses from each cell type
Mentions: Lastly, we introduce variation in the training sets. We update our first scenario in such a way that in each partitioning scheme (Fig. 5a-c), we use data from two cell types for training. All six doses from the remaining cell type can be used as validation data. We make consensus decisions on model validation and selection considering all the validation subsets. The schemes are named throughout the manuscript as Sln1/Sho1, Sln1/WT and Sho1/WT schemes depending on the pair of training cell types. We update our second scenario in such a way that in each partitioning scheme (Fig. 5d-e), we use either data from the four highest or four lowest doses from each cell type for training. The remaining two doses from each cell type can be used as validation data. Similar to the first updated scenario we make consensus decisions using all validation subsets at once. The schemes are named as low doses and high doses schemes based on the doses used in the training set. This way, we can obtain five schemes (Fig. 5) each of which uses twelve subsets of the phosphorylated Hog1 (Hog1PP) data for training and the remaining six subsets for validation. Therefore, these five schemes can be compared to the stratified cross-validation scheme which also makes use of twelve subsets of Hog1PP data in each training set.Fig. 5

Bottom Line: However, drawbacks associated with this approach are usually under-estimated.The hold-out validation strategy leads to biased conclusions, since it can lead to different validation and selection decisions when different partitioning schemes are used.Therefore, it proves to be a promising alternative to the standard hold-out validation strategy.

View Article: PubMed Central - PubMed

Affiliation: Biosystems Data Analysis Group, Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, The Netherlands. D.Hasdemir@uva.nl.

ABSTRACT

Background: Most ordinary differential equation (ODE) based modeling studies in systems biology involve a hold-out validation step for model validation. In this framework a pre-determined part of the data is used as validation data and, therefore it is not used for estimating the parameters of the model. The model is assumed to be validated if the model predictions on the validation dataset show good agreement with the data. Model selection between alternative model structures can also be performed in the same setting, based on the predictive power of the model structures on the validation dataset. However, drawbacks associated with this approach are usually under-estimated.

Results: We have carried out simulations by using a recently published High Osmolarity Glycerol (HOG) pathway from S.cerevisiae to demonstrate these drawbacks. We have shown that it is very important how the data is partitioned and which part of the data is used for validation purposes. The hold-out validation strategy leads to biased conclusions, since it can lead to different validation and selection decisions when different partitioning schemes are used. Furthermore, finding sensible partitioning schemes that would lead to reliable decisions are heavily dependent on the biology and unknown model parameters which turns the problem into a paradox. This brings the need for alternative validation approaches that offer flexible partitioning of the data. For this purpose, we have introduced a stratified random cross-validation (SRCV) approach that successfully overcomes these limitations.

Conclusions: SRCV leads to more stable decisions for both validation and selection which are not biased by underlying biological phenomena. Furthermore, it is less dependent on the specific noise realization in the data. Therefore, it proves to be a promising alternative to the standard hold-out validation strategy.

No MeSH data available.


Related in: MedlinePlus