Limits...
A tree-like Bayesian structure learning algorithm for small-sample datasets from complex biological model systems.

Yin W, Garimalla S, Moreno A, Galinski MR, Styczynski MP - BMC Syst Biol (2015)

Bottom Line: We also develop and demonstrate an integrated resampling-based approach to the reduction of variable space for the learning of the network.We demonstrate that by starting from effective and efficient approaches for creating classifiers, we can identify interesting tree-like network structures with significant ability to capture the relationships in the training data.This approach represents a promising strategy for inferring networks with high positive predictive value under the constraint of small numbers of samples, meeting a need that will only continue to grow as more high-throughput studies are applied to complex model systems.

View Article: PubMed Central - PubMed

Affiliation: Key Laboratory for Biomedical Engineering of Education Ministry, Department of Biomedical Engineering, Zhejiang University, Hangzhou, P. R. China. wwyin@mail.zju.edu.cn.

ABSTRACT

Background: There are increasing efforts to bring high-throughput systems biology techniques to bear on complex animal model systems, often with a goal of learning about underlying regulatory network structures (e.g., gene regulatory networks). However, complex animal model systems typically have significant limitations on cohort sizes, number of samples, and the ability to perform follow-up and validation experiments. These constraints are particularly problematic for many current network learning approaches, which require large numbers of samples and may predict many more regulatory relationships than actually exist.

Results: Here, we test the idea that by leveraging the accuracy and efficiency of classifiers, we can construct high-quality networks that capture important interactions between variables in datasets with few samples. We start from a previously-developed tree-like Bayesian classifier and generalize its network learning approach to allow for arbitrary depth and complexity of tree-like networks. Using four diverse sample networks, we demonstrate that this approach performs consistently better at low sample sizes than the Sparse Candidate Algorithm, a representative approach for comparison because it is known to generate Bayesian networks with high positive predictive value. We develop and demonstrate a resampling-based approach to enable the identification of a viable root for the learned tree-like network, important for cases where the root of a network is not known a priori. We also develop and demonstrate an integrated resampling-based approach to the reduction of variable space for the learning of the network. Finally, we demonstrate the utility of this approach via the analysis of a transcriptional dataset of a malaria challenge in a non-human primate model system, Macaca mulatta, suggesting the potential to capture indicators of the earliest stages of cellular differentiation during leukopoiesis.

Conclusions: We demonstrate that by starting from effective and efficient approaches for creating classifiers, we can identify interesting tree-like network structures with significant ability to capture the relationships in the training data. This approach represents a promising strategy for inferring networks with high positive predictive value under the constraint of small numbers of samples, meeting a need that will only continue to grow as more high-throughput studies are applied to complex model systems.

No MeSH data available.


Related in: MedlinePlus

Sensitivity analysis of ffc shows less significant impact than random variability. The Child network was analyzed with 100 samples, 10 times each for ffc parameter values ranging from 0.2 to 0.4. The variability induced by changing ffc (range of TPR and FPR across all parameter values) is smaller than the variability from different random datasets being used for structure learning (error bars for any given ffc value). This suggests that there is a broad optimum of ffc values and that the value used in this work is a reasonable one (and perhaps not even optimal). TPR: true positive rate; FPR: false positive rate. Error bars represent one standard deviation
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4551520&req=5

Fig3: Sensitivity analysis of ffc shows less significant impact than random variability. The Child network was analyzed with 100 samples, 10 times each for ffc parameter values ranging from 0.2 to 0.4. The variability induced by changing ffc (range of TPR and FPR across all parameter values) is smaller than the variability from different random datasets being used for structure learning (error bars for any given ffc value). This suggests that there is a broad optimum of ffc values and that the value used in this work is a reasonable one (and perhaps not even optimal). TPR: true positive rate; FPR: false positive rate. Error bars represent one standard deviation

Mentions: We found that the parameter that most directly affected structure learning results was ffc, used to determine which nodes are children of the current bottom layer nodes. It is a user-defined parameter with an optimal value that is possibly data-specific. In our simulations, we used ffc = 0.3 based on our initial explorations and some empirical experience from previously published work [16]. However, it is important to assess the sensitivity of algorithm performance to critical parameters so that over-fitting and inappropriate comparisons are avoided. We assessed the accuracy of TL-BSLA when varying ffc values over a factor of 2 (from 0.2 to 0.4, see Fig. 3). For example, for the Child system with 100 samples, we found that the variation induced by different ffc values (the maximum difference for average TPR induced by different ffc is less than 10 %) was smaller than the variation induced by different datasets (e.g. the TPR across 10 datasets can vary by over 20 %). This ffc-induced variation became even smaller as the sample size increased (Additional file 1: Figure S2). Under the constraint of small sample sizes, the variation induced by even substantial changes of ffc was thus not substantial. In fact, 0.3 was not even necessarily the optimum value of ffc (see Fig. 3), but we used it successfully for four diverse datasets (in terms of both size and topology) being studied here. We also performed a similar parameter sensitivity analysis for SCA to confirm that the results in Fig. 2 were not due to poorly chosen defaults. This analysis is presented in Additional file 1: Figure S3, showing that the performance of SCA was not highly sensitive to its parameters and that changing parameter selections did not allow SCA to perform as well on the Child system as the TL-BSLA.Fig. 3


A tree-like Bayesian structure learning algorithm for small-sample datasets from complex biological model systems.

Yin W, Garimalla S, Moreno A, Galinski MR, Styczynski MP - BMC Syst Biol (2015)

Sensitivity analysis of ffc shows less significant impact than random variability. The Child network was analyzed with 100 samples, 10 times each for ffc parameter values ranging from 0.2 to 0.4. The variability induced by changing ffc (range of TPR and FPR across all parameter values) is smaller than the variability from different random datasets being used for structure learning (error bars for any given ffc value). This suggests that there is a broad optimum of ffc values and that the value used in this work is a reasonable one (and perhaps not even optimal). TPR: true positive rate; FPR: false positive rate. Error bars represent one standard deviation
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4551520&req=5

Fig3: Sensitivity analysis of ffc shows less significant impact than random variability. The Child network was analyzed with 100 samples, 10 times each for ffc parameter values ranging from 0.2 to 0.4. The variability induced by changing ffc (range of TPR and FPR across all parameter values) is smaller than the variability from different random datasets being used for structure learning (error bars for any given ffc value). This suggests that there is a broad optimum of ffc values and that the value used in this work is a reasonable one (and perhaps not even optimal). TPR: true positive rate; FPR: false positive rate. Error bars represent one standard deviation
Mentions: We found that the parameter that most directly affected structure learning results was ffc, used to determine which nodes are children of the current bottom layer nodes. It is a user-defined parameter with an optimal value that is possibly data-specific. In our simulations, we used ffc = 0.3 based on our initial explorations and some empirical experience from previously published work [16]. However, it is important to assess the sensitivity of algorithm performance to critical parameters so that over-fitting and inappropriate comparisons are avoided. We assessed the accuracy of TL-BSLA when varying ffc values over a factor of 2 (from 0.2 to 0.4, see Fig. 3). For example, for the Child system with 100 samples, we found that the variation induced by different ffc values (the maximum difference for average TPR induced by different ffc is less than 10 %) was smaller than the variation induced by different datasets (e.g. the TPR across 10 datasets can vary by over 20 %). This ffc-induced variation became even smaller as the sample size increased (Additional file 1: Figure S2). Under the constraint of small sample sizes, the variation induced by even substantial changes of ffc was thus not substantial. In fact, 0.3 was not even necessarily the optimum value of ffc (see Fig. 3), but we used it successfully for four diverse datasets (in terms of both size and topology) being studied here. We also performed a similar parameter sensitivity analysis for SCA to confirm that the results in Fig. 2 were not due to poorly chosen defaults. This analysis is presented in Additional file 1: Figure S3, showing that the performance of SCA was not highly sensitive to its parameters and that changing parameter selections did not allow SCA to perform as well on the Child system as the TL-BSLA.Fig. 3

Bottom Line: We also develop and demonstrate an integrated resampling-based approach to the reduction of variable space for the learning of the network.We demonstrate that by starting from effective and efficient approaches for creating classifiers, we can identify interesting tree-like network structures with significant ability to capture the relationships in the training data.This approach represents a promising strategy for inferring networks with high positive predictive value under the constraint of small numbers of samples, meeting a need that will only continue to grow as more high-throughput studies are applied to complex model systems.

View Article: PubMed Central - PubMed

Affiliation: Key Laboratory for Biomedical Engineering of Education Ministry, Department of Biomedical Engineering, Zhejiang University, Hangzhou, P. R. China. wwyin@mail.zju.edu.cn.

ABSTRACT

Background: There are increasing efforts to bring high-throughput systems biology techniques to bear on complex animal model systems, often with a goal of learning about underlying regulatory network structures (e.g., gene regulatory networks). However, complex animal model systems typically have significant limitations on cohort sizes, number of samples, and the ability to perform follow-up and validation experiments. These constraints are particularly problematic for many current network learning approaches, which require large numbers of samples and may predict many more regulatory relationships than actually exist.

Results: Here, we test the idea that by leveraging the accuracy and efficiency of classifiers, we can construct high-quality networks that capture important interactions between variables in datasets with few samples. We start from a previously-developed tree-like Bayesian classifier and generalize its network learning approach to allow for arbitrary depth and complexity of tree-like networks. Using four diverse sample networks, we demonstrate that this approach performs consistently better at low sample sizes than the Sparse Candidate Algorithm, a representative approach for comparison because it is known to generate Bayesian networks with high positive predictive value. We develop and demonstrate a resampling-based approach to enable the identification of a viable root for the learned tree-like network, important for cases where the root of a network is not known a priori. We also develop and demonstrate an integrated resampling-based approach to the reduction of variable space for the learning of the network. Finally, we demonstrate the utility of this approach via the analysis of a transcriptional dataset of a malaria challenge in a non-human primate model system, Macaca mulatta, suggesting the potential to capture indicators of the earliest stages of cellular differentiation during leukopoiesis.

Conclusions: We demonstrate that by starting from effective and efficient approaches for creating classifiers, we can identify interesting tree-like network structures with significant ability to capture the relationships in the training data. This approach represents a promising strategy for inferring networks with high positive predictive value under the constraint of small numbers of samples, meeting a need that will only continue to grow as more high-throughput studies are applied to complex model systems.

No MeSH data available.


Related in: MedlinePlus