Limits...
Cell-type-specific predictive network yields novel insights into mouse embryonic stem cell self-renewal and cell fate.

Dowell KG, Simons AK, Wang ZZ, Yun K, Hibbs MA - PLoS ONE (2013)

Bottom Line: We then integrated these data into a consensus mESC functional relationship network focused on biological processes associated with embryonic stem cell self-renewal and cell fate determination.Computational evaluations, literature validation, and analyses of predicted functional linkages show that our results are highly accurate and biologically relevant.Our mESC network predicts many novel players involved in self-renewal and serves as the foundation for future pluripotent stem cell studies.

View Article: PubMed Central - PubMed

Affiliation: The Jackson Laboratory, Bar Harbor, Maine, USA.

ABSTRACT
Self-renewal, the ability of a stem cell to divide repeatedly while maintaining an undifferentiated state, is a defining characteristic of all stem cells. Here, we clarify the molecular foundations of mouse embryonic stem cell (mESC) self-renewal by applying a proven Bayesian network machine learning approach to integrate high-throughput data for protein function discovery. By focusing on a single stem-cell system, at a specific developmental stage, within the context of well-defined biological processes known to be active in that cell type, we produce a consensus predictive network that reflects biological reality more closely than those made by prior efforts using more generalized, context-independent methods. In addition, we show how machine learning efforts may be misled if the tissue specific role of mammalian proteins is not defined in the training set and circumscribed in the evidential data. For this study, we assembled an extensive compendium of mESC data: ∼2.2 million data points, collected from 60 different studies, under 992 conditions. We then integrated these data into a consensus mESC functional relationship network focused on biological processes associated with embryonic stem cell self-renewal and cell fate determination. Computational evaluations, literature validation, and analyses of predicted functional linkages show that our results are highly accurate and biologically relevant. Our mESC network predicts many novel players involved in self-renewal and serves as the foundation for future pluripotent stem cell studies. This network can be used by stem cell researchers (at http://StemSight.org) to explore hypotheses about gene function in the context of self-renewal and to prioritize genes of interest for experimental validation.

Show MeSH

Related in: MedlinePlus

Importance of Feature Selection in Bayesian Network Machine Learning. A.Networks trained using the same mESC gold standard but different feature sets had markedly different evidence of overfitting. We generated networks using three different feature sets: a minimalist library of 16 datasets composed largely of non-cell-type-specific data from molecular interaction databases, our mESC-specific compendium composed of 164 datasets restricted to mouse mESC data and a small amount of data not specific to any cell type, a superset compendium composed of all mESC training data plus an additional 646 non-tissue specific mouse microarrays, and a negative control compendium containing all datasets except those with mESC data. Using machine learning metrics, we found that the network trained on a small amount of non-tissue-specific data achieved the lowest ROC curve AUCs and had the least amount of overfitting. The mESC-specific network achieved a higher AUC, and showed evidence of minimal overfitting. The superset and negative control networks had the highest AUCs, but also showed extreme overfitting with a difference of greater than 0.1 between training and test set AUCs. Bootstrapping followed by out of bag averaging largely correct for overfitting in the mESC-specific, superset, and negative control networks. However network content varied dramatically. B. Overfitting in Networks with Randomly Generated Gold Standards. Networks trained on randomly generated gold standards performed better than random according to standard machine learning metrics, but 4-fold cross validation revealed these networks had evidence of overfitting that could be corrected for using regularization and bagging techniques. C. Evaluating Network Differences using Positive Gold Standard Posteriors. A scatterplot of superset versus mESC-only network positive gold standard posterior edge (those with a prior of 1) illustrates that while there is relatively high correlation (Pearson correlation r = 0.6592), there is also a broad range of disparity between the two networks. A scatterplot of negative control versus mESC shows that there is less correlation between the two networks (Pearson correlation r = 0.2311), and reveals the subset of the training gold standard supported by non-mESC data.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3585227&req=5

pone-0056810-g005: Importance of Feature Selection in Bayesian Network Machine Learning. A.Networks trained using the same mESC gold standard but different feature sets had markedly different evidence of overfitting. We generated networks using three different feature sets: a minimalist library of 16 datasets composed largely of non-cell-type-specific data from molecular interaction databases, our mESC-specific compendium composed of 164 datasets restricted to mouse mESC data and a small amount of data not specific to any cell type, a superset compendium composed of all mESC training data plus an additional 646 non-tissue specific mouse microarrays, and a negative control compendium containing all datasets except those with mESC data. Using machine learning metrics, we found that the network trained on a small amount of non-tissue-specific data achieved the lowest ROC curve AUCs and had the least amount of overfitting. The mESC-specific network achieved a higher AUC, and showed evidence of minimal overfitting. The superset and negative control networks had the highest AUCs, but also showed extreme overfitting with a difference of greater than 0.1 between training and test set AUCs. Bootstrapping followed by out of bag averaging largely correct for overfitting in the mESC-specific, superset, and negative control networks. However network content varied dramatically. B. Overfitting in Networks with Randomly Generated Gold Standards. Networks trained on randomly generated gold standards performed better than random according to standard machine learning metrics, but 4-fold cross validation revealed these networks had evidence of overfitting that could be corrected for using regularization and bagging techniques. C. Evaluating Network Differences using Positive Gold Standard Posteriors. A scatterplot of superset versus mESC-only network positive gold standard posterior edge (those with a prior of 1) illustrates that while there is relatively high correlation (Pearson correlation r = 0.6592), there is also a broad range of disparity between the two networks. A scatterplot of negative control versus mESC shows that there is less correlation between the two networks (Pearson correlation r = 0.2311), and reveals the subset of the training gold standard supported by non-mESC data.

Mentions: As a negative control, we assembled a feature set composed of a large amount of inappropriate data: 656 datasets from a broad range of mouse tissues and cell types, excluding mESCs (Table S6). This feature set was composed largely of microarray data and spanned ∼13,500 experimental conditions. To further explore the impact of using a combination of any type of mouse data, we created a feature “superset” based on a sprawling compendium of all available high-throughput mouse data, including data from our negative control, minimalist set, and mESC-specific datasets: a total of 810 datasets representing ∼14,500 conditions (Table S7). Both the negative control and superset networks achieved higher AUCs than the mESC network (0.88 and 0.86, respectively), but they also exhibited more dramatic evidence of overfitting (Figure 5A). This was not unexpected as the number of features in these test sets far exceeded the number of genes in the training set. Subsequently, the Bayes net was able to find patterns in the noise of the input data that most likely did not reflect real biology, often manifested as over-inflated results. Overfitting in networks generated using the superset of input data was even more apparent when trained on randomly generated, negative control gold standards. These test networks all achieved AUCs in the mid-to-high 0.80 s; however, overfitting was largely mitigated by regularization and bootstrap aggregation, which reduced test AUCs back to the expected random levels (∼0.5) (Figure 5B).


Cell-type-specific predictive network yields novel insights into mouse embryonic stem cell self-renewal and cell fate.

Dowell KG, Simons AK, Wang ZZ, Yun K, Hibbs MA - PLoS ONE (2013)

Importance of Feature Selection in Bayesian Network Machine Learning. A.Networks trained using the same mESC gold standard but different feature sets had markedly different evidence of overfitting. We generated networks using three different feature sets: a minimalist library of 16 datasets composed largely of non-cell-type-specific data from molecular interaction databases, our mESC-specific compendium composed of 164 datasets restricted to mouse mESC data and a small amount of data not specific to any cell type, a superset compendium composed of all mESC training data plus an additional 646 non-tissue specific mouse microarrays, and a negative control compendium containing all datasets except those with mESC data. Using machine learning metrics, we found that the network trained on a small amount of non-tissue-specific data achieved the lowest ROC curve AUCs and had the least amount of overfitting. The mESC-specific network achieved a higher AUC, and showed evidence of minimal overfitting. The superset and negative control networks had the highest AUCs, but also showed extreme overfitting with a difference of greater than 0.1 between training and test set AUCs. Bootstrapping followed by out of bag averaging largely correct for overfitting in the mESC-specific, superset, and negative control networks. However network content varied dramatically. B. Overfitting in Networks with Randomly Generated Gold Standards. Networks trained on randomly generated gold standards performed better than random according to standard machine learning metrics, but 4-fold cross validation revealed these networks had evidence of overfitting that could be corrected for using regularization and bagging techniques. C. Evaluating Network Differences using Positive Gold Standard Posteriors. A scatterplot of superset versus mESC-only network positive gold standard posterior edge (those with a prior of 1) illustrates that while there is relatively high correlation (Pearson correlation r = 0.6592), there is also a broad range of disparity between the two networks. A scatterplot of negative control versus mESC shows that there is less correlation between the two networks (Pearson correlation r = 0.2311), and reveals the subset of the training gold standard supported by non-mESC data.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3585227&req=5

pone-0056810-g005: Importance of Feature Selection in Bayesian Network Machine Learning. A.Networks trained using the same mESC gold standard but different feature sets had markedly different evidence of overfitting. We generated networks using three different feature sets: a minimalist library of 16 datasets composed largely of non-cell-type-specific data from molecular interaction databases, our mESC-specific compendium composed of 164 datasets restricted to mouse mESC data and a small amount of data not specific to any cell type, a superset compendium composed of all mESC training data plus an additional 646 non-tissue specific mouse microarrays, and a negative control compendium containing all datasets except those with mESC data. Using machine learning metrics, we found that the network trained on a small amount of non-tissue-specific data achieved the lowest ROC curve AUCs and had the least amount of overfitting. The mESC-specific network achieved a higher AUC, and showed evidence of minimal overfitting. The superset and negative control networks had the highest AUCs, but also showed extreme overfitting with a difference of greater than 0.1 between training and test set AUCs. Bootstrapping followed by out of bag averaging largely correct for overfitting in the mESC-specific, superset, and negative control networks. However network content varied dramatically. B. Overfitting in Networks with Randomly Generated Gold Standards. Networks trained on randomly generated gold standards performed better than random according to standard machine learning metrics, but 4-fold cross validation revealed these networks had evidence of overfitting that could be corrected for using regularization and bagging techniques. C. Evaluating Network Differences using Positive Gold Standard Posteriors. A scatterplot of superset versus mESC-only network positive gold standard posterior edge (those with a prior of 1) illustrates that while there is relatively high correlation (Pearson correlation r = 0.6592), there is also a broad range of disparity between the two networks. A scatterplot of negative control versus mESC shows that there is less correlation between the two networks (Pearson correlation r = 0.2311), and reveals the subset of the training gold standard supported by non-mESC data.
Mentions: As a negative control, we assembled a feature set composed of a large amount of inappropriate data: 656 datasets from a broad range of mouse tissues and cell types, excluding mESCs (Table S6). This feature set was composed largely of microarray data and spanned ∼13,500 experimental conditions. To further explore the impact of using a combination of any type of mouse data, we created a feature “superset” based on a sprawling compendium of all available high-throughput mouse data, including data from our negative control, minimalist set, and mESC-specific datasets: a total of 810 datasets representing ∼14,500 conditions (Table S7). Both the negative control and superset networks achieved higher AUCs than the mESC network (0.88 and 0.86, respectively), but they also exhibited more dramatic evidence of overfitting (Figure 5A). This was not unexpected as the number of features in these test sets far exceeded the number of genes in the training set. Subsequently, the Bayes net was able to find patterns in the noise of the input data that most likely did not reflect real biology, often manifested as over-inflated results. Overfitting in networks generated using the superset of input data was even more apparent when trained on randomly generated, negative control gold standards. These test networks all achieved AUCs in the mid-to-high 0.80 s; however, overfitting was largely mitigated by regularization and bootstrap aggregation, which reduced test AUCs back to the expected random levels (∼0.5) (Figure 5B).

Bottom Line: We then integrated these data into a consensus mESC functional relationship network focused on biological processes associated with embryonic stem cell self-renewal and cell fate determination.Computational evaluations, literature validation, and analyses of predicted functional linkages show that our results are highly accurate and biologically relevant.Our mESC network predicts many novel players involved in self-renewal and serves as the foundation for future pluripotent stem cell studies.

View Article: PubMed Central - PubMed

Affiliation: The Jackson Laboratory, Bar Harbor, Maine, USA.

ABSTRACT
Self-renewal, the ability of a stem cell to divide repeatedly while maintaining an undifferentiated state, is a defining characteristic of all stem cells. Here, we clarify the molecular foundations of mouse embryonic stem cell (mESC) self-renewal by applying a proven Bayesian network machine learning approach to integrate high-throughput data for protein function discovery. By focusing on a single stem-cell system, at a specific developmental stage, within the context of well-defined biological processes known to be active in that cell type, we produce a consensus predictive network that reflects biological reality more closely than those made by prior efforts using more generalized, context-independent methods. In addition, we show how machine learning efforts may be misled if the tissue specific role of mammalian proteins is not defined in the training set and circumscribed in the evidential data. For this study, we assembled an extensive compendium of mESC data: ∼2.2 million data points, collected from 60 different studies, under 992 conditions. We then integrated these data into a consensus mESC functional relationship network focused on biological processes associated with embryonic stem cell self-renewal and cell fate determination. Computational evaluations, literature validation, and analyses of predicted functional linkages show that our results are highly accurate and biologically relevant. Our mESC network predicts many novel players involved in self-renewal and serves as the foundation for future pluripotent stem cell studies. This network can be used by stem cell researchers (at http://StemSight.org) to explore hypotheses about gene function in the context of self-renewal and to prioritize genes of interest for experimental validation.

Show MeSH
Related in: MedlinePlus