Prediction of tissue-specific cis-regulatory modules using Bayesian networks and regression trees.
Bottom Line:
We develop a Bayesian network approach for identifying cis-regulatory modules likely to regulate tissue-specific expression.Our approach is shown to accurately identify known human liver and erythroid-specific modules.When applied to the prediction of tissue-specific modules in 10 different tissues, the network predicts a number of important transcription factor combinations whose concerted binding is associated to specific expression.
Affiliation: McGill Centre for Bioinformatics, 3775 University Street, room 332, Montreal, Quebec, Canada, H3A 2B4. xchen@cs.washington.edu
ABSTRACT
Show MeSH
Background: In vertebrates, a large part of gene transcriptional regulation is operated by cis-regulatory modules. These modules are believed to be regulating much of the tissue-specificity of gene expression. Result: We develop a Bayesian network approach for identifying cis-regulatory modules likely to regulate tissue-specific expression. The network integrates predicted transcription factor binding site information, transcription factor expression data, and target gene expression data. At its core is a regression tree modeling the effect of combinations of transcription factors bound to a module. A new unsupervised EM-like algorithm is developed to learn the parameters of the network, including the regression tree structure. Conclusion: Our approach is shown to accurately identify known human liver and erythroid-specific modules. When applied to the prediction of tissue-specific modules in 10 different tissues, the network predicts a number of important transcription factor combinations whose concerted binding is associated to specific expression. Related in: MedlinePlus |
Related In:
Results -
Collection
License getmorefigures.php?uid=PMC2230503&req=5
Mentions: How should we represent the probability that a module is regulatorily active, given the set of transcription factors bound to it, i.e. Pr[R/B1,...,]? Given that all variables are boolean, this conditional probability can be represented by a × 2 CPT containing parameters. In our application where contains several hundred transcription factors, this is obviously not practical, because (1) the CPT would be too large to store, and (2) we would need a huge amount of training data to learn the parameters. We thus use a more compact representation for this CPT, based on regression trees [17]. A regression tree is a rooted tree whose internal nodes are labeled with tests on the value of some variable Bf. See Figure 2 for a small example. For boolean variables (our case here), each node N tests whether the some variable takes the value true or false. Each leaf l of the tree is associated with a probability distribution Pr[R/l]. Let be the set of variable assignments obtained by following the path from the root to l. Let l(b1,...,) be the leaf reached when B1 = b1,..., = . Then, the regression tree defines a complete conditional probability distribution: Pr[R/B1 = b1,... = ] = p(l(b1,...,)). When many of the Bi's are irrelevant to R, the representation is much more compact than the standard CPT and can be estimated from less data. We will jointly refer to the tree topology, the node labelings, and the probability distributions at the leaves as the meta-parameter Ψ. Inferring Ψ will be the most significant difficulty of this approach. |
View Article: PubMed Central - HTML - PubMed
Affiliation: McGill Centre for Bioinformatics, 3775 University Street, room 332, Montreal, Quebec, Canada, H3A 2B4. xchen@cs.washington.edu
Background: In vertebrates, a large part of gene transcriptional regulation is operated by cis-regulatory modules. These modules are believed to be regulating much of the tissue-specificity of gene expression.
Result: We develop a Bayesian network approach for identifying cis-regulatory modules likely to regulate tissue-specific expression. The network integrates predicted transcription factor binding site information, transcription factor expression data, and target gene expression data. At its core is a regression tree modeling the effect of combinations of transcription factors bound to a module. A new unsupervised EM-like algorithm is developed to learn the parameters of the network, including the regression tree structure.
Conclusion: Our approach is shown to accurately identify known human liver and erythroid-specific modules. When applied to the prediction of tissue-specific modules in 10 different tissues, the network predicts a number of important transcription factor combinations whose concerted binding is associated to specific expression.