Limits...
A predictive model of the oxygen and heme regulatory network in yeast.

Kundaje A, Xin X, Lan C, Lianoglou S, Zhou M, Zhang L, Leslie C - PLoS Comput. Biol. (2008)

Bottom Line: We used a novel margin-based score to extract significant condition-specific regulators and assemble a global map of the oxygen sensing and regulatory network.In each case, deletion of the candidate regulator resulted in the predicted effect on promoter activity, confirming that several novel regulators identified by MEDUSA are indeed involved in oxygen regulation.Supplemental data are included.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, Columbia University, New York, New York, United States of America.

ABSTRACT
Deciphering gene regulatory mechanisms through the analysis of high-throughput expression data is a challenging computational problem. Previous computational studies have used large expression datasets in order to resolve fine patterns of coexpression, producing clusters or modules of potentially coregulated genes. These methods typically examine promoter sequence information, such as DNA motifs or transcription factor occupancy data, in a separate step after clustering. We needed an alternative and more integrative approach to study the oxygen regulatory network in Saccharomyces cerevisiae using a small dataset of perturbation experiments. Mechanisms of oxygen sensing and regulation underlie many physiological and pathological processes, and only a handful of oxygen regulators have been identified in previous studies. We used a new machine learning algorithm called MEDUSA to uncover detailed information about the oxygen regulatory network using genome-wide expression changes in response to perturbations in the levels of oxygen, heme, Hap1, and Co2+. MEDUSA integrates mRNA expression, promoter sequence, and ChIP-chip occupancy data to learn a model that accurately predicts the differential expression of target genes in held-out data. We used a novel margin-based score to extract significant condition-specific regulators and assemble a global map of the oxygen sensing and regulatory network. This network includes both known oxygen and heme regulators, such as Hap1, Mga2, Hap4, and Upc2, as well as many new candidate regulators. MEDUSA also identified many DNA motifs that are consistent with previous experimentally identified transcription factor binding sites. Because MEDUSA's regulatory program associates regulators to target genes through their promoter sequences, we directly tested the predicted regulators for OLE1, a gene specifically induced under hypoxia, by experimental analysis of the activity of its promoter. In each case, deletion of the candidate regulator resulted in the predicted effect on promoter activity, confirming that several novel regulators identified by MEDUSA are indeed involved in oxygen regulation. MEDUSA can reveal important information from a small dataset and generate testable hypotheses for further experimental analysis. Supplemental data are included.

Show MeSH

Related in: MedlinePlus

A schematic flow chart showing the algorithmic steps for learning the oxygen regulatory program with MEDUSA.(A) The mRNA expression data is discretized into three states, up (over-expressed), down (under-expressed), and baseline (not significantly differentially expressed), and genes are partitioned into potential regulators (transcription factors and signal transducers) and targets. The regulators are also included in the list of target genes so that their transcriptional regulation can be modeled. (B) The MEDUSA learning algorithm is presented with the promoter sequences of target genes, the discretized expression profiles of the regulators across multiple conditions, and the differentially expressed (up and down) target gene examples from these experiments. Baseline examples are not used to train MEDUSA. In the first stage of training, MEDUSA considers rules based on promoter sequence data and regulator expression states. MEDUSA uses a boosting strategy to avoid overfitting over many rounds of the algorithm. At each iteration i, a motif/regulator rule is chosen based on the current weights on the training examples; this rule predict that targets whose promoters contain the motif will go up (or down) in experiments where the regulator is over- (or under-) expressed. Before the next iteration, the examples are reweighted to emphasize the ones that are difficult to predict. (C) To learn the sequence motif, the algorithm agglomerates predictive k-mer sequences to produce candidate PSSMs, and it optimizes both the choice of PSSM and the probabilistic threshold used to determine where the hits of the motif occur. (D) At the end of each round of training, motif /regulator rules are placed into an alternating decision tree, building a global regulatory program. This regulatory program can be used to predict target gene up/down regulation for gene-experiment examples that were not seen in training. In order to produce a more stable decision tree, we perform a second pass of the tree-learning algorithm using a stabilized variant of boosting that gives more consistent models over different subsets of the training data. At this stage, both the motifs learned previously by MEDUSA and TF occupancies from ChIP-chip experiments are used as sequence features for the final regulatory program.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2573020&req=5

pcbi-1000224-g002: A schematic flow chart showing the algorithmic steps for learning the oxygen regulatory program with MEDUSA.(A) The mRNA expression data is discretized into three states, up (over-expressed), down (under-expressed), and baseline (not significantly differentially expressed), and genes are partitioned into potential regulators (transcription factors and signal transducers) and targets. The regulators are also included in the list of target genes so that their transcriptional regulation can be modeled. (B) The MEDUSA learning algorithm is presented with the promoter sequences of target genes, the discretized expression profiles of the regulators across multiple conditions, and the differentially expressed (up and down) target gene examples from these experiments. Baseline examples are not used to train MEDUSA. In the first stage of training, MEDUSA considers rules based on promoter sequence data and regulator expression states. MEDUSA uses a boosting strategy to avoid overfitting over many rounds of the algorithm. At each iteration i, a motif/regulator rule is chosen based on the current weights on the training examples; this rule predict that targets whose promoters contain the motif will go up (or down) in experiments where the regulator is over- (or under-) expressed. Before the next iteration, the examples are reweighted to emphasize the ones that are difficult to predict. (C) To learn the sequence motif, the algorithm agglomerates predictive k-mer sequences to produce candidate PSSMs, and it optimizes both the choice of PSSM and the probabilistic threshold used to determine where the hits of the motif occur. (D) At the end of each round of training, motif /regulator rules are placed into an alternating decision tree, building a global regulatory program. This regulatory program can be used to predict target gene up/down regulation for gene-experiment examples that were not seen in training. In order to produce a more stable decision tree, we perform a second pass of the tree-learning algorithm using a stabilized variant of boosting that gives more consistent models over different subsets of the training data. At this stage, both the motifs learned previously by MEDUSA and TF occupancies from ChIP-chip experiments are used as sequence features for the final regulatory program.

Mentions: Figure 2 illustrates the major steps and data used in the MEDUSA learning algorithm. In preprocessing, mRNA expression data is discretized by binning expression values into three states (up, down, and baseline) and partitioning genes into regulators and targets (Figure 2A). In the first stage of training (Figure 2B and 2C), MEDUSA uses the promoter sequences of target genes and the mRNA levels of regulators as inputs to learn a prediction function for the differential expression of targets. MEDUSA uses boosting to iteratively discover motifs whose presence in the promoters of target genes, together with the mRNA levels of regulators across experimental conditions, helps to predict the differential expression of the targets in those conditions. It builds a global regulatory program based on these motifs and regulators (Figure 2D). In order to produce a regulatory program that is more consistent under variations of the training data, a second pass of the regulatory program building algorithm is performed using a stabilized variant of boosting (see Methods). This second pass integrates the motifs discovered in the first training stage, promoter occupancy data from ChIP-chip analysis, and expression data of regulators and targets, to build a final regulatory program that predicts the up or down regulation of target genes. The regulatory program asks questions such as, “Is the mRNA level of regulator ρ up (or down) in the experiment, and is the motif μ present in the upstream region of the gene (or is a transcriptional regulator bound to the promoter, when ChIP-chip data are available)?” The control logic of the regulatory program is described by an alternating decision tree (Figure 2D and Figure 3), which encodes how the overall up or down prediction score for a target gene in a given experimental condition results from the contribution and interaction of multiple regulators and motifs.


A predictive model of the oxygen and heme regulatory network in yeast.

Kundaje A, Xin X, Lan C, Lianoglou S, Zhou M, Zhang L, Leslie C - PLoS Comput. Biol. (2008)

A schematic flow chart showing the algorithmic steps for learning the oxygen regulatory program with MEDUSA.(A) The mRNA expression data is discretized into three states, up (over-expressed), down (under-expressed), and baseline (not significantly differentially expressed), and genes are partitioned into potential regulators (transcription factors and signal transducers) and targets. The regulators are also included in the list of target genes so that their transcriptional regulation can be modeled. (B) The MEDUSA learning algorithm is presented with the promoter sequences of target genes, the discretized expression profiles of the regulators across multiple conditions, and the differentially expressed (up and down) target gene examples from these experiments. Baseline examples are not used to train MEDUSA. In the first stage of training, MEDUSA considers rules based on promoter sequence data and regulator expression states. MEDUSA uses a boosting strategy to avoid overfitting over many rounds of the algorithm. At each iteration i, a motif/regulator rule is chosen based on the current weights on the training examples; this rule predict that targets whose promoters contain the motif will go up (or down) in experiments where the regulator is over- (or under-) expressed. Before the next iteration, the examples are reweighted to emphasize the ones that are difficult to predict. (C) To learn the sequence motif, the algorithm agglomerates predictive k-mer sequences to produce candidate PSSMs, and it optimizes both the choice of PSSM and the probabilistic threshold used to determine where the hits of the motif occur. (D) At the end of each round of training, motif /regulator rules are placed into an alternating decision tree, building a global regulatory program. This regulatory program can be used to predict target gene up/down regulation for gene-experiment examples that were not seen in training. In order to produce a more stable decision tree, we perform a second pass of the tree-learning algorithm using a stabilized variant of boosting that gives more consistent models over different subsets of the training data. At this stage, both the motifs learned previously by MEDUSA and TF occupancies from ChIP-chip experiments are used as sequence features for the final regulatory program.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2573020&req=5

pcbi-1000224-g002: A schematic flow chart showing the algorithmic steps for learning the oxygen regulatory program with MEDUSA.(A) The mRNA expression data is discretized into three states, up (over-expressed), down (under-expressed), and baseline (not significantly differentially expressed), and genes are partitioned into potential regulators (transcription factors and signal transducers) and targets. The regulators are also included in the list of target genes so that their transcriptional regulation can be modeled. (B) The MEDUSA learning algorithm is presented with the promoter sequences of target genes, the discretized expression profiles of the regulators across multiple conditions, and the differentially expressed (up and down) target gene examples from these experiments. Baseline examples are not used to train MEDUSA. In the first stage of training, MEDUSA considers rules based on promoter sequence data and regulator expression states. MEDUSA uses a boosting strategy to avoid overfitting over many rounds of the algorithm. At each iteration i, a motif/regulator rule is chosen based on the current weights on the training examples; this rule predict that targets whose promoters contain the motif will go up (or down) in experiments where the regulator is over- (or under-) expressed. Before the next iteration, the examples are reweighted to emphasize the ones that are difficult to predict. (C) To learn the sequence motif, the algorithm agglomerates predictive k-mer sequences to produce candidate PSSMs, and it optimizes both the choice of PSSM and the probabilistic threshold used to determine where the hits of the motif occur. (D) At the end of each round of training, motif /regulator rules are placed into an alternating decision tree, building a global regulatory program. This regulatory program can be used to predict target gene up/down regulation for gene-experiment examples that were not seen in training. In order to produce a more stable decision tree, we perform a second pass of the tree-learning algorithm using a stabilized variant of boosting that gives more consistent models over different subsets of the training data. At this stage, both the motifs learned previously by MEDUSA and TF occupancies from ChIP-chip experiments are used as sequence features for the final regulatory program.
Mentions: Figure 2 illustrates the major steps and data used in the MEDUSA learning algorithm. In preprocessing, mRNA expression data is discretized by binning expression values into three states (up, down, and baseline) and partitioning genes into regulators and targets (Figure 2A). In the first stage of training (Figure 2B and 2C), MEDUSA uses the promoter sequences of target genes and the mRNA levels of regulators as inputs to learn a prediction function for the differential expression of targets. MEDUSA uses boosting to iteratively discover motifs whose presence in the promoters of target genes, together with the mRNA levels of regulators across experimental conditions, helps to predict the differential expression of the targets in those conditions. It builds a global regulatory program based on these motifs and regulators (Figure 2D). In order to produce a regulatory program that is more consistent under variations of the training data, a second pass of the regulatory program building algorithm is performed using a stabilized variant of boosting (see Methods). This second pass integrates the motifs discovered in the first training stage, promoter occupancy data from ChIP-chip analysis, and expression data of regulators and targets, to build a final regulatory program that predicts the up or down regulation of target genes. The regulatory program asks questions such as, “Is the mRNA level of regulator ρ up (or down) in the experiment, and is the motif μ present in the upstream region of the gene (or is a transcriptional regulator bound to the promoter, when ChIP-chip data are available)?” The control logic of the regulatory program is described by an alternating decision tree (Figure 2D and Figure 3), which encodes how the overall up or down prediction score for a target gene in a given experimental condition results from the contribution and interaction of multiple regulators and motifs.

Bottom Line: We used a novel margin-based score to extract significant condition-specific regulators and assemble a global map of the oxygen sensing and regulatory network.In each case, deletion of the candidate regulator resulted in the predicted effect on promoter activity, confirming that several novel regulators identified by MEDUSA are indeed involved in oxygen regulation.Supplemental data are included.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, Columbia University, New York, New York, United States of America.

ABSTRACT
Deciphering gene regulatory mechanisms through the analysis of high-throughput expression data is a challenging computational problem. Previous computational studies have used large expression datasets in order to resolve fine patterns of coexpression, producing clusters or modules of potentially coregulated genes. These methods typically examine promoter sequence information, such as DNA motifs or transcription factor occupancy data, in a separate step after clustering. We needed an alternative and more integrative approach to study the oxygen regulatory network in Saccharomyces cerevisiae using a small dataset of perturbation experiments. Mechanisms of oxygen sensing and regulation underlie many physiological and pathological processes, and only a handful of oxygen regulators have been identified in previous studies. We used a new machine learning algorithm called MEDUSA to uncover detailed information about the oxygen regulatory network using genome-wide expression changes in response to perturbations in the levels of oxygen, heme, Hap1, and Co2+. MEDUSA integrates mRNA expression, promoter sequence, and ChIP-chip occupancy data to learn a model that accurately predicts the differential expression of target genes in held-out data. We used a novel margin-based score to extract significant condition-specific regulators and assemble a global map of the oxygen sensing and regulatory network. This network includes both known oxygen and heme regulators, such as Hap1, Mga2, Hap4, and Upc2, as well as many new candidate regulators. MEDUSA also identified many DNA motifs that are consistent with previous experimentally identified transcription factor binding sites. Because MEDUSA's regulatory program associates regulators to target genes through their promoter sequences, we directly tested the predicted regulators for OLE1, a gene specifically induced under hypoxia, by experimental analysis of the activity of its promoter. In each case, deletion of the candidate regulator resulted in the predicted effect on promoter activity, confirming that several novel regulators identified by MEDUSA are indeed involved in oxygen regulation. MEDUSA can reveal important information from a small dataset and generate testable hypotheses for further experimental analysis. Supplemental data are included.

Show MeSH
Related in: MedlinePlus