Limits...
A semi-supervised method for predicting transcription factor-gene interactions in Escherichia coli.

Ernst J, Beg QK, Kay KA, Balázsi G, Oltvai ZN, Bar-Joseph Z - PLoS Comput. Biol. (2008)

Bottom Line: To further demonstrate the utility of our inferred interactions, we generated a new microarray gene expression dataset for the aerobic to anaerobic shift response in E. coli.We used our inferred interactions with the verified interactions to reconstruct a dynamic regulatory network for this response.The network reconstructed when using our inferred interactions was better able to correctly identify known regulators and suggested additional activators and repressors as having important roles during the aerobic-anaerobic shift interface.

View Article: PubMed Central - PubMed

Affiliation: Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America.

ABSTRACT
While Escherichia coli has one of the most comprehensive datasets of experimentally verified transcriptional regulatory interactions of any organism, it is still far from complete. This presents a problem when trying to combine gene expression and regulatory interactions to model transcriptional regulatory networks. Using the available regulatory interactions to predict new interactions may lead to better coverage and more accurate models. Here, we develop SEREND (SEmi-supervised REgulatory Network Discoverer), a semi-supervised learning method that uses a curated database of verified transcriptional factor-gene interactions, DNA sequence binding motifs, and a compendium of gene expression data in order to make thousands of new predictions about transcription factor-gene interactions, including whether the transcription factor activates or represses the gene. Using genome-wide binding datasets for several transcription factors, we demonstrate that our semi-supervised classification strategy improves the prediction of targets for a given transcription factor. To further demonstrate the utility of our inferred interactions, we generated a new microarray gene expression dataset for the aerobic to anaerobic shift response in E. coli. We used our inferred interactions with the verified interactions to reconstruct a dynamic regulatory network for this response. The network reconstructed when using our inferred interactions was better able to correctly identify known regulators and suggested additional activators and repressors as having important roles during the aerobic-anaerobic shift interface.

Show MeSH

Related in: MedlinePlus

Method overview.SEREND takes as input a compendium of expression data [14], a curated set of E. coli TF–gene interactions with direct evidence [1], and scores for TF–gene motif association based on the PWMs present in RegulonDB [2]. SEREND uses a logistic regression ensemble-based classification method where all non-confirmed targets were initially treated as unregulated by the TF. SEREND then relaxed this assumption using a self-training method. We evaluated the ranked predictions of SEREND using published ChIP-chip data, and by combining SEREND's predictions with a new set of time series gene expression data on aerobic-anaerobic shift response in E. coli.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2266799&req=5

pcbi-1000044-g001: Method overview.SEREND takes as input a compendium of expression data [14], a curated set of E. coli TF–gene interactions with direct evidence [1], and scores for TF–gene motif association based on the PWMs present in RegulonDB [2]. SEREND uses a logistic regression ensemble-based classification method where all non-confirmed targets were initially treated as unregulated by the TF. SEREND then relaxed this assumption using a self-training method. We evaluated the ranked predictions of SEREND using published ChIP-chip data, and by combining SEREND's predictions with a new set of time series gene expression data on aerobic-anaerobic shift response in E. coli.

Mentions: In this paper we introduce a new method, SEREND (SEmi-supervised REgulatory Network Discoverer), to predict TF-gene regulatory interactions in E. coli (Figure 1). SEREND is an iterative semi-supervised computational prediction method that takes advantage of known regulatory interactions in E. coli and extends them by leveraging TF sequence binding affinities and a compendium of expression data. Similar to other methods [4]–[6] SEREND does not assume that a TF is necessarily transcriptionally regulated. Instead SEREND uses expression data in the context of known or predicted TF-gene interactions. However, these previous methods assume a fixed set of TF-gene interactions, while the purpose of SEREND is to predict additional TF-gene interactions. These predictions can later be used as input to these other methods, as we demonstrate for one method on a new expression dataset. Other methods performed iterative analysis as SEREND does here [18],[19]. However, unlike SEREND, which focuses on classification, the goal of these prior methods was clustering or gene set module identification leading to different treatment for the features used and different meanings for the resulting sets. Another method [20] used curated interactions and expression data along with Gene Ontology (GO) and phylogenic similarity to predict additional gene targets, but did not use an iterative or semi-supervised approach or motif information as we do here. We chose for our method not to use GO annotations in generating predictions giving us the advantage of being able to use GO for an unbiased assessment of the functional role of predicted targets.


A semi-supervised method for predicting transcription factor-gene interactions in Escherichia coli.

Ernst J, Beg QK, Kay KA, Balázsi G, Oltvai ZN, Bar-Joseph Z - PLoS Comput. Biol. (2008)

Method overview.SEREND takes as input a compendium of expression data [14], a curated set of E. coli TF–gene interactions with direct evidence [1], and scores for TF–gene motif association based on the PWMs present in RegulonDB [2]. SEREND uses a logistic regression ensemble-based classification method where all non-confirmed targets were initially treated as unregulated by the TF. SEREND then relaxed this assumption using a self-training method. We evaluated the ranked predictions of SEREND using published ChIP-chip data, and by combining SEREND's predictions with a new set of time series gene expression data on aerobic-anaerobic shift response in E. coli.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2266799&req=5

pcbi-1000044-g001: Method overview.SEREND takes as input a compendium of expression data [14], a curated set of E. coli TF–gene interactions with direct evidence [1], and scores for TF–gene motif association based on the PWMs present in RegulonDB [2]. SEREND uses a logistic regression ensemble-based classification method where all non-confirmed targets were initially treated as unregulated by the TF. SEREND then relaxed this assumption using a self-training method. We evaluated the ranked predictions of SEREND using published ChIP-chip data, and by combining SEREND's predictions with a new set of time series gene expression data on aerobic-anaerobic shift response in E. coli.
Mentions: In this paper we introduce a new method, SEREND (SEmi-supervised REgulatory Network Discoverer), to predict TF-gene regulatory interactions in E. coli (Figure 1). SEREND is an iterative semi-supervised computational prediction method that takes advantage of known regulatory interactions in E. coli and extends them by leveraging TF sequence binding affinities and a compendium of expression data. Similar to other methods [4]–[6] SEREND does not assume that a TF is necessarily transcriptionally regulated. Instead SEREND uses expression data in the context of known or predicted TF-gene interactions. However, these previous methods assume a fixed set of TF-gene interactions, while the purpose of SEREND is to predict additional TF-gene interactions. These predictions can later be used as input to these other methods, as we demonstrate for one method on a new expression dataset. Other methods performed iterative analysis as SEREND does here [18],[19]. However, unlike SEREND, which focuses on classification, the goal of these prior methods was clustering or gene set module identification leading to different treatment for the features used and different meanings for the resulting sets. Another method [20] used curated interactions and expression data along with Gene Ontology (GO) and phylogenic similarity to predict additional gene targets, but did not use an iterative or semi-supervised approach or motif information as we do here. We chose for our method not to use GO annotations in generating predictions giving us the advantage of being able to use GO for an unbiased assessment of the functional role of predicted targets.

Bottom Line: To further demonstrate the utility of our inferred interactions, we generated a new microarray gene expression dataset for the aerobic to anaerobic shift response in E. coli.We used our inferred interactions with the verified interactions to reconstruct a dynamic regulatory network for this response.The network reconstructed when using our inferred interactions was better able to correctly identify known regulators and suggested additional activators and repressors as having important roles during the aerobic-anaerobic shift interface.

View Article: PubMed Central - PubMed

Affiliation: Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America.

ABSTRACT
While Escherichia coli has one of the most comprehensive datasets of experimentally verified transcriptional regulatory interactions of any organism, it is still far from complete. This presents a problem when trying to combine gene expression and regulatory interactions to model transcriptional regulatory networks. Using the available regulatory interactions to predict new interactions may lead to better coverage and more accurate models. Here, we develop SEREND (SEmi-supervised REgulatory Network Discoverer), a semi-supervised learning method that uses a curated database of verified transcriptional factor-gene interactions, DNA sequence binding motifs, and a compendium of gene expression data in order to make thousands of new predictions about transcription factor-gene interactions, including whether the transcription factor activates or represses the gene. Using genome-wide binding datasets for several transcription factors, we demonstrate that our semi-supervised classification strategy improves the prediction of targets for a given transcription factor. To further demonstrate the utility of our inferred interactions, we generated a new microarray gene expression dataset for the aerobic to anaerobic shift response in E. coli. We used our inferred interactions with the verified interactions to reconstruct a dynamic regulatory network for this response. The network reconstructed when using our inferred interactions was better able to correctly identify known regulators and suggested additional activators and repressors as having important roles during the aerobic-anaerobic shift interface.

Show MeSH
Related in: MedlinePlus