Limits...
iRegulon: from a gene list to a gene regulatory network using large motif and track collections.

Janky R, Verfaillie A, Imrichová H, Van de Sande B, Standaert L, Christiaens V, Hulselmans G, Herten K, Naval Sanchez M, Potier D, Svetlichnyy D, Kalender Atak Z, Fiers M, Marine JC, Aerts S - PLoS Comput. Biol. (2014)

Bottom Line: Identifying master regulators of biological processes and mapping their downstream gene networks are key challenges in systems biology.Next, we use iRegulon on more challenging types of gene lists, including microRNA target sets, protein-protein interaction networks, and genetic perturbation data.Similarly we map a repressive network with no indication of direct p53 regulation but rather an indirect effect via E2F and NFY.

View Article: PubMed Central - PubMed

Affiliation: Laboratory of Computational Biology, KU Leuven Center for Human Genetics, Leuven, Belgium.

ABSTRACT
Identifying master regulators of biological processes and mapping their downstream gene networks are key challenges in systems biology. We developed a computational method, called iRegulon, to reverse-engineer the transcriptional regulatory network underlying a co-expressed gene set using cis-regulatory sequence analysis. iRegulon implements a genome-wide ranking-and-recovery approach to detect enriched transcription factor motifs and their optimal sets of direct targets. We increase the accuracy of network inference by using very large motif collections of up to ten thousand position weight matrices collected from various species, and linking these to candidate human TFs via a motif2TF procedure. We validate iRegulon on gene sets derived from ENCODE ChIP-seq data with increasing levels of noise, and we compare iRegulon with existing motif discovery methods. Next, we use iRegulon on more challenging types of gene lists, including microRNA target sets, protein-protein interaction networks, and genetic perturbation data. In particular, we over-activate p53 in breast cancer cells, followed by RNA-seq and ChIP-seq, and could identify an extensive up-regulated network controlled directly by p53. Similarly we map a repressive network with no indication of direct p53 regulation but rather an indirect effect via E2F and NFY. Finally, we generalize our computational framework to include regulatory tracks such as ChIP-seq data and show how motif and track discovery can be combined to map functional regulatory interactions among co-expressed genes. iRegulon is available as a Cytoscape plugin from http://iregulon.aertslab.org.

Show MeSH

Related in: MedlinePlus

Regulon detection by rank-based motif discovery and motif2TF.Motif enrichment in iRegulon is measured using a ranking-and-recovery procedure using a large collection of position weight matrices (PWM). In the ranking step (A) all human genes are ranked for each motif by scoring for homotypic motif clusters across ten vertebrate species. In the recovery step (B) each of these gene rankings is tested against the set of input genes by calculating the Area Under the cumulative Recovery Curve (AUC, in pink). The example shown is for the top enriched motif, motif M2. The AUC score is normalized, based on the AUC scores of all motif rankings (distribution is shown as inset), to a normalized enrichment score (NES). A high NES score (≥3.0) indicates a motif that recovers a large proportion of the input genes within the top of its ranking. In parallel, the leading edge of the recovery curve is used to determine the optimal subset of genes that are likely controlled by this motif. In the last step (C) Motif2TF associates the candidate motif with (a number of) TFs by finding possible paths from a motif to a TF, in a motif-TF network based on direct evidence, orthology, and motif-motif similarity. The enriched TF can be from the input genes (e.g. TG5 encoding for TF2). See also Materials and Methods and Figures S1, S4.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4109854&req=5

pcbi-1003731-g001: Regulon detection by rank-based motif discovery and motif2TF.Motif enrichment in iRegulon is measured using a ranking-and-recovery procedure using a large collection of position weight matrices (PWM). In the ranking step (A) all human genes are ranked for each motif by scoring for homotypic motif clusters across ten vertebrate species. In the recovery step (B) each of these gene rankings is tested against the set of input genes by calculating the Area Under the cumulative Recovery Curve (AUC, in pink). The example shown is for the top enriched motif, motif M2. The AUC score is normalized, based on the AUC scores of all motif rankings (distribution is shown as inset), to a normalized enrichment score (NES). A high NES score (≥3.0) indicates a motif that recovers a large proportion of the input genes within the top of its ranking. In parallel, the leading edge of the recovery curve is used to determine the optimal subset of genes that are likely controlled by this motif. In the last step (C) Motif2TF associates the candidate motif with (a number of) TFs by finding possible paths from a motif to a TF, in a motif-TF network based on direct evidence, orthology, and motif-motif similarity. The enriched TF can be from the input genes (e.g. TG5 encoding for TF2). See also Materials and Methods and Figures S1, S4.

Mentions: The goal of iRegulon is to enable gene regulatory network mapping directly based on motif enrichment in a co-expressed gene set. As motif discovery method we have chosen to elaborate on the recent ranking-and-recovery methods [28]–[32] (Fig. 1). In the ranking step we generate whole-genome rankings of 22284 human RefSeq genes for a library of PWMs where a PWM is a matrix representation of a regulatory motif (Table 1). For each gene, a regulatory search space (500 bp, 10 kb or 20 kb around the Transcription Start Site (TSS), see Materials and Methods) is scanned for homotypic cis-regulatory modules (CRM) using a Hidden Markov Model [33] (Fig. S1). Starting from a library with N PWMs, N ranked lists of genes are generated, each with the most likely genomic targets of a particular motif at the top of the ranking [28], [29]. Next, orthologous search spaces in ten other vertebrate genomes are determined by UCSC liftover tool [34] and are subsequently scanned with the same PWMs. The rankings for different species are combined by rank aggregation [35] into one final ranking for each PWM in our library. For the PWM libraries we have collected and reformatted most of the available libraries into a “6K collection” (N = 6383 PWMs) and a “10K collection” (N = 9713 PWMs) (Table 1). These libraries contain PWMs from different species and also include candidate PWMs for unknown TFs. The results of the ranking step are N human gene rankings stored in an SQLite database. We also generated similar databases using mouse and Drosophila as reference species, in case the input gene set is derived from mouse or fruit fly.


iRegulon: from a gene list to a gene regulatory network using large motif and track collections.

Janky R, Verfaillie A, Imrichová H, Van de Sande B, Standaert L, Christiaens V, Hulselmans G, Herten K, Naval Sanchez M, Potier D, Svetlichnyy D, Kalender Atak Z, Fiers M, Marine JC, Aerts S - PLoS Comput. Biol. (2014)

Regulon detection by rank-based motif discovery and motif2TF.Motif enrichment in iRegulon is measured using a ranking-and-recovery procedure using a large collection of position weight matrices (PWM). In the ranking step (A) all human genes are ranked for each motif by scoring for homotypic motif clusters across ten vertebrate species. In the recovery step (B) each of these gene rankings is tested against the set of input genes by calculating the Area Under the cumulative Recovery Curve (AUC, in pink). The example shown is for the top enriched motif, motif M2. The AUC score is normalized, based on the AUC scores of all motif rankings (distribution is shown as inset), to a normalized enrichment score (NES). A high NES score (≥3.0) indicates a motif that recovers a large proportion of the input genes within the top of its ranking. In parallel, the leading edge of the recovery curve is used to determine the optimal subset of genes that are likely controlled by this motif. In the last step (C) Motif2TF associates the candidate motif with (a number of) TFs by finding possible paths from a motif to a TF, in a motif-TF network based on direct evidence, orthology, and motif-motif similarity. The enriched TF can be from the input genes (e.g. TG5 encoding for TF2). See also Materials and Methods and Figures S1, S4.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4109854&req=5

pcbi-1003731-g001: Regulon detection by rank-based motif discovery and motif2TF.Motif enrichment in iRegulon is measured using a ranking-and-recovery procedure using a large collection of position weight matrices (PWM). In the ranking step (A) all human genes are ranked for each motif by scoring for homotypic motif clusters across ten vertebrate species. In the recovery step (B) each of these gene rankings is tested against the set of input genes by calculating the Area Under the cumulative Recovery Curve (AUC, in pink). The example shown is for the top enriched motif, motif M2. The AUC score is normalized, based on the AUC scores of all motif rankings (distribution is shown as inset), to a normalized enrichment score (NES). A high NES score (≥3.0) indicates a motif that recovers a large proportion of the input genes within the top of its ranking. In parallel, the leading edge of the recovery curve is used to determine the optimal subset of genes that are likely controlled by this motif. In the last step (C) Motif2TF associates the candidate motif with (a number of) TFs by finding possible paths from a motif to a TF, in a motif-TF network based on direct evidence, orthology, and motif-motif similarity. The enriched TF can be from the input genes (e.g. TG5 encoding for TF2). See also Materials and Methods and Figures S1, S4.
Mentions: The goal of iRegulon is to enable gene regulatory network mapping directly based on motif enrichment in a co-expressed gene set. As motif discovery method we have chosen to elaborate on the recent ranking-and-recovery methods [28]–[32] (Fig. 1). In the ranking step we generate whole-genome rankings of 22284 human RefSeq genes for a library of PWMs where a PWM is a matrix representation of a regulatory motif (Table 1). For each gene, a regulatory search space (500 bp, 10 kb or 20 kb around the Transcription Start Site (TSS), see Materials and Methods) is scanned for homotypic cis-regulatory modules (CRM) using a Hidden Markov Model [33] (Fig. S1). Starting from a library with N PWMs, N ranked lists of genes are generated, each with the most likely genomic targets of a particular motif at the top of the ranking [28], [29]. Next, orthologous search spaces in ten other vertebrate genomes are determined by UCSC liftover tool [34] and are subsequently scanned with the same PWMs. The rankings for different species are combined by rank aggregation [35] into one final ranking for each PWM in our library. For the PWM libraries we have collected and reformatted most of the available libraries into a “6K collection” (N = 6383 PWMs) and a “10K collection” (N = 9713 PWMs) (Table 1). These libraries contain PWMs from different species and also include candidate PWMs for unknown TFs. The results of the ranking step are N human gene rankings stored in an SQLite database. We also generated similar databases using mouse and Drosophila as reference species, in case the input gene set is derived from mouse or fruit fly.

Bottom Line: Identifying master regulators of biological processes and mapping their downstream gene networks are key challenges in systems biology.Next, we use iRegulon on more challenging types of gene lists, including microRNA target sets, protein-protein interaction networks, and genetic perturbation data.Similarly we map a repressive network with no indication of direct p53 regulation but rather an indirect effect via E2F and NFY.

View Article: PubMed Central - PubMed

Affiliation: Laboratory of Computational Biology, KU Leuven Center for Human Genetics, Leuven, Belgium.

ABSTRACT
Identifying master regulators of biological processes and mapping their downstream gene networks are key challenges in systems biology. We developed a computational method, called iRegulon, to reverse-engineer the transcriptional regulatory network underlying a co-expressed gene set using cis-regulatory sequence analysis. iRegulon implements a genome-wide ranking-and-recovery approach to detect enriched transcription factor motifs and their optimal sets of direct targets. We increase the accuracy of network inference by using very large motif collections of up to ten thousand position weight matrices collected from various species, and linking these to candidate human TFs via a motif2TF procedure. We validate iRegulon on gene sets derived from ENCODE ChIP-seq data with increasing levels of noise, and we compare iRegulon with existing motif discovery methods. Next, we use iRegulon on more challenging types of gene lists, including microRNA target sets, protein-protein interaction networks, and genetic perturbation data. In particular, we over-activate p53 in breast cancer cells, followed by RNA-seq and ChIP-seq, and could identify an extensive up-regulated network controlled directly by p53. Similarly we map a repressive network with no indication of direct p53 regulation but rather an indirect effect via E2F and NFY. Finally, we generalize our computational framework to include regulatory tracks such as ChIP-seq data and show how motif and track discovery can be combined to map functional regulatory interactions among co-expressed genes. iRegulon is available as a Cytoscape plugin from http://iregulon.aertslab.org.

Show MeSH
Related in: MedlinePlus