Limits...
TREMOR--a tool for retrieving transcriptional modules by incorporating motif covariance.

Singh LN, Wang LS, Hannenhalli S - Nucleic Acids Res. (2007)

Bottom Line: Since TFs belong to evolutionarily and structurally related families, TF family members often bind to similar DNA motifs and can confound sequence-based approaches to TM identification.A previous approach to TM detection addresses this issue by pre-selecting a single representative from each TF family.This method uses the Mahalanobis distance to assess the validity of a TM and automatically incorporates the inter-TF binding similarity without resorting to pre-selecting family representatives.

View Article: PubMed Central - PubMed

Affiliation: Penn Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104, USA.

ABSTRACT
A transcriptional module (TM) is a collection of transcription factors (TF) that as a group, co-regulate multiple, functionally related genes. The task of identifying TMs poses an important biological challenge. Since TFs belong to evolutionarily and structurally related families, TF family members often bind to similar DNA motifs and can confound sequence-based approaches to TM identification. A previous approach to TM detection addresses this issue by pre-selecting a single representative from each TF family. One problem with this approach is that closely related transcription factors can still target sufficiently distinct genes in a biologically meaningful way, and thus, pre-selecting a single family representative may in principle miss certain TMs. Here we report a method-TREMOR (Transcriptional Regulatory Module Retriever). This method uses the Mahalanobis distance to assess the validity of a TM and automatically incorporates the inter-TF binding similarity without resorting to pre-selecting family representatives. The application of TREMOR on human muscle-specific, liver-specific and cell-cycle-related genes reveals TFs and TMs that were validated from literature and also reveals additional related genes.

Show MeSH

Related in: MedlinePlus

Distributions (probability density functions) of expression coherence (at 95th percentile threshold; see Methods section) in cell cycle data. The plot (green) on the left based on random gene set provides a base line. The plot (black) to the right is for the top 100 target genes identified by each of the 158 significant cell cycle TMs of size 4. The blue plot in the middle is based on the target genes identified by the TMs using Euclidian distance.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2189735&req=5

Figure 3: Distributions (probability density functions) of expression coherence (at 95th percentile threshold; see Methods section) in cell cycle data. The plot (green) on the left based on random gene set provides a base line. The plot (black) to the right is for the top 100 target genes identified by each of the 158 significant cell cycle TMs of size 4. The blue plot in the middle is based on the target genes identified by the TMs using Euclidian distance.

Mentions: Table 1 summarizes the results, especially highlighting the known key regulators of cell cycle—E2F, CREB and NF-Y (26). Some of the differences between TREMOR and other programs may reflect the differences between the two programs in the way the promoter sequences were scored by a PWM. Unlike for single TFs, and to some extent for TF pairs, experimental data for higher order TMs is not available. In order to validate these higher order TMs, we took the approach in Ref. (24) as follows. For each TM-4, we scored each of the 7452 human gene promoters in the cell cycle dataset, using the Mahalanobis vector distance and selected the genes with top 100 scores. We then computed the ‘expression coherence’ (EC) (see Methods section) of these genes in human cell cycle data (21). EC is a previously described measure that quantifies the similarity in expression profiles among the genes in a set (19). Since there were only 23 genes in common between these 100 genes and the 232 genes used for TM identification, the proposed scheme can be considered as an independent validation. Results do not change when we remove the overlap. As control, we randomly sampled 100 genes and computed their EC. Figure 3 shows four distributions of coherence scores for the top 100 scoring genes for each of the 158 significant TM-4s computed using Mahalanobis distance. We have also presented other distributions of coherence scores as control. As a baseline expectation, we show the distribution of coherence scores for 100 000 random sets of 100 genes. Using a 95th percentile threshold from the baseline distribution (nominal false positive of 5%), 100% of the Mahalanobis distance-based coherence scores are significant. To highlight the benefits of using Mahalanobis distance, we have also implemented the Euclidian distance measure. We describe this analysis further in the simulation studies section. The figure also shows the coherence score distribution for the top 100 scoring genes for each of the 77 significant TM-3s computed using Euclidian distance (the algorithm using Euclidian distance did not achieve any improvement beyond size-3 TMs). The figure clearly indicates that the genes detected by significant TMs using Mahalanobis distance have a significantly greater EC in the cell cycle data relative to other controls.Figure 3.


TREMOR--a tool for retrieving transcriptional modules by incorporating motif covariance.

Singh LN, Wang LS, Hannenhalli S - Nucleic Acids Res. (2007)

Distributions (probability density functions) of expression coherence (at 95th percentile threshold; see Methods section) in cell cycle data. The plot (green) on the left based on random gene set provides a base line. The plot (black) to the right is for the top 100 target genes identified by each of the 158 significant cell cycle TMs of size 4. The blue plot in the middle is based on the target genes identified by the TMs using Euclidian distance.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2189735&req=5

Figure 3: Distributions (probability density functions) of expression coherence (at 95th percentile threshold; see Methods section) in cell cycle data. The plot (green) on the left based on random gene set provides a base line. The plot (black) to the right is for the top 100 target genes identified by each of the 158 significant cell cycle TMs of size 4. The blue plot in the middle is based on the target genes identified by the TMs using Euclidian distance.
Mentions: Table 1 summarizes the results, especially highlighting the known key regulators of cell cycle—E2F, CREB and NF-Y (26). Some of the differences between TREMOR and other programs may reflect the differences between the two programs in the way the promoter sequences were scored by a PWM. Unlike for single TFs, and to some extent for TF pairs, experimental data for higher order TMs is not available. In order to validate these higher order TMs, we took the approach in Ref. (24) as follows. For each TM-4, we scored each of the 7452 human gene promoters in the cell cycle dataset, using the Mahalanobis vector distance and selected the genes with top 100 scores. We then computed the ‘expression coherence’ (EC) (see Methods section) of these genes in human cell cycle data (21). EC is a previously described measure that quantifies the similarity in expression profiles among the genes in a set (19). Since there were only 23 genes in common between these 100 genes and the 232 genes used for TM identification, the proposed scheme can be considered as an independent validation. Results do not change when we remove the overlap. As control, we randomly sampled 100 genes and computed their EC. Figure 3 shows four distributions of coherence scores for the top 100 scoring genes for each of the 158 significant TM-4s computed using Mahalanobis distance. We have also presented other distributions of coherence scores as control. As a baseline expectation, we show the distribution of coherence scores for 100 000 random sets of 100 genes. Using a 95th percentile threshold from the baseline distribution (nominal false positive of 5%), 100% of the Mahalanobis distance-based coherence scores are significant. To highlight the benefits of using Mahalanobis distance, we have also implemented the Euclidian distance measure. We describe this analysis further in the simulation studies section. The figure also shows the coherence score distribution for the top 100 scoring genes for each of the 77 significant TM-3s computed using Euclidian distance (the algorithm using Euclidian distance did not achieve any improvement beyond size-3 TMs). The figure clearly indicates that the genes detected by significant TMs using Mahalanobis distance have a significantly greater EC in the cell cycle data relative to other controls.Figure 3.

Bottom Line: Since TFs belong to evolutionarily and structurally related families, TF family members often bind to similar DNA motifs and can confound sequence-based approaches to TM identification.A previous approach to TM detection addresses this issue by pre-selecting a single representative from each TF family.This method uses the Mahalanobis distance to assess the validity of a TM and automatically incorporates the inter-TF binding similarity without resorting to pre-selecting family representatives.

View Article: PubMed Central - PubMed

Affiliation: Penn Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104, USA.

ABSTRACT
A transcriptional module (TM) is a collection of transcription factors (TF) that as a group, co-regulate multiple, functionally related genes. The task of identifying TMs poses an important biological challenge. Since TFs belong to evolutionarily and structurally related families, TF family members often bind to similar DNA motifs and can confound sequence-based approaches to TM identification. A previous approach to TM detection addresses this issue by pre-selecting a single representative from each TF family. One problem with this approach is that closely related transcription factors can still target sufficiently distinct genes in a biologically meaningful way, and thus, pre-selecting a single family representative may in principle miss certain TMs. Here we report a method-TREMOR (Transcriptional Regulatory Module Retriever). This method uses the Mahalanobis distance to assess the validity of a TM and automatically incorporates the inter-TF binding similarity without resorting to pre-selecting family representatives. The application of TREMOR on human muscle-specific, liver-specific and cell-cycle-related genes reveals TFs and TMs that were validated from literature and also reveals additional related genes.

Show MeSH
Related in: MedlinePlus