Limits...
Predicting combinatorial binding of transcription factors to regulatory elements in the human genome by association rule mining.

Morgan XC, Ni S, Miranker DP, Iyer VR - BMC Bioinformatics (2007)

Bottom Line: Known true positive motif pairs showed higher association rule support, confidence, and significance than background.Our subsets of high-confidence, high-significance mined pairs of transcription factors showed enrichment for co-citation in PubMed abstracts relative to all pairs, and the predicted associations were often readily verifiable in the literature.Functional elements in the genome where transcription factors bind to regulate expression in a combinatorial manner are more likely to be predicted by identifying statistically and biologically significant combinations of transcription factor binding motifs than by simply scanning the genome for the occurrence of binding sites for a single transcription factor.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute for Cellular and Molecular Biology and Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, Texas 78712-0159, USA. morganx@mail.utexas.edu

ABSTRACT

Background: Cis-acting transcriptional regulatory elements in mammalian genomes typically contain specific combinations of binding sites for various transcription factors. Although some cis-regulatory elements have been well studied, the combinations of transcription factors that regulate normal expression levels for the vast majority of the 20,000 genes in the human genome are unknown. We hypothesized that it should be possible to discover transcription factor combinations that regulate gene expression in concert by identifying over-represented combinations of sequence motifs that occur together in the genome. In order to detect combinations of transcription factor binding motifs, we developed a data mining approach based on the use of association rules, which are typically used in market basket analysis. We scored each segment of the genome for the presence or absence of each of 83 transcription factor binding motifs, then used association rule mining algorithms to mine this dataset, thus identifying frequently occurring pairs of distinct motifs within a segment.

Results: Support for most pairs of transcription factor binding motifs was highly correlated across different chromosomes although pair significance varied. Known true positive motif pairs showed higher association rule support, confidence, and significance than background. Our subsets of high-confidence, high-significance mined pairs of transcription factors showed enrichment for co-citation in PubMed abstracts relative to all pairs, and the predicted associations were often readily verifiable in the literature.

Conclusion: Functional elements in the genome where transcription factors bind to regulate expression in a combinatorial manner are more likely to be predicted by identifying statistically and biologically significant combinations of transcription factor binding motifs than by simply scanning the genome for the occurrence of binding sites for a single transcription factor.

Show MeSH

Related in: MedlinePlus

Mining without overlap. In order to avoid enriching primarily for TF pairs that bind similar motifs, the genome is mined once for each pair AB. All overlapping motifs between A and B are removed before calculating support, confidence, and P-values, then restored upon subsequent iterations.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2211755&req=5

Figure 2: Mining without overlap. In order to avoid enriching primarily for TF pairs that bind similar motifs, the genome is mined once for each pair AB. All overlapping motifs between A and B are removed before calculating support, confidence, and P-values, then restored upon subsequent iterations.

Mentions: Straightforward association rule mining that simultaneously considers all motif positions discovers high numbers of transcription factor pairs that bind identical or highly similar motifs. For example, two different transcription factors A and B may both bind to the motif "CACGTG", so the confidence C of the rule A ⇒ B will be 100%. Similarly, if A binds to "CACGTG" and B binds to "CACGTGA," this high overlap between binding motifs will result in the confidence being very high while the rule is neither interesting nor surprising, although it may still be biologically valid. To avoid discovery of enriched overlapping motifs, for each transcription pair AB, all overlapping binding sites between A and B were removed before calculating support and confidence (Figure 2). We also calculated a P-value based on the hypergeometric probability of observing the association between A and B by chance. We ensured that associations between transcription factor motifs were not an artifact caused by the presence of repetitive DNA, by considering repeat masked regions separately (Methods and Additional file 2). Furthermore, in order to rule out the possibility that associations were generated by nucleotide bias, we ascertained that dinucleotide and trinucleotide frequencies of segments containing motif pairs were not significantly different from segments containing one member of the pair or background (data not shown).


Predicting combinatorial binding of transcription factors to regulatory elements in the human genome by association rule mining.

Morgan XC, Ni S, Miranker DP, Iyer VR - BMC Bioinformatics (2007)

Mining without overlap. In order to avoid enriching primarily for TF pairs that bind similar motifs, the genome is mined once for each pair AB. All overlapping motifs between A and B are removed before calculating support, confidence, and P-values, then restored upon subsequent iterations.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2211755&req=5

Figure 2: Mining without overlap. In order to avoid enriching primarily for TF pairs that bind similar motifs, the genome is mined once for each pair AB. All overlapping motifs between A and B are removed before calculating support, confidence, and P-values, then restored upon subsequent iterations.
Mentions: Straightforward association rule mining that simultaneously considers all motif positions discovers high numbers of transcription factor pairs that bind identical or highly similar motifs. For example, two different transcription factors A and B may both bind to the motif "CACGTG", so the confidence C of the rule A ⇒ B will be 100%. Similarly, if A binds to "CACGTG" and B binds to "CACGTGA," this high overlap between binding motifs will result in the confidence being very high while the rule is neither interesting nor surprising, although it may still be biologically valid. To avoid discovery of enriched overlapping motifs, for each transcription pair AB, all overlapping binding sites between A and B were removed before calculating support and confidence (Figure 2). We also calculated a P-value based on the hypergeometric probability of observing the association between A and B by chance. We ensured that associations between transcription factor motifs were not an artifact caused by the presence of repetitive DNA, by considering repeat masked regions separately (Methods and Additional file 2). Furthermore, in order to rule out the possibility that associations were generated by nucleotide bias, we ascertained that dinucleotide and trinucleotide frequencies of segments containing motif pairs were not significantly different from segments containing one member of the pair or background (data not shown).

Bottom Line: Known true positive motif pairs showed higher association rule support, confidence, and significance than background.Our subsets of high-confidence, high-significance mined pairs of transcription factors showed enrichment for co-citation in PubMed abstracts relative to all pairs, and the predicted associations were often readily verifiable in the literature.Functional elements in the genome where transcription factors bind to regulate expression in a combinatorial manner are more likely to be predicted by identifying statistically and biologically significant combinations of transcription factor binding motifs than by simply scanning the genome for the occurrence of binding sites for a single transcription factor.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute for Cellular and Molecular Biology and Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, Texas 78712-0159, USA. morganx@mail.utexas.edu

ABSTRACT

Background: Cis-acting transcriptional regulatory elements in mammalian genomes typically contain specific combinations of binding sites for various transcription factors. Although some cis-regulatory elements have been well studied, the combinations of transcription factors that regulate normal expression levels for the vast majority of the 20,000 genes in the human genome are unknown. We hypothesized that it should be possible to discover transcription factor combinations that regulate gene expression in concert by identifying over-represented combinations of sequence motifs that occur together in the genome. In order to detect combinations of transcription factor binding motifs, we developed a data mining approach based on the use of association rules, which are typically used in market basket analysis. We scored each segment of the genome for the presence or absence of each of 83 transcription factor binding motifs, then used association rule mining algorithms to mine this dataset, thus identifying frequently occurring pairs of distinct motifs within a segment.

Results: Support for most pairs of transcription factor binding motifs was highly correlated across different chromosomes although pair significance varied. Known true positive motif pairs showed higher association rule support, confidence, and significance than background. Our subsets of high-confidence, high-significance mined pairs of transcription factors showed enrichment for co-citation in PubMed abstracts relative to all pairs, and the predicted associations were often readily verifiable in the literature.

Conclusion: Functional elements in the genome where transcription factors bind to regulate expression in a combinatorial manner are more likely to be predicted by identifying statistically and biologically significant combinations of transcription factor binding motifs than by simply scanning the genome for the occurrence of binding sites for a single transcription factor.

Show MeSH
Related in: MedlinePlus