Limits...
Predicting the binding preference of transcription factors to individual DNA k-mers.

Alleyne TM, Peña-Castillo L, Badis G, Talukder S, Berger MF, Gehrke AR, Philippakis AA, Bulyk ML, Morris QD, Hughes TR - Bioinformatics (2008)

Bottom Line: Many TF-binding preferences, however, are unknown or poorly characterized, in part due to the difficulty associated with determining their specificity experimentally, and an incomplete understanding of the mechanisms governing sequence specificity.Nearest neighbour among functionally important residues emerged among the most effective methods.Our results underscore the complexity of TF-DNA recognition, and suggest a rational approach for future analyses of TF families.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Genetics, Banting and Best Department of Medical Research, University of Toronto, Toronto, ON, Canada.

ABSTRACT

Motivation: Recognition of specific DNA sequences is a central mechanism by which transcription factors (TFs) control gene expression. Many TF-binding preferences, however, are unknown or poorly characterized, in part due to the difficulty associated with determining their specificity experimentally, and an incomplete understanding of the mechanisms governing sequence specificity. New techniques that estimate the affinity of TFs to all possible k-mers provide a new opportunity to study DNA-protein interaction mechanisms, and may facilitate inference of binding preferences for members of a given TF family when such information is available for other family members.

Results: We employed a new dataset consisting of the relative preferences of mouse homeodomains for all eight-base DNA sequences in order to ask how well we can predict the binding profiles of homeodomains when only their protein sequences are given. We evaluated a panel of standard statistical inference techniques, as well as variations of the protein features considered. Nearest neighbour among functionally important residues emerged among the most effective methods. Our results underscore the complexity of TF-DNA recognition, and suggest a rational approach for future analyses of TF families.

Show MeSH

Related in: MedlinePlus

Association between top-100 overlap scores for pairs of 8mer profile inference methods. Scatterplots show the top-100 overlap values for 75 homeodomains when Z-score profiles are predicted using one inference method versus another method for the same proteins. All axes range from 0 to 100. The names on the diagonal label the axes. Predictions are made using the 15 homeodomain DNA-contacting residues. Homeodomains are coloured according to whether they have ≥slant 5 (red), 3–4 (blue) or 1–2 (green) mismatches to their nearest sequence neighbour.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2666811&req=5

Figure 4: Association between top-100 overlap scores for pairs of 8mer profile inference methods. Scatterplots show the top-100 overlap values for 75 homeodomains when Z-score profiles are predicted using one inference method versus another method for the same proteins. All axes range from 0 to 100. The names on the diagonal label the axes. Predictions are made using the 15 homeodomain DNA-contacting residues. Homeodomains are coloured according to whether they have ≥slant 5 (red), 3–4 (blue) or 1–2 (green) mismatches to their nearest sequence neighbour.

Mentions: In general, the 8mer profiles that are difficult for one algorithm to predict are those that are difficult for other algorithms as well. Figure 4 compares the top-100 overlap for all 75 homeodomains for all prediction methods, using the 15AA feature set. The colours of the points reflect the NN distance. There is a clear relationship between the 15AA distance and the top-100 overlap, with the 10 proteins with the greatest distance consistently having overlaps <50, indicating that for all methods the difficulty of learning the 8mer profile for a specific experiment is related to whether there is a similar example in the training set. This trend also holds for other feature sets, and likely explains the success of NN, which does not incorporate any information from more distant profiles.Fig. 4.


Predicting the binding preference of transcription factors to individual DNA k-mers.

Alleyne TM, Peña-Castillo L, Badis G, Talukder S, Berger MF, Gehrke AR, Philippakis AA, Bulyk ML, Morris QD, Hughes TR - Bioinformatics (2008)

Association between top-100 overlap scores for pairs of 8mer profile inference methods. Scatterplots show the top-100 overlap values for 75 homeodomains when Z-score profiles are predicted using one inference method versus another method for the same proteins. All axes range from 0 to 100. The names on the diagonal label the axes. Predictions are made using the 15 homeodomain DNA-contacting residues. Homeodomains are coloured according to whether they have ≥slant 5 (red), 3–4 (blue) or 1–2 (green) mismatches to their nearest sequence neighbour.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2666811&req=5

Figure 4: Association between top-100 overlap scores for pairs of 8mer profile inference methods. Scatterplots show the top-100 overlap values for 75 homeodomains when Z-score profiles are predicted using one inference method versus another method for the same proteins. All axes range from 0 to 100. The names on the diagonal label the axes. Predictions are made using the 15 homeodomain DNA-contacting residues. Homeodomains are coloured according to whether they have ≥slant 5 (red), 3–4 (blue) or 1–2 (green) mismatches to their nearest sequence neighbour.
Mentions: In general, the 8mer profiles that are difficult for one algorithm to predict are those that are difficult for other algorithms as well. Figure 4 compares the top-100 overlap for all 75 homeodomains for all prediction methods, using the 15AA feature set. The colours of the points reflect the NN distance. There is a clear relationship between the 15AA distance and the top-100 overlap, with the 10 proteins with the greatest distance consistently having overlaps <50, indicating that for all methods the difficulty of learning the 8mer profile for a specific experiment is related to whether there is a similar example in the training set. This trend also holds for other feature sets, and likely explains the success of NN, which does not incorporate any information from more distant profiles.Fig. 4.

Bottom Line: Many TF-binding preferences, however, are unknown or poorly characterized, in part due to the difficulty associated with determining their specificity experimentally, and an incomplete understanding of the mechanisms governing sequence specificity.Nearest neighbour among functionally important residues emerged among the most effective methods.Our results underscore the complexity of TF-DNA recognition, and suggest a rational approach for future analyses of TF families.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Genetics, Banting and Best Department of Medical Research, University of Toronto, Toronto, ON, Canada.

ABSTRACT

Motivation: Recognition of specific DNA sequences is a central mechanism by which transcription factors (TFs) control gene expression. Many TF-binding preferences, however, are unknown or poorly characterized, in part due to the difficulty associated with determining their specificity experimentally, and an incomplete understanding of the mechanisms governing sequence specificity. New techniques that estimate the affinity of TFs to all possible k-mers provide a new opportunity to study DNA-protein interaction mechanisms, and may facilitate inference of binding preferences for members of a given TF family when such information is available for other family members.

Results: We employed a new dataset consisting of the relative preferences of mouse homeodomains for all eight-base DNA sequences in order to ask how well we can predict the binding profiles of homeodomains when only their protein sequences are given. We evaluated a panel of standard statistical inference techniques, as well as variations of the protein features considered. Nearest neighbour among functionally important residues emerged among the most effective methods. Our results underscore the complexity of TF-DNA recognition, and suggest a rational approach for future analyses of TF families.

Show MeSH
Related in: MedlinePlus