Limits...
Predicting the binding preference of transcription factors to individual DNA k-mers.

Alleyne TM, Peña-Castillo L, Badis G, Talukder S, Berger MF, Gehrke AR, Philippakis AA, Bulyk ML, Morris QD, Hughes TR - Bioinformatics (2008)

Bottom Line: Many TF-binding preferences, however, are unknown or poorly characterized, in part due to the difficulty associated with determining their specificity experimentally, and an incomplete understanding of the mechanisms governing sequence specificity.Nearest neighbour among functionally important residues emerged among the most effective methods.Our results underscore the complexity of TF-DNA recognition, and suggest a rational approach for future analyses of TF families.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Genetics, Banting and Best Department of Medical Research, University of Toronto, Toronto, ON, Canada.

ABSTRACT

Motivation: Recognition of specific DNA sequences is a central mechanism by which transcription factors (TFs) control gene expression. Many TF-binding preferences, however, are unknown or poorly characterized, in part due to the difficulty associated with determining their specificity experimentally, and an incomplete understanding of the mechanisms governing sequence specificity. New techniques that estimate the affinity of TFs to all possible k-mers provide a new opportunity to study DNA-protein interaction mechanisms, and may facilitate inference of binding preferences for members of a given TF family when such information is available for other family members.

Results: We employed a new dataset consisting of the relative preferences of mouse homeodomains for all eight-base DNA sequences in order to ask how well we can predict the binding profiles of homeodomains when only their protein sequences are given. We evaluated a panel of standard statistical inference techniques, as well as variations of the protein features considered. Nearest neighbour among functionally important residues emerged among the most effective methods. Our results underscore the complexity of TF-DNA recognition, and suggest a rational approach for future analyses of TF families.

Show MeSH
Comparison of the accuracy of NN predictions versus experimental replicates. Scatterplots show the measured Z-scores for all 32 896 non-redundant eight-base DNA sequences from one PBM versus a second PBM for the same DBD (top) or versus the Z-score predicted using NN (6AA variant; bottom). Median performance metrics are given. Evx1 has a single NN (Hoxa2); Irx2 has a single NN (Irx3); Lhx1 has two NN (Alx3 and Lhx3).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2666811&req=5

Figure 2: Comparison of the accuracy of NN predictions versus experimental replicates. Scatterplots show the measured Z-scores for all 32 896 non-redundant eight-base DNA sequences from one PBM versus a second PBM for the same DBD (top) or versus the Z-score predicted using NN (6AA variant; bottom). Median performance metrics are given. Evx1 has a single NN (Hoxa2); Irx2 has a single NN (Irx3); Lhx1 has two NN (Alx3 and Lhx3).

Mentions: Three major conclusions can be drawn from this analysis. First, results of all algorithms are clearly distinct from random (Table 2, columns 5, 9 and 12). Second, the 15AA and 6AA subsets appear to provide a superior training set relative to the 57AA set. Third, presumably due to the importance of non-linear interactions between amino acid positions in defining DNA-binding specificity, methods that can capture interactions and non-linearities have a clear advantage: there is almost always at least one variant of each non-linear method, i.e. NN, RF and SVM_R, that outperforms every linear method we employed. NN (Fig. 1, right panel), in particular, has a significantly higher mean top-100 overlap than PCR [95% confidence interval (CI) for difference, 3.92–109; Kruskal–Wallis test]. Moreover, NN often shows the greatest difference from random, and has the fewest predicted profiles with a top-100 overlap<50. In three instances (Evx1, Irx2, and Lhx1), the 15AA NN-predicted Z-score profiles exhibit Spearman correlation, top-100 overlap, or RMSE values that exceed those of the experimental replicates for these proteins. Therefore, it appears that predicted Z-score profiles can, in specific cases, rival experimental replicates in reproducing the Z-score profile of a given homeodomain.Figure 2 shows scatter plots of the Z-scores for Evx1, Irx2 and Lhx1, as compared with predicted and replicate Z-scores.Fig. 1.


Predicting the binding preference of transcription factors to individual DNA k-mers.

Alleyne TM, Peña-Castillo L, Badis G, Talukder S, Berger MF, Gehrke AR, Philippakis AA, Bulyk ML, Morris QD, Hughes TR - Bioinformatics (2008)

Comparison of the accuracy of NN predictions versus experimental replicates. Scatterplots show the measured Z-scores for all 32 896 non-redundant eight-base DNA sequences from one PBM versus a second PBM for the same DBD (top) or versus the Z-score predicted using NN (6AA variant; bottom). Median performance metrics are given. Evx1 has a single NN (Hoxa2); Irx2 has a single NN (Irx3); Lhx1 has two NN (Alx3 and Lhx3).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2666811&req=5

Figure 2: Comparison of the accuracy of NN predictions versus experimental replicates. Scatterplots show the measured Z-scores for all 32 896 non-redundant eight-base DNA sequences from one PBM versus a second PBM for the same DBD (top) or versus the Z-score predicted using NN (6AA variant; bottom). Median performance metrics are given. Evx1 has a single NN (Hoxa2); Irx2 has a single NN (Irx3); Lhx1 has two NN (Alx3 and Lhx3).
Mentions: Three major conclusions can be drawn from this analysis. First, results of all algorithms are clearly distinct from random (Table 2, columns 5, 9 and 12). Second, the 15AA and 6AA subsets appear to provide a superior training set relative to the 57AA set. Third, presumably due to the importance of non-linear interactions between amino acid positions in defining DNA-binding specificity, methods that can capture interactions and non-linearities have a clear advantage: there is almost always at least one variant of each non-linear method, i.e. NN, RF and SVM_R, that outperforms every linear method we employed. NN (Fig. 1, right panel), in particular, has a significantly higher mean top-100 overlap than PCR [95% confidence interval (CI) for difference, 3.92–109; Kruskal–Wallis test]. Moreover, NN often shows the greatest difference from random, and has the fewest predicted profiles with a top-100 overlap<50. In three instances (Evx1, Irx2, and Lhx1), the 15AA NN-predicted Z-score profiles exhibit Spearman correlation, top-100 overlap, or RMSE values that exceed those of the experimental replicates for these proteins. Therefore, it appears that predicted Z-score profiles can, in specific cases, rival experimental replicates in reproducing the Z-score profile of a given homeodomain.Figure 2 shows scatter plots of the Z-scores for Evx1, Irx2 and Lhx1, as compared with predicted and replicate Z-scores.Fig. 1.

Bottom Line: Many TF-binding preferences, however, are unknown or poorly characterized, in part due to the difficulty associated with determining their specificity experimentally, and an incomplete understanding of the mechanisms governing sequence specificity.Nearest neighbour among functionally important residues emerged among the most effective methods.Our results underscore the complexity of TF-DNA recognition, and suggest a rational approach for future analyses of TF families.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Genetics, Banting and Best Department of Medical Research, University of Toronto, Toronto, ON, Canada.

ABSTRACT

Motivation: Recognition of specific DNA sequences is a central mechanism by which transcription factors (TFs) control gene expression. Many TF-binding preferences, however, are unknown or poorly characterized, in part due to the difficulty associated with determining their specificity experimentally, and an incomplete understanding of the mechanisms governing sequence specificity. New techniques that estimate the affinity of TFs to all possible k-mers provide a new opportunity to study DNA-protein interaction mechanisms, and may facilitate inference of binding preferences for members of a given TF family when such information is available for other family members.

Results: We employed a new dataset consisting of the relative preferences of mouse homeodomains for all eight-base DNA sequences in order to ask how well we can predict the binding profiles of homeodomains when only their protein sequences are given. We evaluated a panel of standard statistical inference techniques, as well as variations of the protein features considered. Nearest neighbour among functionally important residues emerged among the most effective methods. Our results underscore the complexity of TF-DNA recognition, and suggest a rational approach for future analyses of TF families.

Show MeSH