Limits...
Improved predictions of transcription factor binding sites using physicochemical features of DNA.

Maienschein-Cline M, Dinner AR, Hlavacek WS, Mu F - Nucleic Acids Res. (2012)

Bottom Line: New features are derived from Gibbs energies of amino acid-DNA interactions and hydroxyl radical cleavage profiles of DNA.Algorithmic modifications consist of inclusion of a feature selection step, use of a nonlinear kernel in the SVM classifier, and use of a consensus-based post-processing step for predictions.We also considered SVM classification based on letter features alone to distinguish performance gains from use of SVM-based models versus use of physicochemical features.

View Article: PubMed Central - PubMed

Affiliation: Department of Chemistry, University of Chicago, Chicago, IL 60637, USA.

ABSTRACT
Typical approaches for predicting transcription factor binding sites (TFBSs) involve use of a position-specific weight matrix (PWM) to statistically characterize the sequences of the known sites. Recently, an alternative physicochemical approach, called SiteSleuth, was proposed. In this approach, a linear support vector machine (SVM) classifier is trained to distinguish TFBSs from background sequences based on local chemical and structural features of DNA. SiteSleuth appears to generally perform better than PWM-based methods. Here, we improve the SiteSleuth approach by considering both new physicochemical features and algorithmic modifications. New features are derived from Gibbs energies of amino acid-DNA interactions and hydroxyl radical cleavage profiles of DNA. Algorithmic modifications consist of inclusion of a feature selection step, use of a nonlinear kernel in the SVM classifier, and use of a consensus-based post-processing step for predictions. We also considered SVM classification based on letter features alone to distinguish performance gains from use of SVM-based models versus use of physicochemical features. The accuracy of each of the variant methods considered was assessed by cross validation using data available in the RegulonDB database for 54 Escherichia coli TFs, as well as by experimental validation using published ChIP-chip data available for Fis and Lrp.

Show MeSH

Related in: MedlinePlus

Results from verification with ChIP-chip data for Fis and Lrp. Error bars reflect the standard deviation over five independent runs, and thick horizontal bars are the results of five-way consensus analysis. Shown are (A) accuracy (number of ChIP-regions with a predicted TFBSs over total number of predicted TFBSs) and (B) the number of predicted TFBSs (in inverted scale). There is no model variability in BvH, so there is no extra consensus-based result for this method. In panel (A), it should be noted that the bar for SiteSleuth has zero height and that the height of the bar for SVMR-PMM-FS is actually much taller than depicted. In panel (B), SiteSleuth has 0 predicted TFBSs for all runs, and the SVM-LMM and SVMR-PMM-FS have 0 predicted TFBSs in the five-way consensus.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3526315&req=5

gks771-F5: Results from verification with ChIP-chip data for Fis and Lrp. Error bars reflect the standard deviation over five independent runs, and thick horizontal bars are the results of five-way consensus analysis. Shown are (A) accuracy (number of ChIP-regions with a predicted TFBSs over total number of predicted TFBSs) and (B) the number of predicted TFBSs (in inverted scale). There is no model variability in BvH, so there is no extra consensus-based result for this method. In panel (A), it should be noted that the bar for SiteSleuth has zero height and that the height of the bar for SVMR-PMM-FS is actually much taller than depicted. In panel (B), SiteSleuth has 0 predicted TFBSs for all runs, and the SVM-LMM and SVMR-PMM-FS have 0 predicted TFBSs in the five-way consensus.

Mentions: We used the trained models for Fis and Lrp to predict TFBSs across the entire E. coli genome and compared the predictions with binding regions from ChIP-chip experiments (43,44). For these data, we defined the accuracy of a motif model as the number of predicted TFBSs in ChIP-chip regions divided by the total number of predicted TFBSs. This approach also allowed us to test the consensus-based approach for identifying predicted TFBSs, wherein we compare the predicted TFBSs from five independently trained models and retain only those sites that are predicted positive by each model; there is no variability in the training procedure for BvH, so the consensus analysis is not performed for this method. Figure 5A gives the accuracy from each method for Fis and Lrp, and Figure 5B gives the number of predicted binding sites from each method. Note that the F-measures for Fis and Lrp are indicated by the boxed dots and circled dots, respectively, in Figure 4. We do not report prediction times for the different methods under consideration, as prediction time is typically dominated by the mapping of test sequences to feature vectors, which is I/O intensive and therefore platform dependent. We also give the DNA sequence logos for Fis and Lrp, generated by WebLogo (59,60) from the positive training examples in Figure S6 of the Supplementary Materials for the reader's reference.Figure 5.


Improved predictions of transcription factor binding sites using physicochemical features of DNA.

Maienschein-Cline M, Dinner AR, Hlavacek WS, Mu F - Nucleic Acids Res. (2012)

Results from verification with ChIP-chip data for Fis and Lrp. Error bars reflect the standard deviation over five independent runs, and thick horizontal bars are the results of five-way consensus analysis. Shown are (A) accuracy (number of ChIP-regions with a predicted TFBSs over total number of predicted TFBSs) and (B) the number of predicted TFBSs (in inverted scale). There is no model variability in BvH, so there is no extra consensus-based result for this method. In panel (A), it should be noted that the bar for SiteSleuth has zero height and that the height of the bar for SVMR-PMM-FS is actually much taller than depicted. In panel (B), SiteSleuth has 0 predicted TFBSs for all runs, and the SVM-LMM and SVMR-PMM-FS have 0 predicted TFBSs in the five-way consensus.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3526315&req=5

gks771-F5: Results from verification with ChIP-chip data for Fis and Lrp. Error bars reflect the standard deviation over five independent runs, and thick horizontal bars are the results of five-way consensus analysis. Shown are (A) accuracy (number of ChIP-regions with a predicted TFBSs over total number of predicted TFBSs) and (B) the number of predicted TFBSs (in inverted scale). There is no model variability in BvH, so there is no extra consensus-based result for this method. In panel (A), it should be noted that the bar for SiteSleuth has zero height and that the height of the bar for SVMR-PMM-FS is actually much taller than depicted. In panel (B), SiteSleuth has 0 predicted TFBSs for all runs, and the SVM-LMM and SVMR-PMM-FS have 0 predicted TFBSs in the five-way consensus.
Mentions: We used the trained models for Fis and Lrp to predict TFBSs across the entire E. coli genome and compared the predictions with binding regions from ChIP-chip experiments (43,44). For these data, we defined the accuracy of a motif model as the number of predicted TFBSs in ChIP-chip regions divided by the total number of predicted TFBSs. This approach also allowed us to test the consensus-based approach for identifying predicted TFBSs, wherein we compare the predicted TFBSs from five independently trained models and retain only those sites that are predicted positive by each model; there is no variability in the training procedure for BvH, so the consensus analysis is not performed for this method. Figure 5A gives the accuracy from each method for Fis and Lrp, and Figure 5B gives the number of predicted binding sites from each method. Note that the F-measures for Fis and Lrp are indicated by the boxed dots and circled dots, respectively, in Figure 4. We do not report prediction times for the different methods under consideration, as prediction time is typically dominated by the mapping of test sequences to feature vectors, which is I/O intensive and therefore platform dependent. We also give the DNA sequence logos for Fis and Lrp, generated by WebLogo (59,60) from the positive training examples in Figure S6 of the Supplementary Materials for the reader's reference.Figure 5.

Bottom Line: New features are derived from Gibbs energies of amino acid-DNA interactions and hydroxyl radical cleavage profiles of DNA.Algorithmic modifications consist of inclusion of a feature selection step, use of a nonlinear kernel in the SVM classifier, and use of a consensus-based post-processing step for predictions.We also considered SVM classification based on letter features alone to distinguish performance gains from use of SVM-based models versus use of physicochemical features.

View Article: PubMed Central - PubMed

Affiliation: Department of Chemistry, University of Chicago, Chicago, IL 60637, USA.

ABSTRACT
Typical approaches for predicting transcription factor binding sites (TFBSs) involve use of a position-specific weight matrix (PWM) to statistically characterize the sequences of the known sites. Recently, an alternative physicochemical approach, called SiteSleuth, was proposed. In this approach, a linear support vector machine (SVM) classifier is trained to distinguish TFBSs from background sequences based on local chemical and structural features of DNA. SiteSleuth appears to generally perform better than PWM-based methods. Here, we improve the SiteSleuth approach by considering both new physicochemical features and algorithmic modifications. New features are derived from Gibbs energies of amino acid-DNA interactions and hydroxyl radical cleavage profiles of DNA. Algorithmic modifications consist of inclusion of a feature selection step, use of a nonlinear kernel in the SVM classifier, and use of a consensus-based post-processing step for predictions. We also considered SVM classification based on letter features alone to distinguish performance gains from use of SVM-based models versus use of physicochemical features. The accuracy of each of the variant methods considered was assessed by cross validation using data available in the RegulonDB database for 54 Escherichia coli TFs, as well as by experimental validation using published ChIP-chip data available for Fis and Lrp.

Show MeSH
Related in: MedlinePlus