Limits...
MCOIN: a novel heuristic for determining transcription factor binding site motif width.

Kilpatrick AM, Ward B, Aitken S - Algorithms Mol Biol (2013)

Bottom Line: However, this is a challenging computational problem as a result of changing model dimensionality at changing motif widths.Using realistic synthetic data and previously characterised prokaryotic data, we show that MCOIN outperforms the current most popular method (E-value of the resulting multiple alignment) as a predictor of motif width, based on mean absolute error.MCOIN is also shown to choose models which better match known sites at higher levels of motif conservation, based on ROC analysis.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Informatics, University of Edinburgh, Informatics Forum, 10 Crichton Street, EH8 9AB Edinburgh, Scotland. a.m.kilpatrick@sms.ed.ac.uk.

ABSTRACT

Background: In transcription factor binding site discovery, the true width of the motif to be discovered is generally not known a priori. The ability to compute the most likely width of a motif is therefore a highly desirable property for motif discovery algorithms. However, this is a challenging computational problem as a result of changing model dimensionality at changing motif widths. The complexity of the problem is increased as the discovered model at the true motif width need not be the most statistically significant in a set of candidate motif models. Further, the core motif discovery algorithm used cannot guarantee to return the best possible result at each candidate width.

Results: We present MCOIN, a novel heuristic for automatically determining transcription factor binding site motif width, based on motif containment and information content. Using realistic synthetic data and previously characterised prokaryotic data, we show that MCOIN outperforms the current most popular method (E-value of the resulting multiple alignment) as a predictor of motif width, based on mean absolute error. MCOIN is also shown to choose models which better match known sites at higher levels of motif conservation, based on ROC analysis.

Conclusions: We demonstrate the performance of MCOIN as part of a deterministic motif discovery algorithm and conclude that MCOIN outperforms current methods for determining motif width.

No MeSH data available.


E. coli: GntR motif ROC curves. ROC curves (plotted for 0≤sFPR≤0.1) for the most likely E.coli GntR motif, as chosen using the known width (left), MCOIN (centre) and E-values based estimator (right). The curve colour illustrates the threshold of p(Zi,j = 1/Xi,j,θ), from 1.0 (red) to 0.0 (blue). All three estimators predict the GntR motif much better than expected, considering the low conservation of the motif and the results of the experiments using realistic synthetic data.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3716798&req=5

Figure 4: E. coli: GntR motif ROC curves. ROC curves (plotted for 0≤sFPR≤0.1) for the most likely E.coli GntR motif, as chosen using the known width (left), MCOIN (centre) and E-values based estimator (right). The curve colour illustrates the threshold of p(Zi,j = 1/Xi,j,θ), from 1.0 (red) to 0.0 (blue). All three estimators predict the GntR motif much better than expected, considering the low conservation of the motif and the results of the experiments using realistic synthetic data.

Mentions: Comparing Tables 7 and 9 to Table 5, MCOIN is shown to give excellent classification-based results (particularly on the prokaryotic ChIP datasets) given the overall mean motif conservation and the results on realistic synthetic data. This is due to the conservation of individual positions within each motif: while the conservation of positions in each synthetic motif is uniform and independent, this pattern of conservation is not mirrored in real TFBS motifs. Analysis of the previously characterised motifs used in this study indicates that motifs with low mean conservation may have several positions which are very well or even perfectly conserved. This matches well with previous studies (e.g. [30]), which propose that the conservation of a given motif position is correlated with the conservation of surrounding motif positions, producing clusters of well-conserved positions, which may aid TFBS motif discovery algorithms. This phenomenon is clear in a number of E. coli motifs, particularly GntR (Figure 4), which has a mean conservation of 0.74 bits/col; the synthetic data results suggest relatively low values of sSn and sPPV for this motif. However, the GntR motif has a cluster of reasonably well-conserved positions, with a maximum conservation of 1.61 bits/col and is discovered well at the known width (sSn = 0.82, sPPV = 0.70, AUC = 0.99), with similar results for both the MCOIN and E-values based estimators (sSn = 0.71, sPPV = 0.71, AUC = 0.99 and sSn = 0.71, sPPV = 0.48, AUC = 1.00, respectively).


MCOIN: a novel heuristic for determining transcription factor binding site motif width.

Kilpatrick AM, Ward B, Aitken S - Algorithms Mol Biol (2013)

E. coli: GntR motif ROC curves. ROC curves (plotted for 0≤sFPR≤0.1) for the most likely E.coli GntR motif, as chosen using the known width (left), MCOIN (centre) and E-values based estimator (right). The curve colour illustrates the threshold of p(Zi,j = 1/Xi,j,θ), from 1.0 (red) to 0.0 (blue). All three estimators predict the GntR motif much better than expected, considering the low conservation of the motif and the results of the experiments using realistic synthetic data.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3716798&req=5

Figure 4: E. coli: GntR motif ROC curves. ROC curves (plotted for 0≤sFPR≤0.1) for the most likely E.coli GntR motif, as chosen using the known width (left), MCOIN (centre) and E-values based estimator (right). The curve colour illustrates the threshold of p(Zi,j = 1/Xi,j,θ), from 1.0 (red) to 0.0 (blue). All three estimators predict the GntR motif much better than expected, considering the low conservation of the motif and the results of the experiments using realistic synthetic data.
Mentions: Comparing Tables 7 and 9 to Table 5, MCOIN is shown to give excellent classification-based results (particularly on the prokaryotic ChIP datasets) given the overall mean motif conservation and the results on realistic synthetic data. This is due to the conservation of individual positions within each motif: while the conservation of positions in each synthetic motif is uniform and independent, this pattern of conservation is not mirrored in real TFBS motifs. Analysis of the previously characterised motifs used in this study indicates that motifs with low mean conservation may have several positions which are very well or even perfectly conserved. This matches well with previous studies (e.g. [30]), which propose that the conservation of a given motif position is correlated with the conservation of surrounding motif positions, producing clusters of well-conserved positions, which may aid TFBS motif discovery algorithms. This phenomenon is clear in a number of E. coli motifs, particularly GntR (Figure 4), which has a mean conservation of 0.74 bits/col; the synthetic data results suggest relatively low values of sSn and sPPV for this motif. However, the GntR motif has a cluster of reasonably well-conserved positions, with a maximum conservation of 1.61 bits/col and is discovered well at the known width (sSn = 0.82, sPPV = 0.70, AUC = 0.99), with similar results for both the MCOIN and E-values based estimators (sSn = 0.71, sPPV = 0.71, AUC = 0.99 and sSn = 0.71, sPPV = 0.48, AUC = 1.00, respectively).

Bottom Line: However, this is a challenging computational problem as a result of changing model dimensionality at changing motif widths.Using realistic synthetic data and previously characterised prokaryotic data, we show that MCOIN outperforms the current most popular method (E-value of the resulting multiple alignment) as a predictor of motif width, based on mean absolute error.MCOIN is also shown to choose models which better match known sites at higher levels of motif conservation, based on ROC analysis.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Informatics, University of Edinburgh, Informatics Forum, 10 Crichton Street, EH8 9AB Edinburgh, Scotland. a.m.kilpatrick@sms.ed.ac.uk.

ABSTRACT

Background: In transcription factor binding site discovery, the true width of the motif to be discovered is generally not known a priori. The ability to compute the most likely width of a motif is therefore a highly desirable property for motif discovery algorithms. However, this is a challenging computational problem as a result of changing model dimensionality at changing motif widths. The complexity of the problem is increased as the discovered model at the true motif width need not be the most statistically significant in a set of candidate motif models. Further, the core motif discovery algorithm used cannot guarantee to return the best possible result at each candidate width.

Results: We present MCOIN, a novel heuristic for automatically determining transcription factor binding site motif width, based on motif containment and information content. Using realistic synthetic data and previously characterised prokaryotic data, we show that MCOIN outperforms the current most popular method (E-value of the resulting multiple alignment) as a predictor of motif width, based on mean absolute error. MCOIN is also shown to choose models which better match known sites at higher levels of motif conservation, based on ROC analysis.

Conclusions: We demonstrate the performance of MCOIN as part of a deterministic motif discovery algorithm and conclude that MCOIN outperforms current methods for determining motif width.

No MeSH data available.