Limits...
Discriminative motif discovery in DNA and protein sequences using the DEME algorithm.

Redhead E, Bailey TL - BMC Bioinformatics (2007)

Bottom Line: Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences.With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs.We also show that DEME can find highly informative thermal-stability protein motifs.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute for Molecular Bioscience, University of Queensland, Brisbane, Qld, 4072 Australia. e.redhead@imb.uq.edu.au

ABSTRACT

Background: Motif discovery aims to detect short, highly conserved patterns in a collection of unaligned DNA or protein sequences. Discriminative motif finding algorithms aim to increase the sensitivity and selectivity of motif discovery by utilizing a second set of sequences, and searching only for patterns that can differentiate the two sets of sequences. Potential applications of discriminative motif discovery include discovering transcription factor binding site motifs in ChIP-chip data and finding protein motifs involved in thermal stability using sets of orthologous proteins from thermophilic and mesophilic organisms.

Results: We describe DEME, a discriminative motif discovery algorithm for use with protein and DNA sequences. Input to DEME is two sets of sequences; a "positive" set and a "negative" set. DEME represents motifs using a probabilistic model, and uses a novel combination of global and local search to find the motif that optimally discriminates between the two sets of sequences. DEME is unique among discriminative motif finders in that it uses an informative Bayesian prior on protein motif columns, allowing it to incorporate prior knowledge of residue characteristics. We also introduce four, synthetic, discriminative motif discovery problems that are designed for evaluating discriminative motif finders in various biologically motivated contexts. We test DEME using these synthetic problems and on two biological problems: finding yeast transcription factor binding motifs in ChIP-chip data, and finding motifs that discriminate between groups of thermophilic and mesophilic orthologous proteins.

Conclusion: Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences. With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs. We also show that DEME can find highly informative thermal-stability protein motifs. Binaries for the stand-alone program DEME is free for academic use and is available at http://bioinformatics.org.au/deme/

Show MeSH

Related in: MedlinePlus

Effect of the seed prior on DEME accuracy. The plots show the average accuracy of the motifs discovered by DEME on different synthetic discriminative problems as a function of the size of the seed prior, B. The Bayesian motif prior is set to A = 4 for FM problems, and A = 1 for PSFM problems. All results are for DNA sequences, and DEME uses the OOPS data model in all cases. Each data point is the arithmetic mean ± standard error for 100 independent experiments.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2194741&req=5

Figure 3: Effect of the seed prior on DEME accuracy. The plots show the average accuracy of the motifs discovered by DEME on different synthetic discriminative problems as a function of the size of the seed prior, B. The Bayesian motif prior is set to A = 4 for FM problems, and A = 1 for PSFM problems. All results are for DNA sequences, and DEME uses the OOPS data model in all cases. Each data point is the arithmetic mean ± standard error for 100 independent experiments.

Mentions: The seed prior weight, B (Eqn. 7), directly affects the objective function optimized by DEME during global search. Thus, B affects which string motif is chosen by DEME for refinement using local search, strongly influencing the final motif chosen by DEME. Fig. 3 illustrates the effect of B on the performance of DEME for the Random Negative and Decoy Motif Problems using both FM and PSFM motifs. In all cases, the training set PC, test set PC and test set ACC measures give similar pictures of DEME's variation in accuracy in response to B. DEME performs well with values of B smaller than 0.25 on both FM and PSFM motifs in the Random Negative Problem, and with PSFM motifs in the Decoy Motif Problem (Fig. 3a, b, d). However, DEME often fails to discover the true motif in the FM Decoy Motif problem when B is less than 0.5 (Fig. 3c). This is due to the global search objective function giving a higher score to sites of the decoy motif compared with the true motif when B is small. This effect is absent with the PSFM Decoy Motif Problem, where a very small value of B, 0.1, works best. Thus, using B = 0.1 seems to be a good compromise, giving optimal or nearly optimal accuracy for all problems except the FM Decoy Motif Problem. Of course, if the motifs in a real dataset are believed to be FM-like, a value of B of 0.5 would be appropriate.


Discriminative motif discovery in DNA and protein sequences using the DEME algorithm.

Redhead E, Bailey TL - BMC Bioinformatics (2007)

Effect of the seed prior on DEME accuracy. The plots show the average accuracy of the motifs discovered by DEME on different synthetic discriminative problems as a function of the size of the seed prior, B. The Bayesian motif prior is set to A = 4 for FM problems, and A = 1 for PSFM problems. All results are for DNA sequences, and DEME uses the OOPS data model in all cases. Each data point is the arithmetic mean ± standard error for 100 independent experiments.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2194741&req=5

Figure 3: Effect of the seed prior on DEME accuracy. The plots show the average accuracy of the motifs discovered by DEME on different synthetic discriminative problems as a function of the size of the seed prior, B. The Bayesian motif prior is set to A = 4 for FM problems, and A = 1 for PSFM problems. All results are for DNA sequences, and DEME uses the OOPS data model in all cases. Each data point is the arithmetic mean ± standard error for 100 independent experiments.
Mentions: The seed prior weight, B (Eqn. 7), directly affects the objective function optimized by DEME during global search. Thus, B affects which string motif is chosen by DEME for refinement using local search, strongly influencing the final motif chosen by DEME. Fig. 3 illustrates the effect of B on the performance of DEME for the Random Negative and Decoy Motif Problems using both FM and PSFM motifs. In all cases, the training set PC, test set PC and test set ACC measures give similar pictures of DEME's variation in accuracy in response to B. DEME performs well with values of B smaller than 0.25 on both FM and PSFM motifs in the Random Negative Problem, and with PSFM motifs in the Decoy Motif Problem (Fig. 3a, b, d). However, DEME often fails to discover the true motif in the FM Decoy Motif problem when B is less than 0.5 (Fig. 3c). This is due to the global search objective function giving a higher score to sites of the decoy motif compared with the true motif when B is small. This effect is absent with the PSFM Decoy Motif Problem, where a very small value of B, 0.1, works best. Thus, using B = 0.1 seems to be a good compromise, giving optimal or nearly optimal accuracy for all problems except the FM Decoy Motif Problem. Of course, if the motifs in a real dataset are believed to be FM-like, a value of B of 0.5 would be appropriate.

Bottom Line: Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences.With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs.We also show that DEME can find highly informative thermal-stability protein motifs.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute for Molecular Bioscience, University of Queensland, Brisbane, Qld, 4072 Australia. e.redhead@imb.uq.edu.au

ABSTRACT

Background: Motif discovery aims to detect short, highly conserved patterns in a collection of unaligned DNA or protein sequences. Discriminative motif finding algorithms aim to increase the sensitivity and selectivity of motif discovery by utilizing a second set of sequences, and searching only for patterns that can differentiate the two sets of sequences. Potential applications of discriminative motif discovery include discovering transcription factor binding site motifs in ChIP-chip data and finding protein motifs involved in thermal stability using sets of orthologous proteins from thermophilic and mesophilic organisms.

Results: We describe DEME, a discriminative motif discovery algorithm for use with protein and DNA sequences. Input to DEME is two sets of sequences; a "positive" set and a "negative" set. DEME represents motifs using a probabilistic model, and uses a novel combination of global and local search to find the motif that optimally discriminates between the two sets of sequences. DEME is unique among discriminative motif finders in that it uses an informative Bayesian prior on protein motif columns, allowing it to incorporate prior knowledge of residue characteristics. We also introduce four, synthetic, discriminative motif discovery problems that are designed for evaluating discriminative motif finders in various biologically motivated contexts. We test DEME using these synthetic problems and on two biological problems: finding yeast transcription factor binding motifs in ChIP-chip data, and finding motifs that discriminate between groups of thermophilic and mesophilic orthologous proteins.

Conclusion: Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences. With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs. We also show that DEME can find highly informative thermal-stability protein motifs. Binaries for the stand-alone program DEME is free for academic use and is available at http://bioinformatics.org.au/deme/

Show MeSH
Related in: MedlinePlus