Limits...
Discriminative motif discovery in DNA and protein sequences using the DEME algorithm.

Redhead E, Bailey TL - BMC Bioinformatics (2007)

Bottom Line: Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences.With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs.We also show that DEME can find highly informative thermal-stability protein motifs.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute for Molecular Bioscience, University of Queensland, Brisbane, Qld, 4072 Australia. e.redhead@imb.uq.edu.au

ABSTRACT

Background: Motif discovery aims to detect short, highly conserved patterns in a collection of unaligned DNA or protein sequences. Discriminative motif finding algorithms aim to increase the sensitivity and selectivity of motif discovery by utilizing a second set of sequences, and searching only for patterns that can differentiate the two sets of sequences. Potential applications of discriminative motif discovery include discovering transcription factor binding site motifs in ChIP-chip data and finding protein motifs involved in thermal stability using sets of orthologous proteins from thermophilic and mesophilic organisms.

Results: We describe DEME, a discriminative motif discovery algorithm for use with protein and DNA sequences. Input to DEME is two sets of sequences; a "positive" set and a "negative" set. DEME represents motifs using a probabilistic model, and uses a novel combination of global and local search to find the motif that optimally discriminates between the two sets of sequences. DEME is unique among discriminative motif finders in that it uses an informative Bayesian prior on protein motif columns, allowing it to incorporate prior knowledge of residue characteristics. We also introduce four, synthetic, discriminative motif discovery problems that are designed for evaluating discriminative motif finders in various biologically motivated contexts. We test DEME using these synthetic problems and on two biological problems: finding yeast transcription factor binding motifs in ChIP-chip data, and finding motifs that discriminate between groups of thermophilic and mesophilic orthologous proteins.

Conclusion: Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences. With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs. We also show that DEME can find highly informative thermal-stability protein motifs. Binaries for the stand-alone program DEME is free for academic use and is available at http://bioinformatics.org.au/deme/

Show MeSH

Related in: MedlinePlus

The DEME data model. Labels on arcs show the probabilities of choosing the labelled class, C, of a sequence, and the true class, T. When T = 0, sequences are generated using just the background model, θB. When T = 1, sequences contain a motif site, generated by motif model θM, inserted in random sequence generated by θB.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2194741&req=5

Figure 1: The DEME data model. Labels on arcs show the probabilities of choosing the labelled class, C, of a sequence, and the true class, T. When T = 0, sequences are generated using just the background model, θB. When T = 1, sequences contain a motif site, generated by motif model θM, inserted in random sequence generated by θB.

Mentions: DEME models the sequences in its input set as being generated according to the process illustrated in Fig. 1. First the labeled class, C, is chosen with Pr(C = 1) = γ. Then, the true class of the sequence, T, is chosen. Then the sequence is generated. If T = 0, a random sequence without a planted motif site is generated using a 0-order Markov process with parameter vector θB, where θB [a] is the probability of observing the letter a at any position in the sequence. If T = 1, a motif site is generated and inserted at a random position in a random sequence generated using θB. The motif site is generated using a PSFM, θM, whose entries, θM[a, i], give the probability of observing letter a at position i in a motif site.


Discriminative motif discovery in DNA and protein sequences using the DEME algorithm.

Redhead E, Bailey TL - BMC Bioinformatics (2007)

The DEME data model. Labels on arcs show the probabilities of choosing the labelled class, C, of a sequence, and the true class, T. When T = 0, sequences are generated using just the background model, θB. When T = 1, sequences contain a motif site, generated by motif model θM, inserted in random sequence generated by θB.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2194741&req=5

Figure 1: The DEME data model. Labels on arcs show the probabilities of choosing the labelled class, C, of a sequence, and the true class, T. When T = 0, sequences are generated using just the background model, θB. When T = 1, sequences contain a motif site, generated by motif model θM, inserted in random sequence generated by θB.
Mentions: DEME models the sequences in its input set as being generated according to the process illustrated in Fig. 1. First the labeled class, C, is chosen with Pr(C = 1) = γ. Then, the true class of the sequence, T, is chosen. Then the sequence is generated. If T = 0, a random sequence without a planted motif site is generated using a 0-order Markov process with parameter vector θB, where θB [a] is the probability of observing the letter a at any position in the sequence. If T = 1, a motif site is generated and inserted at a random position in a random sequence generated using θB. The motif site is generated using a PSFM, θM, whose entries, θM[a, i], give the probability of observing letter a at position i in a motif site.

Bottom Line: Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences.With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs.We also show that DEME can find highly informative thermal-stability protein motifs.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute for Molecular Bioscience, University of Queensland, Brisbane, Qld, 4072 Australia. e.redhead@imb.uq.edu.au

ABSTRACT

Background: Motif discovery aims to detect short, highly conserved patterns in a collection of unaligned DNA or protein sequences. Discriminative motif finding algorithms aim to increase the sensitivity and selectivity of motif discovery by utilizing a second set of sequences, and searching only for patterns that can differentiate the two sets of sequences. Potential applications of discriminative motif discovery include discovering transcription factor binding site motifs in ChIP-chip data and finding protein motifs involved in thermal stability using sets of orthologous proteins from thermophilic and mesophilic organisms.

Results: We describe DEME, a discriminative motif discovery algorithm for use with protein and DNA sequences. Input to DEME is two sets of sequences; a "positive" set and a "negative" set. DEME represents motifs using a probabilistic model, and uses a novel combination of global and local search to find the motif that optimally discriminates between the two sets of sequences. DEME is unique among discriminative motif finders in that it uses an informative Bayesian prior on protein motif columns, allowing it to incorporate prior knowledge of residue characteristics. We also introduce four, synthetic, discriminative motif discovery problems that are designed for evaluating discriminative motif finders in various biologically motivated contexts. We test DEME using these synthetic problems and on two biological problems: finding yeast transcription factor binding motifs in ChIP-chip data, and finding motifs that discriminate between groups of thermophilic and mesophilic orthologous proteins.

Conclusion: Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences. With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs. We also show that DEME can find highly informative thermal-stability protein motifs. Binaries for the stand-alone program DEME is free for academic use and is available at http://bioinformatics.org.au/deme/

Show MeSH
Related in: MedlinePlus