Limits...
Discriminative motif discovery in DNA and protein sequences using the DEME algorithm.

Redhead E, Bailey TL - BMC Bioinformatics (2007)

Bottom Line: Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences.With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs.We also show that DEME can find highly informative thermal-stability protein motifs.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute for Molecular Bioscience, University of Queensland, Brisbane, Qld, 4072 Australia. e.redhead@imb.uq.edu.au

ABSTRACT

Background: Motif discovery aims to detect short, highly conserved patterns in a collection of unaligned DNA or protein sequences. Discriminative motif finding algorithms aim to increase the sensitivity and selectivity of motif discovery by utilizing a second set of sequences, and searching only for patterns that can differentiate the two sets of sequences. Potential applications of discriminative motif discovery include discovering transcription factor binding site motifs in ChIP-chip data and finding protein motifs involved in thermal stability using sets of orthologous proteins from thermophilic and mesophilic organisms.

Results: We describe DEME, a discriminative motif discovery algorithm for use with protein and DNA sequences. Input to DEME is two sets of sequences; a "positive" set and a "negative" set. DEME represents motifs using a probabilistic model, and uses a novel combination of global and local search to find the motif that optimally discriminates between the two sets of sequences. DEME is unique among discriminative motif finders in that it uses an informative Bayesian prior on protein motif columns, allowing it to incorporate prior knowledge of residue characteristics. We also introduce four, synthetic, discriminative motif discovery problems that are designed for evaluating discriminative motif finders in various biologically motivated contexts. We test DEME using these synthetic problems and on two biological problems: finding yeast transcription factor binding motifs in ChIP-chip data, and finding motifs that discriminate between groups of thermophilic and mesophilic orthologous proteins.

Conclusion: Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences. With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs. We also show that DEME can find highly informative thermal-stability protein motifs. Binaries for the stand-alone program DEME is free for academic use and is available at http://bioinformatics.org.au/deme/

Show MeSH

Related in: MedlinePlus

Running time of the DEME algorithm. The plot shows the CPU time required by DEME on a typical PC using an typical ChIP-chip (positive) dataset, as a function of the number of non-binding probe sequences as the negative set. The positive set contains 59 probe sequences with an average length of 564 nt that bind GCN4 in rich media (the ChIP-chip datasets compiled by Harbison et al. [5] contain on average 40 probe sequences of length 564 nt). Each negative dataset contains randomly selected non-binding probe sequences. The largest negative set studied contains all non-binding probe sequences.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2194741&req=5

Figure 6: Running time of the DEME algorithm. The plot shows the CPU time required by DEME on a typical PC using an typical ChIP-chip (positive) dataset, as a function of the number of non-binding probe sequences as the negative set. The positive set contains 59 probe sequences with an average length of 564 nt that bind GCN4 in rich media (the ChIP-chip datasets compiled by Harbison et al. [5] contain on average 40 probe sequences of length 564 nt). Each negative dataset contains randomly selected non-binding probe sequences. The largest negative set studied contains all non-binding probe sequences.

Mentions: DEME finds highly discriminating motifs when either set of sequences (thermophilic or mesophilic) is used as the positive set. Using the thermophilic sequences as the positive set, DEME finds a motif that identifies a region of the TATA-box binding protein that is differentially conserved between the two environments (Fig. 5a). In other words, the motif corresponds to a location in the multiple alignment of all the proteins where there is a strongly conserved but distinct preference for certain residues in the thermophilic compared with the mesophilic proteins. The motif found by DEME motif corresponds to the most highly differentially conserved region in the motif found by La et al. [8](see Fig. 6 in La et al. [8]). The sites identified by the DEME motif show a very strong preference for the salt-bridge-forming residues arginine (R) and lysine (K) in the thermophilic organisms. These two residues are the most common residue in five out of 20 positions in the thermophilic sites, and are known to enhance thermal stability in proteins. The thermophilic sites also show a very strong preference for isoleucine (I) and valine (V), two residues hypothesized by La et al. [8] to promote thermal stability via beta-sheet formation and lower side-chain entropies. These residues (R, K, I and V) are only very weakly preferred in six columns in the sites in the mesophilic sequences; the difference the prevalence of lysine (K) is particularly noticeable.


Discriminative motif discovery in DNA and protein sequences using the DEME algorithm.

Redhead E, Bailey TL - BMC Bioinformatics (2007)

Running time of the DEME algorithm. The plot shows the CPU time required by DEME on a typical PC using an typical ChIP-chip (positive) dataset, as a function of the number of non-binding probe sequences as the negative set. The positive set contains 59 probe sequences with an average length of 564 nt that bind GCN4 in rich media (the ChIP-chip datasets compiled by Harbison et al. [5] contain on average 40 probe sequences of length 564 nt). Each negative dataset contains randomly selected non-binding probe sequences. The largest negative set studied contains all non-binding probe sequences.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2194741&req=5

Figure 6: Running time of the DEME algorithm. The plot shows the CPU time required by DEME on a typical PC using an typical ChIP-chip (positive) dataset, as a function of the number of non-binding probe sequences as the negative set. The positive set contains 59 probe sequences with an average length of 564 nt that bind GCN4 in rich media (the ChIP-chip datasets compiled by Harbison et al. [5] contain on average 40 probe sequences of length 564 nt). Each negative dataset contains randomly selected non-binding probe sequences. The largest negative set studied contains all non-binding probe sequences.
Mentions: DEME finds highly discriminating motifs when either set of sequences (thermophilic or mesophilic) is used as the positive set. Using the thermophilic sequences as the positive set, DEME finds a motif that identifies a region of the TATA-box binding protein that is differentially conserved between the two environments (Fig. 5a). In other words, the motif corresponds to a location in the multiple alignment of all the proteins where there is a strongly conserved but distinct preference for certain residues in the thermophilic compared with the mesophilic proteins. The motif found by DEME motif corresponds to the most highly differentially conserved region in the motif found by La et al. [8](see Fig. 6 in La et al. [8]). The sites identified by the DEME motif show a very strong preference for the salt-bridge-forming residues arginine (R) and lysine (K) in the thermophilic organisms. These two residues are the most common residue in five out of 20 positions in the thermophilic sites, and are known to enhance thermal stability in proteins. The thermophilic sites also show a very strong preference for isoleucine (I) and valine (V), two residues hypothesized by La et al. [8] to promote thermal stability via beta-sheet formation and lower side-chain entropies. These residues (R, K, I and V) are only very weakly preferred in six columns in the sites in the mesophilic sequences; the difference the prevalence of lysine (K) is particularly noticeable.

Bottom Line: Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences.With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs.We also show that DEME can find highly informative thermal-stability protein motifs.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute for Molecular Bioscience, University of Queensland, Brisbane, Qld, 4072 Australia. e.redhead@imb.uq.edu.au

ABSTRACT

Background: Motif discovery aims to detect short, highly conserved patterns in a collection of unaligned DNA or protein sequences. Discriminative motif finding algorithms aim to increase the sensitivity and selectivity of motif discovery by utilizing a second set of sequences, and searching only for patterns that can differentiate the two sets of sequences. Potential applications of discriminative motif discovery include discovering transcription factor binding site motifs in ChIP-chip data and finding protein motifs involved in thermal stability using sets of orthologous proteins from thermophilic and mesophilic organisms.

Results: We describe DEME, a discriminative motif discovery algorithm for use with protein and DNA sequences. Input to DEME is two sets of sequences; a "positive" set and a "negative" set. DEME represents motifs using a probabilistic model, and uses a novel combination of global and local search to find the motif that optimally discriminates between the two sets of sequences. DEME is unique among discriminative motif finders in that it uses an informative Bayesian prior on protein motif columns, allowing it to incorporate prior knowledge of residue characteristics. We also introduce four, synthetic, discriminative motif discovery problems that are designed for evaluating discriminative motif finders in various biologically motivated contexts. We test DEME using these synthetic problems and on two biological problems: finding yeast transcription factor binding motifs in ChIP-chip data, and finding motifs that discriminate between groups of thermophilic and mesophilic orthologous proteins.

Conclusion: Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences. With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs. We also show that DEME can find highly informative thermal-stability protein motifs. Binaries for the stand-alone program DEME is free for academic use and is available at http://bioinformatics.org.au/deme/

Show MeSH
Related in: MedlinePlus