Limits...
Discriminative motif discovery in DNA and protein sequences using the DEME algorithm.

Redhead E, Bailey TL - BMC Bioinformatics (2007)

Bottom Line: Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences.With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs.We also show that DEME can find highly informative thermal-stability protein motifs.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute for Molecular Bioscience, University of Queensland, Brisbane, Qld, 4072 Australia. e.redhead@imb.uq.edu.au

ABSTRACT

Background: Motif discovery aims to detect short, highly conserved patterns in a collection of unaligned DNA or protein sequences. Discriminative motif finding algorithms aim to increase the sensitivity and selectivity of motif discovery by utilizing a second set of sequences, and searching only for patterns that can differentiate the two sets of sequences. Potential applications of discriminative motif discovery include discovering transcription factor binding site motifs in ChIP-chip data and finding protein motifs involved in thermal stability using sets of orthologous proteins from thermophilic and mesophilic organisms.

Results: We describe DEME, a discriminative motif discovery algorithm for use with protein and DNA sequences. Input to DEME is two sets of sequences; a "positive" set and a "negative" set. DEME represents motifs using a probabilistic model, and uses a novel combination of global and local search to find the motif that optimally discriminates between the two sets of sequences. DEME is unique among discriminative motif finders in that it uses an informative Bayesian prior on protein motif columns, allowing it to incorporate prior knowledge of residue characteristics. We also introduce four, synthetic, discriminative motif discovery problems that are designed for evaluating discriminative motif finders in various biologically motivated contexts. We test DEME using these synthetic problems and on two biological problems: finding yeast transcription factor binding motifs in ChIP-chip data, and finding motifs that discriminate between groups of thermophilic and mesophilic orthologous proteins.

Conclusion: Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences. With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs. We also show that DEME can find highly informative thermal-stability protein motifs. Binaries for the stand-alone program DEME is free for academic use and is available at http://bioinformatics.org.au/deme/

Show MeSH

Related in: MedlinePlus

Comparison of DEME and MEME on synthetic problems. Each plot shows the accuracy of predicted motifs as measured by the training set PC. Each data point represents the mean (± standard error) PC on 100 independent instantiations of the given problem. Panel a shows results on the FM Random Negative Problem. Panel b shows results for the FM Decoy Motif Problem with zero mutations in the occurrences of the decoy motif as a function of the number of mutations in the target motif sites. Panel c shows results on the FM Variant Motif Problem. The variant motif is Hamming distance four from the target motif and planted instances of the target and variant motifs contain the same number of mutations. Panel d shows results on the width-10 PSFM Impoverished Negative Problem (and width-10 PSFM Random Negative Problem for comparison) as a function of the length of the sequences. In all tests, DEME is run using a positive and negative training set, while MEME is applied to the positive training set only.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2194741&req=5

Figure 4: Comparison of DEME and MEME on synthetic problems. Each plot shows the accuracy of predicted motifs as measured by the training set PC. Each data point represents the mean (± standard error) PC on 100 independent instantiations of the given problem. Panel a shows results on the FM Random Negative Problem. Panel b shows results for the FM Decoy Motif Problem with zero mutations in the occurrences of the decoy motif as a function of the number of mutations in the target motif sites. Panel c shows results on the FM Variant Motif Problem. The variant motif is Hamming distance four from the target motif and planted instances of the target and variant motifs contain the same number of mutations. Panel d shows results on the width-10 PSFM Impoverished Negative Problem (and width-10 PSFM Random Negative Problem for comparison) as a function of the length of the sequences. In all tests, DEME is run using a positive and negative training set, while MEME is applied to the positive training set only.

Mentions: The FM Random Negative Problem is essentially identical to the well-studied FM Challenge problem [29] for non-discriminative motif discovery. The performance of DEME on this problem for various numbers of mutations in the planted motif of width 15 is shown in Fig. 4a. For comparison, we show the performance of MEME on the just the positive sequences. DEME effectively discovers planted sites that contain up to four mutations, whereas MEME fails to discover planted sites that contain more than three mutations. The performance of DEME on the (15, 4) FM Random Negative problem (see Methods section) is comparable to the performance reported for PROJECTION [40] (PC 0.93) and WINNOWER [29] (PC 0.92). These algorithms were specifically designed to solve the FM Challenge problem, as was the branching search algorithm, on which DEME's global search algorithm is based. The performance of both DEME and PROJECTION declines significantly on the (15, 5) problem, where the performance coefficient for these algorithms is 0.03 and 0.018 respectively. It is not surprising that the performance of these algorithms are poor on the (15, 5) problem, since it is expected that the positive dataset contains spurious motifs that are as "strong" as the planted motif [40].


Discriminative motif discovery in DNA and protein sequences using the DEME algorithm.

Redhead E, Bailey TL - BMC Bioinformatics (2007)

Comparison of DEME and MEME on synthetic problems. Each plot shows the accuracy of predicted motifs as measured by the training set PC. Each data point represents the mean (± standard error) PC on 100 independent instantiations of the given problem. Panel a shows results on the FM Random Negative Problem. Panel b shows results for the FM Decoy Motif Problem with zero mutations in the occurrences of the decoy motif as a function of the number of mutations in the target motif sites. Panel c shows results on the FM Variant Motif Problem. The variant motif is Hamming distance four from the target motif and planted instances of the target and variant motifs contain the same number of mutations. Panel d shows results on the width-10 PSFM Impoverished Negative Problem (and width-10 PSFM Random Negative Problem for comparison) as a function of the length of the sequences. In all tests, DEME is run using a positive and negative training set, while MEME is applied to the positive training set only.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2194741&req=5

Figure 4: Comparison of DEME and MEME on synthetic problems. Each plot shows the accuracy of predicted motifs as measured by the training set PC. Each data point represents the mean (± standard error) PC on 100 independent instantiations of the given problem. Panel a shows results on the FM Random Negative Problem. Panel b shows results for the FM Decoy Motif Problem with zero mutations in the occurrences of the decoy motif as a function of the number of mutations in the target motif sites. Panel c shows results on the FM Variant Motif Problem. The variant motif is Hamming distance four from the target motif and planted instances of the target and variant motifs contain the same number of mutations. Panel d shows results on the width-10 PSFM Impoverished Negative Problem (and width-10 PSFM Random Negative Problem for comparison) as a function of the length of the sequences. In all tests, DEME is run using a positive and negative training set, while MEME is applied to the positive training set only.
Mentions: The FM Random Negative Problem is essentially identical to the well-studied FM Challenge problem [29] for non-discriminative motif discovery. The performance of DEME on this problem for various numbers of mutations in the planted motif of width 15 is shown in Fig. 4a. For comparison, we show the performance of MEME on the just the positive sequences. DEME effectively discovers planted sites that contain up to four mutations, whereas MEME fails to discover planted sites that contain more than three mutations. The performance of DEME on the (15, 4) FM Random Negative problem (see Methods section) is comparable to the performance reported for PROJECTION [40] (PC 0.93) and WINNOWER [29] (PC 0.92). These algorithms were specifically designed to solve the FM Challenge problem, as was the branching search algorithm, on which DEME's global search algorithm is based. The performance of both DEME and PROJECTION declines significantly on the (15, 5) problem, where the performance coefficient for these algorithms is 0.03 and 0.018 respectively. It is not surprising that the performance of these algorithms are poor on the (15, 5) problem, since it is expected that the positive dataset contains spurious motifs that are as "strong" as the planted motif [40].

Bottom Line: Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences.With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs.We also show that DEME can find highly informative thermal-stability protein motifs.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute for Molecular Bioscience, University of Queensland, Brisbane, Qld, 4072 Australia. e.redhead@imb.uq.edu.au

ABSTRACT

Background: Motif discovery aims to detect short, highly conserved patterns in a collection of unaligned DNA or protein sequences. Discriminative motif finding algorithms aim to increase the sensitivity and selectivity of motif discovery by utilizing a second set of sequences, and searching only for patterns that can differentiate the two sets of sequences. Potential applications of discriminative motif discovery include discovering transcription factor binding site motifs in ChIP-chip data and finding protein motifs involved in thermal stability using sets of orthologous proteins from thermophilic and mesophilic organisms.

Results: We describe DEME, a discriminative motif discovery algorithm for use with protein and DNA sequences. Input to DEME is two sets of sequences; a "positive" set and a "negative" set. DEME represents motifs using a probabilistic model, and uses a novel combination of global and local search to find the motif that optimally discriminates between the two sets of sequences. DEME is unique among discriminative motif finders in that it uses an informative Bayesian prior on protein motif columns, allowing it to incorporate prior knowledge of residue characteristics. We also introduce four, synthetic, discriminative motif discovery problems that are designed for evaluating discriminative motif finders in various biologically motivated contexts. We test DEME using these synthetic problems and on two biological problems: finding yeast transcription factor binding motifs in ChIP-chip data, and finding motifs that discriminate between groups of thermophilic and mesophilic orthologous proteins.

Conclusion: Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences. With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs. We also show that DEME can find highly informative thermal-stability protein motifs. Binaries for the stand-alone program DEME is free for academic use and is available at http://bioinformatics.org.au/deme/

Show MeSH
Related in: MedlinePlus