Limits...
Evaluating deterministic motif significance measures in protein databases.

Ferreira PG, Azevedo PJ - Algorithms Mol Biol (2007)

Bottom Line: The combined use of several measures may provide more robust results.We observed that not all the measures are suitable for situations with poorly balanced class information, like for instance, when positive data is significantly less than negative data.This study provides some pertinent insights on the choice of the right set of significance measures for the evaluation of deterministic motifs extracted from protein databases.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Informatics, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal. pedrogabriel@di.uminho.pt

ABSTRACT

Background: Assessing the outcome of motif mining algorithms is an essential task, as the number of reported motifs can be very large. Significance measures play a central role in automatically ranking those motifs, and therefore alleviating the analysis work. Spotting the most interesting and relevant motifs is then dependent on the choice of the right measures. The combined use of several measures may provide more robust results. However caution has to be taken in order to avoid spurious evaluations.

Results: From the set of conducted experiments, it was verified that several of the selected significance measures show a very similar behavior in a wide range of situations therefore providing redundant information. Some measures have proved to be more appropriate to rank highly conserved motifs, while others are more appropriate for weakly conserved ones. Support appears as a very important feature to be considered for correct motif ranking. We observed that not all the measures are suitable for situations with poorly balanced class information, like for instance, when positive data is significantly less than negative data. Finally, a visualization scheme was proposed that, when several measures are applied, enables an easy identification of high scoring motifs.

Conclusion: In this work we have surveyed and categorized 14 significance measures for pattern evaluation. Their ability to rank three types of deterministic motifs was evaluated. Measures were applied in different testing conditions, where relations were identified. This study provides some pertinent insights on the choice of the right set of significance measures for the evaluation of deterministic motifs extracted from protein databases.

No MeSH data available.


Related in: MedlinePlus

Ranking performance for different support values of the prosite and synthetic datasets. These figures presents the variation of the Rm metric for each measure and according to different support values of the target motif. Rm is presented in logarithmic scale (y-axis) and support in relative values (x-axis). Evaluation performed for the prosite and synthetic datasets from Table 3 and 4. Support, IG and Z-score have, respectively, the best results for the two sets.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2254621&req=5

Figure 1: Ranking performance for different support values of the prosite and synthetic datasets. These figures presents the variation of the Rm metric for each measure and according to different support values of the target motif. Rm is presented in logarithmic scale (y-axis) and support in relative values (x-axis). Evaluation performed for the prosite and synthetic datasets from Table 3 and 4. Support, IG and Z-score have, respectively, the best results for the two sets.

Mentions: Target motifs appear highly conserved in the datasets from Table 3 and 4 and consequently experiments can be biased in favor of support. An additional experiment was devised where the support of the target motifs was reduced for different values. This was done by removing from the dataset the appropriate number of motif occurrences. Rank results were then obtained, both for prosite and synthetic datasets, and presented in Figure 1. It can be seen from these two experiments that even for lower support values Support and IG still maintain a clear advantage over the remaining measures.


Evaluating deterministic motif significance measures in protein databases.

Ferreira PG, Azevedo PJ - Algorithms Mol Biol (2007)

Ranking performance for different support values of the prosite and synthetic datasets. These figures presents the variation of the Rm metric for each measure and according to different support values of the target motif. Rm is presented in logarithmic scale (y-axis) and support in relative values (x-axis). Evaluation performed for the prosite and synthetic datasets from Table 3 and 4. Support, IG and Z-score have, respectively, the best results for the two sets.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2254621&req=5

Figure 1: Ranking performance for different support values of the prosite and synthetic datasets. These figures presents the variation of the Rm metric for each measure and according to different support values of the target motif. Rm is presented in logarithmic scale (y-axis) and support in relative values (x-axis). Evaluation performed for the prosite and synthetic datasets from Table 3 and 4. Support, IG and Z-score have, respectively, the best results for the two sets.
Mentions: Target motifs appear highly conserved in the datasets from Table 3 and 4 and consequently experiments can be biased in favor of support. An additional experiment was devised where the support of the target motifs was reduced for different values. This was done by removing from the dataset the appropriate number of motif occurrences. Rank results were then obtained, both for prosite and synthetic datasets, and presented in Figure 1. It can be seen from these two experiments that even for lower support values Support and IG still maintain a clear advantage over the remaining measures.

Bottom Line: The combined use of several measures may provide more robust results.We observed that not all the measures are suitable for situations with poorly balanced class information, like for instance, when positive data is significantly less than negative data.This study provides some pertinent insights on the choice of the right set of significance measures for the evaluation of deterministic motifs extracted from protein databases.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Informatics, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal. pedrogabriel@di.uminho.pt

ABSTRACT

Background: Assessing the outcome of motif mining algorithms is an essential task, as the number of reported motifs can be very large. Significance measures play a central role in automatically ranking those motifs, and therefore alleviating the analysis work. Spotting the most interesting and relevant motifs is then dependent on the choice of the right measures. The combined use of several measures may provide more robust results. However caution has to be taken in order to avoid spurious evaluations.

Results: From the set of conducted experiments, it was verified that several of the selected significance measures show a very similar behavior in a wide range of situations therefore providing redundant information. Some measures have proved to be more appropriate to rank highly conserved motifs, while others are more appropriate for weakly conserved ones. Support appears as a very important feature to be considered for correct motif ranking. We observed that not all the measures are suitable for situations with poorly balanced class information, like for instance, when positive data is significantly less than negative data. Finally, a visualization scheme was proposed that, when several measures are applied, enables an easy identification of high scoring motifs.

Conclusion: In this work we have surveyed and categorized 14 significance measures for pattern evaluation. Their ability to rank three types of deterministic motifs was evaluated. Measures were applied in different testing conditions, where relations were identified. This study provides some pertinent insights on the choice of the right set of significance measures for the evaluation of deterministic motifs extracted from protein databases.

No MeSH data available.


Related in: MedlinePlus