Limits...
Fuzzy association rules for biological data analysis: a case study on yeast.

Lopez FJ, Blanco A, Garcia F, Cano C, Marin A - BMC Bioinformatics (2008)

Bottom Line: A number of association rules have been found, many of them agreeing with previous research in the area.In addition, a comparison between crisp and fuzzy results proves the fuzzy associations to be more reliable than crisp ones.An integrative approach as the one carried out in this work can unveil significant knowledge which is currently hidden and dispersed through the existing biological databases.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science and AI, University of Granada, 18071, Granada, Spain. fjavier@decsai.ugr.es

ABSTRACT

Background: Last years' mapping of diverse genomes has generated huge amounts of biological data which are currently dispersed through many databases. Integration of the information available in the various databases is required to unveil possible associations relating already known data. Biological data are often imprecise and noisy. Fuzzy set theory is specially suitable to model imprecise data while association rules are very appropriate to integrate heterogeneous data.

Results: In this work we propose a novel fuzzy methodology based on a fuzzy association rule mining method for biological knowledge extraction. We apply this methodology over a yeast genome dataset containing heterogeneous information regarding structural and functional genome features. A number of association rules have been found, many of them agreeing with previous research in the area. In addition, a comparison between crisp and fuzzy results proves the fuzzy associations to be more reliable than crisp ones.

Conclusion: An integrative approach as the one carried out in this work can unveil significant knowledge which is currently hidden and dispersed through the existing biological databases. It is shown that fuzzy association rules can model this knowledge in an intuitive way by using linguistic labels and few easy-understandable parameters.

Show MeSH
Comparison between fuzzy and crisp results 1. A) The histogram shows the distribution of the genes annotated in the term electron transport along the protein abundance domain. The graph below describes how the fuzzy sets are defined in this domain. The red dashed lines show the percentiles p33 and p66, i.e. the borders of the crisp sets. B) The same but for the genes annotated in the term snoRNA binding. Only the percentile p66 is shown in this case.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2277399&req=5

Figure 5: Comparison between fuzzy and crisp results 1. A) The histogram shows the distribution of the genes annotated in the term electron transport along the protein abundance domain. The graph below describes how the fuzzy sets are defined in this domain. The red dashed lines show the percentiles p33 and p66, i.e. the borders of the crisp sets. B) The same but for the genes annotated in the term snoRNA binding. Only the percentile p66 is shown in this case.

Mentions: Some concrete examples that also show the necessity of using fuzzy methodologies are provided in Table 9. This table shows some rules which quality measures vary significantly between the crisp and fuzzy version. For example, the fuzzy Confidence and fuzzy CF of the first rule in Table 9 are lower than their crisp counterparts. Its fuzzy Support is also lower than the crisp Support. By analyzing Figure 5A it can be understood why the crisp values are lower. This Figure shows how the genes annotated in the term electron transport are distributed along the protein abundance domain. It also shows how the linguistic labels are defined in the fuzzy and crisp algorithms. Looking at the histogram in Figure 5A, it can be seen that there appear many genes in the border between the MEDIUM and HIGH labels. Most of these genes are considered as LARGE by the crisp algorithm while they are "a bit" MEDIUM and "a bit" LARGE in the fuzzy one. This makes the fuzzy algorithm to count fewer LARGE genes annotated in electron transport and therefore to obtain lower Confidence, CF and Support values. The same reasoning holds for the next two rules in Table 9, their corresponding graphs are shown in Figures 6A and 6B. In the case of the last rule in Table 9, Support, Confidence and CF values are slightly higher in the fuzzy version than in the crisp one. Looking at Figure 6B it can be seen that many of the genes located at the chromosome 16 present values of their intergenic lengths in the borders that divide the MEDIUM – LOW and MEDIUM – HIGH sets. In this case, the MEDIUM fuzzy set not only includes "completely" (i.e. membership degree 1) almost all the genes that are included in the MEDIUM crisp set, but also there are many genes that belong to it with a lower membership degree and that are not included in the crisp set. This causes the increase in the fuzzy support value, as the number of genes located at chromosome 16 is the same in the crisp and fuzzy algorithm. Since the number of genes of chromosome 16 with a MEDIUM intergenic length is increased, the Confidence and the CF values are also increased.


Fuzzy association rules for biological data analysis: a case study on yeast.

Lopez FJ, Blanco A, Garcia F, Cano C, Marin A - BMC Bioinformatics (2008)

Comparison between fuzzy and crisp results 1. A) The histogram shows the distribution of the genes annotated in the term electron transport along the protein abundance domain. The graph below describes how the fuzzy sets are defined in this domain. The red dashed lines show the percentiles p33 and p66, i.e. the borders of the crisp sets. B) The same but for the genes annotated in the term snoRNA binding. Only the percentile p66 is shown in this case.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2277399&req=5

Figure 5: Comparison between fuzzy and crisp results 1. A) The histogram shows the distribution of the genes annotated in the term electron transport along the protein abundance domain. The graph below describes how the fuzzy sets are defined in this domain. The red dashed lines show the percentiles p33 and p66, i.e. the borders of the crisp sets. B) The same but for the genes annotated in the term snoRNA binding. Only the percentile p66 is shown in this case.
Mentions: Some concrete examples that also show the necessity of using fuzzy methodologies are provided in Table 9. This table shows some rules which quality measures vary significantly between the crisp and fuzzy version. For example, the fuzzy Confidence and fuzzy CF of the first rule in Table 9 are lower than their crisp counterparts. Its fuzzy Support is also lower than the crisp Support. By analyzing Figure 5A it can be understood why the crisp values are lower. This Figure shows how the genes annotated in the term electron transport are distributed along the protein abundance domain. It also shows how the linguistic labels are defined in the fuzzy and crisp algorithms. Looking at the histogram in Figure 5A, it can be seen that there appear many genes in the border between the MEDIUM and HIGH labels. Most of these genes are considered as LARGE by the crisp algorithm while they are "a bit" MEDIUM and "a bit" LARGE in the fuzzy one. This makes the fuzzy algorithm to count fewer LARGE genes annotated in electron transport and therefore to obtain lower Confidence, CF and Support values. The same reasoning holds for the next two rules in Table 9, their corresponding graphs are shown in Figures 6A and 6B. In the case of the last rule in Table 9, Support, Confidence and CF values are slightly higher in the fuzzy version than in the crisp one. Looking at Figure 6B it can be seen that many of the genes located at the chromosome 16 present values of their intergenic lengths in the borders that divide the MEDIUM – LOW and MEDIUM – HIGH sets. In this case, the MEDIUM fuzzy set not only includes "completely" (i.e. membership degree 1) almost all the genes that are included in the MEDIUM crisp set, but also there are many genes that belong to it with a lower membership degree and that are not included in the crisp set. This causes the increase in the fuzzy support value, as the number of genes located at chromosome 16 is the same in the crisp and fuzzy algorithm. Since the number of genes of chromosome 16 with a MEDIUM intergenic length is increased, the Confidence and the CF values are also increased.

Bottom Line: A number of association rules have been found, many of them agreeing with previous research in the area.In addition, a comparison between crisp and fuzzy results proves the fuzzy associations to be more reliable than crisp ones.An integrative approach as the one carried out in this work can unveil significant knowledge which is currently hidden and dispersed through the existing biological databases.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science and AI, University of Granada, 18071, Granada, Spain. fjavier@decsai.ugr.es

ABSTRACT

Background: Last years' mapping of diverse genomes has generated huge amounts of biological data which are currently dispersed through many databases. Integration of the information available in the various databases is required to unveil possible associations relating already known data. Biological data are often imprecise and noisy. Fuzzy set theory is specially suitable to model imprecise data while association rules are very appropriate to integrate heterogeneous data.

Results: In this work we propose a novel fuzzy methodology based on a fuzzy association rule mining method for biological knowledge extraction. We apply this methodology over a yeast genome dataset containing heterogeneous information regarding structural and functional genome features. A number of association rules have been found, many of them agreeing with previous research in the area. In addition, a comparison between crisp and fuzzy results proves the fuzzy associations to be more reliable than crisp ones.

Conclusion: An integrative approach as the one carried out in this work can unveil significant knowledge which is currently hidden and dispersed through the existing biological databases. It is shown that fuzzy association rules can model this knowledge in an intuitive way by using linguistic labels and few easy-understandable parameters.

Show MeSH