Limits...
Towards a semi-automatic functional annotation tool based on decision-tree techniques.

Azé J, Gentils L, Toffano-Nioche C, Loux V, Gibrat JF, Bessières P, Rouveirol C, Poupon A, Froidevaux C - BMC Proc (2008)

Bottom Line: Using combined approaches increases the recall and the prediction rate.The combination of the two approaches is very encouraging and we will further refine these combinations in order to get rules even more useful for the annotators.This first study is a crucial step towards designing a semi-automatic functional annotation tool.

View Article: PubMed Central - HTML - PubMed

Affiliation: LRI - CNRS UMR 8623 - University Paris-Sud 11, F-91405 Orsay Cedex, France. Jerome.Aze@lri.fr

ABSTRACT

Background: Due to the continuous improvements of high throughput technologies and experimental procedures, the number of sequenced genomes is increasing exponentially. Ultimately, the task of annotating these data relies on the expertise of biologists. The necessity for annotation to be supervised by human experts is the rate limiting step of the data analysis. To face the deluge of new genomic data, the need for automating, as much as possible, the annotation process becomes critical.

Results: We consider annotation of a protein with terms of the functional hierarchy that has been used to annotate Bacillus subtilis and propose a set of rules that predict classes in terms of elements of the functional hierarchy, i.e., a class is a node or a leaf of the hierarchy tree. The rules are obtained through two decision-trees techniques: first-order decision-trees and multilabel attribute-value decision-trees, by using as training data the proteins from two lactic bacteria: Lactobacillus sakei and Lactobacillus bulgaricus. We tested the two methods, first independently, then in a combined approach, and evaluated the obtained results using hierarchical evaluation measures. Results obtained for the two approaches on both genomes are comparable and show a good precision together with a high prediction rate. Using combined approaches increases the recall and the prediction rate.

Conclusion: The combination of the two approaches is very encouraging and we will further refine these combinations in order to get rules even more useful for the annotators. This first study is a crucial step towards designing a semi-automatic functional annotation tool.

No MeSH data available.


Related in: MedlinePlus

Hierarchical Precision. Plot of the hierarchical precision as a function of the confidence threshold (CT). A class is predicted only if it represents more than CT% of the examples observed in the leaf at the learning stage. On the left hand side L. sakei has been used to learn the decision-trees that have then been employed to predict proteins of L. bulgaricus, on the right hand side this is the converse.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2654970&req=5

Figure 3: Hierarchical Precision. Plot of the hierarchical precision as a function of the confidence threshold (CT). A class is predicted only if it represents more than CT% of the examples observed in the leaf at the learning stage. On the left hand side L. sakei has been used to learn the decision-trees that have then been employed to predict proteins of L. bulgaricus, on the right hand side this is the converse.

Mentions: In both approaches, decision-trees were learnt under the same condition: the minimal number of proteins in a leaf had to be equal to 8 (smallest values would likely result in overfitting). When applying decision-tree, a class was predicted only if it represented more than a minimal ratio of the examples observed in the leaf at the learning stage. This minimal ratio called Confidence Threshold noted by CT, allows us to control the prediction rate. Its value has been chosen empirically. As shown on Fig. 3, the precision increases steadily with the threshold whereas the recall in Fig. 4 exhibits a sharp decrease after this value. As a result, the hierarchical Fscore (β = 1) also decreases for thresholds CT larger than 75% (see Fig. 5). In the following we use this value CT = 75%.


Towards a semi-automatic functional annotation tool based on decision-tree techniques.

Azé J, Gentils L, Toffano-Nioche C, Loux V, Gibrat JF, Bessières P, Rouveirol C, Poupon A, Froidevaux C - BMC Proc (2008)

Hierarchical Precision. Plot of the hierarchical precision as a function of the confidence threshold (CT). A class is predicted only if it represents more than CT% of the examples observed in the leaf at the learning stage. On the left hand side L. sakei has been used to learn the decision-trees that have then been employed to predict proteins of L. bulgaricus, on the right hand side this is the converse.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2654970&req=5

Figure 3: Hierarchical Precision. Plot of the hierarchical precision as a function of the confidence threshold (CT). A class is predicted only if it represents more than CT% of the examples observed in the leaf at the learning stage. On the left hand side L. sakei has been used to learn the decision-trees that have then been employed to predict proteins of L. bulgaricus, on the right hand side this is the converse.
Mentions: In both approaches, decision-trees were learnt under the same condition: the minimal number of proteins in a leaf had to be equal to 8 (smallest values would likely result in overfitting). When applying decision-tree, a class was predicted only if it represented more than a minimal ratio of the examples observed in the leaf at the learning stage. This minimal ratio called Confidence Threshold noted by CT, allows us to control the prediction rate. Its value has been chosen empirically. As shown on Fig. 3, the precision increases steadily with the threshold whereas the recall in Fig. 4 exhibits a sharp decrease after this value. As a result, the hierarchical Fscore (β = 1) also decreases for thresholds CT larger than 75% (see Fig. 5). In the following we use this value CT = 75%.

Bottom Line: Using combined approaches increases the recall and the prediction rate.The combination of the two approaches is very encouraging and we will further refine these combinations in order to get rules even more useful for the annotators.This first study is a crucial step towards designing a semi-automatic functional annotation tool.

View Article: PubMed Central - HTML - PubMed

Affiliation: LRI - CNRS UMR 8623 - University Paris-Sud 11, F-91405 Orsay Cedex, France. Jerome.Aze@lri.fr

ABSTRACT

Background: Due to the continuous improvements of high throughput technologies and experimental procedures, the number of sequenced genomes is increasing exponentially. Ultimately, the task of annotating these data relies on the expertise of biologists. The necessity for annotation to be supervised by human experts is the rate limiting step of the data analysis. To face the deluge of new genomic data, the need for automating, as much as possible, the annotation process becomes critical.

Results: We consider annotation of a protein with terms of the functional hierarchy that has been used to annotate Bacillus subtilis and propose a set of rules that predict classes in terms of elements of the functional hierarchy, i.e., a class is a node or a leaf of the hierarchy tree. The rules are obtained through two decision-trees techniques: first-order decision-trees and multilabel attribute-value decision-trees, by using as training data the proteins from two lactic bacteria: Lactobacillus sakei and Lactobacillus bulgaricus. We tested the two methods, first independently, then in a combined approach, and evaluated the obtained results using hierarchical evaluation measures. Results obtained for the two approaches on both genomes are comparable and show a good precision together with a high prediction rate. Using combined approaches increases the recall and the prediction rate.

Conclusion: The combination of the two approaches is very encouraging and we will further refine these combinations in order to get rules even more useful for the annotators. This first study is a crucial step towards designing a semi-automatic functional annotation tool.

No MeSH data available.


Related in: MedlinePlus