Limits...
Semi-supervised multi-label collective classification ensemble for functional genomics.

Wu Q, Ye Y, Ho SS, Zhou S - BMC Genomics (2014)

Bottom Line: As such, we propose a novel generative model based SMCC algorithm, called GM-SMCC, to effectively compute the label probability distributions of unannotated protein instances and predict their functional properties.Experimental results on a yeast gene dataset predicting the functions and localization of proteins demonstrate the effectiveness of the proposed method.In the comparison, we find that the performances of the proposed algorithms are better than the other compared algorithms.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: With the rapid accumulation of proteomic and genomic datasets in terms of genome-scale features and interaction networks through high-throughput experimental techniques, the process of manual predicting functional properties of the proteins has become increasingly cumbersome, and computational methods to automate this annotation task are urgently needed. Most of the approaches in predicting functional properties of proteins require to either identify a reliable set of labeled proteins with similar attribute features to unannotated proteins, or to learn from a fully-labeled protein interaction network with a large amount of labeled data. However, acquiring such labels can be very difficult in practice, especially for multi-label protein function prediction problems. Learning with only a few labeled data can lead to poor performance as limited supervision knowledge can be obtained from similar proteins or from connections between them. To effectively annotate proteins even in the paucity of labeled data, it is important to take advantage of all data sources that are available in this problem setting, including interaction networks, attribute feature information, correlations of functional labels, and unlabeled data.

Results: In this paper, we show that the underlying nature of predicting functional properties of proteins using various data sources of relational data is a typical collective classification (CC) problem in machine learning. The protein functional prediction task with limited annotation is then cast into a semi-supervised multi-label collective classification (SMCC) framework. As such, we propose a novel generative model based SMCC algorithm, called GM-SMCC, to effectively compute the label probability distributions of unannotated protein instances and predict their functional properties. To further boost the predicting performance, we extend the method in an ensemble manner, called EGM-SMCC, by utilizing multiple heterogeneous networks with various latent linkages constructed to explicitly model the relationships among the nodes for effectively propagate the supervision knowledge from labeled to unlabeled nodes.

Conclusion: Experimental results on a yeast gene dataset predicting the functions and localization of proteins demonstrate the effectiveness of the proposed method. In the comparison, we find that the performances of the proposed algorithms are better than the other compared algorithms.

Show MeSH
The performance of the algorithms with varying percentages of labeled data on problem (2) of KDD Cup 2001. (a)Coverage; (b)RankingLoss; (c)1-Macro-F1.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4290603&req=5

Figure 3: The performance of the algorithms with varying percentages of labeled data on problem (2) of KDD Cup 2001. (a)Coverage; (b)RankingLoss; (c)1-Macro-F1.

Mentions: We compare the performance of our proposed GM-SMCC approach and other baseline algorithms with varying percentages of labeled data from 3% to 10%. For each percentage, we execute each algorithm 10 times by randomly selecting the label/unlabel data split from the dataset. Then we report average results as well as standard deviation of each compared algorithms over 10 runs. The result is shown in Figure 3. In order to keep consistency with the Coverage and RankingLoss evaluation metrics, we use 1-MacroF1 instead of MacroF1. Thus, the smaller the value of the metric, the better the performance of the algorithm. We see from Figure 3 that GM-SMCC (the black line) has the best performance (lies under the other curves) across all evaluation metrics and label ratios. Semi-ICA is the second best method. In the comparison, SVM performs poor in terms of Coverage. On the other hand, wvRN+RL, ICML and ICA perform poor in terms of MacroF1. Recent studies [41] have shown that one multi-label learning algorithm rarely outperforms another algorithm on all criteria because the evaluation measures used in the experiments assess the learning performance from different aspects. In the experiments, we find that GM-SMCC consistently out-performs other algorithms across all label ratios. On average, ICAM achieves Coverage improvement of 0.35 (GM-SMCC:3.90 versus semiICA:4.25), RankingLoss improvement of 0.01 (GM-SMCC:0.104 versus semiICA:0.114), and 1-MacroF1 improvement of 0.068 (GM-SMCC:0.640 versus semiICA:0.708) against the second best method. This result indicates that the proposed GM-SMCC algorithm is effective for the multi-label protein function prediction task.


Semi-supervised multi-label collective classification ensemble for functional genomics.

Wu Q, Ye Y, Ho SS, Zhou S - BMC Genomics (2014)

The performance of the algorithms with varying percentages of labeled data on problem (2) of KDD Cup 2001. (a)Coverage; (b)RankingLoss; (c)1-Macro-F1.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4290603&req=5

Figure 3: The performance of the algorithms with varying percentages of labeled data on problem (2) of KDD Cup 2001. (a)Coverage; (b)RankingLoss; (c)1-Macro-F1.
Mentions: We compare the performance of our proposed GM-SMCC approach and other baseline algorithms with varying percentages of labeled data from 3% to 10%. For each percentage, we execute each algorithm 10 times by randomly selecting the label/unlabel data split from the dataset. Then we report average results as well as standard deviation of each compared algorithms over 10 runs. The result is shown in Figure 3. In order to keep consistency with the Coverage and RankingLoss evaluation metrics, we use 1-MacroF1 instead of MacroF1. Thus, the smaller the value of the metric, the better the performance of the algorithm. We see from Figure 3 that GM-SMCC (the black line) has the best performance (lies under the other curves) across all evaluation metrics and label ratios. Semi-ICA is the second best method. In the comparison, SVM performs poor in terms of Coverage. On the other hand, wvRN+RL, ICML and ICA perform poor in terms of MacroF1. Recent studies [41] have shown that one multi-label learning algorithm rarely outperforms another algorithm on all criteria because the evaluation measures used in the experiments assess the learning performance from different aspects. In the experiments, we find that GM-SMCC consistently out-performs other algorithms across all label ratios. On average, ICAM achieves Coverage improvement of 0.35 (GM-SMCC:3.90 versus semiICA:4.25), RankingLoss improvement of 0.01 (GM-SMCC:0.104 versus semiICA:0.114), and 1-MacroF1 improvement of 0.068 (GM-SMCC:0.640 versus semiICA:0.708) against the second best method. This result indicates that the proposed GM-SMCC algorithm is effective for the multi-label protein function prediction task.

Bottom Line: As such, we propose a novel generative model based SMCC algorithm, called GM-SMCC, to effectively compute the label probability distributions of unannotated protein instances and predict their functional properties.Experimental results on a yeast gene dataset predicting the functions and localization of proteins demonstrate the effectiveness of the proposed method.In the comparison, we find that the performances of the proposed algorithms are better than the other compared algorithms.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: With the rapid accumulation of proteomic and genomic datasets in terms of genome-scale features and interaction networks through high-throughput experimental techniques, the process of manual predicting functional properties of the proteins has become increasingly cumbersome, and computational methods to automate this annotation task are urgently needed. Most of the approaches in predicting functional properties of proteins require to either identify a reliable set of labeled proteins with similar attribute features to unannotated proteins, or to learn from a fully-labeled protein interaction network with a large amount of labeled data. However, acquiring such labels can be very difficult in practice, especially for multi-label protein function prediction problems. Learning with only a few labeled data can lead to poor performance as limited supervision knowledge can be obtained from similar proteins or from connections between them. To effectively annotate proteins even in the paucity of labeled data, it is important to take advantage of all data sources that are available in this problem setting, including interaction networks, attribute feature information, correlations of functional labels, and unlabeled data.

Results: In this paper, we show that the underlying nature of predicting functional properties of proteins using various data sources of relational data is a typical collective classification (CC) problem in machine learning. The protein functional prediction task with limited annotation is then cast into a semi-supervised multi-label collective classification (SMCC) framework. As such, we propose a novel generative model based SMCC algorithm, called GM-SMCC, to effectively compute the label probability distributions of unannotated protein instances and predict their functional properties. To further boost the predicting performance, we extend the method in an ensemble manner, called EGM-SMCC, by utilizing multiple heterogeneous networks with various latent linkages constructed to explicitly model the relationships among the nodes for effectively propagate the supervision knowledge from labeled to unlabeled nodes.

Conclusion: Experimental results on a yeast gene dataset predicting the functions and localization of proteins demonstrate the effectiveness of the proposed method. In the comparison, we find that the performances of the proposed algorithms are better than the other compared algorithms.

Show MeSH