Limits...
Combining classifiers for improved classification of proteins from sequence or structure.

Melvin I, Weston J, Leslie CS, Noble WS - BMC Bioinformatics (2008)

Bottom Line: Predicting a protein's structural or functional class from its amino acid sequence or structure is a fundamental problem in computational biology.Recently, there has been considerable interest in using discriminative learning algorithms, in particular support vector machines (SVMs), for classification of proteins.In this study, we develop a hybrid machine learning approach for classifying proteins, and we apply the method to the problem of assigning proteins to structural categories based on their sequences or their 3D structures.

View Article: PubMed Central - HTML - PubMed

Affiliation: NEC Laboratories of America, Princeton, NJ, USA. iainmelvin@gmail.com

ABSTRACT

Background: Predicting a protein's structural or functional class from its amino acid sequence or structure is a fundamental problem in computational biology. Recently, there has been considerable interest in using discriminative learning algorithms, in particular support vector machines (SVMs), for classification of proteins. However, because sufficiently many positive examples are required to train such classifiers, all SVM-based methods are hampered by limited coverage.

Results: In this study, we develop a hybrid machine learning approach for classifying proteins, and we apply the method to the problem of assigning proteins to structural categories based on their sequences or their 3D structures. The method combines a full-coverage but lower accuracy nearest neighbor method with higher accuracy but reduced coverage multiclass SVMs to produce a full coverage classifier with overall improved accuracy. The hybrid approach is based on the simple idea of "punting" from one method to another using a learned threshold.

Conclusion: In cross-validated experiments on the SCOP hierarchy, the hybrid methods consistently outperform the individual component methods at all levels of coverage. Code and data sets are available at http://noble.gs.washington.edu/proj/sabretooth.

Show MeSH
Two punting strategies. (A) Two classifiers are combined to produce a hybrid classifier with improved accuracy and coverage. The punting thresholds (T = [T1, ..., Tn]) are class-dependent and are set using held-out data. (B) This approach is similar to (A), except that using two vectors of punting thresholds – T1 for the primary classifier and T2 for the secondary classifier – allows the method sometimes to make no prediction at all.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2561051&req=5

Figure 1: Two punting strategies. (A) Two classifiers are combined to produce a hybrid classifier with improved accuracy and coverage. The punting thresholds (T = [T1, ..., Tn]) are class-dependent and are set using held-out data. (B) This approach is similar to (A), except that using two vectors of punting thresholds – T1 for the primary classifier and T2 for the secondary classifier – allows the method sometimes to make no prediction at all.

Mentions: The punting strategy is depicted in Figure 1. In its simplest form (Figure 1A), the strategy relies upon a vector T of class-specific parameters. These parameters are learned by the algorithm, given a single hyperparameter supplied by the user. A query protein representation is first given to the primary classification method. The classifier produces a predicted classification i along with a score s, the magnitude of which indicates the confidence in the prediction. If this score exceeds Ti, then the current class is predicted. Otherwise, the query is punted to the secondary classifier, which makes its own prediction. Typically, the primary classifier is the one with higher accuracy and lower coverage, although we also experiment with punting in the other direction. It is sometimes preferable to make no prediction at all, rather than make a prediction that is very likely incorrect. In this case, a second set of class-specific thresholds allows the second classifier to punt as well, as shown in Figure 1B.


Combining classifiers for improved classification of proteins from sequence or structure.

Melvin I, Weston J, Leslie CS, Noble WS - BMC Bioinformatics (2008)

Two punting strategies. (A) Two classifiers are combined to produce a hybrid classifier with improved accuracy and coverage. The punting thresholds (T = [T1, ..., Tn]) are class-dependent and are set using held-out data. (B) This approach is similar to (A), except that using two vectors of punting thresholds – T1 for the primary classifier and T2 for the secondary classifier – allows the method sometimes to make no prediction at all.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2561051&req=5

Figure 1: Two punting strategies. (A) Two classifiers are combined to produce a hybrid classifier with improved accuracy and coverage. The punting thresholds (T = [T1, ..., Tn]) are class-dependent and are set using held-out data. (B) This approach is similar to (A), except that using two vectors of punting thresholds – T1 for the primary classifier and T2 for the secondary classifier – allows the method sometimes to make no prediction at all.
Mentions: The punting strategy is depicted in Figure 1. In its simplest form (Figure 1A), the strategy relies upon a vector T of class-specific parameters. These parameters are learned by the algorithm, given a single hyperparameter supplied by the user. A query protein representation is first given to the primary classification method. The classifier produces a predicted classification i along with a score s, the magnitude of which indicates the confidence in the prediction. If this score exceeds Ti, then the current class is predicted. Otherwise, the query is punted to the secondary classifier, which makes its own prediction. Typically, the primary classifier is the one with higher accuracy and lower coverage, although we also experiment with punting in the other direction. It is sometimes preferable to make no prediction at all, rather than make a prediction that is very likely incorrect. In this case, a second set of class-specific thresholds allows the second classifier to punt as well, as shown in Figure 1B.

Bottom Line: Predicting a protein's structural or functional class from its amino acid sequence or structure is a fundamental problem in computational biology.Recently, there has been considerable interest in using discriminative learning algorithms, in particular support vector machines (SVMs), for classification of proteins.In this study, we develop a hybrid machine learning approach for classifying proteins, and we apply the method to the problem of assigning proteins to structural categories based on their sequences or their 3D structures.

View Article: PubMed Central - HTML - PubMed

Affiliation: NEC Laboratories of America, Princeton, NJ, USA. iainmelvin@gmail.com

ABSTRACT

Background: Predicting a protein's structural or functional class from its amino acid sequence or structure is a fundamental problem in computational biology. Recently, there has been considerable interest in using discriminative learning algorithms, in particular support vector machines (SVMs), for classification of proteins. However, because sufficiently many positive examples are required to train such classifiers, all SVM-based methods are hampered by limited coverage.

Results: In this study, we develop a hybrid machine learning approach for classifying proteins, and we apply the method to the problem of assigning proteins to structural categories based on their sequences or their 3D structures. The method combines a full-coverage but lower accuracy nearest neighbor method with higher accuracy but reduced coverage multiclass SVMs to produce a full coverage classifier with overall improved accuracy. The hybrid approach is based on the simple idea of "punting" from one method to another using a learned threshold.

Conclusion: In cross-validated experiments on the SCOP hierarchy, the hybrid methods consistently outperform the individual component methods at all levels of coverage. Code and data sets are available at http://noble.gs.washington.edu/proj/sabretooth.

Show MeSH