Limits...
Use of machine learning algorithms to classify binary protein sequences as highly-designable or poorly-designable.

Peto M, Kloczkowski A, Honavar V, Jernigan RL - BMC Bioinformatics (2008)

Bottom Line: We classify sequences as folding to either highly- or poorly-designable conformations.We have randomly selected subsets of the sequences belonging to highly-designable and poorly-designable conformations and used them to train several different standard machine learning algorithms.By using these machine learning algorithms with ten-fold cross-validation we are able to classify the two classes of sequences with high accuracy -- in some cases exceeding 95%.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA 50011-3020, USA. myron.peto@ars.usda.gov

ABSTRACT

Background: By using a standard Support Vector Machine (SVM) with a Sequential Minimal Optimization (SMO) method of training, Naïve Bayes and other machine learning algorithms we are able to distinguish between two classes of protein sequences: those folding to highly-designable conformations, or those folding to poorly- or non-designable conformations.

Results: First, we generate all possible compact lattice conformations for the specified shape (a hexagon or a triangle) on the 2D triangular lattice. Then we generate all possible binary hydrophobic/polar (H/P) sequences and by using a specified energy function, thread them through all of these compact conformations. If for a given sequence the lowest energy is obtained for a particular lattice conformation we assume that this sequence folds to that conformation. Highly-designable conformations have many H/P sequences folding to them, while poorly-designable conformations have few or no H/P sequences. We classify sequences as folding to either highly- or poorly-designable conformations. We have randomly selected subsets of the sequences belonging to highly-designable and poorly-designable conformations and used them to train several different standard machine learning algorithms.

Conclusion: By using these machine learning algorithms with ten-fold cross-validation we are able to classify the two classes of sequences with high accuracy -- in some cases exceeding 95%.

Show MeSH
ROC curve for the Naïve Bayes classifier. Tripeptide segments are used to classify binary sequences folding to highly- and poorly-designable conformations of the hexagonal shape. The diagonal line y = x, which we would expect if we used a classifier that randomly guessed which class to assign to a sequence, has been added for clarification.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2655094&req=5

Figure 6: ROC curve for the Naïve Bayes classifier. Tripeptide segments are used to classify binary sequences folding to highly- and poorly-designable conformations of the hexagonal shape. The diagonal line y = x, which we would expect if we used a classifier that randomly guessed which class to assign to a sequence, has been added for clarification.

Mentions: In figure 6 we show a receiver operating characteristic (ROC) curve for the Naïve Bayes classifier on tripeptide sequences in the hexagonal shape. This plot of true sensitivity (true positives found) vs. specificity (few false positives found) gives a visual indication of how our classifier performed. Qualitatively, we see that we obtain a large rate of true positives without having to accept many false positives. This is exactly how we want our classifier to perform and is an indication of the success of the Naïve Bayes classifier on tripeptide segments of sequences folding to conformations in the hexagonal shape.


Use of machine learning algorithms to classify binary protein sequences as highly-designable or poorly-designable.

Peto M, Kloczkowski A, Honavar V, Jernigan RL - BMC Bioinformatics (2008)

ROC curve for the Naïve Bayes classifier. Tripeptide segments are used to classify binary sequences folding to highly- and poorly-designable conformations of the hexagonal shape. The diagonal line y = x, which we would expect if we used a classifier that randomly guessed which class to assign to a sequence, has been added for clarification.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2655094&req=5

Figure 6: ROC curve for the Naïve Bayes classifier. Tripeptide segments are used to classify binary sequences folding to highly- and poorly-designable conformations of the hexagonal shape. The diagonal line y = x, which we would expect if we used a classifier that randomly guessed which class to assign to a sequence, has been added for clarification.
Mentions: In figure 6 we show a receiver operating characteristic (ROC) curve for the Naïve Bayes classifier on tripeptide sequences in the hexagonal shape. This plot of true sensitivity (true positives found) vs. specificity (few false positives found) gives a visual indication of how our classifier performed. Qualitatively, we see that we obtain a large rate of true positives without having to accept many false positives. This is exactly how we want our classifier to perform and is an indication of the success of the Naïve Bayes classifier on tripeptide segments of sequences folding to conformations in the hexagonal shape.

Bottom Line: We classify sequences as folding to either highly- or poorly-designable conformations.We have randomly selected subsets of the sequences belonging to highly-designable and poorly-designable conformations and used them to train several different standard machine learning algorithms.By using these machine learning algorithms with ten-fold cross-validation we are able to classify the two classes of sequences with high accuracy -- in some cases exceeding 95%.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA 50011-3020, USA. myron.peto@ars.usda.gov

ABSTRACT

Background: By using a standard Support Vector Machine (SVM) with a Sequential Minimal Optimization (SMO) method of training, Naïve Bayes and other machine learning algorithms we are able to distinguish between two classes of protein sequences: those folding to highly-designable conformations, or those folding to poorly- or non-designable conformations.

Results: First, we generate all possible compact lattice conformations for the specified shape (a hexagon or a triangle) on the 2D triangular lattice. Then we generate all possible binary hydrophobic/polar (H/P) sequences and by using a specified energy function, thread them through all of these compact conformations. If for a given sequence the lowest energy is obtained for a particular lattice conformation we assume that this sequence folds to that conformation. Highly-designable conformations have many H/P sequences folding to them, while poorly-designable conformations have few or no H/P sequences. We classify sequences as folding to either highly- or poorly-designable conformations. We have randomly selected subsets of the sequences belonging to highly-designable and poorly-designable conformations and used them to train several different standard machine learning algorithms.

Conclusion: By using these machine learning algorithms with ten-fold cross-validation we are able to classify the two classes of sequences with high accuracy -- in some cases exceeding 95%.

Show MeSH