Limits...
Use of machine learning algorithms to classify binary protein sequences as highly-designable or poorly-designable.

Peto M, Kloczkowski A, Honavar V, Jernigan RL - BMC Bioinformatics (2008)

Bottom Line: We classify sequences as folding to either highly- or poorly-designable conformations.We have randomly selected subsets of the sequences belonging to highly-designable and poorly-designable conformations and used them to train several different standard machine learning algorithms.By using these machine learning algorithms with ten-fold cross-validation we are able to classify the two classes of sequences with high accuracy -- in some cases exceeding 95%.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA 50011-3020, USA. myron.peto@ars.usda.gov

ABSTRACT

Background: By using a standard Support Vector Machine (SVM) with a Sequential Minimal Optimization (SMO) method of training, Naïve Bayes and other machine learning algorithms we are able to distinguish between two classes of protein sequences: those folding to highly-designable conformations, or those folding to poorly- or non-designable conformations.

Results: First, we generate all possible compact lattice conformations for the specified shape (a hexagon or a triangle) on the 2D triangular lattice. Then we generate all possible binary hydrophobic/polar (H/P) sequences and by using a specified energy function, thread them through all of these compact conformations. If for a given sequence the lowest energy is obtained for a particular lattice conformation we assume that this sequence folds to that conformation. Highly-designable conformations have many H/P sequences folding to them, while poorly-designable conformations have few or no H/P sequences. We classify sequences as folding to either highly- or poorly-designable conformations. We have randomly selected subsets of the sequences belonging to highly-designable and poorly-designable conformations and used them to train several different standard machine learning algorithms.

Conclusion: By using these machine learning algorithms with ten-fold cross-validation we are able to classify the two classes of sequences with high accuracy -- in some cases exceeding 95%.

Show MeSH

Related in: MedlinePlus

The dependence of the logarithm of the number of conformations Nconf on the number NS of sequences folding to them. a) corresponds to data for the hexagonal shape and b) is for the triangular shape.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2655094&req=5

Figure 2: The dependence of the logarithm of the number of conformations Nconf on the number NS of sequences folding to them. a) corresponds to data for the hexagonal shape and b) is for the triangular shape.

Mentions: We enumerate all binary sequences and test them for possible folding to a unique native conformation with the lowest energy among all compact conformations within the given shape. The two shapes studied by us have 19 (hexagon) and 21 (triangle) nodes so the total numbers of sequences are 219 and 221 (524,288 and 2,097,152) for the binary H/P case; combined with the 20,843 and 22,104 conformations for each shape, respectively. We then count the number of different sequences folding to a given conformation with energy lower than all other conformations for a given shape and store the counts. These results are shown in Figure 2a for the hexagon, and Figure 2b for the triangle, where the logarithm of the number of conformations log Nconf having Ns sequences folding to them is plotted against Ns. These two graphs express qualitatively the same ideas reported in earlier studies [17,24,28,29,33,38]. There are many conformations with relatively few (or no) sequences folding to them and a rather smaller number of conformations that have many sequences folding to these structures. The latter conformations are named designable conformations and the former are called poorly designable conformations. We used the top 10% and bottom 10% of conformations for the two respective groups.


Use of machine learning algorithms to classify binary protein sequences as highly-designable or poorly-designable.

Peto M, Kloczkowski A, Honavar V, Jernigan RL - BMC Bioinformatics (2008)

The dependence of the logarithm of the number of conformations Nconf on the number NS of sequences folding to them. a) corresponds to data for the hexagonal shape and b) is for the triangular shape.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2655094&req=5

Figure 2: The dependence of the logarithm of the number of conformations Nconf on the number NS of sequences folding to them. a) corresponds to data for the hexagonal shape and b) is for the triangular shape.
Mentions: We enumerate all binary sequences and test them for possible folding to a unique native conformation with the lowest energy among all compact conformations within the given shape. The two shapes studied by us have 19 (hexagon) and 21 (triangle) nodes so the total numbers of sequences are 219 and 221 (524,288 and 2,097,152) for the binary H/P case; combined with the 20,843 and 22,104 conformations for each shape, respectively. We then count the number of different sequences folding to a given conformation with energy lower than all other conformations for a given shape and store the counts. These results are shown in Figure 2a for the hexagon, and Figure 2b for the triangle, where the logarithm of the number of conformations log Nconf having Ns sequences folding to them is plotted against Ns. These two graphs express qualitatively the same ideas reported in earlier studies [17,24,28,29,33,38]. There are many conformations with relatively few (or no) sequences folding to them and a rather smaller number of conformations that have many sequences folding to these structures. The latter conformations are named designable conformations and the former are called poorly designable conformations. We used the top 10% and bottom 10% of conformations for the two respective groups.

Bottom Line: We classify sequences as folding to either highly- or poorly-designable conformations.We have randomly selected subsets of the sequences belonging to highly-designable and poorly-designable conformations and used them to train several different standard machine learning algorithms.By using these machine learning algorithms with ten-fold cross-validation we are able to classify the two classes of sequences with high accuracy -- in some cases exceeding 95%.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA 50011-3020, USA. myron.peto@ars.usda.gov

ABSTRACT

Background: By using a standard Support Vector Machine (SVM) with a Sequential Minimal Optimization (SMO) method of training, Naïve Bayes and other machine learning algorithms we are able to distinguish between two classes of protein sequences: those folding to highly-designable conformations, or those folding to poorly- or non-designable conformations.

Results: First, we generate all possible compact lattice conformations for the specified shape (a hexagon or a triangle) on the 2D triangular lattice. Then we generate all possible binary hydrophobic/polar (H/P) sequences and by using a specified energy function, thread them through all of these compact conformations. If for a given sequence the lowest energy is obtained for a particular lattice conformation we assume that this sequence folds to that conformation. Highly-designable conformations have many H/P sequences folding to them, while poorly-designable conformations have few or no H/P sequences. We classify sequences as folding to either highly- or poorly-designable conformations. We have randomly selected subsets of the sequences belonging to highly-designable and poorly-designable conformations and used them to train several different standard machine learning algorithms.

Conclusion: By using these machine learning algorithms with ten-fold cross-validation we are able to classify the two classes of sequences with high accuracy -- in some cases exceeding 95%.

Show MeSH
Related in: MedlinePlus