Limits...
Use of machine learning algorithms to classify binary protein sequences as highly-designable or poorly-designable.

Peto M, Kloczkowski A, Honavar V, Jernigan RL - BMC Bioinformatics (2008)

Bottom Line: We classify sequences as folding to either highly- or poorly-designable conformations.We have randomly selected subsets of the sequences belonging to highly-designable and poorly-designable conformations and used them to train several different standard machine learning algorithms.By using these machine learning algorithms with ten-fold cross-validation we are able to classify the two classes of sequences with high accuracy -- in some cases exceeding 95%.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA 50011-3020, USA. myron.peto@ars.usda.gov

ABSTRACT

Background: By using a standard Support Vector Machine (SVM) with a Sequential Minimal Optimization (SMO) method of training, Naïve Bayes and other machine learning algorithms we are able to distinguish between two classes of protein sequences: those folding to highly-designable conformations, or those folding to poorly- or non-designable conformations.

Results: First, we generate all possible compact lattice conformations for the specified shape (a hexagon or a triangle) on the 2D triangular lattice. Then we generate all possible binary hydrophobic/polar (H/P) sequences and by using a specified energy function, thread them through all of these compact conformations. If for a given sequence the lowest energy is obtained for a particular lattice conformation we assume that this sequence folds to that conformation. Highly-designable conformations have many H/P sequences folding to them, while poorly-designable conformations have few or no H/P sequences. We classify sequences as folding to either highly- or poorly-designable conformations. We have randomly selected subsets of the sequences belonging to highly-designable and poorly-designable conformations and used them to train several different standard machine learning algorithms.

Conclusion: By using these machine learning algorithms with ten-fold cross-validation we are able to classify the two classes of sequences with high accuracy -- in some cases exceeding 95%.

Show MeSH
Average energy difference between the ground state and the next lowest energy state for different values of designability NS for the hexagonal (a) and triangular (b) shapes. Although there is a strong visible trend towards a higher energy gap as the conformations become more designable, there are exceptions particularly for the most designable conformations (corresponding to the largest Ns), having in both cases average energy gaps below the maximum.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2655094&req=5

Figure 4: Average energy difference between the ground state and the next lowest energy state for different values of designability NS for the hexagonal (a) and triangular (b) shapes. Although there is a strong visible trend towards a higher energy gap as the conformations become more designable, there are exceptions particularly for the most designable conformations (corresponding to the largest Ns), having in both cases average energy gaps below the maximum.

Mentions: We also show in Figures 4a and 4b the relationship between the average energy gap and the designability of conformations for the two shapes. The energy gap is defined as the difference between the energy of the ground state conformation and second lowest energy conformation for a given sequence. The average energy gap is the average energy gap for sequences folding to conformations of equal designability (NS). Similarly as observed in previous studies [24,26,27,38] we find a marked tendency for the energy gap to increase for more designable conformations. This trend seems weaker for larger Ns, which may be a result of having too few conformations to obtain a reliable average. For the hexagonal shape there are fewer than 40 conformations with more than 38 sequences folding to them; whereas there are more than 20,000 conformations with fewer sequences folding to them.


Use of machine learning algorithms to classify binary protein sequences as highly-designable or poorly-designable.

Peto M, Kloczkowski A, Honavar V, Jernigan RL - BMC Bioinformatics (2008)

Average energy difference between the ground state and the next lowest energy state for different values of designability NS for the hexagonal (a) and triangular (b) shapes. Although there is a strong visible trend towards a higher energy gap as the conformations become more designable, there are exceptions particularly for the most designable conformations (corresponding to the largest Ns), having in both cases average energy gaps below the maximum.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2655094&req=5

Figure 4: Average energy difference between the ground state and the next lowest energy state for different values of designability NS for the hexagonal (a) and triangular (b) shapes. Although there is a strong visible trend towards a higher energy gap as the conformations become more designable, there are exceptions particularly for the most designable conformations (corresponding to the largest Ns), having in both cases average energy gaps below the maximum.
Mentions: We also show in Figures 4a and 4b the relationship between the average energy gap and the designability of conformations for the two shapes. The energy gap is defined as the difference between the energy of the ground state conformation and second lowest energy conformation for a given sequence. The average energy gap is the average energy gap for sequences folding to conformations of equal designability (NS). Similarly as observed in previous studies [24,26,27,38] we find a marked tendency for the energy gap to increase for more designable conformations. This trend seems weaker for larger Ns, which may be a result of having too few conformations to obtain a reliable average. For the hexagonal shape there are fewer than 40 conformations with more than 38 sequences folding to them; whereas there are more than 20,000 conformations with fewer sequences folding to them.

Bottom Line: We classify sequences as folding to either highly- or poorly-designable conformations.We have randomly selected subsets of the sequences belonging to highly-designable and poorly-designable conformations and used them to train several different standard machine learning algorithms.By using these machine learning algorithms with ten-fold cross-validation we are able to classify the two classes of sequences with high accuracy -- in some cases exceeding 95%.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA 50011-3020, USA. myron.peto@ars.usda.gov

ABSTRACT

Background: By using a standard Support Vector Machine (SVM) with a Sequential Minimal Optimization (SMO) method of training, Naïve Bayes and other machine learning algorithms we are able to distinguish between two classes of protein sequences: those folding to highly-designable conformations, or those folding to poorly- or non-designable conformations.

Results: First, we generate all possible compact lattice conformations for the specified shape (a hexagon or a triangle) on the 2D triangular lattice. Then we generate all possible binary hydrophobic/polar (H/P) sequences and by using a specified energy function, thread them through all of these compact conformations. If for a given sequence the lowest energy is obtained for a particular lattice conformation we assume that this sequence folds to that conformation. Highly-designable conformations have many H/P sequences folding to them, while poorly-designable conformations have few or no H/P sequences. We classify sequences as folding to either highly- or poorly-designable conformations. We have randomly selected subsets of the sequences belonging to highly-designable and poorly-designable conformations and used them to train several different standard machine learning algorithms.

Conclusion: By using these machine learning algorithms with ten-fold cross-validation we are able to classify the two classes of sequences with high accuracy -- in some cases exceeding 95%.

Show MeSH