Limits...
A novel approach for protein subcellular location prediction using amino acid exposure.

Mer AS, Andrade-Navarro MA - BMC Bioinformatics (2013)

Bottom Line: For stage one, we trained multiple Support Vector Machines (SVMs) to score eukaryotic protein sequences for membership to each of three categories: nuclear, cytoplasmic and extracellular, plus extra category nucleocytoplasmic, accounting for the fact that a large number of proteins shuttles between those two locations.In stage two we use an artificial neural network (ANN) to propose a category from the scores given to the four locations in stage one.Our algorithm uses a novel approach to address the multiclass classification problem.

View Article: PubMed Central - HTML - PubMed

Affiliation: Computational Biology and Data Mining, Max Delbrück Center for Molecular Medicine, Robert-Rössle-Str, 10, Berlin 13125, Germany. miguel.andrade@mdc-berlin.de.

ABSTRACT

Background: Proteins perform their functions in associated cellular locations. Therefore, the study of protein function can be facilitated by predictions of protein location. Protein location can be predicted either from the sequence of a protein alone by identification of targeting peptide sequences and motifs, or by homology to proteins of known location. A third approach, which is complementary, exploits the differences in amino acid composition of proteins associated to different cellular locations, and can be useful if motif and homology information are missing. Here we expand this approach taking into account amino acid composition at different levels of amino acid exposure.

Results: Our method has two stages. For stage one, we trained multiple Support Vector Machines (SVMs) to score eukaryotic protein sequences for membership to each of three categories: nuclear, cytoplasmic and extracellular, plus extra category nucleocytoplasmic, accounting for the fact that a large number of proteins shuttles between those two locations. In stage two we use an artificial neural network (ANN) to propose a category from the scores given to the four locations in stage one. The method reaches an accuracy of 68% when using as input 3D-derived values of amino acid exposure. Calibration of the method using predicted values of amino acid exposure allows classifying proteins without 3D-information with an accuracy of 62% and discerning proteins in different locations even if they shared high levels of identity.

Conclusions: In this study we explored the relationship between residue exposure and protein subcellular location. We developed a new algorithm for subcellular location prediction that uses residue exposure signatures. Our algorithm uses a novel approach to address the multiclass classification problem. The algorithm is implemented as web server 'NYCE' and can be accessed at http://cbdm.mdc-berlin.de/~amer/nyce.

Show MeSH

Related in: MedlinePlus

Optimization of the artificial neural network (ANN). (Top) ANNs were optimized using different numbers of hidden neurons and SVM types. (Bottom) Best accuracy value obtained. The legends indicate the type of SVM input used. SVM ranges and vectors (of 20 or 40 components) are indicated as in Figure 6. Use of multiple sets of SVMs are indicated by labels using “/” as separator. For example, the best accuracy value (0.68) was obtained using as input three sets of SVMs, two of them trained on 40-component vectors, and another trained on 20-component vectors.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4219330&req=5

Figure 8: Optimization of the artificial neural network (ANN). (Top) ANNs were optimized using different numbers of hidden neurons and SVM types. (Bottom) Best accuracy value obtained. The legends indicate the type of SVM input used. SVM ranges and vectors (of 20 or 40 components) are indicated as in Figure 6. Use of multiple sets of SVMs are indicated by labels using “/” as separator. For example, the best accuracy value (0.68) was obtained using as input three sets of SVMs, two of them trained on 40-component vectors, and another trained on 20-component vectors.

Mentions: We wondered if combining SVM scores for different localizations and ranges using an Artificial Neural Network (ANN), as opposed to just taking the best score prediction, could improve the accuracy of the method. To combine multiple SVM predictions for different locations we used an ANN with three layers: an input layer with one neuron for each of the one-vs.-rest SVMs used, a hidden layer, and an output layer with four neurons, one for each location class. The ANN was trained with SVM-calculated values and was required to produce an output of 1 for the correct class and 0 for the others (see Methods for details). The number of neurons in the hidden layer was optimized for maximum accuracy, as well as the type and number of SVMs using as input (see Methods for details; Figure 8). For example, we tried using four SVMs as input, one for each location class, but also tried using SVMs for two types of ranges (8 input neurons), three types of ranges (12 input neurons), and four ranges (16 input neurons).


A novel approach for protein subcellular location prediction using amino acid exposure.

Mer AS, Andrade-Navarro MA - BMC Bioinformatics (2013)

Optimization of the artificial neural network (ANN). (Top) ANNs were optimized using different numbers of hidden neurons and SVM types. (Bottom) Best accuracy value obtained. The legends indicate the type of SVM input used. SVM ranges and vectors (of 20 or 40 components) are indicated as in Figure 6. Use of multiple sets of SVMs are indicated by labels using “/” as separator. For example, the best accuracy value (0.68) was obtained using as input three sets of SVMs, two of them trained on 40-component vectors, and another trained on 20-component vectors.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4219330&req=5

Figure 8: Optimization of the artificial neural network (ANN). (Top) ANNs were optimized using different numbers of hidden neurons and SVM types. (Bottom) Best accuracy value obtained. The legends indicate the type of SVM input used. SVM ranges and vectors (of 20 or 40 components) are indicated as in Figure 6. Use of multiple sets of SVMs are indicated by labels using “/” as separator. For example, the best accuracy value (0.68) was obtained using as input three sets of SVMs, two of them trained on 40-component vectors, and another trained on 20-component vectors.
Mentions: We wondered if combining SVM scores for different localizations and ranges using an Artificial Neural Network (ANN), as opposed to just taking the best score prediction, could improve the accuracy of the method. To combine multiple SVM predictions for different locations we used an ANN with three layers: an input layer with one neuron for each of the one-vs.-rest SVMs used, a hidden layer, and an output layer with four neurons, one for each location class. The ANN was trained with SVM-calculated values and was required to produce an output of 1 for the correct class and 0 for the others (see Methods for details). The number of neurons in the hidden layer was optimized for maximum accuracy, as well as the type and number of SVMs using as input (see Methods for details; Figure 8). For example, we tried using four SVMs as input, one for each location class, but also tried using SVMs for two types of ranges (8 input neurons), three types of ranges (12 input neurons), and four ranges (16 input neurons).

Bottom Line: For stage one, we trained multiple Support Vector Machines (SVMs) to score eukaryotic protein sequences for membership to each of three categories: nuclear, cytoplasmic and extracellular, plus extra category nucleocytoplasmic, accounting for the fact that a large number of proteins shuttles between those two locations.In stage two we use an artificial neural network (ANN) to propose a category from the scores given to the four locations in stage one.Our algorithm uses a novel approach to address the multiclass classification problem.

View Article: PubMed Central - HTML - PubMed

Affiliation: Computational Biology and Data Mining, Max Delbrück Center for Molecular Medicine, Robert-Rössle-Str, 10, Berlin 13125, Germany. miguel.andrade@mdc-berlin.de.

ABSTRACT

Background: Proteins perform their functions in associated cellular locations. Therefore, the study of protein function can be facilitated by predictions of protein location. Protein location can be predicted either from the sequence of a protein alone by identification of targeting peptide sequences and motifs, or by homology to proteins of known location. A third approach, which is complementary, exploits the differences in amino acid composition of proteins associated to different cellular locations, and can be useful if motif and homology information are missing. Here we expand this approach taking into account amino acid composition at different levels of amino acid exposure.

Results: Our method has two stages. For stage one, we trained multiple Support Vector Machines (SVMs) to score eukaryotic protein sequences for membership to each of three categories: nuclear, cytoplasmic and extracellular, plus extra category nucleocytoplasmic, accounting for the fact that a large number of proteins shuttles between those two locations. In stage two we use an artificial neural network (ANN) to propose a category from the scores given to the four locations in stage one. The method reaches an accuracy of 68% when using as input 3D-derived values of amino acid exposure. Calibration of the method using predicted values of amino acid exposure allows classifying proteins without 3D-information with an accuracy of 62% and discerning proteins in different locations even if they shared high levels of identity.

Conclusions: In this study we explored the relationship between residue exposure and protein subcellular location. We developed a new algorithm for subcellular location prediction that uses residue exposure signatures. Our algorithm uses a novel approach to address the multiclass classification problem. The algorithm is implemented as web server 'NYCE' and can be accessed at http://cbdm.mdc-berlin.de/~amer/nyce.

Show MeSH
Related in: MedlinePlus