Limits...
PredGPI: a GPI-anchor predictor.

Pierleoni A, Martelli PL, Casadio R - BMC Bioinformatics (2008)

Bottom Line: Several eukaryotic proteins associated to the extracellular leaflet of the plasma membrane carry a Glycosylphosphatidylinositol (GPI) anchor, which is linked to the C-terminal residue after a proteolytic cleavage occurring at the so called omega-site.PredGPI outperforms all the other previously described methods and is able to correctly replicate the results of previously published high-throughput experiments.PredGPI reaches a lower rate of false positive predictions with respect to other available methods and it is therefore a costless, rapid and accurate method for screening whole proteomes.

View Article: PubMed Central - HTML - PubMed

Affiliation: Biocomputing Group, Department of Biology, University of Bologna, Via Irnerio 42, 40126 Bologna, Italy. andrea@biocomp.unibo.it

ABSTRACT

Background: Several eukaryotic proteins associated to the extracellular leaflet of the plasma membrane carry a Glycosylphosphatidylinositol (GPI) anchor, which is linked to the C-terminal residue after a proteolytic cleavage occurring at the so called omega-site. Computational methods were developed to discriminate proteins that undergo this post-translational modification starting from their aminoacidic sequences. However more accurate methods are needed for a reliable annotation of whole proteomes.

Results: Here we present PredGPI, a prediction method that, by coupling a Hidden Markov Model (HMM) and a Support Vector Machine (SVM), is able to efficiently predict both the presence of the GPI-anchor and the position of the omega-site. PredGPI is trained on a non-redundant dataset of experimentally characterized GPI-anchored proteins whose annotation was carefully checked in the literature.

Conclusion: PredGPI outperforms all the other previously described methods and is able to correctly replicate the results of previously published high-throughput experiments. PredGPI reaches a lower rate of false positive predictions with respect to other available methods and it is therefore a costless, rapid and accurate method for screening whole proteomes.

Show MeSH
The HMM model of the ω-site. Different colors represent different emission probability sets. ω-site is represented in red. Surrounding residues are colored in green, orange and yellow. The preceding region is represented in dark green. The spacer and the C terminal hydrophobic regions are depicted in violet and blue, respectively. The total number of independent trainable parameters is 147.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2571997&req=5

Figure 1: The HMM model of the ω-site. Different colors represent different emission probability sets. ω-site is represented in red. Surrounding residues are colored in green, orange and yellow. The preceding region is represented in dark green. The spacer and the C terminal hydrophobic regions are depicted in violet and blue, respectively. The total number of independent trainable parameters is 147.

Mentions: In particular the model depicted in Figure 1 is designed to describe the 40-residue C-terminal segment of the GPI-anchored proteins. It contains 46 states centered on the state describing the ω-site. The states filled with the same color share the same emission parameters so that the model describes different zones with different residue compositions. The ω-site, the residue upstream and the two residues downstream are described with independent emission probabilities. The regions upstream and downstream the ω-site neighbors are described with one and two sets of emission probabilities, respectively. Two extra states serve for beginning and ending the process and do not emit any letter. The topology of the transitions describes C-terminal cleaved propeptides (the portions following the ω-sites) longer than 16 residues and models their experimental length distribution. The model was trained for recognizing the ω-sites starting from the C-terminal sequences of the proteins included in the GPIω-Set. Single sequence coding and labeled Baum-Welch training were adopted, using the three labels: Upstream, ω, and Downstream. A complete cross validation was performed, using all the 26 sequences with experimentally known ω-site divided into 20 sets: 19 sets were used for training and the remaining for testing. Since the 20 sets share low identity, this procedure gives a correct estimate of the performance and it is not biased by the homology of the sequences. Due to the scarcity of the known examples, pseudocounts were used when updating the emission parameters to increase the generalization performances. The posterior Viterbi algorithm was used for decoding [21]. Given a sequence, this algorithm optimally aligns it to a given model, maximizing the a posteriori probability for the emission and complying with the topological constraints of the model. The predicted ω-site is the residue that is aligned with the ω-site state. The emission probability of the sequence is also computed and used as input to the SVM discriminator described in the next section. A conservative HMM was also trained, without adding any pseudocount during the training procedure. In this way the prediction is more constrained and in particular it allows as ω-sites only the residues that are observed in experimentally annotated sequences, namely Cysteine, Aspartic acid, Glycine, Asparagine, and Serine. However, as we observed in the Introduction, there is no stringent evidence for excluding other residues.


PredGPI: a GPI-anchor predictor.

Pierleoni A, Martelli PL, Casadio R - BMC Bioinformatics (2008)

The HMM model of the ω-site. Different colors represent different emission probability sets. ω-site is represented in red. Surrounding residues are colored in green, orange and yellow. The preceding region is represented in dark green. The spacer and the C terminal hydrophobic regions are depicted in violet and blue, respectively. The total number of independent trainable parameters is 147.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2571997&req=5

Figure 1: The HMM model of the ω-site. Different colors represent different emission probability sets. ω-site is represented in red. Surrounding residues are colored in green, orange and yellow. The preceding region is represented in dark green. The spacer and the C terminal hydrophobic regions are depicted in violet and blue, respectively. The total number of independent trainable parameters is 147.
Mentions: In particular the model depicted in Figure 1 is designed to describe the 40-residue C-terminal segment of the GPI-anchored proteins. It contains 46 states centered on the state describing the ω-site. The states filled with the same color share the same emission parameters so that the model describes different zones with different residue compositions. The ω-site, the residue upstream and the two residues downstream are described with independent emission probabilities. The regions upstream and downstream the ω-site neighbors are described with one and two sets of emission probabilities, respectively. Two extra states serve for beginning and ending the process and do not emit any letter. The topology of the transitions describes C-terminal cleaved propeptides (the portions following the ω-sites) longer than 16 residues and models their experimental length distribution. The model was trained for recognizing the ω-sites starting from the C-terminal sequences of the proteins included in the GPIω-Set. Single sequence coding and labeled Baum-Welch training were adopted, using the three labels: Upstream, ω, and Downstream. A complete cross validation was performed, using all the 26 sequences with experimentally known ω-site divided into 20 sets: 19 sets were used for training and the remaining for testing. Since the 20 sets share low identity, this procedure gives a correct estimate of the performance and it is not biased by the homology of the sequences. Due to the scarcity of the known examples, pseudocounts were used when updating the emission parameters to increase the generalization performances. The posterior Viterbi algorithm was used for decoding [21]. Given a sequence, this algorithm optimally aligns it to a given model, maximizing the a posteriori probability for the emission and complying with the topological constraints of the model. The predicted ω-site is the residue that is aligned with the ω-site state. The emission probability of the sequence is also computed and used as input to the SVM discriminator described in the next section. A conservative HMM was also trained, without adding any pseudocount during the training procedure. In this way the prediction is more constrained and in particular it allows as ω-sites only the residues that are observed in experimentally annotated sequences, namely Cysteine, Aspartic acid, Glycine, Asparagine, and Serine. However, as we observed in the Introduction, there is no stringent evidence for excluding other residues.

Bottom Line: Several eukaryotic proteins associated to the extracellular leaflet of the plasma membrane carry a Glycosylphosphatidylinositol (GPI) anchor, which is linked to the C-terminal residue after a proteolytic cleavage occurring at the so called omega-site.PredGPI outperforms all the other previously described methods and is able to correctly replicate the results of previously published high-throughput experiments.PredGPI reaches a lower rate of false positive predictions with respect to other available methods and it is therefore a costless, rapid and accurate method for screening whole proteomes.

View Article: PubMed Central - HTML - PubMed

Affiliation: Biocomputing Group, Department of Biology, University of Bologna, Via Irnerio 42, 40126 Bologna, Italy. andrea@biocomp.unibo.it

ABSTRACT

Background: Several eukaryotic proteins associated to the extracellular leaflet of the plasma membrane carry a Glycosylphosphatidylinositol (GPI) anchor, which is linked to the C-terminal residue after a proteolytic cleavage occurring at the so called omega-site. Computational methods were developed to discriminate proteins that undergo this post-translational modification starting from their aminoacidic sequences. However more accurate methods are needed for a reliable annotation of whole proteomes.

Results: Here we present PredGPI, a prediction method that, by coupling a Hidden Markov Model (HMM) and a Support Vector Machine (SVM), is able to efficiently predict both the presence of the GPI-anchor and the position of the omega-site. PredGPI is trained on a non-redundant dataset of experimentally characterized GPI-anchored proteins whose annotation was carefully checked in the literature.

Conclusion: PredGPI outperforms all the other previously described methods and is able to correctly replicate the results of previously published high-throughput experiments. PredGPI reaches a lower rate of false positive predictions with respect to other available methods and it is therefore a costless, rapid and accurate method for screening whole proteomes.

Show MeSH