Limits...
Prediction of glycosylation sites using random forests.

Hamby SE, Hirst JD - BMC Bioinformatics (2008)

Bottom Line: Our prediction program, GPP (glycosylation prediction program), predicts glycosylation sites with an accuracy of 90.8% for Ser sites, 92.0% for Thr sites and 92.8% for Asn sites.This is significantly better than current glycosylation predictors.We have created an accurate predictor of glycosylation sites and used this to extract comprehensible rules about the glycosylation process.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Chemistry, University of Nottingham, University Park, Nottingham NG7 2RD, UK. pcxsh1@nottingham.ac.uk

ABSTRACT

Background: Post translational modifications (PTMs) occur in the vast majority of proteins and are essential for function. Prediction of the sequence location of PTMs enhances the functional characterisation of proteins. Glycosylation is one type of PTM, and is implicated in protein folding, transport and function.

Results: We use the random forest algorithm and pairwise patterns to predict glycosylation sites. We identify pairwise patterns surrounding glycosylation sites and use an odds ratio to weight their propensity of association with modified residues. Our prediction program, GPP (glycosylation prediction program), predicts glycosylation sites with an accuracy of 90.8% for Ser sites, 92.0% for Thr sites and 92.8% for Asn sites. This is significantly better than current glycosylation predictors. We use the trepan algorithm to extract a set of comprehensible rules from GPP, which provide biological insight into all three major glycosylation types.

Conclusion: We have created an accurate predictor of glycosylation sites and used this to extract comprehensible rules about the glycosylation process. GPP is available online at http://comp.chem.nottingham.ac.uk/glyco/.

Show MeSH
Thr glycosylation rules. A subset of the complete decision tree encompassing all the rules produced for Thr glycosylation (the complete tree is available as additional file 5). Each node is numbered in the order it was added to the tree. All rules are 1 of A, B...N so only the relevant features are shown. The amino acids are represented using the single letter code. The amino acids are represented using the single letter code amino acid and the positions are indicated with respect to a sequence window of length 15, with the target residue at position 08.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2651179&req=5

Figure 2: Thr glycosylation rules. A subset of the complete decision tree encompassing all the rules produced for Thr glycosylation (the complete tree is available as additional file 5). Each node is numbered in the order it was added to the tree. All rules are 1 of A, B...N so only the relevant features are shown. The amino acids are represented using the single letter code. The amino acids are represented using the single letter code amino acid and the positions are indicated with respect to a sequence window of length 15, with the target residue at position 08.

Mentions: Trepan identifies the consensus motif for Asn glycosylation (figure 1 and additional file 4) as the most prominent rules in the decision tree. However, subsequent rules are somewhat misleading as they allow glycosylation without the consensus sequence being present. This is probably an artefact of the generation of additional data by trepan. This approach is reliant on the distribution of the training data and will highlight patterns additional to the consensus sequence. The tree corresponding to Thr glycosylation (figure 2 and additional file 5) shows features in line with the statistical data. Pro at residue +3 increases glycosylation when accompanied by a Ser or Thr. The end of the sequence seems to be given undue importance. However, other rules are in line with the frequency analysis. Cys seems to strongly discourage glycosylation, whilst Ser, Thr and Pro encourage it when accompanied by various other amino acids. Some rules may be inexact, due to the limited data in O-unique that trepan can base its derived examples on. This is also true for the Ser tree (figure 3 and additional file 6). The tree for Ser is similar to the one for Thr, although more complicated. Once again the end of the sequence is implicated as is the presence of Pro at various positions. Cys again seems to block glycosylation, whilst Ser, Thr, Glu, and Pro all encourage it when present at various positions along the sequence, especially in conjunction.


Prediction of glycosylation sites using random forests.

Hamby SE, Hirst JD - BMC Bioinformatics (2008)

Thr glycosylation rules. A subset of the complete decision tree encompassing all the rules produced for Thr glycosylation (the complete tree is available as additional file 5). Each node is numbered in the order it was added to the tree. All rules are 1 of A, B...N so only the relevant features are shown. The amino acids are represented using the single letter code. The amino acids are represented using the single letter code amino acid and the positions are indicated with respect to a sequence window of length 15, with the target residue at position 08.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2651179&req=5

Figure 2: Thr glycosylation rules. A subset of the complete decision tree encompassing all the rules produced for Thr glycosylation (the complete tree is available as additional file 5). Each node is numbered in the order it was added to the tree. All rules are 1 of A, B...N so only the relevant features are shown. The amino acids are represented using the single letter code. The amino acids are represented using the single letter code amino acid and the positions are indicated with respect to a sequence window of length 15, with the target residue at position 08.
Mentions: Trepan identifies the consensus motif for Asn glycosylation (figure 1 and additional file 4) as the most prominent rules in the decision tree. However, subsequent rules are somewhat misleading as they allow glycosylation without the consensus sequence being present. This is probably an artefact of the generation of additional data by trepan. This approach is reliant on the distribution of the training data and will highlight patterns additional to the consensus sequence. The tree corresponding to Thr glycosylation (figure 2 and additional file 5) shows features in line with the statistical data. Pro at residue +3 increases glycosylation when accompanied by a Ser or Thr. The end of the sequence seems to be given undue importance. However, other rules are in line with the frequency analysis. Cys seems to strongly discourage glycosylation, whilst Ser, Thr and Pro encourage it when accompanied by various other amino acids. Some rules may be inexact, due to the limited data in O-unique that trepan can base its derived examples on. This is also true for the Ser tree (figure 3 and additional file 6). The tree for Ser is similar to the one for Thr, although more complicated. Once again the end of the sequence is implicated as is the presence of Pro at various positions. Cys again seems to block glycosylation, whilst Ser, Thr, Glu, and Pro all encourage it when present at various positions along the sequence, especially in conjunction.

Bottom Line: Our prediction program, GPP (glycosylation prediction program), predicts glycosylation sites with an accuracy of 90.8% for Ser sites, 92.0% for Thr sites and 92.8% for Asn sites.This is significantly better than current glycosylation predictors.We have created an accurate predictor of glycosylation sites and used this to extract comprehensible rules about the glycosylation process.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Chemistry, University of Nottingham, University Park, Nottingham NG7 2RD, UK. pcxsh1@nottingham.ac.uk

ABSTRACT

Background: Post translational modifications (PTMs) occur in the vast majority of proteins and are essential for function. Prediction of the sequence location of PTMs enhances the functional characterisation of proteins. Glycosylation is one type of PTM, and is implicated in protein folding, transport and function.

Results: We use the random forest algorithm and pairwise patterns to predict glycosylation sites. We identify pairwise patterns surrounding glycosylation sites and use an odds ratio to weight their propensity of association with modified residues. Our prediction program, GPP (glycosylation prediction program), predicts glycosylation sites with an accuracy of 90.8% for Ser sites, 92.0% for Thr sites and 92.8% for Asn sites. This is significantly better than current glycosylation predictors. We use the trepan algorithm to extract a set of comprehensible rules from GPP, which provide biological insight into all three major glycosylation types.

Conclusion: We have created an accurate predictor of glycosylation sites and used this to extract comprehensible rules about the glycosylation process. GPP is available online at http://comp.chem.nottingham.ac.uk/glyco/.

Show MeSH