Limits...
Peak intensity prediction in MALDI-TOF mass spectrometry: a machine learning study to support quantitative proteomics.

Timm W, Scherbart A, Böcker S, Kohlbacher O, Nattkemper TW - BMC Bioinformatics (2008)

Bottom Line: Features encoding the peptides' physico-chemical properties as well as string-based features were extracted.The techniques presented here are a useful first step going beyond the binary prediction of proteotypic peptides towards a more quantitative prediction of peak intensities.These predictions in turn will turn out to be beneficial for mass spectrometry-based quantitative proteomics.

View Article: PubMed Central - HTML - PubMed

Affiliation: Applied Neuroinformatics Group, Bielefeld University, Germany. wiebke.timm@childrens.harvard.edu

ABSTRACT

Background: Mass spectrometry is a key technique in proteomics and can be used to analyze complex samples quickly. One key problem with the mass spectrometric analysis of peptides and proteins, however, is the fact that absolute quantification is severely hampered by the unclear relationship between the observed peak intensity and the peptide concentration in the sample. While there are numerous approaches to circumvent this problem experimentally (e.g. labeling techniques), reliable prediction of the peak intensities from peptide sequences could provide a peptide-specific correction factor. Thus, it would be a valuable tool towards label-free absolute quantification.

Results: In this work we present machine learning techniques for peak intensity prediction for MALDI mass spectra. Features encoding the peptides' physico-chemical properties as well as string-based features were extracted. A feature subset was obtained from multiple forward feature selections on the extracted features. Based on these features, two advanced machine learning methods (support vector regression and local linear maps) are shown to yield good results for this problem (Pearson correlation of 0.68 in a ten-fold cross validation).

Conclusion: The techniques presented here are a useful first step going beyond the binary prediction of proteotypic peptides towards a more quantitative prediction of peak intensities. These predictions in turn will turn out to be beneficial for mass spectrometry-based quantitative proteomics.

Show MeSH

Related in: MedlinePlus

Prediction results with randomly shuffled sequences. When assigning randomly shuffled sequences to the target values of dataset A, prediction by ν-SVR shows no correlation in 10-fold cross-validation. This indicates that we are picking up the true signal, i.e. the predicted values are correlated to the peptide sequence.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2600826&req=5

Figure 3: Prediction results with randomly shuffled sequences. When assigning randomly shuffled sequences to the target values of dataset A, prediction by ν-SVR shows no correlation in 10-fold cross-validation. This indicates that we are picking up the true signal, i.e. the predicted values are correlated to the peptide sequence.

Mentions: To show that predicted values are an actual signal related to peptide sequences, and not some random pattern the learning machines find in the data, we randomly shuffle the assignment of peptide sequences to peak intensities. No good correlation can be observed in cross-validation when evaluating shuffled datasets: From dataset A we generated 100 datasets with randomly permuted target values. For each of the 100 shuffled datasets, we trained a ν-SVR with sss features and parameters optimized using another shuffled dataset. The best correlation with this dataset was r = 0.20. For the 100 datasets, we reach a mean correlation coefficient of r = -0.14. None of the shuffled datasets generated for a good correlation coefficient (standard deviation below 0.044). See Fig. 3 for an exemplary scatter plot. This is a clear indication that we are picking up the true signal, that is, the predicted intensities are correlated to the peptide sequence.


Peak intensity prediction in MALDI-TOF mass spectrometry: a machine learning study to support quantitative proteomics.

Timm W, Scherbart A, Böcker S, Kohlbacher O, Nattkemper TW - BMC Bioinformatics (2008)

Prediction results with randomly shuffled sequences. When assigning randomly shuffled sequences to the target values of dataset A, prediction by ν-SVR shows no correlation in 10-fold cross-validation. This indicates that we are picking up the true signal, i.e. the predicted values are correlated to the peptide sequence.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2600826&req=5

Figure 3: Prediction results with randomly shuffled sequences. When assigning randomly shuffled sequences to the target values of dataset A, prediction by ν-SVR shows no correlation in 10-fold cross-validation. This indicates that we are picking up the true signal, i.e. the predicted values are correlated to the peptide sequence.
Mentions: To show that predicted values are an actual signal related to peptide sequences, and not some random pattern the learning machines find in the data, we randomly shuffle the assignment of peptide sequences to peak intensities. No good correlation can be observed in cross-validation when evaluating shuffled datasets: From dataset A we generated 100 datasets with randomly permuted target values. For each of the 100 shuffled datasets, we trained a ν-SVR with sss features and parameters optimized using another shuffled dataset. The best correlation with this dataset was r = 0.20. For the 100 datasets, we reach a mean correlation coefficient of r = -0.14. None of the shuffled datasets generated for a good correlation coefficient (standard deviation below 0.044). See Fig. 3 for an exemplary scatter plot. This is a clear indication that we are picking up the true signal, that is, the predicted intensities are correlated to the peptide sequence.

Bottom Line: Features encoding the peptides' physico-chemical properties as well as string-based features were extracted.The techniques presented here are a useful first step going beyond the binary prediction of proteotypic peptides towards a more quantitative prediction of peak intensities.These predictions in turn will turn out to be beneficial for mass spectrometry-based quantitative proteomics.

View Article: PubMed Central - HTML - PubMed

Affiliation: Applied Neuroinformatics Group, Bielefeld University, Germany. wiebke.timm@childrens.harvard.edu

ABSTRACT

Background: Mass spectrometry is a key technique in proteomics and can be used to analyze complex samples quickly. One key problem with the mass spectrometric analysis of peptides and proteins, however, is the fact that absolute quantification is severely hampered by the unclear relationship between the observed peak intensity and the peptide concentration in the sample. While there are numerous approaches to circumvent this problem experimentally (e.g. labeling techniques), reliable prediction of the peak intensities from peptide sequences could provide a peptide-specific correction factor. Thus, it would be a valuable tool towards label-free absolute quantification.

Results: In this work we present machine learning techniques for peak intensity prediction for MALDI mass spectra. Features encoding the peptides' physico-chemical properties as well as string-based features were extracted. A feature subset was obtained from multiple forward feature selections on the extracted features. Based on these features, two advanced machine learning methods (support vector regression and local linear maps) are shown to yield good results for this problem (Pearson correlation of 0.68 in a ten-fold cross validation).

Conclusion: The techniques presented here are a useful first step going beyond the binary prediction of proteotypic peptides towards a more quantitative prediction of peak intensities. These predictions in turn will turn out to be beneficial for mass spectrometry-based quantitative proteomics.

Show MeSH
Related in: MedlinePlus