Limits...
Peak intensity prediction in MALDI-TOF mass spectrometry: a machine learning study to support quantitative proteomics.

Timm W, Scherbart A, Böcker S, Kohlbacher O, Nattkemper TW - BMC Bioinformatics (2008)

Bottom Line: Features encoding the peptides' physico-chemical properties as well as string-based features were extracted.The techniques presented here are a useful first step going beyond the binary prediction of proteotypic peptides towards a more quantitative prediction of peak intensities.These predictions in turn will turn out to be beneficial for mass spectrometry-based quantitative proteomics.

View Article: PubMed Central - HTML - PubMed

Affiliation: Applied Neuroinformatics Group, Bielefeld University, Germany. wiebke.timm@childrens.harvard.edu

ABSTRACT

Background: Mass spectrometry is a key technique in proteomics and can be used to analyze complex samples quickly. One key problem with the mass spectrometric analysis of peptides and proteins, however, is the fact that absolute quantification is severely hampered by the unclear relationship between the observed peak intensity and the peptide concentration in the sample. While there are numerous approaches to circumvent this problem experimentally (e.g. labeling techniques), reliable prediction of the peak intensities from peptide sequences could provide a peptide-specific correction factor. Thus, it would be a valuable tool towards label-free absolute quantification.

Results: In this work we present machine learning techniques for peak intensity prediction for MALDI mass spectra. Features encoding the peptides' physico-chemical properties as well as string-based features were extracted. A feature subset was obtained from multiple forward feature selections on the extracted features. Based on these features, two advanced machine learning methods (support vector regression and local linear maps) are shown to yield good results for this problem (Pearson correlation of 0.68 in a ten-fold cross validation).

Conclusion: The techniques presented here are a useful first step going beyond the binary prediction of proteotypic peptides towards a more quantitative prediction of peak intensities. These predictions in turn will turn out to be beneficial for mass spectrometry-based quantitative proteomics.

Show MeSH

Related in: MedlinePlus

Analysis of absolute prediction error. Plot of target value vs. prediction error. Data was pooled into 20 bins according to their target values. For each bin, the mean absolute prediction error is plotted on the left y-axis. Then the number of values falling into the corresponding bin is shown with squares on the right y-axis. The lowest error is achieved for intermediate target values, the highest error occurs for low ones. The absolute error is not correlated to the number of values per bin. Thus, intensities within a certain range are more difficult to predict than others.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2600826&req=5

Figure 4: Analysis of absolute prediction error. Plot of target value vs. prediction error. Data was pooled into 20 bins according to their target values. For each bin, the mean absolute prediction error is plotted on the left y-axis. Then the number of values falling into the corresponding bin is shown with squares on the right y-axis. The lowest error is achieved for intermediate target values, the highest error occurs for low ones. The absolute error is not correlated to the number of values per bin. Thus, intensities within a certain range are more difficult to predict than others.

Mentions: The most reliable prediction for both datasets is achieved with slightly above intermediate intensities. Low target values have the highest prediction error (Fig. 4). Note that there is a large number of samples with small target values in our training sets, so this effect cannot be attributed to undersampling, meaning that low intensities are more difficult to predict. Note that we predict the logarithm of intensities, whereas noise in the mass spectra is additive and, hence, will have a stronger effect at low intensities. Also, noise in regions of lower intensities behaves differently from that of higher intensities [21]. The problem might be overcome when more measurements for each peptide become available. Otherwise, the use of two or more different models specialized for different intensity ranges might overcome this problem.


Peak intensity prediction in MALDI-TOF mass spectrometry: a machine learning study to support quantitative proteomics.

Timm W, Scherbart A, Böcker S, Kohlbacher O, Nattkemper TW - BMC Bioinformatics (2008)

Analysis of absolute prediction error. Plot of target value vs. prediction error. Data was pooled into 20 bins according to their target values. For each bin, the mean absolute prediction error is plotted on the left y-axis. Then the number of values falling into the corresponding bin is shown with squares on the right y-axis. The lowest error is achieved for intermediate target values, the highest error occurs for low ones. The absolute error is not correlated to the number of values per bin. Thus, intensities within a certain range are more difficult to predict than others.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2600826&req=5

Figure 4: Analysis of absolute prediction error. Plot of target value vs. prediction error. Data was pooled into 20 bins according to their target values. For each bin, the mean absolute prediction error is plotted on the left y-axis. Then the number of values falling into the corresponding bin is shown with squares on the right y-axis. The lowest error is achieved for intermediate target values, the highest error occurs for low ones. The absolute error is not correlated to the number of values per bin. Thus, intensities within a certain range are more difficult to predict than others.
Mentions: The most reliable prediction for both datasets is achieved with slightly above intermediate intensities. Low target values have the highest prediction error (Fig. 4). Note that there is a large number of samples with small target values in our training sets, so this effect cannot be attributed to undersampling, meaning that low intensities are more difficult to predict. Note that we predict the logarithm of intensities, whereas noise in the mass spectra is additive and, hence, will have a stronger effect at low intensities. Also, noise in regions of lower intensities behaves differently from that of higher intensities [21]. The problem might be overcome when more measurements for each peptide become available. Otherwise, the use of two or more different models specialized for different intensity ranges might overcome this problem.

Bottom Line: Features encoding the peptides' physico-chemical properties as well as string-based features were extracted.The techniques presented here are a useful first step going beyond the binary prediction of proteotypic peptides towards a more quantitative prediction of peak intensities.These predictions in turn will turn out to be beneficial for mass spectrometry-based quantitative proteomics.

View Article: PubMed Central - HTML - PubMed

Affiliation: Applied Neuroinformatics Group, Bielefeld University, Germany. wiebke.timm@childrens.harvard.edu

ABSTRACT

Background: Mass spectrometry is a key technique in proteomics and can be used to analyze complex samples quickly. One key problem with the mass spectrometric analysis of peptides and proteins, however, is the fact that absolute quantification is severely hampered by the unclear relationship between the observed peak intensity and the peptide concentration in the sample. While there are numerous approaches to circumvent this problem experimentally (e.g. labeling techniques), reliable prediction of the peak intensities from peptide sequences could provide a peptide-specific correction factor. Thus, it would be a valuable tool towards label-free absolute quantification.

Results: In this work we present machine learning techniques for peak intensity prediction for MALDI mass spectra. Features encoding the peptides' physico-chemical properties as well as string-based features were extracted. A feature subset was obtained from multiple forward feature selections on the extracted features. Based on these features, two advanced machine learning methods (support vector regression and local linear maps) are shown to yield good results for this problem (Pearson correlation of 0.68 in a ten-fold cross validation).

Conclusion: The techniques presented here are a useful first step going beyond the binary prediction of proteotypic peptides towards a more quantitative prediction of peak intensities. These predictions in turn will turn out to be beneficial for mass spectrometry-based quantitative proteomics.

Show MeSH
Related in: MedlinePlus