Limits...
LC-MSsim--a simulation software for liquid chromatography mass spectrometry data.

Schulz-Trieglaff O, Pfeifer N, Gröpl C, Kohlbacher O, Reinert K - BMC Bioinformatics (2008)

Bottom Line: The data resulting from an LC-MS experiment is huge, highly complex and noisy.Algorithm developers can match the results of feature detection and alignment algorithms against the simulated ion lists and meaningful error rates can be computed.We anticipate that LC-MSsim will be useful to the wider community to perform benchmark studies and comparisons between computational tools.

View Article: PubMed Central - HTML - PubMed

Affiliation: International Max Planck Research School for Computational Biology and Scientific Computing, Berlin, Germany. trieglaf@inf.fu-berlin.de

ABSTRACT

Background: Mass Spectrometry coupled to Liquid Chromatography (LC-MS) is commonly used to analyze the protein content of biological samples in large scale studies. The data resulting from an LC-MS experiment is huge, highly complex and noisy. Accordingly, it has sparked new developments in Bioinformatics, especially in the fields of algorithm development, statistics and software engineering. In a quantitative label-free mass spectrometry experiment, crucial steps are the detection of peptide features in the mass spectra and the alignment of samples by correcting for shifts in retention time. At the moment, it is difficult to compare the plethora of algorithms for these tasks. So far, curated benchmark data exists only for peptide identification algorithms but no data that represents a ground truth for the evaluation of feature detection, alignment and filtering algorithms.

Results: We present LC-MSsim, a simulation software for LC-ESI-MS experiments. It simulates ESI spectra on the MS level. It reads a list of proteins from a FASTA file and digests the protein mixture using a user-defined enzyme. The software creates an LC-MS data set using a predictor for the retention time of the peptides and a model for peak shapes and elution profiles of the mass spectral peaks. Our software also offers the possibility to add contaminants, to change the background noise level and includes a model for the detectability of peptides in mass spectra. After the simulation, LC-MSsim writes the simulated data to mzData, a public XML format. The software also stores the positions (monoisotopic m/z and retention time) and ion counts of the simulated ions in separate files.

Conclusion: LC-MSsim generates simulated LC-MS data sets and incorporates models for peak shapes and contaminations. Algorithm developers can match the results of feature detection and alignment algorithms against the simulated ion lists and meaningful error rates can be computed. We anticipate that LC-MSsim will be useful to the wider community to perform benchmark studies and comparisons between computational tools.

Show MeSH
ROC plot for peptide detectability prediction. ROC plot for peptide detectability prediction: This plot shows the ROC-curve for peptide detectability prediction on the excluded test partitions of the nested cross-validation runs on balanced samples of the MUDPIT-ESI data set of Mallick et al. [35].
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2577660&req=5

Figure 2: ROC plot for peptide detectability prediction. ROC plot for peptide detectability prediction: This plot shows the ROC-curve for peptide detectability prediction on the excluded test partitions of the nested cross-validation runs on balanced samples of the MUDPIT-ESI data set of Mallick et al. [35].

Mentions: It was previously shown that not all peptides of a digested protein actually appear in the LC-MS sample [35,36,39]. There are numerous reasons for that. Some peptides will only poorly ionize in the electrospray and others are simply not soluble, just to name a few. To account for this fact, we determine the likelihood of detectability of each peptide using a support vector machine [40] and the POBK as a kernel function. We performed a nested cross-validation on balanced samples of the MUDPIT-ESI data set of Mallick et al. [35]. This means that we selected all n positive examples and chose n negative examples randomly out of all negative examples. The whole process was repeated ten times to minimize random effects. A ROC curve of all predictions on the excluded test partitions can be seen in Fig. 2. Mallick et al. [35] evaluated their method in terms of 1 – positive predictive value (PPV) against coverage. For the MUDPIT-ESI data set they got a coverage of about 0.65 at 1 - PPV = 0.1 and a coverage of about 0.75 at 1 - PPV = 0.15. In our evaluation (data not shown) we get a coverage of about 0.65 at 1 - PPV = 0.11 and a coverage of about 0.8 at 1 - PPV = 0.15. This means that our method performs comparable to the methods of Mallick et al. [35] although requiring just a fraction of the negative data set for training (about 1000 instead of 25, 000), which drastically decreases the time needed for training the classifier. We used the probability estimates [41] of the libsvm [42] to compute the likelihood of a peptide to appear in the LC-MS spectra.


LC-MSsim--a simulation software for liquid chromatography mass spectrometry data.

Schulz-Trieglaff O, Pfeifer N, Gröpl C, Kohlbacher O, Reinert K - BMC Bioinformatics (2008)

ROC plot for peptide detectability prediction. ROC plot for peptide detectability prediction: This plot shows the ROC-curve for peptide detectability prediction on the excluded test partitions of the nested cross-validation runs on balanced samples of the MUDPIT-ESI data set of Mallick et al. [35].
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2577660&req=5

Figure 2: ROC plot for peptide detectability prediction. ROC plot for peptide detectability prediction: This plot shows the ROC-curve for peptide detectability prediction on the excluded test partitions of the nested cross-validation runs on balanced samples of the MUDPIT-ESI data set of Mallick et al. [35].
Mentions: It was previously shown that not all peptides of a digested protein actually appear in the LC-MS sample [35,36,39]. There are numerous reasons for that. Some peptides will only poorly ionize in the electrospray and others are simply not soluble, just to name a few. To account for this fact, we determine the likelihood of detectability of each peptide using a support vector machine [40] and the POBK as a kernel function. We performed a nested cross-validation on balanced samples of the MUDPIT-ESI data set of Mallick et al. [35]. This means that we selected all n positive examples and chose n negative examples randomly out of all negative examples. The whole process was repeated ten times to minimize random effects. A ROC curve of all predictions on the excluded test partitions can be seen in Fig. 2. Mallick et al. [35] evaluated their method in terms of 1 – positive predictive value (PPV) against coverage. For the MUDPIT-ESI data set they got a coverage of about 0.65 at 1 - PPV = 0.1 and a coverage of about 0.75 at 1 - PPV = 0.15. In our evaluation (data not shown) we get a coverage of about 0.65 at 1 - PPV = 0.11 and a coverage of about 0.8 at 1 - PPV = 0.15. This means that our method performs comparable to the methods of Mallick et al. [35] although requiring just a fraction of the negative data set for training (about 1000 instead of 25, 000), which drastically decreases the time needed for training the classifier. We used the probability estimates [41] of the libsvm [42] to compute the likelihood of a peptide to appear in the LC-MS spectra.

Bottom Line: The data resulting from an LC-MS experiment is huge, highly complex and noisy.Algorithm developers can match the results of feature detection and alignment algorithms against the simulated ion lists and meaningful error rates can be computed.We anticipate that LC-MSsim will be useful to the wider community to perform benchmark studies and comparisons between computational tools.

View Article: PubMed Central - HTML - PubMed

Affiliation: International Max Planck Research School for Computational Biology and Scientific Computing, Berlin, Germany. trieglaf@inf.fu-berlin.de

ABSTRACT

Background: Mass Spectrometry coupled to Liquid Chromatography (LC-MS) is commonly used to analyze the protein content of biological samples in large scale studies. The data resulting from an LC-MS experiment is huge, highly complex and noisy. Accordingly, it has sparked new developments in Bioinformatics, especially in the fields of algorithm development, statistics and software engineering. In a quantitative label-free mass spectrometry experiment, crucial steps are the detection of peptide features in the mass spectra and the alignment of samples by correcting for shifts in retention time. At the moment, it is difficult to compare the plethora of algorithms for these tasks. So far, curated benchmark data exists only for peptide identification algorithms but no data that represents a ground truth for the evaluation of feature detection, alignment and filtering algorithms.

Results: We present LC-MSsim, a simulation software for LC-ESI-MS experiments. It simulates ESI spectra on the MS level. It reads a list of proteins from a FASTA file and digests the protein mixture using a user-defined enzyme. The software creates an LC-MS data set using a predictor for the retention time of the peptides and a model for peak shapes and elution profiles of the mass spectral peaks. Our software also offers the possibility to add contaminants, to change the background noise level and includes a model for the detectability of peptides in mass spectra. After the simulation, LC-MSsim writes the simulated data to mzData, a public XML format. The software also stores the positions (monoisotopic m/z and retention time) and ion counts of the simulated ions in separate files.

Conclusion: LC-MSsim generates simulated LC-MS data sets and incorporates models for peak shapes and contaminations. Algorithm developers can match the results of feature detection and alignment algorithms against the simulated ion lists and meaningful error rates can be computed. We anticipate that LC-MSsim will be useful to the wider community to perform benchmark studies and comparisons between computational tools.

Show MeSH