Limits...
msmsEval: tandem mass spectral quality assignment for high-throughput proteomics.

Wong JW, Sullivan MJ, Cartwright HM, Cagney G - BMC Bioinformatics (2007)

Bottom Line: We describe an application, msmsEval, that builds on previous work by statistically modeling the spectral quality discriminant function using a Gaussian mixture model.This allows a researcher to filter spectra based on the probability that a spectrum will ultimately be identified by database searching.We show that spectra that are predicted by msmsEval to be of high quality, yet remain unidentified in standard database searches, are candidates for more intensive search strategies.

View Article: PubMed Central - HTML - PubMed

Affiliation: Chemistry Department, Oxford University, Physical and Theoretical Chemistry Laboratory, South Parks Road, Oxford OX1 3QZ, UK. jason.wong@ucd.ie <jason.wong@ucd.ie>

ABSTRACT

Background: In proteomics experiments, database-search programs are the method of choice for protein identification from tandem mass spectra. As amino acid sequence databases grow however, computing resources required for these programs have become prohibitive, particularly in searches for modified proteins. Recently, methods to limit the number of spectra to be searched based on spectral quality have been proposed by different research groups, but rankings of spectral quality have thus far been based on arbitrary cut-off values. In this work, we develop a more readily interpretable spectral quality statistic by providing probability values for the likelihood that spectra will be identifiable.

Results: We describe an application, msmsEval, that builds on previous work by statistically modeling the spectral quality discriminant function using a Gaussian mixture model. This allows a researcher to filter spectra based on the probability that a spectrum will ultimately be identified by database searching. We show that spectra that are predicted by msmsEval to be of high quality, yet remain unidentified in standard database searches, are candidates for more intensive search strategies. Using a well studied public dataset we also show that a high proportion (83.9%) of the spectra predicted by msmsEval to be of high quality but that elude standard search strategies, are in fact interpretable.

Conclusion: msmsEval will be useful for high-throughput proteomics projects and is freely available for download from http://proteomics.ucd.ie/msmseval. Supports Windows, Mac OS X and Linux/Unix operating systems.

Show MeSH
Removal unidentifiable spectra by msmsEval. The predicted fraction of spectra removed for identifiable (◆) and unidentifiable (x) spectra were plotted against the observed fractions for 10 runs of the UCD test dataset (A) and 22 runs of the ISB test dataset (B). The estimated fraction of spectra removed is calculated by taking the respective percentiles from the identifiable spectra Gaussian distributions. The diagonal thin dashed line shows expected trend for the removal of identifiable spectra if the estimated values match the observed values perfectly. Error bars are one standard deviation from the average of the respective test datasets. Receiver operator curves showing the fraction of identifiable spectra removed versus unidentifiable spectra removed for the UCD test dataset (solid line) and ISB dataset (dashed line) are also shown (C).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1803797&req=5

Figure 2: Removal unidentifiable spectra by msmsEval. The predicted fraction of spectra removed for identifiable (◆) and unidentifiable (x) spectra were plotted against the observed fractions for 10 runs of the UCD test dataset (A) and 22 runs of the ISB test dataset (B). The estimated fraction of spectra removed is calculated by taking the respective percentiles from the identifiable spectra Gaussian distributions. The diagonal thin dashed line shows expected trend for the removal of identifiable spectra if the estimated values match the observed values perfectly. Error bars are one standard deviation from the average of the respective test datasets. Receiver operator curves showing the fraction of identifiable spectra removed versus unidentifiable spectra removed for the UCD test dataset (solid line) and ISB dataset (dashed line) are also shown (C).

Mentions: The EM algorithm is allowed to run until there are no significant changes to the estimated parameters between iterations. Supplementary Figure 2 (see Additional file 3) show examples of the algorithm used to fit datasets with different spectra distributions. In general the predicted identifiable and unidentifiable spectra distributions match the observed well, especially for the UCD dataset. For the ISB example, the unidentified spectra distribution is less well modeled; this may be a consequence of the smaller number of spectra available, or may reflect the need to include an additional distribution for effective modeling. Nevertheless, the predicted model still provides a reasonable estimate and importantly, the identifiable distribution is modeled well which is of principal importance for finding high quality spectra.


msmsEval: tandem mass spectral quality assignment for high-throughput proteomics.

Wong JW, Sullivan MJ, Cartwright HM, Cagney G - BMC Bioinformatics (2007)

Removal unidentifiable spectra by msmsEval. The predicted fraction of spectra removed for identifiable (◆) and unidentifiable (x) spectra were plotted against the observed fractions for 10 runs of the UCD test dataset (A) and 22 runs of the ISB test dataset (B). The estimated fraction of spectra removed is calculated by taking the respective percentiles from the identifiable spectra Gaussian distributions. The diagonal thin dashed line shows expected trend for the removal of identifiable spectra if the estimated values match the observed values perfectly. Error bars are one standard deviation from the average of the respective test datasets. Receiver operator curves showing the fraction of identifiable spectra removed versus unidentifiable spectra removed for the UCD test dataset (solid line) and ISB dataset (dashed line) are also shown (C).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1803797&req=5

Figure 2: Removal unidentifiable spectra by msmsEval. The predicted fraction of spectra removed for identifiable (◆) and unidentifiable (x) spectra were plotted against the observed fractions for 10 runs of the UCD test dataset (A) and 22 runs of the ISB test dataset (B). The estimated fraction of spectra removed is calculated by taking the respective percentiles from the identifiable spectra Gaussian distributions. The diagonal thin dashed line shows expected trend for the removal of identifiable spectra if the estimated values match the observed values perfectly. Error bars are one standard deviation from the average of the respective test datasets. Receiver operator curves showing the fraction of identifiable spectra removed versus unidentifiable spectra removed for the UCD test dataset (solid line) and ISB dataset (dashed line) are also shown (C).
Mentions: The EM algorithm is allowed to run until there are no significant changes to the estimated parameters between iterations. Supplementary Figure 2 (see Additional file 3) show examples of the algorithm used to fit datasets with different spectra distributions. In general the predicted identifiable and unidentifiable spectra distributions match the observed well, especially for the UCD dataset. For the ISB example, the unidentified spectra distribution is less well modeled; this may be a consequence of the smaller number of spectra available, or may reflect the need to include an additional distribution for effective modeling. Nevertheless, the predicted model still provides a reasonable estimate and importantly, the identifiable distribution is modeled well which is of principal importance for finding high quality spectra.

Bottom Line: We describe an application, msmsEval, that builds on previous work by statistically modeling the spectral quality discriminant function using a Gaussian mixture model.This allows a researcher to filter spectra based on the probability that a spectrum will ultimately be identified by database searching.We show that spectra that are predicted by msmsEval to be of high quality, yet remain unidentified in standard database searches, are candidates for more intensive search strategies.

View Article: PubMed Central - HTML - PubMed

Affiliation: Chemistry Department, Oxford University, Physical and Theoretical Chemistry Laboratory, South Parks Road, Oxford OX1 3QZ, UK. jason.wong@ucd.ie <jason.wong@ucd.ie>

ABSTRACT

Background: In proteomics experiments, database-search programs are the method of choice for protein identification from tandem mass spectra. As amino acid sequence databases grow however, computing resources required for these programs have become prohibitive, particularly in searches for modified proteins. Recently, methods to limit the number of spectra to be searched based on spectral quality have been proposed by different research groups, but rankings of spectral quality have thus far been based on arbitrary cut-off values. In this work, we develop a more readily interpretable spectral quality statistic by providing probability values for the likelihood that spectra will be identifiable.

Results: We describe an application, msmsEval, that builds on previous work by statistically modeling the spectral quality discriminant function using a Gaussian mixture model. This allows a researcher to filter spectra based on the probability that a spectrum will ultimately be identified by database searching. We show that spectra that are predicted by msmsEval to be of high quality, yet remain unidentified in standard database searches, are candidates for more intensive search strategies. Using a well studied public dataset we also show that a high proportion (83.9%) of the spectra predicted by msmsEval to be of high quality but that elude standard search strategies, are in fact interpretable.

Conclusion: msmsEval will be useful for high-throughput proteomics projects and is freely available for download from http://proteomics.ucd.ie/msmseval. Supports Windows, Mac OS X and Linux/Unix operating systems.

Show MeSH