Limits...
SAMPI: protein identification with mass spectra alignments.

Kaltenbach HM, Wilke A, Böcker S - BMC Bioinformatics (2007)

Bottom Line: A protein is digested (usually by trypsin) and its mass spectrum is compared to simulated spectra for protein sequences in a database.We prove the applicability of our approach using biological mass spectrometry data and compare our results to the standard software Mascot.Introducing more noise peaks, we are able to keep identification rates at a similar level by using the flexibility introduced by scoring schemes.

View Article: PubMed Central - HTML - PubMed

Affiliation: AG Genominformatik, Technische Fakultät, Universität Bielefeld, Bielefeld, Germany. michael@cebitec.uni-bielefeld.de

ABSTRACT

Background: Mass spectrometry based peptide mass fingerprints (PMFs) offer a fast, efficient, and robust method for protein identification. A protein is digested (usually by trypsin) and its mass spectrum is compared to simulated spectra for protein sequences in a database. However, existing tools for analyzing PMFs often suffer from missing or heuristic analysis of the significance of search results and insufficient handling of missing and additional peaks.

Results: We present an unified framework for analyzing Peptide Mass Fingerprints that offers a number of advantages over existing methods: First, comparison of mass spectra is based on a scoring function that can be custom-designed for certain applications and explicitly takes missing and additional peaks into account. The method is able to simulate almost every additive scoring scheme. Second, we present an efficient deterministic method for assessing the significance of a protein hit, independent of the underlying scoring function and sequence database. We prove the applicability of our approach using biological mass spectrometry data and compare our results to the standard software Mascot.

Conclusion: The proposed framework for analyzing Peptide Mass Fingerprints shows performance comparable to Mascot on small peak lists. Introducing more noise peaks, we are able to keep identification rates at a similar level by using the flexibility introduced by scoring schemes.

Show MeSH

Related in: MedlinePlus

Occurrence probabilities. The mass occurrence probabilities p[L, m] for masses. m = 1000.0, 1500.0, 2000.0, 2500.0, 3000.0 Da and string length L = 1 ... 1000. Precision 0.1 Da.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1851022&req=5

Figure 5: Occurrence probabilities. The mass occurrence probabilities p[L, m] for masses. m = 1000.0, 1500.0, 2000.0, 2500.0, 3000.0 Da and string length L = 1 ... 1000. Precision 0.1 Da.

Mentions: Both tables have to be computed up to the largest sequence length Lmax in the sequence database and up to the largest integer fragment mass . For PMF using MALDI, mmax ≈ 3,000 Da and for SwissProt as sequence database, Lmax ≈ 10,000. For a mass precision of one decimal, using doubles, we would need about 30,000·10,000·8 ≈ 2.24 GB of main memory for each table. As L[·, ·] is only needed during computation of [·, ·], only a very small part of about 3–4 MB is required at any time. To efficiently compute the significance of an alignment score, however, the occurrence probability table p[·, ·] needs to be kept in memory. Its columns can be computed independently and entries of each column depend smoothly on L (the occurrence probability will not change abruptly if the sequence length grows), it is thus sufficient to store only the first 100 entries of each column completely and then store every 25th row, performing a linear interpolation to get intermediate values. Comparing the exact values in each column to the values computed by the described interpolation scheme, we found the interpolation error to be smaller than 10-9 in every case. Note that the interpolation nodes are the exact values, so the interpolation error does not accumulate with growing string length. The mass occurrence probability p[L, m] is given for masses m = 1000.0, 1500.0, 2000.0, 2500.0, 3000.0 Da and a precision of 0.1 Da in Figures 4 and 5, for string length up to 50 and 1000, respectively, showing the continuous behavior of the function for L > 40. The "hump" at small string lengths can be explained by the fact that for these lengths, the only possible fragment of mass m is whole the string itself. For greater string length, the corresponding fragment(s) must be "real" fragments, subject to tighter constraints on their combinatorial character composition, e.g. they must have a cleavage character at the end. This "hump" is located around L ≈ m/μavg, where μavg denotes the average character mass. For average molecular masses and SwissProt frequencies we have μavg ≈ 111.2 Da. By further exploiting the fact that fL[l, m*] = 0 for l > m*/, where is the smallest integer character mass in Σ, both L[l, m*] and [L, m*] can be computed in time O(Lmax·). We would like to refer the interested reader to [10] for details and proofs on the memory- and time efficient implementation.


SAMPI: protein identification with mass spectra alignments.

Kaltenbach HM, Wilke A, Böcker S - BMC Bioinformatics (2007)

Occurrence probabilities. The mass occurrence probabilities p[L, m] for masses. m = 1000.0, 1500.0, 2000.0, 2500.0, 3000.0 Da and string length L = 1 ... 1000. Precision 0.1 Da.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1851022&req=5

Figure 5: Occurrence probabilities. The mass occurrence probabilities p[L, m] for masses. m = 1000.0, 1500.0, 2000.0, 2500.0, 3000.0 Da and string length L = 1 ... 1000. Precision 0.1 Da.
Mentions: Both tables have to be computed up to the largest sequence length Lmax in the sequence database and up to the largest integer fragment mass . For PMF using MALDI, mmax ≈ 3,000 Da and for SwissProt as sequence database, Lmax ≈ 10,000. For a mass precision of one decimal, using doubles, we would need about 30,000·10,000·8 ≈ 2.24 GB of main memory for each table. As L[·, ·] is only needed during computation of [·, ·], only a very small part of about 3–4 MB is required at any time. To efficiently compute the significance of an alignment score, however, the occurrence probability table p[·, ·] needs to be kept in memory. Its columns can be computed independently and entries of each column depend smoothly on L (the occurrence probability will not change abruptly if the sequence length grows), it is thus sufficient to store only the first 100 entries of each column completely and then store every 25th row, performing a linear interpolation to get intermediate values. Comparing the exact values in each column to the values computed by the described interpolation scheme, we found the interpolation error to be smaller than 10-9 in every case. Note that the interpolation nodes are the exact values, so the interpolation error does not accumulate with growing string length. The mass occurrence probability p[L, m] is given for masses m = 1000.0, 1500.0, 2000.0, 2500.0, 3000.0 Da and a precision of 0.1 Da in Figures 4 and 5, for string length up to 50 and 1000, respectively, showing the continuous behavior of the function for L > 40. The "hump" at small string lengths can be explained by the fact that for these lengths, the only possible fragment of mass m is whole the string itself. For greater string length, the corresponding fragment(s) must be "real" fragments, subject to tighter constraints on their combinatorial character composition, e.g. they must have a cleavage character at the end. This "hump" is located around L ≈ m/μavg, where μavg denotes the average character mass. For average molecular masses and SwissProt frequencies we have μavg ≈ 111.2 Da. By further exploiting the fact that fL[l, m*] = 0 for l > m*/, where is the smallest integer character mass in Σ, both L[l, m*] and [L, m*] can be computed in time O(Lmax·). We would like to refer the interested reader to [10] for details and proofs on the memory- and time efficient implementation.

Bottom Line: A protein is digested (usually by trypsin) and its mass spectrum is compared to simulated spectra for protein sequences in a database.We prove the applicability of our approach using biological mass spectrometry data and compare our results to the standard software Mascot.Introducing more noise peaks, we are able to keep identification rates at a similar level by using the flexibility introduced by scoring schemes.

View Article: PubMed Central - HTML - PubMed

Affiliation: AG Genominformatik, Technische Fakultät, Universität Bielefeld, Bielefeld, Germany. michael@cebitec.uni-bielefeld.de

ABSTRACT

Background: Mass spectrometry based peptide mass fingerprints (PMFs) offer a fast, efficient, and robust method for protein identification. A protein is digested (usually by trypsin) and its mass spectrum is compared to simulated spectra for protein sequences in a database. However, existing tools for analyzing PMFs often suffer from missing or heuristic analysis of the significance of search results and insufficient handling of missing and additional peaks.

Results: We present an unified framework for analyzing Peptide Mass Fingerprints that offers a number of advantages over existing methods: First, comparison of mass spectra is based on a scoring function that can be custom-designed for certain applications and explicitly takes missing and additional peaks into account. The method is able to simulate almost every additive scoring scheme. Second, we present an efficient deterministic method for assessing the significance of a protein hit, independent of the underlying scoring function and sequence database. We prove the applicability of our approach using biological mass spectrometry data and compare our results to the standard software Mascot.

Conclusion: The proposed framework for analyzing Peptide Mass Fingerprints shows performance comparable to Mascot on small peak lists. Introducing more noise peaks, we are able to keep identification rates at a similar level by using the flexibility introduced by scoring schemes.

Show MeSH
Related in: MedlinePlus