Limits...
RAId_DbS: peptide identification using database searches with realistic statistics.

Alves G, Ogurtsov AY, Yu YK - Biol. Direct (2007)

Bottom Line: Another feature of RAId_DbS is its capability of detecting multiple co-eluted peptides.The peptide identification performance and statistical accuracy of RAId_DbS are assessed and compared with several other search tools.The executables and data related to RAId_DbS are freely available upon request.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD 20894, USA.

ABSTRACT

Background: The key to mass-spectrometry-based proteomics is peptide identification. A major challenge in peptide identification is to obtain realistic E-values when assigning statistical significance to candidate peptides.

Results: Using a simple scoring scheme, we propose a database search method with theoretically characterized statistics. Taking into account possible skewness in the random variable distribution and the effect of finite sampling, we provide a theoretical derivation for the tail of the score distribution. For every experimental spectrum examined, we collect the scores of peptides in the database, and find good agreement between the collected score statistics and our theoretical distribution. Using Student's t-tests, we quantify the degree of agreement between the theoretical distribution and the score statistics collected. The T-tests may be used to measure the reliability of reported statistics. When combined with reported P-value for a peptide hit using a score distribution model, this new measure prevents exaggerated statistics. Another feature of RAId_DbS is its capability of detecting multiple co-eluted peptides. The peptide identification performance and statistical accuracy of RAId_DbS are assessed and compared with several other search tools. The executables and data related to RAId_DbS are freely available upon request.

Show MeSH
Performance analysis of methods tested. Performance analysis of RAId_DbS, X! Tandem(v1.0), Mascot(v2.1), OMSSA(v2.0), and SEQUEST(v3.2). Panels (A) and (C) display the results from 6, 734 spectra in profile format, while panels (B) and (D) display the results from 6,592 centroidized spectra obtained from [19]. In panels (A) and (B), typical ROC curves are shown with the number of false positives (FP) plotted along the abscissa, and the number of true positives (TP) plotted along the ordinate. Thus, a curve that is more to the upper-left corner implies better performance. To unveil the information in the region of small number of false positives, usually the region of most interest, we have plotted the abscissa in log-scale. In panels (C) and (D), a different types of ROC curves are shown. Defining the cumulative number of true negatives by TN and the cumulative number of false negative by FN, the ROC cuves in panels (C) and (D) plot "1 – specificity" (FP/(FP + TN)) along the abscissa (also in log-scale), and the sensitivity (TP/(TP + FN)) along the ordinate. For each method tested, the area under curve (AUC) of this type of ROC curves, when both axes are plotted in linear scale, is also shown inside parentheses in the figure legend. All the AUC have an uncertainty about ± 0.005. Note that ROC curves of this type do not reflect the total number of correct hits and methods that report very few negatives may result in a lower specificity and superficially seems inferior. For example, X! Tandem may be victimized when evaluated using this type of ROC curves. Also note that in panel (D) the trend of AUC for Mascot, X! Tandem, and SEQUEST is consistent with previously reported results [14]. For X! Tandem, Mascot, OMSSA, and SEQUEST, the default parameters for each method were used in every search. However, the maximum number of miscleavages is set to 3 uniformly. It is observed that analysis using profile data giving rise to better ROC curves than those of centoidized data. Although this may be due to the fact that the profile data contain more information, it may also be caused by spectral quality and sample concentration variations.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2211744&req=5

Figure 3: Performance analysis of methods tested. Performance analysis of RAId_DbS, X! Tandem(v1.0), Mascot(v2.1), OMSSA(v2.0), and SEQUEST(v3.2). Panels (A) and (C) display the results from 6, 734 spectra in profile format, while panels (B) and (D) display the results from 6,592 centroidized spectra obtained from [19]. In panels (A) and (B), typical ROC curves are shown with the number of false positives (FP) plotted along the abscissa, and the number of true positives (TP) plotted along the ordinate. Thus, a curve that is more to the upper-left corner implies better performance. To unveil the information in the region of small number of false positives, usually the region of most interest, we have plotted the abscissa in log-scale. In panels (C) and (D), a different types of ROC curves are shown. Defining the cumulative number of true negatives by TN and the cumulative number of false negative by FN, the ROC cuves in panels (C) and (D) plot "1 – specificity" (FP/(FP + TN)) along the abscissa (also in log-scale), and the sensitivity (TP/(TP + FN)) along the ordinate. For each method tested, the area under curve (AUC) of this type of ROC curves, when both axes are plotted in linear scale, is also shown inside parentheses in the figure legend. All the AUC have an uncertainty about ± 0.005. Note that ROC curves of this type do not reflect the total number of correct hits and methods that report very few negatives may result in a lower specificity and superficially seems inferior. For example, X! Tandem may be victimized when evaluated using this type of ROC curves. Also note that in panel (D) the trend of AUC for Mascot, X! Tandem, and SEQUEST is consistent with previously reported results [14]. For X! Tandem, Mascot, OMSSA, and SEQUEST, the default parameters for each method were used in every search. However, the maximum number of miscleavages is set to 3 uniformly. It is observed that analysis using profile data giving rise to better ROC curves than those of centoidized data. Although this may be due to the fact that the profile data contain more information, it may also be caused by spectral quality and sample concentration variations.

Mentions: Finally, we test the effectiveness of RAId_DbS in database retrieval along with several other search methods using Receiver Operating Characteristic (ROC) analysis. The results from spectra with profile (centrodized) format are displayed in panel A (B) of Fig. 3. Although the results in panel (A) seem to suggest that RAId_DbS perform better than X! Tandem and significantly better than other methods, this may be largely due to the fact that RAId_DbS is designed to take the profile data while other methods may not. This is supported by our other assessment using centroidized data published by the Institute for Systems Biology [19]. Data sets A1–A4 of [19] (consisting of 6, 592 spectra) were used for this test. As we may see in panel (B) of Fig. 3, the overall performance gain of RAId_DbS relative to other methods decreases. Nevertheless, this result indicates that by recording the spectrum in profile format, one may have a better chance of uncovering the true peptide(s). Although this may be because the profile data contains more information than centroid data, it may also be caused by spectral quality and sample concentration variations.


RAId_DbS: peptide identification using database searches with realistic statistics.

Alves G, Ogurtsov AY, Yu YK - Biol. Direct (2007)

Performance analysis of methods tested. Performance analysis of RAId_DbS, X! Tandem(v1.0), Mascot(v2.1), OMSSA(v2.0), and SEQUEST(v3.2). Panels (A) and (C) display the results from 6, 734 spectra in profile format, while panels (B) and (D) display the results from 6,592 centroidized spectra obtained from [19]. In panels (A) and (B), typical ROC curves are shown with the number of false positives (FP) plotted along the abscissa, and the number of true positives (TP) plotted along the ordinate. Thus, a curve that is more to the upper-left corner implies better performance. To unveil the information in the region of small number of false positives, usually the region of most interest, we have plotted the abscissa in log-scale. In panels (C) and (D), a different types of ROC curves are shown. Defining the cumulative number of true negatives by TN and the cumulative number of false negative by FN, the ROC cuves in panels (C) and (D) plot "1 – specificity" (FP/(FP + TN)) along the abscissa (also in log-scale), and the sensitivity (TP/(TP + FN)) along the ordinate. For each method tested, the area under curve (AUC) of this type of ROC curves, when both axes are plotted in linear scale, is also shown inside parentheses in the figure legend. All the AUC have an uncertainty about ± 0.005. Note that ROC curves of this type do not reflect the total number of correct hits and methods that report very few negatives may result in a lower specificity and superficially seems inferior. For example, X! Tandem may be victimized when evaluated using this type of ROC curves. Also note that in panel (D) the trend of AUC for Mascot, X! Tandem, and SEQUEST is consistent with previously reported results [14]. For X! Tandem, Mascot, OMSSA, and SEQUEST, the default parameters for each method were used in every search. However, the maximum number of miscleavages is set to 3 uniformly. It is observed that analysis using profile data giving rise to better ROC curves than those of centoidized data. Although this may be due to the fact that the profile data contain more information, it may also be caused by spectral quality and sample concentration variations.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2211744&req=5

Figure 3: Performance analysis of methods tested. Performance analysis of RAId_DbS, X! Tandem(v1.0), Mascot(v2.1), OMSSA(v2.0), and SEQUEST(v3.2). Panels (A) and (C) display the results from 6, 734 spectra in profile format, while panels (B) and (D) display the results from 6,592 centroidized spectra obtained from [19]. In panels (A) and (B), typical ROC curves are shown with the number of false positives (FP) plotted along the abscissa, and the number of true positives (TP) plotted along the ordinate. Thus, a curve that is more to the upper-left corner implies better performance. To unveil the information in the region of small number of false positives, usually the region of most interest, we have plotted the abscissa in log-scale. In panels (C) and (D), a different types of ROC curves are shown. Defining the cumulative number of true negatives by TN and the cumulative number of false negative by FN, the ROC cuves in panels (C) and (D) plot "1 – specificity" (FP/(FP + TN)) along the abscissa (also in log-scale), and the sensitivity (TP/(TP + FN)) along the ordinate. For each method tested, the area under curve (AUC) of this type of ROC curves, when both axes are plotted in linear scale, is also shown inside parentheses in the figure legend. All the AUC have an uncertainty about ± 0.005. Note that ROC curves of this type do not reflect the total number of correct hits and methods that report very few negatives may result in a lower specificity and superficially seems inferior. For example, X! Tandem may be victimized when evaluated using this type of ROC curves. Also note that in panel (D) the trend of AUC for Mascot, X! Tandem, and SEQUEST is consistent with previously reported results [14]. For X! Tandem, Mascot, OMSSA, and SEQUEST, the default parameters for each method were used in every search. However, the maximum number of miscleavages is set to 3 uniformly. It is observed that analysis using profile data giving rise to better ROC curves than those of centoidized data. Although this may be due to the fact that the profile data contain more information, it may also be caused by spectral quality and sample concentration variations.
Mentions: Finally, we test the effectiveness of RAId_DbS in database retrieval along with several other search methods using Receiver Operating Characteristic (ROC) analysis. The results from spectra with profile (centrodized) format are displayed in panel A (B) of Fig. 3. Although the results in panel (A) seem to suggest that RAId_DbS perform better than X! Tandem and significantly better than other methods, this may be largely due to the fact that RAId_DbS is designed to take the profile data while other methods may not. This is supported by our other assessment using centroidized data published by the Institute for Systems Biology [19]. Data sets A1–A4 of [19] (consisting of 6, 592 spectra) were used for this test. As we may see in panel (B) of Fig. 3, the overall performance gain of RAId_DbS relative to other methods decreases. Nevertheless, this result indicates that by recording the spectrum in profile format, one may have a better chance of uncovering the true peptide(s). Although this may be because the profile data contains more information than centroid data, it may also be caused by spectral quality and sample concentration variations.

Bottom Line: Another feature of RAId_DbS is its capability of detecting multiple co-eluted peptides.The peptide identification performance and statistical accuracy of RAId_DbS are assessed and compared with several other search tools.The executables and data related to RAId_DbS are freely available upon request.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD 20894, USA.

ABSTRACT

Background: The key to mass-spectrometry-based proteomics is peptide identification. A major challenge in peptide identification is to obtain realistic E-values when assigning statistical significance to candidate peptides.

Results: Using a simple scoring scheme, we propose a database search method with theoretically characterized statistics. Taking into account possible skewness in the random variable distribution and the effect of finite sampling, we provide a theoretical derivation for the tail of the score distribution. For every experimental spectrum examined, we collect the scores of peptides in the database, and find good agreement between the collected score statistics and our theoretical distribution. Using Student's t-tests, we quantify the degree of agreement between the theoretical distribution and the score statistics collected. The T-tests may be used to measure the reliability of reported statistics. When combined with reported P-value for a peptide hit using a score distribution model, this new measure prevents exaggerated statistics. Another feature of RAId_DbS is its capability of detecting multiple co-eluted peptides. The peptide identification performance and statistical accuracy of RAId_DbS are assessed and compared with several other search tools. The executables and data related to RAId_DbS are freely available upon request.

Show MeSH