Limits...
RAId_DbS: peptide identification using database searches with realistic statistics.

Alves G, Ogurtsov AY, Yu YK - Biol. Direct (2007)

Bottom Line: Another feature of RAId_DbS is its capability of detecting multiple co-eluted peptides.The peptide identification performance and statistical accuracy of RAId_DbS are assessed and compared with several other search tools.The executables and data related to RAId_DbS are freely available upon request.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD 20894, USA.

ABSTRACT

Background: The key to mass-spectrometry-based proteomics is peptide identification. A major challenge in peptide identification is to obtain realistic E-values when assigning statistical significance to candidate peptides.

Results: Using a simple scoring scheme, we propose a database search method with theoretically characterized statistics. Taking into account possible skewness in the random variable distribution and the effect of finite sampling, we provide a theoretical derivation for the tail of the score distribution. For every experimental spectrum examined, we collect the scores of peptides in the database, and find good agreement between the collected score statistics and our theoretical distribution. Using Student's t-tests, we quantify the degree of agreement between the theoretical distribution and the score statistics collected. The T-tests may be used to measure the reliability of reported statistics. When combined with reported P-value for a peptide hit using a score distribution model, this new measure prevents exaggerated statistics. Another feature of RAId_DbS is its capability of detecting multiple co-eluted peptides. The peptide identification performance and statistical accuracy of RAId_DbS are assessed and compared with several other search tools. The executables and data related to RAId_DbS are freely available upon request.

Show MeSH
Average cumulative number of false positives versus E-values. Average cumulative number of false positives versus E-values. Theoretically speaking, average number of false positives with E-values less than or equal to a cutoff Ec should be Ec provided that the number of trials is large enough. The accuracy of E-values assigned by RAId_DbS is tested along with three other methods, X! Tandem(v1.0), Mascot(v2.1) and OMSSA(v2.0). For X! Tandem, Mascot and OMSSA searches, default parameters of each program are used except the maximum number of miscleavages, which is set to 3 uniformly for this test. The diagonal solid lines in each panel are the theoretical lines. There are two curves associated with each method. The dashed line corresponds to the results using regular nr. The solid line corresponds to the results using nr with cluster removal, which we anticipate to be a better representative of a random database. See text for additional details.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2211744&req=5

Figure 2: Average cumulative number of false positives versus E-values. Average cumulative number of false positives versus E-values. Theoretically speaking, average number of false positives with E-values less than or equal to a cutoff Ec should be Ec provided that the number of trials is large enough. The accuracy of E-values assigned by RAId_DbS is tested along with three other methods, X! Tandem(v1.0), Mascot(v2.1) and OMSSA(v2.0). For X! Tandem, Mascot and OMSSA searches, default parameters of each program are used except the maximum number of miscleavages, which is set to 3 uniformly for this test. The diagonal solid lines in each panel are the theoretical lines. There are two curves associated with each method. The dashed line corresponds to the results using regular nr. The solid line corresponds to the results using nr with cluster removal, which we anticipate to be a better representative of a random database. See text for additional details.

Mentions: To further test the statistical accuracy of RAId_DbS and a few other search methods reporting E-values, we compare the reported E-values versus cumulative false positives. The results of the statistical accuracy test are summarized in Fig. 2 and its caption. Two databases are used: the NCBI's nr protein database and nr after cluster removal (CR). CR is done as follows. Each of the eight protein chains is used as a query to search against the NCBI's nr protein database. Proteins hits in nr that align with any of the eight query chains with E-values less than 10-15 are removed from the database. This procedure removes 1,848 proteins out of nr which originally contains 1,486,014 proteins.


RAId_DbS: peptide identification using database searches with realistic statistics.

Alves G, Ogurtsov AY, Yu YK - Biol. Direct (2007)

Average cumulative number of false positives versus E-values. Average cumulative number of false positives versus E-values. Theoretically speaking, average number of false positives with E-values less than or equal to a cutoff Ec should be Ec provided that the number of trials is large enough. The accuracy of E-values assigned by RAId_DbS is tested along with three other methods, X! Tandem(v1.0), Mascot(v2.1) and OMSSA(v2.0). For X! Tandem, Mascot and OMSSA searches, default parameters of each program are used except the maximum number of miscleavages, which is set to 3 uniformly for this test. The diagonal solid lines in each panel are the theoretical lines. There are two curves associated with each method. The dashed line corresponds to the results using regular nr. The solid line corresponds to the results using nr with cluster removal, which we anticipate to be a better representative of a random database. See text for additional details.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2211744&req=5

Figure 2: Average cumulative number of false positives versus E-values. Average cumulative number of false positives versus E-values. Theoretically speaking, average number of false positives with E-values less than or equal to a cutoff Ec should be Ec provided that the number of trials is large enough. The accuracy of E-values assigned by RAId_DbS is tested along with three other methods, X! Tandem(v1.0), Mascot(v2.1) and OMSSA(v2.0). For X! Tandem, Mascot and OMSSA searches, default parameters of each program are used except the maximum number of miscleavages, which is set to 3 uniformly for this test. The diagonal solid lines in each panel are the theoretical lines. There are two curves associated with each method. The dashed line corresponds to the results using regular nr. The solid line corresponds to the results using nr with cluster removal, which we anticipate to be a better representative of a random database. See text for additional details.
Mentions: To further test the statistical accuracy of RAId_DbS and a few other search methods reporting E-values, we compare the reported E-values versus cumulative false positives. The results of the statistical accuracy test are summarized in Fig. 2 and its caption. Two databases are used: the NCBI's nr protein database and nr after cluster removal (CR). CR is done as follows. Each of the eight protein chains is used as a query to search against the NCBI's nr protein database. Proteins hits in nr that align with any of the eight query chains with E-values less than 10-15 are removed from the database. This procedure removes 1,848 proteins out of nr which originally contains 1,486,014 proteins.

Bottom Line: Another feature of RAId_DbS is its capability of detecting multiple co-eluted peptides.The peptide identification performance and statistical accuracy of RAId_DbS are assessed and compared with several other search tools.The executables and data related to RAId_DbS are freely available upon request.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD 20894, USA.

ABSTRACT

Background: The key to mass-spectrometry-based proteomics is peptide identification. A major challenge in peptide identification is to obtain realistic E-values when assigning statistical significance to candidate peptides.

Results: Using a simple scoring scheme, we propose a database search method with theoretically characterized statistics. Taking into account possible skewness in the random variable distribution and the effect of finite sampling, we provide a theoretical derivation for the tail of the score distribution. For every experimental spectrum examined, we collect the scores of peptides in the database, and find good agreement between the collected score statistics and our theoretical distribution. Using Student's t-tests, we quantify the degree of agreement between the theoretical distribution and the score statistics collected. The T-tests may be used to measure the reliability of reported statistics. When combined with reported P-value for a peptide hit using a score distribution model, this new measure prevents exaggerated statistics. Another feature of RAId_DbS is its capability of detecting multiple co-eluted peptides. The peptide identification performance and statistical accuracy of RAId_DbS are assessed and compared with several other search tools. The executables and data related to RAId_DbS are freely available upon request.

Show MeSH