Limits...
RAId_DbS: peptide identification using database searches with realistic statistics.

Alves G, Ogurtsov AY, Yu YK - Biol. Direct (2007)

Bottom Line: Another feature of RAId_DbS is its capability of detecting multiple co-eluted peptides.The peptide identification performance and statistical accuracy of RAId_DbS are assessed and compared with several other search tools.The executables and data related to RAId_DbS are freely available upon request.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD 20894, USA.

ABSTRACT

Background: The key to mass-spectrometry-based proteomics is peptide identification. A major challenge in peptide identification is to obtain realistic E-values when assigning statistical significance to candidate peptides.

Results: Using a simple scoring scheme, we propose a database search method with theoretically characterized statistics. Taking into account possible skewness in the random variable distribution and the effect of finite sampling, we provide a theoretical derivation for the tail of the score distribution. For every experimental spectrum examined, we collect the scores of peptides in the database, and find good agreement between the collected score statistics and our theoretical distribution. Using Student's t-tests, we quantify the degree of agreement between the theoretical distribution and the score statistics collected. The T-tests may be used to measure the reliability of reported statistics. When combined with reported P-value for a peptide hit using a score distribution model, this new measure prevents exaggerated statistics. Another feature of RAId_DbS is its capability of detecting multiple co-eluted peptides. The peptide identification performance and statistical accuracy of RAId_DbS are assessed and compared with several other search tools. The executables and data related to RAId_DbS are freely available upon request.

Show MeSH
Comparison of score histogram versus theoretical distribution. Comparison of score histogram versus theoretical distribution. A randomly picked query spectrum is used to score peptides in NCBI's nr database. For this query spectrum, nine hundred unit intensity peaks were added to the processed spectrum to match Sus. In panel (A), the red staircase represents the histogram of scores computed using Eq. (1) with wi = 1, while the blue line represents the theoretical distribution predicted from peptides with n = 44 theoretical peaks. In panel (B), scores computed using Eq. (1) with wi(mi) = exp(-Δ mi) for peptides with different numbers of theoretical peaks are collected, resulting in the overall score histogram represented by the red staircase. The solid curve plots our fitting of the histogram using Eq. (17) where the fitting variables are β, γ ≡ n/(6⟨x2⟩ β2) and .
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2211744&req=5

Figure 1: Comparison of score histogram versus theoretical distribution. Comparison of score histogram versus theoretical distribution. A randomly picked query spectrum is used to score peptides in NCBI's nr database. For this query spectrum, nine hundred unit intensity peaks were added to the processed spectrum to match Sus. In panel (A), the red staircase represents the histogram of scores computed using Eq. (1) with wi = 1, while the blue line represents the theoretical distribution predicted from peptides with n = 44 theoretical peaks. In panel (B), scores computed using Eq. (1) with wi(mi) = exp(-Δ mi) for peptides with different numbers of theoretical peaks are collected, resulting in the overall score histogram represented by the red staircase. The solid curve plots our fitting of the histogram using Eq. (17) where the fitting variables are β, γ ≡ n/(6⟨x2⟩ β2) and .

Mentions: Eq. (17) is derived for fixed n (the number of peaks used to score). Using a random database, if one were to score only peptides with the same number of theoretical peaks, one should be able to obtain the distribution with the overall constant as the only fitting parameter. This is tested by using wi = 1 in Eq. (1). In Fig 1(A), we show the score histogram from scoring a query spectrum against peptides within the NCBI's nr database. Only scores from peptides with 44 theoretical peaks are included. Once the score histogram is normalized, we first find Su, the highest point of the histogram. The number of unit intensity peaks in the processed/filtered spectrum is then determined by Su through Su = ⟨ln I⟩. All the cumulants are then calculated using the processed/filtered spectrum and the only free parameter left is . By plotting on a linear-log scale the normalized histogram and the expression in Eq. (17) without including , one may determine the overall shift log() needed through regression. The solid curve in Fig 1(A) is theoretical distribution from Eq. (17) with fitted through a least squares procedure.


RAId_DbS: peptide identification using database searches with realistic statistics.

Alves G, Ogurtsov AY, Yu YK - Biol. Direct (2007)

Comparison of score histogram versus theoretical distribution. Comparison of score histogram versus theoretical distribution. A randomly picked query spectrum is used to score peptides in NCBI's nr database. For this query spectrum, nine hundred unit intensity peaks were added to the processed spectrum to match Sus. In panel (A), the red staircase represents the histogram of scores computed using Eq. (1) with wi = 1, while the blue line represents the theoretical distribution predicted from peptides with n = 44 theoretical peaks. In panel (B), scores computed using Eq. (1) with wi(mi) = exp(-Δ mi) for peptides with different numbers of theoretical peaks are collected, resulting in the overall score histogram represented by the red staircase. The solid curve plots our fitting of the histogram using Eq. (17) where the fitting variables are β, γ ≡ n/(6⟨x2⟩ β2) and .
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2211744&req=5

Figure 1: Comparison of score histogram versus theoretical distribution. Comparison of score histogram versus theoretical distribution. A randomly picked query spectrum is used to score peptides in NCBI's nr database. For this query spectrum, nine hundred unit intensity peaks were added to the processed spectrum to match Sus. In panel (A), the red staircase represents the histogram of scores computed using Eq. (1) with wi = 1, while the blue line represents the theoretical distribution predicted from peptides with n = 44 theoretical peaks. In panel (B), scores computed using Eq. (1) with wi(mi) = exp(-Δ mi) for peptides with different numbers of theoretical peaks are collected, resulting in the overall score histogram represented by the red staircase. The solid curve plots our fitting of the histogram using Eq. (17) where the fitting variables are β, γ ≡ n/(6⟨x2⟩ β2) and .
Mentions: Eq. (17) is derived for fixed n (the number of peaks used to score). Using a random database, if one were to score only peptides with the same number of theoretical peaks, one should be able to obtain the distribution with the overall constant as the only fitting parameter. This is tested by using wi = 1 in Eq. (1). In Fig 1(A), we show the score histogram from scoring a query spectrum against peptides within the NCBI's nr database. Only scores from peptides with 44 theoretical peaks are included. Once the score histogram is normalized, we first find Su, the highest point of the histogram. The number of unit intensity peaks in the processed/filtered spectrum is then determined by Su through Su = ⟨ln I⟩. All the cumulants are then calculated using the processed/filtered spectrum and the only free parameter left is . By plotting on a linear-log scale the normalized histogram and the expression in Eq. (17) without including , one may determine the overall shift log() needed through regression. The solid curve in Fig 1(A) is theoretical distribution from Eq. (17) with fitted through a least squares procedure.

Bottom Line: Another feature of RAId_DbS is its capability of detecting multiple co-eluted peptides.The peptide identification performance and statistical accuracy of RAId_DbS are assessed and compared with several other search tools.The executables and data related to RAId_DbS are freely available upon request.

View Article: PubMed Central - HTML - PubMed

Affiliation: National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD 20894, USA.

ABSTRACT

Background: The key to mass-spectrometry-based proteomics is peptide identification. A major challenge in peptide identification is to obtain realistic E-values when assigning statistical significance to candidate peptides.

Results: Using a simple scoring scheme, we propose a database search method with theoretically characterized statistics. Taking into account possible skewness in the random variable distribution and the effect of finite sampling, we provide a theoretical derivation for the tail of the score distribution. For every experimental spectrum examined, we collect the scores of peptides in the database, and find good agreement between the collected score statistics and our theoretical distribution. Using Student's t-tests, we quantify the degree of agreement between the theoretical distribution and the score statistics collected. The T-tests may be used to measure the reliability of reported statistics. When combined with reported P-value for a peptide hit using a score distribution model, this new measure prevents exaggerated statistics. Another feature of RAId_DbS is its capability of detecting multiple co-eluted peptides. The peptide identification performance and statistical accuracy of RAId_DbS are assessed and compared with several other search tools. The executables and data related to RAId_DbS are freely available upon request.

Show MeSH