Limits...
A nonparametric model for quality control of database search results in shotgun proteomics.

Zhang J, Li J, Liu X, Xie H, Zhu Y, He F - BMC Bioinformatics (2008)

Bottom Line: However, validation of database search results creates a bottleneck in MS/MS data processing.Using the nonparametric model, a more flexible DF can be obtained, resulting in improved sensitivity and good FPR estimation.This nonparametric statistical technique is a powerful tool for tackling the complexity and diversity of datasets in shotgun proteomics.

View Article: PubMed Central - HTML - PubMed

Affiliation: College of Mechanical & Electronic Engineering and Automatization, National University of Defense Technology, Changsha, 410073, China. zhangjy@hupo.org.cn

ABSTRACT

Background: Analysis of complex samples with tandem mass spectrometry (MS/MS) has become routine in proteomic research. However, validation of database search results creates a bottleneck in MS/MS data processing. Recently, methods based on a randomized database have become popular for quality control of database search results. However, a consequent problem is the ignorance of how to combine different database search scores to improve the sensitivity of randomized database methods.

Results: In this paper, a multivariate nonlinear discriminate function (DF) based on the multivariate nonparametric density estimation technique was used to filter out false-positive database search results with a predictable false positive rate (FPR). Application of this method to control datasets of different instruments (LCQ, LTQ, and LTQ/FT) yielded an estimated FPR close to the actual FPR. As expected, the method was more sensitive when more features were used. Furthermore, the new method was shown to be more sensitive than two commonly used methods on 3 complex sample datasets and 3 control datasets.

Conclusion: Using the nonparametric model, a more flexible DF can be obtained, resulting in improved sensitivity and good FPR estimation. This nonparametric statistical technique is a powerful tool for tackling the complexity and diversity of datasets in shotgun proteomics.

Show MeSH
The partitions of k-means clustering before (A) and after (B) normalization (z-score) of the features. Blue and red points represent different clusters. The observations derive from the control dataset. Records with larger Xcorr and ΔCn are more likely to be positive results. The partition given by k-means clustering using the observed values is based on Xcorr; ΔCn has no effect. After normalization, the partition is more consistent with the empirical knowledge.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2267700&req=5

Figure 8: The partitions of k-means clustering before (A) and after (B) normalization (z-score) of the features. Blue and red points represent different clusters. The observations derive from the control dataset. Records with larger Xcorr and ΔCn are more likely to be positive results. The partition given by k-means clustering using the observed values is based on Xcorr; ΔCn has no effect. After normalization, the partition is more consistent with the empirical knowledge.

Mentions: K-means clustering [59] is commonly used to partition observations into different groups according to defined distance (such as Euclidean distance). The optimization goal of k-means clustering is to find a partition in which objects within each cluster are as close as possible to each other and as far as possible from objects in other clusters. However, in practice, the scale of each feature will significantly affect the clustering results when Euclidean distance is used. In our application, Xcorr and ΔCn were two main features. Xcorr is a float point value whose typical value is 2.5 but may be larger than 10; ΔCn is in the range [0, 1]. When directly using the observed values in the k-means clustering, Xcorr will dominate the partition results (Figure 8) because the distance (formula 5) between two observations (Xcorri, ΔCni), i = 1, 2, is mainly determined by Xcorr, which has a larger scale.


A nonparametric model for quality control of database search results in shotgun proteomics.

Zhang J, Li J, Liu X, Xie H, Zhu Y, He F - BMC Bioinformatics (2008)

The partitions of k-means clustering before (A) and after (B) normalization (z-score) of the features. Blue and red points represent different clusters. The observations derive from the control dataset. Records with larger Xcorr and ΔCn are more likely to be positive results. The partition given by k-means clustering using the observed values is based on Xcorr; ΔCn has no effect. After normalization, the partition is more consistent with the empirical knowledge.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2267700&req=5

Figure 8: The partitions of k-means clustering before (A) and after (B) normalization (z-score) of the features. Blue and red points represent different clusters. The observations derive from the control dataset. Records with larger Xcorr and ΔCn are more likely to be positive results. The partition given by k-means clustering using the observed values is based on Xcorr; ΔCn has no effect. After normalization, the partition is more consistent with the empirical knowledge.
Mentions: K-means clustering [59] is commonly used to partition observations into different groups according to defined distance (such as Euclidean distance). The optimization goal of k-means clustering is to find a partition in which objects within each cluster are as close as possible to each other and as far as possible from objects in other clusters. However, in practice, the scale of each feature will significantly affect the clustering results when Euclidean distance is used. In our application, Xcorr and ΔCn were two main features. Xcorr is a float point value whose typical value is 2.5 but may be larger than 10; ΔCn is in the range [0, 1]. When directly using the observed values in the k-means clustering, Xcorr will dominate the partition results (Figure 8) because the distance (formula 5) between two observations (Xcorri, ΔCni), i = 1, 2, is mainly determined by Xcorr, which has a larger scale.

Bottom Line: However, validation of database search results creates a bottleneck in MS/MS data processing.Using the nonparametric model, a more flexible DF can be obtained, resulting in improved sensitivity and good FPR estimation.This nonparametric statistical technique is a powerful tool for tackling the complexity and diversity of datasets in shotgun proteomics.

View Article: PubMed Central - HTML - PubMed

Affiliation: College of Mechanical & Electronic Engineering and Automatization, National University of Defense Technology, Changsha, 410073, China. zhangjy@hupo.org.cn

ABSTRACT

Background: Analysis of complex samples with tandem mass spectrometry (MS/MS) has become routine in proteomic research. However, validation of database search results creates a bottleneck in MS/MS data processing. Recently, methods based on a randomized database have become popular for quality control of database search results. However, a consequent problem is the ignorance of how to combine different database search scores to improve the sensitivity of randomized database methods.

Results: In this paper, a multivariate nonlinear discriminate function (DF) based on the multivariate nonparametric density estimation technique was used to filter out false-positive database search results with a predictable false positive rate (FPR). Application of this method to control datasets of different instruments (LCQ, LTQ, and LTQ/FT) yielded an estimated FPR close to the actual FPR. As expected, the method was more sensitive when more features were used. Furthermore, the new method was shown to be more sensitive than two commonly used methods on 3 complex sample datasets and 3 control datasets.

Conclusion: Using the nonparametric model, a more flexible DF can be obtained, resulting in improved sensitivity and good FPR estimation. This nonparametric statistical technique is a powerful tool for tackling the complexity and diversity of datasets in shotgun proteomics.

Show MeSH