Limits...
Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty.

Agrawal A, Huang X - BMC Bioinformatics (2009)

Bottom Line: Further, the results of pairwise statistical significance using multiple parameter sets are shown to be significantly better than database statistical significance estimates reported by BLAST and PSI-BLAST, and comparable and at times significantly better than SSEARCH.The fact that the homology detection performance does not degrade when using multiple parameter sets is a strong evidence for the validity of the assumption that the alignment score distribution follows an extreme value distribution even when using multiple parameter sets.Pairwise statistical significance using multiple parameter sets can be effectively used to determine the relatedness of a (or a few) pair(s) of sequences without performing a time-consuming database search.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, IA 50011-1041, USA. ankitag@iastate.edu

ABSTRACT

Background: Accurate estimation of statistical significance of a pairwise alignment is an important problem in sequence comparison. Recently, a comparative study of pairwise statistical significance with database statistical significance was conducted. In this paper, we extend the earlier work on pairwise statistical significance by incorporating with it the use of multiple parameter sets.

Results: Results for a knowledge discovery application of homology detection reveal that using multiple parameter sets for pairwise statistical significance estimates gives better coverage than using a single parameter set, at least at some error levels. Further, the results of pairwise statistical significance using multiple parameter sets are shown to be significantly better than database statistical significance estimates reported by BLAST and PSI-BLAST, and comparable and at times significantly better than SSEARCH. Using non-zero parameter set change penalty values give better performance than zero penalty.

Conclusion: The fact that the homology detection performance does not degrade when using multiple parameter sets is a strong evidence for the validity of the assumption that the alignment score distribution follows an extreme value distribution even when using multiple parameter sets. Parameter set change penalty is a useful parameter for alignment using multiple parameter sets. Pairwise statistical significance using multiple parameter sets can be effectively used to determine the relatedness of a (or a few) pair(s) of sequences without performing a time-consuming database search.

Show MeSH
Pairwise statistical significance using two parameter sets. Errors per Query vs. Coverage plot for pairwise statistical significance using two parameter sets, along with the curves using corresponding single parameter sets. (a) BLOSUM45, BLOSUM50; (b) BLOSUM45, BLOSUM62; (c) BLOSUM45, BLOSUM80; (d) BLOSUM50, BLOSUM62; (e) BLOSUM50, BLOSUM80; (f) BLOSUM62, BLOSUM80. In 5 panels (b) through (f) out of 6, using two parameter sets leads to better coverage than using a single parameter set at most error levels.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2665049&req=5

Figure 1: Pairwise statistical significance using two parameter sets. Errors per Query vs. Coverage plot for pairwise statistical significance using two parameter sets, along with the curves using corresponding single parameter sets. (a) BLOSUM45, BLOSUM50; (b) BLOSUM45, BLOSUM62; (c) BLOSUM45, BLOSUM80; (d) BLOSUM50, BLOSUM62; (e) BLOSUM50, BLOSUM80; (f) BLOSUM62, BLOSUM80. In 5 panels (b) through (f) out of 6, using two parameter sets leads to better coverage than using a single parameter set at most error levels.

Mentions: Out of the 15 substitution matrix combinations, 4 are using single parameter sets, 6 are using two parameter sets, 4 are using three parameter sets, and 1 is using all four parameter sets. The EPQ vs. Coverage curves using pairwise statistical significance with two, three, and four parameter sets are presented in Figures 1, 2, and 3 respectively. For comparison purposes, the EPQ vs. Coverage curves using corresponding single parameter sets are also presented in the same figures. The y-axis (error-axis) in all these graphs is in log-scale, and hence there is more information in the upper part of the graphs. These figures suggest that pairwise statistical significance using multiple parameter sets performs comparably to and sometimes significantly better than pairwise statistical significance using a single parameter set for most instances of using a single parameter set, and at most error levels.


Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty.

Agrawal A, Huang X - BMC Bioinformatics (2009)

Pairwise statistical significance using two parameter sets. Errors per Query vs. Coverage plot for pairwise statistical significance using two parameter sets, along with the curves using corresponding single parameter sets. (a) BLOSUM45, BLOSUM50; (b) BLOSUM45, BLOSUM62; (c) BLOSUM45, BLOSUM80; (d) BLOSUM50, BLOSUM62; (e) BLOSUM50, BLOSUM80; (f) BLOSUM62, BLOSUM80. In 5 panels (b) through (f) out of 6, using two parameter sets leads to better coverage than using a single parameter set at most error levels.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2665049&req=5

Figure 1: Pairwise statistical significance using two parameter sets. Errors per Query vs. Coverage plot for pairwise statistical significance using two parameter sets, along with the curves using corresponding single parameter sets. (a) BLOSUM45, BLOSUM50; (b) BLOSUM45, BLOSUM62; (c) BLOSUM45, BLOSUM80; (d) BLOSUM50, BLOSUM62; (e) BLOSUM50, BLOSUM80; (f) BLOSUM62, BLOSUM80. In 5 panels (b) through (f) out of 6, using two parameter sets leads to better coverage than using a single parameter set at most error levels.
Mentions: Out of the 15 substitution matrix combinations, 4 are using single parameter sets, 6 are using two parameter sets, 4 are using three parameter sets, and 1 is using all four parameter sets. The EPQ vs. Coverage curves using pairwise statistical significance with two, three, and four parameter sets are presented in Figures 1, 2, and 3 respectively. For comparison purposes, the EPQ vs. Coverage curves using corresponding single parameter sets are also presented in the same figures. The y-axis (error-axis) in all these graphs is in log-scale, and hence there is more information in the upper part of the graphs. These figures suggest that pairwise statistical significance using multiple parameter sets performs comparably to and sometimes significantly better than pairwise statistical significance using a single parameter set for most instances of using a single parameter set, and at most error levels.

Bottom Line: Further, the results of pairwise statistical significance using multiple parameter sets are shown to be significantly better than database statistical significance estimates reported by BLAST and PSI-BLAST, and comparable and at times significantly better than SSEARCH.The fact that the homology detection performance does not degrade when using multiple parameter sets is a strong evidence for the validity of the assumption that the alignment score distribution follows an extreme value distribution even when using multiple parameter sets.Pairwise statistical significance using multiple parameter sets can be effectively used to determine the relatedness of a (or a few) pair(s) of sequences without performing a time-consuming database search.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, IA 50011-1041, USA. ankitag@iastate.edu

ABSTRACT

Background: Accurate estimation of statistical significance of a pairwise alignment is an important problem in sequence comparison. Recently, a comparative study of pairwise statistical significance with database statistical significance was conducted. In this paper, we extend the earlier work on pairwise statistical significance by incorporating with it the use of multiple parameter sets.

Results: Results for a knowledge discovery application of homology detection reveal that using multiple parameter sets for pairwise statistical significance estimates gives better coverage than using a single parameter set, at least at some error levels. Further, the results of pairwise statistical significance using multiple parameter sets are shown to be significantly better than database statistical significance estimates reported by BLAST and PSI-BLAST, and comparable and at times significantly better than SSEARCH. Using non-zero parameter set change penalty values give better performance than zero penalty.

Conclusion: The fact that the homology detection performance does not degrade when using multiple parameter sets is a strong evidence for the validity of the assumption that the alignment score distribution follows an extreme value distribution even when using multiple parameter sets. Parameter set change penalty is a useful parameter for alignment using multiple parameter sets. Pairwise statistical significance using multiple parameter sets can be effectively used to determine the relatedness of a (or a few) pair(s) of sequences without performing a time-consuming database search.

Show MeSH