A probabilistic model of local sequence alignment that simplifies statistical significance estimation.
Bottom Line:
Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used.For a probabilistic model of local sequence alignment, optimal alignment bit scores ("Viterbi" scores) are Gumbel-distributed with constant lambda = log 2, and the high scoring tail of Forward scores is exponential with the same constant lambda.This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments.
View Article:
PubMed Central - PubMed
Affiliation: Howard Hughes Medical Institute, Janelia Farm Research Campus, Ashburn, Virginia, United States of America. eddys@janelia.hhmi.org
ABSTRACT
Show MeSH
Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (lambda) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty ("Forward" scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores ("Viterbi" scores) are Gumbel-distributed with constant lambda = log 2, and the high scoring tail of Forward scores is exponential with the same constant lambda. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments. Related in: MedlinePlus |
Related In:
Results -
Collection
getmorefigures.php?uid=PMC2396288&req=5
Mentions: Viterbi bit scores are predicted to be Gumbel distributed with parametric λ = log 2. To test this prediction on many different profile HMMs, I estimated λ̂(λ̂ represents a maximum likelihood estimate fitted to a finite sample of scores, as distinguished from the parametric true λ) for 9,318 different profile HMMs built from Pfam 22.0 seed alignments, by collecting multihit local Viterbi score distributions for n = 105 i.i.d. random sequences of length 400 generated with the same residue frequencies as the model R. Figure 2 shows the results of maximum likelihood fitting these scores to Gumbel distributions. The 9,318 λ̂ estimates are tightly clustered with mean 0.6928, consistent with the conjecture that λ = log 2 = 0.6931. |
View Article: PubMed Central - PubMed
Affiliation: Howard Hughes Medical Institute, Janelia Farm Research Campus, Ashburn, Virginia, United States of America. eddys@janelia.hhmi.org