Limits...
A probabilistic model of local sequence alignment that simplifies statistical significance estimation.

Eddy SR - PLoS Comput. Biol. (2008)

Bottom Line: Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used.For a probabilistic model of local sequence alignment, optimal alignment bit scores ("Viterbi" scores) are Gumbel-distributed with constant lambda = log 2, and the high scoring tail of Forward scores is exponential with the same constant lambda.This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments.

View Article: PubMed Central - PubMed

Affiliation: Howard Hughes Medical Institute, Janelia Farm Research Campus, Ashburn, Virginia, United States of America. eddys@janelia.hhmi.org

ABSTRACT
Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (lambda) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty ("Forward" scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores ("Viterbi" scores) are Gumbel-distributed with constant lambda = log 2, and the high scoring tail of Forward scores is exponential with the same constant lambda. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments.

Show MeSH

Related in: MedlinePlus

Viterbi scores follow Gumbel distributions with constant λ.(A) A histogram showing the distribution of λ̂ estimates determined by maximum likelihood Gumbel fits to multihit local Viterbi scores of n = 105 i.i.d random sequences of length L = 400, for 9318 profile HMMs built from Pfam 22.0 seed alignments. The sharp black peak is from prototype HMMER3, with mean 0.6928 and standard deviation 0.0114, and extreme outliers indicated by arrows. The broader grey histogram is from old HMMER2, for comparison. The conjectured λ = log 2 is shown as a vertical dotted red line. (B,C) log survival plots (P(V>t) on a log scale, versus score threshold t) showing observed versus expected distributions for multihit local Viterbi scores for two typical Pfam models, RRM_1 and Caudal_act, for n = 108 i.i.d. random sequences of length L = 400. On a log survival plot, the high-scoring tail of a Gumbel distribution is a straight line with slope −λ. Black circles show the observed data. The black lines show maximum likelihood fitted Gumbel distributions, with λ̂ estimates as indicated. The red lines show the conjectured λ = log 2 Gumbel distributions, with μ fitted by maximum likelihood. (D,E) log survival plots for the extreme outliers DUF851 and Sulfakinin, as described in the text.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2396288&req=5

pcbi-1000069-g002: Viterbi scores follow Gumbel distributions with constant λ.(A) A histogram showing the distribution of λ̂ estimates determined by maximum likelihood Gumbel fits to multihit local Viterbi scores of n = 105 i.i.d random sequences of length L = 400, for 9318 profile HMMs built from Pfam 22.0 seed alignments. The sharp black peak is from prototype HMMER3, with mean 0.6928 and standard deviation 0.0114, and extreme outliers indicated by arrows. The broader grey histogram is from old HMMER2, for comparison. The conjectured λ = log 2 is shown as a vertical dotted red line. (B,C) log survival plots (P(V>t) on a log scale, versus score threshold t) showing observed versus expected distributions for multihit local Viterbi scores for two typical Pfam models, RRM_1 and Caudal_act, for n = 108 i.i.d. random sequences of length L = 400. On a log survival plot, the high-scoring tail of a Gumbel distribution is a straight line with slope −λ. Black circles show the observed data. The black lines show maximum likelihood fitted Gumbel distributions, with λ̂ estimates as indicated. The red lines show the conjectured λ = log 2 Gumbel distributions, with μ fitted by maximum likelihood. (D,E) log survival plots for the extreme outliers DUF851 and Sulfakinin, as described in the text.

Mentions: Viterbi bit scores are predicted to be Gumbel distributed with parametric λ = log 2. To test this prediction on many different profile HMMs, I estimated λ̂(λ̂ represents a maximum likelihood estimate fitted to a finite sample of scores, as distinguished from the parametric true λ) for 9,318 different profile HMMs built from Pfam 22.0 seed alignments, by collecting multihit local Viterbi score distributions for n = 105 i.i.d. random sequences of length 400 generated with the same residue frequencies as the model R. Figure 2 shows the results of maximum likelihood fitting these scores to Gumbel distributions. The 9,318 λ̂ estimates are tightly clustered with mean 0.6928, consistent with the conjecture that λ = log 2 = 0.6931.


A probabilistic model of local sequence alignment that simplifies statistical significance estimation.

Eddy SR - PLoS Comput. Biol. (2008)

Viterbi scores follow Gumbel distributions with constant λ.(A) A histogram showing the distribution of λ̂ estimates determined by maximum likelihood Gumbel fits to multihit local Viterbi scores of n = 105 i.i.d random sequences of length L = 400, for 9318 profile HMMs built from Pfam 22.0 seed alignments. The sharp black peak is from prototype HMMER3, with mean 0.6928 and standard deviation 0.0114, and extreme outliers indicated by arrows. The broader grey histogram is from old HMMER2, for comparison. The conjectured λ = log 2 is shown as a vertical dotted red line. (B,C) log survival plots (P(V>t) on a log scale, versus score threshold t) showing observed versus expected distributions for multihit local Viterbi scores for two typical Pfam models, RRM_1 and Caudal_act, for n = 108 i.i.d. random sequences of length L = 400. On a log survival plot, the high-scoring tail of a Gumbel distribution is a straight line with slope −λ. Black circles show the observed data. The black lines show maximum likelihood fitted Gumbel distributions, with λ̂ estimates as indicated. The red lines show the conjectured λ = log 2 Gumbel distributions, with μ fitted by maximum likelihood. (D,E) log survival plots for the extreme outliers DUF851 and Sulfakinin, as described in the text.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2396288&req=5

pcbi-1000069-g002: Viterbi scores follow Gumbel distributions with constant λ.(A) A histogram showing the distribution of λ̂ estimates determined by maximum likelihood Gumbel fits to multihit local Viterbi scores of n = 105 i.i.d random sequences of length L = 400, for 9318 profile HMMs built from Pfam 22.0 seed alignments. The sharp black peak is from prototype HMMER3, with mean 0.6928 and standard deviation 0.0114, and extreme outliers indicated by arrows. The broader grey histogram is from old HMMER2, for comparison. The conjectured λ = log 2 is shown as a vertical dotted red line. (B,C) log survival plots (P(V>t) on a log scale, versus score threshold t) showing observed versus expected distributions for multihit local Viterbi scores for two typical Pfam models, RRM_1 and Caudal_act, for n = 108 i.i.d. random sequences of length L = 400. On a log survival plot, the high-scoring tail of a Gumbel distribution is a straight line with slope −λ. Black circles show the observed data. The black lines show maximum likelihood fitted Gumbel distributions, with λ̂ estimates as indicated. The red lines show the conjectured λ = log 2 Gumbel distributions, with μ fitted by maximum likelihood. (D,E) log survival plots for the extreme outliers DUF851 and Sulfakinin, as described in the text.
Mentions: Viterbi bit scores are predicted to be Gumbel distributed with parametric λ = log 2. To test this prediction on many different profile HMMs, I estimated λ̂(λ̂ represents a maximum likelihood estimate fitted to a finite sample of scores, as distinguished from the parametric true λ) for 9,318 different profile HMMs built from Pfam 22.0 seed alignments, by collecting multihit local Viterbi score distributions for n = 105 i.i.d. random sequences of length 400 generated with the same residue frequencies as the model R. Figure 2 shows the results of maximum likelihood fitting these scores to Gumbel distributions. The 9,318 λ̂ estimates are tightly clustered with mean 0.6928, consistent with the conjecture that λ = log 2 = 0.6931.

Bottom Line: Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used.For a probabilistic model of local sequence alignment, optimal alignment bit scores ("Viterbi" scores) are Gumbel-distributed with constant lambda = log 2, and the high scoring tail of Forward scores is exponential with the same constant lambda.This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments.

View Article: PubMed Central - PubMed

Affiliation: Howard Hughes Medical Institute, Janelia Farm Research Campus, Ashburn, Virginia, United States of America. eddys@janelia.hhmi.org

ABSTRACT
Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (lambda) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty ("Forward" scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores ("Viterbi" scores) are Gumbel-distributed with constant lambda = log 2, and the high scoring tail of Forward scores is exponential with the same constant lambda. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments.

Show MeSH
Related in: MedlinePlus