Limits...
Evolution of biological sequences implies an extreme value distribution of type I for both global and local pairwise alignment scores.

Bastien O, Maréchal E - BMC Bioinformatics (2008)

Bottom Line: In particular, one parameter corresponds to the information hazard rate.It reflects the redundancy of information between homologous sequences, under functional conservative pressure.This model also provides a link between concepts of biological sequence analysis and of systems biology.

View Article: PubMed Central - HTML - PubMed

Affiliation: UMR 5168 CNRS-CEA-INRA-Université J, Fourier, Laboratoire de Physiologie Cellulaire Végétale, Département Réponse et Dynamique Cellulaire, CEA Grenoble, 17 rue des Martyrs, F-38054, Grenoble cedex 09, France. olivier.bastien@cea.fr

ABSTRACT

Background: Confidence in pairwise alignments of biological sequences, obtained by various methods such as Blast or Smith-Waterman, is critical for automatic analyses of genomic data. Two statistical models have been proposed. In the asymptotic limit of long sequences, the Karlin-Altschul model is based on the computation of a P-value, assuming that the number of high scoring matching regions above a threshold is Poisson distributed. Alternatively, the Lipman-Pearson model is based on the computation of a Z-value from a random score distribution obtained by a Monte-Carlo simulation. Z-values allow the deduction of an upper bound of the P-value (1/Z-value2) following the TULIP theorem. Simulations of Z-value distribution is known to fit with a Gumbel law. This remarkable property was not demonstrated and had no obvious biological support.

Results: We built a model of evolution of sequences based on aging, as meant in Reliability Theory, using the fact that the amount of information shared between an initial sequence and the sequences in its lineage (i.e., mutual information in Information Theory) is a decreasing function of time. This quantity is simply measured by a sequence alignment score. In systems aging, the failure rate is related to the systems longevity. The system can be a machine with structured components, or a living entity or population. "Reliability" refers to the ability to operate properly according to a standard. Here, the "reliability" of a sequence refers to the ability to conserve a sufficient functional level at the folded and maturated protein level (positive selection pressure). Homologous sequences were considered as systems 1) having a high redundancy of information reflected by the magnitude of their alignment scores, 2) which components are the amino acids that can independently be damaged by random DNA mutations. From these assumptions, we deduced that information shared at each amino acid position evolved with a constant rate, corresponding to the information hazard rate, and that pairwise sequence alignment scores should follow a Gumbel distribution, which parameters could find some theoretical rationale. In particular, one parameter corresponds to the information hazard rate.

Conclusion: Extreme value distribution of alignment scores, assessed from high scoring segments pairs following the Karlin-Altschul model, can also be deduced from the Reliability Theory applied to molecular sequences. It reflects the redundancy of information between homologous sequences, under functional conservative pressure. This model also provides a link between concepts of biological sequence analysis and of systems biology.

Show MeSH

Related in: MedlinePlus

Aging properties of amino acids. Protein sequences are considered as systems, which components are amino acids. Over time, either amino acids are conserved (similarity of a residue with its descendant is that of identity, diagonal term of a substitution matrix) or modified due to random DNA mutations. Similarity decreases therefore with time, since no similarity is higher than that of identity. When the similarity falls below a threshold that is necessary for the residue to operate according to a standard (functional conservation), the component is damaged. (A) Score distribution corresponding to valine substitution. In this case, the score distribution is exponential, suggesting that valine (V) is a non-aging component. Based on BLOSUM62, residues of this type are V, F, P, W, Y, E, G, H, I, L, K, R, N, D and C (B) Score distribution corresponding to threonine substitution. The score distribution shows a peak, indicating a probable accelerated process of aging (functional damage) when the residue is substituted by random mutation in some other amino acids. Based on BLOSUM62, residues of this type are T, S, M, A and Q. (C) Score distribution in the BLOSUM62 similarity matrix. The complete distribution in the BLOSUM62 matrix is exponential (0.287.exp(-0.287.(s+4))), supporting a general model of amino acids as nonaging components. The exponential law for positive scores is characterized by the same parameter (λ' = 0.287). The original residue is termed i; its descent is termed j.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2529321&req=5

Figure 1: Aging properties of amino acids. Protein sequences are considered as systems, which components are amino acids. Over time, either amino acids are conserved (similarity of a residue with its descendant is that of identity, diagonal term of a substitution matrix) or modified due to random DNA mutations. Similarity decreases therefore with time, since no similarity is higher than that of identity. When the similarity falls below a threshold that is necessary for the residue to operate according to a standard (functional conservation), the component is damaged. (A) Score distribution corresponding to valine substitution. In this case, the score distribution is exponential, suggesting that valine (V) is a non-aging component. Based on BLOSUM62, residues of this type are V, F, P, W, Y, E, G, H, I, L, K, R, N, D and C (B) Score distribution corresponding to threonine substitution. The score distribution shows a peak, indicating a probable accelerated process of aging (functional damage) when the residue is substituted by random mutation in some other amino acids. Based on BLOSUM62, residues of this type are T, S, M, A and Q. (C) Score distribution in the BLOSUM62 similarity matrix. The complete distribution in the BLOSUM62 matrix is exponential (0.287.exp(-0.287.(s+4))), supporting a general model of amino acids as nonaging components. The exponential law for positive scores is characterized by the same parameter (λ' = 0.287). The original residue is termed i; its descent is termed j.

Mentions: Over time, an amino acid i is either conserved or substituted. The similarity of i in an initial sequence compared with residues at the same position in protein descendants is therefore either that of identity (the diagonal term in the scoring matrix) or a lower value(no score is higher than that of identity). In average, the magnitude of the similarity of i compared with its descendants, related to mutual information following equation (13), is therefore a decreasing function of elapsed time. On a functional point of view, the probability that i was mutated into a residue with a score Si lower than a threshold si defined to allow the component to operate like i, can be deduced from the distribution of substitution scores. For most amino acids (F, P, W, Y, V, E, G, H, I, L, K, R, N, D and C), the distribution of scores deduced from BLOSUM 62 fits an exponential distribution (see the case of valine in Figure 1A. For five amino acids (M, S, T, A and Q), the distribution of scores does not fit an exponential distribution (see the case of Threonine in Figure 1B). Taking the average situation, the distribution of scores deduced from the BLOSUM 62 matrix is exponential-like (Figure 1C) supporting a general model for amino acids mutual information distribution: The probability Pr that a residue i is mutated into a residue with mutual information below si is:


Evolution of biological sequences implies an extreme value distribution of type I for both global and local pairwise alignment scores.

Bastien O, Maréchal E - BMC Bioinformatics (2008)

Aging properties of amino acids. Protein sequences are considered as systems, which components are amino acids. Over time, either amino acids are conserved (similarity of a residue with its descendant is that of identity, diagonal term of a substitution matrix) or modified due to random DNA mutations. Similarity decreases therefore with time, since no similarity is higher than that of identity. When the similarity falls below a threshold that is necessary for the residue to operate according to a standard (functional conservation), the component is damaged. (A) Score distribution corresponding to valine substitution. In this case, the score distribution is exponential, suggesting that valine (V) is a non-aging component. Based on BLOSUM62, residues of this type are V, F, P, W, Y, E, G, H, I, L, K, R, N, D and C (B) Score distribution corresponding to threonine substitution. The score distribution shows a peak, indicating a probable accelerated process of aging (functional damage) when the residue is substituted by random mutation in some other amino acids. Based on BLOSUM62, residues of this type are T, S, M, A and Q. (C) Score distribution in the BLOSUM62 similarity matrix. The complete distribution in the BLOSUM62 matrix is exponential (0.287.exp(-0.287.(s+4))), supporting a general model of amino acids as nonaging components. The exponential law for positive scores is characterized by the same parameter (λ' = 0.287). The original residue is termed i; its descent is termed j.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2529321&req=5

Figure 1: Aging properties of amino acids. Protein sequences are considered as systems, which components are amino acids. Over time, either amino acids are conserved (similarity of a residue with its descendant is that of identity, diagonal term of a substitution matrix) or modified due to random DNA mutations. Similarity decreases therefore with time, since no similarity is higher than that of identity. When the similarity falls below a threshold that is necessary for the residue to operate according to a standard (functional conservation), the component is damaged. (A) Score distribution corresponding to valine substitution. In this case, the score distribution is exponential, suggesting that valine (V) is a non-aging component. Based on BLOSUM62, residues of this type are V, F, P, W, Y, E, G, H, I, L, K, R, N, D and C (B) Score distribution corresponding to threonine substitution. The score distribution shows a peak, indicating a probable accelerated process of aging (functional damage) when the residue is substituted by random mutation in some other amino acids. Based on BLOSUM62, residues of this type are T, S, M, A and Q. (C) Score distribution in the BLOSUM62 similarity matrix. The complete distribution in the BLOSUM62 matrix is exponential (0.287.exp(-0.287.(s+4))), supporting a general model of amino acids as nonaging components. The exponential law for positive scores is characterized by the same parameter (λ' = 0.287). The original residue is termed i; its descent is termed j.
Mentions: Over time, an amino acid i is either conserved or substituted. The similarity of i in an initial sequence compared with residues at the same position in protein descendants is therefore either that of identity (the diagonal term in the scoring matrix) or a lower value(no score is higher than that of identity). In average, the magnitude of the similarity of i compared with its descendants, related to mutual information following equation (13), is therefore a decreasing function of elapsed time. On a functional point of view, the probability that i was mutated into a residue with a score Si lower than a threshold si defined to allow the component to operate like i, can be deduced from the distribution of substitution scores. For most amino acids (F, P, W, Y, V, E, G, H, I, L, K, R, N, D and C), the distribution of scores deduced from BLOSUM 62 fits an exponential distribution (see the case of valine in Figure 1A. For five amino acids (M, S, T, A and Q), the distribution of scores does not fit an exponential distribution (see the case of Threonine in Figure 1B). Taking the average situation, the distribution of scores deduced from the BLOSUM 62 matrix is exponential-like (Figure 1C) supporting a general model for amino acids mutual information distribution: The probability Pr that a residue i is mutated into a residue with mutual information below si is:

Bottom Line: In particular, one parameter corresponds to the information hazard rate.It reflects the redundancy of information between homologous sequences, under functional conservative pressure.This model also provides a link between concepts of biological sequence analysis and of systems biology.

View Article: PubMed Central - HTML - PubMed

Affiliation: UMR 5168 CNRS-CEA-INRA-Université J, Fourier, Laboratoire de Physiologie Cellulaire Végétale, Département Réponse et Dynamique Cellulaire, CEA Grenoble, 17 rue des Martyrs, F-38054, Grenoble cedex 09, France. olivier.bastien@cea.fr

ABSTRACT

Background: Confidence in pairwise alignments of biological sequences, obtained by various methods such as Blast or Smith-Waterman, is critical for automatic analyses of genomic data. Two statistical models have been proposed. In the asymptotic limit of long sequences, the Karlin-Altschul model is based on the computation of a P-value, assuming that the number of high scoring matching regions above a threshold is Poisson distributed. Alternatively, the Lipman-Pearson model is based on the computation of a Z-value from a random score distribution obtained by a Monte-Carlo simulation. Z-values allow the deduction of an upper bound of the P-value (1/Z-value2) following the TULIP theorem. Simulations of Z-value distribution is known to fit with a Gumbel law. This remarkable property was not demonstrated and had no obvious biological support.

Results: We built a model of evolution of sequences based on aging, as meant in Reliability Theory, using the fact that the amount of information shared between an initial sequence and the sequences in its lineage (i.e., mutual information in Information Theory) is a decreasing function of time. This quantity is simply measured by a sequence alignment score. In systems aging, the failure rate is related to the systems longevity. The system can be a machine with structured components, or a living entity or population. "Reliability" refers to the ability to operate properly according to a standard. Here, the "reliability" of a sequence refers to the ability to conserve a sufficient functional level at the folded and maturated protein level (positive selection pressure). Homologous sequences were considered as systems 1) having a high redundancy of information reflected by the magnitude of their alignment scores, 2) which components are the amino acids that can independently be damaged by random DNA mutations. From these assumptions, we deduced that information shared at each amino acid position evolved with a constant rate, corresponding to the information hazard rate, and that pairwise sequence alignment scores should follow a Gumbel distribution, which parameters could find some theoretical rationale. In particular, one parameter corresponds to the information hazard rate.

Conclusion: Extreme value distribution of alignment scores, assessed from high scoring segments pairs following the Karlin-Altschul model, can also be deduced from the Reliability Theory applied to molecular sequences. It reflects the redundancy of information between homologous sequences, under functional conservative pressure. This model also provides a link between concepts of biological sequence analysis and of systems biology.

Show MeSH
Related in: MedlinePlus