Limits...
Universal sequence map (USM) of arbitrary discrete sequences.

Almeida JS, Vinga S - BMC Bioinformatics (2002)

Bottom Line: The technique, named Universal Sequence Mapping (USM), is applicable to sequences with an arbitrary length and arbitrary number of unique units and generates a representation where map distance estimates sequence similarity.The novel USM procedure is based on earlier work by these and other authors on the properties of Chaos Game Representation (CGR).The latter enables the representation of 4 unit type sequences (like DNA) as an order free Markov chain transition table.

View Article: PubMed Central - HTML - PubMed

Affiliation: Dept Biometry & Epidemiology, Medical Univ South Carolina, 135 Cannon street, Suite 303, PO Box 250835, Charleston, SC 29425, USA. almeidaj@musc.edu

ABSTRACT

Background: For over a decade the idea of representing biological sequences in a continuous coordinate space has maintained its appeal but not been fully realized. The basic idea is that any sequence of symbols may define trajectories in the continuous space conserving all its statistical properties. Ideally, such a representation would allow scale independent sequence analysis--without the context of fixed memory length. A simple example would consist on being able to infer the homology between two sequences solely by comparing the coordinates of any two homologous units.

Results: We have successfully identified such an iterative function for bijective mapping psi of discrete sequences into objects of continuous state space that enable scale-independent sequence analysis. The technique, named Universal Sequence Mapping (USM), is applicable to sequences with an arbitrary length and arbitrary number of unique units and generates a representation where map distance estimates sequence similarity. The novel USM procedure is based on earlier work by these and other authors on the properties of Chaos Game Representation (CGR). The latter enables the representation of 4 unit type sequences (like DNA) as an order free Markov chain transition table. The properties of USM are illustrated with test data and can be verified for other data by using the accompanying web-based tool:http://bioinformatics.musc.edu/~jonas/usm/.

Conclusions: USM is shown to enable a statistical mechanics approach to sequence analysis. The scale independent representation frees sequence analysis from the need to assume a memory length in the investigation of syntactic rules.

Show MeSH
Cumulative distribution of bi-directional similarity, D, between the two stanzas and comparison of genomic and proteomic sequences of E. coli threonine gene A, thrA (2463 base pairs for the genomic sequence and 820 aminoacids for the proteomic sequence), with B, thrB (933 base pairs for the genomic sequence and 310 aminoacids for the proteomic sequence). The  model expectation, that of uniform random distribution of units, is represented by dashed lines, obtained using Eq. 11. for n = 2 (half dimensionality of USM state space for DNA) and n = 4.3 (half dimensionality of USM state space for proteins, n = 4.32, and for the two stanzas, n = 4.25). The solid lines represent the actual cumulative distribution of D values.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC90187&req=5

Figure 4: Cumulative distribution of bi-directional similarity, D, between the two stanzas and comparison of genomic and proteomic sequences of E. coli threonine gene A, thrA (2463 base pairs for the genomic sequence and 820 aminoacids for the proteomic sequence), with B, thrB (933 base pairs for the genomic sequence and 310 aminoacids for the proteomic sequence). The model expectation, that of uniform random distribution of units, is represented by dashed lines, obtained using Eq. 11. for n = 2 (half dimensionality of USM state space for DNA) and n = 4.3 (half dimensionality of USM state space for proteins, n = 4.32, and for the two stanzas, n = 4.25). The solid lines represent the actual cumulative distribution of D values.

Mentions: Probability distribution of similarity estimates for the uniformly random sequence model – e.g. experimental values deviating from this model would indicate real homology, as in Fig. 4. The dark lines represent the numerical solution for the bi-directional over-determination, P3 (Equation 11), for different dimensionalities, n, identified by numbers in the plot. The gray lines represent the numerical solution for the same values of n, for the uni-directional over-determination, P1 (Equation 9). The solution for the dimensionality of the two stanzas, n = log2(19) = 4.25, is highlighted by a thick line, for both P3 (thick dark line) and P1 (thick gray line).


Universal sequence map (USM) of arbitrary discrete sequences.

Almeida JS, Vinga S - BMC Bioinformatics (2002)

Cumulative distribution of bi-directional similarity, D, between the two stanzas and comparison of genomic and proteomic sequences of E. coli threonine gene A, thrA (2463 base pairs for the genomic sequence and 820 aminoacids for the proteomic sequence), with B, thrB (933 base pairs for the genomic sequence and 310 aminoacids for the proteomic sequence). The  model expectation, that of uniform random distribution of units, is represented by dashed lines, obtained using Eq. 11. for n = 2 (half dimensionality of USM state space for DNA) and n = 4.3 (half dimensionality of USM state space for proteins, n = 4.32, and for the two stanzas, n = 4.25). The solid lines represent the actual cumulative distribution of D values.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC90187&req=5

Figure 4: Cumulative distribution of bi-directional similarity, D, between the two stanzas and comparison of genomic and proteomic sequences of E. coli threonine gene A, thrA (2463 base pairs for the genomic sequence and 820 aminoacids for the proteomic sequence), with B, thrB (933 base pairs for the genomic sequence and 310 aminoacids for the proteomic sequence). The model expectation, that of uniform random distribution of units, is represented by dashed lines, obtained using Eq. 11. for n = 2 (half dimensionality of USM state space for DNA) and n = 4.3 (half dimensionality of USM state space for proteins, n = 4.32, and for the two stanzas, n = 4.25). The solid lines represent the actual cumulative distribution of D values.
Mentions: Probability distribution of similarity estimates for the uniformly random sequence model – e.g. experimental values deviating from this model would indicate real homology, as in Fig. 4. The dark lines represent the numerical solution for the bi-directional over-determination, P3 (Equation 11), for different dimensionalities, n, identified by numbers in the plot. The gray lines represent the numerical solution for the same values of n, for the uni-directional over-determination, P1 (Equation 9). The solution for the dimensionality of the two stanzas, n = log2(19) = 4.25, is highlighted by a thick line, for both P3 (thick dark line) and P1 (thick gray line).

Bottom Line: The technique, named Universal Sequence Mapping (USM), is applicable to sequences with an arbitrary length and arbitrary number of unique units and generates a representation where map distance estimates sequence similarity.The novel USM procedure is based on earlier work by these and other authors on the properties of Chaos Game Representation (CGR).The latter enables the representation of 4 unit type sequences (like DNA) as an order free Markov chain transition table.

View Article: PubMed Central - HTML - PubMed

Affiliation: Dept Biometry & Epidemiology, Medical Univ South Carolina, 135 Cannon street, Suite 303, PO Box 250835, Charleston, SC 29425, USA. almeidaj@musc.edu

ABSTRACT

Background: For over a decade the idea of representing biological sequences in a continuous coordinate space has maintained its appeal but not been fully realized. The basic idea is that any sequence of symbols may define trajectories in the continuous space conserving all its statistical properties. Ideally, such a representation would allow scale independent sequence analysis--without the context of fixed memory length. A simple example would consist on being able to infer the homology between two sequences solely by comparing the coordinates of any two homologous units.

Results: We have successfully identified such an iterative function for bijective mapping psi of discrete sequences into objects of continuous state space that enable scale-independent sequence analysis. The technique, named Universal Sequence Mapping (USM), is applicable to sequences with an arbitrary length and arbitrary number of unique units and generates a representation where map distance estimates sequence similarity. The novel USM procedure is based on earlier work by these and other authors on the properties of Chaos Game Representation (CGR). The latter enables the representation of 4 unit type sequences (like DNA) as an order free Markov chain transition table. The properties of USM are illustrated with test data and can be verified for other data by using the accompanying web-based tool:http://bioinformatics.musc.edu/~jonas/usm/.

Conclusions: USM is shown to enable a statistical mechanics approach to sequence analysis. The scale independent representation frees sequence analysis from the need to assume a memory length in the investigation of syntactic rules.

Show MeSH