Limits...
Back-translation for discovering distant protein homologies in the presence of frameshift mutations.

Girdea M, Noe L, Kucherov G - Algorithms Mol Biol (2010)

Bottom Line: Frameshift mutations in protein-coding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins' common origin.Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level.We developed a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: Frameshift mutations in protein-coding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins' common origin. Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level.

Results: We developed a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences. We design a dynamic programming alignment algorithm over memory-efficient graph representations of the complete set of putative DNA sequences of each protein, with the goal of determining the two putative DNA sequences which have the best scoring alignment under a powerful scoring system designed to reflect the most probable evolutionary process. Our implementation is freely available at [http://bioinfo.lifl.fr/path/].

Conclusions: Our approach allows to uncover evolutionary information that is not captured by traditional alignment methods, which is confirmed by biologically significant examples.

No MeSH data available.


Back-translation graph examples. A fully represented (a) and condensed (b) back-translation graph for the amino acid sequence YSH.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2821327&req=5

Figure 1: Back-translation graph examples. A fully represented (a) and condensed (b) back-translation graph for the amino acid sequence YSH.

Mentions: An explicit enumeration and pairwise alignment of all the putative DNA sequences is not an option, since their number increases exponentially with the protein's length, as all amino acids are encoded by 2, 3, 4 or 6 codons, with the exception of M and W, which have a single corresponding codon. Therefore, we represent the protein's "back-translation" (set of possible source DNAs) as a directed acyclic graph, whose size depends linearly on the length of the protein, and where a path represents one putative sequence. As illustrated in Figure 1(a), the graph is organized as a sequence of length 3n where n is the length of the protein sequence. At each position i in the graph, there is a group of nodes, each node representing a possible nucleotide that can appear at position i in at least one of the putative coding sequences.


Back-translation for discovering distant protein homologies in the presence of frameshift mutations.

Girdea M, Noe L, Kucherov G - Algorithms Mol Biol (2010)

Back-translation graph examples. A fully represented (a) and condensed (b) back-translation graph for the amino acid sequence YSH.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2821327&req=5

Figure 1: Back-translation graph examples. A fully represented (a) and condensed (b) back-translation graph for the amino acid sequence YSH.
Mentions: An explicit enumeration and pairwise alignment of all the putative DNA sequences is not an option, since their number increases exponentially with the protein's length, as all amino acids are encoded by 2, 3, 4 or 6 codons, with the exception of M and W, which have a single corresponding codon. Therefore, we represent the protein's "back-translation" (set of possible source DNAs) as a directed acyclic graph, whose size depends linearly on the length of the protein, and where a path represents one putative sequence. As illustrated in Figure 1(a), the graph is organized as a sequence of length 3n where n is the length of the protein sequence. At each position i in the graph, there is a group of nodes, each node representing a possible nucleotide that can appear at position i in at least one of the putative coding sequences.

Bottom Line: Frameshift mutations in protein-coding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins' common origin.Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level.We developed a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: Frameshift mutations in protein-coding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins' common origin. Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level.

Results: We developed a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences. We design a dynamic programming alignment algorithm over memory-efficient graph representations of the complete set of putative DNA sequences of each protein, with the goal of determining the two putative DNA sequences which have the best scoring alignment under a powerful scoring system designed to reflect the most probable evolutionary process. Our implementation is freely available at [http://bioinfo.lifl.fr/path/].

Conclusions: Our approach allows to uncover evolutionary information that is not captured by traditional alignment methods, which is confirmed by biologically significant examples.

No MeSH data available.