A basic analysis toolkit for biological sequences.
Bottom Line:
Therefore, our main contribution is to fill this gap between algorithmic theory and practice by providing an extensible and easy to use software library that includes algorithms for the mentioned string matching and alignment problems.The library consists of C/C++ library functions as well as Perl library functions.It can be interfaced with Bioperl and can also be used as a stand-alone system with a GUI.
Affiliation: Dipartimento di Matematica Applicazioni, Università di Palermo, Italy. raffaele@math.unipa.it
ABSTRACT
This paper presents a software library, nicknamed BATS, for some basic sequence analysis tasks. Namely, local alignments, via approximate string matching, and global alignments, via longest common subsequence and alignments with affine and concave gap cost functions. Moreover, it also supports filtering operations to select strings from a set and establish their statistical significance, via z-score computation. None of the algorithms is new, but although they are generally regarded as fundamental for sequence analysis, they have not been implemented in a single and consistent software package, as we do here. Therefore, our main contribution is to fill this gap between algorithmic theory and practice by providing an extensible and easy to use software library that includes algorithms for the mentioned string matching and alignment problems. The library consists of C/C++ library functions as well as Perl library functions. It can be interfaced with Bioperl and can also be used as a stand-alone system with a GUI. The software is available at http://www.math.unipa.it/~raffaele/BATS/ under the GNU GPL. No MeSH data available. |
Related In:
Results -
Collection
License getmorefigures.php?uid=PMC2147010&req=5
Mentions: It is well known that finding the LCS of X and Y is equivalent to finding the Levenshtein edit distance between the two strings [4], where the "edit operations" are insertion and deletion of a single character. Those edit operations naturally correspond to the differences of type (b) and (c) introduced in section 3 for approximate string matching. Although there is analogy between approximate string matching and LCS computation, the former can be regarded as a local alignment method as opposed to the latter, that is a global alignment method [1]. Following Myers [28], we phrase the LCS problem as the computation of a shortest path in the edit graph for X and Y, defined as follows. It is a directed grid graph (see Fig. 2) with vertices (i, j), where 0 ≤ i ≤ n and 0 ≤ j ≤ m, /X/ = n and /Y/ = m. We refer to the vertices also as points. There is a vertical edge from each non-bottom point to its neighbor below. There is a horizontal edge from each non-rightmost point to its right neighbor. Finally, if X[i] = Y[j], there is a diagonal edge from (i - 1, j - 1) to (i, j). Assume that each non-diagonal edge has weight 1 and the remaining edges weight 0. Then, the Levenshtein edit distance is given by the minimum cost of any path from (0, 0) to (n, m). We assume the reader to be familiar with the notion of edit script corresponding to the min-cost path and how to recover an LCS from an edit script [28-30]. Our LCS from Fragments problem also corresponds naturally to an edit graph. The vertices and the horizontal and vertical edges are as before, but the diagonal edges correspond to a given set of fragments. Each fragment, formally described as a triple (i, j, k), represents a sequence of diagonal edges from (i - j - 1) (the start point) to (i + k - 1, j + k - 1) (the end point). For a fragment f, the start and end points of f are denoted by start(f) and end(f), respectively. In the example of Figure 3, the fragments are the sequences of at least 2 diagonal edges of Fig. 2. The LCS from Fragments problem is equivalent to finding a minimum-cost path in the edit graph from (0, 0) to (n, m), where each diagonal edge has weight 0 and each non-diagonal edge has weight 1. The problem has an obvious dynamic programming solution since the graph naturally corresponds to an nxm dynamic programming matrix. However, it also falls into the more efficient algorithmic paradigm of Sparse Dynamic Programming [31,32], as discussed in [13] and outlined next. |
View Article: PubMed Central - HTML - PubMed
Affiliation: Dipartimento di Matematica Applicazioni, Università di Palermo, Italy. raffaele@math.unipa.it
No MeSH data available.