Limits...
FAMSA: Fast and accurate multiple sequence alignment of huge protein families

View Article: PubMed Central - PubMed

ABSTRACT

Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.

No MeSH data available.


Illustration of gapped sequence representation of – – C A – – H – F – – – Q – G A C – – D L M – – – – F A – P – S.The ‘+’ symbol is a guard, present to simplify the implementation. The values of dps are computed according to the rule: dps[i] = dps[2i] + dps[2i + 1], provided that the necessary cells are present. Otherwise, they are calculated on the no_gaps and sequence vectors. For example, dps[8] is the number of symbols in sequence [0 … 1] (equal to 2), incremented by the number of gaps present just before these symbols, i.e. no_gaps[0] and no_gaps[1].
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5037421&req=5

f1: Illustration of gapped sequence representation of – – C A – – H – F – – – Q – G A C – – D L M – – – – F A – P – S.The ‘+’ symbol is a guard, present to simplify the implementation. The values of dps are computed according to the rule: dps[i] = dps[2i] + dps[2i + 1], provided that the necessary cells are present. Otherwise, they are calculated on the no_gaps and sequence vectors. For example, dps[8] is the number of symbols in sequence [0 … 1] (equal to 2), incremented by the number of gaps present just before these symbols, i.e. no_gaps[0] and no_gaps[1].

Mentions: While two former components were previously employed by alignment algorithms, e.g. Kalign, the gapped representation is, to the best of our knowledge, a novel technique. In this representation, two equal-sized arrays are stored for each sequence: (i) sequence symbols, (ii) the number of gaps present before the corresponding sequence symbol. Moreover, to quickly localize a symbol in a column, as well as to insert or remove gaps, dynamic position statistics are stored in an additional array. The space for the gapped sequence is approximately 13 times the length of the sequence (see Fig. 1 for example). The proposed profile representation allows a dynamic programming matrix to be computed rapidly and is memory frugal. The DP computation step for a pair of profiles takes


FAMSA: Fast and accurate multiple sequence alignment of huge protein families
Illustration of gapped sequence representation of – – C A – – H – F – – – Q – G A C – – D L M – – – – F A – P – S.The ‘+’ symbol is a guard, present to simplify the implementation. The values of dps are computed according to the rule: dps[i] = dps[2i] + dps[2i + 1], provided that the necessary cells are present. Otherwise, they are calculated on the no_gaps and sequence vectors. For example, dps[8] is the number of symbols in sequence [0 … 1] (equal to 2), incremented by the number of gaps present just before these symbols, i.e. no_gaps[0] and no_gaps[1].
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5037421&req=5

f1: Illustration of gapped sequence representation of – – C A – – H – F – – – Q – G A C – – D L M – – – – F A – P – S.The ‘+’ symbol is a guard, present to simplify the implementation. The values of dps are computed according to the rule: dps[i] = dps[2i] + dps[2i + 1], provided that the necessary cells are present. Otherwise, they are calculated on the no_gaps and sequence vectors. For example, dps[8] is the number of symbols in sequence [0 … 1] (equal to 2), incremented by the number of gaps present just before these symbols, i.e. no_gaps[0] and no_gaps[1].
Mentions: While two former components were previously employed by alignment algorithms, e.g. Kalign, the gapped representation is, to the best of our knowledge, a novel technique. In this representation, two equal-sized arrays are stored for each sequence: (i) sequence symbols, (ii) the number of gaps present before the corresponding sequence symbol. Moreover, to quickly localize a symbol in a column, as well as to insert or remove gaps, dynamic position statistics are stored in an additional array. The space for the gapped sequence is approximately 13 times the length of the sequence (see Fig. 1 for example). The proposed profile representation allows a dynamic programming matrix to be computed rapidly and is memory frugal. The DP computation step for a pair of profiles takes

View Article: PubMed Central - PubMed

ABSTRACT

Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.

No MeSH data available.