Limits...
FAMSA: Fast and accurate multiple sequence alignment of huge protein families

View Article: PubMed Central - PubMed

ABSTRACT

Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.

No MeSH data available.


Comparison of algorithms on extHomFam.The solid bars (lower) represent TC scores, while the transparent ones (higher) —SP scores. For each subset, the algorithms were sorted in an increasing order according to the TC measure. Execution times are provided above the bars in an hours:minutes format.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5037421&req=5

f3: Comparison of algorithms on extHomFam.The solid bars (lower) represent TC scores, while the transparent ones (higher) —SP scores. For each subset, the algorithms were sorted in an increasing order according to the TC measure. Execution times are provided above the bars in an hours:minutes format.

Mentions: The experiments on extHomFam confirmed superior accuracy and execution time of FAMSA to scale well with the number of sequences (Fig. 3; more detailed results are given in Supplementary material). FAMSA was inferior to Clustal-iter2 by a small margin, and only on the small subset. For k > 4000, it became the best aligner and, depending on the subset and quality measure, was followed by Clustal, MAFFT, or UPP. As MUSCLE and MAFFT -parttree rendered inferior results, they were excluded from Fig. 3. Kalign2, Kalign-LCS, and MUSCLE did not complete the analyzes on extra-large due to excessive memory or time requirements. Clustal Omega and MAFFT-default failed to process, respectively, one and four largest extHomFam families (the missing MAFFT results were taken from -dpparttree variant, though). Advances in SP and TC measures of FAMSA over the competing software on medium, large, and extra-large subsets were assessed statistically with the use of the Wilcoxon signed-rank test with the Bonferroni-Holm correction for multiple testing. The differences are significant at α = 0.005; p-values for all pairwise comparisons can be found in Table 2.


FAMSA: Fast and accurate multiple sequence alignment of huge protein families
Comparison of algorithms on extHomFam.The solid bars (lower) represent TC scores, while the transparent ones (higher) —SP scores. For each subset, the algorithms were sorted in an increasing order according to the TC measure. Execution times are provided above the bars in an hours:minutes format.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5037421&req=5

f3: Comparison of algorithms on extHomFam.The solid bars (lower) represent TC scores, while the transparent ones (higher) —SP scores. For each subset, the algorithms were sorted in an increasing order according to the TC measure. Execution times are provided above the bars in an hours:minutes format.
Mentions: The experiments on extHomFam confirmed superior accuracy and execution time of FAMSA to scale well with the number of sequences (Fig. 3; more detailed results are given in Supplementary material). FAMSA was inferior to Clustal-iter2 by a small margin, and only on the small subset. For k > 4000, it became the best aligner and, depending on the subset and quality measure, was followed by Clustal, MAFFT, or UPP. As MUSCLE and MAFFT -parttree rendered inferior results, they were excluded from Fig. 3. Kalign2, Kalign-LCS, and MUSCLE did not complete the analyzes on extra-large due to excessive memory or time requirements. Clustal Omega and MAFFT-default failed to process, respectively, one and four largest extHomFam families (the missing MAFFT results were taken from -dpparttree variant, though). Advances in SP and TC measures of FAMSA over the competing software on medium, large, and extra-large subsets were assessed statistically with the use of the Wilcoxon signed-rank test with the Bonferroni-Holm correction for multiple testing. The differences are significant at α = 0.005; p-values for all pairwise comparisons can be found in Table 2.

View Article: PubMed Central - PubMed

ABSTRACT

Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.

No MeSH data available.