Limits...
FAMSA: Fast and accurate multiple sequence alignment of huge protein families

View Article: PubMed Central - PubMed

ABSTRACT

Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.

No MeSH data available.


Computational scalability of FAMSA with respect to the number of cores evaluated on the ten largest extHomFam families (k ≥ 100000).The algorithm stages are represented by different colors. Execution times of the largest set (ABC_tran) are marked with solid fill, the other families are printed in with transparency.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5037421&req=5

f6: Computational scalability of FAMSA with respect to the number of cores evaluated on the ten largest extHomFam families (k ≥ 100000).The algorithm stages are represented by different colors. Execution times of the largest set (ABC_tran) are marked with solid fill, the other families are printed in with transparency.

Mentions: As FAMSA was designed to use all the available computational power, it takes advantage of multi-core architectures of contemporary computers. The ten largest protein families from extHomFam (all which contain at least 100 000 sequences: ABC_tran, gtp, HATPase_c, helicase_NC, kinase, mdd, response_reg, rvp, sdr, TyrKc) were selected to investigate the scalability of the algorithm stages with respect to the number of computing threads. The experiments also considered the variant of FAMSA in which similarity calculation was adapted for massively parallel architectures with a use of OpenCL. For convenience, processing times of ABC_tran were marked separately. As can be seen in Fig. 6, when FAMSA was run serially, more than 90% of the execution time was related to stages I and II (the algorithm performs them simultaneously). Nevertheless, as pairwise similarities can be calculated independently, these stages scale noticeably better with the number of threads than the progressive construction. In particular, when more than 12 cores were involved, stage III of the algorithm started to be the bottleneck. This was also the case for the GPU FAMSA variant.


FAMSA: Fast and accurate multiple sequence alignment of huge protein families
Computational scalability of FAMSA with respect to the number of cores evaluated on the ten largest extHomFam families (k ≥ 100000).The algorithm stages are represented by different colors. Execution times of the largest set (ABC_tran) are marked with solid fill, the other families are printed in with transparency.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5037421&req=5

f6: Computational scalability of FAMSA with respect to the number of cores evaluated on the ten largest extHomFam families (k ≥ 100000).The algorithm stages are represented by different colors. Execution times of the largest set (ABC_tran) are marked with solid fill, the other families are printed in with transparency.
Mentions: As FAMSA was designed to use all the available computational power, it takes advantage of multi-core architectures of contemporary computers. The ten largest protein families from extHomFam (all which contain at least 100 000 sequences: ABC_tran, gtp, HATPase_c, helicase_NC, kinase, mdd, response_reg, rvp, sdr, TyrKc) were selected to investigate the scalability of the algorithm stages with respect to the number of computing threads. The experiments also considered the variant of FAMSA in which similarity calculation was adapted for massively parallel architectures with a use of OpenCL. For convenience, processing times of ABC_tran were marked separately. As can be seen in Fig. 6, when FAMSA was run serially, more than 90% of the execution time was related to stages I and II (the algorithm performs them simultaneously). Nevertheless, as pairwise similarities can be calculated independently, these stages scale noticeably better with the number of threads than the progressive construction. In particular, when more than 12 cores were involved, stage III of the algorithm started to be the bottleneck. This was also the case for the GPU FAMSA variant.

View Article: PubMed Central - PubMed

ABSTRACT

Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.

No MeSH data available.