Limits...
FAMSA: Fast and accurate multiple sequence alignment of huge protein families

View Article: PubMed Central - PubMed

ABSTRACT

Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.

No MeSH data available.


The normalized Sackin indexes (Sackin indexes divided by the cardinality of input sets) for various guide tree computation methods.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5037421&req=5

f7: The normalized Sackin indexes (Sackin indexes divided by the cardinality of input sets) for various guide tree computation methods.

Mentions: The experiments confirmed that the single linkage was superior in terms of alignment quality. The smallest, yet statistically significant advance (pSP = 0.000011, pTC = 0.000019), was observed when compared with UPGMA. This, together with memory efficiency, made the authors choose the single linkage for FAMSA. To provide a deeper insight into the structure of the trees rendered by different strategies, the Sackin index44 was used, defined as the sum of heights of all leaves in the tree. Figure 7 shows the comparison of the normalized Sackin indexes (i.e. the Sackin indexes divided by the number of sequences in the family) for trees produced by FAMSA + single linkage, FAMSA + UPGMA, and Clustal Omega. The lines corresponding to perfectly balanced and imbalanced (i.e. chained) trees are also presented for convenience. It can be seen that the indexes for single linkage trees are noticeably higher than those for UPGMA and Clustal Omega. Interestingly enough, the normalized Sackin indexes for UPGMA and Clustal Omega trees are approximately twice as large as in the perfectly balanced case.


FAMSA: Fast and accurate multiple sequence alignment of huge protein families
The normalized Sackin indexes (Sackin indexes divided by the cardinality of input sets) for various guide tree computation methods.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5037421&req=5

f7: The normalized Sackin indexes (Sackin indexes divided by the cardinality of input sets) for various guide tree computation methods.
Mentions: The experiments confirmed that the single linkage was superior in terms of alignment quality. The smallest, yet statistically significant advance (pSP = 0.000011, pTC = 0.000019), was observed when compared with UPGMA. This, together with memory efficiency, made the authors choose the single linkage for FAMSA. To provide a deeper insight into the structure of the trees rendered by different strategies, the Sackin index44 was used, defined as the sum of heights of all leaves in the tree. Figure 7 shows the comparison of the normalized Sackin indexes (i.e. the Sackin indexes divided by the number of sequences in the family) for trees produced by FAMSA + single linkage, FAMSA + UPGMA, and Clustal Omega. The lines corresponding to perfectly balanced and imbalanced (i.e. chained) trees are also presented for convenience. It can be seen that the indexes for single linkage trees are noticeably higher than those for UPGMA and Clustal Omega. Interestingly enough, the normalized Sackin indexes for UPGMA and Clustal Omega trees are approximately twice as large as in the perfectly balanced case.

View Article: PubMed Central - PubMed

ABSTRACT

Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.

No MeSH data available.