Limits...
FAMSA: Fast and accurate multiple sequence alignment of huge protein families

View Article: PubMed Central - PubMed

ABSTRACT

Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.

No MeSH data available.


Scalability of SP (dashed lines) and TC (solid lines) scores with respect to the number of sequences.Experiments were performed on the 53 largest extHomFam families, randomly resampled to obtain the desired set size. Processing times for selected values of k are provided as bar plots.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5037421&req=5

f5: Scalability of SP (dashed lines) and TC (solid lines) scores with respect to the number of sequences.Experiments were performed on the 53 largest extHomFam families, randomly resampled to obtain the desired set size. Processing times for selected values of k are provided as bar plots.

Mentions: As can be seen in Fig. 3, the quality advance of the presented software over other algorithms increased for consecutive subsets. For instance, on extra-large, FAMSA aligned in a proper manner approximately 25% more columns than UPP—the second best algorithm. A more detailed analysis of FAMSA accuracy compared to the competitors is given in Fig. 4. Four extHomFam categories were further divided into 11 subsets of approximately 35 families. Selected statistical indicators (median, mean, 12.5th and 87.5th percentile) of absolute differences in SP and TC measures between FAMSA and other algorithms were plotted for each interval at k axis. Clearly, the number of test cases for which the presented software was superior to that made by the competitors, as well as the absolute advance in terms of quality, increases with the growing set size. This observation is supported by the scalability analysis performed on the 53 largest families (k ≥ 30000), randomly resampled to obtain less numerous sets. Figure 5 shows that FAMSA outrun the competitors when the number of proteins exceeded 5000. Importantly, the performance was hardly affected when more sequences were added. This might be caused by the bias of the guide trees towards reference sequences. Indeed, the following section shows that the reference sequences were slightly closer to each other in the guide trees than suggested by the random model. Nevertheless, this held for all analyzed algorithms, therefore can be considered as a property of the benchmark.


FAMSA: Fast and accurate multiple sequence alignment of huge protein families
Scalability of SP (dashed lines) and TC (solid lines) scores with respect to the number of sequences.Experiments were performed on the 53 largest extHomFam families, randomly resampled to obtain the desired set size. Processing times for selected values of k are provided as bar plots.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5037421&req=5

f5: Scalability of SP (dashed lines) and TC (solid lines) scores with respect to the number of sequences.Experiments were performed on the 53 largest extHomFam families, randomly resampled to obtain the desired set size. Processing times for selected values of k are provided as bar plots.
Mentions: As can be seen in Fig. 3, the quality advance of the presented software over other algorithms increased for consecutive subsets. For instance, on extra-large, FAMSA aligned in a proper manner approximately 25% more columns than UPP—the second best algorithm. A more detailed analysis of FAMSA accuracy compared to the competitors is given in Fig. 4. Four extHomFam categories were further divided into 11 subsets of approximately 35 families. Selected statistical indicators (median, mean, 12.5th and 87.5th percentile) of absolute differences in SP and TC measures between FAMSA and other algorithms were plotted for each interval at k axis. Clearly, the number of test cases for which the presented software was superior to that made by the competitors, as well as the absolute advance in terms of quality, increases with the growing set size. This observation is supported by the scalability analysis performed on the 53 largest families (k ≥ 30000), randomly resampled to obtain less numerous sets. Figure 5 shows that FAMSA outrun the competitors when the number of proteins exceeded 5000. Importantly, the performance was hardly affected when more sequences were added. This might be caused by the bias of the guide trees towards reference sequences. Indeed, the following section shows that the reference sequences were slightly closer to each other in the guide trees than suggested by the random model. Nevertheless, this held for all analyzed algorithms, therefore can be considered as a property of the benchmark.

View Article: PubMed Central - PubMed

ABSTRACT

Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.

No MeSH data available.