Limits...
Comprehensive comparison of graph based multiple protein sequence alignment strategies.

Plyusnin I, Holm L - BMC Bioinformatics (2012)

Bottom Line: Our results indicate the optimal alignment strategy based on the choices compared.Second, single linkage clustering is almost invariably the best algorithm to build a guiding tree for progressive alignment.This is the first time multiple protein alignment strategies are comprehensively and clearly compared using a single implementation platform.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Biotechnology, University of Helsinki, P,O, Box 56, Viikinkaari 5, Helsinki, Finland. Ilja.Pljusnin@gmail.com

ABSTRACT

Background: Alignment of protein sequences (MPSA) is the starting point for a multitude of applications in molecular biology. Here, we present a novel MPSA program based on the SeqAn sequence alignment library. Our implementation has a strict modular structure, which allows to swap different components of the alignment process and, thus, to investigate their contribution to the alignment quality and computation time. We systematically varied information sources, guiding trees, score transformations and iterative refinement options, and evaluated the resulting alignments on BAliBASE and SABmark.

Results: Our results indicate the optimal alignment strategy based on the choices compared. First, we show that pairwise global and local alignments contain sufficient information to construct a high quality multiple alignment. Second, single linkage clustering is almost invariably the best algorithm to build a guiding tree for progressive alignment. Third, triplet library extension, with introduction of new edges, is the most efficient consistency transformation of those compared. Alternatively, one can apply tree dependent partitioning as a post processing step, which was shown to be comparable with the best consistency transformation in both time and accuracy. Finally, propagating information beyond four transitive links introduces more noise than signal.

Conclusions: This is the first time multiple protein alignment strategies are comprehensively and clearly compared using a single implementation platform. In particular, we showed which of the existing consistency transformations and iterative refinement techniques are the most valid. Our implementation is freely available at http://ekhidna.biocenter.helsinki.fi/MMSA and as a supplementary file attached to this article (see Additional file 1).

Show MeSH

Related in: MedlinePlus

Performance of strategies with different information sources on SABmark. Boxplots show developer (FD) score (equivalent to sum-of-pairs [SP] score) achieved by different alignment strategies for (A) "Twilight Zone" and (B) "Superfamily" sets in the SABmark benchmark database. Boxplots display first, second and third quartiles as vertical lines; outliers are shown as pluses. The strategies tested differ in the combination of pairwise sequence information that is used to construct the alignment graph. Combinations include: C, longest common subsequence, L, the four top scoring local alignments, G, global alignment, M, GTG motifs.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3375188&req=5

Figure 2: Performance of strategies with different information sources on SABmark. Boxplots show developer (FD) score (equivalent to sum-of-pairs [SP] score) achieved by different alignment strategies for (A) "Twilight Zone" and (B) "Superfamily" sets in the SABmark benchmark database. Boxplots display first, second and third quartiles as vertical lines; outliers are shown as pluses. The strategies tested differ in the combination of pairwise sequence information that is used to construct the alignment graph. Combinations include: C, longest common subsequence, L, the four top scoring local alignments, G, global alignment, M, GTG motifs.

Mentions: We found that using both global and local pairwise alignments to construct the alignment graph, resulted in high quality multiple alignments for both BAliBASE (Figure 1) and SABmark (Figure 2A and 2B) benchmark databases. Adding longest common subsequences to global and local alignments had minor effect on the quality of the multiple alignments. Adding external information in the form of GTG motifs, extracted using motif tracking as described in the original GTG article [22], increased the quality of the multiple alignments by an average of 1% for both BAliBASE and SABmark (full data available in Additional file 2: Tables S2 and S3).


Comprehensive comparison of graph based multiple protein sequence alignment strategies.

Plyusnin I, Holm L - BMC Bioinformatics (2012)

Performance of strategies with different information sources on SABmark. Boxplots show developer (FD) score (equivalent to sum-of-pairs [SP] score) achieved by different alignment strategies for (A) "Twilight Zone" and (B) "Superfamily" sets in the SABmark benchmark database. Boxplots display first, second and third quartiles as vertical lines; outliers are shown as pluses. The strategies tested differ in the combination of pairwise sequence information that is used to construct the alignment graph. Combinations include: C, longest common subsequence, L, the four top scoring local alignments, G, global alignment, M, GTG motifs.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3375188&req=5

Figure 2: Performance of strategies with different information sources on SABmark. Boxplots show developer (FD) score (equivalent to sum-of-pairs [SP] score) achieved by different alignment strategies for (A) "Twilight Zone" and (B) "Superfamily" sets in the SABmark benchmark database. Boxplots display first, second and third quartiles as vertical lines; outliers are shown as pluses. The strategies tested differ in the combination of pairwise sequence information that is used to construct the alignment graph. Combinations include: C, longest common subsequence, L, the four top scoring local alignments, G, global alignment, M, GTG motifs.
Mentions: We found that using both global and local pairwise alignments to construct the alignment graph, resulted in high quality multiple alignments for both BAliBASE (Figure 1) and SABmark (Figure 2A and 2B) benchmark databases. Adding longest common subsequences to global and local alignments had minor effect on the quality of the multiple alignments. Adding external information in the form of GTG motifs, extracted using motif tracking as described in the original GTG article [22], increased the quality of the multiple alignments by an average of 1% for both BAliBASE and SABmark (full data available in Additional file 2: Tables S2 and S3).

Bottom Line: Our results indicate the optimal alignment strategy based on the choices compared.Second, single linkage clustering is almost invariably the best algorithm to build a guiding tree for progressive alignment.This is the first time multiple protein alignment strategies are comprehensively and clearly compared using a single implementation platform.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Biotechnology, University of Helsinki, P,O, Box 56, Viikinkaari 5, Helsinki, Finland. Ilja.Pljusnin@gmail.com

ABSTRACT

Background: Alignment of protein sequences (MPSA) is the starting point for a multitude of applications in molecular biology. Here, we present a novel MPSA program based on the SeqAn sequence alignment library. Our implementation has a strict modular structure, which allows to swap different components of the alignment process and, thus, to investigate their contribution to the alignment quality and computation time. We systematically varied information sources, guiding trees, score transformations and iterative refinement options, and evaluated the resulting alignments on BAliBASE and SABmark.

Results: Our results indicate the optimal alignment strategy based on the choices compared. First, we show that pairwise global and local alignments contain sufficient information to construct a high quality multiple alignment. Second, single linkage clustering is almost invariably the best algorithm to build a guiding tree for progressive alignment. Third, triplet library extension, with introduction of new edges, is the most efficient consistency transformation of those compared. Alternatively, one can apply tree dependent partitioning as a post processing step, which was shown to be comparable with the best consistency transformation in both time and accuracy. Finally, propagating information beyond four transitive links introduces more noise than signal.

Conclusions: This is the first time multiple protein alignment strategies are comprehensively and clearly compared using a single implementation platform. In particular, we showed which of the existing consistency transformations and iterative refinement techniques are the most valid. Our implementation is freely available at http://ekhidna.biocenter.helsinki.fi/MMSA and as a supplementary file attached to this article (see Additional file 1).

Show MeSH
Related in: MedlinePlus