Limits...
Formatt: Correcting protein multiple structural alignments by incorporating sequence alignment.

Daniels NM, Nadimpalli S, Cowen LJ - BMC Bioinformatics (2012)

Bottom Line: We show that Formatt outperforms Matt and other popular structure alignment programs on the popular HOMSTRAD benchmark.For the SABMark twilight zone benchmark set that captures more remote homology, Formatt and Matt outperform other programs; depending on choice of embedded sequence aligner, Formatt produces either better sequence and structural alignments with a smaller core size than Matt, or similarly sized alignments with better sequence similarity, for a small cost in average RMSD.Considering sequence information as well as purely geometric information seems to improve quality of multiple structure alignments, though defining what constitutes the best alignment when sequence and structural measures would suggest different alignments remains a difficult open question.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, Tufts University, 161 College Ave, Medford, MA 02155, USA.

ABSTRACT

Background: The quality of multiple protein structure alignments are usually computed and assessed based on geometric functions of the coordinates of the backbone atoms from the protein chains. These purely geometric methods do not utilize directly protein sequence similarity, and in fact, determining the proper way to incorporate sequence similarity measures into the construction and assessment of protein multiple structure alignments has proved surprisingly difficult.

Results: We present Formatt, a multiple structure alignment based on the Matt purely geometric multiple structure alignment program, that also takes into account sequence similarity when constructing alignments. We show that Formatt outperforms Matt and other popular structure alignment programs on the popular HOMSTRAD benchmark. For the SABMark twilight zone benchmark set that captures more remote homology, Formatt and Matt outperform other programs; depending on choice of embedded sequence aligner, Formatt produces either better sequence and structural alignments with a smaller core size than Matt, or similarly sized alignments with better sequence similarity, for a small cost in average RMSD.

Conclusions: Considering sequence information as well as purely geometric information seems to improve quality of multiple structure alignments, though defining what constitutes the best alignment when sequence and structural measures would suggest different alignments remains a difficult open question.

Show MeSH

Related in: MedlinePlus

Staccato conservation score vs. alignment length. Separation of 1,000 domain pairs where both domains are in the same SCOP family, and 1,000 domain pairs where both domains are in different SCOP families, along the dimensions of Staccato conservation score and core length of the pairwise alignment.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3585936&req=5

Figure 2: Staccato conservation score vs. alignment length. Separation of 1,000 domain pairs where both domains are in the same SCOP family, and 1,000 domain pairs where both domains are in different SCOP families, along the dimensions of Staccato conservation score and core length of the pairwise alignment.

Mentions: In this implementation, we have followed the example of Shatsky et al[12] in equally weighting the sequence and structure components of the Staccato score, and we have left the choice of longer aligned core versus better alignment quality to the user. We are using the Staccato scores, but there are several weaknesses in this approach. First, the combined “Cons” score, which we use to decide if Formatt should use a sequence- or structure-based alignment for a particular region, equally weights the “Seq” and “Str” scores, but this seems arbitrary. Secondly, and more seriously, Staccato scores are not length-invariant – that is, while they are appropriate to compare different alignments of the same length, they will always prefer shorter alignments. In fact, one could worry that the only gain that Formatt makes over Matt in Staccato score is due to Formatt preferring shorter, more conservative alignments (particularly when Mafft is used as the sequence aligner). To show that this is not the case, we created a ‘truncated’ Matt alignment by ranking the columns of the Matt alignment by Staccato Cons score, and, on a structure-by-structure basis, greedily dropped columns from the Matt alignment until it matched the Formatt (Mafft) alignment in length. This resulted in an identical average core length of 148.2 on the HOMSTRAD and 45.01 on the SABMark benchmarks. However, Formatt (Mafft) is qualitatively better than this truncated Matt, both in terms of the Staccato Cons score (1.39 for truncated Matt versus 1.35 for Formatt (Mafft) on HOMSTRAD, and 2.86 for truncated Matt versus 2.81 for Formatt (Mafft) on SABMark) and in terms of the percent correct on HOMSTRAD (78.4% for truncated Matt versus 78.7% for Formatt (Mafft)). This proves that it is worth considering sequence alignment as Formatt does, directly, and not just in terms of Staccato score. The problem of how to normalize a Staccato measure of alignment ‘quality’ with alignment length remains an interesting question. One way to achieve this normalization is suggested by [8]. A plot of aligned core length versus Staccato conservation score for one thousand random pairs of same-family and different-family protein domains can illustrate a possible method for trading off between core length and alignment quality (see Figure 2). We see that an optimal linear separator of 0.126×x−0.213 divides same-family from different-family domains. Thus, given two possible alignments a1 and a2, with Staccato “cons” scores of c1and c2, and core lengths of l1 and l2 respectively, we could view these as points in the space defined by “cons” score and core length. We could then compute the y-intercept of a line with a slope of .0126 through each point; we would then favor the alignment with the lower y-intercept. We suggest that this would be a plausible way to rationally quantify the trade-off between alignment quality and core length.


Formatt: Correcting protein multiple structural alignments by incorporating sequence alignment.

Daniels NM, Nadimpalli S, Cowen LJ - BMC Bioinformatics (2012)

Staccato conservation score vs. alignment length. Separation of 1,000 domain pairs where both domains are in the same SCOP family, and 1,000 domain pairs where both domains are in different SCOP families, along the dimensions of Staccato conservation score and core length of the pairwise alignment.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3585936&req=5

Figure 2: Staccato conservation score vs. alignment length. Separation of 1,000 domain pairs where both domains are in the same SCOP family, and 1,000 domain pairs where both domains are in different SCOP families, along the dimensions of Staccato conservation score and core length of the pairwise alignment.
Mentions: In this implementation, we have followed the example of Shatsky et al[12] in equally weighting the sequence and structure components of the Staccato score, and we have left the choice of longer aligned core versus better alignment quality to the user. We are using the Staccato scores, but there are several weaknesses in this approach. First, the combined “Cons” score, which we use to decide if Formatt should use a sequence- or structure-based alignment for a particular region, equally weights the “Seq” and “Str” scores, but this seems arbitrary. Secondly, and more seriously, Staccato scores are not length-invariant – that is, while they are appropriate to compare different alignments of the same length, they will always prefer shorter alignments. In fact, one could worry that the only gain that Formatt makes over Matt in Staccato score is due to Formatt preferring shorter, more conservative alignments (particularly when Mafft is used as the sequence aligner). To show that this is not the case, we created a ‘truncated’ Matt alignment by ranking the columns of the Matt alignment by Staccato Cons score, and, on a structure-by-structure basis, greedily dropped columns from the Matt alignment until it matched the Formatt (Mafft) alignment in length. This resulted in an identical average core length of 148.2 on the HOMSTRAD and 45.01 on the SABMark benchmarks. However, Formatt (Mafft) is qualitatively better than this truncated Matt, both in terms of the Staccato Cons score (1.39 for truncated Matt versus 1.35 for Formatt (Mafft) on HOMSTRAD, and 2.86 for truncated Matt versus 2.81 for Formatt (Mafft) on SABMark) and in terms of the percent correct on HOMSTRAD (78.4% for truncated Matt versus 78.7% for Formatt (Mafft)). This proves that it is worth considering sequence alignment as Formatt does, directly, and not just in terms of Staccato score. The problem of how to normalize a Staccato measure of alignment ‘quality’ with alignment length remains an interesting question. One way to achieve this normalization is suggested by [8]. A plot of aligned core length versus Staccato conservation score for one thousand random pairs of same-family and different-family protein domains can illustrate a possible method for trading off between core length and alignment quality (see Figure 2). We see that an optimal linear separator of 0.126×x−0.213 divides same-family from different-family domains. Thus, given two possible alignments a1 and a2, with Staccato “cons” scores of c1and c2, and core lengths of l1 and l2 respectively, we could view these as points in the space defined by “cons” score and core length. We could then compute the y-intercept of a line with a slope of .0126 through each point; we would then favor the alignment with the lower y-intercept. We suggest that this would be a plausible way to rationally quantify the trade-off between alignment quality and core length.

Bottom Line: We show that Formatt outperforms Matt and other popular structure alignment programs on the popular HOMSTRAD benchmark.For the SABMark twilight zone benchmark set that captures more remote homology, Formatt and Matt outperform other programs; depending on choice of embedded sequence aligner, Formatt produces either better sequence and structural alignments with a smaller core size than Matt, or similarly sized alignments with better sequence similarity, for a small cost in average RMSD.Considering sequence information as well as purely geometric information seems to improve quality of multiple structure alignments, though defining what constitutes the best alignment when sequence and structural measures would suggest different alignments remains a difficult open question.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, Tufts University, 161 College Ave, Medford, MA 02155, USA.

ABSTRACT

Background: The quality of multiple protein structure alignments are usually computed and assessed based on geometric functions of the coordinates of the backbone atoms from the protein chains. These purely geometric methods do not utilize directly protein sequence similarity, and in fact, determining the proper way to incorporate sequence similarity measures into the construction and assessment of protein multiple structure alignments has proved surprisingly difficult.

Results: We present Formatt, a multiple structure alignment based on the Matt purely geometric multiple structure alignment program, that also takes into account sequence similarity when constructing alignments. We show that Formatt outperforms Matt and other popular structure alignment programs on the popular HOMSTRAD benchmark. For the SABMark twilight zone benchmark set that captures more remote homology, Formatt and Matt outperform other programs; depending on choice of embedded sequence aligner, Formatt produces either better sequence and structural alignments with a smaller core size than Matt, or similarly sized alignments with better sequence similarity, for a small cost in average RMSD.

Conclusions: Considering sequence information as well as purely geometric information seems to improve quality of multiple structure alignments, though defining what constitutes the best alignment when sequence and structural measures would suggest different alignments remains a difficult open question.

Show MeSH
Related in: MedlinePlus