Limits...
On the necessity of dissecting sequence similarity scores into segment-specific contributions for inferring protein homology, function prediction and annotation.

Wong WC, Maurer-Stroh S, Eisenhaber B, Eisenhaber F - BMC Bioinformatics (2014)

Bottom Line: Protein sequence similarities to any types of non-globular segments (coiled coils, low complexity regions, transmembrane regions, long loops, etc. where either positional sequence conservation is the result of a very simple, physically induced pattern or rather integral sequence properties are critical) are pertinent sources for mistaken homologies.Quantitative criteria are required to suppress events of function annotation transfer as a result of false homology assignments.We propose to dissect the total similarity score into fold-critical and other, remaining contributions and suggest that, for a valid homology statement, the fold-relevant score contribution should at least be significant on its own.

View Article: PubMed Central - HTML - PubMed

Affiliation: Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore. wongwc@bii.a-star.edu.sg.

ABSTRACT

Background: Protein sequence similarities to any types of non-globular segments (coiled coils, low complexity regions, transmembrane regions, long loops, etc. where either positional sequence conservation is the result of a very simple, physically induced pattern or rather integral sequence properties are critical) are pertinent sources for mistaken homologies. Regretfully, these considerations regularly escape attention in large-scale annotation studies since, often, there is no substitute to manual handling of these cases. Quantitative criteria are required to suppress events of function annotation transfer as a result of false homology assignments.

Results: The sequence homology concept is based on the similarity comparison between the structural elements, the basic building blocks for conferring the overall fold of a protein. We propose to dissect the total similarity score into fold-critical and other, remaining contributions and suggest that, for a valid homology statement, the fold-relevant score contribution should at least be significant on its own. As part of the article, we provide the DissectHMMER software program for dissecting HMMER2/3 scores into segment-specific contributions. We show that DissectHMMER reproduces HMMER2/3 scores with sufficient accuracy and that it is useful in automated decisions about homology for instructive sequence examples. To generalize the dissection concept for cases without 3D structural information, we find that a dissection based on alignment quality is an appropriate surrogate. The approach was applied to a large-scale study of SMART and PFAM domains in the space of seed sequences and in the space of UniProt/SwissProt.

Conclusions: Sequence similarity core dissection with regard to fold-critical and other contributions systematically suppresses false hits and, additionally, recovers previously obscured homology relationships such as the one between aquaporins and formate/nitrite transporters that, so far, was only supported by structure comparison.

Show MeSH

Related in: MedlinePlus

The distributions of the two classes of quality score for Pfam release 27. Compared to the distributions from SMART (version 6), Figure A exhibits the same trimodal distribution while Figure B also depicts mainly the lower quality scores from weaker alignment positions with less than 5 amino acids.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4061105&req=5

Figure 12: The distributions of the two classes of quality score for Pfam release 27. Compared to the distributions from SMART (version 6), Figure A exhibits the same trimodal distribution while Figure B also depicts mainly the lower quality scores from weaker alignment positions with less than 5 amino acids.

Mentions: Similarly, the same procedure was performed on Pfam (release 27). In a similar fashion, Figure 12A exhibits the same trimodal distribution while Figure 12B once again depicts that the low-quality scores originates from alignment positions with less than 5 amino acids or sparsely aligned segments. Table 7 gives the respective error rates (FPR, TPR) for various quality score cutoff. Based on the table, the FPR of 5% corresponds to a quality score of at least 0.14 and renders a TPR of 91%.


On the necessity of dissecting sequence similarity scores into segment-specific contributions for inferring protein homology, function prediction and annotation.

Wong WC, Maurer-Stroh S, Eisenhaber B, Eisenhaber F - BMC Bioinformatics (2014)

The distributions of the two classes of quality score for Pfam release 27. Compared to the distributions from SMART (version 6), Figure A exhibits the same trimodal distribution while Figure B also depicts mainly the lower quality scores from weaker alignment positions with less than 5 amino acids.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4061105&req=5

Figure 12: The distributions of the two classes of quality score for Pfam release 27. Compared to the distributions from SMART (version 6), Figure A exhibits the same trimodal distribution while Figure B also depicts mainly the lower quality scores from weaker alignment positions with less than 5 amino acids.
Mentions: Similarly, the same procedure was performed on Pfam (release 27). In a similar fashion, Figure 12A exhibits the same trimodal distribution while Figure 12B once again depicts that the low-quality scores originates from alignment positions with less than 5 amino acids or sparsely aligned segments. Table 7 gives the respective error rates (FPR, TPR) for various quality score cutoff. Based on the table, the FPR of 5% corresponds to a quality score of at least 0.14 and renders a TPR of 91%.

Bottom Line: Protein sequence similarities to any types of non-globular segments (coiled coils, low complexity regions, transmembrane regions, long loops, etc. where either positional sequence conservation is the result of a very simple, physically induced pattern or rather integral sequence properties are critical) are pertinent sources for mistaken homologies.Quantitative criteria are required to suppress events of function annotation transfer as a result of false homology assignments.We propose to dissect the total similarity score into fold-critical and other, remaining contributions and suggest that, for a valid homology statement, the fold-relevant score contribution should at least be significant on its own.

View Article: PubMed Central - HTML - PubMed

Affiliation: Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore. wongwc@bii.a-star.edu.sg.

ABSTRACT

Background: Protein sequence similarities to any types of non-globular segments (coiled coils, low complexity regions, transmembrane regions, long loops, etc. where either positional sequence conservation is the result of a very simple, physically induced pattern or rather integral sequence properties are critical) are pertinent sources for mistaken homologies. Regretfully, these considerations regularly escape attention in large-scale annotation studies since, often, there is no substitute to manual handling of these cases. Quantitative criteria are required to suppress events of function annotation transfer as a result of false homology assignments.

Results: The sequence homology concept is based on the similarity comparison between the structural elements, the basic building blocks for conferring the overall fold of a protein. We propose to dissect the total similarity score into fold-critical and other, remaining contributions and suggest that, for a valid homology statement, the fold-relevant score contribution should at least be significant on its own. As part of the article, we provide the DissectHMMER software program for dissecting HMMER2/3 scores into segment-specific contributions. We show that DissectHMMER reproduces HMMER2/3 scores with sufficient accuracy and that it is useful in automated decisions about homology for instructive sequence examples. To generalize the dissection concept for cases without 3D structural information, we find that a dissection based on alignment quality is an appropriate surrogate. The approach was applied to a large-scale study of SMART and PFAM domains in the space of seed sequences and in the space of UniProt/SwissProt.

Conclusions: Sequence similarity core dissection with regard to fold-critical and other contributions systematically suppresses false hits and, additionally, recovers previously obscured homology relationships such as the one between aquaporins and formate/nitrite transporters that, so far, was only supported by structure comparison.

Show MeSH
Related in: MedlinePlus