Limits...
Positive and strongly relaxed purifying selection drive the evolution of repeats in proteins

View Article: PubMed Central - PubMed

ABSTRACT

Protein repeats are considered hotspots of protein evolution, associated with acquisition of new functions and novel phenotypic traits, including disease. Paradoxically, however, repeats are often strongly conserved through long spans of evolution. To resolve this conundrum, it is necessary to directly compare paralogous (horizontal) evolution of repeats within proteins with their orthologous (vertical) evolution through speciation. Here we develop a rigorous methodology to identify highly periodic repeats with significant sequence similarity, for which evolutionary rates and selection (dN/dS) can be estimated, and systematically characterize their evolution. We show that horizontal evolution of repeats is markedly accelerated compared with their divergence from orthologues in closely related species. This observation is universal across the diversity of life forms and implies a biphasic evolutionary regime whereby new copies experience rapid functional divergence under combined effects of strongly relaxed purifying selection and positive selection, followed by fixation and conservation of each individual repeat.

No MeSH data available.


Related in: MedlinePlus

Three steps computational pipeline to extract repeats and its validation.(a) First, the period length (L) of repeats is inferred from the most frequent interval (MFI) of frequent triplets (FT). Second, a seed of repeats is identified by aligning all possible k-mers (k=L=MFI) that contain FTs, transforming the alignment into a scoring matrix, and selecting valid top ranked MFI-mers. Third, a probability position matrix (PPM), Q, is built from the seed over a background, B, to predict additional repeats. See Methods and Supplementary Methods for more details. (b) A set of 18513 canonical proteins from Swissprot that match information in GenBank, containing the corresponding coding DNA, are analysed. Swissprot annotates 3,045 proteins (blue numbers) which contain at least four repeats of length ≥4aa, assigned to 3 distinct classes (Repeats, Domains, Zinc-fingers). The method extracts a subset of 765 proteins (red) and identifies 316 novel ones (green), totaling 1,081 proteins with repeats. It excludes non-periodic and/or highly diverged repeats (blue dotted curves), and includes only repeats of identical length and high sequence similarity (red and green solid curves). Repeats in matched proteins (red numbers) have similar IC to the annotated repeats (red dotted versus red solid; and IC scatter subplot).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5120217&req=5

f1: Three steps computational pipeline to extract repeats and its validation.(a) First, the period length (L) of repeats is inferred from the most frequent interval (MFI) of frequent triplets (FT). Second, a seed of repeats is identified by aligning all possible k-mers (k=L=MFI) that contain FTs, transforming the alignment into a scoring matrix, and selecting valid top ranked MFI-mers. Third, a probability position matrix (PPM), Q, is built from the seed over a background, B, to predict additional repeats. See Methods and Supplementary Methods for more details. (b) A set of 18513 canonical proteins from Swissprot that match information in GenBank, containing the corresponding coding DNA, are analysed. Swissprot annotates 3,045 proteins (blue numbers) which contain at least four repeats of length ≥4aa, assigned to 3 distinct classes (Repeats, Domains, Zinc-fingers). The method extracts a subset of 765 proteins (red) and identifies 316 novel ones (green), totaling 1,081 proteins with repeats. It excludes non-periodic and/or highly diverged repeats (blue dotted curves), and includes only repeats of identical length and high sequence similarity (red and green solid curves). Repeats in matched proteins (red numbers) have similar IC to the annotated repeats (red dotted versus red solid; and IC scatter subplot).

Mentions: Numerous algorithms for repeat detection have been developed39. However, because repeats manifest rich patterns of recurrence, divergence and variable lengths, a single uniform detection methodology does not appear to be attainable, and each algorithm is tuned to identify specific patterns of the repetitive phenomena40. Most of the repeat-detection algorithms are not suitable for large-scale analysis41. Moreover, due to the relatively short lengths of repeats and their divergence, estimates of evolutionary rates are often unreliable. Nonetheless, for subsets of repeats that are highly periodic (that is, have identical length) and possess significant sequence similarity, evolutionary rates can be estimated in many cases and become meaningful when averaged across many comparisons, as long as there are no systematic biases. This is the approach employed in the present study. To this end, we develop a computational pipeline to rapidly extract ‘near perfect' repeats from protein sequences in a systematic manner and validate it against the well-annotated human Swissprot reference proteome (Fig. 1). Throughout this study, we focus on repeats which are at least four amino-acid long and which recur at least four times in a protein.


Positive and strongly relaxed purifying selection drive the evolution of repeats in proteins
Three steps computational pipeline to extract repeats and its validation.(a) First, the period length (L) of repeats is inferred from the most frequent interval (MFI) of frequent triplets (FT). Second, a seed of repeats is identified by aligning all possible k-mers (k=L=MFI) that contain FTs, transforming the alignment into a scoring matrix, and selecting valid top ranked MFI-mers. Third, a probability position matrix (PPM), Q, is built from the seed over a background, B, to predict additional repeats. See Methods and Supplementary Methods for more details. (b) A set of 18513 canonical proteins from Swissprot that match information in GenBank, containing the corresponding coding DNA, are analysed. Swissprot annotates 3,045 proteins (blue numbers) which contain at least four repeats of length ≥4aa, assigned to 3 distinct classes (Repeats, Domains, Zinc-fingers). The method extracts a subset of 765 proteins (red) and identifies 316 novel ones (green), totaling 1,081 proteins with repeats. It excludes non-periodic and/or highly diverged repeats (blue dotted curves), and includes only repeats of identical length and high sequence similarity (red and green solid curves). Repeats in matched proteins (red numbers) have similar IC to the annotated repeats (red dotted versus red solid; and IC scatter subplot).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5120217&req=5

f1: Three steps computational pipeline to extract repeats and its validation.(a) First, the period length (L) of repeats is inferred from the most frequent interval (MFI) of frequent triplets (FT). Second, a seed of repeats is identified by aligning all possible k-mers (k=L=MFI) that contain FTs, transforming the alignment into a scoring matrix, and selecting valid top ranked MFI-mers. Third, a probability position matrix (PPM), Q, is built from the seed over a background, B, to predict additional repeats. See Methods and Supplementary Methods for more details. (b) A set of 18513 canonical proteins from Swissprot that match information in GenBank, containing the corresponding coding DNA, are analysed. Swissprot annotates 3,045 proteins (blue numbers) which contain at least four repeats of length ≥4aa, assigned to 3 distinct classes (Repeats, Domains, Zinc-fingers). The method extracts a subset of 765 proteins (red) and identifies 316 novel ones (green), totaling 1,081 proteins with repeats. It excludes non-periodic and/or highly diverged repeats (blue dotted curves), and includes only repeats of identical length and high sequence similarity (red and green solid curves). Repeats in matched proteins (red numbers) have similar IC to the annotated repeats (red dotted versus red solid; and IC scatter subplot).
Mentions: Numerous algorithms for repeat detection have been developed39. However, because repeats manifest rich patterns of recurrence, divergence and variable lengths, a single uniform detection methodology does not appear to be attainable, and each algorithm is tuned to identify specific patterns of the repetitive phenomena40. Most of the repeat-detection algorithms are not suitable for large-scale analysis41. Moreover, due to the relatively short lengths of repeats and their divergence, estimates of evolutionary rates are often unreliable. Nonetheless, for subsets of repeats that are highly periodic (that is, have identical length) and possess significant sequence similarity, evolutionary rates can be estimated in many cases and become meaningful when averaged across many comparisons, as long as there are no systematic biases. This is the approach employed in the present study. To this end, we develop a computational pipeline to rapidly extract ‘near perfect' repeats from protein sequences in a systematic manner and validate it against the well-annotated human Swissprot reference proteome (Fig. 1). Throughout this study, we focus on repeats which are at least four amino-acid long and which recur at least four times in a protein.

View Article: PubMed Central - PubMed

ABSTRACT

Protein repeats are considered hotspots of protein evolution, associated with acquisition of new functions and novel phenotypic traits, including disease. Paradoxically, however, repeats are often strongly conserved through long spans of evolution. To resolve this conundrum, it is necessary to directly compare paralogous (horizontal) evolution of repeats within proteins with their orthologous (vertical) evolution through speciation. Here we develop a rigorous methodology to identify highly periodic repeats with significant sequence similarity, for which evolutionary rates and selection (dN/dS) can be estimated, and systematically characterize their evolution. We show that horizontal evolution of repeats is markedly accelerated compared with their divergence from orthologues in closely related species. This observation is universal across the diversity of life forms and implies a biphasic evolutionary regime whereby new copies experience rapid functional divergence under combined effects of strongly relaxed purifying selection and positive selection, followed by fixation and conservation of each individual repeat.

No MeSH data available.


Related in: MedlinePlus