Limits...
Capturing coevolutionary signals inrepeat proteins.

Espada R, Parra RG, Mora T, Walczak AM, Ferreiro DU - BMC Bioinformatics (2015)

Bottom Line: Importantly, parameter values obtained for all other interactions are not significantly affected by the equalization.We quantify the robustness of the procedure and assign confidence levels to the interactions, identifying the minimum number of sequences needed to extract evolutionary information in several repeat protein families.The overall procedure can be used to reconstruct the interactions at distances larger than repeat-pairs, identifying the characteristics of the strongest couplings in each family, and can be applied to any system that appears translationally symmetric.

View Article: PubMed Central - PubMed

Affiliation: Protein Physiology Lab, Dep de Química Biológica, Facultad de Ciencias Exactas y Naturales, UBA-CONICET-IQUIBICEN, Buenos Aires, Argentina.

ABSTRACT

Background: The analysis of correlations of amino acid occurrences in globular domains has led to the development of statistical tools that can identify native contacts - portions of the chains that come to close distance in folded structural ensembles. Here we introduce a direct coupling analysis for repeat proteins - natural systems for which the identification of folding domains remains challenging.

Results: We show that the inherent translational symmetry of repeat protein sequences introduces a strong bias in the pair correlations at precisely the length scale of the repeat-unit. Equalizing for this bias in an objective way reveals true co-evolutionary signals from which local native contacts can be identified. Importantly, parameter values obtained for all other interactions are not significantly affected by the equalization. We quantify the robustness of the procedure and assign confidence levels to the interactions, identifying the minimum number of sequences needed to extract evolutionary information in several repeat protein families.

Conclusions: The overall procedure can be used to reconstruct the interactions at distances larger than repeat-pairs, identifying the characteristics of the strongest couplings in each family, and can be applied to any system that appears translationally symmetric.

No MeSH data available.


Related in: MedlinePlus

Robustness of the DI id procedure. Subsets of alignments were constructed by recurrently removing random groups of sequences from each dataset of repeat pairs. M eff is the number of effective sequences used in the alignment. a Particular examples of the stability of DI id assignments as sampling changes on the ANK family. The gray shadow delimits the 1 % fluctuation interval set as a convergence criteria. b Overall stability of the DI id assignments in several repeat protein families
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4489039&req=5

Fig5: Robustness of the DI id procedure. Subsets of alignments were constructed by recurrently removing random groups of sequences from each dataset of repeat pairs. M eff is the number of effective sequences used in the alignment. a Particular examples of the stability of DI id assignments as sampling changes on the ANK family. The gray shadow delimits the 1 % fluctuation interval set as a convergence criteria. b Overall stability of the DI id assignments in several repeat protein families

Mentions: For well determined parameters we expect the true value will be better estimated as sampling increases. Examples of the robustness of the DI id assignments are shown in the panels of Fig. 5a. While the DI id of some residue pairs can be confidently established with about 500 effective sequences, other pairs do not reach stable values even when all the available sequences are taken into account (Fig. 5a). To globally quantify the convergence of the DI id matrix we evaluated how many of the residue pairs reach a limiting value within 1 % of the one obtained with the largest sample size. For every subset of sequences, s, we require that , where is the DI between position i and j calculated over the s-th subset, DIij is the DI on the largest set of sequences, and max(DI) and min(DI) are the maximum and minimum values for all positions in all subsets. Additionally all subsets larger than the subset s one must have a standard deviation lower than 1 % of the standard deviation of the DI values from all the subsets. If a residue pair fulfills these conditions, we say it has converged at the particular s sample size. We quantified how many of the residue pairs satisfy the convergence criteria at various sample sizes (Fig. 5b). The best sampled families, ANK and TPR, contain enough sequences to converge the DI id for almost all residue pairs of consecutive repeats. Reducing the number of input sequences results in a loss of convergence of some sites; the DI id of around 90 % of the residue pairs can be confidently established with about 10 % of the total sequences (Meff≈3000)(Fig. 5b). If the subsamples are further reduced, the proportion of positions that converge drops catastrophically. Yet even more relaxed criteria for convergence give confident results for the high-ranking DI pairs, as exemplified by the PUF and ANEX families (Fig. 5b). However the samples for the HEAT family are not sufficient to confidently quantify repeat pairs co-evolving.Fig. 5


Capturing coevolutionary signals inrepeat proteins.

Espada R, Parra RG, Mora T, Walczak AM, Ferreiro DU - BMC Bioinformatics (2015)

Robustness of the DI id procedure. Subsets of alignments were constructed by recurrently removing random groups of sequences from each dataset of repeat pairs. M eff is the number of effective sequences used in the alignment. a Particular examples of the stability of DI id assignments as sampling changes on the ANK family. The gray shadow delimits the 1 % fluctuation interval set as a convergence criteria. b Overall stability of the DI id assignments in several repeat protein families
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4489039&req=5

Fig5: Robustness of the DI id procedure. Subsets of alignments were constructed by recurrently removing random groups of sequences from each dataset of repeat pairs. M eff is the number of effective sequences used in the alignment. a Particular examples of the stability of DI id assignments as sampling changes on the ANK family. The gray shadow delimits the 1 % fluctuation interval set as a convergence criteria. b Overall stability of the DI id assignments in several repeat protein families
Mentions: For well determined parameters we expect the true value will be better estimated as sampling increases. Examples of the robustness of the DI id assignments are shown in the panels of Fig. 5a. While the DI id of some residue pairs can be confidently established with about 500 effective sequences, other pairs do not reach stable values even when all the available sequences are taken into account (Fig. 5a). To globally quantify the convergence of the DI id matrix we evaluated how many of the residue pairs reach a limiting value within 1 % of the one obtained with the largest sample size. For every subset of sequences, s, we require that , where is the DI between position i and j calculated over the s-th subset, DIij is the DI on the largest set of sequences, and max(DI) and min(DI) are the maximum and minimum values for all positions in all subsets. Additionally all subsets larger than the subset s one must have a standard deviation lower than 1 % of the standard deviation of the DI values from all the subsets. If a residue pair fulfills these conditions, we say it has converged at the particular s sample size. We quantified how many of the residue pairs satisfy the convergence criteria at various sample sizes (Fig. 5b). The best sampled families, ANK and TPR, contain enough sequences to converge the DI id for almost all residue pairs of consecutive repeats. Reducing the number of input sequences results in a loss of convergence of some sites; the DI id of around 90 % of the residue pairs can be confidently established with about 10 % of the total sequences (Meff≈3000)(Fig. 5b). If the subsamples are further reduced, the proportion of positions that converge drops catastrophically. Yet even more relaxed criteria for convergence give confident results for the high-ranking DI pairs, as exemplified by the PUF and ANEX families (Fig. 5b). However the samples for the HEAT family are not sufficient to confidently quantify repeat pairs co-evolving.Fig. 5

Bottom Line: Importantly, parameter values obtained for all other interactions are not significantly affected by the equalization.We quantify the robustness of the procedure and assign confidence levels to the interactions, identifying the minimum number of sequences needed to extract evolutionary information in several repeat protein families.The overall procedure can be used to reconstruct the interactions at distances larger than repeat-pairs, identifying the characteristics of the strongest couplings in each family, and can be applied to any system that appears translationally symmetric.

View Article: PubMed Central - PubMed

Affiliation: Protein Physiology Lab, Dep de Química Biológica, Facultad de Ciencias Exactas y Naturales, UBA-CONICET-IQUIBICEN, Buenos Aires, Argentina.

ABSTRACT

Background: The analysis of correlations of amino acid occurrences in globular domains has led to the development of statistical tools that can identify native contacts - portions of the chains that come to close distance in folded structural ensembles. Here we introduce a direct coupling analysis for repeat proteins - natural systems for which the identification of folding domains remains challenging.

Results: We show that the inherent translational symmetry of repeat protein sequences introduces a strong bias in the pair correlations at precisely the length scale of the repeat-unit. Equalizing for this bias in an objective way reveals true co-evolutionary signals from which local native contacts can be identified. Importantly, parameter values obtained for all other interactions are not significantly affected by the equalization. We quantify the robustness of the procedure and assign confidence levels to the interactions, identifying the minimum number of sequences needed to extract evolutionary information in several repeat protein families.

Conclusions: The overall procedure can be used to reconstruct the interactions at distances larger than repeat-pairs, identifying the characteristics of the strongest couplings in each family, and can be applied to any system that appears translationally symmetric.

No MeSH data available.


Related in: MedlinePlus