Limits...
Estimation and efficient computation of the true probability of recurrence of short linear protein sequence motifs in unrelated proteins.

Davey NE, Edwards RJ, Shields DC - BMC Bioinformatics (2010)

Bottom Line: However, it approximates the probability of all other possible motifs, occurring with a score of p or less, as being equal to p.Sig'v, which corrects both approximations, is shown to be uniformly distributed in a random dataset when searching for non-ambiguous motifs, indicating that it is a robust significance measure.A method is presented to compute exactly the true probability of a non-ambiguous short protein sequence motif, and the utility of an approximate approach for novel motif discovery across a large number of datasets is demonstrated.

View Article: PubMed Central - HTML - PubMed

Affiliation: UCD Complex and Adaptive Systems Laboratory, University College Dublin, Dublin, Ireland. davey@embl.de

ABSTRACT

Background: Large datasets of protein interactions provide a rich resource for the discovery of Short Linear Motifs (SLiMs) that recur in unrelated proteins. However, existing methods for estimating the probability of motif recurrence may be biased by the size and composition of the search dataset, such that p-value estimates from different datasets, or from motifs containing different numbers of non-wildcard positions, are not strictly comparable. Here, we develop more exact methods and explore the potential biases of computationally efficient approximations.

Results: A widely used heuristic for the calculation of motif over-representation approximates motif probability by assuming that all proteins have the same length and composition. We introduce pv, which calculates the probability exactly. Secondly, the recently introduced SLiMFinder statistic Sig, accounts for multiple testing (across all possible motifs) in motif discovery. However, it approximates the probability of all other possible motifs, occurring with a score of p or less, as being equal to p. Here, we show that the exhaustive calculation of the probability of all possible motif occurrences that are as rare or rarer than the motif of interest, Sig', may be carried out efficiently by grouping motifs of a common probability (i.e. those which have permuted orders of the same residues). Sig'v, which corrects both approximations, is shown to be uniformly distributed in a random dataset when searching for non-ambiguous motifs, indicating that it is a robust significance measure.

Conclusions: A method is presented to compute exactly the true probability of a non-ambiguous short protein sequence motif, and the utility of an approximate approach for novel motif discovery across a large number of datasets is demonstrated.

Show MeSH
Test for the comparability of the significance scores, for the 4 tested significance scoring methods, between fixed position motifs of different length and datasets of different size. (a) Boxplot comparisons of the 4 tested significance scoring schemes, for each dataset size and motif length, of the distribution of top ranking motifs significance values. The first boxplot in each panel is the uniform distribution. (b) The heatmap describes the Mann-Whitney p-value for all-by-all comparisons of the 3 dataset sizes and 3 motifs lengths. The Mann-Whitney, in this case, tests for rejection of the hypothesis that the distributions of Sig values of top ranking motifs of different length and dataset size are sampled from the same distribution.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2819990&req=5

Figure 2: Test for the comparability of the significance scores, for the 4 tested significance scoring methods, between fixed position motifs of different length and datasets of different size. (a) Boxplot comparisons of the 4 tested significance scoring schemes, for each dataset size and motif length, of the distribution of top ranking motifs significance values. The first boxplot in each panel is the uniform distribution. (b) The heatmap describes the Mann-Whitney p-value for all-by-all comparisons of the 3 dataset sizes and 3 motifs lengths. The Mann-Whitney, in this case, tests for rejection of the hypothesis that the distributions of Sig values of top ranking motifs of different length and dataset size are sampled from the same distribution.

Mentions: No biases are evident with the Sig'v scoring scheme where the hypothesis that the significance scores were sampled from the same distribution was only rejected (p < 0.05) for one of the 36 comparisons (Fig. 2b), consistent with expectations. The Sig statistic showed some biases, but somewhat surprisingly, the Sigv statistic appeared more biased than the Sig statistic (Fig. 2b): in particular, the larger datasets seemed to be biased in a way that the smaller datasets are not (Fig 2a). However, It can be seen that the majority of Sig correlation between datasets relates to p-values in the range of 0.8 to 1.0 (Fig 2a) and is a related to the schemes conservative bias. The box-plots in Fig 2a indicate that Sig' is reasonably similarly distributed for all motif lengths and dataset sizes, but it is noticeable that the mean p-value varies more across datasets and motif sizes than Sig'v does. What is not clear from these random datasets is the extent to which these two statistics are biased for motifs that are significantly over-enriched (i.e. in ranking true positives, at the other end of the p-value scale).


Estimation and efficient computation of the true probability of recurrence of short linear protein sequence motifs in unrelated proteins.

Davey NE, Edwards RJ, Shields DC - BMC Bioinformatics (2010)

Test for the comparability of the significance scores, for the 4 tested significance scoring methods, between fixed position motifs of different length and datasets of different size. (a) Boxplot comparisons of the 4 tested significance scoring schemes, for each dataset size and motif length, of the distribution of top ranking motifs significance values. The first boxplot in each panel is the uniform distribution. (b) The heatmap describes the Mann-Whitney p-value for all-by-all comparisons of the 3 dataset sizes and 3 motifs lengths. The Mann-Whitney, in this case, tests for rejection of the hypothesis that the distributions of Sig values of top ranking motifs of different length and dataset size are sampled from the same distribution.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2819990&req=5

Figure 2: Test for the comparability of the significance scores, for the 4 tested significance scoring methods, between fixed position motifs of different length and datasets of different size. (a) Boxplot comparisons of the 4 tested significance scoring schemes, for each dataset size and motif length, of the distribution of top ranking motifs significance values. The first boxplot in each panel is the uniform distribution. (b) The heatmap describes the Mann-Whitney p-value for all-by-all comparisons of the 3 dataset sizes and 3 motifs lengths. The Mann-Whitney, in this case, tests for rejection of the hypothesis that the distributions of Sig values of top ranking motifs of different length and dataset size are sampled from the same distribution.
Mentions: No biases are evident with the Sig'v scoring scheme where the hypothesis that the significance scores were sampled from the same distribution was only rejected (p < 0.05) for one of the 36 comparisons (Fig. 2b), consistent with expectations. The Sig statistic showed some biases, but somewhat surprisingly, the Sigv statistic appeared more biased than the Sig statistic (Fig. 2b): in particular, the larger datasets seemed to be biased in a way that the smaller datasets are not (Fig 2a). However, It can be seen that the majority of Sig correlation between datasets relates to p-values in the range of 0.8 to 1.0 (Fig 2a) and is a related to the schemes conservative bias. The box-plots in Fig 2a indicate that Sig' is reasonably similarly distributed for all motif lengths and dataset sizes, but it is noticeable that the mean p-value varies more across datasets and motif sizes than Sig'v does. What is not clear from these random datasets is the extent to which these two statistics are biased for motifs that are significantly over-enriched (i.e. in ranking true positives, at the other end of the p-value scale).

Bottom Line: However, it approximates the probability of all other possible motifs, occurring with a score of p or less, as being equal to p.Sig'v, which corrects both approximations, is shown to be uniformly distributed in a random dataset when searching for non-ambiguous motifs, indicating that it is a robust significance measure.A method is presented to compute exactly the true probability of a non-ambiguous short protein sequence motif, and the utility of an approximate approach for novel motif discovery across a large number of datasets is demonstrated.

View Article: PubMed Central - HTML - PubMed

Affiliation: UCD Complex and Adaptive Systems Laboratory, University College Dublin, Dublin, Ireland. davey@embl.de

ABSTRACT

Background: Large datasets of protein interactions provide a rich resource for the discovery of Short Linear Motifs (SLiMs) that recur in unrelated proteins. However, existing methods for estimating the probability of motif recurrence may be biased by the size and composition of the search dataset, such that p-value estimates from different datasets, or from motifs containing different numbers of non-wildcard positions, are not strictly comparable. Here, we develop more exact methods and explore the potential biases of computationally efficient approximations.

Results: A widely used heuristic for the calculation of motif over-representation approximates motif probability by assuming that all proteins have the same length and composition. We introduce pv, which calculates the probability exactly. Secondly, the recently introduced SLiMFinder statistic Sig, accounts for multiple testing (across all possible motifs) in motif discovery. However, it approximates the probability of all other possible motifs, occurring with a score of p or less, as being equal to p. Here, we show that the exhaustive calculation of the probability of all possible motif occurrences that are as rare or rarer than the motif of interest, Sig', may be carried out efficiently by grouping motifs of a common probability (i.e. those which have permuted orders of the same residues). Sig'v, which corrects both approximations, is shown to be uniformly distributed in a random dataset when searching for non-ambiguous motifs, indicating that it is a robust significance measure.

Conclusions: A method is presented to compute exactly the true probability of a non-ambiguous short protein sequence motif, and the utility of an approximate approach for novel motif discovery across a large number of datasets is demonstrated.

Show MeSH