Efficient and accurate P-value computation for Position Weight Matrices. Touzet H, Varré JS - Algorithms Mol Biol (2007) Bottom Line: For many examples of PWMs, they fail to give accurate results in a reasonable amount of time.The main idea is to use a series of discretized score distributions that improves the final result step by step until some convergence criterion is met.Experimental results show that it achieves better performance in terms of computational time and precision than existing tools. View Article: PubMed Central - HTML - PubMed Affiliation: LIFL, UMR CNRS 8022, Université des Sciences et Technologies de Lille, 59655 Villeneuve d'Ascq, France. helene.touzet@lifl.fr ABSTRACTBackground: Position Weight Matrices (PWMs) are probabilistic representations of signals in sequences. They are widely used to model approximate patterns in DNA or in protein sequences. The usage of PWMs needs as a prerequisite to knowing the statistical significance of a word according to its score. This is done by defining the P-value of a score, which is the probability that the background model can achieve a score larger than or equal to the observed value. This gives rise to the following problem: Given a P-value, find the corresponding score threshold. Existing methods rely on dynamic programming or probability generating functions. For many examples of PWMs, they fail to give accurate results in a reasonable amount of time.Results: The contribution of this paper is two fold. First, we study the theoretical complexity of the problem, and we prove that it is NP-hard. Then, we describe a novel algorithm that solves the P-value problem efficiently. The main idea is to use a series of discretized score distributions that improves the final result step by step until some convergence criterion is met. Moreover, the algorithm is capable of calculating the exact P-value without any error, even for matrices with non-integer coefficient values. The same approach is also used to devise an accurate algorithm for the reverse problem: finding the P-value for a given score. Both methods are implemented in a software called TFM-PVALUE, that is freely available.Conclusion: We have tested TFM-PVALUE on a large set of PWMs representing transcription factor binding sites. Experimental results show that it achieves better performance in terms of computational time and precision than existing tools. No MeSH data available. Related in: MedlinePlus © Copyright Policy - open-access Related In: Results  -  Collection License getmorefigures.php?uid=PMC2238751&req=5 .flowplayer { width: px; height: px; } Figure 7: Time efficiency for granularity 10-3. We compare the running time for the computation of the score threshold associated to a given P-value for FROM P-VALUE TO SCORE and LAZYDISTRIBUTION onto the Jaspar matrices with a granularity set to 10-3. We choose two P-value levels: 10-3 and 10-6. There are 122 matrices (resp. 75 matrices) that can achieve a P-value equal to 10-3 (resp. 10-6). For each algorithm, we classified the matrices into four groups according to the time needed to complete the computation: less than 0.1 second, from 0.1 second to 1 second, from 1 second to 1 minute, and greater than 1 minute. The results are represented by a histogram with four bars. The height of each bar gives the percentage of matrices involved and the number at the top of each bar indicates the corresponding number of matrices. Mentions: We first chose a granularity of 10-3 for the two algorithms and computed the score associated to P-values equal to 10-3 and 10-6 for each matrix of the Jaspar database (see Figure 7). The results show that TFM-PVALUE outperforms LAZYDISTRIBUTION in both cases. With the P-value set to 10-3, the average computation time is 0.64 second per matrix for LAZYDISTRIBUTION compared to 0.03 second for TFM-PVALUE. Considering each matrix individually, TFM-PVALUE is 61 times faster than LAZYDISTRIBUTION. With the P-value set to 10-6, the average computation time is 0.118 second per matrix for LAZYDISTRIBUTION and 0.019 second for TFM-PVALUE. Considering each matrix individually, TFM-PVALUE is 15 times faster than LAZYDISTRIBUTION.

Efficient and accurate P-value computation for Position Weight Matrices.

Touzet H, Varré JS - Algorithms Mol Biol (2007)

Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2238751&req=5

Figure 7: Time efficiency for granularity 10-3. We compare the running time for the computation of the score threshold associated to a given P-value for FROM P-VALUE TO SCORE and LAZYDISTRIBUTION onto the Jaspar matrices with a granularity set to 10-3. We choose two P-value levels: 10-3 and 10-6. There are 122 matrices (resp. 75 matrices) that can achieve a P-value equal to 10-3 (resp. 10-6). For each algorithm, we classified the matrices into four groups according to the time needed to complete the computation: less than 0.1 second, from 0.1 second to 1 second, from 1 second to 1 minute, and greater than 1 minute. The results are represented by a histogram with four bars. The height of each bar gives the percentage of matrices involved and the number at the top of each bar indicates the corresponding number of matrices.
Mentions: We first chose a granularity of 10-3 for the two algorithms and computed the score associated to P-values equal to 10-3 and 10-6 for each matrix of the Jaspar database (see Figure 7). The results show that TFM-PVALUE outperforms LAZYDISTRIBUTION in both cases. With the P-value set to 10-3, the average computation time is 0.64 second per matrix for LAZYDISTRIBUTION compared to 0.03 second for TFM-PVALUE. Considering each matrix individually, TFM-PVALUE is 61 times faster than LAZYDISTRIBUTION. With the P-value set to 10-6, the average computation time is 0.118 second per matrix for LAZYDISTRIBUTION and 0.019 second for TFM-PVALUE. Considering each matrix individually, TFM-PVALUE is 15 times faster than LAZYDISTRIBUTION.

Bottom Line: For many examples of PWMs, they fail to give accurate results in a reasonable amount of time.The main idea is to use a series of discretized score distributions that improves the final result step by step until some convergence criterion is met.Experimental results show that it achieves better performance in terms of computational time and precision than existing tools.

View Article: PubMed Central - HTML - PubMed

Affiliation: LIFL, UMR CNRS 8022, Université des Sciences et Technologies de Lille, 59655 Villeneuve d'Ascq, France. helene.touzet@lifl.fr

ABSTRACT

Background: Position Weight Matrices (PWMs) are probabilistic representations of signals in sequences. They are widely used to model approximate patterns in DNA or in protein sequences. The usage of PWMs needs as a prerequisite to knowing the statistical significance of a word according to its score. This is done by defining the P-value of a score, which is the probability that the background model can achieve a score larger than or equal to the observed value. This gives rise to the following problem: Given a P-value, find the corresponding score threshold. Existing methods rely on dynamic programming or probability generating functions. For many examples of PWMs, they fail to give accurate results in a reasonable amount of time.

Results: The contribution of this paper is two fold. First, we study the theoretical complexity of the problem, and we prove that it is NP-hard. Then, we describe a novel algorithm that solves the P-value problem efficiently. The main idea is to use a series of discretized score distributions that improves the final result step by step until some convergence criterion is met. Moreover, the algorithm is capable of calculating the exact P-value without any error, even for matrices with non-integer coefficient values. The same approach is also used to devise an accurate algorithm for the reverse problem: finding the P-value for a given score. Both methods are implemented in a software called TFM-PVALUE, that is freely available.

Conclusion: We have tested TFM-PVALUE on a large set of PWMs representing transcription factor binding sites. Experimental results show that it achieves better performance in terms of computational time and precision than existing tools.

No MeSH data available.

Related in: MedlinePlus