Limits...
Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data.

Kähärä J, Lähdesmäki H - BMC Bioinformatics (2013)

Bottom Line: We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy.We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number.Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better.

View Article: PubMed Central - HTML - PubMed

ABSTRACT
Transcription factor (TF) binding to DNA can be modeled in a number of different ways. It is highly debated which modeling methods are the best, how the models should be built and what can they be applied to. In this study a linear k-mer model proposed for predicting TF specificity in protein binding microarrays (PBM) is applied to a high-throughput SELEX data and the question of how to choose the most informative k-mers to the binding model is studied. We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy. We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number. We compared prediction performance of k-mer and position specific weight matrix (PWM) models derived from the same SELEX data. Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better. For the 15 proteins in the SELEX data set with medium enrichment cycles, classification accuracies were on average 71% and 62% for k-mer and PWMs, respectively. Finally, the k-mer model trained with SELEX data was evaluated on ChIP-seq data demonstrating substantial improvements for some proteins. For protein GATA1 the model can distinquish between true ChIP-seq peaks and negative peaks. For proteins RFX3 and NFATC1 the performance of the model was no better than chance.

Show MeSH
Classification accuracy in different enrichment cycles. The classification accuracy from different SELEX enrichment cycles for five proteins. The 95% normal approximation confidence intervals are plotted around the accuracies.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3750486&req=5

Figure 5: Classification accuracy in different enrichment cycles. The classification accuracy from different SELEX enrichment cycles for five proteins. The 95% normal approximation confidence intervals are plotted around the accuracies.

Mentions: It was also investigated how the chosen SELEX enrichment cycle affects the classification accuracy. The results are plotted in Figure 5. It is clear that the performance is higher at later cycles. This is quite intuitive because the set of sequences should get more homogenous as the enrichment process proceeds. It is also noticeable how classifying the first SELEX rounds is much more difficult than the later cycles. At later cycles there is also much more variability in the results between proteins. The round used in the performance comparison is marked with a star.


Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data.

Kähärä J, Lähdesmäki H - BMC Bioinformatics (2013)

Classification accuracy in different enrichment cycles. The classification accuracy from different SELEX enrichment cycles for five proteins. The 95% normal approximation confidence intervals are plotted around the accuracies.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3750486&req=5

Figure 5: Classification accuracy in different enrichment cycles. The classification accuracy from different SELEX enrichment cycles for five proteins. The 95% normal approximation confidence intervals are plotted around the accuracies.
Mentions: It was also investigated how the chosen SELEX enrichment cycle affects the classification accuracy. The results are plotted in Figure 5. It is clear that the performance is higher at later cycles. This is quite intuitive because the set of sequences should get more homogenous as the enrichment process proceeds. It is also noticeable how classifying the first SELEX rounds is much more difficult than the later cycles. At later cycles there is also much more variability in the results between proteins. The round used in the performance comparison is marked with a star.

Bottom Line: We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy.We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number.Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better.

View Article: PubMed Central - HTML - PubMed

ABSTRACT
Transcription factor (TF) binding to DNA can be modeled in a number of different ways. It is highly debated which modeling methods are the best, how the models should be built and what can they be applied to. In this study a linear k-mer model proposed for predicting TF specificity in protein binding microarrays (PBM) is applied to a high-throughput SELEX data and the question of how to choose the most informative k-mers to the binding model is studied. We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy. We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number. We compared prediction performance of k-mer and position specific weight matrix (PWM) models derived from the same SELEX data. Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better. For the 15 proteins in the SELEX data set with medium enrichment cycles, classification accuracies were on average 71% and 62% for k-mer and PWMs, respectively. Finally, the k-mer model trained with SELEX data was evaluated on ChIP-seq data demonstrating substantial improvements for some proteins. For protein GATA1 the model can distinquish between true ChIP-seq peaks and negative peaks. For proteins RFX3 and NFATC1 the performance of the model was no better than chance.

Show MeSH