Limits...
Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data.

Kähärä J, Lähdesmäki H - BMC Bioinformatics (2013)

Bottom Line: We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy.We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number.Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better.

View Article: PubMed Central - HTML - PubMed

ABSTRACT
Transcription factor (TF) binding to DNA can be modeled in a number of different ways. It is highly debated which modeling methods are the best, how the models should be built and what can they be applied to. In this study a linear k-mer model proposed for predicting TF specificity in protein binding microarrays (PBM) is applied to a high-throughput SELEX data and the question of how to choose the most informative k-mers to the binding model is studied. We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy. We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number. We compared prediction performance of k-mer and position specific weight matrix (PWM) models derived from the same SELEX data. Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better. For the 15 proteins in the SELEX data set with medium enrichment cycles, classification accuracies were on average 71% and 62% for k-mer and PWMs, respectively. Finally, the k-mer model trained with SELEX data was evaluated on ChIP-seq data demonstrating substantial improvements for some proteins. For protein GATA1 the model can distinquish between true ChIP-seq peaks and negative peaks. For proteins RFX3 and NFATC1 the performance of the model was no better than chance.

Show MeSH
AUC as a function of number of k-mers in the model for GATA1 ChIP-seq samples. AUC as a function of number of k-mers for GATA1, when k-mers are selected with the CV-scheme. In left figure (red) the AUC is plotted when using the most frequent k-mers. In right figure the AUC is calculated when only the centers of the ChIP-seq peaks are used. The 95% Mann-Whitney confidence intervals plotted around the curves.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3750486&req=5

Figure 9: AUC as a function of number of k-mers in the model for GATA1 ChIP-seq samples. AUC as a function of number of k-mers for GATA1, when k-mers are selected with the CV-scheme. In left figure (red) the AUC is plotted when using the most frequent k-mers. In right figure the AUC is calculated when only the centers of the ChIP-seq peaks are used. The 95% Mann-Whitney confidence intervals plotted around the curves.

Mentions: For GATA1 the results are more interesting, as in some cases the k-mer model provides great predictive power. The ChIP-seq peaks can be quite accurately distinguished from false peaks as the AUC rises close to 0.9 with relatively high number of k-mers in the model (Figure 9, left). The summits in turn can be distinquished from false peaks by using only small number of k-mers selected with the cross-validation scheme (Figure 9, right). There is however somewhat controversial phenomenon of AUC dropping significantly below 0.5 when using greater number of k-mers in the model. Whether its a property of the data or the k-mer model remains to be investigated. Nevertheless, taken together, our results indicate that the previously reported poor performance of the k-mer model on in vivo data can be improved using careful feature selection strategies.


Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data.

Kähärä J, Lähdesmäki H - BMC Bioinformatics (2013)

AUC as a function of number of k-mers in the model for GATA1 ChIP-seq samples. AUC as a function of number of k-mers for GATA1, when k-mers are selected with the CV-scheme. In left figure (red) the AUC is plotted when using the most frequent k-mers. In right figure the AUC is calculated when only the centers of the ChIP-seq peaks are used. The 95% Mann-Whitney confidence intervals plotted around the curves.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3750486&req=5

Figure 9: AUC as a function of number of k-mers in the model for GATA1 ChIP-seq samples. AUC as a function of number of k-mers for GATA1, when k-mers are selected with the CV-scheme. In left figure (red) the AUC is plotted when using the most frequent k-mers. In right figure the AUC is calculated when only the centers of the ChIP-seq peaks are used. The 95% Mann-Whitney confidence intervals plotted around the curves.
Mentions: For GATA1 the results are more interesting, as in some cases the k-mer model provides great predictive power. The ChIP-seq peaks can be quite accurately distinguished from false peaks as the AUC rises close to 0.9 with relatively high number of k-mers in the model (Figure 9, left). The summits in turn can be distinquished from false peaks by using only small number of k-mers selected with the cross-validation scheme (Figure 9, right). There is however somewhat controversial phenomenon of AUC dropping significantly below 0.5 when using greater number of k-mers in the model. Whether its a property of the data or the k-mer model remains to be investigated. Nevertheless, taken together, our results indicate that the previously reported poor performance of the k-mer model on in vivo data can be improved using careful feature selection strategies.

Bottom Line: We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy.We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number.Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better.

View Article: PubMed Central - HTML - PubMed

ABSTRACT
Transcription factor (TF) binding to DNA can be modeled in a number of different ways. It is highly debated which modeling methods are the best, how the models should be built and what can they be applied to. In this study a linear k-mer model proposed for predicting TF specificity in protein binding microarrays (PBM) is applied to a high-throughput SELEX data and the question of how to choose the most informative k-mers to the binding model is studied. We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy. We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number. We compared prediction performance of k-mer and position specific weight matrix (PWM) models derived from the same SELEX data. Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better. For the 15 proteins in the SELEX data set with medium enrichment cycles, classification accuracies were on average 71% and 62% for k-mer and PWMs, respectively. Finally, the k-mer model trained with SELEX data was evaluated on ChIP-seq data demonstrating substantial improvements for some proteins. For protein GATA1 the model can distinquish between true ChIP-seq peaks and negative peaks. For proteins RFX3 and NFATC1 the performance of the model was no better than chance.

Show MeSH