Limits...
Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data.

Kähärä J, Lähdesmäki H - BMC Bioinformatics (2013)

Bottom Line: We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy.We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number.Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better.

View Article: PubMed Central - HTML - PubMed

ABSTRACT
Transcription factor (TF) binding to DNA can be modeled in a number of different ways. It is highly debated which modeling methods are the best, how the models should be built and what can they be applied to. In this study a linear k-mer model proposed for predicting TF specificity in protein binding microarrays (PBM) is applied to a high-throughput SELEX data and the question of how to choose the most informative k-mers to the binding model is studied. We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy. We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number. We compared prediction performance of k-mer and position specific weight matrix (PWM) models derived from the same SELEX data. Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better. For the 15 proteins in the SELEX data set with medium enrichment cycles, classification accuracies were on average 71% and 62% for k-mer and PWMs, respectively. Finally, the k-mer model trained with SELEX data was evaluated on ChIP-seq data demonstrating substantial improvements for some proteins. For protein GATA1 the model can distinquish between true ChIP-seq peaks and negative peaks. For proteins RFX3 and NFATC1 the performance of the model was no better than chance.

Show MeSH
Illustration of classification accuracy using the CV-scheme for four proteins. The test set classification accuracy as a function of number of k-mers, when the k-mers are chosen with the CV-scheme. The CV is started from 100 most frequent k-mers. The 95% normal approximation confidence intervals are plotted around the curves.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3750486&req=5

Figure 3: Illustration of classification accuracy using the CV-scheme for four proteins. The test set classification accuracy as a function of number of k-mers, when the k-mers are chosen with the CV-scheme. The CV is started from 100 most frequent k-mers. The 95% normal approximation confidence intervals are plotted around the curves.

Mentions: The cross-validation gives a slight improvement to the results. If we start with for example 200 k-mers, we can end up in a set of 100 k-mers which is better than taking just the 100 most frequent k-mers--the main motivation for feature selection in discriminatory analysis. The change in classification accuracy, when starting with 100 k-mers, is shown for four proteins in Figure 3. The blue line represents the accuracy in the 10-fold cross-validation within the training set, and red line is the classification accuracy in the testing set. Green line represents the classification accuracy, when the model is trained using the equal number of k-mers that are most frequent in the data. Horizontal axis represents the number of k-mers in the model.


Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data.

Kähärä J, Lähdesmäki H - BMC Bioinformatics (2013)

Illustration of classification accuracy using the CV-scheme for four proteins. The test set classification accuracy as a function of number of k-mers, when the k-mers are chosen with the CV-scheme. The CV is started from 100 most frequent k-mers. The 95% normal approximation confidence intervals are plotted around the curves.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3750486&req=5

Figure 3: Illustration of classification accuracy using the CV-scheme for four proteins. The test set classification accuracy as a function of number of k-mers, when the k-mers are chosen with the CV-scheme. The CV is started from 100 most frequent k-mers. The 95% normal approximation confidence intervals are plotted around the curves.
Mentions: The cross-validation gives a slight improvement to the results. If we start with for example 200 k-mers, we can end up in a set of 100 k-mers which is better than taking just the 100 most frequent k-mers--the main motivation for feature selection in discriminatory analysis. The change in classification accuracy, when starting with 100 k-mers, is shown for four proteins in Figure 3. The blue line represents the accuracy in the 10-fold cross-validation within the training set, and red line is the classification accuracy in the testing set. Green line represents the classification accuracy, when the model is trained using the equal number of k-mers that are most frequent in the data. Horizontal axis represents the number of k-mers in the model.

Bottom Line: We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy.We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number.Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better.

View Article: PubMed Central - HTML - PubMed

ABSTRACT
Transcription factor (TF) binding to DNA can be modeled in a number of different ways. It is highly debated which modeling methods are the best, how the models should be built and what can they be applied to. In this study a linear k-mer model proposed for predicting TF specificity in protein binding microarrays (PBM) is applied to a high-throughput SELEX data and the question of how to choose the most informative k-mers to the binding model is studied. We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy. We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number. We compared prediction performance of k-mer and position specific weight matrix (PWM) models derived from the same SELEX data. Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better. For the 15 proteins in the SELEX data set with medium enrichment cycles, classification accuracies were on average 71% and 62% for k-mer and PWMs, respectively. Finally, the k-mer model trained with SELEX data was evaluated on ChIP-seq data demonstrating substantial improvements for some proteins. For protein GATA1 the model can distinquish between true ChIP-seq peaks and negative peaks. For proteins RFX3 and NFATC1 the performance of the model was no better than chance.

Show MeSH