Limits...
MARZ: an algorithm to combinatorially analyze gapped n-mer models of transcription factor binding.

Zellers RG, Drewell RA, Dresch JM - BMC Bioinformatics (2015)

Bottom Line: A number of computational approaches have been developed to examine these interactions, including simple mononucleotide and dinucleotide position weight matrix models.Here we develop a novel, unbiased computational algorithm, MARZ, that systematically analyzes all possible gapped matrices across a fixed number of nucleotides.Our results indicate that in many cases gapped matrix models can outperform traditional models, but that the relative strength of the binding sites considered in the analysis can profoundly influence the predictive ability of specific models.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, Harvey Mudd College, 301 Platt Boulevard, Claremont CA, 91711, USA. rzellers@hmc.edu.

ABSTRACT

Background: A key challenge in understanding the molecular mechanisms that control gene regulation is the characterization of the specificity with which transcription factor proteins bind to specific DNA sequences. A number of computational approaches have been developed to examine these interactions, including simple mononucleotide and dinucleotide position weight matrix models.

Results: Here we develop a novel, unbiased computational algorithm, MARZ, that systematically analyzes all possible gapped matrices across a fixed number of nucleotides. In addition, to evaluate the ability of these matrix models to predict in vivo binding sites, we utilize a new scoring system and, in combination with established scoring methods and statistical analysis, test the performance of 32 different gapped matrices on the well characterized HUNCHBACK transcription factor in Drosophila.

Conclusions: Our results indicate that in many cases gapped matrix models can outperform traditional models, but that the relative strength of the binding sites considered in the analysis can profoundly influence the predictive ability of specific models.

Show MeSH

Related in: MedlinePlus

RZ score evaluation for all 32 gappedn-mer matrices for HB. The x-axis corresponds to the threshold position used for each run of the MARZ algorithm. The y-axis corresponds to the RZ score obtained from each run. At a given threshold, the central mark represents the median RZ score of all gapped n-mer matrices, the boxes enclose the 25th to 75th percentiles of the data set, whiskers extend to all other points not considered outliers, and outliers are plotted separately (red crosses).
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4384306&req=5

Fig4: RZ score evaluation for all 32 gappedn-mer matrices for HB. The x-axis corresponds to the threshold position used for each run of the MARZ algorithm. The y-axis corresponds to the RZ score obtained from each run. At a given threshold, the central mark represents the median RZ score of all gapped n-mer matrices, the boxes enclose the 25th to 75th percentiles of the data set, whiskers extend to all other points not considered outliers, and outliers are plotted separately (red crosses).

Mentions: The relatively uninformative scores obtained using AUROC led us to investigate alternative scoring methods to assess the performance of the 32 matrices. The RZ score effectively measures the predictive ability of a particular matrix to discriminate between in vivo confirmed ‘real’ binding sites contained in ChIP peaks from the genome and sequences from ‘scrambled’ ChIP peaks. In the case of HB (Figure 4), some general trends are observed: i) All 32 matrices outperform a random guesser at all thresholds (all RZ scores above 0.52 are significantly greater, at a significance level of 0.01, than those obtained using the random shuffle). ii) The matrix scores tend to show a monotonic decrease in average score as the threshold position is increased. iii) Performance at any given threshold varies with matrix type. iv) Performance of each matrix changes in response to the threshold used. v) Variation in the performance of different matrix types is greater at extreme thresholds (i.e., those close to 0 and 1).Figure 4


MARZ: an algorithm to combinatorially analyze gapped n-mer models of transcription factor binding.

Zellers RG, Drewell RA, Dresch JM - BMC Bioinformatics (2015)

RZ score evaluation for all 32 gappedn-mer matrices for HB. The x-axis corresponds to the threshold position used for each run of the MARZ algorithm. The y-axis corresponds to the RZ score obtained from each run. At a given threshold, the central mark represents the median RZ score of all gapped n-mer matrices, the boxes enclose the 25th to 75th percentiles of the data set, whiskers extend to all other points not considered outliers, and outliers are plotted separately (red crosses).
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4384306&req=5

Fig4: RZ score evaluation for all 32 gappedn-mer matrices for HB. The x-axis corresponds to the threshold position used for each run of the MARZ algorithm. The y-axis corresponds to the RZ score obtained from each run. At a given threshold, the central mark represents the median RZ score of all gapped n-mer matrices, the boxes enclose the 25th to 75th percentiles of the data set, whiskers extend to all other points not considered outliers, and outliers are plotted separately (red crosses).
Mentions: The relatively uninformative scores obtained using AUROC led us to investigate alternative scoring methods to assess the performance of the 32 matrices. The RZ score effectively measures the predictive ability of a particular matrix to discriminate between in vivo confirmed ‘real’ binding sites contained in ChIP peaks from the genome and sequences from ‘scrambled’ ChIP peaks. In the case of HB (Figure 4), some general trends are observed: i) All 32 matrices outperform a random guesser at all thresholds (all RZ scores above 0.52 are significantly greater, at a significance level of 0.01, than those obtained using the random shuffle). ii) The matrix scores tend to show a monotonic decrease in average score as the threshold position is increased. iii) Performance at any given threshold varies with matrix type. iv) Performance of each matrix changes in response to the threshold used. v) Variation in the performance of different matrix types is greater at extreme thresholds (i.e., those close to 0 and 1).Figure 4

Bottom Line: A number of computational approaches have been developed to examine these interactions, including simple mononucleotide and dinucleotide position weight matrix models.Here we develop a novel, unbiased computational algorithm, MARZ, that systematically analyzes all possible gapped matrices across a fixed number of nucleotides.Our results indicate that in many cases gapped matrix models can outperform traditional models, but that the relative strength of the binding sites considered in the analysis can profoundly influence the predictive ability of specific models.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, Harvey Mudd College, 301 Platt Boulevard, Claremont CA, 91711, USA. rzellers@hmc.edu.

ABSTRACT

Background: A key challenge in understanding the molecular mechanisms that control gene regulation is the characterization of the specificity with which transcription factor proteins bind to specific DNA sequences. A number of computational approaches have been developed to examine these interactions, including simple mononucleotide and dinucleotide position weight matrix models.

Results: Here we develop a novel, unbiased computational algorithm, MARZ, that systematically analyzes all possible gapped matrices across a fixed number of nucleotides. In addition, to evaluate the ability of these matrix models to predict in vivo binding sites, we utilize a new scoring system and, in combination with established scoring methods and statistical analysis, test the performance of 32 different gapped matrices on the well characterized HUNCHBACK transcription factor in Drosophila.

Conclusions: Our results indicate that in many cases gapped matrix models can outperform traditional models, but that the relative strength of the binding sites considered in the analysis can profoundly influence the predictive ability of specific models.

Show MeSH
Related in: MedlinePlus