Limits...
MARZ: an algorithm to combinatorially analyze gapped n-mer models of transcription factor binding.

Zellers RG, Drewell RA, Dresch JM - BMC Bioinformatics (2015)

Bottom Line: A number of computational approaches have been developed to examine these interactions, including simple mononucleotide and dinucleotide position weight matrix models.Here we develop a novel, unbiased computational algorithm, MARZ, that systematically analyzes all possible gapped matrices across a fixed number of nucleotides.Our results indicate that in many cases gapped matrix models can outperform traditional models, but that the relative strength of the binding sites considered in the analysis can profoundly influence the predictive ability of specific models.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, Harvey Mudd College, 301 Platt Boulevard, Claremont CA, 91711, USA. rzellers@hmc.edu.

ABSTRACT

Background: A key challenge in understanding the molecular mechanisms that control gene regulation is the characterization of the specificity with which transcription factor proteins bind to specific DNA sequences. A number of computational approaches have been developed to examine these interactions, including simple mononucleotide and dinucleotide position weight matrix models.

Results: Here we develop a novel, unbiased computational algorithm, MARZ, that systematically analyzes all possible gapped matrices across a fixed number of nucleotides. In addition, to evaluate the ability of these matrix models to predict in vivo binding sites, we utilize a new scoring system and, in combination with established scoring methods and statistical analysis, test the performance of 32 different gapped matrices on the well characterized HUNCHBACK transcription factor in Drosophila.

Conclusions: Our results indicate that in many cases gapped matrix models can outperform traditional models, but that the relative strength of the binding sites considered in the analysis can profoundly influence the predictive ability of specific models.

Show MeSH

Related in: MedlinePlus

Comparison of a traditional frequency matrix and a gappedn-mer matrix for HB.(A) The traditional matrix treats each of the seven nucleotides (m) in the aligned HB binding site sequences independently and slides across the sequences with a window frame size equal to one nucleotide. The 4×7 matrix represents the frequency at which each of the four nucleotides is found at each of the seven positions in the binding site and can be used to generate a standard position weight matrix (PWM). (B) The matrix for the gapped n-mer mkkkkm considers each of the two outer nucleotides (m), but ignores the four inner nucleotides (k), with a sliding window frame size equal to six nucleotides. This generates a 16×2 matrix that represents the frequency at which the 16 possible nucleotide pairs are found at the two outer positions across the two possible frames in a seven nucleotide binding site sequence. It should be noted that the complex matrix constructed for each of the 32 different gapped n-mers is generated using the same principles as the mkkkkm example above, but is distinct based on the specific composition of the gapped n-mer. (C-D) Corresponding visualizations of each matrix at a HB binding site.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4384306&req=5

Fig2: Comparison of a traditional frequency matrix and a gappedn-mer matrix for HB.(A) The traditional matrix treats each of the seven nucleotides (m) in the aligned HB binding site sequences independently and slides across the sequences with a window frame size equal to one nucleotide. The 4×7 matrix represents the frequency at which each of the four nucleotides is found at each of the seven positions in the binding site and can be used to generate a standard position weight matrix (PWM). (B) The matrix for the gapped n-mer mkkkkm considers each of the two outer nucleotides (m), but ignores the four inner nucleotides (k), with a sliding window frame size equal to six nucleotides. This generates a 16×2 matrix that represents the frequency at which the 16 possible nucleotide pairs are found at the two outer positions across the two possible frames in a seven nucleotide binding site sequence. It should be noted that the complex matrix constructed for each of the 32 different gapped n-mers is generated using the same principles as the mkkkkm example above, but is distinct based on the specific composition of the gapped n-mer. (C-D) Corresponding visualizations of each matrix at a HB binding site.

Mentions: First, convert the type ID to binary. Then convert the binary representation to a string of k’s and m’s representing the nucleotides considered in that particular model by replacing each zero with a k and each one with an m. Any leading k’s are omitted, as the leading nucleotide must be included, and an m is inserted on the right hand side, as the terminal nucleotide is always included. Table 1 lists all of the type IDs and Figure 2 gives a graphical illustration of the matrix construction and sequence interpretation for the mononucleotide model m and the more complex gapped n-mer model mkkkkm.Figure 2


MARZ: an algorithm to combinatorially analyze gapped n-mer models of transcription factor binding.

Zellers RG, Drewell RA, Dresch JM - BMC Bioinformatics (2015)

Comparison of a traditional frequency matrix and a gappedn-mer matrix for HB.(A) The traditional matrix treats each of the seven nucleotides (m) in the aligned HB binding site sequences independently and slides across the sequences with a window frame size equal to one nucleotide. The 4×7 matrix represents the frequency at which each of the four nucleotides is found at each of the seven positions in the binding site and can be used to generate a standard position weight matrix (PWM). (B) The matrix for the gapped n-mer mkkkkm considers each of the two outer nucleotides (m), but ignores the four inner nucleotides (k), with a sliding window frame size equal to six nucleotides. This generates a 16×2 matrix that represents the frequency at which the 16 possible nucleotide pairs are found at the two outer positions across the two possible frames in a seven nucleotide binding site sequence. It should be noted that the complex matrix constructed for each of the 32 different gapped n-mers is generated using the same principles as the mkkkkm example above, but is distinct based on the specific composition of the gapped n-mer. (C-D) Corresponding visualizations of each matrix at a HB binding site.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4384306&req=5

Fig2: Comparison of a traditional frequency matrix and a gappedn-mer matrix for HB.(A) The traditional matrix treats each of the seven nucleotides (m) in the aligned HB binding site sequences independently and slides across the sequences with a window frame size equal to one nucleotide. The 4×7 matrix represents the frequency at which each of the four nucleotides is found at each of the seven positions in the binding site and can be used to generate a standard position weight matrix (PWM). (B) The matrix for the gapped n-mer mkkkkm considers each of the two outer nucleotides (m), but ignores the four inner nucleotides (k), with a sliding window frame size equal to six nucleotides. This generates a 16×2 matrix that represents the frequency at which the 16 possible nucleotide pairs are found at the two outer positions across the two possible frames in a seven nucleotide binding site sequence. It should be noted that the complex matrix constructed for each of the 32 different gapped n-mers is generated using the same principles as the mkkkkm example above, but is distinct based on the specific composition of the gapped n-mer. (C-D) Corresponding visualizations of each matrix at a HB binding site.
Mentions: First, convert the type ID to binary. Then convert the binary representation to a string of k’s and m’s representing the nucleotides considered in that particular model by replacing each zero with a k and each one with an m. Any leading k’s are omitted, as the leading nucleotide must be included, and an m is inserted on the right hand side, as the terminal nucleotide is always included. Table 1 lists all of the type IDs and Figure 2 gives a graphical illustration of the matrix construction and sequence interpretation for the mononucleotide model m and the more complex gapped n-mer model mkkkkm.Figure 2

Bottom Line: A number of computational approaches have been developed to examine these interactions, including simple mononucleotide and dinucleotide position weight matrix models.Here we develop a novel, unbiased computational algorithm, MARZ, that systematically analyzes all possible gapped matrices across a fixed number of nucleotides.Our results indicate that in many cases gapped matrix models can outperform traditional models, but that the relative strength of the binding sites considered in the analysis can profoundly influence the predictive ability of specific models.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, Harvey Mudd College, 301 Platt Boulevard, Claremont CA, 91711, USA. rzellers@hmc.edu.

ABSTRACT

Background: A key challenge in understanding the molecular mechanisms that control gene regulation is the characterization of the specificity with which transcription factor proteins bind to specific DNA sequences. A number of computational approaches have been developed to examine these interactions, including simple mononucleotide and dinucleotide position weight matrix models.

Results: Here we develop a novel, unbiased computational algorithm, MARZ, that systematically analyzes all possible gapped matrices across a fixed number of nucleotides. In addition, to evaluate the ability of these matrix models to predict in vivo binding sites, we utilize a new scoring system and, in combination with established scoring methods and statistical analysis, test the performance of 32 different gapped matrices on the well characterized HUNCHBACK transcription factor in Drosophila.

Conclusions: Our results indicate that in many cases gapped matrix models can outperform traditional models, but that the relative strength of the binding sites considered in the analysis can profoundly influence the predictive ability of specific models.

Show MeSH
Related in: MedlinePlus