Limits...
CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences.

Ding Y, Lorenz WA, Chuang JH - BMC Bioinformatics (2012)

Bottom Line: We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences.We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biology, Boston College, Chestnut Hill, MA 02467, USA.

ABSTRACT

Background: It has been increasingly appreciated that coding sequences harbor regulatory sequence motifs in addition to encoding for protein. These sequence motifs are expected to be overrepresented in nucleotide sequences bound by a common protein or small RNA. However, detecting overrepresented motifs has been difficult because of interference by constraints at the protein level. Sampling-based approaches to solve this problem based on codon-shuffling have been limited to exploring only an infinitesimal fraction of the sequence space and by their use of parametric approximations.

Results: We present a novel O(N(log N)2)-time algorithm, CodingMotif, to identify nucleotide-level motifs of unusual copy number in protein-coding regions. Using a new dynamic programming algorithm we are able to exhaustively calculate the distribution of the number of occurrences of a motif over all possible coding sequences that encode the same amino acid sequence, given a background model for codon usage and dinucleotide biases. Our method takes advantage of the sparseness of loci where a given motif can occur, greatly speeding up the required convolution calculations. Knowledge of the distribution allows one to assess the exact non-parametric p-value of whether a given motif is over- or under- represented. We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.

Conclusions: CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences. We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions. A software implementation is available at http://bioinformatics.bc.edu/chuanglab/codingmotif.tar.

Show MeSH
Comparison of dinucleotide usage under different  models. The dinucleotide usage of sequences generated by the DCM Markov model (black) and the dinucleotide usage of the original data (white) exhibit Pearson correlation r = 0.9999, in comparison to correlation r = 0.9580 between ICM-generated dinucleotide usage and that of the original sequences. The largest discrepancy is for CpG dinucleotides, for which the ICM-generated frequency is 1.60 times that in the original data. For the DCM-generated sequences, the CpG frequency is 1.0008 times that in the original data.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3298695&req=5

Figure 2: Comparison of dinucleotide usage under different models. The dinucleotide usage of sequences generated by the DCM Markov model (black) and the dinucleotide usage of the original data (white) exhibit Pearson correlation r = 0.9999, in comparison to correlation r = 0.9580 between ICM-generated dinucleotide usage and that of the original sequences. The largest discrepancy is for CpG dinucleotides, for which the ICM-generated frequency is 1.60 times that in the original data. For the DCM-generated sequences, the CpG frequency is 1.0008 times that in the original data.

Mentions: If each amino acid had only one possible first nucleotide for the underlying codon, then the expected dinucleotide and codon usage in the DCM model would be exactly equal to those of the reference sequence (see Methods for proof). However, the true genetic code deviates slightly from this behavior (Arginine, Leucine, and Serine can have two possible first nucleotides). To determine how well the DCM preserves dinucleotide and codon usage, we generated a sequence using the DCM Markov model and compared to the properties of the reference sequence. Figure 2 shows the dinucleotide usage of the original sequence and that generated by both the DCM Markov model and the ICM model. The original sequence used was the set of all mouse coding sequences (NCBIM 37; 22,791,336 codons). The DCM and ICM sequences were generated through Python scripts using the Python built-in random number generator. The correlation of the original and ICM dinucleotide frequencies is high (Pearson r = 0.9580), but the correlation with the DCM value is noticeably superior (Pearson r = 0.9999). The strongest discrepancy is for CG nucleotides, which are well-known to be hypermutable compared to other dinucleotides. The CG frequency in the ICM is 1.60 times that in the original data, while the DCM CG frequency is only 1.0008 times that in the original data. None of the DCM dinucleotide frequencies differs from its respective original sequence dinucleotide frequency by more than 0.5% of the original sequence value, even though such differences are affected by both systematic biases and finite-size fluctuations.


CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences.

Ding Y, Lorenz WA, Chuang JH - BMC Bioinformatics (2012)

Comparison of dinucleotide usage under different  models. The dinucleotide usage of sequences generated by the DCM Markov model (black) and the dinucleotide usage of the original data (white) exhibit Pearson correlation r = 0.9999, in comparison to correlation r = 0.9580 between ICM-generated dinucleotide usage and that of the original sequences. The largest discrepancy is for CpG dinucleotides, for which the ICM-generated frequency is 1.60 times that in the original data. For the DCM-generated sequences, the CpG frequency is 1.0008 times that in the original data.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3298695&req=5

Figure 2: Comparison of dinucleotide usage under different models. The dinucleotide usage of sequences generated by the DCM Markov model (black) and the dinucleotide usage of the original data (white) exhibit Pearson correlation r = 0.9999, in comparison to correlation r = 0.9580 between ICM-generated dinucleotide usage and that of the original sequences. The largest discrepancy is for CpG dinucleotides, for which the ICM-generated frequency is 1.60 times that in the original data. For the DCM-generated sequences, the CpG frequency is 1.0008 times that in the original data.
Mentions: If each amino acid had only one possible first nucleotide for the underlying codon, then the expected dinucleotide and codon usage in the DCM model would be exactly equal to those of the reference sequence (see Methods for proof). However, the true genetic code deviates slightly from this behavior (Arginine, Leucine, and Serine can have two possible first nucleotides). To determine how well the DCM preserves dinucleotide and codon usage, we generated a sequence using the DCM Markov model and compared to the properties of the reference sequence. Figure 2 shows the dinucleotide usage of the original sequence and that generated by both the DCM Markov model and the ICM model. The original sequence used was the set of all mouse coding sequences (NCBIM 37; 22,791,336 codons). The DCM and ICM sequences were generated through Python scripts using the Python built-in random number generator. The correlation of the original and ICM dinucleotide frequencies is high (Pearson r = 0.9580), but the correlation with the DCM value is noticeably superior (Pearson r = 0.9999). The strongest discrepancy is for CG nucleotides, which are well-known to be hypermutable compared to other dinucleotides. The CG frequency in the ICM is 1.60 times that in the original data, while the DCM CG frequency is only 1.0008 times that in the original data. None of the DCM dinucleotide frequencies differs from its respective original sequence dinucleotide frequency by more than 0.5% of the original sequence value, even though such differences are affected by both systematic biases and finite-size fluctuations.

Bottom Line: We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences.We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biology, Boston College, Chestnut Hill, MA 02467, USA.

ABSTRACT

Background: It has been increasingly appreciated that coding sequences harbor regulatory sequence motifs in addition to encoding for protein. These sequence motifs are expected to be overrepresented in nucleotide sequences bound by a common protein or small RNA. However, detecting overrepresented motifs has been difficult because of interference by constraints at the protein level. Sampling-based approaches to solve this problem based on codon-shuffling have been limited to exploring only an infinitesimal fraction of the sequence space and by their use of parametric approximations.

Results: We present a novel O(N(log N)2)-time algorithm, CodingMotif, to identify nucleotide-level motifs of unusual copy number in protein-coding regions. Using a new dynamic programming algorithm we are able to exhaustively calculate the distribution of the number of occurrences of a motif over all possible coding sequences that encode the same amino acid sequence, given a background model for codon usage and dinucleotide biases. Our method takes advantage of the sparseness of loci where a given motif can occur, greatly speeding up the required convolution calculations. Knowledge of the distribution allows one to assess the exact non-parametric p-value of whether a given motif is over- or under- represented. We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.

Conclusions: CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences. We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions. A software implementation is available at http://bioinformatics.bc.edu/chuanglab/codingmotif.tar.

Show MeSH