Limits...
CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences.

Ding Y, Lorenz WA, Chuang JH - BMC Bioinformatics (2012)

Bottom Line: We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences.We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biology, Boston College, Chestnut Hill, MA 02467, USA.

ABSTRACT

Background: It has been increasingly appreciated that coding sequences harbor regulatory sequence motifs in addition to encoding for protein. These sequence motifs are expected to be overrepresented in nucleotide sequences bound by a common protein or small RNA. However, detecting overrepresented motifs has been difficult because of interference by constraints at the protein level. Sampling-based approaches to solve this problem based on codon-shuffling have been limited to exploring only an infinitesimal fraction of the sequence space and by their use of parametric approximations.

Results: We present a novel O(N(log N)2)-time algorithm, CodingMotif, to identify nucleotide-level motifs of unusual copy number in protein-coding regions. Using a new dynamic programming algorithm we are able to exhaustively calculate the distribution of the number of occurrences of a motif over all possible coding sequences that encode the same amino acid sequence, given a background model for codon usage and dinucleotide biases. Our method takes advantage of the sparseness of loci where a given motif can occur, greatly speeding up the required convolution calculations. Knowledge of the distribution allows one to assess the exact non-parametric p-value of whether a given motif is over- or under- represented. We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.

Conclusions: CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences. We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions. A software implementation is available at http://bioinformatics.bc.edu/chuanglab/codingmotif.tar.

Show MeSH
Distribution of motif overrepresentation p-values for mouse chr19 coding sequence with the Independent Codon Model . Three ICM p-value distributions are shown: the p-values for the original coding sequences; the p-values after shuffling synonymous codons across coding sequences; and the p-values after shuffling dicodons across coding sequences.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3298695&req=5

Figure 1: Distribution of motif overrepresentation p-values for mouse chr19 coding sequence with the Independent Codon Model . Three ICM p-value distributions are shown: the p-values for the original coding sequences; the p-values after shuffling synonymous codons across coding sequences; and the p-values after shuffling dicodons across coding sequences.

Mentions: k-mer scores exhibited a strong bimodal behavior under the ICM , with the vast majority of p-values close to either 0 or 1. The distribution of p-values for these k-mers is shown in Figure 1. and labeled as "original sequence." For example, 32% of 6-mers have p-value < 0.05, and 19% of 6-mers have p-value > 0.95. The large number of motifs with strong copy number biases suggests that the ICM model may not be an adequate model for detecting motifs under selection. To clarify the reason for the bimodal behavior, we shuffled the codons while keeping the amino acid sequences fixed, yielding a "codon shuffled" sequence. When the overrepresentation algorithm was run on this shuffled sequence, the bimodality of the scores was substantially decreased (Figure 1). This suggests that there are systematic dinucleotide biases at the boundaries of neighboring codons that significantly impact the p-values calculated for the original sequence. Such biases may include selective pressures on motifs, which is what we seek to identify; however the large number of motifs with scores altered by the codon shuffling suggests that there are neutral effects as well.


CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences.

Ding Y, Lorenz WA, Chuang JH - BMC Bioinformatics (2012)

Distribution of motif overrepresentation p-values for mouse chr19 coding sequence with the Independent Codon Model . Three ICM p-value distributions are shown: the p-values for the original coding sequences; the p-values after shuffling synonymous codons across coding sequences; and the p-values after shuffling dicodons across coding sequences.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3298695&req=5

Figure 1: Distribution of motif overrepresentation p-values for mouse chr19 coding sequence with the Independent Codon Model . Three ICM p-value distributions are shown: the p-values for the original coding sequences; the p-values after shuffling synonymous codons across coding sequences; and the p-values after shuffling dicodons across coding sequences.
Mentions: k-mer scores exhibited a strong bimodal behavior under the ICM , with the vast majority of p-values close to either 0 or 1. The distribution of p-values for these k-mers is shown in Figure 1. and labeled as "original sequence." For example, 32% of 6-mers have p-value < 0.05, and 19% of 6-mers have p-value > 0.95. The large number of motifs with strong copy number biases suggests that the ICM model may not be an adequate model for detecting motifs under selection. To clarify the reason for the bimodal behavior, we shuffled the codons while keeping the amino acid sequences fixed, yielding a "codon shuffled" sequence. When the overrepresentation algorithm was run on this shuffled sequence, the bimodality of the scores was substantially decreased (Figure 1). This suggests that there are systematic dinucleotide biases at the boundaries of neighboring codons that significantly impact the p-values calculated for the original sequence. Such biases may include selective pressures on motifs, which is what we seek to identify; however the large number of motifs with scores altered by the codon shuffling suggests that there are neutral effects as well.

Bottom Line: We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences.We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biology, Boston College, Chestnut Hill, MA 02467, USA.

ABSTRACT

Background: It has been increasingly appreciated that coding sequences harbor regulatory sequence motifs in addition to encoding for protein. These sequence motifs are expected to be overrepresented in nucleotide sequences bound by a common protein or small RNA. However, detecting overrepresented motifs has been difficult because of interference by constraints at the protein level. Sampling-based approaches to solve this problem based on codon-shuffling have been limited to exploring only an infinitesimal fraction of the sequence space and by their use of parametric approximations.

Results: We present a novel O(N(log N)2)-time algorithm, CodingMotif, to identify nucleotide-level motifs of unusual copy number in protein-coding regions. Using a new dynamic programming algorithm we are able to exhaustively calculate the distribution of the number of occurrences of a motif over all possible coding sequences that encode the same amino acid sequence, given a background model for codon usage and dinucleotide biases. Our method takes advantage of the sparseness of loci where a given motif can occur, greatly speeding up the required convolution calculations. Knowledge of the distribution allows one to assess the exact non-parametric p-value of whether a given motif is over- or under- represented. We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.

Conclusions: CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences. We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions. A software implementation is available at http://bioinformatics.bc.edu/chuanglab/codingmotif.tar.

Show MeSH