Limits...
CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences.

Ding Y, Lorenz WA, Chuang JH - BMC Bioinformatics (2012)

Bottom Line: We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences.We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biology, Boston College, Chestnut Hill, MA 02467, USA.

ABSTRACT

Background: It has been increasingly appreciated that coding sequences harbor regulatory sequence motifs in addition to encoding for protein. These sequence motifs are expected to be overrepresented in nucleotide sequences bound by a common protein or small RNA. However, detecting overrepresented motifs has been difficult because of interference by constraints at the protein level. Sampling-based approaches to solve this problem based on codon-shuffling have been limited to exploring only an infinitesimal fraction of the sequence space and by their use of parametric approximations.

Results: We present a novel O(N(log N)2)-time algorithm, CodingMotif, to identify nucleotide-level motifs of unusual copy number in protein-coding regions. Using a new dynamic programming algorithm we are able to exhaustively calculate the distribution of the number of occurrences of a motif over all possible coding sequences that encode the same amino acid sequence, given a background model for codon usage and dinucleotide biases. Our method takes advantage of the sparseness of loci where a given motif can occur, greatly speeding up the required convolution calculations. Knowledge of the distribution allows one to assess the exact non-parametric p-value of whether a given motif is over- or under- represented. We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.

Conclusions: CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences. We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions. A software implementation is available at http://bioinformatics.bc.edu/chuanglab/codingmotif.tar.

Show MeSH
Motif p-values on synthetic data. For a randomly generated set of 20 sequences of 350 codons, copies of a motif were overwritten onto random positions within the sequences. CodingMotif and ideal z-score based p-values as a function of the number of inserted copies were calculated. This procedure was performed 10 times for each of the 4096 possible 6-mers. CodingMotif plotted values indicate average and standard deviation of log p-values. Z-score plotted values indicate the value of the erfc function when applied to the average z-score. Standard deviations of z-score based p-values were similar to those of CodingMotif (data not shown).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3298695&req=5

Figure 5: Motif p-values on synthetic data. For a randomly generated set of 20 sequences of 350 codons, copies of a motif were overwritten onto random positions within the sequences. CodingMotif and ideal z-score based p-values as a function of the number of inserted copies were calculated. This procedure was performed 10 times for each of the 4096 possible 6-mers. CodingMotif plotted values indicate average and standard deviation of log p-values. Z-score plotted values indicate the value of the erfc function when applied to the average z-score. Standard deviations of z-score based p-values were similar to those of CodingMotif (data not shown).

Mentions: For 6-mers, a p-value better than 4-6 = 0.0002 is an appropriate significance threshold taking into account multiple testing. As can be seen in Figure 5, this level of significance is achieved, on average, when 6-7 of the sequences have a copy of the motif, though there is strong variability from motif-to-motif as indicated by the plotted standard deviations. For larger datasets, a relatively lower motif density would be expected to be sufficient for detection of significant overrepresentation.


CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences.

Ding Y, Lorenz WA, Chuang JH - BMC Bioinformatics (2012)

Motif p-values on synthetic data. For a randomly generated set of 20 sequences of 350 codons, copies of a motif were overwritten onto random positions within the sequences. CodingMotif and ideal z-score based p-values as a function of the number of inserted copies were calculated. This procedure was performed 10 times for each of the 4096 possible 6-mers. CodingMotif plotted values indicate average and standard deviation of log p-values. Z-score plotted values indicate the value of the erfc function when applied to the average z-score. Standard deviations of z-score based p-values were similar to those of CodingMotif (data not shown).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3298695&req=5

Figure 5: Motif p-values on synthetic data. For a randomly generated set of 20 sequences of 350 codons, copies of a motif were overwritten onto random positions within the sequences. CodingMotif and ideal z-score based p-values as a function of the number of inserted copies were calculated. This procedure was performed 10 times for each of the 4096 possible 6-mers. CodingMotif plotted values indicate average and standard deviation of log p-values. Z-score plotted values indicate the value of the erfc function when applied to the average z-score. Standard deviations of z-score based p-values were similar to those of CodingMotif (data not shown).
Mentions: For 6-mers, a p-value better than 4-6 = 0.0002 is an appropriate significance threshold taking into account multiple testing. As can be seen in Figure 5, this level of significance is achieved, on average, when 6-7 of the sequences have a copy of the motif, though there is strong variability from motif-to-motif as indicated by the plotted standard deviations. For larger datasets, a relatively lower motif density would be expected to be sufficient for detection of significant overrepresentation.

Bottom Line: We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences.We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biology, Boston College, Chestnut Hill, MA 02467, USA.

ABSTRACT

Background: It has been increasingly appreciated that coding sequences harbor regulatory sequence motifs in addition to encoding for protein. These sequence motifs are expected to be overrepresented in nucleotide sequences bound by a common protein or small RNA. However, detecting overrepresented motifs has been difficult because of interference by constraints at the protein level. Sampling-based approaches to solve this problem based on codon-shuffling have been limited to exploring only an infinitesimal fraction of the sequence space and by their use of parametric approximations.

Results: We present a novel O(N(log N)2)-time algorithm, CodingMotif, to identify nucleotide-level motifs of unusual copy number in protein-coding regions. Using a new dynamic programming algorithm we are able to exhaustively calculate the distribution of the number of occurrences of a motif over all possible coding sequences that encode the same amino acid sequence, given a background model for codon usage and dinucleotide biases. Our method takes advantage of the sparseness of loci where a given motif can occur, greatly speeding up the required convolution calculations. Knowledge of the distribution allows one to assess the exact non-parametric p-value of whether a given motif is over- or under- represented. We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.

Conclusions: CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences. We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions. A software implementation is available at http://bioinformatics.bc.edu/chuanglab/codingmotif.tar.

Show MeSH