Limits...
CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences.

Ding Y, Lorenz WA, Chuang JH - BMC Bioinformatics (2012)

Bottom Line: We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences.We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biology, Boston College, Chestnut Hill, MA 02467, USA.

ABSTRACT

Background: It has been increasingly appreciated that coding sequences harbor regulatory sequence motifs in addition to encoding for protein. These sequence motifs are expected to be overrepresented in nucleotide sequences bound by a common protein or small RNA. However, detecting overrepresented motifs has been difficult because of interference by constraints at the protein level. Sampling-based approaches to solve this problem based on codon-shuffling have been limited to exploring only an infinitesimal fraction of the sequence space and by their use of parametric approximations.

Results: We present a novel O(N(log N)2)-time algorithm, CodingMotif, to identify nucleotide-level motifs of unusual copy number in protein-coding regions. Using a new dynamic programming algorithm we are able to exhaustively calculate the distribution of the number of occurrences of a motif over all possible coding sequences that encode the same amino acid sequence, given a background model for codon usage and dinucleotide biases. Our method takes advantage of the sparseness of loci where a given motif can occur, greatly speeding up the required convolution calculations. Knowledge of the distribution allows one to assess the exact non-parametric p-value of whether a given motif is over- or under- represented. We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.

Conclusions: CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences. We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions. A software implementation is available at http://bioinformatics.bc.edu/chuanglab/codingmotif.tar.

Show MeSH
Comparison of CodingMotif and parametric methods for known binding motifs. A) All 4 of the top 4 motifs predicted by CodingMotif p-value are exact matches to the canonical motif for the human transcription factor GABP. For comparison, 3 of the top 4 motifs ranked by z-score, and 1 of the top 4 motifs ranked by the ratio of counts in the real sequence to the average in the  distribution, match the GABP canonical motif. B) All 4 of the top 4 motifs predicted by CodingMotif p-value match the canonical motif for NRSF. For motifs ranked by z-score 0/4 of the top motifs match the canonically known motif. 0/4 of the top motifs ranked by count-ratio match the canonically known motif.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3298695&req=5

Figure 4: Comparison of CodingMotif and parametric methods for known binding motifs. A) All 4 of the top 4 motifs predicted by CodingMotif p-value are exact matches to the canonical motif for the human transcription factor GABP. For comparison, 3 of the top 4 motifs ranked by z-score, and 1 of the top 4 motifs ranked by the ratio of counts in the real sequence to the average in the distribution, match the GABP canonical motif. B) All 4 of the top 4 motifs predicted by CodingMotif p-value match the canonical motif for NRSF. For motifs ranked by z-score 0/4 of the top motifs match the canonically known motif. 0/4 of the top motifs ranked by count-ratio match the canonically known motif.

Mentions: Results for GABP are shown in Figure 4A, with the previously known canonical motif shown in weblogo form [31]. The signal for the canonical GABP motif is essentially 7bp long with little degeneracy (CCGGAAG). We determined the top 4 6-mer motifs from CodingMotif as ranked by their overrepresentation p-values, each of which was 1e-21 or better. We found that these 4 motifs were the 4 possible perfect 6-mer matches to the canonical motif: CGGAAG, CCGGAA, and their reverse complements CTTCCG and TTCCGG. To determine whether a parametric approximation of the count distribution would perform equally well, we also calculated z-scores for each 6-mer based on their observed counts and their average counts over all possible coding sequences, which we determined directly from the distribution function calculated by CodingMotif. Note that the mean calculated according to this method is is the ideal of what would be found with an infinite amount of sampling. We found that the top 4 motifs produced by a z-score approach returned only 3/4 of the canonical GABP hexamers. We also sorted motifs according to their count ratio (observed counts/mean counts in the control distribution), the statistic used by Itzkovitz et al [17] to identify motifs. We found that count ratio yielded only 1/4 of the canonical hexamers among the top 4. Thus CodingMotif gives superior results to this idealized parametric method. Methods which parameterize the count distribution from a finite sample set can do no better than this parametric ideal.


CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences.

Ding Y, Lorenz WA, Chuang JH - BMC Bioinformatics (2012)

Comparison of CodingMotif and parametric methods for known binding motifs. A) All 4 of the top 4 motifs predicted by CodingMotif p-value are exact matches to the canonical motif for the human transcription factor GABP. For comparison, 3 of the top 4 motifs ranked by z-score, and 1 of the top 4 motifs ranked by the ratio of counts in the real sequence to the average in the  distribution, match the GABP canonical motif. B) All 4 of the top 4 motifs predicted by CodingMotif p-value match the canonical motif for NRSF. For motifs ranked by z-score 0/4 of the top motifs match the canonically known motif. 0/4 of the top motifs ranked by count-ratio match the canonically known motif.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3298695&req=5

Figure 4: Comparison of CodingMotif and parametric methods for known binding motifs. A) All 4 of the top 4 motifs predicted by CodingMotif p-value are exact matches to the canonical motif for the human transcription factor GABP. For comparison, 3 of the top 4 motifs ranked by z-score, and 1 of the top 4 motifs ranked by the ratio of counts in the real sequence to the average in the distribution, match the GABP canonical motif. B) All 4 of the top 4 motifs predicted by CodingMotif p-value match the canonical motif for NRSF. For motifs ranked by z-score 0/4 of the top motifs match the canonically known motif. 0/4 of the top motifs ranked by count-ratio match the canonically known motif.
Mentions: Results for GABP are shown in Figure 4A, with the previously known canonical motif shown in weblogo form [31]. The signal for the canonical GABP motif is essentially 7bp long with little degeneracy (CCGGAAG). We determined the top 4 6-mer motifs from CodingMotif as ranked by their overrepresentation p-values, each of which was 1e-21 or better. We found that these 4 motifs were the 4 possible perfect 6-mer matches to the canonical motif: CGGAAG, CCGGAA, and their reverse complements CTTCCG and TTCCGG. To determine whether a parametric approximation of the count distribution would perform equally well, we also calculated z-scores for each 6-mer based on their observed counts and their average counts over all possible coding sequences, which we determined directly from the distribution function calculated by CodingMotif. Note that the mean calculated according to this method is is the ideal of what would be found with an infinite amount of sampling. We found that the top 4 motifs produced by a z-score approach returned only 3/4 of the canonical GABP hexamers. We also sorted motifs according to their count ratio (observed counts/mean counts in the control distribution), the statistic used by Itzkovitz et al [17] to identify motifs. We found that count ratio yielded only 1/4 of the canonical hexamers among the top 4. Thus CodingMotif gives superior results to this idealized parametric method. Methods which parameterize the count distribution from a finite sample set can do no better than this parametric ideal.

Bottom Line: We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences.We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biology, Boston College, Chestnut Hill, MA 02467, USA.

ABSTRACT

Background: It has been increasingly appreciated that coding sequences harbor regulatory sequence motifs in addition to encoding for protein. These sequence motifs are expected to be overrepresented in nucleotide sequences bound by a common protein or small RNA. However, detecting overrepresented motifs has been difficult because of interference by constraints at the protein level. Sampling-based approaches to solve this problem based on codon-shuffling have been limited to exploring only an infinitesimal fraction of the sequence space and by their use of parametric approximations.

Results: We present a novel O(N(log N)2)-time algorithm, CodingMotif, to identify nucleotide-level motifs of unusual copy number in protein-coding regions. Using a new dynamic programming algorithm we are able to exhaustively calculate the distribution of the number of occurrences of a motif over all possible coding sequences that encode the same amino acid sequence, given a background model for codon usage and dinucleotide biases. Our method takes advantage of the sparseness of loci where a given motif can occur, greatly speeding up the required convolution calculations. Knowledge of the distribution allows one to assess the exact non-parametric p-value of whether a given motif is over- or under- represented. We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.

Conclusions: CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences. We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions. A software implementation is available at http://bioinformatics.bc.edu/chuanglab/codingmotif.tar.

Show MeSH