Limits...
Recovering motifs from biased genomes: application of signal correction.

Hasan S, Schreiber M - Nucleic Acids Res. (2006)

Bottom Line: We find that the average Euclidian distance between RBS signal frequency matrices of different genomes can be significantly reduced by using the correction technique.Within this reduced average distance, we can find examples of class-specific RBS signals.Our results have implications for motif-based prediction, particularly with regards to the estimation of reliable inter-genomic model parameters.

View Article: PubMed Central - PubMed

Affiliation: Novartis Institute for Tropical Diseases (NITD), 10 Biopolis Road, #05-01 Chromos, Singapore 138670.

ABSTRACT
A significant problem in biological motif analysis arises when the background symbol distribution is biased (e.g. high/low GC content in the case of DNA sequences). This can lead to overestimation of the amount of information encoded in a motif. A motif can be depicted as a signal using information theory (IT). We apply two concepts from IT, distortion and patterned interference (a type of noise), to model genomic and codon bias respectively. This modeling approach allows us to correct a raw signal to recover signals that are weakened by compositional bias. The corrected signal is more likely to be discriminated from a biased background by a macromolecule. We apply this correction technique to recover ribosome-binding site (RBS) signals from available sequenced and annotated prokaryotic genomes having diverse compositional biases. We observed that linear correction was sufficient for recovering signals even at the extremes of these biases. Further comparative genomics studies were made possible upon correction of these signals. We find that the average Euclidian distance between RBS signal frequency matrices of different genomes can be significantly reduced by using the correction technique. Within this reduced average distance, we can find examples of class-specific RBS signals. Our results have implications for motif-based prediction, particularly with regards to the estimation of reliable inter-genomic model parameters.

Show MeSH

Related in: MedlinePlus

Total Euclidian distance of raw and corrected RBS signals to the E.coli raw signal.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC1636444&req=5

fig4: Total Euclidian distance of raw and corrected RBS signals to the E.coli raw signal.

Mentions: All the E.coli strains completely sequenced till now (GenBank accession nos NC_004431, NC_000913, NC_002695, NC_002655) have a uniform background nt distribution (50% GC). Therefore, we used the RBS signal of E.coli as a reference to assess the effects of correcting extremely biased raw signals. The summed Euclidian distance between the reference raw signal of E.coli strain CFT073 (Figure 3; GenBank accession no. NC_004431) to all other genomes was thus measured (Figure 4). This measurement also included the distribution of SD motif-initiation codon distances. At increasing or decreasing background GC content, there was a significant tendency for raw signals to diverge from the reference signal. Fitting a polynomial line to the raw data yielded a strong squared correlation co-efficient (R2)of 0.70 with the lowest peak of the graph roughly at 50% GC. One of the effects of correction was a significant reduction in the average total Euclidian distance from μ = 8.25 to 8.60 this was shown using a paired t-test (α = 0.05). An f-test (α = 0.05) was also performed and showed that the variation in the distances had also significantly reduced from 6.36 to 2.23 upon correction. Correction was also expected to remove any correlation between %GC and total Euclidian distance. This was observed as the relationship between these two variables was successfully flattened (Figure 4: a minor slope of −0.03 was seen when regressing a linear model). This showed that linear correction was sufficient for removing correlation caused by compositional bias.


Recovering motifs from biased genomes: application of signal correction.

Hasan S, Schreiber M - Nucleic Acids Res. (2006)

Total Euclidian distance of raw and corrected RBS signals to the E.coli raw signal.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC1636444&req=5

fig4: Total Euclidian distance of raw and corrected RBS signals to the E.coli raw signal.
Mentions: All the E.coli strains completely sequenced till now (GenBank accession nos NC_004431, NC_000913, NC_002695, NC_002655) have a uniform background nt distribution (50% GC). Therefore, we used the RBS signal of E.coli as a reference to assess the effects of correcting extremely biased raw signals. The summed Euclidian distance between the reference raw signal of E.coli strain CFT073 (Figure 3; GenBank accession no. NC_004431) to all other genomes was thus measured (Figure 4). This measurement also included the distribution of SD motif-initiation codon distances. At increasing or decreasing background GC content, there was a significant tendency for raw signals to diverge from the reference signal. Fitting a polynomial line to the raw data yielded a strong squared correlation co-efficient (R2)of 0.70 with the lowest peak of the graph roughly at 50% GC. One of the effects of correction was a significant reduction in the average total Euclidian distance from μ = 8.25 to 8.60 this was shown using a paired t-test (α = 0.05). An f-test (α = 0.05) was also performed and showed that the variation in the distances had also significantly reduced from 6.36 to 2.23 upon correction. Correction was also expected to remove any correlation between %GC and total Euclidian distance. This was observed as the relationship between these two variables was successfully flattened (Figure 4: a minor slope of −0.03 was seen when regressing a linear model). This showed that linear correction was sufficient for removing correlation caused by compositional bias.

Bottom Line: We find that the average Euclidian distance between RBS signal frequency matrices of different genomes can be significantly reduced by using the correction technique.Within this reduced average distance, we can find examples of class-specific RBS signals.Our results have implications for motif-based prediction, particularly with regards to the estimation of reliable inter-genomic model parameters.

View Article: PubMed Central - PubMed

Affiliation: Novartis Institute for Tropical Diseases (NITD), 10 Biopolis Road, #05-01 Chromos, Singapore 138670.

ABSTRACT
A significant problem in biological motif analysis arises when the background symbol distribution is biased (e.g. high/low GC content in the case of DNA sequences). This can lead to overestimation of the amount of information encoded in a motif. A motif can be depicted as a signal using information theory (IT). We apply two concepts from IT, distortion and patterned interference (a type of noise), to model genomic and codon bias respectively. This modeling approach allows us to correct a raw signal to recover signals that are weakened by compositional bias. The corrected signal is more likely to be discriminated from a biased background by a macromolecule. We apply this correction technique to recover ribosome-binding site (RBS) signals from available sequenced and annotated prokaryotic genomes having diverse compositional biases. We observed that linear correction was sufficient for recovering signals even at the extremes of these biases. Further comparative genomics studies were made possible upon correction of these signals. We find that the average Euclidian distance between RBS signal frequency matrices of different genomes can be significantly reduced by using the correction technique. Within this reduced average distance, we can find examples of class-specific RBS signals. Our results have implications for motif-based prediction, particularly with regards to the estimation of reliable inter-genomic model parameters.

Show MeSH
Related in: MedlinePlus