Limits...
WordCluster: detecting clusters of DNA words and genomic elements.

Hackenberg M, Carpena P, Bernaola-Galván P, Barturen G, Alganza AM, Oliver JL - Algorithms Mol Biol (2011)

Bottom Line: We demonstrate the usefulness of this approach by detecting the clusters of CAG/CTG (cytosine contexts that can be methylated in undifferentiated cells), showing that the degree of methylation vary drastically between inside and outside of the clusters.As another example, we used WordCluster to search for statistically significant clusters of olfactory receptor (OR) genes in the human genome.WordCluster seems to predict biological meaningful clusters of DNA words (k-mers) and genomic entities.

View Article: PubMed Central - HTML - PubMed

Affiliation: Dpto, de Genética, Facultad de Ciencias, Universidad de Granada, Campus de Fuentenueva s/n, 18071-Granada & Lab, de Bioinformática, Centro de Investigación Biomédica, PTS, Avda, del Conocimiento s/n, 18100-Granada, Spain. mlhack@gmail.com.

ABSTRACT

Background: Many k-mers (or DNA words) and genomic elements are known to be spatially clustered in the genome. Well established examples are the genes, TFBSs, CpG dinucleotides, microRNA genes and ultra-conserved non-coding regions. Currently, no algorithm exists to find these clusters in a statistically comprehensible way. The detection of clustering often relies on densities and sliding-window approaches or arbitrarily chosen distance thresholds.

Results: We introduce here an algorithm to detect clusters of DNA words (k-mers), or any other genomic element, based on the distance between consecutive copies and an assigned statistical significance. We implemented the method into a web server connected to a MySQL backend, which also determines the co-localization with gene annotations. We demonstrate the usefulness of this approach by detecting the clusters of CAG/CTG (cytosine contexts that can be methylated in undifferentiated cells), showing that the degree of methylation vary drastically between inside and outside of the clusters. As another example, we used WordCluster to search for statistically significant clusters of olfactory receptor (OR) genes in the human genome.

Conclusions: WordCluster seems to predict biological meaningful clusters of DNA words (k-mers) and genomic entities. The implementation of the method into a web server is available at http://bioinfo2.ugr.es/wordCluster/wordCluster.php including additional features like the detection of co-localization with gene regions or the annotation enrichment tool for functional analysis of overlapped genes.

No MeSH data available.


Clusters of OR genes. A region of human chromosome 11 showing OR genes (green), the clusters annotated in the CLIC/HORDE database (blue) and the statistically significant clusters predicted by WordCluster (red). Our algorithm predicts more compact clusters compared to the CLIC/HORDE annotation. For example, in the first and third HORDE clusters pronounced gaps exist between the genes, which is detected by WordCluster but ignored by the CLIC/HORDE annotation. The figure was generated using the UCSC Genome Browser [8].
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3037320&req=5

Figure 2: Clusters of OR genes. A region of human chromosome 11 showing OR genes (green), the clusters annotated in the CLIC/HORDE database (blue) and the statistically significant clusters predicted by WordCluster (red). Our algorithm predicts more compact clusters compared to the CLIC/HORDE annotation. For example, in the first and third HORDE clusters pronounced gaps exist between the genes, which is detected by WordCluster but ignored by the CLIC/HORDE annotation. The figure was generated using the UCSC Genome Browser [8].

Mentions: As a third example, we used WordCluster to search for significant clusters of olfactory receptor (OR) genes, the largest multigene family in multicellular organisms whose members are known to be clustered within vertebrate genomes [18,19]. Table 4 shows the basic statistics for the 13 clusters of OR genes detected by our algorithm in human chromosome 11. Figure 2 shows a comparative analysis of the clusters predicted by WordCluster to the clusters currently annotated in the CLIC/HORDE database [19] in a selected region of chromosome 11. Our algorithm predicts a higher number of clusters, being all of them statistically significant.


WordCluster: detecting clusters of DNA words and genomic elements.

Hackenberg M, Carpena P, Bernaola-Galván P, Barturen G, Alganza AM, Oliver JL - Algorithms Mol Biol (2011)

Clusters of OR genes. A region of human chromosome 11 showing OR genes (green), the clusters annotated in the CLIC/HORDE database (blue) and the statistically significant clusters predicted by WordCluster (red). Our algorithm predicts more compact clusters compared to the CLIC/HORDE annotation. For example, in the first and third HORDE clusters pronounced gaps exist between the genes, which is detected by WordCluster but ignored by the CLIC/HORDE annotation. The figure was generated using the UCSC Genome Browser [8].
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3037320&req=5

Figure 2: Clusters of OR genes. A region of human chromosome 11 showing OR genes (green), the clusters annotated in the CLIC/HORDE database (blue) and the statistically significant clusters predicted by WordCluster (red). Our algorithm predicts more compact clusters compared to the CLIC/HORDE annotation. For example, in the first and third HORDE clusters pronounced gaps exist between the genes, which is detected by WordCluster but ignored by the CLIC/HORDE annotation. The figure was generated using the UCSC Genome Browser [8].
Mentions: As a third example, we used WordCluster to search for significant clusters of olfactory receptor (OR) genes, the largest multigene family in multicellular organisms whose members are known to be clustered within vertebrate genomes [18,19]. Table 4 shows the basic statistics for the 13 clusters of OR genes detected by our algorithm in human chromosome 11. Figure 2 shows a comparative analysis of the clusters predicted by WordCluster to the clusters currently annotated in the CLIC/HORDE database [19] in a selected region of chromosome 11. Our algorithm predicts a higher number of clusters, being all of them statistically significant.

Bottom Line: We demonstrate the usefulness of this approach by detecting the clusters of CAG/CTG (cytosine contexts that can be methylated in undifferentiated cells), showing that the degree of methylation vary drastically between inside and outside of the clusters.As another example, we used WordCluster to search for statistically significant clusters of olfactory receptor (OR) genes in the human genome.WordCluster seems to predict biological meaningful clusters of DNA words (k-mers) and genomic entities.

View Article: PubMed Central - HTML - PubMed

Affiliation: Dpto, de Genética, Facultad de Ciencias, Universidad de Granada, Campus de Fuentenueva s/n, 18071-Granada & Lab, de Bioinformática, Centro de Investigación Biomédica, PTS, Avda, del Conocimiento s/n, 18100-Granada, Spain. mlhack@gmail.com.

ABSTRACT

Background: Many k-mers (or DNA words) and genomic elements are known to be spatially clustered in the genome. Well established examples are the genes, TFBSs, CpG dinucleotides, microRNA genes and ultra-conserved non-coding regions. Currently, no algorithm exists to find these clusters in a statistically comprehensible way. The detection of clustering often relies on densities and sliding-window approaches or arbitrarily chosen distance thresholds.

Results: We introduce here an algorithm to detect clusters of DNA words (k-mers), or any other genomic element, based on the distance between consecutive copies and an assigned statistical significance. We implemented the method into a web server connected to a MySQL backend, which also determines the co-localization with gene annotations. We demonstrate the usefulness of this approach by detecting the clusters of CAG/CTG (cytosine contexts that can be methylated in undifferentiated cells), showing that the degree of methylation vary drastically between inside and outside of the clusters. As another example, we used WordCluster to search for statistically significant clusters of olfactory receptor (OR) genes in the human genome.

Conclusions: WordCluster seems to predict biological meaningful clusters of DNA words (k-mers) and genomic entities. The implementation of the method into a web server is available at http://bioinfo2.ugr.es/wordCluster/wordCluster.php including additional features like the detection of co-localization with gene regions or the annotation enrichment tool for functional analysis of overlapped genes.

No MeSH data available.