Limits...
Comparative analysis using K-mer and K-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes.

Chae H, Park J, Lee SW, Nephew KP, Kim S - Nucleic Acids Res. (2013)

Bottom Line: First, by calculating genome distance based on rank correlation tests, we show that k-mer and k-flank patterns around CpG sites can be used to correctly reconstruct the phylogeny of 10 mammalian genomes.Further, we used various machine learning algorithms to demonstrate that CpG islands sequences can be characterized using k-mers.In addition, by testing a human model on the nine different mammalian genomes, we provide the first evidence that k-mer signatures are consistent with evolutionary history.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, School of Informatics and Computing, Indiana University, Bloomington, IN, USA.

ABSTRACT
CpG islands are GC-rich regions often located in the 5' end of genes and normally protected from cytosine methylation in mammals. The important role of CpG islands in gene transcription strongly suggests evolutionary conservation in the mammalian genome. However, as CpG dinucleotides are over-represented in CpG islands, comparative CpG island analysis using conventional sequence analysis techniques remains a major challenge in the epigenetics field. In this study, we conducted a comparative analysis of all CpG island sequences in 10 mammalian genomes. As sequence similarity methods and character composition techniques such as information theory are particularly difficult to conduct, we used exact patterns in CpG island sequences and single character discrepancies to identify differences in CpG island sequences. First, by calculating genome distance based on rank correlation tests, we show that k-mer and k-flank patterns around CpG sites can be used to correctly reconstruct the phylogeny of 10 mammalian genomes. Further, we used various machine learning algorithms to demonstrate that CpG islands sequences can be characterized using k-mers. In addition, by testing a human model on the nine different mammalian genomes, we provide the first evidence that k-mer signatures are consistent with evolutionary history.

Show MeSH
Definition of K-mer and K-flank.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3643570&req=5

gkt144-F1: Definition of K-mer and K-flank.

Mentions: As shown in the previous section, over-represented CpG dinucleotides make it difficult to perform analysis of CpG islands. To overcome this hurdle, we used two new oligomer-counting approaches, exact sequence patterns called k-mer and k-flank, for the comparative analysis of all CpG island sequences in 10 mammalian genomes. The first model, k-mer, gives a general descriptor of the oligomer landscape in the entire CpG island sequence. Given a sequence S, k-mers are sub-sequences of S of length k, also known as oligomers for small k. For each CpG island sequence, sliding windows of length k are moved across the CpG island sequence from the 5′ end to 3′ end, and each k-mer occurrence is counted. Determining the number of occurrences of specific k-mers in a sequence is called k-mer counting or oligomer counting, and can provide descriptive information about the DNA sequence (6, 21). To better characterize and describe CpG island sequences, we used k-mer counting techniques and frequency measurements to perform a comparative analysis of the CpG islands. The second oligomer-counting approach, called k-flanks, records the DNA sequence of k length directly upstream and downstream of each CpG site in a CpG island sequence. This oligomer model is stricter and specifically describes the DNA bases directly adjacent to CpG sites. Figure 1 illustrates the definition of k-mer and k-flank. In this study, we counted 3–10 k-mers and k-flanks. Once k-mers and k-flanks were counted, we used pattern counting for the reconstruction of 10 mammalian phylogenies and also for machine learning analysis to show that (i) k-mers are characteristic of CpG island sequences and (ii) k-mer data are consistent with evolutionary history of 10 mammalian genomes. As far as we know, this study is the first extensive comparative analysis of CpG island sequences.Figure 1.


Comparative analysis using K-mer and K-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes.

Chae H, Park J, Lee SW, Nephew KP, Kim S - Nucleic Acids Res. (2013)

Definition of K-mer and K-flank.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3643570&req=5

gkt144-F1: Definition of K-mer and K-flank.
Mentions: As shown in the previous section, over-represented CpG dinucleotides make it difficult to perform analysis of CpG islands. To overcome this hurdle, we used two new oligomer-counting approaches, exact sequence patterns called k-mer and k-flank, for the comparative analysis of all CpG island sequences in 10 mammalian genomes. The first model, k-mer, gives a general descriptor of the oligomer landscape in the entire CpG island sequence. Given a sequence S, k-mers are sub-sequences of S of length k, also known as oligomers for small k. For each CpG island sequence, sliding windows of length k are moved across the CpG island sequence from the 5′ end to 3′ end, and each k-mer occurrence is counted. Determining the number of occurrences of specific k-mers in a sequence is called k-mer counting or oligomer counting, and can provide descriptive information about the DNA sequence (6, 21). To better characterize and describe CpG island sequences, we used k-mer counting techniques and frequency measurements to perform a comparative analysis of the CpG islands. The second oligomer-counting approach, called k-flanks, records the DNA sequence of k length directly upstream and downstream of each CpG site in a CpG island sequence. This oligomer model is stricter and specifically describes the DNA bases directly adjacent to CpG sites. Figure 1 illustrates the definition of k-mer and k-flank. In this study, we counted 3–10 k-mers and k-flanks. Once k-mers and k-flanks were counted, we used pattern counting for the reconstruction of 10 mammalian phylogenies and also for machine learning analysis to show that (i) k-mers are characteristic of CpG island sequences and (ii) k-mer data are consistent with evolutionary history of 10 mammalian genomes. As far as we know, this study is the first extensive comparative analysis of CpG island sequences.Figure 1.

Bottom Line: First, by calculating genome distance based on rank correlation tests, we show that k-mer and k-flank patterns around CpG sites can be used to correctly reconstruct the phylogeny of 10 mammalian genomes.Further, we used various machine learning algorithms to demonstrate that CpG islands sequences can be characterized using k-mers.In addition, by testing a human model on the nine different mammalian genomes, we provide the first evidence that k-mer signatures are consistent with evolutionary history.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, School of Informatics and Computing, Indiana University, Bloomington, IN, USA.

ABSTRACT
CpG islands are GC-rich regions often located in the 5' end of genes and normally protected from cytosine methylation in mammals. The important role of CpG islands in gene transcription strongly suggests evolutionary conservation in the mammalian genome. However, as CpG dinucleotides are over-represented in CpG islands, comparative CpG island analysis using conventional sequence analysis techniques remains a major challenge in the epigenetics field. In this study, we conducted a comparative analysis of all CpG island sequences in 10 mammalian genomes. As sequence similarity methods and character composition techniques such as information theory are particularly difficult to conduct, we used exact patterns in CpG island sequences and single character discrepancies to identify differences in CpG island sequences. First, by calculating genome distance based on rank correlation tests, we show that k-mer and k-flank patterns around CpG sites can be used to correctly reconstruct the phylogeny of 10 mammalian genomes. Further, we used various machine learning algorithms to demonstrate that CpG islands sequences can be characterized using k-mers. In addition, by testing a human model on the nine different mammalian genomes, we provide the first evidence that k-mer signatures are consistent with evolutionary history.

Show MeSH