Limits...
CG dinucleotide clustering is a species-specific property of the genome.

Glass JL, Thompson RF, Khulan B, Figueroa ME, Olivier EN, Oakley EJ, Van Zant G, Bouhassira EE, Melnick A, Golden A, Fazzari MJ, Greally JM - Nucleic Acids Res. (2007)

Bottom Line: We also show that the CG clusters co-localize in the human genome with hypomethylated loci and annotated transcription start sites to a greater extent than annotations produced by prior CpG island definitions.Moreover, this new approach allows CG clusters to be identified in a species-specific manner, revealing a degree of orthologous conservation that is not revealed by current base compositional approaches.Finally, our approach is able to identify methylating genomes (such as Takifugu rubripes) that lack CG clustering entirely, in which it is inappropriate to annotate CpG islands or CG clusters.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Genetics, Albert Einstein College of Medicine, Bronx, NY 10461, USA, Division of Hematology/Oncology, University of Kentucky, Markey Cancer Center, 800 Rose Street, Lexington KY 40536, USA.

ABSTRACT
Cytosines at cytosine-guanine (CG) dinucleotides are the near-exclusive target of DNA methyltransferases in mammalian genomes. Spontaneous deamination of methylcytosine to thymine makes methylated cytosines unusually susceptible to mutation and consequent depletion. The loci where CG dinucleotides remain relatively enriched, presumably due to their unmethylated status during the germ cell cycle, have been referred to as CpG islands. Currently, CpG islands are solely defined by base compositional criteria, allowing annotation of any sequenced genome. Using a novel bioinformatic approach, we show that CG clusters can be identified as an inherent property of genomic sequence without imposing a base compositional a priori assumption. We also show that the CG clusters co-localize in the human genome with hypomethylated loci and annotated transcription start sites to a greater extent than annotations produced by prior CpG island definitions. Moreover, this new approach allows CG clusters to be identified in a species-specific manner, revealing a degree of orthologous conservation that is not revealed by current base compositional approaches. Finally, our approach is able to identify methylating genomes (such as Takifugu rubripes) that lack CG clustering entirely, in which it is inappropriate to annotate CpG islands or CG clusters.

Show MeSH
The analytical technique used to define CG clusters. First, a fixed number of CG dinucleotides is chosen, as illustrated in the example using six CGs (a). The first CG is identified in a chromosome, then the sixth, allowing the number of nucleotides between them to be recorded. The second and seventh CGs are then identified and the distance recorded and so on until the data for the entire genome is collected. When this analysis is performed for the entire human genome using 30 CGs at a time, the resulting lengths can be represented as a frequency histogram as shown in panel (b). Two populations are apparent—the peak on the left with short fragment lengths for this number of CGs, and the remainder of the genome where CGs do not cluster. The maximum fragment length for the clustered CGs is shown as a vertical red line. When the analysis is repeated for 5, 10, 15, … 100 CGs genome-wide, different maximum fragment lengths for each number of CGs are derived, with a near-linear relationship between these variables as shown in panel (c). The arrowhead refers to the observation made for 30 CGs illustrated in panel (b).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2175314&req=5

Figure 1: The analytical technique used to define CG clusters. First, a fixed number of CG dinucleotides is chosen, as illustrated in the example using six CGs (a). The first CG is identified in a chromosome, then the sixth, allowing the number of nucleotides between them to be recorded. The second and seventh CGs are then identified and the distance recorded and so on until the data for the entire genome is collected. When this analysis is performed for the entire human genome using 30 CGs at a time, the resulting lengths can be represented as a frequency histogram as shown in panel (b). Two populations are apparent—the peak on the left with short fragment lengths for this number of CGs, and the remainder of the genome where CGs do not cluster. The maximum fragment length for the clustered CGs is shown as a vertical red line. When the analysis is repeated for 5, 10, 15, … 100 CGs genome-wide, different maximum fragment lengths for each number of CGs are derived, with a near-linear relationship between these variables as shown in panel (c). The arrowhead refers to the observation made for 30 CGs illustrated in panel (b).

Mentions: We pursued the hypothesis that there is a subpopulation of sequences in the genome defined solely by their clustering of CG dinucleotides. This clustering is a result of the genome-wide decay of CG dinucleotide content, with preservation of CG density at certain regions. By measuring the distance spanned by a fixed number of CG dinucleotides for every such group genome-wide, we observed that there are two populations of loci with distinctive CG clustering densities (Figure 1a and b). Using the first local minimum in the distribution of spanned sequence fragment lengths as the boundary of the short, CG-dense population, we identified the maximum fragment length for each cluster corresponding to a fixed number of CGs. In analyzing these cutoffs, we defined a linear relationship between CG dinucleotide number and the associated maximum fragment length (Figure 1c).Figure 1.


CG dinucleotide clustering is a species-specific property of the genome.

Glass JL, Thompson RF, Khulan B, Figueroa ME, Olivier EN, Oakley EJ, Van Zant G, Bouhassira EE, Melnick A, Golden A, Fazzari MJ, Greally JM - Nucleic Acids Res. (2007)

The analytical technique used to define CG clusters. First, a fixed number of CG dinucleotides is chosen, as illustrated in the example using six CGs (a). The first CG is identified in a chromosome, then the sixth, allowing the number of nucleotides between them to be recorded. The second and seventh CGs are then identified and the distance recorded and so on until the data for the entire genome is collected. When this analysis is performed for the entire human genome using 30 CGs at a time, the resulting lengths can be represented as a frequency histogram as shown in panel (b). Two populations are apparent—the peak on the left with short fragment lengths for this number of CGs, and the remainder of the genome where CGs do not cluster. The maximum fragment length for the clustered CGs is shown as a vertical red line. When the analysis is repeated for 5, 10, 15, … 100 CGs genome-wide, different maximum fragment lengths for each number of CGs are derived, with a near-linear relationship between these variables as shown in panel (c). The arrowhead refers to the observation made for 30 CGs illustrated in panel (b).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2175314&req=5

Figure 1: The analytical technique used to define CG clusters. First, a fixed number of CG dinucleotides is chosen, as illustrated in the example using six CGs (a). The first CG is identified in a chromosome, then the sixth, allowing the number of nucleotides between them to be recorded. The second and seventh CGs are then identified and the distance recorded and so on until the data for the entire genome is collected. When this analysis is performed for the entire human genome using 30 CGs at a time, the resulting lengths can be represented as a frequency histogram as shown in panel (b). Two populations are apparent—the peak on the left with short fragment lengths for this number of CGs, and the remainder of the genome where CGs do not cluster. The maximum fragment length for the clustered CGs is shown as a vertical red line. When the analysis is repeated for 5, 10, 15, … 100 CGs genome-wide, different maximum fragment lengths for each number of CGs are derived, with a near-linear relationship between these variables as shown in panel (c). The arrowhead refers to the observation made for 30 CGs illustrated in panel (b).
Mentions: We pursued the hypothesis that there is a subpopulation of sequences in the genome defined solely by their clustering of CG dinucleotides. This clustering is a result of the genome-wide decay of CG dinucleotide content, with preservation of CG density at certain regions. By measuring the distance spanned by a fixed number of CG dinucleotides for every such group genome-wide, we observed that there are two populations of loci with distinctive CG clustering densities (Figure 1a and b). Using the first local minimum in the distribution of spanned sequence fragment lengths as the boundary of the short, CG-dense population, we identified the maximum fragment length for each cluster corresponding to a fixed number of CGs. In analyzing these cutoffs, we defined a linear relationship between CG dinucleotide number and the associated maximum fragment length (Figure 1c).Figure 1.

Bottom Line: We also show that the CG clusters co-localize in the human genome with hypomethylated loci and annotated transcription start sites to a greater extent than annotations produced by prior CpG island definitions.Moreover, this new approach allows CG clusters to be identified in a species-specific manner, revealing a degree of orthologous conservation that is not revealed by current base compositional approaches.Finally, our approach is able to identify methylating genomes (such as Takifugu rubripes) that lack CG clustering entirely, in which it is inappropriate to annotate CpG islands or CG clusters.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Genetics, Albert Einstein College of Medicine, Bronx, NY 10461, USA, Division of Hematology/Oncology, University of Kentucky, Markey Cancer Center, 800 Rose Street, Lexington KY 40536, USA.

ABSTRACT
Cytosines at cytosine-guanine (CG) dinucleotides are the near-exclusive target of DNA methyltransferases in mammalian genomes. Spontaneous deamination of methylcytosine to thymine makes methylated cytosines unusually susceptible to mutation and consequent depletion. The loci where CG dinucleotides remain relatively enriched, presumably due to their unmethylated status during the germ cell cycle, have been referred to as CpG islands. Currently, CpG islands are solely defined by base compositional criteria, allowing annotation of any sequenced genome. Using a novel bioinformatic approach, we show that CG clusters can be identified as an inherent property of genomic sequence without imposing a base compositional a priori assumption. We also show that the CG clusters co-localize in the human genome with hypomethylated loci and annotated transcription start sites to a greater extent than annotations produced by prior CpG island definitions. Moreover, this new approach allows CG clusters to be identified in a species-specific manner, revealing a degree of orthologous conservation that is not revealed by current base compositional approaches. Finally, our approach is able to identify methylating genomes (such as Takifugu rubripes) that lack CG clustering entirely, in which it is inappropriate to annotate CpG islands or CG clusters.

Show MeSH