Limits...
Annotation of suprachromosomal families reveals uncommon types of alpha satellite organization in pericentromeric regions of hg38 human genome assembly.

Shepelev VA, Uralsky LI, Alexandrov AA, Yurov YB, Rogaev EI, Alexandrov IA - Genom Data (2015)

Bottom Line: We used the PERCON program previously described by us to annotate various suprachromosomal families (SFs) of AS in the hg38 assembly and presented the results of our primary analysis as an easy-to-read track for the UCSC Genome Browser.No new SFs or large unclassed AS domains were discovered.The dataset reveals the architecture of human centromeres and allows classification of AS sequence reads by alignment to the annotated hg38 assembly.

View Article: PubMed Central - PubMed

Affiliation: Institute of Molecular Genetics, Russian Academy of Sciences, Kurchatov sq. 2, Moscow 123182, Russia ; Department of Genomics and Human Genetics, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow 119991, Russia ; Center for Brain Neurobiology and Neurogenetics, Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk 630090, Russia.

ABSTRACT

Centromeric alpha satellite (AS) is composed of highly identical higher-order DNA repetitive sequences, which make the standard assembly process impossible. Because of this the AS repeats were severely underrepresented in previous versions of the human genome assembly showing large centromeric gaps. The latest hg38 assembly (GCA_000001405.15) employed a novel method of approximate representation of these sequences using AS reference models to fill the gaps. Therefore, a lot more of assembled AS became available for genomic analysis. We used the PERCON program previously described by us to annotate various suprachromosomal families (SFs) of AS in the hg38 assembly and presented the results of our primary analysis as an easy-to-read track for the UCSC Genome Browser. The monomeric classes, characteristic of the five known SFs, were color-coded, which allowed quick visual assessment of AS composition in whole multi-megabase centromeres down to each individual AS monomer. Such comprehensive annotation of AS in the human genome assembly was performed for the first time. It showed the expected prevalence of the known major types of AS organization characteristic of the five established SFs. Also, some less common types of AS arrays were identified, such as pure R2 domains in SF5, apparent J/R and D/R mixes in SF1 and SF2, and several different SF4 higher-order repeats among reference models and in regular contigs. No new SFs or large unclassed AS domains were discovered. The dataset reveals the architecture of human centromeres and allows classification of AS sequence reads by alignment to the annotated hg38 assembly. The data were deposited here: http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&hgt.customText=https://dl.dropboxusercontent.com/u/22994534/AS-tracks/human-GRC-hg38-M1SFs.bed.bz2.

No MeSH data available.


Comparison of AS SF profiles of hg38 human genome assembly and HuRef WGS dataset.The figure plots the SF content of the two datasets in Mb per haploid genome (3 × 109 bp). For WGS dataset (1 million reads), the number of AS monomers identified by PERCON was multiplied by the average length of a monomer in this dataset (146 bp) and normalized to the genome size (shown as “HuRef raw”). The same amount of AS divided in proportions obtained in the same sample with bad ends trimmed and filtered for monomers 140 bp or longer (average monomer length 168 bp) is shown as “HuRef corrected” (see Fig. S2 for details). For the assembly, the length of all monomers identified by PERCON in each category was summarized directly from PERCON track using the Table Browser. In both datasets, the real amounts are slightly underestimated in a similar manner, as small gaps which PERCON often leaves between monomers due to imperfect alignment of the ends are not taken into account.
© Copyright Policy - CC BY
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4496801&req=5

f0005: Comparison of AS SF profiles of hg38 human genome assembly and HuRef WGS dataset.The figure plots the SF content of the two datasets in Mb per haploid genome (3 × 109 bp). For WGS dataset (1 million reads), the number of AS monomers identified by PERCON was multiplied by the average length of a monomer in this dataset (146 bp) and normalized to the genome size (shown as “HuRef raw”). The same amount of AS divided in proportions obtained in the same sample with bad ends trimmed and filtered for monomers 140 bp or longer (average monomer length 168 bp) is shown as “HuRef corrected” (see Fig. S2 for details). For the assembly, the length of all monomers identified by PERCON in each category was summarized directly from PERCON track using the Table Browser. In both datasets, the real amounts are slightly underestimated in a similar manner, as small gaps which PERCON often leaves between monomers due to imperfect alignment of the ends are not taken into account.

Mentions: The overall statistics of AS was collected from the UCSC Browser Track using Table Browser and analyzed to control how the track data corresponded to what was known from other methods and sources. The overall detection of AS by PERCON (70.1 Mbp, 2.30%) did not differ significantly from that of RepeatMasker (http://repeatmasker.org/), as used in UCSC Browser RepeatMasker Track (70.8 Mbp, 2.32%). RepeatMasker records that had at least 98% overlap with the PERCON track constituted 69.5 Mbp or 98.2% of total RepeatMasker detection. RepeatMasker records that had no overlap with PERCON AS SF track constituted 3628 bp or 0.01% of total detection. The latter records were all small fragments shorter than one monomer. The size of DNA occupied by monomers of each SF determined in hg38 assembly is shown in Fig. 1 where it is compared to the data obtained in the analysis of WGS database. One million HuRef WGS reads (PRJNA19621) obtained by random DNA fragmentation were processed by PERCON in the same way as described previously for analysis of the BAC ends [7]. There was no dramatic difference in SF profiles between the sets, except the proportion of unclassed monomers in WGS (6%) was predictably higher than in the assembly (2.5%). Both a large number of truncated monomers at the ends of WGS reads and the low quality of sequence at the ends of Sanger reads contributed to this difference (Fig. S2). To correct for these factors we evaluated the effect of trimming the bad ends using the LUCY program with default parameters [22] and of filtering the dataset for monomers 140 bp or longer (such monomers are marked with upper case letters in PERCON output). The results are shown in Fig. S2, where it can be seen that each step reduced both the number of unclassed monomers and the total AS detection. To combine good detection with more accurate SF measurements we used the SF proportions obtained in the double-filtered sample to divide the AS amount obtained in the unfiltered sample. After such correction (see legend to Fig. S2 for more details) the proportion of unclassed monomers dropped to 2% and the WGS data appeared to be in fairly good concordance with the assembly. The only significant difference was a larger SF3 in the assembly. The reasons for this discrepancy we did not investigate. A more detailed discussion of the possible sources of unclassed monomers was provided in [7]. The proportions of SFs in human genome shown in Fig. 1 are, as much as we know, the first accurate estimate obtained in a non-biased sample. The results differ significantly from the ones in [7] which was our previous attempt to SF-class a large sample of AS fragments. The difference is presumably due to restriction enzyme bias in the older sample.


Annotation of suprachromosomal families reveals uncommon types of alpha satellite organization in pericentromeric regions of hg38 human genome assembly.

Shepelev VA, Uralsky LI, Alexandrov AA, Yurov YB, Rogaev EI, Alexandrov IA - Genom Data (2015)

Comparison of AS SF profiles of hg38 human genome assembly and HuRef WGS dataset.The figure plots the SF content of the two datasets in Mb per haploid genome (3 × 109 bp). For WGS dataset (1 million reads), the number of AS monomers identified by PERCON was multiplied by the average length of a monomer in this dataset (146 bp) and normalized to the genome size (shown as “HuRef raw”). The same amount of AS divided in proportions obtained in the same sample with bad ends trimmed and filtered for monomers 140 bp or longer (average monomer length 168 bp) is shown as “HuRef corrected” (see Fig. S2 for details). For the assembly, the length of all monomers identified by PERCON in each category was summarized directly from PERCON track using the Table Browser. In both datasets, the real amounts are slightly underestimated in a similar manner, as small gaps which PERCON often leaves between monomers due to imperfect alignment of the ends are not taken into account.
© Copyright Policy - CC BY
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4496801&req=5

f0005: Comparison of AS SF profiles of hg38 human genome assembly and HuRef WGS dataset.The figure plots the SF content of the two datasets in Mb per haploid genome (3 × 109 bp). For WGS dataset (1 million reads), the number of AS monomers identified by PERCON was multiplied by the average length of a monomer in this dataset (146 bp) and normalized to the genome size (shown as “HuRef raw”). The same amount of AS divided in proportions obtained in the same sample with bad ends trimmed and filtered for monomers 140 bp or longer (average monomer length 168 bp) is shown as “HuRef corrected” (see Fig. S2 for details). For the assembly, the length of all monomers identified by PERCON in each category was summarized directly from PERCON track using the Table Browser. In both datasets, the real amounts are slightly underestimated in a similar manner, as small gaps which PERCON often leaves between monomers due to imperfect alignment of the ends are not taken into account.
Mentions: The overall statistics of AS was collected from the UCSC Browser Track using Table Browser and analyzed to control how the track data corresponded to what was known from other methods and sources. The overall detection of AS by PERCON (70.1 Mbp, 2.30%) did not differ significantly from that of RepeatMasker (http://repeatmasker.org/), as used in UCSC Browser RepeatMasker Track (70.8 Mbp, 2.32%). RepeatMasker records that had at least 98% overlap with the PERCON track constituted 69.5 Mbp or 98.2% of total RepeatMasker detection. RepeatMasker records that had no overlap with PERCON AS SF track constituted 3628 bp or 0.01% of total detection. The latter records were all small fragments shorter than one monomer. The size of DNA occupied by monomers of each SF determined in hg38 assembly is shown in Fig. 1 where it is compared to the data obtained in the analysis of WGS database. One million HuRef WGS reads (PRJNA19621) obtained by random DNA fragmentation were processed by PERCON in the same way as described previously for analysis of the BAC ends [7]. There was no dramatic difference in SF profiles between the sets, except the proportion of unclassed monomers in WGS (6%) was predictably higher than in the assembly (2.5%). Both a large number of truncated monomers at the ends of WGS reads and the low quality of sequence at the ends of Sanger reads contributed to this difference (Fig. S2). To correct for these factors we evaluated the effect of trimming the bad ends using the LUCY program with default parameters [22] and of filtering the dataset for monomers 140 bp or longer (such monomers are marked with upper case letters in PERCON output). The results are shown in Fig. S2, where it can be seen that each step reduced both the number of unclassed monomers and the total AS detection. To combine good detection with more accurate SF measurements we used the SF proportions obtained in the double-filtered sample to divide the AS amount obtained in the unfiltered sample. After such correction (see legend to Fig. S2 for more details) the proportion of unclassed monomers dropped to 2% and the WGS data appeared to be in fairly good concordance with the assembly. The only significant difference was a larger SF3 in the assembly. The reasons for this discrepancy we did not investigate. A more detailed discussion of the possible sources of unclassed monomers was provided in [7]. The proportions of SFs in human genome shown in Fig. 1 are, as much as we know, the first accurate estimate obtained in a non-biased sample. The results differ significantly from the ones in [7] which was our previous attempt to SF-class a large sample of AS fragments. The difference is presumably due to restriction enzyme bias in the older sample.

Bottom Line: We used the PERCON program previously described by us to annotate various suprachromosomal families (SFs) of AS in the hg38 assembly and presented the results of our primary analysis as an easy-to-read track for the UCSC Genome Browser.No new SFs or large unclassed AS domains were discovered.The dataset reveals the architecture of human centromeres and allows classification of AS sequence reads by alignment to the annotated hg38 assembly.

View Article: PubMed Central - PubMed

Affiliation: Institute of Molecular Genetics, Russian Academy of Sciences, Kurchatov sq. 2, Moscow 123182, Russia ; Department of Genomics and Human Genetics, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow 119991, Russia ; Center for Brain Neurobiology and Neurogenetics, Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk 630090, Russia.

ABSTRACT

Centromeric alpha satellite (AS) is composed of highly identical higher-order DNA repetitive sequences, which make the standard assembly process impossible. Because of this the AS repeats were severely underrepresented in previous versions of the human genome assembly showing large centromeric gaps. The latest hg38 assembly (GCA_000001405.15) employed a novel method of approximate representation of these sequences using AS reference models to fill the gaps. Therefore, a lot more of assembled AS became available for genomic analysis. We used the PERCON program previously described by us to annotate various suprachromosomal families (SFs) of AS in the hg38 assembly and presented the results of our primary analysis as an easy-to-read track for the UCSC Genome Browser. The monomeric classes, characteristic of the five known SFs, were color-coded, which allowed quick visual assessment of AS composition in whole multi-megabase centromeres down to each individual AS monomer. Such comprehensive annotation of AS in the human genome assembly was performed for the first time. It showed the expected prevalence of the known major types of AS organization characteristic of the five established SFs. Also, some less common types of AS arrays were identified, such as pure R2 domains in SF5, apparent J/R and D/R mixes in SF1 and SF2, and several different SF4 higher-order repeats among reference models and in regular contigs. No new SFs or large unclassed AS domains were discovered. The dataset reveals the architecture of human centromeres and allows classification of AS sequence reads by alignment to the annotated hg38 assembly. The data were deposited here: http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&hgt.customText=https://dl.dropboxusercontent.com/u/22994534/AS-tracks/human-GRC-hg38-M1SFs.bed.bz2.

No MeSH data available.