Limits...
A new rhesus macaque assembly and annotation for next-generation sequencing analyses.

Zimin AV, Cornish AS, Maudhoo MD, Gibbs RM, Zhang X, Pandey S, Meehan DT, Wipfler K, Bosinger SE, Johnson ZP, Tharp GK, Marçais G, Roberts M, Ferguson B, Fox HS, Treangen T, Salzberg SL, Yorke JA, Norgren RB - Biol. Direct (2014)

Bottom Line: Further, we demonstrate that the new annotations provide greatly improved accuracy as compared to the current annotations of rheMac2.The MacaM assembly and annotation files provide a substantially more complete and accurate representation of the rhesus macaque genome than rheMac2 or CR_1.0 and will serve as an important resource for investigators conducting next-generation sequencing studies with nonhuman primates.This article was reviewed by Dr. Lutz Walter, Dr. Soojin Yi and Dr. Kateryna Makova.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, Nebraska 68198, USA. rnorgren@unmc.edu.

ABSTRACT

Background: The rhesus macaque (Macaca mulatta) is a key species for advancing biomedical research. Like all draft mammalian genomes, the draft rhesus assembly (rheMac2) has gaps, sequencing errors and misassemblies that have prevented automated annotation pipelines from functioning correctly. Another rhesus macaque assembly, CR_1.0, is also available but is substantially more fragmented than rheMac2 with smaller contigs and scaffolds. Annotations for these two assemblies are limited in completeness and accuracy. High quality assembly and annotation files are required for a wide range of studies including expression, genetic and evolutionary analyses.

Results: We report a new de novo assembly of the rhesus macaque genome (MacaM) that incorporates both the original Sanger sequences used to assemble rheMac2 and new Illumina sequences from the same animal. MacaM has a weighted average (N50) contig size of 64 kilobases, more than twice the size of the rheMac2 assembly and almost five times the size of the CR_1.0 assembly. The MacaM chromosome assembly incorporates information from previously unutilized mapping data and preliminary annotation of scaffolds. Independent assessment of the assemblies using Ion Torrent read alignments indicates that MacaM is more complete and accurate than rheMac2 and CR_1.0. We assembled messenger RNA sequences from several rhesus tissues into transcripts which allowed us to identify a total of 11,712 complete proteins representing 9,524 distinct genes. Using a combination of our assembled rhesus macaque transcripts and human transcripts, we annotated 18,757 transcripts and 16,050 genes with complete coding sequences in the MacaM assembly. Further, we demonstrate that the new annotations provide greatly improved accuracy as compared to the current annotations of rheMac2. Finally, we show that the MacaM genome provides an accurate resource for alignment of reads produced by RNA sequence expression studies.

Conclusions: The MacaM assembly and annotation files provide a substantially more complete and accurate representation of the rhesus macaque genome than rheMac2 or CR_1.0 and will serve as an important resource for investigators conducting next-generation sequencing studies with nonhuman primates.

Reviewers: This article was reviewed by Dr. Lutz Walter, Dr. Soojin Yi and Dr. Kateryna Makova.

Show MeSH
Alignment of rhesus macaque SHE proteins from different annotations with human protein. Human SHE protein accession: NP_001010846.1. MacaM: Protein derived from the MacaM rhesus macaque genome. rheMac2_N: Protein obtained from the NCBI annotation of rheMac2, accession. rheMac2_E: Protein obtained from the Ensembl annotation of rheMac2, accession ENSMMUT00000032345. CR_1.0: Protein obtained from the Chinese rhesus macaque genome produced by BGI [8]. Yellow highlighting indicates identical sequence in human and alternative rhesus macaque annotations with the exception of sequences that are only shared in rheMac2_E and CR_1.0 which are indicated by green highlighting. Exon boundaries are indicated by line separating amino acids.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4214606&req=5

Figure 3: Alignment of rhesus macaque SHE proteins from different annotations with human protein. Human SHE protein accession: NP_001010846.1. MacaM: Protein derived from the MacaM rhesus macaque genome. rheMac2_N: Protein obtained from the NCBI annotation of rheMac2, accession. rheMac2_E: Protein obtained from the Ensembl annotation of rheMac2, accession ENSMMUT00000032345. CR_1.0: Protein obtained from the Chinese rhesus macaque genome produced by BGI [8]. Yellow highlighting indicates identical sequence in human and alternative rhesus macaque annotations with the exception of sequences that are only shared in rheMac2_E and CR_1.0 which are indicated by green highlighting. Exon boundaries are indicated by line separating amino acids.

Mentions: We were able to annotate genes in MacaM whose exons had been split by misassemblies in rheMac2 [3] but which were contiguous in the MacaM assembly (Figure 2). We used this case (the SHE gene) to assess the annotations for rheMac2 and CR_1.0 (Figure 3). A search for SHE at the RhesusBase database indicated that no gene was found. We were able to identify proteins that appeared to correspond to the SHE protein in MacaM, rheMac2 (both NCBI and Ensembl annotations) and CR_1.0. We aligned these sequences against the human SHE protein sequence. The MacaM sequence was highly similar to the human SHE protein sequence. We found that a portion of the protein sequence encoded by exon 1 was missing in CR_1.0. The protein sequence encoded by exon 3 was missing from the NCBI annotation of rheMac2. For the Ensembl annotation of rheMac2 and CR_1.0, spurious sequence, presumably derived from intronic sequence, was substituted for the correct protein sequence encoded by exon 3. A correct protein sequence encoded by exon 3 was derived from MacaM because the scaffold containing exon 3 was correctly assembled which was not the case for rheMac2 [3] or, apparently, CR_1.0 (Figure 2). This demonstrates an important principle: better genome assemblies permit better annotations.


A new rhesus macaque assembly and annotation for next-generation sequencing analyses.

Zimin AV, Cornish AS, Maudhoo MD, Gibbs RM, Zhang X, Pandey S, Meehan DT, Wipfler K, Bosinger SE, Johnson ZP, Tharp GK, Marçais G, Roberts M, Ferguson B, Fox HS, Treangen T, Salzberg SL, Yorke JA, Norgren RB - Biol. Direct (2014)

Alignment of rhesus macaque SHE proteins from different annotations with human protein. Human SHE protein accession: NP_001010846.1. MacaM: Protein derived from the MacaM rhesus macaque genome. rheMac2_N: Protein obtained from the NCBI annotation of rheMac2, accession. rheMac2_E: Protein obtained from the Ensembl annotation of rheMac2, accession ENSMMUT00000032345. CR_1.0: Protein obtained from the Chinese rhesus macaque genome produced by BGI [8]. Yellow highlighting indicates identical sequence in human and alternative rhesus macaque annotations with the exception of sequences that are only shared in rheMac2_E and CR_1.0 which are indicated by green highlighting. Exon boundaries are indicated by line separating amino acids.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4214606&req=5

Figure 3: Alignment of rhesus macaque SHE proteins from different annotations with human protein. Human SHE protein accession: NP_001010846.1. MacaM: Protein derived from the MacaM rhesus macaque genome. rheMac2_N: Protein obtained from the NCBI annotation of rheMac2, accession. rheMac2_E: Protein obtained from the Ensembl annotation of rheMac2, accession ENSMMUT00000032345. CR_1.0: Protein obtained from the Chinese rhesus macaque genome produced by BGI [8]. Yellow highlighting indicates identical sequence in human and alternative rhesus macaque annotations with the exception of sequences that are only shared in rheMac2_E and CR_1.0 which are indicated by green highlighting. Exon boundaries are indicated by line separating amino acids.
Mentions: We were able to annotate genes in MacaM whose exons had been split by misassemblies in rheMac2 [3] but which were contiguous in the MacaM assembly (Figure 2). We used this case (the SHE gene) to assess the annotations for rheMac2 and CR_1.0 (Figure 3). A search for SHE at the RhesusBase database indicated that no gene was found. We were able to identify proteins that appeared to correspond to the SHE protein in MacaM, rheMac2 (both NCBI and Ensembl annotations) and CR_1.0. We aligned these sequences against the human SHE protein sequence. The MacaM sequence was highly similar to the human SHE protein sequence. We found that a portion of the protein sequence encoded by exon 1 was missing in CR_1.0. The protein sequence encoded by exon 3 was missing from the NCBI annotation of rheMac2. For the Ensembl annotation of rheMac2 and CR_1.0, spurious sequence, presumably derived from intronic sequence, was substituted for the correct protein sequence encoded by exon 3. A correct protein sequence encoded by exon 3 was derived from MacaM because the scaffold containing exon 3 was correctly assembled which was not the case for rheMac2 [3] or, apparently, CR_1.0 (Figure 2). This demonstrates an important principle: better genome assemblies permit better annotations.

Bottom Line: Further, we demonstrate that the new annotations provide greatly improved accuracy as compared to the current annotations of rheMac2.The MacaM assembly and annotation files provide a substantially more complete and accurate representation of the rhesus macaque genome than rheMac2 or CR_1.0 and will serve as an important resource for investigators conducting next-generation sequencing studies with nonhuman primates.This article was reviewed by Dr. Lutz Walter, Dr. Soojin Yi and Dr. Kateryna Makova.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, Nebraska 68198, USA. rnorgren@unmc.edu.

ABSTRACT

Background: The rhesus macaque (Macaca mulatta) is a key species for advancing biomedical research. Like all draft mammalian genomes, the draft rhesus assembly (rheMac2) has gaps, sequencing errors and misassemblies that have prevented automated annotation pipelines from functioning correctly. Another rhesus macaque assembly, CR_1.0, is also available but is substantially more fragmented than rheMac2 with smaller contigs and scaffolds. Annotations for these two assemblies are limited in completeness and accuracy. High quality assembly and annotation files are required for a wide range of studies including expression, genetic and evolutionary analyses.

Results: We report a new de novo assembly of the rhesus macaque genome (MacaM) that incorporates both the original Sanger sequences used to assemble rheMac2 and new Illumina sequences from the same animal. MacaM has a weighted average (N50) contig size of 64 kilobases, more than twice the size of the rheMac2 assembly and almost five times the size of the CR_1.0 assembly. The MacaM chromosome assembly incorporates information from previously unutilized mapping data and preliminary annotation of scaffolds. Independent assessment of the assemblies using Ion Torrent read alignments indicates that MacaM is more complete and accurate than rheMac2 and CR_1.0. We assembled messenger RNA sequences from several rhesus tissues into transcripts which allowed us to identify a total of 11,712 complete proteins representing 9,524 distinct genes. Using a combination of our assembled rhesus macaque transcripts and human transcripts, we annotated 18,757 transcripts and 16,050 genes with complete coding sequences in the MacaM assembly. Further, we demonstrate that the new annotations provide greatly improved accuracy as compared to the current annotations of rheMac2. Finally, we show that the MacaM genome provides an accurate resource for alignment of reads produced by RNA sequence expression studies.

Conclusions: The MacaM assembly and annotation files provide a substantially more complete and accurate representation of the rhesus macaque genome than rheMac2 or CR_1.0 and will serve as an important resource for investigators conducting next-generation sequencing studies with nonhuman primates.

Reviewers: This article was reviewed by Dr. Lutz Walter, Dr. Soojin Yi and Dr. Kateryna Makova.

Show MeSH