Limits...
GASS: genome structural annotation for Eukaryotes based on species similarity.

Wang Y, Chen L, Song N, Lei X - BMC Genomics (2015)

Bottom Line: The experiment results showed that more than 65% RefSeq exons and splicing junctions were exactly found by GASS.We also found the mis-assemblies of rheMac3 genome, which led to the 2 bp shifts in annotating position on exons' boundary and then the incomplete splicing canonical sites in Refseq annotations.GASS can be applied to many study occasions, such as the analysis of RNA-Seq datasets from the unannotated species whose genome drafts are available but the annotations are not.

View Article: PubMed Central - PubMed

Affiliation: Department of Automation, School of Information Science and Technology, Xiamen University, Xiamen, Fujian, 361005, China. wangying@xmu.edu.cn.

ABSTRACT

Background: With the development of high-throughput sequencing techniques, more and more genomes were sequenced and assembled. However, annotating a genome's structure rapidly and expressly remains challenging. Current eukaryotic genome annotations require various, abundant supporting data, such as: species-specific and cross-species protein sequences, ESTs, cDNA and RNA-Seq data. Collecting those data and merging their analytical results to achieve a consistent complete annotation is a complex, time and cost consuming task.

Results: In our study, we proposed a fast and easy-to-use computational tool: GASS (Genome Annotation based on Species Similarity). It annotates a eukaryotic genome based on only the annotations from another similar species. With aligning the exons' sequences of an annotated similar species to the un-annotated genome, GASS detects the optimal transcript annotations with a shortest-path model. In our study, GASS was used to achieve the rhesus annotations based on the human annotations. The produced annotations were evaluated by comparing them to the two existing rhesus annotation databases (RefSeq and Ensembl) directly and being aligned with three RNA-Seq data of rhesus. The experiment results showed that more than 65% RefSeq exons and splicing junctions were exactly found by GASS. GASS's sensitivity was higher than RefSeq's, and was close to Ensembl's. GASS had higher specificities than Ensembl at gene, transcript, exon and splicing junction levels. We also found the mis-assemblies of rheMac3 genome, which led to the 2 bp shifts in annotating position on exons' boundary and then the incomplete splicing canonical sites in Refseq annotations. These detections were further supported by various data sources.

Conclusions: GASS quickly produces structural genome annotations in sufficient abundance and accuracy. With simple and rapid running of GASS, small labs can create quick views of genome annotations for an un-annotated species, without the necessity to create, collect, analyze and synthesize extra various data sources, or wait several months for the annotations from professional organizations. GASS can be applied to many study occasions, such as the analysis of RNA-Seq datasets from the unannotated species whose genome drafts are available but the annotations are not.

Show MeSH
Mis-annotation for Refseq on rheMac3. For transcript NM_001260832: A) The 12th exon has a 2 bp shift between the annotation of RefSeq-rheMac3 and GASS. B) The 12th intron cannot meet the “GT-AG” canonical splicing site. C) The amino acids coded with triple codon from nucleotides sequence of RefSeq transcript are inconsistent with the amino acid sequences from RefSeq protein database. D) Three RNA-Seq datasets and three DNA-Seq datasets are mapped to the sequences from the 12th exon and 13th exon, “AG” cannot be mapped to RNA-Seq datasets and DNA-Seq datasets, meaning that the assembly for the 12th intron is wrong. E) The corrected genome sequences and annotations are given, and the transcript nucleotide sequences and the codon amino acid sequences are consistent.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4352269&req=5

Fig7: Mis-annotation for Refseq on rheMac3. For transcript NM_001260832: A) The 12th exon has a 2 bp shift between the annotation of RefSeq-rheMac3 and GASS. B) The 12th intron cannot meet the “GT-AG” canonical splicing site. C) The amino acids coded with triple codon from nucleotides sequence of RefSeq transcript are inconsistent with the amino acid sequences from RefSeq protein database. D) Three RNA-Seq datasets and three DNA-Seq datasets are mapped to the sequences from the 12th exon and 13th exon, “AG” cannot be mapped to RNA-Seq datasets and DNA-Seq datasets, meaning that the assembly for the 12th intron is wrong. E) The corrected genome sequences and annotations are given, and the transcript nucleotide sequences and the codon amino acid sequences are consistent.

Mentions: During the comparisons of the three databases, we found more than 2,000 exons with 2 bp shifts on splicing donor or acceptor between GASS and RefSeq-rheMac3. Some of the 2 bp-shift boundaries were checked in detail. As shown in Figure 7(A), for transcript NM_001260832 of gene UTP15, the 12th exon has a 2 bp shift (70174203 vs. 70174205) between the annotation of RefSeq-rheMac3 and GASS. And then we found incomplete GT-AG canonical splicing sites in the 11th intron in Refseq-rheMac3, as shown in Figure 7(B). For Refseq-rheMac3, the first two nucleotides of 11th intron are “gt” while the last two are “tt”, where the “GT-AG” canonical splicing sites misses the “AG”. Meanwhile, as shown in Figure 7(C), the amino acids coded with triple codon from RefSeq nucleotides are inconsistent with the amino acid sequences from RefSeq protein database. For further validation, three RNA-Seq and two DNA-Seq datasets (See Additional file 1: Table S4) were aligned to the Refseq-rheMac3 transcript and rheMac3 genome respectively. As shown in Figure 7(D), the highly consistent alignment results support the following conclusions: ① RNA-Seq alignments prove that the splicing junction should include the complete “GT-AG” canonical splicing sites. ② DNA-Seq alignments prove that the current rheMac3 genome assembly misses two “TG” nucleotides between positions 70174212 and 70174213 of Chromosome 6. The corrected genome sequences and annotations are given in Figure 7(E), and the transcript nucleotide sequences and the codon amino acid sequences are consistent. Another example, the analysis of transcript NM_001260538 is given in Additional file 1: Figure S8.Figure 7


GASS: genome structural annotation for Eukaryotes based on species similarity.

Wang Y, Chen L, Song N, Lei X - BMC Genomics (2015)

Mis-annotation for Refseq on rheMac3. For transcript NM_001260832: A) The 12th exon has a 2 bp shift between the annotation of RefSeq-rheMac3 and GASS. B) The 12th intron cannot meet the “GT-AG” canonical splicing site. C) The amino acids coded with triple codon from nucleotides sequence of RefSeq transcript are inconsistent with the amino acid sequences from RefSeq protein database. D) Three RNA-Seq datasets and three DNA-Seq datasets are mapped to the sequences from the 12th exon and 13th exon, “AG” cannot be mapped to RNA-Seq datasets and DNA-Seq datasets, meaning that the assembly for the 12th intron is wrong. E) The corrected genome sequences and annotations are given, and the transcript nucleotide sequences and the codon amino acid sequences are consistent.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4352269&req=5

Fig7: Mis-annotation for Refseq on rheMac3. For transcript NM_001260832: A) The 12th exon has a 2 bp shift between the annotation of RefSeq-rheMac3 and GASS. B) The 12th intron cannot meet the “GT-AG” canonical splicing site. C) The amino acids coded with triple codon from nucleotides sequence of RefSeq transcript are inconsistent with the amino acid sequences from RefSeq protein database. D) Three RNA-Seq datasets and three DNA-Seq datasets are mapped to the sequences from the 12th exon and 13th exon, “AG” cannot be mapped to RNA-Seq datasets and DNA-Seq datasets, meaning that the assembly for the 12th intron is wrong. E) The corrected genome sequences and annotations are given, and the transcript nucleotide sequences and the codon amino acid sequences are consistent.
Mentions: During the comparisons of the three databases, we found more than 2,000 exons with 2 bp shifts on splicing donor or acceptor between GASS and RefSeq-rheMac3. Some of the 2 bp-shift boundaries were checked in detail. As shown in Figure 7(A), for transcript NM_001260832 of gene UTP15, the 12th exon has a 2 bp shift (70174203 vs. 70174205) between the annotation of RefSeq-rheMac3 and GASS. And then we found incomplete GT-AG canonical splicing sites in the 11th intron in Refseq-rheMac3, as shown in Figure 7(B). For Refseq-rheMac3, the first two nucleotides of 11th intron are “gt” while the last two are “tt”, where the “GT-AG” canonical splicing sites misses the “AG”. Meanwhile, as shown in Figure 7(C), the amino acids coded with triple codon from RefSeq nucleotides are inconsistent with the amino acid sequences from RefSeq protein database. For further validation, three RNA-Seq and two DNA-Seq datasets (See Additional file 1: Table S4) were aligned to the Refseq-rheMac3 transcript and rheMac3 genome respectively. As shown in Figure 7(D), the highly consistent alignment results support the following conclusions: ① RNA-Seq alignments prove that the splicing junction should include the complete “GT-AG” canonical splicing sites. ② DNA-Seq alignments prove that the current rheMac3 genome assembly misses two “TG” nucleotides between positions 70174212 and 70174213 of Chromosome 6. The corrected genome sequences and annotations are given in Figure 7(E), and the transcript nucleotide sequences and the codon amino acid sequences are consistent. Another example, the analysis of transcript NM_001260538 is given in Additional file 1: Figure S8.Figure 7

Bottom Line: The experiment results showed that more than 65% RefSeq exons and splicing junctions were exactly found by GASS.We also found the mis-assemblies of rheMac3 genome, which led to the 2 bp shifts in annotating position on exons' boundary and then the incomplete splicing canonical sites in Refseq annotations.GASS can be applied to many study occasions, such as the analysis of RNA-Seq datasets from the unannotated species whose genome drafts are available but the annotations are not.

View Article: PubMed Central - PubMed

Affiliation: Department of Automation, School of Information Science and Technology, Xiamen University, Xiamen, Fujian, 361005, China. wangying@xmu.edu.cn.

ABSTRACT

Background: With the development of high-throughput sequencing techniques, more and more genomes were sequenced and assembled. However, annotating a genome's structure rapidly and expressly remains challenging. Current eukaryotic genome annotations require various, abundant supporting data, such as: species-specific and cross-species protein sequences, ESTs, cDNA and RNA-Seq data. Collecting those data and merging their analytical results to achieve a consistent complete annotation is a complex, time and cost consuming task.

Results: In our study, we proposed a fast and easy-to-use computational tool: GASS (Genome Annotation based on Species Similarity). It annotates a eukaryotic genome based on only the annotations from another similar species. With aligning the exons' sequences of an annotated similar species to the un-annotated genome, GASS detects the optimal transcript annotations with a shortest-path model. In our study, GASS was used to achieve the rhesus annotations based on the human annotations. The produced annotations were evaluated by comparing them to the two existing rhesus annotation databases (RefSeq and Ensembl) directly and being aligned with three RNA-Seq data of rhesus. The experiment results showed that more than 65% RefSeq exons and splicing junctions were exactly found by GASS. GASS's sensitivity was higher than RefSeq's, and was close to Ensembl's. GASS had higher specificities than Ensembl at gene, transcript, exon and splicing junction levels. We also found the mis-assemblies of rheMac3 genome, which led to the 2 bp shifts in annotating position on exons' boundary and then the incomplete splicing canonical sites in Refseq annotations. These detections were further supported by various data sources.

Conclusions: GASS quickly produces structural genome annotations in sufficient abundance and accuracy. With simple and rapid running of GASS, small labs can create quick views of genome annotations for an un-annotated species, without the necessity to create, collect, analyze and synthesize extra various data sources, or wait several months for the annotations from professional organizations. GASS can be applied to many study occasions, such as the analysis of RNA-Seq datasets from the unannotated species whose genome drafts are available but the annotations are not.

Show MeSH