Limits...
GASS: genome structural annotation for Eukaryotes based on species similarity.

Wang Y, Chen L, Song N, Lei X - BMC Genomics (2015)

Bottom Line: The experiment results showed that more than 65% RefSeq exons and splicing junctions were exactly found by GASS.We also found the mis-assemblies of rheMac3 genome, which led to the 2 bp shifts in annotating position on exons' boundary and then the incomplete splicing canonical sites in Refseq annotations.GASS can be applied to many study occasions, such as the analysis of RNA-Seq datasets from the unannotated species whose genome drafts are available but the annotations are not.

View Article: PubMed Central - PubMed

Affiliation: Department of Automation, School of Information Science and Technology, Xiamen University, Xiamen, Fujian, 361005, China. wangying@xmu.edu.cn.

ABSTRACT

Background: With the development of high-throughput sequencing techniques, more and more genomes were sequenced and assembled. However, annotating a genome's structure rapidly and expressly remains challenging. Current eukaryotic genome annotations require various, abundant supporting data, such as: species-specific and cross-species protein sequences, ESTs, cDNA and RNA-Seq data. Collecting those data and merging their analytical results to achieve a consistent complete annotation is a complex, time and cost consuming task.

Results: In our study, we proposed a fast and easy-to-use computational tool: GASS (Genome Annotation based on Species Similarity). It annotates a eukaryotic genome based on only the annotations from another similar species. With aligning the exons' sequences of an annotated similar species to the un-annotated genome, GASS detects the optimal transcript annotations with a shortest-path model. In our study, GASS was used to achieve the rhesus annotations based on the human annotations. The produced annotations were evaluated by comparing them to the two existing rhesus annotation databases (RefSeq and Ensembl) directly and being aligned with three RNA-Seq data of rhesus. The experiment results showed that more than 65% RefSeq exons and splicing junctions were exactly found by GASS. GASS's sensitivity was higher than RefSeq's, and was close to Ensembl's. GASS had higher specificities than Ensembl at gene, transcript, exon and splicing junction levels. We also found the mis-assemblies of rheMac3 genome, which led to the 2 bp shifts in annotating position on exons' boundary and then the incomplete splicing canonical sites in Refseq annotations. These detections were further supported by various data sources.

Conclusions: GASS quickly produces structural genome annotations in sufficient abundance and accuracy. With simple and rapid running of GASS, small labs can create quick views of genome annotations for an un-annotated species, without the necessity to create, collect, analyze and synthesize extra various data sources, or wait several months for the annotations from professional organizations. GASS can be applied to many study occasions, such as the analysis of RNA-Seq datasets from the unannotated species whose genome drafts are available but the annotations are not.

Show MeSH
Comparing GASS and RefSeq-rheMac3 at transcript level. The overlap of the two sets is the amount of exactly same transcripts. Among the non-identical transcripts, there are 2,077 GASS transcripts, and each one is the “subset” of RefSeq-rheMac3’s certain transcript. And there are 27 RefSeq-rheMac3 transcripts, and each one is the “subset” of GASS’s certain transcript. The amount of transcripts sharing at least one identical exon is also analysed.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4352269&req=5

Fig4: Comparing GASS and RefSeq-rheMac3 at transcript level. The overlap of the two sets is the amount of exactly same transcripts. Among the non-identical transcripts, there are 2,077 GASS transcripts, and each one is the “subset” of RefSeq-rheMac3’s certain transcript. And there are 27 RefSeq-rheMac3 transcripts, and each one is the “subset” of GASS’s certain transcript. The amount of transcripts sharing at least one identical exon is also analysed.

Mentions: Transcript level: The UTR 5’ and 3’ regions are the most imprecise parts of gene annotations [25]. Therefore, we excluded the first and the last exons from each transcript during the comparison. As shown in Figure 4, approximately 50% of RefSeq-rheMac3’s transcripts are exactly same with GASS’s transcripts. For transcript A, if all the exons can be found in transcript B, but transcript A misses exon(s) in transcript B, then transcript A is a “subset” of transcript B. There are 2,077 GASS transcripts and each one is the “subset” of RefSeq-rheMac3’s transcript. And there are 27 RefSeq-rheMac3 transcripts and each one is the “subset” of GASS’s transcript, as illustrated by Additional file 1: Figure S4. And 1,687 Refseq transcripts and 8,709 GASS transcripts share at least one common exon. That is, 87.91% of RefSeq-rheMac3’s transcripts share at least one exon with transcripts in GASS.Figure 4


GASS: genome structural annotation for Eukaryotes based on species similarity.

Wang Y, Chen L, Song N, Lei X - BMC Genomics (2015)

Comparing GASS and RefSeq-rheMac3 at transcript level. The overlap of the two sets is the amount of exactly same transcripts. Among the non-identical transcripts, there are 2,077 GASS transcripts, and each one is the “subset” of RefSeq-rheMac3’s certain transcript. And there are 27 RefSeq-rheMac3 transcripts, and each one is the “subset” of GASS’s certain transcript. The amount of transcripts sharing at least one identical exon is also analysed.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4352269&req=5

Fig4: Comparing GASS and RefSeq-rheMac3 at transcript level. The overlap of the two sets is the amount of exactly same transcripts. Among the non-identical transcripts, there are 2,077 GASS transcripts, and each one is the “subset” of RefSeq-rheMac3’s certain transcript. And there are 27 RefSeq-rheMac3 transcripts, and each one is the “subset” of GASS’s certain transcript. The amount of transcripts sharing at least one identical exon is also analysed.
Mentions: Transcript level: The UTR 5’ and 3’ regions are the most imprecise parts of gene annotations [25]. Therefore, we excluded the first and the last exons from each transcript during the comparison. As shown in Figure 4, approximately 50% of RefSeq-rheMac3’s transcripts are exactly same with GASS’s transcripts. For transcript A, if all the exons can be found in transcript B, but transcript A misses exon(s) in transcript B, then transcript A is a “subset” of transcript B. There are 2,077 GASS transcripts and each one is the “subset” of RefSeq-rheMac3’s transcript. And there are 27 RefSeq-rheMac3 transcripts and each one is the “subset” of GASS’s transcript, as illustrated by Additional file 1: Figure S4. And 1,687 Refseq transcripts and 8,709 GASS transcripts share at least one common exon. That is, 87.91% of RefSeq-rheMac3’s transcripts share at least one exon with transcripts in GASS.Figure 4

Bottom Line: The experiment results showed that more than 65% RefSeq exons and splicing junctions were exactly found by GASS.We also found the mis-assemblies of rheMac3 genome, which led to the 2 bp shifts in annotating position on exons' boundary and then the incomplete splicing canonical sites in Refseq annotations.GASS can be applied to many study occasions, such as the analysis of RNA-Seq datasets from the unannotated species whose genome drafts are available but the annotations are not.

View Article: PubMed Central - PubMed

Affiliation: Department of Automation, School of Information Science and Technology, Xiamen University, Xiamen, Fujian, 361005, China. wangying@xmu.edu.cn.

ABSTRACT

Background: With the development of high-throughput sequencing techniques, more and more genomes were sequenced and assembled. However, annotating a genome's structure rapidly and expressly remains challenging. Current eukaryotic genome annotations require various, abundant supporting data, such as: species-specific and cross-species protein sequences, ESTs, cDNA and RNA-Seq data. Collecting those data and merging their analytical results to achieve a consistent complete annotation is a complex, time and cost consuming task.

Results: In our study, we proposed a fast and easy-to-use computational tool: GASS (Genome Annotation based on Species Similarity). It annotates a eukaryotic genome based on only the annotations from another similar species. With aligning the exons' sequences of an annotated similar species to the un-annotated genome, GASS detects the optimal transcript annotations with a shortest-path model. In our study, GASS was used to achieve the rhesus annotations based on the human annotations. The produced annotations were evaluated by comparing them to the two existing rhesus annotation databases (RefSeq and Ensembl) directly and being aligned with three RNA-Seq data of rhesus. The experiment results showed that more than 65% RefSeq exons and splicing junctions were exactly found by GASS. GASS's sensitivity was higher than RefSeq's, and was close to Ensembl's. GASS had higher specificities than Ensembl at gene, transcript, exon and splicing junction levels. We also found the mis-assemblies of rheMac3 genome, which led to the 2 bp shifts in annotating position on exons' boundary and then the incomplete splicing canonical sites in Refseq annotations. These detections were further supported by various data sources.

Conclusions: GASS quickly produces structural genome annotations in sufficient abundance and accuracy. With simple and rapid running of GASS, small labs can create quick views of genome annotations for an un-annotated species, without the necessity to create, collect, analyze and synthesize extra various data sources, or wait several months for the annotations from professional organizations. GASS can be applied to many study occasions, such as the analysis of RNA-Seq datasets from the unannotated species whose genome drafts are available but the annotations are not.

Show MeSH