Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm.
Bottom Line: Use of 'assembled' RNA-Seq transcripts is far from trivial; significant error rate of assembly was revealed in recent assessments.We demonstrated in computational experiments that the proposed method of incorporation of 'unassembled' RNA-Seq reads improves the accuracy of gene prediction; particularly, for the 1.3 GB genome of Aedes aegypti the mean value of prediction Sensitivity and Specificity at the gene level increased over GeneMark-ES by 24.5%.In the current surge of genomic data when the need for accurate sequence annotation is higher than ever, GeneMark-ET will be a valuable addition to the narrow arsenal of automatic gene prediction tools.
Affiliation: Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering, Atlanta, GA, USA 30332.Show MeSH
Related in: MedlinePlus
Mentions: In what follows we show how an unsupervised training procedure can use spliced alignments of ‘unassembled’ RNA-Seq reads (rather than assembled transcripts) to improve accuracy of parameter estimation and gene prediction. The key point in combining two independent methods, ab initio gene prediction and RNA-Seq read mapping, is introduction of the notion of ‘anchor splice sites’: sites supported by both ab initio gene prediction and by RNA-Seq read alignment. Contrary to existing training methods that rely on training sets consisting of complete or almost complete gene structures, the new algorithm uses sets of gene elements, exons and introns, supported by anchor splice sites, to form reliable training sets in the iterative cycles of the model re-training. The new algorithm was tested on genomic and transcriptomic data of several insect species, Aedes aegypti, Anopheles gambiae, Anopheles stephensi, Culex quinquefasciatus and Drosophila melanogaster, which vary significantly by average size of introns and intergenic regions (Figure 1). We have demonstrated that RNA-Seq support in training boosts GeneMark-ET performance in gene prediction to a higher level in comparison with GeneMark-ES that uses purely unsupervised training. The parameter estimation procedure did show robust performance with respect to variations in the size of the set of mapped introns, repeat content and fragmentation of genome assembly. The new method, used stand-alone or as part of a pipeline, should streamline and accelerate the annotation process in large genomes while improving the accuracy of gene identification.
Affiliation: Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering, Atlanta, GA, USA 30332.