Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm.
Bottom Line: Use of 'assembled' RNA-Seq transcripts is far from trivial; significant error rate of assembly was revealed in recent assessments.We demonstrated in computational experiments that the proposed method of incorporation of 'unassembled' RNA-Seq reads improves the accuracy of gene prediction; particularly, for the 1.3 GB genome of Aedes aegypti the mean value of prediction Sensitivity and Specificity at the gene level increased over GeneMark-ES by 24.5%.In the current surge of genomic data when the need for accurate sequence annotation is higher than ever, GeneMark-ET will be a valuable addition to the narrow arsenal of automatic gene prediction tools.
Affiliation: Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering, Atlanta, GA, USA 30332.Show MeSH
Related in: MedlinePlus
Mentions: The general training logic of GeneMark-ET is similar to that of GeneMark-ES (5). First, using an initially defined set of parameters of the hidden semi-Markov model (HSMM) the algorithm predicts protein-coding regions in the chosen genomic sequence. Second, a subset of the newly predicted genes and non-coding regions is selected and used for the HSMM parameter re-estimation. Next, the prediction and re-estimation steps are repeated to convergence. GeneMark-ET differs from GeneMark-ES in the method of selection of the more reliably predicted coding and non-coding regions used for parameter re-estimation. In GeneMark-ET, inclusion of a likely protein-coding exon in the training set requires the predicted exon to have at least one ‘anchor splice site’ (Figure 3). ‘Anchor splice sites’ are those predicted independently by both methods, by the ab initio one and by RNA-Seq read alignment.
Affiliation: Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering, Atlanta, GA, USA 30332.