Limits...
Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm.

Lomsadze A, Burns PD, Borodovsky M - Nucleic Acids Res. (2014)

Bottom Line: Use of 'assembled' RNA-Seq transcripts is far from trivial; significant error rate of assembly was revealed in recent assessments.We demonstrated in computational experiments that the proposed method of incorporation of 'unassembled' RNA-Seq reads improves the accuracy of gene prediction; particularly, for the 1.3 GB genome of Aedes aegypti the mean value of prediction Sensitivity and Specificity at the gene level increased over GeneMark-ES by 24.5%.In the current surge of genomic data when the need for accurate sequence annotation is higher than ever, GeneMark-ET will be a valuable addition to the narrow arsenal of automatic gene prediction tools.

View Article: PubMed Central - PubMed

Affiliation: Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering, Atlanta, GA, USA 30332.

Show MeSH

Related in: MedlinePlus

Selection of elements of training set in GeneMark-ET for the next iteration. The new training set of protein-coding regions is comprised from exons with at least one ‘anchored splice site’ as well as long exons predicted ab initio (>800 nt).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4150757&req=5

Figure 3: Selection of elements of training set in GeneMark-ET for the next iteration. The new training set of protein-coding regions is comprised from exons with at least one ‘anchored splice site’ as well as long exons predicted ab initio (>800 nt).

Mentions: The general training logic of GeneMark-ET is similar to that of GeneMark-ES (5). First, using an initially defined set of parameters of the hidden semi-Markov model (HSMM) the algorithm predicts protein-coding regions in the chosen genomic sequence. Second, a subset of the newly predicted genes and non-coding regions is selected and used for the HSMM parameter re-estimation. Next, the prediction and re-estimation steps are repeated to convergence. GeneMark-ET differs from GeneMark-ES in the method of selection of the more reliably predicted coding and non-coding regions used for parameter re-estimation. In GeneMark-ET, inclusion of a likely protein-coding exon in the training set requires the predicted exon to have at least one ‘anchor splice site’ (Figure 3). ‘Anchor splice sites’ are those predicted independently by both methods, by the ab initio one and by RNA-Seq read alignment.


Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm.

Lomsadze A, Burns PD, Borodovsky M - Nucleic Acids Res. (2014)

Selection of elements of training set in GeneMark-ET for the next iteration. The new training set of protein-coding regions is comprised from exons with at least one ‘anchored splice site’ as well as long exons predicted ab initio (>800 nt).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4150757&req=5

Figure 3: Selection of elements of training set in GeneMark-ET for the next iteration. The new training set of protein-coding regions is comprised from exons with at least one ‘anchored splice site’ as well as long exons predicted ab initio (>800 nt).
Mentions: The general training logic of GeneMark-ET is similar to that of GeneMark-ES (5). First, using an initially defined set of parameters of the hidden semi-Markov model (HSMM) the algorithm predicts protein-coding regions in the chosen genomic sequence. Second, a subset of the newly predicted genes and non-coding regions is selected and used for the HSMM parameter re-estimation. Next, the prediction and re-estimation steps are repeated to convergence. GeneMark-ET differs from GeneMark-ES in the method of selection of the more reliably predicted coding and non-coding regions used for parameter re-estimation. In GeneMark-ET, inclusion of a likely protein-coding exon in the training set requires the predicted exon to have at least one ‘anchor splice site’ (Figure 3). ‘Anchor splice sites’ are those predicted independently by both methods, by the ab initio one and by RNA-Seq read alignment.

Bottom Line: Use of 'assembled' RNA-Seq transcripts is far from trivial; significant error rate of assembly was revealed in recent assessments.We demonstrated in computational experiments that the proposed method of incorporation of 'unassembled' RNA-Seq reads improves the accuracy of gene prediction; particularly, for the 1.3 GB genome of Aedes aegypti the mean value of prediction Sensitivity and Specificity at the gene level increased over GeneMark-ES by 24.5%.In the current surge of genomic data when the need for accurate sequence annotation is higher than ever, GeneMark-ET will be a valuable addition to the narrow arsenal of automatic gene prediction tools.

View Article: PubMed Central - PubMed

Affiliation: Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering, Atlanta, GA, USA 30332.

Show MeSH
Related in: MedlinePlus