Limits...
Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm.

Lomsadze A, Burns PD, Borodovsky M - Nucleic Acids Res. (2014)

Bottom Line: Use of 'assembled' RNA-Seq transcripts is far from trivial; significant error rate of assembly was revealed in recent assessments.We demonstrated in computational experiments that the proposed method of incorporation of 'unassembled' RNA-Seq reads improves the accuracy of gene prediction; particularly, for the 1.3 GB genome of Aedes aegypti the mean value of prediction Sensitivity and Specificity at the gene level increased over GeneMark-ES by 24.5%.In the current surge of genomic data when the need for accurate sequence annotation is higher than ever, GeneMark-ET will be a valuable addition to the narrow arsenal of automatic gene prediction tools.

View Article: PubMed Central - PubMed

Affiliation: Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering, Atlanta, GA, USA 30332.

Show MeSH

Related in: MedlinePlus

The dot plot graph depicting average lengths of exons, introns and intergenic regions against the value of percentage of non-coding DNA in a given genome was made for the five insect genomes used in the GeneMark-ET tests as well as for several other eukaryotic species. The average lengths of intron and intergenic regions correlate with the genome length while the average length of protein-coding exons (CDS) does not show dependence on the genome size.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4150757&req=5

Figure 1: The dot plot graph depicting average lengths of exons, introns and intergenic regions against the value of percentage of non-coding DNA in a given genome was made for the five insect genomes used in the GeneMark-ET tests as well as for several other eukaryotic species. The average lengths of intron and intergenic regions correlate with the genome length while the average length of protein-coding exons (CDS) does not show dependence on the genome size.

Mentions: In what follows we show how an unsupervised training procedure can use spliced alignments of ‘unassembled’ RNA-Seq reads (rather than assembled transcripts) to improve accuracy of parameter estimation and gene prediction. The key point in combining two independent methods, ab initio gene prediction and RNA-Seq read mapping, is introduction of the notion of ‘anchor splice sites’: sites supported by both ab initio gene prediction and by RNA-Seq read alignment. Contrary to existing training methods that rely on training sets consisting of complete or almost complete gene structures, the new algorithm uses sets of gene elements, exons and introns, supported by anchor splice sites, to form reliable training sets in the iterative cycles of the model re-training. The new algorithm was tested on genomic and transcriptomic data of several insect species, Aedes aegypti, Anopheles gambiae, Anopheles stephensi, Culex quinquefasciatus and Drosophila melanogaster, which vary significantly by average size of introns and intergenic regions (Figure 1). We have demonstrated that RNA-Seq support in training boosts GeneMark-ET performance in gene prediction to a higher level in comparison with GeneMark-ES that uses purely unsupervised training. The parameter estimation procedure did show robust performance with respect to variations in the size of the set of mapped introns, repeat content and fragmentation of genome assembly. The new method, used stand-alone or as part of a pipeline, should streamline and accelerate the annotation process in large genomes while improving the accuracy of gene identification.


Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm.

Lomsadze A, Burns PD, Borodovsky M - Nucleic Acids Res. (2014)

The dot plot graph depicting average lengths of exons, introns and intergenic regions against the value of percentage of non-coding DNA in a given genome was made for the five insect genomes used in the GeneMark-ET tests as well as for several other eukaryotic species. The average lengths of intron and intergenic regions correlate with the genome length while the average length of protein-coding exons (CDS) does not show dependence on the genome size.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4150757&req=5

Figure 1: The dot plot graph depicting average lengths of exons, introns and intergenic regions against the value of percentage of non-coding DNA in a given genome was made for the five insect genomes used in the GeneMark-ET tests as well as for several other eukaryotic species. The average lengths of intron and intergenic regions correlate with the genome length while the average length of protein-coding exons (CDS) does not show dependence on the genome size.
Mentions: In what follows we show how an unsupervised training procedure can use spliced alignments of ‘unassembled’ RNA-Seq reads (rather than assembled transcripts) to improve accuracy of parameter estimation and gene prediction. The key point in combining two independent methods, ab initio gene prediction and RNA-Seq read mapping, is introduction of the notion of ‘anchor splice sites’: sites supported by both ab initio gene prediction and by RNA-Seq read alignment. Contrary to existing training methods that rely on training sets consisting of complete or almost complete gene structures, the new algorithm uses sets of gene elements, exons and introns, supported by anchor splice sites, to form reliable training sets in the iterative cycles of the model re-training. The new algorithm was tested on genomic and transcriptomic data of several insect species, Aedes aegypti, Anopheles gambiae, Anopheles stephensi, Culex quinquefasciatus and Drosophila melanogaster, which vary significantly by average size of introns and intergenic regions (Figure 1). We have demonstrated that RNA-Seq support in training boosts GeneMark-ET performance in gene prediction to a higher level in comparison with GeneMark-ES that uses purely unsupervised training. The parameter estimation procedure did show robust performance with respect to variations in the size of the set of mapped introns, repeat content and fragmentation of genome assembly. The new method, used stand-alone or as part of a pipeline, should streamline and accelerate the annotation process in large genomes while improving the accuracy of gene identification.

Bottom Line: Use of 'assembled' RNA-Seq transcripts is far from trivial; significant error rate of assembly was revealed in recent assessments.We demonstrated in computational experiments that the proposed method of incorporation of 'unassembled' RNA-Seq reads improves the accuracy of gene prediction; particularly, for the 1.3 GB genome of Aedes aegypti the mean value of prediction Sensitivity and Specificity at the gene level increased over GeneMark-ES by 24.5%.In the current surge of genomic data when the need for accurate sequence annotation is higher than ever, GeneMark-ET will be a valuable addition to the narrow arsenal of automatic gene prediction tools.

View Article: PubMed Central - PubMed

Affiliation: Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering, Atlanta, GA, USA 30332.

Show MeSH
Related in: MedlinePlus