Limits...
Exploiting single-molecule transcript sequencing for eukaryotic gene prediction.

Minoche AE, Dohm JC, Schneider J, Holtgräwe D, Viehöver P, Montfort M, Sörensen TR, Weisshaar B, Himmelbauer H - Genome Biol. (2015)

Bottom Line: Optimized training and prediction settings and mRNA-seq noise reduction of assisting Illumina reads results in increased gene prediction sensitivity and precision.Additionally, we present an improved gene set for sugar beet (Beta vulgaris) and the first genome-wide gene set for spinach (Spinacia oleracea).The workflow and guidelines are a valuable resource to obtain comprehensive gene sets for newly sequenced genomes of non-model eukaryotes.

View Article: PubMed Central - PubMed

Affiliation: Max Planck Institute for Molecular Genetics, Berlin, Germany.

ABSTRACT
We develop a method to predict and validate gene models using PacBio single-molecule, real-time (SMRT) cDNA reads. Ninety-eight percent of full-insert SMRT reads span complete open reading frames. Gene model validation using SMRT reads is developed as automated process. Optimized training and prediction settings and mRNA-seq noise reduction of assisting Illumina reads results in increased gene prediction sensitivity and precision. Additionally, we present an improved gene set for sugar beet (Beta vulgaris) and the first genome-wide gene set for spinach (Spinacia oleracea). The workflow and guidelines are a valuable resource to obtain comprehensive gene sets for newly sequenced genomes of non-model eukaryotes.

No MeSH data available.


Transcript length distribution. a Length distribution of 29,831 transcript models supported by evidence previously annotated in the RefBeet-1.1 assembly [13]. b Length distribution of SMRT CCS representing full-length transcripts. c Length distribution of transcripts annotated in RefBeet-1.1 that were matched by CCS representing full-length transcripts
© Copyright Policy - OpenAccess
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4556409&req=5

Fig2: Transcript length distribution. a Length distribution of 29,831 transcript models supported by evidence previously annotated in the RefBeet-1.1 assembly [13]. b Length distribution of SMRT CCS representing full-length transcripts. c Length distribution of transcripts annotated in RefBeet-1.1 that were matched by CCS representing full-length transcripts

Mentions: We generated large sugar beet cDNA fragments of the reference genotype KWS2320 by using the ‘SMART’ approach [17, 18], which favors the reverse transcription of intact, full-length RNA molecules. In order to equally sample long and short transcripts, the cDNA was size-selected in fractions of lengths 1-2 kb, 2-3 kb, and >3 kb. Using Pacific Biosciences’ SMRT sequencing technology 395,038 cDNA sequencing reads were generated, each consisting of one or more ‘subreads’, which represent the same circularized cDNA template (Fig. 1). A total of 1.1 million subreads were merged into 78,965 circular consensus sequences (CCS), and 626,871 subreads remained unmerged. For 56,546 CCS and 53,374 unmerged subreads we identified the RNA poly(A) tail as well as the SMART cDNA 5' and 3’ primers which are distinct from the PacBio SMRT sequencing adapter. These sequences are referred to as full-insert SMRT reads. Full-length open reading frames (ORFs) could be identified in 98 % of all full-insert SMRT reads by comparison with sugar beet genes that were found to be complete in multiple alignments containing gene sequences from four additional eudicot plant species. The remaining 2 % of cases may be explained by internal priming of short oligo(A) stretches within the coding region. Among the subreads that could not be merged into CCS, there was still a substantial portion of 35.8 % of reads that contained complete ORFs. A general uncertainty remains whether full-ORF sequences also contain a gene’s entire 5' UTR. In line with the expectation that shorter cDNA fragments are more likely to be sequenced full length, the 1-2 kb fraction had the highest percentage of sequences containing both primers (92.2 % of CCS) and the highest percentage of sequences comprising full-length ORFs (94.5 % of CCS, Tables 1 and 2). The length distribution of SMRT read data suggested a genuine representation of expressed sugar beet genes (Fig. 2).Fig. 1


Exploiting single-molecule transcript sequencing for eukaryotic gene prediction.

Minoche AE, Dohm JC, Schneider J, Holtgräwe D, Viehöver P, Montfort M, Sörensen TR, Weisshaar B, Himmelbauer H - Genome Biol. (2015)

Transcript length distribution. a Length distribution of 29,831 transcript models supported by evidence previously annotated in the RefBeet-1.1 assembly [13]. b Length distribution of SMRT CCS representing full-length transcripts. c Length distribution of transcripts annotated in RefBeet-1.1 that were matched by CCS representing full-length transcripts
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4556409&req=5

Fig2: Transcript length distribution. a Length distribution of 29,831 transcript models supported by evidence previously annotated in the RefBeet-1.1 assembly [13]. b Length distribution of SMRT CCS representing full-length transcripts. c Length distribution of transcripts annotated in RefBeet-1.1 that were matched by CCS representing full-length transcripts
Mentions: We generated large sugar beet cDNA fragments of the reference genotype KWS2320 by using the ‘SMART’ approach [17, 18], which favors the reverse transcription of intact, full-length RNA molecules. In order to equally sample long and short transcripts, the cDNA was size-selected in fractions of lengths 1-2 kb, 2-3 kb, and >3 kb. Using Pacific Biosciences’ SMRT sequencing technology 395,038 cDNA sequencing reads were generated, each consisting of one or more ‘subreads’, which represent the same circularized cDNA template (Fig. 1). A total of 1.1 million subreads were merged into 78,965 circular consensus sequences (CCS), and 626,871 subreads remained unmerged. For 56,546 CCS and 53,374 unmerged subreads we identified the RNA poly(A) tail as well as the SMART cDNA 5' and 3’ primers which are distinct from the PacBio SMRT sequencing adapter. These sequences are referred to as full-insert SMRT reads. Full-length open reading frames (ORFs) could be identified in 98 % of all full-insert SMRT reads by comparison with sugar beet genes that were found to be complete in multiple alignments containing gene sequences from four additional eudicot plant species. The remaining 2 % of cases may be explained by internal priming of short oligo(A) stretches within the coding region. Among the subreads that could not be merged into CCS, there was still a substantial portion of 35.8 % of reads that contained complete ORFs. A general uncertainty remains whether full-ORF sequences also contain a gene’s entire 5' UTR. In line with the expectation that shorter cDNA fragments are more likely to be sequenced full length, the 1-2 kb fraction had the highest percentage of sequences containing both primers (92.2 % of CCS) and the highest percentage of sequences comprising full-length ORFs (94.5 % of CCS, Tables 1 and 2). The length distribution of SMRT read data suggested a genuine representation of expressed sugar beet genes (Fig. 2).Fig. 1

Bottom Line: Optimized training and prediction settings and mRNA-seq noise reduction of assisting Illumina reads results in increased gene prediction sensitivity and precision.Additionally, we present an improved gene set for sugar beet (Beta vulgaris) and the first genome-wide gene set for spinach (Spinacia oleracea).The workflow and guidelines are a valuable resource to obtain comprehensive gene sets for newly sequenced genomes of non-model eukaryotes.

View Article: PubMed Central - PubMed

Affiliation: Max Planck Institute for Molecular Genetics, Berlin, Germany.

ABSTRACT
We develop a method to predict and validate gene models using PacBio single-molecule, real-time (SMRT) cDNA reads. Ninety-eight percent of full-insert SMRT reads span complete open reading frames. Gene model validation using SMRT reads is developed as automated process. Optimized training and prediction settings and mRNA-seq noise reduction of assisting Illumina reads results in increased gene prediction sensitivity and precision. Additionally, we present an improved gene set for sugar beet (Beta vulgaris) and the first genome-wide gene set for spinach (Spinacia oleracea). The workflow and guidelines are a valuable resource to obtain comprehensive gene sets for newly sequenced genomes of non-model eukaryotes.

No MeSH data available.