Limits...
Exploiting single-molecule transcript sequencing for eukaryotic gene prediction.

Minoche AE, Dohm JC, Schneider J, Holtgräwe D, Viehöver P, Montfort M, Sörensen TR, Weisshaar B, Himmelbauer H - Genome Biol. (2015)

Bottom Line: Optimized training and prediction settings and mRNA-seq noise reduction of assisting Illumina reads results in increased gene prediction sensitivity and precision.Additionally, we present an improved gene set for sugar beet (Beta vulgaris) and the first genome-wide gene set for spinach (Spinacia oleracea).The workflow and guidelines are a valuable resource to obtain comprehensive gene sets for newly sequenced genomes of non-model eukaryotes.

View Article: PubMed Central - PubMed

Affiliation: Max Planck Institute for Molecular Genetics, Berlin, Germany.

ABSTRACT
We develop a method to predict and validate gene models using PacBio single-molecule, real-time (SMRT) cDNA reads. Ninety-eight percent of full-insert SMRT reads span complete open reading frames. Gene model validation using SMRT reads is developed as automated process. Optimized training and prediction settings and mRNA-seq noise reduction of assisting Illumina reads results in increased gene prediction sensitivity and precision. Additionally, we present an improved gene set for sugar beet (Beta vulgaris) and the first genome-wide gene set for spinach (Spinacia oleracea). The workflow and guidelines are a valuable resource to obtain comprehensive gene sets for newly sequenced genomes of non-model eukaryotes.

No MeSH data available.


Related in: MedlinePlus

mRNA-seq coverage of sugar beet genes. Each dot represents one sugar beet gene. x-axis: mRNA-seq coverage as in the annotation based on the RefBeet-1.1 assembly; y-axis: mRNA-seq coverage for BeetSet-2 genes. The mRNA-seq data used in the RefBeet-1.1 annotation consisted chiefly of Illumina reads from genotype KWS2320, plus reads from other accessions (total amount: 616.3 million reads). The mRNA-seq data used to generate BeetSet-2 included KWS2320 reads plus isogenic reads from plants grown under stress conditions and their controls (total amount: 923.8 million reads). The overall mRNA-seq coverage increased in BeetSet-2, which improved the prediction of lowly expressed genes
© Copyright Policy - OpenAccess
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4556409&req=5

Fig5: mRNA-seq coverage of sugar beet genes. Each dot represents one sugar beet gene. x-axis: mRNA-seq coverage as in the annotation based on the RefBeet-1.1 assembly; y-axis: mRNA-seq coverage for BeetSet-2 genes. The mRNA-seq data used in the RefBeet-1.1 annotation consisted chiefly of Illumina reads from genotype KWS2320, plus reads from other accessions (total amount: 616.3 million reads). The mRNA-seq data used to generate BeetSet-2 included KWS2320 reads plus isogenic reads from plants grown under stress conditions and their controls (total amount: 923.8 million reads). The overall mRNA-seq coverage increased in BeetSet-2, which improved the prediction of lowly expressed genes

Mentions: We combined 396.9 million Illumina mRNA-seq reads previously used for sugar beet gene prediction [13] with 526.9 million newly generated reads from sugar beet plants grown under abiotic stress conditions (treated with heat, salt, or high light intensity) and from their untreated controls. All reads were derived from the reference genotype KWS2320. This dataset of almost 1 billion quality-filtered mRNA-seq reads (Table 6) led to increased evidence levels for a large number of genes with low or intermediate level of expression: 8,201 genes with average mRNA-seq read coverage below 200x in the published gene set [13] increased their coverage by at least two-fold (Fig. 5). However, adding more expression evidence resulted in higher level of background noise, due to interference of, for example, rare isoforms or incompletely spliced mRNAs, which affected the prediction accuracy (Table 5). We reduced the noise by applying coverage filters (see Methods for details) to facilitate the correct prediction of the most abundant isoform per locus, being aware that in this way low abundance isoforms might be lost. The noise reduction improved the sensitivity from 76.4 % to 84.7 %. We further increased the bonus factor for intron hints, and increased the malus factor for predictions that did not coincide with intron hints. In combination with these improved settings repeat hint masking performed slightly better than genome masking. Using pre-assembled mRNA-seq reads as additional EST hints did not increase the sensitivity. SMRT full-insert sequences as additional EST hints only slightly improved the prediction result, due to the shallow coverage of such reads. Increasing their weight by conversion to ‘anchors’ increased the sensitivity to 91 %.Table 6


Exploiting single-molecule transcript sequencing for eukaryotic gene prediction.

Minoche AE, Dohm JC, Schneider J, Holtgräwe D, Viehöver P, Montfort M, Sörensen TR, Weisshaar B, Himmelbauer H - Genome Biol. (2015)

mRNA-seq coverage of sugar beet genes. Each dot represents one sugar beet gene. x-axis: mRNA-seq coverage as in the annotation based on the RefBeet-1.1 assembly; y-axis: mRNA-seq coverage for BeetSet-2 genes. The mRNA-seq data used in the RefBeet-1.1 annotation consisted chiefly of Illumina reads from genotype KWS2320, plus reads from other accessions (total amount: 616.3 million reads). The mRNA-seq data used to generate BeetSet-2 included KWS2320 reads plus isogenic reads from plants grown under stress conditions and their controls (total amount: 923.8 million reads). The overall mRNA-seq coverage increased in BeetSet-2, which improved the prediction of lowly expressed genes
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4556409&req=5

Fig5: mRNA-seq coverage of sugar beet genes. Each dot represents one sugar beet gene. x-axis: mRNA-seq coverage as in the annotation based on the RefBeet-1.1 assembly; y-axis: mRNA-seq coverage for BeetSet-2 genes. The mRNA-seq data used in the RefBeet-1.1 annotation consisted chiefly of Illumina reads from genotype KWS2320, plus reads from other accessions (total amount: 616.3 million reads). The mRNA-seq data used to generate BeetSet-2 included KWS2320 reads plus isogenic reads from plants grown under stress conditions and their controls (total amount: 923.8 million reads). The overall mRNA-seq coverage increased in BeetSet-2, which improved the prediction of lowly expressed genes
Mentions: We combined 396.9 million Illumina mRNA-seq reads previously used for sugar beet gene prediction [13] with 526.9 million newly generated reads from sugar beet plants grown under abiotic stress conditions (treated with heat, salt, or high light intensity) and from their untreated controls. All reads were derived from the reference genotype KWS2320. This dataset of almost 1 billion quality-filtered mRNA-seq reads (Table 6) led to increased evidence levels for a large number of genes with low or intermediate level of expression: 8,201 genes with average mRNA-seq read coverage below 200x in the published gene set [13] increased their coverage by at least two-fold (Fig. 5). However, adding more expression evidence resulted in higher level of background noise, due to interference of, for example, rare isoforms or incompletely spliced mRNAs, which affected the prediction accuracy (Table 5). We reduced the noise by applying coverage filters (see Methods for details) to facilitate the correct prediction of the most abundant isoform per locus, being aware that in this way low abundance isoforms might be lost. The noise reduction improved the sensitivity from 76.4 % to 84.7 %. We further increased the bonus factor for intron hints, and increased the malus factor for predictions that did not coincide with intron hints. In combination with these improved settings repeat hint masking performed slightly better than genome masking. Using pre-assembled mRNA-seq reads as additional EST hints did not increase the sensitivity. SMRT full-insert sequences as additional EST hints only slightly improved the prediction result, due to the shallow coverage of such reads. Increasing their weight by conversion to ‘anchors’ increased the sensitivity to 91 %.Table 6

Bottom Line: Optimized training and prediction settings and mRNA-seq noise reduction of assisting Illumina reads results in increased gene prediction sensitivity and precision.Additionally, we present an improved gene set for sugar beet (Beta vulgaris) and the first genome-wide gene set for spinach (Spinacia oleracea).The workflow and guidelines are a valuable resource to obtain comprehensive gene sets for newly sequenced genomes of non-model eukaryotes.

View Article: PubMed Central - PubMed

Affiliation: Max Planck Institute for Molecular Genetics, Berlin, Germany.

ABSTRACT
We develop a method to predict and validate gene models using PacBio single-molecule, real-time (SMRT) cDNA reads. Ninety-eight percent of full-insert SMRT reads span complete open reading frames. Gene model validation using SMRT reads is developed as automated process. Optimized training and prediction settings and mRNA-seq noise reduction of assisting Illumina reads results in increased gene prediction sensitivity and precision. Additionally, we present an improved gene set for sugar beet (Beta vulgaris) and the first genome-wide gene set for spinach (Spinacia oleracea). The workflow and guidelines are a valuable resource to obtain comprehensive gene sets for newly sequenced genomes of non-model eukaryotes.

No MeSH data available.


Related in: MedlinePlus