Limits...
Estimation of alternative splicing isoform frequencies from RNA-Seq data.

Nicolae M, Mangul S, Măndoiu II, Zelikovsky A - Algorithms Mol Biol (2011)

Bottom Line: Massively parallel whole transcriptome sequencing, commonly referred as RNA-Seq, is quickly becoming the technology of choice for gene expression profiling.However, due to the short read length delivered by current sequencing technologies, estimation of expression levels for alternative splicing gene isoforms remains challenging.Simulation experiments confirm previous findings that, for a fixed sequencing cost, using reads longer than 25-36 bases does not necessarily lead to better accuracy for estimating expression levels of annotated isoforms and genes.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science & Engineering, University of Connecticut,371 Fairfield Rd,, Unit 2155, Storrs, CT 06269-2155, USA. man09004@engr.uconn.edu.

ABSTRACT

Background: Massively parallel whole transcriptome sequencing, commonly referred as RNA-Seq, is quickly becoming the technology of choice for gene expression profiling. However, due to the short read length delivered by current sequencing technologies, estimation of expression levels for alternative splicing gene isoforms remains challenging.

Results: In this paper we present a novel expectation-maximization algorithm for inference of isoform- and gene-specific expression levels from RNA-Seq data. Our algorithm, referred to as IsoEM, is based on disambiguating information provided by the distribution of insert sizes generated during sequencing library preparation, and takes advantage of base quality scores, strand and read pairing information when available. The open source Java implementation of IsoEM is freely available at http://dna.engr.uconn.edu/software/IsoEM/.

Conclusions: Empirical experiments on both synthetic and real RNA-Seq datasets show that IsoEM has scalable running time and outperforms existing methods of isoform and gene expression level estimation. Simulation experiments confirm previous findings that, for a fixed sequencing cost, using reads longer than 25-36 bases does not necessarily lead to better accuracy for estimating expression levels of annotated isoforms and genes.

No MeSH data available.


IsoEM r2 (a) and CPU time (b) for 1-60 million single/paired reads of length 75, with or without strand information.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3107792&req=5

Figure 10: IsoEM r2 (a) and CPU time (b) for 1-60 million single/paired reads of length 75, with or without strand information.

Mentions: Although high-throughput technologies allow users to make tradeoffs between read length and the number of generated reads, very little has been done to determine optimal parameters even for common applications such as RNA-Seq. The intuition that longer reads are better certainly holds true for many applications such as de novo genome and transcriptome assembly. Surprisingly, [13] found that shorter reads are better for IE when the total number of sequenced bases (as a rough approximation for sequencing cost) is fixed. Figure 9 plots IE estimation accuracy for reads of length between 10 and 100 when the total amount of sequence data is kept constant at 750 M bases. Our results confirm the finding of [13], although the optimal read length is somewhat sensitive to the accuracy measure used and to the availability of pairing information. While 25 bp reads minimize MPE regardless of the availability of paired reads, the read length that maximizes r2 is 25 for paired reads and 50 for single reads. Although further experiments are needed to determine how the optimum length depends on the amount of sequence data and transcriptome complexity, our simulations do suggest that for isoform and gene expression analysis, increasing the number of reads may be more useful than increasing read length beyond 50 bases. Figure 10(a) shows, for reads of length 75, the effects of paired reads and strand information on estimation accuracy as measured by r2. Not surprisingly, for a fixed number of reads, paired reads yield better accuracy than single reads. Also not very surprisingly, adding strand information to paired sequencing yields no benefits to genome-wide IE accuracy (although it may be helpful, e.g., in identification of novel transcripts). Quite surprisingly, performing strand-specific single read sequencing is actually detrimental to IsoEM IE (and hence GE) accuracy under the simulated scenario, most likely due to the reduction in sampled transcript length.


Estimation of alternative splicing isoform frequencies from RNA-Seq data.

Nicolae M, Mangul S, Măndoiu II, Zelikovsky A - Algorithms Mol Biol (2011)

IsoEM r2 (a) and CPU time (b) for 1-60 million single/paired reads of length 75, with or without strand information.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3107792&req=5

Figure 10: IsoEM r2 (a) and CPU time (b) for 1-60 million single/paired reads of length 75, with or without strand information.
Mentions: Although high-throughput technologies allow users to make tradeoffs between read length and the number of generated reads, very little has been done to determine optimal parameters even for common applications such as RNA-Seq. The intuition that longer reads are better certainly holds true for many applications such as de novo genome and transcriptome assembly. Surprisingly, [13] found that shorter reads are better for IE when the total number of sequenced bases (as a rough approximation for sequencing cost) is fixed. Figure 9 plots IE estimation accuracy for reads of length between 10 and 100 when the total amount of sequence data is kept constant at 750 M bases. Our results confirm the finding of [13], although the optimal read length is somewhat sensitive to the accuracy measure used and to the availability of pairing information. While 25 bp reads minimize MPE regardless of the availability of paired reads, the read length that maximizes r2 is 25 for paired reads and 50 for single reads. Although further experiments are needed to determine how the optimum length depends on the amount of sequence data and transcriptome complexity, our simulations do suggest that for isoform and gene expression analysis, increasing the number of reads may be more useful than increasing read length beyond 50 bases. Figure 10(a) shows, for reads of length 75, the effects of paired reads and strand information on estimation accuracy as measured by r2. Not surprisingly, for a fixed number of reads, paired reads yield better accuracy than single reads. Also not very surprisingly, adding strand information to paired sequencing yields no benefits to genome-wide IE accuracy (although it may be helpful, e.g., in identification of novel transcripts). Quite surprisingly, performing strand-specific single read sequencing is actually detrimental to IsoEM IE (and hence GE) accuracy under the simulated scenario, most likely due to the reduction in sampled transcript length.

Bottom Line: Massively parallel whole transcriptome sequencing, commonly referred as RNA-Seq, is quickly becoming the technology of choice for gene expression profiling.However, due to the short read length delivered by current sequencing technologies, estimation of expression levels for alternative splicing gene isoforms remains challenging.Simulation experiments confirm previous findings that, for a fixed sequencing cost, using reads longer than 25-36 bases does not necessarily lead to better accuracy for estimating expression levels of annotated isoforms and genes.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science & Engineering, University of Connecticut,371 Fairfield Rd,, Unit 2155, Storrs, CT 06269-2155, USA. man09004@engr.uconn.edu.

ABSTRACT

Background: Massively parallel whole transcriptome sequencing, commonly referred as RNA-Seq, is quickly becoming the technology of choice for gene expression profiling. However, due to the short read length delivered by current sequencing technologies, estimation of expression levels for alternative splicing gene isoforms remains challenging.

Results: In this paper we present a novel expectation-maximization algorithm for inference of isoform- and gene-specific expression levels from RNA-Seq data. Our algorithm, referred to as IsoEM, is based on disambiguating information provided by the distribution of insert sizes generated during sequencing library preparation, and takes advantage of base quality scores, strand and read pairing information when available. The open source Java implementation of IsoEM is freely available at http://dna.engr.uconn.edu/software/IsoEM/.

Conclusions: Empirical experiments on both synthetic and real RNA-Seq datasets show that IsoEM has scalable running time and outperforms existing methods of isoform and gene expression level estimation. Simulation experiments confirm previous findings that, for a fixed sequencing cost, using reads longer than 25-36 bases does not necessarily lead to better accuracy for estimating expression levels of annotated isoforms and genes.

No MeSH data available.