Limits...
A Comparison of Methods for RNA-Seq Differential Expression Analysis and a New Empirical Bayes Approach.

Wesolowski S, Birtwistle MR, Rempala GA - Biosensors (Basel) (2013)

Bottom Line: Our analysis indicates that DESeq identifies the first half of the differentially expressed transcripts well, but then is outperformed by Cuffdiff and R-EBSeq.Cuffdiff and R-EBSeq are the two top performers.Thus, R-EBSeq offers good performance, while allowing flexible and rigorous comparison of multiple biological conditions.

View Article: PubMed Central - PubMed

Affiliation: Department of Mathematics, Florida State University, Tallahassee, FL 32306, USA ; E-Mail: wesserg@gmail.com.

ABSTRACT
Transcriptome-based biosensors are expected to have a large impact on the future of biotechnology. However, a central aspect of transcriptomics is differential expression analysis, where, currently, deep RNA sequencing (RNA-seq) has the potential to replace the microarray as the standard assay for RNA quantification. Our contributions here to RNA-seq differential expression analysis are two-fold. First, given the high cost of an RNA-seq run, biological replicates are rare, and therefore, information sharing across genes to obtain variance estimates is crucial. To handle such information sharing in a rigorous manner, we propose an hierarchical, empirical Bayes approach (R-EBSeq) that combines the Cufflinks model for generating relative transcript abundance measurements, known as FPKM (fragments per kilobase of transcript length per million mapped reads) with the EBArrays framework, which was previously developed for empirical Bayes analysis of microarray data. A desirable feature of R-EBSeq is easy-to-implement analysis of more than pairwise comparisons, as we illustrate with experimental data. Secondly, we develop the standard RNA-seq test data set, on the level of reads, where 79 transcripts are artificially differentially expressed and, therefore, explicitly known. This test data set allows us to compare the performance, in terms of the true discovery rate, of R-EBSeq to three other widely used RNAseq data analysis packages: Cuffdiff, DEseq and BaySeq. Our analysis indicates that DESeq identifies the first half of the differentially expressed transcripts well, but then is outperformed by Cuffdiff and R-EBSeq. Cuffdiff and R-EBSeq are the two top performers. Thus, R-EBSeq offers good performance, while allowing flexible and rigorous comparison of multiple biological conditions.

No MeSH data available.


Comparison of true discovery rates for various RNA-Seq differential expression testing methods. The test data sets were generated and various software suites implemented, as described in Methods. Each panel contains two plots; the plot on the right is a zoomed-in version of the plot on the left. On the y-axis is the number of correctly-identified transcripts, and on the x-axis is the number of transcripts selected (in order of increasing p-value). DE stands for differentially expressed. In every plot, the thick black line corresponds to Cufflinks, the thick red line to R-EBSeq, the large-dashed black line to DESeq and the small-dashed black line to BaySeq. These lines are also labeled as indicated. The shaded region surrounding the R-EBSeq curve depicts the range of 20 independent runs. (A) Performance of the various methods with noise-free data; (B) Performance of the various methods with noisy data. Noise was added as described in Methods, and the data are overdispersed, as typical for RNA-seq data.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4263583&req=5

Figure 4: Comparison of true discovery rates for various RNA-Seq differential expression testing methods. The test data sets were generated and various software suites implemented, as described in Methods. Each panel contains two plots; the plot on the right is a zoomed-in version of the plot on the left. On the y-axis is the number of correctly-identified transcripts, and on the x-axis is the number of transcripts selected (in order of increasing p-value). DE stands for differentially expressed. In every plot, the thick black line corresponds to Cufflinks, the thick red line to R-EBSeq, the large-dashed black line to DESeq and the small-dashed black line to BaySeq. These lines are also labeled as indicated. The shaded region surrounding the R-EBSeq curve depicts the range of 20 independent runs. (A) Performance of the various methods with noise-free data; (B) Performance of the various methods with noisy data. Noise was added as described in Methods, and the data are overdispersed, as typical for RNA-seq data.

Mentions: To evaluate how R-EBSeq performs, we generated a test data set according to the underlying empirical Bayes model, imposed differential expression on a subset of transcripts from this data set and, then, calculated the performance of R-EBSeq in terms of true positive and false positive identifications (see Methods). Such plots of the false positive rate vs. the true positive rate are called receiver operator characteristic (ROC) curves. An ROC curve along the x = y line implies a very poor algorithm that performs no better than random choice, whereas an ROC curve that peaks high above the x = y line, at low x values, implies a very good algorithm. We investigated how three characteristics of a transcript affect the ROC curves: the difference of means between two conditions (Figure 4(A)), the variance of the transcript expression level (Figure 4(B)), and the number of replicates, M, used as input to the software (Figure 4(C)). In general, we see that R-EBSeq is capable of very good behavior in terms of the ROC curves. As expected, as we increase the difference of means between two conditions and/or decrease the variance, the ability of R-EBSeq to identify truly differentially expressed genes improves. Increasing the number of replicates, M, also improves the performance of R-EBSeq, likely because R-EBSeq is able to get a better estimate of a transcript’s variance.


A Comparison of Methods for RNA-Seq Differential Expression Analysis and a New Empirical Bayes Approach.

Wesolowski S, Birtwistle MR, Rempala GA - Biosensors (Basel) (2013)

Comparison of true discovery rates for various RNA-Seq differential expression testing methods. The test data sets were generated and various software suites implemented, as described in Methods. Each panel contains two plots; the plot on the right is a zoomed-in version of the plot on the left. On the y-axis is the number of correctly-identified transcripts, and on the x-axis is the number of transcripts selected (in order of increasing p-value). DE stands for differentially expressed. In every plot, the thick black line corresponds to Cufflinks, the thick red line to R-EBSeq, the large-dashed black line to DESeq and the small-dashed black line to BaySeq. These lines are also labeled as indicated. The shaded region surrounding the R-EBSeq curve depicts the range of 20 independent runs. (A) Performance of the various methods with noise-free data; (B) Performance of the various methods with noisy data. Noise was added as described in Methods, and the data are overdispersed, as typical for RNA-seq data.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4263583&req=5

Figure 4: Comparison of true discovery rates for various RNA-Seq differential expression testing methods. The test data sets were generated and various software suites implemented, as described in Methods. Each panel contains two plots; the plot on the right is a zoomed-in version of the plot on the left. On the y-axis is the number of correctly-identified transcripts, and on the x-axis is the number of transcripts selected (in order of increasing p-value). DE stands for differentially expressed. In every plot, the thick black line corresponds to Cufflinks, the thick red line to R-EBSeq, the large-dashed black line to DESeq and the small-dashed black line to BaySeq. These lines are also labeled as indicated. The shaded region surrounding the R-EBSeq curve depicts the range of 20 independent runs. (A) Performance of the various methods with noise-free data; (B) Performance of the various methods with noisy data. Noise was added as described in Methods, and the data are overdispersed, as typical for RNA-seq data.
Mentions: To evaluate how R-EBSeq performs, we generated a test data set according to the underlying empirical Bayes model, imposed differential expression on a subset of transcripts from this data set and, then, calculated the performance of R-EBSeq in terms of true positive and false positive identifications (see Methods). Such plots of the false positive rate vs. the true positive rate are called receiver operator characteristic (ROC) curves. An ROC curve along the x = y line implies a very poor algorithm that performs no better than random choice, whereas an ROC curve that peaks high above the x = y line, at low x values, implies a very good algorithm. We investigated how three characteristics of a transcript affect the ROC curves: the difference of means between two conditions (Figure 4(A)), the variance of the transcript expression level (Figure 4(B)), and the number of replicates, M, used as input to the software (Figure 4(C)). In general, we see that R-EBSeq is capable of very good behavior in terms of the ROC curves. As expected, as we increase the difference of means between two conditions and/or decrease the variance, the ability of R-EBSeq to identify truly differentially expressed genes improves. Increasing the number of replicates, M, also improves the performance of R-EBSeq, likely because R-EBSeq is able to get a better estimate of a transcript’s variance.

Bottom Line: Our analysis indicates that DESeq identifies the first half of the differentially expressed transcripts well, but then is outperformed by Cuffdiff and R-EBSeq.Cuffdiff and R-EBSeq are the two top performers.Thus, R-EBSeq offers good performance, while allowing flexible and rigorous comparison of multiple biological conditions.

View Article: PubMed Central - PubMed

Affiliation: Department of Mathematics, Florida State University, Tallahassee, FL 32306, USA ; E-Mail: wesserg@gmail.com.

ABSTRACT
Transcriptome-based biosensors are expected to have a large impact on the future of biotechnology. However, a central aspect of transcriptomics is differential expression analysis, where, currently, deep RNA sequencing (RNA-seq) has the potential to replace the microarray as the standard assay for RNA quantification. Our contributions here to RNA-seq differential expression analysis are two-fold. First, given the high cost of an RNA-seq run, biological replicates are rare, and therefore, information sharing across genes to obtain variance estimates is crucial. To handle such information sharing in a rigorous manner, we propose an hierarchical, empirical Bayes approach (R-EBSeq) that combines the Cufflinks model for generating relative transcript abundance measurements, known as FPKM (fragments per kilobase of transcript length per million mapped reads) with the EBArrays framework, which was previously developed for empirical Bayes analysis of microarray data. A desirable feature of R-EBSeq is easy-to-implement analysis of more than pairwise comparisons, as we illustrate with experimental data. Secondly, we develop the standard RNA-seq test data set, on the level of reads, where 79 transcripts are artificially differentially expressed and, therefore, explicitly known. This test data set allows us to compare the performance, in terms of the true discovery rate, of R-EBSeq to three other widely used RNAseq data analysis packages: Cuffdiff, DEseq and BaySeq. Our analysis indicates that DESeq identifies the first half of the differentially expressed transcripts well, but then is outperformed by Cuffdiff and R-EBSeq. Cuffdiff and R-EBSeq are the two top performers. Thus, R-EBSeq offers good performance, while allowing flexible and rigorous comparison of multiple biological conditions.

No MeSH data available.