Limits...
The level of residual dispersion variation and the power of differential expression tests for RNA-Seq data.

Mi G, Di Y - PLoS ONE (2015)

Bottom Line: RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis.Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not.We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Oregon State University, Corvallis, Oregon, United States of America.

ABSTRACT
RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis. For detecting differentially expressed genes, a variety of statistical methods based on the negative binomial (NB) distribution have been proposed. These methods differ in the ways they handle the NB nuisance parameters (i.e., the dispersion parameters associated with each gene) to save power, such as by using a dispersion model to exploit an apparent relationship between the dispersion parameter and the NB mean. Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not. This paper investigates this power and robustness trade-off by assessing rates of identifying true differential expression using the various methods under realistic assumptions about NB dispersion parameters. Our results indicate that the relative performances of the different methods are closely related to the level of dispersion variation unexplained by the dispersion model. We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

No MeSH data available.


True Positive Rate (TPR) vs. False Discovery Rate (FDR) plots for the six DE test methods performed on RNA-Seq datasets simulated to mimic real datasets.The fold changes of DE genes are fixed at 1.2 (half of the DE genes are over-expressed and the other half are under-expressed). Other simulation settings are identical to those described in Fig. 2 legend.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4388866&req=5

pone.0120117.g003: True Positive Rate (TPR) vs. False Discovery Rate (FDR) plots for the six DE test methods performed on RNA-Seq datasets simulated to mimic real datasets.The fold changes of DE genes are fixed at 1.2 (half of the DE genes are over-expressed and the other half are under-expressed). Other simulation settings are identical to those described in Fig. 2 legend.

Mentions: To understand how each method performs under a wide range of situations, we also performed simulations where the fold changes for DE genes were fixed instead of estimated from real data, while other settings (e.g., the percentage of DE genes, σ and ) remained the same as before. Figs. 3 and 4 show the TPR-FDR plots when the fold changes of DE genes were fixed at 1.2 (low) and 1.5 (moderate) respectively. In general, the NBPSeq:genewise, edgeR:tagwise-trend and QuasiSeq:QLSpline perform better than edgeR:common, NBPSeq:NBQ and edgeR:trend, which is consistent with the observations when the fold changes are estimated from real data. In the low DE fold change case and when the residual dispersion variation is as estimated (upper row of Fig. 3), there is more separation between the QuasiSeq:QLSpline method and the edgeR:tagwise-trend method. In the simulation based on the mouse data, the NBPSeq:genewise method outperforms all other methods for finding the first 25% of truly DE genes (i.e., in the plot region where TPR ≤ 0.25), but it is eventually outperformed by QuasiSeq:QLSpline and edgeR:tagwise-trend if a greater percentage of truly DE genes need to be detected. Similar trend is observed in simulations based on the zebrafish and fruit fly datasets. This indicates the NBPSeq:genewise method can have advantage for detecting DE genes with small fold changes. There is less separation between QuasiSeq:QLSpline and edgeR:tagwise-trend methods when the DE fold changes were specified to be 1.5. Again, the performance of all methods assuming a dispersion model (i.e., all methods except NBPSeq:genewise) improves significantly when the residual dispersion variation is halved.


The level of residual dispersion variation and the power of differential expression tests for RNA-Seq data.

Mi G, Di Y - PLoS ONE (2015)

True Positive Rate (TPR) vs. False Discovery Rate (FDR) plots for the six DE test methods performed on RNA-Seq datasets simulated to mimic real datasets.The fold changes of DE genes are fixed at 1.2 (half of the DE genes are over-expressed and the other half are under-expressed). Other simulation settings are identical to those described in Fig. 2 legend.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4388866&req=5

pone.0120117.g003: True Positive Rate (TPR) vs. False Discovery Rate (FDR) plots for the six DE test methods performed on RNA-Seq datasets simulated to mimic real datasets.The fold changes of DE genes are fixed at 1.2 (half of the DE genes are over-expressed and the other half are under-expressed). Other simulation settings are identical to those described in Fig. 2 legend.
Mentions: To understand how each method performs under a wide range of situations, we also performed simulations where the fold changes for DE genes were fixed instead of estimated from real data, while other settings (e.g., the percentage of DE genes, σ and ) remained the same as before. Figs. 3 and 4 show the TPR-FDR plots when the fold changes of DE genes were fixed at 1.2 (low) and 1.5 (moderate) respectively. In general, the NBPSeq:genewise, edgeR:tagwise-trend and QuasiSeq:QLSpline perform better than edgeR:common, NBPSeq:NBQ and edgeR:trend, which is consistent with the observations when the fold changes are estimated from real data. In the low DE fold change case and when the residual dispersion variation is as estimated (upper row of Fig. 3), there is more separation between the QuasiSeq:QLSpline method and the edgeR:tagwise-trend method. In the simulation based on the mouse data, the NBPSeq:genewise method outperforms all other methods for finding the first 25% of truly DE genes (i.e., in the plot region where TPR ≤ 0.25), but it is eventually outperformed by QuasiSeq:QLSpline and edgeR:tagwise-trend if a greater percentage of truly DE genes need to be detected. Similar trend is observed in simulations based on the zebrafish and fruit fly datasets. This indicates the NBPSeq:genewise method can have advantage for detecting DE genes with small fold changes. There is less separation between QuasiSeq:QLSpline and edgeR:tagwise-trend methods when the DE fold changes were specified to be 1.5. Again, the performance of all methods assuming a dispersion model (i.e., all methods except NBPSeq:genewise) improves significantly when the residual dispersion variation is halved.

Bottom Line: RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis.Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not.We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Oregon State University, Corvallis, Oregon, United States of America.

ABSTRACT
RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis. For detecting differentially expressed genes, a variety of statistical methods based on the negative binomial (NB) distribution have been proposed. These methods differ in the ways they handle the NB nuisance parameters (i.e., the dispersion parameters associated with each gene) to save power, such as by using a dispersion model to exploit an apparent relationship between the dispersion parameter and the NB mean. Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not. This paper investigates this power and robustness trade-off by assessing rates of identifying true differential expression using the various methods under realistic assumptions about NB dispersion parameters. Our results indicate that the relative performances of the different methods are closely related to the level of dispersion variation unexplained by the dispersion model. We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

No MeSH data available.