Limits...
The level of residual dispersion variation and the power of differential expression tests for RNA-Seq data.

Mi G, Di Y - PLoS ONE (2015)

Bottom Line: RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis.Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not.We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Oregon State University, Corvallis, Oregon, United States of America.

ABSTRACT
RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis. For detecting differentially expressed genes, a variety of statistical methods based on the negative binomial (NB) distribution have been proposed. These methods differ in the ways they handle the NB nuisance parameters (i.e., the dispersion parameters associated with each gene) to save power, such as by using a dispersion model to exploit an apparent relationship between the dispersion parameter and the NB mean. Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not. This paper investigates this power and robustness trade-off by assessing rates of identifying true differential expression using the various methods under realistic assumptions about NB dispersion parameters. Our results indicate that the relative performances of the different methods are closely related to the level of dispersion variation unexplained by the dispersion model. We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

No MeSH data available.


MA plots for the edgeR:trended, NBPSeq:genewise, edgeR:tagwise-trend and QuasiSeq:QLSpline methods performed on the mouse dataset.Predictive log fold changes (posterior Bayesian estimators of the true log fold changes, the “M” values) are shown on the y-axis. Averages of log counts per million (CPM) are shown on the x-axis (the “A” values). The M- and A- values are calculated using edgeR. The highlighted points correspond to the top 200 DE genes identified by each of the DE test methods.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4388866&req=5

pone.0120117.g008: MA plots for the edgeR:trended, NBPSeq:genewise, edgeR:tagwise-trend and QuasiSeq:QLSpline methods performed on the mouse dataset.Predictive log fold changes (posterior Bayesian estimators of the true log fold changes, the “M” values) are shown on the y-axis. Averages of log counts per million (CPM) are shown on the x-axis (the “A” values). The M- and A- values are calculated using edgeR. The highlighted points correspond to the top 200 DE genes identified by each of the DE test methods.

Mentions: One notable difference between the NBPSeq:genewise method and a dispersion-modeling method is that the former detects more DE genes with small fold changes, while a method using a dispersion model tends to detect DE genes with large fold changes. This phenomenon agrees with what we observed in the power simulation when the DE fold change was fixed to be low, 1.2. Fig. 8 illustrates this point using MA plots. This is because current dispersion models often assume the dispersion is the same for genes with similar mean levels (those genes having the same x-values). Under such assumptions, large fold changes tend to correspond to more significant test results. The behaviors of the edgeR:tagwise-trend and the QuasiSeq:QLSpline methods are intermediate between the NBPSeq:genewise method and a dispersion-modeling method such as the edgeR:trended model.


The level of residual dispersion variation and the power of differential expression tests for RNA-Seq data.

Mi G, Di Y - PLoS ONE (2015)

MA plots for the edgeR:trended, NBPSeq:genewise, edgeR:tagwise-trend and QuasiSeq:QLSpline methods performed on the mouse dataset.Predictive log fold changes (posterior Bayesian estimators of the true log fold changes, the “M” values) are shown on the y-axis. Averages of log counts per million (CPM) are shown on the x-axis (the “A” values). The M- and A- values are calculated using edgeR. The highlighted points correspond to the top 200 DE genes identified by each of the DE test methods.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4388866&req=5

pone.0120117.g008: MA plots for the edgeR:trended, NBPSeq:genewise, edgeR:tagwise-trend and QuasiSeq:QLSpline methods performed on the mouse dataset.Predictive log fold changes (posterior Bayesian estimators of the true log fold changes, the “M” values) are shown on the y-axis. Averages of log counts per million (CPM) are shown on the x-axis (the “A” values). The M- and A- values are calculated using edgeR. The highlighted points correspond to the top 200 DE genes identified by each of the DE test methods.
Mentions: One notable difference between the NBPSeq:genewise method and a dispersion-modeling method is that the former detects more DE genes with small fold changes, while a method using a dispersion model tends to detect DE genes with large fold changes. This phenomenon agrees with what we observed in the power simulation when the DE fold change was fixed to be low, 1.2. Fig. 8 illustrates this point using MA plots. This is because current dispersion models often assume the dispersion is the same for genes with similar mean levels (those genes having the same x-values). Under such assumptions, large fold changes tend to correspond to more significant test results. The behaviors of the edgeR:tagwise-trend and the QuasiSeq:QLSpline methods are intermediate between the NBPSeq:genewise method and a dispersion-modeling method such as the edgeR:trended model.

Bottom Line: RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis.Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not.We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Oregon State University, Corvallis, Oregon, United States of America.

ABSTRACT
RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis. For detecting differentially expressed genes, a variety of statistical methods based on the negative binomial (NB) distribution have been proposed. These methods differ in the ways they handle the NB nuisance parameters (i.e., the dispersion parameters associated with each gene) to save power, such as by using a dispersion model to exploit an apparent relationship between the dispersion parameter and the NB mean. Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not. This paper investigates this power and robustness trade-off by assessing rates of identifying true differential expression using the various methods under realistic assumptions about NB dispersion parameters. Our results indicate that the relative performances of the different methods are closely related to the level of dispersion variation unexplained by the dispersion model. We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

No MeSH data available.