Limits...
The level of residual dispersion variation and the power of differential expression tests for RNA-Seq data.

Mi G, Di Y - PLoS ONE (2015)

Bottom Line: RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis.Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not.We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Oregon State University, Corvallis, Oregon, United States of America.

ABSTRACT
RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis. For detecting differentially expressed genes, a variety of statistical methods based on the negative binomial (NB) distribution have been proposed. These methods differ in the ways they handle the NB nuisance parameters (i.e., the dispersion parameters associated with each gene) to save power, such as by using a dispersion model to exploit an apparent relationship between the dispersion parameter and the NB mean. Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not. This paper investigates this power and robustness trade-off by assessing rates of identifying true differential expression using the various methods under realistic assumptions about NB dispersion parameters. Our results indicate that the relative performances of the different methods are closely related to the level of dispersion variation unexplained by the dispersion model. We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

No MeSH data available.


The calibration plot for estimating residual dispersion variation σ for the mouse dataset.The x-axis is the σ value used to generate the data. The y-axis is the estimated . The horizontal line correspond to the  estimated from the mouse dataset.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4388866&req=5

pone.0120117.g010: The calibration plot for estimating residual dispersion variation σ for the mouse dataset.The x-axis is the σ value used to generate the data. The y-axis is the estimated . The horizontal line correspond to the estimated from the mouse dataset.

Mentions: As discussed in the “Results/Power-Robustness Evaluations/Simulation Setup” subsection, when simulating the RNA-Seq datasets, we want to choose a σ that matches the level of residual dispersion variation in real data. We want to correct for potential bias in the estimator . We also need to account for the discrepancy between πij (used when simulating the data) and (used when fitting the dispersion model). This is achieved by a calibration approach [38]. The calibrated ’s are essentially obtained from a calibration plot. Fig. 10 shows the calibration plot for the mouse dataset (subsetted to 5,000 genes). We choose the σ value at eight levels: 0.5, 0.7, 0.8, 0.9, 1.0. 1.1, 1.2 and 1.5, and simulate the dispersion ϕij according tolog(ϕij)=log(ϕijNBQ)+ϵi=α0+α1log(πij)+α2[log(πij)]2+ϵi,where εi ∼ N(0,σ2) is the residual variation on top of an NBQ dispersion model with the parameters αi,i = 0,1,2, estimated from the mouse dataset. At each level of σ, we simulate three datasets and obtain three ’s. We then fit a quadratic curve to the eight median values as a function of σ, with a 95% prediction interval superimposed in dashed curves. The estimated from the mouse dataset is also calculated, and the value is shown as a horizontal solid line. The intersection of the fitted quadratic curve and the horizontal line (the solid red point) has its x coordinate being the calibrated . Similarly, the intersections between the upper/lower bound of the 95% prediction interval with the horizontal line determine the associated 95% calibration interval (CI) for the calibrated . We only include the calibration plot for the mouse dataset as an illustration. Table 6 summarizes the calibrated with 95% CI for each of the five real datasets.


The level of residual dispersion variation and the power of differential expression tests for RNA-Seq data.

Mi G, Di Y - PLoS ONE (2015)

The calibration plot for estimating residual dispersion variation σ for the mouse dataset.The x-axis is the σ value used to generate the data. The y-axis is the estimated . The horizontal line correspond to the  estimated from the mouse dataset.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4388866&req=5

pone.0120117.g010: The calibration plot for estimating residual dispersion variation σ for the mouse dataset.The x-axis is the σ value used to generate the data. The y-axis is the estimated . The horizontal line correspond to the estimated from the mouse dataset.
Mentions: As discussed in the “Results/Power-Robustness Evaluations/Simulation Setup” subsection, when simulating the RNA-Seq datasets, we want to choose a σ that matches the level of residual dispersion variation in real data. We want to correct for potential bias in the estimator . We also need to account for the discrepancy between πij (used when simulating the data) and (used when fitting the dispersion model). This is achieved by a calibration approach [38]. The calibrated ’s are essentially obtained from a calibration plot. Fig. 10 shows the calibration plot for the mouse dataset (subsetted to 5,000 genes). We choose the σ value at eight levels: 0.5, 0.7, 0.8, 0.9, 1.0. 1.1, 1.2 and 1.5, and simulate the dispersion ϕij according tolog(ϕij)=log(ϕijNBQ)+ϵi=α0+α1log(πij)+α2[log(πij)]2+ϵi,where εi ∼ N(0,σ2) is the residual variation on top of an NBQ dispersion model with the parameters αi,i = 0,1,2, estimated from the mouse dataset. At each level of σ, we simulate three datasets and obtain three ’s. We then fit a quadratic curve to the eight median values as a function of σ, with a 95% prediction interval superimposed in dashed curves. The estimated from the mouse dataset is also calculated, and the value is shown as a horizontal solid line. The intersection of the fitted quadratic curve and the horizontal line (the solid red point) has its x coordinate being the calibrated . Similarly, the intersections between the upper/lower bound of the 95% prediction interval with the horizontal line determine the associated 95% calibration interval (CI) for the calibrated . We only include the calibration plot for the mouse dataset as an illustration. Table 6 summarizes the calibrated with 95% CI for each of the five real datasets.

Bottom Line: RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis.Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not.We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Oregon State University, Corvallis, Oregon, United States of America.

ABSTRACT
RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis. For detecting differentially expressed genes, a variety of statistical methods based on the negative binomial (NB) distribution have been proposed. These methods differ in the ways they handle the NB nuisance parameters (i.e., the dispersion parameters associated with each gene) to save power, such as by using a dispersion model to exploit an apparent relationship between the dispersion parameter and the NB mean. Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not. This paper investigates this power and robustness trade-off by assessing rates of identifying true differential expression using the various methods under realistic assumptions about NB dispersion parameters. Our results indicate that the relative performances of the different methods are closely related to the level of dispersion variation unexplained by the dispersion model. We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

No MeSH data available.