Limits...
The level of residual dispersion variation and the power of differential expression tests for RNA-Seq data.

Mi G, Di Y - PLoS ONE (2015)

Bottom Line: RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis.Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not.We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Oregon State University, Corvallis, Oregon, United States of America.

ABSTRACT
RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis. For detecting differentially expressed genes, a variety of statistical methods based on the negative binomial (NB) distribution have been proposed. These methods differ in the ways they handle the NB nuisance parameters (i.e., the dispersion parameters associated with each gene) to save power, such as by using a dispersion model to exploit an apparent relationship between the dispersion parameter and the NB mean. Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not. This paper investigates this power and robustness trade-off by assessing rates of identifying true differential expression using the various methods under realistic assumptions about NB dispersion parameters. Our results indicate that the relative performances of the different methods are closely related to the level of dispersion variation unexplained by the dispersion model. We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

No MeSH data available.


Estimation accuracy of .In the simulation, the dispersion is simulated according to an NB2 (left panel) or an NBQ (right panel) trend with added individual variation εi ∼ (0,σ2). The x-axis is the true σ value and the y-axis is the estimated . For each true σ value, the simulation is repeated three times. The blue dots correspond to the median  values.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4388866&req=5

pone.0120117.g009: Estimation accuracy of .In the simulation, the dispersion is simulated according to an NB2 (left panel) or an NBQ (right panel) trend with added individual variation εi ∼ (0,σ2). The x-axis is the true σ value and the y-axis is the estimated . For each true σ value, the simulation is repeated three times. The blue dots correspond to the median values.

Mentions: To evaluate the estimation accuracy for σ, we perform a set of simulations using the human RNA-Seq dataset as the “template” in order to preserve observed relationships between the dispersions and gene-specific mean counts. We simulate 5,000 genes with a single group of seven replicates: the mean structure μ is randomly generated according to a log-normal distribution with mean 8.5 and standard deviation 1.5 (both on the log scale and the values are chosen to mimic the real dataset); the trend of the dispersion is estimated from the real dataset according to an NB2 or an NBQ model; individual residual variation εi is simulated according to 𝒩(0,σ2) and added to the trend. We compare with true σ specified at eight levels that are within a reasonable range for typical RNA-Seq data: 0.1, 0.3, 0.5, 0.7, 0.9, 1.2, 1.5 and 2.0. At each level of σ we repeated the simulation three times using different random number seeds for generating εi ∼ 𝒩(0,σ2). Fig. 9 shows the simulation results. We highlight the median value (out of three repetitions) in solid blue point at each σ level, and ideally these points should follow the y = x reference line. We see that there is some bias in the estimation. The bias will increase for smaller sample sizes. We see that is more accurate for σ values between 0.3 and 0.9 and less so for σ values outside this range. The results (not shown) are similar when we use the NBP (log-linear) and NBS (smooth function) models to capture the general trend in the dispersion.


The level of residual dispersion variation and the power of differential expression tests for RNA-Seq data.

Mi G, Di Y - PLoS ONE (2015)

Estimation accuracy of .In the simulation, the dispersion is simulated according to an NB2 (left panel) or an NBQ (right panel) trend with added individual variation εi ∼ (0,σ2). The x-axis is the true σ value and the y-axis is the estimated . For each true σ value, the simulation is repeated three times. The blue dots correspond to the median  values.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4388866&req=5

pone.0120117.g009: Estimation accuracy of .In the simulation, the dispersion is simulated according to an NB2 (left panel) or an NBQ (right panel) trend with added individual variation εi ∼ (0,σ2). The x-axis is the true σ value and the y-axis is the estimated . For each true σ value, the simulation is repeated three times. The blue dots correspond to the median values.
Mentions: To evaluate the estimation accuracy for σ, we perform a set of simulations using the human RNA-Seq dataset as the “template” in order to preserve observed relationships between the dispersions and gene-specific mean counts. We simulate 5,000 genes with a single group of seven replicates: the mean structure μ is randomly generated according to a log-normal distribution with mean 8.5 and standard deviation 1.5 (both on the log scale and the values are chosen to mimic the real dataset); the trend of the dispersion is estimated from the real dataset according to an NB2 or an NBQ model; individual residual variation εi is simulated according to 𝒩(0,σ2) and added to the trend. We compare with true σ specified at eight levels that are within a reasonable range for typical RNA-Seq data: 0.1, 0.3, 0.5, 0.7, 0.9, 1.2, 1.5 and 2.0. At each level of σ we repeated the simulation three times using different random number seeds for generating εi ∼ 𝒩(0,σ2). Fig. 9 shows the simulation results. We highlight the median value (out of three repetitions) in solid blue point at each σ level, and ideally these points should follow the y = x reference line. We see that there is some bias in the estimation. The bias will increase for smaller sample sizes. We see that is more accurate for σ values between 0.3 and 0.9 and less so for σ values outside this range. The results (not shown) are similar when we use the NBP (log-linear) and NBS (smooth function) models to capture the general trend in the dispersion.

Bottom Line: RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis.Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not.We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Oregon State University, Corvallis, Oregon, United States of America.

ABSTRACT
RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis. For detecting differentially expressed genes, a variety of statistical methods based on the negative binomial (NB) distribution have been proposed. These methods differ in the ways they handle the NB nuisance parameters (i.e., the dispersion parameters associated with each gene) to save power, such as by using a dispersion model to exploit an apparent relationship between the dispersion parameter and the NB mean. Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not. This paper investigates this power and robustness trade-off by assessing rates of identifying true differential expression using the various methods under realistic assumptions about NB dispersion parameters. Our results indicate that the relative performances of the different methods are closely related to the level of dispersion variation unexplained by the dispersion model. We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

No MeSH data available.