Limits...
The level of residual dispersion variation and the power of differential expression tests for RNA-Seq data.

Mi G, Di Y - PLoS ONE (2015)

Bottom Line: RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis.Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not.We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Oregon State University, Corvallis, Oregon, United States of America.

ABSTRACT
RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis. For detecting differentially expressed genes, a variety of statistical methods based on the negative binomial (NB) distribution have been proposed. These methods differ in the ways they handle the NB nuisance parameters (i.e., the dispersion parameters associated with each gene) to save power, such as by using a dispersion model to exploit an apparent relationship between the dispersion parameter and the NB mean. Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not. This paper investigates this power and robustness trade-off by assessing rates of identifying true differential expression using the various methods under realistic assumptions about NB dispersion parameters. Our results indicate that the relative performances of the different methods are closely related to the level of dispersion variation unexplained by the dispersion model. We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

No MeSH data available.


Related in: MedlinePlus

True Positive Rate (TPR) vs. False Discovery Rate (FDR) plots for the six DE test methods performed on RNA-Seq dataset simulated to mimic the human dataset.On each curve, we marked the position corresponding to a reported FDR of 10% with a cross. The fold changes of DE genes are fixed at 1.2 (half of the DE genes are over-expressed and the other half are under-expressed). Other simulation settings are identical to those for the upper row of Fig. 2.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4388866&req=5

pone.0120117.g005: True Positive Rate (TPR) vs. False Discovery Rate (FDR) plots for the six DE test methods performed on RNA-Seq dataset simulated to mimic the human dataset.On each curve, we marked the position corresponding to a reported FDR of 10% with a cross. The fold changes of DE genes are fixed at 1.2 (half of the DE genes are over-expressed and the other half are under-expressed). Other simulation settings are identical to those for the upper row of Fig. 2.

Mentions: Fig. 5 shows what will happen if one uses the reported FDR to identify DE genes. We uses one of the simulated human data as an example (the fold change is specified to be 1.2 for the designated 20% DE genes, and ), since the tests are well separated here. For methods that do not correctly control FDR, such as NBPSeq:NBQ and edgeR:trended, if one identifies DE genes according to a cutoff on reported FDR (e.g., 10%), more genes will be detected as DE (than if one were able to use the actual FDR) at the cost of underestimated FDR.


The level of residual dispersion variation and the power of differential expression tests for RNA-Seq data.

Mi G, Di Y - PLoS ONE (2015)

True Positive Rate (TPR) vs. False Discovery Rate (FDR) plots for the six DE test methods performed on RNA-Seq dataset simulated to mimic the human dataset.On each curve, we marked the position corresponding to a reported FDR of 10% with a cross. The fold changes of DE genes are fixed at 1.2 (half of the DE genes are over-expressed and the other half are under-expressed). Other simulation settings are identical to those for the upper row of Fig. 2.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4388866&req=5

pone.0120117.g005: True Positive Rate (TPR) vs. False Discovery Rate (FDR) plots for the six DE test methods performed on RNA-Seq dataset simulated to mimic the human dataset.On each curve, we marked the position corresponding to a reported FDR of 10% with a cross. The fold changes of DE genes are fixed at 1.2 (half of the DE genes are over-expressed and the other half are under-expressed). Other simulation settings are identical to those for the upper row of Fig. 2.
Mentions: Fig. 5 shows what will happen if one uses the reported FDR to identify DE genes. We uses one of the simulated human data as an example (the fold change is specified to be 1.2 for the designated 20% DE genes, and ), since the tests are well separated here. For methods that do not correctly control FDR, such as NBPSeq:NBQ and edgeR:trended, if one identifies DE genes according to a cutoff on reported FDR (e.g., 10%), more genes will be detected as DE (than if one were able to use the actual FDR) at the cost of underestimated FDR.

Bottom Line: RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis.Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not.We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Oregon State University, Corvallis, Oregon, United States of America.

ABSTRACT
RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis. For detecting differentially expressed genes, a variety of statistical methods based on the negative binomial (NB) distribution have been proposed. These methods differ in the ways they handle the NB nuisance parameters (i.e., the dispersion parameters associated with each gene) to save power, such as by using a dispersion model to exploit an apparent relationship between the dispersion parameter and the NB mean. Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not. This paper investigates this power and robustness trade-off by assessing rates of identifying true differential expression using the various methods under realistic assumptions about NB dispersion parameters. Our results indicate that the relative performances of the different methods are closely related to the level of dispersion variation unexplained by the dispersion model. We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

No MeSH data available.


Related in: MedlinePlus