Limits...
The level of residual dispersion variation and the power of differential expression tests for RNA-Seq data.

Mi G, Di Y - PLoS ONE (2015)

Bottom Line: RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis.Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not.We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Oregon State University, Corvallis, Oregon, United States of America.

ABSTRACT
RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis. For detecting differentially expressed genes, a variety of statistical methods based on the negative binomial (NB) distribution have been proposed. These methods differ in the ways they handle the NB nuisance parameters (i.e., the dispersion parameters associated with each gene) to save power, such as by using a dispersion model to exploit an apparent relationship between the dispersion parameter and the NB mean. Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not. This paper investigates this power and robustness trade-off by assessing rates of identifying true differential expression using the various methods under realistic assumptions about NB dispersion parameters. Our results indicate that the relative performances of the different methods are closely related to the level of dispersion variation unexplained by the dispersion model. We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

No MeSH data available.


Histograms of p-values for the non-DE genes from the six DE test methods.The simulation dataset is based on the human dataset with σ specified as the estimated value . Out of a total of 5,000 genes, 80% are non-DE.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4388866&req=5

pone.0120117.g006: Histograms of p-values for the non-DE genes from the six DE test methods.The simulation dataset is based on the human dataset with σ specified as the estimated value . Out of a total of 5,000 genes, 80% are non-DE.

Mentions: The FDR control is closely related to the test p-values. Fig. 6 shows the histograms of p-values computed for the non-DE genes in one of the datasets used for the FDR comparison above (fold change estimated from data, 20% DE and ). The histograms from the NBPSeq:genewise and QuasiSeq:QLSpline methods are replacedclosermore close to uniform. For the edgeR:common, NBPSeq:NBQ and edgeR:trended methods, the histograms are asymmetric v-shaped: there is an overabundance of small p-values as compared to a uniform distribution, but the histograms also indicate that these tests are conservative for many genes. Similar patterns have been observed for other dispersion-modeling methods by Lund et al. in [18]. The edgeR:tagwise-trend method produces conservative p-values.


The level of residual dispersion variation and the power of differential expression tests for RNA-Seq data.

Mi G, Di Y - PLoS ONE (2015)

Histograms of p-values for the non-DE genes from the six DE test methods.The simulation dataset is based on the human dataset with σ specified as the estimated value . Out of a total of 5,000 genes, 80% are non-DE.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4388866&req=5

pone.0120117.g006: Histograms of p-values for the non-DE genes from the six DE test methods.The simulation dataset is based on the human dataset with σ specified as the estimated value . Out of a total of 5,000 genes, 80% are non-DE.
Mentions: The FDR control is closely related to the test p-values. Fig. 6 shows the histograms of p-values computed for the non-DE genes in one of the datasets used for the FDR comparison above (fold change estimated from data, 20% DE and ). The histograms from the NBPSeq:genewise and QuasiSeq:QLSpline methods are replacedclosermore close to uniform. For the edgeR:common, NBPSeq:NBQ and edgeR:trended methods, the histograms are asymmetric v-shaped: there is an overabundance of small p-values as compared to a uniform distribution, but the histograms also indicate that these tests are conservative for many genes. Similar patterns have been observed for other dispersion-modeling methods by Lund et al. in [18]. The edgeR:tagwise-trend method produces conservative p-values.

Bottom Line: RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis.Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not.We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Oregon State University, Corvallis, Oregon, United States of America.

ABSTRACT
RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis. For detecting differentially expressed genes, a variety of statistical methods based on the negative binomial (NB) distribution have been proposed. These methods differ in the ways they handle the NB nuisance parameters (i.e., the dispersion parameters associated with each gene) to save power, such as by using a dispersion model to exploit an apparent relationship between the dispersion parameter and the NB mean. Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not. This paper investigates this power and robustness trade-off by assessing rates of identifying true differential expression using the various methods under realistic assumptions about NB dispersion parameters. Our results indicate that the relative performances of the different methods are closely related to the level of dispersion variation unexplained by the dispersion model. We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

No MeSH data available.