Limits...
The level of residual dispersion variation and the power of differential expression tests for RNA-Seq data.

Mi G, Di Y - PLoS ONE (2015)

Bottom Line: RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis.Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not.We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Oregon State University, Corvallis, Oregon, United States of America.

ABSTRACT
RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis. For detecting differentially expressed genes, a variety of statistical methods based on the negative binomial (NB) distribution have been proposed. These methods differ in the ways they handle the NB nuisance parameters (i.e., the dispersion parameters associated with each gene) to save power, such as by using a dispersion model to exploit an apparent relationship between the dispersion parameter and the NB mean. Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not. This paper investigates this power and robustness trade-off by assessing rates of identifying true differential expression using the various methods under realistic assumptions about NB dispersion parameters. Our results indicate that the relative performances of the different methods are closely related to the level of dispersion variation unexplained by the dispersion model. We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

No MeSH data available.


Mean-dispersion plots for the human RNA-Seq dataset.The left panel is for the control group and the right panel is for the E2-treated group. Each group has seven biological replicates. The sequencing depth for this dataset is 30 million. Each point on the plots represents one gene with its method-of-moment (MOM) dispersion estimate () on the y-axis and estimated relative mean frequency on the x-axis. The fitted curves for five dispersion models are superimposed on the scatter plot.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4388866&req=5

pone.0120117.g001: Mean-dispersion plots for the human RNA-Seq dataset.The left panel is for the control group and the right panel is for the E2-treated group. Each group has seven biological replicates. The sequencing depth for this dataset is 30 million. Each point on the plots represents one gene with its method-of-moment (MOM) dispersion estimate () on the y-axis and estimated relative mean frequency on the x-axis. The fitted curves for five dispersion models are superimposed on the scatter plot.

Mentions: Fig. 1 shows the mean-dispersion plots for the two treatment groups in the human dataset (with sequencing depth of 30 million). In each plot, method-of-moment (MOM) estimates () of the dispersion ϕ for each gene are plotted against estimated relative mean frequencies (on the log-log scales). For each gene i, is defined as , where yij are the read counts and is their mean. Note that for this dataset, the library sizes (column totals) are roughly the same. Genes with were not used in the mean-dispersion plots and the gamma log-linear regression analysis. We also overlaid the trends from five fitted dispersion models representing the wide range of currently available options: common, NBP, NBQ, NBS and trended (see the “Background/NB Dispersion Models” subsection above). We make the following remarks:#1


The level of residual dispersion variation and the power of differential expression tests for RNA-Seq data.

Mi G, Di Y - PLoS ONE (2015)

Mean-dispersion plots for the human RNA-Seq dataset.The left panel is for the control group and the right panel is for the E2-treated group. Each group has seven biological replicates. The sequencing depth for this dataset is 30 million. Each point on the plots represents one gene with its method-of-moment (MOM) dispersion estimate () on the y-axis and estimated relative mean frequency on the x-axis. The fitted curves for five dispersion models are superimposed on the scatter plot.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4388866&req=5

pone.0120117.g001: Mean-dispersion plots for the human RNA-Seq dataset.The left panel is for the control group and the right panel is for the E2-treated group. Each group has seven biological replicates. The sequencing depth for this dataset is 30 million. Each point on the plots represents one gene with its method-of-moment (MOM) dispersion estimate () on the y-axis and estimated relative mean frequency on the x-axis. The fitted curves for five dispersion models are superimposed on the scatter plot.
Mentions: Fig. 1 shows the mean-dispersion plots for the two treatment groups in the human dataset (with sequencing depth of 30 million). In each plot, method-of-moment (MOM) estimates () of the dispersion ϕ for each gene are plotted against estimated relative mean frequencies (on the log-log scales). For each gene i, is defined as , where yij are the read counts and is their mean. Note that for this dataset, the library sizes (column totals) are roughly the same. Genes with were not used in the mean-dispersion plots and the gamma log-linear regression analysis. We also overlaid the trends from five fitted dispersion models representing the wide range of currently available options: common, NBP, NBQ, NBS and trended (see the “Background/NB Dispersion Models” subsection above). We make the following remarks:#1

Bottom Line: RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis.Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not.We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, Oregon State University, Corvallis, Oregon, United States of America.

ABSTRACT
RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis. For detecting differentially expressed genes, a variety of statistical methods based on the negative binomial (NB) distribution have been proposed. These methods differ in the ways they handle the NB nuisance parameters (i.e., the dispersion parameters associated with each gene) to save power, such as by using a dispersion model to exploit an apparent relationship between the dispersion parameter and the NB mean. Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not. This paper investigates this power and robustness trade-off by assessing rates of identifying true differential expression using the various methods under realistic assumptions about NB dispersion parameters. Our results indicate that the relative performances of the different methods are closely related to the level of dispersion variation unexplained by the dispersion model. We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.

No MeSH data available.