Limits...
A powerful statistical approach for large-scale differential transcription analysis.

Tan YD, Chandler AM, Chaudhury A, Neilson JR - PLoS ONE (2015)

Bottom Line: Next generation sequencing (NGS) is increasingly being used for transcriptome-wide analysis of differential gene expression.The NGS data are multidimensional count data.But due to high cost and limitation of biological resources, current NGS data are still generated from a few replicate libraries.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Physiology and Biophysics and Dan L. Duncan Cancer Center, Baylor College of Medicine, Houston, Texas, 77030, United States of America.

ABSTRACT
Next generation sequencing (NGS) is increasingly being used for transcriptome-wide analysis of differential gene expression. The NGS data are multidimensional count data. Therefore, most of the statistical methods developed well for microarray data analysis are not applicable to transcriptomic data. For this reason, a variety of new statistical methods based on count data of transcript reads have been correspondingly proposed. But due to high cost and limitation of biological resources, current NGS data are still generated from a few replicate libraries. Some of these existing methods do not always have desirable performances on count data. We here developed a very powerful and robust statistical method based on beta and binomial distributions. Our method (mBeta t-test) is specifically applicable to sequence count data from small samples. Both simulated and real transcriptomic data showed mBeta t-test significantly outperformed the existing top statistical methods chosen in all 12 given scenarios and performed with high efficiency and high stability. The differentially expressed genes found by our method from real transcriptomic data were validated by qPCR experiments. Our method shows high power in finding truly differential expression, conservatively estimating FDR and high stability in RNA sequence count data derived from small samples. Our method can also be extended to genome-wide detection of differential splicing events.

Show MeSH

Related in: MedlinePlus

Heatmaps of Expressions of Transcripts Identified Only by Single Statistical Methods from Real Transcriptomic Datasets.A: 10 Isoforms were specifically called differential expression by mBeta t-test. B: 700 isoforms were specifically found to be differentially expressed by Beta t-test. C: 220 DE isoforms were defined only by edgeR Exact test. NS: non-stimulated cells. 48h: stimulated cells via antigen receptor for 48 hours. A, B, and C are replicates.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4404056&req=5

pone.0123658.g003: Heatmaps of Expressions of Transcripts Identified Only by Single Statistical Methods from Real Transcriptomic Datasets.A: 10 Isoforms were specifically called differential expression by mBeta t-test. B: 700 isoforms were specifically found to be differentially expressed by Beta t-test. C: 220 DE isoforms were defined only by edgeR Exact test. NS: non-stimulated cells. 48h: stimulated cells via antigen receptor for 48 hours. A, B, and C are replicates.

Mentions: Within this comparison, if a gene or isoform is found to be differentially expressed by only a single method, then it is highly possible that this differentially expressed gene or isoform is a false discovery. Fig 3 visualizes heat maps of the ten genes specifically identified by the mBeta t-test method (Fig 3A), the 770 genes identified solely by the Beta t-test (Fig 3B), and the 220 genes identified only by the edgeR Exact test (Fig 3C). Indeed, the genes identified only by a single method do not display obvious differences in expression between the non-stimulated and stimulated states. We defined no share ratio of findings (mi / Mi where mi is number of method i-specific findings, and Mi is numbers of findings identified by method i) as the least true false discovery rate (where the least true FDR corresponds to the q-value defined by Storey et [35]). Using this indirect method, we obtained the least true FDRs for the findings of the edgeR Exact test, mBeta t-test and Beta t-test methods, respectively, in our real PAS-seq datasets (Table 3). From Table 3, one can see that edgeR Exact test and Beta t-test severely underestimated their FDRs, while the mBeta t-test still overestimated FDR in its findings. This is consistent with the results obtained from the simulated data.


A powerful statistical approach for large-scale differential transcription analysis.

Tan YD, Chandler AM, Chaudhury A, Neilson JR - PLoS ONE (2015)

Heatmaps of Expressions of Transcripts Identified Only by Single Statistical Methods from Real Transcriptomic Datasets.A: 10 Isoforms were specifically called differential expression by mBeta t-test. B: 700 isoforms were specifically found to be differentially expressed by Beta t-test. C: 220 DE isoforms were defined only by edgeR Exact test. NS: non-stimulated cells. 48h: stimulated cells via antigen receptor for 48 hours. A, B, and C are replicates.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4404056&req=5

pone.0123658.g003: Heatmaps of Expressions of Transcripts Identified Only by Single Statistical Methods from Real Transcriptomic Datasets.A: 10 Isoforms were specifically called differential expression by mBeta t-test. B: 700 isoforms were specifically found to be differentially expressed by Beta t-test. C: 220 DE isoforms were defined only by edgeR Exact test. NS: non-stimulated cells. 48h: stimulated cells via antigen receptor for 48 hours. A, B, and C are replicates.
Mentions: Within this comparison, if a gene or isoform is found to be differentially expressed by only a single method, then it is highly possible that this differentially expressed gene or isoform is a false discovery. Fig 3 visualizes heat maps of the ten genes specifically identified by the mBeta t-test method (Fig 3A), the 770 genes identified solely by the Beta t-test (Fig 3B), and the 220 genes identified only by the edgeR Exact test (Fig 3C). Indeed, the genes identified only by a single method do not display obvious differences in expression between the non-stimulated and stimulated states. We defined no share ratio of findings (mi / Mi where mi is number of method i-specific findings, and Mi is numbers of findings identified by method i) as the least true false discovery rate (where the least true FDR corresponds to the q-value defined by Storey et [35]). Using this indirect method, we obtained the least true FDRs for the findings of the edgeR Exact test, mBeta t-test and Beta t-test methods, respectively, in our real PAS-seq datasets (Table 3). From Table 3, one can see that edgeR Exact test and Beta t-test severely underestimated their FDRs, while the mBeta t-test still overestimated FDR in its findings. This is consistent with the results obtained from the simulated data.

Bottom Line: Next generation sequencing (NGS) is increasingly being used for transcriptome-wide analysis of differential gene expression.The NGS data are multidimensional count data.But due to high cost and limitation of biological resources, current NGS data are still generated from a few replicate libraries.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular Physiology and Biophysics and Dan L. Duncan Cancer Center, Baylor College of Medicine, Houston, Texas, 77030, United States of America.

ABSTRACT
Next generation sequencing (NGS) is increasingly being used for transcriptome-wide analysis of differential gene expression. The NGS data are multidimensional count data. Therefore, most of the statistical methods developed well for microarray data analysis are not applicable to transcriptomic data. For this reason, a variety of new statistical methods based on count data of transcript reads have been correspondingly proposed. But due to high cost and limitation of biological resources, current NGS data are still generated from a few replicate libraries. Some of these existing methods do not always have desirable performances on count data. We here developed a very powerful and robust statistical method based on beta and binomial distributions. Our method (mBeta t-test) is specifically applicable to sequence count data from small samples. Both simulated and real transcriptomic data showed mBeta t-test significantly outperformed the existing top statistical methods chosen in all 12 given scenarios and performed with high efficiency and high stability. The differentially expressed genes found by our method from real transcriptomic data were validated by qPCR experiments. Our method shows high power in finding truly differential expression, conservatively estimating FDR and high stability in RNA sequence count data derived from small samples. Our method can also be extended to genome-wide detection of differential splicing events.

Show MeSH
Related in: MedlinePlus