Limits...
ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis.

Pierson E, Yau C - Genome Biol. (2015)

Bottom Line: Single-cell RNA-seq data allows insight into normal cellular function and various disease states through molecular characterization of gene expression on the single cell level.Dimensionality reduction of such high-dimensional data sets is essential for visualization and analysis, but single-cell RNA-seq data are challenging for classical dimensionality-reduction methods because of the prevalence of dropout events, which lead to zero-inflated data.Here, we develop a dimensionality-reduction method, (Z)ero (I)nflated (F)actor (A)nalysis (ZIFA), which explicitly models the dropout characteristics, and show that it improves modeling accuracy on simulated and biological data sets.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG, Oxford, UK. emma.pierson@st-annes.ox.ac.uk.

ABSTRACT
Single-cell RNA-seq data allows insight into normal cellular function and various disease states through molecular characterization of gene expression on the single cell level. Dimensionality reduction of such high-dimensional data sets is essential for visualization and analysis, but single-cell RNA-seq data are challenging for classical dimensionality-reduction methods because of the prevalence of dropout events, which lead to zero-inflated data. Here, we develop a dimensionality-reduction method, (Z)ero (I)nflated (F)actor (A)nalysis (ZIFA), which explicitly models the dropout characteristics, and show that it improves modeling accuracy on simulated and biological data sets.

Show MeSH
Comparison of exact and block-based EM algorithms. Plots show the correlation between expectations computed using the exact and block-based EM algorithms for latent low-dimensional positions (Z) (a) and latent observations X (b). Simulations were performed on a simulated data set with 500 genes and 200 cells. A block size of 50 was chosen for the approximate approach
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4630968&req=5

Fig2: Comparison of exact and block-based EM algorithms. Plots show the correlation between expectations computed using the exact and block-based EM algorithms for latent low-dimensional positions (Z) (a) and latent observations X (b). Simulations were performed on a simulated data set with 500 genes and 200 cells. A block size of 50 was chosen for the approximate approach

Mentions: The EM algorithm requires computations involving conditional expectations of multivariate Gaussian distributions. For each cell, information from non-zero measurements is used to impute the expected expression levels for genes with zero measured values jointly. If all available expressed genes are used for this imputation process, the exact computations would necessitate large computationally intensive matrix multiplications. In practice, we have discovered that it is not necessary to compute the expectations using all available genes at once. Substantial computational savings can be achieved by partitioning the genes into non-overlapping disjoint sets, and then performing exact computations within each block of genes. This decreases the run time of our algorithm from quadratic to linear in the number of genes, allowing it to run on data sets with hundreds of samples and tens of thousands of genes on a standard computer. Figure 2 shows that expectations obtained via this approximate strategy closely follow those from exact calculations but can be achieved with a substantial computational speed-up. Parameter estimates based on these approximate expectations are also robust (Additional file 1: Figure S2).Fig. 2


ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis.

Pierson E, Yau C - Genome Biol. (2015)

Comparison of exact and block-based EM algorithms. Plots show the correlation between expectations computed using the exact and block-based EM algorithms for latent low-dimensional positions (Z) (a) and latent observations X (b). Simulations were performed on a simulated data set with 500 genes and 200 cells. A block size of 50 was chosen for the approximate approach
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4630968&req=5

Fig2: Comparison of exact and block-based EM algorithms. Plots show the correlation between expectations computed using the exact and block-based EM algorithms for latent low-dimensional positions (Z) (a) and latent observations X (b). Simulations were performed on a simulated data set with 500 genes and 200 cells. A block size of 50 was chosen for the approximate approach
Mentions: The EM algorithm requires computations involving conditional expectations of multivariate Gaussian distributions. For each cell, information from non-zero measurements is used to impute the expected expression levels for genes with zero measured values jointly. If all available expressed genes are used for this imputation process, the exact computations would necessitate large computationally intensive matrix multiplications. In practice, we have discovered that it is not necessary to compute the expectations using all available genes at once. Substantial computational savings can be achieved by partitioning the genes into non-overlapping disjoint sets, and then performing exact computations within each block of genes. This decreases the run time of our algorithm from quadratic to linear in the number of genes, allowing it to run on data sets with hundreds of samples and tens of thousands of genes on a standard computer. Figure 2 shows that expectations obtained via this approximate strategy closely follow those from exact calculations but can be achieved with a substantial computational speed-up. Parameter estimates based on these approximate expectations are also robust (Additional file 1: Figure S2).Fig. 2

Bottom Line: Single-cell RNA-seq data allows insight into normal cellular function and various disease states through molecular characterization of gene expression on the single cell level.Dimensionality reduction of such high-dimensional data sets is essential for visualization and analysis, but single-cell RNA-seq data are challenging for classical dimensionality-reduction methods because of the prevalence of dropout events, which lead to zero-inflated data.Here, we develop a dimensionality-reduction method, (Z)ero (I)nflated (F)actor (A)nalysis (ZIFA), which explicitly models the dropout characteristics, and show that it improves modeling accuracy on simulated and biological data sets.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG, Oxford, UK. emma.pierson@st-annes.ox.ac.uk.

ABSTRACT
Single-cell RNA-seq data allows insight into normal cellular function and various disease states through molecular characterization of gene expression on the single cell level. Dimensionality reduction of such high-dimensional data sets is essential for visualization and analysis, but single-cell RNA-seq data are challenging for classical dimensionality-reduction methods because of the prevalence of dropout events, which lead to zero-inflated data. Here, we develop a dimensionality-reduction method, (Z)ero (I)nflated (F)actor (A)nalysis (ZIFA), which explicitly models the dropout characteristics, and show that it improves modeling accuracy on simulated and biological data sets.

Show MeSH