Limits...
Identifying and removing the cell-cycle effect from single-cell RNA-Sequencing data

View Article: PubMed Central - PubMed

ABSTRACT

Single-cell RNA-Sequencing (scRNA-Seq) is a revolutionary technique for discovering and describing cell types in heterogeneous tissues, yet its measurement of expression often suffers from large systematic bias. A major source of this bias is the cell cycle, which introduces large within-cell-type heterogeneity that can obscure the differences in expression between cell types. The current method for removing the cell-cycle effect is unable to effectively identify this effect and has a high risk of removing other biological components of interest, compromising downstream analysis. We present ccRemover, a new method that reliably identifies the cell-cycle effect and removes it. ccRemover preserves other biological signals of interest in the data and thus can serve as an important pre-processing step for many scRNA-Seq data analyses. The effectiveness of ccRemover is demonstrated using simulation data and three real scRNA-Seq datasets, where it boosts the performance of existing clustering algorithms in distinguishing between cell types.

No MeSH data available.


The simulation data projected onto its first two principal components.The cell types are represented by the different shapes (circle, triangle) and the cell-cycle time point of each cell is represented by the different colors (red, blue, green). (a) Original Data. Here the data is clustered into six groups corresponding to the combinations of cell type and cell-cycle status. (b) scLVM corrected data (one latent factor removed). The data clusters into three groups corresponding to cell-cycle status. (c) scLVM corrected data (three latent factors removed). No distinct clusters are observed. (d) ccRemover corrected data. The data splits into two groups corresponding to the cell types.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5037372&req=5

f1: The simulation data projected onto its first two principal components.The cell types are represented by the different shapes (circle, triangle) and the cell-cycle time point of each cell is represented by the different colors (red, blue, green). (a) Original Data. Here the data is clustered into six groups corresponding to the combinations of cell type and cell-cycle status. (b) scLVM corrected data (one latent factor removed). The data clusters into three groups corresponding to cell-cycle status. (c) scLVM corrected data (three latent factors removed). No distinct clusters are observed. (d) ccRemover corrected data. The data splits into two groups corresponding to the cell types.

Mentions: Applying clustering algorithms, such as k-means clustering or hierarchical clustering, to the gene expression profiles of single cells can reveal the different cell types present in heterogeneous tissues, allowing them to be identified and characterized91011121314. However, for this approach to achieve its optimum power the high-noise nature of scRNA-Seq data needs to be carefully handled15161718192021. ScRNA-Seq data, while known to have large variance introduced during library preparation1722, also suffers from large systematic bias caused by biological noises, which act as confounding factors that obscure biological signals of interest in the data121523. For data generated by other high-throughput techniques such as microarrays, removing systematic bias has been shown to be critically important242526. For scRNA-Seq data, one of the major sources of biological noise is the cell cycle19272829303132. During the cell cycle a cell increases in size, replicates its DNA and splits into daughter cells. Different cells are at different time points of this cycle, and thus they may have quite different expression profiles15, even if they are cells of the same type3334. This within-type heterogeneity can seriously deteriorate the performance of clustering algorithms for cell type identification: it may blur clusters of cell types or cause cells of similar cell-cycle statuses to stand out as new clusters. Figure 1 shows an example using simulation data. Gene expression data is simulated for 50 cells and 2,000 genes. The cells are randomly assigned to two cell types (denoted using different shapes) and three cell-cycle stages (denoted using different colors). Figure 1a shows the results of principal component analysis (PCA) on this simulated data. The cells are clustered into six distinct clusters, grouping by both cell types and cell-cycle statuses. Cell-type discovery using this original data directly will mistakenly result in the discovery of six cell types.


Identifying and removing the cell-cycle effect from single-cell RNA-Sequencing data
The simulation data projected onto its first two principal components.The cell types are represented by the different shapes (circle, triangle) and the cell-cycle time point of each cell is represented by the different colors (red, blue, green). (a) Original Data. Here the data is clustered into six groups corresponding to the combinations of cell type and cell-cycle status. (b) scLVM corrected data (one latent factor removed). The data clusters into three groups corresponding to cell-cycle status. (c) scLVM corrected data (three latent factors removed). No distinct clusters are observed. (d) ccRemover corrected data. The data splits into two groups corresponding to the cell types.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5037372&req=5

f1: The simulation data projected onto its first two principal components.The cell types are represented by the different shapes (circle, triangle) and the cell-cycle time point of each cell is represented by the different colors (red, blue, green). (a) Original Data. Here the data is clustered into six groups corresponding to the combinations of cell type and cell-cycle status. (b) scLVM corrected data (one latent factor removed). The data clusters into three groups corresponding to cell-cycle status. (c) scLVM corrected data (three latent factors removed). No distinct clusters are observed. (d) ccRemover corrected data. The data splits into two groups corresponding to the cell types.
Mentions: Applying clustering algorithms, such as k-means clustering or hierarchical clustering, to the gene expression profiles of single cells can reveal the different cell types present in heterogeneous tissues, allowing them to be identified and characterized91011121314. However, for this approach to achieve its optimum power the high-noise nature of scRNA-Seq data needs to be carefully handled15161718192021. ScRNA-Seq data, while known to have large variance introduced during library preparation1722, also suffers from large systematic bias caused by biological noises, which act as confounding factors that obscure biological signals of interest in the data121523. For data generated by other high-throughput techniques such as microarrays, removing systematic bias has been shown to be critically important242526. For scRNA-Seq data, one of the major sources of biological noise is the cell cycle19272829303132. During the cell cycle a cell increases in size, replicates its DNA and splits into daughter cells. Different cells are at different time points of this cycle, and thus they may have quite different expression profiles15, even if they are cells of the same type3334. This within-type heterogeneity can seriously deteriorate the performance of clustering algorithms for cell type identification: it may blur clusters of cell types or cause cells of similar cell-cycle statuses to stand out as new clusters. Figure 1 shows an example using simulation data. Gene expression data is simulated for 50 cells and 2,000 genes. The cells are randomly assigned to two cell types (denoted using different shapes) and three cell-cycle stages (denoted using different colors). Figure 1a shows the results of principal component analysis (PCA) on this simulated data. The cells are clustered into six distinct clusters, grouping by both cell types and cell-cycle statuses. Cell-type discovery using this original data directly will mistakenly result in the discovery of six cell types.

View Article: PubMed Central - PubMed

ABSTRACT

Single-cell RNA-Sequencing (scRNA-Seq) is a revolutionary technique for discovering and describing cell types in heterogeneous tissues, yet its measurement of expression often suffers from large systematic bias. A major source of this bias is the cell cycle, which introduces large within-cell-type heterogeneity that can obscure the differences in expression between cell types. The current method for removing the cell-cycle effect is unable to effectively identify this effect and has a high risk of removing other biological components of interest, compromising downstream analysis. We present ccRemover, a new method that reliably identifies the cell-cycle effect and removes it. ccRemover preserves other biological signals of interest in the data and thus can serve as an important pre-processing step for many scRNA-Seq data analyses. The effectiveness of ccRemover is demonstrated using simulation data and three real scRNA-Seq datasets, where it boosts the performance of existing clustering algorithms in distinguishing between cell types.

No MeSH data available.