Single-cell RNA sequencing (scRNA-seq) is widely used to measure the genome-wide expression profile of individual cells. From each cell, mRNA is isolated and reverse transcribed to cDNA for high-throughput sequencing (Stegle, Teichmann, and Marioni 2015). This can be done using microfluidics platforms like the Fluidigm C1 (Pollen et al. 2014), or with protocols based on microtiter plates like Smart-seq2 (Picelli et al. 2014). The number of reads mapped to each gene can then be used to quantify its expression in each cell. Alternatively, unique molecular identifiers (UMIs) can be used to directly measure the number of transcript molecules for each gene (Islam et al. 2014). Count data can be analyzed to identify new cell subpopulations via dimensionality reduction and clustering; to detect highly variable genes (HVGs) across a population; or to detect differentially expressed genes (DEGs) between conditions. This provides biological insights at a single-cell resolution that cannot be achieved with conventional bulk RNA sequencing of cell populations.
Strategies for scRNA-seq data analysis differ markedly from those for bulk RNA-seq. One technical reason is that scRNA-seq data is much noisier than bulk data (Brennecke et al. 2013; Marinov et al. 2014). Reliable capture (i.e., conversion) of transcripts into cDNA for sequencing is difficult with the low quantity of RNA in a single cell. This increases the frequency of drop-out events where none of the transcripts for a gene are captured. Dedicated steps are required to deal with this noise, especially during quality control. In addition, scRNA-seq data can be used to study cell-to-cell heterogeneity, e.g., to identify new cell subtypes, to characterize differentiation processes, to assign cells into their cell cycle phases, or to identify HVGs driving variability across the population (Vallejos, Marioni, and Richardson 2015; J. Fan et al. 2016; Trapnell et al. 2014). This is simply not possible with bulk data, such that custom methods are required to perform these analyses.
This article describes a computational workflow for basic analysis of scRNA-seq data using software packages from the open-source Bioconductor project (Huber et al. 2015). Starting from a count matrix, this workflow contains the steps required for quality control to remove problematic cells; normalization of cell-specific biases, with and without spike-ins; cell-cycle phase classification from gene expression data; data exploration to identify putative subpopulations; and finally, HVG and DEG identification to prioritize interesting genes. The application of different steps in the workflow will be demonstrated on several public scRNA-seq data sets involving haematopoietic stem cells, brain-derived cells, T-helper cells and mouse embryonic stem cells, generated with a range of experimental protocols and platforms (Wilson et al. 2015; Zeisel et al. 2015; Buettner et al. 2015; Kolodziejczyk et al. 2015). The aim is to provide a variety of modular usage examples that can be applied to construct custom analysis pipelines.
To introduce most of the concepts of scRNA-seq data analysis, we use a relatively simple data set from a study of haematopoietic stem cells (HSCs) (Wilson et al. 2015). Single mouse HSCs were isolated into microtiter plates and libraries were prepared for 96 cells using the Smart-seq2 protocol. A constant amount of spike-in RNA from the External RNA Controls Consortium (ERCC) was also added to each cell prior to library preparation. High-throughput sequencing was performed and the expression of each gene was quantified by counting the total number of reads mapped to its exonic regions. Similarly, the quantity of each spike-in transcript was measured by counting reads mapped to the spike-in reference sequence. Counts for all genes/transcripts in each cell were obtained from the NCBI Gene Expression Omnibus (GEO) as a supplementary file under the accession number GSE61533.
For simplicity, we forgo a description of the read processing steps required to generate the count matrix, i.e., read alignment and counting into features. These steps have been described in some detail elsewhere (Love et al. 2015), and are largely the same for bulk and single-cell data. The only additional consideration is that the spike-in information must be included in the pipeline. Typically, spike-in sequences can be included as additional FASTA files during genome index building prior to alignment, while genomic intervals for both spike-in transcripts and endogenous genes can be concatenated into a single GTF file prior to counting. For users favouring a R-based approach to read alignment and counting, we suggest using the methods in the Rsubread package (Liao, Smyth, and Shi 2013; Liao, Smyth, and Shi 2014).
The first task is to load the count matrix into memory. This requires some work to decompress and retreive the data from the Excel format. Each row of the matrix represents an endogenous gene or a spike-in transcript, and each column represents a single HSC. For convenience, the counts for spike-in transcripts and endogenous genes are stored in a SCESet
object from the scater package.
library(R.utils)
gunzip("GSE61533_HTSEQ_count_results.xls.gz", remove=FALSE, overwrite=TRUE)
library(gdata)
all.counts <- read.xls('GSE61533_HTSEQ_count_results.xls', sheet=1, header=TRUE, row.names=1)
library(scater)
sce <- newSCESet(countData=all.counts)
dim(sce)
## Features Samples
## 38498 96
We annotate those rows corresponding to ERCC spike-ins and mitochondrial genes. This information can be easily extracted from the row names, though in general, identifying mitochondrial genes from standard identifiers like Ensembl requires extra annotation. For each cell, we calculate quality control metrics such as the total number of counts or the proportion of counts in mitochondrial genes or spike-in transcripts. These metrics are stored in the pData
of the SCESet
for future reference.
is.spike <- grepl("^ERCC", rownames(sce))
isSpike(sce) <- is.spike
is.mito <- grepl("^mt-", rownames(sce))
sce <- calculateQCMetrics(sce, feature_controls=list(Spike=is.spike, Mt=is.mito))
head(colnames(pData(sce)))
## [1] "total_counts" "log10_total_counts" "filter_on_total_counts"
## [4] "total_features" "log10_total_features" "filter_on_total_features"
Two common measures of cell quality are the library size and the number of expressed features in each library. The library size is defined as the total sum of counts across all features, i.e., genes and spike-in transcripts. Cells with small library sizes are considered to be of low quality as the RNA has not been efficiently captured (i.e., converted into cDNA and amplified) during library preparation. The number of expressed features in each cell is defined as the number of features with non-zero counts for that cell. Any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured. The distributions of both of these metrics are shown in Figure 1.
par(mfrow=c(1,2))
hist(sce$total_counts/1e6, xlab="Library sizes (millions)", main="",
breaks=20, col="grey80", ylab="Number of cells")
hist(sce$total_features, xlab="Number of expressed genes", main="",
breaks=20, col="grey80", ylab="Number of cells")
Figure 1: Histograms of library sizes (left) and number of expressed genes (right) for all cells in the HSC data set.
Picking a threshold for these metrics is not straightforward as their absolute values depend on the protocol and biological system. For example, sequencing to greater depth will lead to more reads, regardless of the quality of the cells. To obtain an adaptive threshold, we assume that most of the data set consists of high-quality cells. We remove cells with log-library sizes that are more than 3 median absolute deviations (MADs) below the median log-library size. The wide range of library sizes requires a log-transformation, as the MAD would be too large on the raw scale. We also remove cells where the number of expressed genes is 3 MADs below the median. This eliminates low-quality cells corresponding to small outliers.
libsize.drop <- isOutlier(sce$total_counts, n=3, type="lower", log=TRUE)
feature.drop <- isOutlier(sce$total_features, n=3, type="lower")
Another measure of quality is the proportion of reads mapped to genes in the mitochondrial genome. High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of increased apoptosis and/or loss of cytoplasmic RNA from lysed cells. A similar case can be made for the proportion of reads mapped to spike-in transcripts. The quantity of spike-in RNA added to each cell should be constant, which means that the proportion should increase upon loss of endogenous RNA in low-quality cells. The distributions of mitochondrial and spike-in proportions across all cells are shown in Figure 2.
par(mfrow=c(1,2))
hist(sce$pct_counts_feature_controls_Mt, xlab="Mitochondrial proportion (%)",
ylab="Number of cells", breaks=20, main="", col="grey80")
hist(sce$pct_counts_feature_controls_Spike, xlab="ERCC proportion (%)",
ylab="Number of cells", breaks=20, main="", col="grey80")
Figure 2: Histogram of the proportion of reads mapped to mitochondrial genes (left) or spike-in transcripts (right) across all cells in the HSC data set.
Again, the ideal threshold for these proportions depends on the cell type and the experimental protocol. Cells with more mitochondria or more mitochondrial activity may naturally have larger mitochondrial proportions. Similarly, cells with more endogenous RNA or in protocols using less spike-in RNA will have lower spike-in proportions. If we assume that most cells in the data set are of high quality, then the threshold can be set to remove any large outliers from the distribution of proportions. We use the MAD-based definition of outliers to remove putative low-quality cells from the data set.
mito.drop <- isOutlier(sce$pct_counts_feature_controls_Mt, n=3, type="higher")
spike.drop <- isOutlier(sce$pct_counts_feature_controls_Spike, n=3, type="higher")
Subsetting by column will retain only the high-quality cells that pass each filter described above. We can examine the number of cells removed by each filter, and the total number remaining in the data set. Removal of a substantial proportion of cells (> 10%) may be indicative of an overall issue with data quality. It may also reflect genuine biology in extreme cases (e.g., low numbers of expressed genes in erythrocytes) for which the filters described here are not appropriate.
sce <- sce[,!(libsize.drop | feature.drop | mito.drop | spike.drop)]
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop),
ByMito=sum(mito.drop), BySpike=sum(spike.drop), Remaining=ncol(sce))
## ByLibSize ByFeature ByMito BySpike Remaining
## Samples 2 2 6 3 86
An alternative approach to quality control is to perform a principal components analysis (PCA) based on the quality metrics for each cell, e.g., the total number of reads, the total number of features, the proportion of mitochondrial or spike-in reads. Outliers on a PCA plot may be indicative of low-quality cells that have aberrant technical properties compared to the (presumed) majority of high-quality cells. In Figure 3, no obvious outliers are present which is consistent with the removal of suspect cells in the preceding quality control steps.
fontsize <- theme(axis.text=element_text(size=12), axis.title=element_text(size=16))
plotPCA(sce, pca_data_input="pdata") + fontsize
Figure 3: PCA plot for all remaining cells in the HSC data set, constructed using quality metrics. The first and second components are shown on each axis, along with the percentage of total variance explained by each component. Bars represent the coordinates of the cells on each axis.
Methods like PCA-based outlier detection and support vector machines can provide more power to distinguish low-quality cells from high-quality counterparts (Ilicic et al. 2016). This is because they are able to detect subtle patterns across many quality metrics simultaneously. However, this comes at some cost to interpretability, as the reason for removing a given cell may not always be obvious. Thus, for this workflow, we will use the simple approach whereby each quality metric is considered separately. Users interested in the more sophisticated approaches are referred to the scater and cellity packages.
Low-abundance genes are removed as the counts are too low for reliable statistical inferences. In addition, the discreteness of the counts may interfere with downstream statistical procedures, e.g., by compromising the accuracy of asymptotic approximations. Here, low-abundance genes are defined as those with an average count across cells below 1. Removing them avoids problems with discreteness and also reduces the amount of computational work.
keep <- rowMeans(counts(sce)) >= 1
sce <- sce[keep,]
sum(keep)
## [1] 13997
An alternative approach to gene filtering is to select genes that have non-zero counts in at least n cells. This provides some more protection against genes with outlier expression patterns, i.e., strong expression in only one or two cells. Such outliers are typically uninteresting as they can arise from amplification artifacts that are not replicable across cells. (The exception is for studies involving rare cells where the outliers may be biologically relevant.) An example of this filtering approach is shown below for n set to 10.
alt.keep <- rowSums(is_exprs(sce)) >= 10
sum(alt.keep)
## [1] 11419
The relationship between the proportion of expressing cells and the mean can be examined more closely in Figure 4. The two statistics tend to be well-correlated, so filtering on either should give roughly similar results.
plotQC(sce, type = "exprs-freq-vs-mean") + fontsize
Figure 4: Frequency of expression against the mean expression for each gene. Circles represent endogenous genes and triangles represent spike-in transcripts or mitochondrial genes. The bars on each axis represent the location of each gene on that axis. Genes with expression frequencies higher than the dropout rate are defined as those above a non-linear trend fitted to the spike-in transcripts.
In general, we prefer the mean-based filter as it tends to be less aggressive. A gene will be retained as long as it has sufficient expression in any subset of cells. The “at least n” filter depends heavily on the choice of n – in this case, a gene expressed in a subset of 9 cells would be lost. While the mean-based filter will retain more outlier-driven genes, this can be handled by choosing methods that are robust to outliers in the downstream analyses.
Read counts are subject to differences in capture efficiency and sequencing depth between cells (Stegle, Teichmann, and Marioni 2015). Normalization is required to eliminate these cell-specific biases prior to downstream quantitative analyses. This is often done by assuming that most genes are not differentially expressed (DE) between cells. Any systematic difference in count size across the non-DE majority of genes between two cells is assumed to represent bias and is removed by scaling. More specifically, “size factors” are calculated that represent the extent to which counts should be scaled in each library.
Size factors can be computed with several different approaches, e.g., using the estimateSizeFactorsFromMatrix
function in the DESeq2 package (Anders and Huber 2010; Love, Huber, and Anders 2014), or with the calcNormFactors
function (Robinson and Oshlack 2010) in the edgeR package. However, single-cell data can be problematic for these bulk data-based methods due to the dominance of low and zero counts. To overcome this, we pool counts from many cells to increase the count size for accurate size factor estimation (Lun, Bach, and Marioni 2016). Pool-based size factors are then “deconvolved” into cell-based factors for cell-specific normalization.
sce <- computeSumFactors(sce, sizes=c(20, 40, 60, 80))
summary(sizeFactors(sce))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4268 0.8411 0.9955 1.0420 1.2140 2.1660
In this case, the size factors are tightly correlated with the library sizes for all cells (Figure 5). This suggests that the systematic differences between cells are primarily driven by differences in capture efficiency or sequencing depth. Any DE between cells would yield a non-linear trend between the total count and size factor, and/or increased scatter around the trend. This does not occur here as strong DE is unlikely to exist between cells of the same type.
plot(sizeFactors(sce), sce$total_counts/1e6, log="xy",
ylab="Library size (millions)", xlab="Size factor")
Figure 5: Size factors from deconvolution, plotted against library sizes for all cells in the HSC data set. Axes are shown on a log-scale.
Normalized log-expression values can be computed for use in downstream analyses. Each value is defined as the log-ratio of each count to the size factor for the corresponding cell (after adding a small prior count to avoid undefined values at zero counts). Division by the size factor ensures that any cell-specific biases are removed. The log-transformation provides some measure of variance stabilization (Law et al. 2014), so that high-abundance genes with large variances do not dominate downstream analyses. The computed values are stored as an exprs
matrix in addition to the other assay elements.
sce <- normalize(sce)
Dimensionality reduction is often useful to examine major features of the data before more quantitative analyses. Of particular interest is whether the HSCs partition into distinct subpopulations. This can be visualized by constructing a PCA plot from the normalized log-expression values (Figure 6). Cells with similar expression profiles should be located close together in the plot, while dissimilar cells should be far apart. By default, the plotPCA
function will only use the top 500 genes with the largest variances. This focuses on the genes that are driving heterogeneity in the population and should provide greater visual resolution of any systematic differences between groups of cells.
plotPCA(sce, exprs_values="exprs") + fontsize
Figure 6: PCA plot constructed from normalized log-expression values, where each point represents a cell in the HSC data set. First and second components are shown, along with the percentage of variance explained. Bars represent the coordinates of the cells on each axis. None of the cells are controls (e.g., empty wells) so the legend can be ignored.
Another popular approach to dimensionality reduction is the t-stochastic neighbour embedding (t-SNE) method (Van der Maaten and Hinton 2008). t-SNE tends to work better than PCA for separating cells in large data sets with many subpopulations, at the cost of more computational effort and complexity. Like plotPCA
, the plotTSNE
function will use the genes with the largest variances to focus on heterogeneity in the population. However, unlike PCA, t-SNE is a stochastic method – users should run the algorithm several times to ensure that the results are representative, and then set a seed to ensure that the chosen results are reproducible. It is also advisable to test different settings of the “perplexity” parameter as this will affect the distribution of points in the low-dimensional space (Figure 7).
set.seed(100)
out5 <- plotTSNE(sce, exprs_values="exprs", perplexity=5) + fontsize + ggtitle("Perplexity = 5")
out10 <- plotTSNE(sce, exprs_values="exprs", perplexity=10) + fontsize + ggtitle("Perplexity = 10")
out20 <- plotTSNE(sce, exprs_values="exprs", perplexity=20) + fontsize + ggtitle("Perplexity = 20")
multiplot(out5, out10, out20, cols=3)
Figure 7: t-SNE plot constructed from normalized log-expression values using a range of perplexity values. In each plot, each point represents a cell in the HSC data set. Bars represent the coordinates of the cells on each axis.
For this data set, all methods suggest that there is no separation into distinct subpopulations. This might be expected for a homogenous population of cells of the same type. Of course, there are many dimensionality reduction techniques that we have not considered here but could also be used, e.g., multidimensional scaling, diffusion maps. These have their own advantages and disadvantages – for example, diffusion maps (see plotDiffusionMap
) place cells along a continuous trajectory and are suited for visualizing graduated processes like differentiation (Angerer et al. 2015).
We use the prediction method described by Scialdone et al. (2015) to classify cells into cell cycle phases based on the gene expression data. Using a training data set, the sign of the difference in expression between two genes was computed for each pair of genes. Pairs with changes in the sign across cell cycle phases were chosen as markers. Cells in a test data set can then be classified into the appropriate phase, based on whether the observed sign for each marker pair is consistent with one phase or another. This approach is implemented in the cyclone
function using a pre-trained set of marker pairs for mouse data. The result of phase assignment for each cell in the HSC data set is shown in Figure 8. (Some additional work is necessary to match the gene symbols in the data to the Ensembl annotation in the set of pairs.)
mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds", package="scran"))
library(org.Mm.eg.db)
anno <- select(org.Mm.eg.db, keys=rownames(sce), keytype="SYMBOL", column="ENSEMBL")
ensembl <- anno$ENSEMBL[match(rownames(sce), anno$SYMBOL)]
keep <- !is.na(ensembl)
assignments <- cyclone(sce[keep,], mm.pairs, gene.names=ensembl[keep])
plot(assignments$score$G1, assignments$score$G2M, xlab="G1 score", ylab="G2/M score", pch=16)
Figure 8: Cell cycle phase scores from applying the pair-based classifier on the HSC data set, where each point represents a cell.
Cells are classified as being in G1 phase if the G1 score is above 0.5; in G2/M phase if the G2/M score is above 0.5; and in S phase, if neither is above 0.5. Here, the vast majority of cells are classified as being in G1 phase. We will focus on these cells in the downstream analysis. Cells in other phases are removed to avoid potential confounding effects from cell cycle-induced differences. Alternatively, if a non-negligible number of cells are in other phases, we can use the assigned phase as a blocking factor in downstream analyses. This protects against cell cycle effects without discarding information.
g1.only <- assignments$score$G1 > 0.5
sce <- sce[,g1.only]
Pre-trained classifiers are available in scran for human and mouse data. The mouse classifier used here was trained on embryonic stem cells but can still be generally applied – the pair-based method is a non-parametric procedure that should be robust to technical differences between data sets, and the transcriptional program associated with cell cycling should be mostly conserved across cell types. However, it will (inevitably) be less accurate for cell types that are substantially different from those used in the training set. Users can also construct a custom classifier from their own training data using the sandbag
function. This may be necessary for other model organisms where pre-trained classifiers are not available.
We identify HVGs to focus on the genes that are driving heterogeneity across the population of cells. This requires estimation of the variance in expression for each gene, followed by decomposition of the variance into biological and technical components. HVGs are then identified as those genes with the highest biological components. This avoids prioritizing genes that are highly variable due to technical factors such as sampling noise during RNA capture and library preparation.
Ideally, the technical component would be estimated by fitting a mean-variance trend to the spike-in transcripts. Recall that the same set of spike-ins was added in the same quantity to each cell. This means that the spike-in transcripts should exhibit no biological variability, such that any variance in the counts should be technical in origin. Fitting is performed by the trendVar
function, using a loess curve with a low span as the trend is highly non-linear. (Some adjustment of the parameters may be required to obtain a satisfactory fit.)
var.fit <- trendVar(sce, trend="loess", span=0.3)
Given the mean abundance of a gene, the fitted value of the trend can be used as an estimate of the technical component for that gene. The biological component of the variance can then be calculated by subtracting the technical component from the total variance of each gene in the decomposeVar
function.
var.out <- decomposeVar(sce, var.fit)
In practice, this strategy is complicated by the difficulty of accurately fitting a complex trend to a low number of unevenly distributed points. An alternative approach is to fit the mean-variance trend to the endogenous genes. This assumes that the majority of genes are constantly expressed, such that the technical component dominates the total variance of expression for those genes. The fitted value of the trend can then be used as an estimate of the technical component.
var.fit2 <- trendVar(sce, trend="loess", use.spikes=FALSE, span=0.2)
var.out2 <- decomposeVar(sce, var.fit2)
We assess the suitability of the trend fitted to the endogenous variances by examining whether it is consistent with the spike-in variances (Figure 9). The former passes through the bulk of the latter in the plot below, indicating that our assumption (that most genes have low levels of biological variability) is valid. In contrast, the spike-in trend fits poorly as it lies below the variance estimates at mean intervals with few spike-in transcripts. The use of an endogenous trend is the only option in data sets where no spike-ins were added or in situations where not enough spike-in RNA was added to cover the range of means for the endogenous genes.
plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression",
ylab="Variance of log-expression")
points(var.fit$mean, var.fit$var, col="red", pch=16)
o <- order(var.out$mean)
lines(var.out$mean[o], var.out$tech[o], col="red", lwd=2)
lines(var.out2$mean[o], var.out2$tech[o], col="dodgerblue", lwd=2)
Figure 9: Variance of normalized log-expression values for each gene in the HSC data set, plotted against the mean log-expression. The red line represents the mean-dependent trend in the technical variance of the spike-in transcripts (also highlighted as red points). The blue line represents the trend fitted to the variances of the endogenous genes.
The top HVGs are identified by ranking genes on their biological components. This can be used to prioritize interesting genes for further investigation. In general, we consider a gene to be a HVG if it has a biological component of at least 1. For log2-counts, this means that gene expression will vary, for biological reasons, by at least 2-fold around the mean.
top.hvgs <- order(var.out2$bio, decreasing=TRUE)
write.table(file="hsc_hvg.tsv", var.out2[top.hvgs,], sep="\t", quote=FALSE, col.names=NA)
head(var.out2[top.hvgs,])
## mean total bio tech
## Fos 6.354617 20.182191 12.302393 7.879798
## Rgs1 5.156189 20.261351 9.416550 10.844802
## Dusp1 6.638309 16.092466 9.074147 7.018319
## H2-Aa 4.237864 19.423406 7.524803 11.898603
## Ppp1r15a 6.485799 14.971509 7.462378 7.509130
## Ctla2a 8.594131 9.509346 7.400235 2.109111
We recommend checking the distribution of expression values for the top HVGs to ensure that the variance estimate is not being dominated by one or two outlier cells (Figure 10).
examined <- top.hvgs[1:10]
all.names <- matrix(rownames(sce)[examined], nrow=length(examined), ncol=ncol(sce))
boxplot(split(exprs(sce)[examined,], all.names), las=2, ylab="Normalized log-expression", col="grey80")
Figure 10: Boxplots of normalized log-expression values for the top 10 HVGs in the HSC data set. Points correspond to cells that are more than 1.5 interquartile ranges from the edge of each box.
There are many other ways of defining HVGs, e.g., by using the coefficient of variation (Kolodziejczyk et al. 2015; Kim et al. 2015), with the dispersion parameter in the negative binomial distribution (McCarthy, Chen, and Smyth 2012), or as a proportion of total variability (Vallejos, Marioni, and Richardson 2015). We use the variance of the log-expression values because the log-transformation provides some protection against genes with strong expression in only one or two outlier cells. This ensures that the set of top HVGs is not dominated by genes with (mostly uninteresting) outlier expression patterns. However, the cost of this robustness is the need to fit a complex mean-variance relationship.
Once the basic analysis is completed, it is often useful to save the SCESet
object to file with the saveRDS
function. The object can then be easily restored into new R sessions using the readRDS
function. This allows further work to be conducted without having to repeat all of the processing steps described above.
saveRDS(file="hsc_data.rds", sce)
A variety of methods are available to perform more complex analyses on the processed expression data. For example, cells can be ordered by pseudotime (e.g., for progress along a differentiation pathway) with monocle (Trapnell et al. 2014); cell-state hierarchies can be characterized with the sincell package (Julia, Telenti, and Rausell 2015); and oscillatory behaviour can be identified using Oscope (Leng et al. 2015). HVGs can be used in gene set enrichment analyses to identify biological pathways and processes with heterogeneous activity, using packages designed for bulk data like topGO or with dedicated single-cell methods like scde (J. Fan et al. 2016). Full descriptions of these analyses are outside the scope of this workflow, so interested users are advised to consult the relevant documentation.
We proceed to a more complex data set from a study of cell types in the mouse brain (Zeisel et al. 2015). This contains approximately 3000 cells of varying types such as oligodendrocytes, microglia and neurons. Individual cells were isolated using the Fluidigm C1 microfluidics system and library preparation was performed on each cell using a UMI-based protocol. After sequencing, expression was quantified by counting the number of UMIs mapped to each gene. Count data for all endogenous genes, mitochondrial genes and spike-in transcripts were obtained from http://linnarssonlab.org/cortex.
The count data are distributed across several files, so some work is necessary to consolidate them into a single matrix. We define a simple utility function for loading data in from each file. (We stress that this function is only relevant to the current data set, and should not be used for other data sets. This kind of effort is generally not required if all of the counts are in a single file and separated from the metadata.)
readFormat <- function(infile) {
# First column is empty.
metadata <- read.delim(infile, stringsAsFactors=FALSE, header=FALSE, nrow=10)[,-1]
rownames(metadata) <- metadata[,1]
metadata <- metadata[,-1]
metadata <- as.data.frame(t(metadata))
# First column after row names is some useless filler.
counts <- read.delim(infile, stringsAsFactors=FALSE, header=FALSE, row.names=1, skip=11)[,-1]
counts <- as.matrix(counts)
return(list(metadata=metadata, counts=counts))
}
Using this function, we read in the counts for the endogenous genes, ERCC spike-ins and mitochondrial genes.
endo.data <- readFormat("expression_mRNA_17-Aug-2014.txt")
spike.data <- readFormat("expression_spikes_17-Aug-2014.txt")
mito.data <- readFormat("expression_mito_17-Aug-2014.txt")
We also need to rearrange the columns for the mitochondrial data, as the order is not consistent with the other files.
m <- match(endo.data$metadata$cell_id, mito.data$metadata$cell_id)
mito.data$metadata <- mito.data$metadata[m,]
mito.data$counts <- mito.data$counts[,m]
The counts are then combined into a single matrix for constructing a SCESet
object. For convenience, metadata for all cells are stored in the same object for later access.
all.counts <- rbind(endo.data$counts, mito.data$counts, spike.data$counts)
metadata <- AnnotatedDataFrame(endo.data$metadata)
sce <- newSCESet(countData=all.counts, phenoData=metadata)
dim(sce)
## Features Samples
## 20063 3005
We also add annotation identifying which rows correspond to each class of features.
nrows <- c(nrow(endo.data$counts), nrow(mito.data$counts), nrow(spike.data$counts))
is.spike <- rep(c(FALSE, FALSE, TRUE), nrows)
isSpike(sce) <- is.spike
is.mito <- rep(c(FALSE, TRUE, FALSE), nrows)
The original authors of the study have already removed low-quality cells prior to data publication. Nonetheless, we can compute some metrics to check that whether the remaining cells are satisfactory.
sce <- calculateQCMetrics(sce, feature_controls=list(Spike=is.spike, Mt=is.mito))
We examine the distribution of library sizes and numbers of expressed genes across cells (Figure 13).
par(mfrow=c(1,2))
hist(sce$total_counts/1e3, xlab="Library sizes (thousands)", main="",
breaks=20, col="grey80", ylab="Number of cells")
hist(sce$total_features, xlab="Number of expressed genes", main="",
breaks=20, col="grey80", ylab="Number of cells")
Figure 13: Histograms of library sizes (left) and number of expressed genes (right) for all cells in the brain data set.
We also examine the distribution of the proportions of total reads mapped to mitochondrial genes or spike-in transcripts (Figure 14). Note that the spike-in proportions here are more variable than in the HSC data set. This may reflect a greater variability in the total amount of endogenous RNA per cell when many cell types are present.
par(mfrow=c(1,2))
hist(sce$pct_counts_feature_controls_Mt, xlab="Mitochondrial proportion (%)",
ylab="Number of cells", breaks=20, main="", col="grey80")
hist(sce$pct_counts_feature_controls_Spike, xlab="ERCC proportion (%)",
ylab="Number of cells", breaks=20, main="", col="grey80")
Figure 14: Histogram of the proportion of reads mapped to mitochondrial genes (left) or spike-in transcripts (right) across all cells in the brain data set.
We remove small outliers in Figure 13 and large outliers in Figure 14, using a MAD-based threshold as previously described.
libsize.drop <- isOutlier(sce$total_counts, n=3, type="lower", log=TRUE)
feature.drop <- isOutlier(sce$total_features, n=3, type="lower")
mito.drop <- isOutlier(sce$pct_counts_feature_controls_Mt, n=3, type="higher")
spike.drop <- isOutlier(sce$pct_counts_feature_controls_Spike, n=3, type="higher")
Removal of low-quality cells can then be performed by combining all of the metrics. The majority of cells are retained, which suggests that the original quality control procedures were generally adequate.
sce <- sce[,!(libsize.drop | feature.drop | spike.drop | mito.drop)]
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop),
ByMito=sum(mito.drop), BySpike=sum(spike.drop), Remaining=ncol(sce))
## ByLibSize ByFeature ByMito BySpike Remaining
## Samples 8 0 87 8 2902
Low-abundance genes are also removed by applying a simple mean-based filter. This yields fewer genes than in the HSC data set, mostly because the sequencing depth per cell is much lower.
keep <- rowMeans(counts(sce)) >= 1
sce <- sce[keep,]
sum(keep)
## [1] 3175
Some data sets may contain strong heterogeneity in mitochondrial RNA content, possibly due to differences in mitochondrial copy number or activity between cell types. This heterogeneity will cause mitochondrial genes to dominate the top set of results, e.g., for identification of correlated HVGs. However, these genes are largely uninteresting given that most studies focus on nuclear regulation. As such, we filter them out prior to further analysis. Other candidates for removal include pseudogenes or ribosomal RNA/protein-coding genes that might not be biologically relevant but can interfere with interpretation of the results.
sce <- sce[!fData(sce)$is_feature_control_Mt,]
Normalization of cell-specific biases is performed using the deconvolution method in the computeSumFactors
function. Here, we cluster similar cells together and normalize the cells in each cluster using the deconvolution method. This improves the accuracy of normalization by reducing the number of DE genes between cells in the same cluster. Normalization between clusters is then performed to ensure that expression values from cells in different clusters are comparable.
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, cluster=clusters)
sce <- normalize(sce)
Compared to the HSC analysis, more scatter is observed around the trend between the total count and size factor for each cell (Figure 15). This is consistent with an increased amount of DE between cells of different types, which compromises the accuracy of library size normalization (Robinson and Oshlack 2010). In contrast, the size factors are estimated based on median ratios and are more robust to the presence of DE between cells.
plot(sizeFactors(sce), sce$total_counts/1e3, log="xy",
ylab="Library size (thousands)", xlab="Size factor")
Figure 15: Size factors from deconvolution, plotted against library sizes for all cells in the brain data set. Axes are shown on a log-scale.
We also attempt to classify cells into cell cycle phases using the cyclone
method. However, examination of Figure 16 indicates that many of the G1 and G2/M scores are ambiguous. This highlights the potential difficulties of training a classifier on one cell type (mouse embryonic stem cells – see Scialdone et al. (2015) for more details) and applying it on a substantially different cell type. Some neuron types are particularly problematic as they are postmitotic and do not belong in any phase of the cell cycle.
anno <- select(org.Mm.eg.db, keys=rownames(sce), keytype="SYMBOL", column="ENSEMBL")
ensembl <- anno$ENSEMBL[match(rownames(sce), anno$SYMBOL)]
keep <- !is.na(ensembl)
assignments <- cyclone(sce[keep,], mm.pairs, gene.names=ensembl[keep])
plot(assignments$score$G1, assignments$score$G2M, xlab="G1 score", ylab="G2/M score", pch=16)
Figure 16: Cell cycle phase scores from applying the pair-based classifier on the brain data set, where each point represents a cell.
Given the lack of definitive classification, we will not perform any processing of the data set by cell cycle phase. However, this information is still useful for verifying downstream results. For example, if we were to identify putative subpopulations, and those subpopulations had systematically different phase scores, we might be wary of the possibility that the differences between subpopulations are being driven by cell cycle effects.
For large experiments, data exploration has two functions – to identify interesting biology, and also to check the effect of various technical factors. PCA plots constructed from the expression data suggest that distinct subpopulations are present (Figure 17). Some of the substructure may be due to differences in the tissue from which the cells were extracted, e.g., cells from the cortex and hippocampus dominate different parts of the plot. In contrast, cells taken from mice of different sexes mix throughout the plot, indicating that sex has little effect on the overall differences across the data set.
pca1 <- plotPCA(sce, exprs_values="exprs", colour_by="tissue") + fontsize
pca2 <- plotPCA(sce, exprs_values="exprs", colour_by="sex") + fontsize
multiplot(pca1, pca2, cols=2)
Figure 17: PCA plots constructed from the normalized expression values for all remaining cells in the brain data set. Left: cells are coloured according to the tissue of origin (cortex or hippocampus). Right: cells are coloured according to the sex of the mouse – male (-1), female (1) or unassigned (0).
Similar results are observed with t-SNE plots (Figure 18). Again, users should set the seed to a constant value to ensure that the results are reproducible.
set.seed(100)
tsne1 <- plotTSNE(sce, exprs_values="exprs", colour_by="tissue") + fontsize
set.seed(100)
tsne2 <- plotTSNE(sce, exprs_values="exprs", colour_by="sex") + fontsize
multiplot(tsne1, tsne2, cols=2)
Figure 18: t-SNE plots constructed from the normalized expression values for all remaining cells in the brain data set. Left: cells are coloured according to the tissue of origin (cortex or hippocampus). Right: cells are coloured according to the sex of the mouse – male (-1), female (1) or unassigned (0).
An additional effect to consider is the fact that cells were processed on many different C1 chips. This can lead to batch effects due to technical differences in library preparation between chips. To check that this is not the case, we examine the spread of cells from each chip on the PCA plot (Figure 19). Cells from different chips seem to mix together, which suggests that the substructure is not being driven by a batch effect. Note that we separate cells by tissue because the chip factor is nested within the tissue factor. If all cells were plotted together, differences between tissues would dominate the plot such that more subtle differences between chips may not be visible.
sce$chip <- sub("_.*", "", sce$cell_id)
pca1 <- plotPCA(sce[,sce$tissue=="sscortex"], exprs_values="exprs",
colour_by="chip", legend="none") + fontsize + ggtitle("Cortex")
pca2 <- plotPCA(sce[,sce$tissue!="sscortex"], exprs_values="exprs",
colour_by="chip", legend="none") + fontsize + ggtitle("Hippocampus")
multiplot(pca1, pca2, cols=2)
Figure 19: PCA plots constructed from the normalized expression values for all cells in the brain data set from the cortex (left) or hippocampus (right). Each cell is coloured according to the C1 chip on which its library was prepared.
In summary, the major difference between cells seems to be associated with the tissue of origin. Whether or not this is interesting depends on the biological hypothesis being studied. For the purposes of this workflow, we will treat the tissue of origin as an uninteresting confounding effect. This is because we are mainly interested in the cell subpopulations within each tissue. As such, we will block on tissue in all of our downstream analyses.
design <- model.matrix(~sce$tissue)
Once putative subpopulations are identified by clustering, we can identify some candidate marker genes that are unique to those subpopulations. This is done by testing for DE between each pair of subpopulations and selecting those genes that are consistently upregulated (or downregulated) in one subpopulation compared to all others. DE testing can be done using a number of packages, but for this workflow, we will use the edgeR package (Robinson, McCarthy, and Smyth 2010). First, we set up a design matrix specifying which cells belong in which cluster. Each cluster*
coefficient represents the average log-expression of all cells in the corresponding cluster. We also block on uninteresting factors such as the tissue of origin.
cluster <- factor(my.clusters)
design <- model.matrix(~0 + cluster + sce$tissue)
colnames(design)
## [1] "cluster1" "cluster2" "cluster3" "cluster4"
## [5] "cluster5" "sce$tissuesscortex"
We set up a DGEList
object for entry into the edgeR analysis. Spike-in transcripts are removed as they are not relevant for marker identification. The size factors are divided by the library sizes to obtain normalization factors for all cells. (The normalization factor is simply an alternative formulation of the size factor, and quantifies the bias that is not caused by differences in library size between samples.)
y <- convertTo(sce)
edgeR uses negative binomial (NB) distributions to model the read counts for each sample. We estimate the NB dispersion parameter that quantifies the biological variability in expression across cells in the same cluster. Large dispersion estimates above 0.5 are often observed in scRNA-seq data due to technical noise, in contrast to bulk data where values of 0.05-0.2 are more typical. We then use the design matrix to fit a NB GLM to the counts for each gene (McCarthy, Chen, and Smyth 2012).
y <- estimateDisp(y, design, robust=TRUE)
fit <- glmFit(y, design)
summary(y$tagwise.dispersion)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06273 0.29090 0.45600 1.00300 0.73000 102.40000
To identify marker genes for a particular cluster, we test each gene for DE between that cluster and each other cluster. This is done using the likelihood ratio test (LRT) for each comparison, as demonstrated below for cluster 2. The same process can be repeated for each cluster by changing chosen.clust
, to identify markers specific to the corresponding subpopulation.
result.logFC <- result.PValue <- list()
chosen.clust <- which(levels(cluster)=="2") # character, as 'cluster' is a factor.
for (clust in seq_len(nlevels(cluster))) {
if (clust==chosen.clust) { next }
contrast <- numeric(ncol(design))
contrast[chosen.clust] <- 1
contrast[clust] <- -1
res <- glmLRT(fit, contrast=contrast)
con.name <- paste0('vs.', levels(cluster)[clust])
result.logFC[[con.name]] <- res$table$logFC
result.PValue[[con.name]] <- res$table$PValue
}
Potential marker genes for cluster 2 are ranked based on the maximum p-value across all comparisons. A gene that is DE between the chosen cluster and all others should have small p-values for all comparisons, and thus a small maximum p-value. In addition, we only focus on genes with the same sign of the log-fold change across all comparisons. This is necessary to identify specific markers that are unambiguously upregulated (or downregulated) in cluster 2 relative to the other clusters.
max.PValue <- do.call(pmax, result.PValue)
all.logFC <- do.call(cbind, result.logFC)
all.signs <- sign(all.logFC)
same.sign <- rowSums(all.signs[,1]!=all.signs)==0L
marker.set <- data.frame(Gene=rownames(y), logFC=all.logFC,
PValue=max.PValue, stringsAsFactors=FALSE)
marker.set <- marker.set[same.sign,]
marker.set <- marker.set[order(marker.set$PValue),]
head(marker.set)
## Gene logFC.vs.1 logFC.vs.3 logFC.vs.4 logFC.vs.5 PValue
## 1344 Taldo1 2.616537 4.289601 3.279624 3.021371 4.500667e-207
## 1421 Mog 5.305202 9.738385 8.349295 7.466953 1.637919e-184
## 1345 Mbp 4.797108 7.835503 5.558221 5.468495 1.291619e-180
## 1387 Mobp 5.067485 8.699098 6.779929 6.486105 6.987337e-176
## 1410 Dbndd2 2.817153 4.701019 5.024328 2.965054 4.984182e-168
## 1473 Qdpr 3.521928 5.467438 4.555087 3.982385 2.648376e-166
We save the list of candidate marker genes for further examination. We also examine their expression profiles to verify that the DE is not being driven by outlier cells. Figure 23 indicates that all of the top markers have strong and consistent differences between cells in cluster 2 and those in every other cluster. Indeed, some robustness to outliers is expected from edgeR, as any outliers will inflate the dispersion and increase the maximum p-value for the affected genes.
write.table(marker.set, file="brain_marker_2.tsv", sep="\t", quote=FALSE, row.names=FALSE)
top.markers <- marker.set$Gene[1:20]
norm.exprs <- exprs(sce)[top.markers,,drop=FALSE]
heat.vals <- norm.exprs - rowMeans(norm.exprs)
heatmap.2(heat.vals, col=bluered, symbreak=TRUE, trace='none', cexRow=1,
ColSideColors=clust.col[my.clusters], Colv=as.dendrogram(my.tree), dendrogram='none')
legend("bottomleft", col=clust.col, legend=sort(unique(my.clusters)), pch=16)
Figure 23: Heatmap of mean-centred normalized log-expression values for the top set of markers for cluster 2 in the brain data set. Column colours represent the cluster to which each cell is assigned.
An alternative approach is to identify DE genes across any clusters using an ANOVA-like contrast. This is less stringent than identifying markers for a specific cluster, which may overlook important genes that are expressed in two or more clusters. (For example, in a mixed population of CD4+-only, CD8+-only, double-positive and double-negative T-cells, neither Cd4 or Cd8 would be detected as subpopulation-specific markers because each gene is expressed in two subpopulations.) Here, we report the log-fold changes for each cluster against the average of all other clusters for each gene. This facilitates interpretation of the results as the relevant cluster(s) expressing each gene can be quickly determined.
# Automatic construction of the contrast matrix.
nclusters <- nlevels(cluster)
contrast.matrix <- matrix(0, ncol(design), nclusters)
contrast.matrix[1,] <- -1
diag(contrast.matrix) <- 1
contrast.matrix <- contrast.matrix[,-1]
res.any <- glmLRT(fit, contrast=contrast.matrix)
# Computing log-fold changes between each cluster and the average of the rest.
cluster.expression <- fit$coefficients[,seq_len(nclusters)]
other.expression <- (rowSums(cluster.expression) - cluster.expression)/(nclusters-1)
log.fold.changes <- cluster.expression - other.expression
colnames(log.fold.changes) <- paste0("LogFC.for.", levels(cluster))
rownames(log.fold.changes) <- NULL
# Ordering by the likelihood ratio; p-values affected by numerical imprecision.
any.de <- data.frame(Gene=rownames(y), log.fold.changes,
LR=res.any$table$LR, stringsAsFactors=FALSE)
any.de <- any.de[order(any.de$LR, decreasing=TRUE),]
head(any.de)
## Gene LogFC.for.1 LogFC.for.2 LogFC.for.3 LogFC.for.4 LogFC.for.5 LR
## 1339 Scd2 -0.43975144 2.224036 -2.017434 -1.1847370 1.4178862 8643.685
## 1410 Dbndd2 0.24637738 2.687254 -1.385868 -1.6659942 0.1182310 6371.499
## 1344 Taldo1 0.02156529 2.288622 -1.428034 -0.5529562 -0.3291965 6245.922
## 1345 Mbp -0.05652804 4.099849 -2.689097 -0.7159830 -0.6382407 6164.056
## 1484 Rnf13 -0.31093708 2.452358 -1.838081 -0.5499326 0.2465924 5991.242
## 1421 Mog 0.75099454 5.347602 -3.090065 -1.8865112 -1.1220197 5955.469
It must be stressed that the p-values cannot be interpreted as measures of significance. This is because the clusters have been empirically identified from the data. edgeR does not account for the uncertainty and stochasticity in clustering, which means that the p-values are much lower than they should be. The maximum p-value calculated here should only be used for ranking candidate markers for follow-up studies. However, this is not a concern in other analyses where the groups are pre-defined. For such analyses, the FDR-corrected p-value can be directly used to define significant genes for each DE comparison, though some care may be required to deal with plate effects (Hicks, Teng, and Irizarry 2015; ???).
Having completed the basic analysis, we save the SCESet
object with its associated data to file. This is especially important here as the brain data set is quite large. If further analyses are to be performed, it would be inconvenient to have to repeat all of the pre-processing steps described above.
saveRDS(file="brain_data.rds", sce)
Scaling normalization strategies for scRNA-seq data can be broadly divided into two classes. The first class assumes that there exists a subset of genes that are not DE between samples, as previously described. The second class uses the fact that the same amount of spike-in RNA was added to each cell. Differences in the coverage of the spike-in transcripts can only be due to cell-specific biases, e.g., in capture efficiency or sequencing depth. Scaling normalization is then applied to equalize spike-in coverage across cells.
The choice between these two normalization strategies depends on the biology of the cells and the features of interest. If there is no reliable house-keeping set, and if the majority of genes are expected to be DE, then spike-in normalization may be the only option for removing technical biases. Spike-in normalization should also be used if differences in the total RNA content of individual cells are of interest. This is because the same amount of spike-in RNA is added to each cell, such that the relative quantity of endogenous RNA can be easily quantified in each cell. For non-DE normalization, any change in total RNA content will affect all genes in the non-DE subset, such that it will be treated as bias and removed.
The use of spike-in normalization can be demonstrated on the HSC data set. We load in the SCESet
object that we saved earlier, which contains the count data for filtered genes in high-quality HSCs. We then apply the computeSpikeFactors
method to estimate size factors for all cells. This method computes the total count over all spike-in transcripts in each cell, and calculates size factors to equalize the total spike-in count across cells.
sce <- readRDS("hsc_data.rds")
deconv.sf <- sizeFactors(sce)
sce <- computeSpikeFactors(sce)
Both non-DE methods (like deconvolution) and spike-in normalization will capture technical biases such as sequencing depth and capture efficiency. Indeed, Figure 24 shows a rough positive correlation between the two sets of size factors, consistent with removal of technical biases by both methods. However, differences between the two sets are still present and are attributable to variability in total RNA content across the HSC population. Spike-in normalization will preserve differences in RNA content, whereas non-DE normalization will eliminate them.
plot(sizeFactors(sce), deconv.sf, pch=16, log="xy", xlab="Size factor (spike-in)",
ylab="Size factor (deconvolution)")
Figure 24: Size factors from spike-in normalization, plotted against the size factors from deconvolution for all cells in the HSC data set. Axes are shown on a log-scale.
Whether or not total RNA content is relevant depends on the biological hypothesis. In the analyses described above, variability in total RNA across the population was treated as noise and removed by non-DE normalization. This may not always be appropriate if total RNA is associated with a biological difference of interest. For example, Islam et al. (2011) describe a 5-fold difference in total RNA between mouse embyronic stem cells and fibroblasts. Spike-in normalization will preserve this difference and may provide more accurate quantification in downstream analyses.
Cell cycle phase is usually uninteresting in studies focusing on other aspects of biology. However, the effects of cell cycle on the expression profile can mask other effects and interfere with the interpretation of the results. This cannot be avoided by simply removing cell cycle marker genes, as the cell cycle can affect a substantial number of other transcripts (Buettner et al. 2015). Rather, more sophisticated strategies are required, which are demonstrated below using data from a study of T Helper 2 (TH2) cells (Mahata et al. 2014). Buettner et al. (2015) have already applied quality control and normalized the data, so we can use them directly as log-expression values (accessible as Supplementary Data 1 of https://dx.doi.org/10.1038/nbt.3102).
library(openxlsx)
incoming <- read.xlsx("nbt.3102-S7.xlsx", sheet=1, rowNames=TRUE)
incoming <- incoming[,!duplicated(colnames(incoming))] # Remove duplicated genes.
sce <- newSCESet(exprsData=t(incoming))
We empirically identify the cell cycle phase using the pair-based classifier in cyclone
. The majority of cells in Figure 25 seem to lie in G1 phase, with small numbers of cells in the other phases.
anno <- select(org.Mm.eg.db, keys=rownames(sce), keytype="SYMBOL", column="ENSEMBL")
ensembl <- anno$ENSEMBL[match(rownames(sce), anno$SYMBOL)]
keep <- !is.na(ensembl) # Remove genes without ENSEMBL IDs.
assignments <- cyclone(exprs(sce)[keep,], mm.pairs, gene.names=ensembl[keep])
plot(assignments$score$G1, assignments$score$G2M, xlab="G1 score", ylab="G2/M score", pch=16)
Figure 25: Cell cycle phase scores from applying the pair-based classifier on the TH2 data set, where each point represents a cell.
We can block directly on the phase scores in downstream analyses, which is more graduated than using a strict assignment of each cell to a specific phase. This will absorb any phase-related effects on expression such that they will not affect estimation of the effects of other experimental factors. Note that users should ensure that the phase score is not confounded with other factors of interest. For example, model fitting is not possible if all cells in one experimental condition are in one phase, and all cells in another condition are in a different phase.
design <- model.matrix(~ G1 + G2M, assignments$score)
fit.block <- trendVar(sce, use.spikes=NA, trend="loess", design=design)
dec.block <- decomposeVar(sce, fit.block)
For analyses that do not use design matrices, we can remove the cell cycle effect directly from the expression values using removeBatchEffect
. The result of this procedure can be visualized with some PCA plots in Figure 26. Before removal, cells in the G1 and non-G1 phases tend to be concentrated in different parts of the plot. Afterwards, more intermingling is observed between the phases which suggests that the cell cycle effect has been mitigated.
fit <- trendVar(sce, use.spikes=NA, trend="loess")
dec <- decomposeVar(sce, fit)
top.hvgs <- order(dec$bio, decreasing=TRUE)[1:500]
sce$G1score <- assignments$score$G1
out <- plotPCA(sce, select=top.hvgs, colour_by="G1score") + fontsize + ggtitle("Before removal")
top.hvgs2 <- order(dec.block$bio, decreasing=TRUE)[1:500]
corrected <- removeBatchEffect(exprs(sce), covariates=assignments$score[,c("G1", "G2M")])
sce2 <- newSCESet(exprsData=corrected, phenoData=phenoData(sce))
out2 <- plotPCA(sce2, select=top.hvgs2, colour_by="G1score") + fontsize + ggtitle("After removal")
multiplot(out, out2, cols=2)
Figure 26: PCA plots before (left) and after (right) removal of the cell cycle effect in the TH2 data set. Each point represents a cell, coloured according to its G1 score. Only the top 500 HVGs were used to make each PCA plot.
As an aside, this data set contains cells at various stages of differentiation (Mahata et al. 2014). This is an ideal use case for diffusion maps, which perform dimensionality reduction along a continuous process. In Figure 27, cells are arranged along a trajectory in the low-dimensional space. The first diffusion component is likely to correspond to TH2 differentiation, given that a key regulator Gata3 (J. Zhu et al. 2006) changes in expression from left to right.
plotDiffusionMap(sce2, colour_by="Gata3") + fontsize
Figure 27: A diffusion map for the TH2 data set, where each cell is coloured by its expression of Gata3.
Feature-counting tools typically report genes in terms of standard identifiers like Ensembl or Entrez. These identifiers are used as they are unambiguous and highly stable. However, they are difficult to interpret compared to gene symbols, which are more often used in the literature. We can easily convert from one to the other using annotation packages like org.Mm.eg.db. This is demonstrated below for Ensembl identifiers in the mouse embryonic stem cell (mESC) data set (Kolodziejczyk et al. 2015) obtained from http://www.ebi.ac.uk/teichmann-srv/espresso. The select
call extracts the specified data from the annotation object, and the match
call ensures that the first gene symbol is used if multiple symbols correspond to a single Ensembl identifier.
incoming <- read.table("counttable_es.csv", header=TRUE, row.names=1)
my.ids <- rownames(incoming)
library(org.Mm.eg.db)
anno <- select(org.Mm.eg.db, keys=my.ids, keytype="ENSEMBL", column="SYMBOL")
anno <- anno[match(my.ids, anno$ENSEMBL),]
head(anno)
## ENSEMBL SYMBOL
## 1 ENSMUSG00000000001 Gnai3
## 2 ENSMUSG00000000003 Pbsn
## 3 ENSMUSG00000000028 Cdc45
## 4 ENSMUSG00000000031 <NA>
## 5 ENSMUSG00000000037 Scml2
## 6 ENSMUSG00000000049 Apoh
To identify which rows correspond to mitochondrial genes, we need to use extra annotation describing the genomic location of each gene. For Ensembl, this involves using the TxDb.Mmusculus.UCSC.mm10.ensGene package.
library(TxDb.Mmusculus.UCSC.mm10.ensGene)
location <- select(TxDb.Mmusculus.UCSC.mm10.ensGene, keys=my.ids,
column="CDSCHROM", keytype="GENEID")
location <- location[match(my.ids, location$GENEID),]
is.mito <- location$CDSCHROM == "chrM" & !is.na(location$CDSCHROM)
sum(is.mito)
## [1] 13
Identification of which rows correspond to spike-in transcripts is much easier, given that the ERCC spike-ins were used.
is.spike <- grepl("^ERCC", my.ids)
sum(is.spike)
## [1] 92
All of this information can be consolidated into a SCESet
object for further manipulation.
anno <- anno[,-1,drop=FALSE]
rownames(anno) <- my.ids
sce <- newSCESet(countData=incoming, featureData=AnnotatedDataFrame(anno))
isSpike(sce) <- is.spike
We remove rows that do not correspond to endogenous genes or spike-in transcripts. This includes rows containing mapping statistics, e.g., the number of unaligned or unassigned reads. The object is then ready for downstream analyses as previously described.
sce <- sce[grepl("ENSMUS", rownames(sce)) | isSpike(sce),]
dim(sce)
## Features Samples
## 38653 704
This workflow provides a step-by-step guide for performing basic analyses of single-cell RNA-seq data. It provides instructions for a number of low-level steps such as quality control, normalization, cell cycle phase assignment, data exploration, HVG and DEG detection, and clustering. This is done with a number of different data sets to provide a range of usage examples. In addition, the processed data can be easily used for higher-level analyses with other Bioconductor packages. We anticipate that this workflow will assist readers in assembling analyses of their own scRNA-seq data.
All software packages used in this workflow are publicly available from the Comprehensive R Archive Network (https://cran.r-project.org) or the Bioconductor project (http://bioconductor.org). The specific version numbers of the packages used are shown below, along with the version of the R installation. The workflow takes less than an hour and 5 GB of memory to run on a desktop computer.
sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
## [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dynamicTreeCut_1.63-1 RBGL_1.48.1
## [3] graph_1.50.0 TxDb.Mmusculus.UCSC.mm10.ensGene_3.2.2
## [5] GenomicFeatures_1.24.5 org.Mm.eg.db_3.3.0
## [7] AnnotationDbi_1.34.4 R.utils_2.4.0
## [9] R.oo_1.20.0 R.methodsS3_1.7.1
## [11] openxlsx_3.0.0 gdata_2.17.0
## [13] gplots_3.0.1 destiny_1.2.1
## [15] mvoutlier_2.0.6 sgeostat_1.0-27
## [17] Rtsne_0.11 edgeR_3.14.0
## [19] limma_3.28.21 scran_1.0.4
## [21] scater_1.0.4 ggplot2_2.1.0
## [23] DESeq2_1.12.4 SummarizedExperiment_1.2.3
## [25] Biobase_2.32.0 GenomicRanges_1.24.3
## [27] GenomeInfoDb_1.8.7 IRanges_2.6.1
## [29] S4Vectors_0.10.3 BiocGenerics_0.18.0
## [31] BiocParallel_1.6.6 knitr_1.14
## [33] BiocStyle_2.1.32
##
## loaded via a namespace (and not attached):
## [1] Hmisc_3.17-4 RcppEigen_0.3.2.9.0 plyr_1.8.4
## [4] igraph_1.0.1 sp_1.2-3 shinydashboard_0.5.3
## [7] splines_3.3.0 digest_0.6.10 htmltools_0.3.5
## [10] viridis_0.3.4 magrittr_1.5 cluster_2.0.4
## [13] Biostrings_2.40.2 annotate_1.50.0 matrixStats_0.50.2
## [16] colorspace_1.2-6 rrcov_1.4-3 dplyr_0.5.0
## [19] RCurl_1.95-4.8 tximport_1.0.3 genefilter_1.54.2
## [22] lme4_1.1-12 survival_2.39-5 zoo_1.7-13
## [25] gtable_0.2.0 zlibbioc_1.18.0 XVector_0.12.1
## [28] MatrixModels_0.4-1 car_2.1-3 kernlab_0.9-24
## [31] prabclus_2.2-6 DEoptimR_1.0-6 SparseM_1.72
## [34] VIM_4.5.0 scales_0.4.0 mvtnorm_1.0-5
## [37] DBI_0.5-1 GGally_1.2.0 Rcpp_0.12.7
## [40] sROC_0.1-2 xtable_1.8-2 laeken_0.4.6
## [43] foreign_0.8-66 proxy_0.4-16 mclust_5.2
## [46] Formula_1.2-1 vcd_1.4-3 FNN_1.1
## [49] RColorBrewer_1.1-2 fpc_2.1-10 acepack_1.3-3.3
## [52] modeltools_0.2-21 reshape_0.8.5 XML_3.98-1.4
## [55] flexmix_2.3-13 nnet_7.3-12 locfit_1.5-9.1
## [58] labeling_0.3 reshape2_1.4.1 munsell_0.4.3
## [61] tools_3.3.0 RSQLite_1.0.0 pls_2.5-0
## [64] evaluate_0.9 stringr_1.1.0 cvTools_0.3.2
## [67] yaml_2.1.13 robustbase_0.92-6 caTools_1.17.1
## [70] nlme_3.1-128 mime_0.5 quantreg_5.29
## [73] formatR_1.4 biomaRt_2.28.0 pbkrtest_0.4-6
## [76] e1071_1.6-7 statmod_1.4.26 tibble_1.2
## [79] robCompositions_2.0.2 geneplotter_1.50.0 pcaPP_1.9-60
## [82] stringi_1.1.1 lattice_0.20-34 trimcluster_0.1-2
## [85] Matrix_1.2-6 nloptr_1.0.4 lmtest_0.9-34
## [88] data.table_1.9.6 bitops_1.0-6 rtracklayer_1.32.2
## [91] httpuv_1.3.3 R6_2.1.3 latticeExtra_0.6-28
## [94] KernSmooth_2.23-15 gridExtra_2.2.1 boot_1.3-18
## [97] MASS_7.3-45 gtools_3.5.0 assertthat_0.1
## [100] chron_2.3-47 rhdf5_2.16.0 rjson_0.2.15
## [103] GenomicAlignments_1.8.4 Rsamtools_1.24.0 diptest_0.75-7
## [106] mgcv_1.8-12 grid_3.3.0 rpart_4.1-10
## [109] class_7.3-14 minqa_1.2.4 rmarkdown_1.0
## [112] scatterplot3d_0.3-37 shiny_0.14
No competing interests were disclosed.
A.T.L.L. and J.C.M. were supported by core funding from Cancer Research UK (award no. A17197). J.C.M. was also supported by core funding from EMBL.
We would like to thank Davis McCarthy, for assistance with coding for scater
; Antonio Scialdone, for helpful discussions regarding spike-ins and HVGs; and Michael Epstein, for trialling the workflow on other data sets.
Anders, S., and W. Huber. 2010. “Differential expression analysis for sequence count data.” Genome Biol. 11 (10): R106.
Angel, P., and M. Karin. 1991. “The role of Jun, Fos and the AP-1 complex in cell-proliferation and transformation.” Biochim. Biophys. Acta 1072 (2-3): 129–57.
Angerer, P., L. Haghverdi, M. Buttner, F. J. Theis, C. Marr, and F. Buettner. 2015. “destiny: diffusion maps for large-scale single-cell data in R.” Bioinformatics, Dec.
Brennecke, P., S. Anders, J. K. Kim, A. A. Ko?odziejczyk, X. Zhang, V. Proserpio, B. Baying, et al. 2013. “Accounting for technical noise in single-cell RNA-seq experiments.” Nat. Methods 10 (11): 1093–5.
Buettner, F., K. N. Natarajan, F. P. Casale, V. Proserpio, A. Scialdone, F. J. Theis, S. A. Teichmann, J. C. Marioni, and O. Stegle. 2015. “Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells.” Nat. Biotechnol. 33 (2): 155–60.
Fan, J., N. Salathia, R. Liu, G. E. Kaeser, Y. C. Yung, J. L. Herman, F. Kaper, et al. 2016. “Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis.” Nat. Methods, Jan.
Heng, T. S., M. W. Painter, K. Elpek, V. Lukacs-Kornek, N. Mauermann, S. J. Turley, D. Koller, et al. 2008. “The Immunological Genome Project: networks of gene expression in immune cells.” Nat. Immunol. 9 (10): 1091–4.
Hicks, S. C., M. Teng, and R. A. Irizarry. 2015. “On the widespread and critical impact of systematic bias and batch effects in single-cell RNA-Seq data.” BioRxiv. Cold Spring Harbor Labs Journals. doi:10.1101/025528.
Huber, W., V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. “Orchestrating high-throughput genomic analysis with Bioconductor.” Nat. Methods 12 (2): 115–21.
Ilicic, T., J. K. Kim, A. A. Kolodziejczyk, F. O. Bagger, D. J. McCarthy, J. C. Marioni, and S. A. Teichmann. 2016. “Classification of low quality cells from single-cell RNA-seq data.” Genome Biol. 17 (1): 29.
Islam, S., U. Kjallquist, A. Moliner, P. Zajac, J. B. Fan, P. Lonnerberg, and S. Linnarsson. 2011. “Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq.” Genome Res. 21 (7): 1160–7.
Islam, S., A. Zeisel, S. Joost, G. La Manno, P. Zajac, M. Kasper, P. Lonnerberg, and S. Linnarsson. 2014. “Quantitative single-cell RNA-seq with unique molecular identifiers.” Nat. Methods 11 (2): 163–66.
Julia, M., A. Telenti, and A. Rausell. 2015. “Sincell: an R/Bioconductor package for statistical assessment of cell-state hierarchies from single-cell RNA-seq.” Bioinformatics 31 (20): 3380–2.
Kim, J. K., A. A. Kolodziejczyk, T. Illicic, S. A. Teichmann, and J. C. Marioni. 2015. “Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression.” Nat. Commun. 6: 8687.
Kolodziejczyk, A. A., J. K. Kim, J. C. Tsang, T. Ilicic, J. Henriksson, K. N. Natarajan, A. C. Tuck, et al. 2015. “Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation.” Cell Stem Cell 17 (4): 471–85.
Langfelder, P., B. Zhang, and S. Horvath. 2008. “Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R.” Bioinformatics 24 (5): 719–20.
Law, C. W., Y. Chen, W. Shi, and G. K. Smyth. 2014. “voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biol. 15 (2): R29.
Leng, N., L. F. Chu, C. Barry, Y. Li, J. Choi, X. Li, P. Jiang, R. M. Stewart, J. A. Thomson, and C. Kendziorski. 2015. “Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments.” Nat. Methods 12 (10): 947–50.
Liao, Y., G. K. Smyth, and W. Shi. 2013. “The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote.” Nucleic Acids Res. 41 (10): e108.
———. 2014. “featureCounts: an efficient general purpose program for assigning sequence reads to genomic features.” Bioinformatics 30 (7): 923–30.
Love, M. I., S. Anders, V. Kim, and W. Huber. 2015. “RNA-Seq workflow: gene-level exploratory analysis and differential expression.” F1000Res 4: 1070.
Love, M. I., W. Huber, and S. Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biol. 15 (12): 550.
Lun, A. T. L., K. Bach, and J. C. Marioni. 2016. “Pooling Across Cells to Normalize Single-Cell RNA Sequencing Data with Many Zero Counts.” Genome Biol. 17: 75.
Mahata, B., X. Zhang, A. A. Kolodziejczyk, V. Proserpio, L. Haim-Vilmovsky, A. E. Taylor, D. Hebenstreit, et al. 2014. “Single-cell RNA sequencing reveals T helper cells synthesizing steroids de novo to contribute to immune homeostasis.” Cell Rep. 7 (4): 1130–42.
Marinov, G. K., B. A. Williams, K. McCue, G. P. Schroth, J. Gertz, R. M. Myers, and B. J. Wold. 2014. “From single-cell to cell-pool transcriptomes: stochasticity in gene expression and RNA splicing.” Genome Res. 24 (3): 496–510.
McCarthy, D. J., Y. Chen, and G. K. Smyth. 2012. “Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.” Nucleic Acids Res. 40 (10): 4288–97.
Phipson, B., and G. K. Smyth. 2010. “Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn.” Stat. Appl. Genet. Mol. Biol. 9: Article39.
Picelli, S., O. R. Faridani, A. K. Bjorklund, G. Winberg, S. Sagasser, and R. Sandberg. 2014. “Full-length RNA-seq from single cells using Smart-seq2.” Nat Protoc 9 (1): 171–81.
Pollen, A. A., T. J. Nowakowski, J. Shuga, X. Wang, A. A. Leyrat, J. H. Lui, N. Li, et al. 2014. “Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex.” Nat. Biotechnol. 32 (10): 1053–8.
Ritchie, M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth. 2015. “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Res. 43 (7): e47.
Robinson, M. D., and A. Oshlack. 2010. “A scaling normalization method for differential expression analysis of RNA-seq data.” Genome Biol. 11 (3): R25.
Robinson, M. D., D. J. McCarthy, and G. K. Smyth. 2010. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26 (1): 139–40.
Scialdone, A., K. N. Natarajan, L. R. Saraiva, V. Proserpio, S. A. Teichmann, O. Stegle, J. C. Marioni, and F. Buettner. 2015. “Computational assignment of cell-cycle stage from single-cell transcriptome data.” Methods 85 (Sep): 54–61.
Stegle, O., S. A. Teichmann, and J. C. Marioni. 2015. “Computational and analytical challenges in single-cell transcriptomics.” Nat. Rev. Genet. 16 (3): 133–45.
Trapnell, C., D. Cacchiarelli, J. Grimsby, P. Pokharel, S. Li, M. Morse, N. J. Lennon, K. J. Livak, T. S. Mikkelsen, and J. L. Rinn. 2014. “The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells.” Nat. Biotechnol. 32 (4): 381–86.
Vallejos, C. A., J. C. Marioni, and S. Richardson. 2015. “BASiCS: Bayesian Analysis of Single-Cell Sequencing Data.” PLoS Comput. Biol. 11 (6): e1004333.
Van der Maaten, L., and G. Hinton. 2008. “Visualizing Data Using T-SNE.” J. Mach. Learn. Res. 9 (2579-2605): 85.
Wilson, N. K., D. G. Kent, F. Buettner, M. Shehata, I. C. Macaulay, F. J. Calero-Nieto, M. Sanchez Castillo, et al. 2015. “Combined Single-Cell Functional and Gene Expression Analysis Resolves Heterogeneity within Stem Cell Populations.” Cell Stem Cell 16 (6): 712–24.
Zeisel, A., A. B. Munoz-Manchado, S. Codeluppi, P. Lonnerberg, G. La Manno, A. Jureus, S. Marques, et al. 2015. “Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.” Science 347 (6226): 1138–42.
Zhu, J., H. Yamane, J. Cote-Sierra, L. Guo, and W. E. Paul. 2006. “GATA-3 promotes Th2 responses through three different mechanisms: induction of Th2 cytokine production, selective growth of Th2 cells and inhibition of Th1 cell-specific factors.” Cell Res. 16 (1): 3–10.