Contents

1 Introduction

Single-cell RNA sequencing (scRNA-seq) is widely used to measure the genome-wide expression profile of individual cells. From each cell, mRNA is isolated and reverse transcribed to cDNA for high-throughput sequencing (Stegle, Teichmann, and Marioni 2015). This can be done using microfluidics platforms like the Fluidigm C1 (Pollen et al. 2014), or with protocols based on microtiter plates like Smart-seq2 (Picelli et al. 2014). The number of reads mapped to each gene can then be used to quantify its expression in each cell. Alternatively, unique molecular identifiers (UMIs) can be used to directly measure the number of transcript molecules for each gene (Islam et al. 2014). Count data can be analyzed to identify new cell subpopulations via dimensionality reduction and clustering; to detect highly variable genes (HVGs) across a population; or to detect differentially expressed genes (DEGs) between conditions. This provides biological insights at a single-cell resolution that cannot be achieved with conventional bulk RNA sequencing of cell populations.

Strategies for scRNA-seq data analysis differ markedly from those for bulk RNA-seq. One technical reason is that scRNA-seq data is much noisier than bulk data (Brennecke et al. 2013; Marinov et al. 2014). Reliable capture (i.e., conversion) of transcripts into cDNA for sequencing is difficult with the low quantity of RNA in a single cell. This increases the frequency of drop-out events where none of the transcripts for a gene are captured. Dedicated steps are required to deal with this noise, especially during quality control. In addition, scRNA-seq data can be used to study cell-to-cell heterogeneity, e.g., to identify new cell subtypes, to characterize differentiation processes, to assign cells into their cell cycle phases, or to identify HVGs driving variability across the population (Vallejos, Marioni, and Richardson 2015; J. Fan et al. 2016; Trapnell et al. 2014). This is simply not possible with bulk data, such that custom methods are required to perform these analyses.

This article describes a computational workflow for basic analysis of scRNA-seq data using software packages from the open-source Bioconductor project (Huber et al. 2015). Starting from a count matrix, this workflow contains the steps required for quality control to remove problematic cells; normalization of cell-specific biases, with and without spike-ins; cell-cycle phase classification from gene expression data; data exploration to identify putative subpopulations; and finally, HVG and DEG identification to prioritize interesting genes. The application of different steps in the workflow will be demonstrated on several public scRNA-seq data sets involving haematopoietic stem cells, brain-derived cells, T-helper cells and mouse embryonic stem cells, generated with a range of experimental protocols and platforms (Wilson et al. 2015; Zeisel et al. 2015; Buettner et al. 2015; Kolodziejczyk et al. 2015). The aim is to provide a variety of modular usage examples that can be applied to construct custom analysis pipelines.

2 A simple analysis on haematopoietic stem cells

2.1 Overview

To introduce most of the concepts of scRNA-seq data analysis, we use a relatively simple data set from a study of haematopoietic stem cells (HSCs) (Wilson et al. 2015). Single mouse HSCs were isolated into microtiter plates and libraries were prepared for 96 cells using the Smart-seq2 protocol. A constant amount of spike-in RNA from the External RNA Controls Consortium (ERCC) was also added to each cell prior to library preparation. High-throughput sequencing was performed and the expression of each gene was quantified by counting the total number of reads mapped to its exonic regions. Similarly, the quantity of each spike-in transcript was measured by counting reads mapped to the spike-in reference sequence. Counts for all genes/transcripts in each cell were obtained from the NCBI Gene Expression Omnibus (GEO) as a supplementary file under the accession number GSE61533.

For simplicity, we forgo a description of the read processing steps required to generate the count matrix, i.e., read alignment and counting into features. These steps have been described in some detail elsewhere (Love et al. 2015), and are largely the same for bulk and single-cell data. The only additional consideration is that the spike-in information must be included in the pipeline. Typically, spike-in sequences can be included as additional FASTA files during genome index building prior to alignment, while genomic intervals for both spike-in transcripts and endogenous genes can be concatenated into a single GTF file prior to counting. For users favouring a R-based approach to read alignment and counting, we suggest using the methods in the Rsubread package (Liao, Smyth, and Shi 2013; Liao, Smyth, and Shi 2014).

2.2 Count loading and quality control

The first task is to load the count matrix into memory. This requires some work to decompress and retreive the data from the Excel format. Each row of the matrix represents an endogenous gene or a spike-in transcript, and each column represents a single HSC. For convenience, the counts for spike-in transcripts and endogenous genes are stored in a SCESet object from the scater package.

library(R.utils)
gunzip("GSE61533_HTSEQ_count_results.xls.gz", remove=FALSE, overwrite=TRUE)
library(gdata)
all.counts <- read.xls('GSE61533_HTSEQ_count_results.xls', sheet=1, header=TRUE, row.names=1)
library(scater)
sce <- newSCESet(countData=all.counts)
dim(sce)
## Features  Samples 
##    38498       96

We annotate those rows corresponding to ERCC spike-ins and mitochondrial genes. This information can be easily extracted from the row names, though in general, identifying mitochondrial genes from standard identifiers like Ensembl requires extra annotation. For each cell, we calculate quality control metrics such as the total number of counts or the proportion of counts in mitochondrial genes or spike-in transcripts. These metrics are stored in the pData of the SCESet for future reference.

is.spike <- grepl("^ERCC", rownames(sce))
isSpike(sce) <- is.spike
is.mito <- grepl("^mt-", rownames(sce))
sce <- calculateQCMetrics(sce, feature_controls=list(Spike=is.spike, Mt=is.mito))
head(colnames(pData(sce)))
## [1] "total_counts"             "log10_total_counts"       "filter_on_total_counts"  
## [4] "total_features"           "log10_total_features"     "filter_on_total_features"

Two common measures of cell quality are the library size and the number of expressed features in each library. The library size is defined as the total sum of counts across all features, i.e., genes and spike-in transcripts. Cells with small library sizes are considered to be of low quality as the RNA has not been efficiently captured (i.e., converted into cDNA and amplified) during library preparation. The number of expressed features in each cell is defined as the number of features with non-zero counts for that cell. Any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured. The distributions of both of these metrics are shown in Figure 1.

par(mfrow=c(1,2))
hist(sce$total_counts/1e6, xlab="Library sizes (millions)", main="", 
    breaks=20, col="grey80", ylab="Number of cells")
hist(sce$total_features, xlab="Number of expressed genes", main="", 
    breaks=20, col="grey80", ylab="Number of cells")
Figure 1: Histograms of library sizes (left) and number of expressed genes (right) for all cells in the HSC data set.

Figure 1: Histograms of library sizes (left) and number of expressed genes (right) for all cells in the HSC data set.

Picking a threshold for these metrics is not straightforward as their absolute values depend on the protocol and biological system. For example, sequencing to greater depth will lead to more reads, regardless of the quality of the cells. To obtain an adaptive threshold, we assume that most of the data set consists of high-quality cells. We remove cells with log-library sizes that are more than 3 median absolute deviations (MADs) below the median log-library size. The wide range of library sizes requires a log-transformation, as the MAD would be too large on the raw scale. We also remove cells where the number of expressed genes is 3 MADs below the median. This eliminates low-quality cells corresponding to small outliers.

libsize.drop <- isOutlier(sce$total_counts, n=3, type="lower", log=TRUE)
feature.drop <- isOutlier(sce$total_features, n=3, type="lower")

Another measure of quality is the proportion of reads mapped to genes in the mitochondrial genome. High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of increased apoptosis and/or loss of cytoplasmic RNA from lysed cells. A similar case can be made for the proportion of reads mapped to spike-in transcripts. The quantity of spike-in RNA added to each cell should be constant, which means that the proportion should increase upon loss of endogenous RNA in low-quality cells. The distributions of mitochondrial and spike-in proportions across all cells are shown in Figure 2.

par(mfrow=c(1,2))
hist(sce$pct_counts_feature_controls_Mt, xlab="Mitochondrial proportion (%)", 
    ylab="Number of cells", breaks=20, main="", col="grey80")
hist(sce$pct_counts_feature_controls_Spike, xlab="ERCC proportion (%)", 
    ylab="Number of cells", breaks=20, main="", col="grey80")
Figure 2: Histogram of the proportion of reads mapped to mitochondrial genes (left) or spike-in transcripts (right) across all cells in the HSC data set.

Figure 2: Histogram of the proportion of reads mapped to mitochondrial genes (left) or spike-in transcripts (right) across all cells in the HSC data set.

Again, the ideal threshold for these proportions depends on the cell type and the experimental protocol. Cells with more mitochondria or more mitochondrial activity may naturally have larger mitochondrial proportions. Similarly, cells with more endogenous RNA or in protocols using less spike-in RNA will have lower spike-in proportions. If we assume that most cells in the data set are of high quality, then the threshold can be set to remove any large outliers from the distribution of proportions. We use the MAD-based definition of outliers to remove putative low-quality cells from the data set.

mito.drop <- isOutlier(sce$pct_counts_feature_controls_Mt, n=3, type="higher")
spike.drop <- isOutlier(sce$pct_counts_feature_controls_Spike, n=3, type="higher")

Subsetting by column will retain only the high-quality cells that pass each filter described above. We can examine the number of cells removed by each filter, and the total number remaining in the data set. Removal of a substantial proportion of cells (> 10%) may be indicative of an overall issue with data quality. It may also reflect genuine biology in extreme cases (e.g., low numbers of expressed genes in erythrocytes) for which the filters described here are not appropriate.

sce <- sce[,!(libsize.drop | feature.drop | mito.drop | spike.drop)]
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop),
    ByMito=sum(mito.drop), BySpike=sum(spike.drop), Remaining=ncol(sce))
##         ByLibSize ByFeature ByMito BySpike Remaining
## Samples         2         2      6       3        86

An alternative approach to quality control is to perform a principal components analysis (PCA) based on the quality metrics for each cell, e.g., the total number of reads, the total number of features, the proportion of mitochondrial or spike-in reads. Outliers on a PCA plot may be indicative of low-quality cells that have aberrant technical properties compared to the (presumed) majority of high-quality cells. In Figure 3, no obvious outliers are present which is consistent with the removal of suspect cells in the preceding quality control steps.

fontsize <- theme(axis.text=element_text(size=12), axis.title=element_text(size=16))
plotPCA(sce, pca_data_input="pdata") + fontsize
Figure 3: PCA plot for all remaining cells in the HSC data set, constructed using quality metrics. The first and second components are shown on each axis, along with the percentage of total variance explained by each component. Bars represent the coordinates of the cells on each axis.

Figure 3: PCA plot for all remaining cells in the HSC data set, constructed using quality metrics. The first and second components are shown on each axis, along with the percentage of total variance explained by each component. Bars represent the coordinates of the cells on each axis.

Methods like PCA-based outlier detection and support vector machines can provide more power to distinguish low-quality cells from high-quality counterparts (Ilicic et al. 2016). This is because they are able to detect subtle patterns across many quality metrics simultaneously. However, this comes at some cost to interpretability, as the reason for removing a given cell may not always be obvious. Thus, for this workflow, we will use the simple approach whereby each quality metric is considered separately. Users interested in the more sophisticated approaches are referred to the scater and cellity packages.

2.3 Filtering out low-abundance genes

Low-abundance genes are removed as the counts are too low for reliable statistical inferences. In addition, the discreteness of the counts may interfere with downstream statistical procedures, e.g., by compromising the accuracy of asymptotic approximations. Here, low-abundance genes are defined as those with an average count across cells below 1. Removing them avoids problems with discreteness and also reduces the amount of computational work.

keep <- rowMeans(counts(sce)) >= 1
sce <- sce[keep,] 
sum(keep)
## [1] 13997

An alternative approach to gene filtering is to select genes that have non-zero counts in at least n cells. This provides some more protection against genes with outlier expression patterns, i.e., strong expression in only one or two cells. Such outliers are typically uninteresting as they can arise from amplification artifacts that are not replicable across cells. (The exception is for studies involving rare cells where the outliers may be biologically relevant.) An example of this filtering approach is shown below for n set to 10.

alt.keep <- rowSums(is_exprs(sce)) >= 10
sum(alt.keep)
## [1] 11419

The relationship between the proportion of expressing cells and the mean can be examined more closely in Figure 4. The two statistics tend to be well-correlated, so filtering on either should give roughly similar results.

plotQC(sce, type = "exprs-freq-vs-mean") + fontsize
Figure 4: Frequency of expression against the mean expression for each gene. Circles represent endogenous genes and triangles represent spike-in transcripts or mitochondrial genes. The bars on each axis represent the location of each gene on that axis. Genes with expression frequencies higher than the dropout rate are defined as those above a non-linear trend fitted to the spike-in transcripts.

Figure 4: Frequency of expression against the mean expression for each gene. Circles represent endogenous genes and triangles represent spike-in transcripts or mitochondrial genes. The bars on each axis represent the location of each gene on that axis. Genes with expression frequencies higher than the dropout rate are defined as those above a non-linear trend fitted to the spike-in transcripts.

In general, we prefer the mean-based filter as it tends to be less aggressive. A gene will be retained as long as it has sufficient expression in any subset of cells. The “at least n” filter depends heavily on the choice of n – in this case, a gene expressed in a subset of 9 cells would be lost. While the mean-based filter will retain more outlier-driven genes, this can be handled by choosing methods that are robust to outliers in the downstream analyses.

2.4 Normalization of cell-specific biases

Read counts are subject to differences in capture efficiency and sequencing depth between cells (Stegle, Teichmann, and Marioni 2015). Normalization is required to eliminate these cell-specific biases prior to downstream quantitative analyses. This is often done by assuming that most genes are not differentially expressed (DE) between cells. Any systematic difference in count size across the non-DE majority of genes between two cells is assumed to represent bias and is removed by scaling. More specifically, “size factors” are calculated that represent the extent to which counts should be scaled in each library.

Size factors can be computed with several different approaches, e.g., using the estimateSizeFactorsFromMatrix function in the DESeq2 package (Anders and Huber 2010; Love, Huber, and Anders 2014), or with the calcNormFactors function (Robinson and Oshlack 2010) in the edgeR package. However, single-cell data can be problematic for these bulk data-based methods due to the dominance of low and zero counts. To overcome this, we pool counts from many cells to increase the count size for accurate size factor estimation (Lun, Bach, and Marioni 2016). Pool-based size factors are then “deconvolved” into cell-based factors for cell-specific normalization.

sce <- computeSumFactors(sce, sizes=c(20, 40, 60, 80))
summary(sizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4268  0.8411  0.9955  1.0420  1.2140  2.1660

In this case, the size factors are tightly correlated with the library sizes for all cells (Figure 5). This suggests that the systematic differences between cells are primarily driven by differences in capture efficiency or sequencing depth. Any DE between cells would yield a non-linear trend between the total count and size factor, and/or increased scatter around the trend. This does not occur here as strong DE is unlikely to exist between cells of the same type.

plot(sizeFactors(sce), sce$total_counts/1e6, log="xy",
    ylab="Library size (millions)", xlab="Size factor")
Figure 5: Size factors from deconvolution, plotted against library sizes for all cells in the HSC data set. Axes are shown on a log-scale.

Figure 5: Size factors from deconvolution, plotted against library sizes for all cells in the HSC data set. Axes are shown on a log-scale.

Normalized log-expression values can be computed for use in downstream analyses. Each value is defined as the log-ratio of each count to the size factor for the corresponding cell (after adding a small prior count to avoid undefined values at zero counts). Division by the size factor ensures that any cell-specific biases are removed. The log-transformation provides some measure of variance stabilization (Law et al. 2014), so that high-abundance genes with large variances do not dominate downstream analyses. The computed values are stored as an exprs matrix in addition to the other assay elements.

sce <- normalize(sce)

2.5 Data exploration with dimensionality reduction techniques

Dimensionality reduction is often useful to examine major features of the data before more quantitative analyses. Of particular interest is whether the HSCs partition into distinct subpopulations. This can be visualized by constructing a PCA plot from the normalized log-expression values (Figure 6). Cells with similar expression profiles should be located close together in the plot, while dissimilar cells should be far apart. By default, the plotPCA function will only use the top 500 genes with the largest variances. This focuses on the genes that are driving heterogeneity in the population and should provide greater visual resolution of any systematic differences between groups of cells.

plotPCA(sce, exprs_values="exprs") + fontsize
Figure 6: PCA plot constructed from normalized log-expression values, where each point represents a cell in the HSC data set. First and second components are shown, along with the percentage of variance explained. Bars represent the coordinates of the cells on each axis. None of the cells are controls (e.g., empty wells) so the legend can be ignored.

Figure 6: PCA plot constructed from normalized log-expression values, where each point represents a cell in the HSC data set. First and second components are shown, along with the percentage of variance explained. Bars represent the coordinates of the cells on each axis. None of the cells are controls (e.g., empty wells) so the legend can be ignored.

Another popular approach to dimensionality reduction is the t-stochastic neighbour embedding (t-SNE) method (Van der Maaten and Hinton 2008). t-SNE tends to work better than PCA for separating cells in large data sets with many subpopulations, at the cost of more computational effort and complexity. Like plotPCA, the plotTSNE function will use the genes with the largest variances to focus on heterogeneity in the population. However, unlike PCA, t-SNE is a stochastic method – users should run the algorithm several times to ensure that the results are representative, and then set a seed to ensure that the chosen results are reproducible. It is also advisable to test different settings of the “perplexity” parameter as this will affect the distribution of points in the low-dimensional space (Figure 7).

set.seed(100)
out5 <- plotTSNE(sce, exprs_values="exprs", perplexity=5) + fontsize + ggtitle("Perplexity = 5")
out10 <- plotTSNE(sce, exprs_values="exprs", perplexity=10) + fontsize + ggtitle("Perplexity = 10")
out20 <- plotTSNE(sce, exprs_values="exprs", perplexity=20) + fontsize + ggtitle("Perplexity = 20")
multiplot(out5, out10, out20, cols=3)
Figure 7: t-SNE plot constructed from normalized log-expression values using a range of perplexity values. In each plot, each point represents a cell in the HSC data set. Bars represent the coordinates of the cells on each axis.

Figure 7: t-SNE plot constructed from normalized log-expression values using a range of perplexity values. In each plot, each point represents a cell in the HSC data set. Bars represent the coordinates of the cells on each axis.

For this data set, all methods suggest that there is no separation into distinct subpopulations. This might be expected for a homogenous population of cells of the same type. Of course, there are many dimensionality reduction techniques that we have not considered here but could also be used, e.g., multidimensional scaling, diffusion maps. These have their own advantages and disadvantages – for example, diffusion maps (see plotDiffusionMap) place cells along a continuous trajectory and are suited for visualizing graduated processes like differentiation (Angerer et al. 2015).

2.6 Classification of cell cycle phase

We use the prediction method described by Scialdone et al. (2015) to classify cells into cell cycle phases based on the gene expression data. Using a training data set, the sign of the difference in expression between two genes was computed for each pair of genes. Pairs with changes in the sign across cell cycle phases were chosen as markers. Cells in a test data set can then be classified into the appropriate phase, based on whether the observed sign for each marker pair is consistent with one phase or another. This approach is implemented in the cyclone function using a pre-trained set of marker pairs for mouse data. The result of phase assignment for each cell in the HSC data set is shown in Figure 8. (Some additional work is necessary to match the gene symbols in the data to the Ensembl annotation in the set of pairs.)

mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds", package="scran"))
library(org.Mm.eg.db)
anno <- select(org.Mm.eg.db, keys=rownames(sce), keytype="SYMBOL", column="ENSEMBL")
ensembl <- anno$ENSEMBL[match(rownames(sce), anno$SYMBOL)]
keep <- !is.na(ensembl)
assignments <- cyclone(sce[keep,], mm.pairs, gene.names=ensembl[keep])
plot(assignments$score$G1, assignments$score$G2M, xlab="G1 score", ylab="G2/M score", pch=16)
Figure 8: Cell cycle phase scores from applying the pair-based classifier on the HSC data set, where each point represents a cell.

Figure 8: Cell cycle phase scores from applying the pair-based classifier on the HSC data set, where each point represents a cell.

Cells are classified as being in G1 phase if the G1 score is above 0.5; in G2/M phase if the G2/M score is above 0.5; and in S phase, if neither is above 0.5. Here, the vast majority of cells are classified as being in G1 phase. We will focus on these cells in the downstream analysis. Cells in other phases are removed to avoid potential confounding effects from cell cycle-induced differences. Alternatively, if a non-negligible number of cells are in other phases, we can use the assigned phase as a blocking factor in downstream analyses. This protects against cell cycle effects without discarding information.

g1.only <- assignments$score$G1 > 0.5
sce <- sce[,g1.only]

Pre-trained classifiers are available in scran for human and mouse data. The mouse classifier used here was trained on embryonic stem cells but can still be generally applied – the pair-based method is a non-parametric procedure that should be robust to technical differences between data sets, and the transcriptional program associated with cell cycling should be mostly conserved across cell types. However, it will (inevitably) be less accurate for cell types that are substantially different from those used in the training set. Users can also construct a custom classifier from their own training data using the sandbag function. This may be necessary for other model organisms where pre-trained classifiers are not available.

2.7 Identifying HVGs from the normalized log-expression

We identify HVGs to focus on the genes that are driving heterogeneity across the population of cells. This requires estimation of the variance in expression for each gene, followed by decomposition of the variance into biological and technical components. HVGs are then identified as those genes with the highest biological components. This avoids prioritizing genes that are highly variable due to technical factors such as sampling noise during RNA capture and library preparation.

Ideally, the technical component would be estimated by fitting a mean-variance trend to the spike-in transcripts. Recall that the same set of spike-ins was added in the same quantity to each cell. This means that the spike-in transcripts should exhibit no biological variability, such that any variance in the counts should be technical in origin. Fitting is performed by the trendVar function, using a loess curve with a low span as the trend is highly non-linear. (Some adjustment of the parameters may be required to obtain a satisfactory fit.)

var.fit <- trendVar(sce, trend="loess", span=0.3)

Given the mean abundance of a gene, the fitted value of the trend can be used as an estimate of the technical component for that gene. The biological component of the variance can then be calculated by subtracting the technical component from the total variance of each gene in the decomposeVar function.

var.out <- decomposeVar(sce, var.fit)

In practice, this strategy is complicated by the difficulty of accurately fitting a complex trend to a low number of unevenly distributed points. An alternative approach is to fit the mean-variance trend to the endogenous genes. This assumes that the majority of genes are constantly expressed, such that the technical component dominates the total variance of expression for those genes. The fitted value of the trend can then be used as an estimate of the technical component.

var.fit2 <- trendVar(sce, trend="loess", use.spikes=FALSE, span=0.2)
var.out2 <- decomposeVar(sce, var.fit2)

We assess the suitability of the trend fitted to the endogenous variances by examining whether it is consistent with the spike-in variances (Figure 9). The former passes through the bulk of the latter in the plot below, indicating that our assumption (that most genes have low levels of biological variability) is valid. In contrast, the spike-in trend fits poorly as it lies below the variance estimates at mean intervals with few spike-in transcripts. The use of an endogenous trend is the only option in data sets where no spike-ins were added or in situations where not enough spike-in RNA was added to cover the range of means for the endogenous genes.

plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression", 
    ylab="Variance of log-expression")
points(var.fit$mean, var.fit$var, col="red", pch=16)
o <- order(var.out$mean)
lines(var.out$mean[o], var.out$tech[o], col="red", lwd=2)
lines(var.out2$mean[o], var.out2$tech[o], col="dodgerblue", lwd=2)
Figure 9: Variance of normalized log-expression values for each gene in the HSC data set, plotted against the mean log-expression. The red line represents the mean-dependent trend in the technical variance of the spike-in transcripts (also highlighted as red points). The blue line represents the trend fitted to the variances of the endogenous genes.

Figure 9: Variance of normalized log-expression values for each gene in the HSC data set, plotted against the mean log-expression. The red line represents the mean-dependent trend in the technical variance of the spike-in transcripts (also highlighted as red points). The blue line represents the trend fitted to the variances of the endogenous genes.

The top HVGs are identified by ranking genes on their biological components. This can be used to prioritize interesting genes for further investigation. In general, we consider a gene to be a HVG if it has a biological component of at least 1. For log2-counts, this means that gene expression will vary, for biological reasons, by at least 2-fold around the mean.

top.hvgs <- order(var.out2$bio, decreasing=TRUE)
write.table(file="hsc_hvg.tsv", var.out2[top.hvgs,], sep="\t", quote=FALSE, col.names=NA)
head(var.out2[top.hvgs,])
##              mean     total       bio      tech
## Fos      6.354617 20.182191 12.302393  7.879798
## Rgs1     5.156189 20.261351  9.416550 10.844802
## Dusp1    6.638309 16.092466  9.074147  7.018319
## H2-Aa    4.237864 19.423406  7.524803 11.898603
## Ppp1r15a 6.485799 14.971509  7.462378  7.509130
## Ctla2a   8.594131  9.509346  7.400235  2.109111

We recommend checking the distribution of expression values for the top HVGs to ensure that the variance estimate is not being dominated by one or two outlier cells (Figure 10).

examined <- top.hvgs[1:10]
all.names <- matrix(rownames(sce)[examined], nrow=length(examined), ncol=ncol(sce))
boxplot(split(exprs(sce)[examined,], all.names), las=2, ylab="Normalized log-expression", col="grey80")
Figure 10: Boxplots of normalized log-expression values for the top 10 HVGs in the HSC data set. Points correspond to cells that are more than 1.5 interquartile ranges from the edge of each box.

Figure 10: Boxplots of normalized log-expression values for the top 10 HVGs in the HSC data set. Points correspond to cells that are more than 1.5 interquartile ranges from the edge of each box.

There are many other ways of defining HVGs, e.g., by using the coefficient of variation (Kolodziejczyk et al. 2015; Kim et al. 2015), with the dispersion parameter in the negative binomial distribution (McCarthy, Chen, and Smyth 2012), or as a proportion of total variability (Vallejos, Marioni, and Richardson 2015). We use the variance of the log-expression values because the log-transformation provides some protection against genes with strong expression in only one or two outlier cells. This ensures that the set of top HVGs is not dominated by genes with (mostly uninteresting) outlier expression patterns. However, the cost of this robustness is the need to fit a complex mean-variance relationship.

2.8 Identifying correlated gene pairs with Spearman’s rho

Another useful procedure is to identify the HVGs that are highly correlated with one another. This distinguishes between HVGs caused by random noise and those involved in driving systematic differences between subpopulations. Gene pairs with significantly large positive or negative values for Spearman’s rho are identified using the correlatePairs function. Note that we only apply this function for the top set of HVGs – doing so for all possible gene pairs would require too much computational time and may prioritize uninteresting genes that have strong correlations but low variance, e.g., tightly co-regulated house-keeping genes.

set.seed(100)
var.cor <- correlatePairs(sce[top.hvgs[1:200],])
write.table(file="hsc_cor.tsv", var.cor, sep="\t", quote=FALSE, col.names=NA)
head(var.cor)
##      gene1  gene2       rho      p.value         FDR
## 1    H2-Aa   Cd74 0.5724151 1.999998e-06 0.004974995
## 2    H2-Aa H2-Ab1 0.5720793 1.999998e-06 0.004974995
## 3     Ly6a   Cd74 0.5379287 1.999998e-06 0.004974995
## 4     Egr1    Jun 0.5286512 1.999998e-06 0.004974995
## 5      Fos   Egr1 0.5081861 1.999998e-06 0.004974995
## 6 Ppp1r15a  Zfp36 0.5012384 1.999998e-06 0.004974995

The significance of each correlation is determined using a permutation test. For each pair of genes, the null hypothesis is that the expression profiles of two genes are independent. Shuffling the profiles and recalculating the correlation will yield a null distribution that can be used to obtain a p-value for each observed correlation value (Phipson and Smyth 2010). Correction for multiple testing across many gene pairs is performed by controlling the false discovery rate (FDR) at 5%.

sig.cor <- var.cor$FDR <= 0.05
summary(sig.cor)
##    Mode   FALSE    TRUE    NA's 
## logical   19883      17       0

Larger sets of correlated genes can be assembled by treating genes as nodes in a graph and each pair of genes with significantly large correlations as an edge. In this manner, an undirected graph can be constructed using methods in the RBGL package. Highly connected subgraphs can then be identified and defined as gene sets. This provides a convenient summary of the pairwise correlations between genes.

library(RBGL)
g <- ftM2graphNEL(cbind(var.cor$gene1, var.cor$gene2)[sig.cor,], W=NULL, V=NULL, edgemode="undirected")
cl <- highlyConnSG(g)$clusters
cl <- cl[order(lengths(cl), decreasing=TRUE)]
cl <- cl[lengths(cl) > 2]
cl
## [[1]]
## [1] "Egr1" "Fos"  "Jun"  "Junb"
## 
## [[2]]
## [1] "H2-Aa"  "Cd74"   "H2-Eb1"
## 
## [[3]]
## [1] "Srm"    "Tuba4a" "Zfp945"

Significant correlations provide evidence for substructure in the data set, i.e., subpopulations of cells with systematic differences in their expression profiles. The number of significantly correlated HVG pairs represents the strength of the substructure. If many pairs were significant, this would indicate that the subpopulations were clearly defined and distinct from one another. For this particular data set, a relatively low number of HVGs exhibit significant correlations. This suggests that any substructure in the data will be modest, which may not be unexpected given that rigorous selection was performed to obtain a homogeneous population of HSCs (Wilson et al. 2015).

The correlation results can also be used directly in follow-up experiments to verify if any substructure is present. This is done by using sets of correlated HVGs as markers in procedures such as fluorescence-activated cell sorting, immunohistochemistry or RNA flourescence in situ hybridization. In this manner, the existence of subpopulations with distinct expression patterns for the chosen HVGs can be experimentally validated. Negatively correlated pairs may be particularly useful as they provide more power to discriminate between subpopulations. In the simplest example, a subpopulation would be positive for one marker and negative for the other while the reverse would be true for a different subpopulation, thus allowing the two subpopulations to be easily distinguished.

2.9 Using correlated HVGs for further data exploration

For further analyses, we focus on the significantly correlated HVGs for which any substructure should be most pronounced. This may allow us to identify subpopulations that would have otherwise been masked by random noise in the expression profiles.

chosen <- unique(c(var.cor$gene1[sig.cor], var.cor$gene2[sig.cor]))
norm.exprs <- exprs(sce)[chosen,,drop=FALSE]

We construct a simple dendrogram to group together cells with similar expression patterns across the chosen genes. Here, we cluster on Euclidean distances to provide greater sensitivity to differences in expression for low numbers of genes. Ward’s clustering criterion is used to minimize the total variance within each cluster.

my.dist <- dist(t(norm.exprs))
my.tree <- hclust(my.dist, method="ward.D2")

In addition, a tree cut can be used to explicitly define subpopulations of cells from the dendrogram. Note that some tuning of the cut height h is required to obtain satisfactory results for each data set.

my.clusters <- unname(cutree(my.tree, h=50))
my.clusters
##  [1] 1 2 2 2 2 2 2 1 1 3 1 3 2 2 2 2 2 3 1 1 1 3 2 2 2 1 1 3 2 2 2 1 1 2 2 1 1 1 2 1 1 3 2 2 2 2 2 1
## [49] 2 1 1 2 1 2 2 2 2 1 2 2 1 1 1 3 2 2 2 3 2 2 1 2 2 1 1 2 1 1 1 2 3 1 2

We can visualize the constructed dendrogram with a heatmap (Figure 11). All expression values are mean-centred for each gene to highlight the relative expression between cells. We recommend storing the heatmap at a sufficiently high resolution so that the relevant genes can be easily identified for further examination.

library(gplots)
heat.vals <- norm.exprs - rowMeans(norm.exprs)
clust.col <- rainbow(max(my.clusters))
heatmap.2(heat.vals[chosen,], col=bluered, symbreak=TRUE, trace='none', cexRow=0.8,
    ColSideColors=clust.col[my.clusters], Colv=as.dendrogram(my.tree))
Figure 11: Heatmap of mean-centred normalized log-expression values for correlated HVGs in the HSC data set. Dendrograms are formed by hierarchical clustering on the Euclidean distances between genes (row) or cells (column). Column colours represent the cluster to which each cell is assigned after a tree cut.

Figure 11: Heatmap of mean-centred normalized log-expression values for correlated HVGs in the HSC data set. Dendrograms are formed by hierarchical clustering on the Euclidean distances between genes (row) or cells (column). Column colours represent the cluster to which each cell is assigned after a tree cut.

Modest substructure in Figure 11 is consistent with the low number of correlated HVGs. Nonetheless, a close look suggests that a H2-Aa/Cd74-expressing subpopulation exists alongside a Fos/Jun-negative subpopulation. H2-Aa codes for a component of the class II major histocompatibility complex and may be indicative of B-cell contamination (Heng et al. 2008). Fos and Jun may also be relevant as they are involved in cell proliferation (Angel and Karin 1991). In addition, these visual subpopulations roughly correspond to the computationally identified clusters from the tree cut. Further analyses can then be performed on those empirical clusters – for example, we could perform a DE analysis to identify marker genes for each corresponding subpopulation.

That being said, users should treat clustering results with some caution. It is difficult to maintain statistical rigour during clustering to protect against the formation of spurious clusters. As such, determining whether a cluster is “real” or not usually depends on subjective judgement, unless the clusters are very clearly defined (e.g., strong separation on a PCA plot, widespread differences in the expression profiles). Moreover, different algorithms can yield substantially different clusters by focusing on different aspects of the data. In short, experimental validation of the clustering results is critical to ensure that the putative subpopulations actually exist.

Finally, dimensionality reduction can be applied using only the set of correlated HVGs to highlight any substructure that might be present. This is shown in Figure 12 for both PCA and t-SNE plots, though in this case, focusing on HVGs does not provide any additional separation into distinct subpopulations. A more informative strategy is to colour cells in the plot based on the expression of a gene of interest. This improves visualization by highlighting changes in expression across the cell population.

out.pca <- plotPCA(sce, exprs_values="exprs", feature_set=chosen, colour_by="H2-Aa") + fontsize
set.seed(100)
out.tsne <- plotTSNE(sce, exprs_values="exprs", feature_set=chosen, colour_by="H2-Aa") + fontsize
multiplot(out.pca, out.tsne, cols=2)
Figure 12: PCA (left) and t-SNE plots (right) using only the expression values for significantly correlated HVGs in the HSC data set. Cells are coloured according to the level of H2-Aa expression.

Figure 12: PCA (left) and t-SNE plots (right) using only the expression values for significantly correlated HVGs in the HSC data set. Cells are coloured according to the level of H2-Aa expression.

2.10 Additional comments

Once the basic analysis is completed, it is often useful to save the SCESet object to file with the saveRDS function. The object can then be easily restored into new R sessions using the readRDS function. This allows further work to be conducted without having to repeat all of the processing steps described above.

saveRDS(file="hsc_data.rds", sce)

A variety of methods are available to perform more complex analyses on the processed expression data. For example, cells can be ordered by pseudotime (e.g., for progress along a differentiation pathway) with monocle (Trapnell et al. 2014); cell-state hierarchies can be characterized with the sincell package (Julia, Telenti, and Rausell 2015); and oscillatory behaviour can be identified using Oscope (Leng et al. 2015). HVGs can be used in gene set enrichment analyses to identify biological pathways and processes with heterogeneous activity, using packages designed for bulk data like topGO or with dedicated single-cell methods like scde (J. Fan et al. 2016). Full descriptions of these analyses are outside the scope of this workflow, so interested users are advised to consult the relevant documentation.

3 A more complex analysis on brain cell types

3.1 Overview

We proceed to a more complex data set from a study of cell types in the mouse brain (Zeisel et al. 2015). This contains approximately 3000 cells of varying types such as oligodendrocytes, microglia and neurons. Individual cells were isolated using the Fluidigm C1 microfluidics system and library preparation was performed on each cell using a UMI-based protocol. After sequencing, expression was quantified by counting the number of UMIs mapped to each gene. Count data for all endogenous genes, mitochondrial genes and spike-in transcripts were obtained from http://linnarssonlab.org/cortex.

3.2 Count loading

The count data are distributed across several files, so some work is necessary to consolidate them into a single matrix. We define a simple utility function for loading data in from each file. (We stress that this function is only relevant to the current data set, and should not be used for other data sets. This kind of effort is generally not required if all of the counts are in a single file and separated from the metadata.)

readFormat <- function(infile) { 
    # First column is empty.
    metadata <- read.delim(infile, stringsAsFactors=FALSE, header=FALSE, nrow=10)[,-1] 
    rownames(metadata) <- metadata[,1]
    metadata <- metadata[,-1]
    metadata <- as.data.frame(t(metadata))
    # First column after row names is some useless filler.
    counts <- read.delim(infile, stringsAsFactors=FALSE, header=FALSE, row.names=1, skip=11)[,-1] 
    counts <- as.matrix(counts)
    return(list(metadata=metadata, counts=counts))
}

Using this function, we read in the counts for the endogenous genes, ERCC spike-ins and mitochondrial genes.

endo.data <- readFormat("expression_mRNA_17-Aug-2014.txt")
spike.data <- readFormat("expression_spikes_17-Aug-2014.txt")
mito.data <- readFormat("expression_mito_17-Aug-2014.txt")

We also need to rearrange the columns for the mitochondrial data, as the order is not consistent with the other files.

m <- match(endo.data$metadata$cell_id, mito.data$metadata$cell_id)
mito.data$metadata <- mito.data$metadata[m,]
mito.data$counts <- mito.data$counts[,m]

The counts are then combined into a single matrix for constructing a SCESet object. For convenience, metadata for all cells are stored in the same object for later access.

all.counts <- rbind(endo.data$counts, mito.data$counts, spike.data$counts)
metadata <- AnnotatedDataFrame(endo.data$metadata)
sce <- newSCESet(countData=all.counts, phenoData=metadata)
dim(sce)
## Features  Samples 
##    20063     3005

We also add annotation identifying which rows correspond to each class of features.

nrows <- c(nrow(endo.data$counts), nrow(mito.data$counts), nrow(spike.data$counts))
is.spike <- rep(c(FALSE, FALSE, TRUE), nrows)
isSpike(sce) <- is.spike
is.mito <- rep(c(FALSE, TRUE, FALSE), nrows)

3.3 Quality control on the cells and genes

The original authors of the study have already removed low-quality cells prior to data publication. Nonetheless, we can compute some metrics to check that whether the remaining cells are satisfactory.

sce <- calculateQCMetrics(sce, feature_controls=list(Spike=is.spike, Mt=is.mito)) 

We examine the distribution of library sizes and numbers of expressed genes across cells (Figure 13).

par(mfrow=c(1,2))
hist(sce$total_counts/1e3, xlab="Library sizes (thousands)", main="", 
    breaks=20, col="grey80", ylab="Number of cells")
hist(sce$total_features, xlab="Number of expressed genes", main="", 
    breaks=20, col="grey80", ylab="Number of cells")
Figure 13: Histograms of library sizes (left) and number of expressed genes (right) for all cells in the brain data set.

Figure 13: Histograms of library sizes (left) and number of expressed genes (right) for all cells in the brain data set.

We also examine the distribution of the proportions of total reads mapped to mitochondrial genes or spike-in transcripts (Figure 14). Note that the spike-in proportions here are more variable than in the HSC data set. This may reflect a greater variability in the total amount of endogenous RNA per cell when many cell types are present.

par(mfrow=c(1,2))
hist(sce$pct_counts_feature_controls_Mt, xlab="Mitochondrial proportion (%)", 
    ylab="Number of cells", breaks=20, main="", col="grey80")
hist(sce$pct_counts_feature_controls_Spike, xlab="ERCC proportion (%)",
    ylab="Number of cells", breaks=20, main="", col="grey80")
Figure 14: Histogram of the proportion of reads mapped to mitochondrial genes (left) or spike-in transcripts (right) across all cells in the brain data set.

Figure 14: Histogram of the proportion of reads mapped to mitochondrial genes (left) or spike-in transcripts (right) across all cells in the brain data set.

We remove small outliers in Figure 13 and large outliers in Figure 14, using a MAD-based threshold as previously described.

libsize.drop <- isOutlier(sce$total_counts, n=3, type="lower", log=TRUE)
feature.drop <- isOutlier(sce$total_features, n=3, type="lower")
mito.drop <- isOutlier(sce$pct_counts_feature_controls_Mt, n=3, type="higher")
spike.drop <- isOutlier(sce$pct_counts_feature_controls_Spike, n=3, type="higher")

Removal of low-quality cells can then be performed by combining all of the metrics. The majority of cells are retained, which suggests that the original quality control procedures were generally adequate.

sce <- sce[,!(libsize.drop | feature.drop | spike.drop | mito.drop)]
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop), 
    ByMito=sum(mito.drop), BySpike=sum(spike.drop), Remaining=ncol(sce))
##         ByLibSize ByFeature ByMito BySpike Remaining
## Samples         8         0     87       8      2902

Low-abundance genes are also removed by applying a simple mean-based filter. This yields fewer genes than in the HSC data set, mostly because the sequencing depth per cell is much lower.

keep <- rowMeans(counts(sce)) >= 1
sce <- sce[keep,]
sum(keep)
## [1] 3175

Some data sets may contain strong heterogeneity in mitochondrial RNA content, possibly due to differences in mitochondrial copy number or activity between cell types. This heterogeneity will cause mitochondrial genes to dominate the top set of results, e.g., for identification of correlated HVGs. However, these genes are largely uninteresting given that most studies focus on nuclear regulation. As such, we filter them out prior to further analysis. Other candidates for removal include pseudogenes or ribosomal RNA/protein-coding genes that might not be biologically relevant but can interfere with interpretation of the results.

sce <- sce[!fData(sce)$is_feature_control_Mt,]

3.4 Normalization and cell-cycle classification

Normalization of cell-specific biases is performed using the deconvolution method in the computeSumFactors function. Here, we cluster similar cells together and normalize the cells in each cluster using the deconvolution method. This improves the accuracy of normalization by reducing the number of DE genes between cells in the same cluster. Normalization between clusters is then performed to ensure that expression values from cells in different clusters are comparable.

clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, cluster=clusters)
sce <- normalize(sce)

Compared to the HSC analysis, more scatter is observed around the trend between the total count and size factor for each cell (Figure 15). This is consistent with an increased amount of DE between cells of different types, which compromises the accuracy of library size normalization (Robinson and Oshlack 2010). In contrast, the size factors are estimated based on median ratios and are more robust to the presence of DE between cells.

plot(sizeFactors(sce), sce$total_counts/1e3, log="xy",
    ylab="Library size (thousands)", xlab="Size factor")
Figure 15: Size factors from deconvolution, plotted against library sizes for all cells in the brain data set. Axes are shown on a log-scale.

Figure 15: Size factors from deconvolution, plotted against library sizes for all cells in the brain data set. Axes are shown on a log-scale.

We also attempt to classify cells into cell cycle phases using the cyclone method. However, examination of Figure 16 indicates that many of the G1 and G2/M scores are ambiguous. This highlights the potential difficulties of training a classifier on one cell type (mouse embryonic stem cells – see Scialdone et al. (2015) for more details) and applying it on a substantially different cell type. Some neuron types are particularly problematic as they are postmitotic and do not belong in any phase of the cell cycle.

anno <- select(org.Mm.eg.db, keys=rownames(sce), keytype="SYMBOL", column="ENSEMBL")
ensembl <- anno$ENSEMBL[match(rownames(sce), anno$SYMBOL)]
keep <- !is.na(ensembl)
assignments <- cyclone(sce[keep,], mm.pairs, gene.names=ensembl[keep])
plot(assignments$score$G1, assignments$score$G2M, xlab="G1 score", ylab="G2/M score", pch=16)
Figure 16: Cell cycle phase scores from applying the pair-based classifier on the brain data set, where each point represents a cell.

Figure 16: Cell cycle phase scores from applying the pair-based classifier on the brain data set, where each point represents a cell.

Given the lack of definitive classification, we will not perform any processing of the data set by cell cycle phase. However, this information is still useful for verifying downstream results. For example, if we were to identify putative subpopulations, and those subpopulations had systematically different phase scores, we might be wary of the possibility that the differences between subpopulations are being driven by cell cycle effects.

3.5 Data exploration to examine the effect of technical factors

For large experiments, data exploration has two functions – to identify interesting biology, and also to check the effect of various technical factors. PCA plots constructed from the expression data suggest that distinct subpopulations are present (Figure 17). Some of the substructure may be due to differences in the tissue from which the cells were extracted, e.g., cells from the cortex and hippocampus dominate different parts of the plot. In contrast, cells taken from mice of different sexes mix throughout the plot, indicating that sex has little effect on the overall differences across the data set.

pca1 <- plotPCA(sce, exprs_values="exprs", colour_by="tissue") + fontsize
pca2 <- plotPCA(sce, exprs_values="exprs", colour_by="sex") + fontsize
multiplot(pca1, pca2, cols=2)
Figure 17: PCA plots constructed from the normalized expression values for all remaining cells in the brain data set. Left: cells are coloured according to the tissue of origin (cortex or hippocampus). Right: cells are coloured according to the sex of the mouse – male (-1), female (1) or unassigned (0).

Figure 17: PCA plots constructed from the normalized expression values for all remaining cells in the brain data set. Left: cells are coloured according to the tissue of origin (cortex or hippocampus). Right: cells are coloured according to the sex of the mouse – male (-1), female (1) or unassigned (0).

Similar results are observed with t-SNE plots (Figure 18). Again, users should set the seed to a constant value to ensure that the results are reproducible.

set.seed(100)
tsne1 <- plotTSNE(sce, exprs_values="exprs", colour_by="tissue") + fontsize
set.seed(100)
tsne2 <- plotTSNE(sce, exprs_values="exprs", colour_by="sex") + fontsize
multiplot(tsne1, tsne2, cols=2)
Figure 18: t-SNE plots constructed from the normalized expression values for all remaining cells in the brain data set. Left: cells are coloured according to the tissue of origin (cortex or hippocampus). Right: cells are coloured according to the sex of the mouse – male (-1), female (1) or unassigned (0).

Figure 18: t-SNE plots constructed from the normalized expression values for all remaining cells in the brain data set. Left: cells are coloured according to the tissue of origin (cortex or hippocampus). Right: cells are coloured according to the sex of the mouse – male (-1), female (1) or unassigned (0).

An additional effect to consider is the fact that cells were processed on many different C1 chips. This can lead to batch effects due to technical differences in library preparation between chips. To check that this is not the case, we examine the spread of cells from each chip on the PCA plot (Figure 19). Cells from different chips seem to mix together, which suggests that the substructure is not being driven by a batch effect. Note that we separate cells by tissue because the chip factor is nested within the tissue factor. If all cells were plotted together, differences between tissues would dominate the plot such that more subtle differences between chips may not be visible.

sce$chip <- sub("_.*", "", sce$cell_id)
pca1 <- plotPCA(sce[,sce$tissue=="sscortex"], exprs_values="exprs", 
    colour_by="chip", legend="none") + fontsize + ggtitle("Cortex")
pca2 <- plotPCA(sce[,sce$tissue!="sscortex"], exprs_values="exprs", 
    colour_by="chip", legend="none") + fontsize + ggtitle("Hippocampus")
multiplot(pca1, pca2, cols=2)
Figure 19: PCA plots constructed from the normalized expression values for all cells in the brain data set from the cortex (left) or hippocampus (right). Each cell is coloured according to the C1 chip on which its library was prepared.

Figure 19: PCA plots constructed from the normalized expression values for all cells in the brain data set from the cortex (left) or hippocampus (right). Each cell is coloured according to the C1 chip on which its library was prepared.

In summary, the major difference between cells seems to be associated with the tissue of origin. Whether or not this is interesting depends on the biological hypothesis being studied. For the purposes of this workflow, we will treat the tissue of origin as an uninteresting confounding effect. This is because we are mainly interested in the cell subpopulations within each tissue. As such, we will block on tissue in all of our downstream analyses.

design <- model.matrix(~sce$tissue)

3.6 Identifying correlated HVGs

We identify HVGs that may be involved in driving population heterogeneity. This is done by fitting a trend to the technical variances for the spike-in transcripts as previously described. We then compute the biological component of the variance for each endogenous gene by subtracting the fitted value of the trend from the total variance.

var.fit <- trendVar(sce, trend="loess", design=design, span=0.4)
var.out <- decomposeVar(sce, var.fit)

Figure 20 suggests that the trend is fitted accurately to the technical variances. Errors in fitting seem to be negligible relative to the size of the total variances for the endogenous genes. The technical variances are also much smaller than those in the HSC data set. This is due to the use of UMIs which reduces the noise caused by variable PCR amplification. Furthermore, the spike-in trend is consistently lower than the variances of the endogenous genes. This means the previous strategy of fitting a trend to the endogenous variances would not be appropriate here (or necessary, given the quality of the spike-in trend).

plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression", 
    ylab="Variance of log-expression")
points(var.fit$mean, var.fit$var, col="red", pch=16)
o <- order(var.out$mean)
lines(var.out$mean[o], var.out$tech[o], col="red", lwd=2)
Figure 20: Variance of normalized log-expression values for each gene in the brain data set, plotted against the mean log-expression. The red line represents the mean- dependent trend in the technical variance of the spike-in transcripts (also highlighted as red points).

Figure 20: Variance of normalized log-expression values for each gene in the brain data set, plotted against the mean log-expression. The red line represents the mean- dependent trend in the technical variance of the spike-in transcripts (also highlighted as red points).

The top HVGs are are identified based on their biological components. These are saved to file for future reference.

top.hvgs <- order(var.out$bio, decreasing=TRUE)
write.table(file="brain_hvg.tsv", var.out[top.hvgs,], sep="\t", quote=FALSE, col.names=NA)
head(var.out[top.hvgs,])
##          mean     total       bio      tech
## Plp1 3.653363 13.385484 13.074259 0.3112251
## Trf  2.032646  9.324462  8.688295 0.6361662
## Mal  2.059767  8.830603  8.201525 0.6290775
## Apod 1.646420  7.513916  6.817105 0.6968103
## Mog  1.575109  7.255534  6.552027 0.7035075
## Mbp  1.930524  6.849602  6.187858 0.6617450

Again, we check the distribution of expression values for the top 10 HVGs to ensure that they are not being driven by outliers (Figure 21).

examined <- top.hvgs[1:10]
all.names <- matrix(rownames(sce)[examined], nrow=length(examined), ncol=ncol(sce))
boxplot(split(exprs(sce)[examined,], all.names), las=2, ylab="Normalized log-expression", col="grey80")
Figure 21: Boxplots of normalized log-expression values for the top 10 HVGs in the brain data set. Points correspond to cells that are more than 1.5 interquartile ranges from the edge of each box.

Figure 21: Boxplots of normalized log-expression values for the top 10 HVGs in the brain data set. Points correspond to cells that are more than 1.5 interquartile ranges from the edge of each box.

To identify genes involved in defining subpopulations, the top 200 HVGs can be tested for significant pairwise correlations. Here, the number of significantly correlated pairs is much higher than in the HSC data set, indicating that strong substructure is present. These results are also saved to file for use in designing validation experiments.

set.seed(100)
var.cor <- correlatePairs(sce[top.hvgs[1:200],], design=design)
write.table(file="brain_cor.tsv", var.cor, sep="\t", quote=FALSE, col.names=NA)
head(var.cor)
##   gene1  gene2       rho      p.value          FDR
## 1   Mog   Mobp 0.8717332 1.999998e-06 2.431571e-06
## 2   Mog   Ermn 0.8631645 1.999998e-06 2.431571e-06
## 3   Mog  Ugt8a 0.8609433 1.999998e-06 2.431571e-06
## 4   Mbp   Mobp 0.8563148 1.999998e-06 2.431571e-06
## 5  Mobp    Mag 0.8517917 1.999998e-06 2.431571e-06
## 6 Ugt8a Cldn11 0.8513843 1.999998e-06 2.431571e-06
sig.cor <- var.cor$FDR <= 0.05
sum(sig.cor)
## [1] 18617

Correlated HVGs can also be used to construct a heatmap that can be inspected for subpopulations. Here, we use the normalized log-expression values for the top HVGs that have significant correlations at a FDR of 5%. We also apply the removeBatchEffect function from the limma package (Ritchie et al. 2015) to remove the tissue effect. This ensures that any differences due to the tissue of origin will not dominate the visualization of the expression profiles. (Note that, if an analysis method can accept a design matrix, then blocking on nuisance factors in the design matrix is preferable to manipulating the expression values with removeBatchEffect. This is because the latter does not account for the loss of residual degrees of freedom, nor the uncertainty of estimation of the blocking factor terms.)

chosen <- unique(c(var.cor$gene1[sig.cor], var.cor$gene2[sig.cor]))
norm.exprs <- exprs(sce)[chosen,,drop=FALSE]
library(limma)
norm.exprs <- removeBatchEffect(norm.exprs, batch=sce$tissue)

The heatmap in Figure 22 shows systematic differences between groups of cells with distinct patterns of expression. This is consistent with the presence of well-defined subpopulations. Clusters can also be defined by applying a dynamic tree cut (Langfelder, Zhang, and Horvath 2008) on the dendrogram. This exploits the shape of the branches in the dendrogram to refine the cluster definitions, and is more appropriate than cutree for complex dendrograms. The empirical clusters correspond roughly to the visual subpopulations, which suggests that they can be reliably used for downstream analyses. Greater control of the empirical clusters can be obtained by manually specifying cutHeight in cutreeDynamic.

my.dist <- dist(t(norm.exprs))
my.tree <- hclust(my.dist, method="ward.D2")
library(dynamicTreeCut)
my.clusters <- unname(cutreeDynamic(my.tree, distM=as.matrix(my.dist), verbose=0))
heat.vals <- norm.exprs - rowMeans(norm.exprs)
clust.col <- rainbow(max(my.clusters))
heatmap.2(heat.vals, col=bluered, symbreak=TRUE, trace='none', cexRow=0.8,
    ColSideColors=clust.col[my.clusters], Colv=as.dendrogram(my.tree))
Figure 22: Heatmap of mean-centred normalized log-expression values for correlated HVGs in the brain data set. Dendrograms are formed by hierarchical clustering on the Euclidean distances between genes (row) or cells (column). Column colours represent the cluster to which each cell is assigned after a dynamic tree cut.

Figure 22: Heatmap of mean-centred normalized log-expression values for correlated HVGs in the brain data set. Dendrograms are formed by hierarchical clustering on the Euclidean distances between genes (row) or cells (column). Column colours represent the cluster to which each cell is assigned after a dynamic tree cut.

This heatmap can be stored at a greater resolution for detailed inspection later.

pdf("brain_heat.pdf", width=7, height=20)
heatmap.2(heat.vals, col=bluered, symbreak=TRUE, trace='none', cexRow=0.8,
    ColSideColors=clust.col[my.clusters], Colv=as.dendrogram(my.tree))
dev.off()

3.7 Detecting marker genes between subpopulations

Once putative subpopulations are identified by clustering, we can identify some candidate marker genes that are unique to those subpopulations. This is done by testing for DE between each pair of subpopulations and selecting those genes that are consistently upregulated (or downregulated) in one subpopulation compared to all others. DE testing can be done using a number of packages, but for this workflow, we will use the edgeR package (Robinson, McCarthy, and Smyth 2010). First, we set up a design matrix specifying which cells belong in which cluster. Each cluster* coefficient represents the average log-expression of all cells in the corresponding cluster. We also block on uninteresting factors such as the tissue of origin.

cluster <- factor(my.clusters)
design <- model.matrix(~0 + cluster + sce$tissue)
colnames(design)
## [1] "cluster1"           "cluster2"           "cluster3"           "cluster4"          
## [5] "cluster5"           "sce$tissuesscortex"

We set up a DGEList object for entry into the edgeR analysis. Spike-in transcripts are removed as they are not relevant for marker identification. The size factors are divided by the library sizes to obtain normalization factors for all cells. (The normalization factor is simply an alternative formulation of the size factor, and quantifies the bias that is not caused by differences in library size between samples.)

y <- convertTo(sce)

edgeR uses negative binomial (NB) distributions to model the read counts for each sample. We estimate the NB dispersion parameter that quantifies the biological variability in expression across cells in the same cluster. Large dispersion estimates above 0.5 are often observed in scRNA-seq data due to technical noise, in contrast to bulk data where values of 0.05-0.2 are more typical. We then use the design matrix to fit a NB GLM to the counts for each gene (McCarthy, Chen, and Smyth 2012).

y <- estimateDisp(y, design, robust=TRUE)
fit <- glmFit(y, design)
summary(y$tagwise.dispersion)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   0.06273   0.29090   0.45600   1.00300   0.73000 102.40000

To identify marker genes for a particular cluster, we test each gene for DE between that cluster and each other cluster. This is done using the likelihood ratio test (LRT) for each comparison, as demonstrated below for cluster 2. The same process can be repeated for each cluster by changing chosen.clust, to identify markers specific to the corresponding subpopulation.

result.logFC <- result.PValue <- list()
chosen.clust <- which(levels(cluster)=="2") # character, as 'cluster' is a factor.
for (clust in seq_len(nlevels(cluster))) {
    if (clust==chosen.clust) { next }
    contrast <- numeric(ncol(design))
    contrast[chosen.clust] <- 1
    contrast[clust] <- -1
    res <- glmLRT(fit, contrast=contrast)
    con.name <- paste0('vs.', levels(cluster)[clust])
    result.logFC[[con.name]] <- res$table$logFC
    result.PValue[[con.name]] <- res$table$PValue
}

Potential marker genes for cluster 2 are ranked based on the maximum p-value across all comparisons. A gene that is DE between the chosen cluster and all others should have small p-values for all comparisons, and thus a small maximum p-value. In addition, we only focus on genes with the same sign of the log-fold change across all comparisons. This is necessary to identify specific markers that are unambiguously upregulated (or downregulated) in cluster 2 relative to the other clusters.

max.PValue <- do.call(pmax, result.PValue)
all.logFC <- do.call(cbind, result.logFC)
all.signs <- sign(all.logFC)
same.sign <- rowSums(all.signs[,1]!=all.signs)==0L
marker.set <- data.frame(Gene=rownames(y), logFC=all.logFC, 
    PValue=max.PValue, stringsAsFactors=FALSE)
marker.set <- marker.set[same.sign,]
marker.set <- marker.set[order(marker.set$PValue),]
head(marker.set)
##        Gene logFC.vs.1 logFC.vs.3 logFC.vs.4 logFC.vs.5        PValue
## 1344 Taldo1   2.616537   4.289601   3.279624   3.021371 4.500667e-207
## 1421    Mog   5.305202   9.738385   8.349295   7.466953 1.637919e-184
## 1345    Mbp   4.797108   7.835503   5.558221   5.468495 1.291619e-180
## 1387   Mobp   5.067485   8.699098   6.779929   6.486105 6.987337e-176
## 1410 Dbndd2   2.817153   4.701019   5.024328   2.965054 4.984182e-168
## 1473   Qdpr   3.521928   5.467438   4.555087   3.982385 2.648376e-166

We save the list of candidate marker genes for further examination. We also examine their expression profiles to verify that the DE is not being driven by outlier cells. Figure 23 indicates that all of the top markers have strong and consistent differences between cells in cluster 2 and those in every other cluster. Indeed, some robustness to outliers is expected from edgeR, as any outliers will inflate the dispersion and increase the maximum p-value for the affected genes.

write.table(marker.set, file="brain_marker_2.tsv", sep="\t", quote=FALSE, row.names=FALSE)
top.markers <- marker.set$Gene[1:20]
norm.exprs <- exprs(sce)[top.markers,,drop=FALSE]
heat.vals <- norm.exprs - rowMeans(norm.exprs)
heatmap.2(heat.vals, col=bluered, symbreak=TRUE, trace='none', cexRow=1,
    ColSideColors=clust.col[my.clusters], Colv=as.dendrogram(my.tree), dendrogram='none')
legend("bottomleft", col=clust.col, legend=sort(unique(my.clusters)), pch=16)
Figure 23: Heatmap of mean-centred normalized log-expression values for the top set of markers for cluster 2 in the brain data set. Column colours represent the cluster to which each cell is assigned.

Figure 23: Heatmap of mean-centred normalized log-expression values for the top set of markers for cluster 2 in the brain data set. Column colours represent the cluster to which each cell is assigned.

An alternative approach is to identify DE genes across any clusters using an ANOVA-like contrast. This is less stringent than identifying markers for a specific cluster, which may overlook important genes that are expressed in two or more clusters. (For example, in a mixed population of CD4+-only, CD8+-only, double-positive and double-negative T-cells, neither Cd4 or Cd8 would be detected as subpopulation-specific markers because each gene is expressed in two subpopulations.) Here, we report the log-fold changes for each cluster against the average of all other clusters for each gene. This facilitates interpretation of the results as the relevant cluster(s) expressing each gene can be quickly determined.

# Automatic construction of the contrast matrix.
nclusters <- nlevels(cluster)
contrast.matrix <- matrix(0, ncol(design), nclusters) 
contrast.matrix[1,] <- -1 
diag(contrast.matrix) <- 1
contrast.matrix <- contrast.matrix[,-1]
res.any <- glmLRT(fit, contrast=contrast.matrix)

# Computing log-fold changes between each cluster and the average of the rest.
cluster.expression <- fit$coefficients[,seq_len(nclusters)] 
other.expression <- (rowSums(cluster.expression) - cluster.expression)/(nclusters-1)
log.fold.changes <- cluster.expression - other.expression
colnames(log.fold.changes) <- paste0("LogFC.for.", levels(cluster))
rownames(log.fold.changes) <- NULL

# Ordering by the likelihood ratio; p-values affected by numerical imprecision.
any.de <- data.frame(Gene=rownames(y), log.fold.changes, 
    LR=res.any$table$LR, stringsAsFactors=FALSE)
any.de <- any.de[order(any.de$LR, decreasing=TRUE),]
head(any.de)
##        Gene LogFC.for.1 LogFC.for.2 LogFC.for.3 LogFC.for.4 LogFC.for.5       LR
## 1339   Scd2 -0.43975144    2.224036   -2.017434  -1.1847370   1.4178862 8643.685
## 1410 Dbndd2  0.24637738    2.687254   -1.385868  -1.6659942   0.1182310 6371.499
## 1344 Taldo1  0.02156529    2.288622   -1.428034  -0.5529562  -0.3291965 6245.922
## 1345    Mbp -0.05652804    4.099849   -2.689097  -0.7159830  -0.6382407 6164.056
## 1484  Rnf13 -0.31093708    2.452358   -1.838081  -0.5499326   0.2465924 5991.242
## 1421    Mog  0.75099454    5.347602   -3.090065  -1.8865112  -1.1220197 5955.469

It must be stressed that the p-values cannot be interpreted as measures of significance. This is because the clusters have been empirically identified from the data. edgeR does not account for the uncertainty and stochasticity in clustering, which means that the p-values are much lower than they should be. The maximum p-value calculated here should only be used for ranking candidate markers for follow-up studies. However, this is not a concern in other analyses where the groups are pre-defined. For such analyses, the FDR-corrected p-value can be directly used to define significant genes for each DE comparison, though some care may be required to deal with plate effects (Hicks, Teng, and Irizarry 2015; ???).

3.8 Additional comments

Having completed the basic analysis, we save the SCESet object with its associated data to file. This is especially important here as the brain data set is quite large. If further analyses are to be performed, it would be inconvenient to have to repeat all of the pre-processing steps described above.

saveRDS(file="brain_data.rds", sce)

4 Alternative parameter settings and strategies

4.1 Normalizing based on spike-in coverage

Scaling normalization strategies for scRNA-seq data can be broadly divided into two classes. The first class assumes that there exists a subset of genes that are not DE between samples, as previously described. The second class uses the fact that the same amount of spike-in RNA was added to each cell. Differences in the coverage of the spike-in transcripts can only be due to cell-specific biases, e.g., in capture efficiency or sequencing depth. Scaling normalization is then applied to equalize spike-in coverage across cells.

The choice between these two normalization strategies depends on the biology of the cells and the features of interest. If there is no reliable house-keeping set, and if the majority of genes are expected to be DE, then spike-in normalization may be the only option for removing technical biases. Spike-in normalization should also be used if differences in the total RNA content of individual cells are of interest. This is because the same amount of spike-in RNA is added to each cell, such that the relative quantity of endogenous RNA can be easily quantified in each cell. For non-DE normalization, any change in total RNA content will affect all genes in the non-DE subset, such that it will be treated as bias and removed.

The use of spike-in normalization can be demonstrated on the HSC data set. We load in the SCESet object that we saved earlier, which contains the count data for filtered genes in high-quality HSCs. We then apply the computeSpikeFactors method to estimate size factors for all cells. This method computes the total count over all spike-in transcripts in each cell, and calculates size factors to equalize the total spike-in count across cells.

sce <- readRDS("hsc_data.rds")
deconv.sf <- sizeFactors(sce)
sce <- computeSpikeFactors(sce)

Both non-DE methods (like deconvolution) and spike-in normalization will capture technical biases such as sequencing depth and capture efficiency. Indeed, Figure 24 shows a rough positive correlation between the two sets of size factors, consistent with removal of technical biases by both methods. However, differences between the two sets are still present and are attributable to variability in total RNA content across the HSC population. Spike-in normalization will preserve differences in RNA content, whereas non-DE normalization will eliminate them.

plot(sizeFactors(sce), deconv.sf, pch=16, log="xy", xlab="Size factor (spike-in)",
    ylab="Size factor (deconvolution)")
Figure 24: Size factors from spike-in normalization, plotted against the size factors from deconvolution for all cells in the HSC data set. Axes are shown on a log-scale.

Figure 24: Size factors from spike-in normalization, plotted against the size factors from deconvolution for all cells in the HSC data set. Axes are shown on a log-scale.

Whether or not total RNA content is relevant depends on the biological hypothesis. In the analyses described above, variability in total RNA across the population was treated as noise and removed by non-DE normalization. This may not always be appropriate if total RNA is associated with a biological difference of interest. For example, Islam et al. (2011) describe a 5-fold difference in total RNA between mouse embyronic stem cells and fibroblasts. Spike-in normalization will preserve this difference and may provide more accurate quantification in downstream analyses.

4.2 Blocking on the cell-cycle phase

Cell cycle phase is usually uninteresting in studies focusing on other aspects of biology. However, the effects of cell cycle on the expression profile can mask other effects and interfere with the interpretation of the results. This cannot be avoided by simply removing cell cycle marker genes, as the cell cycle can affect a substantial number of other transcripts (Buettner et al. 2015). Rather, more sophisticated strategies are required, which are demonstrated below using data from a study of T Helper 2 (TH2) cells (Mahata et al. 2014). Buettner et al. (2015) have already applied quality control and normalized the data, so we can use them directly as log-expression values (accessible as Supplementary Data 1 of https://dx.doi.org/10.1038/nbt.3102).

library(openxlsx)
incoming <- read.xlsx("nbt.3102-S7.xlsx", sheet=1, rowNames=TRUE)
incoming <- incoming[,!duplicated(colnames(incoming))] # Remove duplicated genes.
sce <- newSCESet(exprsData=t(incoming))

We empirically identify the cell cycle phase using the pair-based classifier in cyclone. The majority of cells in Figure 25 seem to lie in G1 phase, with small numbers of cells in the other phases.

anno <- select(org.Mm.eg.db, keys=rownames(sce), keytype="SYMBOL", column="ENSEMBL")
ensembl <- anno$ENSEMBL[match(rownames(sce), anno$SYMBOL)]
keep <- !is.na(ensembl) # Remove genes without ENSEMBL IDs.
assignments <- cyclone(exprs(sce)[keep,], mm.pairs, gene.names=ensembl[keep])
plot(assignments$score$G1, assignments$score$G2M, xlab="G1 score", ylab="G2/M score", pch=16)
Figure 25: Cell cycle phase scores from applying the pair-based classifier on the TH2 data set, where each point represents a cell.

Figure 25: Cell cycle phase scores from applying the pair-based classifier on the TH2 data set, where each point represents a cell.

We can block directly on the phase scores in downstream analyses, which is more graduated than using a strict assignment of each cell to a specific phase. This will absorb any phase-related effects on expression such that they will not affect estimation of the effects of other experimental factors. Note that users should ensure that the phase score is not confounded with other factors of interest. For example, model fitting is not possible if all cells in one experimental condition are in one phase, and all cells in another condition are in a different phase.

design <- model.matrix(~ G1 + G2M, assignments$score)
fit.block <- trendVar(sce, use.spikes=NA, trend="loess", design=design)
dec.block <- decomposeVar(sce, fit.block)

For analyses that do not use design matrices, we can remove the cell cycle effect directly from the expression values using removeBatchEffect. The result of this procedure can be visualized with some PCA plots in Figure 26. Before removal, cells in the G1 and non-G1 phases tend to be concentrated in different parts of the plot. Afterwards, more intermingling is observed between the phases which suggests that the cell cycle effect has been mitigated.

fit <- trendVar(sce, use.spikes=NA, trend="loess")
dec <- decomposeVar(sce, fit)
top.hvgs <- order(dec$bio, decreasing=TRUE)[1:500]
sce$G1score <- assignments$score$G1
out <- plotPCA(sce, select=top.hvgs, colour_by="G1score") + fontsize + ggtitle("Before removal")

top.hvgs2 <- order(dec.block$bio, decreasing=TRUE)[1:500]
corrected <- removeBatchEffect(exprs(sce), covariates=assignments$score[,c("G1", "G2M")])
sce2 <- newSCESet(exprsData=corrected, phenoData=phenoData(sce))
out2 <- plotPCA(sce2, select=top.hvgs2, colour_by="G1score") + fontsize + ggtitle("After removal")
multiplot(out, out2, cols=2)
Figure 26: PCA plots before (left) and after (right) removal of the cell cycle effect in the TH2 data set. Each point represents a cell, coloured according to its G1 score. Only the top 500 HVGs were used to make each PCA plot.

Figure 26: PCA plots before (left) and after (right) removal of the cell cycle effect in the TH2 data set. Each point represents a cell, coloured according to its G1 score. Only the top 500 HVGs were used to make each PCA plot.

As an aside, this data set contains cells at various stages of differentiation (Mahata et al. 2014). This is an ideal use case for diffusion maps, which perform dimensionality reduction along a continuous process. In Figure 27, cells are arranged along a trajectory in the low-dimensional space. The first diffusion component is likely to correspond to TH2 differentiation, given that a key regulator Gata3 (J. Zhu et al. 2006) changes in expression from left to right.

plotDiffusionMap(sce2, colour_by="Gata3") + fontsize
Figure 27: A diffusion map for the TH2 data set, where each cell is coloured by its expression of Gata3.

Figure 27: A diffusion map for the TH2 data set, where each cell is coloured by its expression of Gata3.

4.3 Extracting annotation from Ensembl identifiers

Feature-counting tools typically report genes in terms of standard identifiers like Ensembl or Entrez. These identifiers are used as they are unambiguous and highly stable. However, they are difficult to interpret compared to gene symbols, which are more often used in the literature. We can easily convert from one to the other using annotation packages like org.Mm.eg.db. This is demonstrated below for Ensembl identifiers in the mouse embryonic stem cell (mESC) data set (Kolodziejczyk et al. 2015) obtained from http://www.ebi.ac.uk/teichmann-srv/espresso. The select call extracts the specified data from the annotation object, and the match call ensures that the first gene symbol is used if multiple symbols correspond to a single Ensembl identifier.

incoming <- read.table("counttable_es.csv", header=TRUE, row.names=1)
my.ids <- rownames(incoming)
library(org.Mm.eg.db)
anno <- select(org.Mm.eg.db, keys=my.ids, keytype="ENSEMBL", column="SYMBOL")
anno <- anno[match(my.ids, anno$ENSEMBL),]
head(anno)
##              ENSEMBL SYMBOL
## 1 ENSMUSG00000000001  Gnai3
## 2 ENSMUSG00000000003   Pbsn
## 3 ENSMUSG00000000028  Cdc45
## 4 ENSMUSG00000000031   <NA>
## 5 ENSMUSG00000000037  Scml2
## 6 ENSMUSG00000000049   Apoh

To identify which rows correspond to mitochondrial genes, we need to use extra annotation describing the genomic location of each gene. For Ensembl, this involves using the TxDb.Mmusculus.UCSC.mm10.ensGene package.

library(TxDb.Mmusculus.UCSC.mm10.ensGene)
location <- select(TxDb.Mmusculus.UCSC.mm10.ensGene, keys=my.ids, 
    column="CDSCHROM", keytype="GENEID")
location <- location[match(my.ids, location$GENEID),]
is.mito <- location$CDSCHROM == "chrM" & !is.na(location$CDSCHROM)
sum(is.mito)
## [1] 13

Identification of which rows correspond to spike-in transcripts is much easier, given that the ERCC spike-ins were used.

is.spike <- grepl("^ERCC", my.ids)
sum(is.spike)
## [1] 92

All of this information can be consolidated into a SCESet object for further manipulation.

anno <- anno[,-1,drop=FALSE]
rownames(anno) <- my.ids
sce <- newSCESet(countData=incoming, featureData=AnnotatedDataFrame(anno))
isSpike(sce) <- is.spike

We remove rows that do not correspond to endogenous genes or spike-in transcripts. This includes rows containing mapping statistics, e.g., the number of unaligned or unassigned reads. The object is then ready for downstream analyses as previously described.

sce <- sce[grepl("ENSMUS", rownames(sce)) | isSpike(sce),]
dim(sce)
## Features  Samples 
##    38653      704

5 Conclusions

This workflow provides a step-by-step guide for performing basic analyses of single-cell RNA-seq data. It provides instructions for a number of low-level steps such as quality control, normalization, cell cycle phase assignment, data exploration, HVG and DEG detection, and clustering. This is done with a number of different data sets to provide a range of usage examples. In addition, the processed data can be easily used for higher-level analyses with other Bioconductor packages. We anticipate that this workflow will assist readers in assembling analyses of their own scRNA-seq data.

6 Software availability

All software packages used in this workflow are publicly available from the Comprehensive R Archive Network (https://cran.r-project.org) or the Bioconductor project (http://bioconductor.org). The specific version numbers of the packages used are shown below, along with the version of the R installation. The workflow takes less than an hour and 5 GB of memory to run on a desktop computer.

sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] dynamicTreeCut_1.63-1                  RBGL_1.48.1                           
##  [3] graph_1.50.0                           TxDb.Mmusculus.UCSC.mm10.ensGene_3.2.2
##  [5] GenomicFeatures_1.24.5                 org.Mm.eg.db_3.3.0                    
##  [7] AnnotationDbi_1.34.4                   R.utils_2.4.0                         
##  [9] R.oo_1.20.0                            R.methodsS3_1.7.1                     
## [11] openxlsx_3.0.0                         gdata_2.17.0                          
## [13] gplots_3.0.1                           destiny_1.2.1                         
## [15] mvoutlier_2.0.6                        sgeostat_1.0-27                       
## [17] Rtsne_0.11                             edgeR_3.14.0                          
## [19] limma_3.28.21                          scran_1.0.4                           
## [21] scater_1.0.4                           ggplot2_2.1.0                         
## [23] DESeq2_1.12.4                          SummarizedExperiment_1.2.3            
## [25] Biobase_2.32.0                         GenomicRanges_1.24.3                  
## [27] GenomeInfoDb_1.8.7                     IRanges_2.6.1                         
## [29] S4Vectors_0.10.3                       BiocGenerics_0.18.0                   
## [31] BiocParallel_1.6.6                     knitr_1.14                            
## [33] BiocStyle_2.1.32                      
## 
## loaded via a namespace (and not attached):
##   [1] Hmisc_3.17-4            RcppEigen_0.3.2.9.0     plyr_1.8.4             
##   [4] igraph_1.0.1            sp_1.2-3                shinydashboard_0.5.3   
##   [7] splines_3.3.0           digest_0.6.10           htmltools_0.3.5        
##  [10] viridis_0.3.4           magrittr_1.5            cluster_2.0.4          
##  [13] Biostrings_2.40.2       annotate_1.50.0         matrixStats_0.50.2     
##  [16] colorspace_1.2-6        rrcov_1.4-3             dplyr_0.5.0            
##  [19] RCurl_1.95-4.8          tximport_1.0.3          genefilter_1.54.2      
##  [22] lme4_1.1-12             survival_2.39-5         zoo_1.7-13             
##  [25] gtable_0.2.0            zlibbioc_1.18.0         XVector_0.12.1         
##  [28] MatrixModels_0.4-1      car_2.1-3               kernlab_0.9-24         
##  [31] prabclus_2.2-6          DEoptimR_1.0-6          SparseM_1.72           
##  [34] VIM_4.5.0               scales_0.4.0            mvtnorm_1.0-5          
##  [37] DBI_0.5-1               GGally_1.2.0            Rcpp_0.12.7            
##  [40] sROC_0.1-2              xtable_1.8-2            laeken_0.4.6           
##  [43] foreign_0.8-66          proxy_0.4-16            mclust_5.2             
##  [46] Formula_1.2-1           vcd_1.4-3               FNN_1.1                
##  [49] RColorBrewer_1.1-2      fpc_2.1-10              acepack_1.3-3.3        
##  [52] modeltools_0.2-21       reshape_0.8.5           XML_3.98-1.4           
##  [55] flexmix_2.3-13          nnet_7.3-12             locfit_1.5-9.1         
##  [58] labeling_0.3            reshape2_1.4.1          munsell_0.4.3          
##  [61] tools_3.3.0             RSQLite_1.0.0           pls_2.5-0              
##  [64] evaluate_0.9            stringr_1.1.0           cvTools_0.3.2          
##  [67] yaml_2.1.13             robustbase_0.92-6       caTools_1.17.1         
##  [70] nlme_3.1-128            mime_0.5                quantreg_5.29          
##  [73] formatR_1.4             biomaRt_2.28.0          pbkrtest_0.4-6         
##  [76] e1071_1.6-7             statmod_1.4.26          tibble_1.2             
##  [79] robCompositions_2.0.2   geneplotter_1.50.0      pcaPP_1.9-60           
##  [82] stringi_1.1.1           lattice_0.20-34         trimcluster_0.1-2      
##  [85] Matrix_1.2-6            nloptr_1.0.4            lmtest_0.9-34          
##  [88] data.table_1.9.6        bitops_1.0-6            rtracklayer_1.32.2     
##  [91] httpuv_1.3.3            R6_2.1.3                latticeExtra_0.6-28    
##  [94] KernSmooth_2.23-15      gridExtra_2.2.1         boot_1.3-18            
##  [97] MASS_7.3-45             gtools_3.5.0            assertthat_0.1         
## [100] chron_2.3-47            rhdf5_2.16.0            rjson_0.2.15           
## [103] GenomicAlignments_1.8.4 Rsamtools_1.24.0        diptest_0.75-7         
## [106] mgcv_1.8-12             grid_3.3.0              rpart_4.1-10           
## [109] class_7.3-14            minqa_1.2.4             rmarkdown_1.0          
## [112] scatterplot3d_0.3-37    shiny_0.14

7 Author contributions

A.T.L.L. developed and tested the workflow on all data sets. All authors wrote and approved the final manuscript.

8 Competing interests

No competing interests were disclosed.

9 Grant information

A.T.L.L. and J.C.M. were supported by core funding from Cancer Research UK (award no. A17197). J.C.M. was also supported by core funding from EMBL.

10 Acknowledgements

We would like to thank Davis McCarthy, for assistance with coding for scater; Antonio Scialdone, for helpful discussions regarding spike-ins and HVGs; and Michael Epstein, for trialling the workflow on other data sets.

References

Anders, S., and W. Huber. 2010. “Differential expression analysis for sequence count data.” Genome Biol. 11 (10): R106.

Angel, P., and M. Karin. 1991. “The role of Jun, Fos and the AP-1 complex in cell-proliferation and transformation.” Biochim. Biophys. Acta 1072 (2-3): 129–57.

Angerer, P., L. Haghverdi, M. Buttner, F. J. Theis, C. Marr, and F. Buettner. 2015. “destiny: diffusion maps for large-scale single-cell data in R.” Bioinformatics, Dec.

Brennecke, P., S. Anders, J. K. Kim, A. A. Ko?odziejczyk, X. Zhang, V. Proserpio, B. Baying, et al. 2013. “Accounting for technical noise in single-cell RNA-seq experiments.” Nat. Methods 10 (11): 1093–5.

Buettner, F., K. N. Natarajan, F. P. Casale, V. Proserpio, A. Scialdone, F. J. Theis, S. A. Teichmann, J. C. Marioni, and O. Stegle. 2015. “Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells.” Nat. Biotechnol. 33 (2): 155–60.

Fan, J., N. Salathia, R. Liu, G. E. Kaeser, Y. C. Yung, J. L. Herman, F. Kaper, et al. 2016. “Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis.” Nat. Methods, Jan.

Heng, T. S., M. W. Painter, K. Elpek, V. Lukacs-Kornek, N. Mauermann, S. J. Turley, D. Koller, et al. 2008. “The Immunological Genome Project: networks of gene expression in immune cells.” Nat. Immunol. 9 (10): 1091–4.

Hicks, S. C., M. Teng, and R. A. Irizarry. 2015. “On the widespread and critical impact of systematic bias and batch effects in single-cell RNA-Seq data.” BioRxiv. Cold Spring Harbor Labs Journals. doi:10.1101/025528.

Huber, W., V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. “Orchestrating high-throughput genomic analysis with Bioconductor.” Nat. Methods 12 (2): 115–21.

Ilicic, T., J. K. Kim, A. A. Kolodziejczyk, F. O. Bagger, D. J. McCarthy, J. C. Marioni, and S. A. Teichmann. 2016. “Classification of low quality cells from single-cell RNA-seq data.” Genome Biol. 17 (1): 29.

Islam, S., U. Kjallquist, A. Moliner, P. Zajac, J. B. Fan, P. Lonnerberg, and S. Linnarsson. 2011. “Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq.” Genome Res. 21 (7): 1160–7.

Islam, S., A. Zeisel, S. Joost, G. La Manno, P. Zajac, M. Kasper, P. Lonnerberg, and S. Linnarsson. 2014. “Quantitative single-cell RNA-seq with unique molecular identifiers.” Nat. Methods 11 (2): 163–66.

Julia, M., A. Telenti, and A. Rausell. 2015. “Sincell: an R/Bioconductor package for statistical assessment of cell-state hierarchies from single-cell RNA-seq.” Bioinformatics 31 (20): 3380–2.

Kim, J. K., A. A. Kolodziejczyk, T. Illicic, S. A. Teichmann, and J. C. Marioni. 2015. “Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression.” Nat. Commun. 6: 8687.

Kolodziejczyk, A. A., J. K. Kim, J. C. Tsang, T. Ilicic, J. Henriksson, K. N. Natarajan, A. C. Tuck, et al. 2015. “Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation.” Cell Stem Cell 17 (4): 471–85.

Langfelder, P., B. Zhang, and S. Horvath. 2008. “Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R.” Bioinformatics 24 (5): 719–20.

Law, C. W., Y. Chen, W. Shi, and G. K. Smyth. 2014. “voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biol. 15 (2): R29.

Leng, N., L. F. Chu, C. Barry, Y. Li, J. Choi, X. Li, P. Jiang, R. M. Stewart, J. A. Thomson, and C. Kendziorski. 2015. “Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments.” Nat. Methods 12 (10): 947–50.

Liao, Y., G. K. Smyth, and W. Shi. 2013. “The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote.” Nucleic Acids Res. 41 (10): e108.

———. 2014. “featureCounts: an efficient general purpose program for assigning sequence reads to genomic features.” Bioinformatics 30 (7): 923–30.

Love, M. I., S. Anders, V. Kim, and W. Huber. 2015. “RNA-Seq workflow: gene-level exploratory analysis and differential expression.” F1000Res 4: 1070.

Love, M. I., W. Huber, and S. Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biol. 15 (12): 550.

Lun, A. T. L., K. Bach, and J. C. Marioni. 2016. “Pooling Across Cells to Normalize Single-Cell RNA Sequencing Data with Many Zero Counts.” Genome Biol. 17: 75.

Mahata, B., X. Zhang, A. A. Kolodziejczyk, V. Proserpio, L. Haim-Vilmovsky, A. E. Taylor, D. Hebenstreit, et al. 2014. “Single-cell RNA sequencing reveals T helper cells synthesizing steroids de novo to contribute to immune homeostasis.” Cell Rep. 7 (4): 1130–42.

Marinov, G. K., B. A. Williams, K. McCue, G. P. Schroth, J. Gertz, R. M. Myers, and B. J. Wold. 2014. “From single-cell to cell-pool transcriptomes: stochasticity in gene expression and RNA splicing.” Genome Res. 24 (3): 496–510.

McCarthy, D. J., Y. Chen, and G. K. Smyth. 2012. “Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.” Nucleic Acids Res. 40 (10): 4288–97.

Phipson, B., and G. K. Smyth. 2010. “Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn.” Stat. Appl. Genet. Mol. Biol. 9: Article39.

Picelli, S., O. R. Faridani, A. K. Bjorklund, G. Winberg, S. Sagasser, and R. Sandberg. 2014. “Full-length RNA-seq from single cells using Smart-seq2.” Nat Protoc 9 (1): 171–81.

Pollen, A. A., T. J. Nowakowski, J. Shuga, X. Wang, A. A. Leyrat, J. H. Lui, N. Li, et al. 2014. “Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex.” Nat. Biotechnol. 32 (10): 1053–8.

Ritchie, M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth. 2015. “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Res. 43 (7): e47.

Robinson, M. D., and A. Oshlack. 2010. “A scaling normalization method for differential expression analysis of RNA-seq data.” Genome Biol. 11 (3): R25.

Robinson, M. D., D. J. McCarthy, and G. K. Smyth. 2010. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26 (1): 139–40.

Scialdone, A., K. N. Natarajan, L. R. Saraiva, V. Proserpio, S. A. Teichmann, O. Stegle, J. C. Marioni, and F. Buettner. 2015. “Computational assignment of cell-cycle stage from single-cell transcriptome data.” Methods 85 (Sep): 54–61.

Stegle, O., S. A. Teichmann, and J. C. Marioni. 2015. “Computational and analytical challenges in single-cell transcriptomics.” Nat. Rev. Genet. 16 (3): 133–45.

Trapnell, C., D. Cacchiarelli, J. Grimsby, P. Pokharel, S. Li, M. Morse, N. J. Lennon, K. J. Livak, T. S. Mikkelsen, and J. L. Rinn. 2014. “The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells.” Nat. Biotechnol. 32 (4): 381–86.

Vallejos, C. A., J. C. Marioni, and S. Richardson. 2015. “BASiCS: Bayesian Analysis of Single-Cell Sequencing Data.” PLoS Comput. Biol. 11 (6): e1004333.

Van der Maaten, L., and G. Hinton. 2008. “Visualizing Data Using T-SNE.” J. Mach. Learn. Res. 9 (2579-2605): 85.

Wilson, N. K., D. G. Kent, F. Buettner, M. Shehata, I. C. Macaulay, F. J. Calero-Nieto, M. Sanchez Castillo, et al. 2015. “Combined Single-Cell Functional and Gene Expression Analysis Resolves Heterogeneity within Stem Cell Populations.” Cell Stem Cell 16 (6): 712–24.

Zeisel, A., A. B. Munoz-Manchado, S. Codeluppi, P. Lonnerberg, G. La Manno, A. Jureus, S. Marques, et al. 2015. “Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.” Science 347 (6226): 1138–42.

Zhu, J., H. Yamane, J. Cote-Sierra, L. Guo, and W. E. Paul. 2006. “GATA-3 promotes Th2 responses through three different mechanisms: induction of Th2 cytokine production, selective growth of Th2 cells and inhibition of Th1 cell-specific factors.” Cell Res. 16 (1): 3–10.