A Seurat object serves as a container for single-cell dataset data (e.g., count matrix) and analyses (e.g., PCA or clustering results).
- From CellRanger output (folder containing barcodes.tsv, genes.tsv, and matrix.mtx)
- Create a sparse count matrix by reading in the data:
Read10X() - Convert the sparse count matrix in to a Seurat object:
CreateSeuratObject()
- Create a sparse count matrix by reading in the data:
- From CellRanger output (file ending in: count_raw_feature_bc_matrix.h5)
- Create a sparse count matrix by reading in the data:
Read10X_h5() - Convert the sparse count matrix in to a Seurat object:
CreateSeuratObject()
- Create a sparse count matrix by reading in the data:
- From GEO datasets with multiple samples
- Read in count and metadata files using:
read_tsv() - Convert the files into Seurat objects:
purr::map2(counts, meta, ~CreateSeuratObject(counts = as(.x, "sparseMatrix"), meta.data = .y)) - Merge the objects into one Seurat object:
purrr::reduce(obj, function(x,y) {merge(x,y)})
- Read in count and metadata files using:
- From datasets stored in SeuratData
- Load in data using:
LoadData()
- Load in data using:
Low quality cells can be the result of cell damage or failure in library preparation, and typically manifest as cells with low total counts, a low level of expressed genes ot high mitochondrial proportions. If low quality cells are not removed, the first few principal components and the genes with the largest variances will be driven quality rather than biology.
- Proportion of mitochondrial genes
- A high percentage of mitochondrial genes is indicative of low quality or dying cells, due to the loss of cytoplasmic RNA from perforated cells.
- Create new column using:
PercentageFeatureSet(pattern = "^MT-")
- Number of unique genes detected in each cell (nFeature_RNA)
- Low numbers of unique genes is likely a low quality cell, because the diverse transcript population has not been successfully captured.
- Total number counts detected in each cell (nCount_RNA)
- Low number of counts is indicative of low quality cells, due to RNA having been lost at some point during the library preparation (e.g., cell lysis or inefficient cDNA capture and amplification).
- A high number of genes or counts can be indicative of doublets or multiplets.
- Create plots to explore data and determine thresholds.
VlnPlot(nsclc_seu, features = c("nCount_RNA", "nFeature_RNA", "percent_mt"), ncol = 3)FeatureScatter(nsclc_seu, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + geom_smooth(method = 'lm')
- Filter data according to the determined thresholds.
subset(subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
Note
Quality control metrics are dependent on cell type (e.g, hepatocytes are highly metabolically active and therefore have higher mitochondrial percentages than other cells) and data type (e.g., counts-based data vs UMI-based data).
- Normalizes the feature expression measurements for each cell by the total expression, then multiples this by a scale factor (10,000) and log-transforms the result.
NormalizeData(normalization.method = "LogNormalize", scale.factor = 10000)
SCTransform()can be used in the workflow in place ofNormalizeData(),FindVariableFeatures()andScaleData(). Log normalization relies on the assumption that each cell originally contained the same number of molecules: SCTransform does not make this assumption. Confounding sources of variation, like high mitochondrial percentages, can also be removed with SCTransform().PercentageFeatureSet(pattern = "^MT-", col.name = "percent.mt")followed bySCTransform(vars.to.regress = "percent.mt", verbose = FALSE)
Cells with low cell-to-cell variation (i.e., housekeeping genes) are not very informative, whereas focusing on variable genes helps to highlight the biological signal in single-cell datasets.
- Find variable features
FindVariableFeatures(nfeatures = 2000) - Examine the variable features
VariableFeaturePlot(): look at most variable genes
- Variable features are used with in downstream analyses (e.g., PCA).
This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate.
- Shifts the expression of each gene, so that the mean expression across cells is 0.
- Scales the expression of each gene, so that the variance across cells is 1.
ScaleData(features = default is variable features)
- Standard pre-processing step prior to dimensional reduction techniques like PCA
Linear transformation of the original dataset into principal components ranked in decreasing order of variance. The variance of the data is maximized in the lower dimensional space.
- Dimensionality reduction has two components:
- Feature selection (selection of a smaller subset from the original set of variables)
- Based on the assumption that genes showing high variability correspond to biological variation
- Feature extraction (high dimension data is projected to a lower dimension)
- Feature selection (selection of a smaller subset from the original set of variables)
- Run the principal component analysis
RunPCA()
- By default, only previously determined variable features are used as input.
- Visualize the results
VizDimLoadings(dims = 1:2, reduction = "pca")DimPlot(reduction = “pca”)DimHeatMap(dims = 1:4, cells = 500, balanced = TRUE)- Setting cells to a number plots the “extreme” cells on both ends of the spectrum and dramatically decreases plotting time for large datasets.
- Determine the dimensionality of the dataset (number of PCs to use)
ElbowPlot()
Note
scRNAseq has a highly non-linear structure, so PCA alone is not best suited for data visualization.
Cells are embedded into a graph structure, edges are drawn between cells with similar feature expression patterns, and the graph is partitioned into highly interconnected communties/clusters.
k-means clustering randomly initializes k cluster centers, assigns points to nearest center and then updates the cluster centers and repeats the assignment process until the centers stop changing. Intitial cluster centers are randomly assigned: set a seed
set.seed(1234)for consistent clustering.
- A k-nearest neighbors (kNN) graph is constructed based on the Euclidean distance in the PCA space and the edge weights between any two cells are refined based on the shared overlap in their local neighborhoods (Jaccard similarity).
FindNeighbors(dims = # of PCs decided upon in the previous step)
- Modularity optimization techniques (default = Louvain algorithm) are applied to iteratively group cells together with the goal of optimizing the standard modularity function.
FindClusters(resolution = c(0.1, 0.3, 0.5, 0.7, 1))- Resolution parameter sets the granularity of the downstream clustering = higher values result in greater numbers of clusters
- 0.4 -1.2 typically has good results for single-cell datasets of around 3000 cells and optimal resolution often increases for larger datasets
- Visulaize the effect of the different resolution values
DimPlot(group.by = 'RNA_snn_res.0.1', label = TRUE)Idents() <- 'RNA_snn_res.0.1'
Cells that are grouped together within graph-based clusters determined during the clustering step should co-localize on these dimension reduction plots
- Run non-linear dimensional reduction and examine the results
RunUMAP(dims = # of PCs determined previously)DimPlot(reduction = “umap”)
- UMAP aims to preserve local distances in the dataset, but often does not preserve more glocal relationships
Note
UMAP can be used for visualization, but biological conclusions should not be drawn solely from this visualization technique. More information on UMAP interpretation can be found here: Is UMAP Accurate?
Perform integration if batch effects were observed in the UMAP (e.g., cells clustering based on batches, donors or conditions)
- Perform integration
IntegrateLayers(method = HarmonyIntegration, orig.reduction = “pca”, new.reduction = “harmony”)- Performs integration in low-dimensional space and returns a dimensional object (in reductions slot: pca, umap and harmony)
Note
Seurat supports five integration methods: Anchor-based CCA integration, Anchor-based RPCA integration, Harmony, FastMNN, and scVI. More information about batch correction can be found here: Batch Correction.
- Repeat clustering with integrated data
FindNeighbors(reduction = “harmony”)FindClusters(cluster.name = “harmony_clusters”)
- Re-run non-linear dimensional reduction
RunUMAP(reduction = “harmony”, reduction.name = “umap_harmony”)
- Visualize the results to see if batch effects persist
DimPlot(reduction = “umap_harmony”)VlnPlot(group.by = “harmony_clusters”)
- Integration can be performed after using ```SCTransform()````
SCTransform()RunPCA()RunHarmony(assay.use="SCT", group.by.vars = "Method")FindNeighbors(reduction = "harmony")FindClusters()RunUMAP(reduction = "harmony")DimPlot(group.by = "Method")
- Rejoin the layers
- Collapses the individual datasets together and recreates the original counts and data layers.
JoinLayers()
Important
Layers need to be rejoined before performing differential expression analysis.
- Identify marker genes for the clusters and assign the annotations manually.
- If your clusters are composed of cells from more than one condition (e.g., control and stimulated).
FindConservedMarkers(ident.1 = 3, grouping.var = "stim")- Finds the markers in cluster three that are concerved between the control and stimulated groups.
- Not setting
ident.2to a specific cluster compares the cluster inident.1to all other clusters. - Visulaize the identified markers
VlnPlot(features = c("FCGR3A"))FeaturePlot(features = c("FCGR3A"), min.cutoff = "q10")min.cutoff = "q10": genes in the tenth quartile will appear grey, increasing contrast.
- Rename the cluster based on the identified markers.
Idents()RenameIdents("3" = "CD16 Mono")
- Annotation using a reference Seurat object
- Find a set of anchors between a reference and a query object.
anchors <- FindTransferAnchors(reference = reference, query = query, dims = 1:30, reference.reduction = "pca")
- Use anchors to transfer data from the reference to the query object.
predictions <- TransferData(anchorset = anchors, refdata = reference$celltype, dims = 1:30)
- Add the cell type predictions to the query metadata
query <- AddMetaData(query, metadata = predictions)
- Verify cell type predictions by examining canonical cell type marker expression in the clusters.
VlnPlot(query, c("REG1A", "PPY"), group.by = "predicted.id")
- Automated annotation using an annotated object from a database
- The reference needs to contain a cell population sufficiently similar to the cell population in your query object. Using multiple references increases the likelihood that the cell types in your query are accounted for.
- celldex: provides a collection of reference expression datasets with curated cell type labels.
- SingleR: uses reference transcriptomic datasets of pure cell types to infer the cell of origin of each of the single cells independently.
- Default method: performs classification seperately with each reference and then combines the results and chooses the highest overall prediction score.
- Load the reference libraries:
hpca <- celldex::HumanPrimaryCellAtlasData()anddice <- celldex::DatabaseImmuneCellExpressionData() - Get the counts from you query dataset:
counts <- GetAssayData(obj, slot = "counts") - Run SingleR:
pred <- SingleR(test = counts, ref = list(HPCA = hpca, DICE = dice), labels = list(hpca$label.main, dice$label.main)) - Save the cell type labels to the Seurat metadata:
obj$pred.labels <- pred[match(rownames(obj@meta.data), rownames(pred)), "labels"]- May have to fix inconsistencies with cell labels.
- Visualize the labels:
DimPlot(obj, reduction = "umap", group.by = "pred.labels", label = TRUE) - Get the marker genes from each reference dataset for each cell type:
metadata(pred$orig.results$HPCA)$de.genesandmetadata(com.res2$orig.results$DICE)$de.genes
Note
Lack of consistency in cell labels across references can complicate interpretation.
- Identify genes that are differentially expressed between conditions for a given cluster.
- Create a column that has information about both cluster and condition.
obj$cluster.cnd <- paste0(obj$cluster,"_", obj$condition)Idents(obj) <- obj$cluster.cnd
- Find differential markers between the groups
data.frame <- FindMarkers(obj, ident.1 = "cluster.cnd_1", ident.2 = "cluster.cnd_2")
- Visualize the differential gene expression
FeaturePlot(features = c("FCGR3A", "VMO1"), split.by = "stim", min.cutoff = "q10")VlnPlot(features = c("FCGR3A", "VMO1"))DotPlot(features = c("FCGR3A", "VMO1"), split.by = "stim")
- Create a column that has information about both cluster and condition.
- Pseudobulking + differential expression analysis
- Single cell methods treat each cell as a sample: p-values are inflated, variation across the population is not truly investigated, and there are issues with unmodelled correlations between samples (the samples/single cells are not independent of each other).
- Aggregate counts across sample, condition and cell type.
pseudo_obj <- AggregateExpression(obj, assays = "RNA", return.seurat = TRUE, group.by = c("stim", "donor_id", "seurat_annotations"))
- Create a column that has information about both cell type and condition.
obj$celltype.stim <- paste(obj$seurat_annotations, obj$stim, sep = "_")Idents(obj) <- "celltype.stim"
- Perform differential expression analysis using DESeq2 at the sample level
FindMarkers(object = obj, ident.1 = "CD14 Mono_STIM", ident.2 = "CD14 Mono_CTRL", test.use = "DESeq2")
Note
Differential expression analysis using FindMarkers() can be performed using alternative tests, including: "wilcox", "wilcox_limma", "bimod", "roc", "t", "poisson", "negbiom", "LR" and "MAST".