Seurat – Guided Clustering Tutorial

Adapted from https://satijalab.org/seurat/v3.0/pbmc3k_tutorial.html.

For this tutorial, we will be analyzing the a dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500.

This is the raw data:

matrix.mtx

genes.tsv

barcodes.tsv

We start by loading the required libraries:

library(dplyr)
library(ggplot2)
library(Seurat)

0.5s

Seurat requires the data files to be in a common directory, while Nextjournal's data versioning stores them separately. We can work around this using symbolic links.

mkdir -p /data/pbmc3k/hg19/
ln -sf matrix.mtx
 /data/pbmc3k/hg19/
ln -sf matrix.mtx
 /data/pbmc3k/hg19/
ln -sf matrix.mtx
 /data/pbmc3k/hg19/

0.2s

Bash in R

We start by reading in the data. The Read10X function reads in the output of the cellranger pipeline from 10X, returning a unique molecular identified (UMI) count matrix. The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column).

pbmc.data <- Read10X(data.dir = "/data/pbmc3k/hg19/")

4.7s

We next use the count matrix to create a Seurat object. The object serves as a container that contains both data (like the count matrix) and analysis (like PCA, or clustering results) for a single-cell dataset. For a technical discussion of the Seurat object structure, check out our GitHub Wiki. For example, the count matrix is stored in pbmc[["RNA"]]@counts.

pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc

1.7s

What does data in a count matrix look like?

pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]

0.7s

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

1.0s

In the example below, we visualize QC metrics, and use these to filter cells.

We filter cells that have unique feature counts over 2,500 or less than 200
We filter cells that have >5% mitochondrial counts

VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

1.3s