Martin Kavalar / Jun 11 2019

Seurat – Guided Clustering Tutorial

For this tutorial, we will be analyzing the a dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500.

This is the raw data:

matrix.mtx
genes.tsv
barcodes.tsv

We start by loading the required libraries:

library(dplyr)
library(ggplot2)
library(Seurat)

Seurat requires the data files to be in a common directory, while Nextjournal's data versioning stores them separately. We can work around this using symbolic links.

mkdir -p /data/pbmc3k/hg19/
ln -sf 
matrix.mtx
/data/pbmc3k/hg19/ ln -sf
matrix.mtx
/data/pbmc3k/hg19/ ln -sf
matrix.mtx
/data/pbmc3k/hg19/

We start by reading in the data. The Read10X function reads in the output of the cellranger pipeline from 10X, returning a unique molecular identified (UMI) count matrix. The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column).

pbmc.data <- Read10X(data.dir = "/data/pbmc3k/hg19/")

We next use the count matrix to create a Seurat object. The object serves as a container that contains both data (like the count matrix) and analysis (like PCA, or clustering results) for a single-cell dataset. For a technical discussion of the Seurat object structure, check out our GitHub Wiki. For example, the count matrix is stored in pbmc[["RNA"]]@counts.

pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc

What does data in a count matrix look like?

pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

In the example below, we visualize QC metrics, and use these to filter cells.

  • We filter cells that have unique feature counts over 2,500 or less than 200
  • We filter cells that have >5% mitochondrial counts
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)