Explore an HCA data set in Scanpy

ENSG_to_name.csv

pancreas.loom

This notebook shows a few quick steps to start exploring an example HCA data sets using scanpy (citation).

%matplotlib inline

import loompy
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy.api as sc

sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=80, color_map='Greys')
sc.logging.print_versions()
results_file = '/results/pancreas.h5ad'

Python

Read the expression matrix file downloaded from the HCA

The file shown here is from the study Single-Cell Analysis of Human Pancreas Reveals Transcriptional Signatures of Aging and Somatic Mutation Patterns by Enge, et al., available in this repository at ./pancreas.loom

adata = sc.read_loom(pancreas.loom
, sparse=True, cleanup=False, 
                     X_name='spliced', obs_names='CellID', var_names='Gene')

Python

Query for genes of interest

HCA uses Gencode nomenclature as gene identifiers. A list of these identifiers and their corresponding gene symbols is included in this repository as ./ENSG_to_name.csv

gene_names = pd.read_csv(ENSG_to_name.csv
, index_col = 0)
named_list = adata.var.join(gene_names)

Python

named_list.query('name == ["GCG"]') ## replace GCG with gene of interest

Python

Filter and normalize

Filter out genes that are present in few cells, cells that contain few genes
Transform the sum of (remaining) gene expression counts in each cell to equal the same number (1e4)
Keep 500 highest expressed genes from each cell (this threshold used in initial publication)

sc.pp.filter_genes(adata, min_cells=3)
sc.pp.filter_cells(adata, min_genes=200)

adata.obs['n_counts'] = np.sum(adata.X, axis=1).A1
dat = sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4, copy=True)

# here we'll make "top", which is the union of the top 500 expressed genes from each cell
top = []
for i in range(dat.shape[0]):
    indices = np.asarray(np.argsort(dat.X[i,:].todense())[0,:]).flatten()[::-1][0:500]
    top = np.union1d(top, indices)
    
# update adata to include only the list "top"
adata = adata[:, top.astype('int')]

Python

Perform dimensionality reduction and clustering

Compute the principal components
Generate the neighborhood graph
Determine clusters using Louvain method
Perform dimensionality reduction with UMAP
Perform dimensionality reduction with tSNE

sc.pp.pca(adata, svd_solver='arpack')
sc.pp.neighbors(adata)
#sc.tl.louvain(adata, resolution=0.15)
sc.tl.tsne(adata)
sc.tl.umap(adata)
adata.write(results_file)

Python

pancreas.h5ad

Determine weighted expression of select genes

This will be used in a few figures below

genes = ['ENSG00000254647.6', 'ENSG00000115263.14', 'ENSG00000157005.3', 'ENSG00000204983.13', #INS #GCG #SST #PRSS1
         'ENSG00000007062.11','ENSG00000154096.13', 'ENSG00000108849.7'] #PRSS1 #PROM1 #THY1 #PPY
colors = np.asarray([[226,43,34], [67,114,175], [48,178,78], 
                     [139,47,175], [244,152,46], [249,246,49], [119,73,27]])
weights = adata[:,genes].X.todense()
weights = weights / (weights.sum(axis=1) + 0.0001)
c = np.dot(weights, colors)

Python

Embed cells using tSNE

INS: ENSG00000254647.6
GCG: ENSG00000115263.14
PRSS1: ENSG00000204983.13
SST: ENSG00000157005.3
PROM1: ENSG00000007062.11
THY1: ENSG00000154096.13
PPY: ENSG00000108849.7

The quick way to plot in tSNE space

sc.pl.tsne(adata, color=['ENSG00000254647.6', 'ENSG00000115263.14', 'ENSG00000204983.13'], 
           wspace=0.3, size=150, linewidths=1, edgecolors='black');

Python

A slightly prettier custom plot

def clean_plot():
    plt.grid(False)
    plt.yticks([])
    plt.xticks([])
    plt.xlabel('tSNE 1')
    plt.ylabel('tSNE 2')

plt.figure(figsize=[15,4.5])
ids = ['ENSG00000254647.6', 'ENSG00000115263.14', 'ENSG00000204983.13']
genes = ['INS', 'GCG', 'PRSS1']
plots = [131, 132, 133]
for i, g, p in zip(ids, genes, plots):
    plt.subplot(p);
    plt.scatter(adata.obsm['X_tsne'][:,0], adata.obsm['X_tsne'][:,1], c=adata[:,i].X,
                linewidths=0.3, edgecolors='black');
    plt.title(g)
    clean_plot()

Python

Cells colored by weighted expression of select genes

plt.figure(figsize=[8, 8])
plt.scatter(adata.obsm['X_tsne'][:,0], adata.obsm['X_tsne'][:,1], c=c/255, alpha=0.3);
clean_plot()

Python

Embed cells using UMAP

The same clusters and markers can be viewed using alternative embeddings:

Cells colored by louvain clustering

#sc.pl.umap(adata, color='louvain', wspace=0.3, size=150, linewidths=0.3, edgecolors='black');
## skipping this for now -- working on some louvain issues above

Python

Cells colored by weighted expression of select genes

plt.figure(figsize=[8, 8])
plt.scatter(adata.obsm['X_umap'][:,0], adata.obsm['X_umap'][:,1], c=c/255, alpha=0.3);
plt.grid(False)
plt.yticks([])
plt.xticks([])
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2');

Python

Where to go from here

There are many more ways to explore the data you've downloaded from the HCA -- see the scanpy tutorials for ideas https://scanpy.readthedocs.io/en/latest/examples.html#