Rare Cancers Genomics / Jun 04 2020

by AGAurélie Gabriel, NANicolas Alcala and

Remix of V2 Building accurate data visualisations of high-dimensional molecular data for cancer classification: application to rare lung neoplasms by mmathian

DOI: https://doi.org/10.33016/nextjournal.100003

A molecular map of lung neuroendocrine neoplasms

Introduction

In a study published in Nature Communications in 2019 (Alcala et al. Nat Commun 2019), we performed a molecular characterization of the rare understudied lung neuroendocrine neoplasms (lung NENs) that allowed us to identify clinically relevant molecular groups. To complement this study, we have submitted a data note to journal GigaScience where we firstly describe the preprocessing and the quality control steps performed on data generated by Alcala et al. and propose the first molecular map of lung NENs (pan-LNEN map) using the Uniform Manifold Approximation and Projection (UMAP) on RNA-sequencing data. This map integrates homogenized data from multiple molecular studies on lung NENs (Peifer et al. Nat Genet 2012; Fernandez-Cuesta et al. Nat Commun 2014; George et al. Natu re 2015; George et al. Nat Commun 2018; Alcala et al. Nat Commun 2019, Laddha et al. Cancer research 2019) with the aim of increasing the sample size to better understand this rare disease. The map was published on the TumorMap interactive portal.

This notebook complements the data note by providing the code necessary to reproduce the pan-LNEN molecular map. We also propose to assess the quality of the molecular map by: i) validating the biological hypotheses proposed in the previous studies, ii) exploring the preservation of the samples' neighborhood in the map, and iii) validating the preservation of spatial auto-correlations in the map.

Setup

This section allows to set up the working environment, install the R packages, and upload the scripts and data sets needed for the analyses.

R packages installation

Install CRAN packages

install.packages(c("ade4",      # Principal Component Analysis                   "umap",      # Uniform manifold analysis projection                   "FNN",       # K-Neirest Neighbours                   "ape",       # for Moran index                   "reshape2",  # Basic tools                   "ggpubr",    # Layout                   "gridExtra", # Layout                   "eulerr",    # Layout for UMAP                   "viridis",   # Layout                   "foreach",   # Parallel computing                   "doParallel" # Parallel computing                  ))

539.4s

R+Libs+DESeq2 (R)

Install Bioconductor packages

install.packages("BiocManager")BiocManager::install("DESeq2")

55.9s

R+Libs+DESeq2 (R)

R packages loading

library(umap) library(ade4)library(FNN)library(reshape2)library(ape)library(ggpubr)library(foreach)library(doParallel)library(gridExtra)library(eulerr)library(viridis)library(DESeq2)library(RColorBrewer)sessionInfo()

10.1s

R+Libs+DESeq2 (R)

ls

1.0s

R+Libs+DESeq2 (Bash in R)

Note that to avoid running the notebook entirely, the R environment allowing to generate all the results included in the notebook has been saved in the file R_environment.Rdata and can be loaded by uncommenting the following line.

#load('R_environment.Rdata')

0.7s

R+Libs+DESeq2 (R)

Scripts and data loading

The scripts used to perform and evaluate the dimensionality reduction of the pan-LNEN RNA-seq data are available in the following GitHub repository:

IARCbioinfo/DRMetrics

4fb3527a37b3bb48f4b88e059a72d5ced6e3fa3a

This repo is mounted by: R+Libs+DESeq2

The DRMetrics repository contains (see readme for additional details):

A folder named "scripts" containing the R scripts enabling the assessment of the quality of low-dimensional representations
A folder named "data" containing the data required to reproduce the results presented in the GigaScience paper:
- read_counts_all.txt.zip: homogenized gene expression quantification from RNA-seq data from several published manuscripts (Peifer et al. Nat Genet 2012; Fernandez-Cuesta et al. Nat Commun 2014; George et al. Nature 201 5; George et al. Nat Commun 2018; Alcala et al. Nat Commun 2019, Laddha et al. 2019). The data are stored as a matrix of raw read counts of 59,607 genes (rows) from the 239 samples (columns, including 1 technical replicate).
- Attributes.txt.zip: main clinical features and the relevant genes (columns) for the 238 unique individuals (rows) reported in the previous manuscripts.
- Coords_umap_nn238.txt.zip: UMAP coordinates (columns) associated with the map shared on the TumorMap interactive portal, for each of 238 unique individuals (rows).
- ref_annot.gtf.gene_spans.zip: genome annotation associated to the GRCh38 human reference genome, version 33, from bundle CTAT from the 6th of April 2020
- hg38_Gencode_V33.bed.zip and hg38_Gencode_V33_merged.bed.zip: bed files used in the RNA-Seq data preprocessing (see below)

Loading the main functions needed to perform the analyses:

source("/DRMetrics/scripts/DR_quality_metrics.r")

0.2s

R+Libs+DESeq2 (R)

Loading the RNA-Seq and clinical data:

The raw RNA-seq fastq files were preprocessed following four nextflow pipelines available in the IARCbioinfo GitHub repository. Note that the pipelines were run using singularity but docker and conda profiles could also be used.

Step 1: we performed RNA sequencing mapping and quality controls using the pipeline IARCbioinfo/RNAseq-nf by running the following command:

nextflow run iarcbioinfo/RNAseq-nf -r v2.3 -profile singularity --input_folder folder_with_fastq_files --output_folder out_RNAseq-nf-2.3 --ref_folder ref_genome.fa.star.idx/ --ref ref_genome.fa --gtf ref_annot.gtf --bed hg38_Gencode_v33.bed --fastq_ext fastq --STAR_mapqUnique 60 --cutadapt

Where folder_with_fastq_files is a folder containing all the fastq files, ref_genome.fa.star.idx is a folder with the genome reference files for software STAR, ref_annot.gtf the annotation file and hg38_Gencode_v33.bed a bed file with list of intervals for further annotations by RSeQC. The bed file was generated using the UCSC table browser and the genome reference files, the annotation file should be associated to the GRCh38 human reference genome, version 33, from bundle CTAT from the 6th of April 2020 (located in the folder GRCh38_gencode_v33_CTAT_lib_Apr062020.plug-n-play/ctat_genome_lib_build_dir/).

Step 2: we refined alignments of the BAM files using the pipeline IARCbioinfo/abra-nf by running the following command:

nextflow run iarcbioinfo/abra-nf -r v3.0 -profile singularity --bam_folder out_RNAseq-nf-2.3/BAM --output_folder out_abra-nf-3.0 --ref ref_genome.fa --gtf ref_annot.gtf --bed hg38_Gencode_v33_merged.bed --junctions --rna

The BAM files used as input for ABRA are the outputs of the first alignment step. The reference and annotation files used are the same as the ones used in step 1 with an exception for the bed file which has been modified so that no overlapping intervals remain using bedtools:

sort -k1,1 -k2,2n hg38_Gencode_v33.bed | bedtools merge -i stdin > hg38_Gencode_v33_merged.bed

Step 3: we performed base quality score recalibration using the pipeline IARCbioinfo/BQSR-nf by running the following command:

nextflow run iarcbioinfo/BQSR-nf -r v1.1 -profile singularity --input_folder out_abra-nf-3.0/ --ref ref_genome.fa --snp_vcf dbsnp_146.hg38.vcf.gz --indel_vcf Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --output_folder out_bqsr-nf-1.1

Where the vcf files come from the GATK bundle hg38.

Step 4: we performed gene expression quantification using the pipeline IARCbioinfo/RNAseq-transcript-nf by running the following command:

nextflow run iarcbioinfo/RNAseq-transcript-nf -r v2.1 -profile singularity --input_file input-transcript.txt --gtf ref_annot.gtf --output_folder out_RNAseq-transcript-nf-2.1

The file input-transcript.txt contains in rows, the samples ID, paths to BAM files, and read lengths per sample. The latter were computed using the post-trimming fastqc outputs of the first step by computing the average of the two fastq files read length values and rounding it.

For our analysis, we merged the samples raw read counts generated by the fourth step in a single matrix. In the following code cell, we load the read counts associated to all samples as well as the clinical data associated.

Attributes = read.table(unzip('DRMetrics/data/Attributes.txt.zip')[1], sep = '\t', header = T)Attributes[1:5,1:10]read_counts = as.matrix(read.table(unzip('DRMetrics/data/read_counts_all.txt.zip')[1], header = T, row.names=1))read_counts[1:5,1:10]

6.1s

R+Libs+DESeq2 (R)

0 items

0 items

We removed read counts associated with genes located on sex and mitochondrial chromosomes. After this filtering 56 597 remained.

# Read gene annotations genespans   = read.table(unzip('DRMetrics/data/ref_annot.gtf.gene_spans.zip')[1],stringsAsFactors = F)dim(genespans)rownames(genespans) = genespans[,1]# Remove genes on sex and mitochondrial chromosomes genenames   = rownames(read_counts)summary(genenames %in% rownames(genespans))genes.sex   = genespans[genespans$V2 %in% c("chrM", "chrX", "chrY"), ]genes.nosex = genespans[!(genespans$V2 %in% c("chrM", "chrX", "chrY")), ]read_counts = read_counts[which(rownames(read_counts) %in% rownames(genes.nosex)), ]dim(read_counts)

1.3s

R+Libs+DESeq2 (R)

Data pre-processing:

The raw read counts needed to be normalized; we used the variance stabilizing transformation from the DESeq2 R package.

colData = as.matrix(colnames(read_counts))DESeq_object = DESeqDataSetFromMatrix(countData=read_counts,colData = colData,design = ~1)VST = varianceStabilizingTransformation(DESeq_object)VST = assay(VST)dim(VST)VST[1:5,1:5]

113.0s

R+Libs+DESeq2 (R)

0 items

For each analysis, the most variable genes, explaining 50% of the total variance in variance-stabilized read counts, were selected (6,390 out of the 56,597 initial genes).

rvdm     = apply(VST,1,var)ntopdm   = rev(which( cumsum( sort(rvdm,decreasing = T) )/sum(rvdm) <= 0.5 ))[1]selectdm = order(rvdm, decreasing = TRUE)[seq_len(ntopdm)]vst50    = t(VST[selectdm,])vst50[1:5,1:5]dim(vst50)

1.8s

R+Libs+DESeq2 (R)

0 items

We kept only one replicate per individual (the column Sample_ID in table Attributes, excluding sample S00716).

vst50 = vst50[as.character(Attributes$Sample_ID),]dim(vst50)

0.4s

R+Libs+DESeq2 (R)

Dimensionality reduction

The analysis of the gene expression matrix is complex due to its high dimensionality. Dimensionality reduction methods are thus essential for hypothesis generation and interpretations. In this section, we present firstly the projections resulting from the principal component analysis (PCA) method and secondly a two-dimensional projection resulting from the Uniform Manifold Analysis Projection (UMAP) algorithm.

Principal Component Analysis (PCA)

In the previously published datasets that we integrated, different molecular groups were identified: carcinoids A1, A2, and B (Alcala et al. Nat Commun 2019), and clusters LC1, LC2 and LC3 (Laddha et al. 2019)-that we found here corresponding respectively to carcinoids A1, B and A2 (see low dimensional projections below)-, LCNEC/TypeI and LCNEC/TypeII (George mun 2018), and SCLC (George et al. Nat Commun 2015). In order to distinguish the six molecular groups, we considered five principal components when evaluating the PCA results.

We used the dudi.pca function to perform a PCA on the lung NEN samples:

pca5D = dudi.pca(vst50, center = T , scale = T,  scannf = F , nf=5 )$lihead(pca5D)

1.9s

R+Libs+DESeq2 (R)

0 items

However, to visualize the samples in the reduced space generated by the PCA , we needed to consider only two dimensions. The next code cell represents the lung NEN samples in several two dimensional spaces based on the different combination of the first five principal components.

distinctive_cols <- c("#9E0142","#E41A1C","#E95D47","#FF8904","#FDBE6E","#FFD422","#88CFA4","#47A265","#397EB8","#00CCFF","#91569B","#D01C8B","#CCCCCC")Axis_list = list(c(2,1), c(3,1), c(4,1), c(5,1), c(3,2), c(4,2), c(5,2), c(4,3), c(5,3), c(5,4))plist <- list()for(i in 1:length(Axis_list)){  x_index = Axis_list[[i]][1]  y_index = Axis_list[[i]][2]  xlab   = paste("PC", x_index)  ylab   = paste("PC", y_index)  c_data = data.frame("axisX"=  pca5D[,x_index], "axisY" =pca5D[,y_index] )  p      = ggplot(c_data, aes(x = axisX , y = axisY, color = Attributes$Molecular_clusters)) + geom_point(size=2, alpha=0.8) + scale_colour_manual(values=distinctive_cols)     if (x_index == 2 & y_index == 1){    plist[[i]] = p + theme( legend.position="right", legend.title= element_blank()) + labs(x = xlab, y = ylab)   }else{    plist[[i]] = p + theme( legend.position="none", legend.title= element_blank()) + labs(x = xlab, y = ylab)   }}ggarrange(plist[[1]])grid.arrange(grobs = lapply(2:10, function(i){plist[[i]]}),               layout_matrix = rbind(c(2:4),c(5:7),c(8:10)), nrow =2)

2.5s

R+Libs+DESeq2 (R)

Uniform Manifold Analysis Projection (UMAP)

Another way to visualize data in a two dimensional space is to use UMAP. The UMAP algorithm is a non-linear dimensionality reduction technique based on topology. We applied UMAP to the RNA-seq pan-LNEN data using the umap R package. Using this method we aimed at visualizing the molecular clusters identified in previous studies. UMAP doesn't necessarily preserve the global structure of the data and one important parameter for this is n_neighbors. Small and high values of n_neighbors favor the preservation of local structures and global structures, respectively. The pan-LNEN TumorMap was hence generated fixing the n_neighbors parameter to 238, which is the total number of samples. The other parameters were set to their default values.

UMAP_TM = umap(vst50, n_neighbors = 238)$layoutcolnames(UMAP_TM) = c("UMAP1","UMAP2")UMAP_TM=as.data.frame(UMAP_TM)

5.0s

R+Libs+DESeq2 (R)

Because UMAP is not deterministic, in order to reproduce the exact results presented in the GigaScience paper we upload in the following cell the UMAP coordinates used to create the pan-LNENs TumorMap (available in the IARCbioinfo/DRMetrics github repository):

# Note that the next line replace the UMAP coordinates calculated in the previous cell with our pre-computed coordinates. You should comment it if you are running a customized analysis.UMAP_TM = read.table(unzip('DRMetrics/data/Coords_umap_nn238.txt.zip')[1], sep = '\t', header = T, row.names=1)UMAP_TM <- cbind(UMAP_TM, 'Molecular_clusters' = Attributes$Molecular_clusters)ggplot(UMAP_TM, aes(x = UMAP1 , y = UMAP2, color = Molecular_clusters))+geom_point(size=4, alpha=0.8)+ scale_colour_manual(values=distinctive_cols)+labs(x = "UMAP dimension 1" , y = "UMAP dimension 2", color = "")+ theme(legend.position = "bottom")+guides(col = guide_legend( nrow = 4)) head(UMAP_TM)

0.7s

R+Libs+DESeq2 (R)

0 items

Validation of biological hypotheses

Although non-linear dimensionality reduction methods such as UMAP lack interpretability in comparison to PCA, we attempted to show that this simple layout captures the main biological findings highlighted in the previous studies (Fernandez-Cuesta et al. Nat Commun 2014; George et al. Nature 2015; George et al. Nat Commun 2019; Alcala et al. Nat Commun 2019; Laddha et al. Cancer research 2019).

In previous studies, some subgroups of lung NENs have been identified as having the molecular features of a histopathological type that is different from their initial histopathological classification. We checked whether our tumor map captures this phenomenon by comparing the distances from these discordant tumors to their molecular and histopathological groups, respectively. The following code chunk corresponds to the code used to perform these comparisons using a one-sided Wilcoxon's test.

clusters_distances_F <- function(Coords, evals, ref1, ref2, alt_hyp){  # Coords: a data frame with the following columns: i) the samples coordinates on the dimensionality reduction axes, and ii) the molecular cluster associated to each sample  # evals: the molecular cluster to analyse   # ref1: the supposed molecular cluster(s)  # ref2: the supposed histopathological group(s)  # alt_hyp: the alternative hypothesis for the wilcoxon test   Coords_evals = Coords[Coords$Molecular_clusters==evals,]  T1           = Coords[which(Coords$Molecular_clusters %in% ref1),]   T1_centroid  = colMeans(T1[,1:2])  dist_1       = sqrt( (Coords_evals[,1] - T1_centroid[1])^2 + ( Coords_evals[,2] - T1_centroid [2])^2 )  T2           = Coords[which(Coords$Molecular_clusters %in% ref2),]  T2_centroid  = colMeans(T2[,1:2])  dist_2       = sqrt( (Coords_evals[,1] - T2_centroid[1])^2 + ( Coords_evals[,2] - T2_centroid [2])^2 )  return(wilcox.test(dist_1 , dist_2, alt_hyp)$p.value) }

0.3s

R+Libs+DESeq2 (R)

As an example, we can consider the LCNEC/SCLC-like samples identified by George et al. in 2019 as being LCNEC samples that have the molecular profile of the SCLC samples. Using the function above, we showed that in the pan-LNEN TumorMap, these LCNEC/SCLC-like samples were indeed significantly closer to the SCLC samples than the LCNEC samples:

wilcoxon_res = clusters_distances_F(UMAP_TM, "LCNEC/SCLC-like","SCLC/SCLC-like",c("LCNEC/TypeI", "LCNEC/TypeII"), 'less')paste('Wilcoxon p-value = ', format(wilcoxon_res, digits = 2))

0.4s

R+Libs+DESeq2 (R)

Quality controls of the dimensionality reduction

Structural consistency of projections

We evaluated the preservation of the samples' neighborhood using the sequence difference view (SD) dissimilarity metric proposed by Martins et al. in 2015. This measure compares samples' neighborhoods in two different spaces, $\small D_1$ and $\small D_2$ . In our case, $\small D_1$ will typically be the original high-dimensional space and $\small D_2$ the reduced space from UMAP. The SD metric of sample i is defined as follows:

Parameter $\small k$ corresponds to the number of nearest neighbors to consider for the computation of the neighborhood of sample $\small i$ ; $\small V^1_k(i)$ and $\small V^2_k(i)$ denote the neighborhoods in $\small D_1$ and $\small D_2$ , respectively. $\small \rho^1_i(j)$ and $\small \rho^2_i(j)$ are the ranks of the distance between sample i and the $\small j_{th}$ neighbor of sample $\small i$ in the two spaces, respectively (ranked from 0 to k-1).

The SD metric firstly evaluates, for each point, its ranking in the two neighborhoods ( $\small \rho^1_i(j)$ and $\small \rho^2_i(j)$ ). Secondly, it penalizes discordant rankings (terms | $\small \rho^1_i(j)$ - $\small \rho^2_i(j)$ |), the penalty being larger if the discordance is observed for close neighbors (terms |k- $\small \rho^1_i(j)$ | and |k- $\small \rho^2_i(j)$ |). As a result, SD is non-negative, with high SD values indicating a poor preservation of the neighborhood while values close to 0 indicate a near-perfect preservation of the neighborhood.

Individual evaluation of samples' neighborhood

The following code chunk uses the compute_SD function to compute the SD values associated to each sample for different values of $\small k$ . In the following example, we used the SD metric to compare the UMAP layout with the five-dimensional space resulting from the PCA (PCA-5D).

distpca5D = as.matrix(dist(pca5D, method = "euclidean", diag = TRUE, upper = TRUE))distpca5D[1:5,1:5]k_values = seq(2,238,5)SD_pca5D = compute_SD_allSamples(distRef = distpca5D, List_projection = list("UMAP_TM" = UMAP_TM[,1:2]), k_values = k_values,colnames_res_df = "UMAP-nn238")head(SD_pca5D)

6.2s

R+Libs+DESeq2 (R)

0 items

0 items

The SD_map_f function is then called to represent the lung NEN samples in the UMAP projection and show, with a color gradient, the SD values averaged over the $k$ values.

sd_map_lnen = SD_map_f(SD_pca5D , UMAP_TM, "bottom")sd_map_res  = sd_map_lnen[[1]]head(sd_map_res)sd_map_lnen[[2]]

0.4s

R+Libs+DESeq2 (R)

0 items

The above representation allows the identification of the samples with the lowest sample neighborhood preservation between UMAP and PCA-5D. We chose to explore the samples' neighborhood of S02340 which have the highest SD metric. This sample is an atypical sample that clusters with the Carcinoid-A1 cluster on the UMAP projection. To explain this high SD value, we computed the euclidean distances between each sample and S02340 in the two spaces.

plot_dist <- function(coords,projection_coords,focal_sample){  dist1 = as.matrix(dist(coords[, 1:(ncol(coords)-1)], method = "euclidian", diag = TRUE, upper = TRUE))  Sample_names = rownames(coords)  dist1        = data.frame("Sample_ID" = rownames(dist1)[order(dist1[,focal_sample])], "rank" = 1:length(rownames(dist1)) )  merged_df    = merge(cbind(Sample_ID = rownames(projection_coords), projection_coords), dist1,  by= "Sample_ID")  colnames(merged_df)[2:3] = c("V1","V2")    proj_dist    = ggplot(merged_df, aes(x=V1, y=V2,  color=rank, label=merged_df$Sample_ID)) +  geom_point(size=2, alpha =.8) + scale_color_distiller(palette = "Spectral") +  geom_point(aes(merged_df$V1[which(merged_df$Sample_ID == focal_sample)],                 merged_df$V2[which(merged_df$Sample_ID == focal_sample)] ),             size=2, col = "black") +  geom_text(aes(merged_df$V1[ merged_df$Sample_ID== focal_sample],                merged_df$V2[ merged_df$Sample_ID== focal_sample],                label=merged_df$Sample_ID[merged_df$Sample_ID==focal_sample]),            col = 'black' , hjust=0, vjust=0 ) +  labs(title="", y="dim2", x="dim1") + theme( legend.position = "bottom")	return(proj_dist)}p1 <- plot_dist(UMAP_TM,UMAP_TM,"S02340")p2 <- plot_dist(pca5D,UMAP_TM,"S02340")grid.arrange(grobs = list(p1,p2), layout_matrix = rbind(c(1,2)), nrow =1)

1.0s

R+Libs+DESeq2 (R)

In the above figure, the left and right panels represent the UMAP projection of lung NEN samples, with a color gradient corresponding to the rank of each sample in the neighborhoods of S02340 in UMAP and PCA-5D projections, respectively. In the UMAP projection (left panel), the nearest neighbors of S02340 belong to the carcinoids cluster (top cluster). However, in the PCA projection (right panel) S02340 is also close to LCNEC samples, more specifically the LCNEC Type I samples (bottom cluster), suggesting that the UMAP projection for that sample might be misleading.

Exploring the SD metric for each sample allows us to identify samples with questionable projections.

Global preservation of samples' neighborhoods

Using the SD metric, we also evaluated the global preservation of the samples neighborhoods of different dimensionality reductions: i) PCA-5D: the projection based on the five first principal components of the PCA, ii) PCA-2D: the projection based on the two first principal components of the PCA, iii) projections generated by UMAP based on different values of the n_neighbors parameter (15 (default value), 30, 90, 180 and 238).

n_neighborsL <- c(15,30,90,180,238)List_projection <- list("pca2D" = pca5D[1:2], "pca5D" = pca5D) for (j in 1:length(n_neighborsL)){  umap_c = umap(vst50, n_neighbors = n_neighborsL[j])$layout  List_projection[[j+2]] <- umap_c  names(List_projection)[j+2] <- paste("umap_nn", n_neighborsL[j])}

14.2s

R+Libs+DESeq2 (R)

We computed the SD metric for each sample and each projection while varying the parameter $k$ in order to explore the samples neighborhood preservation at different scales. We averaged the SD metric over the samples. Due to the stochastic nature of the UMAP optimization step, we repeated each UMAP run multiple times. The following code cell performs these iterations. To speed up the computation on Nextjournal, we fixed the number of iterations to 10 (n_iter parameter below) but this number can be increased by the user. In the GigaScience paper, n_iter was fixed to 1000 to generate Figure 4.

distRef = as.matrix(dist(vst50, method = "euclidean", diag = TRUE, upper = TRUE))distRef[1:5,1:5]k_values        = seq(2,238,25)n_neighborsL    = c(15,30,90,180,238)colnames_res_df = c('PCA-2D', 'PCA-5D',  paste0('UMAP-nn',n_neighborsL) )n_iter          = 10no_cores        = 2                                                cl = makeCluster(no_cores)registerDoParallel(cl)SD_means <-  foreach(i=1:n_iter, .packages = c("umap","doParallel"))%dopar%{  List_projection = list("PCA-2D" = pca5D[,1:2], "PCA-5D" = pca5D)   for (j in 1:length(n_neighborsL)){    List_projection[[j+2]] = umap(vst50, n_neighbors = n_neighborsL[j])$layout    names(List_projection)[j+2] = paste("umap_nn", n_neighborsL[j])  }   SD_comp = compute_SD_allSamples(distRef, List_projection, k_values, colnames_res_df)    # For each k compute the mean value over all samples  SD_k_mean = sapply(1:length(colnames_res_df), function(j) aggregate(SD_comp[,colnames_res_df[j]],list(SD_comp$K),mean)$x)   colnames(SD_k_mean) = colnames_res_df  rownames(SD_k_mean) = k_values    return(SD_k_mean)}                       stopCluster(cl)

174.5s

R+Libs+DESeq2 (R)

0 items

For each of the 10 replicate iterations (elements of list SD_means), for each dimensionality reduction technique (columns), we computed the mean SD value across samples for different $\small k$ (rows):

#mean SD as function of k for the first iterationhead(SD_means[[1]])

0.2s

R+Libs+DESeq2 (R)

0 items

Mean_SD = melt( apply(simplify2array(SD_means), 1:2, mean), varnames = c("k","DRM"), value.name = "mean" )sd_SD   = melt( apply(simplify2array(SD_means), 1:2, sd), varnames = c("k","DRM"), value.name = "sd" )Main_SD_DF = cbind(Mean_SD, sd = sd_SD$sd)head(Main_SD_DF)p = ggplot(Main_SD_DF, aes(x =  k, y = mean, color = DRM)) +  geom_line() +  geom_point()+  scale_color_viridis(discrete=TRUE) +  geom_errorbar(aes(ymin = mean-sd, ymax = mean+sd))+  labs(title = "", y = "mean(SD)", x = "k") p

0.6s

R+Libs+DESeq2 (R)

0 items

To facilitate the visualization and to compare the sample neighborhood preservation of the different projections, we normalized the mean SD values so that the SD of the PCA-2D is 1 and that of the PCA-5D is 0, as described in the GigaScience paper (equation 1).

SD_means_norm = lapply( SD_means, function(x) (x[,2] - x)/(x[,2] - x[,1] ) )Mean_SD_norm = melt( apply(simplify2array(SD_means_norm), 1:2, mean), varnames = c("k","DRM"), value.name = "mean" )sd_SD_norm   = melt( apply(simplify2array(SD_means_norm), 1:2, sd), varnames = c("k","DRM"), value.name = "sd" )Main_SD_norm_DF = cbind(Mean_SD_norm, sd = sd_SD_norm$sd)theme_set(theme_bw())ggplot(Main_SD_norm_DF, aes(x = k, y = mean, color = DRM)) +  geom_line() + geom_point()+  scale_color_viridis(discrete=TRUE) +  geom_errorbar(aes(ymin = mean-sd, ymax = mean+sd))+  labs(title = "", y = "mean(SD')", x = "k") +  theme( axis.title.x=element_text(size=20),  # X axis title         axis.title.y=element_text(size=20),  # Y axis title         axis.text.x=element_text(size=16),  # X axis text         axis.text.y=element_text(size=16),         legend.text = element_text(size = 20) ,         legend.title = element_blank())

0.5s

R+Libs+DESeq2 (R)

The above figure shows that varying the n_neighbors parameter has an impact on the sample neighborhood preservation. While lower values of n_neighbors improve the preservation of the local sample neighborhood (low SD for small $k$ values), higher values of n_neighbors improve the global preservation of the neighborhood (low SD large $k$ values). Overall, n_neighbors=238 is a good compromise between local and global preservation, having preservations at least as good as PCA-2D for most $k$ values but also having preservations almost as good as PCA-5D for local neighborhoods.

Preservation of spatial autocorrelation

We can evaluate the spatial auto-correlation for focal genes using the Moran index. This metric ranges from -1 (inverse spatial autocorrelation) to 1 (perfect spatial autocorrelation). For a given gene, the index considers the similarity between the gene expression levels of samples in the $k$ -neighborhood of each sample. The following code cell computes the Moran index values for all the genes and different values of $k$ for: i) the original space, ii) PCA-5D, and iii) UMAP-nn238. The means of Moran index values for all k levels are then calculated.

n_genes = ncol(vst50)k_step  = seq(2,238,25)MI_GeneExprMat = moran_I_knn(vst50[,1:n_genes], vst50,k_step)MI_UMAP        = moran_I_knn(vst50[,1:n_genes], UMAP_TM[,1:2],k_step)MI_PCA5D       = moran_I_knn(vst50[,1:n_genes], pca5D,k_step)head(MI_UMAP)

643.8s

R+Libs+DESeq2 (R)

We computed first the average of the Moran's indexes over $k$ values for each gene and then the overlap between the greatest averaged Moran indexes with respect to the different spaces:

MI_PCA5D_order       = order(colMeans(MI_PCA5D[,1,]), decreasing=T) MI_UMAP_order        = order(colMeans(MI_UMAP[,1,]), decreasing=T) MI_GeneExprMat_order = order(colMeans(MI_GeneExprMat[,1,]), decreasing=T) PCA5D_GeneExprMat_overlap = sapply(1:n_genes, function(i) mean( MI_PCA5D_order[1:i] %in% MI_GeneExprMat_order[1:i] ) )PCA5D_UMAP_overlap        = sapply(1:n_genes, function(i) mean( MI_PCA5D_order[1:i] %in% MI_UMAP_order[1:i] ) )UMAP_GeneExprMat_overlap  = sapply(1:n_genes, function(i) mean( MI_UMAP_order[1:i] %in% MI_GeneExprMat_order[1:i] ) )PCA5D_UMAP_GeneExprMat_overlap = sapply(1:n_genes, function(i) mean( MI_PCA5D_order[1:i] %in% MI_GeneExprMat_order[1:i] & MI_PCA5D_order[1:i] %in% MI_UMAP_order[1:i] ) )Expected = sapply(1:n_genes, function(i) (i/n_genes) )cols <- c("Expected" =  "#FDE725FF", "PCA 5D - GeneExprMat" = "#440154FF","PCA 5D - UMAP 238 - GeneExprMat" = "#5DC863FF",  "PCA 5D - UMAP 238" = "#21908CFF", "UMAP 238 - GeneExprMat" = "#3B528BFF")ggplot()+geom_point(aes(x=seq(1:n_genes), y=PCA5D_GeneExprMat_overlap , colour="PCA 5D - GeneExprMat"), size = 0.1)+geom_point(aes(seq(1:n_genes),  y=UMAP_GeneExprMat_overlap, colour="UMAP 238 - GeneExprMat"), size = 0.1)+geom_point(aes(seq(1:n_genes),  y=PCA5D_UMAP_overlap , colour="PCA 5D - UMAP 238"), size = 0.1)+geom_point(aes(seq(1:n_genes),  y=PCA5D_UMAP_GeneExprMat_overlap, colour="PCA 5D - UMAP 238 - GeneExprMat"), size = 0.1)+labs(title="", y="Overlapping %", x="N", caption = "") +scale_colour_manual(values=cols, labels =c("Expected", "PCA 5D - GeneExprMat","PCA 5D - UMAP 238 - GeneExprMat",  "PCA 5D - UMAP 238", "UMAP 238 - GeneExprMat")) +theme( legend.position =  "bottom")+geom_line(aes(seq(1:n_genes),  y=Expected, colour="Expected"), size = 1)+geom_vline(xintercept = 1000, linetype="dashed")

8.0s

R+Libs+DESeq2 (R)

We finally represented the Euler diagram of the gene sets corresponding to the top 1000 Moran indexes:

n_genes_sel = 1000abc = PCA5D_UMAP_GeneExprMat_overlap[n_genes_sel]*n_genes_selab  = PCA5D_UMAP_overlap[n_genes_sel]*n_genes_selac  = PCA5D_GeneExprMat_overlap[n_genes_sel]*n_genes_selbc  = UMAP_GeneExprMat_overlap[n_genes_sel]*n_genes_sel

0.3s

R+Libs+DESeq2 (R)

Venn <- c(c("A" = n_genes_sel - ((ab -abc) + abc + (ac -abc)) ,           "B"  = n_genes_sel - ((ab -abc) + abc + (bc -abc)),           "C"  = n_genes_sel - ((ac -abc) + abc + (bc -abc)),           "A&B" = ab -abc,           "A&C" = ac -abc,           "B&C" = bc -abc ,           "A&B&C" = abc))PlotEuler <- plot(euler(Venn), fills = c("#5DC863FF", "#FDE725FF", "#21908CFF"),  quantities = list(cex = 1), alpha= .7, cex = 2, legend = list(labels = c("PCA 5D", "UMAP nn = 238", "GenesExprMat"), alpha=.7 ,cex =1.2))PlotEuler 

0.4s

R+Libs+DESeq2 (R)

The Euler diagram above represents the overlap between the top 1000 genes in the three lists and shows that 88.8% of the genes are in common. This result shows that the UMAP projection preserves genes spatial auto-correlation.

save.image('R_environment.Rdata')

20.1s

R+Libs+DESeq2 (R)

A molecular map of lung neuroendocrine neoplasms

Introduction

Setup

R packages installation

R packages loading

Scripts and data loading

Dimensionality reduction

Principal Component Analysis (PCA)

Uniform Manifold Analysis Projection (UMAP)

Validation of biological hypotheses

Quality controls of the dimensionality reduction

Structural consistency of projections

Individual evaluation of samples' neighborhood

Global preservation of samples' neighborhoods

Preservation of spatial autocorrelation

Runtimes (1)