2 Building accurate data visualisations of high-dimensional molecular data for cancer classification: application to rare lung neoplasms

Subtitle

1. Introduction

Lung neuroendocrine neoplasms (LNENs) are rare understudied diseases with limited therapeutic opportunities. Over the past years several relatively large genomic studies have tried to tackle the molecular characteristics of these diseases in order to provide some evidence for a more personalized clinical management (Peifer et al. Nat Genet 2012; Rudin et al. Nat Genet 2012; Fernandez-Cuesta et al. Nat Commun 2014; George et al. Nature 2015; George et al. Nat Commun 2019; Alcala et al. Nat Commun 2019). This notebook was designed, to produce a comprehensive map of the different LNENs molecular clusters, using the Uniform Manifold Approximation and Projection (UMAP) algorithm. This map was published on the TumorMap interactive portal. Here we proposed a reusable procedure to prove the structural relevance and the biological consistency of this map.

2. Setup

This section allows to set the working environment, to install the R packages and to upload the main data sets.

2.1. Install R packages

Remix this to get started with R.

install.packages(c("ade4", # Principal Component Analysis
                   "umap", # Uniform manifold analysis projection
                   "cluster", # For Silhouette coefficients
                   "FNN", # For KNN
                 "data.table",  # Basic tools
                   "devtools", # Basic tools
                   "ggpubr", # Layout
                   "gridExtra", # Layout
                    "eulerr", # Layout for UMAP
                   "latex2exp",# Layout
                      "viridis", # Layout
                    "foreach", # HPLC
                    "parallel" ,# HPLC
                   "doParallel"   # HPLC 
                  
                  ))

445.6s

library(umap) 
library(ade4)
library(cluster)
library(FNN)
library(data.table)
library(ggpubr)
library(foreach)
library(parallel)
library(doParallel)

library(latex2exp)
library(gridExtra)
library(devtools)
library(eulerr)
library(viridis)
set.seed(1564404882)
sessionInfo()

3.1s

sessionInfo()

0.4s

2.2. Import GitHub

We implemented tools to evaluate low dimensional representations quality. Our metrics and statistics especially target the neighborhood preservation, the biological consistency of potential clusters created by projections and how low dimensional spaces could be interpreted in term of biological features. This tools are available on the following GitHub repository.emathian/DRMetrics.

emathian/DRMetrics

Public

The DRMetrics repository contains :

A repository named DR_Method, containing the R scripts allow to assess the quality of low dimensional representations. The SEQ_DIFF.R file allows to evaluate the neighborhood preservation, between the high and the low dimensional spaces, according the sequence difference metric (Martin et Al 2015). (The CP.R script allows to evaluate the centrality preservation, between the original and the projected spaces, (Martin et Al 2015)). The MORAN_I.R program used the Moran's Index metric to evaluate features spatial auto-correlation. A Manual describing the different functions is also available Man.html.

Importation:

source('DRMetrics/DR_Method/Dimensionality_reduction_comparison_metrics.R', local = F, echo = T)

0.4s

2.3. Importation of Data

In the following cells, the files necessary to the compilation are imported. The VST_nosex_TCACLCNECSCLC.txt file contains the whole-transcriptome (RNAseq) available from the previous manuscripts (Peifer et al. Nat Genet 2012; Rudin et al. Nat Genet 2012; Fernandez-Cuesta et al. Nat Commun 2014; George et al. Nature 2015; George et al. Nat Commun 2019; Alcala et al. Nat Commun 2019). The raw read counts of 57 822 genes from the 210 samples were normalized using the variance stabilization transform (vst function from R package DESeq2 v1.14.1.)

VST <- read.table(unzip('DRMetrics/data/VST_nosex_TCACLCNECSCLC.txt.zip')[1], header = T )

16.2s

The Attributes_UMAP_TCACLCNECSCLC.tsv file gathers the main clinical features and the relevant genes reported in the previous manuscripts (George et al. Nat Commun 2019; Alcala et al. Nat Commun 2019.)Attributes_UMAP_TCACLCNECSCLC.tsv

Attributes <- read.table(unzip('DRMetrics/data/Attributes_UMAP_TCACLCNECSCLC.tsv.zip')[1], sep = '\t', header = T)

Attributes <- Attributes[order(Attributes$Sample_ID),]

head(Attributes)

0.2s

0 items

3. Genes selection

For each analysis, the most variable genes, explaining 50% of the total variance in variance- stabilized read counts, were selected. Like this, over the initial 54,851 genes and 310 samples 9398 were kept.

vst_mat <- VST 
rvdm <- apply(vst_mat,1,stats::var)
ntopdm = rev(which( cumsum( sort(rvdm,decreasing = T) )/sum(rvdm) <= 0.5 ))[1]
selectdm <- order(rvdm, decreasing = TRUE)[seq_len(min(ntopdm, length(rvdm)))]
vst_mat <- vst_mat[selectdm,]
vst50 <- data.frame(t(vst_mat))
head(vst50)

2.4s

0 items

We removed the replicates named S00716_A and S02322_B.

vst50 <- vst50[- which(rownames(vst50) == "S00716_A"|rownames(vst50) == "S02322_B"),]
rownames(vst50)[which(rownames(vst50) == "S02322_A")] <- "S02322.R1"
vst50_sampleID <- rownames(vst50)
vst50 <- cbind("Sample_ID" = vst50_sampleID, vst50) 
vst50$Sample_ID <- as.character(vst50$Sample_ID)
vst50 <- vst50[order(vst50$Sample_ID),]

0.4s

4. Dimensionality reduction

The analysis of this gene expression matrix is complicated by their high dimensionality, that is why low dimensional projections are essential for hypothesis generation and interpretations. We are going to present firstly the projections resulting from the principal component analysis (PCA), the two dimensional projetions resulting from the UMAP algorithm. If this new non linear method lacks the strong interpretability of the PCA (McInnes et al). We are going to show that this simple layout capture the main biological findings (Fernandez-Cuesta et al. Nat Commun 2014; George et al. Nature 2015; George et al. Nat Commun 2019; Alcala et al. Nat Commun 2019).

4.1. Principale Component Analysis (PCA)

On n dimensional space only n+1 equidistant groups could be distinguished. In order to capture the six main molecular clusters previously described (Carcinoids A1, A2, and B (Alcala et al. Nat Commun 2019) , LCNEC/Type1 and LCNEC/Type2 (George et al. Nat Commun 2019), and SCLC/SCLC-like), the first five axes of the PCA were kept.

pca5D <- as.data.frame(dudi.pca(vst50[,2:dim(vst50)[2]], center = T , scale = T,  scannf = F , nf=5 )$li) 
pca5D  = setDT(pca5D, keep.rownames = TRUE)[]
colnames(pca5D)[1]<-'Sample_ID'
pca5D <- pca5D[order(pca5D$Sample_ID),]
pca5D <- data.frame(pca5D)
pca2D <- pca5D[,1:3]
head(pca2D)

1.7s

0 items

Axis_list <- list(c(3,2), c(4,2), c(5,2), c(6,2), c(4,3), c(5,3), c(6,3), c(5,4), c(6,4), c(6,5))

plist <- list()

for(i in 1:length(Axis_list)){

  xlab <- paste("PC", (Axis_list[[i]][1]-1))
  ylab <- paste("PC", (Axis_list[[i]][2]-1))
  c_data <- data.frame("axisX"=  pca5D[,Axis_list[[i]][1]], "axisY" =pca5D[,Axis_list[[i]][2]] )
  p <- ggplot(c_data, aes(x = axisX , y = axisY, color = Attributes$Molecular_clusters)) +  geom_point(size=4, alpha=0.8)+ scale_color_brewer(palette="Spectral") 

  if (Axis_list[[i]][1] ==3 & Axis_list[[i]][2] == 2){
  	p <- p + theme( legend.position="right", legend.title= element_blank()) + labs(x = xlab, y = ylab) 
	}
  
  else{
     p <- p + theme( legend.position="none", legend.title= element_blank()) + labs(x = xlab, y = ylab) 
	}
  
  plist[[i]] <- p
}

0.3s

ggarrange(plist[[1]])
grid.arrange(grobs = lapply(2:7, function(i){plist[[i]]}),  
  layout_matrix = rbind(c(2, 3, 4),
                        c(5, 6, 7)), nrow =3)
grid.arrange(grobs = lapply(8:10, function(i){plist[[i]]}),  
  layout_matrix = rbind(c(8, 9, NA),
                       c(10, NA, NA)), nrow =3)

2.7s

4.2. Uniform Manifold Analysis Projection (UMAP)

The UMAP algorithm is a dimensionality reduction technique based on topology. This method firstly builds a fuzzy topological representation of the data, and secondly finds the best low dimensional representation of this topological structures, through an optimization algorithm. The UMAP algorithm have several parameter, however we are going to focus our analysis on the n_neighbors parameter. The small values of n_neighbors guarantee a local preservation of the manifold, and reciprocally. The following animation, obtained through the gganimate R package, shows projections of the LNEN transcriptome data according different level of n_neighbors. In order to reproduce the exact results of the article, we upload the UMAP's coordinates used to create the TumorMap pan-LNENs map. This map was obtained owing to the UMAP algorithm setting the n_neighbors parameter to 208.

UMAP_TM <- read.table(unzip('DRMetrics/data/Coords_umap_nn208.tsv.zip')[1], sep = '\t', header = F)
colnames(UMAP_TM) <- c("Sample_ID", "V1", "V2")
UMAP_TM <- UMAP_TM[order(UMAP_TM$Sample_ID),]
UMAP_TM <- cbind(UMAP_TM, 'Molecular_clusters' = Attributes$Molecular_clusters)
pumap <- ggplot(UMAP_TM, aes(x = V1 , y = V2, color = Molecular_clusters))+  geom_point(size=4, alpha=0.8)+ scale_color_brewer(palette="Spectral")+ labs(x = "dim1" , y = "dim2", color = "")+ theme(legend.position = "bottom")+  guides(col = guide_legend( nrow = 4)) 
pumap

0.7s

5. Molecular clusters on the UMAP layout

5.1. Identification of border line tumors

In this section, the consistency of the classification of the border line tumors are going to be assessed, according the distances between this tumors and the centroid of their supposed molecular cluster, and between this discordant tumors and their supposed histopathogical cluster . These distances are finally going to be compared through Wilcoxon's tests. The following function generalize the procedure :

clusters_distances_F <- function(Coords, evals, ref1, ref2, alt_hyp){

  # Coords must be a data frame of n colunms such as : col1 = Sample_ID, col2 =   dim1, col3 =   dim2, ..., coln = 

  # evals must be the names of border line tumors # ref1 is a vector containing the named of the supposed molecular cluster(s)

  # ref2 is a vector containing the named of the supposed histopathological cluster(s)

  # alt_hyp is the alternative hypothessis  
  
  Coords_evals <- Coords[which(Coords$Molecular_clusters %in% evals),]
  T1 <- Coords[which(Coords$Molecular_clusters %in% ref1),] 
  T1_centroid <- apply(T1[,2:3], 2, mean)
  dist_1 <- unlist(lapply( 1:dim(Coords_evals)[1], function(i){
          sqrt( (Coords_evals[i,2] - T1_centroid[1])^2 + ( Coords_evals[i,3] -              T1_centroid [2])^2 )}))
  T2 <- Coords[which(Coords$Molecular_clusters %in% ref2),]
  T2_centroid <- apply(T2[,2:3], 2,mean)
  dist_2 <- unlist(lapply( 1:dim(Coords_evals)[1], function(i){
        sqrt( ( Coords_evals[i,2] - T2_centroid[1])^2 + ( Coords_evals[i,3] -           T2_centroid [2])^2 )}))
  return(wilcox.test(dist_1 , dist_2, alt_hyp)$p.value) 
}

0.2s

A set of SCLCs present the molecular profile of LCNEC and therefore they are called SCLC/LCNEC-like (Peifer et al. Nat Genet 2012; George et al. Nature 2015; George et al. Nat Commun 2019 ): these samples tend to cluster with the LCNECs in the pan-LNEN TumorMap.

paste('Wilcoxon p-value = ', format(clusters_distances_F(UMAP_TM, "SCLC/LCNEC-like",c("LCNEC/TypeII","LCNEC/TypeI"),"SCLC/SCLC-like", 'less'), digits = 2))

0.6s

A set of LCNECs present the molecular profile of SCLC and therefore they are called LCNEC/SCLC-like( George et al. Nat Commun 2019): these samples tend to cluster with the SCLCs in the pan-LNEN TumorMap.

paste('Wilcoxon p-value = ', format(clusters_distances_F(UMAP_TM, "LCNEC/SCLC-like","LCNEC/SCLC-like",c("LCNEC/TypeI", "LCNEC/TypeII"), 'less'), digits = 2))

1.1s

The LCNEC/LCNEC-like can been further divided into type-I and type-II molecular groups ( George et al. Nat Commun 2019): these samples tend to be closer to each other than to the SCLC/SCLC-like.

paste('Wilcoxon p-value = ', format(clusters_distances_F(UMAP_TM, "LCNEC/TypeI","LCNEC/TypeII","SCLC/SCLC-like", 'less'), digits = 2))

0.4s

SCLC/LCNEC-like samples are molecularly closer to type-II than type-I LCNECs ( George et al. Nat Commun 2019).

paste('Wilcoxon p-value = ', format(clusters_distances_F(UMAP_TM, "SCLC/LCNEC-like","LCNEC/TypeII","LCNEC/TypeI", 'less'), digits = 2))

0.4s

Supra-carcinoids are pulmonary carcinoids from a morphological standpoint but harbor the molecular characteristics of LCNEC (Alcala et al. Nat Commun 2019): in the pan-LNENs TumorMap, supra-carcinoids cluster with LNECs, and are molecularly closer to LCNEC than to SCLC.

paste('Wilcoxon p-value = ', format(clusters_distances_F(UMAP_TM, "Supra_carcinoid", c("LCNEC/TypeII","LCNEC/TypeI", 'SCLC/LCNEC-like'), c("SCLC/SCLC-like", "LCNEC/SCLC-like"), 'less'), digits = 2))

0.4s

6. Structural consistency of projections

Since pairwise distances of non linear data cannot be preserved in low dimensional spaces, we evaluated the preservation of points neighborhood according a metric named SD. This metric evaluates the number point's neighbors preserved between the original space and the low dimensional space. Secondly it penalizes the potential re-orderings. High values of SD metric imply a poor preservation of points' neighborhood and reciprocally.

6.1. Global preservation of samples neighborhood

We are going to compare the samples' neighborhood preservation between the PCA, keeping the first five axis, and between the low dimensional spaces resulting from UMAP, through the SD metric. For each sample the SD metric evaluates the number of neighbors among the k first neighbors in common between the original space and a low dimensional space. It also penalizes samplesâ€™ neighborhood reordering. Once the SD values have been determined for each point and for different k levels, the means by k level can be calculated. The optimization stage of UMAP induces that several layouts could be obtained according the same parameters. That is why we are going to repeat the UMAP algorithm several times. For each iteration the means SD values by k level are going to be determined. The effect of the n_neighbors parameter will be explored, setting this parameter to 15 (default value), 30, 90 and 208 (maximum value).

n_neighborsL <- c(15,30,90,180,208)
List_projection <- list("pca2D" = pca5D[1:3], "pca5D" = pca5D)
for (j in 1:length(n_neighborsL)){
  umap_c <- umap(vst50[,2:dim(vst50)[2]], n_neighbors = n_neighborsL[j])
  umap_c <- as.data.frame(umap_c$layout)
  umap_c <- cbind("Sample_ID" = vst50[,1], umap_c)
  names_umap <- paste("umap_nn", n_neighborsL[j])
  List_projection[[j+2]] <- umap_c
  names(List_projection)[j+2] <- names_umap 
}
#print(length(List_projection))
#List_projection[[length(List_projection)+1]] <- UMAP_TM[,1:3]
#names(List_projection)[length(List_projection)] <-"umap_nn 208"
Main_SQ_res_NN <- Seq_main(l_data = List_projection , dataRef =  data.frame(vst50) , listK = seq(from= 10, to = 208, by = 5), colnames_res_df = c("pca_2D", "pca_5D","umap_nn=15", "umap_nn=30","umap_nn=90", "umap_nn=180", "umap_nn=208"))
  Main_SQ_res_NN[[1]]

336.7s

0 items

start_time <- Sys.time()
no_cores <- detectCores()
no_cores
cl <- makeCluster(no_cores)
registerDoParallel(cl)

Main_SQ_REP <-  foreach(i=1:10, .combine=c, .packages = c("umap", 'data.table', "ade4", "parallel", "doParallel", "cluster"))%dopar%{
n_neighborsL <- c(15,30,90,180,208)
List_projection <- list("pca2D" = pca5D[1:3], "pca5D" = pca5D)
for (i in 1:length(n_neighborsL)){
  umap_c <- umap(vst50[,2:dim(vst50)[2]], n_neighbors = n_neighborsL[i])
  umap_c <- as.data.frame(umap_c$layout)
  umap_c <- cbind("Sample_ID" = vst50[,1], umap_c)
  names_umap <- paste("umap_nn", n_neighborsL[i])
  List_projection[[i+2]] <- umap_c
  names(List_projection)[i+2] <- names_umap 
}

  Main_SQ_res_NN <- Seq_main(l_data = List_projection , dataRef =  data.frame(vst50) , listK = seq(from= 1, to = 208, by = 5), colnames_res_df = c("pca_2D", "pca_5D","umap_nn=15", "umap_nn=30","umap_nn=90", "umap_nn=180", "umap_nn=208"))

  Main_SQ_res_NN[[1]]
}  

end_time <- Sys.time()
start_time - end_time

3166.3s

MEAN_SD_DF <- data.frame()

for (i in 1:9){
  MEAN_SD_K_L <- list()
  c <- 1
  for(j in seq(1,81,9)){
    MAIN_SD_DF_c <- data.frame("Sample_ID" = Main_SQ_REP[[j]],"K" =         Main_SQ_REP[[j+1]], "PCA 2D" = Main_SQ_REP[[j+2]], "PCA 5D" = Main_SQ_REP[[j+3]],"Umap nn = 15" = Main_SQ_REP[[j+4]],  "Umap nn = 30"= Main_SQ_REP[[j+5]], "Umap nn = 90" = Main_SQ_REP[[j+6]], "Umap nn = 180" = Main_SQ_REP[[j+7]], "Umap nn = 208" = Main_SQ_REP[[j+8]])

    mean_df <- aggregate(MAIN_SD_DF_c[ ,3:8], list(MAIN_SD_DF_c$K), mean)
    colnames(mean_df) <- c("K",  'pca_2D', 'pca_5D',  'umap nn = 15', "Umap nn = 30" ,"Umap nn = 90", "umap nn = 208" )
    MEAN_SD_K_L[[c]] <- mean_df
    c <- c + 1
  }
  MEAN_SD_DF <- rbind(MEAN_SD_DF, as.data.frame(MEAN_SD_K_L[[i]]))
}
Mean_SD <- melt(aggregate(MEAN_SD_DF[, 2:7], list(MEAN_SD_DF$K), mean), id.vars = "Group.1")

colnames(Mean_SD)[3] <- "mean"
colnames(Mean_SD)[1] <- "K"
sd_SD <- melt(aggregate(MEAN_SD_DF[, 2:7], list(MEAN_SD_DF$K), sd), id.vars = "Group.1")
colnames(sd_SD)[3] <- "sd"
Main_SD_DF <- cbind(Mean_SD, "sd" = sd_SD$sd)

Main_SD_DF <- Main_SD_DF[-c(which(Main_SD_DF$K ==1)) , ]
ggplot(Main_SD_DF, aes(x =  K, y = mean, color = variable)) + geom_line() + geom_point()+
  scale_color_viridis(discrete=TRUE) +
  geom_errorbar(aes(ymin = mean-sd, ymax = mean+sd))+  labs(title = "", y = "mean(SD)", x = "k")

8.8s

To highlight the main trends, the values of could be normalized according values resulting from the two dimensional space and the five dimensional space generated by the PCA.

MEAN_SD_DF_norm <- data.frame("K" = MEAN_SD_DF[,1])
for (i in 2:dim(MEAN_SD_DF)[2]){
  MEAN_SD_DF_norm[ ,i] <- (MEAN_SD_DF[,3] - MEAN_SD_DF[,i])/(MEAN_SD_DF[,3] - MEAN_SD_DF[,2])

}

colnames(MEAN_SD_DF_norm) <- colnames(MEAN_SD_DF)
Mean_SDn <- melt(aggregate(MEAN_SD_DF_norm[, 2:7], list(MEAN_SD_DF_norm$K), mean), id.vars = "Group.1")
colnames(Mean_SDn)[3] <- "mean"
colnames(Mean_SDn)[1] <- "K"
sd_SDn <- melt(aggregate(MEAN_SD_DF_norm[, 2:7], list(MEAN_SD_DF_norm$K), sd), id.vars = "Group.1")
colnames(sd_SDn)[3] <- "sd"

Main_SD_DFn <- cbind(Mean_SDn, "sd" = sd_SDn$sd)
ggplot(Main_SD_DFn, aes(x =  K, y = mean, color = variable)) + geom_line() + geom_point()+

  scale_color_viridis(discrete=TRUE) +

  geom_errorbar(aes(ymin = mean-sd, ymax = mean+sd))+  labs(title = "", y = "mean(SD')", x = "k")

0.6s

6.2. Individual evaluation of samples' neighborhood

In this section section, we are going to focus our attention on samples' neighborhood preservation, at the scale individual scale. The five dimensional space resulting from the PCA will be taken as reference. After the calculation of the SD values, we are going to represent the represent the map of the mean SD values by sample.

List_projection <- list("UMAP_TM" = UMAP_TM)

Main_SQ_res_pca <- Seq_main(l_data = List_projection , dataRef =  data.frame(pca5D) , listK = seq(from= 1, to = 208, by = 5))

Shift+Enter to run

Sd_map_df <- data.frame("Sample_ID" = Main_SQ_res_pca[[1]]$Sample_ID, "k" = Main_SQ_res_pca[[1]]$K , "UMAP208" = Main_SQ_res_pca[[1]][,3] )
sd_map_lnen <- SD_map_f(Sd_map_df , UMAP_TM, "bottom")
sd_map_lnen[[2]]

Shift+Enter to run

The most misplaced, between the five dimensional space resulting from the PCA and the UMAP layout is the S02297 sample. We are going to compare the neighborhood of this sample according the five dimensional space resulting from the PCA and the UMAP layout.

dist1 <- as.matrix(dist(UMAP_TM[, 2:dim(UMAP_TM)[2]], method = "euclidian", diag = TRUE, upper = TRUE))
UMAP_TM_Sample <- as.character(UMAP_TM$Sample_ID)
rownames(dist1) <- UMAP_TM_Sample
colnames(dist1) <- UMAP_TM_Sample
dist1 <- as.data.frame(dist1)
dist1 <-  dist1[order(dist1$S02297),]
dist1 <- data.frame("Sample_ID" = rownames(dist1), "rank" = seq(1:length(rownames(dist1))))
Umap_S02297 <- merge(UMAP_TM, dist1,  by= "Sample_ID")

ggplot(Umap_S02297, aes(x=V1, y=V2,  color=rank, label =Umap_S02297$Sample_ID)) +  geom_point(size=4, alpha =.8) + scale_color_distiller(palette = "Spectral")+ geom_point(aes(x = Umap_S02297$V1[which(Umap_S02297$Sample_ID == "S02297")] , y =  Umap_S02297$V2[which(Umap_S02297$Sample_ID == "S02297")] ), size=4, col = "black") + geom_text(aes( Umap_S02297$V1[ Umap_S02297$Sample_ID== "S02297"],Umap_S02297$V2[ Umap_S02297$Sample_ID== "S02297"], label = Umap_S02297$Sample_ID[ Umap_S02297$Sample_ID== "S02297"]) , col = 'black' , hjust=0, vjust=0 ) + labs(title="", y=TeX("dim2"), x="dim1") + theme( legend.position = "bottom")

Shift+Enter to run

This layout highlights that UMAP projection the nearest neighbors to the S02297 sample belong either to the carcinoid A1 cluster or to the SCLC/SCLC-like cluster. We now are going to represent S02297's neighborhood according the five space resulting from the PCA on the UMAP layout.

dist2 <- as.matrix(dist(pca5D[, 2:dim(pca5D)[2]], method = "euclidian", diag = TRUE, upper = TRUE))
pca_Sample <- as.character(pca5D$Sample_ID)
rownames(dist2) <- pca_Sample 
colnames(dist2) <- pca_Sample 
dist2 <- as.data.frame(dist2)
dist2 <-  dist2[order(dist2$S02297),]
dist2 <- data.frame("Sample_ID" = rownames(dist2), "rank" =seq(1:length(rownames(dist2))))

UMAP_PCA_S02297 <- merge(UMAP_TM, dist2,  by= "Sample_ID")

ggplot(UMAP_PCA_S02297, aes(x=V1, y=V2,  color=rank, label =UMAP_PCA_S02297$Sample_ID)) +  geom_point(size=4, alpha =.8) + scale_color_distiller(palette = "Spectral") + geom_point(aes(x = UMAP_PCA_S02297$V1[which(UMAP_PCA_S02297$Sample_ID == "S02297")] , y =  UMAP_PCA_S02297$V2[which(UMAP_PCA_S02297$Sample_ID == "S02297")] ), size=4, col = "black")   +

  geom_text(aes( UMAP_PCA_S02297$V1[ UMAP_PCA_S02297$Sample_ID== "S02297"],UMAP_PCA_S02297$V2[UMAP_PCA_S02297$Sample_ID== "S02297"], label = UMAP_PCA_S02297$Sample_ID[UMAP_PCA_S02297$Sample_ID== "S02297"]) , col = 'black' , hjust=0, vjust=0 )

Shift+Enter to run

This second representation highlights that on the UMAP layout, comparatively to the PCA low dimensional space, the S02297 sample is too close to the carcinoids B and A2 and to far from the SCLC/SCLC-like.

7. Moran index

One of the main hypothesis resulting from the UMAP algorithm is that close points one the layout have a similar molecular profile. That is why we are interesting by genes' spatial auto-correlation. These correlations are going to be measured through the Moran's index. This metric that could be interpreted such as a classic correlation, like this it tends to one for high spatial auto-correlation and to minus in opposite case. Furthermore, this statistic could be determined according for partition of the space, which formally calculated in function samples nearest neighbors determined through the KNN method. The parameter of the KNN algorithm represents the size of spatial unities.

7.1. OTP spatial auto-correlation

As described above the Moran's index could be calculated for each gene and for different sizes of the spatial unities. Alcala and colleagues showed that the value of OTP expression levels have an important effect on LNENs prognosis and diagnosis. That is why we are going to evaluate the spatial auto-correlation this gene on the UMAP layout.

List_coords <- list('PCA5D' = data.frame(pca5D), 'UMAP' =  data.frame(UMAP_TM[,1:3]), "gene_expr_mat" = data.frame(vst50) )

Embl_OTP = "ENSG00000171540.6"

otp_df <- data.frame("Sample_ID" = vst50$Sample_ID, "OTP" =  vst50[,which(colnames(vst50)== Embl_OTP)] )

MI_OTP <- moran_I_main(l_coords_data = List_coords, spatial_data = data.frame(otp_df), listK= seq(10,208,5), nsim = 10, Stat=FALSE, Graph = FALSE, methods_name = c('PCA 5D','UMAP nn = 208', "GeneExprMat"))

MI_OTP_by_k = moran_I_scatter_plot_by_k(data = as.array(MI_OTP)[[1]], Xlab = NULL, Ylab=NULL, Title= NULL)

Shift+Enter to run

In order to understand this graphic we could represent the expression of OTP on the UMAP layout.

ggplot(UMAP_TM, aes(x=V1, y=V2,  color=otp_df$OTP)) +  geom_point(size=4, alpha =.8) + scale_color_distiller(palette = "Spectral")

Shift+Enter to run

According to this map the expression OTP highly differs between the carcinoids and the cluster of the SCLCs and the LCNECs. This observation is concordant with the Moran index values, since the spatial autocorrelation is relatively high until to a size of the spatial unities around 100, what matches with the carcinoids clusters size (89). In addition, we note that the OTP's Moran index values are very similar between the original space and the five dimensional space resulting from the PCA.

7.2. Biological consistency of UMAP embedding space

In order to evaluate, the global preservation of genes' spatial auto-correlation. The previous example could be generalized to the entire lists of genes used to generate the low dimensional spaces. Then the means of Moran index values for all u levels are going to calculated for each space, that it is to say for the original space, the five dimensional space generated by the PCA and the two dimensional space resulting from UMAP. Finally, for each space the the overlaps between the three resulting list of genes, ordered according their Moran Index values, will be determined.

List_coords <- list('PCA 5D' = data.frame(pca5D), 'umap_nn=208' =  data.frame(UMAP_TM[,1:3]), 'gene_expr_mat' = vst50)
MI_foreach <- list()
start_time <- Sys.time()
for (i in 2:1000){
   c_vst50 <- data.frame("Sample_ID"  = vst50[,1],  vst50[,i])
   colnames(c_vst50)[2] <- colnames(vst50)[i]
 MI <- moran_I_main(l_coords_data = List_coords  , spatial_data =data.frame(c_vst50), listK= seq(2,208,15), nsim = 10, Stat=FALSE, Graph = FALSE, methods_name = c('pca5D','umap208','gene_expr_mat'))
 MI_foreach[[i-1]] <- MI$MoranIndex

}

7605.3s

1.2s

MI_gene_name <- colnames(vst50)[1:999]
MI_mean_by_gene_by_method <- list()
for (I in 1:3){
  MI_mean_gene <- unlist(lapply(1:999, function(i){
   c_gene_PCA5D <- mean(unlist(lapply(1:14, function(j){
     MI_foreach[[i]][I,1,j]
    })) )
  }))
 MI_mean_by_gene_by_method[[I]] <- MI_mean_gene
}

MI_mean_genes_df <- data.frame('Gene_names' = MI_gene_name, 'PCA5D' = MI_mean_by_gene_by_method[[1]] , "UMAP208" = MI_mean_by_gene_by_method[[2]],  "GeneExprMat" = MI_mean_by_gene_by_method[[3]])

PCA5D_order <- MI_mean_genes_df[order(MI_mean_genes_df$PCA5D, decreasing = T),]
PCA5D_order_genes <- as.character(PCA5D_order$Gene_names)
UMAP208_order <- MI_mean_genes_df[order(MI_mean_genes_df$UMAP208, decreasing = T),]
UMAP208_order_genes <- as.character(UMAP208_order$Gene_names)
GeneExpr_order <- MI_mean_genes_df[order(MI_mean_genes_df$GeneExprMat, decreasing = T),]
GeneExpr_order_genes <-  as.character(GeneExpr_order$Gene_names)
PCA5D_GeneExprMat_overlap <-unlist(lapply(1:999, function(i){
length(intersect(PCA5D_order_genes[1:i], GeneExpr_order_genes[1:i])) /i 
}))


UMAP208_GeneExprMat_overlap <-unlist(lapply(1:999, function(i){
  length(intersect(UMAP208_order_genes[1:i], GeneExpr_order_genes[1:i])) /i 
}))

PCA5D_UMAP208_overlap <- unlist(lapply(1:999, function(i){
  length(intersect(UMAP208_order_genes[1:i], PCA5D_order_genes[1:i])) /i 
}))

PCA5D_UMAP208_GeneExprMat_overlap <- unlist(lapply(1:999, function(i){
	length(intersect( intersect(UMAP208_order_genes[1:i], PCA5D_order_genes[1:i]), GeneExpr_order_genes[1:i])) /i 
}))

Expected <- unlist(lapply(1:999, function(i){
  (i *(i/999)^2) / i
}))

cols <- c("Expected" =  "#FDE725FF", "PCA 5D - GeneExprMat"="#440154FF","PCA 5D - UMAP 208 - GeneExprMat" =  "#5DC863FF",  "PCA5D - UMAP208"= "#21908CFF", "UMAP 208 - GeneExprMat"="#3B528BFF")

p <- ggplot() 
p <- p + geom_point(aes(x=seq(1:999), y=PCA5D_GeneExprMat_overlap , colour="PCA 5D - GeneExprMat"), size = 0.1)
p <- p + geom_point(aes(seq(1:999),  y=UMAP208_GeneExprMat_overlap, colour="UMAP 208 - GeneExprMat"), size = 0.1)
p <- p + geom_point(aes(seq(1:999),  y=PCA5D_UMAP208_overlap , colour="PCA5D - UMAP208"), size = 0.1)
p <- p + geom_point(aes(seq(1:999),  y=PCA5D_UMAP208_GeneExprMat_overlap, colour="PCA 5D - UMAP 208 - GeneExprMat"), size = 0.1)
p <- p +  labs(title="", y="Overlapping %", x="N", caption = "") + 
   scale_colour_manual(values=cols, labels =c("Expected", "PCA 5D - GeneExprMat","PCA 5D - UMAP 208 - GeneExprMat",  "PCA5D - UMAP208", "UMAP 208 - GeneExprMat")) + theme( legend.position =  "bottom")
p <- p + geom_line(aes(seq(1:999),  y=Expected, colour="Expected"), size = 1)+ geom_vline(xintercept = 500, linetype="dashed")
p

1.2s

a = PCA5D_order_genes[1:500]; b= UMAP208_order_genes[1:500]; c= GeneExpr_order_genes[1:500]
abc = length(intersect(intersect(a,b),c))
ab = length(intersect(a,b))
ac = length(intersect(a,c))
bc = length(intersect(b,c))

0.2s

Venn<- c(c(A = length(a) - ((ab -abc) + abc + (ac -abc)) ,
               B = length(a) - ((ab -abc) + abc + (bc -abc)),
               C = length(a) - ((ac -abc) + abc + (bc -abc)), 
               "A&B" = ab -abc,
                "A&C" = ac -abc, 
               "B&C"= bc -abc ,
               "A&B&C" = abc))
PlotEuler <- plot(euler(Venn), fills = c("#5DC863FF", "#FDE725FF", "#21908CFF"),  quantities = list(cex = 1), alpha= .7, cex = 2, legend = list(labels = c("PCA 5D ", "UMAP nn = 208 ", "GenesExprMat"), alpha=.7 ,cex =1.2))

PlotEuler

0.4s

2 Building accurate data visualisations of high-dimensional molecular data for cancer classification: application to rare lung neoplasms

Subtitle

1. Introduction

2. Setup

2.1. Install R packages

2.2. Import GitHub

2.3. Importation of Data

3. Genes selection

4. Dimensionality reduction

4.1. Principale Component Analysis (PCA)

4.2. Uniform Manifold Analysis Projection (UMAP)

5. Molecular clusters on the UMAP layout

5.1. Identification of border line tumors

6. Structural consistency of projections

6.1. Global preservation of samples neighborhood

6.2. Individual evaluation of samples' neighborhood

7. Moran index

7.1. OTP spatial auto-correlation

7.2. Biological consistency of UMAP embedding space

8. Conclusion