This notebook complements the data note by providing the code necessary to reproduce the pan-LNEN molecular map. We also propose to assess the quality of the molecular map by: i) validating the biological hypotheses proposed in the previous studies, ii) exploring the preservation of the samples' neighborhood in the map, and iii) validating the preservation of spatial auto-correlations in the map.
This section allows to set up the working environment, install the R packages, and upload the scripts and data sets needed for the analyses.
R packages installation
Install CRAN packages
install.packages(c("ade4", # Principal Component Analysis
Note that to avoid running the notebook entirely, the R environment allowing to generate all the results included in the notebook has been saved in the file R_environment.Rdata and can be loaded by uncommenting the following line.
The raw RNA-seq fastq files were preprocessed following four nextflow pipelines available in the IARCbioinfo GitHub repository. Note that the pipelines were run using singularity but docker and conda profiles could also be used.
Step 1: we performed RNA sequencing mapping and quality controls using the pipeline IARCbioinfo/RNAseq-nf by running the following command:
Where folder_with_fastq_files is a folder containing all the fastq files, ref_genome.fa.star.idxis a folder with the genome reference files for software STAR, ref_annot.gtf the annotation file and hg38_Gencode_v33.bed a bed file with list of intervals for further annotations by RSeQC. The bed file was generated using the UCSC table browser and the genome reference files, the annotation file should be associated to the GRCh38 human reference genome, version 33, from bundle CTAT from the 6th of April 2020 (located in the folder GRCh38_gencode_v33_CTAT_lib_Apr062020.plug-n-play/ctat_genome_lib_build_dir/).
Step 2: we refined alignments of the BAM files using the pipeline IARCbioinfo/abra-nf by running the following command:
The BAM files used as input for ABRA are the outputs of the first alignment step. The reference and annotation files used are the same as the ones used in step 1 with an exception for the bed file which has been modified so that no overlapping intervals remain using bedtools:
The file input-transcript.txt contains in rows, the samples ID, paths to BAM files, and read lengths per sample. The latter were computed using the post-trimming fastqc outputs of the first step by computing the average of the two fastq files read length values and rounding it.
For our analysis, we merged the samples raw read counts generated by the fourth step in a single matrix. In the following code cell, we load the read counts associated to all samples as well as the clinical data associated.
The analysis of the gene expression matrix is complex due to its high dimensionality. Dimensionality reduction methods are thus essential for hypothesis generation and interpretations. In this section, we present firstly the projections resulting from the principal component analysis (PCA) method and secondly a two-dimensional projection resulting from the Uniform Manifold Analysis Projection (UMAP) algorithm.
Principal Component Analysis (PCA)
In the previously published datasets that we integrated, different molecular groups were identified: carcinoids A1, A2, and B (Alcala et al. Nat Commun 2019), and clusters LC1, LC2 and LC3 (Laddha et al. 2019)-that we found here corresponding respectively to carcinoids A1, B and A2 (see low dimensional projections below)-, LCNEC/TypeI and LCNEC/TypeII (George mun 2018), and SCLC (George et al. Nat Commun 2015). In order to distinguish the six molecular groups, we considered five principal components when evaluating the PCA results.
We used the dudi.pca function to perform a PCA on the lung NEN samples:
However, to visualize the samples in the reduced space generated by the PCA , we needed to consider only two dimensions. The next code cell represents the lung NEN samples in several two dimensional spaces based on the different combination of the first five principal components.
Another way to visualize data in a two dimensional space is to use UMAP. The UMAP algorithm is a non-linear dimensionality reduction technique based on topology. We applied UMAP to the RNA-seq pan-LNEN data using the umap R package. Using this method we aimed at visualizing the molecular clusters identified in previous studies. UMAP doesn't necessarily preserve the global structure of the data and one important parameter for this is n_neighbors. Small and high values of n_neighbors favor the preservation of local structures and global structures, respectively. The pan-LNEN TumorMap was hence generated fixing the n_neighbors parameter to 238, which is the total number of samples. The other parameters were set to their default values.
Because UMAP is not deterministic, in order to reproduce the exact results presented in the GigaScience paper we upload in the following cell the UMAP coordinates used to create the pan-LNENs TumorMap (available in the IARCbioinfo/DRMetrics github repository):
# Note that the next line replace the UMAP coordinates calculated in the previous cell with our pre-computed coordinates. You should comment it if you are running a customized analysis.
Although non-linear dimensionality reduction methods such as UMAP lack interpretability in comparison to PCA, we attempted to show that this simple layout captures the main biological findings highlighted in the previous studies (Fernandez-Cuesta et al. Nat Commun 2014; George et al. Nature 2015; George et al. Nat Commun 2019; Alcala et al. Nat Commun 2019; Laddha et al. Cancer research 2019).
In previous studies, some subgroups of lung NENs have been identified as having the molecular features of a histopathological type that is different from their initial histopathological classification. We checked whether our tumor map captures this phenomenon by comparing the distances from these discordant tumors to their molecular and histopathological groups, respectively. The following code chunk corresponds to the code used to perform these comparisons using a one-sided Wilcoxon's test.
As an example, we can consider the LCNEC/SCLC-like samples identified by George et al. in 2019 as being LCNEC samples that have the molecular profile of the SCLC samples. Using the function above, we showed that in the pan-LNEN TumorMap, these LCNEC/SCLC-like samples were indeed significantly closer to the SCLC samples than the LCNEC samples:
We evaluated the preservation of the samples' neighborhood using the sequence difference view (SD) dissimilarity metric proposed by Martins et al. in 2015. This measure compares samples' neighborhoods in two different spaces, D1 and D2. In our case, D1 will typically be the original high-dimensional space and D2 the reduced space from UMAP. The SD metric of sample i is defined as follows:
Parameter k corresponds to the number of nearest neighbors to consider for the computation of the neighborhood of sample i; Vk1(i) and Vk2(i) denote the neighborhoods in D1 and D2, respectively. ρi1(j) and ρi2(j) are the ranks of the distance between sample i and the jth neighbor of sample i in the two spaces, respectively (ranked from 0 to k-1).
The SD metric firstly evaluates, for each point, its ranking in the two neighborhoods (ρi1(j) and ρi2(j)). Secondly, it penalizes discordant rankings (terms |ρi1(j) - ρi2(j)|), the penalty being larger if the discordance is observed for close neighbors (terms |k-ρi1(j)| and |k-ρi2(j)|). As a result, SD is non-negative, with high SD values indicating a poor preservation of the neighborhood while values close to 0 indicate a near-perfect preservation of the neighborhood.
Individual evaluation of samples' neighborhood
The following code chunk uses the compute_SD function to compute the SD values associated to each sample for different values of k. In the following example, we used the SD metric to compare the UMAP layout with the five-dimensional space resulting from the PCA (PCA-5D).
The above representation allows the identification of the samples with the lowest sample neighborhood preservation between UMAP and PCA-5D. We chose to explore the samples' neighborhood of S02340 which have the highest SD metric. This sample is an atypical sample that clusters with the Carcinoid-A1 cluster on the UMAP projection. To explain this high SD value, we computed the euclidean distances between each sample and S02340 in the two spaces.
In the above figure, the left and right panels represent the UMAP projection of lung NEN samples, with a color gradient corresponding to the rank of each sample in the neighborhoods of S02340 in UMAP and PCA-5D projections, respectively. In the UMAP projection (left panel), the nearest neighbors of S02340 belong to the carcinoids cluster (top cluster). However, in the PCA projection (right panel) S02340 is also close to LCNEC samples, more specifically the LCNEC Type I samples (bottom cluster), suggesting that the UMAP projection for that sample might be misleading.
Exploring the SD metric for each sample allows us to identify samples with questionable projections.
Global preservation of samples' neighborhoods
Using the SD metric, we also evaluated the global preservation of the samples neighborhoods of different dimensionality reductions: i) PCA-5D: the projection based on the five first principal components of the PCA, ii) PCA-2D: the projection based on the two first principal components of the PCA, iii) projections generated by UMAP based on different values of the n_neighbors parameter (15 (default value), 30, 90, 180 and 238).
We computed the SD metric for each sample and each projection while varying the parameter k in order to explore the samples neighborhood preservation at different scales. We averaged the SD metric over the samples. Due to the stochastic nature of the UMAP optimization step, we repeated each UMAP run multiple times. The following code cell performs these iterations. To speed up the computation on Nextjournal, we fixed the number of iterations to 10 (n_iter parameter below) but this number can be increased by the user. In the GigaScience paper, n_iter was fixed to 1000 to generate Figure 4.
To facilitate the visualization and to compare the sample neighborhood preservation of the different projections, we normalized the mean SD values so that the SD of the PCA-2D is 1 and that of the PCA-5D is 0, as described in the GigaScience paper (equation 1).
The above figure shows that varying the n_neighbors parameter has an impact on the sample neighborhood preservation. While lower values of n_neighbors improve the preservation of the local sample neighborhood (low SD for small k values), higher values of n_neighbors improve the global preservation of the neighborhood (low SD large k values). Overall, n_neighbors=238 is a good compromise between local and global preservation, having preservations at least as good as PCA-2D for most k values but also having preservations almost as good as PCA-5D for local neighborhoods.
Preservation of spatial autocorrelation
We can evaluate the spatial auto-correlation for focal genes using the Moran index. This metric ranges from -1 (inverse spatial autocorrelation) to 1 (perfect spatial autocorrelation). For a given gene, the index considers the similarity between the gene expression levels of samples in the k-neighborhood of each sample. The following code cell computes the Moran index values for all the genes and different values of k for: i) the original space, ii) PCA-5D, and iii) UMAP-nn238. The means of Moran index values for all k levels are then calculated.
The Euler diagram above represents the overlap between the top 1000 genes in the three lists and shows that 88.8% of the genes are in common. This result shows that the UMAP projection preserves genes spatial auto-correlation.