# Plotting Kaplan-Meier survival curves - Part 1

## Introduction

The first step of survival analysis is most often **modeling** and **visualizing** the time-to-event data as a survival function *S(t)*, which describes the probability that the event of interest does *not* happen at a given time *t*.

As the *actual* survival function cannot be observed, it is approximated by the **Kaplan-Meier (KM) estimator**, which essentially calculates the proportion of at-risk subjects that has not yet "succumbed" (to death, malfunction, etc) out of all at-risk subjects present at each time *t:*

Being a *non-parametric* method, the Kaplan-Meier estimator does not make assumptions about the distribution of survival times and any specific relationships between covariates and time to the event of interest.

Throughout this course, we will use the well-known IBM Telco customer churn dataset as example to demonstrate the utility of survival analysis for deriving business insights. The Python package `lifelines`

and R package `survival`

both provide easy implementations to plot KM survival curves of a dataset that contains (at least) two parameters for each subject observed: 1) whether the event of interest has occurred and 2) if so, when did it occur relative to the start point of observation. For the Telco data set, this correspond to the `Churn`

and `Tenure`

columns, respectively.

## Plotting KM survival curves

### Workflow in Python

First up, let's create the KM survival curve in Python. The `lifelines`

API is very similar to that of `scikit-learn`

, where the `KaplanMeierFitter`

object is first instantiated and then fitted to the data.

## Import libraries import pandas as pd import numpy as np from lifelines import KaplanMeierFitter import matplotlib.pyplot as plt from matplotlib import style ## Import data df = pd.read_csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_yes_no.csv") ## Re-encode "Churn" as 0/1 df['Churn'] = np.where(df['Churn'] == 'Yes', 1, 0) ## Instantiate kmf object kmf = KaplanMeierFitter() ## Fit kmf object to data kmf.fit(df['Tenure'], event_observed = df['Churn']) ## Plot KM curve ax = kmf.plot(xlim=(0, 75), ylim=(0, 1)) ax.set_title("Overall survival probability") ax.set_xlabel("Tenure (months)") ax.set_ylabel("Survival probability") plt.gcf()

In this plot, the y-axis represents the probability that a customer is still subscribed to the company's services (namely the "survival probability") at a given time (x-axis) after they first signed on. We see that the probability of a given customer leaving decreases (the curve flattens) with time, which makes sense as more and more loyal customers remain as time goes on. The blue band superimposed on the KM curve is the 95% confidence interval of survival probability at each time point.

### Workflow in R

Now let's try it in R:

## Import libraries library(dplyr) library(survival) library(survminer) ## Import data df <- read.csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_yes_no.csv") ## Encode "Churn" as 0/1 df <- df %>% mutate(Churn = ifelse(Churn == "Yes", 1, 0)) ## Fit KM estimator to data fit <- survfit(Surv(Tenure, Churn) ~ 1, data=df) ## Plot ggsurvplot(fit, conf.int = TRUE, ggtheme = theme_bw(), title = "Overall survival probability", xlim = c(0, 75)) + xlab('Tenure (months)')

We see the same KM curve, with the grey band indicating the 95% confidence interval.

The `survival`

package also offers implementations for several transformations of the survival function, to aid in interpretations of the data. One of the most helpful is the **cumulative event rates**, which calculates the proportion of subjects* experiencing* the event of interesting before or at a given time *t*.

ggsurvplot(fit, conf.int = TRUE, risk.table.col = "strata", # Change risk table color by groups ggtheme = theme_bw(), # Change ggplot2 theme palette = c("#E7B800", "#2E9FDF"), fun = "event") + xlab('Tenure (months)')

In this case, we see that ~40% of customers churn by 72 months after signing up with the company. In terms of business insights, this helps to calculate the average customer lifetime value and to inform the amount of marketing spending appropriate for customer retention/acquisition.

## Parting notes

In this notebook, we covered the basic workflow for plotting an overall KM survival curve for a time-to-event dataset. We will examine how to use KM survival curves to investigate group differences in the second notebook on this topic.

If you want to run the code yourself, click on "Remix" in the upper right corner to get a copy of the notebook in your own workspace. Please remember to **change the Python and R runtimes to** `survival_Python`

and `survival_R`

, respectively, from this notebook to ensure that you have all the installed packages and can start right away.