Plotting Kaplan-Meier survival curves - Part 1

Introduction

The first step of survival analysis is most often modeling and visualizing the time-to-event data as a survival function S(t), which describes the probability that the event of interest does not happen at a given time t.

As the actual survival function cannot be observed, it is approximated by the Kaplan-Meier (KM) estimator, which essentially calculates the proportion of at-risk subjects that has not yet "succumbed" (to death, malfunction, etc) out of all at-risk subjects present at each time t:

Being a non-parametric method, the Kaplan-Meier estimator does not make assumptions about the distribution of survival times and any specific relationships between covariates and time to the event of interest.

Throughout this course, we will use the well-known IBM Telco customer churn dataset as example to demonstrate the utility of survival analysis for deriving business insights. The Python package lifelines and R package survival both provide easy implementations to plot KM survival curves of a dataset that contains (at least) two parameters for each subject observed: 1) whether the event of interest has occurred and 2) if so, when did it occur relative to the start point of observation. For the Telco data set, this correspond to the Churn and Tenure columns, respectively.

Plotting KM survival curves

Workflow in Python

First up, let's create the KM survival curve in Python. The lifelines API is very similar to that of scikit-learn, where the KaplanMeierFitter object is first instantiated and then fitted to the data.

## Import libraries
import pandas as pd
import numpy as np
from lifelines import KaplanMeierFitter
import matplotlib.pyplot as plt
from matplotlib import style 

## Import data
df = pd.read_csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_yes_no.csv")

## Re-encode "Churn" as 0/1
df['Churn'] = np.where(df['Churn'] == 'Yes', 1, 0)

## Instantiate kmf object
kmf = KaplanMeierFitter()

## Fit kmf object to data
kmf.fit(df['Tenure'], event_observed = df['Churn'])

## Plot KM curve
ax = kmf.plot(xlim=(0, 75), ylim=(0, 1))
ax.set_title("Overall survival probability")
ax.set_xlabel("Tenure (months)")
ax.set_ylabel("Survival probability")
plt.gcf()

In this plot, the y-axis represents the probability that a customer is still subscribed to the company's services (namely the "survival probability") at a given time (x-axis) after they first signed on. We see that the probability of a given customer leaving decreases (the curve flattens) with time, which makes sense as more and more loyal customers remain as time goes on. The blue band superimposed on the KM curve is the 95% confidence interval of survival probability at each time point.

Workflow in R

Now let's try it in R:

## Import libraries
library(dplyr)
library(survival)
library(survminer)

## Import data
df <- read.csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_yes_no.csv")

## Encode "Churn" as 0/1
df <- df %>%
         mutate(Churn = ifelse(Churn == "Yes", 1, 0))

## Fit KM estimator to data
fit <- survfit(Surv(Tenure, Churn) ~ 1, data=df)

## Plot
ggsurvplot(fit, 
           conf.int = TRUE,
           ggtheme = theme_bw(),
           title = "Overall survival probability",
           xlim = c(0, 75)) +
	xlab('Tenure (months)')

We see the same KM curve, with the grey band indicating the 95% confidence interval.

The survival package also offers implementations for several transformations of the survival function, to aid in interpretations of the data. One of the most helpful is the cumulative event rates, which calculates the proportion of subjects experiencing the event of interesting before or at a given time t.

ggsurvplot(fit,
          conf.int = TRUE,
          risk.table.col = "strata", # Change risk table color by groups
          ggtheme = theme_bw(), # Change ggplot2 theme
          palette = c("#E7B800", "#2E9FDF"),
          fun = "event") +
	xlab('Tenure (months)')

In this case, we see that ~40% of customers churn by 72 months after signing up with the company. In terms of business insights, this helps to calculate the average customer lifetime value and to inform the amount of marketing spending appropriate for customer retention/acquisition.

Parting notes

In this notebook, we covered the basic workflow for plotting an overall KM survival curve for a time-to-event dataset. We will examine how to use KM survival curves to investigate group differences in the second notebook on this topic.

If you want to run the code yourself, click on "Remix" in the upper right corner to get a copy of the notebook in your own workspace. Please remember to change the Python and R runtimes to survival_Python and survival_R, respectively, from this notebook to ensure that you have all the installed packages and can start right away.