# Plotting Kaplan-Meier survival curves - Part 2

## Introduction

As mentioned in the previous notebook, we are almost always interested in identifying factors (covariates) that significantly modify the probability of the event of interest when analyzing time-to-event data. In our example, this means identifying which subset of customers, whether in terms of demographics or purchasing behaviour, are more or less likely to stop buying the company's services at a certain point. One of the ways we can do it is by plotting and examining KM survival curves that are stratified by each variable.

If you want to try this yourself, click on "Remix" in the upper right corner to get a copy of the notebook in your own workspace. Please remember to change the Python and R runtimes to survival_Python and survival_R, respectively, from this notebook to ensure that you have all the installed packages and can start right away.

## Import data and discretize continuous variable

As Kaplan-Meier estimation of the survival function cannot characterize the probability of event occurrence in relation to a continuous variable, we need to bin the continuous variable MonthlyCharges into discrete levels, so that we can examine the probability of customer churn at different "tiers" of customer spending. Here, we will use the arulesCBA package, which uses a supervised approach to identify bin breaks that are most informative with respect to a class label, which is Churn in our case. We will ignore TotalCharges as it is a product of MonthlyCharges and Tenure, the latter of which is already part of the survival function.

## Import library
library(plyr)
library(dplyr)
library(arulesCBA)

## Import data

## Encode "Churn" as 0/1
df <- df %>%
mutate(Churn = ifelse(Churn == "No",0,1))

## Set "Churn" as factor
df$Churn <- as.factor(df$Churn)

## Discretize "MonthlyCharges" with respect to "Churn"/"No Churn" label and assign to new column in dataframe
df$Binned_MonthlyCharges <- discretizeDF.supervised(Churn ~ ., df[, c('MonthlyCharges', 'Churn')], method='mdlp')$MonthlyCharges

## Check the dataframe
head(df)
0 items

Let's see what the new bins look like:

## Import library
library(ggplot2)

## Summarize proportion of customers that churned for each tier of monthly fee
summary_df <- data.frame(t(table(df$Binned_MonthlyCharges, df$Churn)))

## Plot
ggplot(summary_df, aes(x = Var2, y = Freq, fill = Var1)) +
geom_col() +
xlab("Binned monthly charges") +
ylab("No. customers") +
labs(fill = "Churn")

Looks like there are differences in the propensity to churn between tiers of monthly fees. Consistent with what we saw in the conditional probability density plots, customers paying $29.4-56/month or$68.8-107/month appear to be more likely to leave the company.

This gives us some inkling that the discretization does result in informative binning. So, we want to rename the levels of the binned MonthlyCharges variable, to make them more reader friendly:

## Check levels of binned variable
unique(df$Binned_MonthlyCharges) ## Rename the levels based on knowledge of min/max monthly charges df$Binned_MonthlyCharges = revalue(df$Binned_MonthlyCharges, c("[-Inf,29.4)"="$0-29.4",
"[29.4,56)"="$29.4-56", "[56,68.8)"="$56-68.8",
"[68.8,107)"="$68.8-107", "[107, Inf]" = "$107-118.75"))

## Output dataframe to CSV so it can be used by Python runtime
write.csv(df, './results/processed_telco_df.csv', row.names=FALSE)
0 items

## Plotting stratified KM survival curves

### Workflow in Python

9.4s
survival_Python (Python)
survival_Python
## Import libraries
import pandas as pd
from lifelines import KaplanMeierFitter
import matplotlib.pyplot as plt
from matplotlib import style
from lifelines.statistics import multivariate_logrank_test
from matplotlib.offsetbox import AnchoredText

## Import data

## Set colour dictionary for consistent colour coding of KM curves
colours = {'Yes':'g', 'No':'r',
'Female':'b', 'Male':'y',
'Month-to-month':'#007f0e', 'Two year':'#c4507c','One year':'#feba9e',