Testing the proportional hazards assumption
Introduction
In the second part of our series, we will learn about arguably one of the most important part of conducting survival analysis: characterizing the time-dependence of covariate effects on the probability of the event of interest happening at a given time, which is called hazard.
Whether it be patient death or equipment failure, we are interested in identifying factors that increase or decrease the probability of the event occurring, using a subset of approaches called survival regression. However, this is not simply a yes or no question, as often the contribution of a factor can and does change with time. For example, there is evidence that the status of hormone receptors increase the risk of breast cancer metastases during the early stages of the disease, but becomes protective later on. Authors of the study suggested that this reversal may account for previous conventional survival regressions (which did not consider time-dependent effects) not finding the status of hormone receptors to be a significant factor in breast cancer metastases.
One of the most commonly used models for survival regression is the Cox proportional hazards model, which has the assumption that the factors under examination have the same effect on hazard at any point in time. If the data in question does not satisfy the proportional hazards assumption, which is very likely given the complexity of real-life conditions, results of the Cox's model cannot be trusted. Nevertheless, determining which factors have time-varying effects can be quite useful in itself in terms of gaining insights into the data.
A well-established approach to test the proportional hazards assumption is based on scaled Schoenfeld residuals, which is independent of time if the assumption holds. Therefore, for any given covariate, a significant relationship between the Schoenfeld residuals and time indicates its effect on hazard is time-dependent.
Let's get started!
Import and pre-process data
Here we will use the same IBM Telco customer churn dataset used in part 1 of this series. We will use label encoding to quickly convert categorical variables to numerical encoding, as required by the lifelines
package:
## Import library import pandas as pd import numpy as np from sklearn.preprocessing import LabelEncoder ## Import data df = pd.read_csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_yes_no.csv") ## Label encode categorical features cat_col = ['Gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn'] le = LabelEncoder() df[cat_col] = df[cat_col].apply(le.fit_transform) ## Check dataframe df.head() ## Output to file df.to_csv('./results/binary_telco.csv', index=False)
Workflow in Python
First up, we will use the Python package lifelines
to calculate and plot scaled Schoenfeld residuals against various transformations of time for each input variable:
## Import libraries from matplotlib import pyplot as plt from lifelines import CoxPHFitter from lifelines.statistics import proportional_hazard_test ## Instantiate object cph = CoxPHFitter() ## Fit to data cph.fit(df, 'Tenure', 'Churn', show_progress=False) ## Test assumption results = proportional_hazard_test(cph, df, time_transform='rank') ## Output results results.summary.to_csv('./results/lifelines_res.csv', index=False)
## Plot scaled Schoenfeld residuals cph.check_assumptions(df, p_value_threshold=0.05, show_plots=True, advice=True)
As an example, we will plot the scaled Schoenfeld residual against transformed time for the TotalCharges
variable, which does not pass the test:
plt.gcf()
Workflow in R
To perform the same test in R, we will use the survival
package introduced in the introductory post on performing survival analysis:
## Impor library library(survival) ## Import data df <- read.csv(binary_telco.csv) ## Test proportional hazard assumption cox <- coxph(Surv(Tenure, Churn) ~ ., data = df) x <- cox.zph(cox, transform="km", global=TRUE) ## Output results survival_stats <- as.data.frame(x$table) write.csv(survival_stats, './results/survival_res.csv', row.names=FALSE)
plot(x[1], ylim=c(-10, 10), col = '#D3D3D3')
plot(x[7], ylim=c(-10, 10), col = '#ff0000')
## Import libraries library(ggplot2) library(ggpubr) ## Convert rownames to a variable column lifelines_res <- read.csv(survival_res.csv) lifelines_res$var <- rownames(lifelines_res) rownames(lifelines_res) <- NULL survival_stats <- read.csv(survival_res.csv) survival_stats$var <- rownames(survival_stats) rownames(survival_stats) <- NULL ## Select only relevant columns from each results dataframe lifelines_res <- lifelines_res[, c('var', 'p')] survival_stats <- survival_stats[, c('var', 'p')] ## Rename column headers colnames(lifelines_res) <- c('var', 'lifelines p-values') colnames(survival_stats) <- c('var', 'survival p-values') ## Merge dataframes for comparison merged_df <- Reduce(merge, list(lifelines_res, survival_stats)) rownames(merged_df) <- merged_df$var merged_df$var <- NULL ## Sort by one of the columns for comparison merged_df <- merged_df[order(-merged_df['survival p-values']), ] ## Make balloon plot of dataframe ggballoonplot(merged_df, fill = "value") + scale_fill_gradient(low = "red", high = "green", limits=c(0, 0.07)) + theme(text = element_text(size=20))