Testing the proportional hazards assumption

Introduction

In the second part of our series, we will learn about arguably one of the most important part of conducting survival analysis: characterizing the time-dependence of covariate effects on the probability of the event of interest happening at a given time, which is called hazard.

Whether it be patient death or equipment failure, we are interested in identifying factors that increase or decrease the probability of the event occurring, using a subset of approaches called survival regression. However, this is not simply a yes or no question, as often the contribution of a factor can and does change with time. For example, there is evidence that the status of hormone receptors increase the risk of breast cancer metastases during the early stages of the disease, but becomes protective later on. Authors of the study suggested that this reversal may account for previous conventional survival regressions (which did not consider time-dependent effects) not finding the status of hormone receptors to be a significant factor in breast cancer metastases.

One of the most commonly used models for survival regression is the Cox proportional hazards model, which has the assumption that the factors under examination have the same effect on hazard at any point in time. If the data in question does not satisfy the proportional hazards assumption, which is very likely given the complexity of real-life conditions, results of the Cox's model cannot be trusted. Nevertheless, determining which factors have time-varying effects can be quite useful in itself in terms of gaining insights into the data.

A well-established approach to test the proportional hazards assumption is based on scaled Schoenfeld residuals, which is independent of time if the assumption holds. Therefore, for any given covariate, a significant relationship between the Schoenfeld residuals and time indicates its effect on hazard is time-dependent.

Let's get started!

Import and pre-process data

Here we will use the same IBM Telco customer churn dataset used in part 1 of this series. We will use label encoding to quickly convert categorical variables to numerical encoding, as required by the lifelines package:

## Import library
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

## Import data
df = pd.read_csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_yes_no.csv")

## Label encode categorical features 
cat_col = ['Gender', 'SeniorCitizen', 'Partner', 'Dependents', 
           'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
           'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
           'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']

le = LabelEncoder()

df[cat_col] = df[cat_col].apply(le.fit_transform)

## Check dataframe
df.head()

## Output to file
df.to_csv('./results/binary_telco.csv', index=False)

1.6s

test_ph_Python (Python)

binary_telco.csv

Workflow in Python

First up, we will use the Python package lifelines to calculate and plot scaled Schoenfeld residuals against various transformations of time for each input variable:

## Import libraries
from matplotlib import pyplot as plt
from lifelines import CoxPHFitter
from lifelines.statistics import proportional_hazard_test

## Instantiate object
cph = CoxPHFitter()

## Fit to data
cph.fit(df, 'Tenure', 'Churn', show_progress=False)

## Test assumption
results = proportional_hazard_test(cph, df, time_transform='rank')

## Output results
results.summary.to_csv('./results/lifelines_res.csv', index=False)

1.4s

test_ph_Python (Python)

lifelines_res.csv

## Plot scaled Schoenfeld residuals
cph.check_assumptions(df, p_value_threshold=0.05, show_plots=True, advice=True)

353.6s

test_ph_Python (Python)

As an example, we will plot the scaled Schoenfeld residual against transformed time for the TotalCharges variable, which does not pass the test:

plt.gcf()

1.3s

test_ph_Python (Python)

Workflow in R

To perform the same test in R, we will use the survival package introduced in the introductory post on performing survival analysis:

## Impor library
library(survival)

## Import data
df <- read.csv(binary_telco.csv
)

## Test proportional hazard assumption
cox <- coxph(Surv(Tenure, Churn) ~  ., data = df)

x <- cox.zph(cox, transform="km", global=TRUE)

## Output results
survival_stats <- as.data.frame(x$table)

write.csv(survival_stats, './results/survival_res.csv', row.names=FALSE)

0.6s

test_ph_R (R)

survival_res.csv

plot(x[1], ylim=c(-10, 10),  col = '#D3D3D3')

0.3s

test_ph_R (R)

plot(x[7], ylim=c(-10, 10), col = '#ff0000')

0.4s

test_ph_R (R)

0.6s

test_ph_R (R)

## Import libraries
library(ggplot2)
library(ggpubr)

## Convert rownames to a variable column
lifelines_res <- read.csv(survival_res.csv
)
lifelines_res$var <- rownames(lifelines_res)
rownames(lifelines_res) <- NULL

survival_stats <- read.csv(survival_res.csv
)
survival_stats$var <- rownames(survival_stats)
rownames(survival_stats) <- NULL

## Select only relevant columns from each results dataframe
lifelines_res <- lifelines_res[, c('var', 'p')]

survival_stats <- survival_stats[, c('var', 'p')]

## Rename column headers
colnames(lifelines_res) <- c('var', 'lifelines p-values')

colnames(survival_stats) <- c('var', 'survival p-values')

## Merge dataframes for comparison
merged_df <- Reduce(merge, list(lifelines_res, survival_stats))

rownames(merged_df) <- merged_df$var

merged_df$var <- NULL

## Sort by one of the columns for comparison
merged_df <- merged_df[order(-merged_df['survival p-values']), ] 

## Make balloon plot of dataframe
ggballoonplot(merged_df, fill = "value") + 
      scale_fill_gradient(low = "red", high = "green", limits=c(0, 0.07)) +
        theme(text = element_text(size=20))

0.6s

test_ph_R (R)