Association rule miningAug 16 2019 UTC

Filter association rules

Introduction

In the last notebook, we generated 79,313 association rules that appear in at least 0.1% of customers, which is almost certainly too many to provide actionable insights. There are implementations in the arules package to set more stringent thresholds at the stage of rule generation, and to filter out redundant and statistically non-significant rules afterwards. We will cover these steps in this notebook.

Import and pre-process data

First, we will prepare the data for association rule mining, as shown in previously:

4.7s
Assn_rule_mining_R (R)
## Import library
library(plyr)
library(dplyr)
library(arulesCBA)

## Import data
df <-read.csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_yes_no.csv")

## Discretize "MonthlyCharges" with respect to "Churn"/"No Churn" label and assign to new column in dataframe
df$Binned_MonthlyCharges <- discretizeDF.supervised(Churn ~ ., 
                                                    df[, c('MonthlyCharges', 'Churn')], 
                                                    method='mdlp')$MonthlyCharges

## Rename the levels based on knowledge of min/max monthly charges
df$Binned_MonthlyCharges <- revalue(df$Binned_MonthlyCharges, 
                                   c("[-Inf,29.4)"="$0-29.4", 
                                     "[29.4,56)"="$29.4-56", 
                                     "[56,68.8)"="$56-68.8", 
                                     "[68.8,107)"="$68.8-107", 
                                     "[107, Inf]" = "$107-118.75")) 

## Discretize "Tenure" with respect to "Churn"/"No Churn" label and assign to new column in dataframe
df$Binned_Tenure <- discretizeDF.supervised(Churn ~ ., 
                                            df[, c('Tenure', 'Churn')], 
                                            method='mdlp')$Tenure
                                            
## Rename the levels based on knowledge of min/max tenures
df$Binned_Tenure <- revalue(df$Binned_Tenure, 
                            c("[-Inf,1.5)"="1-1.5m", 
                              "[1.5,5.5)"="1.5-5.5m", 
                              "[5.5,17.5)"="5.5-17.5m", 
                              "[17.5,43.5)"="17.5-43.5m", 
                              "[43.5,59.5)"="43.5-59.5m", 
                              "[59.5,70.5)"="59.5-70.5m", 
                              "[70.5, Inf]"="70.5-72m"))

## Rename values by column header
df[df == "No"] <- NA

df[] <- lapply(df, function(x) levels(x)[x])

w <- which(df == "Yes", arr.ind = TRUE)
   
df[w] <- names(df)[w[,"col"]]

## Write to CSV
write.csv(df, './results/final_df.csv', row.names=FALSE)
               
## Convert dataframe to transaction format
tData <- read.transactions('./results/final_df.csv', 
                           format = "basket", sep = ",", 
                           header=TRUE)

## Check data
inspect(head(tData))
final_df.csv

When creating association rules

At the step of rule generation, constraints on the rule support, confidence and length can be set:

Support

Given an example grocery transaction dataset:

Support of apples is calculated as:

Therefore, support is a measure of the proportion of "transactions" in which the given item/itemset/association rule holds true. This provides an idea of how common the rule is.

Confidence

Confidence of the rule {apples -> beer} is calculated as:

Given a rule {X -> Y}, confidence is calculated as the proportion of itemsets that contain both X and Y, out of all itemsets that contain X. It effectively is a measure of how reliable the rule is. Unlike support, confidence is only applicable to association rules.

So, we can generate rules regarding itemsets that appear in at least 0.1% of all the customers in the dataset, that are true 90% of the time given the antecendent, and contain 3-5 items:

## Import library
library(arules)

## Create rules
rules <- apriori(tData, 
                 parameter = list(supp = 0.001,  ## Support (0.1%)
                                  conf=0.9,       ## Confidence (90%)
                                  minlen=3,      ## At least 3 items in rule
                                  maxlen=5))     ## At most 5 items in rule

The more stringent thresholds result in 45,444 rules.

Filter rules by lift

The generated rules can be further filtered down using a lift threshold:

Lift is a measure of how much more or less likely the items in a given rule appear together as compared to by chance. Therefore, it is a metric of the importance of a rule.

Given a set of items on the left hand side of the rule, items on the right hand side are:

  • Lift > 1: more likely to occur
  • Lift < 1: less likely to occur

A lift ~ 1 suggusts that occurrence of the items on the left and right side of the rule is statistically independent.

## Check range of lift of generated rules
summary(rules)

We see that the rules have lift that range from 0.88 to 16.82. Let's get rules with lift > 1.5:

filtered_rules <- subset(rules, subset = lift > 1.5)

summary(filtered_rules)

Remove redundant rules

Furthermore, a rule is considered redundant if it is equally or less predictive than a more general rule, which has the same items on the right hand side, but one or more items less in the left hand side.

## Filter out redundant rules
nonr_rules <- filtered_rules[!is.redundant(filtered_rules)]   

## Check
summary(nonr_rules)

Remove statistically non-significant rules

Finally, the is.significant() function uses the Fisher's exact test to identify rules where probability of occurrence of items on the left and right hand side are statistically dependent. It also corrects for multiple comparisons.

## Filter out statistically insignificant rules
sig_rules <- nonr_rules[!is.significant(nonr_rules, 
                                        tData, 
                                        method = "fisher", 
                                        adjust = 'bonferroni')]

## Check
summary(sig_rules)

Now we are down to a final set of 1,704 rules that are non-redundant and statistically significant.

Extract rules with specific items

Depending on the use case, you might want to look at rules that contain specific items. This can be done using subset() with the options:

  • %in% - select itemsets matching any given item
  • %ain% - select only itemsets matching all given item
  • %oin% - select only itemsets matching only the given item
  • %pin% - equivalent of %in% with partial matching

In our case, we are most interested in rules that contain "Churn" on the right hand side, to gain insights into customer characteristics and/or buying behaviours that are associated with churn:

## Extract rules that have "Churn" as consequent 
churn_rules <- subset(sig_rules, subset=rhs %pin% 'Churn')

summary(churn_rules)

Here we have a set of 26 rules that describe customer traits and buying habits that are associated with churn. The lift of these rules range between 3.3-3.7, indicating they are also fairly strong rules, being able to predict churn quite a bit better than chance.

We can convert the rule set to a dataframe format to inspect them:

## Convert rules matrix to dataframe
rules_df <- DATAFRAME(churn_rules, setStart='', setEnd='', separate = TRUE)

## Check
rules_df
0 items

Interestingly, a few "items" are present in a majority of the rules, such as 1-1.5 month tenure, having purchased device protection, paying by electronic cheque, etc. From the conditional probability density plots we made of this dataset, we saw that the probability of churn steadily increases with tenure, with customers with the shortest tenures most likely to leave the company. We will compare these results with those from our future analyses of this dataset to gain more insights.

Parting notes

In the nextbook, we will learn about functions that summarize and visualize these rules, including an interactive Shiny app that allows both filtering and visualizations on the fly.

Til then! :)