Filter association rules
Introduction
In the last notebook, we generated 79,313 association rules that appear in at least 0.1% of customers, which is almost certainly too many to provide actionable insights. There are implementations in the arules
package to set more stringent thresholds at the stage of rule generation, and to filter out redundant and statistically non-significant rules afterwards. We will cover these steps in this notebook.
Import and pre-process data
First, we will prepare the data for association rule mining, as shown in previously:
## Import library library(plyr) library(dplyr) library(arulesCBA) ## Import data df <-read.csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_yes_no.csv") ## Discretize "MonthlyCharges" with respect to "Churn"/"No Churn" label and assign to new column in dataframe df$Binned_MonthlyCharges <- discretizeDF.supervised(Churn ~ ., df[, c('MonthlyCharges', 'Churn')], method='mdlp')$MonthlyCharges ## Rename the levels based on knowledge of min/max monthly charges df$Binned_MonthlyCharges <- revalue(df$Binned_MonthlyCharges, c("[-Inf,29.4)"="$0-29.4", "[29.4,56)"="$29.4-56", "[56,68.8)"="$56-68.8", "[68.8,107)"="$68.8-107", "[107, Inf]" = "$107-118.75")) ## Discretize "Tenure" with respect to "Churn"/"No Churn" label and assign to new column in dataframe df$Binned_Tenure <- discretizeDF.supervised(Churn ~ ., df[, c('Tenure', 'Churn')], method='mdlp')$Tenure ## Rename the levels based on knowledge of min/max tenures df$Binned_Tenure <- revalue(df$Binned_Tenure, c("[-Inf,1.5)"="1-1.5m", "[1.5,5.5)"="1.5-5.5m", "[5.5,17.5)"="5.5-17.5m", "[17.5,43.5)"="17.5-43.5m", "[43.5,59.5)"="43.5-59.5m", "[59.5,70.5)"="59.5-70.5m", "[70.5, Inf]"="70.5-72m")) ## Rename values by column header df[df == "No"] <- NA df[] <- lapply(df, function(x) levels(x)[x]) w <- which(df == "Yes", arr.ind = TRUE) df[w] <- names(df)[w[,"col"]] ## Write to CSV write.csv(df, './results/final_df.csv', row.names=FALSE) ## Convert dataframe to transaction format tData <- read.transactions('./results/final_df.csv', format = "basket", sep = ",", header=TRUE) ## Check data inspect(head(tData))
When creating association rules
At the step of rule generation, constraints on the rule support, confidence and length can be set:
Support
Given an example grocery transaction dataset:
Support of apples is calculated as:
Therefore, support is a measure of the proportion of "transactions" in which the given item/itemset/association rule holds true. This provides an idea of how common the rule is.
Confidence
Confidence of the rule {apples -> beer} is calculated as:
Given a rule {X -> Y}, confidence is calculated as the proportion of itemsets that contain both X and Y, out of all itemsets that contain X. It effectively is a measure of how reliable the rule is. Unlike support, confidence is only applicable to association rules.
So, we can generate rules regarding itemsets that appear in at least 0.1% of all the customers in the dataset, that are true 90% of the time given the antecendent, and contain 3-5 items:
## Import library library(arules) ## Create rules rules <- apriori(tData, parameter = list(supp = 0.001, ## Support (0.1%) conf=0.9, ## Confidence (90%) minlen=3, ## At least 3 items in rule maxlen=5)) ## At most 5 items in rule
The more stringent thresholds result in 45,444 rules.
Filter rules by lift
The generated rules can be further filtered down using a lift threshold:
Lift is a measure of how much more or less likely the items in a given rule appear together as compared to by chance. Therefore, it is a metric of the importance of a rule.
Given a set of items on the left hand side of the rule, items on the right hand side are:
- Lift > 1: more likely to occur
- Lift < 1: less likely to occur
A lift ~ 1 suggusts that occurrence of the items on the left and right side of the rule is statistically independent.
## Check range of lift of generated rules summary(rules)
We see that the rules have lift that range from 0.88 to 16.82. Let's get rules with lift > 1.5:
filtered_rules <- subset(rules, subset = lift > 1.5) summary(filtered_rules)
Remove redundant rules
Furthermore, a rule is considered redundant if it is equally or less predictive than a more general rule, which has the same items on the right hand side, but one or more items less in the left hand side.
## Filter out redundant rules nonr_rules <- filtered_rules[!is.redundant(filtered_rules)] ## Check summary(nonr_rules)
Remove statistically non-significant rules
Finally, the is.significant()
function uses the Fisher's exact test to identify rules where probability of occurrence of items on the left and right hand side are statistically dependent. It also corrects for multiple comparisons.
## Filter out statistically insignificant rules sig_rules <- nonr_rules[!is.significant(nonr_rules, tData, method = "fisher", adjust = 'bonferroni')] ## Check summary(sig_rules)
Now we are down to a final set of 1,704 rules that are non-redundant and statistically significant.
Extract rules with specific items
Depending on the use case, you might want to look at rules that contain specific items. This can be done using subset()
with the options:
%in%
- select itemsets matching any given item%ain%
- select only itemsets matching all given item%oin%
- select only itemsets matching only the given item%pin%
- equivalent of%in%
with partial matching
In our case, we are most interested in rules that contain "Churn" on the right hand side, to gain insights into customer characteristics and/or buying behaviours that are associated with churn:
## Extract rules that have "Churn" as consequent churn_rules <- subset(sig_rules, subset=rhs %pin% 'Churn') summary(churn_rules)
Here we have a set of 26 rules that describe customer traits and buying habits that are associated with churn. The lift of these rules range between 3.3-3.7, indicating they are also fairly strong rules, being able to predict churn quite a bit better than chance.
We can convert the rule set to a dataframe format to inspect them:
## Convert rules matrix to dataframe rules_df <- DATAFRAME(churn_rules, setStart='', setEnd='', separate = TRUE) ## Check rules_df
Interestingly, a few "items" are present in a majority of the rules, such as 1-1.5 month tenure, having purchased device protection, paying by electronic cheque, etc. From the conditional probability density plots we made of this dataset, we saw that the probability of churn steadily increases with tenure, with customers with the shortest tenures most likely to leave the company. We will compare these results with those from our future analyses of this dataset to gain more insights.
Parting notes
In the nextbook, we will learn about functions that summarize and visualize these rules, including an interactive Shiny app that allows both filtering and visualizations on the fly.
Til then! :)