Exploratory data analysisAug 16 2019 UTC

Supervised discretization of continuous variables

I am just popping in with a little fun tidbit that I found while researching for the continuous variable discretization post (coming up soon!).

While binning continuous variables is almost always a bad idea for data modeling, it does seem to have some value in exploratory data analysis. Previously, we have used conditional probability density plots to examine how categorical variables vary in relation to a continuous variable. However, visual inspection of these plots can be inconclusive, as it is unclear which ranges of values, along a smooth continuous distribution, have significantly different probabilities for various levels of a given categorical variable. In addition, the probability density plots do not show the number of data points at various values of the continuous variable.

This is where weight of evidence (WOE)-based binning of continuous variables can offer complimentary insights, by creating distinct value segments, based on its relationship to an outcome variable of interest, that differ in some significant way. For this reason, the bins determined using this method is more meaningful than arbitrarily determined bin widths.

There are numerous packages that offer implementations for supervised (scorecard, discretization, arulesCBA) and unsupervised (arules) discretization of continuous variables. The R package scorecard, which is designed for assessing credit risks, has an easy visualization functionality that makes it well-suited for quick exploratory data analysis, so it will be the focus on this notebook.

If you want to try this yourself, click on "Remix" in the upper right corner to get a copy of the notebook in your own workspace. Please remember to change the R runtime to discretize_R from this notebook to ensure that you have all the installed packages and can start right away.

Let's take a look!

  • Import data
  • We will grab our usual IBM Telco customer churn dataset and get only the numeric variables, plus the outcome variable Churn.

    ## Import data
    data <- read.csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_yes_no.csv")
    
    ## Get only the numeric variables
    df <- data[, c('TotalCharges', 'MonthlyCharges', 'Tenure', 'Churn')]

  • Supervised discretization using scorecard
  • We will perform WOE-based binning of the three continuous variables in the Telco dataset: Tenure, MonthlyCharges and TotalCharges. It has a great functionality that plots the resultant bins and the probability of a given categorical variable, which is Churn in our case, for each bin, so we can identify potential subpopulations of customers with decreased/increased risk of churn. To illustrate how the WOE bin plots and conditional probability density plots can offer complimentary insights, we will compare them side-by-side.

    MonthlyCharges

    Let's start with examining the relationship between how much a customer has to pay monthly with their likelihood to leave the company:

    ## Import libraries
    library(scorecard)
    library(ggplot2)
    library(ggplotify)
    library(plotly)
    
    ## Calculate bin breaks for numeric variables with respect to their relationships with the outcome variable Churn
    bins = woebin(df[, c('Tenure', 'MonthlyCharges', 'TotalCharges', 'Churn')], y = 'Churn', positive = 'Yes')
    
    ## Visualize bins
    woebin_plot(bins$MonthlyCharges)$MonthlyCharges
    ## Compare results with conditional probability density plot
    ggplotly(ggplot(df, aes_string(df$MonthlyCharges, fill = df$Churn)) + 
                                geom_density(position='fill', alpha = 0.5) + 
                                xlab('Monthly Charges') + labs(fill='Churn') +
                                theme(legend.text=element_text(size=10), 
                                      axis.title=element_text(size=10)))
    Loading viewer…

    We see that both plots identified two groups of customers that are more likely to churn, one group paying $26-56/month and another paying $68-106/month. As we had mentioned previously, this could be of interest for the company as these may reflect uncompetitive pricing that should be adjusted. Most interestingly, what the conditional probability density plot does not show is that most customers are in the $68-106/month tier, which poses a potentially significant problem as these customers are also much more likely to churn than the rest.

    Tenure

    Next, we will see how the tendency of customers to churn changes with their tenure with the company:

    ## Visualize bins
    woebin_plot(bins$Tenure)$Tenure
    ## Compare results with conditional probability density plot
    ggplotly(ggplot(df, aes_string(df$Tenure, fill = df$Churn)) + 
                                geom_density(position='fill', alpha = 0.5) + 
                                xlab('Tenure') + labs(fill='Churn') +
                                theme(legend.text=element_text(size=10), 
                                      axis.title=element_text(size=10)))
    Loading viewer…

    Both plots show a steady decrease in probability of the customer to churn as their time with the company increases. This makes sense as customers that are more likely to leave the company would already have done so as time goes on, so that gradually only the loyal customers remain.

    TotalCharges

    Finally, TotalCharges gives us a sense of the interaction between Tenure and MonthlyCharges.

    ## Visualize bins
    woebin_plot(bins$TotalCharges)$TotalCharges
    ## Compare results with conditional probability density plot
    ggplotly(ggplot(df, aes_string(df$TotalCharges, fill = df$Churn)) + 
                                geom_density(position='fill', alpha = 0.5) + 
                                xlab('Total Charges') + labs(fill='Churn') +
                                theme(legend.text=element_text(size=10), 
                                      axis.title=element_text(size=10)))
    Loading viewer…

    We see that customers with lower total charges are more likely to leave. However, as this could be due to different combinations of monthly fee and tenure with the company, we try to dig deeper using a multivariate scatter plot (much like the one made in the automated exploratory data analysis post). As there are ~7,000 data points, we will set the transparency (alpha) to very low, so that regions in the plot with many data points will be apparent by more dense colouring.

    This plot illustrates the interaction amongst the four variables of interest: MonthlyCharges, Tenure, TotalCharges and Churn. Churned customers are represented by circles in teal, we see that many have high monthly fees and short tenures (lower right quadrant), which result in lower total charges (small size of the circles). This potentially identifies a segment of high-paying customers that require some extra attention in the first few months after signing on, in the form of discounts or gifts, in order to retain them for longer.

    ggplot(df, aes(df$MonthlyCharges, df$Tenure)) + 
        geom_point(aes(colour = factor(df$Churn), size = df$TotalCharges), alpha=0.1) + 
        labs(color='Churn', size = 'Total charges') +
        scale_x_continuous(name="Monthly charges") +
        scale_y_continuous(name="Tenure")

    This is a whole lot of insights for such a simple analysis!

    Parting notes

    In an upcoming notebook, I will look at a variety of methods for discretizing continuous variables. While binning is almost always a bad idea for building data models, converting continuous variables into a categorical form has its uses in making these variables available for analysis methods that only work with categorical variables, such as multiple correspondence analysis (MCA) and association rule learning (another post for the near future).

    Til next time! :)