Relationship between variables

Introduction

In this notebook, we will use the results of factor analysis of mixed data (FAMD) to explore relationships between variables.

  • Import and pre-process data
  • 6.1s
    ## Import library
    library(plyr)
    library(dplyr)
    library(arulesCBA)
    
    ## Import data
    df <-read.csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv")
    
    ## Drop TotalCharges variable, as it is a product of MonthlyCharges and Tenure
    df <- within(df, rm('TotalCharges'))
    
    ## Discretize "MonthlyCharges" with respect to "Churn"/"No Churn" label and assign to new column in dataframe
    df$Binned_MonthlyCharges <- discretizeDF.supervised(Churn ~ ., df[, c('MonthlyCharges', 'Churn')], method='mdlp')$MonthlyCharges
    
    ## Rename the levels based on knowledge of min/max monthly charges
    df$Binned_MonthlyCharges = revalue(df$Binned_MonthlyCharges, 
                                       c("[-Inf,29.4)"="$0-29.4", 
                                         "[29.4,56)"="$29.4-56", 
                                         "[56,68.8)"="$56-68.8", 
                                         "[68.8,107)"="$68.8-107", 
                                         "[107, Inf]" = "$107-118.75"))
    
    ## Discretize "Tenure" with respect to "Churn"/"No Churn" label and assign to new column in dataframe
    df$Binned_Tenure <- discretizeDF.supervised(Churn ~ ., 
                                                df[, c('Tenure', 'Churn')], 
                                                method='mdlp')$Tenure
    
    ## Rename the levels based on knowledge of min/max tenures
    df$Binned_Tenure = revalue(df$Binned_Tenure, 
                               c("[-Inf,1.5)"="1-1.5m", 
                                 "[1.5,5.5)"="1.5-5.5m",
                                 "[5.5,17.5)"="5.5-17.5m",
                                 "[17.5,43.5)"="17.5-43.5m",
                                 "[43.5,59.5)"="43.5-59.5m",
                                 "[59.5,70.5)"="59.5-70.5m",
                                 "[70.5, Inf]"="70.5-72m"))
    
    ## Export data to CSV for use in Python runtime
    write.csv(df, "./results/cleaned_df.csv", row.names=FALSE)
    0 items

    Relationship between variables

    Numerical variables

    Correlation circles, which were introduced in part 2 of this series as a way to visualize how well numerical variables are represented by two given principal components (PCs), can also be used to examine relationships between those variables.

    ## Import libraries
    library(FactoMineR)
    library(factoextra)
    
    ## FAMD
    res.famd <- FAMD(df, 
                     sup.var = 20, 
                     graph = FALSE, 
                     ncp=25)
    
    ## Plotting
    fviz_famd_var(res.famd, 
                  "quanti.var", 
                  repel = TRUE,
                  col.var = "black")

    To interpret the correlation circle, whereas positively correlated variables appear in the same quadrant, negatively correlated variables are positioned on opposite sides of the origin. So in this example, we see that Tenure and MonthlyCharges are negative correlated in some way. This makes some intuitive sense, as customers with high monthly charges may not stay very long with the company.

    Finally, as a sidenote, the PCAmixdata package can produce the exact same plot from the FAMD results. One of the authors of the package has a fairly detailed tutorial here.

    Categorical variables

    In the squared loading plots shown in part 2 of this series, we could get some rough ideas as to the relationships between categorical variables. However, we can further visualize the relationships between levels of categorical variables (including discretized continuous variables, such as MonthlyCharges and Tenure) in level maps. This allows us to get much more fine-grained insights, as for example "Senior Citizen" and "Not Senior Citizen" carry very different meanings, which are lost when lumped together into a single variable.

    All three packages used in this series have some implementation for this type of visualization, as you will see in the comparison below.

    Using FactoMineR

    The FAMD implementation in the FactoMineR package somehow does not allow display of the supplementary variable (Churn in this case), which is key to interpreting the results. However, as FAMD essentially uses multiple correspondence analysis (MCA) to handle the categorical variables, we can directly perform MCA using FactoMineR to visualize relationships between various aspects of customer behaviour and our outcome of interest, customer churn.

    I will plot results from both FAMD and MCA for the sake of comparison:

    ## Plot relationship between levels of categorical variables obtained from FAMD
    fviz_famd_var(res.famd, "quali.var", col.var = "cos2", 
                 gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
                 labelsize = 3,
                 repel=TRUE) +
                 xlim(-3, 3) + ylim (-2, 2)
    ## Plot relationship between levels of categorical variables obtained from MCA
    res.mca <- MCA(df, quanti.sup=c(5, 18), quali.sup=19, graph = FALSE)
    
    fviz_mca_var(res.mca, col.var = "cos2", 
                 gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
                 labelsize = 3, 
                 repel=TRUE) +
                 xlim(-1.5, 1.5) + ylim (-1, 1)

    We see that the two plots show similar relationships between levels of all categorical variables, but the MCA results allow us to see where "Churn" and "No Churn" (in dark green) fall relative to levels of other categorical variables. As variables that are closer together on the factor map are more closely related, we can glean that whereas have a month-to-month plan and paying by electronic cheque are associated with customers who churn, having one-year contract and not being senior citizen are associated with those who do not.

    Using PCAmixdata

    We can generate the same kind of plot using the results of FAMD with PCAmixdata.

    ## Import library
    library(PCAmixdata)
    
    ## Split quantitative and qualitative variables
    split <- splitmix(df)
    
    ## PCA
    res.pcamix <- PCAmix(X.quanti=split$X.quanti,  
                         X.quali=split$X.quali, 
                         rename.level=TRUE, 
                         graph=FALSE, 
                         ndim=25)
    
    ## Add "Churn" as a supplementary varible
    res.sup <- supvar(res.pcamix,  
                      X.quanti.sup = NULL, 
                      X.quali.sup = df[19], 
                      rename.level=TRUE)
    
    ## Plotting
    plot(res.sup, 
         choice="levels", 
         main="Levels", 
         cex=0.6)

    While we can generate the same plot more directly from the FAMD results, the result is quite a bit less readable and visually appealing than the plots made using FactoMineR.

    Using prince

    If your analysis needs to be in Python, the prince is the way to go. As far as I can see, there is not (yet) an option to create a level map direcly from the results of FAMD. However, as mentioned above, we can use MCA to get the same information.

    1.7s
    ## Import library
    import pandas as pd
    import prince
    import matplotlib.pyplot as plt
    
    ## Import data
    df = pd.read_csv(
    cleaned_df.csv
    ) df.drop(['Tenure', 'MonthlyCharges'], axis=1, inplace=True) ## Instantiate MCA object mca = prince.MCA( n_components=2, n_iter=10, copy=True, check_input=True, engine='auto', random_state=42) ## Fit MCA object to quantative data mca = mca.fit(df) ## Generate figure plt.figure() mca.plot_coordinates(df, figsize=(14, 14), show_row_points=False, show_row_labels=False, show_column_points=True, column_points_size=3, show_column_labels=True, legend_n_cols=1) plt.gcf()

    Again, we see the same relationships between categorical variables. And again, the figure suffers in comparison with that of FactoMineR in terms of aesthetics.

    In summary, using the MCA implementation in FactoMineR provides the most visually pleasing and informative visual representation of the relationships between categorical variables in a dataset. However, knowing that all three packages deliver very similar results, you have plenty of options depending on your workflow needs.

    In the next notebook, we will bring individual data points into the mix and see what insights we can gain there.

    Til then! :)