Relationship between variables

Introduction

In this notebook, we will use the results of factor analysis of mixed data (FAMD) to explore relationships between variables.

Import and pre-process data

6.1s

## Import library
library(plyr)
library(dplyr)
library(arulesCBA)

## Import data
df <-read.csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv")

## Drop TotalCharges variable, as it is a product of MonthlyCharges and Tenure
df <- within(df, rm('TotalCharges'))

## Discretize "MonthlyCharges" with respect to "Churn"/"No Churn" label and assign to new column in dataframe
df$Binned_MonthlyCharges <- discretizeDF.supervised(Churn ~ ., df[, c('MonthlyCharges', 'Churn')], method='mdlp')$MonthlyCharges

## Rename the levels based on knowledge of min/max monthly charges
df$Binned_MonthlyCharges = revalue(df$Binned_MonthlyCharges, 
                                   c("[-Inf,29.4)"="$0-29.4", 
                                     "[29.4,56)"="$29.4-56", 
                                     "[56,68.8)"="$56-68.8", 
                                     "[68.8,107)"="$68.8-107", 
                                     "[107, Inf]" = "$107-118.75"))

## Discretize "Tenure" with respect to "Churn"/"No Churn" label and assign to new column in dataframe
df$Binned_Tenure <- discretizeDF.supervised(Churn ~ ., 
                                            df[, c('Tenure', 'Churn')], 
                                            method='mdlp')$Tenure

## Rename the levels based on knowledge of min/max tenures
df$Binned_Tenure = revalue(df$Binned_Tenure, 
                           c("[-Inf,1.5)"="1-1.5m", 
                             "[1.5,5.5)"="1.5-5.5m",
                             "[5.5,17.5)"="5.5-17.5m",
                             "[17.5,43.5)"="17.5-43.5m",
                             "[43.5,59.5)"="43.5-59.5m",
                             "[59.5,70.5)"="59.5-70.5m",
                             "[70.5, Inf]"="70.5-72m"))

## Export data to CSV for use in Python runtime
write.csv(df, "./results/cleaned_df.csv", row.names=FALSE)

6.1s

0 items

Relationship between variables

Numerical variables

Correlation circles, which were introduced in part 2 of this series as a way to visualize how well numerical variables are represented by two given principal components (PCs), can also be used to examine relationships between those variables.

## Import libraries
library(FactoMineR)
library(factoextra)

## FAMD
res.famd <- FAMD(df, 
                 sup.var = 20, 
                 graph = FALSE, 
                 ncp=25)

## Plotting
fviz_famd_var(res.famd, 
              "quanti.var", 
              repel = TRUE,
              col.var = "black")

2.5s

To interpret the correlation circle, whereas positively correlated variables appear in the same quadrant, negatively correlated variables are positioned on opposite sides of the origin. So in this example, we see that Tenure and MonthlyCharges are negative correlated in some way. This makes some intuitive sense, as customers with high monthly charges may not stay very long with the company.

Finally, as a sidenote, the PCAmixdata package can produce the exact same plot from the FAMD results. One of the authors of the package has a fairly detailed tutorial here.

Categorical variables

In the squared loading plots shown in part 2 of this series, we could get some rough ideas as to the relationships between categorical variables. However, we can further visualize the relationships between levels of categorical variables (including discretized continuous variables, such as MonthlyCharges and Tenure) in level maps. This allows us to get much more fine-grained insights, as for example "Senior Citizen" and "Not Senior Citizen" carry very different meanings, which are lost when lumped together into a single variable.

All three packages used in this series have some implementation for this type of visualization, as you will see in the comparison below.

Using `FactoMineR`

The FAMD implementation in the FactoMineR package somehow does not allow display of the supplementary variable (Churn in this case), which is key to interpreting the results. However, as FAMD essentially uses multiple correspondence analysis (MCA) to handle the categorical variables, we can directly perform MCA using FactoMineR to visualize relationships between various aspects of customer behaviour and our outcome of interest, customer churn.

I will plot results from both FAMD and MCA for the sake of comparison:

## Plot relationship between levels of categorical variables obtained from FAMD
fviz_famd_var(res.famd, "quali.var", col.var = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             labelsize = 3,
             repel=TRUE) +
             xlim(-3, 3) + ylim (-2, 2)

0.9s

## Plot relationship between levels of categorical variables obtained from MCA
res.mca <- MCA(df, quanti.sup=c(5, 18), quali.sup=19, graph = FALSE)

fviz_mca_var(res.mca, col.var = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             labelsize = 3, 
             repel=TRUE) +
             xlim(-1.5, 1.5) + ylim (-1, 1)

1.2s

We see that the two plots show similar relationships between levels of all categorical variables, but the MCA results allow us to see where "Churn" and "No Churn" (in dark green) fall relative to levels of other categorical variables. As variables that are closer together on the factor map are more closely related, we can glean that whereas have a month-to-month plan and paying by electronic cheque are associated with customers who churn, having one-year contract and not being senior citizen are associated with those who do not.

Using `PCAmixdata`

We can generate the same kind of plot using the results of FAMD with PCAmixdata.

## Import library
library(PCAmixdata)

## Split quantitative and qualitative variables
split <- splitmix(df)

## PCA
res.pcamix <- PCAmix(X.quanti=split$X.quanti,  
                     X.quali=split$X.quali, 
                     rename.level=TRUE, 
                     graph=FALSE, 
                     ndim=25)

## Add "Churn" as a supplementary varible
res.sup <- supvar(res.pcamix,  
                  X.quanti.sup = NULL, 
                  X.quali.sup = df[19], 
                  rename.level=TRUE)

## Plotting
plot(res.sup, 
     choice="levels", 
     main="Levels", 
     cex=0.6)

2.1s

While we can generate the same plot more directly from the FAMD results, the result is quite a bit less readable and visually appealing than the plots made using FactoMineR.

Using `prince`

If your analysis needs to be in Python, the prince is the way to go. As far as I can see, there is not (yet) an option to create a level map direcly from the results of FAMD. However, as mentioned above, we can use MCA to get the same information.

1.7s

Python

## Import library
import pandas as pd
import prince
import matplotlib.pyplot as plt

## Import data
df = pd.read_csv(cleaned_df.csv
)

df.drop(['Tenure', 'MonthlyCharges'], axis=1, inplace=True)

## Instantiate MCA object
mca = prince.MCA(
     n_components=2,
     n_iter=10,
     copy=True,
     check_input=True,
     engine='auto',
     random_state=42)

## Fit MCA object to quantative data
mca = mca.fit(df)

## Generate figure
plt.figure()

mca.plot_coordinates(df,
                     figsize=(14, 14),
                     show_row_points=False,
                     show_row_labels=False,
                     show_column_points=True,
                     column_points_size=3,
                     show_column_labels=True,
                     legend_n_cols=1)

plt.gcf()

1.7s

Python

Again, we see the same relationships between categorical variables. And again, the figure suffers in comparison with that of FactoMineR in terms of aesthetics.

In summary, using the MCA implementation in FactoMineR provides the most visually pleasing and informative visual representation of the relationships between categorical variables in a dataset. However, knowing that all three packages deliver very similar results, you have plenty of options depending on your workflow needs.

In the next notebook, we will bring individual data points into the mix and see what insights we can gain there.

Til then! :)

Relationship between variables

Introduction

Import and pre-process data

Relationship between variables

Numerical variables

Categorical variables

Using FactoMineR

Using PCAmixdata

Using prince

Using `FactoMineR`

Using `PCAmixdata`

Using `prince`