Relationship between variables
Introduction
In this notebook, we will use the results of factor analysis of mixed data (FAMD) to explore relationships between variables.
Import and pre-process data
## Import library library(plyr) library(dplyr) library(arulesCBA) ## Import data df <-read.csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv") ## Drop TotalCharges variable, as it is a product of MonthlyCharges and Tenure df <- within(df, rm('TotalCharges')) ## Discretize "MonthlyCharges" with respect to "Churn"/"No Churn" label and assign to new column in dataframe df$Binned_MonthlyCharges <- discretizeDF.supervised(Churn ~ ., df[, c('MonthlyCharges', 'Churn')], method='mdlp')$MonthlyCharges ## Rename the levels based on knowledge of min/max monthly charges df$Binned_MonthlyCharges = revalue(df$Binned_MonthlyCharges, c("[-Inf,29.4)"="$0-29.4", "[29.4,56)"="$29.4-56", "[56,68.8)"="$56-68.8", "[68.8,107)"="$68.8-107", "[107, Inf]" = "$107-118.75")) ## Discretize "Tenure" with respect to "Churn"/"No Churn" label and assign to new column in dataframe df$Binned_Tenure <- discretizeDF.supervised(Churn ~ ., df[, c('Tenure', 'Churn')], method='mdlp')$Tenure ## Rename the levels based on knowledge of min/max tenures df$Binned_Tenure = revalue(df$Binned_Tenure, c("[-Inf,1.5)"="1-1.5m", "[1.5,5.5)"="1.5-5.5m", "[5.5,17.5)"="5.5-17.5m", "[17.5,43.5)"="17.5-43.5m", "[43.5,59.5)"="43.5-59.5m", "[59.5,70.5)"="59.5-70.5m", "[70.5, Inf]"="70.5-72m")) ## Export data to CSV for use in Python runtime write.csv(df, "./results/cleaned_df.csv", row.names=FALSE)
Relationship between variables
Numerical variables
Correlation circles, which were introduced in part 2 of this series as a way to visualize how well numerical variables are represented by two given principal components (PCs), can also be used to examine relationships between those variables.
## Import libraries library(FactoMineR) library(factoextra) ## FAMD res.famd <- FAMD(df, sup.var = 20, graph = FALSE, ncp=25) ## Plotting fviz_famd_var(res.famd, "quanti.var", repel = TRUE, col.var = "black")
To interpret the correlation circle, whereas positively correlated variables appear in the same quadrant, negatively correlated variables are positioned on opposite sides of the origin. So in this example, we see that Tenure
and MonthlyCharges
are negative correlated in some way. This makes some intuitive sense, as customers with high monthly charges may not stay very long with the company.
Finally, as a sidenote, the PCAmixdata
package can produce the exact same plot from the FAMD results. One of the authors of the package has a fairly detailed tutorial here.
Categorical variables
In the squared loading plots shown in part 2 of this series, we could get some rough ideas as to the relationships between categorical variables. However, we can further visualize the relationships between levels of categorical variables (including discretized continuous variables, such as MonthlyCharges
and Tenure
) in level maps. This allows us to get much more fine-grained insights, as for example "Senior Citizen" and "Not Senior Citizen" carry very different meanings, which are lost when lumped together into a single variable.
All three packages used in this series have some implementation for this type of visualization, as you will see in the comparison below.
Using FactoMineR
FactoMineR
The FAMD implementation in the FactoMineR
package somehow does not allow display of the supplementary variable (Churn
in this case), which is key to interpreting the results. However, as FAMD essentially uses multiple correspondence analysis (MCA) to handle the categorical variables, we can directly perform MCA using FactoMineR
to visualize relationships between various aspects of customer behaviour and our outcome of interest, customer churn.
I will plot results from both FAMD and MCA for the sake of comparison:
## Plot relationship between levels of categorical variables obtained from FAMD fviz_famd_var(res.famd, "quali.var", col.var = "cos2", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), labelsize = 3, repel=TRUE) + xlim(-3, 3) + ylim (-2, 2)
## Plot relationship between levels of categorical variables obtained from MCA res.mca <- MCA(df, quanti.sup=c(5, 18), quali.sup=19, graph = FALSE) fviz_mca_var(res.mca, col.var = "cos2", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), labelsize = 3, repel=TRUE) + xlim(-1.5, 1.5) + ylim (-1, 1)
We see that the two plots show similar relationships between levels of all categorical variables, but the MCA results allow us to see where "Churn" and "No Churn" (in dark green) fall relative to levels of other categorical variables. As variables that are closer together on the factor map are more closely related, we can glean that whereas have a month-to-month plan and paying by electronic cheque are associated with customers who churn, having one-year contract and not being senior citizen are associated with those who do not.
Using PCAmixdata
PCAmixdata
We can generate the same kind of plot using the results of FAMD with PCAmixdata
.
## Import library library(PCAmixdata) ## Split quantitative and qualitative variables split <- splitmix(df) ## PCA res.pcamix <- PCAmix(X.quanti=split$X.quanti, X.quali=split$X.quali, rename.level=TRUE, graph=FALSE, ndim=25) ## Add "Churn" as a supplementary varible res.sup <- supvar(res.pcamix, X.quanti.sup = NULL, X.quali.sup = df[19], rename.level=TRUE) ## Plotting plot(res.sup, choice="levels", main="Levels", cex=0.6)
While we can generate the same plot more directly from the FAMD results, the result is quite a bit less readable and visually appealing than the plots made using FactoMineR
.
Using prince
prince
If your analysis needs to be in Python, the prince
is the way to go. As far as I can see, there is not (yet) an option to create a level map direcly from the results of FAMD. However, as mentioned above, we can use MCA to get the same information.
## Import library import pandas as pd import prince import matplotlib.pyplot as plt ## Import data df = pd.read_csv(cleaned_df.csv) df.drop(['Tenure', 'MonthlyCharges'], axis=1, inplace=True) ## Instantiate MCA object mca = prince.MCA( n_components=2, n_iter=10, copy=True, check_input=True, engine='auto', random_state=42) ## Fit MCA object to quantative data mca = mca.fit(df) ## Generate figure plt.figure() mca.plot_coordinates(df, figsize=(14, 14), show_row_points=False, show_row_labels=False, show_column_points=True, column_points_size=3, show_column_labels=True, legend_n_cols=1) plt.gcf()
Again, we see the same relationships between categorical variables. And again, the figure suffers in comparison with that of FactoMineR
in terms of aesthetics.
In summary, using the MCA implementation in FactoMineR
provides the most visually pleasing and informative visual representation of the relationships between categorical variables in a dataset. However, knowing that all three packages deliver very similar results, you have plenty of options depending on your workflow needs.
In the next notebook, we will bring individual data points into the mix and see what insights we can gain there.
Til then! :)