Calculate principal components of mixed-type data

Introduction

Many datasets that a data scientist will encounter in the real world will contain both numerical and categorical variables. Factor analysis of mixed data (FAMD) is a principal component method that combines principal component analysis (PCA) for continuous variables and multiple correspondence analysis (MCA) for categorical variables. To learn more about FAMD, see an excellent tutorial using the FactoMineR package.

In the next few post of this series, I present two packages in R, FactoMineR (along with its visualization package factoextra) and PCAmixdata, and one in Python, prince, for performing FAMD. Using the IBM Telco customer churn dataset as an example, we will explore how FAMD can be used to gain insights into the relationships between various aspects of customer behaviour and to derive actionable business insights regarding customer retention.

As an important note, standardization of the numeric variables is critical for FAMD and PCA. This is done automatically by the three packages introduced here, so I will not do it beforehand. However, if you will be using other method, it is a necessary preprocessing step.

If you want to try this yourself, click on "Remix" in the upper right corner to get a copy of the notebook in your own workspace. Please remember to change the Python and R runtimes to famd_Python and famd_R, respectively, from this notebook to ensure that you have all the installed packages and can start right away.

Workflow in R

Using FactoMineR

FactoMineR provides a variety of functions for PCA, correspondence analysis (CA), multiple correspondence analysis (MCA) and FAMD.

## Import libraries
library(FactoMineR)
library(factoextra)

## Import data
df <- read.csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv')

## PCA
res.famd <- FAMD(df, 
                 sup.var = 19,  ## Set the target variable "Churn" as a supplementary variable, so it is not included in the analysis for now
                 graph = FALSE, 
                 ncp=25)

## Inspect principal components
get_eigenvalue(res.famd)
0 items

We can visualize the individual data points along the first two principal components (PCs):

## Import libraries
library(FactoMineR)
library(factoextra)

## PCA
res.famd <- FAMD(df, 
                 sup.var = 19, 
                 graph = FALSE, 
                 ncp=25)

fviz_famd_ind(res.famd, 
              col.ind = "#2eb135", 
              label = "none", 
              repel = TRUE) + 
				xlim(-5, 5) + ylim (-4.5, 4.5)

Using PCAmixdata

According to its authors, PCAmixdata is "dedicated to multivariate analysis of mixed data where observations are described by a mixture of numerical and categorical variables" (Chavent et al., 2017). As we will see in part 2 of this series, PCAmixdata provides a very useful function for performing (a generalized form of) varimax rotation that aids in interpreting the principal components identified.

## Import library
library(PCAmixdata)

## Split mixed dataset into quantitative and qualitative variables
## For now excluding the target variable "Churn", which will be added later as a supplementary variable
split <- splitmix(df[1:18])  

## PCA
res.pcamix <- PCAmix(X.quanti=split$X.quanti,  
                     X.quali=split$X.quali, 
                     rename.level=TRUE, 
                     graph=FALSE, 
                     ndim=25)

## Inspect principal components
res.pcamix$eig
0 items

Similarly, to inspect the results in further detail, use the summary(res.pcamix) and print(res.pcamix) functions.

Thus far, we see that the results from FactoMineR and PCAmixdata are identical.

A little background: an eigenvalue > 1 indicates that the principal component (PCs) accounts for more variance than accounted by one of the original variables in standardized data (N.B. This holds true only when the data are standardized.). This is commonly used as a cutoff point for which PCs are retained.

Interestingly, only the first four PCs account for more variance than each of the original variables, and together they account for only 46.7% of the total variance in the data set. This suggests that possibly 1) patterns between the variables may be non-linear and complex, and/or 2) some factors (variables) that account for variations in the data are not included in this data set.

Workflow in Python

Using prince

Like FactoMineR, the prince package has a collection of functions for a variety of analyses involving purely numerical/categorical or mixed-type datasets. Unlike the two R packages above, there does not seem to be an option for adding in supplementary variables after FAMD.

prince uses a familiar scikit-learn API.

## Import libraries
import pandas as pd
import prince
import pprint

## Import data
df = pd.read_csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv')

## Instantiate FAMD object
famd = prince.FAMD(
     n_components=25,
     n_iter=10,
     copy=True,
     check_input=True,
     engine='auto',       ## Can be "auto", 'sklearn', 'fbpca'
     random_state=42)

## Fit FAMD object to data 
famd = famd.fit(df.drop('Churn', axis=1)) ## Exclude target variable "Churn"

## Inspect principal dimensions
pp = pprint.PrettyPrinter()
pp.pprint(famd.explained_inertia_) 

Surprisingly, the results here differ greatly from the ones above. In my preliminary readings, I understand that "explained inertia" is synonymous with "explained variance", so this seems to be an unlikely cause of the discrepancy. I will keep digging, but as you will see in later parts of this series, FAMD performed using prince does reach nearly identical conclusions as the two R packages.

Parting notes

In this post, we saw that calculating the PCs of a data set provides insights into how "informative" the variables are in understanding the data. Coming up next, we will explore the relationships between the PCs and the variables, and use that as a way of determining variable importance.