Calculate principal components of mixed-type data

Introduction

Many datasets that one will encounter in the real world will contain both continuous and categorical variables. Factor analysis for mixed data (FAMD) is a principal component method that combines principal component analysis (PCA) for continuous variables and multiple correspondence analysis (MCA) for categorical variables. To learn more about FAMD, see an excellent tutorial using the FactoMineR package.

In this series, I will use three well-established packages in R (FactoMineR and PCAmixdata) and Python (prince) to performance FAMD on the IBM Telco customer churn dataset, to gain insights into the relationships between various aspects of customer behaviour. This will be a toy example of how FAMD can be used to derive actionable business insights in the real world.

As an important note, standardization of the numeric variables is critical for FAMD and PCA. This is done automatically by the three packages introduced here, so I will not do it beforehand.

Workflow in R

Using `FactoMineR`

FactoMineR provides a variety of functions for PCA, correspondence analysis (CA), multiple correspondence analysis (MCA) and FAMD.

See CRAN documentation for FactoMineR.

## Import libraries
library(FactoMineR)
library(factoextra)

## Import data
df <- read.csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv')

## PCA
res.famd <- FAMD(df, 
                 sup.var = 19,  ## Set the target variable "Churn" as a supplementary variable, so it is not included in the analysis for now
                 graph = FALSE, 
                 ncp=25)

## Inspect principal components
get_eigenvalue(res.famd)

4.4s

famd_R (R)

0 items

Using `PCAmixdata`

According to its authors, PCAmixdata is "dedicated to multivariate analysis of mixed data where observations are described by a mixture of numerical and categorical variables" (Chavent et al., 2017). As we will see in part 2 of this series, PCAmixdata provides a very useful function for performing (a generalized form of) varimax rotation that aids in interpreting the principal components identified.

## Import library
library(PCAmixdata)

## Split mixed dataset into quantitative and qualitative variables
## For now excluding the target variable "Churn", which will be added later as a supplementary variable
split <- splitmix(df[1:18])  

## PCA
res.pcamix <- PCAmix(X.quanti=split$X.quanti,  
                     X.quali=split$X.quali, 
                     rename.level=TRUE, 
                     graph=FALSE, 
                     ndim=25)

## Inspect principal components
res.pcamix$eig

0.8s

famd_R (R)

0 items

Similarly, to inspect the results in further detail, use the summary(res.pcamix) and print(res.pcamix) functions.

Thus far, we see that the results from FactoMineR and PCAmixdata are identical.

A little background: an eigenvalue > 1 indicates that the principal component (PCs) accounts for more variance than accounted by one of the original variables in standardized data (N.B. This holds true only when the data are standardized.). This is commonly used as a cutoff point for which PCs are retained.

Therefore, interestingly, only the first four PCs account for more variance than each of the original variables, and together they account for only 46.7% of the total variance in the data set. This suggests that patterns between the variables are likely non-linear and complex.

Workflow in Python

Using `prince`

Like FactoMineR, the prince package has a collection of functions for a variety of analyses involving purely numerical/categorical or mixed-type datasets. This package uses a familiar scikit-learn API.

Unlike the two R packages above, there does not seem to be an option for adding in supplementary variables after FAMD.

For more detailed documentation, see the GitHub repo.

## Import libraries
import pandas as pd
import prince
import pprint

## Import data
df = pd.read_csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv')

## Instantiate FAMD object
famd = prince.FAMD(
     n_components=25,
     n_iter=10,
     copy=True,
     check_input=True,
     engine='auto',       ## Can be "auto", 'sklearn', 'fbpca'
     random_state=42)

## Fit FAMD object to data 
famd = famd.fit(df.drop('Churn', axis=1)) ## Exclude target variable "Churn"

## Inspect principal dimensions
pp = pprint.PrettyPrinter()
pp.pprint(famd.explained_inertia_)

2.3s

famd_Python (Python)

Surprisingly, the results here differ greatly from the ones above. In my preliminary readings, I understand that "explained inertia" is synonymous with "explained variance", so this seems to be an unlikely cause of the discrepancy. I will keep digging, but as you will see in later parts of this series, FAMD performed using prince does reach nearly identical conclusions as the two R packages.

Parting notes

If you want to try this analysis yourself, click on "Remix" in the upper right corner to get a copy of the notebook in your own workspace. Please remember to import both the Python (famd_Python) and R (famd_R) runtimes from this notebook (intelrefinery/calculate-pc-mixed-data) under "Runtime Settings" to ensure that you have all the installed packages and can start right away.