Calculate principal components of mixed-type data
Introduction
Many datasets that one will encounter in the real world will contain both continuous and categorical variables. Factor analysis for mixed data (FAMD) is a principal component method that combines principal component analysis (PCA) for continuous variables and multiple correspondence analysis (MCA) for categorical variables. To learn more about FAMD, see an excellent tutorial using the FactoMineR
package.
In this series, I will use three well-established packages in R (FactoMineR and PCAmixdata) and Python (prince) to performance FAMD on the IBM Telco customer churn dataset, to gain insights into the relationships between various aspects of customer behaviour. This will be a toy example of how FAMD can be used to derive actionable business insights in the real world.
As an important note,
Workflow in R
Using FactoMineR
FactoMineR
FactoMineR
provides a variety of functions for PCA, correspondence analysis (CA), multiple correspondence analysis (MCA) and FAMD.
See CRAN documentation for FactoMineR
.
## Import libraries library(FactoMineR) library(factoextra) ## Import data df <- read.csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv') ## PCA res.famd <- FAMD(df, sup.var = 19, ## Set the target variable "Churn" as a supplementary variable, so it is not included in the analysis for now graph = FALSE, ncp=25) ## Inspect principal components get_eigenvalue(res.famd)
Using PCAmixdata
PCAmixdata
According to its authors, PCAmixdata
is "dedicated to multivariate analysis of mixed data where observations are described by a mixture of numerical and categorical variables" (Chavent et al., 2017). As we will see in part 2 of this series, PCAmixdata
provides a very useful function for performing (a generalized form of) varimax rotation that aids in interpreting the principal components identified.
## Import library library(PCAmixdata) ## Split mixed dataset into quantitative and qualitative variables ## For now excluding the target variable "Churn", which will be added later as a supplementary variable split <- splitmix(df[1:18]) ## PCA res.pcamix <- PCAmix(X.quanti=split$X.quanti, X.quali=split$X.quali, rename.level=TRUE, graph=FALSE, ndim=25) ## Inspect principal components res.pcamix$eig
Similarly, to inspect the results in further detail, use the summary(res.pcamix)
and print(res.pcamix)
functions.
Thus far, we see that the results from FactoMineR
and PCAmixdata
are identical.
A little background: an eigenvalue > 1 indicates that the principal component (PCs) accounts for more variance than accounted by one of the original variables in standardized data (N.B. This holds true only when the data are standardized.). This is commonly used as a cutoff point for which PCs are retained.
Therefore, interestingly, only the first four PCs account for more variance than each of the original variables, and together they account for only 46.7% of the total variance in the data set. This suggests that patterns between the variables are likely non-linear and complex.
Workflow in Python
Using prince
prince
Like FactoMineR
, the prince
package has a collection of functions for a variety of analyses involving purely numerical/categorical or mixed-type datasets. This package uses a familiar scikit-learn
API.
Unlike the two R packages above, there does not seem to be an option for adding in supplementary variables after FAMD.
For more detailed documentation, see the GitHub repo.
## Import libraries import pandas as pd import prince import pprint ## Import data df = pd.read_csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv') ## Instantiate FAMD object famd = prince.FAMD( n_components=25, n_iter=10, copy=True, check_input=True, engine='auto', ## Can be "auto", 'sklearn', 'fbpca' random_state=42) ## Fit FAMD object to data famd = famd.fit(df.drop('Churn', axis=1)) ## Exclude target variable "Churn" ## Inspect principal dimensions pp = pprint.PrettyPrinter() pp.pprint(famd.explained_inertia_)
Surprisingly, the results here differ greatly from the ones above. In my preliminary readings, I understand that "explained inertia" is synonymous with "explained variance", so this seems to be an unlikely cause of the discrepancy. I will keep digging, but as you will see in later parts of this series, FAMD performed using prince
does reach nearly identical conclusions as the two R packages.
Parting notes
If you want to try this analysis yourself, click on "Remix" in the upper right corner to get a copy of the notebook in your own workspace. Please remember to import both the Python (famd_Python
) and R (famd_R
) runtimes from this notebook (intelrefinery/calculate-pc-mixed-data
) under "Runtime Settings" to ensure that you have all the installed packages and can start right away.