Calculate principal components of mixed-type data
Introduction
Many datasets that a data scientist will encounter in the real world will contain both numerical and categorical variables. Factor analysis of mixed data (FAMD) is a principal component method that combines principal component analysis (PCA) for continuous variables and multiple correspondence analysis (MCA) for categorical variables. To learn more about FAMD, see an excellent tutorial using the FactoMineR
package.
In the next few post of this series, I present two packages in R, FactoMineR
(along with its visualization package factoextra
) and PCAmixdata
, and one in Python, prince
, for performing FAMD. Using the IBM Telco customer churn dataset as an example, we will explore how FAMD can be used to gain insights into the relationships between various aspects of customer behaviour and to derive actionable business insights regarding customer retention.
As an important note,
If you want to try this yourself, click on "Remix" in the upper right corner to get a copy of the notebook in your own workspace. Please remember to change the Python and R runtimes to famd_Python
and famd_R
, respectively, from this notebook to ensure that you have all the installed packages and can start right away.
Workflow in R
Using FactoMineR
FactoMineR
FactoMineR
provides a variety of functions for PCA, correspondence analysis (CA), multiple correspondence analysis (MCA) and FAMD.
## Import libraries library(FactoMineR) library(factoextra) ## Import data df <- read.csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv') ## PCA res.famd <- FAMD(df, sup.var = 19, ## Set the target variable "Churn" as a supplementary variable, so it is not included in the analysis for now graph = FALSE, ncp=25) ## Inspect principal components get_eigenvalue(res.famd)
We can visualize the individual data points along the first two principal components (PCs):
## Import libraries library(FactoMineR) library(factoextra) ## PCA res.famd <- FAMD(df, sup.var = 19, graph = FALSE, ncp=25) fviz_famd_ind(res.famd, col.ind = "#2eb135", label = "none", repel = TRUE) + xlim(-5, 5) + ylim (-4.5, 4.5)
Using PCAmixdata
PCAmixdata
According to its authors, PCAmixdata
is "dedicated to multivariate analysis of mixed data where observations are described by a mixture of numerical and categorical variables" (Chavent et al., 2017). As we will see in part 2 of this series, PCAmixdata
provides a very useful function for performing (a generalized form of) varimax rotation that aids in interpreting the principal components identified.
## Import library library(PCAmixdata) ## Split mixed dataset into quantitative and qualitative variables ## For now excluding the target variable "Churn", which will be added later as a supplementary variable split <- splitmix(df[1:18]) ## PCA res.pcamix <- PCAmix(X.quanti=split$X.quanti, X.quali=split$X.quali, rename.level=TRUE, graph=FALSE, ndim=25) ## Inspect principal components res.pcamix$eig
Similarly, to inspect the results in further detail, use the summary(res.pcamix)
and print(res.pcamix)
functions.
Thus far, we see that the results from FactoMineR
and PCAmixdata
are identical.
A little background: an eigenvalue > 1 indicates that the principal component (PCs) accounts for more variance than accounted by one of the original variables in standardized data (N.B. This holds true only when the data are standardized.). This is commonly used as a cutoff point for which PCs are retained.
Interestingly, only the first four PCs account for more variance than each of the original variables, and together they account for only 46.7% of the total variance in the data set. This suggests that possibly 1) patterns between the variables may be non-linear and complex, and/or 2) some factors (variables) that account for variations in the data are not included in this data set.
Workflow in Python
Using prince
prince
Like FactoMineR
, the prince
package has a collection of functions for a variety of analyses involving purely numerical/categorical or mixed-type datasets. Unlike the two R packages above, there does not seem to be an option for adding in supplementary variables after FAMD.
prince
uses a familiar scikit-learn
API.
## Import libraries import pandas as pd import prince import pprint ## Import data df = pd.read_csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv') ## Instantiate FAMD object famd = prince.FAMD( n_components=25, n_iter=10, copy=True, check_input=True, engine='auto', ## Can be "auto", 'sklearn', 'fbpca' random_state=42) ## Fit FAMD object to data famd = famd.fit(df.drop('Churn', axis=1)) ## Exclude target variable "Churn" ## Inspect principal dimensions pp = pprint.PrettyPrinter() pp.pprint(famd.explained_inertia_)
Surprisingly, the results here differ greatly from the ones above. In my preliminary readings, I understand that "explained inertia" is synonymous with "explained variance", so this seems to be an unlikely cause of the discrepancy. I will keep digging, but as you will see in later parts of this series, FAMD performed using prince
does reach nearly identical conclusions as the two R packages.
Parting notes
In this post, we saw that calculating the PCs of a data set provides insights into how "informative" the variables are in understanding the data. Coming up next, we will explore the relationships between the PCs and the variables, and use that as a way of determining variable importance.