Factor loading
Introduction
One of the great uses of principal component methods is for examining the relationships between variables and principal components, thereby identifying those that are the most important in describing a dataset.
Factor loading and squared loading
The factor loading of a variable describes the correlation, i.e. information shared, between it and a given principal component (PC).
By squaring the factor loading for a variable, we also get its squared loading (also called squared cosine or cos2). This provides a measure of the proportion of variance in a variable that is captured by a PC. So, for each variable the sum of its squared loading across all PCs equals to 1.
One way to depict this relationship is using correlation circles, which plot variables using their loadings for PC1 and PC2 (or any other two PCs chosen) as coordinates. It is very useful in illustrating some key concepts for interpreting PCA results.
Note that only quantitative variables can be depicted in correlation circles. Here is an example using the Telco dataset, in which MonthlyCharges
and Tenure
are the two quantitative variables (after removing the TotalCharges
variable that is essentially a product of the two):
## Import library library(PCAmixdata) ## Import data df <- read.csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv') ## Drop the TotalCharges variable, as it is a product of MonthlyCharges and Tenure df <- within(df, rm('TotalCharges')) ## Split quantitative and qualitative variables split <- splitmix(df) ## FAMD res.pcamix <- PCAmix(X.quanti=split$X.quanti, X.quali=split$X.quali, rename.level=TRUE, graph=FALSE, ndim=25) ## Plotting plot(res.pcamix, choice="cor", main="Numerical variables", cex=0.6)
To interpret this figure, recall that the sum of squared loadings for a given variable across all PCs equals 1. So if a given variable can be perfectly represented by only the two PCs plotted, then:
When plotted using factor loading on each PC as coordinates in a Cartesian grid, this is the same as
The circle in the plot has a radius of 1, meaning that the projection endpoint for any such variable would be positioned on the circle. In the figure above, we see that PC1 and PC2 together do a pretty good job of capturing information contained in the MonthlyCharges
variable, as its endpoint is very close to the circle. Conversely, if more PCs are needed to capture the information contained in a variable, then the length of it projection would be less than 1 and the endpoint would be positioned inside the circle. Projection for the Tenure
variable lies closer to the origin than that of MonthlyCharges
, indicating that more than PC1 and PC2 are needed to completely represent the information it contains. Therefore, the closer a variable is to the circle, then more important it is to interpreting the PCs involved.
To visualize qualitative and quantitative variables together in the same principal subspace, PCAmixdata
offers an implementation called "squared loading plot". This has the added benefit of allowing me to include the Churn
variable as a supplementary variable, thereby seeing its relationship with other variables without including it in the original analysis. This is useful as most downstream analyses would try to predict Churn
.
## Import libraries library(FactoMineR) library(factoextra) ## PCA res.famd <- FAMD(df, sup.var = 19, graph = FALSE, ncp=25) ## Visualization p <- fviz_famd_var(res.famd, 'var', axes = c(1, 2), labelsize = 3, col.var = 'cos2', ## Colour obs by their squared loading gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repl=TRUE) + xlim(-0.1, 0.85) + ylim (-0.1, 0.85) ## Add the supplementary variable, Churn, to the plot fviz_add(p, res.famd$var$coord.sup, geom = c("arrow", "text"), labelsize = 3, col.var = 'cos2', color = "blue", repel=TRUE)
We see several interesting things in the figure above:
- Consistent with what we saw in the correlation circle,
MonthlyCharges
is more closely correlated with PC1 than with PC2, whereasTenure
is described by a more even combination of PC1 and PC2 - Being furthest from the origin, the variables
Contract
,InternetService
andMonthlyCharges
have the highest squared loading values and so are more important in explaining the variance captured by PC1 and PC2 than variables clustered near the origin, such asGender
,PhoneService
, andSeniorCitizen
- The variable of interest,
Churn
, overlaps with the y-axis, indicating that PC2 alone perfectly captures all the variation contained in this variable
As a sidenote, unlike correlation circles, this plot depicts only positive values on the x- and y-axis. According to the authors of the package, the coordinates are to be interpreted as measuring "the links (signless) between variables and principal components". This may be interpreted as the coordinates of each variable being the absolute value its squared loading.
Parting notes
In the next notebook, we will learn about how principal components can be rotated to facilitate interpretation of the relationship between variables and PCs.