Factor loading

Introduction

One of the great uses of principal component methods is for examining the relationships between variables and principal components, thereby identifying those that are the most important in describing a dataset.

Factor loading and squared loading

The factor loading of a variable describes the correlation, i.e. information shared, between it and a given principal component (PC).

By squaring the factor loading for a variable, we also get its squared loading (also called squared cosine or cos2). This provides a measure of the proportion of variance in a variable that is captured by a PC. So, for each variable the sum of its squared loading across all PCs equals to 1.

One way to depict this relationship is using correlation circles, which plot variables using their loadings for PC1 and PC2 (or any other two PCs chosen) as coordinates. It is very useful in illustrating some key concepts for interpreting PCA results.

Note that only quantitative variables can be depicted in correlation circles. Here is an example using the Telco dataset, in which MonthlyCharges and Tenure are the two quantitative variables (after removing the TotalCharges variable that is essentially a product of the two):

## Import library
library(PCAmixdata)

## Import data
df <- read.csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv')

## Drop the TotalCharges variable, as it is a product of MonthlyCharges and Tenure
df <- within(df, rm('TotalCharges'))

## Split quantitative and qualitative variables
split <- splitmix(df)

## FAMD
res.pcamix <- PCAmix(X.quanti=split$X.quanti,  
                     X.quali=split$X.quali, 
                     rename.level=TRUE, 
                     graph=FALSE, 
                     ndim=25)

## Plotting
plot(res.pcamix,
     choice="cor", 
     main="Numerical variables", 
     cex=0.6)

To interpret this figure, recall that the sum of squared loadings for a given variable across all PCs equals 1. So if a given variable can be perfectly represented by only the two PCs plotted, then:

(factor loadingPC1)2+(factor loadingPC2)2=1(factor\ loading_{PC1})^2 + (factor\ loading_{PC2})^2 = 1

When plotted using factor loading on each PC as coordinates in a Cartesian grid, this is the same as

The circle in the plot has a radius of 1, meaning that the projection endpoint for any such variable would be positioned on the circle. In the figure above, we see that PC1 and PC2 together do a pretty good job of capturing information contained in the MonthlyCharges variable, as its endpoint is very close to the circle. Conversely, if more PCs are needed to capture the information contained in a variable, then the length of it projection would be less than 1 and the endpoint would be positioned inside the circle. Projection for the Tenure variable lies closer to the origin than that of MonthlyCharges, indicating that more than PC1 and PC2 are needed to completely represent the information it contains. Therefore, the closer a variable is to the circle, then more important it is to interpreting the PCs involved.

To visualize qualitative and quantitative variables together in the same principal subspace, PCAmixdata offers an implementation called "squared loading plot". This has the added benefit of allowing me to include the Churn variable as a supplementary variable, thereby seeing its relationship with other variables without including it in the original analysis. This is useful as most downstream analyses would try to predict Churn.

## Import libraries
library(FactoMineR)
library(factoextra)

## PCA
res.famd <- FAMD(df, 
                 sup.var = 19, 
                 graph = FALSE, 
                 ncp=25)

## Visualization
p <- fviz_famd_var(res.famd, 'var', axes = c(1, 2),
                     labelsize = 3,
                     col.var = 'cos2',   ## Colour obs by their squared loading
                     gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
                     repl=TRUE) +
                     xlim(-0.1, 0.85) + ylim (-0.1, 0.85)

## Add the supplementary variable, Churn, to the plot
fviz_add(p, 
         res.famd$var$coord.sup,
         geom = c("arrow", "text"),
         labelsize = 3,
         col.var = 'cos2',
         color = "blue", 
         repel=TRUE)

We see several interesting things in the figure above:

  • Consistent with what we saw in the correlation circle, MonthlyCharges is more closely correlated with PC1 than with PC2, whereas Tenure is described by a more even combination of PC1 and PC2
  • Being furthest from the origin, the variables Contract, InternetService and MonthlyCharges have the highest squared loading values and so are more important in explaining the variance captured by PC1 and PC2 than variables clustered near the origin, such as Gender, PhoneService, and SeniorCitizen
  • The variable of interest, Churn, overlaps with the y-axis, indicating that PC2 alone perfectly captures all the variation contained in this variable

As a sidenote, unlike correlation circles, this plot depicts only positive values on the x- and y-axis. According to the authors of the package, the coordinates are to be interpreted as measuring "the links (signless) between variables and principal components". This may be interpreted as the coordinates of each variable being the absolute value its squared loading.

Parting notes

In the next notebook, we will learn about how principal components can be rotated to facilitate interpretation of the relationship between variables and PCs.