Principle component analysis and biplots in Python
Before getting myself in data science, something I remember being deeply impressed by principal component analyses (PCA). The graphical representation of variables and arrows seemed to be pure sorcery. Then I learned the R language as well as PCA basics, and sorcery became poetic mathematics.
If you need an introduction to PCA, I recommend Josh Stramer's video on his StatQuest channel. Search on Youtube!
Python, like R, comes with helper packages to compute PCAs. The widely-known machine learning package scikit-learn offers PCA transformers, basically for preprocessing high dimensional data. The statistical package statsmodels, on the other hand, has a more traditional statistical approach. There are probably a plethora of other Python packages proposing their own version of PCA.
But to my knowledge, none compares to R for generating biplots. So I wrote my own code.
Scaling
PCAs basically return two important outcomes:
scores (or sites scores), representing the observations (rows)
loadings (or species scores), representing the variables (columns)
The first challenge is to figure out how to properly scale scores and loadings so that they are displayed in a desirable fashion. This step is determinant for the interpretation of the biplot.
The rda
function comes from the vegan R package, which is made for ecological analysis, where terms site scores and species scores are commonly used for scores and loadings.
Basically, you will prefer scaling 1 for a distance biplot, where distance between objects (observations and variables) in the biplot approximate distances between objects in high-dimensional data. You will prefer scaling 2 for correlation biplots, where angles between loadings (with segments linking loadings to the origin) approximate their correlation (read Borcard et al. 2011 for details). To my experience, other scalings are rarely used. Also, to my experience, few science papers care about scaling, although it considerably affects the interpretation.
The first (1 of 2) Python function will transform scores and loadings to the desirable scaling. The scaling function I wrote mimics the rda
function only.
import sys
import numpy as np # generic mathematics
import pandas as pd # handling tabular data
import matplotlib.pyplot as plt # basic plots
import seaborn as sns # more complex plots
from statsmodels.multivariate.pca import PCA
def eigen_scaling(pca, scaling = 0):
# pca is a PCA object obtained from statsmodels.multivariate.pca
# scaling is one of [0, 1, 2, 3]
# the eigenvalues of the pca object are n times the ones computed with R
# we thus need to divide their sum by the number of rows
const = ((pca.scores.shape[0]-1) * pca.eigenvals.sum()/ pca.scores.shape[0])**0.25
if scaling == 0:
scores = pca.scores
loadings = pca.loadings
elif scaling == 1:
scaling_fac = (pca.eigenvals / pca.eigenvals.sum())**0.5
scaling_fac.index = pca.scores.columns
scores = pca.scores * scaling_fac * const
loadings = pca.loadings * const
elif scaling == 2:
scaling_fac = (pca.eigenvals / pca.eigenvals.sum())**0.5
scaling_fac.index = pca.scores.columns
scores = pca.scores * const
loadings = pca.loadings * scaling_fac * const
elif scaling == 3:
scaling_fac = (pca.eigenvals / pca.eigenvals.sum())**0.25
scaling_fac.index = pca.scores.columns
scores = pca.scores * scaling_fac * const
loadings = pca.loadings * scaling_fac * const
else:
sys.exit("Scaling should either be 0, 1, 2 or 3")
return([scores, loadings])
Biplot
Once you can compute scaling, you can compute the biplot (function 2 of 2). I created a basic Seaborn plot function, to which I added Matplotlib elements.
def biplot(pca, scaling = 0, plot_loading_labels = True, color = None, alpha_scores = 1):
scores, loadings = eigen_scaling(pca, scaling=scaling)
# Plot scores
if color is None:
sns.relplot(
x = "comp_0",
y = "comp_1",
palette = "muted",
alpha = alpha_scores,
data = scores_,
)
else:
scores_ = scores.copy()
scores_["group"] = color
sns.relplot(
x = "comp_0",
y = "comp_1",
hue = "group",
palette = "muted",
alpha = alpha_scores,
data = scores_,
)
# Plot loadings
if plot_loading_labels:
loading_labels = pca.loadings.index
for i in range(loadings.shape[0]):
plt.arrow(
0, 0,
loadings.iloc[i, 0],
loadings.iloc[i, 1],
color = 'black',
alpha = 0.7,
linestyle = '-',
head_width = loadings.values.max() / 50,
width = loadings.values.max() / 2000,
length_includes_head = True
)
if plot_loading_labels:
plt.text(
loadings.iloc[i, 0]*1.05,
loadings.iloc[i, 1]*1.05,
loading_labels[i],
color = 'black',
ha = 'center',
va = 'center',
fontsize = 10
);
# range of the plot
scores_loadings = np.vstack([scores.values[:, :2], loadings.values[:, :2]])
xymin = scores_loadings.min(axis=0) * 1.2
xymax = scores_loadings.max(axis=0) * 1.2
plt.axhline(y = 0, color = 'k', linestyle = 'dotted', linewidth=0.75)
plt.axvline(x = 0, color = 'k', linestyle = 'dotted', linewidth=0.75)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.xlim(xymin[0], xymax[0])
plt.ylim(xymin[1], xymax[1]);
Try it
Having enough of the iris data set, Allison Horst introduced the Palmer penguin data set with real data from the Palmer station, Antartica.
penguin = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv")
penguin.sample(10)
I remove some columns and drop rows containing NA
values.
penguin_ = (
penguin
.loc[:, ["species", "bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]]
.dropna(axis = 0)
)
I compute the PCA on scaled values.
penguin_pca = PCA(
data = penguin_.drop("species", axis = 1),
standardize = True,
normalize = True,
ncomp = 2
)
Then show the biplot.
biplot(
pca = penguin_pca,
scaling = 2,
plot_loading_labels = True,
color = penguin_["species"]
)
Which is similar to what we would obtain with the vegan package.