Essi Parent / Sep 10 2024

Principle component analysis and biplots in Python

Before getting myself in data science, something I remember being deeply impressed by principal component analyses (PCA). The graphical representation of variables and arrows seemed to be pure sorcery. Then I learned the R language as well as PCA basics, and sorcery became poetic mathematics.

If you need an introduction to PCA, I recommend Josh Stramer's video on his StatQuest channel. Search on Youtube!

Python, like R, comes with helper packages to compute PCAs. The widely-known machine learning package scikit-learn offers PCA transformers, basically for preprocessing high dimensional data. The statistical package statsmodels, on the other hand, has a more traditional statistical approach. There are probably a plethora of other Python packages proposing their own version of PCA.

But to my knowledge, none compares to R for generating biplots. So I wrote my own code.

Scaling

PCAs basically return two important outcomes:

scores (or sites scores), representing the observations (rows)
loadings (or species scores), representing the variables (columns)

The first challenge is to figure out how to properly scale scores and loadings so that they are displayed in a desirable fashion. This step is determinant for the interpretation of the biplot.

scaling.png

57.33 KBDownload

PCA constants specified in the previous figure. Screenshot from https://rdrr.io/rforge/vegan/f/inst/doc/decision-vegan.pdf

The rda function comes from the vegan R package, which is made for ecological analysis, where terms site scores and species scores are commonly used for scores and loadings.

Basically, you will prefer scaling 1 for a distance biplot, where distance between objects (observations and variables) in the biplot approximate distances between objects in high-dimensional data. You will prefer scaling 2 for correlation biplots, where angles between loadings (with segments linking loadings to the origin) approximate their correlation (read Borcard et al. 2011 for details). To my experience, other scalings are rarely used. Also, to my experience, few science papers care about scaling, although it considerably affects the interpretation.

The first (1 of 2) Python function will transform scores and loadings to the desirable scaling. The scaling function I wrote mimics the rda function only.

import sysimport numpy as np # generic mathematicsimport pandas as pd # handling tabular dataimport matplotlib.pyplot as plt # basic plotsimport seaborn as sns # more complex plotsfrom statsmodels.multivariate.pca import PCA

def eigen_scaling(pca, scaling = 0):    # pca is a PCA object obtained from statsmodels.multivariate.pca    # scaling is one of [0, 1, 2, 3]    # the eigenvalues of the pca object are n times the ones computed with R    # we thus need to divide their sum by the number of rows    const = ((pca.scores.shape[0]-1) * pca.eigenvals.sum()/ pca.scores.shape[0])**0.25    if scaling == 0:        scores = pca.scores        loadings = pca.loadings    elif scaling == 1:        scaling_fac = (pca.eigenvals / pca.eigenvals.sum())**0.5        scaling_fac.index = pca.scores.columns        scores = pca.scores * scaling_fac * const        loadings = pca.loadings * const    elif scaling == 2:        scaling_fac = (pca.eigenvals / pca.eigenvals.sum())**0.5        scaling_fac.index = pca.scores.columns        scores = pca.scores * const        loadings = pca.loadings * scaling_fac * const    elif scaling == 3:        scaling_fac = (pca.eigenvals / pca.eigenvals.sum())**0.25        scaling_fac.index = pca.scores.columns        scores = pca.scores * scaling_fac * const        loadings = pca.loadings * scaling_fac * const    else:        sys.exit("Scaling should either be 0, 1, 2 or 3")    return([scores, loadings])

Biplot

Once you can compute scaling, you can compute the biplot (function 2 of 2). I created a basic Seaborn plot function, to which I added Matplotlib elements.

def biplot(pca, scaling = 0, plot_loading_labels = True, color = None, alpha_scores = 1):    scores, loadings = eigen_scaling(pca, scaling=scaling)    # Plot scores    if color is None:        sns.relplot(            x = "comp_0",            y = "comp_1",            palette = "muted",            alpha = alpha_scores,            data = scores_,        )    else:        scores_ = scores.copy()        scores_["group"] = color        sns.relplot(            x = "comp_0",            y = "comp_1",            hue = "group",            palette = "muted",            alpha = alpha_scores,            data = scores_,        )    # Plot loadings    if plot_loading_labels:        loading_labels = pca.loadings.index    for i in range(loadings.shape[0]):        plt.arrow(            0, 0,            loadings.iloc[i, 0],            loadings.iloc[i, 1],            color = 'black',            alpha = 0.7,            linestyle = '-',            head_width = loadings.values.max() / 50,            width = loadings.values.max() / 2000,            length_includes_head = True        )        if plot_loading_labels:            plt.text(                loadings.iloc[i, 0]*1.05,                loadings.iloc[i, 1]*1.05,                loading_labels[i],                color = 'black',                ha = 'center',                va = 'center',                fontsize = 10            );    # range of the plot    scores_loadings = np.vstack([scores.values[:, :2], loadings.values[:, :2]])    xymin = scores_loadings.min(axis=0) * 1.2    xymax = scores_loadings.max(axis=0) * 1.2    plt.axhline(y = 0, color = 'k', linestyle = 'dotted', linewidth=0.75)    plt.axvline(x = 0, color = 'k', linestyle = 'dotted', linewidth=0.75)    plt.xlabel("PC1")    plt.ylabel("PC2")    plt.xlim(xymin[0], xymax[0])    plt.ylim(xymin[1], xymax[1]);

Try it

Having enough of the iris data set, Allison Horst introduced the Palmer penguin data set with real data from the Palmer station, Antartica.

penguin = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv")penguin.sample(10)

I remove some columns and drop rows containing NA values.

penguin_ = (  penguin  .loc[:, ["species", "bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]]  .dropna(axis = 0))

I compute the PCA on scaled values.

penguin_pca = PCA(  data = penguin_.drop("species", axis = 1),  standardize = True,  normalize = True,  ncomp = 2)

Then show the biplot.

biplot(  pca = penguin_pca,  scaling = 2,  plot_loading_labels = True,  color = penguin_["species"])

Which is similar to what we would obtain with the vegan package.

Principle component analysis and biplots in Python

Scaling

Biplot

Try it

Runtimes (1)