Principle component analysis and biplots in Python

Before getting myself in data science, something I remember being deeply impressed by was principal component analysis (PCA). The graphical representation of variables and arrows seemed to be pure sorcery. Then I learned the R language as well as PCA basics, and sorcery became poetic mathematics.

If you need an introduction to PCA, I recommend Josh Stramer's video on his StatQuest channel. Search on Youtube!

Python, like R, comes with helper packages to compute PCAs. The widely-known machine learning package scikit-learn offers PCA transformers, basically for preprocessing high dimensional data. The statistical package statsmodels, on the other hand, has a more traditional statistical approach. There are probably a plethora of other Python packages proposing their own version of PCA.

But to my knowledge, none compares to R for generating biplots. So I wrote my own code.

Scaling

PCAs basically return two important outcomes:

  • scores (or sites scores), representing the observations (rows)

  • loadings (or species scores), representing the variables (columns)

The first challenge is to figure out how to properly scale scores and loadings so that they are displayed in a desirable fashion. This step is determinant for the interpretation of the biplot.

PCA functions in R return biplots with different sclaings. Screenshot from https://rdrr.io/rforge/vegan/f/inst/doc/decision-vegan.pdf

PCA constants specified in the previous figure. Screenshot from https://rdrr.io/rforge/vegan/f/inst/doc/decision-vegan.pdf

The rda function comes from the vegan R package, which is made for ecological analysis, where terms site scores and species scores are commonly used for scores and loadings.

Basically, you will prefer scaling 1 for a distance biplot, where distance between objects (observations and variables) in the biplot approximate distances between objects in high-dimensional data. You will prefer scaling 2 for correlation biplots, where angles between loadings (with segments linking loadings to the origin) approximate their correlation (read Borcard et al. 2011 for details). To my experience, other scalings are rarely used. Also, to my experience, few science papers care about scaling, although it considerably affects the interpretation.

The first (1 of 2) Python function will transform scores and loadings to the desirable scaling. The scaling function I wrote mimics the rda function only.

import sys
import numpy as np # generic mathematics
import pandas as pd # handling tabular data
import matplotlib.pyplot as plt # basic plots
import seaborn as sns # more complex plots
from statsmodels.multivariate.pca import PCA
0.0s
def eigen_scaling(pca, scaling = 0):
    # pca is a PCA object obtained from statsmodels.multivariate.pca
    # scaling is one of [0, 1, 2, 3]
    # the eigenvalues of the pca object are n times the ones computed with R
    # we thus need to divide their sum by the number of rows
    const = ((pca.scores.shape[0]-1) * pca.eigenvals.sum()/ pca.scores.shape[0])**0.25
    if scaling == 0:
        scores = pca.scores
        loadings = pca.loadings
    elif scaling == 1:
        scaling_fac = (pca.eigenvals / pca.eigenvals.sum())**0.5
        scaling_fac.index = pca.scores.columns
        scores = pca.scores * scaling_fac * const
        loadings = pca.loadings * const
    elif scaling == 2:
        scaling_fac = (pca.eigenvals / pca.eigenvals.sum())**0.5
        scaling_fac.index = pca.scores.columns
        scores = pca.scores * const
        loadings = pca.loadings * scaling_fac * const
    elif scaling == 3:
        scaling_fac = (pca.eigenvals / pca.eigenvals.sum())**0.25
        scaling_fac.index = pca.scores.columns
        scores = pca.scores * scaling_fac * const
        loadings = pca.loadings * scaling_fac * const
    else:
        sys.exit("Scaling should either be 0, 1, 2 or 3")
    return([scores, loadings])
0.0s

Biplot

Once you can compute scaling, you can compute the biplot (function 2 of 2). I created a basic Seaborn plot function, to which I added Matplotlib elements.

def biplot(pca, scaling = 0, plot_loading_labels = True, color = None, alpha_scores = 1):
    scores, loadings = eigen_scaling(pca, scaling=scaling)
    # Plot scores
    if color is None:
        sns.relplot(
            x = "comp_0",
            y = "comp_1",
            palette = "muted",
            alpha = alpha_scores,
            data = scores_,
        )
    else:
        scores_ = scores.copy()
        scores_["group"] = color
        sns.relplot(
            x = "comp_0",
            y = "comp_1",
            hue = "group",
            palette = "muted",
            alpha = alpha_scores,
            data = scores_,
        )
    # Plot loadings
    if plot_loading_labels:
        loading_labels = pca.loadings.index
    for i in range(loadings.shape[0]):
        plt.arrow(
            0, 0,
            loadings.iloc[i, 0],
            loadings.iloc[i, 1],
            color = 'black',
            alpha = 0.7,
            linestyle = '-',
            head_width = loadings.values.max() / 50,
            width = loadings.values.max() / 2000,
            length_includes_head = True
        )
        if plot_loading_labels:
            plt.text(
                loadings.iloc[i, 0]*1.05,
                loadings.iloc[i, 1]*1.05,
                loading_labels[i],
                color = 'black',
                ha = 'center',
                va = 'center',
                fontsize = 10
            );
    # range of the plot
    scores_loadings = np.vstack([scores.values[:, :2], loadings.values[:, :2]])
    xymin = scores_loadings.min(axis=0) * 1.2
    xymax = scores_loadings.max(axis=0) * 1.2
    plt.axhline(y = 0, color = 'k', linestyle = 'dotted', linewidth=0.75)
    plt.axvline(x = 0, color = 'k', linestyle = 'dotted', linewidth=0.75)
    plt.xlabel("PC1")
    plt.ylabel("PC2")
    plt.xlim(xymin[0], xymax[0])
    plt.ylim(xymin[1], xymax[1]);
0.0s

Try it

Having enough of the iris data set, Allison Horst introduced the Palmer penguin data set with real data from the Palmer station, Antartica.

penguin = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv")
penguin.sample(10)
1.0s

I remove some columns and drop rows containing NA values.

penguin_ = (
  penguin
  .loc[:, ["species", "bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]]
  .dropna(axis = 0)
)
0.0s

I compute the PCA on scaled values.

penguin_pca = PCA(
  data = penguin_.drop("species", axis = 1),
  standardize = True,
  normalize = True,
  ncomp = 2
)
0.0s

Then show the biplot.

biplot(
  pca = penguin_pca,
  scaling = 2,
  plot_loading_labels = True,
  color = penguin_["species"]
)
0.5s

Which is similar to what we would obtain with the vegan package.

Runtimes (1)