Spatial prediction of soil pollutants with multi-output Gaussian processes
A tutorial with GPFlow
Note. The PyMC version can be found here.
Purple sunset, photo of the Meuse river by Ilirjan Rrumbullaku
The Meuse river flows through Germany, the Netherlands, Belgium, and France. Several industries have settled on its banks, so that its bed, as well as the soils of its banks, are polluted by metals. A particular area north of Maastricht, the Netherlands, is often used as a case study for geostatistics.
Knowledge requirements. Intermediate levels in Python programming and data science. Beginner level in Gausian processes can be reached by reading A Visual Exploration of Gaussian Processes by Görtler et al. (2019). Beginner level in positionning can be quickly reached by carefully watching this Vox 6 minute video.
Specific objectives. At the end of this tutorial, you will be able to
show geographic data on a map background,
perform spatial queries,
model spatial attributes with Gaussian processes,
project spatial attributes on a background map.
Tools. We will use Python as programming language with, as packages, Numpy (generic mathematical operations), Pandas (to handle tabular data), GeoPandas (to create spatial data types in tabular data, and perform spatial queries on them), Lets-Plot (for interactive plots) and GPFlow (to build our Gaussian process model).
Multi-output Gaussian processes
You might need a multi-output prediction when you suppose that the outputs are related to each other (i.e. correlated). Such cases are common is spatial predictions. Without committed ourselves on causality, we can often assume that things at the same location might have occurred together. In this case, soil pollutants may occur together due to sediment deposits from the flow of the river, during floods, ponctual spills, etc. Of course, we could use a collection of individual models, but when outputs are correlated, they inform each other.
In the upper knowledge requirements, I specified that you need a beginner level in Gaussian processes. Gaussian processes are congnitively intensive, but mathematically powerful. They are also, in my view, poetic. Basically, a Gaussian process (inline_formula not implemented) is a multivariate-normal distribution with infinite dimensions, relying on continuous functions to define the mean vector and the covariance matrix. We can set the mean to zero when data are centered to 0. Covariance functions are named kernels , and many different kernels are distributed by GPFlow. In single output Gaussian processes, we contruct a kernel on the inputs.
formula not implementedformula not implemented
In other words, we estimate inline_formula not implemented as a function of inline_formula not implemented, and this function is a Gaussian process (infinite multivariate normal) with 0 mean and a covariance defined by a function between a pair of observations inline_formula not implemented and inline_formula not implemented.
In multi-output Gaussian processes, we add another kernel on the outputs. There are several ways to approach such problem with GPFlow. One of them is coregionalization, where a pair of inline_formula not implemented functions inline_formula not implemented and inline_formula not implemented will have a covariance equal to our kernel times a inline_formula not implemented modifier matrix.
formula not implemented
Just like our kernel, this inline_formula not implemented matrix must be positive-definite. It is convenient to define inline_formula not implemented by two objects, inline_formula not implemented and inline_formula not implemented.
formula not implemented
GPFlow assembles the multioutput kernel for you, as long as you provide a kernel for the features, as well as inline_formula not implemented and inline_formula not implemented.
Prepare the notebook
Nextjournal comes with Python and package installers pip and conda. We need to install GeoPandas, Lets-plot and GPFlow - other packages are already install in Nextjournal's default Python environment. Because Lets-plot is not available in conda, for simplicity we just install everything with pip in a Bash cell.
The data from the Meuse river are distrubuted as shape files on the website companion of the book A Practical Guide to Geostatistical Mapping, by Tomislav Hengl. I unzipped the file, opened it in QGIS, then exported the data as a csv file.
Import packages and data
Packages are installed, but still not loaded in our Python session.
Similarly, data are loaded in Nextjournal, but not to our Python session.
As written on the website distributing the Meuse data set, coordinates
y are in the
proj4: +init=epsg:28992 coordinate system. If we only care about modelling, we wouldn't need to mind much about the coordinate system. But since we care about plotting data on a map and later perform spatial queries, I specified in GeoPandas that
y are point geometries expressed in the 28992 coordinate reference system (
crs), then transformed the geometries to longitude-latitude angular data with the universal WGS84 system, i.e. 4326 (go to espg.io to get the numbers). Moreover, later on, we will use the distance from the river as predictor.
For a quick overview of the geometry and a Lets-plot alternative to
meuse_gdf.plot(). - I will introduce Lets-plot shortly.
Show geographic data on a map background
meuse_gdf geographic data frame, we have a column for spatial data named
geometry. But we still have
y in the table, but they are still expressed in EPSG 28992 (the geometry is in 4326, but the original
y have not been modified). Let's update these values.
If you are familliar with ggplot (a widely known package in the R statistical programming language), you will understand the following code block right away. Basically, the grammar of graphics implemented in ggplot and Lets-plot consists in calling the data and stack layers of graphical entities, some mapping elements (aesthetics like x-y position, colors, point shape, etc.) to columns.
I wasn't aware of it at the first version of this tutorial, but Artem Smirnov notified me that Lets-plot is designed to work with GeoPandas geometries! The trick is to call the data in the graphical entity layer (earlier, I called
meuse_gdf in the
lp.geom_point() layer). In the following code block, I call the plot (in this special case without the data) add a map, add points filled by a color according to values in the
copper column, then customize the colors of the filling.
Lets-plot proposes a large number of map styles (tilesets). I used the
STAMEN_DESIGN_TONER tile set because it's undisruptive, has great contrasts, and is pleasant to my eyes.
If you know R's ggplot, plotting for multiple metals would consist in, first, pivotting the table in long format, with all metal concentration values in a single column and the name of the metal variable in another column. Then, we would use facetting (
facet_grid to obtain multiple plots). But Lets-plot doesn't have yet the option to plot a legend per facet (UPDATE: Lets-plot version 2.3.0 now has the option to scale the position, but not yet other aesthetics like colors and filling), so zinc concentrations would take all fill color gradients, while cadmium point will all appear white since they are all low concentrations compared to zinc.
The trick for now (I will update the code when Lets-plot will implement some scaling argument) is to define the plot in a function and plot each metal one after the other with
lp.GGBunch(), which is kind of tricky since it needs the position (below,
h) of the plot from the top-left corner of the whole patchwork, but works great.
Perform spatial queries
dist column of the
meuse_gdf table contains the information of the distance between the river and the sample. However, if we wish to evaluate the distance at any point in the domain of application of our model, in order to subsequently make our projections in space, we will have to be able to compute the distance to the river.
Note that we are working here with geographic coordinates (longitude, latitude), which we approximate as geometric data. At this scale and for our usage, it is appropriate. But on a smaller scale, close to the poles, these distances could be distorted.
We will need the shape of the river to compute the distance to it at any point. I drew a line with mouse and click in QGIS along the center of the river. Since a line is difficult to handle with a csv, I exported it in the geojson format, which can be imported without ache with GeoPandas.
I used the
distance function of GeoPandas in a loop across rows, and assigned a new column named
Let's see what it looks like and how our computed distance compares to the
The left plot shows that distances are comparable, and the right plot shows that points away from the river are going darker. Good!
Model spatial attributes with Gaussian processes
This part is probably the most difficult. Our workflow is the following.
Assign observations to training and testing
Preprocess data (standardize to 0 mean and 1 variance)
Arrange data to a format amenable to coregionalisation with GPFlow
Create the model
Fit the model
Predict and criticize the model
Use the model
Asssign to train / test
This is a necessary step to check if the model is good to predict at locations unknown by the model. This must be done with randomness, and it's usually good to back check if distributions in both sets are similar (I won't do it here though). I retain 70% of data in training, the rest in testing.
I prefer to perform the split operation after preprocessing and arrangements.
I identify the columns I want to predict (targets, outcomes or outputs) and those containing the information needed to predict (features).
The data set contains other information useful to predict soil pollutants. I could have used the elevation, flood frequency, soil type, etc. But remember that if you want to predict pollutants at any location on your map, you also need to provide the features at any location. What would you do to obtain them? Think about it 30 seconds...
For categorical variables (e.g.
soil), you could draw polygons in your GIS to delimit zones and perform spatial queries to assign the category to enclosure in a polygon. But this would be impossible for numerical variables. In all cases (categories and numerical), you could create a simple spatial model, e.g. a k-nearest neighbors model with scikit-learn with the variable of interest (e.g.
elev) as target and the position (
y), then use this model to predict at any location what would become a feature for the soil pollutant model. By doing so, you should fit the soil pollutant model on the outcomes of your first spatial model, not the original points.
I won't do that here, since it would just make the approach more complicated - we will stick on the position and the distance from the river as features.
XY table must be standardized (or scaled,
_sc) on the training set. Most tutorials online use scikit-learn tools, but since standardizing is a very simple operation, I prefer to stay explicit.
Note that concentration values are compositional data. Usually, I would have transformed them to log-ratios before scaling, but this would have complicated this workflow. For more info, read Why, and How, Should Geologists Use Compositional Data Analysis, by Ricardo A. Valls (2008).
Coregionalization in GPFlow needs a special format: the target is a single column of a matrix in long format, stacking all outputs into a vector and adding an integer (starting from 0) in the feature matrix indentifying to which output it belongs.
The first step to arrange our data set is to create a long format along the targets with the
melt() method. The
variable contains the target identifier and the
value column contains its value. o keep track of the index for train/test split down the road, I "reset" the index so that it becomes a genuine column.
As I wrote, the variable must be an interger, while the variables in previous table are specified as strings. We can create a helper table, then merge it to our initial table.
Create the model
The first version of this tutorial used the PyMC package to compute the multi-output Gaussian process. PyMC is my go-to tool for everything related to Bayesian approaches. Due to a confusion about PyMC capabilities to predict multioutput, I reworked the tutorial with GPFlow, which comes with all the tools we need to do the tasks in hand, without excessive complications. But it ended out that PyMC is fine and I used it right, so if you prefer PyMC, the notebook is available here.
Because GPFlow has a tensorflow-probability backend, we can striaght-forwardly specify the hyper-parameters of our model with probability distributions instead of fixed values, just like we would have done it with PyMC.
At this point, I need my training features (
X) and multi-output targets (
Y) in the form of arrays, not tables (from tables to arrays with the
.values method). The
variable_id column goes in the X array. I also need the index of the column where the
variable_id is stored (
id_dim), the column indices of the features and the number of outputs.
These informations will allow us to create our kernel, the ❤️ of our Gaussian process model. The generic Matern32 kernel will handle the job for our features (non-multi-output). We have to tell the kernel on which dimensions it applies.
I wrote early on that we use tensorflow-probability to define probability distributions for our hyper-parameters. We can have a look at our priors with Scipy. I used a half-Cauchy prior on length scale and another on amplitude. I tend to intensively use Cauchy priors since they have a thicker tail than normal distributions, thus allowing more extreme values. Half-Cauchy, like half-normal, restricts to positive values (and by default comes with zero mean). Such poorly informative priors allow a lot of flexibility, although favouring values close to the lower boundary of the distribution. The following plot shows priors from a half-normal and two half-Cauchys (more on half-Cauchy mean = 5 later). You can see the thicker tail on the half-Cauchy (mean = 0) compared to the half-normal.
The other half-Cauchy has a mean 0f 5 instead of 0. Using a hard lower limit is useful for the
lengthscales hyper-parameter of the features kernel. While the
variance hyper-parameter controls the amplitude of the Gaussian process, i.e. how far from the mean values are able to reach),
lenghtscales controls the wiggliness (or the frequency in signal analysis), i.e. how features close to each other can output different values. Low
lenghtscales tend to overfit, so it's good practice to assure that it doesn't optimize to low values, which can be the case when you have few data points. In this case, the lower hard limit of
variance is zero and the one of
lengthscales is 0.5.
How did I chose 0.5? I forst tried at 0, but the regression error was close to zero on the training dataset and mapping the predictions on the map showed local changes not supported by the testing set. I tried with 1 but the prediction on both train and test was pretty bad, not allowing local changes. With 0.5, the prediction was wiggly, but credible.
The feature kernel will be combined to a coregionalized kernel, which needs the number of outputs, a rank describing the expected correlation between outputs and the column index it applies to, i.e. the
variable_id column. The rank is described in the docmentation on GPFlow as "the number of degrees of correlation between the outputs". I'm not sure what it means - if you know, please tell me!
Here, we find the inline_formula not implemented and inline_formula not implemented defined in the introduction. I will put priors on both
kappa, according to the specified dimensions (related to the
The features and ulti-output kernels are combined by a kernel product.
Fit the model
There are multiple Gaussian processes types to chose from. I chose a variational Gaussian process (
gpflow.models.VGP), since data are not sparse, and I need a quick estimation (with the variational path). The likelihood is gaussian since my ouput are continuous. The Gaussian process is optimized with Scipy's fast and accurate
We can get the summary of the model just by calling it.
Predict and criticize the model
GPFlow uses several ways to predict the outcomes. The
predict_y method includes the noise of the Gaussian process, so we should expect some extra variance compared to
I include the prediction (mean) and its uncertainty (sd) in our
Remember these were scaled predictions, and we must compute them to their original scale.
The RMSE (root mean square error) allow to appreciate the error in the scale of the outputs, and should be done on the testing set.
The acceptability of these errors depend on the usage of the model.
The testing prediction is not so good, with some off points. The model could be improved by adding more predictors and adjusting hyper-parameters. Coregionalization may not be needed and simpler models would perform better. In fact, a much simpler approach with generalized additive models with PyGAM returned similar results with the same features. If you want to see how I did it with GAMs, go here - (the following code exports data for GAMs).
Use the model
Once we have our model, we can project it onto our map. To do this, one can use a grid of points, and evaluate the metal contents on this grid. There are several ways to create grids. The Pandas package, although not offering a function to create grids of points, does offer the
expand_grid function in its documentation.
The problem with this grid is that it goes far away from our domain. Indeed, predicting concentrations on the West side of the river is risky, since we only have data between the East side and the canal. I drew a polygon with QGIS, exported it in the geojson format, then performed a spatial query to filter out points outside the polygon. The grid must first be turned to a GeoPandas data frame.
Since the grid is good, I save it in its own object.
The resolution of the grid could be finer, but it is better for the example to limit it to little. Let us now calculate the distances from the river for each of the points in the grid.
To obtain the desirable prediction, we have to format the data just like the data we fitted the model onto. Remember that the features were scaled, and that the first column contained the index of the target.
In Machine learning, we must feed our model in the same format as the data it was fitted to. So I need the grid to be stacked four folds to predict the four metals, with the variable id of the metal as the last column. With more metals, this could have been elegantly implemented in a loop.
We predict the mean and the variance, then store them in the
geogrid_mod table (the variance stored as standard variation by extracting the square-root ).
Because I prefer metal names rather than variable ids, I run the following joint operation.
The predictions and features are scaled. Better put everything back to original scale.
We can finally plot the spatial distribution on the prediction, as we did before, but it's more appropriate to present results with tiles instead of points.
What could I also do with Gaussian process predictions? ⏲️
That's right, mapping uncertainties! Where there are fewer observations, Gaussian processes return larger uncertainties. Let us check it out.
In this tutorial, we began by the import of a tabular csv in a Pandas data frame. We transformed this table to a GeoPandas data frame, i.e. data frame including a geometry column. By transforming the projection of our points to the correct coordinate system, we could project our points on a map with Lets-plot.
We then imported a geojson file directly as a GeoPandas data frame in the aim of computing the distance to the river from any location. This was our first spatial query with GeoPandas.
We created a spatial model as conventionnally done with machine learning, with training and testing sets and preprocess. We used multioutput Gaussian processes with GPFlow. We then sampled from our fitted Gaussian process to obtain our predictions and criticize our model.
We finally used our model to spatially project soil pollutants concentrations and their uncertainty on a grid contained in a polygon to avoid spatial extrapolations.
Comment on Twitter.
Comment on Mastodon.