# Spatial prediction of soil pollutants with multi-output Gaussian processes

## A tutorial with *GPFlow*

Note. ThePyMCversion can be found here.

*Purple sunset, photo of the Meuse river by **Ilirjan Rrumbullaku*

The Meuse river flows through Germany, the Netherlands, Belgium, and France. Several industries have settled on its banks, so that its bed, as well as the soils of its banks, are polluted by metals. A particular area north of Maastricht, the Netherlands, is often used as a case study for geostatistics.

**Knowledge requirements**. Intermediate levels in Python programming and data science. Beginner level in Gausian processes can be reached by reading *A Visual Exploration of Gaussian Processes* by Görtler et al. (2019). Beginner level in positionning can be quickly reached by carefully watching this Vox 6 minute video.

**Specific objectives**. At the end of this tutorial, you will be able to

show geographic data on a map background,

perform spatial queries,

model spatial attributes with Gaussian processes,

project spatial attributes on a background map.

**Tools**. We will use Python as programming language with, as packages, *Numpy* (generic mathematical operations), *Pandas* (to handle tabular data), *GeoPandas* (to create spatial data types in tabular data, and perform spatial queries on them), *Lets-Plot* (for interactive plots) and *GPFlow* (to build our Gaussian process model).

## Multi-output Gaussian processes

You might need a multi-output prediction when you suppose that the outputs are related to each other (i.e. correlated). Such cases are common is spatial predictions. Without committed ourselves on causality, we can often assume that things at the same location might have occurred together. In this case, soil pollutants may occur together due to sediment deposits from the flow of the river, during floods, ponctual spills, etc. Of course, we could use a collection of individual models, but when outputs are correlated, they inform each other.

In the upper knowledge requirements, I specified that you need a beginner level in Gaussian processes. Gaussian processes are congnitively intensive, but mathematically powerful. They are also, in my view, poetic. Basically, a Gaussian process (inline_formula not implemented) is a multivariate-normal distribution with infinite dimensions, relying on continuous functions to define the mean vector and the covariance matrix. We can set the mean to zero when data are centered to 0. Covariance functions are named *kernels* , and many different kernels are distributed by *GPFlow*. In single output Gaussian processes, we contruct a kernel on the inputs.

In other words, we estimate inline_formula not implemented as a function of inline_formula not implemented, and this function is a Gaussian process (infinite multivariate normal) with 0 mean and a covariance defined by a function between a pair of observations inline_formula not implemented and inline_formula not implemented.

In multi-output Gaussian processes, we add another kernel on the outputs. There are several ways to approach such problem with *GPFlow*. One of them is coregionalization, where a pair of inline_formula not implemented functions inline_formula not implemented and inline_formula not implemented will have a covariance equal to our kernel times a inline_formula not implemented modifier matrix.

Just like our kernel, this inline_formula not implemented matrix must be positive-definite. It is convenient to define inline_formula not implemented by two objects, inline_formula not implemented and inline_formula not implemented.

formula not implemented*GPFlow* assembles the multioutput kernel for you, as long as you provide a kernel for the features, as well as inline_formula not implemented and inline_formula not implemented.

## Prepare the notebook

Nextjournal comes with Python and package installers *pip* and *conda*. We need to install *GeoPandas*, *Lets-plot* and *GPFlow* - other packages are already install in Nextjournal's default Python environment. Because Lets-plot is not available in *conda*, for simplicity we just install everything with *pip* in a Bash cell.

`# I installed the cpu version of tensorflow because this environment is running on a CPU`

`pip install geopandas lets-plot tensorflow-cpu gpflow`

The data from the Meuse river are distrubuted as shape files on the website companion of the book *A Practical Guide to Geostatistical Mapping*, by Tomislav Hengl. I unzipped the file, opened it in QGIS, then exported the data as a csv file.

## Import packages and data

Packages are installed, but still not loaded in our Python session.

`# Generic math`

`import numpy as np`

`# Tableaux`

`import pandas as pd # tabular data`

`import geopandas as gpd # spatial data`

`# Graphiques`

`import lets_plot as lp # interactive grammar of graphics`

`from lets_plot import tilesets # to define the map tiles`

`lp.LetsPlot.setup_html(isolated_frame=True) # so that lets-plot works in Nextjournal`

`# Modelling`

`import gpflow # obviously`

`from gpflow.utilities import to_default_float # a function to transform Python numbers to tensors for compatibility`

`from gpflow.ci_utils import ci_niter # a function to set the max number of iterations`

`import tensorflow_probability as tfp # to set piors on hyper-parameters of GP`

`# For predictible randomness`

`np.random.seed(629545) # random.org`

Similarly, data are loaded in Nextjournal, but not to our Python session.

`meuse_df = pd.read_csv(meuse.csv)`

`meuse_df.head()`

As written on the website distributing the Meuse data set, coordinates `x`

and `y`

are in the `proj4: +init=epsg:28992`

coordinate system. If we only care about modelling, we wouldn't need to mind much about the coordinate system. But since we care about plotting data on a map and later perform spatial queries, I specified in *GeoPandas* that `x`

and `y`

are point geometries expressed in the 28992 coordinate reference system (`crs`

), then transformed the geometries to longitude-latitude angular data with the universal WGS84 system, i.e. 4326 (go to espg.io to get the numbers). Moreover, later on, we will use the distance from the river as predictor.

`meuse_gdf = (`

` gpd.GeoDataFrame(`

` meuse_df,`

` geometry = gpd.points_from_xy(meuse_df["x"], meuse_df["y"])`

` )`

` .set_crs(crs = 28992)`

` .to_crs(crs = 4326)`

`)`

`meuse_gdf.head() # quick check`

For a quick overview of the geometry and a *Lets-plot* alternative to `meuse_gdf.plot()`

. - I will introduce *Lets-plot* shortly.

`(lp.ggplot() +`

` lp.geom_point(data = meuse_gdf) +`

` lp.coord_map()`

`)`

## Show geographic data on a map background

In the `meuse_gdf`

geographic data frame, we have a column for spatial data named `geometry`

. But we still have `x`

and `y`

in the table, but they are still expressed in EPSG 28992 (the geometry is in 4326, but the original `x`

and `y`

have not been modified). Let's update these values.

`meuse_gdf['x'] = meuse_gdf["geometry"].x`

`meuse_gdf['y'] = meuse_gdf["geometry"].y`

If you are familliar with *ggplot* (a widely known package in the R statistical programming language), you will understand the following code block right away. Basically, the grammar of graphics implemented in *ggplot* and *Lets-plot* consists in calling the data and stack layers of graphical entities, some mapping elements (*aesthetics* like x-y position, colors, point shape, etc.) to columns.

I wasn't aware of it at the first version of this tutorial, but Artem Smirnov notified me that *Lets-plot* is designed to work with *GeoPandas* geometries! The trick is to call the data in the graphical entity layer (earlier, I called `meuse_gdf`

in the `lp.geom_point()`

layer). In the following code block, I call the plot (in this special case without the data) add a map, add points filled by a color according to values in the `copper`

column, then customize the colors of the filling.

Lets-plotproposes a large number of map styles (tilesets). I used the`STAMEN_DESIGN_TONER`

tile set because it's undisruptive, has great contrasts, and is pleasant to my eyes.

`(lp.ggplot() +`

` lp.geom_livemap(tiles = tilesets.STAMEN_DESIGN_TONER) + `

` lp.geom_point(`

` data = meuse_gdf,`

` mapping = lp.aes(fill = "copper"),`

` size = 3, shape = 21, color = "black"`

` ) +`

` lp.scale_fill_brewer(type = "seq", palette = "Reds")`

`)`

If you know R's *ggplot*, plotting for multiple metals would consist in, first, pivotting the table in long format, with all metal concentration values in a single column and the name of the metal variable in another column. Then, we would use facetting (`facet_wrap`

or `facet_grid`

to obtain multiple plots). But *Lets-plot* doesn't have yet the option to plot a legend per facet (UPDATE: *Lets-plot* version 2.3.0 now has the option to scale the position, but not yet other aesthetics like colors and filling), so zinc concentrations would take all fill color gradients, while cadmium point will all appear white since they are all low concentrations compared to zinc.

The trick for now (I will update the code when *Lets-plot* will implement some scaling argument) is to define the plot in a function and plot each metal one after the other with `lp.GGBunch()`

, which is kind of tricky since it needs the position (below, `w`

and `h`

) of the plot from the top-left corner of the whole patchwork, but works great.

`def plot_observations(metal):`

` p = (`

` lp.ggplot() +`

` lp.geom_livemap(tiles = tilesets.STAMEN_DESIGN_TONER) +`

` lp.scale_fill_brewer(type = 'seq', palette = 'Reds') +`

` lp.geom_point(`

` data = meuse_gdf, mapping = lp.aes(fill = metal), size = 2, shape = 21, color = 'black'`

` )`

` )`

` return p`

`w, h = 480, 320`

`offset = 15`

`bunch = lp.GGBunch()`

`bunch.add_plot(plot_observations('cadmium'), 0, 0, w, h)`

`bunch.add_plot(plot_observations('copper'), w + offset, 0, w, h)`

`bunch.add_plot(plot_observations('lead'), 0, h + offset, w, h)`

`bunch.add_plot(plot_observations('zinc'), w + offset, h + offset, w, h)`

`bunch`

## Perform spatial queries

The `dist`

column of the `meuse_gdf`

table contains the information of the distance between the river and the sample. However, if we wish to evaluate the distance at any point in the domain of application of our model, in order to subsequently make our projections in space, we will have to be able to compute the distance to the river.

Note that we are working here with geographic coordinates (longitude, latitude), which we approximate as geometric data. At this scale and for our usage, it is appropriate. But on a smaller scale, close to the poles, these distances could be distorted.

We will need the shape of the river to compute the distance to it at any point. I drew a line with mouse and click in QGIS along the center of the river. Since a line is difficult to handle with a csv, I exported it in the geojson format, which can be imported without ache with *GeoPandas*.

`river = gpd.read_file(river.geojson)`

`lp.ggplot() + lp.geom_path(data = river) + lp.coord_map()`

I used the `distance`

function of *GeoPandas* in a loop across rows, and assigned a new column named `distance_river`

.

`meuse_gdf = meuse_gdf.assign(`

` distance_river = [meuse_gdf["geometry"][i].distance(river["geometry"][0]) for i in range(meuse_gdf.shape[0])]`

`)`

`meuse_gdf.head()`

Let's see what it looks like and how our computed distance compares to the `dist`

column.

`dist_compare = (`

` lp.ggplot(meuse_gdf) + # call the data set`

` lp.geom_point(lp.aes(x = "dist", y = "distance_river"), size = 3, color = "#ca0020") + # a layer of points`

` lp.labs(x = "Distance in meuse.csv (km)", y = "Distance computed with GeoPandas (m)") # axes titles`

`)`

`dist_map = (`

` lp.ggplot() +`

` lp.geom_livemap(tiles = tilesets.STAMEN_DESIGN_TONER) +`

` lp.geom_point(`

` data = meuse_gdf, `

` mapping = lp.aes(fill = "distance_river"),`

` size = 3, shape = 21, color = "black"`

` ) +`

` lp.scale_fill_brewer(type = "seq", palette = "Reds") +`

` lp.labs(x = "Longitude", y = "Latitude")`

`)`

`w, h = 480, 480`

`offset = 15`

`bunch = lp.GGBunch() # plot both`

`bunch.add_plot(dist_compare, 0, 0, w, h)`

`bunch.add_plot(dist_map, w + offset, 0, w, h)`

`bunch`

The left plot shows that distances are comparable, and the right plot shows that points away from the river are going darker. Good!

## Model spatial attributes with Gaussian processes

This part is probably the most difficult. Our workflow is the following.

Assign observations to training and testing

Preprocess data (standardize to 0 mean and 1 variance)

Arrange data to a format amenable to coregionalisation with

*GPFlow*Create the model

Fit the model

Predict and criticize the model

Use the model

### Asssign to train / test

This is a necessary step to check if the model is good to predict at locations unknown by the model. This must be done with randomness, and it's usually good to back check if distributions in both sets are similar (I won't do it here though). I retain 70% of data in training, the rest in testing.

`obs_id = meuse_gdf.index`

`n_train = np.round(obs_id.shape[0] * 0.7, 0).astype("int")`

`id_train = np.random.choice(obs_id, size = n_train, replace = False)`

`id_test = obs_id[~obs_id.isin(id_train)].values`

I prefer to perform the split operation after preprocessing and arrangements.

### Preprocess data

I identify the columns I want to predict (targets, outcomes or outputs) and those containing the information needed to predict (features).

`targets = ["cadmium", "copper", "lead", "zinc"]`

`features = ["x", "y", "distance_river"]`

The data set contains other information useful to predict soil pollutants. I could have used the elevation, flood frequency, soil type, etc. But remember that if you want to predict pollutants at any location on your map, you also need to provide the features at any location. What would you do to obtain them? Think about it 30 seconds...

⏲️

For categorical variables (e.g. `soil`

), you could draw polygons in your GIS to delimit zones and perform spatial queries to assign the category to enclosure in a polygon. But this would be impossible for numerical variables. In all cases (categories and numerical), you could create a simple spatial model, e.g. a *k*-nearest neighbors model with scikit-learn with the variable of interest (e.g. `elev`

) as target and the position (`x`

any `y`

), then use this model to predict at any location what would become a feature for the soil pollutant model. By doing so, you should fit the soil pollutant model on the outcomes of your first spatial model, not the original points.

I won't do that here, since it would just make the approach more complicated - we will stick on the position and the distance from the river as features.

`XY = meuse_gdf[targets + features]`

The `XY`

table must be standardized (or scaled, `_sc`

) *on the training set*. Most tutorials online use scikit-learn tools, but since standardizing is a very simple operation, I prefer to stay explicit.

`mean_sc = XY.loc[XY.index.isin(id_train), :].mean(axis = 0)`

`std_sc = XY.loc[XY.index.isin(id_train), :].std(axis = 0)`

`XY_sc = XY.apply(lambda x: (x-mean_sc)/std_sc, axis = 1)`

Note that concentration values are compositional data. Usually, I would have transformed them to log-ratios before scaling, but this would have complicated this workflow. For more info, read

Why, and How, Should Geologists Use Compositional Data Analysis, by Ricardo A. Valls (2008).

### Arrange data

Coregionalization in *GPFlow* needs a special format: the target is a single column of a matrix in long format, stacking all outputs into a vector and adding an integer (starting from 0) in the feature matrix indentifying to which output it belongs.

The first step to arrange our data set is to create a long format along the targets with the `melt()`

method. The `variable`

contains the target identifier and the `value`

column contains its value. o keep track of the index for train/test split down the road, I "reset" the index so that it becomes a genuine column.

`XY_m = (`

` XY_sc`

` .reset_index()`

` .melt(id_vars = ["index"] + features, value_vars = targets)`

`)`

`XY_m.sample(8)`

As I wrote, the variable must be an interger, while the variables in previous table are specified as strings. We can create a helper table, then merge it to our initial table.

`variable_ids = pd.DataFrame(dict(`

` variable = targets,`

` variable_id = np.arange(0, len(targets))`

`))`

`variable_ids`

`XY_m = XY_m.merge(variable_ids, on = "variable")`

`XY_m.sample(8)`

### Create the model

The first version of this tutorial used the *PyMC* package to compute the multi-output Gaussian process. *PyMC* is my go-to tool for everything related to Bayesian approaches. Due to a confusion about PyMC capabilities to predict multioutput, I reworked the tutorial with *GPFlow*, which comes with all the tools we need to do the tasks in hand, without excessive complications. But it ended out that PyMC is fine and I used it right, so if you prefer PyMC, the notebook is available here.

Because *GPFlow* has a *tensorflow-probability* backend, we can striaght-forwardly specify the hyper-parameters of our model with probability distributions instead of fixed values, just like we would have done it with *PyMC*.

At this point, I need my training features (`X`

) and multi-output targets (`Y`

) in the form of arrays, not tables (from tables to arrays with the `.values`

method). The `variable_id`

column goes in the X array. I also need the index of the column where the `variable_id`

is stored (`id_dim`

), the column indices of the features and the number of outputs.

`Xtr = XY_m.loc[XY_m["index"].isin(id_train), features + ["variable_id"]].values`

`Ytr = XY_m.loc[XY_m["index"].isin(id_train), ["value"]].values`

`id_dim = Xtr.shape[1] - 1 # output integer column in X_mod`

`feature_dims = np.arange(id_dim) # features columns in X_mod`

`n_outputs = np.unique(Xtr[:, id_dim]).shape[0]`

These informations will allow us to create our kernel, the ❤️ of our Gaussian process model. The generic Matern32 kernel will handle the job for our features (non-multi-output). We have to tell the kernel on which dimensions it applies.

`kernel_features = gpflow.kernels.Matern32(active_dims = feature_dims)`

`kernel_features`

I wrote early on that we use *tensorflow-probability* to define probability distributions for our hyper-parameters. We can have a look at our priors with *Scipy*. I used a half-Cauchy prior on length scale and another on amplitude. I tend to intensively use Cauchy priors since they have a thicker tail than normal distributions, thus allowing more extreme values. Half-Cauchy, like half-normal, restricts to positive values (and by default comes with zero mean). Such poorly informative priors allow a lot of flexibility, although favouring values close to the lower boundary of the distribution. The following plot shows priors from a half-normal and two half-Cauchys (more on half-Cauchy mean = 5 later). You can see the thicker tail on the half-Cauchy (mean = 0) compared to the half-normal.

`from scipy import stats`

`x = np.linspace(0, 30, 100)`

`data = pd.DataFrame({`

` 'x': np.hstack([x, x, x]),`

` 'y': np.hstack([`

` stats.halfnorm.pdf(x, 0, 10),`

` stats.halfcauchy.pdf(x, 0, 10),`

` stats.halfcauchy.pdf(x, 5, 10)`

` ]),`

` 'stat': ["Half-normal, mean = 0"] * x.size + ["Half-Cauchy, mean = 0"] * x.size + ["Half-Cauchy mean = 5"] * x.size`

`})`

`(lp.ggplot(data) +`

` lp.geom_area(`

` lp.aes(x = 'x', y = 'y', fill = 'stat'),`

` position = 'identity', color = 'white', alpha = 0.6, size = 0`

` ) +`

` lp.labs(x = "x", y = "density")`

`)`

The other half-Cauchy has a mean 0f 5 instead of 0. Using a hard lower limit is useful for the `lengthscales`

hyper-parameter of the features kernel. While the `variance`

hyper-parameter controls the amplitude of the Gaussian process, i.e. how far from the mean values are able to reach), `lenghtscales`

controls the wiggliness (or the frequency in signal analysis), i.e. how features close to each other can output different values. Low `lenghtscales`

tend to overfit, so it's good practice to assure that it doesn't optimize to low values, which can be the case when you have few data points. In this case, the lower hard limit of `variance`

is zero and the one of `lengthscales`

is 0.5.

How did I chose 0.5? I forst tried at 0, but the regression error was close to zero on the training dataset and mapping the predictions on the map showed local changes not supported by the testing set. I tried with 1 but the prediction on both train and test was pretty bad, not allowing local changes. With 0.5, the prediction was wiggly, but credible.

`kernel_features.variance.prior = tfp.distributions.HalfCauchy(`

` to_default_float(0), to_default_float(1)`

`)`

`kernel_features.lengthscales.prior = tfp.distributions.HalfCauchy(`

` to_default_float(0.5), to_default_float(5)`

`)`

`kernel_features`

The feature kernel will be combined to a coregionalized kernel, which needs the number of outputs, a rank describing the expected correlation between outputs and the column index it applies to, i.e. the `variable_id`

column. The rank is described in the docmentation on *GPFlow* as "the number of degrees of correlation between the outputs". I'm not sure what it means - if you know, please tell me!

`rank = n_outputs # We refer to the number of columns on W as ‘rank’: it is the number of degrees of correlation between the outputs.`

`kernel_mo = gpflow.kernels.Coregion(`

` output_dim = n_outputs,`

` rank = rank,`

` active_dims = [id_dim] # active_dims require a list, so don't forget the brackets!`

`)`

`kernel_mo`

Here, we find the inline_formula not implemented and inline_formula not implemented defined in the introduction. I will put priors on both `W`

and `kappa`

, according to the specified dimensions (related to the `rank`

).

`kernel_mo.W.prior = tfp.distributions.Normal(`

` loc = to_default_float(np.repeat(0, rank * rank).reshape(rank, rank)),`

` scale = to_default_float(np.repeat(5, rank * rank).reshape(rank, rank))`

`)`

`kernel_mo.kappa.prior = tfp.distributions.HalfCauchy(`

` loc = to_default_float(np.repeat(0, rank)),`

` scale = to_default_float(np.repeat(5, rank))`

`)`

`kernel_mo`

The features and ulti-output kernels are combined by a kernel product.

`kernel = kernel_features * kernel_mo`

`kernel`

### Fit the model

There are multiple Gaussian processes types to chose from. I chose a variational Gaussian process (`gpflow.models.VGP`

), since data are not sparse, and I need a quick estimation (with the variational path). The likelihood is gaussian since my ouput are continuous. The Gaussian process is optimized with *Scipy*'s fast and accurate `L-BFGS-B`

algorithm.

`gp_coreg = gpflow.models.VGP(`

` (Xtr, Ytr),`

` kernel = kernel,`

` likelihood = gpflow.likelihoods.Gaussian()`

`)`

`# fit the covariance function parameters`

`maxiter = ci_niter(2000)`

`gpflow.optimizers.Scipy().minimize(`

` gp_coreg.training_loss,`

` gp_coreg.trainable_variables,`

` options=dict(maxiter = maxiter),`

` method="L-BFGS-B"`

`)`

We can get the summary of the model just by calling it.

`gp_coreg`

### Predict and criticize the model

*GPFlow* uses several ways to predict the outcomes. The `predict_y`

method includes the noise of the Gaussian process, so we should expect some extra variance compared to `predict_f`

.

`yhat_mean, yhat_var = gp_coreg.predict_y(XY_m.loc[:, features + ["variable_id"]].values)`

I include the prediction (mean) and its uncertainty (sd) in our `XY_m`

table.

`XY_m["yhat_meansc"] = yhat_mean.numpy()`

`XY_m["yhat_sdsc"] = np.sqrt(yhat_var)`

`XY_m["tr_te"] = "train"`

`XY_m.loc[XY_m["index"].isin(id_test), "tr_te"] = "test"`

Remember these were scaled predictions, and we must compute them to their original scale.

`XY_m["value_o"] = 0`

`XY_m["yhat_meano"] = 0`

`XY_m["yhat_sdo"] = 0`

`for metal in targets:`

` XY_m.loc[XY_m["variable"] == metal, "value_o"] = XY_m.loc[XY_m["variable"] == metal, "value"] * std_sc[metal] + mean_sc[metal]`

` XY_m.loc[XY_m["variable"] == metal, "yhat_meano"] = XY_m.loc[XY_m["variable"] == metal, "yhat_meansc"] * std_sc[metal] + mean_sc[metal]`

` XY_m.loc[XY_m["variable"] == metal, "yhat_sdo"] = XY_m.loc[XY_m["variable"] == metal, "yhat_sdsc"] * std_sc[metal]`

`XY_m.head()`

The RMSE (root mean square error) allow to appreciate the error in the scale of the outputs, and should be done on the testing set.

`rmse = (`

` XY_m`

` .loc[XY_m["index"].isin(id_test), :]`

` .query("tr_te == 'test'")`

` .groupby('variable')`

` .apply(func = lambda x: np.mean((x.value_o - x.yhat_meano)**2)**0.5)`

` .round(2)`

`)`

`rmse`

The acceptability of these errors depend on the usage of the model.

`(lp.ggplot(XY_m, lp.aes(x = "value", y = "yhat_meansc")) +`

` lp.facet_grid(x = "tr_te", y = "variable") +`

` lp.geom_abline(intercept = 0, slope = 1, color = 'red', size = 1) +`

` lp.geom_point()`

`)`

The testing prediction is not so good, with some off points. The model could be improved by adding more predictors and adjusting hyper-parameters. Coregionalization may not be needed and simpler models would perform better. In fact, a much simpler approach with generalized additive models with *PyGAM* returned similar results with the same features. If you want to see how I did it with GAMs, go here - (the following code exports data for GAMs).

`import pickle`

`with open("results/to_gam.pkl", "wb") as file:`

` pickle.dump([XY, XY_sc, id_train, id_test, mean_sc, std_sc, targets, features], file)`

### Use the model

Once we have our model, we can project it onto our map. To do this, one can use a grid of points, and evaluate the metal contents on this grid. There are several ways to create grids. The *Pandas* package, although not offering a function to create grids of points, does offer the `expand_grid`

function in its documentation.

`import itertools`

`def expand_grid(data_dict):`

` rows = itertools.product(*data_dict.values())`

` return pd.DataFrame.from_records(rows, columns=data_dict.keys())`

`gridres = 0.0012 # grid resolution : could be finer`

`x_grid = np.arange(5.71, 5.77, gridres)`

`y_grid = np.arange(50.95, 51.0, gridres)`

`squaregrid = expand_grid(`

` {"x": x_grid, "y": y_grid}`

`)`

`(lp.ggplot(squaregrid, lp.aes('x', 'y')) +`

` lp.geom_livemap(tiles = tilesets.STAMEN_DESIGN_TONER) +`

` lp.geom_point()`

`)`

The problem with this grid is that it goes far away from our domain. Indeed, predicting concentrations on the West side of the river is risky, since we only have data between the East side and the canal. I drew a polygon with QGIS, exported it in the geojson format, then performed a spatial query to filter out points outside the polygon. The grid must first be turned to a *GeoPandas* data frame.

`geogrid = gpd.GeoDataFrame(`

` squaregrid,`

` geometry = gpd.points_from_xy(squaregrid["x"], squaregrid["y"], crs = "epsg:4326")`

`)`

`geo_polygon = gpd.read_file(polygon.geojson)`

`grid_filter = np.array([geogrid["geometry"][i].within(geo_polygon["geometry"][0]) for i in range(geogrid.shape[0])])`

`(lp.ggplot(geogrid.iloc[grid_filter, :]) +`

` lp.geom_livemap(tiles = tilesets.STAMEN_DESIGN_TONER) +`

` lp.geom_point(lp.aes(x = 'x', y = 'y'))`

`)`

Since the grid is good, I save it in its own object.

`geogrid_f = geogrid.iloc[grid_filter, :]`

The resolution of the grid could be finer, but it is better for the example to limit it to little. Let us now calculate the distances from the river for each of the points in the grid.

`geogrid_f = geogrid_f.assign(`

` distance_river = [geogrid_f["geometry"][i].distance(river["geometry"][0]) for i in geogrid_f.index]`

`)`

`(lp.ggplot(geogrid_f) +`

` lp.geom_livemap(tiles = tilesets.STAMEN_DESIGN_TONER) +`

` lp.geom_tile(lp.aes('x', 'y', fill = 'distance_river'), alpha = 0.8) +`

` lp.scale_fill_brewer(type = 'seq', palette = 'Reds')`

`)`

To obtain the desirable prediction, we have to format the data just like the data we fitted the model onto. Remember that the features were scaled, and that the first column contained the index of the target.

`geogrid_sc = (geogrid_f[features] - mean_sc[features]) / std_sc[features]`

In Machine learning, we must feed our model in the same format as the data it was fitted to. So I need the grid to be stacked four folds to predict the four metals, with the variable id of the metal as the last column. With more metals, this could have been elegantly implemented in a loop.

`geogrid_mod = pd.concat([geogrid_sc, geogrid_sc, geogrid_sc, geogrid_sc]).reset_index(drop = True)`

`geogrid_varid = pd.Series(np.repeat([0, 1, 2, 3], geogrid_sc.shape[0]), name = "variable_id")`

`geogrid_mod = pd.concat([geogrid_mod, geogrid_varid], axis = 1)`

`geogrid_mod`

We predict the mean and the variance, then store them in the `geogrid_mod`

table (the variance stored as standard variation by extracting the square-root ).

`yhat_meangrid, yhat_vargrid = gp_coreg.predict_y(geogrid_mod.values)`

`geogrid_mod["yhat_meansc"] = yhat_meangrid.numpy()`

`geogrid_mod["yhat_sdsc"] = np.sqrt(yhat_vargrid)`

Because I prefer metal names rather than variable ids, I run the following joint operation.

`geogrid_mod = geogrid_mod.merge(variable_ids, on = "variable_id")`

The predictions and features are scaled. Better put everything back to original scale.

`for metal in targets:`

` geogrid_mod.loc[geogrid_mod["variable"] == metal, "yhat_meano"] = geogrid_mod.loc[geogrid_mod["variable"] == metal, "yhat_meansc"] * std_sc[metal] + mean_sc[metal]`

` geogrid_mod.loc[geogrid_mod["variable"] == metal, "yhat_sdo"] = geogrid_mod.loc[geogrid_mod["variable"] == metal, "yhat_sdsc"] * std_sc[metal]`

`for var in features:`

` geogrid_mod.loc[:, var] = geogrid_mod.loc[:, var] * std_sc[var] + mean_sc[var]`

We can finally plot the spatial distribution on the prediction, as we did before, but it's more appropriate to present results with tiles instead of points.

`def plot_predictions(metal):`

` p = (lp.ggplot() +`

` lp.geom_livemap(tiles = tilesets.STAMEN_DESIGN_TONER) +`

` lp.scale_fill_brewer(type = 'seq', palette = 'Reds') +`

` lp.geom_tile(data = geogrid_mod.loc[geogrid_mod["variable"] == metal, :], mapping = lp.aes('x', 'y', fill = 'yhat_meano'), alpha = 0.85)`

` )`

` return p`

`w, h = 480, 320`

`offset = 15`

`bunch = lp.GGBunch()`

`bunch.add_plot(plot_predictions('cadmium'), 0, 0, w, h)`

`bunch.add_plot(plot_predictions('copper'), w + offset, 0, w, h)`

`bunch.add_plot(plot_predictions('lead'), 0, h + offset, w, h)`

`bunch.add_plot(plot_predictions('zinc'), w + offset, h + offset, w, h)`

`bunch`

What could I also do with Gaussian process predictions? ⏲️

That's right, mapping uncertainties! Where there are fewer observations, Gaussian processes return larger uncertainties. Let us check it out.

`def plot_predictions(metal):`

` p = (lp.ggplot() +`

` lp.geom_livemap(tiles = tilesets.STAMEN_DESIGN_TONER) +`

` lp.scale_fill_brewer(type = 'seq', palette = 'Reds') +`

` lp.geom_tile(data = geogrid_mod.loc[geogrid_mod["variable"] == metal, :], mapping = lp.aes('x', 'y', fill = 'yhat_sdo'), alpha = 0.85)`

` )`

` return p`

`w, h = 480, 320`

`offset = 15`

`bunch = lp.GGBunch()`

`bunch.add_plot(plot_predictions('cadmium'), 0, 0, w, h)`

`bunch.add_plot(plot_predictions('copper'), w + offset, 0, w, h)`

`bunch.add_plot(plot_predictions('lead'), 0, h + offset, w, h)`

`bunch.add_plot(plot_predictions('zinc'), w + offset, h + offset, w, h)`

`bunch`

## Recapitulation

In this tutorial, we began by the import of a tabular *csv* in a *Pandas* data frame. We transformed this table to a *GeoPandas* data frame, i.e. data frame including a geometry column. By transforming the projection of our points to the correct coordinate system, we could project our points on a map with *Lets-plot*.

We then imported a *geojson* file directly as a *GeoPandas* data frame in the aim of computing the distance to the river from any location. This was our first spatial query with *GeoPandas*.

We created a spatial model as conventionnally done with machine learning, with training and testing sets and preprocess. We used multioutput Gaussian processes with *GPFlow*. We then sampled from our fitted Gaussian process to obtain our predictions and criticize our model.

We finally used our model to spatially project soil pollutants concentrations and their uncertainty on a grid contained in a polygon to avoid spatial extrapolations.