Spatial prediction of soil pollutants with multi-output Gaussian processes
A tutorial with GPFlow
Note. The PyMC version can be found here.
Purple sunset, photo of the Meuse river by Ilirjan Rrumbullaku
The Meuse river flows through Germany, the Netherlands, Belgium, and France. Several industries have settled on its banks, so that its bed, as well as the soils of its banks, are polluted by metals. A particular area north of Maastricht, the Netherlands, is often used as a case study for geostatistics.
Knowledge requirements. Intermediate levels in Python programming and data science. Beginner level in Gausian processes can be reached by reading A Visual Exploration of Gaussian Processes by Görtler et al. (2019). Beginner level in positionning can be quickly reached by carefully watching this Vox 6 minute video.
Specific objectives. At the end of this tutorial, you will be able to
show geographic data on a map background,
perform spatial queries,
model spatial attributes with Gaussian processes,
project spatial attributes on a background map.
Tools. We will use Python as programming language with, as packages, Numpy (generic mathematical operations), Pandas (to handle tabular data), GeoPandas (to create spatial data types in tabular data, and perform spatial queries on them), Lets-Plot (for interactive plots) and GPFlow (to build our Gaussian process model).
Multi-output Gaussian processes
You might need a multi-output prediction when you suppose that the outputs are related to each other (i.e. correlated). Such cases are common is spatial predictions. Without committed ourselves on causality, we can often assume that things at the same location might have occurred together. In this case, soil pollutants may occur together due to sediment deposits from the flow of the river, during floods, ponctual spills, etc. Of course, we could use a collection of individual models, but when outputs are correlated, they inform each other.
In the upper knowledge requirements, I specified that you need a beginner level in Gaussian processes. Gaussian processes are congnitively intensive, but mathematically powerful. They are also, in my view, poetic. Basically, a Gaussian process (inline_formula not implemented) is a multivariate-normal distribution with infinite dimensions, relying on continuous functions to define the mean vector and the covariance matrix. We can set the mean to zero when data are centered to 0. Covariance functions are named kernels , and many different kernels are distributed by GPFlow. In single output Gaussian processes, we contruct a kernel on the inputs.
formula not implementedformula not implementedIn other words, we estimate inline_formula not implemented as a function of inline_formula not implemented, and this function is a Gaussian process (infinite multivariate normal) with 0 mean and a covariance defined by a function between a pair of observations inline_formula not implemented and inline_formula not implemented.
In multi-output Gaussian processes, we add another kernel on the outputs. There are several ways to approach such problem with GPFlow. One of them is coregionalization, where a pair of inline_formula not implemented functions inline_formula not implemented and inline_formula not implemented will have a covariance equal to our kernel times a inline_formula not implemented modifier matrix.
formula not implementedJust like our kernel, this inline_formula not implemented matrix must be positive-definite. It is convenient to define inline_formula not implemented by two objects, inline_formula not implemented and inline_formula not implemented.
formula not implementedGPFlow assembles the multioutput kernel for you, as long as you provide a kernel for the features, as well as inline_formula not implemented and inline_formula not implemented.
Prepare the notebook
Nextjournal comes with Python and package installers pip and conda. We need to install GeoPandas, Lets-plot and GPFlow - other packages are already install in Nextjournal's default Python environment. Because Lets-plot is not available in conda, for simplicity we just install everything with pip in a Bash cell.
# I installed the cpu version of tensorflow because this environment is running on a CPU
pip install geopandas lets-plot tensorflow-cpu gpflow
The data from the Meuse river are distrubuted as shape files on the website companion of the book A Practical Guide to Geostatistical Mapping, by Tomislav Hengl. I unzipped the file, opened it in QGIS, then exported the data as a csv file.
Import packages and data
Packages are installed, but still not loaded in our Python session.
# Generic math
import numpy as np
# Tableaux
import pandas as pd # tabular data
import geopandas as gpd # spatial data
# Graphiques
import lets_plot as lp # interactive grammar of graphics
from lets_plot import tilesets # to define the map tiles
lp.LetsPlot.setup_html(isolated_frame=True) # so that lets-plot works in Nextjournal
# Modelling
import gpflow # obviously
from gpflow.utilities import to_default_float # a function to transform Python numbers to tensors for compatibility
from gpflow.ci_utils import ci_niter # a function to set the max number of iterations
import tensorflow_probability as tfp # to set piors on hyper-parameters of GP
# For predictible randomness
np.random.seed(629545) # random.org
Similarly, data are loaded in Nextjournal, but not to our Python session.
meuse_df = pd.read_csv(meuse.csv)
meuse_df.head()
As written on the website distributing the Meuse data set, coordinates x
and y
are in the proj4: +init=epsg:28992
coordinate system. If we only care about modelling, we wouldn't need to mind much about the coordinate system. But since we care about plotting data on a map and later perform spatial queries, I specified in GeoPandas that x
and y
are point geometries expressed in the 28992 coordinate reference system (crs
), then transformed the geometries to longitude-latitude angular data with the universal WGS84 system, i.e. 4326 (go to espg.io to get the numbers). Moreover, later on, we will use the distance from the river as predictor.
meuse_gdf = (
gpd.GeoDataFrame(
meuse_df,
geometry = gpd.points_from_xy(meuse_df["x"], meuse_df["y"])
)
.set_crs(crs = 28992)
.to_crs(crs = 4326)
)
meuse_gdf.head() # quick check
For a quick overview of the geometry and a Lets-plot alternative to meuse_gdf.plot()
. - I will introduce Lets-plot shortly.
(lp.ggplot() +
lp.geom_point(data = meuse_gdf) +
lp.coord_map()
)
Show geographic data on a map background
In the meuse_gdf
geographic data frame, we have a column for spatial data named geometry
. But we still have x
and y
in the table, but they are still expressed in EPSG 28992 (the geometry is in 4326, but the original x
and y
have not been modified). Let's update these values.
meuse_gdf['x'] = meuse_gdf["geometry"].x
meuse_gdf['y'] = meuse_gdf["geometry"].y
If you are familliar with ggplot (a widely known package in the R statistical programming language), you will understand the following code block right away. Basically, the grammar of graphics implemented in ggplot and Lets-plot consists in calling the data and stack layers of graphical entities, some mapping elements (aesthetics like x-y position, colors, point shape, etc.) to columns.
I wasn't aware of it at the first version of this tutorial, but Artem Smirnov notified me that Lets-plot is designed to work with GeoPandas geometries! The trick is to call the data in the graphical entity layer (earlier, I called meuse_gdf
in the lp.geom_point()
layer). In the following code block, I call the plot (in this special case without the data) add a map, add points filled by a color according to values in the copper
column, then customize the colors of the filling.
Lets-plot proposes a large number of map styles (tilesets). I used the
STAMEN_DESIGN_TONER
tile set because it's undisruptive, has great contrasts, and is pleasant to my eyes.
(lp.ggplot() +
lp.geom_livemap(tiles = tilesets.STAMEN_DESIGN_TONER) +
lp.geom_point(
data = meuse_gdf,
mapping = lp.aes(fill = "copper"),
size = 3, shape = 21, color = "black"
) +
lp.scale_fill_brewer(type = "seq", palette = "Reds")
)
If you know R's ggplot, plotting for multiple metals would consist in, first, pivotting the table in long format, with all metal concentration values in a single column and the name of the metal variable in another column. Then, we would use facetting (facet_wrap
or facet_grid
to obtain multiple plots). But Lets-plot doesn't have yet the option to plot a legend per facet (UPDATE: Lets-plot version 2.3.0 now has the option to scale the position, but not yet other aesthetics like colors and filling), so zinc concentrations would take all fill color gradients, while cadmium point will all appear white since they are all low concentrations compared to zinc.
The trick for now (I will update the code when Lets-plot will implement some scaling argument) is to define the plot in a function and plot each metal one after the other with lp.GGBunch()
, which is kind of tricky since it needs the position (below, w
and h
) of the plot from the top-left corner of the whole patchwork, but works great.
def plot_observations(metal):
p = (
lp.ggplot() +
lp.geom_livemap(tiles = tilesets.STAMEN_DESIGN_TONER) +
lp.scale_fill_brewer(type = 'seq', palette = 'Reds') +
lp.geom_point(
data = meuse_gdf, mapping = lp.aes(fill = metal), size = 2, shape = 21, color = 'black'
)
)
return p
w, h = 480, 320
offset = 15
bunch = lp.GGBunch()
bunch.add_plot(plot_observations('cadmium'), 0, 0, w, h)
bunch.add_plot(plot_observations('copper'), w + offset, 0, w, h)
bunch.add_plot(plot_observations('lead'), 0, h + offset, w, h)
bunch.add_plot(plot_observations('zinc'), w + offset, h + offset, w, h)
bunch
Perform spatial queries
The dist
column of the meuse_gdf
table contains the information of the distance between the river and the sample. However, if we wish to evaluate the distance at any point in the domain of application of our model, in order to subsequently make our projections in space, we will have to be able to compute the distance to the river.
Note that we are working here with geographic coordinates (longitude, latitude), which we approximate as geometric data. At this scale and for our usage, it is appropriate. But on a smaller scale, close to the poles, these distances could be distorted.
We will need the shape of the river to compute the distance to it at any point. I drew a line with mouse and click in QGIS along the center of the river. Since a line is difficult to handle with a csv, I exported it in the geojson format, which can be imported without ache with GeoPandas.
river = gpd.read_file(river.geojson)
lp.ggplot() + lp.geom_path(data = river) + lp.coord_map()
I used the distance
function of GeoPandas in a loop across rows, and assigned a new column named distance_river
.
meuse_gdf = meuse_gdf.assign(
distance_river = [meuse_gdf["geometry"][i].distance(river["geometry"][0]) for i in range(meuse_gdf.shape[0])]
)
meuse_gdf.head()
Let's see what it looks like and how our computed distance compares to the dist
column.
dist_compare = (
lp.ggplot(meuse_gdf) + # call the data set
lp.geom_point(lp.aes(x = "dist", y = "distance_river"), size = 3, color = "#ca0020") + # a layer of points
lp.labs(x = "Distance in meuse.csv (km)", y = "Distance computed with GeoPandas (m)") # axes titles
)
dist_map = (
lp.ggplot() +
lp.geom_livemap(tiles = tilesets.STAMEN_DESIGN_TONER) +
lp.geom_point(
data = meuse_gdf,
mapping = lp.aes(fill = "distance_river"),
size = 3, shape = 21, color = "black"
) +
lp.scale_fill_brewer(type = "seq", palette = "Reds") +
lp.labs(x = "Longitude", y = "Latitude")
)
w, h = 480, 480
offset = 15
bunch = lp.GGBunch() # plot both
bunch.add_plot(dist_compare, 0, 0, w, h)
bunch.add_plot(dist_map, w + offset, 0, w, h)
bunch
The left plot shows that distances are comparable, and the right plot shows that points away from the river are going darker. Good!
Model spatial attributes with Gaussian processes
This part is probably the most difficult. Our workflow is the following.
Assign observations to training and testing
Preprocess data (standardize to 0 mean and 1 variance)
Arrange data to a format amenable to coregionalisation with GPFlow
Create the model
Fit the model
Predict and criticize the model
Use the model
Asssign to train / test
This is a necessary step to check if the model is good to predict at locations unknown by the model. This must be done with randomness, and it's usually good to back check if distributions in both sets are similar (I won't do it here though). I retain 70% of data in training, the rest in testing.
obs_id = meuse_gdf.index
n_train = np.round(obs_id.shape[0] * 0.7, 0).astype("int")
id_train = np.random.choice(obs_id, size = n_train, replace = False)
id_test = obs_id[~obs_id.isin(id_train)].values
I prefer to perform the split operation after preprocessing and arrangements.
Preprocess data
I identify the columns I want to predict (targets, outcomes or outputs) and those containing the information needed to predict (features).
targets = ["cadmium", "copper", "lead", "zinc"]
features = ["x", "y", "distance_river"]
The data set contains other information useful to predict soil pollutants. I could have used the elevation, flood frequency, soil type, etc. But remember that if you want to predict pollutants at any location on your map, you also need to provide the features at any location. What would you do to obtain them? Think about it 30 seconds...
⏲️
For categorical variables (e.g. soil
), you could draw polygons in your GIS to delimit zones and perform spatial queries to assign the category to enclosure in a polygon. But this would be impossible for numerical variables. In all cases (categories and numerical), you could create a simple spatial model, e.g. a k-nearest neighbors model with scikit-learn with the variable of interest (e.g. elev
) as target and the position (x
any y
), then use this model to predict at any location what would become a feature for the soil pollutant model. By doing so, you should fit the soil pollutant model on the outcomes of your first spatial model, not the original points.
I won't do that here, since it would just make the approach more complicated - we will stick on the position and the distance from the river as features.
XY = meuse_gdf[targets + features]
The XY
table must be standardized (or scaled, _sc
) on the training set. Most tutorials online use scikit-learn tools, but since standardizing is a very simple operation, I prefer to stay explicit.
mean_sc = XY.loc[XY.index.isin(id_train), :].mean(axis = 0)
std_sc = XY.loc[XY.index.isin(id_train), :].std(axis = 0)
XY_sc = XY.apply(lambda x: (x-mean_sc)/std_sc, axis = 1)
Note that concentration values are compositional data. Usually, I would have transformed them to log-ratios before scaling, but this would have complicated this workflow. For more info, read Why, and How, Should Geologists Use Compositional Data Analysis, by Ricardo A. Valls (2008).
Arrange data
Coregionalization in GPFlow needs a special format: the target is a single column of a matrix in long format, stacking all outputs into a vector and adding an integer (starting from 0) in the feature matrix indentifying to which output it belongs.
The first step to arrange our data set is to create a long format along the targets with the melt()
method. The variable
contains the target identifier and the value
column contains its value. o keep track of the index for train/test split down the road, I "reset" the index so that it becomes a genuine column.
XY_m = (
XY_sc
.reset_index()
.melt(id_vars = ["index"] + features, value_vars = targets)
)
XY_m.sample(8)
As I wrote, the variable must be an interger, while the variables in previous table are specified as strings. We can create a helper table, then merge it to our initial table.
variable_ids = pd.DataFrame(dict(
variable = targets,
variable_id = np.arange(0, len(targets))
))
variable_ids
XY_m = XY_m.merge(variable_ids, on = "variable")
XY_m.sample(8)
Create the model
The first version of this tutorial used the PyMC package to compute the multi-output Gaussian process. PyMC is my go-to tool for everything related to Bayesian approaches. Due to a confusion about PyMC capabilities to predict multioutput, I reworked the tutorial with GPFlow, which comes with all the tools we need to do the tasks in hand, without excessive complications. But it ended out that PyMC is fine and I used it right, so if you prefer PyMC, the notebook is available here.
Because GPFlow has a tensorflow-probability backend, we can striaght-forwardly specify the hyper-parameters of our model with probability distributions instead of fixed values, just like we would have done it with PyMC.
At this point, I need my training features (X
) and multi-output targets (Y
) in the form of arrays, not tables (from tables to arrays with the .values
method). The variable_id
column goes in the X array. I also need the index of the column where the variable_id
is stored (id_dim
), the column indices of the features and the number of outputs.
Xtr = XY_m.loc[XY_m["index"].isin(id_train), features + ["variable_id"]].values
Ytr = XY_m.loc[XY_m["index"].isin(id_train), ["value"]].values
id_dim = Xtr.shape[1] - 1 # output integer column in X_mod
feature_dims = np.arange(id_dim) # features columns in X_mod
n_outputs = np.unique(Xtr[:, id_dim]).shape[0]
These informations will allow us to create our kernel, the ❤️ of our Gaussian process model. The generic Matern32 kernel will handle the job for our features (non-multi-output). We have to tell the kernel on which dimensions it applies.
kernel_features = gpflow.kernels.Matern32(active_dims = feature_dims)
kernel_features
I wrote early on that we use tensorflow-probability to define probability distributions for our hyper-parameters. We can have a look at our priors with Scipy. I used a half-Cauchy prior on length scale and another on amplitude. I tend to intensively use Cauchy priors since they have a thicker tail than normal distributions, thus allowing more extreme values. Half-Cauchy, like half-normal, restricts to positive values (and by default comes with zero mean). Such poorly informative priors allow a lot of flexibility, although favouring values close to the lower boundary of the distribution. The following plot shows priors from a half-normal and two half-Cauchys (more on half-Cauchy mean = 5 later). You can see the thicker tail on the half-Cauchy (mean = 0) compared to the half-normal.
from scipy import stats
x = np.linspace(0, 30, 100)
data = pd.DataFrame({
'x': np.hstack([x, x, x]),
'y': np.hstack([
stats.halfnorm.pdf(x, 0, 10),
stats.halfcauchy.pdf(x, 0, 10),
stats.halfcauchy.pdf(x, 5, 10)
]),
'stat': ["Half-normal, mean = 0"] * x.size + ["Half-Cauchy, mean = 0"] * x.size + ["Half-Cauchy mean = 5"] * x.size
})
(lp.ggplot(data) +
lp.geom_area(
lp.aes(x = 'x', y = 'y', fill = 'stat'),
position = 'identity', color = 'white', alpha = 0.6, size = 0
) +
lp.labs(x = "x", y = "density")
)
The other half-Cauchy has a mean 0f 5 instead of 0. Using a hard lower limit is useful for the lengthscales
hyper-parameter of the features kernel. While the variance
hyper-parameter controls the amplitude of the Gaussian process, i.e. how far from the mean values are able to reach), lenghtscales
controls the wiggliness (or the frequency in signal analysis), i.e. how features close to each other can output different values. Low lenghtscales
tend to overfit, so it's good practice to assure that it doesn't optimize to low values, which can be the case when you have few data points. In this case, the lower hard limit of variance
is zero and the one of lengthscales
is 0.5.
How did I chose 0.5? I forst tried at 0, but the regression error was close to zero on the training dataset and mapping the predictions on the map showed local changes not supported by the testing set. I tried with 1 but the prediction on both train and test was pretty bad, not allowing local changes. With 0.5, the prediction was wiggly, but credible.
kernel_features.variance.prior = tfp.distributions.HalfCauchy(
to_default_float(0), to_default_float(1)
)
kernel_features.lengthscales.prior = tfp.distributions.HalfCauchy(
to_default_float(0.5), to_default_float(5)
)
kernel_features
The feature kernel will be combined to a coregionalized kernel, which needs the number of outputs, a rank describing the expected correlation between outputs and the column index it applies to, i.e. the variable_id
column. The rank is described in the docmentation on GPFlow as "the number of degrees of correlation between the outputs". I'm not sure what it means - if you know, please tell me!
rank = n_outputs # We refer to the number of columns on W as ‘rank’: it is the number of degrees of correlation between the outputs.
kernel_mo = gpflow.kernels.Coregion(
output_dim = n_outputs,
rank = rank,
active_dims = [id_dim] # active_dims require a list, so don't forget the brackets!
)
kernel_mo
Here, we find the inline_formula not implemented and inline_formula not implemented defined in the introduction. I will put priors on both W
and kappa
, according to the specified dimensions (related to the rank
).
kernel_mo.W.prior = tfp.distributions.Normal(
loc = to_default_float(np.repeat(0, rank * rank).reshape(rank, rank)),
scale = to_default_float(np.repeat(5, rank * rank).reshape(rank, rank))
)
kernel_mo.kappa.prior = tfp.distributions.HalfCauchy(
loc = to_default_float(np.repeat(0, rank)),
scale = to_default_float(np.repeat(5, rank))
)
kernel_mo
The features and ulti-output kernels are combined by a kernel product.
kernel = kernel_features * kernel_mo
kernel
Fit the model
There are multiple Gaussian processes types to chose from. I chose a variational Gaussian process (gpflow.models.VGP
), since data are not sparse, and I need a quick estimation (with the variational path). The likelihood is gaussian since my ouput are continuous. The Gaussian process is optimized with Scipy's fast and accurate L-BFGS-B
algorithm.
gp_coreg = gpflow.models.VGP(
(Xtr, Ytr),
kernel = kernel,
likelihood = gpflow.likelihoods.Gaussian()
)
# fit the covariance function parameters
maxiter = ci_niter(2000)
gpflow.optimizers.Scipy().minimize(
gp_coreg.training_loss,
gp_coreg.trainable_variables,
options=dict(maxiter = maxiter),
method="L-BFGS-B"
)
We can get the summary of the model just by calling it.
gp_coreg
Predict and criticize the model
GPFlow uses several ways to predict the outcomes. The predict_y
method includes the noise of the Gaussian process, so we should expect some extra variance compared to predict_f
.
yhat_mean, yhat_var = gp_coreg.predict_y(XY_m.loc[:, features + ["variable_id"]].values)
I include the prediction (mean) and its uncertainty (sd) in our XY_m
table.
XY_m["yhat_meansc"] = yhat_mean.numpy()
XY_m["yhat_sdsc"] = np.sqrt(yhat_var)
XY_m["tr_te"] = "train"
XY_m.loc[XY_m["index"].isin(id_test), "tr_te"] = "test"
Remember these were scaled predictions, and we must compute them to their original scale.
XY_m["value_o"] = 0
XY_m["yhat_meano"] = 0
XY_m["yhat_sdo"] = 0
for metal in targets:
XY_m.loc[XY_m["variable"] == metal, "value_o"] = XY_m.loc[XY_m["variable"] == metal, "value"] * std_sc[metal] + mean_sc[metal]
XY_m.loc[XY_m["variable"] == metal, "yhat_meano"] = XY_m.loc[XY_m["variable"] == metal, "yhat_meansc"] * std_sc[metal] + mean_sc[metal]
XY_m.loc[XY_m["variable"] == metal, "yhat_sdo"] = XY_m.loc[XY_m["variable"] == metal, "yhat_sdsc"] * std_sc[metal]
XY_m.head()
The RMSE (root mean square error) allow to appreciate the error in the scale of the outputs, and should be done on the testing set.
rmse = (
XY_m
.loc[XY_m["index"].isin(id_test), :]
.query("tr_te == 'test'")
.groupby('variable')
.apply(func = lambda x: np.mean((x.value_o - x.yhat_meano)**2)**0.5)
.round(2)
)
rmse
The acceptability of these errors depend on the usage of the model.
(lp.ggplot(XY_m, lp.aes(x = "value", y = "yhat_meansc")) +
lp.facet_grid(x = "tr_te", y = "variable") +
lp.geom_abline(intercept = 0, slope = 1, color = 'red', size = 1) +
lp.geom_point()
)
The testing prediction is not so good, with some off points. The model could be improved by adding more predictors and adjusting hyper-parameters. Coregionalization may not be needed and simpler models would perform better. In fact, a much simpler approach with generalized additive models with PyGAM returned similar results with the same features. If you want to see how I did it with GAMs, go here - (the following code exports data for GAMs).
import pickle
with open("results/to_gam.pkl", "wb") as file:
pickle.dump([XY, XY_sc, id_train, id_test, mean_sc, std_sc, targets, features], file)
Use the model
Once we have our model, we can project it onto our map. To do this, one can use a grid of points, and evaluate the metal contents on this grid. There are several ways to create grids. The Pandas package, although not offering a function to create grids of points, does offer the expand_grid
function in its documentation.
import itertools
def expand_grid(data_dict):
rows = itertools.product(*data_dict.values())
return pd.DataFrame.from_records(rows, columns=data_dict.keys())
gridres = 0.0012 # grid resolution : could be finer
x_grid = np.arange(5.71, 5.77, gridres)
y_grid = np.arange(50.95, 51.0, gridres)
squaregrid = expand_grid(
{"x": x_grid, "y": y_grid}
)
(lp.ggplot(squaregrid, lp.aes('x', 'y')) +
lp.geom_livemap(tiles = tilesets.STAMEN_DESIGN_TONER) +
lp.geom_point()
)
The problem with this grid is that it goes far away from our domain. Indeed, predicting concentrations on the West side of the river is risky, since we only have data between the East side and the canal. I drew a polygon with QGIS, exported it in the geojson format, then performed a spatial query to filter out points outside the polygon. The grid must first be turned to a GeoPandas data frame.
geogrid = gpd.GeoDataFrame(
squaregrid,
geometry = gpd.points_from_xy(squaregrid["x"], squaregrid["y"], crs = "epsg:4326")
)
geo_polygon = gpd.read_file(polygon.geojson)
grid_filter = np.array([geogrid["geometry"][i].within(geo_polygon["geometry"][0]) for i in range(geogrid.shape[0])])
(lp.ggplot(geogrid.iloc[grid_filter, :]) +
lp.geom_livemap(tiles = tilesets.STAMEN_DESIGN_TONER) +
lp.geom_point(lp.aes(x = 'x', y = 'y'))
)
Since the grid is good, I save it in its own object.
geogrid_f = geogrid.iloc[grid_filter, :]
The resolution of the grid could be finer, but it is better for the example to limit it to little. Let us now calculate the distances from the river for each of the points in the grid.
geogrid_f = geogrid_f.assign(
distance_river = [geogrid_f["geometry"][i].distance(river["geometry"][0]) for i in geogrid_f.index]
)
(lp.ggplot(geogrid_f) +
lp.geom_livemap(tiles = tilesets.STAMEN_DESIGN_TONER) +
lp.geom_tile(lp.aes('x', 'y', fill = 'distance_river'), alpha = 0.8) +
lp.scale_fill_brewer(type = 'seq', palette = 'Reds')
)
To obtain the desirable prediction, we have to format the data just like the data we fitted the model onto. Remember that the features were scaled, and that the first column contained the index of the target.
geogrid_sc = (geogrid_f[features] - mean_sc[features]) / std_sc[features]
In Machine learning, we must feed our model in the same format as the data it was fitted to. So I need the grid to be stacked four folds to predict the four metals, with the variable id of the metal as the last column. With more metals, this could have been elegantly implemented in a loop.
geogrid_mod = pd.concat([geogrid_sc, geogrid_sc, geogrid_sc, geogrid_sc]).reset_index(drop = True)
geogrid_varid = pd.Series(np.repeat([0, 1, 2, 3], geogrid_sc.shape[0]), name = "variable_id")
geogrid_mod = pd.concat([geogrid_mod, geogrid_varid], axis = 1)
geogrid_mod
We predict the mean and the variance, then store them in the geogrid_mod
table (the variance stored as standard variation by extracting the square-root ).
yhat_meangrid, yhat_vargrid = gp_coreg.predict_y(geogrid_mod.values)
geogrid_mod["yhat_meansc"] = yhat_meangrid.numpy()
geogrid_mod["yhat_sdsc"] = np.sqrt(yhat_vargrid)
Because I prefer metal names rather than variable ids, I run the following joint operation.
geogrid_mod = geogrid_mod.merge(variable_ids, on = "variable_id")
The predictions and features are scaled. Better put everything back to original scale.
for metal in targets:
geogrid_mod.loc[geogrid_mod["variable"] == metal, "yhat_meano"] = geogrid_mod.loc[geogrid_mod["variable"] == metal, "yhat_meansc"] * std_sc[metal] + mean_sc[metal]
geogrid_mod.loc[geogrid_mod["variable"] == metal, "yhat_sdo"] = geogrid_mod.loc[geogrid_mod["variable"] == metal, "yhat_sdsc"] * std_sc[metal]
for var in features:
geogrid_mod.loc[:, var] = geogrid_mod.loc[:, var] * std_sc[var] + mean_sc[var]
We can finally plot the spatial distribution on the prediction, as we did before, but it's more appropriate to present results with tiles instead of points.
def plot_predictions(metal):
p = (lp.ggplot() +
lp.geom_livemap(tiles = tilesets.STAMEN_DESIGN_TONER) +
lp.scale_fill_brewer(type = 'seq', palette = 'Reds') +
lp.geom_tile(data = geogrid_mod.loc[geogrid_mod["variable"] == metal, :], mapping = lp.aes('x', 'y', fill = 'yhat_meano'), alpha = 0.85)
)
return p
w, h = 480, 320
offset = 15
bunch = lp.GGBunch()
bunch.add_plot(plot_predictions('cadmium'), 0, 0, w, h)
bunch.add_plot(plot_predictions('copper'), w + offset, 0, w, h)
bunch.add_plot(plot_predictions('lead'), 0, h + offset, w, h)
bunch.add_plot(plot_predictions('zinc'), w + offset, h + offset, w, h)
bunch
What could I also do with Gaussian process predictions? ⏲️
That's right, mapping uncertainties! Where there are fewer observations, Gaussian processes return larger uncertainties. Let us check it out.
def plot_predictions(metal):
p = (lp.ggplot() +
lp.geom_livemap(tiles = tilesets.STAMEN_DESIGN_TONER) +
lp.scale_fill_brewer(type = 'seq', palette = 'Reds') +
lp.geom_tile(data = geogrid_mod.loc[geogrid_mod["variable"] == metal, :], mapping = lp.aes('x', 'y', fill = 'yhat_sdo'), alpha = 0.85)
)
return p
w, h = 480, 320
offset = 15
bunch = lp.GGBunch()
bunch.add_plot(plot_predictions('cadmium'), 0, 0, w, h)
bunch.add_plot(plot_predictions('copper'), w + offset, 0, w, h)
bunch.add_plot(plot_predictions('lead'), 0, h + offset, w, h)
bunch.add_plot(plot_predictions('zinc'), w + offset, h + offset, w, h)
bunch
Recapitulation
In this tutorial, we began by the import of a tabular csv in a Pandas data frame. We transformed this table to a GeoPandas data frame, i.e. data frame including a geometry column. By transforming the projection of our points to the correct coordinate system, we could project our points on a map with Lets-plot.
We then imported a geojson file directly as a GeoPandas data frame in the aim of computing the distance to the river from any location. This was our first spatial query with GeoPandas.
We created a spatial model as conventionnally done with machine learning, with training and testing sets and preprocess. We used multioutput Gaussian processes with GPFlow. We then sampled from our fitted Gaussian process to obtain our predictions and criticize our model.
We finally used our model to spatially project soil pollutants concentrations and their uncertainty on a grid contained in a polygon to avoid spatial extrapolations.