🌱 Machine learning for optimization in agriculture with R
Abstract. Modelling how crops respond to fertilizer is a challenging task involving knowledge in plant and soil science, professional experience, reliable data and mathematical models. In this notebook, I present mathematical techniques allowing to optimize yields through plant nutrition adapted to soil, weather and farm management conditions.
Nextjournal. The platform you are currently looking at is Nextjournal. You can run your code online without having to install a coding environment on your computer. So we can spend more time focusing on the content of the course than trying to install everything you need on your system. To use this notebook as a coding environment, sign up (or sign in) then, from the top of this page, hit Remix. You will see the status of your remixed notebook at the top when you will run your first cell. As in well-known Jupyter notebooks, in a Nextjournal notebook you can mix code and formatted text: an approached named literate programming. Code cells can be run by clicking on the blue Play triangle ▶️ at the top-right of a cell, or by focusing on the code cell with the cursor and hit Shift+Enter.
The R 4.1 environment that I created (Nextjournal created one but based on an older version of R) is based comes with all the packages we need for the workshop. You can export this notebook to the ipynb format, then convert it for RStudio to the Rmd format using the rmarkdown::convert_ipynb()
function.
The machine should take a minute to boot.
library("tidyverse") # generic package for data handling and plots
library("tidymodels") # generic modeling, but caret is also used
library("caret") # generic machine learning
library("ggpubr") # publiation theme for plots
library("patchwork") # grids of plots
library("janitor") # clean data
library("mgcv") # generalized additive models
Note that the new R modeling with tidymodels
is working great with most modeling techniques. However, the algorithm I will use here (Gaussian process) is not yet included. Although I could be included with some hacking, to keep it simple I used the more complete but older approach, with the caret
package.
Introduction
Optimizing the nutrient inputs of agricultural plants is usually done by fitting a predetermined curve through dose-response data, then computing the dose with an optimal outcome, usually yields considering the costs of fertilizers and their ecological impacts. Many equations can be used to perform this task, each one with its pros and cons (Dhakal & Lange, 2021). One of these equations is the Mitscherlich equation.
formula not implementedWith yields as the outcome inline_formula not implemented and fertilization dose as feature inline_formula not implemented, we obtain a curve with diminishing yield returns with increasing dosage.
# Mitscherlich function
mitscherlich <- function(x, A, R, E) {
y <- A * (1 - exp(-R * (E + x)))
return(y)
}
# Define parameters
A <- 40
R <- 0.05
E <- 30
# Create a x sequence and compute y
x <-
y <-
yield <- tibble(x = seq(-E, 250, 1)) |>
mutate(
y = mitscherlich(x = x, A = 40, R = 0.02, E = 30),
is_neg = x < 0
)
rate <- tibble(
x = c(32, 43, 43),
y = c(27, 27, 30)
)
# Plot
options(repr.plot.width = 6, repr.plot.height = 4)
ggplot(yield, aes(x = x, y = y)) +
geom_vline(xintercept = 0, colour = "black", size = 0.3) +
geom_hline(yintercept = 0, colour = "black", size = 0.3) +
geom_line(aes(linetype = is_neg), show.legend = FALSE) +
geom_hline(yintercept = A, colour = "grey70", linetype = 2) +
geom_polygon(data = rate, fill = NA, colour = 'grey30') +
geom_text(x = 60, y = 28.5, label = "R : Rate") +
annotate(
geom = "curve", x = 5, y = 5, xend = -E, yend = 0,
curvature = .3, arrow = arrow(length = unit(3, "mm"))
) +
geom_text(x = 37, y = 5, label = "- E : Environment") +
geom_text(x = 35, y = 39, label = "A : Asymptote") +
labs(x = 'Dose', y = 'Yield') +
theme_pubr()
Although convenient, such approach can be over simplistic if one wants to include effects other than a single dose. In Parent et al. (2017), I proposed to model each parameter of the Mitscherlich equation with a linear model. Accordingly, each parameter was a linear combination of soil, weather, cultivar and management variables. The use of an additive random effect on the A parameter added some randomness.
formula not implementedinline_formula not implemented, inline_formula not implemented and inline_formula not implemented are parameters of the Mitscherlich equation, inline_formula not implemented, inline_formula not implemented and inline_formula not implemented are vectors of fixed effect coefficients, inline_formula not implemented is the fixed effect model matrix, inline_formula not implemented is the vector of random effect coefficients, inline_formula not implemented is the random effect model matrix and inline_formula not implemented are modelling errors associated to their respective subscript.
I have used such model for statistical hypotheses (next figure, left) and to draw a response curve (next figure, right) for given site conditions. I could draw curve random samples from the random effects to compute a distribution of optimal economical doses and its associated yield distribution.
The details of this model can be found in the Supplementary material of the Parent et al. (2017) paper.
My first insatifaction with this model resulted from its focus on statistics, not predictability. To evaluate the capacity of a model to predict, a test data set must be held out from the fitting, then used to assess the performance of the model with unseen data. A second insatisfaction resulted from the constrain of a single element for dosage, nitrogen, whereas at least phosphorous and potassium should also be included using a multi-outcome model. So I designed a planar Mitscherlich response instead of a curve (predicting yield from the addition of three Mitscherlich equations), but another thing bugged me: the Mitscherlich model was imposed. Indeed, why do we a priori need polynomial, linear-plateau or Mitscherlich models?
Conditioning a machine learning model
In Coulibali et al. (2020), Z. Coulibali generated probabilistic responses NPK planes with machine learning without constraining the shape of the response. He fitted machine learning models with yield and other quality measures as outcomes (or endogen variables), and site conditions (fertilizer dosage, weather, soil genesis, soil texture, preceeding crops, etc.) as features (or exogen variables). Once he had his models, he generated a grid of increasing doses with fixed uncontrollable values (historical weather, soil parameters, preceeding crops). He could then assess how yields could result from combination of dosages given the characteristics of a site. In Parent et al. (2020), I pushed the idea forward to generalize the idea not only to doses, but to every variable we can manage on a farm.
Indeed, machine learning aims at finding patterns in data. You have an array inline_formula not implemented (the features) and an array inline_formula not implemented (the outcomes), then you train a model on inline_formula not implemented so that it guesses inline_formula not implemented. If the model is good enough for practical usage, you can guess any output inline_formula not implemented from any inline_formula not implemented you input, and maybe predict the uncertainty of the prediction.
Now if you need to figure out how inline_formula not implemented would answer to other inline_formula not implemented values, you can create inline_formula not implemented values, then predict inline_formula not implemented. You could also fix some columns in inline_formula not implemented, and vary others, a process named conditioning.
In agriculture, we can fix uncontrolable features like weather and soil, then vary plant nutrition (e.g. fertilizer doses), then find the highest yield (or any economical and ecological optimal) for the field condition.
Theory
The understanding of several mathematical concepts is essential to get started with modelling. I suppose that the reader is familiar with basic mathematics, linear algebra, geometry, probability distributions and the R programming language. If you are not sure, I recommend to keep reading. I will add links to online, reliable information given in English as we go along.
🔺 Compositional data analysis
A row vector, inline_formula not implemented, is defined as a D-part composition when all its components are strictly positive real numbers and theycarry only relative information. - Pawlowsky-Glahn et al., (2011)
In other words, data are compositional when they can be expressed as proportions of a whole by the closure operation inline_formula not implemented. This operation assures that each row of a compositional table sums up to the same amount, usually 1 or 100%.
formula not implementedThe closure operation is inherent to compositional data, but induces special properties.
Scale invariance. The structure of data is given by the ratios between components, no matter how the total sum is defined. Changing the unit of measurement also doesn't affect the ratios.
Subcompositional coherence. A subcomposition of inline_formula not implemented is created when at least one part is removed from inline_formula not implemented. Including or excluding a component shouldn't affect the structure of compositional data.
Permutation invariance. The order or the columns doesn't affect the structure of the data.
Moreover, the normality assumption on compositional data risks biased analyses. An obvious symptome of wrong statistics occur when confidence regions span outside the 0-100% range. But even where statistics stays in the expected range, the noramlity assumption might also incorreclty represent data. The following plot shows the distributionof sediment data represented by a normal distribution in red and a logistic-normal (i.e. compositional) distribution in blue.
These properties of compositional data are generally addressed by transforming compositional data before the analysis. There are several ways to transform compositional data. You might encounter additive log-ratios, centered log-ratios, isometric log-ratios and summed log-ratios. Here, we will use centered log-ratios, where inline_formula not implemented is the geometric mean of inline_formula not implemented.
formula not implementedThere are few ways to compute the geometric mean, but the most intuitive one, at least to me, is to transform data with a log, compute the mean, then back-transforming the mean with the exponential, i.e.
formula not implementedIndeed, the geometric mean is just the mean of a log-normal distribution.
📏 Mahalanobis distance
When you measure a distance between two points with the Pythagorean equation, i.e. by taking the root of the sum of squares of each edge, you are computing the Euclidean distance (blue circle below). The Mahalabonis distance is a generalisation of the Euclidean distance considering the multivariate structure of data (red ellipse below).
set.seed(643263)
eucl_dist <- 2.7 # qchisq(p = 0.95, df = 2)
n <- 100
covmat <- matrix(
c(1, 0.6,
0.6, 1),
nrow = 2, ncol = 2
)
mu <- c(x = 10, y = 5)
maha_example <- data.frame(MASS::mvrnorm(n, mu = mu, Sigma = covmat))
mu2 <- colMeans(maha_example)
ggplot(maha_example, aes(x = x, y = y)) +
geom_point() +
geom_point(x = mu2[1], y = mu2[2], size = 8, pch = 21, fill = "grey50") +
stat_ellipse(level = 0.95, colour = "#e31a1c", size = 1) +
annotate(
"path",
x = mu2[1] + eucl_dist * cos(seq(0, 2 * pi, length.out=100)),
y = mu2[2] + eucl_dist * sin(seq(0, 2 * pi, length.out=100)),
colour = "#1f78b4", size = 1
) +
coord_equal() +
theme_pubr()
The Mahalanobis distance has many uses. Here, we will use it to define the scope of the model. Further than from a certain Mahalanobis distance from the center of our features, we will consider than our model is extrapolating, i.e. is likely to return a prediction for a point the model was not or poorly trained to predict.
🤖 Machine learning
Machine learning is a generic term for phenomenological models used to predict outcomes from data. In contrast with mechanistic models, which rely on mechanisms or rules (Newton laws, thermodynamics, nutrient flows, system dynamics) fed with data to obtain answers, phenomenology is based on empirical evidence to detect patterns or rules from answers associated with data.
Machine learning covers simple algorithms like linear regressions to more complex ones like deep neural networks. The workflow of a modeling process should look like the following chart. Don't worry about the complexity, I will explain it right away.
The Data blue box consist in collecting data from experiments or observations, then storing it in your favorite data structure (from simple csv to complex SQL databases). Many introductions to machine learning take this part for granted, but data acquisition is often the most difficult, time-consuming and costly part of the analysis. Since data are likely to be messy and incomplete, the arrange data part is often time-consuming. Then data are split among a training set and a testing set. Why? Because to evaluate the capacity of a model to predict, we must use data which were not used to fit the model. The test set must not be used in modelling. I mean, at all. Even I it is to adjust hyperparameters afterwards. For this, you should rather use a cross-validation technique like k-folds - machine learning packages usually come with cross-validation capabilities. This split is usually random, but there are exceptions like time series (where the split is at some time) and experimental designs (you might split by site or blocks and not by observation).
Once data are splitted, we prepare the data for the Model blue box. Complex transformations are called Feature engineering. This includes creating new data from observations, compositional data transformations, scaling to 0 mean and unit variance, etc. These transformations must be based on the training set (i.e. scaling to the mean of training data).
Then we fit the model. As said, many algorithms are available. For instance, a simple technique called k-nearest neighbors (KNN) can be used to predict outcomes with given features based on what happened in similar cases: in its most simple approach, KNN will return a prediction of the outcome by averaging the outputs of the k points which are the closest to the input features. I quickly covered some algorithms in another workshop. Since we can't learn them all, my suggestion is to have a minimal understanding of the major algorithms, try some and if the one you try is of interest, try to understand it deeper. An understanding of the maths behind a model necessary to understand its outcomes.
For this workshop, I will introduce with a technique named generalized additive models (GAMs), which is similar to polynomial regressions, but instead of cumulating weighted exponents, we cumulate basis functions to create splines.
In multivariate regressions, numerous splines can be added up. Since splines will show their effect of the variable along its gradients, GAMs become instructive to inspect compared to black-box machine learning algorithms. Noam Ross have created a very interesting free online course on GAMs. Gavin Simpsons gave a 3-hours lectures on GAMs in R, which I recommend following if you need to understand the mathematics behind GAMs.
I haven't used GAMs in the Parent et al. (2020) paper, but rather used Gaussian processes, which like GAMS return smooth responses. Quickly, a Gaussian process is a multivariate normal (i.e. gaussian) distribution so extreme that it covers infinite dimensions. To use this tool for machine learning, we marginalise this distribution on the data plus some noise. What we really model here is the covariance of the data. The shorter the Euclidean distance between points, the lower the covariance. There are a lot of great resources on Gaussian processes online, but most rely highly on mathematics. John Cunningham rather explained Gaussian processes intuitively in a lecture available on Youtube. I also recommend reading Görtler et al. (2019) for an interactive and visual overview of the mathematics behing Gaussian processes.
We won't cover the subject in this notebook, but if you want to get familliar with machine learning, it is difficult these days to avoid the study of deep learning neural network. There are many courses and tutorials online covering the topic, like the 3blue1brown video series. But I recommand reading Deep learning with R, by François Chollet and J.J. Allaire (François Chollet also wrote a Python version). The first three chapters can be read online for free, where you will learn the very basics of the R interface for Tensorflow.
Once a model is obtained, we criticize it on the testing set. We can compute indices like root-mean-square errors, but to have a grasp of the whole prediction, I suggest plotting observed outcomes on the x-axis and predictions on the y-axis. This helps to detect potential baises in the model, such as systematic underestimation, as it happened with blueberries in Parent et al. (2020).
🐿️ Optimization
Whatever the algorithm, a supervised machine learning model maps the outcomes on the features. Once such model is obtained, we can use it to find the maximum or minimum of an outcome. Several algorithms can be used to do so. In the Parent et al. (2020) paper, I used my own algorithm because I needed a code to follow the path of convergence from a field observation. This code (1) used a single starting point, e.g. the filed measurement, (2) generated random points in the feature space around the starting point up to a certain distance, (3) computed the outcomes for all these points, (4) if the outcome improved, move to that point, decrease the distance and continue to generate points and scanning, but if the outcome has not improved, increase the distance and continue to scan. Like this squirrel trying to find the top of the hill (which is a Gaussian process model of the volcano data set).
You might have remarked that the domain where the squirrel is allowed to scan has the shape of an ellipse. Indeed, I restricted the scan to a certain Mahalanobis distance from the center of the domain. We will use this strategy later on.
Most of optimization algorithms work similarly, but are more sophisticated. The optim
R package provide several of the best optimization algorithms developed to date. We will prefer to use optim
rather than my Markov-chain Monte-Carlo algorithm.
🇷 The R language
The power of a programming language comes less from its core than from its packages. Although R has been created for statistical computing, the development of reliable and usable packages in R for a plethora of tasks has exploded through the years, pulling R as world-class popular choice in the wider perspective of data science, competing with Python (also a very good choice for data science and the most popular langage for machine learning). And just like Python, R is open source and can be downloaded freely.
If you are new to R, my suggestion to learn it is to pick a R book on a subject you like. I learned R by reading and re-reading Numerical Ecology with R, by Daniel Borcard et al. (2011). If you don't know where to start, the free online book R for data science (Wickham & Grolemund, 2017) is a very good choice.
⛏️ Tutorial with concrete resistance
We have data relating concrete compression strength to its age and composition. Our task is to model the compression strength of a concrete mixture through time. We will not perform any optimization. The goal is to focus on compositional data transformations, machine learning and conditioning.
The data set comes from Yeh (1998) and was downloaded from University of California - Irvine - Machine learning repository.
concrete <- read_csv(Concrete_Data.csv)
head(concrete)
I'm using the centered log-ratio transformation for all compositional sets. I could have used a specialized package like compositions
or robCompositions
, but I found it simpler to create functions.
geomean <- function(x) exp(mean(log(x)))
clr <- function(x) t(apply(x, 1, function(x) log(x / geomean(x))))
clr_inv <- function(x) {
x <- exp(x)
x <- t(apply(x, 1, function(x) x/sum(x)))
return(x)
}
The components are expressed as kg/minline_formula not implemented. This is not ideal, since compositions should be expressed without dimensions (e.g. kg/kg). I could convert it to mass fractions with the use of density, but we don't have it. Consequently, I assumed that the density was constant among concretes and kept values as kg/minline_formula not implemented. Instead of writing down the names of the components, I get them from their column names.
conc_comp_names <- concrete |>
select(contains("kg")) |>
names()
I isolate the parts of the composition.
conc_parts <- concrete[conc_comp_names]
head(conc_parts)
The dataset contains a lot of zeros. What happens when computing log(0)
?
log(0)
Woops 😬. Indeed, compositional data analysis struggles with zeros. Lubbe et al. (2021) reviewed several methods to adjust BDL (below detection limit) data.
library("robCompositions")
conc_parts_z <- imputeBDLs(
x = conc_parts,
method = "lmrob",
dl = 0.65 * apply(conc_parts, 2, function(x) min(x[x!=0]))
)$x
head(conc_parts_z)
Then, we can compute the CLRs.
conc_clr <- clr(conc_parts_z)
I create a new compositional data frame, where compositions are replaced by CLRs.
conc_co <- concrete
conc_co[conc_comp_names] <- conc_clr
Before modeling, you should be aware that some algorithms have problems with irregular characters such as spaces, accents and brackets in columns names. The clean_names
function from the janitor
package is a convenient way to edit column names.
conc_co <- conc_co |> clean_names()
conc_co
The Data part of the workflow is already done, excepting the split. I perform the split randomly, but I use a seed with a randomly generated number to make sure that the split is reproducible, i.e. the operation will return always the same split.
set.seed(216430) # random.org
conc_split <- initial_split(conc_co, prop = 0.7)
conc_tr <- training(conc_split)
conc_te <- testing(conc_split)
I continue with the tidymodels
approach by creating a recipe, which includes a process to normalize (0 mean and 1 variance) each column based on the mean and standard variation of the training set. The prep()
function prepares the recipe and the bake function actually transforms data.
Because tidymodels
doesn't include back transformations, I compute the mean and sd of the compressive strength. These values will allow me to back transform the prediction later.
conc_recipe <- conc_tr |>
recipe(compressive_strength_m_pa ~ .) |>
step_normalize(all_numeric()) |>
prep()
conctr_baked <- bake(conc_recipe, conc_tr)
concte_baked <- bake(conc_recipe, conc_te)
mean_cs <- mean(conc_tr$compressive_strength_m_pa)
sd_cs <- sd(conc_tr$compressive_strength_m_pa)
In the following code block, I train a GAM. The first line gets the rank of the outcome column. The following dynamically construct the formula. Beleive it or not, I prefer using dynamic constructions rather then typing formulas. I use the default spline number and types (10 thin plates).
outcome_icol <- which(names(conc_co) == "compressive_strength_m_pa")
conc_formula <- as.formula(
paste0(
names(conctr_baked)[outcome_icol],
" ~ s(",
paste0(
names(conctr_baked)[-outcome_icol],
collapse = ") + s("
),
")")
)
conc_formula
conc_fit <- gam(conc_formula, data = conctr_baked, method = "REML")
GAMs summaries look like outputs of linear models in R. P-values test weither the likelihood that the spline is drawn from a flat, null slope. Gavin Simpson explains how to statistically interpret this output in his lecture at around 17:10.
conc_fit |> summary()
I can also plot the individual effects.
plot(conc_fit, pages = 1)
There are many ways you can define splines, including interacting components. Noam Ross and Gavin Simpson cover the subject.
I use the predict()
function to (obviously) predict the model on the testing set (concte_baked
). The prediction is transformed back to its original scale. Then I plot the results. It is up to the analyst to judge if the prediction error is acceptable.
conc_te_sc_pred <- predict(conc_fit, concte_baked)
conc_te_pred <- conc_te_sc_pred * sd_cs + mean_cs
conc_te |>
mutate(y_hat = conc_te_pred) |>
ggplot(aes(x = compressive_strength_m_pa, y = y_hat)) +
geom_abline(slope = 1, intercept = 0, colour = "red") +
geom_point() +
labs(x = "Observed (test)", y = "Predicted (test)") +
theme_pubr()
To condition the model, I keep everything constant except time (age_day
) for two different cases sampled ramdomly. To make things short, I sample from the CLR-transformed data, but if I would write down my own cases, I would have to transform them to CLRs since this is what is expected by the model.
set.seed(731548)
age <- seq(0, 365, 5)
conditioned_c1 <- conc_co[sample(1:nrow(conc_co), 1), ]
conditioned_c1 <- conditioned_c1[rep(nrow(conditioned_c1), each = length(age)), ]
conditioned_c1["age_day"] <- age
conditioned_c2 <- conc_co[sample(1:nrow(conc_co), 1), ]
conditioned_c2 <- conditioned_c2[rep(nrow(conditioned_c2), each = length(age)), ]
conditioned_c2["age_day"] <- age
Remember that the model is trained on transformed values: I have to bake these conditions. I predict for both conditions, then do some make some manipulations for plotting.
# Scaling
c1_baked <- bake(conc_recipe, conditioned_c1)
c2_baked <- bake(conc_recipe, conditioned_c2)
# Create a table
cond <- tibble(age_day = conditioned_c1$age_day) |>
mutate(
pred1 = predict(conc_fit, c1_baked) * sd_cs + mean_cs,
pred2 = predict(conc_fit, c2_baked) * sd_cs + mean_cs
) |>
pivot_longer(
cols = c(pred1, pred2)
)
head(cond)
ggplot(cond, aes(x = age_day, y = value)) +
geom_line(aes(colour = name)) +
labs(x = "Age (day)", y = "Predicted compressive strength (MPa)") +
theme_pubr()
The trends are similar for both cements, and correspond to an additive transposition of the individual time effect shown earlier. A GAM being additive per se, it doesn't acquire all the complexity of multivariate effects.
Nevertheless, is this prediction reliable? Well, it has a similar behavior to the neural network model published by Yeh (1998). Expert knowledge would be necessary to criticize our model.
🫐 Case study with blueberry
The data set
In Quebec (Canada), wildbush blueberries, obviously berries and obviously blue, are the tasty fruits mainly produced by a plant named Vaccinium angustifolium. The data set has been published as supplementary material for a paper I wrote with colleagues (Parent et al., 2020), and came with yields, elements in foliar tissues, elements in the soil and soil texture. Because I was also interested in weather effects, I feteched weather data from canadian archives using the weathercan
R package. Many R packages are developed to do such tasks throughtout the world, such as BrazilMet
for Brazil. I computed weather indices over the years observed previously (e.g. mean monthly precipitations in july throught 6 years before the experiment).
It is important to train the model on historical events. A machine learning model will likely be used to predict the optimal dosage from historical weather, since no one will use weather indices that hasn't happen yet. Computing such indices (the Arranging data part of the workflow) is challenging and specific to the case studied.
If you are interested in what I did, you might check the supplementary material of the paper - be aware that although I tried to document it clearly, the weather part might be difficult to understand.
I also performed some imputations with the mice
package. I uploaded the resulting csv on Nextjournal as a starting point for modeling.
data <- read_csv(blueberry.csv) |> drop_na()
names(data) <- str_replace_all(names(data), " ", "_")
names(data)
The purpose is to predict yield from ionomic (leaf elements), edaphic (soil) and weather.
The modeling framework
I will proceed with the following steps.
Preprocess compositional data
Compute predictive models
Optimize NPK dosage, soil chemistry and leaf ionome
I like to define the features in a list (note that the pipe sign |>
can be translated to English by "then" and is equivalent to the former %>%
sign).
select <- dplyr::select # the select function of dplyr seemed to have been overwritten by another function by loading some package. I must put it back to dplyr
features <- list(
leaf = data |> select(starts_with("Leaf")) |> names(),
soil = data |> select(starts_with("Soil")) |> names(),
weather = data |> select(starts_with("weather_prev")) |> names(),
dose = data |> select(contains("dose")) |> names(),
other = c("pH")
)
outcome <- "yield"
Compositional data analysis
I must define my simplexes: one for the soil, the other for the leaves, then compute the clrs.
# Leaf simplex
leaf_parts <- data |>
select(one_of(features$leaf))
leaf_clr <- clr(leaf_parts)
# Soil simplex
soil_parts <- data |>
select(one_of(features$soil))
soil_clr <- clr(soil_parts)
I create a new compositional data frame, where compositions are replaced by CLRs.
dataco <- data
dataco[features$soil] <- soil_clr
dataco[features$leaf] <- leaf_clr
Modeling
I remove two weather indices : historical growing degree days because they are highly correlated with mean temperatures and historical freezing days because they were poor predictors (freezing greatly affects the flowers, but is too inconsistent to be used as predictor). I also remove columns related to the experimental design.
dataco <- dataco |>
select(
-starts_with("weather_prev.GDD"),
-starts_with("weather_prev.frozen"),
-Year, -Site, -Block
)
Since column names don't contain special characters, there is no need to invoke janitor::clean_names
: I can split the data among training and testing sets right away.
set.seed(442042)
berry_split <- initial_split(dataco, prop = 0.7)
berry_tr <- training(berry_split)
berry_te <- testing(berry_split)
I scale all the columns.
berry_recipe <- berry_tr |>
recipe(yield ~ .) |>
step_normalize(all_numeric()) |>
prep()
berrytr_baked <- bake(berry_recipe, berry_tr)
berryte_baked <- bake(berry_recipe, berry_te)
mean_y <- mean(berry_tr$yield)
sd_y <- sd(berry_tr$yield)
I launch the Gaussian process fitting with the caret
modelling package, which (in 2021) is slowy phasing out to open the way to the parsnip
package. The caret
package allows to integrate k-folds cross-validation, which separates data in k approximately equal-sized sets, successively leaving one set out to fit the model and optimize the hyper-parameters. We can also use a grid of tuning parameters, and caret
will take care of selecting the best combinaision. With Gaussian processes, I only need to tune the sigma
parameter, so the expand.grid
function could have been left out. In Parent et al. (2020), the best sigma
value was 0.075. The gaussprRadial
method is a Gaussian process with a radial (or exponentiated quadratic) kernel.
set.seed(2779781) # random.org
train_control <- trainControl(
method = "repeatedcv",
number = 10
)
set.seed(327958) # random.org
tune_gp <- expand.grid(
sigma = c(0.05, 0.075, 0.1)
)
m_fit <- train(yield ~ .,
data = berrytr_baked,
method = "gaussprRadial",
trControl = train_control,
tuneGrid = tune_gp,
verbose = FALSE
)
To inspect the model in more details, I use the model to predict on both the training and the testing set, then compute the R2 and the root mean square error.
berrytr_pred <- tibble(
yield = berry_tr$yield,
pred = predict(m_fit, newdata = bake(berry_recipe, berry_tr)) * sd_y + mean_y
)
berryte_pred <- tibble(
yield = berry_te$yield,
pred = predict(m_fit, newdata = bake(berry_recipe, berry_te)) * sd_y + mean_y
)
rsq_tr <- cor(berrytr_pred$yield, berrytr_pred$pred)^2
rsq_te <- cor(berryte_pred$yield, berryte_pred$pred)^2
rmse_tr <- mean(abs(berrytr_pred$yield - berrytr_pred$pred))
rmse_te <- mean(abs(berryte_pred$yield - berryte_pred$pred))
print(paste("Train R²:", rsq_tr))
print(paste("Test R²:", rsq_te))
print(paste("Train RMSE:", rmse_tr))
print(paste("Test RMSE:", rmse_te))
The following code block is almost identical to the one used for Parent et al. (2020). Results might differ a bit since the split is different and packages are updated.
gg_pred_f <- function(data, rsq, rmse, title) {
data |>
ggplot(aes(x = yield, y = pred)) +
geom_abline(slope = 1, intercept = 0, colour = "gray60", lwd = 2) +
geom_label(x = 0, y = 15000,
vjust = 1, hjust = 0,
label = paste("R² =", signif(rsq, 2), "\nRMSE =", round(rmse, 0))) +
geom_point(pch = 1) +
labs(x = expression("Observed yield (kg ha"^"-1"*")"),
y = expression("Predicted yield (kg ha"^"-1"*")"),
title = title) +
expand_limits(x = c(0, 15000), y = c(0, 15000)) +
coord_equal() +
theme_pubr()
}
gg_training <- gg_pred_f(berrytr_pred, rsq_tr, rmse_tr, "Training")
gg_testing <- gg_pred_f(berryte_pred, rsq_te, rmse_te, "Testing")
gg_training + gg_testing
Conditioning and optimization
So far, what I did with blueberries doesn't differ so much from what I did with concrete. In this section, I will use a conditioned optimization.
First, I split the variables I want to optimize from those I don't. If I had data on soil texture or pedogenesis, I could consider them as fixed. I could also decide that I don't want to control pH, for instance. Here, I want to optimize everything except weather, which obviously can't be managed.
conditional_features <- dataco |>
select(starts_with("weather"), -yield) |>
names()
managed_features <- dataco |>
select(-starts_with("weather"), -yield) |>
names()
Instead of writing down a vector with my own field values that I want to optimize, I randomly select a low yielder, then split it to manageable and conditional parts.
set.seed(747156)
observation <- dataco |>
filter(yield < 3000) |>
sample_n(1)
obs_baked <- bake(berry_recipe, observation)
obs_managed <- obs_baked |> select(all_of(managed_features))
obs_conditions <- obs_baked |> select(all_of(conditional_features))
Earlier in the notebook, I wrote about the Mahalanobis distance. This is a gadget I like to add to limit the possibility of extrapolation. The Mahalanobis distance is distributed like a inline_formula not implemented. I compute a maximal distance from the center (berrytr_means
) and covariance (invcov
) of the training set, so that my optimization function returns a 0 yield if the optimizer goes to far from the data scatter the model was fitted.
critical_q <- 0.9 # -> 1 is loose, -> 0 is strict
crit_dist <- qchisq(p = critical_q, df = ncol(berrytr_baked))
invcov <- MASS::ginv(cov(berrytr_baked %>% select(-yield)))
berrytr_means <- colMeans(berrytr_baked %>% select(-yield)) # obviously zeros
In the next code block, I define the function to optimize (the loss function). The optim()
function used to optimise always minimizes the outcome: this is why the function to optimize returns the negative of the prediction.
x_conditions <- as.vector(obs_conditions)
model <- m_fit
minim_f <- function(x) {
all_x <- data.frame(t(unlist(c(x, x_conditions))))
y <- predict(model, newdata = all_x)
maha_dist <- mahalanobis(
x = all_x,
center = berrytr_means,
cov = invcov,
inverted = TRUE
)
if (maha_dist > crit_dist) y <- 0
return(-y)
}
All this to compute the optimization. But one last detail: even though the Mahalanobis distance controls the extrapolation, it supposes that data are normally distributed: the Mahalanobis critical ellipsoid often crosses the minimum and maximum values of the dataset. If we don't impose these hard limits, we risk obtaining negative fertilization doses. These hard limits could have been defined in the loss function, but optim
offers a way to use them as arguments.
managed_min <- berrytr_baked |> select(all_of(managed_features)) |> summarize_all(min)
managed_max <- berrytr_baked |> select(all_of(managed_features)) |> summarize_all(max)
berry_opt <- optim(
par = obs_managed,
fn = minim_f,
method = "L-BFGS-B",
lower = managed_min,
upper = managed_max,
control = list(maxit = 1000, reltol = 1e-12)
)
berry_par_sc <- berry_opt$par
The berry_par_sc
object contains the scaled optimal parameters. We can use it to predict the outcome.
all_x <- data.frame(t(unlist(c(berry_par_sc, x_conditions))))
best_outcome <- predict(model, all_x) * sd_y + mean_y
best_outcome
At this point the berry_par_sc
object is difficult to interpret. Let's put it back to its original scale.
step_id <- tidy(berry_recipe)$id
managed_means <- tidy(berry_recipe, id = step_id) |>
filter(statistic == "mean", terms %in% managed_features) |>
pull(value)
managed_sds <- tidy(berry_recipe, id = step_id) |>
filter(statistic == "sd", terms %in% managed_features) |>
pull(value)
berry_par <- berry_par_sc * managed_sds + managed_means
berry_par <- data.frame(t(data.frame(berry_par))) # turn to a single-row data frame
t(berry_par)
Still, soil and leaf data are expressed as CLRs. The following code blocks back-transform the CLRs and prepare the data to create a recommendation table.
optimal_ionome <- clr_inv(berry_par |> select(one_of(features$leaf)))
optimal_soil <- clr_inv(berry_par |> select(one_of(features$soil)))
optimal_dose <- berry_par |> select(one_of(features$dose))
optimal_other <- berry_par |> select(one_of(features$other))
obs_ionome <- clr_inv(observation |> select(all_of(features$leaf)))
obs_soil <- clr_inv(observation |> select(all_of(features$soil)))
obs_dose <- observation |> select(all_of(features$dose))
obs_other <- observation |> select(all_of(features$other))
I put observed and optimal data together in the same table for all types.
ionome <- tibble(observed = t(obs_ionome), optimal = t(optimal_ionome)) |>
mutate(variable = features$leaf, type = "Leaf ionome") |>
filter(variable != "LeafFv")
soil <- tibble(observed = t(obs_soil), optimal = t(optimal_soil)) |>
mutate(variable = features$soil, type = "Soil chemistry") |>
filter(variable != "SoilFv")
dose <- tibble(observed = t(obs_dose), optimal = t(optimal_dose)) |>
mutate(variable = features$dose, type = "Fertilizers")
other <- tibble(observed = t(obs_other), optimal = t(optimal_other)) |>
mutate(variable = features$other, type = "Other")
yield <- tibble(observed = observation$yield, optimal = best_outcome, type = "Outcome", variable = "yield")
recommendation <- bind_rows(ionome, soil, dose, other, yield)
recommendation
Recommendation
The last step is to plot recommendation. Because data are not expressed at the same scale, they are difficult to represent. An option is to use ratios as visuals and a table for recommendation.
recommendation |>
mutate(ratio = optimal/observed) |>
ggplot() +
geom_segment(
aes(xend = ratio, y = variable, yend = variable),
x = 1,
arrow = arrow(length = unit(0.2, "cm"))
) +
geom_vline(xintercept = 1) +
facet_grid(type ~ ., scales = "free", space = "free") +
labs(y = "") +
theme_bw() +
theme(strip.text.y = element_text(angle = 0))
Another option is to use a single pannel per object.
recommend_long <- recommendation |>
pivot_longer(cols = c(observed, optimal))
ggplot(recommend_long, aes(x = value, y = name)) +
geom_col() +
geom_vline(xintercept = 0) +
facet_wrap(variable ~ ., scales = "free", ncol = 4) +
labs(y = "") +
theme_bw() +
theme(strip.text.y = element_text(angle = 0))
Finally, the slope plot is more complex to design, but easier to interpret.
recommend_long <- recommend_long |>
mutate(
label = case_when(
variable == "pH" ~ as.character(signif(value, 2)),
variable %in% c("doseN", "doseP", "doseK") ~ paste0(signif(value, 2), " kg/ha"),
variable == "yield" ~ paste0(signif(value, 4), " kg/ha"),
TRUE ~ paste0(signif(value*100, 2), " %")
)
)
gg_recom <- ggplot(recommend_long, aes(x = name, y = value)) +
facet_wrap(~variable, ncol = 2, scales = "free") +
geom_line(aes(group = variable)) +
geom_point(size = 1) +
geom_text(
data = recommend_long |> filter(name == "observed"),
mapping = aes(label = label) ,
hjust = 1.15,
fontface = "bold",
size = 3
) +
geom_text(
data = recommend_long |> filter(name == "optimal"),
mapping = aes(label = label),
hjust = -0.25,
fontface = "bold",
size = 3
) +
expand_limits(y = 0) +
labs(title = "Parameter adjustment for site-specific optimization", y = "") +
scale_y_continuous(breaks=0) +
geom_hline(yintercept = 0) +
theme_bw() +
theme(
strip.text = element_text(size=10),
panel.spacing = unit(10, 'mm'),
panel.grid = element_blank(),
panel.border = element_blank(),
strip.background = element_rect(fill=NA, size=0),
axis.title.x = element_blank(),
panel.grid.major.x = element_blank()
)
ggsave(filename = "results/recommend.png", plot = gg_recom, height = 14, width = 8, dpi = 300)
The recommendation plots show that correctly adjusting the soil, plant nutrition and major fertilizers could more than duplicate yields.
⚠️ But the model comes with contradictions we must be aware of. While soil P and leaf P should decrease,the model proposes an increase in P dosage. This is why a model won't replace the accountability of professional judgement.
On a final note, I used a Gausian process algorithm, but a plethora of methods are available in caret and tidymodels, like random forests, extreme gradient boosting, neural networks, etc. I could also have used GAMs. I invite you to try some options. But as a warning, I highly recommend the study of the mathematics on which the algorithms rely on before using them with confidence.
Conclusion
We have covered a workflow leading to optimal recommendations for agriculture. We began with a well-structured data set, and ended with a recommendation model. A lot of work is needed upstream to this workflow, not only to organize the data, but mostly to collect data. On the other hand, few operations downstream could lead to interactive applications, e.g. Shiny apps, that could complete and even challenge professionnal experience for field recommendations.