🤖 Introduction to machine learning for ecological engineers
EDIT. The original notebook included an automatic algorithm selector to fit the data. I used such approach to simplify the process, but it ended up adding confusions and technical difficulties. The current version of this document explores machine learning algorithms without the auto-sklearn automation.
Nextjournal
The platform you are currently looking at is Nextjournal. You can run your code online without having to install a coding environment on your computer. So we can spend more time focusing on the content of the workshop than trying to install everything you need on your system. If you need a local installation though, I highly recommend Anaconda. Nextjournal is great to code in a collaborative environment with multiple programming languages used in science: Python, R, Julia, etc. We will use Python for this workshop, since Python has a fairly simple and well documented machine learning package named scikit-learn
.
To use this notebook as a coding environment, sign up (or sign in) then from the top of this page, hit Remix. You will see the status of your remixed notebook at the top when you will run your first cell. On left, you can see that you are running in an environment named Python 3, courtesy of Nextjournal.
As in well-known Jupyter notebooks, in a Nextjournal notebook you can mix code and formatted text: an approached named literate programming. Code cells can be run by clicking on the blue Play triangle (▶️) at the top-right of a cell, or by focusing on the code cell with the cursor and hit Shift+Enter.
Why machine learning matters to engineers?
Mechanisms and phenomena
Engineering aims at finding solutions using science and technology. Mathematics is a common ground to most tasks done by an engineer. Typically, we use mechanistic models to analyze and predict stress distributions, liquid flows, chemical budgets, etc. These models rely on sets of parameters calibrated in laboratories on real hardware and materials. In the end, modeling almost always rely on both physics and empirical evidence.
When dealing with complexity, mechanistic models become less obvious. System thinking, implying stocks and flows, becomes difficult to tune where elements interact through varying functions over space and time. Since "all models are wrong, but some are useful", sometimes a simple set of linear regressions will offer a good approximation of what is happening under the hood of an ecosystem.
But most ecological patterns are nonlinear. Fortunately, in 2020 we have access to myriads of tools to detect nonlinear patterns through data. Surprisingly simple modeling techniques are able to detect complex patterns. For example, you can collect numerous outcomes from the outflow of a constructed wetland: nitrogen load, turbidity, pH, etc. You can also collect data describing the conditions of the wetland: water and nutrient inflows, weather data, % coverage of plants species in the wetland, etc.
Hundreds of algorithms
To predict outflows from inflows, one could rely on the mechanisms of the ecosystem: compute flow mechanics, nutrient budgets and dynamics with nutrient deposition and nutrient uptake in plants, etc. Another approach could rely purely on phenomenology with machine learning. Using this approach, we identify key features to predict outcomes using pattern detection. In other words, we predict a Y array using an X array. For instance, a simple technique called k-nearest neighbors (KNN) can be used to predict outcomes (usually future outcomes) with given features based on what happened in similar cases: in its most simple approach, KNN will return a prediction of the outcome by averaging the outputs of the k points which are the closest to the input features. While a KNN model is pretty straight-forward, artificial neural networks (ANN), on the other hand, can become a complete black box.
k-nearest neighbors
The basic principle of KNNs is that the outcomes of an observation will be similar to the outcomes of similar observations, i.e. a set of features will return similar values or categories to the k ones returned in its neighborhood, k being a parameter of the model to be optimized. Using this approach, we need a distance metric (Euclidean, Mahalanobis, Manhattan, Gower, etc.) - be careful on the influence of scaling on the distance metric.
The next figure shows, on the left, a simplistic 2D classification example, where with KNNs (k=3 and the distance metric is Euclidean) the predicted color of the white circle would have a probability of 66.7% to be cyan and 33.3% to be magenta. If it was a regression problem, the predicted outcome of the white circle will be the mean or median of the 3 surrounding points. On the right, I created a KNN model based on the Mahalanobis distance to classify penguin species from the dimensions of their bill.
The white dot can move around a little bit without affecting its outcome. This is why outcomes from KNNs a usually not smooth, but change in steps.
Decision trees
A decision tree is a hierarchical sequence of mostly binary decisions. At each node, a boolean (true - false) test is made. The predicted outcome, whether it's a category of a number, is found at the final node. The decisions and the sequence are optimized to find the most accurate outcome. Simple trees can be printed on paper and are easy to understand.
In the following example, I created a tree to classify penguin species from the dimensions of their bill.
There are numerous algorithms based on decision trees. A random forest is a collection of numerous trees, each one created from a random sample of data, with replacement. The outcome is the mean of the value of all the trees or the category mostly predicted. A popular algorithm named XGBoost is also a forest, with trees in sequentially modelling the error of the previous tree.
Even more than, results from decision trees and decision tree-based algorithms are not smooth.
Neural networks
A neural network (or artificial neural network) includes a series of layers, each one including interconnected functions named neurons. The most basic neural network is a multilayer perceptron, such as the one presented above.
In the previous figure, each circle is a neuron. Each layer of neurons is connected to the next layer. Connection have a fixed weight, which is optimized to obtain the most accurate outcome. The first layer is the input, i.e. the values in the features. Each neuron in a layer inputs a matrix multiplication of the previous neurons times the weights. Usually, an intercept called the bias is also added, but I removed it for simplicity. Each neuron also carries a function which transforms the result of the matrix multiplication. In this example, I used a simple function called the rectified linear unit (ReLU), which turns negative values to zero. There are numerous functions available, each one carrying its pros and cons, but ReLU is known to return pretty good results. These functions can differ from a neuron to another, or a layer to another, but usually using the same function for the whole network is enough.
To demystify neural networks, let's compute a simple one by hand (without the bias).
Neural networks can have hundreds of layers with hundreds of neurons per layers. Networks can be assembled in all sorts of ways, e.g. by taking the output of one network as the input of another one. The way they are arranged is called the network architecture. When these networks become deep, we refer this technique as deep learning.
The classification of penguins with neural networks has linear borders, but its not the case for all classifications with neural networks.
Other algorithms like Gaussian processes, which can return probabilistic results, are also commonly used by data scientists and engineers.
All these algorithms can be put together in complex pipelines we call ensembles. The library auto-sklearn
can generate automatically such ensembles. But we will keep it simple here by using single models.
Conditioning models
Conditioning a machine learning model means observing the output of a model by varying some features (usually those we can control) while other features are fixed (usually uncontrollable features, like weather). Using computer-based random searches, it becomes possible to find optimal controllable features given, for instance, expected weather.
In this workshop, I will show how to tune a machine learning algorithm to create a model and how a model can be used for conditioning. Of course, this is a complex area full of traps and caveats that I can't cover here. To go further with Python, I highly recommend Python Data Science Handbook, by Jake VanderPlas.
Although I did my best to avoid relying too much on programming knowledge, some knowledge in programming is needed to take the most from this workshop. If you are familiar with spreadsheets, you might not be aware of it, but you have some programming skills! I will be as concise and will use as little programming as possible.
Machine learning typical workflow
Task zero is always to state the problem and ask the question you need to answer. Once you know where you are going, you can get your hands dirty.
Data collecting and cleaning
Machine learning is math with data. Clean and efficient algorithms are usually provided by machine learning packages like scikit-learn
. But after data is collected, they are often a real mess. Cleaning the data is a boring but critical task in data modeling. Make sure that
each column has one single meaning with a single type of data (numerical, categorical, date, text, boolean, etc),
each row is a single observation
missing values are noted consistently (usually empty cells)
units are consistent
etc.
Data partitioning
Machine learning is not like usual statistics: it is used for prediction. Accordingly, the performance of a machine learning model must be assessed on predicted cases, not on fitted cases. Typically, we split the data among a training set used to fit the model, and a testing set used to assess the performance. Rule of thumb, training and testing sets are split with a 70/30 % ratio, but the proportions are up to the analyst. Cross-validation techniques can be applied on the training set when fitting the model, but we will not cover this topic here.
Preprocessing
Many machine learning algorithms, like k-nearest neighbors, highly depend on the scale of data. For instance, if we had spatial coordinates, with k-nearest neighbors we would amplify the importance of a NS position expressed in meters compared to a WE direction expressed in feet. A typical way to express all columns in a common scale is to subtract the mean and divide by the standard deviation, as process named standardisation. Preprocessing could also include missing value imputation and one-hot encoding of categorical features.
Fit the model
I have covered three of the numerous families of algorithms that have been developped for machine learning. Some of these are only able to predict categories, other only numerical values, some both. These are supervised learning techniques, because we fit features to outcomes. Unsupervised learning aims at predicting something that does not exist yet, like creating groups through a priori unclassified data. Algorithms have their pros and cons.
scikit-learn
comes with a myriad of algorithms: artificial neural networks, convolutional neural networks (mainly for image processing), support vector machines (forclasssification only), decision trees, random forests, k-nearest neighbors, k-means clustering (for unsupervised classification learning), etc.
I tend to prefer artificial neural networks and Gaussian processes for regressions, since they usually return a smooth response, while k-nearest neighbors and algorithms based on trees (decision trees, random forest, extreme gradient boosting, etc.) return stepped responses.
Criticize and improve the model
Once you have your model, it's a good idea to inspect its performance. Some are very good right away, but most are bad. Really bad. Features must be engineered, parameters must be tuned, etc. The model can also be improved continuously by collecting more and more data.
Use your model
You could save your model, then reuse it in other environments. Some will also create apps for clients.
Coding
Load data
Our data come from blueberry crops field experiments done in Québec, Canada. We aim at predicting yields from fertilizer dosages (nitrogen, phosphorus and potassium), weather data at different growth stages and soil chemistry. I have already done the boring job of cleaning data.
In Nexjournal, you can drag and drop data in csv
(comma-separated values) format. csv
files are just a text file with rows separated by carriage returns and columns separated by commas. You could also to click on the menu at the bottom. In the Editor menu, click on More > then File. Browse and select your csv
file on your desktop, then Nextjournal will create a block including your file.
The file is now on the server, but not loaded to our Python environment. To load our data table in Python, you need the pandas
library, which is already installed in this computing environment (installed means it's downloaded and ready to be used, loaded means it's available for the session). A library (or package) adds functionalities to the basic operations that can be done with Python. The pandas
library handles data tables. In Python, packages can be loaded like this.
import pandas as pd
I specified to load pandas
as pd
so that all pandas
functions will be accessible with the pd
prefix. pd
just a conventionally used shortcut. To load the csv
in Python, pandas
has a read_csv()
function. Usually, read_csv()
takes a path to the file. But in Nextjournal, go click on Insert inline in the menu or hit Shift Cmd/Ctrl E
, then select your file.
df = pd.read_csv(df.csv)
Our data are now saved in Python, and accessible by calling the object we named df
. In pandas, we can inspect the table using the head()
method. Just like we would call frog.jump(5)
to ask a frog to jump by 5 units, we ask for the top 5 rows as
df.head(15)
We might also want a graphical representation of our data. But in Nexjournal, plot size needs to be adjusted with this little trick.
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (14,14)
A data scatter can help to find obvious correlations between data.
pd.plotting.scatter_matrix(df);
There are some correlations between weather features, and between soil elements. We could reduce the number of features and removing some redundancy by using principle component analysis, and do some feature engineering, but it's another subject.
We will create two objects: the matrix of features (data that are used to predict an outcome, X
), and the target (the outcome to be predicted,y
). The drop()
method removes a row (axis=0
) or a column (axis=1
).
X = df.drop("yield", axis=1) # features
y = df["yield"] # outcome
Data partitioning
From the scikit-learn
package, we need to load the train_test_split
function to partition our data. We can assign the output of the split to four objects (X_train
, X_test
, y_train
, y_test
) all together. We use a 70/30 split ratio. To assure that each split will return the same values, thus code reproducibility, I add a random seed with the numpy
package (I usually generate a number from random.org to assure randomness). The numpy
package is a well-known mathematical package for Python.
import numpy as np
np.random.seed(856499)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)
Preprocessing
Scikit-learn has a good collection of preprocessing functions. We will use the standard scaler, that subtracts the mean and divides by the standard deviation. The advantage of using scikit-learn functions instead of doing it manually is that the scaler class can scale back to the original scale.
from sklearn.preprocessing import StandardScaler
X_scaler = StandardScaler()
X_scaler.fit(X_train)
X_train_sc = X_scaler.transform(X_train)
X_test_sc = X_scaler.transform(X_test)
StandardScaler
uses 2D data, while y
is a 1D vector. To transform the vector to a single column matrix, we extract the values from the table, which transformes it to an array, then reshape it to whatever the number of lines (-1) and one column. The operation can be done in one line with y_train.values.reshape(-1, 1)
. After the transformation, we reshape y back to a vector with .flatten()
.
y_scaler = StandardScaler()
y_scaler.fit(y_train.values.reshape(-1, 1))
y_train_sc = y_scaler.transform(y_train.values.reshape(-1, 1)).flatten()
y_test_sc = y_scaler.transform(y_test.values.reshape(-1, 1)).flatten()
Fit the model
To build our model, we first a select a regressor among those provided by scikit-learn. We will use a neural network regressor, specifically a multi-layer perceptron (MLP). The architecture of the network is defined in a tuple (an immutable list in Python), where we define the number of neurons per layer. The activation functions will be ReLU.
from sklearn.neural_network import MLPRegressor
reg = MLPRegressor(
hidden_layer_sizes = (30, 30, 30, 30),
activation = 'relu',
max_iter = 25000
)
Our regressor (reg
) is not fitted yet to our training data. The next code fits our scaled X to y in the training set. Multi-layered perceptrons are fast to train.
reg.fit(X_train_sc, y_train_sc)
Now that our model has been fitted, we might want to know if it's any good. Remember that the performance of a model is done on the test set. We predict the outcome of our fitted object by using the predict()
method. We put our prediction back to original scale using the inverse_transform()
method of our scaler.
y_test_pred_sc = reg.predict(X_test_sc)
y_test_pred = y_scaler.inverse_transform(y_test_pred_sc.reshape(-1, 1)).flatten()
We can check some metrics. The R² score offers a common ground to assess regressions, while the median absolute error gives an idea of the adequacy with the scale of the outcome. The metrics can be computed with functions included in the scikit-learn
package. From the scikit-learn
package (named sklearn
), we import the r2_score()
and the median_absolute_error()
functions.
from sklearn.metrics import r2_score, median_absolute_error
print("R²: ", r2_score(y_test, y_test_pred))
Such R² seems low compared to the ones we usually obtain from controlled processes, but for open ecological systems, it can be considered acceptable. The median absolute error tells us that a typical prediction should return a blueberry per hectare error of
print("Median error: ", median_absolute_error(y_test, y_test_pred))
I like to plot observed versus predicted. It gives a sense of model reliability across the whole range of outcomes. Model performance is always assessed on the testing set.
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (8,8)
plt.plot(y_test, y_test_pred, 'o', color="#98d7e8")
plt.plot([0, 12000], [0, 12000], color="#31afd0", linewidth=3)
plt.xlabel("Observed yield")
plt.ylabel("Predicted yield")
We could refine our model with more features, more data, better modeling parameters and allow more time. But the model is acceptable for the workshop. Let's save our model.
from joblib import dump
dump(reg, "results/model.joblib")
We can share the joblib
file to anyone, or upload it to a server for a web application.
from joblib import load
reg_ = load(model.joblib)
Note that the preprocessing could also have been included in the joblib file using scikit-learn pipelines.
Use your model
To actually use your model, you can create a new csv
table with the exact column names and enter the information of a case you want to predict. Just remove the y column (here, remove yield
), and you are good to go.
my_case = pd.read_csv(df_pred.csv)
my_case
Remember that the model asks for preprocessed data!
my_case_sc = X_scaler.transform(my_case.values)
my_pred = y_scaler.inverse_transform(
reg_.predict(
my_case_sc
).reshape(-1, 1)
).flatten()
my_pred
Condition your model
A very clever way of using our model to fix some features while varying others. In our data set, fertilizer dosage can be managed given the results of soil test and given historical weather in the region. This process is named conditioning.
The scipy
package has useful optimizers, and we could use them to find the best dose combination. But to keep it simple here we will generate a large table populated with 10 000 random dose combinaisons, then retain the one with highest yield. The numpy
package, widely known in Python, is used for all sorts of mathematical operations, like generating random values.
n_cases = 10000
conditioned_df = pd.DataFrame(
{
'dose N': np.random.uniform(0, 30, n_cases),
'dose P': np.random.uniform(0, 30, n_cases),
'dose K': np.random.uniform(0, 30, n_cases),
'pH': 4.2,
'prec 1': 72,
'prec 2': 110,
'prec 3': 60,
'prec 4': 120,
'prec 5': 80,
'temp 1': 2.1,
'temp 2': 10,
'temp 3': 14,
'temp 4': 18,
'temp 5': 17,
'soil P': 8.1e-5,
'soil K': 1.6e-4,
'soil Ca': 3.8e-4,
'soil Mg': 2.8e-4
}
)
conditioned_df
To the contitioned_df
table, we add a new column including predicted yields.
conditioned_df["yield predicted"] = y_scaler.inverse_transform(
reg_.predict(
X_scaler.transform(conditioned_df.values)
).reshape(-1, 1)
).flatten()
conditioned_df
We look for the row index with the maximum predicted yield.
best_case = np.argmax(conditioned_df["yield predicted"])
best_case
... and we recover the corresponding row, with optimal dosage!
conditioned_df.iloc[best_case, :3]
Once we have our model, we can condition it in all sorts of ways, like looking for optimal dosage according to specific weather, but see how it reacts with soil tests. Be beware that the model is not valid for values lying outside the rages of the original data set. I could have created random doses to a larger range (like 0 to 50), but since their is no such dose in the original data set, this extrapolation would be meaningless.
Usually, fertilizer dosages are prescribed using soil tests, tables created using fertilizer trials and professional judgement, often disregarding important local conditions, like weather.
Exercise
We have data relating concrete compression strength to its age and composition. Our task is to model the compression strength through time of a concrete with given composition.
The data set comes from Yeh (1998) and was downloaded from University of California - Irvine - Machine learning repository.
concrete = pd.read_csv(Concrete_Data.csv)
concrete.head(5)
Try it!
Tip: to create a vector of ages from 0 to 365 by steps 5, you can use np.linspace(start=0, stop=365, num=74)
.
No, really, try it before digging into the solution.
Solution
As before, we plot a bivariate representation of the data.
pd.plotting.scatter_matrix(concrete);
There are clear patterns, probably from experimental design. We need our features and our outcomes.
Xc = concrete.drop("Compressive strength (MPa)", axis=1)
yc = concrete["Compressive strength (MPa)"]
Which we split.
Xc_train, Xc_test, yc_train, yc_test = train_test_split(Xc, yc, train_size=0.7, random_state=948637)
We instantiate our model object, then fit it. Lets try with random forests this time (preprocess is not mandatory with random forests).
from sklearn.ensemble import RandomForestRegressor
reg_concrete = RandomForestRegressor() # default arguments
reg_concrete.fit(Xc_train, yc_train)
We assess the model on the testing set.
yc_test_pred = reg_concrete.predict(Xc_test)
plt.plot(yc_test, yc_test_pred, 'o', color="#98d7e8")
plt.plot([0, 100], [0, 100], color="#31afd0", linewidth=3)
plt.xlabel("Observed yield")
plt.ylabel("Predicted yield")
Now let's condition our model. We keep everything constant except time (Age (day)
) for two different cases of water contents.
conditioned_c1 = pd.DataFrame(
{
'Cement (kg/mÂł)': 420,
'Blast Furnace Slag (kg/mÂł)': 50,
'Fly Ash (kg/mÂł)': 10,
'Water (kg/mÂł)': 150,
'Superplasticizer (kg/mÂł)': 0,
'Coarse Aggregate (kg/mÂł)': 900,
'Fine Aggregate (kg/mÂł)': 800,
'Age (day)': np.linspace(0, 365, 74)
}
)
conditioned_c2 = pd.DataFrame(
{
'Cement (kg/mÂł)': 420,
'Blast Furnace Slag (kg/mÂł)': 50,
'Fly Ash (kg/mÂł)': 10,
'Water (kg/mÂł)': 250,
'Superplasticizer (kg/mÂł)': 0,
'Coarse Aggregate (kg/mÂł)': 900,
'Fine Aggregate (kg/mÂł)': 800,
'Age (day)': np.linspace(0, 365, 74)
}
)
We predict these cases and plot compressive strength against time.
yc_cond1 = reg_concrete.predict(conditioned_c1)
yc_cond2 = reg_concrete.predict(conditioned_c2)
plt.plot(conditioned_c1["Age (day)"], yc_cond1, linewidth=3)
plt.plot(conditioned_c2["Age (day)"], yc_cond2, linewidth=3)
plt.xlabel("Age (day)")
plt.ylabel("Compressive strength (MPa)");
Conclusion
Using machine learning, we have seen how to create a basic machine learning model in Python, and used it to simulate yields for different fertilizer dosage given site-specific conditions. Such strategy has a great potential to avoid over-fertilization causing water eutrophication and can be applied in other fields as well. We did an exercise to train a model to predict how compressive strength of a concrete evolves through time. With the help of machine learning, phenomenological modelling opportunities are numerous.