July deliverables: Two new surrogates methods and benchmarking.
In July I mainly worked on:
Multivariate adaptive regression spline (MARS)
Gradient enhanced kriging (GEK)
Benchmarking problems
Mentoring work
Let's see the details!
MARS Surrogate:
MARS is non-parametric regression technique, which can be seen as an extension of linear models that automatically models non linearities.
It is just a linear combination of basis functions:
formula not implementedWhere the inline_formula not implemented can be just constant or a hinge function: inline_formula not implemented. To build the surrogate, we just need to find the number of B, and the knot for each hinge. Then, the coefficients are found using usual least square techniques.
The model is built in two different phases: a forward and backward pass.
At each step of the forward pass, the pair of basis functions that gives the maximum reduction in sum-of-squares residual error is found. The two basis functions in the pair are identical except that a different side of a mirrored hinge function is used for each function.
This process of adding terms continues until the change in residual error is too small to continue or until the maximum number of terms is reached
To build a model with better generalization ability, the backward pass prunes the model. It removes terms one by one, deleting the least effective term at each step until it finds the best sub-model. Model subsets are compared using the GCV criterion, which is just a residual sum-of-squares criteria that takes into account the number of terms and its penalty.
Gradient enhanced Kriging (GEK)
Last summer, I started off building the Kriging model.
GEK takes into account gradient information of the expensive process to approximate, to improve the fit. In principle, getting this kind of information is quite difficult and usually another approximation is needed. However, using Zygote we can get the gradients hassle-free.
The downside is that the model matrix grow from dimension n to dimension n + nd, where n is the number of samples and d is the dimension of the problem. Because of this, it is advised to use this surrogate when d is quite low.
Besides updating the matrix, it is just a matter of calling the right Kriging related methods, toggled to account for the changing dimension and a few other technicalities.
Benchmarking problems
In my GSOC first review, I was told that I should work more on benchmarking and documenting in the docs the why's behind each surrogate method.
It is a very fair point: in the first month I churned out a lot of methods without no documentation whatsoever, which is quite bad practice.
That's why I decided to do some catchup work on this: I started creating tutorials for each surrogates and I also made a benchmark section to compare them using common functions in the literature.
Mentoring work
In the last weeks there has been an influx of students that are interested in these topics. I have been trying to act as mentor for them: I am helping with their first PR's around the Julia community and Surrogates.jl in particular. It is actually very helpful, because I can keep up the development of different topics in parallel.
The plan is to have revamped tutorials and a new surrogate method called KPLS which is another Kriging variation aiming to speed up the training process. If everything goes smoothly, I can have those deliverables by the end of August.
What's to come in the final month
From my proposal, I just need to code two new Surrogates: Regularized minimal-energy tensor-product splines (RTMS) and DENSE. I actually planned on doing RTMS in July, however the paper turned out to be pretty confusing for both my mentor and I, plus the authors did not answer my questions. So, last minute I decided to go with GEK.
I am in love with this kind of freedom!
So, those two surrogates are what I need to work on, but the flexibility allows me to pick what's more useful for the package, so we'll see in the last update what I actually decided to work on!
See you then!