Nextjournal Blog / Mar 22 2021

🛠 Bootstrapping our CI

While Nextjournal is currently focused on improving data science workflows we’ve also been conceiving it as a tool for arbitrary computation right from the start. This gives us the luxury of bootstrapping our work.

Under the definition of Doug Engelbart, bootstrapping means:

To use what you build to boost your own effectiveness.

We aim at creating a virtuous cycle that multiplies our efforts and results into a better final product.

In other words, we use Nextjournal to build Nextjournal.

By doing that, we are our first users, and we get to experience firsthand all the great parts (and the rough ones!) of our platform.

In a certain sense, bootstrapping is also sound from a business perspective: if we don't trust our product or it doesn't make us more productive, how could we expect others to pay for it?

Build to solve your own problems vs 'Not invented here'

If you have the ability to invent and make new tools that are needed for your problem, then you must.
Alan Kay - How?

A historic example of building your own specialised tools cited by Alan Kay is the construction of the Empire State building in 1930. Since the builders were facing an endeavour never done before, they made their own tools from scratch:

Gentlemen, this building of yours is going to present unusual problems. Ordinary building equipment won't be worth a damn on it. We'll buy and make new stuff, fitted for the job … That's what we do on every big job. It costs less than renting secondhand stuff, and it's more efficient.”
William Starrett - Builder of the Empire State Building

We work under the guideline that some of the problems that we face when building Nextjournal are novel, and therefore need to be tackled in a novel way. That's our innovation side.

But we must not make the mistake of believing that we need to make a tool for all our problems, otherwise we're simply falling for the "Not invented here" (NIH) syndrome.

To avoid the trap, it's very important to have an (even fuzzy) idea of the product we want to build, and start from the problem itself. If it's relevant to the scope of our platform, we will eventually want to bootstrap it. Otherwise, we're happy to use external tools and maintain focus on what matters to our users.

So, bootstrapping can bring some major benefits, but it all comes at a cost. A point that then becomes critical is resilience: whenever you make the tools yourself, you are adding more potential points of failure that you need to fix yourself.

Circling back to the construction of the Empire State Building as an example: what happens if the people within the construction site that make the tools are suddenly unavailable, and you need their work urgently?

In a few moments I'll make a practical example of how this has bitten us, and what we've learned from it.

Building a bootstrapped CI

With this bootstrapping mindset, Nextjournal CI emerged.

It has been running for roughly a year, so it feels like a good time to discuss a little how we got there, and what it has brought us so far.

The initial state

One of the main subsystems of Nextjournal is scheduling: when you run a notebook, we use our Cloud Provider APIs to schedule a runner: a VM that will run all the code cells in the notebook.

When I started to work on our bootstrapped CI, I didn't really start from scratch: we had our own custom-built CI, which was working as follows:

On a new commit, schedule a new VM
Checkout the repo at the pushed commit there
Run a bunch of shell scripts.

In the end, what we had was not entirely dissimilar to what most commercial services normally do for their offered services. But we were rolling our own instead.

Our initial state is a good example of how things may end up when you're not focusing on the added value.

No real added value compared to a commercial CI provider (maybe a little cheaper)
One more system to maintain

The potential cost saving is often one of the main reasons for building a tool internally, but unless you're sure it's providing more value somewhere else, you're just shifting some of the costs to the risks.

Real bootstrapping, i.e., getting rid of custom code

Our CI was working fine, but we weren't too happy about two things:

The code powering this CI was separate from the code that runs our application day after day (i.e. not really bootstrapping).
Some of the tests that we run are flaky, and if one of them failed, the whole CI task needed to be re-run, instead of just the offending test.

So I set out to rebuild our CI with two goals in mind:

Running our CI should work exactly like a user running a notebook
Notebooks instead of scripts

And the great part about doing this, is that it wasn't even too much work.

The advantage of reusing already existing processes in our platform meant that the bulk of development was just some glue between already existing parts. In fact, the majority of the changes were related to the GitHub webhook, where I had to make sure Nextjournal was running a notebook instead of scheduling a VM.

Our currently used Bootstrapped CI works as follows:

There are a bunch of CI notebook templates (which are just normal notebooks, in reality). They contain the code to run the tests / build the release or whatever else the task is supposed to do.
These notebooks are referenced in a .nextjournal/config.edn file in our repository with an immutable permalink
When pushing to GitHub, our application takes the code at the pushed commit, and it figures out, by parsing config.edn file in the repo, which CI notebook templates to use and at which version.
Each CI notebook template gets remixed (a copy of it is created), and run
Each of the CI notebooks reports its status independently to GitHub by using GitHub APIs (with code in the notebook itself).

This process isn't perfect, and I already have in mind some improvements, but it has been working great for us in the last 12 months: at the moment of writing this post, we have created 28668 CI notebooks so far.

The price to pay

The fact that we use Nextjournal itself for running our tests and building our releases poses immediately a problem: what happens if the platform is down?

This was the case a few months ago, where some token/api-key bundled in our VM images had expired, and suddenly Nextjournal was not able to schedule new runners to run code.

We were effectively locked out of our system, with a fix ready, but couldn't build the release.

In that case, we resorted to building the release locally. It worked fine, and that is our fallback. But that is only necessary because of bootstrapping.

Unsurprisingly, it's a tradeoff. For us, running our CI with our normal infrastructure is a no-brainer, even if we have to fallback to a "manual" build every once in a while (hopefully, never again 😅).

Advantages of Nextjournal CI

Now that we've been using our CI for over a year, I can confidently say that we are liking it way more than a traditional CI.

Even more, when I had to build a CI pipeline for another, unrelated project, I wished I could use interactive notebooks.

And here's a couple of reasons why.

Interactive building of CI tasks

Usually, the process of building a CI pipeline is one of a very tedious trial-and-error flow:

change the yaml file (or whatever) that defines the CI tasks
commit and push the changed CI definition
check CI report and see whether it passed/failed and why
If it failed, figure out what's wrong and start again

How making changes to CI typically looks on GitHub Action

This process, is generally tedious and slow.

We don't experience this pain when building our CI pipelines, because a CI notebook is just a normal interactive notebook. We can run it and iterate until it works, with a very fast feedback loop (only the time it takes to run).

When we're done, we "save it" by copying its immutable permalink in the config.edn in our repository.

Our .nextjournal/config.edn looks like this:

{:ci-jobs  {:foo "https://nextjournal.com/a/NEUbaaXZMRWjZ3EVEZw3d?change-id=Ct7UUEqFhjoToURK985Mod"  :bar "https://nextjournal.com/a/CGhxCjZn67e9w8NRdi8xk1?change-id=CtLDRN2Q4zVLNE8o5EHZFP"}}

You can see that we're referencing not just the notebook, but also the change-id, which marks an immutable state of a notebook. So anybody can change the notebook, but the CI, unless this file is upgraded, will still work.

And all of this comes directly from the fact that there's no difference between running a notebook when "developing it", or running it through our CI.

So you know exactly how it will behave.

Not just scripting in CI tasks

Using Notebooks as CI tasks has another advantage: having access to a fully fledged computing environment is much more powerful than what the usual CI offers.

Here are a couple of examples:

Linter

Our codebase is in Clojure, and we use clj-kondo as a linter. Since it was introduced pretty recently in the history of the repository, there have been quite a few warnings.

Now, solving the issues at once isn't particularly feasible or useful for us (big cleanups like this rarely are). On the other hand, using the linter as-is on this codebase isn't useful either: imagine pushing a commit and seeing in the CI report "459 warnings"... is it ok, or not?

Luckily, clj-kondo is written in clojure, and running clojure notebooks is our forte: instead of running the linter with a command line binary, as one would do, our CI notebook treats clj-kondo as a normal Clojure library, and calls its functions directly.

This allows us to lint the repository at the given commit and at the head of our main branch, and, since we're inside a clojure program and everything is data, we can compare the results, figuring out what has been solved, what is new, and what stayed the same.

Excerpt from our CI `clj-kondo` notebook

This is not something that couldn't have been achieved with a traditional CI and a bit of Bash scripting, but the ability of writing Clojure, which we're much more proficient at, and the interactive development explained before, made this process a breeze.

Build reports

In our CI notebooks, everything needed for building and reporting the CI run is contained within the notebook itself. In practice, this means that the status report to GitHub happens with a code cell.

Usually, our CI notebooks have this shape:

GitHub repository component
Report 'start' to GitHub
One of more code cells that are the bulk of the CI task
Report outcome to GitHub

But in our CI release task, after the last bit we have another code cell, which makes the shadow-cljs build report, see below:

Excerpt from the build report at the end of the Release CI task

The build report gets written to /results, and therefore it ends in our persistent Content Address Storage.

It can be useful later on for auditing, or if we want to figure out when exactly the build size of our frontend has changed, for example.

However, it is not mission critical, and we definitely don't want to make the Release CI task take longer than what's strictly needed (you can see in the screenshot that this build report takes ~4 minutes to complete), so all this computation is done after everything has already succeeded (or failed).

With a more traditional CI provider, you'd have an extra CI task named build-report which would be triggered after the build-report. I think this is another good example where using notebooks and a single computational environment is definitely handy and reduces complexity.

Nextjournal CI as a service?

All in all, the bootstrapping of our CI proved useful not only because we got rid of custom code and now it is just business as usual, but it actively made our Continuous Integration processes better, with interactive development and a more complete computing environment.

We are thinking of opening up the same service for others, are you interested in running your CI using Nextjournal? Let us know