Learn XTDB Datalog Today
Learn XTDB Datalog Today is derived from the classic learndatalogtoday.org tutorial materials, but adapted to focus on the xtdb API and unique Datalog semantics, and extended with additional topics and exercises. It is an interactive tutorial designed to teach you the xtdb dialect of Datalog. Datalog is a declarative database query language with roots in logic programming. Datalog has similar expressive power to SQL.
You can follow along by hitting the "Remix" button on the toolbar to start your own interactive session running entirely on the Nextjournal platform, or you can copy the steps into a Clojure REPL running locally on your machine. You could also use curl (or similar) to send the data and queries in this tutorial via HTTP to a pre-built Docker image.
master.md source file for this notebook is available on GitHub.
You need to get xtdb running before you can use it. Here we are using Clojure on the JVM and running xtdb locally, as an embedded in-memory library. However, the Datalog query API is identical for all clients and all these queries would work equally well over the network via HTTP and the Java/Clojure client library.
Next we need some data. XTDB interprets maps as "documents." These require no pre-defined schema -- they only need a valid ID attribute. In the data below (which is defined using edn, and fully explained in the section) we are using negative integers as IDs. Any top-level attributes that refer to these integers can be interpreted in an ad-hoc way and traversed by the query engine -- this capability is known as "schema-on-read".
This vector of maps contains two kinds of documents: documents relating to people (actors and directors) and documents relating to movies. As a convention to aid human interpretation, all persons have IDs like
-1XX and all movies have IDs like
-2XX. Many ID value types are supported, such as strings and UUIDs, which may be more appropriate in a real application.
Note that xtdb also has a JSON-over-HTTP API that naturally supports JSON documents, this is possible because JSON can be very simply mapped to a subset of edn.
To start an in-memory instance of xtdb, you can use the
start-node function like so:
Loading the small amount of data we defined under
my-docs above can be comfortably done in a single transaction. In practice you will often find benefit to batch
put operations into groups of 1000 at a time. The following code maps over the docs to generate a single transaction containing one
:xtdb.api/put operation per document, then submits the transaction. Finally we call the
sync function to ensure that the documents are fully indexed (and that the transaction has succeeded) before we attempt to run any queries -- this is necessary because of xtdb's asynchronous design.
With xtdb running and the data loaded, you can now execute a query, which is a Clojure map, by passing it to xtdb's
q API, which takes the result of a
db call as it's first value. The meaning of this query will become apparent very soon!
To simplify this
xt/q call throughout the rest of the tutorial we can define a new
q function that saves us a few characters and visual clutter.
Queries can then be executed trivially:
Extensible Data Notation
In xtdb, a Datalog query is written in extensible data notation (edn). Edn is a data format similar to JSON, but it:
Edn consists of:
"This is a string"
[1 2 3]
[foo "bar" ?baz 123 ...]
(3.14 :foo [:bar :baz]),
(+ 1 2 3 4)
.. and a few other things which we will not need in this tutorial.
Here is an example query that finds all movie titles in our example database:
Note that the query is a map with two key-value pairs:
XTDB also supports queries in an alternative "vector" format:
However, in this tutorial we will use the map format. Also note that xtdb does not require logical variables to be preceded by
?, although you may use this convention if you wish.
Q1. Find all the movie titles in the database
The example database we'll use contains movies mostly, but not exclusively, from the 80s. You'll find information about movie titles, release year, directors, cast members, etc. As the tutorial advances we'll learn more about the contents of the database and how it's organized.
The data model in xtdb is based around atomic collections of facts. Those atomic collections of facts are called documents. The facts are called triples. A triple is a 3-tuple consisting of:
Although it is the document which is atomic in xtdb (and not the triple), you can think of the database as a flat set of triples of the form:
Note that the last two triples share the same entity ID, which means they are facts about the same movie (one document). Note also that the last triple's value is the same as the first triple's entity ID, i.e. the value of the
:movie/director attribute is itself an entity.
A query is represented as a map with at least two key-value pairs. In the first pair, the key is the keyword
:find, and the value is a vector of one or more logic variables (symbols, e.g.
e). The other key-value pair is the
:where keyword key with a vector of clauses which restrict the query to triples that match the given data patterns.
For example, this query finds all entity-ids that have the attribute
:person/name with a value of
The simplest data pattern is a triple with some parts replaced with logic variables. It is the job of the query engine to figure out every possible value of each of the logic variables and return the ones that are specified in the
_ can be used as a wildcard for the parts of the data pattern that you wish to ignore. You can also elide trailing values in a data pattern. Therefore, the following two queries are equivalent.
Q1. Find the entity ids of movies made in 1987
Q2. Find the entity-id and titles of movies in the database
Q3. Find the name of all people in the database
In the previous chapter, we looked at data patterns, i.e., vectors within the
:where vector, such as
[e :movie/title "Commando"]. There can be many data patterns in a
The important thing to note here is that the logic variable
e is used in both data patterns. When a logic variable is used in multiple places, the query engine requires it to be bound to the same value in each place. Therefore, this query will only find movie titles for movies made in 1987.
The order of the data patterns does not matter. XTDB ignores the user-provided clause ordering so the query engine can optimize query execution. Thus, the previous query could just as well have been written this way:
In both cases, the result set will be exactly the same.
Let's say we want to find out who starred in "Lethal Weapon". We will need three data patterns for this. The first one finds the entity ID of the movie with "Lethal Weapon" as the title:
Using the same entity ID at
m, we can find the cast members with the data pattern:
In this pattern,
p will now be (the entity ID of) a person entity, so we can grab the actual name with:
The query will therefore be:
Q1. Find movie titles made in 1985
Q2. What year was "Alien" released?
Q3. Who directed RoboCop? You will need to use
[<movie-eid> :movie/director <person-eid>] to find the director for a movie.
Q4. Find directors who have directed Arnold Schwarzenegger in a movie.
Looking at this query:
It would be great if we could reuse this query to find movie titles for any actor and not just for "Sylvester Stallone". This is possible with an
:in clause, which provides the query with input parameters, much in the same way that function or method arguments do in your programming language.
Here's that query with an input parameter for the actor:
This query takes one argument,
name, which will be the name of some actor.
The above query is executed like
(q db query "Sylvester Stallone"), where
query is the query we just saw, and
db is a database value. You can have any number of inputs to a query.
In the above query, the input logic variable
name is bound to a scalar - a string in this case. There are four different kinds of input: scalars, tuples, collections, and relations.
A quick aside
Note that an implicit first argument
$ is also available, should it ever be needed, which is the value of database
db itself. You can use this in conjunction with more advanced features like subqueries and custom Clojure predicates (more on those later!).
A tuple input is written as e.g.
[name age] and can be used when you want to destructure an input. Let's say you have the vector
["James Cameron" "Arnold Schwarzenegger"] and you want to use this as input to find all movies where these two people collaborated:
Of course, in this case, you could just as well use two distinct inputs instead:
You can use collection destructuring to implement a kind of logical OR in your query. Say you want to find all movies directed by either James Cameron or Ridley Scott:
director logic variable is initially bound to both "James Cameron" and "Ridley Scott". Note that the ellipsis following
director is a literal, not elided code.
Relations - a set of tuples - are the most interesting and powerful of input types, since you can join external relations with the triples in your database.
As a simple example, let's consider a relation with tuples
Let's use this data and the data in our database to find box office earnings for a particular director:
Note that the
box-office logic variable does not appear in any of the data patterns in the
Q1. Find movie title by year
Q2. Given a list of movie titles, find the title and the year that movie was released.
Q3. Find all movie
titles where the
actor and the
director has worked together
Q4. Write a query that, given an actor name and a relation with movie-title/rating, finds the movie titles and corresponding rating for which that actor was a cast member.
So far, we have only been dealing with data patterns:
[m :movie/year year]. We have not yet seen a proper way of handling questions like "Find all movies released before 1984". This is where predicate clauses come into play.
Let's start with the query for the question above:
The last clause,
[(< year 1984)], is a predicate clause. The predicate clause filters the result set to only include results for which the predicate returns a "truthy" (non-nil, non-false) value. You can use any Clojure function as a predicate function:
Clojure functions must be fully namespace-qualified, so if you have defined your own predicate
awesome? you must write it as
(my.namespace/awesome? movie). All
clojure.core/* functions may be used as predicates without namespace qualification:
<, >, <=, >=, =, not= and so on. XTDB provides a "Predicate AllowList" feature to restrict the exact set of predicates available to queries.
Q1. Find movies older than a certain year (inclusive)
Q2. Find actors older than Danny Glover
Q3. Find movies newer than
year (inclusive) and has a
rating higher than the one supplied
Transformation functions are pure (side-effect free) functions which can be used in queries as "function expression" predicates to transform values and bind their results to new logic variables. Say, for example, there exists an attribute
:person/born with type
:db.type/instant. Given the birthday, it's easy to calculate the (very approximate) age of a person:
With this function, we can now calculate the age of a person inside the query itself:
A transformation function clause has the shape
[(<fn> <arg1> <arg2> ...) <result-binding>] where
<result-binding> can be the same binding forms as we saw in chapter 3:
One thing to be aware of is that transformation functions can't be nested. For example, you can't write:
Instead, you must bind intermediate results in temporary logic variables:
Q1. Find people by age. Use the function
user/age to find the names of people, given their age and a date representing "today".
Q2. Find the names of people younger than Bruce Willis and their corresponding age.
Aggregate functions such as
max etc. are readily available in xtdb's Datalog implementation. They are written in the
:find clause in your query:
An aggregate function collects values from multiple triples and returns
A single value:
A collection of values:
(min n d)
(max n d)
(sample n e) etc. where
n is an integer specifying the size of the collection.
count the number of movies in the database
Q2. Find the birth date of the oldest person in the database.
Q3. Given a collection of actors and (the now familiar) ratings data. Find the average rating for each actor. The query should return the actor name and the
Many times over the course of this tutorial, we have had to write the following three lines of repetitive query code:
Rules are the means of abstraction in Datalog. You can abstract away reusable parts of your queries into rules, give them meaningful names and forget about the implementation details, just like you can with functions in your favorite programming language. Let's create a rule for the three lines above:
The first vector is called the head of the rule where the first symbol is the name of the rule. The rest of the rule is called the body.
You can think of a rule as a kind of function, but remember that this is logic programming, so we can use the same rule to:
Put another way, we can use both
(actor-movie name title) for input as well as for output. If we provide values for neither, we'll get all the possible combinations in the database. If we provide values for one or both, it will constrain the result returned by the query as you'd expect.
To use the above rule, you simply write the head of the rule instead of the data patterns. Any variable with values already bound will be input, the rest will be output.
The query to find cast members of some movie, for which we previously had to write:
You can write any number of rules, collect them in a vector, and pass them to the query engine using the
:rules key, as above.
You can use data patterns, predicates, transformation functions and calls to other rules in the body of a rule.
Rules can also be used as another tool to write logical OR queries, as the same rule name can be used several times:
Subsequent rule definitions will only be used if the ones preceding it aren't satisfied.
Using this rule, we can find both directors and cast members very easily:
Given the fact that rules can contain calls to other rules, what would happen if a rule called itself? Interesting things, it turns out, but let's find out in the exercises.
Q1. Write a rule
(movie-year title year) where
title is the title of some movie and
year is that movie's release year.
Q2. Two people are friends if they have worked together in a movie. Write a rule
(friends p1 p2) where
p2 are person entities. Try with a few different
name inputs to make sure you got it right. There might be some edge cases here.
Congratulations for making it through the tutorial - we hope this knowledge helps you in your Datalog journey! Any and all feedback is appreciated, as are new contributions, please email email@example.com or open an issue via GitHub
Copyright & License
The MIT License (MIT)
Copyright © 2013 - 2021 Jonas Enlund
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Thank you Jonas and contributors for freely licensing your excellent materials!