Peter Cheng / May 26 2019
Remix of Julia by Nextjournal

JSoC 2019: Practical Implementation of BERT models for Julia

I was very lucky to be part of the JSoC 2019, Julia Season of Contributions. This is the first blog post as I started my project: Practical Implementation of BERT models for Julia. BERT was one of the most powerful language models proposed by Google AI in 2018. By implementing BERT in Julia, I hope this could be conducive to the NLP researchers who love Julia and also show the beauty and power of Julia. The detail of the model will be introduced in the future posts. Before we get there, we need to know about the Transformer model which is also proposed by Google in 2017.

The following content will cover the basic introductions about the Transformer model and also an implementation in Julia, Transformers.jl.

Transformer model

The Transformer model was proposed in the paper: Attention Is All You Need. In that paper they provide a new way of handling the sequence transduction problem (like the machine translation task) without complex recurrent or convolutional structure. Simply use a stack of attention mechanisms to get the latent structure in the input sentences and a special embedding (positional embedding) to get the locationality. The whole model architecture looks like this:

Multi-Head Attention

Instead of using the regular attention mechanism, they split the input vector to several pairs of subvector and perform a dot-product attention on each subvector pairs.  

For those who like mathematical expression, here is the formula:

Attention(Q,K,V)=softmax(QKTdk)VAttention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V
MultiHead(Q,K,V)=Concat(head1,...,headh)WOwhere headi=Attention(QWiQ,KWiK,VWiV)MultiHead(Q, K, V) = Concat(head_1,..., head_h)W^O \newline \text{where }head_i = Attention(QW^Q_i, KW^K_i, VW^V_i)

Positional Embedding

As we mentioned above, transformer model didn't depend on the recurrent or convolutional structure. On the other hand, we still need a way to differentiate two sequence with same words but different order. Therefore, they add the locational information on the embedding, i.e. the origin word embedding plus a special embedding that indicate the order of that word. The special embedding can be computed by some equations or just use another trainable embedding matrix. In the paper, the positional embedding use this formula:

PE(pos,k)={sin(pos104k/dk),if k is evencos(pos104k/dk),if k is oddPE_{(pos, k)} = \begin{cases} sin(\frac{pos}{10^{4k/d_k}}),& \text{if }k \text{ is even}\\ cos(\frac{pos}{10^{4k/d_k}}), & \text{if }k \text{ is odd} \end{cases}

where is the locational information that tells you the given word is the -th word, and is the -th dimension of the input vector. is the total length of the word/positional embedding. So the new embedding will be computed as:

Embeddingk(word)=WordEmbeddingk(word)+PE(pos_of_word,k)Embedding_k(word) = WordEmbedding_k(word) + PE(pos\_of\_word, k)

Julia implementation: Transformers.jl

Now we know how the transformer model looks like, let's take a look at the package Transformers.jl. The package is build on top of a famous deep learning framework in Julia, Flux.jl.

Example

To best illustrate the usage of Transformers.jl, we will start with building a two layer Transformer model on a sequence copy task. Before we start, we need to install all the package we need:

using Pkg
Pkg.add("CuArrays")
Pkg.add("Flux")
Pkg.add("Transformers")

We use CuArrays.jl for the GPU support.

using Flux
using CuArrays
using Transformers
using Transformers.Basic #for loading the positional embedding

Copy task

The copy task is a toy test case of a sequence transduction problem that simply return the same sequence as the output. Here we define the input as a random sequence with number from 1~10 and length 10. we will also need a start and end symbol to indicate where is the begin and end of the sequence. We can use Transformers.Basic.Vocabulary to turn the input to corresponding index.

labels = collect(1:10)
startsym = 11
endsym = 12
unksym = 0
labels = [unksym, startsym, endsym, labels...]
vocab = Vocabulary(labels, unksym)
Vocabulary(13, unk=0)
#function for generate training datas
sample_data() = (d = rand(1:10, 10); (d,d))
#function for adding start & end symbol
preprocess(x) = [startsym, x..., endsym]

@show sample = preprocess.(sample_data())
@show encoded_sample = vocab(sample[1]) #use Vocabulary to encode the training data
12-element Array{Int64,1}: 2 8 7 5 8 5 8 8 8 10 11 3

Defining the model

With the Transformers.jl and Flux.jl, we can define the model easily. We use a Transformer with 512 hidden size and 8 head.

#define a Word embedding layer which turn word index to word vector
embed = Embed(512, length(vocab)) |> gpu
#define a position embedding layer metioned above
pe = PositionEmbedding(512) |> gpu

#wrapper for get embedding
function embedding(x)
  we = embed(x, inv(sqrt(512))) 
  e = we .+ pe(we)
	return e
end

#define 2 layer of transformer
encode_t1 = Transformer(512, 8, 64, 2048) |> gpu  
encode_t2 = Transformer(512, 8, 64, 2048) |> gpu

#define 2 layer of transformer decoder
decode_t1 = TransformerDecoder(512, 8, 64, 2048) |> gpu  
decode_t2 = TransformerDecoder(512, 8, 64, 2048) |> gpu

#define the layer to get the final output probabilities
linear = Positionwise(Dense(512, length(vocab)), logsoftmax) |> gpu

function encoder_forward(x)
  e = embedding(x)
  t1 = encode_t1(e)
  t2 = encode_t2(t1)
  return t2
end

function decoder_forward(x, m)
  e = embedding(x)
  t1 = decode_t1(e, m)
  t2 = decode_t2(t1, m)
  p = linear(t2)
	return p
end
decoder_forward (generic function with 1 method)

Then run the model on the sample

enc = encoder_forward(encoded_sample)
probs = decoder_forward(encoded_sample, enc)
Tracked 13×12 CuArray{Float32,2}: -1.69415 -2.29217 -1.61589 -2.64687 … -2.48312 -1.16466 -2.76511 -2.6313 -1.97819 -2.51262 -1.92097 -2.30805 -2.75562 -3.14376 -1.99579 -3.51347 -2.46945 -2.48385 -2.096 -2.21266 -1.48463 -7.59547 -6.45749 -7.0241 -6.69884 -6.12993 -5.25212 -6.65952 -3.34441 -2.89357 -4.35713 -3.6727 -2.92281 -3.41547 -4.2494 -1.34983 -1.44299 -1.8408 -1.74415 … -2.4508 -2.64895 -1.13064 -3.16925 -3.29113 -4.41777 -4.29367 -4.39714 -2.89145 -3.75797 -3.19057 -4.7175 -4.86113 -2.54307 -2.45899 -3.08392 -3.07605 -2.97161 -2.64614 -3.05669 -2.50107 -2.00201 -2.256 -2.50969 -2.1736 -1.56447 -1.06218 -1.35528 -1.60499 -2.10699 -2.42297 -3.77015 -2.37636 -3.67172 -4.66593 … -3.91436 -3.44688 -4.17743 -6.06289 -5.44619 -5.71121 -6.60221 -5.8637 -5.49299 -5.8993 -3.22546 -4.25468 -3.82069 -2.90244 -2.3502 -3.15678 -2.63888

We can also use the Transformers.Stack to define the encoder and decoder so you can define multiple layer and the xx_forwawrd at once. See the README for more information about the API.

define the loss and training loop

For the last step, we need to define the loss function and training loop. We use the kl divergence for the output probability.

function smooth(et)
    sm = fill!(similar(et, Float32), 1e-6/size(embed, 2))
    p = sm .* (1 .+ -et)
    label = p .+ et .* (1 - convert(Float32, 1e-6))
    label
end

#define loss function
function loss(x, y)
  label = onehot(vocab, y) #turn the index to one-hot encoding
  label = smooth(label) #perform label smoothing
  enc = encoder_forward(x)
	probs = decoder_forward(y, enc)
  l = logkldivergence(label[:, 2:end, :], probs[:, 1:end-1, :])
  return l
end

#collect all the parameters
ps = params(embed, pe, encode_t1, encode_t2, decode_t1, decode_t2, linear)
opt = ADAM(1e-4)

#function for created batched data
using Transformers.Datasets: batched

#flux function for update parameters
using Flux: gradient
using Flux.Optimise: update!

#define training loop
function train!()
  @info "start training"
  for i = 1:2000
    data = batched([sample_data() for i = 1:32]) #create 32 random sample and batched
		x, y = preprocess.(data[1]), preprocess.(data[2])
    x, y = vocab(x), vocab(y)#encode the data
    x, y = todevice(x, y) #move to gpu
    l = loss(x, y)
    grad = gradient(()->l, ps)
    if i % 8 == 0
    	println("loss = $l")
    end
    update!(opt, ps, grad)
  end
end
train! (generic function with 1 method)
train!()

Test our model

After training, we can try to test the model.

using Flux: onecold
function translate(x)
    ix = todevice(vocab(preprocess(x)))
    seq = [startsym]

    enc = encoder_forward(ix)

    len = length(ix)
    for i = 1:2len
        trg = todevice(vocab(seq))
        dec = decoder_forward(trg, enc)
        #move back to gpu due to argmax wrong result on CuArrays
        ntok = onecold(collect(dec), labels)
        push!(seq, ntok[end])
        ntok[end] == endsym && break
    end
  seq[2:end-1]
end
translate (generic function with 1 method)
translate([5,5,6,6,1,2,3,4,7, 10])
10-element Array{Int64,1}: 5 5 6 6 1 2 3 4 7 10

The result looks good!

Last of All

This is just a basic introduction about the based model used inside BERT. The following works will be built on top of these. If you have any question or encounter any problem with Transformers.jl, you can post an issue or tag me on the Julia slack/discourse with @chengchingwen. Hope that we could write tons of interesting code during the JSoC 2019.