JSoC 2019-Blog#2(End of Phase One): How's BERT going?

It has been a month since the JSoC 2019 started, that also mean we have reach the end of phase one. So in this blog, I will talk about what I have done during this month and demonstrate the code usages.

Using Bert with Transformers.jl

All the codes are in Transformers#bert branch. Since it's not finished, some of the API might change in the future, and once it's 100% finished, it will be merge into #master. I will show how to use those code step by step in the following sections.

Prepare environment

To use the bert code, we need to check out to the #bert branch

using Pkg
pkg"add Transformers#bert"

13.2s

Julia

Here are some other packages we will need later. Please note that we have to install the TensorFlow.jl beforehand if you want to test the code on your own computers.

using Flux
using CuArrays
using WordTokenizers
using TensorFlow #not in the dependency; run `pkg"add TensorFlow"` to install
using Transformers

using Transformers.BidirectionalEncoder

64.9s

Julia

Processing the pre-trained model

As we mentioned in the last blog, using pre-trained model is one the pleasant features bert have. However, The pre-trained weight was released as a TensorFlow checkpoint files, so we will need to do a conversion on the pre-trained files and save as a Julia desired file format (here we use BSON.jl to store those stuff).

First, we need to download a pre-trained file. This is one of the file link found on official repo of bert and we download it to our computer.

wget https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip

7.5s

Bash in Julia

readdir()[findfirst(n->startswith(n, "multi"), readdir())]

0.6s

Julia

"multi_cased_L-12_H-768_A-12.bson"

Once the pre-trained file is ready, we can start our conversion process with tfckpt2bson function. The result is the saved file with the same name but different filename extension. (This will take a few minutes.)

BidirectionalEncoder.tfckpt2bson("multi_cased_L-12_H-768_A-12.zip")

334.6s

Julia

"multi_cased_L-12_H-768_A-12.bson"

This is the only function who need TensorFlow.jl, so if you already have the bson file or someone do the conversion for you, you don't need that package anymore.

Loading pre-trained model

Now we have our pre-trained weights in BSON format, we use load_bert_pretrain to load the saved model (or use BSON.load). Besides, you will also see the tokenizer and WordPiece inside that file.

bert_model, wordpiece, tokenizer = load_bert_pretrain("multi_cased_L-12_H-768_A-12.bson")

# is equivalent to ---
# using BSON
# bert_bson = BSON.load("multi_cased_L-12_H-768_A-12.bson")

100.5s

Julia

Then, we have the desired model and other related objects.

Process input

Before we can run Bert on our sentences, we need to process the input a little bit. Here I will show you how to use the pre-trained model to get sentence representations.

sample1 = "We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab."
sample2 = "quick fox jumps over the lazy dog"
sample3 = "I can eat glass, it doesn't hurt me."
sample = [sample1, sample2, sample3]

2.3s

Julia

3-element Array{String,1}: "We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab." "quick fox jumps over the lazy dog" "I can eat glass, it doesn't hurt me."

Running the tokenization and word pieces on each sample.

processed_sample = wordpiece.(tokenizer.(sample))

2.2s

Julia

3-element Array{Array{String,1},1}: ["We", "want", "the", "speed", "of", "[UNK]", "with", "the", "dy", "##nami" … "with", "obvious", "[UNK]", "familiar", "mathematical", "notation", "like", "Mat", "##lab", "[UNK]"] ["quick", "[UNK]", "[UNK]", "over", "the", "la", "##zy", "dog"] ["[UNK]", "can", "eat", "glass", "[UNK]", "it", "doesn", "[UNK]", "[UNK]", "[UNK]", "me", "[UNK]"]

Then using Vocabulary to help us turn each token into embedding indices.

vocab = Transformers.Basic.Vocabulary(wordpiece.vocab, wordpiece.vocab[wordpiece.unk_idx])

sample_indices = vocab(processed_sample)

2.0s

Julia

44×3 Array{Int64,2}: 12866 69610 101 21529 101 10945 10106 101 69111 19086 10492 32363 10109 10106 101 101 10110 10272 10170 12548 47799 10106 17836 101 13907 101 101 46526 101 101 ⋮ 94453 101 101 101 101 101 29627 101 101 73470 101 101 100238 101 101 11851 101 101 57472 101 101 41285 101 101 101 101 101

Beside the sample indices, we also need the segment indices. However, since we only take one sentence as input, we can just use ones.

seg_indices = ones(Int, size(sample_indices)...)

0.5s

Julia

44×3 Array{Int64,2}: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ⋮ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

And don't forget the masks.

masks = Transformers.Basic.getmask(processed_sample)

0.9s

Julia

1×44×3 Array{Float32,3}: [:, :, 1] = 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 1.0 1.0 1.0 1.0 1.0 1.0 1.0 [:, :, 2] = 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3] = 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Get Embeddings

Next we need to turn those indices into embeddings. This is done by the bert_model.embed.

bert_model.embed

0.8s

Julia

CompositeEmbedding(tok = Embed(768), segment = Embed(768), pe = PositionEmbedding(768, max_len=512), postprocessor = Positionwise{Tuple{LayerNorm{TrackedArray{…,Array{Float32,1}}},Dropout{Float64}}}((LayerNorm(768), Dropout{Float64}(0.1, true))))

Which composite different embeddings together and can run them by passing the indices with name to specify which embedding this indices is for. (position embedding will be apply automatically)

embeddings = bert_model.embed(tok=sample_indices, segment=seg_indices)

23.7s

Julia

Get the Representations

Finally, we can pass the input embeddings and mask to get the sentences representations.

representations = bert_model.transformers(embeddings, masks)

13.8s

Julia

And you can also get all the output of each transformer layers with the all keyword argument.

representations, all_outputs = bert_model.transformers(embeddings, masks;all=true)

2.6s

Julia

These are what we have in #bert for now.

Conclusion

Currently I only implement the forward part. The next part will be implementing the pre-train related functions, and if we have enough time, I will try to make a TPU version with XLA.jl

Appendix: I don't like the struct

As you may see, I use a custom wrapper struct to put everything in one variable (the bert_model). However, some people (including me) might feel unhappy about it one day, so we also handle these situations.

In the previous sections, we use tfckpt2bson to convert to a wrapped struct. This time, we will only extract the variable from the Tensorflow checkpoint files and save it unmodifiedly with a .tfbson filename extension.

BidirectionalEncoder.tfckpt2bson("multi_cased_L-12_H-768_A-12.zip"; raw=true)

153.7s

Julia

"multi_cased_L-12_H-768_A-12.tfbson"

We use a raw keyword to specify this. If you want to use your own pre-trained model but you save them with different name from what google used, just pass the filenames with the relative keywords like: .

#=
tfckpt2bson("model_files_zip_or_can_be_a_folder"; raw=true, 
													 											  saveto="/my/data/volumn/", 
													 												confname = "mybert_config.json", 
													 												ckptname = "mybert.ckpt", 
													 												vocabname = "special_vocab.txt")
=#

0.8s

Julia

You can still use load_bert_pretrain to load the raw weights out.

config, weights, vocab = load_bert_pretrain("multi_cased_L-12_H-768_A-12.tfbson")

52.4s

Julia

Then you can handle those variable names and weights yourself.

weights

3.4s

Julia