JSoC 2019-Blog#2(End of Phase One): How's BERT going?
It has been a month since the JSoC 2019 started, that also mean we have reach the end of phase one. So in this blog, I will talk about what I have done during this month and demonstrate the code usages.
Using Bert with Transformers.jl
All the codes are in Transformers#bert branch. Since it's not finished, some of the API might change in the future, and once it's 100% finished, it will be merge into #master. I will show how to use those code step by step in the following sections.
To use the bert code, we need to check out to the #bert branch
using Pkg pkg"add Transformers#bert"
Here are some other packages we will need later. Please note that we have to install the TensorFlow.jl beforehand if you want to test the code on your own computers.
using Flux using CuArrays using WordTokenizers using TensorFlow #not in the dependency; run `pkg"add TensorFlow"` to install using Transformers using Transformers.BidirectionalEncoder
Processing the pre-trained model
As we mentioned in the last blog, using pre-trained model is one the pleasant features bert have. However, The pre-trained weight was released as a TensorFlow checkpoint files, so we will need to do a conversion on the pre-trained files and save as a Julia desired file format (here we use BSON.jl to store those stuff).
First, we need to download a pre-trained file. This is one of the file link found on official repo of bert and we download it to our computer.
readdir()[findfirst(n->startswith(n, "multi"), readdir())]
Once the pre-trained file is ready, we can start our conversion process with
tfckpt2bson function. The result is the saved file with the same name but different filename extension. (This will take a few minutes.)
This is the only function who need TensorFlow.jl, so if you already have the bson file or someone do the conversion for you, you don't need that package anymore.
Loading pre-trained model
Now we have our pre-trained weights in BSON format, we use
load_bert_pretrain to load the saved model (or use
BSON.load). Besides, you will also see the
WordPiece inside that file.
bert_model, wordpiece, tokenizer = load_bert_pretrain("multi_cased_L-12_H-768_A-12.bson") # is equivalent to --- # using BSON # bert_bson = BSON.load("multi_cased_L-12_H-768_A-12.bson")
Then, we have the desired model and other related objects.
Before we can run Bert on our sentences, we need to process the input a little bit. Here I will show you how to use the pre-trained model to get sentence representations.
sample1 = "We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab." sample2 = "quick fox jumps over the lazy dog" sample3 = "I can eat glass, it doesn't hurt me." sample = [sample1, sample2, sample3]
Running the tokenization and word pieces on each sample.
processed_sample = wordpiece.(tokenizer.(sample))
Then using Vocabulary to help us turn each token into embedding indices.
vocab = Transformers.Basic.Vocabulary(wordpiece.vocab, wordpiece.vocab[wordpiece.unk_idx]) sample_indices = vocab(processed_sample)
Beside the sample indices, we also need the segment indices. However, since we only take one sentence as input, we can just use
seg_indices = ones(Int, size(sample_indices)...)
And don't forget the masks.
masks = Transformers.Basic.getmask(processed_sample)
Next we need to turn those indices into embeddings. This is done by the
Which composite different embeddings together and can run them by passing the indices with name to specify which embedding this indices is for. (position embedding will be apply automatically)
embeddings = bert_model.embed(tok=sample_indices, segment=seg_indices)
Get the Representations
Finally, we can pass the input embeddings and mask to get the sentences representations.
representations = bert_model.transformers(embeddings, masks)
And you can also get all the output of each transformer layers with the
all keyword argument.
representations, all_outputs = bert_model.transformers(embeddings, masks;all=true)
These are what we have in #bert for now.
Currently I only implement the forward part. The next part will be implementing the pre-train related functions, and if we have enough time, I will try to make a TPU version with XLA.jl
Appendix: I don't like the struct
As you may see, I use a custom wrapper struct to put everything in one variable (the
bert_model). However, some people (including me) might feel unhappy about it one day, so we also handle these situations.
In the previous sections, we use
tfckpt2bson to convert to a wrapped struct. This time, we will only extract the variable from the Tensorflow checkpoint files and save it unmodifiedly with a
.tfbson filename extension.
We use a
raw keyword to specify this. If you want to use your own pre-trained model but you save them with different name from what google used, just pass the filenames with the relative keywords like: .
#= tfckpt2bson("model_files_zip_or_can_be_a_folder"; raw=true, saveto="/my/data/volumn/", confname = "mybert_config.json", ckptname = "mybert.ckpt", vocabname = "special_vocab.txt") =#
You can still use
load_bert_pretrain to load the raw weights out.
config, weights, vocab = load_bert_pretrain("multi_cased_L-12_H-768_A-12.tfbson")
Then you can handle those variable names and weights yourself.