JSoC 2019-Blog#2(End of Phase One): How's BERT going?
It has been a month since the JSoC 2019 started, that also mean we have reach the end of phase one. So in this blog, I will talk about what I have done during this month and demonstrate the code usages.
Using Bert with Transformers.jl
All the codes are in Transformers#bert branch. Since it's not finished, some of the API might change in the future, and once it's 100% finished, it will be merge into #master. I will show how to use those code step by step in the following sections.
Prepare environment
To use the bert code, we need to check out to the #bert branch
using Pkg pkg"add Transformers#bert"
Here are some other packages we will need later. Please note that we have to install the TensorFlow.jl beforehand if you want to test the code on your own computers.
using Flux using CuArrays using WordTokenizers using TensorFlow #not in the dependency; run `pkg"add TensorFlow"` to install using Transformers using Transformers.BidirectionalEncoder
Processing the pre-trained model
As we mentioned in the last blog, using pre-trained model is one the pleasant features bert have. However, The pre-trained weight was released as a TensorFlow checkpoint files, so we will need to do a conversion on the pre-trained files and save as a Julia desired file format (here we use BSON.jl to store those stuff).
First, we need to download a pre-trained file. This is one of the file link found on official repo of bert and we download it to our computer.
wget https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip
readdir()[findfirst(n->startswith(n, "multi"), readdir())]
Once the pre-trained file is ready, we can start our conversion process with tfckpt2bson
function. The result is the saved file with the same name but different filename extension. (This will take a few minutes.)
BidirectionalEncoder.tfckpt2bson("multi_cased_L-12_H-768_A-12.zip")
This is the only function who need TensorFlow.jl, so if you already have the bson file or someone do the conversion for you, you don't need that package anymore.
Loading pre-trained model
Now we have our pre-trained weights in BSON format, we use load_bert_pretrain
to load the saved model (or use BSON.load
). Besides, you will also see the tokenizer
and WordPiece
inside that file.
bert_model, wordpiece, tokenizer = load_bert_pretrain("multi_cased_L-12_H-768_A-12.bson") # is equivalent to --- # using BSON # bert_bson = BSON.load("multi_cased_L-12_H-768_A-12.bson")
Then, we have the desired model and other related objects.
Process input
Before we can run Bert on our sentences, we need to process the input a little bit. Here I will show you how to use the pre-trained model to get sentence representations.
sample1 = "We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab." sample2 = "quick fox jumps over the lazy dog" sample3 = "I can eat glass, it doesn't hurt me." sample = [sample1, sample2, sample3]
Running the tokenization and word pieces on each sample.
processed_sample = wordpiece.(tokenizer.(sample))
Then using Vocabulary to help us turn each token into embedding indices.
vocab = Transformers.Basic.Vocabulary(wordpiece.vocab, wordpiece.vocab[wordpiece.unk_idx]) sample_indices = vocab(processed_sample)
Beside the sample indices, we also need the segment indices. However, since we only take one sentence as input, we can just use ones
.
seg_indices = ones(Int, size(sample_indices)...)
And don't forget the masks.
masks = Transformers.Basic.getmask(processed_sample)
Get Embeddings
Next we need to turn those indices into embeddings. This is done by the bert_model.embed
.
bert_model.embed
Which composite different embeddings together and can run them by passing the indices with name to specify which embedding this indices is for. (position embedding will be apply automatically)
embeddings = bert_model.embed(tok=sample_indices, segment=seg_indices)
Get the Representations
Finally, we can pass the input embeddings and mask to get the sentences representations.
representations = bert_model.transformers(embeddings, masks)
And you can also get all the output of each transformer layers with the all
keyword argument.
representations, all_outputs = bert_model.transformers(embeddings, masks;all=true)
These are what we have in #bert for now.
Conclusion
Currently I only implement the forward part. The next part will be implementing the pre-train related functions, and if we have enough time, I will try to make a TPU version with XLA.jl
Appendix: I don't like the struct
As you may see, I use a custom wrapper struct to put everything in one variable (the bert_model
). However, some people (including me) might feel unhappy about it one day, so we also handle these situations.
In the previous sections, we use tfckpt2bson
to convert to a wrapped struct. This time, we will only extract the variable from the Tensorflow checkpoint files and save it unmodifiedly with a .tfbson
filename extension.
BidirectionalEncoder.tfckpt2bson("multi_cased_L-12_H-768_A-12.zip"; raw=true)
We use a raw
keyword to specify this. If you want to use your own pre-trained model but you save them with different name from what google used, just pass the filenames with the relative keywords like: .
#= tfckpt2bson("model_files_zip_or_can_be_a_folder"; raw=true, saveto="/my/data/volumn/", confname = "mybert_config.json", ckptname = "mybert.ckpt", vocabname = "special_vocab.txt") =#
You can still use load_bert_pretrain
to load the raw weights out.
config, weights, vocab = load_bert_pretrain("multi_cased_L-12_H-768_A-12.tfbson")
Then you can handle those variable names and weights yourself.
weights