JSoC 2019-Blog#1: How does BERT Work?

As I mentioned in the previous blog post, BERT is one of the most powerful language model right now. The goal of my project is to implement the BERT model in Julia. Therefore, in this post, I will give you a brief introduction of how everything get together in BERT. And since I haven't finished the code yet, we won't have any real or toy example in the following content.

How does it work?

BERT does make a milestone in language related neural models. Lots of model reach state-of-the-art by simply integrating BERT with their own model. lets first start with how BERT encode the input sentences.

First: The Embeddings

If you are familiar with NLP, you might already know that we use the word embedding for years. With the embedding technique, we represent each token of the input sentence as a high dimension vector. In BERT, we doesn't represent each "word" as a vector. Instead, we represent each "subword token" as a vector. That is, when given a unfamiliar word, instead of using the word itself and it's representational vector, we try to split the word into sub-pieces like the word roots and prefixes(but not necessary a meaningful split, it depend on the algorithm we used) and use all the representational vectors of each sub-pieces together to represent the unfamiliar word.

Besides, BERT also use the position embedding as I mentioned in the previous post and a new segment embedding. This new segment embedding is highly related to the pretrain task, so I will skip it for now and explain it in the pretrain section. Just take it as another bias term for now (and it truly is in normal situation).

Second: the Transformers

After getting the embedding of the input sentences, we give that embedding vector to a multiple-layered transformer encoder block.

Last: the representation

We take the output of the last encoder block as our final representation. And TA-DA! now you have the most powerful sentence representation at this moment! You can just feed the representation to your old NLP model to get a better performance.

Wait, What?

These doesn't sound right. Shouldn't it has a complex and complicate structure to get such a good performance? Well, one of the beauty of BERT is that it doesn't have a weirdly complicated structure like some other model does. However, you should notice that the BERT model is very very large. For the SOTA version of BERT, it has about 340M parameters, which is too big to fit into a 11GB 1080ti GPU of mine. But, using a large model is not the core success of BERT, though google do tend to use a really large model to get good performance these years.

The Innovation

"What makes BERT so different?", This is probably what most people want to ask when they first saw BERT. As we have talked about, BERT doesn't seem to invent any new neural structure. It nothing but a large transformer encoder. Here I list some ingredients I think that brings BERT to the current state.

New pre-train tasks

And yes, that's the only thing I think that makes BERT so different. However, I'm not saying BERT didn't do anything. On the contrary, I'm really amazed by the work. There are lots of recent researches that show some interesting properties that BERT have. I'll explain the pre-train tasks in the following sections.

Pre-train tasks

Pre-trained model has been adapted for years in CV like the Imagenet. One of the pleasant feature BERT give us is that it make using a large pre-trained model on a language task possible. However, it is not the first one who came up with using pre-trained model on language task, and also it is not the first to use Transformer model for the pre-train. 6 month earlier than BERT, openAI wrote a paper "Improving Language Understanding by Generative Pre-Training" which has brought the idea up to tables. That model has several different names because in the origin paper openAI didn't give it a name like BERT, but it's commonly know as "GPT". In that paper, they pre-trained a 12 layer transformer model for the language modeling task on the BooksCorpus dataset.

Language modeling task

\argmax_{x_n}\ P(x_n | x_1, x_2, ..., x_{n-1})

This is the pre-train task used in GPT model, which is simply giving you a sequence of words and asking you to predict what the next word would be. And after the pre-train phase. you can fine-tune the model on a NLP task with the pre-train (you also need some modification on the fine-tuned task, see the picture below.)

So what's the different between GPT and BERT? First of all, BERT is larger than GPT. Second, In GPT, we only use the word previous than the current one. By contrast, BERT take the information from both side(past and future). In order to avoid receiving the correct answer from input, they turn the task into a so-call Masked Language modeling task.

Masked Language modeling task

\argmax_{x_n}\ P(x_n | x_1,x_2,...,x_{n-1},\text{masktoken}, x_{n+1},x_{n+2},...)

In this task, we will randomly choose some token in a given sentence X whereand then turn those token into a special "[MASKED]" token. Now we have a new input sentencewhere and run the above steps to get a sequence of representation. Therefore, we can use the vector representation at the position of masked token to predict the origin token . For the detail setup, see the origin paper appendix.

Next Sentence prediction task

\argmax_{C} P(C |S_2, S_1, \text{tasktoken}) \\\text{where }C = \{IsNext, NotNext\}

Another pre-train task introduced in BERT is the Next Sentence prediction task. That is, give two sentence S1 and S2, you need to predict whether S2 is a possible context of S1. And here comes the segment embedding we mentioned in the embedding section. For this task, we will add a bias termto every token inand another bias termto every token in. By adding a task token "[CLS]" and a special token "[SEP]" to separate two sentences, we form a new input as and encode this input with the above steps. Finally, we use the vector representation at the position of task token to predict they are contextual sentences.

What about the regular situation where we don't have another sentence? Then, we only add the bias termto every input tokens, and that's why I said the segment embedding is just a bias term.

All in all

In the comparison between GPT and BERT, it shows a great potential of using a good objective to pre-train large neural models with large corpus. There might still exist different or complicate pre-train tasks that can help neural models to learn better language representational vectors, and more and more researches have shown that these representations will have some impressive properties.

With Transformers.jl

These are probably everything you need to know about BERT right know. If you want to see what the actual code will look like with Transformers.jl, here is a construction-only sample code for describe the network structure.

150.7s

Julia

using Pkg
Pkg.add("Transformers")
using Flux
using Transformers
using Transformers.Basic

tok_emb = Embed(
  768,
  300 # size of vocab 
)

seg_emb = Embed(
  768,
  2 # Ea & Eb
)

posi_emb = PositionEmbedding(
  768,
  512,
  trainable = true # use trained position embedding instead of sin/cos
)

bert = Chain(
  [
    # `future` for getting info from both side, in Gpt use `future=false`
    # this the setting of a BERT-Base
    Transformer(768, 12, 64, 3072; future=true, act=gelu, pdrop=0.1)
    for i = 1:12
  ]...
)

# and getting the representation with

forward(x_token) = bert(tok_emb(x_token) + seg_emb(ones(Int, length(x_token))) .+ posi_emb(length(x_token)))

forward(rand(1:300, 15))

150.7s

Julia

0.5s

Julia

Further code will be at Transformers.jl#bert and will be merged into #master once complete.

Conclusion

The work of BERT shows a lot of interesting stuff. Adapting BERT might need some modification on the task, and since we haven't have any code to show, I'll leave those topics in the future posts. If you have any question or encounter any problem with Transformers.jl, you can post an issue or tag me on the Julia slack/discourse with @chengchingwen.