Peter Cheng / Aug 05 2019
Remix of Julia by Nextjournal

JSoC 2019-Blog#3(End of Phase Two) Bert model in Julia:

It has been a while since the last update, but we finally get here. Most of the functionality is done. Let's see what we have now. Again, before we start, we first checkout to the #bert branch.

using Pkg
pkg"add Transformers#bert"

Getting Pretrain

In the last blog post, we show that we can use the tfckpt2bson function to convert a Bert Tensorflow pretrain checkpoint file into a BSON file and load it in Juila. Now, if you just want to use the publicly released ckeckpoint point from google, we have a self-host version that you can directly download and there's no need to use tfckpt2bson. For example, we want to fine-tune the uncased_L-12_H-768_A-12 pretrain weight, we can do this:

using Transformers
using Transformers.Pretrain


The download process is handle by DataDeps.jl, so we need to set "DATADEPS_ALWAYS_ACCEPT" to true to download the model without typing Y, and loading the model will be done by a special string with pretrain prefix in the format of pretrain"bert-<model-name>:<item>" where <item> can be either :bert_model, :wordpiece, :tokenizer, and :all(default). Like this:

bert_model, wordpiece, tokenizer = pretrain"bert-uncased_L-12_H-768_A-12"
(TransformerModel{Bert}( embed = CompositeEmbedding(tok = Embed(768), segment = Embed(768), pe = PositionEmbedding(768, max_len=512), postprocessor = Positionwise(LayerNorm(768), Dropout{Float64}(0.1, true))), transformers = Bert(layers=12, head=12, head_size=64, pwffn_size=3072, size=768), classifier = ( pooler => Dense(768, 768, tanh) masklm => ( transform => Chain(Dense(768, 768, gelu), LayerNorm(768)) output_bias => TrackedArray{…,Array{Float32,1}} ) nextsentence => Chain(Dense(768, 2), logsoftmax) ) ), WordPiece(vocab_size=30522, unk=[UNK], max_char=200), bert_uncased_tokenizer)

And we can use the pretrain model in our task.

You can also see what pretrain model is supported, if you can't find a public model on the list, please open an issue.


Fine-tune model

Fine-tuning Bert model is also very easy. We use the GLEU tasks as an example.

GLUE datasets

Before we start the fine-tune process, we need to download the dataset. We have a version of GLUE task in Datasets, so you can test the model without worrying about handling the datasets. For example:

using Transformers.Datasets
using Transformers.Datasets.GLUE

task = GLUE.QNLI()
datas = dataset(Train, task)
get_batch(datas, 4)

You can see what GLUE task is supported here.

using InteractiveUtils: varinfo
CoLA172 bytesDataType
Diagnostic172 bytesDataType
GLUE82.390 KiBModule
MNLI188 bytesDataType
MRPC172 bytesDataType
QNLI172 bytesDataType
QQP172 bytesDataType
RTE172 bytesDataType
SNLI172 bytesDataType
SST172 bytesDataType
STS172 bytesDataType
WNLI172 bytesDataType

Preprocess the data

using Transformers.Basic
using Flux: onehotbatch

vocab = Vocabulary(wordpiece) #get vocabulary from WordPiece
labels = get_labels(task) #get dataset labels

#add start and separate symbol around sentence
markline(s1, s2) = ["[CLS]"; s1; "[SEP]"; s2; "[SEP]"]

function preprocess(batch)
    s1 = wordpiece.(tokenizer.(batch[1]))
    s2 = wordpiece.(tokenizer.(batch[2]))
    sentence = markline.(s1, s2)
    mask = getmask(sentence)
    tok = vocab(sentence)

    segment = fill!(similar(tok), 1)
    for (i, sent)  enumerate(sentence)
      j = findfirst(isequal("[SEP]"), sent)
      if j !== nothing
        @view(segment[j+1:end, i]) .= 2

    label = onehotbatch(batch[3], labels)
    return (tok=tok, segment=segment), label, mask

preprocess(get_batch(datas, 4))
((tok = [102 102 102 102; 2055 2044 2055 2130; … ; 1013 101 101 101; 103 101 101 101], segment = [1 1 1 1; 1 1 1 1; … ; 2 2 2 2; 2 2 2 2]), Bool[false true true false; true false false true], Float32[1.0 1.0 … 1.0 1.0] Float32[1.0 1.0 … 0.0 0.0] Float32[1.0 1.0 … 0.0 0.0] Float32[1.0 1.0 … 0.0 0.0])

Training the model

Now we have the fine-tune datasets and the preprocess function, we can start training our model with a simple training loop

using Flux
using Flux: gradient
import Flux.Optimise: update!

using CuArrays

clf = Chain(
    Dense(768, length(labels)), logsoftmax

# remove masklm/nextsentence weights,
# set clf as part of classifiers,
# move the result model to gpu
bert_model = gpu(
      pooler = bert_model.classifier.pooler,
      clf = clf
@show bert_model

ps = params(bert_model)
opt = ADAM(1e-4)

#define the loss
function loss(data, label, mask=nothing)
    e = bert_model.embed(data)
    t = bert_model.transformers(e, mask)
    l = Basic.logcrossentropy(
    return l
loss (generic function with 2 methods)
#the training loop
for i  1:10 # 10 training step, just for illustration
  batch = get_batch(datas, 2)
  batch === nothing && break # run out of training datas
  data, label, mask = todevice( #move data to gpu
  l = loss(data, label, mask)
  @show l
  grad = gradient(()->l, ps)
  update!(opt, ps, grad)

Feature-based Approach

We can also use out bert model to extract fixed feature vectors (like ELMo). To do so, just pass an extra keyword argument when calling bert and you can get all the output of each Transformer layer.

data, label, mask = todevice(preprocess(get_batch(datas, 4)))

e = bert_model.embed(data)
out, ts = bert_model.transformers(e, mask; all=true)

#the 11-th transformer layer output 

Pretrain our own pretrain

We can also pretrain a Bert model on our own dataset. We will show how to do it with the following example.

Pretrain task helper

Bert has two pretrain task: masked language modeling & next sentence prediction. Here is a document from a recent wiki dump and we'll use it as a dataset in the following code.

# one document from wiki dump, just for illustration
docs = """
Guy Fawkes (; 13 April 1570�罱�� 31 January 1606), also known as Guido Fawkes while fighting for the Spanish, was a member of a group of provincial English Catholics who planned the failed Gunpowder Plot of 1605. He was born and educated in York, England; his father died when Fawkes was eight years old, after which his mother married a recusant Catholic.

Fawkes converted to Catholicism and left for mainland Europe, where he fought for Catholic Spain in the Eighty Years' War against Protestant Dutch reformers in the Low Countries. He travelled to Spain to seek support for a Catholic rebellion in England without success. He later met Thomas Wintour, with whom he returned to England, and Wintour introduced him to Robert Catesby, who planned to assassinate and restore a Catholic monarch to the throne. The plotters leased an undercroft beneath the House of Lords, and Fawkes was placed in charge of the gunpowder which they stockpiled there. The authorities were prompted by an anonymous letter to search Westminster Palace during the early hours of 5 November, and they found Fawkes guarding the explosives. He was questioned and tortured over the next few days, and he finally confessed.

Immediately before his execution on 31 January, Fawkes fell from the scaffold where he was to be hanged and broke his neck, thus avoiding the agony of being hanged, drawn and quartered. He became synonymous with the Gunpowder Plot, the failure of which has been commemorated in Britain as Guy Fawkes Night since 5 November 1605, when his effigy is traditionally burned on a bonfire, commonly accompanied by fireworks.

Guy Fawkes was born in 1570 in Stonegate, York. He was the second of four children born to Edward Fawkes, a proctor and an advocate of the consistory court at York, and his wife, Edith. Guy's parents were regular communicants of the Church of England, as were his paternal grandparents; his grandmother, born Ellen Harrington, was the daughter of a prominent merchant, who served as Lord Mayor of York in 1536. Guy's mother's family were recusant Catholics, and his cousin, Richard Cowling, became a Jesuit priest. "Guy" was an uncommon name in England, but may have been popular in York on account of a local notable, Sir Guy Fairfax of Steeton.

The date of Fawkes's birth is unknown, but he was baptised in the church of St Michael le Belfrey on 16 April. As the customary gap between birth and baptism was three days, he was probably born about 13 April. In 1568, Edith had given birth to a daughter named Anne, but the child died aged about seven weeks, in November that year. She bore two more children after Guy: Anne (b. 1572), and Elizabeth (b. 1575). Both were married, in 1599 and 1594 respectively.

In 1579, when Guy was eight years old, his father died. His mother remarried several years later, to the Catholic Dionis Baynbrigge (or Denis Bainbridge) of Scotton, Harrogate. Fawkes may have become a Catholic through the Baynbrigge family's recusant tendencies, and also the Catholic branches of the Pulleyn and Percy families of Scotton, but also from his time at St. Peter's School in York. A governor of the school had spent about 20�懢ears in prison for recusancy, and its headmaster, John Pulleyn, came from a family of noted Yorkshire recusants, the Pulleyns of Blubberhouses. In her 1915 work "The Pulleynes of Yorkshire", author Catharine Pullein suggested that Fawkes's Catholic education came from his Harrington relatives, who were known for harbouring priests, one of whom later accompanied Fawkes to Flanders in 1592��1593. Fawkes's fellow students included John Wright and his brother Christopher (both later involved with Fawkes in the Gunpowder Plot) and Oswald Tesimond, Edward Oldcorne and Robert Middleton, who became priests (the latter executed in 1601).

You can use the pretrain helper BidirectionalEncoder.bert_pretrain_task function to get the input sentence with mask and the masked id (and also the next sentence prediction), but you need to wrap you own data into a Channel which get sentences line by line in the correct order (otherwise the next sentence prediction can't have the correct label).

using WordTokenizers

chn = Channel(3)

sentences = split_sentences(docs)
task = @async foreach(sentences) do sentence
  if !isempty(sentence)
    put!(chn, sentence)
bind(chn, task)
using Transformers.BidirectionalEncoder

datas = BidirectionalEncoder.bert_pretrain_task(chn, wordpiece; tokenizer = tokenizer) # we need to pass the wordpiece. If no tokenizer provided, it will use default tokenizer from WordTokenizers
masked_sentence, mask_idx, masked_token, isnext = get_batch(datas, 1)
4-element Array{Array{T,1} where T,1}: Array{String,1}[["[CLS]", "guy", "[MASK]", "##wk", "##es", "(", ";", "13", "april", "1570" … "his", "mother", "married", "a", "rec", "##usa", "##nt", "catholic", ".", "[SEP]"]] Array{Int64,1}[[3, 13, 20, 30, 40, 63, 74, 78, 81]] Array{String,1}[["fa", "january", "as", ",", "catholics", "died", "which", "a", "##nt"]] Bool[true]

Here we use the wordpiece and tokenizer from the previous section. If you only want the them without loading the whole model, you can do this:

wordpiece = pretrain"bert-uncased_L-12_H-768_A-12:wordpiece"
tokenizer = pretrain"bert-uncased_L-12_H-768_A-12:tokenizer"
bert_uncased_tokenizer (generic function with 1 method)

Training the model

Here we run a pretrain example on a simple bert model

emb = CompositeEmbedding(
  tok = Embed(300, length(vocab)),
  pe = PositionEmbedding(300, 512; trainable=false),
  seg = Embed(300, 2)
bert = Bert(
  300, #hidden
  12, #head
  512, #intermediate hidden size
  3, #layer
  act = gelu,
masklm = Dense(300,300)
nextsentence = Chain(Dense(300, 2), logsoftmax)

bert_model = TransformerModel(emb, bert, (mlm=masklm, ns = nextsentence)) |> gpu
TransformerModel{Bert}( embed = CompositeEmbedding(tok = Embed(300), pe = PositionEmbedding(300), seg = Embed(300)), transformers = Bert(layers=3, head=12, head_size=25, pwffn_size=512, size=300), classifier = ( mlm => Dense(300, 300) ns => Chain(Dense(300, 2), logsoftmax) ) )
function preprocess(batch)
  mask = getmask(batch[1])
  tok = vocab(batch[1])
  segment = fill!(similar(tok), 1.0)

  for (i, sentence)  enumerate(batch[1])
    j = findfirst(isequal("[SEP]"), sentence)
    if j !== nothing
      @view(segment[j+1:end, i]) .= 2.0

  ind = vcat(
    map(enumerate(batch[2])) do (i, x)
     map(j->(j,i), x)

  masklabel = onehotbatch(vocab(vcat(batch[3]...)), 1:length(vocab))
  nextlabel = onehotbatch(batch[4], (true, false))

  return (tok=tok, seg=segment), ind, masklabel, nextlabel, mask

function loss(data, ind, masklabel, nextlabel, mask = nothing)
  e = bert_model.embed(data)
  t = bert_model.transformers(e, mask)
  nextloss = Basic.logcrossentropy(
  mkloss = masklmloss(bert_model.embed.embeddings.tok, # embedding table for compute similarity
                      bert_model.classifier.mlm, # transform function on output embedding
                      t, # output embeddings
                      ind, # mask index
                      masklabel #masked token
  return nextloss + mkloss

ps = params(bert_model)
opt = ADAM(1e-4)
ADAM(0.0001, (0.9, 0.999), IdDict{Any,Any}())
for i  1:10 # run 10 step for illustration
  batch = get_batch(datas, 2)
  batch === nothing && break # out of data
  data, ind, masklabel, nextlabel, mask = todevice(preprocess(batch))
  l = loss(data, ind, masklabel, nextlabel, mask)
  @show l
  grad = gradient(()->l, ps)
  update!(opt, ps, grad)


You can find more example at here. Most of the functionality is done, but the documentation isn't finished. I'll try to add more example and docs in the following weeks. About the TPU part, I haven't have a workable code right now. Fortunately, I get access to the TFRC TPU Cloud. Hopefully I can get something done with it.