JSoC 2019-Blog#3(End of Phase Two) Bert model in Julia:
It has been a while since the last update, but we finally get here. Most of the functionality is done. Let's see what we have now. Again, before we start, we first checkout to the #bert
branch.
using Pkg pkg"add Transformers#bert"
Getting Pretrain
In the last blog post, we show that we can use the tfckpt2bson
function to convert a Bert Tensorflow pretrain checkpoint file into a BSON file and load it in Juila. Now, if you just want to use the publicly released ckeckpoint point from google, we have a self-host version that you can directly download and there's no need to use tfckpt2bson
. For example, we want to fine-tune the uncased_L-12_H-768_A-12
pretrain weight, we can do this:
using Transformers using Transformers.Pretrain ENV["DATADEPS_ALWAYS_ACCEPT"] = true
The download process is handle by DataDeps.jl, so we need to set "DATADEPS_ALWAYS_ACCEPT"
to true
to download the model without typing Y
, and loading the model will be done by a special string with pretrain
prefix in the format of pretrain"bert-<model-name>:<item>"
where <item>
can be either :bert_model
, :wordpiece
, :tokenizer
, and :all
(default). Like this:
bert_model, wordpiece, tokenizer = pretrain"bert-uncased_L-12_H-768_A-12"
And we can use the pretrain model in our task.
You can also see what pretrain model is supported, if you can't find a public model on the list, please open an issue.
pretrains()
Fine-tune model
Fine-tuning Bert model is also very easy. We use the GLEU tasks as an example.
GLUE datasets
Before we start the fine-tune process, we need to download the dataset. We have a version of GLUE task in Datasets
, so you can test the model without worrying about handling the datasets. For example:
using Transformers.Datasets using Transformers.Datasets.GLUE task = GLUE.QNLI() datas = dataset(Train, task) get_batch(datas, 4)
You can see what GLUE task is supported here.
using InteractiveUtils: varinfo varinfo(GLUE)
name | size | summary |
---|---|---|
CoLA | 172 bytes | DataType |
Diagnostic | 172 bytes | DataType |
GLUE | 82.390 KiB | Module |
MNLI | 188 bytes | DataType |
MRPC | 172 bytes | DataType |
QNLI | 172 bytes | DataType |
QQP | 172 bytes | DataType |
RTE | 172 bytes | DataType |
SNLI | 172 bytes | DataType |
SST | 172 bytes | DataType |
STS | 172 bytes | DataType |
WNLI | 172 bytes | DataType |
Preprocess the data
using Transformers.Basic using Flux: onehotbatch vocab = Vocabulary(wordpiece) #get vocabulary from WordPiece labels = get_labels(task) #get dataset labels #add start and separate symbol around sentence markline(s1, s2) = ["[CLS]"; s1; "[SEP]"; s2; "[SEP]"] function preprocess(batch) s1 = wordpiece.(tokenizer.(batch[1])) s2 = wordpiece.(tokenizer.(batch[2])) sentence = markline.(s1, s2) mask = getmask(sentence) tok = vocab(sentence) segment = fill!(similar(tok), 1) for (i, sent) ∈ enumerate(sentence) j = findfirst(isequal("[SEP]"), sent) if j !== nothing (segment[j+1:end, i]) .= 2 end end label = onehotbatch(batch[3], labels) return (tok=tok, segment=segment), label, mask end preprocess(get_batch(datas, 4))
Training the model
Now we have the fine-tune datasets and the preprocess function, we can start training our model with a simple training loop
#preparation using Flux using Flux: gradient import Flux.Optimise: update! using CuArrays clf = Chain( Dropout(0.1), Dense(768, length(labels)), logsoftmax ) # remove masklm/nextsentence weights, # set clf as part of classifiers, # move the result model to gpu bert_model = gpu( Basic.set_classifier(bert_model, ( pooler = bert_model.classifier.pooler, clf = clf ) ) ) bert_model ps = params(bert_model) opt = ADAM(1e-4) #define the loss function loss(data, label, mask=nothing) e = bert_model.embed(data) t = bert_model.transformers(e, mask) l = Basic.logcrossentropy( label, bert_model.classifier.clf( bert_model.classifier.pooler( t[:,1,:] ) ) ) return l end
#the training loop for i ∈ 1:10 # 10 training step, just for illustration batch = get_batch(datas, 2) batch === nothing && break # run out of training datas data, label, mask = todevice( #move data to gpu preprocess(batch) ) l = loss(data, label, mask) l grad = gradient(()->l, ps) update!(opt, ps, grad) end
Feature-based Approach
We can also use out bert model to extract fixed feature vectors (like ELMo). To do so, just pass an extra keyword argument when calling bert and you can get all the output of each Transformer layer.
data, label, mask = todevice(preprocess(get_batch(datas, 4))) e = bert_model.embed(data) out, ts = bert_model.transformers(e, mask; all=true) #the 11-th transformer layer output ts[11]
Pretrain our own pretrain
We can also pretrain a Bert model on our own dataset. We will show how to do it with the following example.
Pretrain task helper
Bert has two pretrain task: masked language modeling & next sentence prediction. Here is a document from a recent wiki dump and we'll use it as a dataset in the following code.
# one document from wiki dump, just for illustration docs = """ Guy Fawkes (; 13 April 1570�罱�� 31 January 1606), also known as Guido Fawkes while fighting for the Spanish, was a member of a group of provincial English Catholics who planned the failed Gunpowder Plot of 1605. He was born and educated in York, England; his father died when Fawkes was eight years old, after which his mother married a recusant Catholic. Fawkes converted to Catholicism and left for mainland Europe, where he fought for Catholic Spain in the Eighty Years' War against Protestant Dutch reformers in the Low Countries. He travelled to Spain to seek support for a Catholic rebellion in England without success. He later met Thomas Wintour, with whom he returned to England, and Wintour introduced him to Robert Catesby, who planned to assassinate and restore a Catholic monarch to the throne. The plotters leased an undercroft beneath the House of Lords, and Fawkes was placed in charge of the gunpowder which they stockpiled there. The authorities were prompted by an anonymous letter to search Westminster Palace during the early hours of 5 November, and they found Fawkes guarding the explosives. He was questioned and tortured over the next few days, and he finally confessed. Immediately before his execution on 31 January, Fawkes fell from the scaffold where he was to be hanged and broke his neck, thus avoiding the agony of being hanged, drawn and quartered. He became synonymous with the Gunpowder Plot, the failure of which has been commemorated in Britain as Guy Fawkes Night since 5 November 1605, when his effigy is traditionally burned on a bonfire, commonly accompanied by fireworks. Guy Fawkes was born in 1570 in Stonegate, York. He was the second of four children born to Edward Fawkes, a proctor and an advocate of the consistory court at York, and his wife, Edith. Guy's parents were regular communicants of the Church of England, as were his paternal grandparents; his grandmother, born Ellen Harrington, was the daughter of a prominent merchant, who served as Lord Mayor of York in 1536. Guy's mother's family were recusant Catholics, and his cousin, Richard Cowling, became a Jesuit priest. "Guy" was an uncommon name in England, but may have been popular in York on account of a local notable, Sir Guy Fairfax of Steeton. The date of Fawkes's birth is unknown, but he was baptised in the church of St Michael le Belfrey on 16 April. As the customary gap between birth and baptism was three days, he was probably born about 13 April. In 1568, Edith had given birth to a daughter named Anne, but the child died aged about seven weeks, in November that year. She bore two more children after Guy: Anne (b. 1572), and Elizabeth (b. 1575). Both were married, in 1599 and 1594 respectively. In 1579, when Guy was eight years old, his father died. His mother remarried several years later, to the Catholic Dionis Baynbrigge (or Denis Bainbridge) of Scotton, Harrogate. Fawkes may have become a Catholic through the Baynbrigge family's recusant tendencies, and also the Catholic branches of the Pulleyn and Percy families of Scotton, but also from his time at St. Peter's School in York. A governor of the school had spent about 20�懢ears in prison for recusancy, and its headmaster, John Pulleyn, came from a family of noted Yorkshire recusants, the Pulleyns of Blubberhouses. In her 1915 work "The Pulleynes of Yorkshire", author Catharine Pullein suggested that Fawkes's Catholic education came from his Harrington relatives, who were known for harbouring priests, one of whom later accompanied Fawkes to Flanders in 1592��1593. Fawkes's fellow students included John Wright and his brother Christopher (both later involved with Fawkes in the Gunpowder Plot) and Oswald Tesimond, Edward Oldcorne and Robert Middleton, who became priests (the latter executed in 1601). """
You can use the pretrain helper BidirectionalEncoder.bert_pretrain_task
function to get the input sentence with mask and the masked id (and also the next sentence prediction), but you need to wrap you own data into a Channel
which get sentences line by line in the correct order (otherwise the next sentence prediction can't have the correct label).
using WordTokenizers chn = Channel(3) sentences = split_sentences(docs) task = foreach(sentences) do sentence if !isempty(sentence) put!(chn, sentence) end end bind(chn, task)
using Transformers.BidirectionalEncoder datas = BidirectionalEncoder.bert_pretrain_task(chn, wordpiece; tokenizer = tokenizer) # we need to pass the wordpiece. If no tokenizer provided, it will use default tokenizer from WordTokenizers
masked_sentence, mask_idx, masked_token, isnext = get_batch(datas, 1)
Here we use the wordpiece
and tokenizer
from the previous section. If you only want the them without loading the whole model, you can do this:
wordpiece = pretrain"bert-uncased_L-12_H-768_A-12:wordpiece" tokenizer = pretrain"bert-uncased_L-12_H-768_A-12:tokenizer"
Training the model
Here we run a pretrain example on a simple bert model
emb = CompositeEmbedding( tok = Embed(300, length(vocab)), pe = PositionEmbedding(300, 512; trainable=false), seg = Embed(300, 2) ) bert = Bert( 300, #hidden 12, #head 512, #intermediate hidden size 3, #layer act = gelu, ) masklm = Dense(300,300) nextsentence = Chain(Dense(300, 2), logsoftmax) bert_model = TransformerModel(emb, bert, (mlm=masklm, ns = nextsentence)) |> gpu
function preprocess(batch) mask = getmask(batch[1]) tok = vocab(batch[1]) segment = fill!(similar(tok), 1.0) for (i, sentence) ∈ enumerate(batch[1]) j = findfirst(isequal("[SEP]"), sentence) if j !== nothing (segment[j+1:end, i]) .= 2.0 end end ind = vcat( map(enumerate(batch[2])) do (i, x) map(j->(j,i), x) end...) masklabel = onehotbatch(vocab(vcat(batch[3]...)), 1:length(vocab)) nextlabel = onehotbatch(batch[4], (true, false)) return (tok=tok, seg=segment), ind, masklabel, nextlabel, mask end function loss(data, ind, masklabel, nextlabel, mask = nothing) e = bert_model.embed(data) t = bert_model.transformers(e, mask) nextloss = Basic.logcrossentropy( nextlabel, bert_model.classifier.ns( t[:,1,:] ) ) mkloss = masklmloss(bert_model.embed.embeddings.tok, # embedding table for compute similarity bert_model.classifier.mlm, # transform function on output embedding t, # output embeddings ind, # mask index masklabel #masked token ) return nextloss + mkloss end ps = params(bert_model) opt = ADAM(1e-4)
for i ∈ 1:10 # run 10 step for illustration batch = get_batch(datas, 2) batch === nothing && break # out of data data, ind, masklabel, nextlabel, mask = todevice(preprocess(batch)) l = loss(data, ind, masklabel, nextlabel, mask) l grad = gradient(()->l, ps) update!(opt, ps, grad) end
Conclusion
You can find more example at here. Most of the functionality is done, but the documentation isn't finished. I'll try to add more example and docs in the following weeks. About the TPU part, I haven't have a workable code right now. Fortunately, I get access to the TFRC TPU Cloud. Hopefully I can get something done with it.