Peter Cheng / Aug 25 2019
Remix of Julia by Nextjournal

JSoC 2019-Blog#4 Summery of the Season

Julia Season of Contribution is coming to the end. It's always a great time when coding Julia. Here is the summary of my season.

Proposal: Practical implementation of BERT models for Julia

using Transformers
using Transformers.BidirectionalEncoder
using Transformers.Pretrain

Implement the Bidirectional Encoder Representation from Transformers model in pure Julia. List of objectives in the proposal:

  • The model definition: including the Bert layer type and the forward call.
Bert(size::Int, head::Int, ps::Int, layer::Int;
    act = gelu, pdrop = 0.1, attn_pdrop = 0.1)
Bert(size::Int, head::Int, hs::Int, ps::Int, layer::Int;
    act = gelu, pdrop = 0.1, attn_pdrop = 0.1)

the Bidirectional Encoder Representations from Transformer(BERT) model.

(bert::Bert)(x, mask=nothing; all::Bool=false)

eval the bert layer on input x. If length mask is given (in shape (1, seqlen, batchsize)), mask the attention with getmask(mask, mask). Moreover, set all to true to get all outputs of each transformer layer.

  • Functions to load the released pre-trained model.

convenient macro for loading data from pretrain. Use DataDeps to download automatically, if a model is not downlaod. the string should be in pretrain"<model>-<model-name>:<item>" format.

see also Pretrain.pretrains().

load_pretrain(name; kw...)

same as @pretrain_str, but can pass keyword argument if needed.

pretrains(model::String = "")

Show all available model.

  • The text segment function using in the origin python code.

google bert tokenizer which remain the case during tokenization. Recommended for multi-lingual data.


google bert tokenizer which do lower case on input before tokenization.

WordPiece(vocab::Vector{String}, unk::String = "[UNK]"; max_char::Int=200)

WordPiece implementation.


split given token.


split given tokens

(wp::WordPiece)(type, tokens::Vector{String})

split given tokens, if type is Int, return pieces indices instead of strings pieces.

(wp::WordPiece)(tks::Vector{T}, token::String)

split given token and add result to tks. if T is Int, add indices instead of strings pieces.

  • Helper function for prepare the pre-train data like: function that randomly mask 15% of tokens for the mask language modeling.
  • Loss functions: mask LM, next-sentence prediction.
masklmloss(embed::Embed{T}, transform,
           t::AbstractArray{T, N}, posis::AbstractArray{Tuple{Int,Int}}, labels) where {T,N}
masklmloss(embed::Embed{T}, transform, output_bias,
           t::AbstractArray{T, N}, posis::AbstractArray{Tuple{Int,Int}}, labels) where {T,N}

helper function for computing the maks language modeling loss. Performance transform(x) .+ output_bias where x is the mask specified by posis, then compute the similarity with embed.embedding and crossentropy between true labels.

  • Feature-base BERT embedding.

Same as calling Bert forward (bert::Bert)(x, mask=nothing; all::Bool), just set the all keyword to true.

  • Example codes: fine-tune model examples.

All the exmaples can be found in the example folder of the project repo.

  • Documentations.

Most of the implemented function has the docstring, or can be found on the documentation.

Unachieved stuff

There are also some stuff I try to do during the time but I wasn't able to finished. Here is the list:

  • Tune GPU performance:

Currently the implementation is not optimized. If you run the model run GPU, then you'll notice that it doesn't use 100% of you GPU core. After a few experiment, I found that there are some type instability in the forward pass (also backward pass). One main cause was from the broadcast operation on the TrackedArray. That might require to fix the broadcast related definition in Tracker.jl.

On the other hand, it's also possible to write custom GPU kernel for optimized the performance like the FasterTransformer. However, I don't have a solid experience in GPU programming. It required more than I origin thought to implement one with CUDAnative.

  • Adapt Zygote as the Flux AD backend:

Since the Tracker.jl is said to be soon-to-ex, I also try to change the Flux AD backend to Zygote.jl. However, Zygote is not ready for use yet. Some stuff is still missing while some stuff is not working. With a few hack, it's possible to use zygote on a regular transformer, but not bert because there are some error in differentiating gelu activation.

  • Running model on TPU:

Running the model on TPU required either adapting zygote or implementing a trace based AD. However, I can't get Zygote run and my trace AD implementation never work.

  • Parallel data preprocessing

Since the preprocessing of Bert pretrain task is relative complicated. It will be much convenient to have a complete API for processing and handling the data. I have an experimental api here, but is not finished yet.

Future development

There are still lots of things can do, including the unachieved list above.

  • Optimize GPU performance
  • Contribute to Zygote
  • Make TPU version
  • Complete parallel data preprocessing API
  • Tools for investigate BERT model, e.g. visualization tools
  • More model and pretrains


After this season, we have a pure Julia implementation of Bert model based on Flux and Tracker. It ain't much, but it's honest work. With Transformers.jl and a few coding, you can train or inference with your own bert model.


Thanks the JuliaLang and Julia Community for these opportunities. It's always a great time when coding Julia.

After the JSoC, I will keep develop the Transformers.jl. Hopefully, The project could be the "Cybertron", i.e. The mother planet of all Julia Transformers.