Yash Patel / Aug 18 2019
Remix of Julia by Nextjournal

JSOC'19: Practical implementation of ULMFiT in Julia [2]

In the previous blog post, I explained the techniques and layers we need to build Language Model for ULMFiT. This is the continuation of that blog, here we will combine all techniques and layers to build Language model and then pretrain it.

Language Model Pre-training [Part 2]

In previous part, all the layers discussed will be used to build Language Model.So, Language Model is nothing but a mutable struct with field vocab and layers, where, vocab is vocabulary of language (e.g English) which will be used throughout the model and layers is an instance of Chain type of Flux which contains all the layers need to train the model including DropOut layers:


# ULMFiT Language Model
mutable struct LanguageModel
    vocab :: Vector
    layers :: Flux.Chain

function LanguageModel(embedding_size::Integer=400, hidLSTMSize::Integer=1150, outLSTMSize::Integer=embedding_size;
    embedDropProb::Float64 = 0.05, wordDropProb::Float64 = 0.4, hidDropProb::Float64 = 0.5, LayerDropProb::Float64 = 0.3, FinalDropProb::Float64 = 0.3)
    vocab = intern.(string.(readdlm("vocab.csv",',', header=false)[:, 1]))
    de = DroppedEmbeddings(length(vocab), embedding_size, 0.1; init = (dims...) -> init_weights(0.1, dims...))
    lm = LanguageModel(
            AWD_LSTM(embedding_size, hidLSTMSize, hidDropProb; init = (dims...) -> init_weights(1/hidLSTMSize, dims...)),
            AWD_LSTM(hidLSTMSize, hidLSTMSize, hidDropProb; init = (dims...) -> init_weights(1/hidLSTMSize, dims...)),
            AWD_LSTM(hidLSTMSize, outLSTMSize, hidDropProb; init = (dims...) -> init_weights(1/hidLSTMSize, dims...)),
            x -> de(x, true),
    return lm

Flux.@treelike LanguageModel

By default, the language model will be a three Layer AWD-LSTM model with 1150 units in the hidden layer and an embedding layer of size 400. AWD-LSTM is a regular LSTM (without attention, short-cut connections or any other sophisticated additions) with various tuned dropout hyper-parameters.

init_weights function is used to initialize the LSTM and embedding layer weights . the vocabulary file provided by authors can be downloaded from here. The vocabulary should contain special tokens for UNKNOWN (<unk>) and PADDING (<pos>). More special tokens can be added for better language model. (Refer this for understanding meaning of above tokens).


These are the parameters which the algorithm is not going to learn itself, these should be tuned manually. These include:

  1. DropOut probabilities
  2. DropConnect probabilites
  3. Embedding layer size
  4. LSTM layers sizes
  5. Gradient clip bound values (if used)
  6. parameters of the optimizer used for updating weights (if any)
  7. Learning Rates
  8. batchsize and bptt
  9. Number of epochs
  10. Parameters of optimizer

Preparing Mini-batch - batchsize and bptt

Often there is a confusion between batchsize and bptt for that I have included this section. To train such a model we need a big dataset or corpus, and for training we need to divide it into mini-batches. Structure of a Mini-batch depends upon these two hyper-parameters batchsize and bptt . Here, batchsize represents the number of independent sequences going in parallel through the model and bptt represents the length of one sequence.

For getting a mini-batch, we first pre-process the data such that it becomes a long sequence of tokens. Then, we divide this long sequence (or corpus) into equal sequences, such that number of sequences produced is equal to batchsize value. Then again we divide these produced sequences such that we get each small sequence of length bptt.

After loading corpus, WikiText-103 in this case, and pre-processing it, mini-batches will be produced and then they be passed through the model and weights are updated with each pass. To produce mini-batch whenever needed, I have made a generator function to give a mini-batch at every call, I have used concept of tasks (or co-routines). [Refer this doc]

# Padding multiple sequences w r t the max size sequence
function pre_pad_sequences(sequences::Vector, pad::String="_pad_")
    max_len = maximum([length(x) for x in sequences])
    return [[fill(pad, max_len-length(sequence)); sequence] for sequence in sequences]

function post_pad_sequences(sequences::Vector, pad::String="_pad_")
    max_len = maximum([length(x) for x in sequences])
    return [[sequence; fill(pad, max_len-length(sequence))] for sequence in sequences]

# Generator, whenever it should be called two times since it gives X in first     and y in second call
function generator(c::Channel, corpus; batchsize::Integer=64, bptt::Integer=70)
    X_total = post_pad_sequences(chunk(corpus, batchsize))
    n_batches = Int(floor(length(X_total[1])/bptt))
    put!(c, n_batches)
    for i=1:n_batches
        start = bptt*(i-1) + 1
        batch = [Flux.batch(X_total[k][j] for k=1:batchsize) for j=start:start+bptt]
        put!(c, batch[1:end-1])
        put!(c, batch[2:end])

Here, Y (target variable) is nothing but next word in the sequence with respect to X (input variable), so accordingly I have made batch variable in generator. Notice that in the very first call to the generator the output will be the number of mini-batches which will be helpful for training. The argument corpus to generator function is a long Vector of words. It can be used like:

corpus = read(open(corpuspath, "r"), String)
corpus = intern.(tokenize(corpus))
gen = Channel(x -> generator(x, corpus))

Other than the regularization functions discussed in the previous blog, we need some more functions to train the model.

Forward Propagation

LSTMs, DropOuts and DropConnect Layers:

For each forward pass, the batch will pass from regularization steps and RNN layers. This step is completed in the forward function. It takes as input a LanguageModel instance and a mini-batch, after processing it returns a Vector of length equal to vocab, which is a probability distrbution of the words in vocabulary. Main purpose of this function was to avoid occupancy of memory in main training loop:

# Forward
function forward(lm, batch)
    batch = map(x -> indices(x, lm.vocab, "_unk_"), batch)
    batch = lm.layers.(batch)
    return batch

Objective function or loss function:

This is the function where the forward pass actually ends. It calculates the Cross-entropy loss of the predicted output of forward function:

# loss funciton - Loss calculation with AR and TAR regulatization
function loss(lm, gen)
    H = forward(lm, take!(gen))
    Y = broadcast(x -> gpu(Flux.onehotbatch(x, lm.vocab, "_unk_")), take!(gen))
    l = sum(crossentropy.(H, Y))
    return l

Here, I am using truncate! function to truncate gradients calculation. [Refer these docs to know more about it]


Similar to other Machine learning frameworks, back-propagation is easy in Flux as compared to forward propagation. After calculating loss we need to calculate the gradients for that loss, then updating needs to be done according to the chosen optimizer, I have constructed a function for back-propagation:

# Backward - Calulating gradients and weights updation
function backward!(layers, l, opt, gradient_clip::Float64)
    p = get_trainable_params(layers)
    grads = Tracker.gradient(() -> l, p)
    Tracker.update!(opt, p, grads)

Checkout Backpropagation in Flux for better understanding. Also, if gradient clipping is needed to be applied, backward! function is the best place to apply that.


This wraps up the whole implementation of Language Model for ULMFiT to get better conceptual understanding of things cite the research papers referenced below. The model discussed above has its proper implementation here. In next blog, discuss the Fine-tuning of Language model for downstream task.