Yash Patel / Aug 18 2019
Remix of Julia by Nextjournal

JSOC'19 : Practical implementation of ULMFiT for Text Classification [1]

Julia is a solution to "2 language problem", built in MIT. It is a high performance (like C) with dynamically-typed (like Python) language. It is an appropriate language for scientific and numerical computing. I am glad to contribute to the development of "Language of Future".

This project is being done as a part of Julia Season of Contributions (JSOC) 2019. The project comes under Natural Language Processing domain. The objective to this project is to get efficient and practical model for Text Classification, which is assigning predefined tags or categories to given piece of text. I found, ULMFiT (Universal Language Model Fine-Tuning), as a suitable model for the task. The model was proposed in 2018 by Jeremy Howard and Sebastian Ruder. It is one of the best performing models in various NLP tasks. It leverages the concept of Transfer Learning, which is the improvement of learning a new task by the knowledge gained by learning some related task. Specifically, the model uses inductive transfer learning, which means extraction of general rules from the observed training cases which can then be applied to various test cases, unlike transductive transfer learning, which is, to solve a specific problem from extracting rules from specific training case.

The Implementation of model can be divided into three major steps:

  1. Pre-training a Language Model
  2. Fine-Tuning of Language Model
  3. Fine-Tuning of Text Classifier

[All the steps are covered in successive blogs]

Language Model Pre-training [Part 1]

In this step, pre-training of a General-domain Language Model will be done. We will use a large corpus so that model can learn the general properties of the language (in this English), here we use the WikiText-103 corpus for pre-training. This step needs to be performed only once and later the trained Language Model can be used to improve performance and help whole classifier model for early convergence for downstream tasks. Pre-training the Language Model will also helps classifier to learn task-specific properties very efficient even from small datasets, which is what we refer to as Transfer Learning concept.

I this part, we will see the techniques and layers which will be used to build and train language model [Refer this paper for all the techniques used]. In second part for Language Model Pre-training all the techniques to train the model will be combined.

Firstly, we will import Flux, Machine Learning framework in Julia, some other important libraries :

import Pkg

using Flux
import Flux; crossentropy, chunk
using WordTokenizers   # For accesories
using InternedStrings   # For using Interned strings

And if have cuda is installed then to get GPU support for training CuArrays.jl is also needed:

using CuArrays

Regularization Techniques

There are several techniques discussed in this paper, to train the LSTM Language Model efficiently. Below I have listed out some of them, which are used in this implementation.

Variational DropOut:

Unlike standard DropOut technique, where each time step in LSTM gets a new dropout mask, in Variational DropOut we will use a common mask for all the time steps for one forward and backward pass (or one mini-batch), although, the masks for different mini-batches will be different from one another. We will be using Variational DropOut for all dropout operations except for hidden-to-hidden transitions.


For hidden-to-hidden transitions we will be using DropConnect instead of DropOut. Unlike DropOut where the output of a neuron is dropped randomly, here the connections (weights) to activation units are dropped randomly. That means, the activation unit won't get the input from all of its input connections.

In ULMFiT, DropConnect follows concept of Variational DropOut, i.e, for each mini-batch same mask will be used. Below is the implementaion for Variational DropOut and DropConnect:


For DropOut or DropConnect operations, a common function, drop_mask, for mask generation is used:

# Generates Mask
import Flux: _dropoutkernel, rand!

function drop_mask(x, p)
    y = similar(x, size(x))
    y .= _dropout_kernel.(y, p, 1 - p)
    return y

drop_mask(shape::Tuple, p; type = Float32) = (mask = rand(type, shape...);mask .= _dropout_kernel.(mask, p, 1 - p))

For Variational DropOut, I have created a struct named VarDrop:

########################## Varitional DropOut ######################
mutable struct VarDrop{F}

VarDrop(probability::Float64=0.0) = VarDrop(probability, Array{Float32, 2}(UndefInitializer(), 0, 0), true, true)

function (vd::VarDrop)(inp)
    vd.active || return inp
    if vd.reset
        vd.mask = drop_mask(inp, vd.p)
        vd.reset = false
    return inp .* vd.mask

Since, these are Variational DropOuts or Locked DropOuts the masks are needed to be remembered till a full mini-batch pass's the all layers. To address that, this above struct is made such that it remembers the masks, till it is no explicitly changing reset field to true. To do that, I have created a common generic function for all such structs, namely, reset_masks! which follows inplace criteria, that means it changes the mask of given struct without given any output:

reset_masks!(vd::VarDrop) = (vd.reset = true)

Custom Layers

Weight-Dropped LSTM:

Often we use the dropout on the, output of LSTM or perform dropout on update to, memory unit at t time-step, to reduce the over-fitting. But here, we will use DropConnect instead of DropOut on hidden-to-hidden connections (Weights). Below are the mathematical formulation of LSTM:

Here,and are the connections for hidden-to-hidden transitions. We will use DropConnect technique on these connections. We can also use this with input weights of LSTM i.e. [ ]. Since, their is no in-built function for DropConnect, I have implemented a WeightDroppedLSTM which is LSTM layer with DropConnect masks for corresponding weights and these masks are like Variational DropOut masks in nature:

Shift+Enter to run
#################### Weight-Dropped LSTM Cell#######################
import Flux: gate, tanh, σ, Tracker

mutable struct WeightDroppedLSTMCell{A, V, M}

function WeightDroppedLSTMCell(in::Integer, out::Integer, probability::Float64=0.0;
    init = Flux.glorot_uniform)
    cell = WeightDroppedLSTMCell(
        param(init(out*4, in)),
        param(init(out*4, out)),
        param(zeros(Float32, out)),
        param(zeros(Float32, out)),
        drop_mask((out*4, in), probability),
        drop_mask((out*4, out), probability),
    cell.b.data[gate(out, 2)] .= 1
    return cell

function (m::WeightDroppedLSTMCell)((h, c), x)
    b, o = m.b, size(h, 1)
    Wi = m.active ? m.Wi .* m.maskWi : m.Wi
    Wh = m.active ? m.Wh .* m.maskWh : m.Wh
    g = Wi*x .+ Wh*h .+ b
    input = σ.(gate(g, o, 1))
    forget = σ.(gate(g, o, 2))
    cell = tanh.(gate(g, o, 3))
    output = σ.(gate(g, o, 4))
    c = forget .* c .+ input .* cell
    h′ = output .* tanh.(c)
    return (h′, c), h′

Flux.@treelike WeightDroppedLSTMCell

# Weight-Dropped LSTM [stateful]
function WeightDroppedLSTM(a...; kw...)
    cell = WeightDroppedLSTMCell(a...;kw...)
    hidden = (cell.h, cell.c)
    return Flux.Recur(cell, hidden, hidden)

Similar to Variational DropOut in the above section, we need to reset the masks, so as explained above need I have overloaded reset_masks! for that:

function reset_masks!(wd::T) where T <: Flux.Recur{<:WeightDroppedLSTMCell}
    wd.cell.maskWi = drop_mask(wd.cell.Wi, wd.cell.p)
    wd.cell.maskWh = drop_mask(wd.cell.Wh, wd.cell.p)

Average SGD Weight-Dropped LSTM (AWD-LSTM) layer

In the architecture, we are using AWD-LSTM (ASGD Weight-Dropped LSTM) layer, that means, after a trigger iteration or threshold iteration T averaging of the weights will be done for subsequent iterations. To implement that we need a accumulator variable which will accumulate the sum for all passed iterations after averaging point. I have constructed an struct for AWD_LSTM , which is simply kind of a wrapper around the WeightDroppedLSTM [discussed in previous blog] , with Trigger and accumulator as additional fields:


For implementing AWD-LSTM, a struct AWD_LSTM is made which is basically a wrapper around WeightDroppedLSTM Layer with additional averaging accumulator field:


set_trigger!(t, m) = nothing
asgd_step!(i, l) = nothing

mutable struct AWD_LSTM

AWD_LSTM(in::Integer, out::Integer, probability::Float64=0.0; kw...) = AWD_LSTM(WeightDroppedLSTM(in, out, probability; kw...), -1, [])

Flux.@treelike AWD_LSTM

(m::AWD_LSTM)(in) = m.layer(in)

For setting Trigger iteration in AWD_LSTM, set_trigger! function can be used. Now, for the ASGD step, I have prepared an additional function named asgd_step!, which is responsible for the averaging the weights after Trigger iteration.

# Averaged Stochastic Gradient Descent Step
function asgd_step!(iter::Integer, layer::AWD_LSTM)
    if iter >= layer.T
        p = get_trainable_params([layer])
        avg_fact = 1/max(iter - layer.T + 1, 1)
        if avg_fact != 1
            layer.accum = layer.accum .+ Tracker.data.(p)
            for (ps, accum) in zip(p, layer.accum)
                Tracker.data(ps) .= avg_fact*accum
            layer.accum = deepcopy(Tracker.data.(p))   # Accumulator for ASGD

Here, iter is the iteration number, T is Trigger point and accum accumulates sum after Trigger has occurred. Below are several Variational DropOuts used in the Language Model:

Word-Embedding DropOut:

This dropout will be applied after the vectors of words are converted to their corresponding embedding vectors. To apply this we use VarDrop with wordDropProb probability and then use that for a forward and backward pass.

Layer-to-Layer DropOut and Final LSTM Layer DropOut:

The dropout will be applied to the output between the LSTM layers with LayerDropProb probability. And also the final LSTM layer output will also be dropped with FinalDropProb probability. All these dropouts will use the VarDrop struct.

Embedding Dropout and Weight Tying:

Embedding Dropout will be performed before any of above Dropouts and will be performed at word level, where a whole embedding vector, related to a specific word, will be dropped. This means all the occurrences of the specific word will disappear with that pass. Here, is the implementation of DroppedEmbeddings struct, which contains the embedding matrix with a dropout mask for that.

Weight tying shares the weights between Embedding layer and SoftMax layer, this will substantially reduce the total parameter count in the model. This prevents the model from having to learn one-to-one correspondence between input and output, resulting in the substantial improvements to the standard LSTM language model.


################# Varitional Dropped Embeddings ####################
mutable struct DroppedEmbeddings{A, F}

function DroppedEmbeddings(in::Integer, embed_size::Integer, probability::Float64=0.0;
    init = Flux.glorot_uniform)
        de = DroppedEmbeddings{AbstractArray, typeof(probability)}(
            param(init(in, embed_size)),
            drop_mask((in,), probability),
    return de

function (de::DroppedEmbeddings)(in::AbstractArray, tying::Bool=false)
    dropped = de.active ? de.emb .* de.mask : de.emb
    return tying ? dropped * in : transpose(dropped[in, :])

Flux.@treelike DroppedEmbeddings

The struct contains a tying argument which tells the functor whether the pass is for Embedding layer or for SoftMax layer. Since this has Variational DropOut for embedding matrix, I have overloaded reset_masks! function to reset mask when needed:

function reset_masks!(de::DroppedEmbeddings)
    de.mask = drop_mask(de.emb, de.p)

Lets go to actual implementation

This completes the Part-1 of the Language Model Pretraining for ULMFiT. In Part-2, we will combine these regularization techniques and layers to build Language Model and also constructing training procedure for Language Model pre-training.