# JSOC'19 : Practical implementation of ULMFiT for Text Classification [1]

**Julia** is a solution to "2 language problem", built in MIT. It is a high performance (like C) with dynamically-typed (like Python) language. It is an appropriate language for scientific and numerical computing. I am glad to contribute to the development of "Language of Future".

This project is being done as a part of **Julia Season of Contributions (JSOC) 2019**. The project comes under Natural Language Processing domain. The objective to this project is to get efficient and practical model for Text Classification, which is assigning predefined tags or categories to given piece of text. I found, ULMFiT (Universal Language Model Fine-Tuning), as a suitable model for the task. The model was proposed in 2018 by Jeremy Howard and Sebastian Ruder. It is one of the best performing models in various NLP tasks. It leverages the concept of **Transfer Learning**, which is the improvement of learning a new task by the knowledge gained by learning some related task. Specifically, the model uses *inductive transfer learning*, which means extraction of general rules from the observed training cases which can then be applied to various test cases, unlike *transductive transfer learning*, which is, to solve a specific problem from extracting rules from specific training case.

The Implementation of model can be divided into three major steps:

- Pre-training a Language Model
- Fine-Tuning of Language Model
- Fine-Tuning of Text Classifier

[All the steps are covered in successive blogs]

## Language Model Pre-training [Part 1]

In this step, pre-training of a General-domain Language Model will be done. We will use a large corpus so that model can learn the general properties of the language (in this English), here we use the WikiText-103 corpus for pre-training. This step needs to be performed only once and later the trained Language Model can be used to improve performance and help whole classifier model for early convergence for downstream tasks. Pre-training the Language Model will also helps classifier to learn task-specific properties very efficient even from small datasets, which is what we refer to as Transfer Learning concept.

I this part, we will see the techniques and layers which will be used to build and train language model [Refer this paper for all the techniques used]. In second part for Language Model Pre-training all the techniques to train the model will be combined.

Firstly, we will import `Flux`

, Machine Learning framework in Julia, some other important libraries :

import Pkg Pkg.add("Flux") Pkg.add("WordTokenizers") Pkg.add("InternedStrings") Pkg.add("BSON") using Flux import Flux; crossentropy, chunk using WordTokenizers # For accesories using InternedStrings # For using Interned strings

And if have **cuda** is installed then to get GPU support for training **CuArrays.jl** is also needed:

Pkg.add("CuArrays") using CuArrays

### Regularization Techniques

There are several techniques discussed in this paper, to train the LSTM Language Model efficiently. Below I have listed out some of them, which are used in this implementation.

**Variational DropOut:**

Unlike standard DropOut technique, where each time step in LSTM gets a new dropout mask, in Variational DropOut we will use a common mask for all the time steps for one forward and backward pass (or one mini-batch), although, the masks for different mini-batches will be different from one another. We will be using Variational DropOut for all dropout operations except for hidden-to-hidden transitions.

**DropConnect:**

For hidden-to-hidden transitions we will be using DropConnect instead of DropOut. Unlike DropOut where the output of a neuron is dropped randomly, here the connections (weights) to activation units are dropped randomly. That means, the activation unit won't get the input from all of its input connections.

In ULMFiT, DropConnect follows concept of Variational DropOut, i.e, for each mini-batch same mask will be used. Below is the implementaion for Variational DropOut and DropConnect:

**Implementation:**

For DropOut or DropConnect operations, a common function, `drop_mask`

, for mask generation is used:

# Generates Mask import Flux: _dropoutkernel, rand! function drop_mask(x, p) y = similar(x, size(x)) rand!(y) y .= _dropout_kernel.(y, p, 1 - p) return y end drop_mask(shape::Tuple, p; type = Float32) = (mask = rand(type, shape...);mask .= _dropout_kernel.(mask, p, 1 - p))

For **Variational DropOut**, I have created a `struct`

named `VarDrop`

:

########################## Varitional DropOut ###################### mutable struct VarDrop{F} p::F mask reset::Bool active::Bool end VarDrop(probability::Float64=0.0) = VarDrop(probability, Array{Float32, 2}(UndefInitializer(), 0, 0), true, true) function (vd::VarDrop)(inp) vd.active || return inp if vd.reset vd.mask = drop_mask(inp, vd.p) vd.reset = false end return inp .* vd.mask end

Since, these are Variational DropOuts or **Locked DropOut**s the masks are needed to be remembered till a full mini-batch pass's the all layers. To address that, this above `struct`

is made such that it remembers the masks, till it is no explicitly changing `reset`

field to true. To do that, I have created a common generic function for all such `struct`

s, namely, `reset_masks!`

which follows inplace criteria, that means it changes the mask of given `struct`

without given any output:

reset_masks!(vd::VarDrop) = (vd.reset = true)

### Custom Layers

**Weight-Dropped LSTM:**

Often we use the dropout on the*t* time-step, to reduce the over-fitting. But here, we will use DropConnect instead of DropOut on hidden-to-hidden connections (Weights). Below are the mathematical formulation of LSTM:

Here,`WeightDroppedLSTM`

which is LSTM layer with DropConnect masks for corresponding weights and these masks are like Variational DropOut masks in nature:

#################### Weight-Dropped LSTM Cell####################### import Flux: gate, tanh, σ, Tracker mutable struct WeightDroppedLSTMCell{A, V, M} Wi::A Wh::A b::V h::V c::V p::Float64 maskWi::M maskWh::M active::Bool end function WeightDroppedLSTMCell(in::Integer, out::Integer, probability::Float64=0.0; init = Flux.glorot_uniform) cell = WeightDroppedLSTMCell( param(init(out*4, in)), param(init(out*4, out)), param(init(out*4)), param(zeros(Float32, out)), param(zeros(Float32, out)), probability, drop_mask((out*4, in), probability), drop_mask((out*4, out), probability), true ) cell.b.data[gate(out, 2)] .= 1 return cell end function (m::WeightDroppedLSTMCell)((h, c), x) b, o = m.b, size(h, 1) Wi = m.active ? m.Wi .* m.maskWi : m.Wi Wh = m.active ? m.Wh .* m.maskWh : m.Wh g = Wi*x .+ Wh*h .+ b input = σ.(gate(g, o, 1)) forget = σ.(gate(g, o, 2)) cell = tanh.(gate(g, o, 3)) output = σ.(gate(g, o, 4)) c = forget .* c .+ input .* cell h′ = output .* tanh.(c) return (h′, c), h′ end Flux. WeightDroppedLSTMCell # Weight-Dropped LSTM [stateful] function WeightDroppedLSTM(a...; kw...) cell = WeightDroppedLSTMCell(a...;kw...) hidden = (cell.h, cell.c) return Flux.Recur(cell, hidden, hidden) end

Similar to Variational DropOut in the above section, we need to reset the masks, so as explained above need I have overloaded `reset_masks!`

for that:

function reset_masks!(wd::T) where T <: Flux.Recur{<:WeightDroppedLSTMCell} wd.cell.maskWi = drop_mask(wd.cell.Wi, wd.cell.p) wd.cell.maskWh = drop_mask(wd.cell.Wh, wd.cell.p) return end

**Average SGD Weight-Dropped LSTM (AWD-LSTM) layer**

In the architecture, we are using AWD-LSTM (ASGD Weight-Dropped LSTM) layer, that means, after a trigger iteration or threshold iteration `T`

averaging of the weights will be done for subsequent iterations. To implement that we need a accumulator variable which will accumulate the sum for all passed iterations after averaging point. I have constructed an `struct`

for `AWD_LSTM`

, which is simply kind of a wrapper around the `WeightDroppedLSTM `

[discussed in previous blog] , with Trigger and accumulator as additional fields:

**Implementation:**

For implementing AWD-LSTM, a `struct`

`AWD_LSTM`

is made which is basically a wrapper around `WeightDroppedLSTM`

Layer with additional averaging accumulator field:

# AWD_LSTM set_trigger!(t, m) = nothing asgd_step!(i, l) = nothing mutable struct AWD_LSTM layer::Flux.Recur T::Integer accum end AWD_LSTM(in::Integer, out::Integer, probability::Float64=0.0; kw...) = AWD_LSTM(WeightDroppedLSTM(in, out, probability; kw...), -1, []) Flux. AWD_LSTM (m::AWD_LSTM)(in) = m.layer(in)

For setting Trigger iteration in `AWD_LSTM`

, `set_trigger`

! function can be used. Now, for the ASGD step, I have prepared an additional function named `asgd_step!`

, which is responsible for the averaging the weights after Trigger iteration.

# Averaged Stochastic Gradient Descent Step function asgd_step!(iter::Integer, layer::AWD_LSTM) if iter >= layer.T p = get_trainable_params([layer]) avg_fact = 1/max(iter - layer.T + 1, 1) if avg_fact != 1 layer.accum = layer.accum .+ Tracker.data.(p) for (ps, accum) in zip(p, layer.accum) Tracker.data(ps) .= avg_fact*accum end else layer.accum = deepcopy(Tracker.data.(p)) # Accumulator for ASGD end end return end

Here, `iter`

is the iteration number, `T`

is Trigger point and `accum`

accumulates sum after Trigger has occurred. Below are several **Variational DropOuts** used in the Language Model:

**Word-Embedding DropOut:**

This dropout will be applied after the vectors of words are converted to their corresponding embedding vectors. To apply this we use `VarDrop `

with `wordDropProb`

probability and then use that for a forward and backward pass.

**Layer-to-Layer DropOut and Final LSTM Layer DropOut:**

The dropout will be applied to the output between the LSTM layers with `LayerDropProb`

probability. And also the final LSTM layer output will also be dropped with `FinalDropProb`

probability. All these dropouts will use the `VarDrop struct`

.

**Embedding Dropout and Weight Tying:**

**Embedding Dropout** will be performed before any of above Dropouts and will be performed at *word level*, where a whole embedding vector, related to a specific word, will be dropped. This means all the occurrences of the specific word will disappear with that pass. Here, is the implementation of `DroppedEmbeddings struct`

, which contains the embedding matrix with a dropout mask for that.

**Weight tying** shares the weights between Embedding layer and *SoftMax* layer, this will substantially reduce the total parameter count in the model. This prevents the model from having to learn one-to-one correspondence between input and output, resulting in the substantial improvements to the standard LSTM language model.

**Implementation:**

################# Varitional Dropped Embeddings #################### mutable struct DroppedEmbeddings{A, F} emb::A p::F mask active::Bool end function DroppedEmbeddings(in::Integer, embed_size::Integer, probability::Float64=0.0; init = Flux.glorot_uniform) de = DroppedEmbeddings{AbstractArray, typeof(probability)}( param(init(in, embed_size)), probability, drop_mask((in,), probability), true ) return de end function (de::DroppedEmbeddings)(in::AbstractArray, tying::Bool=false) dropped = de.active ? de.emb .* de.mask : de.emb return tying ? dropped * in : transpose(dropped[in, :]) end Flux. DroppedEmbeddings

The `struct`

contains a `tying`

argument which tells the `functor`

whether the pass is for Embedding layer or for SoftMax layer. Since this has Variational DropOut for embedding matrix, I have overloaded `reset_masks!`

function to reset mask when needed:

function reset_masks!(de::DroppedEmbeddings) de.mask = drop_mask(de.emb, de.p) return end

### Lets go to actual implementation

This completes the Part-1 of the Language Model Pretraining for ULMFiT. In Part-2, we will combine these regularization techniques and layers to build Language Model and also constructing training procedure for Language Model pre-training.