# JSOC'19: Practical implementation of ULMFiT in Julia [2]

In the previous blog post, I explained the techniques and layers we need to build Language Model for ULMFiT. This is the continuation of that blog, here we will combine all techniques and layers to build Language model and then pretrain it.

## Language Model Pre-training [Part 2]

In previous part, all the layers discussed will be used to build Language Model.So, Language Model is nothing but a `mutable struct `

with field `vocab `

and `layers`

, where, `vocab `

is vocabulary of language (e.g English) which will be used throughout the model and `layers `

is an instance of `Chain `

type of `Flux `

which contains all the layers need to train the model including DropOut layers:

**Implementation:**

# ULMFiT Language Model mutable struct LanguageModel vocab :: Vector layers :: Flux.Chain end function LanguageModel(embedding_size::Integer=400, hidLSTMSize::Integer=1150, outLSTMSize::Integer=embedding_size; embedDropProb::Float64 = 0.05, wordDropProb::Float64 = 0.4, hidDropProb::Float64 = 0.5, LayerDropProb::Float64 = 0.3, FinalDropProb::Float64 = 0.3) vocab = intern.(string.(readdlm("vocab.csv",',', header=false)[:, 1])) de = DroppedEmbeddings(length(vocab), embedding_size, 0.1; init = (dims...) -> init_weights(0.1, dims...)) lm = LanguageModel( vocab, Chain( de, VarDrop(wordDropProb), AWD_LSTM(embedding_size, hidLSTMSize, hidDropProb; init = (dims...) -> init_weights(1/hidLSTMSize, dims...)), VarDrop(LayerDropProb), AWD_LSTM(hidLSTMSize, hidLSTMSize, hidDropProb; init = (dims...) -> init_weights(1/hidLSTMSize, dims...)), VarDrop(LayerDropProb), AWD_LSTM(hidLSTMSize, outLSTMSize, hidDropProb; init = (dims...) -> init_weights(1/hidLSTMSize, dims...)), VarDrop(FinalDropProb), x -> de(x, true), softmax ) ) return lm end Flux. LanguageModel

By default, the language model will be a three Layer AWD-LSTM model with 1150 units in the hidden layer and an embedding layer of size 400. AWD-LSTM is a regular LSTM (without attention, short-cut connections or any other sophisticated additions) with various tuned dropout hyper-parameters.

`init_weights`

function is used to initialize the LSTM and embedding layer weights . the vocabulary file provided by authors can be downloaded from here. The vocabulary should contain special tokens for UNKNOWN (`<unk>`

) and PADDING (`<pos>`

). More special tokens can be added for better language model. (Refer this for understanding meaning of above tokens).

### Hyper-Parameters

These are the parameters which the algorithm is not going to learn itself, these should be tuned manually. These include:

- DropOut probabilities
- DropConnect probabilites
- Embedding layer size
- LSTM layers sizes
- Gradient clip bound values (if used)
- parameters of the optimizer used for updating weights (if any)
- Learning Rates
`batchsize`

and`bptt`

- Number of epochs
- Parameters of optimizer

### Preparing Mini-batch - batchsize and bptt

Often there is a confusion between `batchsize `

and `bptt`

for that I have included this section. To train such a model we need a big dataset or corpus, and for training we need to divide it into mini-batches. Structure of a Mini-batch depends upon these two hyper-parameters `batchsize`

and `bptt`

. Here, `batchsize `

represents the number of independent sequences going in parallel through the model and `bptt`

represents the length of one sequence.

For getting a mini-batch, we first pre-process the data such that it becomes a long sequence of tokens. Then, we divide this long sequence (or corpus) into equal sequences, such that number of sequences produced is equal to `batchsize`

value. Then again we divide these produced sequences such that we get each small sequence of length `bptt`

.

After loading corpus, WikiText-103 in this case, and pre-processing it, mini-batches will be produced and then they be passed through the model and weights are updated with each pass. To produce mini-batch whenever needed, I have made a `generator`

function to give a mini-batch at every call, I have used concept of tasks (or co-routines). [Refer this doc]

# Padding multiple sequences w r t the max size sequence function pre_pad_sequences(sequences::Vector, pad::String="_pad_") max_len = maximum([length(x) for x in sequences]) return [[fill(pad, max_len-length(sequence)); sequence] for sequence in sequences] end function post_pad_sequences(sequences::Vector, pad::String="_pad_") max_len = maximum([length(x) for x in sequences]) return [[sequence; fill(pad, max_len-length(sequence))] for sequence in sequences] end # Generator, whenever it should be called two times since it gives X in first and y in second call function generator(c::Channel, corpus; batchsize::Integer=64, bptt::Integer=70) X_total = post_pad_sequences(chunk(corpus, batchsize)) n_batches = Int(floor(length(X_total[1])/bptt)) put!(c, n_batches) for i=1:n_batches start = bptt*(i-1) + 1 batch = [Flux.batch(X_total[k][j] for k=1:batchsize) for j=start:start+bptt] put!(c, batch[1:end-1]) put!(c, batch[2:end]) end end

Here, `Y`

(target variable) is nothing but next word in the sequence with respect to `X`

(input variable), so accordingly I have made `batch`

variable in `generator`

. Notice that in the very first call to the `generator`

the output will be the number of mini-batches which will be helpful for training. The argument `corpus`

to `generator`

function is a long `Vector`

of words. It can be used like:

corpus = read(open(corpuspath, "r"), String) corpus = intern.(tokenize(corpus)) gen = Channel(x -> generator(x, corpus))

Other than the regularization functions discussed in the previous blog, we need some more functions to train the model.

### Forward Propagation

**LSTMs, DropOuts and DropConnect Layers:**

For each forward pass, the batch will pass from regularization steps and RNN layers. This step is completed in the `forward`

function. It takes as input a `LanguageModel`

instance and a mini-batch, after processing it returns a `Vector`

of length equal to `vocab`

, which is a probability distrbution of the words in vocabulary. Main purpose of this function was to avoid occupancy of memory in main training loop:

# Forward function forward(lm, batch) batch = map(x -> indices(x, lm.vocab, "_unk_"), batch) batch = lm.layers.(batch) return batch end

**Objective function or loss function:**

This is the function where the forward pass actually ends. It calculates the *Cross-entropy loss* of the predicted output of `forward`

function:

# loss funciton - Loss calculation with AR and TAR regulatization function loss(lm, gen) H = forward(lm, take!(gen)) Y = broadcast(x -> gpu(Flux.onehotbatch(x, lm.vocab, "_unk_")), take!(gen)) l = sum(crossentropy.(H, Y)) Flux.truncate!(lm.layers) return l end

Here, I am using `truncate!`

function to truncate gradients calculation. [Refer these` `

docs to know more about it]

### Back-propagation

Similar to other Machine learning frameworks, back-propagation is easy in `Flux`

as compared to forward propagation. After calculating loss we need to calculate the gradients for that loss, then updating needs to be done according to the chosen optimizer, I have constructed a function for back-propagation:

# Backward - Calulating gradients and weights updation function backward!(layers, l, opt, gradient_clip::Float64) p = get_trainable_params(layers) grads = Tracker.gradient(() -> l, p) Tracker.update!(opt, p, grads) return end

Checkout Backpropagation in Flux for better understanding. Also, if gradient clipping is needed to be applied, `backward!`

function is the best place to apply that.