JSOC'19: Practical implementation of ULMFiT in Julia [2]
In the previous blog post, I explained the techniques and layers we need to build Language Model for ULMFiT. This is the continuation of that blog, here we will combine all techniques and layers to build Language model and then pretrain it.
Language Model Pre-training [Part 2]
In previous part, all the layers discussed will be used to build Language Model.So, Language Model is nothing but a mutable struct
with field vocab
and layers
, where, vocab
is vocabulary of language (e.g English) which will be used throughout the model and layers
is an instance of Chain
type of Flux
which contains all the layers need to train the model including DropOut layers:
Implementation:
# ULMFiT Language Model mutable struct LanguageModel vocab :: Vector layers :: Flux.Chain end function LanguageModel(embedding_size::Integer=400, hidLSTMSize::Integer=1150, outLSTMSize::Integer=embedding_size; embedDropProb::Float64 = 0.05, wordDropProb::Float64 = 0.4, hidDropProb::Float64 = 0.5, LayerDropProb::Float64 = 0.3, FinalDropProb::Float64 = 0.3) vocab = intern.(string.(readdlm("vocab.csv",',', header=false)[:, 1])) de = DroppedEmbeddings(length(vocab), embedding_size, 0.1; init = (dims...) -> init_weights(0.1, dims...)) lm = LanguageModel( vocab, Chain( de, VarDrop(wordDropProb), AWD_LSTM(embedding_size, hidLSTMSize, hidDropProb; init = (dims...) -> init_weights(1/hidLSTMSize, dims...)), VarDrop(LayerDropProb), AWD_LSTM(hidLSTMSize, hidLSTMSize, hidDropProb; init = (dims...) -> init_weights(1/hidLSTMSize, dims...)), VarDrop(LayerDropProb), AWD_LSTM(hidLSTMSize, outLSTMSize, hidDropProb; init = (dims...) -> init_weights(1/hidLSTMSize, dims...)), VarDrop(FinalDropProb), x -> de(x, true), softmax ) ) return lm end Flux. LanguageModel
By default, the language model will be a three Layer AWD-LSTM model with 1150 units in the hidden layer and an embedding layer of size 400. AWD-LSTM is a regular LSTM (without attention, short-cut connections or any other sophisticated additions) with various tuned dropout hyper-parameters.
init_weights
function is used to initialize the LSTM and embedding layer weights . the vocabulary file provided by authors can be downloaded from here. The vocabulary should contain special tokens for UNKNOWN (<unk>
) and PADDING (<pos>
). More special tokens can be added for better language model. (Refer this for understanding meaning of above tokens).
Hyper-Parameters
These are the parameters which the algorithm is not going to learn itself, these should be tuned manually. These include:
- DropOut probabilities
- DropConnect probabilites
- Embedding layer size
- LSTM layers sizes
- Gradient clip bound values (if used)
- parameters of the optimizer used for updating weights (if any)
- Learning Rates
batchsize
andbptt
- Number of epochs
- Parameters of optimizer
Preparing Mini-batch - batchsize and bptt
Often there is a confusion between batchsize
and bptt
for that I have included this section. To train such a model we need a big dataset or corpus, and for training we need to divide it into mini-batches. Structure of a Mini-batch depends upon these two hyper-parameters batchsize
and bptt
. Here, batchsize
represents the number of independent sequences going in parallel through the model and bptt
represents the length of one sequence.
For getting a mini-batch, we first pre-process the data such that it becomes a long sequence of tokens. Then, we divide this long sequence (or corpus) into equal sequences, such that number of sequences produced is equal to batchsize
value. Then again we divide these produced sequences such that we get each small sequence of length bptt
.
After loading corpus, WikiText-103 in this case, and pre-processing it, mini-batches will be produced and then they be passed through the model and weights are updated with each pass. To produce mini-batch whenever needed, I have made a generator
function to give a mini-batch at every call, I have used concept of tasks (or co-routines). [Refer this doc]
# Padding multiple sequences w r t the max size sequence function pre_pad_sequences(sequences::Vector, pad::String="_pad_") max_len = maximum([length(x) for x in sequences]) return [[fill(pad, max_len-length(sequence)); sequence] for sequence in sequences] end function post_pad_sequences(sequences::Vector, pad::String="_pad_") max_len = maximum([length(x) for x in sequences]) return [[sequence; fill(pad, max_len-length(sequence))] for sequence in sequences] end # Generator, whenever it should be called two times since it gives X in first and y in second call function generator(c::Channel, corpus; batchsize::Integer=64, bptt::Integer=70) X_total = post_pad_sequences(chunk(corpus, batchsize)) n_batches = Int(floor(length(X_total[1])/bptt)) put!(c, n_batches) for i=1:n_batches start = bptt*(i-1) + 1 batch = [Flux.batch(X_total[k][j] for k=1:batchsize) for j=start:start+bptt] put!(c, batch[1:end-1]) put!(c, batch[2:end]) end end
Here, Y
(target variable) is nothing but next word in the sequence with respect to X
(input variable), so accordingly I have made batch
variable in generator
. Notice that in the very first call to the generator
the output will be the number of mini-batches which will be helpful for training. The argument corpus
to generator
function is a long Vector
of words. It can be used like:
corpus = read(open(corpuspath, "r"), String) corpus = intern.(tokenize(corpus)) gen = Channel(x -> generator(x, corpus))
Other than the regularization functions discussed in the previous blog, we need some more functions to train the model.
Forward Propagation
LSTMs, DropOuts and DropConnect Layers:
For each forward pass, the batch will pass from regularization steps and RNN layers. This step is completed in the forward
function. It takes as input a LanguageModel
instance and a mini-batch, after processing it returns a Vector
of length equal to vocab
, which is a probability distrbution of the words in vocabulary. Main purpose of this function was to avoid occupancy of memory in main training loop:
# Forward function forward(lm, batch) batch = map(x -> indices(x, lm.vocab, "_unk_"), batch) batch = lm.layers.(batch) return batch end
Objective function or loss function:
This is the function where the forward pass actually ends. It calculates the Cross-entropy loss of the predicted output of forward
function:
# loss funciton - Loss calculation with AR and TAR regulatization function loss(lm, gen) H = forward(lm, take!(gen)) Y = broadcast(x -> gpu(Flux.onehotbatch(x, lm.vocab, "_unk_")), take!(gen)) l = sum(crossentropy.(H, Y)) Flux.truncate!(lm.layers) return l end
Here, I am using truncate!
function to truncate gradients calculation. [Refer these
docs to know more about it]
Back-propagation
Similar to other Machine learning frameworks, back-propagation is easy in Flux
as compared to forward propagation. After calculating loss we need to calculate the gradients for that loss, then updating needs to be done according to the chosen optimizer, I have constructed a function for back-propagation:
# Backward - Calulating gradients and weights updation function backward!(layers, l, opt, gradient_clip::Float64) p = get_trainable_params(layers) grads = Tracker.gradient(() -> l, p) Tracker.update!(opt, p, grads) return end
Checkout Backpropagation in Flux for better understanding. Also, if gradient clipping is needed to be applied, backward!
function is the best place to apply that.