JSOC'19: Practical implementation of ULMFiT in Julia [3]
In previous blog post, discussion on pretraining of Language model for ULMFiT was completed. In this post, discussion will go for next step of building ULMFiT model, i.e, Fine-Tuning of Language model on target data. In this step, concept of Transfer learning comes into picture. Since Language model is pre-trained on a general domain corpus, it understands the basic structure and properties of Language (English in this case), so we can use this knowledge to train the model for our downstream task effectively, which is Sentiment Analysis in this case, for that we need some target data, here I have chosen IMDB movie reviews dataset which is a standard dataset for Sentiment Analysis. The training will go similar to what we have done while pre-training Language Model, only change is that the data used here will be from target dataset (IMDB dataset in this case) and also some of the novel techniques introduced by authors in the ULMFiT paper, namely, discriminative fine-tuning and slanted triangular learning rates will be used.
Fine-tuning Language Model
We will be doing the same training as done in previous Step, but with some novel techniques given by authors. Techniques are discussed below:
Discriminative Fine-Tuning
Every layer captures different types of information, so they should be fine-tuned to different extents. To address this, authors have introduced discriminative fine-tuning method. In this method, we will be using different learning rates for different layers.
Implementation:
A function has been implemented which replaces the backward!
function in Language Model pretraining step:
# Gradient Clipping grad_clipping(g, upper_bound) = min(g, upper_bound) # Discriminative Fine-Tuning Step function discriminative_step!(layers, ηL::Float64, l, gradient_clip::Float64) # Applying gradient clipping l = Tracker.hook(x -> grad_clipping(x, gradient_clip), l) # Gradient calculation grads = Tracker.gradient(() -> l, get_trainable_params(layers)) # discriminative step ηl = ηL/(2.6^length(layers)) for layer in layers ηl *= 2.6 for p in get_trainable_params(layer) Tracker.update!(p, -ηl*grad[p]) end end return end
In the paper, authors have suggested to first choose the learning rate of last layer, i.e.
Slanted triangular learning rate
In this step, we want the model to converge fast to a suitable region of parameter and then after converging we want to refine parameters there. This can't be done by using a constant learning rate throughout the training. To address this, authors introduced Slanted triangular learning rates, which increases linearly in the initial training iterations and then decreases linearly till the end of training. The learning rate changes according to below equations:
Here,
Implementation:
# Slanted triangular learning rate step cut = num_of_iters * epochs * stlr_cut_frac t = num_of_iters * (epoch-1) + i p_frac = (t < cut) ? t/cut : (1 - ((t-cut)/(cut*(1/stlr_cut_frac-1)))) ηL = 0.01*((1+p_frac*(stlr_ratio-1))/stlr_ratio)
Here, epochs
is the number of total epochs, epoch
is the current epoch and num_of_iters
is the number of iterations per epoch. This block of code should be inside the iteration loop and should be before using the discriminative_step!
function. This include some other hyper-parameters in addition to the hyper-parameter discussed in the Part-2 of Language Model pre-training:
- STLR cut fraction,
stlr_cut_frac
- STLR ratio ,
stlr_ratio
Other, than these methods everything else is same as the pre-training step, all the regularization techniques, ASGD step and weight-tying used in that step are used here as well.
Conclusion
This completes the Fine-tuning step of ULMFiT model following the paper, for full implementation checkout this repository. In the next blog, we will discuss the last step of the ULMFiT model i.e. training the classifier for downstream task such as sentiment classification.