Yash Patel / Aug 18 2019
Remix of Julia by Nextjournal

JSOC'19: Practical implementation of ULMFiT in Julia [3]

In previous blog post, discussion on pretraining of Language model for ULMFiT was completed. In this post, discussion will go for next step of building ULMFiT model, i.e, Fine-Tuning of Language model on target data. In this step, concept of Transfer learning comes into picture. Since Language model is pre-trained on a general domain corpus, it understands the basic structure and properties of Language (English in this case), so we can use this knowledge to train the model for our downstream task effectively, which is Sentiment Analysis in this case, for that we need some target data, here I have chosen IMDB movie reviews dataset which is a standard dataset for Sentiment Analysis. The training will go similar to what we have done while pre-training Language Model, only change is that the data used here will be from target dataset (IMDB dataset in this case) and also some of the novel techniques introduced by authors in the ULMFiT paper, namely, discriminative fine-tuning and slanted triangular learning rates will be used.

Fine-tuning Language Model

We will be doing the same training as done in previous Step, but with some novel techniques given by authors. Techniques are discussed below:

Discriminative Fine-Tuning

Every layer captures different types of information, so they should be fine-tuned to different extents. To address this, authors have introduced discriminative fine-tuning method. In this method, we will be using different learning rates for different layers.


A function has been implemented which replaces the backward! function in Language Model pretraining step:

# Gradient Clipping
grad_clipping(g, upper_bound) = min(g, upper_bound)

# Discriminative Fine-Tuning Step
function discriminative_step!(layers, ηL::Float64, l, gradient_clip::Float64)
    # Applying gradient clipping
    l = Tracker.hook(x -> grad_clipping(x, gradient_clip), l)

    # Gradient calculation
    grads = Tracker.gradient(() -> l, get_trainable_params(layers))

    # discriminative step
    ηl = ηL/(2.6^length(layers))
    for layer in layers
        ηl *= 2.6
        for p in get_trainable_params(layer)
            Tracker.update!(p, -ηl*grad[p])

In the paper, authors have suggested to first choose the learning rate of last layer, i.e. , by fine-tuning only the last layer and then using as the learning rates for the lower layers.

Slanted triangular learning rate

In this step, we want the model to converge fast to a suitable region of parameter and then after converging we want to refine parameters there. This can't be done by using a constant learning rate throughout the training. To address this, authors introduced Slanted triangular learning rates, which increases linearly in the initial training iterations and then decreases linearly till the end of training. The learning rate changes according to below equations:

cut=[T.cut_frac]cut = [T.cut\_frac]
p=t/cut   if   t<cutp = t/cut\ \ \ if\ \ \ t < cut
p=1tcutcut.(1/cut_frac1)   Otherwisep = 1 - \dfrac{t-cut}{cut.(1/cut\_frac-1)}\ \ \ Otherwise
ηt=ηmax.1+p.(ratio1)ratioη_t = η_{max}.\dfrac{1+p.(ratio-1)}{ratio}

Here, is the number of training iterations,is the fraction of iterations we increase the learning rate, is the iteration when we switch from increasing to decreasing the learning rate, is the fraction of the number of iterations we have increased or will decrease the learning rate respectively, specifies how much smaller the lowest learning rate is from the maximum learning rate , and is the learning rate at iteration . If we plot the STLR with number of iterations the graph will look like the image shown below:


# Slanted triangular learning rate step
cut = num_of_iters * epochs * stlr_cut_frac
t = num_of_iters * (epoch-1) + i
p_frac = (t < cut) ? t/cut : (1 - ((t-cut)/(cut*(1/stlr_cut_frac-1))))
ηL = 0.01*((1+p_frac*(stlr_ratio-1))/stlr_ratio)

Here, epochs is the number of total epochs, epoch is the current epoch and num_of_iters is the number of iterations per epoch. This block of code should be inside the iteration loop and should be before using the discriminative_step! function. This include some other hyper-parameters in addition to the hyper-parameter discussed in the Part-2 of Language Model pre-training:

  1. STLR cut fraction, stlr_cut_frac
  2. STLR ratio , stlr_ratio

Other, than these methods everything else is same as the pre-training step, all the regularization techniques, ASGD step and weight-tying used in that step are used here as well.


This completes the Fine-tuning step of ULMFiT model following the paper, for full implementation checkout this repository. In the next blog, we will discuss the last step of the ULMFiT model i.e. training the classifier for downstream task such as sentiment classification.