# JSOC'19: Practical implementation of ULMFiT in Julia \[3\]

![julia.png][nextjournal#file#5c09851e-4fbd-4fb1-afbe-1152084be27b]

In [previous blog post](https://nextjournal.com/ComputerMaestro/jsoc19-practical-implementation-of-ulmfit-in-julia-2), discussion on pretraining of Language model for ULMFiT was completed. In this post, discussion will go for next step of building ULMFiT model, i.e, Fine-Tuning of Language model on target data. In this step, concept of Transfer learning comes into picture. Since Language model is pre-trained on a general domain corpus, it understands the basic structure and properties of Language (English in this case), so we can use this knowledge to train the model for our downstream task effectively, which is *Sentiment Analysis* in this case, for that we need some target data, here I have chosen IMDB movie reviews dataset which is a standard dataset for *Sentiment Analysis*. The training will go similar to what we have done while pre-training Language Model, only change is that the data used here will be from target dataset (IMDB dataset in this case) and also some of the novel techniques introduced by authors in the ULMFiT paper, namely, *discriminative fine-tuning* and *slanted triangular learning rates* will be used*.*

# Fine-tuning Language Model

We will be doing the same training  as done in previous Step, but with some novel techniques given by authors. Techniques are discussed below:

## Discriminative Fine-Tuning

Every layer captures different types of information, so they should be fine-tuned to different extents. To address this, authors have introduced *discriminative fine-tuning* method. In this method, we will be using different learning rates for different layers.

**Implementation:**

A function has been implemented which replaces the `backward!` function in Language Model pretraining step:

```julia id=26f42ed8-4332-49d0-aef8-43f68f8383de
# Gradient Clipping
grad_clipping(g, upper_bound) = min(g, upper_bound)

# Discriminative Fine-Tuning Step
function discriminative_step!(layers, ηL::Float64, l, gradient_clip::Float64)
    # Applying gradient clipping
    l = Tracker.hook(x -> grad_clipping(x, gradient_clip), l)

    # Gradient calculation
    grads = Tracker.gradient(() -> l, get_trainable_params(layers))

    # discriminative step
    ηl = ηL/(2.6^length(layers))
    for layer in layers
        ηl *= 2.6
        for p in get_trainable_params(layer)
            Tracker.update!(p, -ηl*grad[p])
        end
    end
    return
end
```

In the paper, authors have suggested to first choose the learning rate of last layer, i.e. $η^L$, by fine-tuning only the last layer and then using $η^{l-1} = η^l/2.6$as the learning rates for the lower layers.

## Slanted triangular learning rate

In this step, we want the model to converge fast to a suitable region of parameter and then after converging we want to refine parameters there. This can't be done by using a constant learning rate throughout the training. To address this, authors introduced *Slanted triangular learning rates*, which increases linearly in the initial training iterations and then decreases linearly till the end of training. The learning rate changes according to below equations:

$$
cut = [T.cut\_frac]
$$
$$
p = t/cut\ \ \ if\ \ \ t < cut
$$
$$
or
$$
$$
p = 1 - \dfrac{t-cut}{cut.(1/cut\_frac-1)}\ \ \ Otherwise
$$
$$
η_t = η_{max}.\dfrac{1+p.(ratio-1)}{ratio}
$$
Here, $T$ is the number of training iterations,$cut\_frac$is the fraction of iterations we increase the learning rate, $cut$is the iteration when we switch from increasing to decreasing the learning rate, $p$is the fraction of the number of iterations we have increased or will decrease the learning rate respectively, $ratio$specifies how much smaller the lowest learning rate is from the maximum learning rate  $η_{max}$, and  $η_t$is the learning rate at iteration $t$. If we plot the STLR with number of iterations the graph will look like the image shown below:

![STLR.PNG][nextjournal#file#84d9735d-aeed-451f-817e-6f81f5e5d09a]

**Implementation:**

```julia id=e009a208-95d1-447d-87e1-d15caed9cd38
# Slanted triangular learning rate step
cut = num_of_iters * epochs * stlr_cut_frac
t = num_of_iters * (epoch-1) + i
p_frac = (t < cut) ? t/cut : (1 - ((t-cut)/(cut*(1/stlr_cut_frac-1))))
ηL = 0.01*((1+p_frac*(stlr_ratio-1))/stlr_ratio)
```

Here, `epochs` is the number of total epochs, `epoch` is the current epoch and `num_of_iters` is the number of iterations per epoch. This block of code should be inside the iteration loop and should be before using the `discriminative_step!` function. This include some other hyper-parameters in addition to the hyper-parameter discussed in the [Part-2 of Language Model pre-training](https://nextjournal.com/ComputerMaestro/jsoc19-practical-implementation-of-ulmfit-in-julia-2):

1. STLR cut fraction, `stlr_cut_frac`
2. STLR ratio , `stlr_ratio`

Other, than these methods everything else is same as the pre-training step, all the regularization techniques, ASGD step and weight-tying used in that step are used here as well.

# Conclusion

This completes the Fine-tuning step of ULMFiT model following the [paper](https://arxiv.org/pdf/1801.06146.pdf), for full implementation checkout [this repository](https://github.com/ComputerMaestro/ULMFiT). In the [next blog](https://nextjournal.com/ComputerMaestro/jsoc19-practical-implementation-of-ulmfit-in-julia-4), we will discuss the last step of the ULMFiT model i.e. training the classifier for downstream task such as sentiment classification.

# References

1. [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/pdf/1801.06146.pdf)  
2. [Regularization and Optimization LSTM Language Models](https://arxiv.org/pdf/1708.02182.pdf)  

[nextjournal#file#5c09851e-4fbd-4fb1-afbe-1152084be27b]:
<https://nextjournal.com/data/QmT5fDkSq8GZVpJovSHAxvUHBpDYvcGSe16GKJfAWDawZJ?content-type=image/png&node-id=5c09851e-4fbd-4fb1-afbe-1152084be27b&filename=julia.png&node-kind=file>

[nextjournal#file#84d9735d-aeed-451f-817e-6f81f5e5d09a]:
<https://nextjournal.com/data/QmWSk7sZBb9fz61bgDsod8ne241kH3N2GFySWDT7Hm7HSw?content-type=image/png&node-id=84d9735d-aeed-451f-817e-6f81f5e5d09a&filename=STLR.PNG&node-kind=file>

<details id="com.nextjournal.article">
<summary>This notebook was exported from <a href="https://nextjournal.com/a/LKZEwmjUrmnLFu46uT77i?change-id=D5qQMNC75R89Z9z1Mpth7n">https://nextjournal.com/a/LKZEwmjUrmnLFu46uT77i?change-id=D5qQMNC75R89Z9z1Mpth7n</a></summary>

```edn nextjournal-metadata
{:article
 {:settings nil,
  :nodes
  {"03bb3c36-f0fa-4f81-82ae-7494685c4f25"
   {:id "03bb3c36-f0fa-4f81-82ae-7494685c4f25",
    :kind "number",
    :value 1},
   "26f42ed8-4332-49d0-aef8-43f68f8383de"
   {:id "26f42ed8-4332-49d0-aef8-43f68f8383de",
    :kind "code",
    :runtime [:runtime "649f47d0-806a-4972-9202-c7299588e7b0"]},
   "5c09851e-4fbd-4fb1-afbe-1152084be27b"
   {:id "5c09851e-4fbd-4fb1-afbe-1152084be27b", :kind "file"},
   "649f47d0-806a-4972-9202-c7299588e7b0"
   {:environment
    [:environment
     {:article/nextjournal.id
      #uuid "5b460d39-8c57-43a6-8b13-e217642b0146",
      :change/nextjournal.id
      #uuid "5cc9b90b-0bd4-4e38-bb95-834151b2dc86",
      :node/id "39e3f06d-60bf-4003-ae1a-62e835085aef"}],
    :id "649f47d0-806a-4972-9202-c7299588e7b0",
    :kind "runtime",
    :language "julia",
    :type :nextjournal},
   "84d9735d-aeed-451f-817e-6f81f5e5d09a"
   {:id "84d9735d-aeed-451f-817e-6f81f5e5d09a", :kind "file"},
   "e009a208-95d1-447d-87e1-d15caed9cd38"
   {:id "e009a208-95d1-447d-87e1-d15caed9cd38",
    :kind "code",
    :runtime [:runtime "649f47d0-806a-4972-9202-c7299588e7b0"]}},
  :nextjournal/id #uuid "02b2912c-dc3a-44c4-895d-6c4a38bd008d",
  :article/change
  {:nextjournal/id #uuid "61da77d9-b0cb-4403-84e1-054b10595211"}}}

```
</details>