Yash Patel / Aug 18 2019
Remix of Julia by Nextjournal

JSOC'19: Practical Implementation of ULMFiT in julia [4]

In previous blog post, the discussion on Fine-tuning the Language Model for downstream task is completed. In this post, I will go to the third or the final step of building ULMFiT model, which is to Fine-tune the task specific classifier. In this case, my downstream task is to Sentiment Analysis with the help of ULMFiT model, but the model can be fine-tuned for several downstream tasks like Question Classification, Topic Classification, Named Entity Recognition etc.

Fine-Tuning of Text Classifier - Part 1

In this part, we will discuss the techniques which will be used to train the Text classifier later. In next part, implementation of text classifier and its training procedure will be discussed. For building the full classifier, we will add two more linear blocks after the RNN layers of the Language Model. One will be a Dense (or fully connected) layer with Concat pooling and other will be the final output layer before SoftMax layer, which will output the probability distribution of the classes.

As suggested by the authors, the hidden linear layer will be of size 50 and since implementation is for Binary Sentiment Analysis, 2 output units will be used to get the probability distribution for negative and positive polarity of text. Apart from this, we will be using Batch Normalization [refer this lecture by Dr. Andrew Ng for understanding the Batch Normalization] and DropOut with RELU activations to the hidden linear layer of the network and SoftMax activation for the final output layer of the model. The parameters of these two linear layers are learnt from scratch. Also, the first linear layer takes as the input the pooled last hidden layer states, here we will Concat pooling which is a noble technique introduced by the authors of the ULMFiT paper.

Concat Pooling:

In a given text sequence, the important information may occur anywhere in between the sequence, but since the document length can consist of hundreds of words, information may get lost if we only consider the last hidden state of the model. For this reason, we concatenate the hidden state at the last time stepof the document with both the max-pooled and the mean-pooled representation of the hidden states over as many time steps as GPU can hold. In simple words, in concat pooling we simply do max-pool and mean-pool over all hidden states and then concatenate them with latest hidden state.

H={h1,...,hT}H = \{h_1, ..., h_T\}
hc=[hT,maxpool(H),meanpool(H)]h_c = [h_T, maxpool(H), meanpool(H)]

where [] is concatenation.

Implementation:

"""
Concat-Pooled Dense layer
"""

mutable struct PooledDense{F, S, T}
    W::S
    b::T
    σ::F
end

PooledDense(W, b) = PooledDense(W, b, identity)

function PooledDense(hidden_sz::Integer, out::Integer, σ = identity;
             initW = Flux.glorot_uniform, initb = (dims...) -> zeros(Float32, dims...))
return PooledDense(param(initW(out, hidden_sz*3)), param(initb(out)), σ)
end

Flux.@treelike PooledDense

function (a::PooledDense)(in)
    W, b, σ = a.W, a.b, a.σ
    in = cat(in..., dims=3)
    maxpool = maximum(in, dims=3)[:, :, 1]
    meanpool = (sum(in, dims=3)/size(in, 3))[:, :, 1]
    hc = cat(in[:, :, 1], maxpool, meanpool, dims=1)
    σ.(W*hc .+ b)
end

This layer takes the vector of hidden states output by the model at the different time-steps. Then, it takes the max-pool and mean-pool of all hidden states across the time-steps. Then, it concatenates the max-pooled and mean-pooled results to the hidden state of the last time-step. To initialize this layer the hidden_sz argument takes the size of the a hidden state which will the output of the layer before this layer:

# Initializing Concat Pooled Dense Layer
pd = PooledDense(1150, 50)

Gradual Freezing:

Overly aggressive fine-tuning might lead to catastrophic forgetting, eliminating the benefit of the pre-training step. For this reason, gradual freezing is proposed. It says that instead of training all layers at once, gradually unfreeze the layers, starting from the last layer as this contains the least general knowledge. First, we unfreeze the last layer and then train it for one epoch then unfreeze next lower layer and train all the unfrozen layers until convergence at the last iteration.

Implementation:

unfreezed_layers, cur_opts = (epoch < length(trainable)) ? (trainable[end-epoch+1:end], opts[end-epoch+1:end]) : (trainable, opts)
discriminative_step!(unfreezed_layers, ηL, l, gradient_clip, cur_opts)

With these two techniques we will also use discriminative fine-tuning and slanted triangular learning rates discussed in the previous blog.

Conclusion

This completes the Part-1 for last step of ULMFiT model building. In the next part, implementation and training of Text classifier will discussed based on the above techniques.

References