JSOC'19: Practical Implementation of ULMFiT in julia 
In previous blog post, I had completed discussion on some aspects of the fine-tuning of the classifier. In this blog, the structure of text classifier and it's procedure will be discussed.
Fine-tuning of Text Classifier - Part 2
In this part,
PooledDense layer and gradual freezing, which were discussed in the previous part, will be used to train the fine-tuned Language model for downstream task. Firstly, it is essential to know what is a Text classifier. So, like
LanguageModel it is a wrapper around the layers of the
LanguageModel with 2 additional
Dense]layers for classification.
# ULMFiT - Text Classifier mutable struct TextClassifier vocab::Vector rnn_layers::Flux.Chain linear_layers::Flux.Chain end function TextClassifier(lm::LanguageModel=LanguageModel(), clsfr_out_sz::Integer=1, clsfr_hidden_sz::Integer=50, clsfr_hidden_drop::Float64=0.0) return TextClassifier( lm.vocab, lm.layers[1:8], Chain( gpu(PooledDense(length(lm.layers.layer.cell.h), clsfr_hidden_sz, relu)), gpu(BatchNorm(clsfr_hidden_sz, relu)), Dropout(clsfr_hidden_drop), gpu(Dense(clsfr_hidden_sz, clsfr_out_sz)), gpu(BatchNorm(clsfr_out_sz)), softmax ) ) end Flux. TextClassifier
There are two more linear block added to the Language model layers. First is a Dense layer with Concat pooling functionality and second is the output layer. Both the layers use Batch Normalization (
BatchNorm , refer docs to see usage) and dropout with ReLU activation for he intermediate layer and SoftMax activation for the last output layer.
Training Text Classifier
Unlike training for Language Model, where the sequences in a batch can start from anywhere between a sentence such that even a sentence can come in two different mini-batches. But, for training Text Classifier we need whole sequence in one pass, since the label is assigned to the whole sentence so it is necessary to pass whole sentence to predict label and calculate loss.
This creates several problems while training. Since the length of the sequences will not be same so padding or truncating of sequence should be done to make the length of the sequence equal for vectorized computations. Since, truncating sequences will lead to loss of important data which might be useful to predict the label, padding is preferred. Padding can be done in two ways either at the starting of the sequences or at the ending of the sequences. In this case, padding at the starting is preferred because as the time-steps pass the network will remember the features collected from the last time-steps better than the features at the starting of the sequence.
If a batch is made randomly out of data-set, the difference in the sequences can be very large. This will lead to unnecessary long padded sequences which will increase computation time of training. To tackle this a Sequence Bucketing is commonly used. In this we club the sequences with similar length sizes together to form batches. And then we can pass these batches through model after shuffling between the batches and with the batches as well. This technique is applied through sorting the sequence based on the length of the sequences and then form bigger groups, call Buckets, then dividing these buckets into batches. This reduces the padding of sequences drastically and also reduces unnecessary computation.
Also, padding at the starting of the sequence also favors the remedy used to solve the problem which is discussed next.
Truncated Back-prop Through Time:
While training model with sequences of variable length the main problem comes when the size of the sequence becomes very large (e.g. 1000, 1500 etc) . This will lead to very large number of intermediate values stored in the memory beacuse tracking in
Flux, which may lead to
OutOfMemoryErrors. So, in such case it is essential to use Truncate Backprop thorugh time concept, in this method instead to calculating the gradients for the time-steps passed, the gradients for the last few time-steps is calculated. Although, the whol sequence is passes from the net but the tracking (which helps Flux to calculate gradients) is applied to last few time-steps, which can be accommodated in the memory simultaneously. For this the
forward function is changed for the
# Forward step for classifier function forward(tc::TextClassifier, gen; tracked_words::Integer=32) # swiching off tracking classifier = mapleaves(Tracker.data, tc) X = take!(gen) l = length(X) # Truncated Backprop through time for i=1:ceil(l/now_per_pass)-1 # Tracking is swiched off inside this loop (i == 1 && l%now_per_pass != 0) ? (last_idx = l%now_per_pass) : (last_idx = now_per_pass) H = broadcast(x -> indices(x, classifier.vocab, "_unk_"), X[1:last_idx]) H = classifier.rnn_layers.(H) X = X[last_idx+1:end] end for (t_layer, unt_layer) in zip(tc.rnn_layers[[3, 5, 7]], classifier.rnn_layers[[3, 5, 7]]) t_layer.layer.state = unt_layer.layer.state end # last part of the sequecnes in X - Tracking is swiched on H = broadcast(x -> tc.rnn_layers(indices(x, classifier.vocab, "_unk_")), X) H = tc.rnn_layers[2:end].(H) H = tc.linear_layers(H) return H end
The function takes an argument
tracked_words which means, number of words for which the tracking will be one. The non-tracked part is used to calculate the hidden state which will be passed to the last part which is tracked.
This concludes this blog post and also the training of the whole ULMFiT model. The model can be be trained for several Text Classification tasks. In the actual implementation it is used for sentiment analysis in Julia TextAnalysis.jl library.