Yash Patel / Sep 15 2019
Remix of Julia by Nextjournal

JSOC'19: Practical Implementation of ULMFiT in julia [5]

In previous blog post, I had completed discussion on some aspects of the fine-tuning of the classifier. In this blog, the structure of text classifier and it's procedure will be discussed.

Fine-tuning of Text Classifier - Part 2

In this part, PooledDense layer and gradual freezing, which were discussed in the previous part, will be used to train the fine-tuned Language model for downstream task. Firstly, it is essential to know what is a Text classifier. So, like LanguageModel it is a wrapper around the layers of the LanguageModel with 2 additional Dense [PooledDense and Dense]layers for classification.

Implementation:

# ULMFiT - Text Classifier
mutable struct TextClassifier
    vocab::Vector
    rnn_layers::Flux.Chain
    linear_layers::Flux.Chain
end

function TextClassifier(lm::LanguageModel=LanguageModel(), clsfr_out_sz::Integer=1, clsfr_hidden_sz::Integer=50, clsfr_hidden_drop::Float64=0.0)
    return TextClassifier(
        lm.vocab,
        lm.layers[1:8],
        Chain(
            gpu(PooledDense(length(lm.layers[7].layer.cell.h), clsfr_hidden_sz, relu)),
            gpu(BatchNorm(clsfr_hidden_sz, relu)),
            Dropout(clsfr_hidden_drop),
            gpu(Dense(clsfr_hidden_sz, clsfr_out_sz)),
            gpu(BatchNorm(clsfr_out_sz)),
            softmax
        )
    )
end

Flux.@treelike TextClassifier

There are two more linear block added to the Language model layers. First is a Dense layer with Concat pooling functionality and second is the output layer. Both the layers use Batch Normalization (BatchNorm , refer docs to see usage) and dropout with ReLU activation for he intermediate layer and SoftMax activation for the last output layer.

Training Text Classifier

Unlike training for Language Model, where the sequences in a batch can start from anywhere between a sentence such that even a sentence can come in two different mini-batches. But, for training Text Classifier we need whole sequence in one pass, since the label is assigned to the whole sentence so it is necessary to pass whole sentence to predict label and calculate loss.

This creates several problems while training. Since the length of the sequences will not be same so padding or truncating of sequence should be done to make the length of the sequence equal for vectorized computations. Since, truncating sequences will lead to loss of important data which might be useful to predict the label, padding is preferred. Padding can be done in two ways either at the starting of the sequences or at the ending of the sequences. In this case, padding at the starting is preferred because as the time-steps pass the network will remember the features collected from the last time-steps better than the features at the starting of the sequence.

Sequence Bucketing:

If a batch is made randomly out of data-set, the difference in the sequences can be very large. This will lead to unnecessary long padded sequences which will increase computation time of training. To tackle this a Sequence Bucketing is commonly used. In this we club the sequences with similar length sizes together to form batches. And then we can pass these batches through model after shuffling between the batches and with the batches as well. This technique is applied through sorting the sequence based on the length of the sequences and then form bigger groups, call Buckets, then dividing these buckets into batches. This reduces the padding of sequences drastically and also reduces unnecessary computation.

Also, padding at the starting of the sequence also favors the remedy used to solve the problem which is discussed next.

Truncated Back-prop Through Time:

While training model with sequences of variable length the main problem comes when the size of the sequence becomes very large (e.g. 1000, 1500 etc) . This will lead to very large number of intermediate values stored in the memory beacuse tracking in Flux, which may lead to OutOfMemoryErrors. So, in such case it is essential to use Truncate Backprop thorugh time concept, in this method instead to calculating the gradients for the time-steps passed, the gradients for the last few time-steps is calculated. Although, the whol sequence is passes from the net but the tracking (which helps Flux to calculate gradients) is applied to last few time-steps, which can be accommodated in the memory simultaneously. For this the forward function is changed for the TextClassifier:

# Forward step for classifier
function forward(tc::TextClassifier, gen; tracked_words::Integer=32)
  	# swiching off tracking
    classifier = mapleaves(Tracker.data, tc)
    X = take!(gen)
    l = length(X)
    # Truncated Backprop through time
    for i=1:ceil(l/now_per_pass)-1   # Tracking is swiched off inside this loop
        (i == 1 && l%now_per_pass != 0) ? (last_idx = l%now_per_pass) : (last_idx = now_per_pass)
        H = broadcast(x -> indices(x, classifier.vocab, "_unk_"), X[1:last_idx])
        H = classifier.rnn_layers.(H)
        X = X[last_idx+1:end]
    end
    for (t_layer, unt_layer) in zip(tc.rnn_layers[[3, 5, 7]], classifier.rnn_layers[[3, 5, 7]])
        t_layer.layer.state = unt_layer.layer.state
    end
    # last part of the sequecnes in X - Tracking is swiched on
    H = broadcast(x -> tc.rnn_layers[1](indices(x, classifier.vocab, "_unk_")), X)
    H = tc.rnn_layers[2:end].(H)
    H = tc.linear_layers(H)
    return H
end

The function takes an argument tracked_words which means, number of words for which the tracking will be one. The non-tracked part is used to calculate the hidden state which will be passed to the last part which is tracked.

Conclusion

This concludes this blog post and also the training of the whole ULMFiT model. The model can be be trained for several Text Classification tasks. In the actual implementation it is used for sentiment analysis in Julia TextAnalysis.jl library.

References