A Deeper Look at Computational Sarcasm
Benchmarking
Select model using MODEL variable:
1 - CNN baseline
2 - ResNet baseline
3 - DweNet
MODEL = 3Select dataset using DATASET variable:
1 - Headlines
2 - Reddit Main
3 - Reddit Pol
DATASET = 3if DATASET == 1: pth = '/.nextjournal/data-named/QmXM1SAUDr39KzBUo4rkFn9VTti8hSGTneAsShAdL6VAcG/' file = 'Headlines.csv' D_COL = 'headline' WEIGHTS = Headlines_Glove_Weights.pkl PADDING = 64 RES_OUT = '/results/Headline_Results'elif DATASET == 2: pth = '/.nextjournal/data-named/QmX5TD1r1Hox3Wo5ZYw5SxeiHfKRZ9qFZH9PNYzHkvdnJv/' file = 'Reddit_Main.csv' D_COL = 'comment' WEIGHTS = Headlines_Glove_Weights.pkl PADDING = 128 RES_OUT = '/results/Main_Results' elif DATASET == 3: pth = '/.nextjournal/data-named/QmQhy5wz8vnWVwg6mDTW2TtRqiGL3MehRYJPmNWiP6hbbt/' file = 'Reddit_Pol.csv' D_COL = 'comment' WEIGHTS = Headlines_Glove_Weights.pkl PADDING = 128 RES_OUT = '/results/Pol_Results'Model Setup
%matplotlib inlineexec(open(dweNet_final.py).read())exec(open(dweNet_final.py).read())exec(open(dweNet_final.py).read())from fastai import *from fastai.text import *from fastai.callbacks import *import torch.utils.data as data_utilsimport numpydefaults.device = torch.device('cuda')def pad_to(x:Collection[str], pad_til = PADDING) -> Collection[str]: # Pads data with PAD token 'xxpad' during pre-processing res = [] count = 0 for t in x: res.append(t) count += 1 while count < pad_til: res.append(PAD) count +=1 return restokenizer = Tokenizer(SpacyTokenizer, 'en', pre_rules= [fix_html, replace_rep, replace_wrep, spec_add_spaces, rm_useless_spaces], post_rules=[replace_all_caps, deal_caps, pad_to], n_cpus=1)processor = [TokenizeProcessor(tokenizer=tokenizer), NumericalizeProcessor()]data = (TextList.from_csv(pth, file, cols= D_COL, processor=processor)).split_from_df(col='valid').label_from_df(cols=0).databunch()# This will take a few minutes for the Reddit Main dataset.Displays a random sample of the data.
data.show_batch()Initializes the model with the pretrained embedding.
weights_matrix = pickle.load(open(WEIGHTS, 'rb'))if MODEL == 1: net = ConvNet(weights_matrix) elif MODEL == 2: net = ResNet(weights_matrix)else: net = DweNet(weights_matrix) net.to('cuda') # Will print out an entire trace of networklearn = Learner(data, net, wd=0.1, loss_func=CrossEntropyFlat(), metrics=[accuracy, FBeta(average='micro',beta=1)], callback_fns=[partial(CSVLogger, filename = RES_OUT, append=True)])Model Training
Set the number of epochs to run the model for using EPOCH variable below:
EPOCH = 10Train AI for EPOCH epochs, testing performed on test set after each epoch. Results are presented in a table below after training, with accuracy and F1-score (f_beta) as metrics.
learn.fit_one_cycle(EPOCH, 1e-03, moms=(0.8,0.7))# The model will require a long time per epoch on the Reddit Main datasetResult Analysis
preds, y, losses = learn.get_preds(with_loss=True)interp = ClassificationInterpretation(learn, preds, y, losses)Confusion matrix of final epoch - diagonal entries correspond to correctly classified texts.
interp.plot_confusion_matrix()Displays the top TOP_LOSS sentences which resulted in the highest loss within the model in descending order, as well as the category - sarcastic or nonsarcastic. Note that the processed texts are presented.
TOP_LOSS = 10interp.top_losses()[1][0:TOP_LOSS]for i in interp.top_losses()[1][0:TOP_LOSS]: print(data.valid_ds[i.item()])Appendix
Model source code, datasets and weight matrices initialised using 50 dimensional GloVe embeddings.
pip install fastai