Anon / Nov 03 2019

Remix of PyTorch Template by

Nextjournal

A Deeper Look at Computational Sarcasm

Benchmarking

Select model using MODEL variable:

1 - CNN baseline

2 - ResNet baseline

3 - DweNet

MODEL = 3

0.0s

Python

Select dataset using DATASET variable:

1 - Headlines

2 - Reddit Main

3 - Reddit Pol

DATASET = 3

0.0s

Python

if DATASET == 1:  pth = '/.nextjournal/data-named/QmXM1SAUDr39KzBUo4rkFn9VTti8hSGTneAsShAdL6VAcG/'  file = 'Headlines.csv'  D_COL = 'headline'WEIGHTS=Headlines_Glove_Weights.pkl
  PADDING = 64  RES_OUT = '/results/Headline_Results'elif DATASET == 2:  pth = '/.nextjournal/data-named/QmX5TD1r1Hox3Wo5ZYw5SxeiHfKRZ9qFZH9PNYzHkvdnJv/'  file = 'Reddit_Main.csv'  D_COL = 'comment'WEIGHTS=Headlines_Glove_Weights.pkl
  PADDING = 128  RES_OUT = '/results/Main_Results'  elif DATASET == 3:  pth = '/.nextjournal/data-named/QmQhy5wz8vnWVwg6mDTW2TtRqiGL3MehRYJPmNWiP6hbbt/'  file = 'Reddit_Pol.csv'  D_COL = 'comment'WEIGHTS=Headlines_Glove_Weights.pkl
  PADDING = 128  RES_OUT = '/results/Pol_Results'

0.0s

Python

Model Setup

%matplotlib inlineexec(open(dweNet_final.py
).read())exec(open(dweNet_final.py
).read())exec(open(dweNet_final.py
).read())from fastai import *from fastai.text import *from fastai.callbacks import *import torch.utils.data as data_utilsimport numpy

0.4s

Python

defaults.device = torch.device('cuda')

0.0s

Python

def pad_to(x:Collection[str], pad_til = PADDING) -> Collection[str]:  # Pads data with PAD token 'xxpad' during pre-processing    res = []    count = 0    for t in x:        res.append(t)        count += 1    while count < pad_til:        res.append(PAD)        count +=1    return res

0.0s

Python

tokenizer = Tokenizer(SpacyTokenizer, 'en', pre_rules=                      [fix_html, replace_rep, replace_wrep, spec_add_spaces,                        rm_useless_spaces],                      post_rules=[replace_all_caps, deal_caps, pad_to], n_cpus=1)processor = [TokenizeProcessor(tokenizer=tokenizer), NumericalizeProcessor()]

0.0s

Python

data = (TextList.from_csv(pth, file, cols= D_COL, processor=processor)).split_from_df(col='valid').label_from_df(cols=0).databunch()# This will take a few minutes for the Reddit Main dataset.

8.5s

Python

Displays a random sample of the data.

data.show_batch()

1.6s

Python

Initializes the model with the pretrained embedding.

weights_matrix = pickle.load(open(WEIGHTS, 'rb'))if MODEL == 1:  net =  ConvNet(weights_matrix) elif MODEL == 2:  net = ResNet(weights_matrix)else:  net = DweNet(weights_matrix)  net.to('cuda') # Will print out an entire trace of network

0.8s

Python

learn = Learner(data, net, wd=0.1, loss_func=CrossEntropyFlat(),                 metrics=[accuracy, FBeta(average='micro',beta=1)],                callback_fns=[partial(CSVLogger, filename = RES_OUT,                                      append=True)])

0.0s

Python

Model Training

Set the number of epochs to run the model for using EPOCH variable below:

EPOCH = 10

0.0s

Python

Train AI for EPOCH epochs, testing performed on test set after each epoch. Results are presented in a table below after training, with accuracy and F1-score (f_beta) as metrics.

learn.fit_one_cycle(EPOCH, 1e-03, moms=(0.8,0.7))# The model will require a long time per epoch on the Reddit Main dataset

467.9s

Python

0 items

Result Analysis

preds, y, losses = learn.get_preds(with_loss=True)interp = ClassificationInterpretation(learn, preds, y, losses)

3.1s

Python

Confusion matrix of final epoch - diagonal entries correspond to correctly classified texts.

interp.plot_confusion_matrix()

0.8s

Python

Displays the top TOP_LOSS sentences which resulted in the highest loss within the model in descending order, as well as the category - sarcastic or nonsarcastic. Note that the processed texts are presented.

TOP_LOSS = 10

0.0s

Python

interp.top_losses()[1][0:TOP_LOSS]for i in interp.top_losses()[1][0:TOP_LOSS]:  print(data.valid_ds[i.item()])

0.7s

Python

Appendix

Model source code, datasets and weight matrices initialised using 50 dimensional GloVe embeddings.

ResNet_baseline_final.py

ConvNet_Baseline_Final.py

dweNet_final.py

Reddit_Main.csv

NJ_WEIGHTS_REDDIT_MAIN.pkl

Reddit_Pol.csv

NJ_WEIGHTS_POL.pkl

Headlines.csv

Headlines_Glove_Weights.pkl

pip install fastai

73.6s

FAI (Bash in Python)

A Deeper Look at Computational Sarcasm

Benchmarking

Model Setup

Model Training

Result Analysis

Appendix

Runtimes (2)