-

Bi-ISCA: Bidirectional Inter-Sentence Contextual Attention Mechanism for Detecting Sarcasm in User Generated Noisy Short Text

Prakamya Mishra

Saroj Kaushik

saroj.kaushik@snu.edu.in 1

Kuntal Dey

kuntal.dey@accenture.com 0 0 Accenture Technology Labs 1 Shiv Nadar University

2021

Many online comments on social media platforms are hateful, humorous, or sarcastic. The sarcastic nature of these comments (especially the short ones) alters their actual implied sentiments, which leads to misinterpretations by the existing sentiment analysis models. A lot of research has already been done to detect sarcasm in the text using user-based, topical, and conversational information but not much work has been done to use inter-sentence contextual information for detecting the same. This paper proposes a new deep learning architecture that uses a novel Bidirectional Inter-Sentence Contextual Attention mechanism (Bi-ISCA) to capture intersentence dependencies for detecting sarcasm in the user-generated short text using only the conversational context. The proposed deep learning model demonstrates the capability to capture explicit, implicit, and contextual incongruous words & phrases responsible for invoking sarcasm. Bi-ISCA generates results comparable to the state-of-the-art on two widely used benchmark datasets for the sarcasm detection task (Reddit and Twitter). To the best of our knowledge, none of the existing models use an intersentence contextual attention mechanism to detect sarcasm in the user-generated short text using only conversational context.

Sentiment analysis is one of the most important natural language processing (NLP) applications. Its goal is to identify, extract, quantify, and study subjective information. The sudden rise in the usage of social media platforms as a means of communication has led to a vast amount of data being shared between its users on a wide range of topics. This type of data is very helpful to several organizations for analyzing the sentiments of people towards products, movies, political events, etc. Understanding the unique intricacies of the human language remains one of the most important pending NLP problems of this time. Humans regularly use sarcasm as a crucial part of the day-to-day conversations when venting, arguing, or ∗Contact Author maybe engaging on social media platforms. Sarcastic remarks on these platforms inflict problems on the existing sentiment analysis systems in identifying the true intentions of the users.

The Cambridge Dictionary1 describes sarcasm as an irony conveyed hilariously or amusingly to criticize something. Sarcasm may not show criticism on the surface but instead might have a criticizing implied meaning. Such a figurative aspect of sarcasm makes it difficult to be detected in the modern micro texts [Ghosh and Veale, 2016]. Several linguistic research has been done to analyze different aspects of sarcasm. Kind of responses evoked because of comments has been considered a major indicator of sarcasm [Eisterhold et al., 2006]. [Wilson, 2006] states that circumstantial incongruity between a comment and its corresponding contextual information plays an important role in implying sarcasm.

Previous research works have used policy-based, statistical, and deep-learning-based methods for detecting sarcasm. The use of contextual information like conversational context, author personality features, or prior knowledge of the topic, have proved to be very useful. [Khattri et al., 2015] used sentiments of the author’s historical tweets as context. [Rajadesingan et al., 2015] used personality features like the author’s familiarity with twitter, language (structure and word usage), and the author’s familiarity with sarcasm (history of previous sarcastic tweets) for consolidating context. [Bamman and Smith, 2015] explored the use of historical terms, topics, and sentiments along with profile information as the author’s context. They also exploited the use of conversational context like the immediate previous tweets in the thread. [Joshi et al., 2015] demonstrated that concatenation of preceding comment with the objective comment in a discussion forum led to an increase in the precision score.

Overall in recent years a lot of work has been done to use different types of contextual information for sarcasm detection but none of them have used inter-sentence dependencies. In this paper, we propose a novel Bidirectional Inter-Sentence Contextual Attention mechanism (Bi-ISCA) based deep learning neural network for sarcasm detection. The main contribution of this paper can be summarised as follows: • We propose a new deep learning architecture that uses a novel Bidirectional Inter-Sentence Contextual attention mechanism (Bi-ISCA) for detecting sarcasm in short texts (short texts are more difficult to analyze due to shortage of contextual information). • Bi-ISCA focuses on only using the conversational contextual comment/tweet for detecting sarcasm rather than using any other topical/personality-based features, as using only the contextual information enriches the model’s ability to capture syntactical and semantical textual properties responsible for invoking sarcasm. • We also explain model behavior and predictions by visualizing attention maps generated by Bi-ISCA, which helps in identifying significant parts of the sentences responsible for invoking sarcasm.

The rest of the paper is organized as follows. Section 2 describes the related work. Then section 3, explains the proposed model architecture for detecting sarcasm. Section 4 will describe the datasets used, pre-processing pipeline, and training details for reproducibility. Then experimental results are explained in section 5 and section 6 illustrates model behavior and predictions by visualizing attention maps. Finally we conclude in section 7. 2

Related Work

A diverse spectrum of approaches has been used to detect sarcasm. Recent sarcasm detection approaches have either mainly focused on using machine learning based approaches that leverage the use of explicitly declared relevant features or they focus on using neural network based deep learning approaches that do not require handcrafted features. Also, the recent advances in using deep learning for preforming natural language processing tasks have led to a promising increase in the performance of these sarcasm detection systems.

A lot of research has been done using bag of words as features. However, to improve performance, scholars started to explore the use of several other semantic and syntactical features like punctuations [Tsur et al., 2010]; emotion marks and intensifiers [Liebrecht et al., 2013]; positive verbs and negative phrases [Riloff et al., 2013]; polarity skip grams [Reyes et al., 2013]; synonyms & ambiguity[Barbieri et al., 2014]; implicit and explicit incongruity-based [Joshi et al., 2015]; sentiment flips [Rajadesingan et al., 2015]; affect-based features derived from multiple emotion lexicons [Farías et al., 2016].

Every day an enormous amount of short text data is generated by users on popular social media platforms like Twitter2 and Reddit3. Easy accessibility of such data sources has enticed researchers to use them for extracting user-based and discourse-based features. [Hazarika et al., 2018] utilized contextual information by making user-embeddings for capturing indicative behavioral traits. These user-embeddings incorporated personality features along with the author’s writing style (using historical posts). They also used discourse comments along with background cues and topical information for detecting sarcasm. They performed their experiments on the largest Reddit dataset SARC [Khodak et al., 2018]. Many have only used the target text for classification purposes, where a target

2www.twitter.com/

3www.reddit.com/ text is a textual unit that has to be classified as sarcastic or not. Simply using gated recurrent units (GRU) [Cho et al., 2014] or long short term memory (LSTM) [Hochreiter and Schmidhuber, 1997] do not capture in between interactions of word pairs which makes it difficult to model contrast and incongruity. [Tay et al., 2018] were able to solve this problem by looking in-between word pairs using a multi-dimensional intra-attention recurrent network. They focused on modeling the intra-sentence relationships among the words. [Kumar et al., 2020] exploited the use of a multi-head attention mechanism [Vaswani et al., 2017] which could capture dependencies between different representations subspaces in different positions. Their model consisted of a word encoder for generating new word representations by summarizing comment contextual information in a bidirectional manner. On top of that, they used multi-head attention for focusing on different contexts of a sentence, and in the end, a simple multi-layer perceptron was used for classification.

There has not been much work done in conversation dependent (comment and reply) approaches for sarcasm detection. [Ghaeini et al., 2018] proposed a model that not only used information from the target utterance but also used its conversational context to perceive sarcasm. They aimed to detect sarcasm by just using the sequences of sentences, without any extra knowledge about the user and topic. They combined the predictions from utterance-only and conversation-dependent parts for generating its final prediction which was able to capture the words responsible for delivering sarcasm. [Ghosh and Veale, 2017] also modeled conversational context for sarcasm detection. They also attempted to derive what parts of the conversational context triggered a sarcastic reply. Their proposed model used sentence embeddings created by taking an average of word embeddings and a sentence-level attention mechanism was used to generate attention induced representations of both the context and the response which was later concatenated and used for classification.

Among all the previous works, [Ghaeini et al., 2018] and [Ghosh and Veale, 2017] share similar motives of detecting sarcasm using only the conversational context. However, we introduce a novel Bidirectional Inter-Sentence Contextual Attention mechanism (Bi-ISCA) for detecting sarcasm. Unlike previous works, our work considers short texts for detecting sarcasm, which is far more challenging to detect when compared to long texts as long texts provide much more contextual information. 3

Model

This section will introduce the proposed Bi-ISCA: Bidirectional Inter Sentence Contextual Attention based neural network for sarcasm detection (as shown in Figure 1). Sarcasm detection is a binary classification task that tries to predict whether a given comment is sarcastic or not. The proposed model uses comment-reply pairs for detecting sarcasm. The input to the model is represented by U = [W1u, W2u, ...., Wnu] and V = [W1v, W2v, ...., Wnv], where U represents the comment sentence and V represents the reply sentence (both sentences padded to a length of n). Here, Wiu, Wjv ∈ Rd are d−dimensional word embedding vectors. The objective is to predict label y which indicates whether the reply to the corresponding comment was sarcastic or not. 3.1

Intra-Sentence Word Encoder Layer

The primary purpose of this layer is to summarize intrasentence contextual information from both directions in both the sentences (comment & reply) using Bidirectional Long Short Term Memory Networks (Bi-LSTM). A Bi-LSTM [Schuster and Paliwal, 1997] processes information in both the directions using a forward LSTM [Hochreiter and Schmidhuber, 1997] →−h , that reads the sentence S = [w1, w2, ...., wn] from w1 to wn and a backward LSTM ←h− that reads the sentence from wn to w1. Hidden states from both the LSTMs are added to get the final hidden state representations of each word. So the hidden state representation of the tth word (ht) can be represented by the sum of tth hidden representations of the forward and backward LSTMs (→−ht ,←h−t ) as show in equations below.

→−ht = L−−S−T−M→(wt, h−−t−→1); ←h−t = L←S−−T−M−(wt, h←t−−−1) ht = ←h−t + →−ht

This Intra-Sentence Word Encoder Layer consists of two independent Bidirectional LSTMs for both comment (BiLST Mc) and reply (BiLST Mr). Apart from the hidden states, both these Bi-LSTMs also generate separate (forward & backward) final cell states represented by ←C− & →−C. The comment sentence U is given as an input to BiLST Mc and the reply sentence V is given as an input to BiLST Mr. The outputs of both the Bi-LSTMs are represented by the equations 3 and 4.

C−→u, hu, C←−u = BiLST Mc(U ) (1) (2) (3)

C−→v, hv, C←−v = BiLST Mr(V ) (4)

Here, C−→u, C−→v ∈ Rd are the final cell states of the forward LSTMs corresponding to BiLST Mc & BiLST Mr; C←−u, C←−v ∈ Rd are the final cell states of the backward LSTMs corresponding to BiLST Mc & BiLST Mr; hu = [h1u, h2u, ...., hun] and hv = [h1v, h2v, ...., hvn] are the hidden state representations of BiLST Mc & BiLST Mr respectively, where hiu, hjv ∈ Rd and hu, hv ∈ Rn×d.

3.2 Bi-ISCA: Bidirectional Inter-Sentence Contextual Attention Mechanism

Sarcasm is context-dependent in nature. Even humans sometimes have a hard time understanding sarcasm without having any contextual information. The hidden states generated by both the Bi-LSTMs (BiLST Mc & BiLST Mr) captures the intra-sentence bidirectional contextual information in comment & reply respectively, but fails to capture the intersentence contextual information between them. This paper introduces a novel Bidirectional Inter-Sentence Contextual Attention mechanism (Bi-ISCA) for capturing the inter-sentence contextual information between both the sentences.

Bi-ISCA uses hidden state representations of U & V along with the auxiliary sentence’s cell state representations (→−C& ←C−) to capture the inter-sentence contextual information. At first, the attention mechanism captures four sets of atten−→ ←− −→ ←− tions scores namely, (αCu , αCu , αCv , αCv ∈ Rn). These sets of inter-sentence attention scores are used to generate new inter-sentence contextualized hidden representations. Then −→ ←− (αCu , αCu ) are calculated using the hidden state representations of BiLST Mr along with the forward and backward final states (C−→u, C←−u) of BiLST Mc (as shown in equations

In the above equation, bli is a bias matrix and Kil,j is a filter connecting jth feature map of layer (l − 1) to the ith feature map of layer (l). The output of each convolution layer is passed through a activation function f . The proposed model uses LeakyReLu as its activation function.

f = a ∗ x, for x ≥ 0; a ∈ R x, for x < 0

For each of the CNN blocks, the corresponding contextualized hidden representations are first concatenated (⊕) and then given as input. The outputs of all the CNN blocks are flattened ( F1, F2, F3, F4 ∈ Rdk) and concatenated to generate a new vector (p ∈ R4dk), where d represents the dimension of the hidden representation and k represents number of convolutional filters used. This concatenated (p) vector is then given as input to a dense layer having 4dk neurons and is followed by the final sigmoid prediction layer.

−→ −→ −→ F1 = CN N1([huC,v1 ⊕ huC,v2 ⊕ .... ⊕ huC,vn])

←− ←− ←− F2 = CN N2([huC,v1 ⊕ huC,v2 ⊕ .... ⊕ huC,vn])

−→ −→ −→ F3 = CN N3([hvC,u1 ⊕ hvC,u2 ⊕ .... ⊕ hvC,un])

←− ←− ←− F4 = CN N4([hvC,u1 ⊕ hvC,u2 ⊕ .... ⊕ hvC,un]) p = [F1 ⊕ F2 ⊕ F3 ⊕ F4] (14) (15) (16) (17) (18) (19) (20) (21) −→ ←− 5 & 6), similarly (αCv , αCv ) are calculated using the hidden state representations of BiLST Mc along with the forward and backward final states (C−→v, C←−v) of BiLST Mr (as shown in equations 7 & 8). In the equations below (•) represents a dot product between two vectors.

αC−→u = [α1C−→u , α2C−→u , ...., αnC−→u ]; αiC−→u = C−→u • hiv αC←−u = [α1C←−u , α2C←−u , ...., αnC←−u ]; αiC←−u = C←−u • hiv αC−→v = [α1C−→v , α2C−→v , ...., αnC←−v ]; αiC−→v = C−→v • hiu αC←−v = [α1C←−v , α2C←−v , ...., αnC←−v ]; αiC←−v = C←−v • hiu (8) In the next step, the above calculated sets of inter-sentence −→ ←− attention scores αCu , αCu ) are multiplied back with the hidden state representations of BiLST Mr to generate two new −→ ←− set of hidden representations hCu , hCu ∈ Rn×d of the rev v ply sentence namely, reply contextualized on comment (forward) & reply contextualized on comment (backward) respec−→ ←− tively (as shown in equations 9 & 10). Similarly αCv , αCv are multiplied back with the hidden state representations of BiLST Mc to generate two new set of hidden representations −→ ←− huCv , hCv ∈ Rn×d of the comment sentence namely, comment u contextualized on reply (forward) & comment contextualized on reply (backward) respectively (as shown in equations 11 & 12). In the equations below (×) represents multiplication between a scalar and a vector.

hvC−→u = [hvC−→,u1, hvC−→,u2, ...., hvC−→,un], ; hvC−→,ui = αiC−→u × hiv ←− ←− ←− ←− ←− ←− v hvCu = [hvC,u1, hvC,u2, ...., hvC,un], ; hvC,ui = αiCu × hi huC−→v = [huC−→,v1, huC−→,v2, ...., huC−→,vn], ; huC−→,vi = αiC−→v × hiu (5) (6) (7) (9) (10) (11) (12) (13) huC←−v = [huC←−,v1, huC←−,v2, ...., huC←−,vn], ; huC←−,vi = αiC←−v × hiu

3.3 Integration and Final Prediction

The proposed model uses Convolutional Neural Networks (CNN) [Lecun et al., 1998] for capturing location-invariant local features from the newly obtained contextualized hid←− −→ ←− −→ den representations hCv , huCv , hCu , hCu . Four independent u v v CNN blocks (CN N1, CN N2, CN N3, CN N4) are used, corresponding to each of the newly obtained contextualized hidden representations. Each CN N block consists two convolutional layers. Both the convolution layer consist of k filters of height h. The role of these filters is to detect particular features at different locations of the input. The output cli of the lth layer consists of kl feature maps of height h. The ith feature map (cli) is calculated as:

j=1 cli = bli + X Kil,j ∗ clj−1

kl−1 yˆ = σ(W p + b),

W ∈ R4dk; b ∈ R

The proposed model uses the binary cross-entropy as the training loss function as shown in equation 22. Here (L) is the cost function, yˆi ∈ R represents the output of the proposed model, yi ∈ R represents the true label and N ∈ N represents the number of training samples.

Evaluation Setup 4.1 Dataset

This paper focuses on detecting sarcasm in the user-generated short text using only the conversational context. Social media platforms like Reddit and Twitter are widely used by users for posting opinions and replying to other’s opinions. They have proved to be of a great source for extracting conversational data. So the experiments were conducted on two publicly available benchmark datasets (Reddit & Twitter) used for the sarcasm detection task. Both the datasets consist of comments and reply pairs.

SARC4 Reddit [Khodak et al., 2018] is the largest

dataset available for sarcasm detection containing millions of sarcastic/non-sarcastic comments-reply pairs from the social media site Reddit. This dataset was generated by scraping

4https://nlp.cs.princeton.edu/SARC/2.0/

Training set Reddit IBmalbaanlacendced

Twitter Balanced Testing set Reddit IBmalbaanlacendced

Twitter Balanced comments from Reddit containing the \s (sarcasm) tag. It contains replies, their parent comment (acts as context), and a label that shows whether the reply was sarcastic/non-sarcastic to their corresponding parent comment. To compare the performance of the model on a different dataset (latest), the proposed model was also evaluated on the Twitter dataset provided in the FigLang5 2020 workshop [Ghosh et al., 2020] for the "sarcasm detection shared task". This consists of sarcastic/nonsarcastic tweets and their corresponding contextual parent tweets. The sarcastic tweets were collected using hashtags like #sarcasm, #sarcastic, and #irony, similarly non-sarcastic tweets were collected using hashtags like #happy, #sad, and #hate. This dataset sometime contains more than one contextual parent tweet, so in those cases, all of the contextual tweets are considered independently with the target tweet.

In both the datasets, replies are the target comment/tweet to be classified as sarcastic/non-sarcastic, and their corresponding parent comment/tweet acts as context. Both the datasets constitute of comments/tweets of varying lengths, but because this paper only focuses on detecting sarcasm in the short text, only the short comment/reply pairs were used. Comment/reply sentences of length (no. of words) less than 20, 40 were used in the case of SARC and Twitter dataset respectively. In both cases, the balanced datasets contain equal proportions of sarcastic/non-sarcastic comment/reply pairs, and the imbalanced datasets maintain a 20:80 ratio (approximately) between sarcastic and non-sarcastic comment/reply pairs. Testing was done on 10% of the dataset and the rest was used for training. 10% of the training set was used for validation purposes. Statistics of both the datasets are shown in Table 1. 4.2

Data Preprocessing

The preprocessing of the textual data was done by first lowercasing all the sentences and separating punctuations from the words. We do not remove the stop-words because we believe that sometimes stop-words play a major role in making a sentence sarcastic e.g., "is it?" and "am I?". The problem with social media platforms is that, users use a lot of abbreviations, shortened words and slang words like, "IMO" for "in my opinion", lmk" for "let me know ", "fr" for "for", etc. These words are challenging to taken care of in the NLP tasks, particularly in the automatic discovery of flexible word usages. So to solve this problem, these words are converted to their corresponding full-forms using abbreviation/slang word dictionaries obtained from urban dictionary6. After this, all the sentences were tokenized into a list of words. The proposed model had a fixed input size for both comment and reply, but not all the sentences were of the same length. So all the sentences were padded

5sites.google.com/view/figlang2020

6https://www.urbandictionary.com/ to the length of the longest sentence (20 in the case of the Reddit dataset and 40 in the case of the Twitter dataset). Word embeddings are used to give semantically-meaningful dense representations to the words. Word-based embeddings are constructed using contextual words whereas character-based embeddings are constructed from character n-grams of the words. Character-based in contrast to the Word-based embeddings solves the problem of out of vocabulary words and performs better in the case of infrequent words by creating word embeddings based only on their spellings. So for generating proper representations for words we have used FastText7, a character-based word embedding. This would not only give words better representation compared to the word-based model but also incorporate slang/shortened/infrequent words (which commonly appear in social media platforms). 4.3

Training Details

We have used macro-averaged (F1) and accuracy (Acc) scores as the evaluation metric, as it is standard for the sarcasm detection task. We have also reported Precision (P) and Recall (R) scores in the case of the Twitter dataset as well as for the Reddit dataset (wherever available). Hyperparameter tuning was used to find optimum values of the hyperparameters. The FastText embeddings used were of size d = 30 and were trained for 30 iterations having window size of 3, 5 in the case of SARC, and Twitter dataset respectively. The number of filters in all the convolutional blocks were [64, 64] of height [2, 2]. The learning optimizer used is Adam with an initial learning rate of 0.01. The value of α in all the LeakyReLu layers was set to 0.3. All the models were trained for 20 epochs. L2 regularization set to 10−2 is applied to all the feed-forward connections along with early stopping having the patience of 5 to avoid overfitting. The mini-batch size was tuned amongst {100, 500, 1000, 2000, 3000, 4000} and was observed that mini-batch size of 2000, 500 gave the best performance for the SARC and Twitter dataset respectively.

The recent success of transformer-based language models has led to their wide usage in sentiment analysis tasks. They are known for generating high quality high dimensional word representations (768-dimensional for BERT). Their only drawback is that they require high processing power and memory to train. The above-mentioned configuration of the proposed model generates ≈1120K trainable parameters, and increasing either the embedding size or the number of tokens in a sentence led to an exponential increase in the number of trainable parameters. So due to computational resource limitations, we limited our experiments to lower-dimensional word embeddings.

Results

Models Acc FB1alanceP R Acc ImF1balancePd R CNN-SVM [Poria et al., 2016] †? 68.0 68.0 – – 69.0 79.0 – – AMR [Ghaeini et al., 2018] ‡ 69.5 69.5 74.8 69.7 – – – – [Ghosh and Veale, 2017] ‡ – 67.8 68.2 67.9 – – – – CUE-CNN [Amir et al., 2016] †? 70.0 69.0 – – 73.0 81.0 – – MHA-BiLSTM [Kumar et al., 2020] † – 77.5 72.6 83.0 – 56.8 60.3 53.7 CASCADE [Hazarika et al., 2018] ‡? 77.0 77.0 – – 79.0 86.0 – – CASCADE (only discourse features) ‡ 68.0 66.0 – – 68.0 78.0 – – Bi-ISCA (this paper) ‡ 72.3 75.7 74.2 77.6 71.9 74.4 73.0 75.8 (Δonilnycrdeiassceouwr.sre.tfCeaAtuSrCesA)DE 4.3 ↑ 9.7 ↑ – – 3.9 ↑ 3.6 ↓ – – † Uses only target sentence, ‡ Uses context along with target sentence, ? Uses personality-based features

Bi-ISCA focuses on only using the contextual comment/tweet for detecting sarcasm rather than using any other topical/personality-based features. Using only the contextual information enriches the model’s ability to capture syntactical and semantical textual properties responsible for invoking sarcasm in any type of conversation. Table 2 reports performance results on the SARC datasets. For comparison purposes, F1score (F1), Accuracy score (Acc), Precision (P) and Recall (R) were used.

When compared with the existing works, Bi-ISCA was able to outperform all the models (only ‡) that use only conversational context for sarcasm detection (Improvement of Δ 7.9% in F1 score when compared to [Ghosh and Veale, 2017]; Δ 6.2% in F1 score and Δ 2.8% in accuracy when compared to AMR [Ghaeini et al., 2018]) , and was even able to perform better than the models (†?) that use personality-based features along with the target sentence for detecting sarcasm (improvement of Δ 7.7% in F1 and Δ 4.3% in accuracy score when compared to CNN-SVM [Poria et al., 2016]; Δ 6.7% in F1 score and Δ 2.3% in accuracy when compared to CUE-CNN [Amir et al., 2016]) . MHA-BiLSTM [Kumar et al., 2020] had a Δ 1.8% higher F1 score in the balanced dataset but Bi-ISCA was able to show drastic improvement of Δ 17.6% in the imbalanced dataset, which demonstrated the ability of Bi-ISCA to handle class imbalance.

The current state-of-the-art on the SARC dataset is achieved by CASCADE. Even though CASCADE uses personalitybased features and contextual information along with large sentences of average length ≈55-62 (very large compared to our dataset, which gives them the advantage of using a lot more contextual information), Bi-ISCA was able to achieve an F1 score comparable to it (despite using relatively short text). In comparison with CASCADE that only uses discoursebased features, Bi-ISCA performed drastically better with an increase of Δ 9.7% in F1 and Δ 4.3% in accuracy score for the balanced dataset.

Bi-ISCA clearly demonstrated its capabilities to robustly handle an imbalance in the dataset, although it was unable to outperform both the CASCADE models. This slightly poor performance in the imbalanced dataset can be explained by the length of sentences used by CASCADE, which are significantly (≈5 times) greater than the ones on which Bi-ISCA was tested. Longer sentences result in increased contextual information which improves performance especially in the case of imbalance where little extra information can lead to a drastic increase in performance.

Models Baseline (LST Mattn) BERT-Large+BiLSTM+SVM [Baruah et al., 2020] BERT+CNN+LSTM [Srivastava et al., 2020] RoBERTa+LSTM [Kumar and Anand, 2020] RoBERT-Large [Dong et al., 2020] RoBERT+Multi-Initialization Ensemble [Jaiswal, 2020] BERT + BiLSTM + NeXtVLAD + Context Ensemble + Data Augmentation [Lee et al., 2020] Bi-ISCA (this paper)

The attention scores generated by the attention mechanism makes the proposed model highly interpretable. Table 4 showcases the distribution of the attention scores over four sarcastic (correctly predicted by Bi-ISCA) comment-reply pairs from the SARC dataset. Not only the proposed model was correctly able to detect sarcasm in these pairs of sentences but was also able to correctly identify words responsible for contextual, explicit, or implicit incongruity which invokes sarcasm.

For example in Pair 1, Bi-ISCA correctly identified explicitly incongruous words like "amazing" and "force" in the reply sentence which were responsible for the sarcastic nature of the reply. Interestingly the word "traumatized" in the parent comment also had a high attention weight value, which shows that the proposed attention mechanism was able to learn the contextual incongruity between the opposite sentiment words like "traumatized" & "amazing" in the comment-reply pair. Pair 2 demonstrates the model’s ability to capture words responsible for invoking sarcasm by making sentences implicitly incongruous. Sarcasm due to implicit incongruity is usually the toughest to perceive. Despite this, Bi-ISCA was able to give high attention weights to words like "announces" and "crashes & security holes". Not only this, but the proposed intra-sentence attention mechanism was also able to learn a link between "microsoft" and "m" (slang for microsoft) without having any prior knowledge related to slangs. Pair 3 is also an example of an explicitly and contextually incongruous comment-reply pair, where the model was successfully able to capture opposite sentiment words & phrases like "blind drunk", "cautious" and "behind the wheel" that made the reply sarcastic in nature. Pair 4 is an example of sarcasm due to implicit incongruity between the words, "pause" & "watch", and contextual incongruity simultaneously between "reported" & "enjoyable", both of which were successfully captured by Bi-BISCA. 7

Conclusion

In this paper, we introduce a novel Bi-directional InterSentence Attention mechanism based model (Bi-ISCA) for detecting sarcasm. The proposed model not only was able to capture both intra and inter-sentence dependencies but was able to achieve state-of-the-art results in detecting sarcasm in the user-generated short text using only the conversational context. Further investigation of attention maps illustrated Bi-ISCA’s ability to capture explicitly, implicitly, and contextually incongruous words & phrases responsible for invoking sarcasm. The success of the proposed model is achieved due to the use of character-based embeddings that takes care of slang/shortened & out of vocabulary words, Bi-LSTMs that captures intra-sentence dependencies between words in the same sentence, and Bi-ISCA that captures inter-sentence dependencies between words of different sentences.

[Amir et al., 2016 ]

Silvio

Amir , Byron C Wallace, Hao Lyu, Paula Carvalho, and Silva Mário J. Modelling context with user embeddings for sarcasm detection in social media . Proceedings of the Conference on Natural Language Learning (CoNLL) , 2016 .

[Bamman and Smith , 2015]

David

Bamman and

Noah A

Smith. Contextualized sarcasm detection on twitter . In Ninth International AAAI Conference on Web and Social Media , 2015 .

[Barbieri et al., 2014 ]

Francesco

Barbieri , Horacio Saggion, and

Francesco

Ronzano . Modelling sarcasm in twitter, a novel approach . In Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis , pages 50 - 58 , Baltimore, Maryland, June 2014 . Association for Computational Linguistics .

[Baruah et al., 2020 ]

Arup

Baruah , Kaushik Das , Ferdous Barbhuiya , and Kuntal Dey . Context-aware sarcasm detection using BERT . In Proceedings of the Second Workshop on Figurative Language Processing , pages 83 - 87 , Online, July 2020 . Association for Computational Linguistics .

[Cho et al., 2014 ]

Kyunghyun

Cho , Bart van Merriënboer, Caglar Gulcehre , Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio . Learning phrase representations using RNN encoder-decoder for statistical machine translation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 1724 - 1734 , Doha, Qatar, October 2014 . Association for Computational Linguistics .

[Dong et al., 2020 ]

Xiangjue

Dong ,

Changmao

Li , and

Jinho D.

Choi . Transformer-based context-aware sarcasm detection in conversation threads from social media . In Proceedings of the Second Workshop on Figurative Language Processing , pages 276 - 280 , Online, July 2020 . Association for Computational Linguistics .

[Eisterhold et al., 2006 ]

Jodi

Eisterhold , Salvatore Attardo, and

Diana

Boxer . Reactions to irony in discourse: evidence for the least disruption principle . Journal of Pragmatics , 38 ( 8 ): 1239 - 1256 , 2006 . Focus-on Issue: Discourse and Conversation .

[Farías et al., 2016 ]

Delia

Irazú Hernaundefineddez Farías , Viviana Patti, and

Paolo

Rosso . Irony detection in twitter: The role of affective content . ACM Trans. Internet Technol ., 16 ( 3 ), July 2016 .

[Ghaeini et al., 2018 ]

Reza

Ghaeini , Xiaoli

Fern , and

Prasad

Tadepalli . Attentional multi-reading sarcasm detection . CoRR , abs/ 1809 .03051, 2018 .

[Ghosh and Veale , 2016]

Aniruddha

Ghosh and

Tony

Veale . Fracking sarcasm using neural network . In Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis , pages 161 - 169 , San Diego, California, June 2016. Association for Computational Linguistics .

[Ghosh and Veale , 2017]

Aniruddha

Ghosh and

Tony

Veale . Magnets for sarcasm: Making sarcasm detection timely, contextual and very personal . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages 482 - 491 , Copenhagen, Denmark, September 2017 . Association for Computational Linguistics .

[Ghosh et al., 2020 ]

Debanjan

Ghosh , Avijit Vajpayee, and

Smaranda

Muresan . A report on the 2020 sarcasm detection shared task . In Proceedings of the Second Workshop on Figurative Language Processing , pages 1 - 11 , Online, July 2020 . Association for Computational Linguistics .

[Hazarika et al., 2018 ]

Devamanyu

Hazarika , Soujanya Poria, Sruthi Gorantla, Erik Cambria, Roger Zimmermann, and Rada Mihalcea. CASCADE: Contextual sarcasm detection in online discussion forums . In Proceedings of the 27th International Conference on Computational Linguistics , pages 1837 - 1848 ,

Santa

Fe , New Mexico, USA, August 2018 . Association for Computational Linguistics .

[Hochreiter and Schmidhuber , 1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory . Neural Computation , 9 ( 8 ): 1735 - 1780 , 1997 .

[Jaiswal , 2020]

Nikhil

Jaiswal . Neural sarcasm detection using conversation context . In Proceedings of the Second Workshop on Figurative Language Processing , pages 77 - 82 , Online, July 2020 . Association for Computational Linguistics .

[Joshi et al., 2015 ]

Aditya

Joshi , Vinita Sharma, and

Pushpak

Bhattacharyya . Harnessing context incongruity for sarcasm detection . In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) , pages 757 - 762 , Beijing, China, July 2015 . Association for Computational Linguistics .

[Khattri et al., 2015 ]

Anupam

Khattri , Aditya Joshi, Pushpak Bhattacharyya, and

Mark

Carman . Your sentiment precedes you: Using an author's historical tweets to predict sarcasm . In Proceedings of the 6th workshop on computational approaches to subjectivity, sentiment and social media analysis , pages 25 - 30 , 2015 .

[Khodak et al., 2018 ]

Mikhail

Khodak , Nikunj Saunshi, and

Kiran

Vodrahalli . A large self-annotated corpus for sarcasm . In Proceedings of the Linguistic Resource and Evaluation Conference (LREC) , 2018 .

[Kumar and Anand , 2020]

Amardeep

Kumar and

Vivek

Anand . Transformers on sarcasm detection with context . In Proceedings of the Second Workshop on Figurative Language Processing , pages 88 - 92 , Online, July 2020 . Association for Computational Linguistics .

[Kumar et al., 2020 ]

Kumar ,

V. T.

Narapareddy ,

Aditya Srikanth ,

Malapati , and

L. B. M.

Neti . Sarcasm detection using multi-head attention based bidirectional lstm . IEEE Access , 8 : 6388 - 6397 , 2020 .

[Lecun et al., 1998 ]

Lecun ,

Bottou ,

Bengio , and

Haffner . Gradient-based learning applied to document recognition . Proceedings of the IEEE , 86 ( 11 ): 2278 - 2324 , 1998 .

[Lee et al., 2020 ]

Hankyol

Lee ,

Youngjae

Yu , and

Gunhee

Kim . Augmenting data for sarcasm detection with unlabeled conversation context . In Proceedings of the Second Workshop on Figurative Language Processing , pages 12 - 17 , Online, July 2020 . Association for Computational Linguistics .

[Liebrecht et al., 2013 ]

Christine

Liebrecht , Florian Kunneman, and Antal van den Bosch. The perfect solution for detecting sarcasm in tweets #not . In Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis , pages 29 - 37 , Atlanta, Georgia, June 2013 . Association for Computational Linguistics .

[Poria et al., 2016 ]

Soujanya

Poria , Erik Cambria, Devamanyu Hazarika, and

Prateek

Vij . A deeper look into sarcastic tweets using deep convolutional neural networks . In Proceedings of COLING 2016 , the 26th International Conference on Computational Linguistics: Technical Papers , pages 1601 - 1612 , Osaka, Japan, December 2016 . The COLING 2016 Organizing Committee .

[Rajadesingan et al., 2015 ]

Ashwin

Rajadesingan , Reza Zafarani, and Huan Liu. Sarcasm detection on twitter: A behavioral modeling approach . In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM '15, page 97-106 , New York, NY, USA, 2015 . Association for Computing Machinery .

[Reyes et al., 2013 ] Antonio Reyes, Paolo Rosso, and

Tony

Veale . A multidimensional approach for detecting irony in twitter . Language resources and evaluation , 47 ( 1 ): 239 - 268 , 2013 .

[Riloff et al., 2013 ]

Ellen

Riloff , Ashequl Qadir, Prafulla Surve, Lalindra De Silva, Nathan Gilbert, and

Ruihong

Huang . Sarcasm as contrast between a positive sentiment and negative situation . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013 , 18 -21 October 2013 , Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL , pages 704 - 714 . ACL, 2013 .

[Schuster and Paliwal , 1997]

Schuster and K. K. Paliwal. Bidirectional recurrent neural networks . IEEE Transactions on Signal Processing , 45 ( 11 ): 2673 - 2681 , 1997 .

[Srivastava et al., 2020 ]

Himani

Srivastava , Vaibhav Varshney, Surabhi Kumari, and

Saurabh

Srivastava . A novel hierarchical BERT architecture for sarcasm detection . In Proceedings of the Second Workshop on Figurative Language Processing , pages 93 - 97 , Online, July 2020 . Association for Computational Linguistics .

[Tay et al., 2018 ]

Tay , Anh Tuan Luu, Siu Cheung Hui, and

Jian

Su . Reasoning with sarcasm by reading in-between . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1010 - 1020 , Melbourne, Australia, July 2018 . Association for Computational Linguistics .

[Tsur et al., 2010 ]

Oren

Tsur ,

Dmitry

Davidov , and

Ari

Rappoport . Icwsm-a great catchy name: Semi-supervised recognition of sarcastic sentences in online product reviews . In fourth international AAAI conference on weblogs and social media , 2010 .

[Vaswani et al., 2017 ]

Ashish

Vaswani , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need . In I. Guyon,

U. V.

Luxburg ,

Bengio ,

Wallach ,

Fergus ,

Vishwanathan , and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 5998 - 6008 . Curran Associates, Inc., 2017 .

[Wilson, 2006]

Deirdre

Wilson . The pragmatics of verbal irony: Echo or pretence? Lingua , 116 ( 10 ): 1722 - 1743 , 2006 . Language in Mind: A Tribute to Neil Smith on the Occasion of his Retirement .