Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021 1 Bi-ISCA: Bidirectional Inter-Sentence Contextual Attention Mechanism for Detecting Sarcasm in User Generated Noisy Short Text Prakamya Mishra1∗ , Saroj Kaushik2 , Kuntal Dey3 , Shiv Nadar University 1 2 3 Accenture Technology Labs {pm669, saroj.kaushik}@snu.edu.in, kuntal.dey@accenture.com Abstract maybe engaging on social media platforms. Sarcastic remarks on these platforms inflict problems on the existing sentiment Many online comments on social media platforms analysis systems in identifying the true intentions of the users. are hateful, humorous, or sarcastic. The sarcastic The Cambridge Dictionary1 describes sarcasm as an irony nature of these comments (especially the short ones) conveyed hilariously or amusingly to criticize something. Sar- alters their actual implied sentiments, which leads casm may not show criticism on the surface but instead might to misinterpretations by the existing sentiment anal- have a criticizing implied meaning. Such a figurative aspect of ysis models. A lot of research has already been done sarcasm makes it difficult to be detected in the modern micro to detect sarcasm in the text using user-based, top- texts [Ghosh and Veale, 2016]. Several linguistic research has ical, and conversational information but not much been done to analyze different aspects of sarcasm. Kind of work has been done to use inter-sentence contex- responses evoked because of comments has been considered a tual information for detecting the same. This pa- major indicator of sarcasm [Eisterhold et al., 2006]. [Wilson, per proposes a new deep learning architecture that 2006] states that circumstantial incongruity between a com- uses a novel Bidirectional Inter-Sentence Contextual ment and its corresponding contextual information plays an Attention mechanism (Bi-ISCA) to capture inter- important role in implying sarcasm. sentence dependencies for detecting sarcasm in the Previous research works have used policy-based, statisti- user-generated short text using only the conversa- cal, and deep-learning-based methods for detecting sarcasm. tional context. The proposed deep learning model The use of contextual information like conversational con- demonstrates the capability to capture explicit, im- text, author personality features, or prior knowledge of the plicit, and contextual incongruous words & phrases topic, have proved to be very useful. [Khattri et al., 2015] responsible for invoking sarcasm. Bi-ISCA gener- used sentiments of the author’s historical tweets as context. ates results comparable to the state-of-the-art on two [Rajadesingan et al., 2015] used personality features like the widely used benchmark datasets for the sarcasm de- author’s familiarity with twitter, language (structure and word tection task (Reddit and Twitter). To the best of our usage), and the author’s familiarity with sarcasm (history of knowledge, none of the existing models use an inter- previous sarcastic tweets) for consolidating context. [Bamman sentence contextual attention mechanism to detect and Smith, 2015] explored the use of historical terms, topics, sarcasm in the user-generated short text using only and sentiments along with profile information as the author’s conversational context. context. They also exploited the use of conversational context like the immediate previous tweets in the thread. [Joshi et al., 1 Introduction 2015] demonstrated that concatenation of preceding comment with the objective comment in a discussion forum led to an Sentiment analysis is one of the most important natural lan- increase in the precision score. guage processing (NLP) applications. Its goal is to identify, Overall in recent years a lot of work has been done to use extract, quantify, and study subjective information. The sud- different types of contextual information for sarcasm detection den rise in the usage of social media platforms as a means of but none of them have used inter-sentence dependencies. In communication has led to a vast amount of data being shared this paper, we propose a novel Bidirectional Inter-Sentence between its users on a wide range of topics. This type of data Contextual Attention mechanism (Bi-ISCA) based deep learn- is very helpful to several organizations for analyzing the senti- ing neural network for sarcasm detection. The main contribu- ments of people towards products, movies, political events, etc. tion of this paper can be summarised as follows: Understanding the unique intricacies of the human language remains one of the most important pending NLP problems • We propose a new deep learning architecture that uses a of this time. Humans regularly use sarcasm as a crucial part novel Bidirectional Inter-Sentence Contextual attention of the day-to-day conversations when venting, arguing, or mechanism (Bi-ISCA) for detecting sarcasm in short texts ∗ 1 Contact Author https://dictionary.cambridge.org/ Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021 2 (short texts are more difficult to analyze due to shortage text is a textual unit that has to be classified as sarcastic or of contextual information). not. Simply using gated recurrent units (GRU) [Cho et al., • Bi-ISCA focuses on only using the conversational con- 2014] or long short term memory (LSTM) [Hochreiter and textual comment/tweet for detecting sarcasm rather than Schmidhuber, 1997] do not capture in between interactions using any other topical/personality-based features, as us- of word pairs which makes it difficult to model contrast and ing only the contextual information enriches the model’s incongruity. [Tay et al., 2018] were able to solve this problem ability to capture syntactical and semantical textual prop- by looking in-between word pairs using a multi-dimensional erties responsible for invoking sarcasm. intra-attention recurrent network. They focused on modeling the intra-sentence relationships among the words. [Kumar et • We also explain model behavior and predictions by vi- al., 2020] exploited the use of a multi-head attention mecha- sualizing attention maps generated by Bi-ISCA, which nism [Vaswani et al., 2017] which could capture dependencies helps in identifying significant parts of the sentences re- between different representations subspaces in different posi- sponsible for invoking sarcasm. tions. Their model consisted of a word encoder for generating The rest of the paper is organized as follows. Section 2 new word representations by summarizing comment contex- describes the related work. Then section 3, explains the pro- tual information in a bidirectional manner. On top of that, they posed model architecture for detecting sarcasm. Section 4 used multi-head attention for focusing on different contexts will describe the datasets used, pre-processing pipeline, and of a sentence, and in the end, a simple multi-layer perceptron training details for reproducibility. Then experimental results was used for classification. are explained in section 5 and section 6 illustrates model be- There has not been much work done in conversation depen- havior and predictions by visualizing attention maps. Finally dent (comment and reply) approaches for sarcasm detection. we conclude in section 7. [Ghaeini et al., 2018] proposed a model that not only used information from the target utterance but also used its conver- 2 Related Work sational context to perceive sarcasm. They aimed to detect A diverse spectrum of approaches has been used to detect sarcasm by just using the sequences of sentences, without any sarcasm. Recent sarcasm detection approaches have either extra knowledge about the user and topic. They combined the mainly focused on using machine learning based approaches predictions from utterance-only and conversation-dependent that leverage the use of explicitly declared relevant features parts for generating its final prediction which was able to cap- or they focus on using neural network based deep learning ture the words responsible for delivering sarcasm. [Ghosh and approaches that do not require handcrafted features. Also, the Veale, 2017] also modeled conversational context for sarcasm recent advances in using deep learning for preforming natural detection. They also attempted to derive what parts of the con- language processing tasks have led to a promising increase in versational context triggered a sarcastic reply. Their proposed the performance of these sarcasm detection systems. model used sentence embeddings created by taking an average A lot of research has been done using bag of words as of word embeddings and a sentence-level attention mechanism features. However, to improve performance, scholars started was used to generate attention induced representations of both to explore the use of several other semantic and syntactical the context and the response which was later concatenated and features like punctuations [Tsur et al., 2010]; emotion marks used for classification. and intensifiers [Liebrecht et al., 2013]; positive verbs and Among all the previous works, [Ghaeini et al., 2018] and [Ghosh and Veale, 2017] share similar motives of detecting negative phrases [Riloff et al., 2013]; polarity skip grams [Reyes et al., 2013]; synonyms & ambiguity[Barbieri et al., sarcasm using only the conversational context. However, we 2014]; implicit and explicit incongruity-based [Joshi et al., introduce a novel Bidirectional Inter-Sentence Contextual At- 2015]; sentiment flips [Rajadesingan et al., 2015]; affect-based tention mechanism (Bi-ISCA) for detecting sarcasm. Unlike features derived from multiple emotion lexicons [Farías et al., previous works, our work considers short texts for detecting 2016]. sarcasm, which is far more challenging to detect when com- Every day an enormous amount of short text data is gener- pared to long texts as long texts provide much more contextual ated by users on popular social media platforms like Twitter2 information. and Reddit3 . Easy accessibility of such data sources has en- ticed researchers to use them for extracting user-based and 3 Model discourse-based features. [Hazarika et al., 2018] utilized con- textual information by making user-embeddings for capturing This section will introduce the proposed Bi-ISCA: Bidirec- indicative behavioral traits. These user-embeddings incorpo- tional Inter Sentence Contextual Attention based neural net- rated personality features along with the author’s writing style work for sarcasm detection (as shown in Figure 1). Sarcasm (using historical posts). They also used discourse comments detection is a binary classification task that tries to predict along with background cues and topical information for detect- whether a given comment is sarcastic or not. The proposed ing sarcasm. They performed their experiments on the largest model uses comment-reply pairs for detecting sarcasm. The Reddit dataset SARC [Khodak et al., 2018]. Many have only input to the model is represented by U = [W1u , W2u , ...., Wnu ] used the target text for classification purposes, where a target and V = [W1v , W2v , ...., Wnv ], where U represents the com- ment sentence and V represents the reply sentence (both sen- 2 www.twitter.com/ tences padded to a length of n). Here, Wiu , Wjv ∈ Rd are 3 www.reddit.com/ d−dimensional word embedding vectors. The objective is Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021 3 Figure 1: Bi-ISCA: Bi-Directional Inter-Sentence Contextual Attention Mechanism for Sarcasm Detection. to predict label y which indicates whether the reply to the − → ←− corresponding comment was sarcastic or not. Cv , hv , Cv = BiLST Mr (V ) (4) −→ − → 3.1 Intra-Sentence Word Encoder Layer Here, Cu , Cv ∈ R are the final cell states of the for- d ward LSTMs corresponding to BiLST Mc & BiLST Mr ; The primary purpose of this layer is to summarize intra- ← − ← − sentence contextual information from both directions in both Cu , Cv ∈ Rd are the final cell states of the backward the sentences (comment & reply) using Bidirectional Long LSTMs corresponding to BiLST Mc & BiLST Mr ; hu = Short Term Memory Networks (Bi-LSTM). A Bi-LSTM [hu1 , hu2 , ...., hun ] and hv = [hv1 , hv2 , ...., hvn ] are the hidden [Schuster and Paliwal, 1997] processes information in both state representations of BiLST Mc & BiLST Mr respec- the directions using a forward LSTM [Hochreiter and Schmid- tively, where hui , hvj ∈ Rd and hu , hv ∈ Rn×d . → − huber, 1997] h , that reads the sentence S = [w1 , w2 , ...., wn ] 3.2 Bi-ISCA: Bidirectional Inter-Sentence ← − from w1 to wn and a backward LSTM h that reads the sen- Contextual Attention Mechanism tence from wn to w1 . Hidden states from both the LSTMs are Sarcasm is context-dependent in nature. Even humans some- added to get the final hidden state representations of each word. times have a hard time understanding sarcasm without hav- So the hidden state representation of the tth word (ht ) can be ing any contextual information. The hidden states gener- represented by the sum of tth hidden representations of the ated by both the Bi-LSTMs (BiLST Mc & BiLST Mr ) cap- → − ← − tures the intra-sentence bidirectional contextual information forward and backward LSTMs ( ht , ht ) as show in equations below. in comment & reply respectively, but fails to capture the inter- → − −−−−→ −−→ ← − ←−−−− ←−− sentence contextual information between them. This paper ht = LST M (wt , ht−1 ); ht = LST M (wt , ht−1 ) (1) introduces a novel Bidirectional Inter-Sentence Contextual At- tention mechanism (Bi-ISCA) for capturing the inter-sentence ← − → − contextual information between both the sentences. ht = ht + ht (2) Bi-ISCA uses hidden state representations of U & V along This Intra-Sentence Word Encoder Layer consists of → − ← − with the auxiliary sentence’s cell state representations ( C & C ) two independent Bidirectional LSTMs for both comment to capture the inter-sentence contextual information. At (BiLST Mc ) and reply (BiLST Mr ). Apart from the hidden first, the attention mechanism captures four sets of atten- states, both these Bi-LSTMs also generate separate (forward −→ ←− −→ ←− ←− → − tions scores namely, (αCu , αCu , αCv , αCv ∈ Rn ). These sets & backward) final cell states represented by C & C . The of inter-sentence attention scores are used to generate new comment sentence U is given as an input to BiLST Mc and inter-sentence contextualized hidden representations. Then the reply sentence V is given as an input to BiLST Mr . The −→ ←− outputs of both the Bi-LSTMs are represented by the equations (αCu , αCu ) are calculated using the hidden state representa- 3 and 4. tions of BiLST Mr along with the forward and backward − → ←− −→ ← − Cu , hu , Cu = BiLST Mc (U ) (3) final states (Cu , Cu ) of BiLST Mc (as shown in equations Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021 4 − → ← − 5 & 6), similarly (αCv , αCv ) are calculated using the hidden In the above equation, bli is a bias matrix and Ki,j l is a filter state representations of BiLST Mc along with the forward connecting j feature map of layer (l − 1) to the i feature th th −→ ← − and backward final states (Cv , Cv ) of BiLST Mr (as shown map of layer (l). The output of each convolution layer is in equations 7 & 8). In the equations below (•) represents a passed through a activation function f . The proposed model dot product between two vectors. uses LeakyReLu as its activation function. − → −→ −→ − → − → − →  αCu = [α1Cu , α2Cu , ...., αnCu ]; αiCu = Cu • hvi (5) a ∗ x, for x ≥ 0; a ∈ R (14) f= x, for x < 0 (15) ←− ←− ←− ←− ←− ←− αCu = [α1Cu , α2Cu , ...., αnCu ]; αiCu = Cu • hvi (6) For each of the CNN blocks, the corresponding contextu- alized hidden representations are first concatenated (⊕) and − → − → − → ←− − → − → then given as input. The outputs of all the CNN blocks are αCv = [α1Cv , α2Cv , ...., αnCv ]; αiCv = Cv • hui (7) flattened (F1 , F2 , F3 , F4 ∈ Rdk ) and concatenated to generate ←− ← − ← − ←− ←− ←− a new vector (p ∈ R4dk ), where d represents the dimension of αCv = [α1Cv , α2Cv , ...., αnCv ]; αiCv = Cv • hui (8) the hidden representation and k represents number of convolu- In the next step, the above calculated sets of inter-sentence tional filters used. This concatenated (p) vector is then given −→ ←− as input to a dense layer having 4dk neurons and is followed attention scores αCu , αCu ) are multiplied back with the hid- by the final sigmoid prediction layer. den state representations of BiLST Mr to generate two new −→ ← − − → − → − → set of hidden representations hC v , hv u Cu ∈ Rn×d of the re- F1 = CN N1 ([hC Cv Cv u,1 ⊕ hu,2 ⊕ .... ⊕ hu,n ]) v (16) ply sentence namely, reply contextualized on comment (for- ward) & reply contextualized on comment (backward) respec- ← − ← − ← − − → ←− tively (as shown in equations 9 & 10). Similarly αCv , αCv F2 = CN N2 ([hC Cv Cv u,1 ⊕ hu,2 ⊕ .... ⊕ hu,n ]) v (17) are multiplied back with the hidden state representations of − → − → − → BiLST Mc to generate two new set of hidden representations F3 = CN N3 ([hC Cu Cu v,1 ⊕ hv,2 ⊕ .... ⊕ hv,n ]) u (18) − → ←− hCu , hu ∈ R v Cv n×d of the comment sentence namely, comment ←− ←− ←− contextualized on reply (forward) & comment contextualized F4 = CN N4 ([hC Cu Cu (19) v,1 ⊕ hv,2 ⊕ .... ⊕ hv,n ]) u on reply (backward) respectively (as shown in equations 11 & 12). In the equations below (×) represents multiplication between a scalar and a vector. p = [F1 ⊕ F2 ⊕ F3 ⊕ F4 ] (20) − → Cu − → Cu − → − → Cu − → Cu − → ŷ = σ(W p + b), W ∈ R4dk ; b ∈ R (21) hC Cu v v = [hv,1 , hv,2 , ...., hv,n ], ; hv,i = αi × hi u (9) The proposed model uses the binary cross-entropy as the ←− ←− ←− ←− ←− ←− training loss function as shown in equation 22. Here (L) is the hC Cu Cu Cu Cu Cu v (10) cost function, ŷi ∈ R represents the output of the proposed v = [hv,1 , hv,2 , ...., hv,n ], ; hv,i = αi × hi u model, yi ∈ R represents the true label and N ∈ N represents − → − → − → − → − → − → the number of training samples. Cv Cv Cv Cv hC Cv u u = [hu,1 , hu,2 , ...., hu,n ], ; hu,i = αi × hi v (11) N 1 X L=− yi · log(ŷi ) + (1 − yi ) · log(1 − ŷi ) (22) ←− ←− ←− ←− ←− ←− N i=1 Cv Cv Cv Cv hC Cv u u = [hu,1 , hu,2 , ...., hu,n ], ; hu,i = αi × hi v (12) 4 Evaluation Setup 3.3 Integration and Final Prediction The proposed model uses Convolutional Neural Networks 4.1 Dataset (CNN) [Lecun et al., 1998] for capturing location-invariant This paper focuses on detecting sarcasm in the user-generated local features from the newly obtained contextualized hid- short text using only the conversational context. Social media ←− − → ←− −→ den representations hC platforms like Reddit and Twitter are widely used by users for u , hu , hv , hv . Four independent v Cv Cu Cu CNN blocks (CN N1 , CN N2 , CN N3 , CN N4 ) are used, cor- posting opinions and replying to other’s opinions. They have responding to each of the newly obtained contextualized hid- proved to be of a great source for extracting conversational den representations. Each CN N block consists two convolu- data. So the experiments were conducted on two publicly tional layers. Both the convolution layer consist of k filters available benchmark datasets (Reddit & Twitter) used for the of height h. The role of these filters is to detect particular sarcasm detection task. Both the datasets consist of comments features at different locations of the input. The output cli of and reply pairs. SARC4 Reddit [Khodak et al., 2018] is the largest the lth layer consists of k l feature maps of height h. The ith dataset available for sarcasm detection containing millions feature map (cli ) is calculated as: of sarcastic/non-sarcastic comments-reply pairs from the so- j=1 X cial media site Reddit. This dataset was generated by scraping cli = bli + l Ki,j ∗ cl−1 j (13) 4 kl−1 https://nlp.cs.princeton.edu/SARC/2.0/ Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021 5 No. of comment-reply pairs Avg. no. of words per comment Avg. no. of words per reply Sarcastic Non-Sarcastic Sarcastic Non-Sarcastic Sarcastic Non-Sarcastic Balanced 81205 81205 12.69 12.67 12.19 12.21 Reddit Training set Imbalanced 16303 81205 12.69 12.65 12.15 12.21 Twitter Balanced 3496 3496 24.97 24.97 24.25 24.25 Balanced 9058 9058 12.71 12.64 12.14 12.22 Reddit Testing set Imbalanced 1747 9058 12.73 12.69 12.20 12.21 Twitter Balanced 874 874 24.97 24.97 24.25 24.25 Table 1: Statics of the SARC dataset and FigLang 2020 workshop Twitter dataset. comments from Reddit containing the \s (sarcasm) tag. It to the length of the longest sentence (20 in the case of the contains replies, their parent comment (acts as context), and a Reddit dataset and 40 in the case of the Twitter dataset). Word label that shows whether the reply was sarcastic/non-sarcastic embeddings are used to give semantically-meaningful dense to their corresponding parent comment. To compare the perfor- representations to the words. Word-based embeddings are mance of the model on a different dataset (latest), the proposed constructed using contextual words whereas character-based model was also evaluated on the Twitter dataset provided in the embeddings are constructed from character n-grams of the FigLang5 2020 workshop [Ghosh et al., 2020] for the "sar- words. Character-based in contrast to the Word-based em- casm detection shared task". This consists of sarcastic/non- beddings solves the problem of out of vocabulary words and sarcastic tweets and their corresponding contextual parent performs better in the case of infrequent words by creating tweets. The sarcastic tweets were collected using hashtags word embeddings based only on their spellings. So for gener- like #sarcasm, #sarcastic, and #irony, similarly non-sarcastic ating proper representations for words we have used FastText7 , tweets were collected using hashtags like #happy, #sad, and a character-based word embedding. This would not only give #hate. This dataset sometime contains more than one contex- words better representation compared to the word-based model tual parent tweet, so in those cases, all of the contextual tweets but also incorporate slang/shortened/infrequent words (which are considered independently with the target tweet. commonly appear in social media platforms). In both the datasets, replies are the target comment/tweet to be classified as sarcastic/non-sarcastic, and their correspond- 4.3 Training Details ing parent comment/tweet acts as context. Both the datasets constitute of comments/tweets of varying lengths, but because We have used macro-averaged (F1) and accuracy (Acc) scores this paper only focuses on detecting sarcasm in the short text, as the evaluation metric, as it is standard for the sarcasm only the short comment/reply pairs were used. Comment/reply detection task. We have also reported Precision (P) and Recall sentences of length (no. of words) less than 20, 40 were used (R) scores in the case of the Twitter dataset as well as for the in the case of SARC and Twitter dataset respectively. In Reddit dataset (wherever available). Hyperparameter tuning both cases, the balanced datasets contain equal proportions was used to find optimum values of the hyperparameters. The of sarcastic/non-sarcastic comment/reply pairs, and the imbal- FastText embeddings used were of size d = 30 and were anced datasets maintain a 20:80 ratio (approximately) between trained for 30 iterations having window size of 3, 5 in the case sarcastic and non-sarcastic comment/reply pairs. Testing was of SARC, and Twitter dataset respectively. The number of done on 10% of the dataset and the rest was used for train- filters in all the convolutional blocks were [64, 64] of height ing. 10% of the training set was used for validation purposes. [2, 2]. The learning optimizer used is Adam with an initial Statistics of both the datasets are shown in Table 1. learning rate of 0.01. The value of α in all the LeakyReLu layers was set to 0.3. All the models were trained for 20 4.2 Data Preprocessing epochs. L2 regularization set to 10−2 is applied to all the feed-forward connections along with early stopping having The preprocessing of the textual data was done by first lower- the patience of 5 to avoid overfitting. The mini-batch size casing all the sentences and separating punctuations from the was tuned amongst {100, 500, 1000, 2000, 3000, 4000} and words. We do not remove the stop-words because we believe was observed that mini-batch size of 2000, 500 gave the best that sometimes stop-words play a major role in making a sen- performance for the SARC and Twitter dataset respectively. tence sarcastic e.g., "is it?" and "am I?". The problem with social media platforms is that, users use a lot of abbreviations, The recent success of transformer-based language models shortened words and slang words like, "IMO" for "in my opin- has led to their wide usage in sentiment analysis tasks. They ion", lmk" for "let me know ", "fr" for "for", etc. These words are known for generating high quality high dimensional word are challenging to taken care of in the NLP tasks, particularly representations (768-dimensional for BERT). Their only draw- in the automatic discovery of flexible word usages. So to solve back is that they require high processing power and memory this problem, these words are converted to their corresponding to train. The above-mentioned configuration of the proposed full-forms using abbreviation/slang word dictionaries obtained model generates ≈1120K trainable parameters, and increasing from urban dictionary6 . After this, all the sentences were tok- either the embedding size or the number of tokens in a sen- enized into a list of words. The proposed model had a fixed tence led to an exponential increase in the number of trainable input size for both comment and reply, but not all the sentences parameters. So due to computational resource limitations, we were of the same length. So all the sentences were padded limited our experiments to lower-dimensional word embed- dings. 5 sites.google.com/view/figlang2020 6 7 https://www.urbandictionary.com/ https://fasttext.cc/ Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021 6 5 Results case of imbalance where little extra information can lead to a drastic increase in performance. Balance Imbalanced Models Acc F1 P R Acc F1 P R Models P R F1 CNN-SVM [Poria et al., 2016] †? 68.0 68.0 – – 69.0 79.0 – – AMR [Ghaeini et al., 2018] ‡ 69.5 69.5 74.8 69.7 – – – – Baseline (LST Ma ttn) 70.0 66.9 68.0 [Ghosh and Veale, 2017] ‡ – 67.8 68.2 67.9 – – – – BERT-Large+BiLSTM+SVM [Baruah et al., 2020] 73.4 73.5 73.4 CUE-CNN [Amir et al., 2016] †? 70.0 69.0 – – 73.0 81.0 – – BERT+CNN+LSTM [Srivastava et al., 2020] 74.2 74.6 74.1 MHA-BiLSTM [Kumar et al., 2020] † – 77.5 72.6 83.0 – 56.8 60.3 53.7 RoBERTa+LSTM [Kumar and Anand, 2020] 77.3 77.4 77.2 CASCADE [Hazarika et al., 2018] ‡? 77.0 77.0 – – 79.0 86.0 – – CASCADE (only discourse features) ‡ 68.0 66.0 – – 68.0 78.0 – – RoBERT-Large [Dong et al., 2020] 79.1 79.4 79.0 Bi-ISCA (this paper) ‡ 72.3 75.7 74.2 77.6 71.9 74.4 73.0 75.8 RoBERT+Multi-Initialization Ensemble [Jaiswal, 2020] 79.2 79.3 79.1 ∆ increase w.r.t CASCADE – – – – 4.3 ↑ 9.7 ↑ 3.9 ↑ 3.6 ↓ (only discourse features) BERT + BiLSTM + NeXtVLAD + Context Ensemble † Uses only target sentence, ‡ Uses context along with target sentence, 93.2 93.6 93.1 ? Uses personality-based features + Data Augmentation [Lee et al., 2020] Bi-ISCA (this paper) 89.4 94.8 91.7 Table 2: Results on the SARC dataset. Models haveing only ‡ uses only contextual text for detecting sarcasm. Table 3: Results on the FigLang 2020 workshop Twitter dataset. Table 3 reports Precision (P), Recall (R), and F1-score (F1) Bi-ISCA focuses on only using the contextual com- of different models from the leaderboard of FigLang 2020 ment/tweet for detecting sarcasm rather than using any other sarcasm detection shared task using the Twitter dataset. In topical/personality-based features. Using only the contextual this case, not only Bi-ISCA was able to outperform the base- information enriches the model’s ability to capture syntactical line model [Ghosh et al., 2020] (improvement of ∆ 19.4%, and semantical textual properties responsible for invoking sar- ∆ 27.9% & ∆ 23.7% in precision, recall, and F1 score re- casm in any type of conversation. Table 2 reports performance spectively), but was also able to perform comparably to the results on the SARC datasets. For comparison purposes, F1- state-of-the-art [Lee et al., 2020] with a ∆ 1.2% increase in score (F1), Accuracy score (Acc), Precision (P) and Recall (R) recall, which further validates the performance of the proposed were used. model. Even though all the models other than the baseline in When compared with the existing works, Bi-ISCA was able Table 3 are a transformer-based model, Bi-ISCA was able to to outperform all the models (only ‡) that use only conversa- outperform them all. tional context for sarcasm detection (Improvement of ∆ 7.9% in F1 score when compared to [Ghosh and Veale, 2017]; ∆ 6.2% in F1 score and ∆ 2.8% in accuracy when compared to 6 Discussion AMR [Ghaeini et al., 2018]), and was even able to perform better than the models (†?) that use personality-based features along with the target sentence for detecting sarcasm (improve- ment of ∆ 7.7% in F1 and ∆ 4.3% in accuracy score when 1. compared to CNN-SVM [Poria et al., 2016]; ∆ 6.7% in F1 score and ∆ 2.3% in accuracy when compared to CUE-CNN 2. [Amir et al., 2016]). MHA-BiLSTM [Kumar et al., 2020] had a ∆ 1.8% higher F1 score in the balanced dataset but Bi-ISCA was able to show drastic improvement of ∆ 17.6% 3. in the imbalanced dataset, which demonstrated the ability of Bi-ISCA to handle class imbalance. The current state-of-the-art on the SARC dataset is achieved 4. by CASCADE. Even though CASCADE uses personality- based features and contextual information along with large Table 4: Attension weight distribution in reddit comment-reply pairs. sentences of average length ≈55-62 (very large compared to Here CcR represents "Comment contextualized on Reply" whereas our dataset, which gives them the advantage of using a lot RcC represents "Reply contextualized on Comment"; (R) & (L) rep- more contextual information), Bi-ISCA was able to achieve resents forward & backward attention. an F1 score comparable to it (despite using relatively short text). In comparison with CASCADE that only uses discourse- The attention scores generated by the attention mechanism based features, Bi-ISCA performed drastically better with an makes the proposed model highly interpretable. Table 4 show- increase of ∆ 9.7% in F1 and ∆ 4.3% in accuracy score for cases the distribution of the attention scores over four sarcastic the balanced dataset. (correctly predicted by Bi-ISCA) comment-reply pairs from Bi-ISCA clearly demonstrated its capabilities to robustly the SARC dataset. Not only the proposed model was correctly handle an imbalance in the dataset, although it was unable to able to detect sarcasm in these pairs of sentences but was also outperform both the CASCADE models. This slightly poor able to correctly identify words responsible for contextual, performance in the imbalanced dataset can be explained by explicit, or implicit incongruity which invokes sarcasm. the length of sentences used by CASCADE, which are signif- For example in Pair 1, Bi-ISCA correctly identified explic- icantly (≈5 times) greater than the ones on which Bi-ISCA itly incongruous words like "amazing" and "force" in the reply was tested. Longer sentences result in increased contextual sentence which were responsible for the sarcastic nature of information which improves performance especially in the the reply. Interestingly the word "traumatized" in the parent Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021 7 comment also had a high attention weight value, which shows [Baruah et al., 2020] Arup Baruah, Kaushik Das, Ferdous that the proposed attention mechanism was able to learn the Barbhuiya, and Kuntal Dey. Context-aware sarcasm detec- contextual incongruity between the opposite sentiment words tion using BERT. In Proceedings of the Second Workshop like "traumatized" & "amazing" in the comment-reply pair. on Figurative Language Processing, pages 83–87, Online, Pair 2 demonstrates the model’s ability to capture words re- July 2020. Association for Computational Linguistics. sponsible for invoking sarcasm by making sentences implicitly [Cho et al., 2014] Kyunghyun Cho, Bart van Merriënboer, incongruous. Sarcasm due to implicit incongruity is usually Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Hol- the toughest to perceive. Despite this, Bi-ISCA was able to ger Schwenk, and Yoshua Bengio. Learning phrase repre- give high attention weights to words like "announces" and sentations using RNN encoder–decoder for statistical ma- "crashes & security holes". Not only this, but the proposed chine translation. In Proceedings of the 2014 Conference intra-sentence attention mechanism was also able to learn a on Empirical Methods in Natural Language Processing link between "microsoft" and "m" (slang for microsoft) with- (EMNLP), pages 1724–1734, Doha, Qatar, October 2014. out having any prior knowledge related to slangs. Pair 3 is Association for Computational Linguistics. also an example of an explicitly and contextually incongruous comment-reply pair, where the model was successfully able [Dong et al., 2020] Xiangjue Dong, Changmao Li, and to capture opposite sentiment words & phrases like "blind Jinho D. Choi. Transformer-based context-aware sarcasm drunk", "cautious" and "behind the wheel" that made the reply detection in conversation threads from social media. In Pro- sarcastic in nature. Pair 4 is an example of sarcasm due to ceedings of the Second Workshop on Figurative Language implicit incongruity between the words, "pause" & "watch", Processing, pages 276–280, Online, July 2020. Association and contextual incongruity simultaneously between "reported" for Computational Linguistics. & "enjoyable", both of which were successfully captured by [Eisterhold et al., 2006] Jodi Eisterhold, Salvatore Attardo, Bi-BISCA. and Diana Boxer. Reactions to irony in discourse: evidence for the least disruption principle. Journal of Pragmatics, 7 Conclusion 38(8):1239 – 1256, 2006. Focus-on Issue: Discourse and Conversation. In this paper, we introduce a novel Bi-directional Inter- [Farías et al., 2016] Delia Irazú Hernaundefineddez Farías, Sentence Attention mechanism based model (Bi-ISCA) for Viviana Patti, and Paolo Rosso. Irony detection in twit- detecting sarcasm. The proposed model not only was able to ter: The role of affective content. ACM Trans. Internet capture both intra and inter-sentence dependencies but was Technol., 16(3), July 2016. able to achieve state-of-the-art results in detecting sarcasm in the user-generated short text using only the conversational [Ghaeini et al., 2018] Reza Ghaeini, Xiaoli Z. Fern, and context. Further investigation of attention maps illustrated Prasad Tadepalli. Attentional multi-reading sarcasm de- Bi-ISCA’s ability to capture explicitly, implicitly, and contex- tection. CoRR, abs/1809.03051, 2018. tually incongruous words & phrases responsible for invoking [Ghosh and Veale, 2016] Aniruddha Ghosh and Tony Veale. sarcasm. The success of the proposed model is achieved due Fracking sarcasm using neural network. In Proceedings of to the use of character-based embeddings that takes care of the 7th Workshop on Computational Approaches to Subjec- slang/shortened & out of vocabulary words, Bi-LSTMs that tivity, Sentiment and Social Media Analysis, pages 161–169, captures intra-sentence dependencies between words in the San Diego, California, June 2016. Association for Compu- same sentence, and Bi-ISCA that captures inter-sentence de- tational Linguistics. pendencies between words of different sentences. [Ghosh and Veale, 2017] Aniruddha Ghosh and Tony Veale. Magnets for sarcasm: Making sarcasm detection timely, References contextual and very personal. In Proceedings of the 2017 Conference on Empirical Methods in Natural Lan- [Amir et al., 2016] Silvio Amir, Byron C Wallace, Hao Lyu, guage Processing, pages 482–491, Copenhagen, Denmark, Paula Carvalho, and Silva Mário J. Modelling context with September 2017. Association for Computational Linguis- user embeddings for sarcasm detection in social media. Pro- tics. ceedings of the Conference on Natural Language Learning (CoNLL), 2016. [Ghosh et al., 2020] Debanjan Ghosh, Avijit Vajpayee, and Smaranda Muresan. A report on the 2020 sarcasm detection [Bamman and Smith, 2015] David Bamman and Noah A shared task. In Proceedings of the Second Workshop on Smith. Contextualized sarcasm detection on twitter. In Figurative Language Processing, pages 1–11, Online, July Ninth International AAAI Conference on Web and Social 2020. Association for Computational Linguistics. Media, 2015. [Hazarika et al., 2018] Devamanyu Hazarika, Soujanya Poria, [Barbieri et al., 2014] Francesco Barbieri, Horacio Saggion, Sruthi Gorantla, Erik Cambria, Roger Zimmermann, and and Francesco Ronzano. Modelling sarcasm in twitter, a Rada Mihalcea. CASCADE: Contextual sarcasm detection novel approach. In Proceedings of the 5th Workshop on in online discussion forums. In Proceedings of the 27th Computational Approaches to Subjectivity, Sentiment and International Conference on Computational Linguistics, Social Media Analysis, pages 50–58, Baltimore, Maryland, pages 1837–1848, Santa Fe, New Mexico, USA, August June 2014. Association for Computational Linguistics. 2018. Association for Computational Linguistics. Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021 8 [Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and Conference on Computational Linguistics: Technical Pa- Jürgen Schmidhuber. Long short-term memory. Neural pers, pages 1601–1612, Osaka, Japan, December 2016. The Computation, 9(8):1735–1780, 1997. COLING 2016 Organizing Committee. [Jaiswal, 2020] Nikhil Jaiswal. Neural sarcasm detection us- [Rajadesingan et al., 2015] Ashwin Rajadesingan, Reza Za- ing conversation context. In Proceedings of the Second farani, and Huan Liu. Sarcasm detection on twitter: A be- Workshop on Figurative Language Processing, pages 77– havioral modeling approach. In Proceedings of the Eighth 82, Online, July 2020. Association for Computational Lin- ACM International Conference on Web Search and Data guistics. Mining, WSDM ’15, page 97–106, New York, NY, USA, [Joshi et al., 2015] Aditya Joshi, Vinita Sharma, and Pushpak 2015. Association for Computing Machinery. Bhattacharyya. Harnessing context incongruity for sarcasm [Reyes et al., 2013] Antonio Reyes, Paolo Rosso, and Tony detection. In Proceedings of the 53rd Annual Meeting of the Veale. A multidimensional approach for detecting irony in Association for Computational Linguistics and the 7th Inter- twitter. Language resources and evaluation, 47(1):239–268, national Joint Conference on Natural Language Processing 2013. (Volume 2: Short Papers), pages 757–762, Beijing, China, [Riloff et al., 2013] Ellen Riloff, Ashequl Qadir, Prafulla July 2015. Association for Computational Linguistics. Surve, Lalindra De Silva, Nathan Gilbert, and Ruihong [Khattri et al., 2015] Anupam Khattri, Aditya Joshi, Pushpak Huang. Sarcasm as contrast between a positive sentiment Bhattacharyya, and Mark Carman. Your sentiment pre- and negative situation. In Proceedings of the 2013 Con- cedes you: Using an author’s historical tweets to predict ference on Empirical Methods in Natural Language Pro- sarcasm. In Proceedings of the 6th workshop on compu- cessing, EMNLP 2013, 18-21 October 2013, Grand Hyatt tational approaches to subjectivity, sentiment and social Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a media analysis, pages 25–30, 2015. Special Interest Group of the ACL, pages 704–714. ACL, 2013. [Khodak et al., 2018] Mikhail Khodak, Nikunj Saunshi, and Kiran Vodrahalli. A large self-annotated corpus for sarcasm. [Schuster and Paliwal, 1997] M. Schuster and K. K. Paliwal. In Proceedings of the Linguistic Resource and Evaluation Bidirectional recurrent neural networks. IEEE Transactions Conference (LREC), 2018. on Signal Processing, 45(11):2673–2681, 1997. [Kumar and Anand, 2020] Amardeep Kumar and Vivek [Srivastava et al., 2020] Himani Srivastava, Vaibhav Varsh- Anand. Transformers on sarcasm detection with context. ney, Surabhi Kumari, and Saurabh Srivastava. A novel In Proceedings of the Second Workshop on Figurative Lan- hierarchical BERT architecture for sarcasm detection. In guage Processing, pages 88–92, Online, July 2020. Associ- Proceedings of the Second Workshop on Figurative Lan- ation for Computational Linguistics. guage Processing, pages 93–97, Online, July 2020. Associ- ation for Computational Linguistics. [Kumar et al., 2020] A. Kumar, V. T. Narapareddy, V. Aditya Srikanth, A. Malapati, and L. B. M. Neti. Sarcasm detection [Tay et al., 2018] Yi Tay, Anh Tuan Luu, Siu Cheung Hui, using multi-head attention based bidirectional lstm. IEEE and Jian Su. Reasoning with sarcasm by reading in-between. Access, 8:6388–6397, 2020. In Proceedings of the 56th Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Pa- [Lecun et al., 1998] Y. Lecun, L. Bottou, Y. Bengio, and pers), pages 1010–1020, Melbourne, Australia, July 2018. P. Haffner. Gradient-based learning applied to document Association for Computational Linguistics. recognition. Proceedings of the IEEE, 86(11):2278–2324, [Tsur et al., 2010] Oren Tsur, Dmitry Davidov, and Ari Rap- 1998. poport. Icwsm—a great catchy name: Semi-supervised [Lee et al., 2020] Hankyol Lee, Youngjae Yu, and Gunhee recognition of sarcastic sentences in online product reviews. Kim. Augmenting data for sarcasm detection with unla- In fourth international AAAI conference on weblogs and beled conversation context. In Proceedings of the Sec- social media, 2010. ond Workshop on Figurative Language Processing, pages [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki 12–17, Online, July 2020. Association for Computational Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Linguistics. Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you [Liebrecht et al., 2013] Christine Liebrecht, Florian Kunne- need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, man, and Antal van den Bosch. The perfect solution for R. Fergus, S. Vishwanathan, and R. Garnett, editors, Ad- detecting sarcasm in tweets #not. In Proceedings of the 4th vances in Neural Information Processing Systems 30, pages Workshop on Computational Approaches to Subjectivity, 5998–6008. Curran Associates, Inc., 2017. Sentiment and Social Media Analysis, pages 29–37, At- [Wilson, 2006] Deirdre Wilson. The pragmatics of verbal lanta, Georgia, June 2013. Association for Computational irony: Echo or pretence? Lingua, 116(10):1722 – 1743, Linguistics. 2006. Language in Mind: A Tribute to Neil Smith on the [Poria et al., 2016] Soujanya Poria, Erik Cambria, Deva- Occasion of his Retirement. manyu Hazarika, and Prateek Vij. A deeper look into sarcastic tweets using deep convolutional neural networks. In Proceedings of COLING 2016, the 26th International Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).