Bi-ISCA: Bidirectional Inter-Sentence Contextual Attention Mechanism for Detecting Sarcasm in User Generated Noisy Short Text

Bi-ISCA: Bidirectional Inter-Sentence Contextual Attention Mechanism for Detecting Sarcasm in User Generated Noisy Short Text PrakamyaMishra https dictionary.cambridge.org SarojKaushik saroj.kaushik@snu.edu Shiv Nadar University KuntalDey kuntal.dey@accenture.com Accenture Technology Labs https dictionary.cambridge.org Bi-ISCA: Bidirectional Inter-Sentence Contextual Attention Mechanism for Detecting Sarcasm in User Generated Noisy Short Text 7457E880F7E7E1E173318661F0326F85 GROBID - A machine learning software for extracting information from scholarly documents

Many online comments on social media platforms are hateful, humorous, or sarcastic. The sarcastic nature of these comments (especially the short ones) alters their actual implied sentiments, which leads to misinterpretations by the existing sentiment analysis models. A lot of research has already been done to detect sarcasm in the text using user-based, topical, and conversational information but not much work has been done to use inter-sentence contextual information for detecting the same. This paper proposes a new deep learning architecture that uses a novel Bidirectional Inter-Sentence Contextual Attention mechanism (Bi-ISCA) to capture intersentence dependencies for detecting sarcasm in the user-generated short text using only the conversational context. The proposed deep learning model demonstrates the capability to capture explicit, implicit, and contextual incongruous words & phrases responsible for invoking sarcasm. Bi-ISCA generates results comparable to the state-of-the-art on two widely used benchmark datasets for the sarcasm detection task (Reddit and Twitter). To the best of our knowledge, none of the existing models use an intersentence contextual attention mechanism to detect sarcasm in the user-generated short text using only conversational context.

Introduction

Sentiment analysis is one of the most important natural language processing (NLP) applications. Its goal is to identify, extract, quantify, and study subjective information. The sudden rise in the usage of social media platforms as a means of communication has led to a vast amount of data being shared between its users on a wide range of topics. This type of data is very helpful to several organizations for analyzing the sentiments of people towards products, movies, political events, etc. Understanding the unique intricacies of the human language remains one of the most important pending NLP problems of this time. Humans regularly use sarcasm as a crucial part of the day-to-day conversations when venting, arguing, or * Contact Author maybe engaging on social media platforms. Sarcastic remarks on these platforms inflict problems on the existing sentiment analysis systems in identifying the true intentions of the users.

The Cambridge Dictionary 1 describes sarcasm as an irony conveyed hilariously or amusingly to criticize something. Sarcasm may not show criticism on the surface but instead might have a criticizing implied meaning. Such a figurative aspect of sarcasm makes it difficult to be detected in the modern micro texts [Ghosh and Veale, 2016]. Several linguistic research has been done to analyze different aspects of sarcasm. Kind of responses evoked because of comments has been considered a major indicator of sarcasm [Eisterhold et al., 2006]. [Wilson, 2006] states that circumstantial incongruity between a comment and its corresponding contextual information plays an important role in implying sarcasm.

Previous research works have used policy-based, statistical, and deep-learning-based methods for detecting sarcasm. The use of contextual information like conversational context, author personality features, or prior knowledge of the topic, have proved to be very useful. [Khattri et al., 2015] used sentiments of the author's historical tweets as context. [Rajadesingan et al., 2015] used personality features like the author's familiarity with twitter, language (structure and word usage), and the author's familiarity with sarcasm (history of previous sarcastic tweets) for consolidating context. [Bamman and Smith, 2015] explored the use of historical terms, topics, and sentiments along with profile information as the author's context. They also exploited the use of conversational context like the immediate previous tweets in the thread. [Joshi et al., 2015] demonstrated that concatenation of preceding comment with the objective comment in a discussion forum led to an increase in the precision score.

Overall in recent years a lot of work has been done to use different types of contextual information for sarcasm detection but none of them have used inter-sentence dependencies. In this paper, we propose a novel Bidirectional Inter-Sentence Contextual Attention mechanism (Bi-ISCA) based deep learning neural network for sarcasm detection. The main contribution of this paper can be summarised as follows:

• We propose a new deep learning architecture that uses a novel Bidirectional Inter-Sentence Contextual attention mechanism (Bi-ISCA) for detecting sarcasm in short texts (short texts are more difficult to analyze due to shortage of contextual information). • Bi-ISCA focuses on only using the conversational contextual comment/tweet for detecting sarcasm rather than using any other topical/personality-based features, as using only the contextual information enriches the model's ability to capture syntactical and semantical textual properties responsible for invoking sarcasm. • We also explain model behavior and predictions by visualizing attention maps generated by Bi-ISCA, which helps in identifying significant parts of the sentences responsible for invoking sarcasm. The rest of the paper is organized as follows. Section 2 describes the related work. Then section 3, explains the proposed model architecture for detecting sarcasm. Section 4 will describe the datasets used, pre-processing pipeline, and training details for reproducibility. Then experimental results are explained in section 5 and section 6 illustrates model behavior and predictions by visualizing attention maps. Finally we conclude in section 7.

Related Work

A diverse spectrum of approaches has been used to detect sarcasm. Recent sarcasm detection approaches have either mainly focused on using machine learning based approaches that leverage the use of explicitly declared relevant features or they focus on using neural network based deep learning approaches that do not require handcrafted features. Also, the recent advances in using deep learning for preforming natural language processing tasks have led to a promising increase in the performance of these sarcasm detection systems.

A lot of research has been done using bag of words as features. However, to improve performance, scholars started to explore the use of several other semantic and syntactical features like punctuations [Tsur et al., 2010]; emotion marks and intensifiers [Liebrecht et al., 2013]; positive verbs and negative phrases [Riloff et al., 2013]; polarity skip grams [Reyes et al., 2013]; synonyms & ambiguity [Barbieri et al., 2014]; implicit and explicit incongruity-based [Joshi et al., 2015]; sentiment flips [Rajadesingan et al., 2015]; affect-based features derived from multiple emotion lexicons [Farías et al., 2016].

Every day an enormous amount of short text data is generated by users on popular social media platforms like Twitter 2 and Reddit 3 . Easy accessibility of such data sources has enticed researchers to use them for extracting user-based and discourse-based features. [Hazarika et al., 2018] utilized contextual information by making user-embeddings for capturing indicative behavioral traits. These user-embeddings incorporated personality features along with the author's writing style (using historical posts). They also used discourse comments along with background cues and topical information for detecting sarcasm. They performed their experiments on the largest Reddit dataset SARC [Khodak et al., 2018]. Many have only used the target text for classification purposes, where a target 2 www.twitter.com/ 3 www.reddit.com/ text is a textual unit that has to be classified as sarcastic or not. Simply using gated recurrent units (GRU) [Cho et al., 2014] or long short term memory (LSTM) [Hochreiter and Schmidhuber, 1997] do not capture in between interactions of word pairs which makes it difficult to model contrast and incongruity. [Tay et al., 2018] were able to solve this problem by looking in-between word pairs using a multi-dimensional intra-attention recurrent network. They focused on modeling the intra-sentence relationships among the words. [Kumar et al., 2020] exploited the use of a multi-head attention mechanism [Vaswani et al., 2017] which could capture dependencies between different representations subspaces in different positions. Their model consisted of a word encoder for generating new word representations by summarizing comment contextual information in a bidirectional manner. On top of that, they used multi-head attention for focusing on different contexts of a sentence, and in the end, a simple multi-layer perceptron was used for classification.

There has not been much work done in conversation dependent (comment and reply) approaches for sarcasm detection. [Ghaeini et al., 2018] proposed a model that not only used information from the target utterance but also used its conversational context to perceive sarcasm. They aimed to detect sarcasm by just using the sequences of sentences, without any extra knowledge about the user and topic. They combined the predictions from utterance-only and conversation-dependent parts for generating its final prediction which was able to capture the words responsible for delivering sarcasm. [Ghosh and Veale, 2017] also modeled conversational context for sarcasm detection. They also attempted to derive what parts of the conversational context triggered a sarcastic reply. Their proposed model used sentence embeddings created by taking an average of word embeddings and a sentence-level attention mechanism was used to generate attention induced representations of both the context and the response which was later concatenated and used for classification.

Among all the previous works, [ Ghaeini et al., 2018] and [Ghosh and Veale, 2017] share similar motives of detecting sarcasm using only the conversational context. However, we introduce a novel Bidirectional Inter-Sentence Contextual Attention mechanism (Bi-ISCA) for detecting sarcasm. Unlike previous works, our work considers short texts for detecting sarcasm, which is far more challenging to detect when compared to long texts as long texts provide much more contextual information.

Model

This section will introduce the proposed Bi-ISCA: Bidirectional Inter Sentence Contextual Attention based neural network for sarcasm detection (as shown in Figure 1). Sarcasm detection is a binary classification task that tries to predict whether a given comment is sarcastic or not. The proposed model uses comment-reply pairs for detecting sarcasm. The input to the model is represented by U

= [W u 1 , W u 2 , ...., W u n ] and V = [W v 1 , W v 2 , ...., W v n ],

where U represents the comment sentence and V represents the reply sentence (both sentences padded to a length of n). Here, W u i , W v j ∈ R d are d−dimensional word embedding vectors. The objective is to predict label y which indicates whether the reply to the corresponding comment was sarcastic or not.

Intra-Sentence Word Encoder Layer

The primary purpose of this layer is to summarize intrasentence contextual information from both directions in both the sentences (comment & reply) using Bidirectional Long Short Term Memory Networks (Bi-LSTM). A Bi-LSTM [Schuster and Paliwal, 1997] processes information in both the directions using a forward LSTM [Hochreiter and Schmidhuber, 1997] − → h , that reads the sentence S = [w 1 , w 2 , ...., w n ] from w 1 to w n and a backward LSTM ← − h that reads the sentence from w n to w 1 . Hidden states from both the LSTMs are added to get the final hidden state representations of each word. So the hidden state representation of the t th word (h t ) can be represented by the sum of t th hidden representations of the forward and backward LSTMs (

− → h t , ← − h t ) as show in equations below. − → h t = −−−−→ LST M (w t , − − → h t−1 ); ← − h t = ←−−−− LST M (w t , ← − − h t−1 )(1)h t = ← − h t + − → h t (2)

This Intra-Sentence Word Encoder Layer consists of two independent Bidirectional LSTMs for both comment (BiLST M c ) and reply (BiLST M r ). Apart from the hidden states, both these Bi-LSTMs also generate separate (forward & backward) final cell states represented by

← − C & − → C .

The comment sentence U is given as an input to BiLST M c and the reply sentence V is given as an input to BiLST M r . The outputs of both the Bi-LSTMs are represented by the equations 3 and 4.

− → C u , h u , ← − C u = BiLST M c (U ) (3) − → C v , h v , ← − C v = BiLST M r (V ) (4) Here, − → C u , − → C v ∈ R d are the final cell states of the for- ward LSTMs corresponding to BiLST M c & BiLST M r ; ← − C u , ← − C v ∈ R d are the final cell states of the backward LSTMs corresponding to BiLST M c & BiLST M r ; h u = [h u 1 , h u 2 , ...., h u n ] and h v = [h v 1 , h v 2 , ...., h v n ] are the hidden state representations of BiLST M c & BiLST M r respec- tively, where h u i , h v j ∈ R d and h u , h v ∈ R n×d .

Bi-ISCA: Bidirectional Inter-Sentence Contextual Attention Mechanism

Sarcasm is context-dependent in nature. Even humans sometimes have a hard time understanding sarcasm without having any contextual information. The hidden states generated by both the Bi-LSTMs (BiLST M c & BiLST M r ) captures the intra-sentence bidirectional contextual information in comment & reply respectively, but fails to capture the intersentence contextual information between them. This paper introduces a novel Bidirectional Inter-Sentence Contextual Attention mechanism (Bi-ISCA) for capturing the inter-sentence contextual information between both the sentences. Bi-ISCA uses hidden state representations of U & V along with the auxiliary sentence's cell state representations ( − → C & ← − C ) to capture the inter-sentence contextual information. At first, the attention mechanism captures four sets of attentions scores namely, (α ). In the equations below (×) represents multiplication between a scalar and a vector.

− → Cu , α ← − Cu , α − → Cv , α ← − Cv ∈ R n ).− → Cu = [α − → Cu 1 , α − → Cu 2 , ...., α − → Cu n ]; α − → Cu i = − → C u • h v i (5) α ← − Cu = [α ← − Cu 1 , α ← − Cu 2 , ...., α ← − Cu n ]; α ← − Cu i = ← − C u • h v i (6) α − → Cv = [α − → Cv 1 , α − → Cv 2 , ...., α ← − Cv n ]; α − → Cv i = − → C v • h u i (7) α ← − Cv = [α ← − Cv 1 , α ← − Cv 2 , ...., α ← − Cv n ]; α ← − Cv i = ← − C v • h u i (8h − → Cu v = [h − → Cu v,1 , h − → Cu v,2 , ...., h − → Cu v,n ], ; h − → Cu v,i = α − → Cu i × h v i (9) h ← − Cu v = [h ← − Cu v,1 , h ← − Cu v,2 , ...., h ← − Cu v,n ], ; h ← − Cu v,i = α ← − Cu i × h v i (10) h − → Cv u = [h − → Cv u,1 , h − → Cv u,2 , ...., h − → Cv u,n ], ; h − → Cv u,i = α − → Cv i × h u i (11) h ← − Cv u = [h ← − Cv u,1 , h ← − Cv u,2 , ...., h ← − Cv u,n ], ; h ← − Cv u,i = α ← − Cv i × h u i (12)

Integration and Final Prediction

The proposed model uses Convolutional Neural Networks (CNN) [Lecun et al., 1998] for capturing location-invariant local features from the newly obtained contextualized hidden representations h

← − Cv u , h − → Cv u , h ← − Cu v , h − → Cu v . Four independent CNN blocks (CN N 1 , CN N 2 , CN N 3 , CN N 4

) are used, corresponding to each of the newly obtained contextualized hidden representations. Each CN N block consists two convolutional layers. Both the convolution layer consist of k filters of height h. The role of these filters is to detect particular features at different locations of the input. The output c l i of the l th layer consists of k l feature maps of height h. The i th feature map (c l i ) is calculated as:

c l i = b l i + j=1 k l−1 K l i,j * c l−1 j (13)

In the above equation, b l i is a bias matrix and K l i,j is a filter connecting j th feature map of layer (l − 1) to the i th feature map of layer (l). The output of each convolution layer is passed through a activation function f . The proposed model uses LeakyReLu as its activation function.

f = a * x, for x ≥ 0; a ∈ R (14)

x, for x < 0 (15)

For each of the CNN blocks, the corresponding contextualized hidden representations are first concatenated (⊕) and then given as input. The outputs of all the CNN blocks are flattened (F 1 , F 2 , F 3 , F 4 ∈ R dk ) and concatenated to generate a new vector (p ∈ R 4dk ), where d represents the dimension of the hidden representation and k represents number of convolutional filters used. This concatenated (p) vector is then given as input to a dense layer having 4dk neurons and is followed by the final sigmoid prediction layer.

F 1 = CN N 1 ([h − → Cv u,1 ⊕ h − → Cv u,2 ⊕ .... ⊕ h − → Cv u,n ])(16)F 2 = CN N 2 ([h ← − Cv u,1 ⊕ h ← − Cv u,2 ⊕ .... ⊕ h ← − Cv u,n ])(17)F 3 = CN N 3 ([h − → Cu v,1 ⊕ h − → Cu v,2 ⊕ .... ⊕ h − → Cu v,n ])(18)F 4 = CN N 4 ([h ← − Cu v,1 ⊕ h ← − Cu v,2 ⊕ .... ⊕ h ← − Cu v,n ]) (19) p = [F 1 ⊕ F 2 ⊕ F 3 ⊕ F 4 ](20)ŷ = σ(W p + b), W ∈ R 4dk ; b ∈ R(L = − 1 N N i=1 y i • log(ŷ i ) + (1 − y i ) • log(1 − ŷi ) (22)

4 Evaluation Setup comments from Reddit containing the \s (sarcasm) tag. It contains replies, their parent comment (acts as context), and a label that shows whether the reply was sarcastic/non-sarcastic to their corresponding parent comment. To compare the performance of the model on a different dataset (latest), the proposed model was also evaluated on the Twitter dataset provided in the FigLang5 2020 workshop [Ghosh et al., 2020] for the "sarcasm detection shared task". This consists of sarcastic/nonsarcastic tweets and their corresponding contextual parent tweets. The sarcastic tweets were collected using hashtags like #sarcasm, #sarcastic, and #irony, similarly non-sarcastic tweets were collected using hashtags like #happy, #sad, and #hate. This dataset sometime contains more than one contextual parent tweet, so in those cases, all of the contextual tweets are considered independently with the target tweet.

In both the datasets, replies are the target comment/tweet to be classified as sarcastic/non-sarcastic, and their corresponding parent comment/tweet acts as context. Both the datasets constitute of comments/tweets of varying lengths, but because this paper only focuses on detecting sarcasm in the short text, only the short comment/reply pairs were used. Comment/reply sentences of length (no. of words) less than 20, 40 were used in the case of SARC and Twitter dataset respectively. In both cases, the balanced datasets contain equal proportions of sarcastic/non-sarcastic comment/reply pairs, and the imbalanced datasets maintain a 20:80 ratio (approximately) between sarcastic and non-sarcastic comment/reply pairs. Testing was done on 10% of the dataset and the rest was used for training. 10% of the training set was used for validation purposes. Statistics of both the datasets are shown in Table 1.

Data Preprocessing

The preprocessing of the textual data was done by first lowercasing all the sentences and separating punctuations from the words. We do not remove the stop-words because we believe that sometimes stop-words play a major role in making a sentence sarcastic e.g., "is it?" and "am I?". The problem with social media platforms is that, users use a lot of abbreviations, shortened words and slang words like, "IMO" for "in my opinion", lmk" for "let me know ", "fr" for "for", etc. These words are challenging to taken care of in the NLP tasks, particularly in the automatic discovery of flexible word usages. So to solve this problem, these words are converted to their corresponding full-forms using abbreviation/slang word dictionaries obtained from urban dictionary6 . After this, all the sentences were tokenized into a list of words. The proposed model had a fixed input size for both comment and reply, but not all the sentences were of the same length. So all the sentences were padded to the length of the longest sentence (20 in the case of the Reddit dataset and 40 in the case of the Twitter dataset). Word embeddings are used to give semantically-meaningful dense representations to the words. Word-based embeddings are constructed using contextual words whereas character-based embeddings are constructed from character n-grams of the words. Character-based in contrast to the Word-based embeddings solves the problem of out of vocabulary words and performs better in the case of infrequent words by creating word embeddings based only on their spellings. So for generating proper representations for words we have used FastText7 , a character-based word embedding. This would not only give words better representation compared to the word-based model but also incorporate slang/shortened/infrequent words (which commonly appear in social media platforms).

Training Details

We have used macro-averaged (F1) and accuracy (Acc) scores as the evaluation metric, as it is standard for the sarcasm detection task. We have also reported Precision (P) and Recall (R) scores in the case of the Twitter dataset as well as for the Reddit dataset (wherever available). Hyperparameter tuning was used to find optimum values of the hyperparameters. The FastText embeddings used were of size d = 30 and were trained for 30 iterations having window size of 3, 5 in the case of SARC, and Twitter dataset respectively. The number of filters in all the convolutional blocks were [64,64] of height [2,2]. The learning optimizer used is Adam with an initial learning rate of 0.01. The value of α in all the LeakyReLu layers was set to 0.3. All the models were trained for 20 epochs. L2 regularization set to 10 −2 is applied to all the feed-forward connections along with early stopping having the patience of 5 to avoid overfitting. The mini-batch size was tuned amongst {100, 500, 1000, 2000, 3000, 4000} and was observed that mini-batch size of 2000, 500 gave the best performance for the SARC and Twitter dataset respectively.

The recent success of transformer-based language models has led to their wide usage in sentiment analysis tasks. They are known for generating high quality high dimensional word representations (768-dimensional for BERT). Their only drawback is that they require high processing power and memory to train. The above-mentioned configuration of the proposed model generates ≈1120K trainable parameters, and increasing either the embedding size or the number of tokens in a sentence led to an exponential increase in the number of trainable parameters. So due to computational resource limitations, we limited our experiments to lower-dimensional word embeddings. Bi-ISCA focuses on only using the contextual comment/tweet for detecting sarcasm rather than using any other topical/personality-based features. Using only the contextual information enriches the model's ability to capture syntactical and semantical textual properties responsible for invoking sarcasm in any type of conversation. Table 2 reports performance results on the SARC datasets. For comparison purposes, F1score (F1), Accuracy score (Acc), Precision (P) and Recall (R) were used.

Results

Models

When compared with the existing works, Bi-ISCA was able to outperform all the models (only ‡) that use only conversational context for sarcasm detection (Improvement of ∆ 7.9% in F1 score when compared to [Ghosh and Veale, 2017]; ∆ 6.2% in F1 score and ∆ 2.8% in accuracy when compared to AMR [Ghaeini et al., 2018]), and was even able to perform better than the models ( † ) that use personality-based features along with the target sentence for detecting sarcasm (improvement of ∆ 7.7% in F1 and ∆ 4.3% in accuracy score when compared to CNN-SVM [Poria et al., 2016]; ∆ 6.7% in F1 score and ∆ 2.3% in accuracy when compared to CUE-CNN [Amir et al., 2016]). MHA-BiLSTM [Kumar et al., 2020] had a ∆ 1.8% higher F1 score in the balanced dataset but Bi-ISCA was able to show drastic improvement of ∆ 17.6% in the imbalanced dataset, which demonstrated the ability of Bi-ISCA to handle class imbalance.

The current state-of-the-art on the SARC dataset is achieved by CASCADE. Even though CASCADE uses personalitybased features and contextual information along with large sentences of average length ≈55-62 (very large compared to our dataset, which gives them the advantage of using a lot more contextual information), Bi-ISCA was able to achieve an F1 score comparable to it (despite using relatively short text). In comparison with CASCADE that only uses discoursebased features, Bi-ISCA performed drastically better with an increase of ∆ 9.7% in F1 and ∆ 4.3% in accuracy score for the balanced dataset.

Bi-ISCA clearly demonstrated its capabilities to robustly handle an imbalance in the dataset, although it was unable to outperform both the CASCADE models. This slightly poor performance in the imbalanced dataset can be explained by the length of sentences used by CASCADE, which are significantly (≈5 times) greater than the ones on which Bi-ISCA was tested. Longer sentences result in increased contextual information which improves performance especially in the case of imbalance where little extra information can lead to a drastic increase in performance. Models P R F1 Baseline (LST M a ttn) 70.0 66.9 68.0 BERT-Large+BiLSTM+SVM [Baruah et al., 2020] 73.4 73.5 73.4 BERT+CNN+LSTM [Srivastava et al., 2020] 74.2 74.6 74.1 RoBERTa+LSTM [Kumar and Anand, 2020] 77 Table 3: Results on the FigLang 2020 workshop Twitter dataset.

Table 3 reports Precision (P), Recall (R), and F1-score (F1) of different models from the leaderboard of FigLang 2020 sarcasm detection shared task using the Twitter dataset. In this case, not only Bi-ISCA was able to outperform the baseline model [Ghosh et al., 2020] (improvement of ∆ 19.4%, ∆ 27.9% & ∆ 23.7% in precision, recall, and F1 score respectively), but was also able to perform comparably to the state-of-the-art [Lee et al., 2020] with a ∆ 1.2% increase in recall, which further validates the performance of the proposed model. Even though all the models other than the baseline in Table 3 are The attention scores generated by the attention mechanism makes the proposed model highly interpretable. Table 4 showcases the distribution of the attention scores over four sarcastic (correctly predicted by Bi-ISCA) comment-reply pairs from the SARC dataset. Not only the proposed model was correctly able to detect sarcasm in these pairs of sentences but was also able to correctly identify words responsible for contextual, explicit, or implicit incongruity which invokes sarcasm.

For example in Pair 1, Bi-ISCA correctly identified explicitly incongruous words like "amazing" and "force" in the reply sentence which were responsible for the sarcastic nature of the reply. Interestingly the word "traumatized" in the parent comment also had a high attention weight value, which shows that the proposed attention mechanism was able to learn the contextual incongruity between the opposite sentiment words like "traumatized" & "amazing" in the comment-reply pair. Pair 2 demonstrates the model's ability to capture words responsible for invoking sarcasm by making sentences implicitly incongruous. Sarcasm due to implicit incongruity is usually the toughest to perceive. Despite this, Bi-ISCA was able to give high attention weights to words like "announces" and "crashes & security holes". Not only this, but the proposed intra-sentence attention mechanism was also able to learn a link between "microsoft" and "m" (slang for microsoft) without having any prior knowledge related to slangs. Pair 3 is also an example of an explicitly and contextually incongruous comment-reply pair, where the model was successfully able to capture opposite sentiment words & phrases like "blind drunk", "cautious" and "behind the wheel" that made the reply sarcastic in nature. Pair 4 is an example of sarcasm due to implicit incongruity between the words, "pause" & "watch", and contextual incongruity simultaneously between "reported" & "enjoyable", both of which were successfully captured by Bi-BISCA.

Conclusion

In this paper, we introduce a novel Bi-directional Inter-Sentence Attention mechanism based model (Bi-ISCA) for detecting sarcasm. The proposed model not only was able to capture both intra and inter-sentence dependencies but was able to achieve state-of-the-art results in detecting sarcasm in the user-generated short text using only the conversational context. Further investigation of attention maps illustrated Bi-ISCA's ability to capture explicitly, implicitly, and contextually incongruous words & phrases responsible for invoking sarcasm. The success of the proposed model is achieved due to the use of character-based embeddings that takes care of slang/shortened & out of vocabulary words, Bi-LSTMs that captures intra-sentence dependencies between words in the same sentence, and Bi-ISCA that captures inter-sentence dependencies between words of different sentences.

Figure 1 :1Figure 1: Bi-ISCA: Bi-Directional Inter-Sentence Contextual Attention Mechanism for Sarcasm Detection.

These sets of inter-sentence attention scores are used to generate new inter-sentence contextualized hidden representations. Then (α − → Cu , α ← − Cu ) are calculated using the hidden state representations of BiLST M r along with the forward and backward final states ( − → C u , ← − C u ) of BiLST M c (as shown in equations 5 & 6), similarly (α − → Cv , α ← − Cv ) are calculated using the hidden state representations of BiLST M c along with the forward and backward final states ( − → C v , ← − C v ) of BiLST M r (as shown in equations 7 & 8). In the equations below (•) represents a dot product between two vectors.

∈) In the next step, the above calculated sets of inter-sentence attention scores α multiplied back with the hidden state representations of BiLST M r to generate two new set of hidden representations h R n×d of the reply sentence namely, reply contextualized on comment (forward) & reply contextualized on comment (backward) respectively (as shown in equations 9 & 10). Similarly α − → Cv , α ← − Cv are multiplied back with the hidden state representations of BiLST M c to generate two new set of hidden representations h − → Cv u , h ← − Cv u ∈ R n×d of the comment sentence namely, comment contextualized on reply (forward) & comment contextualized on reply (backward) respectively (as shown in equations 11 & 12

21) The proposed model uses the binary cross-entropy as the training loss function as shown in equation 22. Here (L) is the cost function, ŷi ∈ R represents the output of the proposed model, y i ∈ R represents the true label and N ∈ N represents the number of training samples.

a transformer-based model, Bi-ISCA was able to outperform them all. Attension weight distribution in reddit comment-reply pairs. Here CcR represents "Comment contextualized on Reply" whereas RcC represents "Reply contextualized on Comment"; (R) & (L) represents forward & backward attention.

Table 1 :1Statics of the SARC dataset and FigLang 2020 workshop Twitter dataset.4.1 DatasetThis paper focuses on detecting sarcasm in the user-generated short text using only the conversational context. Social media platforms like Reddit and Twitter are widely used by users for posting opinions and replying to other's opinions. They have proved to be of a great source for extracting conversational data. So the experiments were conducted on two publicly available benchmark datasets (Reddit & Twitter) used for the sarcasm detection task. Both the datasets consist of comments and reply pairs. SARC 4 Reddit [Khodak et al., 2018] is the largest dataset available for sarcasm detection containing millions of sarcastic/non-sarcastic comments-reply pairs from the so-cial media site Reddit. This dataset was generated by scraping

Table 2 :2Results on the SARC dataset. Models haveing only ‡ uses only contextual text for detecting sarcasm.Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).https://nlp.cs.princeton.edu/SARC/2.0/ Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021sites.google.com/view/figlang2020https://www.urbandictionary.com/https://fasttext.cc/ Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021

Modelling context with user embeddings for sarcasm detection in social media Amir Ninth International AAAI Conference on Web and Social Media CoNLL 2016. 2016. 2015. 2015 Proceedings of the Conference on Natural Language Learning Learning phrase representations using RNN encoder-decoder for statistical machine translation Barbieri Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Baltimore, Maryland; Doha, Qatar

2014. June 2014. 2020. July 2020. 2014. October 2014 Association for Computational Linguistics Transformer-based context-aware sarcasm detection in conversation threads from social media Dong Proceedings of the Second Workshop on Figurative Language Processing the Second Workshop on Figurative Language Processing 2020. July 2020 Association for Computational Linguistics Reactions to irony in discourse: evidence for the least disruption principle Eisterhold CoRR, abs/1809.03051 Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis PrasadTadepalli the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

San Diego, California

Association for Computational Linguistics 2006. 2006. 2016. July 2016. 2018. 2018. June 2016 38 Attentional multi-reading sarcasm detection Magnets for sarcasm: Making sarcasm detection timely, contextual and very personal VealeGhosh AniruddhaGhosh TonyVeale Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing the 2017 Conference on Empirical Methods in Natural Language Processing

Copenhagen, Denmark

Association for Computational Linguistics 2017. September 2017 CASCADE: Contextual sarcasm detection in online discussion forums Ghosh Proceedings of the 27th International Conference on Computational Linguistics SchmidhuberHochreiter the 27th International Conference on Computational Linguistics

Santa Fe, New Mexico, USA

Association for Computational Linguistics 2020. July 2020. 2018. August 2018. 1997. 1997. 2020. July 2020 9 Proceedings of the Second Workshop on Figurative Language Processing Your sentiment precedes you: Using an author's historical tweets to predict sarcasm Joshi Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing

Beijing, China

Association for Computational Linguistics 2015. July 2015. 2015. 2015. 2018. 2020. July 2020 2 Proceedings of the Second Workshop on Figurative Language Processing Sarcasm detection using multi-head attention based bidirectional lstm Kumar IEEE Access 8 2020. 2020 Augmenting data for sarcasm detection with unlabeled conversation context Lecun Proceedings of the Second Workshop on Figurative Language Processing the Second Workshop on Figurative Language Processing 1998. 1998. 2020. July 2020 86 Association for Computational Linguistics The perfect solution for detecting sarcasm in tweets #not Liebrecht Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Atlanta, Georgia

2013. June 2013 Association for Computational Linguistics A deeper look into sarcastic tweets using deep convolutional neural networks Poria Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Osaka, Japan

2016. December 2016 The COLING 2016 Organizing Committee Sarcasm as contrast between a positive sentiment and negative situation Rajadesingan Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013 the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013

New York, NY, USA; Grand Hyatt Seattle, Seattle, Washington, USA

Association for Computational Linguistics 2015. 2015. 2013. 2013. 18-21 October 2013. 2013. 1997. 1997. July 2020 47 Proceedings of the Second Workshop on Figurative Language Processing Reasoning with sarcasm by reading in-between Tay Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Melbourne, Australia

2018. July 2018 Association for Computational Linguistics Icwsm-a great catchy name: Semi-supervised recognition of sarcastic sentences in online product reviews Tsur fourth international AAAI conference on weblogs and social media IGuyon UVLuxburg SBengio HWallach RFergus SVishwanathan RGarnett Curran Associates, Inc 2010. 2010. 2017. 2017 Advances in Neural Information Processing Systems 30 The pragmatics of verbal irony: Echo or pretence? Wilson; DeirdreWilson Lingua 116 10 2006. 2006 Language in Mind: A Tribute to Neil Smith on the Occasion of his Retirement