Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021                                                          1


         Bi-ISCA: Bidirectional Inter-Sentence Contextual Attention Mechanism for
                  Detecting Sarcasm in User Generated Noisy Short Text

                               Prakamya Mishra1∗ , Saroj Kaushik2 , Kuntal Dey3
                                                 , Shiv Nadar University
                                                1 2
                                             3
                                               Accenture Technology Labs
                            {pm669, saroj.kaushik}@snu.edu.in, kuntal.dey@accenture.com

                              Abstract                                        maybe engaging on social media platforms. Sarcastic remarks
                                                                              on these platforms inflict problems on the existing sentiment
        Many online comments on social media platforms                        analysis systems in identifying the true intentions of the users.
        are hateful, humorous, or sarcastic. The sarcastic                       The Cambridge Dictionary1 describes sarcasm as an irony
        nature of these comments (especially the short ones)                  conveyed hilariously or amusingly to criticize something. Sar-
        alters their actual implied sentiments, which leads                   casm may not show criticism on the surface but instead might
        to misinterpretations by the existing sentiment anal-                 have a criticizing implied meaning. Such a figurative aspect of
        ysis models. A lot of research has already been done                  sarcasm makes it difficult to be detected in the modern micro
        to detect sarcasm in the text using user-based, top-                  texts [Ghosh and Veale, 2016]. Several linguistic research has
        ical, and conversational information but not much                     been done to analyze different aspects of sarcasm. Kind of
        work has been done to use inter-sentence contex-                      responses evoked because of comments has been considered a
        tual information for detecting the same. This pa-                     major indicator of sarcasm [Eisterhold et al., 2006]. [Wilson,
        per proposes a new deep learning architecture that                    2006] states that circumstantial incongruity between a com-
        uses a novel Bidirectional Inter-Sentence Contextual                  ment and its corresponding contextual information plays an
        Attention mechanism (Bi-ISCA) to capture inter-                       important role in implying sarcasm.
        sentence dependencies for detecting sarcasm in the                       Previous research works have used policy-based, statisti-
        user-generated short text using only the conversa-                    cal, and deep-learning-based methods for detecting sarcasm.
        tional context. The proposed deep learning model                      The use of contextual information like conversational con-
        demonstrates the capability to capture explicit, im-                  text, author personality features, or prior knowledge of the
        plicit, and contextual incongruous words & phrases                    topic, have proved to be very useful. [Khattri et al., 2015]
        responsible for invoking sarcasm. Bi-ISCA gener-                      used sentiments of the author’s historical tweets as context.
        ates results comparable to the state-of-the-art on two                [Rajadesingan et al., 2015] used personality features like the
        widely used benchmark datasets for the sarcasm de-                    author’s familiarity with twitter, language (structure and word
        tection task (Reddit and Twitter). To the best of our                 usage), and the author’s familiarity with sarcasm (history of
        knowledge, none of the existing models use an inter-                  previous sarcastic tweets) for consolidating context. [Bamman
        sentence contextual attention mechanism to detect                     and Smith, 2015] explored the use of historical terms, topics,
        sarcasm in the user-generated short text using only                   and sentiments along with profile information as the author’s
        conversational context.                                               context. They also exploited the use of conversational context
                                                                              like the immediate previous tweets in the thread. [Joshi et al.,
1       Introduction                                                          2015] demonstrated that concatenation of preceding comment
                                                                              with the objective comment in a discussion forum led to an
Sentiment analysis is one of the most important natural lan-                  increase in the precision score.
guage processing (NLP) applications. Its goal is to identify,                    Overall in recent years a lot of work has been done to use
extract, quantify, and study subjective information. The sud-                 different types of contextual information for sarcasm detection
den rise in the usage of social media platforms as a means of                 but none of them have used inter-sentence dependencies. In
communication has led to a vast amount of data being shared                   this paper, we propose a novel Bidirectional Inter-Sentence
between its users on a wide range of topics. This type of data                Contextual Attention mechanism (Bi-ISCA) based deep learn-
is very helpful to several organizations for analyzing the senti-             ing neural network for sarcasm detection. The main contribu-
ments of people towards products, movies, political events, etc.              tion of this paper can be summarised as follows:
Understanding the unique intricacies of the human language
remains one of the most important pending NLP problems                            • We propose a new deep learning architecture that uses a
of this time. Humans regularly use sarcasm as a crucial part                        novel Bidirectional Inter-Sentence Contextual attention
of the day-to-day conversations when venting, arguing, or                           mechanism (Bi-ISCA) for detecting sarcasm in short texts
    ∗                                                                             1
        Contact Author                                                                https://dictionary.cambridge.org/


Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021                                                          2


      (short texts are more difficult to analyze due to shortage              text is a textual unit that has to be classified as sarcastic or
      of contextual information).                                             not. Simply using gated recurrent units (GRU) [Cho et al.,
   • Bi-ISCA focuses on only using the conversational con-                    2014] or long short term memory (LSTM) [Hochreiter and
      textual comment/tweet for detecting sarcasm rather than                 Schmidhuber, 1997] do not capture in between interactions
      using any other topical/personality-based features, as us-              of word pairs which makes it difficult to model contrast and
      ing only the contextual information enriches the model’s                incongruity. [Tay et al., 2018] were able to solve this problem
      ability to capture syntactical and semantical textual prop-             by looking in-between word pairs using a multi-dimensional
      erties responsible for invoking sarcasm.                                intra-attention recurrent network. They focused on modeling
                                                                              the intra-sentence relationships among the words. [Kumar et
   • We also explain model behavior and predictions by vi-                    al., 2020] exploited the use of a multi-head attention mecha-
      sualizing attention maps generated by Bi-ISCA, which                    nism [Vaswani et al., 2017] which could capture dependencies
      helps in identifying significant parts of the sentences re-             between different representations subspaces in different posi-
      sponsible for invoking sarcasm.                                         tions. Their model consisted of a word encoder for generating
   The rest of the paper is organized as follows. Section 2                   new word representations by summarizing comment contex-
describes the related work. Then section 3, explains the pro-                 tual information in a bidirectional manner. On top of that, they
posed model architecture for detecting sarcasm. Section 4                     used multi-head attention for focusing on different contexts
will describe the datasets used, pre-processing pipeline, and                 of a sentence, and in the end, a simple multi-layer perceptron
training details for reproducibility. Then experimental results               was used for classification.
are explained in section 5 and section 6 illustrates model be-                   There has not been much work done in conversation depen-
havior and predictions by visualizing attention maps. Finally                 dent (comment and reply) approaches for sarcasm detection.
we conclude in section 7.                                                     [Ghaeini et al., 2018] proposed a model that not only used
                                                                              information from the target utterance but also used its conver-
2       Related Work                                                          sational context to perceive sarcasm. They aimed to detect
A diverse spectrum of approaches has been used to detect                      sarcasm by just using the sequences of sentences, without any
sarcasm. Recent sarcasm detection approaches have either                      extra knowledge about the user and topic. They combined the
mainly focused on using machine learning based approaches                     predictions from utterance-only and conversation-dependent
that leverage the use of explicitly declared relevant features                parts for generating its final prediction which was able to cap-
or they focus on using neural network based deep learning                     ture the words responsible for delivering sarcasm. [Ghosh and
approaches that do not require handcrafted features. Also, the                Veale, 2017] also modeled conversational context for sarcasm
recent advances in using deep learning for preforming natural                 detection. They also attempted to derive what parts of the con-
language processing tasks have led to a promising increase in                 versational context triggered a sarcastic reply. Their proposed
the performance of these sarcasm detection systems.                           model used sentence embeddings created by taking an average
   A lot of research has been done using bag of words as                      of word embeddings and a sentence-level attention mechanism
features. However, to improve performance, scholars started                   was used to generate attention induced representations of both
to explore the use of several other semantic and syntactical                  the context and the response which was later concatenated and
features like punctuations [Tsur et al., 2010]; emotion marks                 used for classification.
and intensifiers [Liebrecht et al., 2013]; positive verbs and                    Among all the previous works, [Ghaeini et al., 2018] and
                                                                              [Ghosh and Veale, 2017] share similar motives of detecting
negative phrases [Riloff et al., 2013]; polarity skip grams
[Reyes et al., 2013]; synonyms & ambiguity[Barbieri et al.,                   sarcasm using only the conversational context. However, we
2014]; implicit and explicit incongruity-based [Joshi et al.,                 introduce a novel Bidirectional Inter-Sentence Contextual At-
2015]; sentiment flips [Rajadesingan et al., 2015]; affect-based              tention mechanism (Bi-ISCA) for detecting sarcasm. Unlike
features derived from multiple emotion lexicons [Farías et al.,               previous works, our work considers short texts for detecting
2016].                                                                        sarcasm, which is far more challenging to detect when com-
   Every day an enormous amount of short text data is gener-                  pared to long texts as long texts provide much more contextual
ated by users on popular social media platforms like Twitter2                 information.
and Reddit3 . Easy accessibility of such data sources has en-
ticed researchers to use them for extracting user-based and                   3 Model
discourse-based features. [Hazarika et al., 2018] utilized con-
textual information by making user-embeddings for capturing                   This section will introduce the proposed Bi-ISCA: Bidirec-
indicative behavioral traits. These user-embeddings incorpo-                  tional Inter Sentence Contextual Attention based neural net-
rated personality features along with the author’s writing style              work for sarcasm detection (as shown in Figure 1). Sarcasm
(using historical posts). They also used discourse comments                   detection is a binary classification task that tries to predict
along with background cues and topical information for detect-                whether a given comment is sarcastic or not. The proposed
ing sarcasm. They performed their experiments on the largest                  model uses comment-reply pairs for detecting sarcasm. The
Reddit dataset SARC [Khodak et al., 2018]. Many have only                     input to the model is represented by U = [W1u , W2u , ...., Wnu ]
used the target text for classification purposes, where a target              and V = [W1v , W2v , ...., Wnv ], where U represents the com-
                                                                              ment sentence and V represents the reply sentence (both sen-
    2
        www.twitter.com/                                                      tences padded to a length of n). Here, Wiu , Wjv ∈ Rd are
    3
        www.reddit.com/                                                       d−dimensional word embedding vectors. The objective is


Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021                                                                   3


                  Figure 1: Bi-ISCA: Bi-Directional Inter-Sentence Contextual Attention Mechanism for Sarcasm Detection.


to predict label y which indicates whether the reply to the                                           −
                                                                                                      →         ←−
corresponding comment was sarcastic or not.                                                           Cv , hv , Cv = BiLST Mr (V )                   (4)
                                                                                            −→ −   →
3.1    Intra-Sentence Word Encoder Layer                                         Here, Cu , Cv ∈ R are the final cell states of the for-
                                                                                                               d

                                                                              ward LSTMs corresponding to BiLST Mc & BiLST Mr ;
The primary purpose of this layer is to summarize intra-                      ← − ←  −
sentence contextual information from both directions in both                  Cu , Cv ∈ Rd are the final cell states of the backward
the sentences (comment & reply) using Bidirectional Long                      LSTMs corresponding to BiLST Mc & BiLST Mr ; hu =
Short Term Memory Networks (Bi-LSTM). A Bi-LSTM                               [hu1 , hu2 , ...., hun ] and hv = [hv1 , hv2 , ...., hvn ] are the hidden
[Schuster and Paliwal, 1997] processes information in both                    state representations of BiLST Mc & BiLST Mr respec-
the directions using a forward LSTM [Hochreiter and Schmid-                   tively, where hui , hvj ∈ Rd and hu , hv ∈ Rn×d .
              →
              −
huber, 1997] h , that reads the sentence S = [w1 , w2 , ...., wn ]            3.2 Bi-ISCA: Bidirectional Inter-Sentence
                                           ←
                                           −
from w1 to wn and a backward LSTM h that reads the sen-                           Contextual Attention Mechanism
tence from wn to w1 . Hidden states from both the LSTMs are                   Sarcasm is context-dependent in nature. Even humans some-
added to get the final hidden state representations of each word.             times have a hard time understanding sarcasm without hav-
So the hidden state representation of the tth word (ht ) can be               ing any contextual information. The hidden states gener-
represented by the sum of tth hidden representations of the                   ated by both the Bi-LSTMs (BiLST Mc & BiLST Mr ) cap-
                                    →
                                    − ← −                                     tures the intra-sentence bidirectional contextual information
forward and backward LSTMs ( ht , ht ) as show in equations
below.                                                                        in comment & reply respectively, but fails to capture the inter-
    →
    −      −−−−→         −−→ ←   − ←−−−−            ←−−                       sentence contextual information between them. This paper
     ht = LST M (wt , ht−1 ); ht = LST M (wt , ht−1 )          (1)            introduces a novel Bidirectional Inter-Sentence Contextual At-
                                                                              tention mechanism (Bi-ISCA) for capturing the inter-sentence
                                 ←
                                 − →  −                                       contextual information between both the sentences.
                            ht = ht + ht                             (2)
                                                                                 Bi-ISCA uses hidden state representations of U & V along
   This Intra-Sentence Word Encoder Layer consists of                                                                                   →
                                                                                                                                        − ←  −
                                                                              with the auxiliary sentence’s cell state representations ( C & C )
two independent Bidirectional LSTMs for both comment                          to capture the inter-sentence contextual information. At
(BiLST Mc ) and reply (BiLST Mr ). Apart from the hidden                      first, the attention mechanism captures four sets of atten-
states, both these Bi-LSTMs also generate separate (forward                                            −→    ←−    −→    ←−
                                              ←− →   −                        tions scores namely, (αCu , αCu , αCv , αCv ∈ Rn ). These sets
& backward) final cell states represented by C & C . The
                                                                              of inter-sentence attention scores are used to generate new
comment sentence U is given as an input to BiLST Mc and
                                                                              inter-sentence contextualized hidden representations. Then
the reply sentence V is given as an input to BiLST Mr . The                      −→    ←−
outputs of both the Bi-LSTMs are represented by the equations                 (αCu , αCu ) are calculated using the hidden state representa-
3 and 4.                                                                      tions of BiLST Mr along with the forward and backward
                 −
                 →         ←−                                                               −→ ←  −
                 Cu , hu , Cu = BiLST Mc (U )             (3)                 final states (Cu , Cu ) of BiLST Mc (as shown in equations


Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021                                                                4

                        −
                        →     ←
                              −
5 & 6), similarly (αCv , αCv ) are calculated using the hidden                  In the above equation, bli is a bias matrix and Ki,j
                                                                                                                                 l
                                                                                                                                     is a filter
state representations of BiLST Mc along with the forward                      connecting j feature map of layer (l − 1) to the i feature
                                                                                          th                                        th
                             −→ ←   −
and backward final states (Cv , Cv ) of BiLST Mr (as shown                    map of layer (l). The output of each convolution layer is
in equations 7 & 8). In the equations below (•) represents a                  passed through a activation function f . The proposed model
dot product between two vectors.                                              uses LeakyReLu as its activation function.
          −
          →       −→    −→           −
                                     →       −
                                             →    −
                                                  →                                             
        αCu = [α1Cu , α2Cu , ...., αnCu ]; αiCu = Cu • hvi  (5)                                    a ∗ x, for x ≥ 0; a ∈ R                 (14)
                                                                                            f=
                                                                                                   x,         for x < 0                    (15)
          ←−      ←−     ←−           ←−      ←−   ←−
         αCu = [α1Cu , α2Cu , ...., αnCu ]; αiCu = Cu • hvi          (6)         For each of the CNN blocks, the corresponding contextu-
                                                                              alized hidden representations are first concatenated (⊕) and
          −
          →       −
                  →      −
                         →            ←−      −
                                              →    −
                                                   →                          then given as input. The outputs of all the CNN blocks are
         αCv = [α1Cv , α2Cv , ...., αnCv ]; αiCv = Cv • hui          (7)      flattened (F1 , F2 , F3 , F4 ∈ Rdk ) and concatenated to generate
           ←−      ← −    ←
                          −           ←−      ←−   ←−                         a new vector (p ∈ R4dk ), where d represents the dimension of
         αCv = [α1Cv , α2Cv , ...., αnCv ]; αiCv = Cv • hui   (8)             the hidden representation and k represents number of convolu-
   In the next step, the above calculated sets of inter-sentence              tional filters used. This concatenated (p) vector is then given
                    −→    ←−                                                  as input to a dense layer having 4dk neurons and is followed
attention scores αCu , αCu ) are multiplied back with the hid-
                                                                              by the final sigmoid prediction layer.
den state representations of BiLST Mr to generate two new
                                     −→     ← −                                                               −
                                                                                                              →        −
                                                                                                                       →                  −
                                                                                                                                          →
set of hidden representations hC     v , hv
                                       u    Cu
                                                 ∈ Rn×d of the re-                         F1 = CN N1 ([hC      Cv            Cv
                                                                                                         u,1 ⊕ hu,2 ⊕ .... ⊕ hu,n ])
                                                                                                          v
                                                                                                                                                (16)
ply sentence namely, reply contextualized on comment (for-
ward) & reply contextualized on comment (backward) respec-                                                    ←
                                                                                                              −        ←
                                                                                                                       −                  ←
                                                                                                                                          −
                                                            −
                                                            →  ←−
tively (as shown in equations 9 & 10). Similarly αCv , αCv                                 F2 = CN N2 ([hC      Cv            Cv
                                                                                                         u,1 ⊕ hu,2 ⊕ .... ⊕ hu,n ])
                                                                                                          v
                                                                                                                                                (17)
are multiplied back with the hidden state representations of                                                  −
                                                                                                              →        −
                                                                                                                       →                  −
                                                                                                                                          →
BiLST Mc to generate two new set of hidden representations                                  F3 = CN N3 ([hC      Cu            Cu
                                                                                                          v,1 ⊕ hv,2 ⊕ .... ⊕ hv,n ])
                                                                                                            u
                                                                                                                                                (18)
  −
  →     ←−
hCu , hu ∈ R
   v    Cv     n×d
                    of the comment sentence namely, comment
                                                                                                              ←−       ←−                 ←−
contextualized on reply (forward) & comment contextualized                                  F4 = CN N4 ([hC      Cu            Cu
                                                                                                                                                (19)
                                                                                                          v,1 ⊕ hv,2 ⊕ .... ⊕ hv,n ])
                                                                                                            u

on reply (backward) respectively (as shown in equations 11
& 12). In the equations below (×) represents multiplication
between a scalar and a vector.                                                                       p = [F1 ⊕ F2 ⊕ F3 ⊕ F4 ]                   (20)

         −
         →
               Cu
                   −
                   →
                      Cu
                          −
                          →            −
                                       →
                                             Cu
                                                −
                                                →
                                                    Cu
                                                          −
                                                          →                                ŷ = σ(W p + b), W ∈ R4dk ; b ∈ R            (21)
        hC                         Cu                    v
         v = [hv,1 , hv,2 , ...., hv,n ], ; hv,i = αi × hi
          u
                                                                     (9)
                                                                                 The proposed model uses the binary cross-entropy as the
       ←−        ←−     ←−           ←−       ←−         ←−
                                                                              training loss function as shown in equation 22. Here (L) is the
      hC     Cu     Cu           Cu        Cu     Cu   v
                                                                    (10)      cost function, ŷi ∈ R represents the output of the proposed
       v = [hv,1 , hv,2 , ...., hv,n ], ; hv,i = αi × hi
         u

                                                                              model, yi ∈ R represents the true label and N ∈ N represents
       −
       →        −
                →       −
                        →            −
                                     →         −
                                               →         −
                                                         →
                                                                              the number of training samples.
             Cv     Cv                     Cv     Cv
      hC                         Cv                    u
       u = [hu,1 , hu,2 , ...., hu,n ], ; hu,i = αi × hi
         v
                                                                    (11)                         N
                                                                                             1 X
                                                                                  L=−              yi · log(ŷi ) + (1 − yi ) · log(1 − ŷi )   (22)
       ←−    ←−     ←−           ←−        ←−     ←−                                         N i=1
             Cv     Cv                     Cv     Cv
      hC                         Cv                    u
       u = [hu,1 , hu,2 , ...., hu,n ], ; hu,i = αi × hi
         v
                                                                    (12)
                                                                              4 Evaluation Setup
3.3    Integration and Final Prediction
The proposed model uses Convolutional Neural Networks                         4.1 Dataset
(CNN) [Lecun et al., 1998] for capturing location-invariant                   This paper focuses on detecting sarcasm in the user-generated
local features from the newly obtained contextualized hid-                    short text using only the conversational context. Social media
                         ←−    −
                               →     ←− −→
den representations hC                                                        platforms like Reddit and Twitter are widely used by users for
                         u , hu , hv , hv . Four independent
                           v   Cv    Cu Cu

CNN blocks (CN N1 , CN N2 , CN N3 , CN N4 ) are used, cor-                    posting opinions and replying to other’s opinions. They have
responding to each of the newly obtained contextualized hid-                  proved to be of a great source for extracting conversational
den representations. Each CN N block consists two convolu-                    data. So the experiments were conducted on two publicly
tional layers. Both the convolution layer consist of k filters                available benchmark datasets (Reddit & Twitter) used for the
of height h. The role of these filters is to detect particular                sarcasm detection task. Both the datasets consist of comments
features at different locations of the input. The output cli of               and reply pairs.
                                                                                 SARC4 Reddit [Khodak et al., 2018] is the largest
the lth layer consists of k l feature maps of height h. The ith
                                                                              dataset available for sarcasm detection containing millions
feature map (cli ) is calculated as:
                                                                              of sarcastic/non-sarcastic comments-reply pairs from the so-
                                    j=1
                                    X                                         cial media site Reddit. This dataset was generated by scraping
                      cli = bli +           l
                                           Ki,j ∗ cl−1
                                                   j                (13)          4
                                    kl−1
                                                                                      https://nlp.cs.princeton.edu/SARC/2.0/


Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021                                                                             5

                                                                     No. of comment-reply pairs   Avg. no. of words per comment   Avg. no. of words per reply
                                                                     Sarcastic  Non-Sarcastic     Sarcastic     Non-Sarcastic     Sarcastic   Non-Sarcastic
                                                        Balanced      81205         81205          12.69            12.67          12.19          12.21
                                              Reddit
                               Training set             Imbalanced    16303         81205          12.69            12.65          12.15          12.21
                                              Twitter   Balanced       3496          3496          24.97            24.97          24.25          24.25
                                                        Balanced       9058          9058          12.71            12.64          12.14          12.22
                                              Reddit
                               Testing set              Imbalanced     1747          9058          12.73            12.69          12.20          12.21
                                              Twitter   Balanced       874           874           24.97            24.97          24.25          24.25


                               Table 1: Statics of the SARC dataset and FigLang 2020 workshop Twitter dataset.


comments from Reddit containing the \s (sarcasm) tag. It                                          to the length of the longest sentence (20 in the case of the
contains replies, their parent comment (acts as context), and a                                   Reddit dataset and 40 in the case of the Twitter dataset). Word
label that shows whether the reply was sarcastic/non-sarcastic                                    embeddings are used to give semantically-meaningful dense
to their corresponding parent comment. To compare the perfor-                                     representations to the words. Word-based embeddings are
mance of the model on a different dataset (latest), the proposed                                  constructed using contextual words whereas character-based
model was also evaluated on the Twitter dataset provided in the                                   embeddings are constructed from character n-grams of the
FigLang5 2020 workshop [Ghosh et al., 2020] for the "sar-                                         words. Character-based in contrast to the Word-based em-
casm detection shared task". This consists of sarcastic/non-                                      beddings solves the problem of out of vocabulary words and
sarcastic tweets and their corresponding contextual parent                                        performs better in the case of infrequent words by creating
tweets. The sarcastic tweets were collected using hashtags                                        word embeddings based only on their spellings. So for gener-
like #sarcasm, #sarcastic, and #irony, similarly non-sarcastic                                    ating proper representations for words we have used FastText7 ,
tweets were collected using hashtags like #happy, #sad, and                                       a character-based word embedding. This would not only give
#hate. This dataset sometime contains more than one contex-                                       words better representation compared to the word-based model
tual parent tweet, so in those cases, all of the contextual tweets                                but also incorporate slang/shortened/infrequent words (which
are considered independently with the target tweet.                                               commonly appear in social media platforms).
   In both the datasets, replies are the target comment/tweet to
be classified as sarcastic/non-sarcastic, and their correspond-                                   4.3 Training Details
ing parent comment/tweet acts as context. Both the datasets
constitute of comments/tweets of varying lengths, but because                                     We have used macro-averaged (F1) and accuracy (Acc) scores
this paper only focuses on detecting sarcasm in the short text,                                   as the evaluation metric, as it is standard for the sarcasm
only the short comment/reply pairs were used. Comment/reply                                       detection task. We have also reported Precision (P) and Recall
sentences of length (no. of words) less than 20, 40 were used                                     (R) scores in the case of the Twitter dataset as well as for the
in the case of SARC and Twitter dataset respectively. In                                          Reddit dataset (wherever available). Hyperparameter tuning
both cases, the balanced datasets contain equal proportions                                       was used to find optimum values of the hyperparameters. The
of sarcastic/non-sarcastic comment/reply pairs, and the imbal-                                    FastText embeddings used were of size d = 30 and were
anced datasets maintain a 20:80 ratio (approximately) between                                     trained for 30 iterations having window size of 3, 5 in the case
sarcastic and non-sarcastic comment/reply pairs. Testing was                                      of SARC, and Twitter dataset respectively. The number of
done on 10% of the dataset and the rest was used for train-                                       filters in all the convolutional blocks were [64, 64] of height
ing. 10% of the training set was used for validation purposes.                                    [2, 2]. The learning optimizer used is Adam with an initial
Statistics of both the datasets are shown in Table 1.                                             learning rate of 0.01. The value of α in all the LeakyReLu
                                                                                                  layers was set to 0.3. All the models were trained for 20
4.2      Data Preprocessing                                                                       epochs. L2 regularization set to 10−2 is applied to all the
                                                                                                  feed-forward connections along with early stopping having
The preprocessing of the textual data was done by first lower-
                                                                                                  the patience of 5 to avoid overfitting. The mini-batch size
casing all the sentences and separating punctuations from the
                                                                                                  was tuned amongst {100, 500, 1000, 2000, 3000, 4000} and
words. We do not remove the stop-words because we believe
                                                                                                  was observed that mini-batch size of 2000, 500 gave the best
that sometimes stop-words play a major role in making a sen-
                                                                                                  performance for the SARC and Twitter dataset respectively.
tence sarcastic e.g., "is it?" and "am I?". The problem with
social media platforms is that, users use a lot of abbreviations,                                     The recent success of transformer-based language models
shortened words and slang words like, "IMO" for "in my opin-                                      has led to their wide usage in sentiment analysis tasks. They
ion", lmk" for "let me know ", "fr" for "for", etc. These words                                   are known for generating high quality high dimensional word
are challenging to taken care of in the NLP tasks, particularly                                   representations (768-dimensional for BERT). Their only draw-
in the automatic discovery of flexible word usages. So to solve                                   back is that they require high processing power and memory
this problem, these words are converted to their corresponding                                    to train. The above-mentioned configuration of the proposed
full-forms using abbreviation/slang word dictionaries obtained                                    model generates ≈1120K trainable parameters, and increasing
from urban dictionary6 . After this, all the sentences were tok-                                  either the embedding size or the number of tokens in a sen-
enized into a list of words. The proposed model had a fixed                                       tence led to an exponential increase in the number of trainable
input size for both comment and reply, but not all the sentences                                  parameters. So due to computational resource limitations, we
were of the same length. So all the sentences were padded                                         limited our experiments to lower-dimensional word embed-
                                                                                                  dings.
   5
       sites.google.com/view/figlang2020
   6                                                                                                   7
       https://www.urbandictionary.com/                                                                    https://fasttext.cc/


Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021                                                                                           6


5      Results                                                                                         case of imbalance where little extra information can lead to a
                                                                                                       drastic increase in performance.
                                                     Balance                       Imbalanced
 Models
                                            Acc     F1       P       R     Acc      F1     P     R
                                                                                                        Models                                            P      R      F1
 CNN-SVM [Poria et al., 2016] †?           68.0    68.0      –       –     69.0    79.0    –     –
 AMR [Ghaeini et al., 2018] ‡              69.5    69.5 74.8 69.7           –        –     –     –      Baseline (LST Ma ttn)                            70.0   66.9   68.0
 [Ghosh and Veale, 2017]   ‡
                                             –     67.8 68.2 67.9           –        –     –     –      BERT-Large+BiLSTM+SVM [Baruah et al., 2020]      73.4   73.5   73.4
 CUE-CNN [Amir et al., 2016] †?            70.0    69.0      –       –     73.0    81.0    –     –      BERT+CNN+LSTM [Srivastava et al., 2020]          74.2   74.6   74.1
 MHA-BiLSTM [Kumar et al., 2020] †           –     77.5 72.6 83.0           –      56.8 60.3    53.7    RoBERTa+LSTM [Kumar and Anand, 2020]             77.3   77.4   77.2
 CASCADE [Hazarika et al., 2018] ‡?        77.0    77.0      –       –     79.0    86.0    –     –
 CASCADE (only discourse features) ‡ 68.0          66.0      –       –     68.0    78.0    –     –
                                                                                                        RoBERT-Large [Dong et al., 2020]                 79.1   79.4   79.0
 Bi-ISCA (this paper)   ‡
                                           72.3    75.7 74.2 77.6          71.9    74.4 73.0    75.8    RoBERT+Multi-Initialization Ensemble
                                                                                                        [Jaiswal, 2020]                                  79.2   79.3   79.1
 ∆ increase w.r.t CASCADE                                    –       –                     –     –
                                           4.3 ↑ 9.7 ↑                     3.9 ↑   3.6 ↓
 (only discourse features)                                                                              BERT + BiLSTM + NeXtVLAD + Context Ensemble
 † Uses only target sentence, ‡ Uses context along with target sentence,                                                                                 93.2   93.6   93.1
 ? Uses personality-based features
                                                                                                        + Data Augmentation [Lee et al., 2020]
                                                                                                        Bi-ISCA (this paper)                             89.4   94.8   91.7
Table 2: Results on the SARC dataset. Models haveing only ‡ uses
only contextual text for detecting sarcasm.                                                             Table 3: Results on the FigLang 2020 workshop Twitter dataset.

                                                                                                          Table 3 reports Precision (P), Recall (R), and F1-score (F1)
   Bi-ISCA focuses on only using the contextual com-                                                   of different models from the leaderboard of FigLang 2020
ment/tweet for detecting sarcasm rather than using any other                                           sarcasm detection shared task using the Twitter dataset. In
topical/personality-based features. Using only the contextual                                          this case, not only Bi-ISCA was able to outperform the base-
information enriches the model’s ability to capture syntactical                                        line model [Ghosh et al., 2020] (improvement of ∆ 19.4%,
and semantical textual properties responsible for invoking sar-                                        ∆ 27.9% & ∆ 23.7% in precision, recall, and F1 score re-
casm in any type of conversation. Table 2 reports performance                                          spectively), but was also able to perform comparably to the
results on the SARC datasets. For comparison purposes, F1-                                             state-of-the-art [Lee et al., 2020] with a ∆ 1.2% increase in
score (F1), Accuracy score (Acc), Precision (P) and Recall (R)                                         recall, which further validates the performance of the proposed
were used.                                                                                             model. Even though all the models other than the baseline in
   When compared with the existing works, Bi-ISCA was able                                             Table 3 are a transformer-based model, Bi-ISCA was able to
to outperform all the models (only ‡) that use only conversa-                                          outperform them all.
tional context for sarcasm detection (Improvement of ∆ 7.9%
in F1 score when compared to [Ghosh and Veale, 2017]; ∆
6.2% in F1 score and ∆ 2.8% in accuracy when compared to                                               6 Discussion
AMR [Ghaeini et al., 2018]), and was even able to perform
better than the models (†?) that use personality-based features
along with the target sentence for detecting sarcasm (improve-
ment of ∆ 7.7% in F1 and ∆ 4.3% in accuracy score when                                                  1.
compared to CNN-SVM [Poria et al., 2016]; ∆ 6.7% in F1
score and ∆ 2.3% in accuracy when compared to CUE-CNN
                                                                                                        2.
[Amir et al., 2016]). MHA-BiLSTM [Kumar et al., 2020]
had a ∆ 1.8% higher F1 score in the balanced dataset but
Bi-ISCA was able to show drastic improvement of ∆ 17.6%                                                 3.
in the imbalanced dataset, which demonstrated the ability of
Bi-ISCA to handle class imbalance.
   The current state-of-the-art on the SARC dataset is achieved                                         4.
by CASCADE. Even though CASCADE uses personality-
based features and contextual information along with large                                             Table 4: Attension weight distribution in reddit comment-reply pairs.
sentences of average length ≈55-62 (very large compared to                                             Here CcR represents "Comment contextualized on Reply" whereas
our dataset, which gives them the advantage of using a lot                                             RcC represents "Reply contextualized on Comment"; (R) & (L) rep-
more contextual information), Bi-ISCA was able to achieve                                              resents forward & backward attention.
an F1 score comparable to it (despite using relatively short
text). In comparison with CASCADE that only uses discourse-                                               The attention scores generated by the attention mechanism
based features, Bi-ISCA performed drastically better with an                                           makes the proposed model highly interpretable. Table 4 show-
increase of ∆ 9.7% in F1 and ∆ 4.3% in accuracy score for                                              cases the distribution of the attention scores over four sarcastic
the balanced dataset.                                                                                  (correctly predicted by Bi-ISCA) comment-reply pairs from
   Bi-ISCA clearly demonstrated its capabilities to robustly                                           the SARC dataset. Not only the proposed model was correctly
handle an imbalance in the dataset, although it was unable to                                          able to detect sarcasm in these pairs of sentences but was also
outperform both the CASCADE models. This slightly poor                                                 able to correctly identify words responsible for contextual,
performance in the imbalanced dataset can be explained by                                              explicit, or implicit incongruity which invokes sarcasm.
the length of sentences used by CASCADE, which are signif-                                                For example in Pair 1, Bi-ISCA correctly identified explic-
icantly (≈5 times) greater than the ones on which Bi-ISCA                                              itly incongruous words like "amazing" and "force" in the reply
was tested. Longer sentences result in increased contextual                                            sentence which were responsible for the sarcastic nature of
information which improves performance especially in the                                               the reply. Interestingly the word "traumatized" in the parent


Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021                                                        7


comment also had a high attention weight value, which shows                   [Baruah et al., 2020] Arup Baruah, Kaushik Das, Ferdous
that the proposed attention mechanism was able to learn the                      Barbhuiya, and Kuntal Dey. Context-aware sarcasm detec-
contextual incongruity between the opposite sentiment words                      tion using BERT. In Proceedings of the Second Workshop
like "traumatized" & "amazing" in the comment-reply pair.                        on Figurative Language Processing, pages 83–87, Online,
Pair 2 demonstrates the model’s ability to capture words re-                     July 2020. Association for Computational Linguistics.
sponsible for invoking sarcasm by making sentences implicitly                 [Cho et al., 2014] Kyunghyun Cho, Bart van Merriënboer,
incongruous. Sarcasm due to implicit incongruity is usually                      Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Hol-
the toughest to perceive. Despite this, Bi-ISCA was able to                      ger Schwenk, and Yoshua Bengio. Learning phrase repre-
give high attention weights to words like "announces" and                        sentations using RNN encoder–decoder for statistical ma-
"crashes & security holes". Not only this, but the proposed                      chine translation. In Proceedings of the 2014 Conference
intra-sentence attention mechanism was also able to learn a                      on Empirical Methods in Natural Language Processing
link between "microsoft" and "m" (slang for microsoft) with-                     (EMNLP), pages 1724–1734, Doha, Qatar, October 2014.
out having any prior knowledge related to slangs. Pair 3 is                      Association for Computational Linguistics.
also an example of an explicitly and contextually incongruous
comment-reply pair, where the model was successfully able                     [Dong et al., 2020] Xiangjue Dong, Changmao Li, and
to capture opposite sentiment words & phrases like "blind                        Jinho D. Choi. Transformer-based context-aware sarcasm
drunk", "cautious" and "behind the wheel" that made the reply                    detection in conversation threads from social media. In Pro-
sarcastic in nature. Pair 4 is an example of sarcasm due to                      ceedings of the Second Workshop on Figurative Language
implicit incongruity between the words, "pause" & "watch",                       Processing, pages 276–280, Online, July 2020. Association
and contextual incongruity simultaneously between "reported"                     for Computational Linguistics.
& "enjoyable", both of which were successfully captured by                    [Eisterhold et al., 2006] Jodi Eisterhold, Salvatore Attardo,
Bi-BISCA.                                                                        and Diana Boxer. Reactions to irony in discourse: evidence
                                                                                 for the least disruption principle. Journal of Pragmatics,
7    Conclusion                                                                  38(8):1239 – 1256, 2006. Focus-on Issue: Discourse and
                                                                                 Conversation.
In this paper, we introduce a novel Bi-directional Inter-                     [Farías et al., 2016] Delia Irazú Hernaundefineddez Farías,
Sentence Attention mechanism based model (Bi-ISCA) for                           Viviana Patti, and Paolo Rosso. Irony detection in twit-
detecting sarcasm. The proposed model not only was able to                       ter: The role of affective content. ACM Trans. Internet
capture both intra and inter-sentence dependencies but was                       Technol., 16(3), July 2016.
able to achieve state-of-the-art results in detecting sarcasm
in the user-generated short text using only the conversational                [Ghaeini et al., 2018] Reza Ghaeini, Xiaoli Z. Fern, and
context. Further investigation of attention maps illustrated                     Prasad Tadepalli. Attentional multi-reading sarcasm de-
Bi-ISCA’s ability to capture explicitly, implicitly, and contex-                 tection. CoRR, abs/1809.03051, 2018.
tually incongruous words & phrases responsible for invoking                   [Ghosh and Veale, 2016] Aniruddha Ghosh and Tony Veale.
sarcasm. The success of the proposed model is achieved due                       Fracking sarcasm using neural network. In Proceedings of
to the use of character-based embeddings that takes care of                      the 7th Workshop on Computational Approaches to Subjec-
slang/shortened & out of vocabulary words, Bi-LSTMs that                         tivity, Sentiment and Social Media Analysis, pages 161–169,
captures intra-sentence dependencies between words in the                        San Diego, California, June 2016. Association for Compu-
same sentence, and Bi-ISCA that captures inter-sentence de-                      tational Linguistics.
pendencies between words of different sentences.                              [Ghosh and Veale, 2017] Aniruddha Ghosh and Tony Veale.
                                                                                 Magnets for sarcasm: Making sarcasm detection timely,
References                                                                       contextual and very personal. In Proceedings of the
                                                                                 2017 Conference on Empirical Methods in Natural Lan-
[Amir et al., 2016] Silvio Amir, Byron C Wallace, Hao Lyu,                       guage Processing, pages 482–491, Copenhagen, Denmark,
  Paula Carvalho, and Silva Mário J. Modelling context with                      September 2017. Association for Computational Linguis-
  user embeddings for sarcasm detection in social media. Pro-                    tics.
  ceedings of the Conference on Natural Language Learning
  (CoNLL), 2016.                                                              [Ghosh et al., 2020] Debanjan Ghosh, Avijit Vajpayee, and
                                                                                 Smaranda Muresan. A report on the 2020 sarcasm detection
[Bamman and Smith, 2015] David Bamman and Noah A                                 shared task. In Proceedings of the Second Workshop on
  Smith. Contextualized sarcasm detection on twitter. In                         Figurative Language Processing, pages 1–11, Online, July
  Ninth International AAAI Conference on Web and Social                          2020. Association for Computational Linguistics.
  Media, 2015.                                                                [Hazarika et al., 2018] Devamanyu Hazarika, Soujanya Poria,
[Barbieri et al., 2014] Francesco Barbieri, Horacio Saggion,                     Sruthi Gorantla, Erik Cambria, Roger Zimmermann, and
  and Francesco Ronzano. Modelling sarcasm in twitter, a                         Rada Mihalcea. CASCADE: Contextual sarcasm detection
  novel approach. In Proceedings of the 5th Workshop on                          in online discussion forums. In Proceedings of the 27th
  Computational Approaches to Subjectivity, Sentiment and                        International Conference on Computational Linguistics,
  Social Media Analysis, pages 50–58, Baltimore, Maryland,                       pages 1837–1848, Santa Fe, New Mexico, USA, August
  June 2014. Association for Computational Linguistics.                          2018. Association for Computational Linguistics.


Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021                                                         8


[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and                           Conference on Computational Linguistics: Technical Pa-
   Jürgen Schmidhuber. Long short-term memory. Neural                            pers, pages 1601–1612, Osaka, Japan, December 2016. The
   Computation, 9(8):1735–1780, 1997.                                            COLING 2016 Organizing Committee.
[Jaiswal, 2020] Nikhil Jaiswal. Neural sarcasm detection us-                  [Rajadesingan et al., 2015] Ashwin Rajadesingan, Reza Za-
   ing conversation context. In Proceedings of the Second                        farani, and Huan Liu. Sarcasm detection on twitter: A be-
   Workshop on Figurative Language Processing, pages 77–                         havioral modeling approach. In Proceedings of the Eighth
   82, Online, July 2020. Association for Computational Lin-                     ACM International Conference on Web Search and Data
   guistics.                                                                     Mining, WSDM ’15, page 97–106, New York, NY, USA,
[Joshi et al., 2015] Aditya Joshi, Vinita Sharma, and Pushpak                    2015. Association for Computing Machinery.
   Bhattacharyya. Harnessing context incongruity for sarcasm                  [Reyes et al., 2013] Antonio Reyes, Paolo Rosso, and Tony
   detection. In Proceedings of the 53rd Annual Meeting of the                   Veale. A multidimensional approach for detecting irony in
   Association for Computational Linguistics and the 7th Inter-                  twitter. Language resources and evaluation, 47(1):239–268,
   national Joint Conference on Natural Language Processing                      2013.
   (Volume 2: Short Papers), pages 757–762, Beijing, China,                   [Riloff et al., 2013] Ellen Riloff, Ashequl Qadir, Prafulla
   July 2015. Association for Computational Linguistics.                         Surve, Lalindra De Silva, Nathan Gilbert, and Ruihong
[Khattri et al., 2015] Anupam Khattri, Aditya Joshi, Pushpak                     Huang. Sarcasm as contrast between a positive sentiment
   Bhattacharyya, and Mark Carman. Your sentiment pre-                           and negative situation. In Proceedings of the 2013 Con-
   cedes you: Using an author’s historical tweets to predict                     ference on Empirical Methods in Natural Language Pro-
   sarcasm. In Proceedings of the 6th workshop on compu-                         cessing, EMNLP 2013, 18-21 October 2013, Grand Hyatt
   tational approaches to subjectivity, sentiment and social                     Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a
   media analysis, pages 25–30, 2015.                                            Special Interest Group of the ACL, pages 704–714. ACL,
                                                                                 2013.
[Khodak et al., 2018] Mikhail Khodak, Nikunj Saunshi, and
   Kiran Vodrahalli. A large self-annotated corpus for sarcasm.               [Schuster and Paliwal, 1997] M. Schuster and K. K. Paliwal.
   In Proceedings of the Linguistic Resource and Evaluation                      Bidirectional recurrent neural networks. IEEE Transactions
   Conference (LREC), 2018.                                                      on Signal Processing, 45(11):2673–2681, 1997.
[Kumar and Anand, 2020] Amardeep Kumar and Vivek                              [Srivastava et al., 2020] Himani Srivastava, Vaibhav Varsh-
   Anand. Transformers on sarcasm detection with context.                        ney, Surabhi Kumari, and Saurabh Srivastava. A novel
   In Proceedings of the Second Workshop on Figurative Lan-                      hierarchical BERT architecture for sarcasm detection. In
   guage Processing, pages 88–92, Online, July 2020. Associ-                     Proceedings of the Second Workshop on Figurative Lan-
   ation for Computational Linguistics.                                          guage Processing, pages 93–97, Online, July 2020. Associ-
                                                                                 ation for Computational Linguistics.
[Kumar et al., 2020] A. Kumar, V. T. Narapareddy, V. Aditya
   Srikanth, A. Malapati, and L. B. M. Neti. Sarcasm detection                [Tay et al., 2018] Yi Tay, Anh Tuan Luu, Siu Cheung Hui,
   using multi-head attention based bidirectional lstm. IEEE                     and Jian Su. Reasoning with sarcasm by reading in-between.
   Access, 8:6388–6397, 2020.                                                    In Proceedings of the 56th Annual Meeting of the Associ-
                                                                                 ation for Computational Linguistics (Volume 1: Long Pa-
[Lecun et al., 1998] Y. Lecun, L. Bottou, Y. Bengio, and                         pers), pages 1010–1020, Melbourne, Australia, July 2018.
   P. Haffner. Gradient-based learning applied to document                       Association for Computational Linguistics.
   recognition. Proceedings of the IEEE, 86(11):2278–2324,
                                                                              [Tsur et al., 2010] Oren Tsur, Dmitry Davidov, and Ari Rap-
   1998.
                                                                                 poport. Icwsm—a great catchy name: Semi-supervised
[Lee et al., 2020] Hankyol Lee, Youngjae Yu, and Gunhee                          recognition of sarcastic sentences in online product reviews.
   Kim. Augmenting data for sarcasm detection with unla-                         In fourth international AAAI conference on weblogs and
   beled conversation context. In Proceedings of the Sec-                        social media, 2010.
   ond Workshop on Figurative Language Processing, pages                      [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki
   12–17, Online, July 2020. Association for Computational
                                                                                 Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
   Linguistics.
                                                                                 Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you
[Liebrecht et al., 2013] Christine Liebrecht, Florian Kunne-                     need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
   man, and Antal van den Bosch. The perfect solution for                        R. Fergus, S. Vishwanathan, and R. Garnett, editors, Ad-
   detecting sarcasm in tweets #not. In Proceedings of the 4th                   vances in Neural Information Processing Systems 30, pages
   Workshop on Computational Approaches to Subjectivity,                         5998–6008. Curran Associates, Inc., 2017.
   Sentiment and Social Media Analysis, pages 29–37, At-                      [Wilson, 2006] Deirdre Wilson. The pragmatics of verbal
   lanta, Georgia, June 2013. Association for Computational
                                                                                 irony: Echo or pretence? Lingua, 116(10):1722 – 1743,
   Linguistics.
                                                                                 2006. Language in Mind: A Tribute to Neil Smith on the
[Poria et al., 2016] Soujanya Poria, Erik Cambria, Deva-                         Occasion of his Retirement.
   manyu Hazarika, and Prateek Vij. A deeper look into
   sarcastic tweets using deep convolutional neural networks.
   In Proceedings of COLING 2016, the 26th International


Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).