=Paper= {{Paper |id=Vol-2995/paper1 |storemode=property |title=Bi-ISCA: Bidirectional Inter-Sentence Contextual Attention Mechanism for Detecting Sarcasm in User Generated Noisy Short Text |pdfUrl=https://ceur-ws.org/Vol-2995/paper1.pdf |volume=Vol-2995 |authors=Prakamya Mishra,Saroj Kaushik,Kuntal Dey |dblpUrl=https://dblp.org/rec/conf/ijcai/MishraKD21 }} ==Bi-ISCA: Bidirectional Inter-Sentence Contextual Attention Mechanism for Detecting Sarcasm in User Generated Noisy Short Text== https://ceur-ws.org/Vol-2995/paper1.pdf

Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021 1

Bi-ISCA: Bidirectional Inter-Sentence Contextual Attention Mechanism for
Detecting Sarcasm in User Generated Noisy Short Text

Prakamya Mishra1∗ , Saroj Kaushik2 , Kuntal Dey3
, Shiv Nadar University
1 2
3
Accenture Technology Labs
{pm669, saroj.kaushik}@snu.edu.in, kuntal.dey@accenture.com

Abstract maybe engaging on social media platforms. Sarcastic remarks
on these platforms inflict problems on the existing sentiment
Many online comments on social media platforms analysis systems in identifying the true intentions of the users.
are hateful, humorous, or sarcastic. The sarcastic The Cambridge Dictionary1 describes sarcasm as an irony
nature of these comments (especially the short ones) conveyed hilariously or amusingly to criticize something. Sar-
alters their actual implied sentiments, which leads casm may not show criticism on the surface but instead might
to misinterpretations by the existing sentiment anal- have a criticizing implied meaning. Such a figurative aspect of
ysis models. A lot of research has already been done sarcasm makes it difficult to be detected in the modern micro
to detect sarcasm in the text using user-based, top- texts [Ghosh and Veale, 2016]. Several linguistic research has
ical, and conversational information but not much been done to analyze different aspects of sarcasm. Kind of
work has been done to use inter-sentence contex- responses evoked because of comments has been considered a
tual information for detecting the same. This pa- major indicator of sarcasm [Eisterhold et al., 2006]. [Wilson,
per proposes a new deep learning architecture that 2006] states that circumstantial incongruity between a com-
uses a novel Bidirectional Inter-Sentence Contextual ment and its corresponding contextual information plays an
Attention mechanism (Bi-ISCA) to capture inter- important role in implying sarcasm.
sentence dependencies for detecting sarcasm in the Previous research works have used policy-based, statisti-
user-generated short text using only the conversa- cal, and deep-learning-based methods for detecting sarcasm.
tional context. The proposed deep learning model The use of contextual information like conversational con-
demonstrates the capability to capture explicit, im- text, author personality features, or prior knowledge of the
plicit, and contextual incongruous words & phrases topic, have proved to be very useful. [Khattri et al., 2015]
responsible for invoking sarcasm. Bi-ISCA gener- used sentiments of the author’s historical tweets as context.
ates results comparable to the state-of-the-art on two [Rajadesingan et al., 2015] used personality features like the
widely used benchmark datasets for the sarcasm de- author’s familiarity with twitter, language (structure and word
tection task (Reddit and Twitter). To the best of our usage), and the author’s familiarity with sarcasm (history of
knowledge, none of the existing models use an inter- previous sarcastic tweets) for consolidating context. [Bamman
sentence contextual attention mechanism to detect and Smith, 2015] explored the use of historical terms, topics,
sarcasm in the user-generated short text using only and sentiments along with profile information as the author’s
conversational context. context. They also exploited the use of conversational context
like the immediate previous tweets in the thread. [Joshi et al.,
1 Introduction 2015] demonstrated that concatenation of preceding comment
with the objective comment in a discussion forum led to an
Sentiment analysis is one of the most important natural lan- increase in the precision score.
guage processing (NLP) applications. Its goal is to identify, Overall in recent years a lot of work has been done to use
extract, quantify, and study subjective information. The sud- different types of contextual information for sarcasm detection
den rise in the usage of social media platforms as a means of but none of them have used inter-sentence dependencies. In
communication has led to a vast amount of data being shared this paper, we propose a novel Bidirectional Inter-Sentence
between its users on a wide range of topics. This type of data Contextual Attention mechanism (Bi-ISCA) based deep learn-
is very helpful to several organizations for analyzing the senti- ing neural network for sarcasm detection. The main contribu-
ments of people towards products, movies, political events, etc. tion of this paper can be summarised as follows:
Understanding the unique intricacies of the human language
remains one of the most important pending NLP problems • We propose a new deep learning architecture that uses a
of this time. Humans regularly use sarcasm as a crucial part novel Bidirectional Inter-Sentence Contextual attention
of the day-to-day conversations when venting, arguing, or mechanism (Bi-ISCA) for detecting sarcasm in short texts
∗ 1
Contact Author https://dictionary.cambridge.org/

Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Twelfth International Workshop Modelling and Reasoning in Context (MRC) @IJCAI 2021 2

(short texts are more difficult to analyze due to shortage text is a textual unit that has to be classified as sarcastic or
of contextual information). not. Simply using gated recurrent units (GRU) [Cho et al.,
• Bi-ISCA focuses on only using the conversational con- 2014] or long short term memory (LSTM) [Hochreiter and
textual comment/tweet for detecting sarcasm rather than Schmidhuber, 1997] do not capture in between interactions
using any other topical/personality-based features, as us- of word pairs which makes it difficult to model contrast and
ing only the contextual information enriches the model’s incongruity. [Tay et al., 2018] were able to solve this problem
ability to capture syntactical and semantical textual prop- by looking in-between word pairs using a multi-dimensional
erties responsible for invoking sarcasm. intra-attention recurrent network. They focused on modeling
the intra-sentence relationships among the words. [Kumar et
• We also explain model behavior and predictions by vi- al., 2020] exploited the use of a multi-head attention mecha-
sualizing attention maps generated by Bi-ISCA, which nism [Vaswani et al., 2017] which could capture dependencies
helps in identifying significant parts of the sentences re- between different representations subspaces in different posi-
sponsible for invoking sarcasm. tions. Their model consisted of a word encoder for generating
The rest of the paper is organized as follows. Section 2 new word representations by summarizing comment contex-
describes the related work. Then section 3, explains the pro- tual information in a bidirectional manner. On top of that, they
posed model architecture for detecting sarcasm. Section 4 used multi-head attention for focusing on different contexts
will describe the datasets used, pre-processing pipeline, and of a sentence, and in the end, a simple multi-layer perceptron
training details for reproducibility. Then experimental results was used for classification.
are explained in section 5 and section 6 illustrates model be- There has not been much work done in conversation depen-
havior and predictions by visualizing attention maps. Finally dent (comment and reply) approaches for sarcasm detection.
we conclude in section 7. [Ghaeini et al., 2018] proposed a model that not only used
information from the target utterance but also used its conver-
2 Related Work sational context to perceive sarcasm. They aimed to detect
A diverse spectrum of approaches has been used to detect sarcasm by just using the sequences of sentences, without any
sarcasm. Recent sarcasm detection approaches have either extra knowledge about the user and topic. They combined the
mainly focused on using machine learning based approaches predictions from utterance-only and conversation-dependent
that leverage the use of explicitly declared relevant features parts for generating its final prediction which was able to cap-
or they focus on using neural network based deep learning ture the words responsible for delivering sarcasm. [Ghosh and
approaches that do not require handcrafted features. Also, the Veale, 2017] also modeled conversational context for sarcasm
recent advances in using deep learning for preforming natural detection. They also attempted to derive what parts of the con-
language processing tasks have led to a promising increase in versational context triggered a sarcastic reply. Their proposed
the performance of these sarcasm detection systems. model used sentence embeddings created by taking an average
A lot of research has been done using bag of words as of word embeddings and a sentence-level attention mechanism
features. However, to improve performance, scholars started was used to generate attention induced representations of both
to explore the use of several other semantic and syntactical the context and the response which was later concatenated and
features like punctuations [Tsur et al., 2010]; emotion marks used for classification.
and intensifiers [Liebrecht et al., 2013]; positive verbs and Among all the previous works, [Ghaeini et al., 2018] and
[Ghosh and Veale, 2017] share similar motives of detecting
negative phrases [Riloff et al., 2013]; polarity skip grams
[Reyes et al., 2013]; synonyms & ambiguity[Barbieri et al., sarcasm using only the conversational context. However, we
2014]; implicit and explicit incongruity-based [Joshi et al., introduce a novel Bidirectional Inter-Sentence Contextual At-
2015]; sentiment flips [Rajadesingan et al., 2015]; affect-based tention mechanism (Bi-ISCA) for detecting sarcasm. Unlike
features derived from multiple emotion lexicons [Farías et al., previous works, our work considers short texts for detecting
2016]. sarcasm, which is far more challenging to detect when com-
Every day an enormous amount of short text data is gener- pared to long texts as long texts provide much more contextual
ated by users on popular social media platforms like Twitter2 information.
and Reddit3 . Easy accessibility of such data sources has en-
ticed researchers to use them for extracting user-based and 3 Model
discourse-based features. [Hazarika et al., 2018] utilized con-
textual information by making user-embeddings for capturing This section will introduce the proposed Bi-ISCA: Bidirec-
indicative behavioral traits. These user-embeddings incorpo- tional Inter Sentence Contextual Attention based neural net-
rated personality features along with the author’s writing style work for sarcasm detection (as shown in Figure 1). Sarcasm
(using historical posts). They also used discourse comments detection is a binary classification task that tries to predict
along with background cues and topical information for detect- whether a given comment is sarcastic or not. The proposed
ing sarcasm. They performed their experiments on the largest model uses comment-reply pairs for detecting sarcasm. The
Reddit dataset SARC [Khodak et al., 2018]. Many have only input to the model is represented by U = [W1u , W2u , ...., Wnu ]
used the target text for classification purposes, where a target and V = [W1v , W2v , ...., Wnv ], where U represents the com-
ment sentence and V represents the reply sentence (both sen-
2
www.twitter.com/ tences padded to a length of n). Here, Wiu , Wjv ∈ Rd are
3
www.reddit.com/ d−dimensional word embedding vectors. The objective is

Figure 1: Bi-ISCA: Bi-Directional Inter-Sentence Contextual Attention Mechanism for Sarcasm Detection.

to predict label y which indicates whether the reply to the −
→ ←−
corresponding comment was sarcastic or not. Cv , hv , Cv = BiLST Mr (V ) (4)
−→ − →
3.1 Intra-Sentence Word Encoder Layer Here, Cu , Cv ∈ R are the final cell states of the for-
d

ward LSTMs corresponding to BiLST Mc & BiLST Mr ;
The primary purpose of this layer is to summarize intra- ← − ← −
sentence contextual information from both directions in both Cu , Cv ∈ Rd are the final cell states of the backward
the sentences (comment & reply) using Bidirectional Long LSTMs corresponding to BiLST Mc & BiLST Mr ; hu =
Short Term Memory Networks (Bi-LSTM). A Bi-LSTM [hu1 , hu2 , ...., hun ] and hv = [hv1 , hv2 , ...., hvn ] are the hidden
[Schuster and Paliwal, 1997] processes information in both state representations of BiLST Mc & BiLST Mr respec-
the directions using a forward LSTM [Hochreiter and Schmid- tively, where hui , hvj ∈ Rd and hu , hv ∈ Rn×d .
→
−
huber, 1997] h , that reads the sentence S = [w1 , w2 , ...., wn ] 3.2 Bi-ISCA: Bidirectional Inter-Sentence
←
−
from w1 to wn and a backward LSTM h that reads the sen- Contextual Attention Mechanism
tence from wn to w1 . Hidden states from both the LSTMs are Sarcasm is context-dependent in nature. Even humans some-
added to get the final hidden state representations of each word. times have a hard time understanding sarcasm without hav-
So the hidden state representation of the tth word (ht ) can be ing any contextual information. The hidden states gener-
represented by the sum of tth hidden representations of the ated by both the Bi-LSTMs (BiLST Mc & BiLST Mr ) cap-
→
− ← − tures the intra-sentence bidirectional contextual information
forward and backward LSTMs ( ht , ht ) as show in equations
below. in comment & reply respectively, but fails to capture the inter-
→
− −−−−→ −−→ ← − ←−−−− ←−− sentence contextual information between them. This paper
ht = LST M (wt , ht−1 ); ht = LST M (wt , ht−1 ) (1) introduces a novel Bidirectional Inter-Sentence Contextual At-
tention mechanism (Bi-ISCA) for capturing the inter-sentence
←
− → − contextual information between both the sentences.
ht = ht + ht (2)
Bi-ISCA uses hidden state representations of U & V along
This Intra-Sentence Word Encoder Layer consists of →
− ← −
with the auxiliary sentence’s cell state representations ( C & C )
two independent Bidirectional LSTMs for both comment to capture the inter-sentence contextual information. At
(BiLST Mc ) and reply (BiLST Mr ). Apart from the hidden first, the attention mechanism captures four sets of atten-
states, both these Bi-LSTMs also generate separate (forward −→ ←− −→ ←−
←− → − tions scores namely, (αCu , αCu , αCv , αCv ∈ Rn ). These sets
& backward) final cell states represented by C & C . The
of inter-sentence attention scores are used to generate new
comment sentence U is given as an input to BiLST Mc and
inter-sentence contextualized hidden representations. Then
the reply sentence V is given as an input to BiLST Mr . The −→ ←−
outputs of both the Bi-LSTMs are represented by the equations (αCu , αCu ) are calculated using the hidden state representa-
3 and 4. tions of BiLST Mr along with the forward and backward
−
→ ←− −→ ← −
Cu , hu , Cu = BiLST Mc (U ) (3) final states (Cu , Cu ) of BiLST Mc (as shown in equations

−
→ ←
−
5 & 6), similarly (αCv , αCv ) are calculated using the hidden In the above equation, bli is a bias matrix and Ki,j
l
is a filter
state representations of BiLST Mc along with the forward connecting j feature map of layer (l − 1) to the i feature
th th
−→ ← −
and backward final states (Cv , Cv ) of BiLST Mr (as shown map of layer (l). The output of each convolution layer is
in equations 7 & 8). In the equations below (•) represents a passed through a activation function f . The proposed model
dot product between two vectors. uses LeakyReLu as its activation function.
−
→ −→ −→ −
→ −
→ −
→
αCu = [α1Cu , α2Cu , ...., αnCu ]; αiCu = Cu • hvi (5) a ∗ x, for x ≥ 0; a ∈ R (14)
f=
x, for x < 0 (15)
←− ←− ←− ←− ←− ←−
αCu = [α1Cu , α2Cu , ...., αnCu ]; αiCu = Cu • hvi (6) For each of the CNN blocks, the corresponding contextu-
alized hidden representations are first concatenated (⊕) and
−
→ −
→ −
→ ←− −
→ −
→ then given as input. The outputs of all the CNN blocks are
αCv = [α1Cv , α2Cv , ...., αnCv ]; αiCv = Cv • hui (7) flattened (F1 , F2 , F3 , F4 ∈ Rdk ) and concatenated to generate
←− ← − ←
− ←− ←− ←− a new vector (p ∈ R4dk ), where d represents the dimension of
αCv = [α1Cv , α2Cv , ...., αnCv ]; αiCv = Cv • hui (8) the hidden representation and k represents number of convolu-
In the next step, the above calculated sets of inter-sentence tional filters used. This concatenated (p) vector is then given
−→ ←− as input to a dense layer having 4dk neurons and is followed
attention scores αCu , αCu ) are multiplied back with the hid-
by the final sigmoid prediction layer.
den state representations of BiLST Mr to generate two new
−→ ← − −
→ −
→ −
→
set of hidden representations hC v , hv
u Cu
∈ Rn×d of the re- F1 = CN N1 ([hC Cv Cv
u,1 ⊕ hu,2 ⊕ .... ⊕ hu,n ])
v
(16)
ply sentence namely, reply contextualized on comment (for-
ward) & reply contextualized on comment (backward) respec- ←
− ←
− ←
−
−
→ ←−
tively (as shown in equations 9 & 10). Similarly αCv , αCv F2 = CN N2 ([hC Cv Cv
u,1 ⊕ hu,2 ⊕ .... ⊕ hu,n ])
v
(17)
are multiplied back with the hidden state representations of −
→ −
→ −
→
BiLST Mc to generate two new set of hidden representations F3 = CN N3 ([hC Cu Cu
v,1 ⊕ hv,2 ⊕ .... ⊕ hv,n ])
u
(18)
−
→ ←−
hCu , hu ∈ R
v Cv n×d
of the comment sentence namely, comment
←− ←− ←−
contextualized on reply (forward) & comment contextualized F4 = CN N4 ([hC Cu Cu
(19)
v,1 ⊕ hv,2 ⊕ .... ⊕ hv,n ])
u

on reply (backward) respectively (as shown in equations 11
& 12). In the equations below (×) represents multiplication
between a scalar and a vector. p = [F1 ⊕ F2 ⊕ F3 ⊕ F4 ] (20)

−
→
Cu
−
→
Cu
−
→ −
→
Cu
−
→
Cu
−
→ ŷ = σ(W p + b), W ∈ R4dk ; b ∈ R (21)
hC Cu v
v = [hv,1 , hv,2 , ...., hv,n ], ; hv,i = αi × hi
u
(9)
The proposed model uses the binary cross-entropy as the
←− ←− ←− ←− ←− ←−
training loss function as shown in equation 22. Here (L) is the
hC Cu Cu Cu Cu Cu v
(10) cost function, ŷi ∈ R represents the output of the proposed
v = [hv,1 , hv,2 , ...., hv,n ], ; hv,i = αi × hi
u

model, yi ∈ R represents the true label and N ∈ N represents
−
→ −
→ −
→ −
→ −
→ −
→
the number of training samples.
Cv Cv Cv Cv
hC Cv u
u = [hu,1 , hu,2 , ...., hu,n ], ; hu,i = αi × hi
v
(11) N
1 X
L=− yi · log(ŷi ) + (1 − yi ) · log(1 − ŷi ) (22)
←− ←− ←− ←− ←− ←− N i=1
Cv Cv Cv Cv
hC Cv u
u = [hu,1 , hu,2 , ...., hu,n ], ; hu,i = αi × hi
v
(12)
4 Evaluation Setup
3.3 Integration and Final Prediction
The proposed model uses Convolutional Neural Networks 4.1 Dataset
(CNN) [Lecun et al., 1998] for capturing location-invariant This paper focuses on detecting sarcasm in the user-generated
local features from the newly obtained contextualized hid- short text using only the conversational context. Social media
←− −
→ ←− −→
den representations hC platforms like Reddit and Twitter are widely used by users for
u , hu , hv , hv . Four independent
v Cv Cu Cu

CNN blocks (CN N1 , CN N2 , CN N3 , CN N4 ) are used, cor- posting opinions and replying to other’s opinions. They have
responding to each of the newly obtained contextualized hid- proved to be of a great source for extracting conversational
den representations. Each CN N block consists two convolu- data. So the experiments were conducted on two publicly
tional layers. Both the convolution layer consist of k filters available benchmark datasets (Reddit & Twitter) used for the
of height h. The role of these filters is to detect particular sarcasm detection task. Both the datasets consist of comments
features at different locations of the input. The output cli of and reply pairs.
SARC4 Reddit [Khodak et al., 2018] is the largest
the lth layer consists of k l feature maps of height h. The ith
dataset available for sarcasm detection containing millions
feature map (cli ) is calculated as:
of sarcastic/non-sarcastic comments-reply pairs from the so-
j=1
X cial media site Reddit. This dataset was generated by scraping
cli = bli + l
Ki,j ∗ cl−1
j (13) 4
kl−1
https://nlp.cs.princeton.edu/SARC/2.0/

No. of comment-reply pairs Avg. no. of words per comment Avg. no. of words per reply
Sarcastic Non-Sarcastic Sarcastic Non-Sarcastic Sarcastic Non-Sarcastic
Balanced 81205 81205 12.69 12.67 12.19 12.21
Reddit
Training set Imbalanced 16303 81205 12.69 12.65 12.15 12.21
Twitter Balanced 3496 3496 24.97 24.97 24.25 24.25
Balanced 9058 9058 12.71 12.64 12.14 12.22
Reddit
Testing set Imbalanced 1747 9058 12.73 12.69 12.20 12.21
Twitter Balanced 874 874 24.97 24.97 24.25 24.25

Table 1: Statics of the SARC dataset and FigLang 2020 workshop Twitter dataset.

comments from Reddit containing the \s (sarcasm) tag. It to the length of the longest sentence (20 in the case of the
contains replies, their parent comment (acts as context), and a Reddit dataset and 40 in the case of the Twitter dataset). Word
label that shows whether the reply was sarcastic/non-sarcastic embeddings are used to give semantically-meaningful dense
to their corresponding parent comment. To compare the perfor- representations to the words. Word-based embeddings are
mance of the model on a different dataset (latest), the proposed constructed using contextual words whereas character-based
model was also evaluated on the Twitter dataset provided in the embeddings are constructed from character n-grams of the
FigLang5 2020 workshop [Ghosh et al., 2020] for the "sar- words. Character-based in contrast to the Word-based em-
casm detection shared task". This consists of sarcastic/non- beddings solves the problem of out of vocabulary words and
sarcastic tweets and their corresponding contextual parent performs better in the case of infrequent words by creating
tweets. The sarcastic tweets were collected using hashtags word embeddings based only on their spellings. So for gener-
like #sarcasm, #sarcastic, and #irony, similarly non-sarcastic ating proper representations for words we have used FastText7 ,
tweets were collected using hashtags like #happy, #sad, and a character-based word embedding. This would not only give
#hate. This dataset sometime contains more than one contex- words better representation compared to the word-based model
tual parent tweet, so in those cases, all of the contextual tweets but also incorporate slang/shortened/infrequent words (which
are considered independently with the target tweet. commonly appear in social media platforms).
In both the datasets, replies are the target comment/tweet to
be classified as sarcastic/non-sarcastic, and their correspond- 4.3 Training Details
ing parent comment/tweet acts as context. Both the datasets
constitute of comments/tweets of varying lengths, but because We have used macro-averaged (F1) and accuracy (Acc) scores
this paper only focuses on detecting sarcasm in the short text, as the evaluation metric, as it is standard for the sarcasm
only the short comment/reply pairs were used. Comment/reply detection task. We have also reported Precision (P) and Recall
sentences of length (no. of words) less than 20, 40 were used (R) scores in the case of the Twitter dataset as well as for the
in the case of SARC and Twitter dataset respectively. In Reddit dataset (wherever available). Hyperparameter tuning
both cases, the balanced datasets contain equal proportions was used to find optimum values of the hyperparameters. The
of sarcastic/non-sarcastic comment/reply pairs, and the imbal- FastText embeddings used were of size d = 30 and were
anced datasets maintain a 20:80 ratio (approximately) between trained for 30 iterations having window size of 3, 5 in the case
sarcastic and non-sarcastic comment/reply pairs. Testing was of SARC, and Twitter dataset respectively. The number of
done on 10% of the dataset and the rest was used for train- filters in all the convolutional blocks were [64, 64] of height
ing. 10% of the training set was used for validation purposes. [2, 2]. The learning optimizer used is Adam with an initial
Statistics of both the datasets are shown in Table 1. learning rate of 0.01. The value of α in all the LeakyReLu
layers was set to 0.3. All the models were trained for 20
4.2 Data Preprocessing epochs. L2 regularization set to 10−2 is applied to all the
feed-forward connections along with early stopping having
The preprocessing of the textual data was done by first lower-
the patience of 5 to avoid overfitting. The mini-batch size
casing all the sentences and separating punctuations from the
was tuned amongst {100, 500, 1000, 2000, 3000, 4000} and
words. We do not remove the stop-words because we believe
was observed that mini-batch size of 2000, 500 gave the best
that sometimes stop-words play a major role in making a sen-
performance for the SARC and Twitter dataset respectively.
tence sarcastic e.g., "is it?" and "am I?". The problem with
social media platforms is that, users use a lot of abbreviations, The recent success of transformer-based language models
shortened words and slang words like, "IMO" for "in my opin- has led to their wide usage in sentiment analysis tasks. They
ion", lmk" for "let me know ", "fr" for "for", etc. These words are known for generating high quality high dimensional word
are challenging to taken care of in the NLP tasks, particularly representations (768-dimensional for BERT). Their only draw-
in the automatic discovery of flexible word usages. So to solve back is that they require high processing power and memory
this problem, these words are converted to their corresponding to train. The above-mentioned configuration of the proposed
full-forms using abbreviation/slang word dictionaries obtained model generates ≈1120K trainable parameters, and increasing
from urban dictionary6 . After this, all the sentences were tok- either the embedding size or the number of tokens in a sen-
enized into a list of words. The proposed model had a fixed tence led to an exponential increase in the number of trainable
input size for both comment and reply, but not all the sentences parameters. So due to computational resource limitations, we
were of the same length. So all the sentences were padded limited our experiments to lower-dimensional word embed-
dings.
5
sites.google.com/view/figlang2020
6 7
https://www.urbandictionary.com/ https://fasttext.cc/

5 Results case of imbalance where little extra information can lead to a
drastic increase in performance.
Balance Imbalanced
Models
Acc F1 P R Acc F1 P R
Models P R F1
CNN-SVM [Poria et al., 2016] †? 68.0 68.0 – – 69.0 79.0 – –
AMR [Ghaeini et al., 2018] ‡ 69.5 69.5 74.8 69.7 – – – – Baseline (LST Ma ttn) 70.0 66.9 68.0
[Ghosh and Veale, 2017] ‡
– 67.8 68.2 67.9 – – – – BERT-Large+BiLSTM+SVM [Baruah et al., 2020] 73.4 73.5 73.4
CUE-CNN [Amir et al., 2016] †? 70.0 69.0 – – 73.0 81.0 – – BERT+CNN+LSTM [Srivastava et al., 2020] 74.2 74.6 74.1
MHA-BiLSTM [Kumar et al., 2020] † – 77.5 72.6 83.0 – 56.8 60.3 53.7 RoBERTa+LSTM [Kumar and Anand, 2020] 77.3 77.4 77.2
CASCADE [Hazarika et al., 2018] ‡? 77.0 77.0 – – 79.0 86.0 – –
CASCADE (only discourse features) ‡ 68.0 66.0 – – 68.0 78.0 – –
RoBERT-Large [Dong et al., 2020] 79.1 79.4 79.0
Bi-ISCA (this paper) ‡
72.3 75.7 74.2 77.6 71.9 74.4 73.0 75.8 RoBERT+Multi-Initialization Ensemble
[Jaiswal, 2020] 79.2 79.3 79.1
∆ increase w.r.t CASCADE – – – –
4.3 ↑ 9.7 ↑ 3.9 ↑ 3.6 ↓
(only discourse features) BERT + BiLSTM + NeXtVLAD + Context Ensemble
† Uses only target sentence, ‡ Uses context along with target sentence, 93.2 93.6 93.1
? Uses personality-based features
+ Data Augmentation [Lee et al., 2020]
Bi-ISCA (this paper) 89.4 94.8 91.7
Table 2: Results on the SARC dataset. Models haveing only ‡ uses
only contextual text for detecting sarcasm. Table 3: Results on the FigLang 2020 workshop Twitter dataset.

Table 3 reports Precision (P), Recall (R), and F1-score (F1)
Bi-ISCA focuses on only using the contextual com- of different models from the leaderboard of FigLang 2020
ment/tweet for detecting sarcasm rather than using any other sarcasm detection shared task using the Twitter dataset. In
topical/personality-based features. Using only the contextual this case, not only Bi-ISCA was able to outperform the base-
information enriches the model’s ability to capture syntactical line model [Ghosh et al., 2020] (improvement of ∆ 19.4%,
and semantical textual properties responsible for invoking sar- ∆ 27.9% & ∆ 23.7% in precision, recall, and F1 score re-
casm in any type of conversation. Table 2 reports performance spectively), but was also able to perform comparably to the
results on the SARC datasets. For comparison purposes, F1- state-of-the-art [Lee et al., 2020] with a ∆ 1.2% increase in
score (F1), Accuracy score (Acc), Precision (P) and Recall (R) recall, which further validates the performance of the proposed
were used. model. Even though all the models other than the baseline in
When compared with the existing works, Bi-ISCA was able Table 3 are a transformer-based model, Bi-ISCA was able to
to outperform all the models (only ‡) that use only conversa- outperform them all.
tional context for sarcasm detection (Improvement of ∆ 7.9%
in F1 score when compared to [Ghosh and Veale, 2017]; ∆
6.2% in F1 score and ∆ 2.8% in accuracy when compared to 6 Discussion
AMR [Ghaeini et al., 2018]), and was even able to perform
better than the models (†?) that use personality-based features
along with the target sentence for detecting sarcasm (improve-
ment of ∆ 7.7% in F1 and ∆ 4.3% in accuracy score when 1.
compared to CNN-SVM [Poria et al., 2016]; ∆ 6.7% in F1
score and ∆ 2.3% in accuracy when compared to CUE-CNN
2.
[Amir et al., 2016]). MHA-BiLSTM [Kumar et al., 2020]
had a ∆ 1.8% higher F1 score in the balanced dataset but
Bi-ISCA was able to show drastic improvement of ∆ 17.6% 3.
in the imbalanced dataset, which demonstrated the ability of
Bi-ISCA to handle class imbalance.
The current state-of-the-art on the SARC dataset is achieved 4.
by CASCADE. Even though CASCADE uses personality-
based features and contextual information along with large Table 4: Attension weight distribution in reddit comment-reply pairs.
sentences of average length ≈55-62 (very large compared to Here CcR represents "Comment contextualized on Reply" whereas
our dataset, which gives them the advantage of using a lot RcC represents "Reply contextualized on Comment"; (R) & (L) rep-
more contextual information), Bi-ISCA was able to achieve resents forward & backward attention.
an F1 score comparable to it (despite using relatively short
text). In comparison with CASCADE that only uses discourse- The attention scores generated by the attention mechanism
based features, Bi-ISCA performed drastically better with an makes the proposed model highly interpretable. Table 4 show-
increase of ∆ 9.7% in F1 and ∆ 4.3% in accuracy score for cases the distribution of the attention scores over four sarcastic
the balanced dataset. (correctly predicted by Bi-ISCA) comment-reply pairs from
Bi-ISCA clearly demonstrated its capabilities to robustly the SARC dataset. Not only the proposed model was correctly
handle an imbalance in the dataset, although it was unable to able to detect sarcasm in these pairs of sentences but was also
outperform both the CASCADE models. This slightly poor able to correctly identify words responsible for contextual,
performance in the imbalanced dataset can be explained by explicit, or implicit incongruity which invokes sarcasm.
the length of sentences used by CASCADE, which are signif- For example in Pair 1, Bi-ISCA correctly identified explic-
icantly (≈5 times) greater than the ones on which Bi-ISCA itly incongruous words like "amazing" and "force" in the reply
was tested. Longer sentences result in increased contextual sentence which were responsible for the sarcastic nature of
information which improves performance especially in the the reply. Interestingly the word "traumatized" in the parent

comment also had a high attention weight value, which shows [Baruah et al., 2020] Arup Baruah, Kaushik Das, Ferdous
that the proposed attention mechanism was able to learn the Barbhuiya, and Kuntal Dey. Context-aware sarcasm detec-
contextual incongruity between the opposite sentiment words tion using BERT. In Proceedings of the Second Workshop
like "traumatized" & "amazing" in the comment-reply pair. on Figurative Language Processing, pages 83–87, Online,
Pair 2 demonstrates the model’s ability to capture words re- July 2020. Association for Computational Linguistics.
sponsible for invoking sarcasm by making sentences implicitly [Cho et al., 2014] Kyunghyun Cho, Bart van Merriënboer,
incongruous. Sarcasm due to implicit incongruity is usually Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Hol-
the toughest to perceive. Despite this, Bi-ISCA was able to ger Schwenk, and Yoshua Bengio. Learning phrase repre-
give high attention weights to words like "announces" and sentations using RNN encoder–decoder for statistical ma-
"crashes & security holes". Not only this, but the proposed chine translation. In Proceedings of the 2014 Conference
intra-sentence attention mechanism was also able to learn a on Empirical Methods in Natural Language Processing
link between "microsoft" and "m" (slang for microsoft) with- (EMNLP), pages 1724–1734, Doha, Qatar, October 2014.
out having any prior knowledge related to slangs. Pair 3 is Association for Computational Linguistics.
also an example of an explicitly and contextually incongruous
comment-reply pair, where the model was successfully able [Dong et al., 2020] Xiangjue Dong, Changmao Li, and
to capture opposite sentiment words & phrases like "blind Jinho D. Choi. Transformer-based context-aware sarcasm
drunk", "cautious" and "behind the wheel" that made the reply detection in conversation threads from social media. In Pro-
sarcastic in nature. Pair 4 is an example of sarcasm due to ceedings of the Second Workshop on Figurative Language
implicit incongruity between the words, "pause" & "watch", Processing, pages 276–280, Online, July 2020. Association
and contextual incongruity simultaneously between "reported" for Computational Linguistics.
& "enjoyable", both of which were successfully captured by [Eisterhold et al., 2006] Jodi Eisterhold, Salvatore Attardo,
Bi-BISCA. and Diana Boxer. Reactions to irony in discourse: evidence
for the least disruption principle. Journal of Pragmatics,
7 Conclusion 38(8):1239 – 1256, 2006. Focus-on Issue: Discourse and
Conversation.
In this paper, we introduce a novel Bi-directional Inter- [Farías et al., 2016] Delia Irazú Hernaundefineddez Farías,
Sentence Attention mechanism based model (Bi-ISCA) for Viviana Patti, and Paolo Rosso. Irony detection in twit-
detecting sarcasm. The proposed model not only was able to ter: The role of affective content. ACM Trans. Internet
capture both intra and inter-sentence dependencies but was Technol., 16(3), July 2016.
able to achieve state-of-the-art results in detecting sarcasm
in the user-generated short text using only the conversational [Ghaeini et al., 2018] Reza Ghaeini, Xiaoli Z. Fern, and
context. Further investigation of attention maps illustrated Prasad Tadepalli. Attentional multi-reading sarcasm de-
Bi-ISCA’s ability to capture explicitly, implicitly, and contex- tection. CoRR, abs/1809.03051, 2018.
tually incongruous words & phrases responsible for invoking [Ghosh and Veale, 2016] Aniruddha Ghosh and Tony Veale.
sarcasm. The success of the proposed model is achieved due Fracking sarcasm using neural network. In Proceedings of
to the use of character-based embeddings that takes care of the 7th Workshop on Computational Approaches to Subjec-
slang/shortened & out of vocabulary words, Bi-LSTMs that tivity, Sentiment and Social Media Analysis, pages 161–169,
captures intra-sentence dependencies between words in the San Diego, California, June 2016. Association for Compu-
same sentence, and Bi-ISCA that captures inter-sentence de- tational Linguistics.
pendencies between words of different sentences. [Ghosh and Veale, 2017] Aniruddha Ghosh and Tony Veale.
Magnets for sarcasm: Making sarcasm detection timely,
References contextual and very personal. In Proceedings of the
2017 Conference on Empirical Methods in Natural Lan-
[Amir et al., 2016] Silvio Amir, Byron C Wallace, Hao Lyu, guage Processing, pages 482–491, Copenhagen, Denmark,
Paula Carvalho, and Silva Mário J. Modelling context with September 2017. Association for Computational Linguis-
user embeddings for sarcasm detection in social media. Pro- tics.
ceedings of the Conference on Natural Language Learning
(CoNLL), 2016. [Ghosh et al., 2020] Debanjan Ghosh, Avijit Vajpayee, and
Smaranda Muresan. A report on the 2020 sarcasm detection
[Bamman and Smith, 2015] David Bamman and Noah A shared task. In Proceedings of the Second Workshop on
Smith. Contextualized sarcasm detection on twitter. In Figurative Language Processing, pages 1–11, Online, July
Ninth International AAAI Conference on Web and Social 2020. Association for Computational Linguistics.
Media, 2015. [Hazarika et al., 2018] Devamanyu Hazarika, Soujanya Poria,
[Barbieri et al., 2014] Francesco Barbieri, Horacio Saggion, Sruthi Gorantla, Erik Cambria, Roger Zimmermann, and
and Francesco Ronzano. Modelling sarcasm in twitter, a Rada Mihalcea. CASCADE: Contextual sarcasm detection
novel approach. In Proceedings of the 5th Workshop on in online discussion forums. In Proceedings of the 27th
Computational Approaches to Subjectivity, Sentiment and International Conference on Computational Linguistics,
Social Media Analysis, pages 50–58, Baltimore, Maryland, pages 1837–1848, Santa Fe, New Mexico, USA, August
June 2014. Association for Computational Linguistics. 2018. Association for Computational Linguistics.

[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and Conference on Computational Linguistics: Technical Pa-
Jürgen Schmidhuber. Long short-term memory. Neural pers, pages 1601–1612, Osaka, Japan, December 2016. The
Computation, 9(8):1735–1780, 1997. COLING 2016 Organizing Committee.
[Jaiswal, 2020] Nikhil Jaiswal. Neural sarcasm detection us- [Rajadesingan et al., 2015] Ashwin Rajadesingan, Reza Za-
ing conversation context. In Proceedings of the Second farani, and Huan Liu. Sarcasm detection on twitter: A be-
Workshop on Figurative Language Processing, pages 77– havioral modeling approach. In Proceedings of the Eighth
82, Online, July 2020. Association for Computational Lin- ACM International Conference on Web Search and Data
guistics. Mining, WSDM ’15, page 97–106, New York, NY, USA,
[Joshi et al., 2015] Aditya Joshi, Vinita Sharma, and Pushpak 2015. Association for Computing Machinery.
Bhattacharyya. Harnessing context incongruity for sarcasm [Reyes et al., 2013] Antonio Reyes, Paolo Rosso, and Tony
detection. In Proceedings of the 53rd Annual Meeting of the Veale. A multidimensional approach for detecting irony in
Association for Computational Linguistics and the 7th Inter- twitter. Language resources and evaluation, 47(1):239–268,
national Joint Conference on Natural Language Processing 2013.
(Volume 2: Short Papers), pages 757–762, Beijing, China, [Riloff et al., 2013] Ellen Riloff, Ashequl Qadir, Prafulla
July 2015. Association for Computational Linguistics. Surve, Lalindra De Silva, Nathan Gilbert, and Ruihong
[Khattri et al., 2015] Anupam Khattri, Aditya Joshi, Pushpak Huang. Sarcasm as contrast between a positive sentiment
Bhattacharyya, and Mark Carman. Your sentiment pre- and negative situation. In Proceedings of the 2013 Con-
cedes you: Using an author’s historical tweets to predict ference on Empirical Methods in Natural Language Pro-
sarcasm. In Proceedings of the 6th workshop on compu- cessing, EMNLP 2013, 18-21 October 2013, Grand Hyatt
tational approaches to subjectivity, sentiment and social Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a
media analysis, pages 25–30, 2015. Special Interest Group of the ACL, pages 704–714. ACL,
2013.
[Khodak et al., 2018] Mikhail Khodak, Nikunj Saunshi, and
Kiran Vodrahalli. A large self-annotated corpus for sarcasm. [Schuster and Paliwal, 1997] M. Schuster and K. K. Paliwal.
In Proceedings of the Linguistic Resource and Evaluation Bidirectional recurrent neural networks. IEEE Transactions
Conference (LREC), 2018. on Signal Processing, 45(11):2673–2681, 1997.
[Kumar and Anand, 2020] Amardeep Kumar and Vivek [Srivastava et al., 2020] Himani Srivastava, Vaibhav Varsh-
Anand. Transformers on sarcasm detection with context. ney, Surabhi Kumari, and Saurabh Srivastava. A novel
In Proceedings of the Second Workshop on Figurative Lan- hierarchical BERT architecture for sarcasm detection. In
guage Processing, pages 88–92, Online, July 2020. Associ- Proceedings of the Second Workshop on Figurative Lan-
ation for Computational Linguistics. guage Processing, pages 93–97, Online, July 2020. Associ-
ation for Computational Linguistics.
[Kumar et al., 2020] A. Kumar, V. T. Narapareddy, V. Aditya
Srikanth, A. Malapati, and L. B. M. Neti. Sarcasm detection [Tay et al., 2018] Yi Tay, Anh Tuan Luu, Siu Cheung Hui,
using multi-head attention based bidirectional lstm. IEEE and Jian Su. Reasoning with sarcasm by reading in-between.
Access, 8:6388–6397, 2020. In Proceedings of the 56th Annual Meeting of the Associ-
ation for Computational Linguistics (Volume 1: Long Pa-
[Lecun et al., 1998] Y. Lecun, L. Bottou, Y. Bengio, and pers), pages 1010–1020, Melbourne, Australia, July 2018.
P. Haffner. Gradient-based learning applied to document Association for Computational Linguistics.
recognition. Proceedings of the IEEE, 86(11):2278–2324,
[Tsur et al., 2010] Oren Tsur, Dmitry Davidov, and Ari Rap-
1998.
poport. Icwsm—a great catchy name: Semi-supervised
[Lee et al., 2020] Hankyol Lee, Youngjae Yu, and Gunhee recognition of sarcastic sentences in online product reviews.
Kim. Augmenting data for sarcasm detection with unla- In fourth international AAAI conference on weblogs and
beled conversation context. In Proceedings of the Sec- social media, 2010.
ond Workshop on Figurative Language Processing, pages [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki
12–17, Online, July 2020. Association for Computational
Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Linguistics.
Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you
[Liebrecht et al., 2013] Christine Liebrecht, Florian Kunne- need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
man, and Antal van den Bosch. The perfect solution for R. Fergus, S. Vishwanathan, and R. Garnett, editors, Ad-
detecting sarcasm in tweets #not. In Proceedings of the 4th vances in Neural Information Processing Systems 30, pages
Workshop on Computational Approaches to Subjectivity, 5998–6008. Curran Associates, Inc., 2017.
Sentiment and Social Media Analysis, pages 29–37, At- [Wilson, 2006] Deirdre Wilson. The pragmatics of verbal
lanta, Georgia, June 2013. Association for Computational
irony: Echo or pretence? Lingua, 116(10):1722 – 1743,
Linguistics.
2006. Language in Mind: A Tribute to Neil Smith on the
[Poria et al., 2016] Soujanya Poria, Erik Cambria, Deva- Occasion of his Retirement.
manyu Hazarika, and Prateek Vij. A deeper look into
sarcastic tweets using deep convolutional neural networks.
In Proceedings of COLING 2016, the 26th International

Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).