=Paper= {{Paper |id=Vol-3150/short6 |storemode=property |title=Context-Based Sarcasm Detection Model in Chinese Social Media Using BERT and Bi-GRU Models |pdfUrl=https://ceur-ws.org/Vol-3150/short6.pdf |volume=Vol-3150 |authors=Chenghao Jia, Hongying Zan }} ==Context-Based Sarcasm Detection Model in Chinese Social Media Using BERT and Bi-GRU Models== https://ceur-ws.org/Vol-3150/short6.pdf
Context-Based Sarcasm Detection Model in Chinese Social
Media Using BERT and Bi-GRU Models
Chenghao Jia1* and Hongying Zan1
1
    College of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, China
*
    ch_jia@stu.zzu.edu.cn

                Abstract
                The emergence of web text has brought new challenges to the research of sentiment analysis.
                At present, most sarcasm detection tasks are not suitable for situational sarcasm which needs
                to identify contextual information, and the semantics of Chinese sarcasm is more complex than
                that of English sarcasm. Therefore, this paper proposes a Chinese social media sarcasm
                detection model combining BERT and Bi-GRU to solve these problems in the background of
                the increasing use of sarcasm on the Internet. First, we use the bi-directional encoder
                representations from transformers (BERT) to obtain the word vectors fused with the context.
                Then, we use bi-directional gate recurrent unit (Bi-GRU) to extract the semantic features and
                contextual information again. Finally, we get the sarcasm probability of the online comments
                by the sigmoid function. In this model, the topic background is used as a part of the contextual
                information to detect sarcasm, so as to improve the accuracy of situational sarcasm detection.
                At the same time, by extracting semantic features twice, the problem of more complicated
                Chinese sarcasm semantics is solved.

                Keywords
                Sarcasm Detection, Bi-GRU, BERT, Social Media

1. Introduction
    Sentiment analysis is a significant part of natural language processing (NLP), and enterprises and
governments increasingly need sentiment analysis to provide reference for public opinion [1]. For the
government, accurately analyzing the public sentiment from the user's language is of great value for the
government to serve the people and control public opinion. For enterprises, knowing customers'
preferences will help them improve their products and services. Moreover, if the computer can
accurately understand the sentiment of the text, it can ease users' emotional pressure and it has medical
value. However, with the vigorous development of social media, many Web clients use sarcasm to
express their views and emotions on the Internet. For example, Web clients often use “gou tou bao
ming”, “[doge]” for joking but not its superficial meaning. Sarcasm is defined as a rhetorical device
whose purpose is to belittle someone, or to say something in the opposite sense. In the experiment of
Edison et al. [2], the types of irony in the dataset are divided into three categories: verbal irony,
situational irony, and other types of verbal irony. Verbal irony refers to words that express the opposite
of literal meaning, while situational irony refers to words that are judged by context.
    Traditional sentiment analysis tasks can not correctly identify the real sentiment of sarcasm, which
will seriously affect the accuracy of sentiment analysis. Therefore, how to accurately identify sarcasm
is a difficult problem in the field of sentiment analysis. Accurate sarcasm detection can not only greatly
improve the accuracy of enterprises and governments in judging online public opinion trends, but also
help computers understand the sentiment of the language at a deeper level, laying the foundation for a
more accurate and friendly human-computer interaction environment. Joshi et al. summarized two
trends in the direction of sarcasm detection: pattern discovery and role of context [3]. Finding sarcastic
patterns was an early trend in sarcasm detection. Pattern-based approaches attempt to identify sarcasm
through linguistic and pattern features. In these tasks, the traditional machine learning algorithms are
used extensively. For example, Kreuz et al. used support vector machine (SVM) to detect sarcasm [4].
Bamman and Smith used binary logistic regression [5]. Using context is a lately trend in sarcasm
detection. The term context here refers to any information beyond common sense, as well as information
beyond the text. Wallace et al. highlighted the need for context for sarcasm detection [6]. Silvio Amir




Copyright © 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
et al. proposed a new method based on convolutional networks in 2016, while allowing the model to
learn user-specific context [7]. The result showed that the performance is improved by 2%. Joshi et al.
summarized three types of contexts: author-specific context, conversation context, and topical context
[3]. Because the topic context of each comment can be easily obtained on social media such as Weibo
and Twitter, a few topics are probably going to summon sarcasm more normally than others. This means
that sarcasm detection can be easier and more straightforward through topic context in practice. Based
on the above reasons, this paper mainly considers the topic context.
    Raffel et al. verified the generally held opinion that utilizing a denoising objective commonly results
in better downstream task performance compared to a language modeling objective [8]. Edison et al.
proposed that using more pre-training schemes can make the model that rely solely on pre-trained
embedding performs better [2]. Their work both demonstrated that pre-training models can improve
performance. Compared with other pre-training models, the bi-directional encoder representations from
transformers (BERT) model has a stronger ability to fuse text. Therefore, this paper proposes a sarcasm
detection model combining BERT and bi-directional gate recurrent unit (Bi-GRU). First, the comment
text and topic background are converted into dynamic vectors fused with semantics through the BERT
pre-training model. Then, the output of BERT is extracted from the semantic features and context
information again through the Bi-GRU. Finally, the sarcasm probability can be judged by extracting
deep semantics.
    The rest of this paper is listed as follows: The second section lists previous works. The third part
introduces how to realize the Chinese sarcasm detection model combining BERT and Bi-GRU. At last,
the fourth section draws a conclusion.

2. Related Works

2.1. Previous work on sarcasm detection
    In previous work on sarcasm detection, sarcasm detection methods can be divided into: rule-based
methods, statistical methods, and deep learning-based methods.
    Rule-based methods attempt to identify sarcasm through specific evidence. For example, Bharti et
al. proposed two rules-based classifiers: a sarcasm detection model based on the occurrence of the
interjection word and Parsing-based lexicon generation algorithm (PBLGA) [9]. Rule-based methods
need to construct a large number of dictionaries and rule sets, which requires people to spend a lot of
time collecting punctuation marks, keywords, central words, demonstrative words and other features.
This method puts the recognized object into the corresponding dictionary for matching, which requires
multiple corrections and matching. This method is also very dependent on rules and specific languages,
fields and styles. It is difficult to cover all types of sarcasm and is only suitable for small-scale data.
Moreover, when encountering new problems, it is necessary to construct a new dictionary and set new
rules .
    Statistical methods, which hope to detect sarcasm through the characteristics of statements, are
mainly achieved by using machine learning algorithms. Riloff et al. considered hyperbole, ellipsis and
imbalance in their set of features [10]. Reyes et al. used naive bayes and decision trees to detect verbal
irony in short web text [11]. However, statistical methods still rely too much on manual annotation and
domain knowledge to construct normalized texts. It makes mobility poor and consumes a lot of
manpower and material resources.
    As architectures based on deep learning technologies become more popular, deep learning-based
methods are gradually being applied to the task of sarcasm detection. For example, Ghosh and Veale
compared their deep learning architectures with recursive support vector machines, and the result
showed that the deep learning architectures bring improvements [12]. Edison et al. used training
methods that combine multi-layer bi-directional long short term memory (Bi-LSTM) and pre-trained
word embedding [2].
    Above works detect sarcasm through pattern discovery. Even though Yi et al. used deep learning
techniques [13], they essentially detect sarcasm by linguistic features. In fact, sarcasm detection can not
solely depend on the characteristics of language, but on the context. Sarcasm that is only recognized
based on linguistic features may overlook contextual features, which will lead to a decrease in accuracy.
Many modern researchers also pay attention to context and use deep learning technology to complete
the detection of contextual sarcasm. Hazarika et al. studied the context including user information and
topic information [14], and user information includes the user's style and the user's behavior. Kolchinski
et al. studied the role of author-specific context in sarcasm detection [15].

2.2. Previous work on pre-training model
    Natural language pre-training models are divided into static models and dynamic models. Static
models include word2vec [16], glove and so on. Dynamic models include embedding from language
models (ELMo) [17], generative pre-training (GPT), BERT and so on. Although the field of NLP has
been greatly developed due to static pre-training technology, the static word vector can not characterize
poly semantic words well. For example, the word "apple" can be understood as fruit or Apple Company
due to the different contexts, and it may also represent the title of a book or the name of a film. If we
use static word embedding technology, computer will use one vector to represent these three meanings.
In 2018, the emergence of ELMo effectively solved the problem of poly semantic words. Then, new
pre-training models such as GPT and BERT are proposed one after another, especially the BERT model,
which provides a powerful pre-training tool for many tasks in the field of NLP. This is an important
breakthrough in the field of NLP. The network structure of ELMo pre-training model uses a two-layer
Bi-LSTM that can adjust word embeddings based on current contextual information dynamically, but
its feature fusion capability is weaker than BERT. Although GPT adopts the Transformer model with
strong feature extraction ability, it only pays attention to the above information, but abandons the latter
information, which makes it impossible to fully combine the semantic information of the context. BERT
makes up for the problems of ELMo and GPT. It uses word-level vectors to extract semantics from the
context, and there is no problem of poly semantic words.

3. Proposed algorithm
    The sarcasm detection model combining with BERT and Bi-GRU is shown in Figure 1. The model
has 5 layers: the text preprocessing layer, the word embedding layer, the BERT layer, the Bi-GRU layer
and the decision layer. The text preprocessing layer removes the part of the text that has nothing to do
with semantics, and only retains the text containing semantic information by segmenting, cleaning, and
standardizing the data in the text dataset. Then, we use the pre-training model to get the static vectors
of the corresponding words. The BERT layer further trains and learns the static vectors corresponding
to the topic background and the static vectors corresponding to the comment text, and obtains the
dynamic vectors fused with their respective context. The two groups of dynamic vectors output by the
BERT layer are linearly connected as the input of the Bi-GRU layer. The Bi-GRU layer extracts the
context information of comments that fuse their topic background. The final states of the two directions
of the Bi-GRU are concatenated with each other and pass through a fully connected linear layer or
multiple fully connected linear layers. The output of the last linear layer is input through a sigmoid
function, which yields the estimated probability of sarcasm.

3.1. The text preprocessing layer and the word embedding layer
    There are a lot of noise data in the original comments and topic background text, such as expressions,
a series of punctuation marks, etc. First, we use regular expressions to clear the text. We treat the whole
review text as a sentence, then divide the sentence by character and delete the stop words. The same is
true of the theme background text. The marked sentences only contains the characters with semantic
information. We make sure that the maximum length of the sentence does not exceed the set sentence
length minus 2, because the remaining 2 positions are used to store the [CLS] flag and the [SEP] flag.
After preprocessing, each character of the sentence will be mapped to the vector one by one. Finally,
we get the static vectors. These word vectors are used as one of the inputs of the BERT pre-training
model. After the BERT layer, the dynamic word vectors fused with the context will be obtained.
Figure 1: Model combined with BERT and Bi-GRU

3.2. The BERT layer
   The BERT [18] model adopts the transformer model with strong ability to integrate text, and
considers the importance of each word to other words by combining the attention mechanism. The
vectors pre-trained by the BERT model work better. The network structure of BERT model shown in
Figure 2. e1, … , en are the input vectors of the BERT model. T1, … , Tn are the output vectors of the
BERT model.




Figure 2: Network structure of BERT model
   The formation of the input vector ei is shown in Figure 3, which is formed by adding up the
corresponding elements of three different vectors. The first vector of each sentence is the [CLS] flag,
which can be used for downstream classification tasks. The [SEP] flag is used as a separator for different
sentences. We treat the comment text as a sentence, so we only use a sentence vector. The topic
background is the same as the comment text, which is also treated as a sentence. PE is the position
vector that records the position of the participle in the sentence, and the calculation formula is shown:
                                    PE (pos, 2i)= sin(pos/100002i/d )                                 ()
                                                                    2i/d
                                  PE (pos, 2i + 1)= cos(pos/10000          )                          ()
Where pos represents the position of the word segmentation in the sentence, 2i and 2i + 1 represent the
even and odd dimensions of the word vector respectively and d represents the dimension of the input
vectors, which is also the dimension of the output vectors.




Figure 3: Input structure of BERT model
   The transformer structure of the BERT model is shown in Figure 4. The transformer model includes
encoder and decoder. The encoder is composed of a stack of N = 6 identical layers. The decoder is also
composed of a stack of N = 6 identical layers.




Figure 4: Structure of Transformer model
   BERT is based on matrix calculations in practice. We need to concatenate all the inputs into a vector
matrix E = {e1, … , en}. The multi-head attention mechanism consists of 8 self-attention mechanisms.
The input of self-attention is three different vector matrices: Query matrix (Q), Key matrix (K) and
Value matrix (V). They are calculated by multiplying vector matrix E and matrix WQ, WK and WV
respectively. The calculation formula for calculating self-attention mechanism is shown:
                                                                 QKT
                                Attention(Q, K, V ) = Softmax(         )V                            ()
                                                                 √dK

   Where Q∈Rn×dK , K∈Rn×dK , V∈Rn×dV and dK is vector dimension. The Softmax function normalizes
each row vector after calculating QKT /√dK , in order to calculate the weight of a word to other words.
    The output X of the multi-head attention mechanism can be calculated by formula (4) and formula
(5):
          X = MultiHead(Q, K, V ) = Concat(Attention1 , … , Attentioni , … , Attentiong ) ∙ WO
                                                                                                     ()
                                                       Q
                              Attentioni = Attention(QWi , KWKi , VWVi )                             ()
            O
    Where W represents the parameter matrix, which is the parameter matrix to be learned during
               Q
training, and Wi , WKi and WVi represent WQ , WK and WV in the i-head attention mechanism
respectively.
    Z is obtained from the output X of multi-head attention after residual and layer normalization
operations. Z is the input of the fully connected feedforward network (FFNN). FFNN consists of two
fully connected layers. The calculation formula is shown:
                                FFNN(Z) = max(0, ZXWF1 +b1 ) ∙ WF2 + b2                                   ()
   Where WF1 and WF2 represent the parameter matrix of the FFNN layer, and b1 and b2 are bias vectors.
They are the parameters that need to be learned during training.
    After the output of the fully connected feedforward neural network is subjected to residual and layer
normalization operations, the result is the input of the next encoder. The input of the first encoder is the
sentence word vector matrix, the input of the subsequent encoder is the output of the previous encoder,
and the last encoder output is the encoded matrix, which will act on each decoder. The calculation
process of decoder is similar to that of encoder, but a layer of masked multi-head attention mechanism
is added and the output masked matrix is used as one of the inputs of the next sub-layer. Denote the
output matrix of the last layer of BERT as Tr={T1, … , Tn}, the row and column dimensions of the Tr
matrix are the same as the BERT input matrix, and each row vector represents the unambiguous depth
vector of the word segmentation, which is used as the input of the downstream task.

3.3. The Bi-GRU layer and the decision layer
    Gated recurrent unit (GRU) [19] is a variant of long-term and short-term memory (LSTM). GRU
and LSTM belong to the advanced model of recurrent neural network (RNN). Due to the serious
disappearing gradient problem in RNN processing sequences, the perception ability of the later nodes
is lower than that of the earlier nodes. To tackle the issue of vanishing gradient, Hinton et al. [20]
proposed the LSTM model. As a variant of LSTM, GRU is also very suitable for sequence data
processing. It also uses the "gate mechanism" to memorize the information of the previous nodes, thus
solving the problem of disappearing gradient. The GRU model is a simplified version of the LSTM
model. The "gate" structure used by GRU is different from that of LSTM. GRU merges the input gate
and the forget gate into an update gate, which only contains two gate structures: the reset gate and the
update gate. And linear self-update is not built on additional memory states, but linearly accumulated
on hidden states and regulated by the gate structure. The reset gate determines how to combine the
previous information and the current input. The update gate determines how much previous information
is retained. Compared with LSTM, GRU has fewer parameters and reduces the risk of overfitting.
Because most social media texts are short texts, the GRU model is more suitable than the LSTM model.
The GRU network model is shown in Figure 5.




Figure 5: Network structure of GRU model
    Where x is the input data, h is the output of the GRU unit, r is the reset gate, and z is the update gate.
r and z together complete the calculation from the previous hidden state (ht-1 ) to the new hidden state
(ht ). The update gate simultaneously controls the current input data xt and the previous memory
information ht-1 . Finally, it outputs zt that is a number between 0 and 1. The calculation formula is
shown:
                                           zt = 𝜎(Wz [ht-1 , xt ] + bz )                                   ()
   Where zt determines the transfer degree from ht-1 to the next state, 0 represents complete
abandonment and 1 represents complete retention. In formula (7), 𝜎 is the sigmoid function, Wz is the
weight of update gate, bz is the offset.
   The reset gate controls the importance of ht-1 to the result h̃t . If the previous memory ht-1 is
completely irrelevant to the new memory, the reset gate will remove the influence of the previous
memory. The calculation formula is shown:
                                           rt = 𝜎(Wr [ht-1 , xt ] + br )                          ()
                                             ̃
   Generate new memory information ht according to the update door. The calculation formula is
shown:
                                         h̃t = tanh(Wr [ht-1 , xt ] + br )                        ()
   The output at the current time is ht. The calculation formula is shown:
                                            ht = (1 − zt )ht-1 + zt h̃t                          ()
    GRU only gets the above information, ignoring the following information. However, the semantic
information of sentences is related to the following information as well as the above information.
Therefore, this paper uses the Bi-directional GRU neural network, which can simultaneously obtain the
above information and the following information and heighten the accuracy of feature extraction.
Moreover, Bi-GRU has the benefits of small dependence on word vectors, low complexity and fast
response time. The Bi-GRU [21] model is composed of two GRU networks superimposed. There are
two GRU units in the opposite directions in each time step. Each word vector of the BERT layer output
matrix Tr is input to the positive and negative Grus of each time step, respectively. We input the BERT
vectors of the comment text together with the BERT vectors of the topic background to the Bi-GRU
layer, so that the output in each direction takes the topic background into account.
    The final states of the two directions of the Bi-GRU are concatenated with each other and pass
through a fully connected linear layer or multiple fully connected linear layers. The output of the last
linear layer is input through a sigmoid function, which yields the estimated probability of sarcasm.

3.4. Implementation Issues
   Firstly, the text is preprocessed by the text preprocessing layer and the word embedding layer, and
the noiseless word segmentation vector is obtained. Second, we respectively input the static vector of
the comment text and the static vector of the topic background to the BERT layer to obtain the dynamic
vector of their own text sentiment. Next, we input the dynamic vectors of the comment text together
with the dynamic vectors of the topic background to the Bi-GRU layer. Then, the output vector of the
topic context is obtained. Finally, we get the estimated probability of sarcasm through the decision
layer.

4. Conclusion
    The sarcasm detection model proposed in this paper combines BERT and Bi-GRU, and uses the
BERT model to obtain the dynamic word vectors of the comment text and topic background
respectively, which enables the model to better understand the text semantics. Then, the model uses Bi-
GRU to get the result that is fused with topic context, and extract semantic features again to strengthen
the model's understanding of semantics. Through the analysis in the third section, the effectiveness of
the method is verified. In future work, we will consider using BERT's optimized model and new pre-
training models such as transformer-XL network (XLNet) to replace BERT to improve the accuracy.
On the other hand, since there is no authoritative Chinese sarcasm dataset at present, the sarcasm corpus
can only be manually annotated by each research unit, and due to subjective factors, the quality of the
corpus can not be guaranteed. In the future, an appropriate experimental data set will be an important
direction.
5. References
[1] Ravi, K. and V. Ravi, A survey on opinion mining and sentiment analysis: tasks, approaches and
     applications[J]. Knowledge-based systems, 2015. 89: p. 14-46.
[2] Marrese-Taylor, E., et al., IIIDYT at SemEval-2018 Task 3: Irony detection in English tweets[J].
     arXiv preprint arXiv:1804.08094, 2018.
[3] Joshi, A., P. Bhattacharyya, and M.J. Carman, Automatic sarcasm detection: A survey[J]. ACM
     Computing Surveys (CSUR), 2017. 50(5): p. 1-22.
[4] Kreuz, R. and G. Caucci. Lexical influences on the perception of sarcasm[C]. in Proceedings of
     the Workshop on computational approaches to Figurative Language. 2007.
[5] Bamman, D. and N.A. Smith. Contextualized sarcasm detection on twitter[C]. in Ninth
     international AAAI conference on web and social media. 2015.
[6] Wallace, B.C., L. Kertz, and E. Charniak. Humans require context to infer ironic intent (so
     computers probably do, too)[C]. in Proceedings of the 52nd Annual Meeting of the Association
     for Computational Linguistics (Volume 2: Short Papers). 2014.
[7] Amir, S., et al., Modelling context with user embeddings for sarcasm detection in social media[J].
     arXiv preprint arXiv:1607.00976, 2016.
[8] Raffel, C., et al., Exploring the limits of transfer learning with a unified text-to-text transformer[J].
     arXiv preprint arXiv:1910.10683, 2019.
[9] Bharti, S.K., K.S. Babu, and S.K. Jena. Parsing-based sarcasm sentiment recognition in twitter
     data[C]. in 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis
     and Mining (ASONAM). 2015. IEEE.
[10] Riloff, E., et al. Sarcasm as contrast between a positive sentiment and negative situation[C]. in
     Proceedings of the 2013 conference on empirical methods in natural language processing. 2013.
[11] Reyes, A., P. Rosso, and T. Veale, A multidimensional approach for detecting irony in twitter[J].
     Language resources and evaluation, 2013. 47(1): p. 239-268.
[12] Muresan, S., et al., Identification of nonliteral language in social media: A case study on
     sarcasm[J]. Journal of the Association for Information Science and Technology, 2016. 67(11): p.
     2725-2737.
[13] Tay, Y., et al., Reasoning with sarcasm by reading in-between[J]. arXiv preprint
     arXiv:1805.02856, 2018.
[14] Hazarika, D., et al., Cascade: Contextual sarcasm detection in online discussion forums[J]. arXiv
     preprint arXiv:1805.06413, 2018.
[15] Kolchinski, Y.A. and C. Potts, Representing social media users for sarcasm detection[J]. arXiv
     preprint arXiv:1808.08470, 2018.
[16] Mikolov, T., et al., Efficient estimation of word representations in vector space[J]. arXiv preprint
     arXiv:1301.3781, 2013.
[17] Peters, M., et al., Deep Contextualized Word Representations[J]. arXiv preprint arXiv:1802.05365,
     2018.
[18] Devlin, J., et al., Bert: Pre-training of deep bidirectional transformers for language
     understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
[19] Chung, J., et al., Empirical evaluation of gated recurrent neural networks on sequence modeling[J].
     arXiv preprint arXiv:1412.3555, 2014.
[20] Hinton, G.E. Learning distributed representations of concepts[C]. in Proceedings of the eighth
     annual conference of the cognitive science society. 1986. Amherst, MA.
[21] Cho, K., et al., Learning phrase representations using RNN encoder-decoder for statistical machine
     translation[J]. arXiv preprint arXiv:1406.1078, 2014.