=Paper= {{Paper |id=Vol-2841/DARLI-AP_17 |storemode=property |title=Emotion and sentiment analysis of tweets using BERT |pdfUrl=https://ceur-ws.org/Vol-2841/DARLI-AP_17.pdf |volume=Vol-2841 |authors=Andrea Chiorrini,Claudia Diamantini,Alex Mircoli,Domenico Potena |dblpUrl=https://dblp.org/rec/conf/edbt/ChiorriniDMP21 }} ==Emotion and sentiment analysis of tweets using BERT== https://ceur-ws.org/Vol-2841/DARLI-AP_17.pdf
           Emotion and sentiment analysis of tweets using BERT
                            Andrea Chiorrini                                                       Claudia Diamantini
                  Università Politecnica delle Marche                                       Università Politecnica delle Marche
                             Ancona, Italy                                                             Ancona, Italy
                      a.chiorrini@pm.univpm.it                                                   c.diamantini@univpm.it

                                Alex Mircoli                                                        Domenico Potena
                  Università Politecnica delle Marche                                       Università Politecnica delle Marche
                             Ancona, Italy                                                             Ancona, Italy
                         a.mircoli@univpm.it                                                       d.potena@univpm.it

ABSTRACT                                                                          through both lexicon-based [7] and learning-based approaches
The huge diffusion of social networks has made available an un-                   [11], the latter have shown better performance in terms of classifi-
precedented amount of publicly-available user-generated data,                     cation. For this reason, recent works have focused on large deep
which may be analyzed in order to determine people’s opinions                     learning models [32] [3]. In order to be accurately trained, such
and emotions. In this paper we investigate the use of Bidirectional               models require large corpora of labelled data, which are usually
Encoder Representations from Transformers (BERT) models for                       scarce and expensive to build [19].
both sentiment analysis and emotion recognition of Twitter data.                     As a consequence, pre-trained models that only need a fine-
We define two separate classifiers for the two tasks and we                       tuning phase with a smaller dataset have been widely used. In
evaluate the performance of the obtained models on real-world                     particular, many neural networks composed of a task-agnostic
tweet datasets. Experiments show that the models achieve an                       pre-trained word embedding layer (e.g., GloVe [21]) and a task-
accuracy of 0.92 and 0.90 on, respectively, sentiment analysis and                specific neural architecture have been proposed but the improve-
emotion recognition.                                                              ment of these models measured by accuracy or F1 score has
                                                                                  reached a bottleneck [14]. Anyway, recent architectures based
KEYWORDS                                                                          on Trasformer [28] have shown further room for improvement.
                                                                                     In the present paper, we investigate the enhancement in terms
sentiment analysis, emotion recognition, BERT, deep learning,
                                                                                  of classification accuracy of Bidirectional Encoder Representations
tweet sentiment analysis
                                                                                  from Transformers (BERT) [6], one of the most popular pre-
                                                                                  trained language models based on Transformer, on both the tasks
1    INTRODUCTION                                                                 of senti-ment analysis and emotion recognition. To this purpose,
In the last decade, the great diffusion of social networks, personal              we propose two BERT-based architectures for text classification
blogs and review sites has made available a huge amount of                        and we fine-tune them in order to evaluate their performance. In
publicly-available user-generated content. Such data is considered                the rest of the work we focus on data collected from microblogging
authentic, as in the above contexts people usually feel free to                   platforms and, in particular, from Twitter. The main reasons of
express their thoughts. Therefore, the analysis of this user-genera-              this choice are the wide availability of tweets (as opposed to, for
ted content provides valuable information about the opinion                       instance, Facebook posts, due to different data policies) and the
of users about a large variety of topics and products, allowing                   fact that such data are usually challenging to analyze due to the
firms to address typical marketing problems as, for instance,                     presence of slang, typos and abbreviations (e.g., "btw" for "by the
the evaluation of custo-mer satisfaction or the measurement of                    way") and hence represent a good benchmark for text classifiers.
the impact of a new marke-ting campaign on brand perception.                         The rest of the paper is structured as follows: the next section
Moreover, the analysis of customers’ opinions about a certain                     presents some relevant related work on sentiment analysis and
product can be a driver for open innovation, as it helps business                 emotion recognition. The architecture of the models used for
owners to find out possible issues and can possibly suggest                       both tasks is proposed in Section 3, while Section 4 reports the
new interesting features. For this reason, in the last years many                 results of the experimental evaluation of the models on real-
researchers (e.g., [10], [13], [22]) focused on techniques for the                world datasets of tweets. Finally, Section 5 draws conclusions
automatic analysis of writer’s opinions and emotions, general-                    and discusses future work.
ly referred to as, respectively, sentiment analysis and emotion
analysis.                                                                         2 RELATED WORK
   Sentiment analysis is the process of automatic extraction of                   2.1 Sentiment analysis
writer’s opinions and their characterization in terms of polarity:
                                                                                  With the ever increasing amount of user generated content available
positive, negative and neutral. On the other hand, emotion analysis
                                                                                  online, the field of automatic sentiment analysis has become a
has the goal of recognizing the emotion expressed in the text. This
                                                                                  topic of increasing research interest. As in many other field, deep
task is usually more difficult than sentiment analysis given the
                                                                                  learning techniques are being widely used for sentiment analysis,
greater variety of classes and the more subtle differences between
                                                                                  as demonstrated by the presence of various surveys regarding the
them. Although in literature such tasks have been addressed
                                                                                  subject over the last years [12, 16, 24]. The first complex task that
© 2021 Copyright for this paper by its author(s). Published in the Workshop       sentiment analysis must tackle is the vector representation of
Proceedings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia,   words, which is typically performed thorough word embeddings:
Cyprus) on CEUR-WS.org. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0)
                                                                                  a technique which transforms the words in a vocabulary into
                                                                                  vectors of continuous real numbers.
   The most commonly used word embeddings are Word2Vec 1                           3     MODEL
and Global Vector (GloVe) [21].                                                    In this section we describe the proposed model for the tasks
   Word2Vec is a neural network that learns the word embeddings                    of emotion and sentiment analysis. The model is built by fine-
from text, and contains both the continuous bag-of-words (CBoW)                    tuning BERT on specific datasets of tweets developed for such
model [17] and the Skip-gram model [18]. Given a set of context                    tasks. Since tweets usually contain words that are irrelevant for
words (e.g. “the girl is _ an apple,” where “_” denotes the target                 text classification, a text preprocessing phase is needed in order
word) the CBoW predicts the target word (e.g., “eating”), conversely               to remove:
the Skip-gram model, given the target word, predicts the context
                                                                                        • mentions: users often cite other Twitter usernames in their
words.
                                                                                           tweets through the character ’@’ in order to direct their
   GloVe is trained on the non-zero entries of a global word-word
                                                                                           messages;
co-occurrence matrix, rather than on the entire sparse matrix or
                                                                                        • urls: urls are very common in tweets, both for media (i.e.,
on individual context windows in a large corpus.
                                                                                           pictures and videos) and links to other webpages;
   Subsequent works have focused on further refining the idea
                                                                                        • retweets: users often resend tweets they consider relevant
of embedding. In [26, 27] the authors proposed models that learn
                                                                                           to their followers. Retweets are usually marked with the
sentiment-specific word embeddings (SSWE). In these embeddings,
                                                                                           prefix "RT" and hence are easily identifiable.
the senti-ment information is embedded in the learned word
vectors as well as the semantic. The authors of [29] designed                         After the preprocessing phase, data can be used as input to
and trained a neural network that learns a sentiment-related                       train task-specific BERT-based models. The architecture of a
embedding representation through the integration of sentiment                      generic BERT model consists of a series of bi-directional multi-
supervision both at document level and at word level. A further                    layer encoder-based Transformers. Nowadays, several pre-trained
refinement of semantics-oriented word vectors has been proposed                    BERT models are available. Table 1 shows the main BERT models
in [33] which integrates the word embedding model with standard                    as a function of the number of layers L (i.e. the number of
matrix factorization through a projection level.                                   encoders) and the number of hidden units H. Smaller models are
                                                                                   intended for environments with limited computational resources,
                                                                                   since bigger models have a large number of trainable parameters:
2.2     Emotion analysis                                                           a model of average size like BERT-Base has approximately 110
Though there is not universal agreement over which are the                         million trainable parameters, while BERT-Large has more than
primary emotions of human being, the scientific community                          340 million parameters.
is giving ever increasing attention to the specific problem of                        Specifically, the reference model used in this work is the BERT-
emotion recognition.                                                               Base, both in the uncased and the cased version. The uncased
   In [30] a bilingual attention network model has been proposed                   version implies that text is converted to lowercase before the
for code-switched emotion prediction. In particular, a document                    word tokeniza-tion process (ex. Michael Jackson becomes michael
level representation of each post has been built using a Long                      jackson) and accents are ignored. The architecture of the BERT-
Short-Term Memory (LSTM) model, while the informative words                        Base model consists of 12 encoders, each composed of 8 layers: 4
from the context have been captured through the attention mecha-                   multi-head self-attention layers and 4 feed forward layers. We
nism.                                                                              extended such model by adding a fully connected layer and a
   In [1], the authors used distant supervision to automatically                   softmax layer for classification, as reported in Figure 1. The
build a dataset for emotion detection and trained a fine-grained                   architecture is common to both the sentiment and emotion classi-
emotion detection system using Gated Recurrent Unit (GRU)                          fiers: the only difference between the two models is represented
network.                                                                           by the last softmax layer, in which the number of neurons is
   Another approach [8] focused on learning a better representa-                   equal to the number of classes (i.e., 3 for sentiment analysis and
tions of emotional contexts by using millions of emoji occurrences                 4 for emotion recognition).
in social media to pre-train neural models.
   Even more recently a Bidirectional Encoder Representations                      4     EXPERIMENTS
from Transformers (BERT) model has been proposed in [5]. This                      In this section we present some experimental results aimed at
pre-trained BERT model has provided, without any substantial                       evaluating the performance of the proposed BERT-based models.
task-specific architecture modifications, state of the art performan-              The results for the emotion analysis and the sentiment analysis
ces over various NLP tasks.                                                        tasks are discussed separately.
   In the sentiment analysis field, BERT has been mostly used in
aspect-based sentiment analysis such as in [15, 25, 31], while few                 4.1    Experimental setting
authors focused on emotion analysis.                                               The proposed models have been evaluated through two different
   In [2], the authors performed a comparative analysis of various                 datasets, namely Go et al. [9] for sentiment analysis and the Tweet
pre-trained transformer model, including BERT, for the text                        Emotion Intensity dataset [20] for emotion recognition. The same
emotion recognition problem. However, our work differs from                        criteria have been used for the experiments: in particular, each
the previous as we evaluate the performance of the emotion                         dataset has been split through a stratified sampling into train
classification when applied to social content, which is usually                    (80%), dev (10%) and test (10%) set. Moreover, we tested both the
more challenging.                                                                  uncased and the case version of BERT. Experiments have been
                                                                                   performed on a laptop with 2x2.2GHz CPU, 8GB RAM and a
                                                                                   Nvidia Geforce 740M graphics card: execution times reported in
                                                                                   the following subsections refer to such hardware configuration.
1 https://code.google.com/archive/p/word2vec/ https://code.google.com/archive/p/
                                                                                      We evaluated the model by means of two metrics: classification
word2vec/                                                                          accuracy and F1 score. Let 𝑥𝑖 𝑗 be the number of data belonging
                                                          Table 1: BERT pre-trained models

                         BERT Models                 H=128       H=256         H=512          H=768          H=1024
                         L=2                        BERT-Tiny      –             –              –               –
                         L=4                           –        BERT-Mini    BERT-Small         –               –
                         L=8                           –           –        BERT-Medium         –               –
                         L=12                          –           –             –           BERT-Base          –
                         L=24                          –           –             –              –           BERT-Large


                                                                             4.2    Emotion analysis
                                                                             In order to evaluate the performance of the proposed architecture
                                                                             on the emotion analysis task we considered the Tweet Emotion
                                                                             Intensity dataset, which consists of 6755 tweets labelled with
                                                                             respect to the following four emotions: anger, fear, happiness,
                                                                             sadness. Since samples in the original dataset were not equally
                                                                             distributed among classes, we balanced the training set by applying
                                                                             the undersampling technique. In particular, we randomly chose
                                                                             1300 tweets from each class. We also filtered out 974 meaningless
                                                                             tweets, e.g. tweets only containing non-ASCII characters or very
                                                                             short tweets. As a result, we obtained a training+dev set of 5200
                                                                             equally distributed tweets and a test set of 581 tweets with the
                                                                             class distribution reported in Figure 2.




Figure 1: The architecture of the proposed classification
model.


to 𝑗-th class which have been classified as 𝑖-th class. Let 𝐶 be
                                                                             Figure 2: Tweet Emotion Intensity dataset: class
the number of classes and 𝑁 be the total number of data. The
                                                                             distribution of the test set.
accuracy achieved by a classifier is computed as:
                                             𝐶
                                       1 ∑︂                                      Each occurrence in the dataset is associated not only with
                       𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =            𝑥𝑖𝑖                     (1)
                                       𝑁 𝑖=1                                 a label emotion but also with a parameter called intensity, that
                                                                             represents the intensity of the emotion. Specifically, this parameter
Precision and recall of 𝑖-th class are determined as follows:                is a value between 0 and 1 that indicates the degree of intensity
                                       𝑥𝑖𝑖                                   with which the author of the tweet felt that emotion. In Figure 3
                      𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖 =                            (2)
                                      𝐶
                                      ∑︁                                     it is shown the histogram of the occurrences for different lengths
                                         𝑥𝑖 𝑗
                                           𝑗=1                               (in characters) of tweets; the length of 452 characters represents
                                                                             the upper-bound of lengths present in the dataset and is actually
                                           𝑥𝑖𝑖                               an isolated case given by a single tweet, while the average length
                          𝑟𝑒𝑐𝑎𝑙𝑙𝑖 =                                  (3)
                                       𝐶
                                       ∑︁                                    ranges between 9 and 50 characters. Since the model requires
                                             𝑥 𝑗𝑖
                                       𝑗=1                                   defining a maximum length for the sequence of input characters
                                                                             (the max_seq_length parameter), analyzing the histogram we
F1 score of 𝑖-th class is equal to:
                                                                             decided to set max_seq_length=95.
                                                                                 After a preliminary phase of hyperparameter tuning aimed
                                𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖 · 𝑟𝑒𝑐𝑎𝑙𝑙𝑖
                   𝐹 1𝑖 = 2 ·                                        (4)     at determining the best values for the hyperparameters of our
                                𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖 + 𝑟𝑒𝑐𝑎𝑙𝑙𝑖                         model, we trained our classifier by using the values reported in
Therefore, the F1 score achieved by a classification model is                Table 2.
defined as the average of F1𝑖 :                                                  Training required about 5’30"/epoch, while the prediction of
                                       𝐶                                     the emotion related to a tweet in test set took approximately 0.4
                                   1 ∑︂                                      seconds. We trained the model for a variable number of epochs,
                           𝐹1 =          𝐹 1𝑖                        (5)
                                   𝐶 𝑖=1                                     ranging from 1 to 6. The reason for choosing such a small number
Figure 3: Histogram of the occurrences for
different lengths (in characters) of tweets. Source:
saifmohammad.com

Table 2: Optimal hyperparameters for the emotion
recognition task.                                                      Figure 5: Cased BERT for emotion recognition: Training
                                                                       and validation loss over epochs
                       Hyperparameter    Value
                     learning_rate       2e-5                          that the classification of tweets is often very challenging (see
                     train_batch_size      8                           Section 1).
                     eval_batch_size       8
                     max_seq_length       95                           4.3    Sentiment analysis
                     adam_epsilon        1e-8                          The performance of the sentiment analysis classifier has been
                                                                       evaluated on the dataset proposed by Go et al. [9]. Such dataset is
of epochs is the fact that pre-trained models usually need a short     composed of a training set of 1,600,000 tweets annotated through
fine-tuning phase in order not to overfit the data.                    distant supervision (by considering emoticons in text) and a test
   We evaluated both the uncased and the cased version of BERT-        set of 430 manually-annotated tweets. We only considered the
Base by using the same hyper-parameter configuration. Training         latter, since that dataset has been annotated by humans and
and validation loss in function of the number of epochs are            hence it is more reliable. Each tweet has been annotated with
reported, respectively, in Figure 4 for the uncased version and in     respect to its polarity (i.e., positive, negative or neutral); the
Figure 5 for the cased one. It has to be noticed that,in line with     class distribution is reported in Table 5. The dataset is slightly
expectations, in both cases the optimal training is reached in         imbalanced and the neutral is the minority class. However, it
only 2 epochs. In fact, starting from the third epoch, even if the     does not represent a problem since we are more interested in
training error diminishes, the validation loss begins to increase:     detecting emotion-bearing tweets.
a phenomenon which is usually correlated with overfitting.                Coherently with the approach proposed in Section 3, we prepro-
                                                                       cessed the dataset in order to remove noisy words like links,
                                                                       hashtags, retweets and mentions. Similarly to Section 4.2, we
                                                                       analyzed the length (in terms of characters) of each tweet. In
                                                                       this case, we set max_seq_length=82, since tweets were shorter
                                                                       than those in the Tweet Emotion Intensity Dataset on average.
                                                                       We performed a phase of hyperparameter tuning through a grid
                                                                       search and we determined the best configuration (see Table 6).
                                                                          Due to the smaller size of the dataset, the time required by
                                                                       training was smaller: in particular, it took about 1’15"/epoch. We
                                                                       trained the model for a variable number of epochs - from 1 to 6
                                                                       - and we noticed a behavior similar to the emotion recognition
                                                                       task. As it can be observed in Figures 6 and 7, the validation loss
                                                                       reached its minimum value after a single epoch, then started to
                                                                       increase, probably due to overfitting. A possible explaination of
Figure 4: Uncased BERT for emotion recognition:                        such phenomenon is that the dataset was small if compared to
Training and validation loss over epochs                               the number of parameters of the model and hence the classifier
                                                                       rapidly overfitted. Anyway, further investigations with larger
The confusion matrices for the uncased and cased version are           datasets are required.
respectively shown in Table 3 and 4. The uncased BERT has                 The confusion matrices for the uncased and cased version are
accuracy = 0.89 and 𝐹 1 = 0.89, while the cased version has accuracy   respectively reported in Table 7 and 8. In this case, the uncased
= 0.90 and 𝐹 1 =0.91: hence, the cased version shows slightly higher   and cased BERT have similar performance, both in terms of
performance. Table 4 shows that the happiness class has the            accuracy (0.92) and 𝐹 1 (0.92), hence the cased version provides
highest precision, while the highest recall is reached by the          no improvement over the cased one.
sadness class, which has also the best average metrics. Happy             It can be observed that the largest part of misclassified tweets
tweets seems to be the most difficult to be detected, since the        is composed by emotion-bearing text that are, instead, classified
happiness class has the lowest recall (0.85). Anyway, the difference   as neutral. This phenomenon can be justified by considering that
with other classes is rather small. Generally speaking, the perfor-    there are sentences that are weakly polarized (e.g., for the lack of
mance of the classifier seems promising, especially considering        strongly polarized adjectives, such as "wonderful" or "ugly") and
                          Table 3: Uncased BERT: confusion matrix for the emotion recognition task

                                                 Actual Happiness    Actual Anger     Actual Sadness      Actual Fear
                  Predicted Happiness                   131                3                 0                10
                  Predicted Anger                        10              127                 3                10
                  Predicted Sadness                       6                3               122                 0
                  Predicted Fear                         11                5                 2                138
                  Recall                                0.83             0.92              0.96              0.87
                  Precision                             0.91             0.85              0.93              0.88

                           Table 4: Cased BERT: confusion matrix for the emotion recognition task

                                                 Actual Happiness    Actual Anger     Actual Sadness      Actual Fear
                  Predicted Happiness                   135                2                 0                 6
                  Predicted Anger                         7              121                 3                 4
                  Predicted Sadness                       9                2               122                 1
                  Predicted Fear                          7                2                 2                147
                  Recall                                0.85             0.88              0.96              0.93
                  Precision                             0.94             0.90              0.91              0.93

Table 5: Class distribution of the test set proposed by Go
et al.

                        Class     Occurrences
                     Positive          157
                     Neutral           117
                     Negative          156
                     Total             430

Table 6: Optimal hyperparameters for the sentiment
analysis task.

                      Hyperparameter     Value                          Figure 7: Cased BERT for sentiment analysis: Training
                    learning_rate        1e-5                           and validation loss over epochs
                    train_batch_size       8
                    eval_batch_size        8
                    max_seq_length        82                            terms of classification error as they correspond to completely
                    adam_epsilon         1e-7                           misrepresent the user’s opinion. Such performance can be compared
                                                                        to those presented in [9], where traditional machine learning
                                                                        algorithms are applied to the same dataset. It can be noticed
                                                                        that the best model proposed in [9], i.e., SVM, has an accuracy
                                                                        of 0.82. Therefore, the use of BERT leads to a remarkable 0.10
                                                                        improvement in terms of accuracy.

                                                                        5   CONCLUSION
                                                                        The goal of this work was the evaluation of the use of Bidirectional
                                                                        Encoder Representations from Transformers (BERT) models for
                                                                        both sentiment analysis and emotion recognition of Twitter data.
                                                                        We defined an architecture composed of BERT-Base followed
                                                                        by a final classification stage and we fine-tuned the model for
                                                                        the above-mentioned tasks. We measured the performance of
                                                                        our classifiers by considering two datasets of tweets and we
Figure 6: Uncased BERT for sentiment analysis: Training                 obtained a remarkable 92% accuracy for sentiment analysis and
and validation loss over epochs                                         a 90% accuracy for emotion analysis, from which it was possible
                                                                        to deduce that BERT’s language modeling power significantly
                                                                        contributes to achieve a good text classification.
sentences containing slang, which are really difficult to properly         In future work, we plan to improve the performance of our
classify. It is remarkable that polarity inversions, i.e. positive      classifiers by determining the best number of layers and neurons
sentences classified as negatives and vice versa, are quite rare        in the final classification layers (i.e., fully connected layers) .
(1.8%). In fact, polarity inversions are usually more costly in         We also intend to extend the experimentation by considering
                                  Table 7: Uncased BERT: confusion matrix for the sentiment analysis task

                                                                  Actual Negative      Actual Neutral           Actual Positive
                                   Predicted Negative                   141                   3                        4
                                   Predicted Neutral                     11                 112                       10
                                   Predicted Positive                     3                   3                      143
                                   Recall                              0.91                 0.95                     0.91
                                   Precision                           0.95                 0.84                     0.96

                                    Table 8: Cased BERT: confusion matrix for the sentiment analysis task

                                                                  Actual Negative      Actual Neutral           Actual Positive
                                   Predicted Negative                   141                   2                        5
                                   Predicted Neutral                     12                 112                        9
                                   Predicted Positive                     3                   3                      143
                                   Recall                              0.90                 0.96                     0.91
                                   Precision                           0.95                 0.84                     0.96


larger datasets, such as the SemEval 2017 Task 4 [23] dataset                           [9] Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification
for sentiment analysis and the EmoBank [4] dataset for emotion                              using distant supervision. CS224N project report, Stanford 1, 12 (2009), 2009.
                                                                                       [10] Ali Hasan, Sana Moin, Ahmad Karim, and Shahaboddin Shamshirband. 2018.
analysis. This is particular important for the sentiment analysis                           Machine learning-based sentiment analysis for twitter accounts. Mathematical
task, in which we observed a repentine increment of the validation                          and Computational Applications 23, 1 (2018), 11.
                                                                                       [11] Maha Heikal, Marwan Torki, and Nagwa El-Makky. 2018. Sentiment analysis
loss after the first epoch, probably due to overfitting. Although                           of Arabic Tweets using deep learning. Procedia Computer Science 142 (2018),
the models reach high accuracy and the approach seems promising,                            114–122.
a comparison with other state-of-the-art classifiers will be useful                    [12] Doaa Mohey El-Din Mohamed Hussein. 2018. A survey on sentiment analysis
                                                                                            challenges. Journal of King Saud University-Engineering Sciences 30, 4 (2018),
to thoroughly evaluate the performance of our approach. We also                             330–338.
intend to investi-gate the impact of BERT-Base by replacing it                         [13] Zhao Jianqiang, Gui Xiaolin, and Zhang Xuejun. 2018. Deep convolution
with other BERT distributions (e.g., BERT-Large) or traditional                             neural networks for twitter sentiment analysis. IEEE Access 6 (2018), 23253–
                                                                                            23260.
word embeddings, such as Word2Vec [17] or GloVe [21].                                  [14] Xin Li, Lidong Bing, Wenxuan Zhang, and Wai Lam. 2019. Exploiting BERT for
                                                                                            end-to-end aspect-based sentiment analysis. arXiv preprint arXiv:1910.00883
                                                                                            (2019).
ACKNOWLEDGMENTS                                                                        [15] Xinlong Li, Xingyu Fu, Guangluan Xu, Yang Yang, Jiuniu Wang, Li Jin, Qing
                                                                                            Liu, and Tianyuan Xiang. 2020. Enhancing BERT representation with context-
The authors would like to thank the students Federico Filipponi,                            aware embedding for aspect-based sentiment analysis. IEEE Access 8 (2020),
Leonardo Lucarelli and Alessandrino Manilii for their help in                               46868–46876.
imple-menting the architecture for emotion recognition.                                [16] Walaa Medhat, Ahmed Hassan, and Hoda Korashy. 2014. Sentiment analysis
                                                                                            algorithms and applications: A survey. Ain Shams engineering journal 5, 4
                                                                                            (2014), 1093–1113.
REFERENCES                                                                             [17] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
                                                                                            estimation of word representations in vector space.            arXiv preprint
 [1] Muhammad Abdul-Mageed and Lyle Ungar. 2017. Emonet: Fine-grained                       arXiv:1301.3781 (2013).
     emotion detection with gated recurrent neural networks. In Proceedings of the     [18] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013.
     55th annual meeting of the association for computational linguistics (volume 1:        Distributed representations of words and phrases and their compositionality.
     Long papers). 718–728.                                                                 arXiv preprint arXiv:1310.4546 (2013).
 [2] Acheampong Francisca Adoma, Nunoo-Mensah Henry, and Wenyu Chen.                   [19] A. Mircoli, A. Cucchiarelli, C. Diamantini, and D. Potena. 2017. Automatic
     2020. Comparative Analyses of Bert, Roberta, Distilbert, and Xlnet for Text-           emotional text annotation using facial expression analysis. In CEUR Workshop
     Based Emotion Recognition. In 2020 17th International Computer Conference              Proceedings, Vol. 1848. 188–196.
     on Wavelet Active Media Technology and Information Processing (ICCWAMTIP).        [20] Saif M Mohammad and Felipe Bravo-Marquez. 2017. Emotion intensities in
     IEEE, 117–121.                                                                         tweets. arXiv preprint arXiv:1708.03696 (2017).
 [3] Qurat Tul Ain, Mubashir Ali, Amna Riaz, Amna Noureen, Muhammad Kamran,            [21] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:
     Babar Hayat, and A Rehman. 2017. Sentiment analysis using deep learning                Global vectors for word representation. In Proceedings of the 2014 conference
     techniques: a review. Int J Adv Comput Sci Appl 8, 6 (2017), 424.                      on empirical methods in natural language processing (EMNLP). 1532–1543.
 [4] Sven Buechel and Udo Hahn. 2017. Emobank: Studying the impact of                  [22] Ana Reyes-Menendez, José Ramón Saura, and Cesar Alvarez-Alonso. 2018.
     annotation perspective and representation format on dimensional emotion                Understanding# WorldEnvironmentDay user opinions in Twitter: A topic-
     analysis. In Proceedings of the 15th Conference of the European Chapter of the         based sentiment analysis approach. International journal of environmental
     Association for Computational Linguistics: Volume 2, Short Papers. 578–585.            research and public health 15, 11 (2018), 2537.
 [5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:     [23] Sara Rosenthal, Noura Farra, and Preslav Nakov. 2017.                SemEval-
     Pre-training of deep bidirectional transformers for language understanding.            2017 Task 4: Sentiment Analysis in Twitter. In Proceedings of the 11th
     arXiv preprint arXiv:1810.04805 (2018).                                                International Workshop on Semantic Evaluation (SemEval-2017). Association
 [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.                      for Computational Linguistics, Vancouver, Canada, 502–518. https://doi.org/
     2019. BERT: Pre-training of Deep Bidirectional Transformers for Language               10.18653/v1/S17-2088
     Understanding. In Proceedings of the 2019 Conference of the North American        [24] Kim Schouten and Flavius Frasincar. 2015. Survey on aspect-level sentiment
     Chapter of the Association for Computational Linguistics: Human Language               analysis. IEEE Transactions on Knowledge and Data Engineering 28, 3 (2015),
     Technologies, Volume 1 (Long and Short Papers). Association for Computational          813–830.
     Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/      [25] Chi Sun, Luyao Huang, and Xipeng Qiu. 2019. Utilizing BERT for aspect-
     N19-1423                                                                               based sentiment analysis via constructing auxiliary sentence. arXiv preprint
 [7] C. Diamantini, A. Mircoli, D. Potena, and E. Storti. 2015. Semantic                    arXiv:1903.09588 (2019).
     disambiguation in a social information discovery system. In 2015 International    [26] Duyu Tang, Furu Wei, Bing Qin, Nan Yang, Ting Liu, and Ming Zhou.
     Conference on Collaboration Technologies and Systems, CTS 2015. 326–333.               2015. Sentiment embeddings with applications to sentiment analysis. IEEE
 [8] Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann.             transactions on knowledge and data Engineering 28, 2 (2015), 496–509.
     2017. Using millions of emoji occurrences to learn any-domain representations     [27] Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin.
     for detecting sentiment, emotion and sarcasm. arXiv preprint arXiv:1708.00524          2014. Learning sentiment-specific word embedding for twitter sentiment
     (2017).
     classification. In Proceedings of the 52nd Annual Meeting of the Association for
     Computational Linguistics (Volume 1: Long Papers). 1555–1565.
[28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
     Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
     you need. arXiv preprint arXiv:1706.03762 (2017).
[29] Leyi Wang and Rui Xia. 2017. Sentiment lexicon construction with
     representation learning based on hierarchical sentiment supervision. In
     Proceedings of the 2017 conference on empirical methods in natural language
     processing. 502–510.
[30] Zhongqing Wang, Yue Zhang, Sophia Lee, Shoushan Li, and Guodong
     Zhou. 2016. A bilingual attention network for code-switched emotion
     prediction. In Proceedings of COLING 2016, the 26th International Conference
     on Computational Linguistics: Technical Papers. 1624–1634.
[31] Hu Xu, Bing Liu, Lei Shu, and Philip S Yu. 2019. BERT post-training for review
     reading comprehension and aspect-based sentiment analysis. arXiv preprint
     arXiv:1904.02232 (2019).
[32] Lei Zhang, Shuai Wang, and Bing Liu. 2018. Deep learning for sentiment
     analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge
     Discovery 8, 4 (2018), e1253.
[33] Wei Zhang, Quan Yuan, Jiawei Han, and Jianyong Wang. 2016. Collaborative
     multi-Level embedding learning from reviews for rating prediction.. In IJCAI,
     Vol. 16. 2986–2992.