Bidirectional Dilated LSTM with Attention for
 Fine-grained Emotion Classification in Tweets

             Annika M Schoene[0000−0002−9248−617X] , Alexander P
       Turner[0000−0002−2392−6549] , and Nina Dethlefs[0000−0002−6917−5066]

                The University of Hull, Cottingham Road, Hull HU6 7RX
                                  amschoene@gmail.com


       Abstract. We propose a novel approach for fine-grained emotion classi-
       fication in tweets using a Bidirectional Dilated LSTM (BiDLSTM) with
       attention. Conventional LSTM architectures can face problems when
       classifying long sequences, which is problematic for tweets, where crucial
       information is often attached to the end of a sequence, e.g. an emoti-
       con. We show that by adding a bidirectional layer, dilations and atten-
       tion mechanism to a standard LSTM, our model overcomes these prob-
       lems and is able to maintain complex data dependencies over time. We
       present experiments with two datasets, the 2018 WASSA Implicit Emo-
       tions Shared Task and a new dataset of 240,000 tweets. Our BiDLSTM
       with attention achieves a test accuracy of up to 81.97% outperforming
       competitive baselines by up to 10.52% on both datasets. Finally, we eval-
       uate our data against a human benchmark on the same task.

       Keywords: Natural Language Processing · Sentiment Analysis· Recur-
       rent Neural Networks.


1    Introduction

There has been a surge of interest in the field of sentiment analysis in recent
years, which is likely due to the growing number of social media users, who
increasingly express their opinions, beliefs and attitudes in online posts towards
a range of different topics, events and products [37]. Most sentiment analysis
approaches to date focus on polarity detection [17, 3] but neglect the classification
of more fine-grained emotion categories, such as Ekman’s basic six emotions
[15]. Fine-grained emotion detection has promising applicability in a number of
domains, including detecting cyber-bullying [55] or identifying potential mental
health issues in social media posts [44].
    The majority of current approaches to sentiment analysis rely on deep learn-
ing algorithms [59], such as recurrent neural networks (RNN) [47, 25] and con-
volutional neural networks (CNN) [12, 11]. While tweets have previously been
categorised as short sequences or sentence-level sentiment analysis [22], we ar-
gue that this should no longer be the case especially since Twitter increased
its allowed character limit from 140 to 280 [49]. As such, tweets mostly face


 Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License
 Attribution 4.0 International (CC BY 4.0). In: N. Chhaya, K. Jaidka, J. Healey, L. H. Ungar, A. Sinha
 (eds.): Proceedings of the 3rd Workshop of Affective Content Analysis, New York, USA, 07-
 FEB-2020, published at http://ceur-ws.org
2       A.M. Schoene et al.

also problems with classifying long sequences, similar to other natural language
processing tasks [20].
     In this paper we propose the use of Dilated RNNs (DRNN) for emotion clas-
sification from tweets. DRNNs introduce skip connections into a standard RNN
to increase the range of temporal dependencies that can be modelled. Experi-
ments on sequence classification for language modelling on the Penn Treebank,
pixel-by-pixel MNIST classification and speaker identification from audio [10]
have shown to outperform competitive baselines such as standard LSTM/GRU
architectures as well as more specialised models. We expect that the same ad-
vantages can be observed for tweets. We extend the standard proposed DRNN
with an embedding layer, bidirectional layer and attention mechanism and ap-
ply it to the classification of six basic emotion categories, anger, fear, disgust,
surprise, joy and sadness. Figure 1 shows an example of a tweet.


                Fig. 1. Example of a tweet from the ’Joy’ category.


    Therefore we hypothesise that by using dilated recurrent neural networks
we can take advantage of the increased sequence length of tweets and avoid
information loss over time. Another reason for the good performance of dilated
recurrent skip connections is that they have a better balance of memory over
a larger period of time compared to standard RNNs. We believe that using a
similar structure, albeit not for a very long sequence but treating tweets as longer
sequence will enable us to achieve better classification accuracies compared to
treating tweets as a short sequence problem.
    We experiment with two datasets, the 2018 WASSA Implicit Emotions Shared
Task dataset which contains 153,383 tweets and can be considered an established
benchmark. In addition, we collected a new larger dataset of 240,000 tweets us-
ing the same six emotion categories. We find that on both datasets, DLSTMs
with attention perform better than standard LSTM or CNN architectures, as
well as any of the submissions to the WASSA shared task, achieving up to
71.45% of accuracy. We find that the BiDLSTMs with attention are particularly
beneficial for the longest sequences in our datasets and that the additions of a
word embeddings, bidirectional layer and attention mechanism further increase
performance.

2   Related Work
Recently, deep learning methods for sentiment and emotion classification have
become the predominant technique. For example, [23] developed a soft attention-
based LSTM with CNN for sarcasm detection. Work conducted by [36] use a
BiDLSTM with Attention for Fine-grained Emotion Classification in Tweets        3

deep CNN with a multi-kernel classifier to extract features of short sequences
for multi-modal sentiment analysis and show that this increases accuracy. [42] use
a BiLSTM for a range of different text classification tasks, including sentiment
analysis. In their experiments they show that using a single-layer BiLSTM with
pretrained word embeddings and trained with cross-entropy loss achieves com-
petitive results compared to more complex learning models. Most recently the
Implicit Emotions Shared Task (IEST) [21] has used Tweets, where the winning
model, named ’Amobee’, was able to outperform the baseline score significantly
by achieving an accuracy of 71.45% [41]. Amobee is a bidirectional GRU with
an additional attention mechanism inspired by [5] and additional hidden layers.
It has been reasoned that the model’s success has been due to its specific type of
transfer learning. The baseline model for this shared task was established using
a maxentropy classifier with L2 regularization, where the F1 score reached an
accuracy of 59.1% on the test data. Recurrent neural networks have become the
predominant neural network across as range of sentiment analysis and emotion
detection tasks [13]. Similarly, almost half of the submissions to the annual Se-
mEval shared task [39, 27, 29] used some form of neural networks. At the same
time, the majority of approaches to detect sentiment continue to focus on polar-
ity detection [9], including approaches to identifying sentiment on social media
such as Twitter [39, 30] or longer texts such as reviews or blogs [32]. This is
limiting for real-world applications, where for mental state detection, customer
reviews, advertising, and many more, fine-grained emotions can add substantial
added value.
    Approaches that have attempted more fine-grained classification are mostly
based on Ekman’s six basic emotions [15], anger, fear, disgust, surpise, joy and
sadness, or Plutchik’s eight basic emotions [35], who extended [15] basic emo-
tions with Trust and Anticipation. For example, [2] apply Gated Recurrent Neu-
ral Networks (GRNNs) to classify tweets collected based on hashtags carrying
emotions into [35] emotion categories. Research conducted by [28] used hashtags
that contain emotion words based on Plutschnik’s eight basic emotions to show
that user-labelled hashtags used as annotations are consistent with those anno-
tated by trained judges. Furthermore a new lexicon based on the same twitter
corpus is introduced. [26] introduces a Topic Sentiment Model (TSM), which can
capture both topics and sentiment. The model is based on Probabilistic Latent
Semantic Indexing (pLSI) and utilises an online sentiment retrieval service to
induce prior knowledge to the model. Research by [46] use distant supervision
and a lexicon to label tweets for Plutschik’s eight basic emotions [35] and then
classify them. Work conducted by [40] also investigated eight basic emotions in
online discourse. [8] used the whole taxonomy of Plutschik’s emotions to analyse
chat messages.
    Work on sentiment classification from social media has additionally explored
the occurrence of emoticons and their influence on sentiment classification [16].
[31] conducted research distinguishing happiness and sadness in emoticons. Sim-
ilarly, [22] have shown that the usage of both hashtags and emoticons can be
beneficial and contribute to more accurate classification of tweets.
4       A.M. Schoene et al.

3   Learning Model

Motivation There are a number of challenges that have to be taken into ac-
count when using recurrent neural networks to learn longer sequences, which
include but are not limited to: (1) maintaining mid- and short term memory
is problematic when memorising long-term dependencies [19] and (2) vanishing
and exploding gradient descent [33]. Therefore it could be argued that there is
a need for a more specialised learning model which can overcomes these chal-
lenges. [52] introduce a dilated LSTM as part of a reinforcement learning task,
where the learning model has one dilated recurrent layer with fixed dilations.
Work by [10] introduced a Dilated RNN by using dilated skip connections. The
dilated LSTM alleviates the problem of learning long sequences, however not
every word in a sequence has the same meaning or importance. Therefore we
extend this network by (1) an embedding layer, (2) a bidirectional layer and (3)
attention mechanism. The full architecture of the Bidirectional Dilated LSTM
(BiDLSTM) with attention is shown in Figure 2.


                    Fig. 2. bidirectional DLSTM with attention


LSTM architecture Our primary model is the Long-short-term memory (LSTM)
given its suitability for language and time-series data [20]. We feed into the LSTM
an input sequence x = (x1 , . . . , xN ) of words in a tweet alongside a label y ∈ Y
denoting an emotion from any of the six basic emotion categories. The LSTM
learns to map inputs x to an output y via a hidden representation ht which can
be found recursively from an activation function:

                                   f (ht−1 , xt ),                              (1)
BiDLSTM with Attention for Fine-grained Emotion Classification in Tweets         5

where t denotes a time-step. During training, we minimise a loss function, in our
case categorical cross-entropy, as:
                                      1 X
                          L(x, y) = −       xn log yn .                       (2)
                                      N
                                          n∈N

    Standard LSTMs manage their weight updates through a number of gates
that determine the amount of information that should be retained and forgotten
at each time step. In particular, we distinguish an ‘input gate’ i that decides how
much new information to add at each time-step, a ‘forget gate’ f that decides
what information not to retain and an ‘output gate’ o determining the output.
More formally, and following the definition by [18], this leads us to update our
hidden state h as follows (where σ refers to the logistic sigmoid function, c is
the ‘cell state’, W is the weight matrix and b is the bias term):
                    it = σ(Wxi xt + Whi ht−1 + Wci ct−1 + bi )                 (3)
                   ft = σ(Wxf xt + Whf ht−1 + Wcf ct−1 + bf )                  (4)
                  ct = ft ct−1 + it tanh(Wxc xt + Whc ht−1 + bc )              (5)
                    ot = σ(Wxo xt + Who ht−1 + Wco ct + bo )                   (6)
                                 ht = ot tanh(ct )                             (7)
    A standard LSTM definition solves some of the problems of vanilla RNNs
have, such as the vanishing gradient descent problem [20], but it still has some
shortcomings when learning long-term dependencies. One of them is due to the
cell state of an LSTM; the cell state is changed by adding some function of
the inputs. When we backpropagate and take the derivative of ct with respect to
ct −1, the added term would disappear and less information would travel through
the layers of a learning model. This shortcoming can be addressed through the
use of dilations and skip-connections in the dilated LSTM.

Embedding and bidirectional Layer Each tweet t contains wi words where wi t, t ∈
[0, T ] represents the ith word in each tweet. We utilise GloVe word embeddings
trained on 2 billion tweets as developed by [34], in our 200-dimensional embed-
ding layer. Then we use a bidirectional LSTM to obtain information from both
directions of each word in order to capture the contextual information. The bidi-
                                                    →
                                                    −
rectional LSTM incorporates the forward LSTM h t(i) which reads each tweet
                                           ←−
from wi 1 to wi T and a backward LSTM h t(i) which reads words in each tweet
from wi T to wi , where xi t represents word vectors in an embedding matrix:
                             xi t = We wi t, t ∈ [1, T ]                       (8)
                         →
                         −        −−−−→
                         h t(i) = LST M (xi t), t ∈ [1, T ]                    (9)
                         ←
                         −        ←−−−−
                         h t(i) = LST M (xi t), t ∈ [1, T ]                (10)
                                                                   →
                                                                   −
   We then concatenate all outputs of the forward hidden state h t and back-
                    ←
                    −
ward hidden state h t , where the output o allows us to utilise all information
available in each tweet. The output o is then fed into the Dilated LSTM.
6       A.M. Schoene et al.

Dilated LSTM Layer For our implementation of a Dilated LSTM, we follow
the implementation of recurrent skip connections with exponentially increasing
dilations in a multi-layered learning model - as proposed by [10] - as it allows
LSTMs to better learn input sequences and their dependencies. This means that
temporal and complex data dependencies are learned on different layers. The
most important part of this architecture is the dilated recurrent skip connection
                           (l)
in the LSTM cell, where ct is the cell in layer l at time t:

                              (l)               (l)     (l)
                              ct = LST M (ot , ct−sl )·                        (11)


                                    (l)
s(l) is the skip length of layer l; ot as the input to layer l at time t in a LSTM.
The exponentially increasing dilations across layers have been inspired by [51];
s(l) denotes the dilation of the l-th layer, where M and L denotes dilations at
different layers:
                              s(l) = M (l−1) , l = 1, . . . L.                  (12)

As outlined by [10] there are two main benefits to stacking exponentially dilated
recurrent layers: (1) it enables different layers to focus on different temporal
resolutions and (2) it reduces the length of paths between nodes at different time-
steps, which enables the network to learn more complex long-term dependencies.
Therefore exponentially increasing dilations shortens any given sequence length
at different layers.

Attention Layer The attention mechanism was first introduced by [4], but has
since been used in a number of different tasks including machine translation [24],
sentence pairs detection [58], neural image captioning [56] and action recognition
[45].
    Our implementation of the attention mechanism is inspired by [57], using
attention to find words that are most important to the meaning of a tweet.
We use the output of the dilated LSTM as direct input into the attention layer,
where O denotes the output of final layer L of the Dilated LSTM at time t+1 . The
attention for each word w in a tweet t is computed as follows, where hiw is the
hidden representation of the dilated LSTM output, αiw represents normalised
alpha weights measuring the importance of each word and ti is the corresponding
tweet vector:

                                uiw = tanh(O + bw )                            (13)

                                       exp (hTiw )
                                αiw = P         T
                                                                               (14)
                                        t exp (hiw )
                                           X
                                    ti =       αiw o·                          (15)
                                           t
BiDLSTM with Attention for Fine-grained Emotion Classification in Tweets            7

4     Experiments

We present the datasets used, our baselines and discuss objective and subjective
results.


4.1    Data

We will work with the following datasets:


 – The WASSA Implicit Emotions Shared Task (IEST) [21] data consists of
   155,383 tweets and is based on [15] six basic emotions.
 – The Ekman’s Emotion Keyword (EEK) data, a collection of 240,000 tweets
   that we collected between September 2017 and December 2018. 1

Table 1 shows a comparison of the two datasets in terms of their size and basic
distribution of emotion categories represented in them.


        Emotion IEST EEK
                                                   Emotion Keywords
        Anger    25,384 40,000
                                                   Anger    anger,angry, furious
        Fear     25,387 40,000
                                                   Fear     fear, scared, fearful
        Disgust 25,396 40,000
                                                   Disgust disgust, disgusting
        Surprise 25,402 40,000
                                                   Surprise surprise, surprising
        Joy      25,377 40,000
                                                   Joy      joy, happy
        Sadness 25,396 40,000
                                                   Sadness sad
Table 1. Comparison of IEST and
                                               Table 2. Synonyms for Twitter API
EEK dataset emotion category distri-
                                               queries
bution


    Both datasets were collected using the Twitter API [50] and a list of keyword
and synonyms were specified for automatic data collection from Twitter. See
Table 2 for the keywords that we used, following [21] and using Ekman’s six basic
emotions. After the initial data collection we filtered tweets by those marked in
the language tab as ”English” and removed any duplicates. Then we used the text
processing library developed by [6], to anonymise usernames and mask URLs.
Afterwards we used a dictionary containing all emotion keywords listed in Table
2 and replaced existing keywords in all tweets with the term [keyword]. Finally
each tweet was assigned a label based on the emotion category its keyword
belonged to (see Figure 1). For our experiments we use 80% of the data for
training, 10% for validation and the remaining 10% for testing.
1
    The dataset will be released to the research community upon request and in accor-
    dance with the Twitter API guidelines [50]
8         A.M. Schoene et al.

4.2     Baselines

Similarly to [21] we use a a maximum entropy classifier with L2 regularisation
for establishing the baselines of our datasets. All baselines will be evaluated in
two conditions:

Capped length , where we cap the length of any sequence to 40 in accordance
with the WASSA IEST challenge winners.

Full length ,where we use the average full uncapped length of a sequence (max-
imum 103). Our intuition is that this condition will particularly reveal the ad-
vantages of the skip connections.

    For the DLSTM, BiDLSTM and BiDLSTM with attention, we established the
number of dilations empirically. There are two dilated layers with the dilations
increasing exponentially starting at 1 [1,2]. This means that each sub-LSTM for
the pruned sequence has the following sequence length [Dilation 1 = 40, Dilation
2 = 20] with a total of 20 hidden units per layer. Whilst each sub-LSTM for the
longer sequence has the following sequence length:[Dilation 1 = 102, Dilation 2
= 51].
We evaluate our BiDLSTM with attention against the following baselines:

    – DLSTM – a dilated LSTM with hierarchically stacked dilations and hyper-
      parameters: learning rate: 0.001, batch size: 128, optimizer: Adam, dropout:
      0.5
    – BiDLSTM – a two-layer bidirectional dilated LSTM with a three-layer
      LSTM, hierarchically stacked dilations and the same hyperparameters as
      the DLSTM.
    – BiLSTM – a BiLSTM with 2 layers and the following hyper-parameters:
      learning rate: 0.001, batch size: 128, optimizer: Adam, dropout: 0.5. This
      model is similar to recent work by [42] who used a single layer biLSTM to
      classify the ImdB movie review dataset into positive and negative reviews.
    – BiLSTM with attention – a BiLSTM with attention and the following
      hyper-parameters: learning rate: 0.001, batch size: 128, optimizer: Adam,
      dropout: 0.5. This model is similar to recent work by [7, 43].
    – CNN – a CNN 2-D convolution with two fully connected layers, a filter size
      of 1,2 and 102 filters, and a ReLU function. This learning model is similar
      to recent work by [14].
    – CNN-LSTM – we follow the implementation of the learning model by [53],
      using a CNN that is feeding into an LSTM. This model was used to predict
      the valence/arousal of ratings in textual data.

Also, we compare our model against the winner of the 2019 WASSA IEST
dataset, called Amobee[41]. All of the experiments conducted using Tensorflow
[1].
BiDLSTM with Attention for Fine-grained Emotion Classification in Tweets       9

5   Results
We benchmark the BiDLSTM with attention to a number of different neural
networks, using both vanilla neural networks and more specialised neural net-
works that have been used in sentiment analysis tasks. We compare results by
two different sequence length and use four different metrics for evaluation; test
set accuracy, precision, recall and F1-score.

Capped Sequences Tables 3 and 4 show the results for capped sequences lengths
for both the IEST and EEK dataset respectively.


             Learning Model    Test Acc. Precision Recall F1-score
             Max Entropy       58.4      0.59      0.57 0.58
             CNN               43.17     0.44      0.42 0.43
             CNN LSTM          55.42     0.56      0.54 0.55
             BI LSTM           49.47     0.50      0.48 0.49
             BI LSTM attention 58.60     0.60      0.56 0.58
             DLSTM             56.44     0.57      0.55 0.56
             BiDLSTM           67.96     0.68      0.67 0.67
             Amobee            -         -         -      71.45
             BiDLSTM attention 72.83     0.74      0.71 0.72

              Table 3. Results for capped sequences (IEST Dataset)


             Learning Model    Test Acc. Precision Recall F1-score
             Max Entropy       62.50     0.63      0.62 0.62
             CNN               55.33     0.56      0.54 0.55
             CNN LSTM          59.79     0.60      0.59 0.59
             BI LSTM           60.19     0.61      0.59 0.60
             BI LSTM attention 63.62     0.64      0.62 0.63
             DLSTM             66.80     0.67      0.65 0.66
             BiDLSTM           69.71     0.70      0.69 0.69
             BiDLSTM attention 73.74     0.75      0.72 0.73

              Table 4. Results for capped sequences (EEK dataset)


    It can be seen that vanilla CNN and BiLSTM fall just short of the baselines
established for this task. The CNN-LSTM and DLSTM architecture, both out-
perform their vanilla predecessors. The BiLSTM with attention and BiDLSTM
surpass the baselines but falls short of the model proposed in the IEST task for
both datasets. It can be seen that BiDLSTM with attention outperforms all pre-
vious models on the capped sequence length by over 14.43% for capped sequences
and the IEST baseline by 11.24%. The results for capped sequence length using
the IEST dataset (Table 3) show that our proposed model surpasses the ’Amobee’
model’s result, however this is only marginally. We hypothesis that the reason
10     A.M. Schoene et al.

the DLSTM, BiDLSTM and BiDLSTM and with attention either fall short of
the baselines or only marginally surpass them is due the model not being able
to take full advantage of the full sequence length.


Long sequences Table 5 shows the results for the IEST dataset using full length
sequences and Table 6 also shows the results for the full length for the EEK
dataset. Similarly to the results for the capped sequence length, the CNN and
Bi-LSTM fall short of the established baselines. Only the CNN-LSTM improves
the performance of the results, whereas for the long sequences the DLSTM, BiL-
STM with attention and BiDSLTM surpasses the baselines of both datasets. The
BiDLSTM with attention outperforms all models on the full length sequences
by over 20.36% on the EEK dataset and the IEST baseline by 18.47%. These
results show that incorporating contextual information through the bidirectional
layer and using attention to focus on the most important words in a tweet en-
hances the dilated LSTMs ability to cope with longer sequences. This confirms
that using more specialised learning models such as the DLSTM, BiDLSTM
and BiDLSTM with attention allow us to better capture information in longer
sequences.


             Learning Model    Test Acc. Precision Recall F1-score
             Max Entropy       58.4      0.59      0.57 0.58
             CNN               43.95     0.44      0.43 0.43
             CNN LSTM          56.15     0.57      0.55 0.56
             BI LSTM           51.73     0.52      0.51 0.51
             BI LSTM attention 58.79     0.59      0.58 0.58
             DLSTM             60.27     0.61      0.59 0.60
             BiDLSTM           69.01     0.71      0.67 0.69
             BiDLSTM attention 78.76     0.79      0.78 0.78

                 Table 5. Results for full length (IEST dataset)


             Learning Model     Test Acc. Precision Recall F1-score
             Max Entropy        62.50       0.63     0.62 0.62
             CNN                55.12       0.56     0.54 0.55
             CNN LSTM           60.11       0.61     0.59 0.60
             BI LSTM            60.88       0.61     0.60 0.60
             BI LSTM attention 62.70        0.63     0.62 0.62
             DLSTM              67.18       0.68     0.66 0.67
             BiDLSTM            69.53       0.71     0.68 0.69
             BiDLSTM attention 80.97        0.82     0.79 0.80
                 Table 6. Results for full length (EEK dataset)
BiDLSTM with Attention for Fine-grained Emotion Classification in Tweets      11

5.1   Evaluation of Prediction Labels

In order to evaluate the performance of each model, we have set aside 5,000
tweets per dataset that have not been used during training or testing previously.
We then use the pretrained models to establish, which labels are hardest to
predict for each network. We compare the best performing learning model with
human performance. For this we used Amazon Mechanical Turk [48], where each
tweet was annotated by three different annotators for the six emotion categories,
yielding 15,000 annotations per dataset. All emotion words were replaced with
the term ’[Keyword]’, a sample tweet can be seen in Figure 3.


                 Fig. 3. Example of a tweet shown to annotators.


    We use confusion matrices to visualise the quality of label output for our
learning model on both datasets. Figures 4 and 5 both show the confusion ma-
trices for the BiDLSTM with attention. Figures 4 and 5 shows that for the both
datasets Joy was most accurately predicted emotion, whilst Anger (61.96 %) was
often misclassified. Furthermore it is shows that Anger is more often confused
with Disgust in both datasets.


 Fig. 4. BiDLSTM attention (IEST)             Fig. 5. BiDLSTM attention (EEK)


    Furthermore we have also looked at each emotion in both datasets in order
to gain a better insight into how well each emotion is classified by the proposed
learning model. We use Precision, Recall and F-1 score as our evaluation metrics
for both of the test datasets. Table 7 shows the emotion labels in the IEST
dataset using the full sequence length, where the best performing emotion is
Joy and the emotion Anger is most often misclassfied. Table 8 also shows the
12     A.M. Schoene et al.

label classification for the EEK dataset using the full sequence length, confirming
that the same emotions, Joy and Anger, are also the most and least likely to be
accurately classified.


 Type     Precision Recall F1-score           Type     Precision Recall F1-score
 Anger 0.69         0.76 0.72                 Anger 0.71         0.79 0.75
 Fear     0.69      0.83 0.75                 Fear     0.74      0.86 0.80
 Disgust 0.83       0.75 0.79                 Disgust 0.84       0.78 0.81
 Sadness 0.76       0.78 0.77                 Sadness 0.78       0.81 0.79
 Joy      0.90      0.75 0.82                 Joy      0.93      0.77 0.85
 Surprise 0.84      0.78 0.81                 Surprise 0.84      0.79 0.81
 Average 0.79       0.78 0.78                 Average 0.81       0.80 0.80

Table 7. Evaluation metrics per emo-         Table 8. Evaluation metrics per emo-
tion label - BiDLSMT with attention          tion label - BiDLSMT with attention
in % (IEST dataset)                          in % (EEK dataset)


    Afterwards we looked at the results for the human annotation, for the same
test datasets. Figures 6 and 7 show the confusion matrices for the human anno-
tators. Each confusion matrix shows the number of correctly and false predicted
labels in percentages. We have found that for both datasets evaluated by humans
that the most commonly correctly annotated emotion was Joy with 37.70% in
the IEST and 41.80% in the EEK dataset. The emotion Disgust was least likely
to be accurately annotated in both datasets. Furthermore Disgust was most of-
ten mistaken for the emotion Sadness in both datasets and overall there were
far fewer accurately predicted labels by the human annotators compared to the
proposed learning model.


 Fig. 6. Humans annotators (IEST)              Fig. 7. Humans annotators (EEK)


   In Figure 8 we show an example of a tweet with its true label and the labels
predicted by human annotators. It can be seen that for all three people anno-
 BiDLSTM with Attention for Fine-grained Emotion Classification in Tweets           13

tating this tweet there was no agreement on the emotion label and no annotator
picked the correct label. This illustrate how hard this task may be for humans
as the keyword could have been replaced with a number of different emotion
keywords and made sense.


Fig. 8. A tweet illustrating the difficulty of the task for a human annotator to choose
one emotion keyword.


Probabilities of labels Furthermore we have looked at 100 random test samples
to see the probability distribution of the output labels (see Figures 9 and 10).
It could be argued that there might be some larger pattern that is detected by
learning models when humans write about emotion that may not be detected by
humans on a qualitative basis.


Fig. 9. Visualisation of IEST Emotion labels based on the probability of accurate
prediction - BiDLSTM with attention
14     A.M. Schoene et al.


Fig. 10. Visualisation of EEK Emotion labels based on the probability of accurate
prediction - BiDLSTM with attention


    This might be due to the difficulty in the task where many emotions are
closely related or overlapping such as Disgust and Anger, where humans were
not able to interpret them correctly [54]. Other studies have previously found
that humans struggle to identify emotions in textual data due to the lack of extra
information provided (e.g.: tone of voice or facial expression) and therefore often
projecting their own emotional state and information [38]. However, this is not
possible for any learning model and therefore might be the reason why they are
better at detecting underlying patterns in this type of data.


6    Conclusion

In this paper we have found that our learning model, the bidirectional dilated
LSTM with attention, performs above the baseline of 58.4% by over 14.43% on
the WASSA shared task dataset. Furthermore, our model performs also best on
our own dataset achieving an accuracy of 80.97%. We have also found that when
using longer sequences we achieve better results with models that are more
specialised compared to vanilla neural networks. Additionally, we have shown
that when pruning our model to use a shorter input sequence it still outperforms
state-of-the art results. Also, it could be argued that treating tweets as longer
sequences we can utilise more information in a tweet. Furthermore we have
evaluated which labels are most likely predicted correctly by both humans and
the BiDLSTM with attention. We have demonstrated that the task of accurately
 BiDLSTM with Attention for Fine-grained Emotion Classification in Tweets              15

identifying the six emotion categories in tweets is considerably harder for humans
compared to the learning model. This could largely be due to the amount of
emotions projected by humans on an individual tweet which doesn’t enable them
to identified overall patterns on a qualitative basis. Also, we have outlined the
collection of a new resource, a dataset of 240,000 tweets that have been labelled
for six emotion categories.

References
 1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghe-
    mawat, S., Irving, G., Isard, M., et al.: Tensorflow: a system for large-scale machine
    learning. In: OSDI. vol. 16, pp. 265–283 (2016)
 2. Abdul-Mageed, M., Ungar, L.: Emonet: Fine-grained emotion detection with gated
    recurrent neural networks. In: Proceedings of the 55th Annual Meeting of the
    Association for Computational Linguistics (Volume 1: Long Papers). vol. 1, pp.
    718–728 (2017)
 3. Amplayo, R.K., Kim, J., Sung, S., Hwang, S.w.: Cold-start aware user and product
    attention for sentiment classification. arXiv preprint arXiv:1806.05507 (2018)
 4. Bahdanau, D., Cho, K., Bengio, Y.: Neural Machine Translation by Jointly Learn-
    ing to Align and Translate. In: Proc. of the International Conference on Learning
    Representations (ICLR). San Diego, CA, USA (2015)
 5. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning
    to align and translate. arXiv preprint arXiv:1409.0473 (2014)
 6. Baziotis, C., Pelekis, N., Doulkeridis, C.: Datastories at semeval-2017 task 4: Deep
    lstm with attention for message-level and topic-based sentiment analysis. In: Pro-
    ceedings of the 11th International Workshop on Semantic Evaluation (SemEval-
    2017). pp. 747–754. Association for Computational Linguistics, Vancouver, Canada
    (August 2017)
 7. Baziotis, C., Pelekis, N., Doulkeridis, C.: Datastories at semeval-2017 task 4: Deep
    lstm with attention for message-level and topic-based sentiment analysis. In: Pro-
    ceedings of the 11th International Workshop on Semantic Evaluation (SemEval-
    2017). pp. 747–754 (2017)
 8. Brooks, M., Kuksenok, K., Torkildson, M.K., Perry, D., Robinson, J.J., Scott, T.J.,
    Anicello, O., Zukowski, A., Harris, P., Aragon, C.R.: Statistical affect detection in
    collaborative chat. In: Proceedings of the 2013 conference on Computer supported
    cooperative work. pp. 317–328. ACM (2013)
 9. Cambria, E.: Affective computing and sentiment analysis. IEEE Intelligent Systems
    31(2), 102–107 (2016)
10. Chang, S., Zhang, Y., Han, W., Yu, M., Guo, X., Tan, W., Cui, X., Witbrock,
    M., Hasegawa-Johnson, M.A., Huang, T.S.: Dilated recurrent neural networks. In:
    Advances in Neural Information Processing Systems. pp. 77–87 (2017)
11. Chen, T., Xu, R., He, Y., Wang, X.: Improving sentiment analysis via sentence
    type classification using bilstm-crf and cnn. Expert Systems with Applications 72,
    221–230 (2017)
12. Dahou, A., Elaziz, M.A., Zhou, J., Xiong, S.: Arabic sentiment classification using
    convolutional neural network and differential evolution algorithm. Computational
    Intelligence and Neuroscience 2019 (2019)
13. Derczynski, L., Bontcheva, K., Liakata, M., Procter, R., Hoi, G.W.S., Zubiaga, A.:
    Semeval-2017 task 8: Rumoureval: Determining rumour veracity and support for
    rumours. arXiv preprint arXiv:1704.05972 (2017)
16      A.M. Schoene et al.

14. Dos Santos, C., Gatti, M.: Deep convolutional neural networks for sentiment anal-
    ysis of short texts. In: Proceedings of COLING 2014, the 25th International Con-
    ference on Computational Linguistics: Technical Papers. pp. 69–78 (2014)
15. Ekman, P., Levenson, R.W., Friesen, W.V.: Autonomic nervous system activity
    distinguishes among emotions. Science 221(4616), 1208–1210 (1983)
16. Felbo, B., Mislove, A., Søgaard, A., Rahwan, I., Lehmann, S.: Using millions of
    emoji occurrences to learn any-domain representations for detecting sentiment,
    emotion and sarcasm. arXiv preprint arXiv:1708.00524 (2017)
17. Go, A., Bhayani, R., Huang, L.: Twitter sentiment classification using distant
    supervision. CS224N Project Report, Stanford 1(12) (2009)
18. Graves, A.: Generating Sequences With Recurrent Neural Networks. CoRR
    abs/1308.0850 (2013), http://arxiv.org/abs/1308.0850
19. Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J., et al.: Gradient flow in
    recurrent nets: the difficulty of learning long-term dependencies (2001)
20. Hochreiter, S., Schmidhuber, J.: Lstm can solve hard long time lag problems. In:
    Advances in neural information processing systems. pp. 473–479 (1997)
21. Klinger, R., De Clercq, O., Mohammad, S.M., Balahur, A.: Iest: Wassa-2018 im-
    plicit emotions shared task. arXiv preprint arXiv:1809.01083 (2018)
22. Kouloumpis, E., Wilson, T., Moore, J.D.: Twitter sentiment analysis: The good
    the bad and the omg! Icwsm 11, 164 (2011)
23. Kumar, A., Sangwan, S.R., Arora, A., Nayyar, A., Abdel-Basset, M., et al.: Sar-
    casm detection using soft attention-based bidirectional long short-term memory
    model with convolution network. IEEE Access (2019)
24. Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based
    neural machine translation. arXiv preprint arXiv:1508.04025 (2015)
25. Ma, Y., Peng, H., Cambria, E.: Targeted aspect-based sentiment analysis via em-
    bedding commonsense knowledge into an attentive lstm. In: Thirty-Second AAAI
    Conference on Artificial Intelligence (2018)
26. Mei, Q., Ling, X., Wondra, M., Su, H., Zhai, C.: Topic sentiment mixture: modeling
    facets and opinions in weblogs. In: Proceedings of the 16th international conference
    on World Wide Web. pp. 171–180. ACM (2007)
27. Mohammad, S., Bravo-Marquez, F., Salameh, M., Kiritchenko, S.: Semeval-2018
    task 1: Affect in tweets. In: Proceedings of The 12th International Workshop on
    Semantic Evaluation. pp. 1–17 (2018)
28. Mohammad, S.M.: # emotional tweets. In: Proceedings of the First Joint Confer-
    ence on Lexical and Computational Semantics-Volume 1: Proceedings of the main
    conference and the shared task, and Volume 2: Proceedings of the Sixth Interna-
    tional Workshop on Semantic Evaluation. pp. 246–255. Association for Computa-
    tional Linguistics (2012)
29. Mohammad, S.M., Bravo-Marquez, F.: Wassa-2017 shared task on emotion inten-
    sity. arXiv preprint arXiv:1708.03700 (2017)
30. Nakov, P., Ritter, A., Rosenthal, S., Sebastiani, F., Stoyanov, V.: Semeval-2016
    task 4: Sentiment analysis in twitter. In: Proceedings of the 10th international
    workshop on semantic evaluation (semeval-2016). pp. 1–18 (2016)
31. Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion
    mining. In: LREc. vol. 10, pp. 1320–1326 (2010)
32. Pang, B., Lee, L., et al.: Opinion mining and sentiment analysis. Foundations and
    Trends R in Information Retrieval 2(1–2), 1–135 (2008)
33. Pascanu, R., Mikolov, T., Bengio, Y.: Understanding the exploding gradient prob-
    lem. CoRR, abs/1211.5063 (2012)
BiDLSTM with Attention for Fine-grained Emotion Classification in Tweets           17

34. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word repre-
    sentation. In: Proceedings of the 2014 conference on empirical methods in natural
    language processing (EMNLP). pp. 1532–1543 (2014)
35. Plutchik, R.: The nature of emotions: Human emotions have deep evolutionary
    roots, a fact that may explain their complexity and provide tools for clinical prac-
    tice. American scientist 89(4), 344–350 (2001)
36. Poria, S., Cambria, E., Gelbukh, A.: Deep convolutional neural network textual fea-
    tures and multiple kernel learning for utterance-level multimodal sentiment anal-
    ysis. In: Proceedings of the 2015 conference on empirical methods in natural lan-
    guage processing. pp. 2539–2544 (2015)
37. Ravi, K., Ravi, V.: A survey on opinion mining and sentiment analysis: tasks,
    approaches and applications. Knowledge-Based Systems 89, 14–46 (2015)
38. Riordan, M.A., Trichtinger, L.A.: Overconfidence at the keyboard: Confidence and
    accuracy in interpreting affect in e-mail exchanges. Human Communication Re-
    search 43(1), 1–24 (2017)
39. Rosenthal, S., Farra, N., Nakov, P.: Semeval-2017 task 4: Sentiment analysis in
    twitter. In: Proceedings of the 11th International Workshop on Semantic Evalua-
    tion (SemEval-2017). pp. 502–518 (2017)
40. Rothkrantz, L.: Online emotional facial expression dictionary. In: Proceedings of
    the 15th International Conference on Computer Systems and Technologies. pp.
    116–123. ACM (2014)
41. Rozental, A., Fleischer, D.: Amobee at semeval-2018 task 1: Gru neural net-
    work with a cnn attention mechanism for sentiment classification. arXiv preprint
    arXiv:1804.04380 (2018)
42. Sachan, D.S., Zaheer, M., Salakhutdinov, R.: Revisiting lstm networks for semi-
    supervised text classification via mixed objective function (2018)
43. Schoene, A.M., Dethlefs, N.: Unsupervised suicide note classification (2018)
44. Schoene, A.M., Dethlefs, N.: Automatic identification of suicide notes from lin-
    guistic and sentiment features. In: Proceedings of the 10th SIGHUM Workshop on
    Language Technology for Cultural Heritage, Social Sciences, and Humanities. pp.
    128–133 (2016)
45. Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention.
    arXiv preprint arXiv:1511.04119 (2015)
46. Suttles, J., Ide, N.: Distant supervision for emotion classification with discrete bi-
    nary values. In: International Conference on Intelligent Text Processing and Com-
    putational Linguistics. pp. 121–136. Springer (2013)
47. Tay, Y., Tuan, L.A., Hui, S.C.: Learning to attend via word-aspect associative
    fusion for aspect-based sentiment analysis. In: Thirty-Second AAAI Conference on
    Artificial Intelligence (2018)
48. Turk, A.M.: Amazon mechanical turk. Retrieved August 17, 2012 (2012)
49. Twitter: Counting characters. https://developer.twitter.com/en/docs/basics/counting-
    characters.html (Dec 2018), accessed on 2018-11-11
50. Twitter: Developer policy. https://developer.twitter.com/en.html (Dec 2018), ac-
    cessed on 2018-11-11
51. Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A.,
    Kalchbrenner, N., Senior, A.W., Kavukcuoglu, K.: Wavenet: A generative model
    for raw audio. In: SSW. p. 125 (2016)
52. Vezhnevets, A.S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D.,
    Kavukcuoglu, K.: Feudal networks for hierarchical reinforcement learning. arXiv
    preprint arXiv:1703.01161 (2017)
18      A.M. Schoene et al.

53. Wang, J., Yu, L.C., Lai, K.R., Zhang, X.: Dimensional sentiment analysis using
    a regional cnn-lstm model. In: Proceedings of the 54th Annual Meeting of the
    Association for Computational Linguistics (Volume 2: Short Papers). vol. 2, pp.
    225–230 (2016)
54. Widen, S.C., Russell, J.A., Brooks, A.: Anger and disgust: Discrete or overlapping
    categories. In: 2004 APS Annual Convention, Boston College, Chicago, IL (2004)
55. Xu, J.M., Zhu, X., Bellmore, A.: Fast learning for sentiment analysis on bully-
    ing. In: Proceedings of the First International Workshop on Issues of Sentiment
    Discovery and Opinion Mining. p. 10. ACM (2012)
56. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R.,
    Bengio, Y.: Show, attend and tell: Neural image caption generation with visual
    attention. In: International conference on machine learning. pp. 2048–2057 (2015)
57. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention
    networks for document classification. In: Proceedings of the 2016 Conference of
    the North American Chapter of the Association for Computational Linguistics:
    Human Language Technologies. pp. 1480–1489 (2016)
58. Yin, W., Schütze, H., Xiang, B., Zhou, B.: Abcnn: Attention-based convolutional
    neural network for modeling sentence pairs. Transactions of the Association for
    Computational Linguistics 4, 259–272 (2016)
59. Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learn-
    ing based natural language processing. ieee Computational intelligenCe magazine
    13(3), 55–75 (2018)