-

Bidirectional Dilated LSTM with Attention for Fine-grained Emotion Classi cation in Tweets

Annika M Schoene[

amschoene@gmail.com 0

Alexander P Turn

0 0 The University of Hull , Cottingham Road, Hull HU6 7RX

We propose a novel approach for ne-grained emotion classication in tweets using a Bidirectional Dilated LSTM (BiDLSTM) with attention. Conventional LSTM architectures can face problems when classifying long sequences, which is problematic for tweets, where crucial information is often attached to the end of a sequence, e.g. an emoticon. We show that by adding a bidirectional layer, dilations and attention mechanism to a standard LSTM, our model overcomes these problems and is able to maintain complex data dependencies over time. We present experiments with two datasets, the 2018 WASSA Implicit Emotions Shared Task and a new dataset of 240,000 tweets. Our BiDLSTM with attention achieves a test accuracy of up to 81:97% outperforming competitive baselines by up to 10.52% on both datasets. Finally, we evaluate our data against a human benchmark on the same task.

Natural Language Processing Sentiment Analysis Recurrent Neural Networks

There has been a surge of interest in the eld of sentiment analysis in recent years, which is likely due to the growing number of social media users, who increasingly express their opinions, beliefs and attitudes in online posts towards a range of di erent topics, events and products [ 37 ]. Most sentiment analysis approaches to date focus on polarity detection [ 17, 3 ] but neglect the classi cation of more ne-grained emotion categories, such as Ekman's basic six emotions [ 15 ]. Fine-grained emotion detection has promising applicability in a number of domains, including detecting cyber-bullying [ 55 ] or identifying potential mental health issues in social media posts [ 44 ].

The majority of current approaches to sentiment analysis rely on deep learning algorithms [ 59 ], such as recurrent neural networks (RNN) [ 47, 25 ] and convolutional neural networks (CNN) [ 12, 11 ]. While tweets have previously been categorised as short sequences or sentence-level sentiment analysis [ 22 ], we argue that this should no longer be the case especially since Twitter increased its allowed character limit from 140 to 280 [ 49 ]. As such, tweets mostly face also problems with classifying long sequences, similar to other natural language processing tasks [ 20 ].

In this paper we propose the use of Dilated RNNs (DRNN) for emotion classi cation from tweets. DRNNs introduce skip connections into a standard RNN to increase the range of temporal dependencies that can be modelled. Experiments on sequence classi cation for language modelling on the Penn Treebank, pixel-by-pixel MNIST classi cation and speaker identi cation from audio [ 10 ] have shown to outperform competitive baselines such as standard LSTM/GRU architectures as well as more specialised models. We expect that the same advantages can be observed for tweets. We extend the standard proposed DRNN with an embedding layer, bidirectional layer and attention mechanism and apply it to the classi cation of six basic emotion categories, anger, fear, disgust, surprise, joy and sadness. Figure 1 shows an example of a tweet.

Therefore we hypothesise that by using dilated recurrent neural networks we can take advantage of the increased sequence length of tweets and avoid information loss over time. Another reason for the good performance of dilated recurrent skip connections is that they have a better balance of memory over a larger period of time compared to standard RNNs. We believe that using a similar structure, albeit not for a very long sequence but treating tweets as longer sequence will enable us to achieve better classi cation accuracies compared to treating tweets as a short sequence problem.

We experiment with two datasets, the 2018 WASSA Implicit Emotions Shared Task dataset which contains 153,383 tweets and can be considered an established benchmark. In addition, we collected a new larger dataset of 240,000 tweets using the same six emotion categories. We nd that on both datasets, DLSTMs with attention perform better than standard LSTM or CNN architectures, as well as any of the submissions to the WASSA shared task, achieving up to 71.45% of accuracy. We nd that the BiDLSTMs with attention are particularly bene cial for the longest sequences in our datasets and that the additions of a word embeddings, bidirectional layer and attention mechanism further increase performance. 2

Related Work

Recently, deep learning methods for sentiment and emotion classi cation have become the predominant technique. For example, [ 23 ] developed a soft attentionbased LSTM with CNN for sarcasm detection. Work conducted by [ 36 ] use a deep CNN with a multi-kernel classi er to extract features of short sequences for multi-modal sentiment analysis and show that this increases accuracy. [ 42 ] use a BiLSTM for a range of di erent text classi cation tasks, including sentiment analysis. In their experiments they show that using a single-layer BiLSTM with pretrained word embeddings and trained with cross-entropy loss achieves competitive results compared to more complex learning models. Most recently the Implicit Emotions Shared Task (IEST) [ 21 ] has used Tweets, where the winning model, named 'Amobee', was able to outperform the baseline score signi cantly by achieving an accuracy of 71.45% [ 41 ]. Amobee is a bidirectional GRU with an additional attention mechanism inspired by [ 5 ] and additional hidden layers. It has been reasoned that the model's success has been due to its speci c type of transfer learning. The baseline model for this shared task was established using a maxentropy classi er with L2 regularization, where the F1 score reached an accuracy of 59.1% on the test data. Recurrent neural networks have become the predominant neural network across as range of sentiment analysis and emotion detection tasks [ 13 ]. Similarly, almost half of the submissions to the annual SemEval shared task [ 39, 27, 29 ] used some form of neural networks. At the same time, the majority of approaches to detect sentiment continue to focus on polarity detection [ 9 ], including approaches to identifying sentiment on social media such as Twitter [ 39, 30 ] or longer texts such as reviews or blogs [ 32 ]. This is limiting for real-world applications, where for mental state detection, customer reviews, advertising, and many more, ne-grained emotions can add substantial added value.

Approaches that have attempted more ne-grained classi cation are mostly based on Ekman's six basic emotions [ 15 ], anger, fear, disgust, surpise, joy and sadness, or Plutchik's eight basic emotions [ 35 ], who extended [ 15 ] basic emotions with Trust and Anticipation. For example, [ 2 ] apply Gated Recurrent Neural Networks (GRNNs) to classify tweets collected based on hashtags carrying emotions into [ 35 ] emotion categories. Research conducted by [ 28 ] used hashtags that contain emotion words based on Plutschnik's eight basic emotions to show that user-labelled hashtags used as annotations are consistent with those annotated by trained judges. Furthermore a new lexicon based on the same twitter corpus is introduced. [ 26 ] introduces a Topic Sentiment Model (TSM), which can capture both topics and sentiment. The model is based on Probabilistic Latent Semantic Indexing (pLSI) and utilises an online sentiment retrieval service to induce prior knowledge to the model. Research by [ 46 ] use distant supervision and a lexicon to label tweets for Plutschik's eight basic emotions [ 35 ] and then classify them. Work conducted by [ 40 ] also investigated eight basic emotions in online discourse. [ 8 ] used the whole taxonomy of Plutschik's emotions to analyse chat messages.

Work on sentiment classi cation from social media has additionally explored the occurrence of emoticons and their in uence on sentiment classi cation [ 16 ]. [ 31 ] conducted research distinguishing happiness and sadness in emoticons. Similarly, [ 22 ] have shown that the usage of both hashtags and emoticons can be bene cial and contribute to more accurate classi cation of tweets. Motivation There are a number of challenges that have to be taken into account when using recurrent neural networks to learn longer sequences, which include but are not limited to: (1) maintaining mid- and short term memory is problematic when memorising long-term dependencies [ 19 ] and (2) vanishing and exploding gradient descent [ 33 ]. Therefore it could be argued that there is a need for a more specialised learning model which can overcomes these challenges. [ 52 ] introduce a dilated LSTM as part of a reinforcement learning task, where the learning model has one dilated recurrent layer with xed dilations. Work by [ 10 ] introduced a Dilated RNN by using dilated skip connections. The dilated LSTM alleviates the problem of learning long sequences, however not every word in a sequence has the same meaning or importance. Therefore we extend this network by (1) an embedding layer, (2) a bidirectional layer and (3) attention mechanism. The full architecture of the Bidirectional Dilated LSTM (BiDLSTM) with attention is shown in Figure 2. LSTM architecture Our primary model is the Long-short-term memory (LSTM) given its suitability for language and time-series data [ 20 ]. We feed into the LSTM an input sequence x = (x1; : : : ; xN ) of words in a tweet alongside a label y 2 Y denoting an emotion from any of the six basic emotion categories. The LSTM learns to map inputs x to an output y via a hidden representation ht which can be found recursively from an activation function: f (ht 1; xt); (1) where t denotes a time-step. During training, we minimise a loss function, in our case categorical cross-entropy, as: (2) (3) (4) (5) (6) (7) (8) (9) (10) it = (Wxixt + Whiht 1 + Wcict 1 + bi) ft = (Wxf xt + Whf ht 1 + Wcf ct 1 + bf ) ct = ftct 1 + it tanh(Wxcxt + Whcht 1 + bc) ot = (Wxoxt + Whoht 1 + Wcoct + bo)

ht = ottanh(ct)

A standard LSTM de nition solves some of the problems of vanilla RNNs have, such as the vanishing gradient descent problem [ 20 ], but it still has some shortcomings when learning long-term dependencies. One of them is due to the cell state of an LSTM; the cell state is changed by adding some function of the inputs. When we backpropagate and take the derivative of ct with respect to ct 1, the added term would disappear and less information would travel through the layers of a learning model. This shortcoming can be addressed through the use of dilations and skip-connections in the dilated LSTM.

Embedding and bidirectional Layer Each tweet t contains wi words where wit; t 2 [0; T ] represents the ith word in each tweet. We utilise GloVe word embeddings trained on 2 billion tweets as developed by [ 34 ], in our 200-dimensional embedding layer. Then we use a bidirectional LSTM to obtain information from both directions of each word in order to capture the contextual information. The bidirectional LSTM incorporates the forward LSTM !h t(i) which reads each tweet from wi1 to wiT and a backward LSTM h t(i) which reads words in each tweet from wiT to wi, where xit represents word vectors in an embedding matrix: L(x; y) =

N n2N 1 X xn log yn:

Standard LSTMs manage their weight updates through a number of gates that determine the amount of information that should be retained and forgotten at each time step. In particular, we distinguish an `input gate' i that decides how much new information to add at each time-step, a `forget gate' f that decides what information not to retain and an `output gate' o determining the output. More formally, and following the de nition by [ 18 ], this leads us to update our hidden state h as follows (where refers to the logistic sigmoid function, c is the `cell state', W is the weight matrix and b is the bias term):

xit = Wewit; t 2 [1; T ] !h t(i) = LST M!(xit); t 2 [1; T ] h t(i) = LST M (xit); t 2 [1; T ]

We then concatenate all outputs of the forward hidden state !h t and backward hidden state h t , where the output o allows us to utilise all information available in each tweet. The output o is then fed into the Dilated LSTM.

Dilated LSTM Layer For our implementation of a Dilated LSTM, we follow the implementation of recurrent skip connections with exponentially increasing dilations in a multi-layered learning model - as proposed by [ 10 ] - as it allows LSTMs to better learn input sequences and their dependencies. This means that temporal and complex data dependencies are learned on di erent layers. The most important part of this architecture is the dilated recurrent skip connection in the LSTM cell, where ct(l) is the cell in layer l at time t: c(l) = LST M (ot(l); ct(l)sl ) t s(l) is the skip length of layer l; ot(l) as the input to layer l at time t in a LSTM. The exponentially increasing dilations across layers have been inspired by [ 51 ]; s(l) denotes the dilation of the l-th layer, where M and L denotes dilations at di erent layers:

s(l) = M (l 1); l = 1; : : : L: As outlined by [ 10 ] there are two main bene ts to stacking exponentially dilated recurrent layers: (1) it enables di erent layers to focus on di erent temporal resolutions and (2) it reduces the length of paths between nodes at di erent timesteps, which enables the network to learn more complex long-term dependencies. Therefore exponentially increasing dilations shortens any given sequence length at di erent layers.

Attention Layer The attention mechanism was rst introduced by [ 4 ], but has since been used in a number of di erent tasks including machine translation [ 24 ], sentence pairs detection [ 58 ], neural image captioning [ 56 ] and action recognition [ 45 ].

Our implementation of the attention mechanism is inspired by [ 57 ], using attention to nd words that are most important to the meaning of a tweet. We use the output of the dilated LSTM as direct input into the attention layer, where O denotes the output of nal layer L of the Dilated LSTM at time t+1. The attention for each word w in a tweet t is computed as follows, where hiw is the hidden representation of the dilated LSTM output, iw represents normalised alpha weights measuring the importance of each word and ti is the corresponding tweet vector: uiw = tanh(O + bw) iw =

exp (hiTw)

Pt exp (hiTw) ti = X (12) (13) (14) (15)

Experiments

We present the datasets used, our baselines and discuss objective and subjective results. 4.1

Data

We will work with the following datasets: { The WASSA Implicit Emotions Shared Task (IEST) [ 21 ] data consists of 155,383 tweets and is based on [ 15 ] six basic emotions. { The Ekman's Emotion Keyword (EEK) data, a collection of 240,000 tweets that we collected between September 2017 and December 2018. 1

Both datasets were collected using the Twitter API [ 50 ] and a list of keyword and synonyms were speci ed for automatic data collection from Twitter. See Table 2 for the keywords that we used, following [ 21 ] and using Ekman's six basic emotions. After the initial data collection we ltered tweets by those marked in the language tab as "English" and removed any duplicates. Then we used the text processing library developed by [ 6 ], to anonymise usernames and mask URLs. Afterwards we used a dictionary containing all emotion keywords listed in Table 2 and replaced existing keywords in all tweets with the term [keyword]. Finally each tweet was assigned a label based on the emotion category its keyword belonged to (see Figure 1). For our experiments we use 80% of the data for training, 10% for validation and the remaining 10% for testing. 1 The dataset will be released to the research community upon request and in accordance with the Twitter API guidelines [ 50 ] 4.2

Baselines

Similarly to [ 21 ] we use a a maximum entropy classi er with L2 regularisation for establishing the baselines of our datasets. All baselines will be evaluated in two conditions: Capped length , where we cap the length of any sequence to 40 in accordance with the WASSA IEST challenge winners.

Full length ,where we use the average full uncapped length of a sequence (maximum 103). Our intuition is that this condition will particularly reveal the advantages of the skip connections.

For the DLSTM, BiDLSTM and BiDLSTM with attention, we established the number of dilations empirically. There are two dilated layers with the dilations increasing exponentially starting at 1 [ 1,2 ]. This means that each sub-LSTM for the pruned sequence has the following sequence length [Dilation 1 = 40, Dilation 2 = 20] with a total of 20 hidden units per layer. Whilst each sub-LSTM for the longer sequence has the following sequence length:[Dilation 1 = 102, Dilation 2 = 51].

We evaluate our BiDLSTM with attention against the following baselines: { DLSTM { a dilated LSTM with hierarchically stacked dilations and hyperparameters: learning rate: 0.001, batch size: 128, optimizer: Adam, dropout: 0.5 { BiDLSTM { a two-layer bidirectional dilated LSTM with a three-layer LSTM, hierarchically stacked dilations and the same hyperparameters as the DLSTM. { BiLSTM { a BiLSTM with 2 layers and the following hyper-parameters: learning rate: 0.001, batch size: 128, optimizer: Adam, dropout: 0.5. This model is similar to recent work by [ 42 ] who used a single layer biLSTM to classify the ImdB movie review dataset into positive and negative reviews. { BiLSTM with attention { a BiLSTM with attention and the following hyper-parameters: learning rate: 0.001, batch size: 128, optimizer: Adam, dropout: 0.5. This model is similar to recent work by [ 7, 43 ]. { CNN { a CNN 2-D convolution with two fully connected layers, a lter size of 1,2 and 102 lters, and a ReLU function. This learning model is similar to recent work by [ 14 ]. { CNN-LSTM { we follow the implementation of the learning model by [ 53 ], using a CNN that is feeding into an LSTM. This model was used to predict the valence/arousal of ratings in textual data.

Also, we compare our model against the winner of the 2019 WASSA IEST dataset, called Amobee[ 41 ]. All of the experiments conducted using Tensor ow [ 1 ].

Results

We benchmark the BiDLSTM with attention to a number of di erent neural networks, using both vanilla neural networks and more specialised neural networks that have been used in sentiment analysis tasks. We compare results by two di erent sequence length and use four di erent metrics for evaluation; test set accuracy, precision, recall and F1-score.

Capped Sequences Tables 3 and 4 show the results for capped sequences lengths for both the IEST and EEK dataset respectively.

Learning Model Test Acc. Precision Recall F1-score Max Entropy 58.4 0.59 0.57 0.58 CNN 43.17 0.44 0.42 0.43 CNN LSTM 55.42 0.56 0.54 0.55 BI LSTM 49.47 0.50 0.48 0.49 BI LSTM attention 58.60 0.60 0.56 0.58 DLSTM 56.44 0.57 0.55 0.56 BiDLSTM 67.96 0.68 0.67 0.67 Amobee - - - 71.45 BiDLSTM attention 72.83 0.74 0.71 0.72

It can be seen that vanilla CNN and BiLSTM fall just short of the baselines established for this task. The CNN-LSTM and DLSTM architecture, both outperform their vanilla predecessors. The BiLSTM with attention and BiDLSTM surpass the baselines but falls short of the model proposed in the IEST task for both datasets. It can be seen that BiDLSTM with attention outperforms all previous models on the capped sequence length by over 14.43% for capped sequences and the IEST baseline by 11.24%. The results for capped sequence length using the IEST dataset (Table 3) show that our proposed model surpasses the 'Amobee' model's result, however this is only marginally. We hypothesis that the reason the DLSTM, BiDLSTM and BiDLSTM and with attention either fall short of the baselines or only marginally surpass them is due the model not being able to take full advantage of the full sequence length.

Long sequences Table 5 shows the results for the IEST dataset using full length sequences and Table 6 also shows the results for the full length for the EEK dataset. Similarly to the results for the capped sequence length, the CNN and Bi-LSTM fall short of the established baselines. Only the CNN-LSTM improves the performance of the results, whereas for the long sequences the DLSTM, BiLSTM with attention and BiDSLTM surpasses the baselines of both datasets. The BiDLSTM with attention outperforms all models on the full length sequences by over 20.36% on the EEK dataset and the IEST baseline by 18.47%. These results show that incorporating contextual information through the bidirectional layer and using attention to focus on the most important words in a tweet enhances the dilated LSTMs ability to cope with longer sequences. This con rms that using more specialised learning models such as the DLSTM, BiDLSTM and BiDLSTM with attention allow us to better capture information in longer sequences.

Learning Model Test Acc. Precision Recall F1-score Max Entropy 58.4 0.59 0.57 0.58 CNN 43.95 0.44 0.43 0.43 CNN LSTM 56.15 0.57 0.55 0.56 BI LSTM 51.73 0.52 0.51 0.51 BI LSTM attention 58.79 0.59 0.58 0.58 DLSTM 60.27 0.61 0.59 0.60 BiDLSTM 69.01 0.71 0.67 0.69 BiDLSTM attention 78.76 0.79 0.78 0.78

Evaluation of Prediction Labels

In order to evaluate the performance of each model, we have set aside 5,000 tweets per dataset that have not been used during training or testing previously. We then use the pretrained models to establish, which labels are hardest to predict for each network. We compare the best performing learning model with human performance. For this we used Amazon Mechanical Turk [ 48 ], where each tweet was annotated by three di erent annotators for the six emotion categories, yielding 15,000 annotations per dataset. All emotion words were replaced with the term '[Keyword]', a sample tweet can be seen in Figure 3.

We use confusion matrices to visualise the quality of label output for our learning model on both datasets. Figures 4 and 5 both show the confusion matrices for the BiDLSTM with attention. Figures 4 and 5 shows that for the both datasets Joy was most accurately predicted emotion, whilst Anger (61.96 %) was often misclassi ed. Furthermore it is shows that Anger is more often confused with Disgust in both datasets.

Furthermore we have also looked at each emotion in both datasets in order to gain a better insight into how well each emotion is classi ed by the proposed learning model. We use Precision, Recall and F-1 score as our evaluation metrics for both of the test datasets. Table 7 shows the emotion labels in the IEST dataset using the full sequence length, where the best performing emotion is Joy and the emotion Anger is most often misclass ed. Table 8 also shows the label classi cation for the EEK dataset using the full sequence length, con rming that the same emotions, Joy and Anger, are also the most and least likely to be accurately classi ed.

Type Precision Recall F1-score Anger 0.69 0.76 0.72 Fear 0.69 0.83 0.75 Disgust 0.83 0.75 0.79 Sadness 0.76 0.78 0.77 Joy 0.90 0.75 0.82 Surprise 0.84 0.78 0.81

Average 0.79 0.78 0.78

Afterwards we looked at the results for the human annotation, for the same test datasets. Figures 6 and 7 show the confusion matrices for the human annotators. Each confusion matrix shows the number of correctly and false predicted labels in percentages. We have found that for both datasets evaluated by humans that the most commonly correctly annotated emotion was Joy with 37.70% in the IEST and 41.80% in the EEK dataset. The emotion Disgust was least likely to be accurately annotated in both datasets. Furthermore Disgust was most often mistaken for the emotion Sadness in both datasets and overall there were far fewer accurately predicted labels by the human annotators compared to the proposed learning model.

In Figure 8 we show an example of a tweet with its true label and the labels predicted by human annotators. It can be seen that for all three people annotating this tweet there was no agreement on the emotion label and no annotator picked the correct label. This illustrate how hard this task may be for humans as the keyword could have been replaced with a number of di erent emotion keywords and made sense.

Probabilities of labels Furthermore we have looked at 100 random test samples to see the probability distribution of the output labels (see Figures 9 and 10). It could be argued that there might be some larger pattern that is detected by learning models when humans write about emotion that may not be detected by humans on a qualitative basis.

This might be due to the di culty in the task where many emotions are closely related or overlapping such as Disgust and Anger, where humans were not able to interpret them correctly [ 54 ]. Other studies have previously found that humans struggle to identify emotions in textual data due to the lack of extra information provided (e.g.: tone of voice or facial expression) and therefore often projecting their own emotional state and information [ 38 ]. However, this is not possible for any learning model and therefore might be the reason why they are better at detecting underlying patterns in this type of data. 6

Conclusion

In this paper we have found that our learning model, the bidirectional dilated LSTM with attention, performs above the baseline of 58.4% by over 14.43% on the WASSA shared task dataset. Furthermore, our model performs also best on our own dataset achieving an accuracy of 80.97%. We have also found that when using longer sequences we achieve better results with models that are more specialised compared to vanilla neural networks. Additionally, we have shown that when pruning our model to use a shorter input sequence it still outperforms state-of-the art results. Also, it could be argued that treating tweets as longer sequences we can utilise more information in a tweet. Furthermore we have evaluated which labels are most likely predicted correctly by both humans and the BiDLSTM with attention. We have demonstrated that the task of accurately identifying the six emotion categories in tweets is considerably harder for humans compared to the learning model. This could largely be due to the amount of emotions projected by humans on an individual tweet which doesn't enable them to identi ed overall patterns on a qualitative basis. Also, we have outlined the collection of a new resource, a dataset of 240,000 tweets that have been labelled for six emotion categories.

1. Abadi , M. , Barham , P. , Chen , J. , Chen , Z. , Davis , A. , Dean , J. , Devin , M. , Ghemawat , S. , Irving , G. , Isard , M. , et al.: Tensor ow: a system for large-scale machine learning . In: OSDI . vol. 16 , pp. 265 { 283 ( 2016 )

2. Abdul-Mageed , M. , Ungar , L. : Emonet: Fine-grained emotion detection with gated recurrent neural networks . In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . vol. 1 , pp. 718 { 728 ( 2017 )

3. Amplayo , R.K. , Kim , J. , Sung , S. , Hwang , S.w.: Cold-start aware user and product attention for sentiment classi cation . arXiv preprint arXiv: 1806 . 05507 ( 2018 )

4. Bahdanau , D. , Cho , K. , Bengio , Y.: Neural Machine Translation by Jointly Learning to Align and Translate . In: Proc. of the International Conference on Learning Representations (ICLR) . San Diego, CA, USA ( 2015 )

5. Bahdanau , D. , Cho , K. , Bengio , Y.: Neural machine translation by jointly learning to align and translate . arXiv preprint arXiv:1409.0473 ( 2014 )

6. Baziotis , C. , Pelekis , N. , Doulkeridis , C. : Datastories at semeval -2017 task 4: Deep lstm with attention for message-level and topic-based sentiment analysis . In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval2017) . pp. 747 { 754 . Association for Computational Linguistics, Vancouver, Canada ( August 2017 )

7. Baziotis , C. , Pelekis , N. , Doulkeridis , C. : Datastories at semeval -2017 task 4: Deep lstm with attention for message-level and topic-based sentiment analysis . In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval2017) . pp. 747 { 754 ( 2017 )

8. Brooks , M. , Kuksenok , K. , Torkildson , M.K. , Perry , D. , Robinson , J.J. , Scott , T.J., Anicello , O. , Zukowski , A. , Harris , P. , Aragon , C.R. : Statistical a ect detection in collaborative chat . In: Proceedings of the 2013 conference on Computer supported cooperative work . pp. 317 { 328 . ACM ( 2013 )

9. Cambria , E.: A ective computing and sentiment analysis . IEEE Intelligent Systems 31 ( 2 ), 102 { 107 ( 2016 )

10. Chang , S. , Zhang, Y., Han , W. , Yu , M. , Guo , X. , Tan , W. , Cui , X. , Witbrock , M. , Hasegawa-Johnson , M.A. , Huang , T.S.: Dilated recurrent neural networks . In: Advances in Neural Information Processing Systems . pp. 77 { 87 ( 2017 )

11. Chen , T. , Xu , R. , He , Y. , Wang , X. : Improving sentiment analysis via sentence type classi cation using bilstm-crf and cnn . Expert Systems with Applications 72 , 221{ 230 ( 2017 )

12. Dahou , A. , Elaziz , M.A. , Zhou , J. , Xiong , S. : Arabic sentiment classi cation using convolutional neural network and di erential evolution algorithm . Computational Intelligence and Neuroscience 2019 ( 2019 )

13. Derczynski , L. , Bontcheva , K. , Liakata , M. , Procter , R. , Hoi , G.W.S. , Zubiaga , A. : Semeval-2017 task 8: Rumoureval: Determining rumour veracity and support for rumours . arXiv preprint arXiv:1704.05972 ( 2017 )

14.

Dos

Santos , C. , Gatti , M. : Deep convolutional neural networks for sentiment analysis of short texts . In: Proceedings of COLING 2014 , the 25th International Conference on Computational Linguistics: Technical Papers . pp. 69 { 78 ( 2014 )

15. Ekman , P. , Levenson , R.W. , Friesen , W.V. : Autonomic nervous system activity distinguishes among emotions . Science 221 ( 4616 ), 1208 { 1210 ( 1983 )

16. Felbo , B. , Mislove , A., S gaard, A., Rahwan , I. , Lehmann , S.: Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm . arXiv preprint arXiv:1708.00524 ( 2017 )

17. Go , A. , Bhayani , R. , Huang , L. : Twitter sentiment classi cation using distant supervision . CS224N Project Report, Stanford 1 ( 12 ) ( 2009 )

18. Graves , A. : Generating Sequences With Recurrent Neural Networks . CoRR abs/1308 .0850 ( 2013 ), http://arxiv.org/abs/1308.0850

19. Hochreiter , S. , Bengio , Y. , Frasconi , P. , Schmidhuber , J. , et al.: Gradient ow in recurrent nets: the di culty of learning long-term dependencies ( 2001 )

20. Hochreiter , S. , Schmidhuber , J. : Lstm can solve hard long time lag problems . In: Advances in neural information processing systems . pp. 473 { 479 ( 1997 )

21. Klinger , R. , De Clercq , O. , Mohammad , S.M. , Balahur , A. : Iest: Wassa-2018 implicit emotions shared task . arXiv preprint arXiv:1809 . 01083 ( 2018 )

22. Kouloumpis , E. , Wilson, T. , Moore , J.D. : Twitter sentiment analysis: The good the bad and the omg! Icwsm 11 , 164 ( 2011 )

23. Kumar , A. , Sangwan , S.R. , Arora , A. , Nayyar , A. , Abdel-Basset , M. , et al.: Sarcasm detection using soft attention-based bidirectional long short-term memory model with convolution network . IEEE Access ( 2019 )

24. Luong , M.T. , Pham , H. , Manning , C.D.: E ective approaches to attention-based neural machine translation . arXiv preprint arXiv:1508.04025 ( 2015 )

25. Ma , Y. , Peng , H. , Cambria , E.: Targeted aspect-based sentiment analysis via embedding commonsense knowledge into an attentive lstm . In: Thirty-Second AAAI Conference on Arti cial Intelligence ( 2018 )

26. Mei , Q. , Ling , X. , Wondra , M. , Su , H. , Zhai , C. : Topic sentiment mixture: modeling facets and opinions in weblogs . In: Proceedings of the 16th international conference on World Wide Web . pp. 171 { 180 . ACM ( 2007 )

27. Mohammad , S. , Bravo-Marquez , F. , Salameh , M. , Kiritchenko , S. : Semeval-2018 task 1: A ect in tweets . In: Proceedings of The 12th International Workshop on Semantic Evaluation . pp. 1 { 17 ( 2018 )

28. Mohammad , S.M. : # emotional tweets . In: Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation . pp. 246 { 255 . Association for Computational Linguistics ( 2012 )

29. Mohammad , S.M. , Bravo-Marquez , F. : Wassa-2017 shared task on emotion intensity . arXiv preprint arXiv:1708.03700 ( 2017 )

30. Nakov , P. , Ritter , A. , Rosenthal , S. , Sebastiani , F. , Stoyanov , V. : Semeval-2016 task 4: Sentiment analysis in twitter . In: Proceedings of the 10th international workshop on semantic evaluation (semeval-2016) . pp. 1 { 18 ( 2016 )

31. Pak , A. , Paroubek , P. : Twitter as a corpus for sentiment analysis and opinion mining . In: LREc . vol. 10 , pp. 1320 { 1326 ( 2010 )

32. Pang , B. , Lee , L. , et al.: Opinion mining and sentiment analysis . Foundations and Trends R in Information Retrieval 2 ( 1 {2), 1 { 135 ( 2008 )

33. Pascanu , R. , Mikolov , T. , Bengio , Y. : Understanding the exploding gradient problem . CoRR, abs/1211 .5063 ( 2012 )

34. Pennington , J. , Socher , R. , Manning , C. : Glove: Global vectors for word representation . In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) . pp. 1532 { 1543 ( 2014 )

35. Plutchik , R.: The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice . American scientist 89(4) , 344 { 350 ( 2001 )

36. Poria , S. , Cambria , E. , Gelbukh , A. : Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis . In: Proceedings of the 2015 conference on empirical methods in natural language processing . pp. 2539 { 2544 ( 2015 )

37. Ravi , K. , Ravi , V.: A survey on opinion mining and sentiment analysis: tasks, approaches and applications . Knowledge-Based Systems 89 , 14 { 46 ( 2015 )

38. Riordan , M.A. , Trichtinger , L.A. : Overcon dence at the keyboard: Con dence and accuracy in interpreting a ect in e-mail exchanges . Human Communication Research 43 ( 1 ), 1 { 24 ( 2017 )

39. Rosenthal , S. , Farra , N. , Nakov , P. : Semeval-2017 task 4: Sentiment analysis in twitter . In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) . pp. 502 { 518 ( 2017 )

40. Rothkrantz , L. : Online emotional facial expression dictionary . In: Proceedings of the 15th International Conference on Computer Systems and Technologies . pp. 116 { 123 . ACM ( 2014 )

41. Rozental , A. , Fleischer , D. : Amobee at semeval -2018 task 1: Gru neural network with a cnn attention mechanism for sentiment classi cation . arXiv preprint arXiv: 1804 . 04380 ( 2018 )

42. Sachan , D.S. , Zaheer , M. , Salakhutdinov , R.: Revisiting lstm networks for semisupervised text classi cation via mixed objective function ( 2018 )

43. Schoene , A.M. , Dethlefs , N.: Unsupervised suicide note classi cation ( 2018 )

44. Schoene , A.M. , Dethlefs , N. : Automatic identi cation of suicide notes from linguistic and sentiment features . In: Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage , Social Sciences, and Humanities . pp. 128 { 133 ( 2016 )

45. Sharma , S. , Kiros , R. , Salakhutdinov , R.: Action recognition using visual attention . arXiv preprint arXiv:1511.04119 ( 2015 )

46. Suttles , J. , Ide , N.: Distant supervision for emotion classi cation with discrete binary values . In: International Conference on Intelligent Text Processing and Computational Linguistics . pp. 121 { 136 . Springer ( 2013 )

47. Tay , Y. , Tuan , L.A. , Hui , S.C. : Learning to attend via word-aspect associative fusion for aspect-based sentiment analysis . In: Thirty-Second AAAI Conference on Arti cial Intelligence ( 2018 )

48. Turk , A.M.: Amazon mechanical turk . Retrieved August 17 , 2012 ( 2012 )

49. Twitter: Counting characters . https://developer.twitter.com/en/docs/basics/countingcharacters.html ( Dec 2018 ), accessed on 2018-11-11

50. Twitter: Developer policy . https://developer.twitter.com/en.html ( Dec 2018 ), accessed on 2018-11-11

51. Van Den Oord , A. , Dieleman , S. , Zen , H. , Simonyan , K. , Vinyals , O. , Graves , A. , Kalchbrenner , N. , Senior , A.W. , Kavukcuoglu , K. : Wavenet: A generative model for raw audio . In: SSW . p. 125 ( 2016 )

52. Vezhnevets , A.S. , Osindero , S. , Schaul , T. , Heess , N. , Jaderberg , M. , Silver , D. , Kavukcuoglu , K. : Feudal networks for hierarchical reinforcement learning . arXiv preprint arXiv:1703.01161 ( 2017 )

53. Wang , J. , Yu , L.C. , Lai , K.R. , Zhang, X. : Dimensional sentiment analysis using a regional cnn-lstm model . In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2 :

Short

Papers ) . vol. 2 , pp. 225 { 230 ( 2016 )

54. Widen , S.C. , Russell , J.A. , Brooks , A. : Anger and disgust: Discrete or overlapping categories . In: 2004 APS Annual Convention , Boston College, Chicago, IL ( 2004 )

55. Xu , J.M. , Zhu , X. , Bellmore , A. : Fast learning for sentiment analysis on bullying . In: Proceedings of the First International Workshop on Issues of Sentiment Discovery and Opinion Mining . p. 10 . ACM ( 2012 )

56. Xu , K. , Ba , J. , Kiros , R. , Cho , K. , Courville , A. , Salakhudinov , R. , Zemel , R. , Bengio , Y. : Show, attend and tell: Neural image caption generation with visual attention . In: International conference on machine learning . pp. 2048 { 2057 ( 2015 )

57. Yang , Z. , Yang , D. , Dyer , C. , He , X. , Smola , A. , Hovy , E.: Hierarchical attention networks for document classi cation . In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . pp. 1480 { 1489 ( 2016 )

58. Yin , W. , Schutze, H., Xiang , B. , Zhou , B. : Abcnn: Attention-based convolutional neural network for modeling sentence pairs . Transactions of the Association for Computational Linguistics 4 , 259 { 272 ( 2016 )

59. Young , T. , Hazarika , D. , Poria , S. , Cambria , E.: Recent trends in deep learning based natural language processing . ieee Computational intelligenCe magazine 13 ( 3 ), 55 { 75 ( 2018 )