-

Happy Together: Learning and Understanding Appraisal From Natural Language

Arun Rajendran

Chiyu Zhang

Muhammad Abdul-Mageed

muhammad.mageed@ubc.ca 0 0 Natural Language Processing Lab The University of British Columbia

In this paper, we explore various approaches for learning two types of appraisal components from happy language. We focus on `agency' of the author and the `sociality' involved in happy moments based on the HappyDB dataset. We develop models based on deep neural networks for the task, including uni- and bi-directional long short-term memory networks, with and without attention. We also experiment with a number of novel embedding methods, such as embedding from neural machine translation (as in CoVe) and embedding from language models (as in ELMo). We compare our results to those acquired by several traditional machine learning methods. Our best models achieve 87.97% accuracy on agency and 93.13% accuracy on sociality, both of which are signi cantly higher than our baselines.

Emotion emotion detection sentiment analysis language models text classi cation agency sociality appraisal theory

Emotion is an essential part of human experience that a ects both individual and group decision making. For this reason, it is desirable to understand the language of emotion and develop tools to aid such an understanding. Although there has been recently works focusing on detecting human emotion from text data [ 20, 1, 2 ], we still lack a deeper understanding of various components related to emotion. Available emotion detection tools have so far been based on theories of basic emotion like the work of Paul Ekman and colleagues (e.g., [ 5 ]) and extensions of these (e.g., Robert Plutchik's models [ 15 ]). Emotion theory, however, has more to o er than mere categorization of human experience based on valence (e.g., anger, joy, sadness). As such, computational treatment of emotion is yet to bene t from existing (e.g., psychological) theories by building models that capture nuances these theories o er. Our work focuses on the cognitive appraisal theory [ 17 ] where Roseman posits the existence of 5 appraisal components, including that of `agency'. Agency refers to whether a stimuli is caused by the individual, self-caused, another individual, other-caused, or merely the result of the situation circumstance-caused. Identifying the exact type of agency related to an emotion is useful in that it helps determine the target of emotion (i.e., another person or some other type of entity). We focus on agency since it was recently labeled as an extension of the HappyDB dataset [ 3 ] as part of the CL-A shared task [ 10 ] 1.

The CL-A shared task distribution of HappyDB also includes labels for the concept of `sociality'. Sociality refers to whether or not other people than the author are involved in the emotion situation. Identifying the type of sociality associated to an emotion further enriches our knowledge of the emotion experience. For example, an emotion experience with a sociality value \yes" (i.e., other people are involved) could teach us about social groups (e.g., families) and the range of emotions expressed during speci c types of situations (e.g., wedding, death). Overall, agency and sociality are two concepts that we believe to be useful. Predictions of these concepts can be added to a computational toolkit which can be run on huge datasets to derive useful insights. To the best of our knowledge, no works have investigated learning these two concepts from language data. In this paper, we thus aim at pioneering this learning task by developing novel deep learning models for predicting agency and sociality.

Moreover, we train attention-based models that are able to assign weights to features contributing to a given task. In other words, we are able to identify the words most relevant to each of the two concepts of agency and sociality. This not only enriches our knowledge about the distribution of these language items over each of these concepts, but also provides us with intuition about what our models learn (i.e., model interpretability). Interpretability is becoming increasingly important for especially deep learning models since many of these models are currently deployed in various real-life domains. Being able to identify why a model is making a certain decision helps us explain model decisions to end users, including by showing them examples of attention-based outputs.

In modeling agency and sociality, we experiment with various machine learning methods, both traditional and deep learning-based. In this way, we are able to establish strong baselines for this task as well as report competitive models. Our deep learning models are based on recurrent neural networks [ 7, 8 ]. We also exploit frameworks with novel embedding methods, including embeddings from neural machine translation as in CoVe [ 12 ] and embedding from language models as in ELMo [ 14 ]. Additionally, we investigate the utility of ne-tuning our models using the recently proposed ULMFiT model [ 9 ].

Overall, we o er the following contributions: (1) we develop successful models for identifying the novel concepts of agency and sociality in happy language, (2) we probe our models to o er meaningful interpretations (in the form of visualization) of the contribution of di erent words to the learning tasks, thereby supporting model interpretability. The rest of the paper is organized as follows: Section 2 is about our dataset and data splits. In Section 3, we describe our methods and in Section 4 we provide our results. We o er model attention-based visualizations in Section 5, and we conclude in Section 6.

1 https://sites.google.com/view/a con2019/cl-a -shared-task Dataset

HappyDB [ 3 ] is a dataset of about 100,000 `happy moments' crowd-sourced via Amazons Mechanical Turk where each worker was asked to describe in a complete sentence \what made them happy in the past 24 hours". Each user was asked to describe three such moments. In particular, we exploit the agency and sociality annotations provided on the dataset as part of the recent CL-A shared task 2, associated with the AAAI-19 workshop of a ective content analysis 3.

For this particular shared task, 10,560 moments are labelled for agency and sociality and were available as labeled training data. 4 Then, there were 17,215 moments used as test data. Test labels were not released and teams were expected to submit the predictions based on their systems on the test split. For our models, we split the labeled data into 80% training set (8,448 moments) and 20% development set (2112 moments). We train our models on train and tune parameters on dev. For our system runs, we submit labels from the models trained only on the 8,448 training data points. The distribution of labeled data is as follows: agency (`yes'=7,796; `no'= 2,764), sociality (`yes'=5,625; `no'= 4,935). 3 3.1

Methods Traditional Machine learning Models

We develop multiple basic machine learning models, including Naive Bayes, Linear Support Vector Machine (LinSVM), and Logistic Regression (Log Reg). For each model, we have two settings: (a) we use a bag of words (BOW) approach (with n-gram values from 1 to 4) and (2) we combine the BOW with a TF-IDF transformation. These are strong baselines, due to our use of the combination of higher up n-grams (with n=4). We use the default parameters of Scikit-learn5 to train all the classical machine learning models. We also use an ensemble method that takes prediction labels from each of the classi ers and nds the majority among the di erent model classi cations to decide the nal prediction. We report results in terms of binary classi cation accuracy. 3.2

Deep Learning

We apply various models based on deep neural networks. All our deep learning models are based on variations of recurrent neural networks (RNNs), which have achieved remarkable performance on text classi cation tasks such as sentiment analysis and emotion detection [ 19, 16, 11, 1, 21, 18 ]. RNNs and its variations are 2 https://sites.google.com/view/a con2019/cl-a -shared-task?authuser=0 3 https://sites.google.com/view/a con2019/ 4 There were also 72,326 moments available as unlabeled training data, but we did not use these in our work. 5 https://scikit-learn.org/ able to capture sequential dependencies especially in time series data. One weakness of basic RNNs, however, lies in the gradient either vanishing or exploding, as the time gaps become larger. Long short term memory (LSTM) networks [ 8 ] were developed to address this limitation. We also use a bidirectional LSTM (BiLSTM). BiLSTM extends the unidirectional LSTM network by o ering a second layer where the hidden to hidden states ow in opposite chronological order [ 22 ]. Overall, our systems can be categorized as follows: (1) Systems tuning simple pre-trained embeddings; (2) Systems tuning embeddings from neural machine translation (NMT); (3) Systems tuning embeddings from language models (LM); and (4) Systems directly tuning language models (ULMFiT). Exploiting Simple GloVe Embeddings For the embedding layer, we obtain the 300-dimensional embedding vector for tokens using GloVe's Common Crawl pre-trained model [ 13 ]. GloVe embeddings are global vectors for word representation based on frequencies of pairs of co-occurring words. In this setting, we x the embedding layer in our deep learning models at pre-trained GloVe embeddings. We apply four architectures (i.e. LSTM, LSTM with attention, BiLSTM, and BiLSTM with attention) to learn classi cation of agency and sociality respectively. For each models, we optimize the number of layers and the number of hidden unit within each layer to obtain the best performance. We experiment with layers from the set f1, 2g, and hidden units from the set f128, 256, 512g. Each of the setting was run with batch size 64 and dropout 0.75 for 20 epochs. Embeddings from NMT Bryan et al. [ 12 ] proposed CoVe, an approach for contextualized word embeddings directly from machine translation models. CoVe not only contains the word-level information from GloVe but also information learned with an LSTM in the context of the MT task. CoVe is trained on three di erent MT datasets, 2016 WMT multimodal dataset, 2016 IWSLT training set, and 2017 WMT news track training set. To train CoVe, we use an LSTM with attention. Our hyperparameter using CoVe are shown in Table 1. Embedding from LM Peters et al. [ 14, 6 ] introduced ELMo, a model based on learning embeddings directly from language models. The pre-training with language models provides ELMo with both complex characteristics of words as well as the usage of these words across various linguistic contexts. ELMo is trained on 1 billion words benchmark dataset [ 4 ], and these embeddings are employed as our input layer. More speci cally, we extract the 3rd layer of the ELMo representation and experiment with it using an LSTM with attention network. This is our best model, and the only model we submitted for the competition. We provide its hyperparameters in Table 1. Fine Tuning LM: ULMFiT Transfer learning is extensively used in the eld of computer vision for improving the ability of models to learn on new data. Inspired by this idea, Howard and Ruder [ 9 ] present ULMFiT6, ne tunes a pretrained language model (trained on the Wikitext-103 dataset). With ULMFiT, we use a forward language model. We use the same network architecture and hyperparameters (except dropout ratio and epochs) that Howard and Ruder used, as we report in Table 1. 4

Results 6 http://nlp.fast.ai/

acquires the best accuracy (0.9181) on the sociality task. These results suggest that sociality is an easier task than the agency task. One confounding factor is that the sociality training data is more balanced than the agency training data (with majority class at 0.5327 for sociality vs. 0.7382 for agency).

Layers Nodes LSTM BiLSTM LSTM-A BiLSTM-A 1 2 1 2 128 256 512 128 256 512 128 256 512 128 256 512

Next, we present our results with the CoVe, ELMo, and ULMFiT trained models in Table 5. Table 5 shows results in accuracy, AUC score, and F1 score(for positive class in binary classi cation) of our models on the validation set. From Tables 2, 4, 3 and 5, it can be observed that our CoVe, ELMo and ULMFiT models lead to (a) signi cant performance improvements compared to traditional machine learning models and (b) sizable improvements compared to deep learning models with simple GloVe embeddings.

Among the systems with pre-trained embeddings (mentioned in Section 3.2), ELMo performs better best. One nuance is that ELMo outperform the ULMFiT model that ne-tunes a language model rather than the embeddings. One probable explanation for this is the impact of attention that is used in the LSTM model with ELMo embedding which is crucial for this particular task and is not present in the ULMFiT model. We now turn to probing our models further by visualizing the attention weights for words in our data. For interpretability, and to acquire a better understanding of the two important concepts of agency and sociality, we provide attention-based visualization of 24 example sentences from our data. In each example, color intensity corresponds to the self attention weights assigned by our model (LSTM-A). Figures 1 (handpicked) and 2 (randomly picked) provide examples from the agency data, each for the positive then the negative class respectively. As the Figures demonstrate, the model attentions are relatively intuitive. For example, for the positive class cases (hand-picked), the model attends to words such as `my', `with', and `coworkers', which refer to (or establishes a connection with) the agent. Figures 3 and 4 provide similar visualizations for the sociality task. Again, the attention weights cast some intuitive light on the concept of sociality. The model, for example, attends to words like `daughter', `grandson', `members', and `family' in the hand-picked positive cases. Also, in the hand-picked negative examples, the model attends to words referring to non-persons such as `book', `mail', `workout', and `dog'.

a. Examples of happy moments with positive agency label b. Examples of happy moments with negative agency label

Next, in Figure 5, we provide the top 35 words the model attends to in the positive classes in each of the agency and sociality datasets. Again, many b. Examples of happy moments with negative agency label a. Examples of happy moments with positive sociality label b. Examples of happy moments with negative sociality label a. Examples of happy moments with positive sociality label b. Examples of happy moments with negative sociality label of these words are intutively relevant to each task. For example, for agency, the model attends to words referring to others the agent indicating interaction with (e.g., `girlfriend', `friend' `mother' and `family') and social activities the agent is possibly contributing to (e.g., `lunch', 'trip', `party', and `dinner'). Similarly, for sociality, the model is attending to verbs indicating being socially involved (e.g., `told', `came', `bought', and `took') and others/social groups (e.g., `friends', `son', `family', and `daughter'). Clearly, the two concepts of agency and sociality are not orthogonal: The words the model attends to in each case indicate overlap between the two concepts to some extent. 6

Conclusion

In this paper, we reported successful models learning agency and sociality in a supervised setting. We also presented extensive visualizations based on the models' self-attentions that enhance our understanding of these two concepts as well as model decisions (i.e., interpretability). In the future, we plan to develop models for the same tasks based on more sophisticated attention mechanisms. 7

Acknowledgement

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC). The research was partially enabled by support from WestGrid (www.westgrid.ca) and Compute Canada (www.computecanada.ca).

Rajendran et al.

1. Abdul-Mageed , M. , Ungar , L. : Emonet: Fine-grained emotion detection with gated recurrent neural networks . In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . vol. 1 , pp. 718 { 728 ( 2017 )

2. Alhuzali , H. , Abdul-Mageed , M. , Ungar , L. : Enabling deep learning of emotion with rst-person seed expressions . In: Proceedings of the Second Workshop on Computational Modeling of Peoples Opinions , Personality, and Emotions in Social Media. pp. 25 { 35 ( 2018 )

3. Asai , A. , Evensen , S. , Golshan , B. , Halevy , A. , Li , V. , Lopatenko , A. , Stepanov , D. , Suhara , Y. , Tan , W.C. , Xu , Y. : Happydb: A corpus of 100,000 crowdsourced happy moments . arXiv preprint arXiv: 1801 . 07746 ( 2018 )

4. Chelba , C. , Mikolov , T. , Schuster , M. , Ge , Q. , Brants , T. , Koehn , P. , Robinson , T. : One billion word benchmark for measuring progress in statistical language modeling . arXiv preprint arXiv:1312.3005 ( 2013 )

5. Ekman , P.: An argument for basic emotions . Cognition & emotion 6(3-4) , 169 { 200 ( 1992 )

6. Gardner , M. , Grus , J. , Neumann , M. , Tafjord , O. , Dasigi , P. , Liu , N.F. , Peters , M. , Schmitz , M. , Zettlemoyer , L.S.: AllenNLP: A deep semantic natural language processing platform . In: ACL workshop for NLP Open Source Software ( 2018 )

7. Graves , A. : Supervised sequence labelling . In: Supervised Sequence Labelling with Recurrent Neural Networks , pp. 5 { 13 . Springer ( 2012 )

8. Hochreiter , S. , Schmidhuber , J.: Long short-term memory . Neural computation 9(8) , 1735 { 1780 ( 1997 )

9. Howard , J. , Ruder , S. : Universal language model ne-tuning for text classi cation . In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . vol. 1 , pp. 328 { 339 ( 2018 )

10. Jaidka , K. , Mumick , S. , Chhaya , N. , Ungar , L. : The CL-A Happiness Shared Task: Results and Key Insights . In: Proceedings of the 2nd Workshop on A ective Content Analysis @ AAAI (A Con2019) . Honolulu, Hawaii ( January 2019 )

11. Liu , P. , Qiu , X. , Huang , X. : Recurrent neural network for text classi cation with multi-task learning . In: Proceedings of the Twenty-Fifth International Joint Conference on Arti cial Intelligence . pp. 2873 { 2879 . AAAI Press ( 2016 )

12. McCann , B. , Bradbury , J. , Xiong , C. , Socher , R.: Learned in translation: Contextualized word vectors . In: Advances in Neural Information Processing Systems . pp. 6294 { 6305 ( 2017 )

13. Pennington , J. , Socher , R. , Manning , C.D.: Glove: Global vectors for word representation . In: EMNLP . vol. 14 , pp. 1532 { 1543 ( 2014 )

14. Peters , M.E. , Neumann , M. , Iyyer , M. , Gardner , M. , Clark , C. , Lee , K. , Zettlemoyer , L. : Deep contextualized word representations . arXiv preprint arXiv:1802 . 05365 ( 2018 )

15. Plutchik , R.: The psychology and biology of emotion . HarperCollins College Publishers ( 1994 )

16. Ren , Y. , Zhang , Y. , Zhang , M. , Ji , D. : Context-sensitive twitter sentiment classication using neural network . In: AAAI . pp. 215 { 221 ( 2016 )

17. Roseman , I.J.: Cognitive determinants of emotion: A structural theory . Review of personality & social psychology ( 1984 )

18. Samy , A.E. , El-Beltagy , S.R. , Hassanien , E.: A context integrated model for multilabel emotion detection . Procedia computer science 142 , 61 { 71 ( 2018 )

19. Tai , K.S. , Socher , R. , Manning , C.D.: Improved semantic representations from tree-structured long short-term memory networks . arXiv preprint arXiv:1503.00075 ( 2015 )

20. Volkova , S. , Bachrach , Y. : Inferring perceived demographics from user emotional tone and user-environment emotional contrast . In: Proceedings of the 54th ACL . vol. 1 , pp. 1567 { 1578 ( 2016 )

21. Xu , J. , Chen , D. , Qiu , X. , Huang , X. : Cached long short-term memory neural networks for document-level sentiment classi cation . In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing . pp. 1660 { 1669 ( 2016 )

22. Zhou , P. , Shi , W. , Tian , J. , Qi , Z. , Li , B. , Hao , H. , Xu , B. : Attention-based bidirectional long short-term memory networks for relation classi cation . In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2 :

Short

Papers ) . vol. 2 , pp. 207 { 212 ( 2016 )