Ingredients for Happiness: Modeling constructs via semi-supervised content driven inductive transfer learning Bakhtiyar Syed?1 , Vijayasaradhi Indurthi?1 , Kulin Shah1 , Manish Gupta1,2 , and Vasudeva Varma1 1 IIIT Hyderabad {syed.b, vijaya.saradhi}@research.iiit.ac.in, kulin.shah@students.iiit.ac.in, {manish.gupta, vv}@iiit.ac.in 2 Microsoft gmanish@microsoft.com Abstract. Modeling affect via understanding the social constructs be- hind them is an important task in devising robust and accurate systems for socially relevant scenarios. In the CL-Aff Shared Task (part of Affec- tive Content Analysis workshop @ AAAI 2019), the organizers released a dataset of ‘happy’ moments, called the HappyDB corpus. The task is to detect two social constructs: the agency (i.e., whether the author is in control of the happy moment) and the social characteristics (i.e., whether anyone else other than the author was also involved in the happy moment). We employ an inductive transfer learning technique where we utilize a pre-trained language model and fine-tune it on the target task for both the binary classification tasks. At first, we use a language model pre-trained on the huge WikiText-103 corpus. This step utilizes an AWD- LSTM with three hidden layers for training the language model. In the second step, we fine-tune the pre-trained language model on both the labeled and unlabeled instances from the HappyDB dataset. Finally, we train a classifier on top of the language model for each of the identification tasks. Our experiments using 10-fold cross validation on the corpus show that we achieve a high accuracy of ∼93% for detection of the social char- acteristic and ∼87% for agency of the author, showing significant gains over other baselines. We also show that using the unlabeled dataset for fine-tuning the language model in the second step improves our accuracy by 1-2% across detection of both the constructs. Keywords: Happy Moments · Inductive transfer learning · Language model fine-tuning · Agency Prediction · Social Characteristic Detection 1 Introduction In our quest to better model happy moments and characterize them, it is im- portant to understand which entities were involved in the happy moments, and ? The authors contributed equally. 2 B. Syed et al. the psychology and behaviours which make people happy. Once the reasons and behaviours which trigger happiness are identified, techniques can be effectively developed to steer towards such behaviours which can increase people happiness levels. It is therefore useful to answer questions like (1) whether the author was in control of the happy moment (referred to as agency in this paper), and (2) whether multiple people contributed to the happy moment (referred to as social characteristic in this paper). The CL-AFF shared task at AffCon20193 focuses on answering these two research questions. Asai et al. [1] developed a database of 100K happy moments, HappyDB, using crowd sourcing and made it publicly available. We use this dataset to build models for answering the two questions. Recently, there has been significant progress in the area of inductive transfer learning for natural language processing (NLP). Training deep learning mod- els from scratch requires enormous amount of labeled data for achieving high accuracy. In recent times though, there have been advancements which give bet- ter performance on tasks like text classification from only a few labeled data instances [6]. In this work, we show that inductive transfer learning is greatly beneficial in identifying the agency and social characteristics of happy moments in the dataset. We also employ a variant wherein we utilize the ‘unlabeled’ happy mo- ments and leverage it to increase the system performance. Our experiments using 10-fold cross validation on the corpus show that we achieve a high accuracy of ∼93% for detection of the social characteristic and ∼87% for agency of the au- thor, showing significant gains over other baselines. 2 Problem Definition We specifically attempt to solve Task 1 of the CL-Aff shared task, i.e., detecting the agency and social labels of the given happy moment. Formally, we model the task as follows. Agency and Social characteristic detection: Given a happy moment H, we intend to learn the agency label C1 and the social label C2. C1 indicates whether the author is in control of the happy moment being described. C2, on the other hand, indicates whether anybody else other than the author, i.e., whether multiple entities are involved in the happy moment being described. We model both the tasks as binary classification problems. Thus, if author is in control of the happy moment, C1=1; otherwise, C1=0. Similarly, if anyone else other the author is involved in the happy moment, C2=1; otherwise, C2=0. To solve these problems, we propose a semi-supervised inductive transfer learning approach. Our approach is inspired by the ULMFiT architecture [6] and AWD-LSTM [11] which we discuss in brief below. 3 https://sites.google.com/view/affcon2019/cl-aff-shared-task Inductive Transfer Learning for Happy Moments 3 3 Preliminaries In this section, we discuss the ULMFiT architecture and the AWD-LSTM model in brief. 3.1 The ULMFiT Architecture Previous research has proposed multiple models for exploiting inductive trans- fer for Natural Language Processing (NLP) applications [5, 12]. In this work, we adapt a recently proposed architecture called ULMFiT (Universal Language Model Fine-tuning) for inductive transfer learning. The ULMFiT architecture proposed by Howard and Ruder [6] uses multiple heuristics for fine-tuning of lan- guage models (LMs) to avoid overfitting when training neural models on small labeled datasets. The ULMFiT architecture not just reduces the LM over-fitting but also prevents catastrophic forgetting of information which earlier models built on LMs were susceptible to. We adapt the ULMFiT model for our induc- tive transfer learning approach with a variant and show that inductive transfer learning is greatly beneficial for identifying agency and social characteristics of the happy moments in the given corpus. Besides exploiting just the labeled data, our variant also utilizes the unlabeled corpus for fine-tuning the language model which further improves the classification performance across both the constructs. 3.2 The AWD-LSTM Model Our inductive transfer learning mechanism also makes use of the Averaged- SGD Weight-Dropped Long Short Term Memory (AWD-LSTM) networks [11]. The AWD-LSTM uses DropConnect and a variant of Average-SGD (NT-ASGD) along with several other well-known regularization strategies. We leverage the use of AWD-LSTMs as it has been shown to very effective in learning low-perplexity language models. 4 Approach: Inductive Transfer Learning In this section, we describe the three phases of the proposed inductive transfer learning approach. Figure 1 illustrates the overall system architecture. The proposed inductive transfer learning framework for identification of the ‘agency’ and ‘social’ characteristics makes use of the following three phases in order. 1. General Domain Pre-training: The first phase pre-trains the AWD- LSTM based language model on a huge text corpus. In our case, we use the pretrained language model trained on Wikitext-103 [11] dataset which consists of 103 Million unique words and 28,595 pre-processed Wikipedia articles. General domain pre-training helps the model learn basic character- istics of the language in question. It is essential that the LM be pre-trained on a huge corpus so that these general-domain characteristics are learned well. 4 B. Syed et al. Fig. 1: Inductive Transfer Learning mechanism to identify agency and social charac- teristics. The same architecture is used for classifying agent and social characteristics separately. 2. Language Model Fine-tuning for the Target Task: For this step, after pre-training the language model with a huge corpus of the language texts, we fine-tune it using both the labeled as well as unlabeled part of the happy moments corpus. In this stage, we utilize task-specific data to fine-tune our language model in an unsupervised manner. As proposed in [6], our fine- tuning involves discriminative fine-tuning and slanted triangular learning rates to combat the catastrophic forgetting nature of language models as exhibited in previous works [5, 12]. In discriminative fine-tuning, instead of keeping the same learning rate for all the layers of the AWD-LSTM, a differ- ent learning rate is used for tuning the three different layers. The intuition is that since each of the layers represent a different kind of information [21], they must be fine-tuned to different extents. Also using the same learning rate is not the best way to enable the model to converge to a suitable re- gion of the parameter space. Thus we adapt the slanted triangular learning rate [6] which first increases the learning rate and then linearly decays it as the number of training samples increases. 3. Classifier Fine-tuning for the Target Task: The weights that we obtain from the second phase are fine-tuned by extending the upstream architecture with two fully connected layers with softmax activation for the classification. In this phase, we adapt the gradual unfreezing heuristic [6] for our task. In gradual unfreezing (GU), all layers are not fine-tuned at the same time, instead the model is gradually unfrozen starting from the last layer, as it contains the least general knowledge [21]. The last layer is first unfrozen and fine-tuned for one epoch. Subsequently, the next frozen layer is unfrozen Inductive Transfer Learning for Happy Moments 5 and all unfrozen layers are fine-tuned. This is repeated until all layers are fine-tuned until convergence is reached. 5 Experiments In this section, we describe the baselines and present comparisons between the baseline and our proposed approach. 5.1 Baselines Word embedding is a technique in NLP which maps words of a language into dense vectors of real numbers in a continuous embedding space. Traditional NLP systems such as BoW (Bag of Words) and TF-IDF (Term Frequency- Inverse Document Frequency) are mainly syntactic representations and cannot capture the semantic relationships between words. Word embedding techniques have been gaining popularity in a range of NLP tasks like Sentiment analysis [10, 20], Named Entity Recognition [8, 18], Question Answering [15], etc. As baselines, we use word embeddings and demographics features of the au- thor of the happy moment like age, country, gender, marital status, parenthood, happiness duration. We train multiple classifiers using these set of features. Specifically, for the baselines, we use the following pre-trained word and sen- tence embedding models: GloVe, Concatenated Power Mean, Google Universal Embedding, fastText, Lexical Vectors and InferSent embeddings. For word based embeddings, the embedding of the sentence is computed by tokenizing the sentence into words and computing the average of all the embed- dings of the words of the sentence. We formulate the problem of identifying the social and agency attributes as text classification tasks. Hence, we use multiple supervised learning algorithms like Logistic Regression (LR), Support Vector Machines (SVM), Random Forests (RF), Neural Networks (with two hidden layers), and boosting (XGB) to train the models. In the following, we describe the word/sentence embeddings which we use as baselines. (1) fastText [2]: It is a skipgram based word embedding method, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. (2) GloVe [13] is an unsupervised learning algorithm for distributed word rep- resentation. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. We use the standard 300 dimen- sional GloVe embeddings (GloVe1) trained on 840B word tokens. As another baseline, we also use 200 dimensional GloVe embeddings trained on a Twitter corpus (GloVe2) containing 27B word tokens. (3) InferSent [4] is another set of embeddings trained by Facebook. InferSent is trained using the task of language inference. Given two sentences the model 6 B. Syed et al. is trained to infer whether they are a contradiction, a neutral pairing, or an entailment. The output is an embedding of 4096 dimensions. (4) Concatenated Power Mean Word Embedding [16] generalizes the concept of average word embeddings to power mean word embeddings. The concatenation of different types of power mean word embeddings considerably closes the gap to state-of-the-art methods mono-lingually and substantially outperforms many complex techniques cross-lingually. (5) Lexical Vectors [17] is another word embedding similar to fastText with slightly modified objective. Fast Text [2] is another word embedding model which incorporates character n-grams into the skipgram model of Word2Vec and con- siders the subword information. (6) The Universal Sentence Encoder [3] encodes text into high dimensional vectors. The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. For each of the embeddings in the above list, we train models using different supervised learning algorithms. We use the scikit-learn implementations of these algorithms with the standard default parameters without any hyper-parameter tuning. Model LR RF NN SVM XGB Acc. F1 AUC Acc. F1 AUC Acc. F1 AUC Acc. F1 AUC Acc. F1 AUC Universal 84.98 84.77 79.29 82.35 80.49 70.12 85.30 84.75 79.17 83.83 83.55 77.92 83.12 82.21 73.91 GloVe1 75.37 72.41 60.29 75.17 70.37 57.28 75.17 71.91 59.52 85.30 71.84 59.43 85.26 70.81 57.58 GloVe2 82.80 82.11 74.42 81.55 79.3 68.3 81.87 81.92 73.8 82.63 81.64 73.75 82.77 81.68 72.85 fastText 74.57 71.07 58.52 74.90 69.67 56.43 75.44 71.37 58.78 74.50 64.58 51.52 76.03 70.77 57.51 InferSent 80.00 78.72 73.54 85.20 82.66 73.44 83.27 80.90 74.91 79.35 78.27 72.9 85.84 84.51 77.09 LexVec 82.28 81.57 73.78 80.71 77.65 65.90 82.27 80.94 72.54 81.95 81.16 73.2 81.17 79.54 69.45 Concatenated 76.36 76.61 70.63 83.27 82.20 73.13 80.77 80.95 75.66 76.44 76.74 70.89 83.81 82.98 74.96 p-means Table 1: 10-fold cross validation Accuracy, F1 and AUC scores for Agency labels for the baseline models (embeddings+demography features) Model LR RF NN SVM XGB Acc. F1 AUC Acc. F1 AUC Acc. F1 AUC Acc. F1 AUC Acc. F1 AUC Universal 91.51 91.51 91.51 90.93 91.21 91.3 92.04 91.95 91.99 91.70 90.86 90.88 90.82 90.83 90.86 GloVe1 81.05 81.01 81.48 78.89 79.14 79.12 81.13 80.97 81.48 80.61 80.40 81.01 81.86 81.72 82.59 GloVe2 88.80 88.81 88.87 87.98 87.89 87.96 89.30 88.89 88.94 88.92 88.62 88.72 88.15 88.16 88.25 fastText 79.49 79.39 80.11 78.15 78.18 78.13 79.83 79.79 80.56 79.27 79.26 80.18 80.85 80.68 81.64 InferSent 86.40 86.76 86.71 87.54 88.63 88.70 88.45 88.97 89.04 85.53 85.84 85.77 89.64 89.93 90.04 LexVec 88.88 88.89 88.94 87.58 87.69 87.91 88.70 88.78 88.90 88.55 88.69 88.78 88.42 88.43 88.55 Concatenated 85.11 85.15 85.06 88.57 88.25 88.29 89.66 89.06 89.07 84.49 84.42 84.35 89.78 89.79 89.91 p means Table 2: 10-fold cross validation Accuracy, F1 and AUC scores for Social labels for the baseline models (embeddings+demography features) Inductive Transfer Learning for Happy Moments 7 5.2 Hyper-parameter Settings As suggested in [6], we use the AWD-LSTM language model with three layers, 1150 hidden activations per layer and an embedding size of 400. The hidden layer of the classifier is of size 50. A batch size of 30 is used to train the model. The LM and classifier fine-tuning is done with a base learning rate of 0.004 and 0.01 respectively. We built separate models for the ‘agency’ and ‘social’ classification tasks. 5.3 Results and Analysis Tables 1 and 2 show the accuracy of the models trained on different word em- beddings across various machine learning algorithms for the Agency and Social label prediction tasks respectively. Table 3 shows the performance of the induc- tive transfer model utilizing the unlabeled corpus for LM fine-tuning as compared with not utilizing the unlabeled corpus. We see that making use of the unlabeled corpus has its advantages as it gives the highest accuracy across both agency and social detection, beating other baselines strongly and outperforming the inductive transfer model used when not making use of the unlabeled corpus. Table 3: 10-fold cross-validation Performance using Inductive Transfer Learning for both Agency and Social Characteristics Detection Tasks Unlabeled data used Agency Social in LM fine-tuning? Accuracy F-1 ROC-AUC Accuracy F-1 ROC-AUC No 85.97 85.45 78.93 92.1 92.11 92.23 Yes 86.45 86.16 80.18 92.7 92.69 92.71 6 Conclusions In this work, we showed that the idea of using inductive transfer learning by fine-tuning language models helps in giving robust performance across detection of agency and social characteristics. We also showed that the use of unlabeled data for LM fine-tuning in our second stage helped in improving performance across 10-fold cross validation evaluation measures for both the tasks. We plan to perform the given text classification using other pre-trained em- beddings like ELMo (Embeddings from Language Models) [14], Skip-Thought Vectors [7], Quick-Thoughts [9] and Multi-task learning based sentence repre- sentations [19], and investigate if use of those embeddings can improve the clas- sification accuracy. We would also like to experiment with other semi-supervised techniques to improve the classification accuracy. 8 B. Syed et al. References 1. Asai, A., Evensen, S., Golshan, B., Halevy, A., Li, V., Lopatenko, A., Stepanov, D., Suhara, Y., Tan, W.C., Xu, Y.: Happydb: A corpus of 100,000 crowdsourced happy moments. In: Proceedings of LREC 2018. European Language Resources Association (ELRA), Miyazaki, Japan (May 2018) 2. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016) 3. Cer, D., Yang, Y., Kong, S.y., Hua, N., Limtiaco, N., John, R.S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., et al.: Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018) 4. Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 670–680. Association for Computational Linguistics, Copenhagen, Denmark (September 2017), https://www.aclweb.org/anthology/D17-1070 5. Dai, A.M., Le, Q.V.: Semi-supervised sequence learning. In: Advances in neural information processing systems. pp. 3079–3087 (2015) 6. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). vol. 1, pp. 328–339 (2018) 7. Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., Fidler, S.: Skip-thought vectors. In: Advances in neural information processing systems. pp. 3294–3302 (2015) 8. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016) 9. Logeswaran, L., Lee, H.: An efficient framework for learning sentence representa- tions. arXiv preprint arXiv:1803.02893 (2018) 10. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1. pp. 142–150. Association for Computational Linguistics (2011) 11. Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182 (2017) 12. Mou, L., Meng, Z., Yan, R., Li, G., Xu, Y., Zhang, L., Jin, Z.: How transferable are neural networks in nlp applications? arXiv preprint arXiv:1603.06111 (2016) 13. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word repre- sentation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532–1543 (2014) 14. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018) 15. Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Advances in neural information processing systems. pp. 2953–2961 (2015) 16. Rücklé, A., Eger, S., Peyrard, M., Gurevych, I.: Concatenated p-mean word embeddings as universal cross-lingual sentence representations. arXiv preprint arXiv:1803.01400 (2018) 17. Salle, A., Villavicencio, A.: Incorporating subword information into matrix factor- ization word embeddings. arXiv preprint arXiv:1805.03710 (2018) Inductive Transfer Learning for Happy Moments 9 18. Santos, C.N.d., Guimaraes, V.: Boosting named entity recognition with neural character embeddings. arXiv preprint arXiv:1505.05008 (2015) 19. Subramanian, S., Trischler, A., Bengio, Y., Pal, C.J.: Learning general purpose dis- tributed sentence representations via large scale multi-task learning. arXiv preprint arXiv:1804.00079 (2018) 20. Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentiment-specific word embedding for twitter sentiment classification. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). vol. 1, pp. 1555–1565 (2014) 21. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: Advances in neural information processing systems. pp. 3320–3328 (2014)