-

Ingredients for Happiness: Modeling constructs via semi-supervised content driven inductive transfer learning

Bakhtiyar Syed?

Vijayasaradhi Indurthi?

Kulin Shah

kulin.shah@students.iiit.ac.in 0

Manish Gupta

gmanish@microsoft.com 0

Vasudeva Varma

0 0 IIIT Hyderabad

Modeling a ect via understanding the social constructs behind them is an important task in devising robust and accurate systems for socially relevant scenarios. In the CL-A Shared Task (part of A ective Content Analysis workshop @ AAAI 2019), the organizers released a dataset of `happy' moments, called the HappyDB corpus. The task is to detect two social constructs: the agency (i.e., whether the author is in control of the happy moment) and the social characteristics (i.e., whether anyone else other than the author was also involved in the happy moment). We employ an inductive transfer learning technique where we utilize a pre-trained language model and ne-tune it on the target task for both the binary classi cation tasks. At rst, we use a language model pre-trained on the huge WikiText-103 corpus. This step utilizes an AWDLSTM with three hidden layers for training the language model. In the second step, we ne-tune the pre-trained language model on both the labeled and unlabeled instances from the HappyDB dataset. Finally, we train a classi er on top of the language model for each of the identi cation tasks. Our experiments using 10-fold cross validation on the corpus show that we achieve a high accuracy of 93% for detection of the social characteristic and 87% for agency of the author, showing signi cant gains over other baselines. We also show that using the unlabeled dataset for ne-tuning the language model in the second step improves our accuracy by 1-2% across detection of both the constructs.

Happy Moments Inductive transfer learning Language model ne-tuning Agency Prediction Social Characteristic Detection

In our quest to better model happy moments and characterize them, it is important to understand which entities were involved in the happy moments, and ? The authors contributed equally. the psychology and behaviours which make people happy. Once the reasons and behaviours which trigger happiness are identi ed, techniques can be e ectively developed to steer towards such behaviours which can increase people happiness levels. It is therefore useful to answer questions like (1) whether the author was in control of the happy moment (referred to as agency in this paper), and (2) whether multiple people contributed to the happy moment (referred to as social characteristic in this paper). The CL-AFF shared task at A Con20193 focuses on answering these two research questions. Asai et al. [ 1 ] developed a database of 100K happy moments, HappyDB, using crowd sourcing and made it publicly available. We use this dataset to build models for answering the two questions.

Recently, there has been signi cant progress in the area of inductive transfer learning for natural language processing (NLP). Training deep learning models from scratch requires enormous amount of labeled data for achieving high accuracy. In recent times though, there have been advancements which give better performance on tasks like text classi cation from only a few labeled data instances [ 6 ].

In this work, we show that inductive transfer learning is greatly bene cial in identifying the agency and social characteristics of happy moments in the dataset. We also employ a variant wherein we utilize the `unlabeled' happy moments and leverage it to increase the system performance. Our experiments using 10-fold cross validation on the corpus show that we achieve a high accuracy of 93% for detection of the social characteristic and 87% for agency of the author, showing signi cant gains over other baselines. 2

Problem De nition

We speci cally attempt to solve Task 1 of the CL-A shared task, i.e., detecting the agency and social labels of the given happy moment. Formally, we model the task as follows.

Agency and Social characteristic detection: Given a happy moment H,

we intend to learn the agency label C1 and the social label C2. C1 indicates whether the author is in control of the happy moment being described. C2, on the other hand, indicates whether anybody else other than the author, i.e., whether multiple entities are involved in the happy moment being described. We model both the tasks as binary classi cation problems. Thus, if author is in control of the happy moment, C1=1; otherwise, C1=0. Similarly, if anyone else other the author is involved in the happy moment, C2=1; otherwise, C2=0.

To solve these problems, we propose a semi-supervised inductive transfer learning approach. Our approach is inspired by the ULMFiT architecture [ 6 ] and AWD-LSTM [ 11 ] which we discuss in brief below. 3 https://sites.google.com/view/a con2019/cl-a -shared-task

Preliminaries

In this section, we discuss the ULMFiT architecture and the AWD-LSTM model in brief. 3.1

The ULMFiT Architecture

Previous research has proposed multiple models for exploiting inductive transfer for Natural Language Processing (NLP) applications [ 5, 12 ]. In this work, we adapt a recently proposed architecture called ULMFiT (Universal Language Model Fine-tuning) for inductive transfer learning. The ULMFiT architecture proposed by Howard and Ruder [ 6 ] uses multiple heuristics for ne-tuning of language models (LMs) to avoid over tting when training neural models on small labeled datasets. The ULMFiT architecture not just reduces the LM over- tting but also prevents catastrophic forgetting of information which earlier models built on LMs were susceptible to. We adapt the ULMFiT model for our inductive transfer learning approach with a variant and show that inductive transfer learning is greatly bene cial for identifying agency and social characteristics of the happy moments in the given corpus. Besides exploiting just the labeled data, our variant also utilizes the unlabeled corpus for ne-tuning the language model which further improves the classi cation performance across both the constructs. 3.2

The AWD-LSTM Model

Our inductive transfer learning mechanism also makes use of the AveragedSGD Weight-Dropped Long Short Term Memory (AWD-LSTM) networks [ 11 ]. The AWD-LSTM uses DropConnect and a variant of Average-SGD (NT-ASGD) along with several other well-known regularization strategies. We leverage the use of AWD-LSTMs as it has been shown to very e ective in learning low-perplexity language models. 4

Approach: Inductive Transfer Learning

In this section, we describe the three phases of the proposed inductive transfer learning approach. Figure 1 illustrates the overall system architecture.

The proposed inductive transfer learning framework for identi cation of the `agency' and `social' characteristics makes use of the following three phases in order. 1. General Domain Pre-training: The rst phase pre-trains the AWDLSTM based language model on a huge text corpus. In our case, we use the pretrained language model trained on Wikitext-103 [ 11 ] dataset which consists of 103 Million unique words and 28,595 pre-processed Wikipedia articles. General domain pre-training helps the model learn basic characteristics of the language in question. It is essential that the LM be pre-trained on a huge corpus so that these general-domain characteristics are learned well. 2. Language Model Fine-tuning for the Target Task: For this step, after pre-training the language model with a huge corpus of the language texts, we ne-tune it using both the labeled as well as unlabeled part of the happy moments corpus. In this stage, we utilize task-speci c data to ne-tune our language model in an unsupervised manner. As proposed in [ 6 ], our netuning involves discriminative ne-tuning and slanted triangular learning rates to combat the catastrophic forgetting nature of language models as exhibited in previous works [ 5, 12 ]. In discriminative ne-tuning, instead of keeping the same learning rate for all the layers of the AWD-LSTM, a di erent learning rate is used for tuning the three di erent layers. The intuition is that since each of the layers represent a di erent kind of information [ 21 ], they must be ne-tuned to di erent extents. Also using the same learning rate is not the best way to enable the model to converge to a suitable region of the parameter space. Thus we adapt the slanted triangular learning rate [ 6 ] which rst increases the learning rate and then linearly decays it as the number of training samples increases.

3. Classi er Fine-tuning for the Target Task: The weights that we obtain

from the second phase are ne-tuned by extending the upstream architecture with two fully connected layers with softmax activation for the classi cation. In this phase, we adapt the gradual unfreezing heuristic [ 6 ] for our task. In gradual unfreezing (GU), all layers are not ne-tuned at the same time, instead the model is gradually unfrozen starting from the last layer, as it contains the least general knowledge [ 21 ]. The last layer is rst unfrozen and ne-tuned for one epoch. Subsequently, the next frozen layer is unfrozen and all unfrozen layers are ne-tuned. This is repeated until all layers are ne-tuned until convergence is reached. 5

Experiments

In this section, we describe the baselines and present comparisons between the baseline and our proposed approach. 5.1

Baselines

Word embedding is a technique in NLP which maps words of a language into dense vectors of real numbers in a continuous embedding space. Traditional NLP systems such as BoW (Bag of Words) and TF-IDF (Term FrequencyInverse Document Frequency) are mainly syntactic representations and cannot capture the semantic relationships between words. Word embedding techniques have been gaining popularity in a range of NLP tasks like Sentiment analysis [ 10, 20 ], Named Entity Recognition [ 8, 18 ], Question Answering [ 15 ], etc.

As baselines, we use word embeddings and demographics features of the author of the happy moment like age, country, gender, marital status, parenthood, happiness duration. We train multiple classi ers using these set of features. Speci cally, for the baselines, we use the following pre-trained word and sentence embedding models: GloVe, Concatenated Power Mean, Google Universal Embedding, fastText, Lexical Vectors and InferSent embeddings.

For word based embeddings, the embedding of the sentence is computed by tokenizing the sentence into words and computing the average of all the embeddings of the words of the sentence. We formulate the problem of identifying the social and agency attributes as text classi cation tasks. Hence, we use multiple supervised learning algorithms like Logistic Regression (LR), Support Vector Machines (SVM), Random Forests (RF), Neural Networks (with two hidden layers), and boosting (XGB) to train the models.

In the following, we describe the word/sentence embeddings which we use as baselines.

(1) fastText [ 2 ]: It is a skipgram based word embedding method, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations.

(2) GloVe [ 13 ] is an unsupervised learning algorithm for distributed word representation. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. We use the standard 300 dimensional GloVe embeddings (GloVe1) trained on 840B word tokens. As another baseline, we also use 200 dimensional GloVe embeddings trained on a Twitter corpus (GloVe2) containing 27B word tokens.

(3) InferSent [ 4 ] is another set of embeddings trained by Facebook. InferSent is trained using the task of language inference. Given two sentences the model is trained to infer whether they are a contradiction, a neutral pairing, or an entailment. The output is an embedding of 4096 dimensions.

(4) Concatenated Power Mean Word Embedding [ 16 ] generalizes the concept of average word embeddings to power mean word embeddings. The concatenation of di erent types of power mean word embeddings considerably closes the gap to state-of-the-art methods mono-lingually and substantially outperforms many complex techniques cross-lingually.

(5) Lexical Vectors [ 17 ] is another word embedding similar to fastText with slightly modi ed objective. Fast Text [ 2 ] is another word embedding model which incorporates character n-grams into the skipgram model of Word2Vec and considers the subword information.

(6) The Universal Sentence Encoder [ 3 ] encodes text into high dimensional vectors. The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector.

For each of the embeddings in the above list, we train models using di erent supervised learning algorithms. We use the scikit-learn implementations of these algorithms with the standard default parameters without any hyper-parameter tuning.

Hyper-parameter Settings

As suggested in [ 6 ], we use the AWD-LSTM language model with three layers, 1150 hidden activations per layer and an embedding size of 400. The hidden layer of the classi er is of size 50. A batch size of 30 is used to train the model. The LM and classi er ne-tuning is done with a base learning rate of 0.004 and 0.01 respectively. We built separate models for the `agency' and `social' classi cation tasks. 5.3

Results and Analysis

In this work, we showed that the idea of using inductive transfer learning by ne-tuning language models helps in giving robust performance across detection of agency and social characteristics. We also showed that the use of unlabeled data for LM ne-tuning in our second stage helped in improving performance across 10-fold cross validation evaluation measures for both the tasks.

We plan to perform the given text classi cation using other pre-trained embeddings like ELMo (Embeddings from Language Models) [ 14 ], Skip-Thought Vectors [ 7 ], Quick-Thoughts [ 9 ] and Multi-task learning based sentence representations [ 19 ], and investigate if use of those embeddings can improve the classi cation accuracy. We would also like to experiment with other semi-supervised techniques to improve the classi cation accuracy.

1. Asai , A. , Evensen , S. , Golshan , B. , Halevy , A. , Li , V. , Lopatenko , A. , Stepanov , D. , Suhara , Y. , Tan , W.C. , Xu , Y. : Happydb: A corpus of 100,000 crowdsourced happy moments . In: Proceedings of LREC 2018 . European Language Resources Association (ELRA), Miyazaki , Japan (May 2018 )

2. Bojanowski , P. , Grave , E. , Joulin , A. , Mikolov , T. : Enriching word vectors with subword information . arXiv preprint arXiv:1607.04606 ( 2016 )

3. Cer , D. , Yang , Y. , Kong , S.y., Hua , N. , Limtiaco , N. , John , R.S., Constant , N. , Guajardo-Cespedes , M. , Yuan , S. , Tar , C. , et al.: Universal sentence encoder . arXiv preprint arXiv: 1803 . 11175 ( 2018 )

4. Conneau , A. , Kiela , D. , Schwenk , H. , Barrault , L. , Bordes , A. : Supervised learning of universal sentence representations from natural language inference data . In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing . pp. 670 { 680 . Association for Computational Linguistics, Copenhagen, Denmark ( September 2017 ), https://www.aclweb.org/anthology/D17-1070

5. Dai , A.M. , Le , Q.V. : Semi-supervised sequence learning . In: Advances in neural information processing systems . pp. 3079 { 3087 ( 2015 )

6. Howard , J. , Ruder , S. : Universal language model ne-tuning for text classi cation . In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . vol. 1 , pp. 328 { 339 ( 2018 )

7. Kiros , R. , Zhu , Y. , Salakhutdinov , R.R. , Zemel , R. , Urtasun , R. , Torralba , A. , Fidler , S. : Skip-thought vectors . In: Advances in neural information processing systems . pp. 3294 { 3302 ( 2015 )

8. Lample , G. , Ballesteros , M. , Subramanian , S. , Kawakami , K. , Dyer , C. : Neural architectures for named entity recognition . arXiv preprint arXiv:1603.01360 ( 2016 )

9. Logeswaran , L. , Lee , H. : An e cient framework for learning sentence representations . arXiv preprint arXiv:1803 . 02893 ( 2018 )

10. Maas , A.L. , Daly , R.E. , Pham , P.T. , Huang , D. , Ng , A.Y. , Potts , C. : Learning word vectors for sentiment analysis . In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 . pp. 142 { 150 . Association for Computational Linguistics ( 2011 )

11. Merity , S. , Keskar , N.S. , Socher , R.: Regularizing and optimizing lstm language models . arXiv preprint arXiv:1708.02182 ( 2017 )

12. Mou , L. , Meng , Z. , Yan , R. , Li , G. , Xu , Y. , Zhang , L. , Jin , Z. : How transferable are neural networks in nlp applications? arXiv preprint arXiv: 1603 .06111 ( 2016 )

13. Pennington , J. , Socher , R. , Manning , C. : Glove: Global vectors for word representation . In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) . pp. 1532 { 1543 ( 2014 )

14. Peters , M.E. , Neumann , M. , Iyyer , M. , Gardner , M. , Clark , C. , Lee , K. , Zettlemoyer , L. : Deep contextualized word representations . arXiv preprint arXiv:1802 . 05365 ( 2018 )

15. Ren , M. , Kiros , R. , Zemel , R.: Exploring models and data for image question answering . In: Advances in neural information processing systems . pp. 2953 { 2961 ( 2015 )

16. Ruckle, A. , Eger , S. , Peyrard , M. , Gurevych , I. : Concatenated p-mean word embeddings as universal cross-lingual sentence representations . arXiv preprint arXiv:1803 . 01400 ( 2018 )

17. Salle , A. , Villavicencio , A. : Incorporating subword information into matrix factorization word embeddings . arXiv preprint arXiv: 1805 . 03710 ( 2018 )

18. Santos , C.N.d. , Guimaraes , V. : Boosting named entity recognition with neural character embeddings . arXiv preprint arXiv:1505.05008 ( 2015 )

19. Subramanian , S. , Trischler , A. , Bengio , Y. , Pal , C.J. : Learning general purpose distributed sentence representations via large scale multi-task learning . arXiv preprint arXiv:1804 . 00079 ( 2018 )

20. Tang , D. , Wei , F. , Yang , N. , Zhou , M. , Liu , T. , Qin , B. : Learning sentiment-speci c word embedding for twitter sentiment classi cation . In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . vol. 1 , pp. 1555 { 1565 ( 2014 )

21. Yosinski , J. , Clune , J. , Bengio , Y. , Lipson , H.: How transferable are features in deep neural networks? In: Advances in neural information processing systems . pp. 3320 { 3328 ( 2014 )