=Paper=
{{Paper
|id=Vol-2614/session3_paper2
|storemode=property
|title=BERT-based Ensembles for Modeling Disclosure and Support in Conversational Social Media Text
|pdfUrl=https://ceur-ws.org/Vol-2614/AffCon20_session3_bertbased.pdf
|volume=Vol-2614
|authors=Kartikey Pant,Tanvi Dadu,Radhika Mamidi
|dblpUrl=https://dblp.org/rec/conf/aaai/PantDM20
}}
==BERT-based Ensembles for Modeling Disclosure and Support in Conversational Social Media Text==
BERT-based Ensembles for Modeling Disclosure and Support in Conversational Social Media Text Kartikey Pant1 ? , Tanvi Dadu2 ? , and Radhika Mamidi1 1 International Institute of Information Technology, Hyderabad kartikey.pant@research.iiit.ac.in radhika.mamidi@iiit.ac.in 2 Netaji Subhas Institute of Technology, Delhi tanvid.co.16@nsit.net.in Abstract. There is a growing interest in understanding how humans initiate and hold conversations. The affective understanding of conver- sations focuses on the problem of how speakers use emotions to react to a situation and to each other. In the CL-Aff Shared Task, the organizers released Get it #OffMyChest dataset, which contains Reddit comments from casual and confessional conversations, labeled for their disclosure and supportiveness characteristics. In this paper, we introduce a predic- tive ensemble model exploiting the finetuned contextualized word embed- dings, RoBERTa and ALBERT. We show that our model outperforms the base models in all considered metrics, achieving an improvement of 3% in the F1 score. We further conduct statistical analysis and outline deeper insights into the given dataset while providing a new characteri- zation of impact for the dataset. Keywords: emotion recognition · sentiment analysis · natural language processing · social media analysis 1 Introduction The word ‘Affective’ refers to emotions, mood, sentiment, personality, subjec- tive evaluations, opinions, and attitude. Affect analysis refers to the techniques used to identify and measure the ‘experience of emotion’ in multimodal content containing text, audio, images, and videos.[6] Affect has become an essential part of the human experience, which directly influences their reaction towards a particular situation. Therefore, it has become crucial to analyze how speakers use emotions and sentiment to react to different situations and each other. This paper addresses the challenge put forward in the CL-Aff Shared Task at the AAAI-2020 Workshop on Affective Content Analysis to Model Affect in Response (AffCon 2020). The theme of this task is to the study affect in response to the interactive content which grows over time. The task offers two datasets ( ? The first two authors contributed equally to the work. Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). In: N. Chhaya, K. Jaidka, J. Healey, L. H. Ungar, A. Sinha (eds.): Proceedings of the 3rd Workshop of Affective Content Analysis, New York, USA, 07- FEB-2020, published at http://ceur-ws.org 2 Pant, Dadu, and Mamidi a small labeled dataset and a large unlabeled dataset) sampled from casual and confessional conversations on Reddit in the subreddit /r/CasualConversations and the /r/OffMyChest. This shared task comprises two subtasks. The first sub- task is a semi-supervised text classification task predicting Disclosure and Sup- portiveness labels based on the given two datasets. Whereas, the second subtask is an open-ended task, which requires authors to propose new characterizations and insights to capture conversation dynamics. Recent works in the task of text classification have used pre-trained con- textualized word representations rather than context-independent word repre- sentations. Some of these representations include BERT [3], RoBERTa[14], and ALBERT [15]. These models perform contextualized word representation and are pre-trained using bidirectional transformers[10]. These BERT-based pre-trained models have outperformed many existing techniques on most NLP tasks with minimal task-specific architectural changes. Ensemble models exploiting features learned from multiple pre-trained mod- els are hypothesized to perform competitively. In this work, we propose an ensemble-based model exploiting pre-trained BERT-based word representations. We document the experimental results for the CL-Aff Shared Task of our pro- posed model in comparison to the baseline models. We further perform attribute- based statistical analysis using attributes like word count, day of the week, and comment per parent post. We conclude the paper by proposing impact as a new characterization to model conversation dynamics. 2 Our Model In this section, we introduce our predictive model that uses Transfer learning in the form of pretrained BERT-based models. We propose an ensemble of two pre-trained models: RoBERTa and ALBERT. In this section, we first outline the pre-trained models incorporated and then discuss the ensemble technique used. 2.1 Preliminaries Transfer learning is the process of extracting knowledge from a source prob- lem domain and applying it to a different target problem or domain. Recent works on text classification use transfer learning in the form of pre-trained embeddings.[13,14,15] These pre-trained embeddings have outperformed many of the existing techniques with minimal architectural structure. The use of pre- trained embeddings reduces the need for annotated data and allows one to per- form the downstream task with minimal resources for the finetuning of the model. Devlin et al.[3] introduced BERT, a contextualized word representation, pre- trained using a bi-directional Transformer-based encoder. These embeddings use a linear combination of masked language modeling and the next sentence predic- tion objectives. It is pre-trained on 3.3B words from various sources, including BooksCorpus[16] and the English Wikipedia. BERT-based Ensembles for Modeling Disclosure and Support 3 Liu et al. introduced RoBERTa, a replication study of BERT, with care- fully tuned hyperparameters and more extensive training data[14]. It is trained with a batch size eight times larger for half as many optimization steps, thus taking significantly lesser time to train in comparison. It is trained on more than twelve times the data used to train BERTlarge , using data from Open- WebText [4], CC-News[5], and STORIES [9] datasets. These optimizations lead the RoBERT alarge pre-trained model to perform better than the BERT-large model in all benchmarking tests, including SQuAD[7] and GLUE [11]. Lan et al. introduced ALBERT, a BERT-based model with two parameter- reduction techniques: factorized embedding parameterization, and cross-layer parameter sharing.[15] These techniques help in lowering memory consumption and increasing training speed. Moreover, this model uses a self-supervised loss that focuses on modeling inter-sentence coherence and improves on downstream tasks with multi-sentence input. ALBERTxxlarge,v2 achieves significant improve- ments over BERTlarge on multiple tasks. 2.2 Our Approach Ensemble methodology entails constructing a predictive model by integrating multiple models in order to improve prediction performance. They are meta- algorithms that combine several machine learning and deep learning classifiers into one predictive model to decrease variance, bias, and improve predictions. Recent works show that ensemble-based classifiers utilizing contextual embed- dings outperform single-model classifiers.[14,15] Hence, we use ensembling tech- niques to combine predictions from multiple models for the tasks for making a prediction for the given task. Figure 1 depicts our proposed ensemble model. In this model, a sentence is parallelly computed by RoBERTa and ALBERT finetuned for predicting that label. The results from these base models are then combined using a weighted average based ensembling technique to predict the final label set, which includes predictions for the six labels. 3 Experiments and Results In this section, we outline the experimental setup, the baselines for the task, and a comparative analysis of our proposed ensemble model with the two base mod- els finetuned for the task, RoBERT alarge and ALBERTxxlarge,v2 . We further compare our ensemble model with four other ensemble models and show that our model performs the best among all the models in four out of five evaluation metrics using 10-fold cross validation. For our baselines, we finetune RoBERT aLarge and ALBERTxxlarge,v2 mod- els for three epochs with a maximum sequence length of 50 and a batch size of 16 for predicting each label separately. We finetune the model with a learning rate of 2 ∗ 10−5 , a weight decay of 0.01, and 20 steps for warm-up. We evaluate 4 Pant, Dadu, and Mamidi Fig. 1. Architecture of our ensemble models which predict the label set denoting support and disclosure from the comment text. Model/Metrics Accuracy Precision-1 Recall-1 F1 Acc&F1 RoBERT aLarge 84.86% 0.585 0.514 0.541 0.695 ALBERTxxlarge,v2 84.90% 0.596 0.472 0.524 0.686 Our Model 85.55% 0.623 0.515 0.558 0.707 Table 1. Label-averaged values for each metric for RoBERTa,ALBERT, and our best performing ensemble model. all the models on the following metrics: Accuracy, F1, Precision-1, Recall-1, and the mean of Accuracy and F1, denoted as Acc&F1 from hereon. From Table 1, we can discern that our ensemble-based model achieves the best results when compared with base models: RoBERTa and ALBERT. We ob- serve a significant increase in Accuracy, Precision-1, and F1 and a slight increase BERT-based Ensembles for Modeling Disclosure and Support 5 in Recall-1 and Acc&F1 in our best-performing ensemble model as compared to the base models. Label/Metrics Accuracy Precision-1 Recall-1 F1 Acc&F1 Informational Disclosure 74.12% 0.710 0.551 0.620 0.681 Emotional Disclosure 74.20% 0.636 0.510 0.566 0.654 Support 84.38% 0.685 0.724 0.704 0.774 General Support 95.42% 0.483 0.241 0.322 0.638 Informational Support 91.30% 0.592 0.485 0.533 0.723 Emotional Support 93.86% 0.632 0.577 0.603 0.771 Table 2. Label-wise values for each metric for our best performing ensemble model. Table 2 further shows the performance of our ensemble-based model on in- dividual labels. Its performance on different labels is evaluated using the above metrics. Labels/Model Model 1 Model 2 Model 3 Model 4 Model 5 Informational Disclosure 0.0,1.0 0.5,0.5 0.0,1.0 0.0,1.0 0.1,0.9 Emotional Disclosure 0.0,1.0 0.5,0.5 0.5,0.5 0.5,0.5 0.5,0.5 Support 1.0,0.0 0.5,0.5 1.0,0.0 1.0,0.0 1.0,0.0 General Support 0.0,1.0 0.5,0.5 0.5,0.5 0.6,0.4 0.6,0.4 Informational Support 1.0,0.0 0.5,0.5 1.0,0.0 1.0,0.0 1.0,0.0 Emotional Support 1.0,0.0 0.5,0.5 0.5,0.5 0.5,0.5 0.5,0.5 Table 3. Weights assigned to each model in different Ensemble Models. Each cell contains a pair (x, y) where x denotes the weight assigned to RoBERTa and y denotes the weight assigned to ALBERT. We further performed a comparative study on ensembling techniques by choosing different weights for RoBERTa and ALBERT, as given in Table 3. It shows different combinations of weights assigned to each label for RoBERTa and ALBERT respectively. This gives rise to five different models, which are then compared using the above metrics. Table 4 depicts the results of the comparative study conducted on the five different ensemble models. We discern that Model 5 performs the best for Accu- racy, Precision-1, F1, and Acc&F1 metrics, and Model 1 performs the best for Recall-1 metric among all the compared models. Since Model 5 outperforms all other models in four out of five metrics, it is the best predictive model for the task and is referred to as Our Model in the paper. For the shared task, our System Run 1 to System Run 5 are predictions generated by the Model 1 to Model 5 respectively. System Run 6 and Sys- 6 Pant, Dadu, and Mamidi Model/Metrics Accuracy Precision-1 Recall-1 F1 Acc&F1 Model 1 85.18% 0.595 0.516 0.547 0.699 Model 2 85.42% 0.622 0.490 0.544 0.699 Model 3 85.47% 0.619 0.514 0.557 0.706 Model 4 85.48% 0.622 0.480 0.557 0.706 Model 5 85.54% 0.623 0.515 0.558 0.707 Table 4. Label-averaged values for each metric for different ensemble models. tem Run 7 are the predictions generated by finetuned RoBERT a large and ALBERTxxlarge,v2 respectively. 4 Dataset In this section, we provide a comprehensive statistical analysis of the dataset Get it #OffMyChest, which comprises of comments and parent posts from the subreddit /r/CasualConversations, and /r/OffMyChest. We further propose new characterizations and outline semantic features for the given dataset. 4.1 Analysis Statistical analysis of the labels, Emotional Disclosure, Informational Disclo- sure, Support, General Support, Information Support, and Emotional Support show significant variations in the number of positive and negative labels. The percentage of positive labels is maximum for Information Disclosure with 37.99% and minimum for General Support with 5.37%. Therefore, the given dataset is highly imbalanced, which makes the training of predictive models a strenuous task. Further analysis of the labeled dataset shows that there are 3, 511 parent posts for 11, 573 comments. We observe an average of 3.29 comments per parent post ranging from one comment per parent post to 52 comments per parent post. In the given dataset, there are 6, 999 unique users with an average of 1.653 comments per user and a significant variation in the number of comments per user ranging from 1 to 159 comments per user, with a standard deviation of 2.669. From this, we conclude that multiple comments within the same parent post and by the same author may be related to each other. We also observe significant variations in the word count of the comments, with an average comment being of 14.7 words, which translates to around one sentence[2]. However, the comment length varies significantly from 3 words to 151 words per comment, with the distribution having a standard deviation of 9.670. The dataset is thus, well-rounded, and represents realistic discourse setting with participants exchanging comments of varying lengths. We intuitively proceeded to predict the effect of the day of the week in the characterized labels representing disclosure and support in a comment. It BERT-based Ensembles for Modeling Disclosure and Support 7 Emotional Informational General Informational Weekday/Label Support EmotionalSupport Disclosure Disclosure Support Support Monday 29.73% 38.55% 25.70% 6.01% 9.49% 7.89% Tuesday 29.71% 38.23% 24.80% 5.43% 9.66% 7.26% Wednesday 30.70% 38.35% 26.80% 5.93% 11.68% 7.79% Thursday 29.40% 37.27% 24.95% 5.08% 9.78% 7.75% Friday 31.14% 35.53% 24.67% 5.27% 8.41% 8.35% Saturday 30.59% 38.95% 22.04% 4.34% 8.22% 6.32% Sunday 31.85% 38.92% 25.93% 5.41% 10.31% 9.06% Overall 30.44% 37.99% 25.02% 5.37% 9.66% 7.79% Table 5. Weekday-wise label distribution of the labelled dataset. was expected that the users would behave differently as the week progresses. However, as illustrated in Table 5, we do not see any significant variation in the existing characterizations with a change in the day of the week. Thus, we conclude, in this dataset, that the week of the day doesn’t affect the users to be either more supportive or disclose more information. 4.2 Impact Prediction The score assigned to a comment quantifies its Impact since, on Reddit, it is the difference between the upvotes and downvotes that it obtains. We observed the posts to have a moderately positive Impact of 10.938 on average. We also see that the breadth of the spectrum in the Impact is captured well by the dataset, with a standard deviation of 57.198, and a range of −49 to 2, 637. This paves the way for a need to characterize and predict the Impact of a post. Labels ρ with Impact Emotional Disclosure 0.046 Informational Disclosure 0.024 Support 0.021 General Support 0.028 Information Support 0.005 Emotional Support 0.019 Table 6. The relationship between Labels and Impact, as represented by Pearson correlation coefficient, ρ. Upon performing a correlation study between Impact and the previously characterized labels using Pearson’s correlation coefficient[8], we observe a very small positive correlation between the variables. As is illustrated in Table 6, the maximum ρ of 0.046 between Impact and Emotional Disclosure represents that Impact is characteristically distinct from the previously predicted labels. We further analyze the influence of Impact, characterized by the score on the semantic structure of the comments. We perform a correlation study between Impact and semantic features selected, as is explored previously in Yang et al[12]. Semantic structure is captured by the following features: 8 Pant, Dadu, and Mamidi 1. Positive words: The number of occurrences of positive words in a comment. 2. Negative words: The number of occurrences of negative words in a com- ment. 3. Positive Polarity Confidence: The probability that a sentence is positive. This metric is used to capture the polarity of comments and is calculated using Fasttext[1]. 4. Subjective words: The number of occurrences of subjectivity oriented words in a comment. It is used to capture the linguistic expression of people’s opinions, beliefs, and speculations. k 5. Sense Combination: It is computed as the log(Πi=1 nwi ) where nwi is the total number of senses of word wi . 6. Sense Farmost: The largest Path Similarity of any word sense in a sentence. 7. Sense Closest: The smallest Path Similarity of any word sense in a sentence. Semantic Features ρ with Impact Positive words -0.009 Negative words -0.018 Subjective words -0.018 Sense Combination -0.019 Sense Farmost -0.015 Sense Closest 0.040 Positive Polarity Confidence -0.012 Table 7. The relationship between Semantic Features and Impact, as represented by Pearson correlation coefficient, ρ. From Table 7, we observe a minimal correlation between Impact and the selected semantic features. The maximum ρ of 0.040 between Impact and the feature Sense Closest[12] depicts that the new characterization is distinct from semantic features of the comment. Although it is essential to understand that predicting Impact is beneficial for numerous applications like finance, product marketing and provides insights on social dynamics, it is a hard problem dependent on various factors. Our attempt to capture relationships between Impact and some selected semantic features was not able to establish a strong correlation between the features. Thus, this implies that the use of sophisticated architectures in the task of Impact prediction would be valuable. 5 Conclusion This paper presents a novel BERT-based predictive ensemble model to predict given labels: Emotional Disclosure, Informational Disclosure, Support, General Support, Information Support, and Emotional Support. Our model gives compet- itive results for the label prediction on the given dataset Get it #OffMyChest. Analysis of dataset shows the highly imbalanced distribution of the given labels, BERT-based Ensembles for Modeling Disclosure and Support 9 and high variations in some features like score, word count, comments per parent post, and comments per user. We further discerned that day of the week has no significant impact on the frequency of Disclosure and Support based comments on Reddit. Future work may involve exploring more ensembling techniques and exploring sophisticated architectures to predict the impact of a comment. References 1. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information (2016), http://arxiv.org/abs/1607.04606 2. Cutts, M.: Oxford Guide to Plain English (2009) 3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding (2018), http://arxiv.org/abs/ 1810.04805 4. Gokaslan, A., Cohen, V.: Openwebtext corpus. http://Skylion007.github.io/ OpenWebTextCorpus (2019) 5. Nagel, S.: Cc-news (2016), http://web.archive.org/save/http://commoncrawl. org/2016/10/newsdataset-available/ 6. Rajendran, A., Zhang, C., Abdul-Mageed, M.: Happy together: Learning and un- derstanding appraisal from natural language. In: AffCon@AAAI (2019) 7. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016) 8. Rodgers, J., Nicewander, W.: Thirteen ways to look at the correlation coefficient. The American Statistician 42(1), 59–66 (1988), http://www.jstor.org/stable/ 2685263 9. Trinh, T.H., Le, Q.V.: A simple method for commonsense reasoning. CoRR abs/1806.02847 (2018), http://dblp.uni-trier.de/db/journals/corr/ corr1806.html#abs-1806-02847 10. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems. pp. 5998–6008 (2017) 11. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: A multi- task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Inter- preting Neural Networks for NLP. pp. 353–355. Association for Computational Lin- guistics, Brussels, Belgium (Nov 2018). https://doi.org/10.18653/v1/W18-5446, https://www.aclweb.org/anthology/W18-5446 12. Yang, D., Lavie, A., Dyer, C., Hovy, E.H.: Humor recognition and humor anchor ex- traction. In: Màrquez, L., Callison-Burch, C., Su, J., Pighin, D., Marton, Y. (eds.) EMNLP. pp. 2367–2376. The Association for Computational Linguistics (2015), http://dblp.uni-trier.de/db/conf/emnlp/emnlp2015.html#YangLDH15 13. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: Xlnet: Generalized autoregressive pretraining for language understanding (2019), http: //arxiv.org/abs/1906.08237 14. Yinhan Liu, Myle Ott, N.G.J.D.M.J.D.C.O.L.M.L.L.Z.V.S.: Roberta: A robustly optimized bert pretraining approach. In: Submitted to International Confer- ence on Learning Representations (2020), https://openreview.net/forum?id= SyxS0T4tvS, under review 10 Pant, Dadu, and Mamidi 15. Zhenzhong Lan, Mingda Chen, S.G.K.G.P.S.R.S.: Albert: A lite bert for self- supervised learning of language representations. In: Submitted to International Conference on Learning Representations (2020), https://openreview.net/forum? id=H1eA7AEtvS, under review 16. Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., Fidler, S.: Aligning books and movies: Towards story-like visual explanations by watching movies and reading books (2015), http://arxiv.org/abs/1506.06724