-

BERT-based Ensembles for Modeling Disclosure and Support in Conversational Social Media Text

Kartikey Pant

kartikey.pant@research.iiit.ac.in 0

Tanvi Dadu

tanvid.co.16@nsit.net.in 1

Radhika Mamidi

radhika.mamidi@iiit.ac.in 0 0 International Institute of Information Technology , Hyderabad 1 Netaji Subhas Institute of Technology , Delhi

There is a growing interest in understanding how humans initiate and hold conversations. The a ective understanding of conversations focuses on the problem of how speakers use emotions to react to a situation and to each other. In the CL-A Shared Task, the organizers released Get it #O MyChest dataset, which contains Reddit comments from casual and confessional conversations, labeled for their disclosure and supportiveness characteristics. In this paper, we introduce a predictive ensemble model exploiting the netuned contextualized word embeddings, RoBERTa and ALBERT. We show that our model outperforms the base models in all considered metrics, achieving an improvement of 3% in the F1 score. We further conduct statistical analysis and outline deeper insights into the given dataset while providing a new characterization of impact for the dataset.

emotion recognition sentiment analysis natural language processing social media analysis

The word `A ective' refers to emotions, mood, sentiment, personality, subjective evaluations, opinions, and attitude. A ect analysis refers to the techniques used to identify and measure the `experience of emotion' in multimodal content containing text, audio, images, and videos.[ 6 ] A ect has become an essential part of the human experience, which directly in uences their reaction towards a particular situation. Therefore, it has become crucial to analyze how speakers use emotions and sentiment to react to di erent situations and each other.

This paper addresses the challenge put forward in the CL-A Shared Task at the AAAI-2020 Workshop on A ective Content Analysis to Model A ect in Response (A Con 2020). The theme of this task is to the study a ect in response to the interactive content which grows over time. The task o ers two datasets ( ? The rst two authors contributed equally to the work. a small labeled dataset and a large unlabeled dataset) sampled from casual and confessional conversations on Reddit in the subreddit /r/CasualConversations and the /r/O MyChest. This shared task comprises two subtasks. The rst subtask is a semi-supervised text classi cation task predicting Disclosure and Supportiveness labels based on the given two datasets. Whereas, the second subtask is an open-ended task, which requires authors to propose new characterizations and insights to capture conversation dynamics.

Recent works in the task of text classi cation have used pre-trained contextualized word representations rather than context-independent word representations. Some of these representations include BERT [ 3 ], RoBERTa[ 14 ], and ALBERT [ 15 ]. These models perform contextualized word representation and are pre-trained using bidirectional transformers[ 10 ]. These BERT-based pre-trained models have outperformed many existing techniques on most NLP tasks with minimal task-speci c architectural changes.

Ensemble models exploiting features learned from multiple pre-trained models are hypothesized to perform competitively. In this work, we propose an ensemble-based model exploiting pre-trained BERT-based word representations. We document the experimental results for the CL-A Shared Task of our proposed model in comparison to the baseline models. We further perform attributebased statistical analysis using attributes like word count, day of the week, and comment per parent post. We conclude the paper by proposing impact as a new characterization to model conversation dynamics. 2

Our Model

In this section, we introduce our predictive model that uses Transfer learning in the form of pretrained BERT-based models. We propose an ensemble of two pre-trained models: RoBERTa and ALBERT. In this section, we rst outline the pre-trained models incorporated and then discuss the ensemble technique used. 2.1

Preliminaries

Transfer learning is the process of extracting knowledge from a source problem domain and applying it to a di erent target problem or domain. Recent works on text classi cation use transfer learning in the form of pre-trained embeddings.[ 13,14,15 ] These pre-trained embeddings have outperformed many of the existing techniques with minimal architectural structure. The use of pretrained embeddings reduces the need for annotated data and allows one to perform the downstream task with minimal resources for the netuning of the model.

Devlin et al.[ 3 ] introduced BERT, a contextualized word representation, pretrained using a bi-directional Transformer-based encoder. These embeddings use a linear combination of masked language modeling and the next sentence prediction objectives. It is pre-trained on 3.3B words from various sources, including BooksCorpus [ 16 ] and the English Wikipedia.

Liu et al. introduced RoBERTa, a replication study of BERT, with carefully tuned hyperparameters and more extensive training data[ 14 ]. It is trained with a batch size eight times larger for half as many optimization steps, thus taking signi cantly lesser time to train in comparison. It is trained on more than twelve times the data used to train BERTlarge, using data from OpenWebText [ 4 ], CC-News [ 5 ], and STORIES [ 9 ] datasets. These optimizations lead the RoBERT alarge pre-trained model to perform better than the BERT-large model in all benchmarking tests, including SQuAD [ 7 ] and GLUE [ 11 ].

Lan et al. introduced ALBERT, a BERT-based model with two parameterreduction techniques: factorized embedding parameterization, and cross-layer parameter sharing.[ 15 ] These techniques help in lowering memory consumption and increasing training speed. Moreover, this model uses a self-supervised loss that focuses on modeling inter-sentence coherence and improves on downstream tasks with multi-sentence input. ALBERTxxlarge;v2 achieves signi cant improvements over BERTlarge on multiple tasks. 2.2

Our Approach

Ensemble methodology entails constructing a predictive model by integrating multiple models in order to improve prediction performance. They are metaalgorithms that combine several machine learning and deep learning classi ers into one predictive model to decrease variance, bias, and improve predictions. Recent works show that ensemble-based classi ers utilizing contextual embeddings outperform single-model classi ers.[ 14,15 ] Hence, we use ensembling techniques to combine predictions from multiple models for the tasks for making a prediction for the given task.

Figure 1 depicts our proposed ensemble model. In this model, a sentence is parallelly computed by RoBERTa and ALBERT netuned for predicting that label. The results from these base models are then combined using a weighted average based ensembling technique to predict the nal label set, which includes predictions for the six labels. 3

Experiments and Results

In this section, we outline the experimental setup, the baselines for the task, and a comparative analysis of our proposed ensemble model with the two base models netuned for the task, RoBERT alarge and ALBERTxxlarge;v2. We further compare our ensemble model with four other ensemble models and show that our model performs the best among all the models in four out of ve evaluation metrics using 10-fold cross validation.

For our baselines, we netune RoBERT aLarge and ALBERTxxlarge;v2 models for three epochs with a maximum sequence length of 50 and a batch size of 16 for predicting each label separately. We netune the model with a learning rate of 2 10 5, a weight decay of 0:01, and 20 steps for warm-up. We evaluate

Model/Metrics Accuracy Precision-1 Recall-1 F1 Acc&F1 RoBERT aLarge 84.86% 0.585 0.514 0.541 0.695 ALBERTxxlarge;v2 84.90% 0.596 0.472 0.524 0.686 Our Model 85.55% 0.623 0.515 0.558 0.707 Table 1. Label-averaged values for each metric for RoBERTa,ALBERT, and our best performing ensemble model. all the models on the following metrics: Accuracy, F1, Precision-1, Recall-1, and the mean of Accuracy and F1, denoted as Acc&F1 from hereon.

From Table 1, we can discern that our ensemble-based model achieves the best results when compared with base models: RoBERTa and ALBERT. We observe a signi cant increase in Accuracy, Precision-1, and F1 and a slight increase in Recall-1 and Acc&F1 in our best-performing ensemble model as compared to the base models.

Label/Metrics Accuracy Precision-1 Recall-1 F1 Acc&F1 Informational Disclosure 74.12% 0.710 0.551 0.620 0.681 Emotional Disclosure 74.20% 0.636 0.510 0.566 0.654 Support 84.38% 0.685 0.724 0.704 0.774 General Support 95.42% 0.483 0.241 0.322 0.638 Informational Support 91.30% 0.592 0.485 0.533 0.723 Emotional Support 93.86% 0.632 0.577 0.603 0.771 Table 2. Label-wise values for each metric for our best performing ensemble model.

Table 2 further shows the performance of our ensemble-based model on individual labels. Its performance on di erent labels is evaluated using the above metrics.

Labels/Model Model 1 Model 2 Model 3 Model 4 Model 5 Informational Disclosure 0.0,1.0 0.5,0.5 0.0,1.0 0.0,1.0 0.1,0.9 Emotional Disclosure 0.0,1.0 0.5,0.5 0.5,0.5 0.5,0.5 0.5,0.5 Support 1.0,0.0 0.5,0.5 1.0,0.0 1.0,0.0 1.0,0.0 General Support 0.0,1.0 0.5,0.5 0.5,0.5 0.6,0.4 0.6,0.4 Informational Support 1.0,0.0 0.5,0.5 1.0,0.0 1.0,0.0 1.0,0.0 Emotional Support 1.0,0.0 0.5,0.5 0.5,0.5 0.5,0.5 0.5,0.5 Table 3. Weights assigned to each model in di erent Ensemble Models. Each cell contains a pair (x; y) where x denotes the weight assigned to RoBERTa and y denotes the weight assigned to ALBERT.

We further performed a comparative study on ensembling techniques by choosing di erent weights for RoBERTa and ALBERT, as given in Table 3. It shows di erent combinations of weights assigned to each label for RoBERTa and ALBERT respectively. This gives rise to ve di erent models, which are then compared using the above metrics.

Table 4 depicts the results of the comparative study conducted on the ve di erent ensemble models. We discern that Model 5 performs the best for Accuracy, Precision-1, F1, and Acc&F1 metrics, and Model 1 performs the best for Recall-1 metric among all the compared models. Since Model 5 outperforms all other models in four out of ve metrics, it is the best predictive model for the task and is referred to as Our Model in the paper.

For the shared task, our System Run 1 to System Run 5 are predictions generated by the Model 1 to Model 5 respectively. System Run 6 and SysModel/Metrics Accuracy Precision-1 Recall-1 F1 Acc&F1 Model 1 85.18% 0.595 0.516 0.547 0.699 Model 2 85.42% 0.622 0.490 0.544 0.699 Model 3 85.47% 0.619 0.514 0.557 0.706 Model 4 85.48% 0.622 0.480 0.557 0.706 Model 5 85.54% 0.623 0.515 0.558 0.707

Table 4. Label-averaged values for each metric for di erent ensemble models. tem Run 7 are the predictions generated by ALBERTxxlarge;v2 respectively. 4

Dataset

netuned RoBERT a large and In this section, we provide a comprehensive statistical analysis of the dataset Get it #O MyChest, which comprises of comments and parent posts from the subreddit /r/CasualConversations, and /r/O MyChest. We further propose new characterizations and outline semantic features for the given dataset. 4.1

Analysis

Statistical analysis of the labels, Emotional Disclosure, Informational Disclosure, Support, General Support, Information Support, and Emotional Support show signi cant variations in the number of positive and negative labels. The percentage of positive labels is maximum for Information Disclosure with 37:99% and minimum for General Support with 5:37%. Therefore, the given dataset is highly imbalanced, which makes the training of predictive models a strenuous task.

Further analysis of the labeled dataset shows that there are 3; 511 parent posts for 11; 573 comments. We observe an average of 3:29 comments per parent post ranging from one comment per parent post to 52 comments per parent post. In the given dataset, there are 6; 999 unique users with an average of 1:653 comments per user and a signi cant variation in the number of comments per user ranging from 1 to 159 comments per user, with a standard deviation of 2:669. From this, we conclude that multiple comments within the same parent post and by the same author may be related to each other.

We also observe signi cant variations in the word count of the comments, with an average comment being of 14:7 words, which translates to around one sentence[ 2 ]. However, the comment length varies signi cantly from 3 words to 151 words per comment, with the distribution having a standard deviation of 9:670. The dataset is thus, well-rounded, and represents realistic discourse setting with participants exchanging comments of varying lengths.

We intuitively proceeded to predict the e ect of the day of the week in the characterized labels representing disclosure and support in a comment. It Weekday/Label EDmiscoltoiosunrael IDnifsocrlmosautrieonal Support SGuepnpeorarlt ISnufpoprmorattional EmotionalSupport Monday Tuesday Wednesday Thursday Friday Saturday Sunday Overall was expected that the users would behave di erently as the week progresses. However, as illustrated in Table 5, we do not see any signi cant variation in the existing characterizations with a change in the day of the week. Thus, we conclude, in this dataset, that the week of the day doesn't a ect the users to be either more supportive or disclose more information. The score assigned to a comment quanti es its Impact since, on Reddit, it is the di erence between the upvotes and downvotes that it obtains. We observed the posts to have a moderately positive Impact of 10:938 on average. We also see that the breadth of the spectrum in the Impact is captured well by the dataset, with a standard deviation of 57:198, and a range of 49 to 2; 637. This paves the way for a need to characterize and predict the Impact of a post.

Upon performing a correlation study between Impact and the previously characterized labels using Pearson's correlation coe cient [ 8 ], we observe a very small positive correlation between the variables. As is illustrated in Table 6, the maximum of 0:046 between Impact and Emotional Disclosure represents that Impact is characteristically distinct from the previously predicted labels.

We further analyze the in uence of Impact, characterized by the score on the semantic structure of the comments. We perform a correlation study between Impact and semantic features selected, as is explored previously in Yang et al[ 12 ]. Semantic structure is captured by the following features: 1. Positive words: The number of occurrences of positive words in a comment. 2. Negative words: The number of occurrences of negative words in a comment. 3. Positive Polarity Con dence: The probability that a sentence is positive.

This metric is used to capture the polarity of comments and is calculated using Fasttext[ 1 ]. 4. Subjective words: The number of occurrences of subjectivity oriented words in a comment. It is used to capture the linguistic expression of people's opinions, beliefs, and speculations. 5. Sense Combination: It is computed as the log( ik=1nwi) where nwi is the total number of senses of word wi. 6. Sense Farmost: The largest Path Similarity of any word sense in a sentence. 7. Sense Closest: The smallest Path Similarity of any word sense in a sentence.

From Table 7, we observe a minimal correlation between Impact and the selected semantic features. The maximum of 0:040 between Impact and the feature Sense Closest [ 12 ] depicts that the new characterization is distinct from semantic features of the comment.

Although it is essential to understand that predicting Impact is bene cial for numerous applications like nance, product marketing and provides insights on social dynamics, it is a hard problem dependent on various factors. Our attempt to capture relationships between Impact and some selected semantic features was not able to establish a strong correlation between the features. Thus, this implies that the use of sophisticated architectures in the task of Impact prediction would be valuable. 5

Conclusion

This paper presents a novel BERT-based predictive ensemble model to predict given labels: Emotional Disclosure, Informational Disclosure, Support, General Support, Information Support, and Emotional Support. Our model gives competitive results for the label prediction on the given dataset Get it #O MyChest. Analysis of dataset shows the highly imbalanced distribution of the given labels, and high variations in some features like score, word count, comments per parent post, and comments per user. We further discerned that day of the week has no signi cant impact on the frequency of Disclosure and Support based comments on Reddit. Future work may involve exploring more ensembling techniques and exploring sophisticated architectures to predict the impact of a comment.

1. Bojanowski , P. , Grave , E. , Joulin , A. , Mikolov , T. : Enriching word vectors with subword information ( 2016 ), http://arxiv.org/abs/1607.04606

2. Cutts , M. : Oxford Guide to Plain English ( 2009 )

3. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : Bert: Pre-training of deep bidirectional transformers for language understanding ( 2018 ), http://arxiv.org/abs/ 1810 .04805

4. Gokaslan , A. , Cohen , V. : Openwebtext corpus . http://Skylion007.github.io/ OpenWebTextCorpus ( 2019 )

5. Nagel , S. : Cc-news ( 2016 ), http://web.archive.org/save/http://commoncrawl. org/ 2016 /10/newsdataset-available/

6. Rajendran , A. , Zhang , C. , Abdul-Mageed , M. : Happy together: Learning and understanding appraisal from natural language . In: A Con@AAAI ( 2019 )

7. Rajpurkar , P. , Zhang, J., Lopyrev , K. , Liang , P. : Squad: 100 ,000+ questions for machine comprehension of text . arXiv preprint arXiv:1606.05250 ( 2016 )

8. Rodgers , J. , Nicewander , W. : Thirteen ways to look at the correlation coe cient . The American Statistician 42 ( 1 ), 59 { 66 ( 1988 ), http://www.jstor.org/stable/ 2685263

9. Trinh , T.H. , Le , Q.V. : A simple method for commonsense reasoning . CoRR abs/ 1806 .02847 ( 2018 ), http://dblp.uni-trier.de/db/journals/corr/ corr1806.html#abs-1806 -02847

10. Vaswani , A. , Shazeer , N. , Parmar , N. , Uszkoreit , J. , Jones , L. , Gomez , A.N. , Kaiser , L. , Polosukhin , I. : Attention is all you need . In: Advances in Neural Information Processing Systems . pp. 5998 { 6008 ( 2017 )

11. Wang , A. , Singh , A. , Michael , J. , Hill , F. , Levy , O. , Bowman , S.: GLUE: A multitask benchmark and analysis platform for natural language understanding . In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP . pp. 353 { 355 . Association for Computational Linguistics, Brussels, Belgium (Nov 2018 ). https://doi.org/10.18653/v1/ W18 -5446, https://www.aclweb.org/anthology/W18-5446

12. Yang , D. , Lavie , A. , Dyer , C. , Hovy , E.H. : Humor recognition and humor anchor extraction . In: Marquez, L. , Callison-Burch , C. , Su , J. , Pighin , D. , Marton , Y . (eds.) EMNLP. pp. 2367 { 2376 . The Association for Computational Linguistics ( 2015 ), http://dblp.uni-trier.de/db/conf/emnlp/emnlp2015.html#YangLDH15

13. Yang , Z. , Dai , Z. , Yang , Y. , Carbonell, J., Salakhutdinov , R. , Le , Q.V. : Xlnet: Generalized autoregressive pretraining for language understanding ( 2019 ), http: //arxiv.org/abs/ 1906 .08237

14. Yinhan

Liu

, Myle Ott,

N.G.

J.D.M.J.D.C.O.L.M.L.L.Z .V.S.: Roberta: A robustly optimized bert pretraining approach . In: Submitted to International Conference on Learning Representations ( 2020 ), https://openreview.net/forum?id= SyxS0T4tvS, under review

15. Zhenzhong

Lan

, Mingda Chen, S.G.K.G.P.S.R.S.: Albert: A lite bert for selfsupervised learning of language representations . In: Submitted to International Conference on Learning Representations ( 2020 ), https://openreview.net/forum? id=H1eA7AEtvS, under review

16. Zhu , Y. , Kiros , R. , Zemel , R. , Salakhutdinov , R. , Urtasun , R. , Torralba , A. , Fidler , S. : Aligning books and movies: Towards story-like visual explanations by watching movies and reading books ( 2015 ), http://arxiv.org/abs/1506.06724