=Paper=
{{Paper
|id=Vol-2614/session4_paper5
|storemode=property
|title=Contextual representation of self-disclosure and supportiveness in short text
|pdfUrl=https://ceur-ws.org/Vol-2614/AffCon20_session4_contextual.pdf
|volume=Vol-2614
|authors=Chandan Akiti,Sarah Rajtmajer,Anna Squicciarini
|dblpUrl=https://dblp.org/rec/conf/aaai/AkitiRS20
}}
==Contextual representation of self-disclosure and supportiveness in short text==
Contextual Representation of Self-Disclosure and Supportiveness in Short Text Chandan Reddy Akiti1 , Sarah Rajtmajer1 , and Anna Squicciarini1 Pennsylvania State University, University Park, PA 16804, USA Abstract. As user engagement in online public discourse continues to richen, voluntary disclosure of personal information and its associated risks to privacy and security are of increasing concern. Users are often unaware of the sheer amount of personal information they share across online forums, commentaries, and social networks, as well as the power of modern AI to synthesize and gain insights from this data. We develop a novel multi-modal approach for the joint classification of self-disclosure and supportiveness in short text. We take an ensemble approach for representation learning, leveraging BERT, LSTM, and CNN neural net- works. 1 Introduction As public discourse facilitated through social media and online forums grows increasingly commonplace, voluntary disclosure of personal information has been normalized. But users are often unaware or under-aware about the threat of self-disclosure to their privacy and security. We argue that public self-disclosure is often encouraged and even primed by a false sense of intimacy, as well as topics and tone of conversations. Accordingly, we aim to leverage contextual representations afforded by deep neural language models for the detection of self-disclosure and supportive text. The application of deep learning to NLP is made possible by representing words as vectors in a low-dimensional continuous space. Traditionally, these word representations were static. Each word was represented by a single vec- tor, regardless of the context. However, this approach had fundamental deficits for tasks like sentiment analysis where the representation of a word in context is critically important. Instead, recent work, e.g., [1], has shown that contextual word representations increase performance on NLP tasks. Deep neural language models such as BERT [2] and GPT-2 [3] represent suc- cessful attempts to create contextualized word representations. Replacing static with contextualized representations has led to significant improvement in a num- ber of NLP tasks [2]. In this work, we use the BERT pre-trained model from the huggingface [4] library. Following, we present our approach and results for the CL-Aff Shared Task on the OffMyChest conversation dataset put forth in the AAAI-2020 workshop on Affective Content Analysis. The task involves multi-class classification of Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). In: N. Chhaya, K. Jaidka, J. Healey, L. H. Ungar, A. Sinha (eds.): Proceedings of the 3rd Workshop of Affective Content Analysis, New York, USA, 07- FEB-2020, published at http://ceur-ws.org 2 F. Author et al. sentences with 6 labels: emotional disclosure; information disclosure; support; general support; information support; and, emotional support. We propose two alternative models for the task of detection of self-disclosure and supportiveness in online users’ generated comments. The first model is classi- fication using BERT, a bi-directional transformer. We use this model to fine-tune our word representations and for classification using sentence representations and hidden attentions obtained from the model. The second model is classification using a CNN, where we replace the typical embedding layer with the pre-trained BERT model. We apply multiple convolutions using different window sizes (num- ber of words). 2 Problem Definition The OffMyChest conversation dataset consists of 12,860 labeled sentences and 5,000 unlabeled sentences sampled from comments on subreddits within the OffMyChest community, with the following tags: “wife”; “girlfriend”; “gf”; “hus- band”; “boyfriend” and “bf”. For example, the following training sentence is la- beled for emotional disclosure and emotional support: I hope this chapter results in a better, healthier, more fulfilled you!! Of particular interest in this dataset is the General Support label. This label is intended for sentences offering support through catch phrases or quotes, which are exceptionally difficult to distinguish using automated approaches. Label Frequency Emotional Disclosure 0.44 Information Disclosure 0.61 Support 0.35 General Support 0.06 Information Support 0.11 Emotional Support 0.08 Table 1. Label frequency for each label in training data. As indicated in Table 1, sub-labels for supportiveness suffer from class im- balance. We address this issue with weighted batch sampling. That is, we assign a weight to each sample given by the inverse frequency of occurrence of the label assigned to it, and then draw each mini-batch from the multinomial distribution based on the sample weights. Thus, highly-weighted samples are sampled more often for each mini-batch. 3 Modeling Approach Our model leverages the generalizability of Bidirectional Encoder Representa- tions from Transformers (BERT). As BERT models tend to generalize better for the diverse set of NLP tasks [5], we use this capability in our transfer learning Title Suppressed Due to Excessive Length 3 approach. Sentence-level representations obtained from the BERT model can also be fine-tuned on the unlabeled dataset to better generalize on unseen data for our task. BERT has its origins in Semi-supervised Sequence Learning [8], Genera- tive Pre-Training [13], ELMo [6], and ULMFit [7]. Unlike previous models, though, BERT is a deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus [2]. This makes it particularly suit- able for our task, as it allows us to input training text as-is, without imposing predefined and possibly biased features or setting hyper-parameters that would require further analysis. Context-free models such as word2vec [11] and GloVe [12] generate a specific word embedding representation for each word in the vocabulary. That is, the same word used in different contexts have the same word-representation. While, the BERT model accounts for context of a sentence using the words both pre- ceding and following. BERT has achieved state-of-the-art performance on eleven NLP tasks [2]. We present two classification models. The first model is a BERT model for se- quence classification fine-tuned on the unlabeled text corpus. The second model is a CNN model where we use BERT word embedding instead of a typical em- bedding layer for words. We use the CNN model to visualize which parts of the sentences are contributing to the classification. 3.1 Model 1: BERT model with fine-tuning BERT-base model has 12 layers with a hidden size of 768 and 12 self-attention heads. This model is pre-trained on BooksCorpus (800M words) [10] and English Wikipedia (2,500M words). We fine-tune the BERT model using a Masked Lan- guage Model (LM) and the post and comment data provided as the unlabeled dataset. For fine-tuning using Masked LM, we prepare a text corpus with all the posts and comments and feed it to the model. The model masks 15% of the words in each sentence and tries to predict them using the other 85% words in the sentence. Thus we fine-tune the language model to predict the words in the corpus accurately in the context of surrounding words. For this task, we use a learning rate of 5e-5, weight decay of 0, Adam epsilon of 1e-8 and gradient clipping at 1.0. We train the Masked LM for 1 epoch. Sentence tokenization We add [CLS] and [SEP] tokens at the start and end of every sentence. These token are used in the BERT model to indicate the start and end of a sequence. We then generate word indices for all the tokens in the sentence using a BERT tokenizer. Sentence padding The sentences in the training data and test data have an average length of 17 words and maximum length of 171 words. We pad all sentences to 200 words in length to avoid truncating the tokenized sentences. 4 F. Author et al. Fig. 1. Sentence classification using Bidirectional Transformers [2]. Sentence classification For sentence classification, we add a linear layer on top of the pooled output of [CLS] token. We then apply softmax on the output of the linear layer to obtain the classification probability for the label. parameter value optimizer Adam lr 1e-5 weight decay 0 pre-trained model bert-base-uncased batch size 32 epochs 2 Table 2. Training parameters for Model 1. Training parameters are set according to Table 2 We measure the cross- entropy loss for each mini-batch and back-propagate the loss. 3.2 Model 2: CNN with BERT embedding As an alternative model, we deploy the CNN model presented in [9] for text classification. The intuition behind this model is to process window of words in the contextual representation of sentences and aggregate the decision from all the windows. For this purpose we represent the sentence in a matrix and design convolution kernels of different sizes, each size corresponding to a window size of words in the sentence. We tokenize the sentences in a similar fashion as we did with BERT model. If the tokenized sentence has n tokens where (n < 200), we pad the tokens Title Suppressed Due to Excessive Length 5 Fig. 2. Sample sentences for Emotional Disclosure. The word windows, e.g., “I was happy” and “I was left in awe” activate for the classification of Emotional Disclosure. with zero tokens to get standardised length of N = 200 tokens. We then obtain word representations of the N tokens from BERT-base pre-trained/fine-tuned model. The BERT-base model has a hidden length of H = 768. Thus we take the representation of a sentence as an N × H matrix. Fig. 3. CNN model with Bert embedding, modified from [9]. parameter value optimizer Adam lr 1e-3 weight decay 0 batch size 32 epochs 30 Table 3. Training parameters for Model 2. We set 4 types of convolution kernels of filter size 2 through 5 and filter width equal to the length of word representations, H. Suppose a kernel wh with filter size of h is used for convolution on the sentence A on the window of ith token to (i + h − 1)th token, we obtain kernel output ohi as ohi = wh · A[i : i + h − 1] where oh is the convolution output of kernel wh on sentence A. We add a bias term bh and activation function f to each ohi , then we have chi = f (ohi + bh ) In this way, for each kernel, we obtain an N × 1 output chi after a padded convolution and activation. The output chi is an indicator for the weight of the 6 F. Author et al. window i of length h for the classification task. The output of the convolution layer can also be visualized as a heat-map for the sentence classification task. To maximize learning, we use 12 kernels: 3 of each kernel type. We apply max-pooling on each of the convolution output to obtain 4 outputs for each of the kernels. Finally, a linear layer is applied on the outputs for classification. Loss is measured using cross entropy loss function and back-propagated for training. We use an Adam optimizer with learning rate of 1e-3, a batch size of 32 and training for 30 epochs and we choose the model with best validation f1-score. Visualization of activations We visualize activations in the CNN network during classification. Given a sentence A, we pass the input as a single element batch to the model and extract activations from ch for each window size h. In Figure 6, we plot the heat map for each word. For a window size h, for a word Ai , we measure the heat map as the sum of activations of all the windows in which the word Ai is included. We then normalize the heat maps across the sentence to plot activations with respect to windows size h. Fig. 4. Activations in the CNN model for Emotional Disclosure. h = 2 for first two sentences and h = 4 for the third sentence. The fourth sentence does not contain emotional disclosure and has low activation values. Fig. 5. Activations in CNN model for Emotional Support. h = 2 for both sentences. Fig. 6. Activations in CNN model for Information Disclosure. h = 2, 3 and 4 respec- tively for the sentences. 4 Results We measure accuracy against our labeled dataset using standard accuracy met- rics, i.e. precision, recall, F1-score and AuROC. Title Suppressed Due to Excessive Length 7 To avoid false high accuracy on the imbalanced dataset, we measure precision and recall of only true values for the labels. AuROC measures how well the model is able to distinguish the true labels from false labels. Model 1 results Results for our BERT model are reported in Table 4. As shown, BERT provides a mean F1-score of 0.525 with mean precision 0.45 and mean recall 0.655. Emotional Disclosure, Information Disclosure and Support labels had no data imbalance problem, thus accounting for reasonable precision. The model performs well on these three labels with a mean F1-score of 0.60. However, General Support, Information Support and Emotional Support la- bels have high data imbalance with approximately 10% true labels. Low precision for these labels indicate high false positive rate. The model performs very poorly on General Support. This is expectable, given the difficulty in distinguishing catch phrases and quotes from usual text. Performing weighted sampling based on the labels increased the recall but considerably decreased the precision. Label Precision Recall F1-score AuROC Emo Disclosure 0.48 0.68 0.57 0.74 Info Disclosure 0.60 0.64 0.62 0.74 Support 0.55 0.78 0.61 0.82 General Support 0.26 0.42 0.32 0.73 Info Support 0.39 0.72 0.49 0.84 Emo Support 0.44 0.69 0.54 0.85 Table 4. BERT fine-tuned model: average scores on 10-fold cross validation. Model 2 results The CNN model provides a mean F1-score of 0.485 with pre- cision 0.417 and recall 0.592. Emotional Disclosure, Information Disclosure and Support labels had no data imbalance problem, thus accounting for reasonable precision. The model performs well on these three labels with a mean F1-score of 0.58. Label Precision Recall F1-score AuROC Emo Disclosure 0.43 0.67 0.53 0.69 Info Disclosure 0.56 0.67 0.61 0.72 Support 0.57 0.70 0.62 0.83 General Support 0.20 0.39 0.26 0.75 Info Support 0.38 0.56 0.45 0.83 Emo Support 0.36 0.56 0.44 0.83 Table 5. CNN model with BERT embedding: average scores on 10-fold cross validation. Comparing the two models, BERT performs better as we fine-tune the model while performing the classification task itself. In contrast, the CNN model takes as input static word embeddings from BERT. We do not propagate loss from the BERT model; we only train the CNN. 5 Conclusion In this work, we have shown that the use of contextual embedding performs well on complex sentence classification tasks. We have also tested an alternative 8 F. Author et al. model with a CNN classification layer. We found the contextual embedding per- formed with comparable accuracy, despite lack of fine-tuning. We have presented visualizations of the convolution activations for the classification task. Currently we use the BERT-base pre-trained model with only 12 layers. How- ever, there are other variations of BERT models with additional parameters. In future work, we plan to vigorously validate model parameters such as number of layers and add text pre-processing, e.g., stemming, to further clean the data. We suspect that fine-tuning with pre-processed text data would improve gener- alization on test data. We also plan to dissect bi-directional transformers like BERT and XLNet to enhance contextual word representations for tasks specific to disclosure and support classification in spoken dialogue systems. Finally, we would like to model effects of peer influence on self-disclosure. References 1. Zhilin Yang, Ruslan Salakhutdinov and William W. Cohen. Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks, 2017; arXiv:1703.06345. 2. Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding, 2018; arXiv:1810.04805. 3. Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya: Language Models are Unsupervised Multitask Learners (2019) 4. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz and Jamie Brew. HuggingFace’s Transformers: State-of-the-art Natural Language Processing, 2019; arXiv:1910.03771. 5. Yaru Hao, Li Dong, Furu Wei and Ke Xu. Visualizing and Understanding the Effectiveness of BERT, 2019; arXiv:1908.05620. 6. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee and Luke Zettlemoyer. Deep contextualized word representations, 2018; arXiv:1802.05365. 7. Jeremy Howard and Sebastian Ruder. Universal Language Model Fine-tuning for Text Classification, 2018; arXiv:1801.06146. 8. Andrew M. Dai and Quoc V. Le. Semi-supervised Sequence Learning, 2015; arXiv:1511.01432. 9. Ye Zhang and Byron Wallace. A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification, 2015; arXiv:1510.03820. 10. Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urta- sun, Antonio Torralba and Sanja Fidler. Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books, 2015; arXiv:1506.06724. 11. Tomas Mikolov, Kai Chen, Greg Corrado and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space, 2013; arXiv:1301.3781. 12. Sakakibara, Y. (1992). Efficient Learning of Context-Free Grammars from Positive Structural Examples. Inf. Comput., 97, 23-60. 13. Alec Radford: Improving Language Understanding by Generative Pre-Training (2018)