Introduction

Contextual Representation of Self-Disclosure and Supportiveness in Short Text

Chandan Reddy Akiti

Sarah Rajtmajer

Anna Squicciarini

0 0 Pennsylvania State University , University Park, PA 16804 , USA

As user engagement in online public discourse continues to richen, voluntary disclosure of personal information and its associated risks to privacy and security are of increasing concern. Users are often unaware of the sheer amount of personal information they share across online forums, commentaries, and social networks, as well as the power of modern AI to synthesize and gain insights from this data. We develop a novel multi-modal approach for the joint classi cation of self-disclosure and supportiveness in short text. We take an ensemble approach for representation learning, leveraging BERT, LSTM, and CNN neural networks.

Introduction

sentences with 6 labels: emotional disclosure; information disclosure; support; general support; information support; and, emotional support.

We propose two alternative models for the task of detection of self-disclosure and supportiveness in online users' generated comments. The rst model is classication using BERT, a bi-directional transformer. We use this model to ne-tune our word representations and for classi cation using sentence representations and hidden attentions obtained from the model. The second model is classi cation using a CNN, where we replace the typical embedding layer with the pre-trained BERT model. We apply multiple convolutions using di erent window sizes (number of words). 2

Problem De nition

The O MyChest conversation dataset consists of 12,860 labeled sentences and 5,000 unlabeled sentences sampled from comments on subreddits within the O MyChest community, with the following tags: \wife"; \girlfriend"; \gf"; \husband"; \boyfriend" and \bf". For example, the following training sentence is labeled for emotional disclosure and emotional support: I hope this chapter results in a better, healthier, more ful lled you!! Of particular interest in this dataset is the General Support label. This label is intended for sentences o ering support through catch phrases or quotes, which are exceptionally di cult to distinguish using automated approaches.

Label

Frequency Emotional Disclosure Information Disclosure

Support

General Support Information Support Emotional Support 0.44 0.61 0.35 0.06 0.11 0.08

As indicated in Table 1, sub-labels for supportiveness su er from class imbalance. We address this issue with weighted batch sampling. That is, we assign a weight to each sample given by the inverse frequency of occurrence of the label assigned to it, and then draw each mini-batch from the multinomial distribution based on the sample weights. Thus, highly-weighted samples are sampled more often for each mini-batch. 3

Modeling Approach

Our model leverages the generalizability of Bidirectional Encoder Representations from Transformers (BERT). As BERT models tend to generalize better for the diverse set of NLP tasks [ 5 ], we use this capability in our transfer learning approach. Sentence-level representations obtained from the BERT model can also be ne-tuned on the unlabeled dataset to better generalize on unseen data for our task.

BERT has its origins in Semi-supervised Sequence Learning [ 8 ], Generative Pre-Training [ 13 ], ELMo [ 6 ], and ULMFit [ 7 ]. Unlike previous models, though, BERT is a deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus [ 2 ]. This makes it particularly suitable for our task, as it allows us to input training text as-is, without imposing prede ned and possibly biased features or setting hyper-parameters that would require further analysis.

Context-free models such as word2vec [ 11 ] and GloVe [ 12 ] generate a speci c word embedding representation for each word in the vocabulary. That is, the same word used in di erent contexts have the same word-representation. While, the BERT model accounts for context of a sentence using the words both preceding and following. BERT has achieved state-of-the-art performance on eleven NLP tasks [ 2 ].

We present two classi cation models. The rst model is a BERT model for sequence classi cation ne-tuned on the unlabeled text corpus. The second model is a CNN model where we use BERT word embedding instead of a typical embedding layer for words. We use the CNN model to visualize which parts of the sentences are contributing to the classi cation. 3.1

Model 1: BERT model with ne-tuning

BERT-base model has 12 layers with a hidden size of 768 and 12 self-attention heads. This model is pre-trained on BooksCorpus (800M words) [ 10 ] and English Wikipedia (2,500M words). We ne-tune the BERT model using a Masked Language Model (LM) and the post and comment data provided as the unlabeled dataset.

For ne-tuning using Masked LM, we prepare a text corpus with all the posts and comments and feed it to the model. The model masks 15% of the words in each sentence and tries to predict them using the other 85% words in the sentence. Thus we ne-tune the language model to predict the words in the corpus accurately in the context of surrounding words. For this task, we use a learning rate of 5e-5, weight decay of 0, Adam epsilon of 1e-8 and gradient clipping at 1.0. We train the Masked LM for 1 epoch.

Sentence tokenization We add [CLS] and [SEP] tokens at the start and end of every sentence. These token are used in the BERT model to indicate the start and end of a sequence. We then generate word indices for all the tokens in the sentence using a BERT tokenizer.

Sentence padding The sentences in the training data and test data have an average length of 17 words and maximum length of 171 words. We pad all sentences to 200 words in length to avoid truncating the tokenized sentences.

Sentence classi cation For sentence classi cation, we add a linear layer on top of the pooled output of [CLS] token. We then apply softmax on the output of the linear layer to obtain the classi cation probability for the label. parameter

value

Training parameters are set according to Table 2 We measure the crossentropy loss for each mini-batch and back-propagate the loss. 3.2

Model 2: CNN with BERT embedding

As an alternative model, we deploy the CNN model presented in [ 9 ] for text classi cation. The intuition behind this model is to process window of words in the contextual representation of sentences and aggregate the decision from all the windows. For this purpose we represent the sentence in a matrix and design convolution kernels of di erent sizes, each size corresponding to a window size of words in the sentence.

We tokenize the sentences in a similar fashion as we did with BERT model. If the tokenized sentence has n tokens where (n < 200), we pad the tokens with zero tokens to get standardised length of N = 200 tokens. We then obtain word representations of the N tokens from BERT-base pre-trained/ ne-tuned model. The BERT-base model has a hidden length of H = 768. Thus we take the representation of a sentence as an N H matrix.

We set 4 types of convolution kernels of lter size 2 through 5 and lter width equal to the length of word representations, H. Suppose a kernel wh with lter size of h is used for convolution on the sentence A on the window of ith token to (i + h 1)th token, we obtain kernel output ohi as ohi = wh A[i : i + h 1] where oh is the convolution output of kernel wh on sentence A. We add a bias term bh and activation function f to each ohi, then we have chi = f (ohi + bh)

In this way, for each kernel, we obtain an N 1 output chi after a padded convolution and activation. The output chi is an indicator for the weight of the window i of length h for the classi cation task. The output of the convolution layer can also be visualized as a heat-map for the sentence classi cation task.

To maximize learning, we use 12 kernels: 3 of each kernel type. We apply max-pooling on each of the convolution output to obtain 4 outputs for each of the kernels. Finally, a linear layer is applied on the outputs for classi cation. Loss is measured using cross entropy loss function and back-propagated for training. We use an Adam optimizer with learning rate of 1e-3, a batch size of 32 and training for 30 epochs and we choose the model with best validation f1-score. Visualization of activations We visualize activations in the CNN network during classi cation. Given a sentence A, we pass the input as a single element batch to the model and extract activations from ch for each window size h. In Figure 6, we plot the heat map for each word. For a window size h, for a word Ai, we measure the heat map as the sum of activations of all the windows in which the word Ai is included. We then normalize the heat maps across the sentence to plot activations with respect to windows size h. We measure accuracy against our labeled dataset using standard accuracy metrics, i.e. precision, recall, F1-score and AuROC.

To avoid false high accuracy on the imbalanced dataset, we measure precision and recall of only true values for the labels. AuROC measures how well the model is able to distinguish the true labels from false labels.

Model 1 results Results for our BERT model are reported in Table 4. As shown, BERT provides a mean F1-score of 0:525 with mean precision 0:45 and mean recall 0:655. Emotional Disclosure, Information Disclosure and Support labels had no data imbalance problem, thus accounting for reasonable precision. The model performs well on these three labels with a mean F1-score of 0:60.

However, General Support, Information Support and Emotional Support labels have high data imbalance with approximately 10% true labels. Low precision for these labels indicate high false positive rate. The model performs very poorly on General Support. This is expectable, given the di culty in distinguishing catch phrases and quotes from usual text. Performing weighted sampling based on the labels increased the recall but considerably decreased the precision. Model 2 results The CNN model provides a mean F1-score of 0:485 with precision 0:417 and recall 0:592. Emotional Disclosure, Information Disclosure and Support labels had no data imbalance problem, thus accounting for reasonable precision. The model performs well on these three labels with a mean F1-score of 0:58.

Comparing the two models, BERT performs better as we ne-tune the model while performing the classi cation task itself. In contrast, the CNN model takes as input static word embeddings from BERT. We do not propagate loss from the BERT model; we only train the CNN. 5

Conclusion

In this work, we have shown that the use of contextual embedding performs well on complex sentence classi cation tasks. We have also tested an alternative model with a CNN classi cation layer. We found the contextual embedding performed with comparable accuracy, despite lack of ne-tuning. We have presented visualizations of the convolution activations for the classi cation task.

Currently we use the BERT-base pre-trained model with only 12 layers. However, there are other variations of BERT models with additional parameters. In future work, we plan to vigorously validate model parameters such as number of layers and add text pre-processing, e.g., stemming, to further clean the data. We suspect that ne-tuning with pre-processed text data would improve generalization on test data.

We also plan to dissect bi-directional transformers like BERT and XLNet to enhance contextual word representations for tasks speci c to disclosure and support classi cation in spoken dialogue systems. Finally, we would like to model e ects of peer in uence on self-disclosure.

Zhilin

Yang ,

Ruslan

Salakhutdinov and William W. Cohen. Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks , 2017 ; arXiv: 1703 . 06345 .

Jacob

Devlin , Ming-Wei

Chang

Kenton

Lee and

Kristina

Toutanova . BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding , 2018 ; arXiv: 1810 .04805.

3. Radford , Alec and Wu, Je and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya: Language Models are Unsupervised Multitask Learners ( 2019 )

Thomas

Wolf , Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz and Jamie Brew. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019 ; arXiv: 1910 .03771.

Yaru

Hao , Li Dong, Furu Wei and Ke Xu. Visualizing and Understanding the E ectiveness of BERT , 2019 ; arXiv: 1908 .05620.

6. Matthew

Peters , Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton

Lee and Luke

Zettlemoyer . Deep contextualized word representations , 2018 ; arXiv: 1802 .05365.

Jeremy

Howard and

Sebastian

Ruder . Universal Language Model Fine-tuning for Text Classi cation , 2018 ; arXiv: 1801 .06146.

8. Andrew

Dai and Quoc V.

Le. Semi-supervised Sequence Learning , 2015 ; arXiv: 1511 . 01432 .

Zhang and

Byron

Wallace . A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classi cation , 2015 ; arXiv: 1510 . 03820 .

10. Yukun

Zhu

, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba and

Sanja

Fidler . Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books, 2015 ; arXiv: 1506 . 06724 .

11. Tomas

Mikolov

, Kai Chen, Greg Corrado and Je rey Dean. E cient Estimation of Word Representations in Vector Space , 2013 ; arXiv: 1301 . 3781 .

12. Sakakibara , Y. ( 1992 ). E cient Learning of Context-Free Grammars from Positive Structural Examples . Inf. Comput. , 97 , 23 - 60 .

13. Alec Radford: Improving Language Understanding by Generative Pre-Training ( 2018 )