<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Contextual Representation of Self-Disclosure and Supportiveness in Short Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chandan Reddy Akiti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sarah Rajtmajer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Squicciarini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Pennsylvania State University</institution>
          ,
          <addr-line>University Park, PA 16804</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>As user engagement in online public discourse continues to richen, voluntary disclosure of personal information and its associated risks to privacy and security are of increasing concern. Users are often unaware of the sheer amount of personal information they share across online forums, commentaries, and social networks, as well as the power of modern AI to synthesize and gain insights from this data. We develop a novel multi-modal approach for the joint classi cation of self-disclosure and supportiveness in short text. We take an ensemble approach for representation learning, leveraging BERT, LSTM, and CNN neural networks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>sentences with 6 labels: emotional disclosure; information disclosure; support;
general support; information support; and, emotional support.</p>
      <p>We propose two alternative models for the task of detection of self-disclosure
and supportiveness in online users' generated comments. The rst model is
classication using BERT, a bi-directional transformer. We use this model to ne-tune
our word representations and for classi cation using sentence representations and
hidden attentions obtained from the model. The second model is classi cation
using a CNN, where we replace the typical embedding layer with the pre-trained
BERT model. We apply multiple convolutions using di erent window sizes
(number of words).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Problem De nition</title>
      <p>The O MyChest conversation dataset consists of 12,860 labeled sentences and
5,000 unlabeled sentences sampled from comments on subreddits within the
O MyChest community, with the following tags: \wife"; \girlfriend"; \gf";
\husband"; \boyfriend" and \bf". For example, the following training sentence is
labeled for emotional disclosure and emotional support: I hope this chapter results
in a better, healthier, more ful lled you!!
Of particular interest in this dataset is the General Support label. This label is
intended for sentences o ering support through catch phrases or quotes, which
are exceptionally di cult to distinguish using automated approaches.</p>
      <p>Label</p>
      <p>Frequency
Emotional Disclosure
Information Disclosure</p>
      <p>Support</p>
      <p>General Support
Information Support
Emotional Support
0.44
0.61
0.35
0.06
0.11
0.08</p>
      <p>As indicated in Table 1, sub-labels for supportiveness su er from class
imbalance. We address this issue with weighted batch sampling. That is, we assign
a weight to each sample given by the inverse frequency of occurrence of the label
assigned to it, and then draw each mini-batch from the multinomial distribution
based on the sample weights. Thus, highly-weighted samples are sampled more
often for each mini-batch.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Modeling Approach</title>
      <p>
        Our model leverages the generalizability of Bidirectional Encoder
Representations from Transformers (BERT). As BERT models tend to generalize better for
the diverse set of NLP tasks [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we use this capability in our transfer learning
approach. Sentence-level representations obtained from the BERT model can
also be ne-tuned on the unlabeled dataset to better generalize on unseen data
for our task.
      </p>
      <p>
        BERT has its origins in Semi-supervised Sequence Learning [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
Generative Pre-Training [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], ELMo [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and ULMFit [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Unlike previous models,
though, BERT is a deeply bidirectional, unsupervised language representation,
pre-trained using only a plain text corpus [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This makes it particularly
suitable for our task, as it allows us to input training text as-is, without imposing
prede ned and possibly biased features or setting hyper-parameters that would
require further analysis.
      </p>
      <p>
        Context-free models such as word2vec [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and GloVe [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] generate a speci c
word embedding representation for each word in the vocabulary. That is, the
same word used in di erent contexts have the same word-representation. While,
the BERT model accounts for context of a sentence using the words both
preceding and following. BERT has achieved state-of-the-art performance on eleven
NLP tasks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>We present two classi cation models. The rst model is a BERT model for
sequence classi cation ne-tuned on the unlabeled text corpus. The second model
is a CNN model where we use BERT word embedding instead of a typical
embedding layer for words. We use the CNN model to visualize which parts of the
sentences are contributing to the classi cation.
3.1</p>
      <sec id="sec-3-1">
        <title>Model 1: BERT model with ne-tuning</title>
        <p>
          BERT-base model has 12 layers with a hidden size of 768 and 12 self-attention
heads. This model is pre-trained on BooksCorpus (800M words) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and English
Wikipedia (2,500M words). We ne-tune the BERT model using a Masked
Language Model (LM) and the post and comment data provided as the unlabeled
dataset.
        </p>
        <p>For ne-tuning using Masked LM, we prepare a text corpus with all the
posts and comments and feed it to the model. The model masks 15% of the
words in each sentence and tries to predict them using the other 85% words
in the sentence. Thus we ne-tune the language model to predict the words in
the corpus accurately in the context of surrounding words. For this task, we use
a learning rate of 5e-5, weight decay of 0, Adam epsilon of 1e-8 and gradient
clipping at 1.0. We train the Masked LM for 1 epoch.</p>
        <p>Sentence tokenization We add [CLS] and [SEP] tokens at the start and end
of every sentence. These token are used in the BERT model to indicate the start
and end of a sequence. We then generate word indices for all the tokens in the
sentence using a BERT tokenizer.</p>
        <p>Sentence padding The sentences in the training data and test data have
an average length of 17 words and maximum length of 171 words. We pad all
sentences to 200 words in length to avoid truncating the tokenized sentences.</p>
        <p>Sentence classi cation For sentence classi cation, we add a linear layer on
top of the pooled output of [CLS] token. We then apply softmax on the output
of the linear layer to obtain the classi cation probability for the label.
parameter</p>
        <p>value</p>
        <p>Training parameters are set according to Table 2 We measure the
crossentropy loss for each mini-batch and back-propagate the loss.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Model 2: CNN with BERT embedding</title>
        <p>
          As an alternative model, we deploy the CNN model presented in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] for text
classi cation. The intuition behind this model is to process window of words in
the contextual representation of sentences and aggregate the decision from all
the windows. For this purpose we represent the sentence in a matrix and design
convolution kernels of di erent sizes, each size corresponding to a window size
of words in the sentence.
        </p>
        <p>We tokenize the sentences in a similar fashion as we did with BERT model.
If the tokenized sentence has n tokens where (n &lt; 200), we pad the tokens
with zero tokens to get standardised length of N = 200 tokens. We then obtain
word representations of the N tokens from BERT-base pre-trained/ ne-tuned
model. The BERT-base model has a hidden length of H = 768. Thus we take
the representation of a sentence as an N H matrix.</p>
        <p>We set 4 types of convolution kernels of lter size 2 through 5 and lter width
equal to the length of word representations, H. Suppose a kernel wh with lter
size of h is used for convolution on the sentence A on the window of ith token
to (i + h 1)th token, we obtain kernel output ohi as ohi = wh A[i : i + h 1]
where oh is the convolution output of kernel wh on sentence A. We add a bias
term bh and activation function f to each ohi, then we have chi = f (ohi + bh)</p>
        <p>In this way, for each kernel, we obtain an N 1 output chi after a padded
convolution and activation. The output chi is an indicator for the weight of the
window i of length h for the classi cation task. The output of the convolution
layer can also be visualized as a heat-map for the sentence classi cation task.</p>
        <p>To maximize learning, we use 12 kernels: 3 of each kernel type. We apply
max-pooling on each of the convolution output to obtain 4 outputs for each of
the kernels. Finally, a linear layer is applied on the outputs for classi cation. Loss
is measured using cross entropy loss function and back-propagated for training.
We use an Adam optimizer with learning rate of 1e-3, a batch size of 32 and
training for 30 epochs and we choose the model with best validation f1-score.
Visualization of activations We visualize activations in the CNN network
during classi cation. Given a sentence A, we pass the input as a single element
batch to the model and extract activations from ch for each window size h. In
Figure 6, we plot the heat map for each word. For a window size h, for a word Ai,
we measure the heat map as the sum of activations of all the windows in which
the word Ai is included. We then normalize the heat maps across the sentence
to plot activations with respect to windows size h.
We measure accuracy against our labeled dataset using standard accuracy
metrics, i.e. precision, recall, F1-score and AuROC.</p>
        <p>To avoid false high accuracy on the imbalanced dataset, we measure precision
and recall of only true values for the labels. AuROC measures how well the model
is able to distinguish the true labels from false labels.</p>
        <p>Model 1 results Results for our BERT model are reported in Table 4. As
shown, BERT provides a mean F1-score of 0:525 with mean precision 0:45 and
mean recall 0:655. Emotional Disclosure, Information Disclosure and Support
labels had no data imbalance problem, thus accounting for reasonable precision.
The model performs well on these three labels with a mean F1-score of 0:60.</p>
        <p>However, General Support, Information Support and Emotional Support
labels have high data imbalance with approximately 10% true labels. Low precision
for these labels indicate high false positive rate. The model performs very poorly
on General Support. This is expectable, given the di culty in distinguishing
catch phrases and quotes from usual text. Performing weighted sampling based
on the labels increased the recall but considerably decreased the precision.
Model 2 results The CNN model provides a mean F1-score of 0:485 with
precision 0:417 and recall 0:592. Emotional Disclosure, Information Disclosure and
Support labels had no data imbalance problem, thus accounting for reasonable
precision. The model performs well on these three labels with a mean F1-score
of 0:58.</p>
        <p>Comparing the two models, BERT performs better as we ne-tune the model
while performing the classi cation task itself. In contrast, the CNN model takes
as input static word embeddings from BERT. We do not propagate loss from
the BERT model; we only train the CNN.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this work, we have shown that the use of contextual embedding performs
well on complex sentence classi cation tasks. We have also tested an alternative
model with a CNN classi cation layer. We found the contextual embedding
performed with comparable accuracy, despite lack of ne-tuning. We have presented
visualizations of the convolution activations for the classi cation task.</p>
      <p>Currently we use the BERT-base pre-trained model with only 12 layers.
However, there are other variations of BERT models with additional parameters. In
future work, we plan to vigorously validate model parameters such as number
of layers and add text pre-processing, e.g., stemming, to further clean the data.
We suspect that ne-tuning with pre-processed text data would improve
generalization on test data.</p>
      <p>We also plan to dissect bi-directional transformers like BERT and XLNet
to enhance contextual word representations for tasks speci c to disclosure and
support classi cation in spoken dialogue systems. Finally, we would like to model
e ects of peer in uence on self-disclosure.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Zhilin</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ruslan</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          and William W. Cohen.
          <article-title>Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks</article-title>
          ,
          <year>2017</year>
          ; arXiv:
          <fpage>1703</fpage>
          .
          <fpage>06345</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          and
          <string-name>
            <given-names>Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <source>BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding</source>
          ,
          <year>2018</year>
          ; arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Radford</surname>
          </string-name>
          ,
          <article-title>Alec and Wu, Je and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya: Language Models are Unsupervised Multitask Learners (</article-title>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Wolf</surname>
          </string-name>
          , Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz and Jamie Brew.
          <article-title>HuggingFace's Transformers: State-of-the-art</article-title>
          <source>Natural Language Processing</source>
          ,
          <year>2019</year>
          ; arXiv:
          <year>1910</year>
          .03771.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Yaru</given-names>
            <surname>Hao</surname>
          </string-name>
          , Li Dong,
          <source>Furu Wei and Ke Xu. Visualizing and Understanding the E ectiveness of BERT</source>
          ,
          <year>2019</year>
          ; arXiv:
          <year>1908</year>
          .05620.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Matthew</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Peters</surname>
            , Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,
            <given-names>Kenton</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            and
            <given-names>Luke</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
          </string-name>
          .
          <source>Deep contextualized word representations</source>
          ,
          <year>2018</year>
          ; arXiv:
          <year>1802</year>
          .05365.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Jeremy</given-names>
            <surname>Howard</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Ruder</surname>
          </string-name>
          .
          <article-title>Universal Language Model Fine-tuning for Text Classi cation</article-title>
          ,
          <year>2018</year>
          ; arXiv:
          <year>1801</year>
          .06146.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Andrew</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dai</surname>
            and
            <given-names>Quoc V.</given-names>
          </string-name>
          <string-name>
            <surname>Le.</surname>
          </string-name>
          Semi-supervised
          <source>Sequence Learning</source>
          ,
          <year>2015</year>
          ; arXiv:
          <fpage>1511</fpage>
          .
          <fpage>01432</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Ye</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Byron</given-names>
            <surname>Wallace</surname>
          </string-name>
          .
          <article-title>A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classi cation</article-title>
          ,
          <year>2015</year>
          ; arXiv:
          <fpage>1510</fpage>
          .
          <fpage>03820</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Yukun</surname>
            <given-names>Zhu</given-names>
          </string-name>
          , Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba and
          <string-name>
            <given-names>Sanja</given-names>
            <surname>Fidler</surname>
          </string-name>
          .
          <article-title>Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies</article-title>
          and Reading Books,
          <year>2015</year>
          ; arXiv:
          <fpage>1506</fpage>
          .
          <fpage>06724</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Kai Chen,
          <source>Greg Corrado and Je rey Dean. E cient Estimation of Word Representations in Vector Space</source>
          ,
          <year>2013</year>
          ; arXiv:
          <fpage>1301</fpage>
          .
          <fpage>3781</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Sakakibara</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          (
          <year>1992</year>
          ).
          <article-title>E cient Learning of Context-Free Grammars from Positive Structural Examples</article-title>
          .
          <source>Inf. Comput.</source>
          ,
          <volume>97</volume>
          ,
          <fpage>23</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Alec Radford: Improving Language Understanding by Generative Pre-Training</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>