<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>[CL-A Shared Task] Detecting Disclosure and Support via Deep Multi-Task Learning</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Ottawa</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose a novel way of deploying deep multi-task learning models for the task of detecting disclosure and support. We calculate all possible logical relations among six labels, represented in a Venn diagram. Based on it, the six labels are distributed to multiple fragment clusters. Then, a multi-task deep neural network is built on the groups.</p>
      </abstract>
      <kwd-group>
        <kwd>Deep Multi-Task Learning Word Embeddings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Deep Learning (DL) has achieved great success in many elds, including, but
not limited to, natural language processing, computer vision, and speech
recognition. But there are still many limitations and challenges related to training DL
models, such as over tting, hyperparameter optimization, long training times,
high memory usage, etc.</p>
      <p>
        Even if we do not consider the high demands for computing power, there
is still some interesting techniques in the classical neural network models to
improve the performance, like, deep multi-task learning structures. Multi-task
learning (MTL), particularly with deep neural networks, can not only reduce
the risk of over tting, but also improve the results for each task, compared with
single-task learning. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
      </p>
      <p>
        Switching the topic to the 2020 CL-A Shared Task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the inspiration of
this shared task is the growing interest in understanding how humans initiate
and hold conversations. We want to know people's reactions, both in terms of
emotion and information. As task 2 is an open-ended problem, we will only focus
on task 1 in this paper.
      </p>
      <p>For task 1, the O MyChest conversation dataset is provided. Twelve
thousand samples are included in this dataset, and each entry contains a sentence and
six binary labels, including Information disclosure, Emotion disclosure, Support,
General support, Info support, and Emo support.
The distribution of the six labels in training data is shown in table 1. We can see
that in all the labels, the negative data accounts for a high proportion, especially
for the label General support, in which the negative data reaches a proportion
of 94.7%. This tells us to pay attention to the class weights during training.
Nonetheless, it is likely that the result on the label General support will be
among the lowest since it has the highest class imbalance.
Label True False
Emotional disclosure 3948 8912
Information disclosure 4891 7969
Support 3226 9634
General support 680 12180
Info support 1250 11610</p>
      <p>Emo support 1006 11854</p>
      <p>Table 2 shows the token analysis of the training dataset. As shown in the
table, although the maximum token length reaches 171, 95% percent of sentences
have a length of no more than 34. So the objective length in sentences
preprocessing is 34. After retrieving the word embedding vectors, all sentences will be
transformed into vectors with shape (34 embedding dimension).</p>
      <p>Value
Max token length 171
Min token length 1
Mean token length 15.07
Median token length 13
Number of unique tokens 11460
Total number of tokens 193837</p>
      <p>Length that covers 95% of sentences 34
3</p>
    </sec>
    <sec id="sec-2">
      <title>Word Embedding</title>
      <p>
        Preprocessing steps included standard operations like splitting the sentences
into words, omitting all punctuation marks, transforming the sentences into
sequences and padding them to the same length, 34. The word embedding method
we chose is BERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Unfortunately, as the GPU we had available did not
have su cient memory, we could not use the BERT embedding as a layer in
the model. Instead, we use bert-as-service [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to do word embeddings
beforehand. By losing some performance in value updating, this allowed us to use less
memory and reduced the computation time. We used the default con guration
and the bert-as-service con guration called "ELMo-like contextual word
embedding". The former one aims to generate sentence embeddings and the latter one
will create embeddings which have similar shape as the ELMO embeddings [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
in other words, it will generate separated embedding for all words in the padded
sentences. We obtained two word embedding les, one with shape 12; 860 1; 024,
and the second one with shape 12; 860 34 1; 024. 12; 860 is the number of
instances in the training set, while 34 is the objective length we de ned for each
sentence and 1; 024 is the dimension of each word (or of the whole sentence in
the default con guration).
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>Models</title>
      <p>
        In the model, we want to fully utilize the power of the neural networks on
multitasking with hard parameter sharing [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        In general, when training a model on a task using noisy datasets, we need
to ignore the data-dependent noise and to learn good patterns based on other
features. Because di erent tasks have di erent noise patterns, a model trained
for multiple tasks can generate a more general representation and can average
the noise patterns on the di erent tasks [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Furthermore, as similar tasks have similar patterns, we want their
taskspeci c layers to be at a closer position compared with other tasks in the
model. For example, among the six labels of our shared task, it is easy to image
that the label Support has a strong relationship with the label General support,
Info support and Emo support, as they all refer to something about support.
Considering each label as a set, which contains entries in the training data where the
corresponding label is 1, the relationship among four labels can be described by
the Venn diagram in Fig. 1. The numbers on the graph show the size of
intersections between or among the sets. Venn diagram [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a simple diagram used
to represent unions and intersections of sets. However, based on the number of
sets, it could be extremely complicated and we will see it below.
      </p>
      <p>From the Fig. 1, we can see that the label Support almost covers all the cases
in the other three sets, except for some trivial cases. Then how to re ect this
relationship in the neural network? Because the label Support covers a more
general concept, it should be treated at a lower layer. The other three labels
re ne information from the Support layer, using part of its information, while
sharing some neurons between themselves, as shown in Fig. 2. The bottom of
the Fig. 2 is a large dense layer, which is split into several parts; we call it a
fragment layer. Above the fragment layer in Fig. 2 are territories of four labels;
this means that the label-corresponding task-speci c layers will only connect to
their speci c territories (neuron s).</p>
      <p>Each label's territory has some overlap with other labels' territories, and the
label Support occupies the whole layer, from the leftmost node to the rightmost
node.</p>
      <p>The example above is only for four labels. Nevertheless, Support can contain
all the other three, which means that the intersections only appears among three
layers. What about the six labels in our task? The Venn diagram is far less clear,
as shown in Fig. 3. Discarding the pieces in the gure whose size is too small
(intersections comprised of less than 10 instances), there are in total 31 major
intersections in the Venn diagram. And for six-labels, each label is comprised of
15, 16, 28, 12, 12, and 15 intersections separately.</p>
      <p>Roughly, from the bottom to the top, the network contains the input layer,
shared hidden layers, task-speci c layers and task-speci c outputs. The
connection between shared hidden layers and task-speci c layers is based on the
fragments and the Venn graphs, as shown above.</p>
    </sec>
    <sec id="sec-4">
      <title>Experiment</title>
      <sec id="sec-4-1">
        <title>Structure</title>
        <p>As we have two types of embeddings, of shape (12860 1024) and shape (12860
34 1024), we tried two types of models in the experiment.</p>
        <p>For the data with shape (12860 1024), from the bottom to the top, we have
an input layer, fragment dense layers, concatenate layers and output layers. As we
mentioned above, the embedding layer is not included in the model because the
word embeddings are already generated in preprocessing, using bert-as-service.</p>
        <p>
          For the data with shape (12860 34 1024), from the bottom to the top, the
model is composed of an input layer, bidirectional LSTM layer, fragment dense
layers, concatenate layer, attention layer [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], atten layer and output layer.
5.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Training</title>
        <p>
          During training, we use mini-batch gradient descent with size at least 512 and
the Adam optimizer [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] is used with a learning rate of 0.0001. The loss function
is binary crossentropy and activation function used in the model are mostly leaky
relu [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], except the output layers, which use sigmoid function.
5.3
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Results and Parameter Description</title>
        <p>We evaluate the performance of models on the training dataset. The split ratio
we used is 0.6:0.2:0.2, which means 60% of data is used for training, 20% for
validation and 20% for testing.</p>
        <p>Table 3 shows the result of two models. The parameters used in Model 1
are: learning rate is 2e 5, epochs is 20 and batch size in mini-batch SGD is
1024. The corresponding le name in submission is system runs uottawa1. The
parameters used in Model 2 are: learning rate is 2e 5, epochs is 20 and batch
size in mini-batch SGD is 512. The corresponding le name in submission is
system runs uottawa2.</p>
        <p>From the table, we can see that General support always has the worst result.
This is resasonable, considering the imbalance of this label, in which the positive
cases:negative cases is 680:12,180.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we presented a multi-task deep learning model. Our model has
a reasonable result on some of the labels, but not all, especially not for
General support. The reason is that General support classi es quotes and
catchphrases, which have less distinctive features than Emotional XXX or
Information XXX, while having less positive cases appearing in the dataset.</p>
      <p>
        During the experiments, we tried several other methods for training models,
for instance, using LIWC [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] as auxiliary input/output to assist the main tasks.
Elmo embeddings and GloVe embeddings were also tried, and a combination
using Elmo. A transformer as the classi cation model was also tested. But they
could not improve the performance, unfortunately.
      </p>
      <p>One possible direction of further work for this task is to make use of large
unlabeled data provided. An idea here is to use it to nd the di erent patterns in
the texts in order to split them into several groups, and then to train models on
each group. Sentences can have di erent patterns and structures, which requires
di erent mapping functions in the network. If we can separate them into clusters
in which sentences have similar patterns, there might be an improvement in the
classi cation results.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Xxv.
          <article-title>on the diagrammatic and mechanical representation of propositions and reasonings</article-title>
          .
          <source>The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science</source>
          <volume>10</volume>
          (
          <issue>61</issue>
          ),
          <volume>168</volume>
          {
          <fpage>171</fpage>
          (
          <year>1880</year>
          ). https://doi.org/10.1080/14786448008626913, https://doi.org/10.1080/14786448008626913
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          . CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ), http://arxiv.org/abs/
          <year>1810</year>
          .04805
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jaidka</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiahui</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chhaya</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ungar</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>A report of the CL-A O MyChest Shared Task at</article-title>
          A ective Content Workshop @ AAAI.
          <source>In: Proceedings of the 3rd Workshop on A ective Content Analysis @ AAAI (A Con2020)</source>
          . New York, New York (
          <year>February 2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>CoRR abs/1412</source>
          .6980 (
          <year>2014</year>
          ), http://arxiv.org/abs/1412.6980
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boyd</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blackburn</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>The development and psychometric properties of liwc2015 (09</article-title>
          <year>2015</year>
          ). https://doi.org/10.15781/T29G6Z
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>In: Proc. of NAACL</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ruder</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>An Overview of Multi-Task Learning in Deep Neural Networks</article-title>
          .
          <source>ArXiv e-prints arXiv:1706.05098 (Jun</source>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parmar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uszkoreit</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>A.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaiser</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polosukhin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Attention is all you need</article-title>
          .
          <source>CoRR abs/1706</source>
          .03762 (
          <year>2017</year>
          ), http://arxiv.org/abs/1706.03762
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Xiao</surname>
          </string-name>
          , H.:
          <article-title>bert-as-service</article-title>
          . https://github.com/hanxiao/bert-as-service (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Empirical evaluation of rectied activations in convolutional network</article-title>
          .
          <source>CoRR abs/1505</source>
          .00853 (
          <year>2015</year>
          ), http://arxiv.org/abs/1505.00853
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>