[CL-Aff Shared Task] Detecting Disclosure and
       Support via Deep Multi-Task Learning

     Weizhao Xin[0000−0003−1712−2218] and Diana Inkpen[0000−0002−0202−2444]

                                    University of Ottawa
                            {wxin074,diana.inkpen}@uottawa.ca


       Abstract. We propose a novel way of deploying deep multi-task learn-
       ing models for the task of detecting disclosure and support. We calculate
       all possible logical relations among six labels, represented in a Venn di-
       agram. Based on it, the six labels are distributed to multiple fragment
       clusters. Then, a multi-task deep neural network is built on the groups.

       Keywords: Deep Multi-Task Learning · Natural Language Processing
       · Word Embeddings.


1    Introduction
Deep Learning (DL) has achieved great success in many fields, including, but
not limited to, natural language processing, computer vision, and speech recog-
nition. But there are still many limitations and challenges related to training DL
models, such as overfitting, hyperparameter optimization, long training times,
high memory usage, etc.
    Even if we do not consider the high demands for computing power, there
is still some interesting techniques in the classical neural network models to
improve the performance, like, deep multi-task learning structures. Multi-task
learning (MTL), particularly with deep neural networks, can not only reduce
the risk of overfitting, but also improve the results for each task, compared with
single-task learning. [7]
    Switching the topic to the 2020 CL-Aff Shared Task [3], the inspiration of
this shared task is the growing interest in understanding how humans initiate
and hold conversations. We want to know people’s reactions, both in terms of
emotion and information. As task 2 is an open-ended problem, we will only focus
on task 1 in this paper.
    For task 1, the OffMyChest conversation dataset is provided. Twelve thou-
sand samples are included in this dataset, and each entry contains a sentence and
six binary labels, including Information disclosure, Emotion disclosure, Support,
General support, Info support, and Emo support.

2    Data Preprocessing
The distribution of the six labels in training data is shown in table 1. We can see
that in all the labels, the negative data accounts for a high proportion, especially


 Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License
 Attribution 4.0 International (CC BY 4.0). In: N. Chhaya, K. Jaidka, J. Healey, L. H. Ungar, A. Sinha
 (eds.): Proceedings of the 3rd Workshop of Affective Content Analysis, New York, USA, 07-
 FEB-2020, published at http://ceur-ws.org
2      Weizhao Xin and Diana Inkpen

for the label General support, in which the negative data reaches a proportion
of 94.7%. This tells us to pay attention to the class weights during training.
Nonetheless, it is likely that the result on the label General support will be
among the lowest since it has the highest class imbalance.

                  Table 1. Data Distribution in the Training Set

                Label                  True False
                Emotional disclosure 3948 8912
                Information disclosure 4891 7969
                Support                3226 9634
                General support        680 12180
                Info support           1250 11610
                Emo support            1006 11854

    Table 2 shows the token analysis of the training dataset. As shown in the ta-
ble, although the maximum token length reaches 171, 95% percent of sentences
have a length of no more than 34. So the objective length in sentences prepro-
cessing is 34. After retrieving the word embedding vectors, all sentences will be
transformed into vectors with shape (34 × embedding dimension).


                      Table 2. Word Preprocessing Results

                                                Value
            Max token length                    171
            Min token length                    1
            Mean token length                   15.07
            Median token length                 13
            Number of unique tokens             11460
            Total number of tokens              193837
            Length that covers 95% of sentences 34


3   Word Embedding
Preprocessing steps included standard operations like splitting the sentences
into words, omitting all punctuation marks, transforming the sentences into se-
quences and padding them to the same length, 34. The word embedding method
      [CL-Aff Shared Task] Detecting Disclosure and Support via Deep MTL         3

we chose is BERT [2]. Unfortunately, as the GPU we had available did not
have sufficient memory, we could not use the BERT embedding as a layer in
the model. Instead, we use bert-as-service [9] to do word embeddings before-
hand. By losing some performance in value updating, this allowed us to use less
memory and reduced the computation time. We used the default configuration
and the bert-as-service configuration called ”ELMo-like contextual word embed-
ding”. The former one aims to generate sentence embeddings and the latter one
will create embeddings which have similar shape as the ELMO embeddings [6],
in other words, it will generate separated embedding for all words in the padded
sentences. We obtained two word embedding files, one with shape 12, 860×1, 024,
and the second one with shape 12, 860 × 34 × 1, 024. 12, 860 is the number of
instances in the training set, while 34 is the objective length we defined for each
sentence and 1, 024 is the dimension of each word (or of the whole sentence in
the default configuration).


4   Models

In the model, we want to fully utilize the power of the neural networks on multi-
tasking with hard parameter sharing [7].
    In general, when training a model on a task using noisy datasets, we need
to ignore the data-dependent noise and to learn good patterns based on other
features. Because different tasks have different noise patterns, a model trained
for multiple tasks can generate a more general representation and can average
the noise patterns on the different tasks [7].
    Furthermore, as similar tasks have similar patterns, we want their task-
specific layers to be at a closer position compared with other tasks in the
model. For example, among the six labels of our shared task, it is easy to image
that the label Support has a strong relationship with the label General support,
Info support and Emo support, as they all refer to something about support. Con-
sidering each label as a set, which contains entries in the training data where the
corresponding label is 1, the relationship among four labels can be described by
the Venn diagram in Fig. 1. The numbers on the graph show the size of inter-
sections between or among the sets. Venn diagram [1] is a simple diagram used
to represent unions and intersections of sets. However, based on the number of
sets, it could be extremely complicated and we will see it below.
    From the Fig. 1, we can see that the label Support almost covers all the cases
in the other three sets, except for some trivial cases. Then how to reflect this
relationship in the neural network? Because the label Support covers a more
general concept, it should be treated at a lower layer. The other three labels
refine information from the Support layer, using part of its information, while
sharing some neurons between themselves, as shown in Fig. 2. The bottom of
the Fig. 2 is a large dense layer, which is split into several parts; we call it a
fragment layer. Above the fragment layer in Fig. 2 are territories of four labels;
this means that the label-corresponding task-specific layers will only connect to
their specific territories (neuron s).
4      Weizhao Xin and Diana Inkpen

    Each label’s territory has some overlap with other labels’ territories, and the
label Support occupies the whole layer, from the leftmost node to the rightmost
node.
    The example above is only for four labels. Nevertheless, Support can contain
all the other three, which means that the intersections only appears among three
layers. What about the six labels in our task? The Venn diagram is far less clear,
as shown in Fig. 3. Discarding the pieces in the figure whose size is too small
(intersections comprised of less than 10 instances), there are in total 31 major
intersections in the Venn diagram. And for six-labels, each label is comprised of
15, 16, 28, 12, 12, and 15 intersections separately.
    Roughly, from the bottom to the top, the network contains the input layer,
shared hidden layers, task-specific layers and task-specific outputs. The con-
nection between shared hidden layers and task-specific layers is based on the
fragments and the Venn graphs, as shown above.


Fig. 1. Venn graph for label Support, General support, Info support and Emo support


5     Experiment

5.1   Structure

As we have two types of embeddings, of shape (12860×1024) and shape (12860×
34 × 1024), we tried two types of models in the experiment.
    For the data with shape (12860 × 1024), from the bottom to the top, we have
an input layer, fragment dense layers, concatenate layers and output layers. As we
     [CL-Aff Shared Task] Detecting Disclosure and Support via Deep MTL        5


Fig. 2. An example showing how fragment layer works for Support, General support,
Info support and Emo support


                       Fig. 3. Venn graph for all six labels
6        Weizhao Xin and Diana Inkpen

mentioned above, the embedding layer is not included in the model because the
word embeddings are already generated in preprocessing, using bert-as-service.
    For the data with shape (12860 × 34 × 1024), from the bottom to the top, the
model is composed of an input layer, bidirectional LSTM layer, fragment dense
layers, concatenate layer, attention layer [8], flatten layer and output layer.


5.2    Training

During training, we use mini-batch gradient descent with size at least 512 and
the Adam optimizer [4] is used with a learning rate of 0.0001. The loss function
is binary crossentropy and activation function used in the model are mostly leaky
relu [10], except the output layers, which use sigmoid function.


5.3    Results and Parameter Description

We evaluate the performance of models on the training dataset. The split ratio
we used is 0.6:0.2:0.2, which means 60% of data is used for training, 20% for
validation and 20% for testing.
    Table 3 shows the result of two models. The parameters used in Model 1
are: learning rate is 2e−5 , epochs is 20 and batch size in mini-batch SGD is
1024. The corresponding file name in submission is system runs uottawa1. The
parameters used in Model 2 are: learning rate is 2e−5 , epochs is 20 and batch
size in mini-batch SGD is 512. The corresponding file name in submission is
system runs uottawa2.


                         Table 3. Experiment result of two models

                                                Label                  accuracy precision recall F1
                                                Emotional disclosure 0.6640 0.4697 0.7440 0.5758
                                                Information disclosure 0.7000 0.6048 0.6429 0.6233
                                                Support                0.8205 0.6153 0.7446 0.6738
Model 1 for data with shape (12860 × 1024)
                                                General support        0.8975 0.2260 0.3884 0.2857
                                                Info support           0.8982 0.4857 0.5251 0.5046
                                                Emo support            0.9305 0.5081 0.4181 0.4587
                                                Macro scores           0.8184 0.4849 0.5771 0.5203

                                                Emotional disclosure 0.6847     0.4902   0.7133 0.5811
                                                Information disclosure 0.7012   0.6110   0.6215 0.6162
                                                Support                0.8233   0.6213   0.7436 0.6770
Model 2 for data with shape (12860 × 34 × 1024)
                                                General support        0.9272   0.3023   0.2902 0.2961
                                                Info support           0.8772   0.4177   0.6181 0.4986
                                                Emo support            0.9284   0.4920   0.5151 0.5033
                                                Macro scores           0.8236   0.4890   0.5836 0.5287


   From the table, we can see that General support always has the worst result.
This is resasonable, considering the imbalance of this label, in which the positive
cases:negative cases is 680:12,180.
      [CL-Aff Shared Task] Detecting Disclosure and Support via Deep MTL          7

6   Conclusion and Future Work

In this paper, we presented a multi-task deep learning model. Our model has
a reasonable result on some of the labels, but not all, especially not for Gen-
eral support. The reason is that General support classifies quotes and catch-
phrases, which have less distinctive features than Emotional XXX or Informa-
tion XXX, while having less positive cases appearing in the dataset.
    During the experiments, we tried several other methods for training models,
for instance, using LIWC [5] as auxiliary input/output to assist the main tasks.
Elmo embeddings and GloVe embeddings were also tried, and a combination
using Elmo. A transformer as the classification model was also tested. But they
could not improve the performance, unfortunately.
    One possible direction of further work for this task is to make use of large
unlabeled data provided. An idea here is to use it to find the different patterns in
the texts in order to split them into several groups, and then to train models on
each group. Sentences can have different patterns and structures, which requires
different mapping functions in the network. If we can separate them into clusters
in which sentences have similar patterns, there might be an improvement in the
classification results.
8       Weizhao Xin and Diana Inkpen

References
 1. Xxv. on the diagrammatic and mechanical representation of propositions and rea-
    sonings. The London, Edinburgh, and Dublin Philosophical Magazine and Journal
    of Science 10(61), 168–171 (1880). https://doi.org/10.1080/14786448008626913,
    https://doi.org/10.1080/14786448008626913
 2. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec-
    tional transformers for language understanding. CoRR abs/1810.04805 (2018),
    http://arxiv.org/abs/1810.04805
 3. Jaidka, K., Singh, I., Jiahui, L., Chhaya, N., Ungar, L.: A report of the CL-Aff
    OffMyChest Shared Task at Affective Content Workshop @ AAAI. In: Proceedings
    of the 3rd Workshop on Affective Content Analysis @ AAAI (AffCon2020). New
    York, New York (February 2020)
 4. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR
    abs/1412.6980 (2014), http://arxiv.org/abs/1412.6980
 5. Pennebaker, J., Boyd, R., Jordan, K., Blackburn, K.: The development and psy-
    chometric properties of liwc2015 (09 2015). https://doi.org/10.15781/T29G6Z
 6. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettle-
    moyer, L.: Deep contextualized word representations. In: Proc. of NAACL (2018)
 7. Ruder, S.: An Overview of Multi-Task Learning in Deep Neural Networks. ArXiv
    e-prints arXiv:1706.05098 (Jun 2017)
 8. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
    L., Polosukhin, I.: Attention is all you need. CoRR abs/1706.03762 (2017),
    http://arxiv.org/abs/1706.03762
 9. Xiao, H.: bert-as-service. https://github.com/hanxiao/bert-as-service (2018)
10. Xu, B., Wang, N., Chen, T., Li, M.: Empirical evaluation of recti-
    fied activations in convolutional network. CoRR abs/1505.00853 (2015),
    http://arxiv.org/abs/1505.00853