<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>[CL-A Shared Task] Happiness Ingredients Detection using Multi-Task Deep Learning</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Ottawa</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose a novel deep multi-task learning model for the task of detecting happiness ingredients. The two classes/labels "agency" and "social" are treated as two separate tasks for training Deep Learning classi ers. Then, we train a multi-task deep learning classi er to see if the shared knowledge between the two tasks can improve the overall results. In addition, we compare several models that use di erent kinds of word embeddings: di erent dimensions of the vectors, xed versus trainable embeddings, initialized randomly or with existing embeddings.</p>
      </abstract>
      <kwd-group>
        <kwd>Deep Multi-Task Learning Nature Language Processing Word Embeddings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Deep learning has achieved great success in many elds, such as natural
language processing, computer vision and speech recognition. But there are still
many limits and challenges in deep learning, including over tting,
hyperparameter optimization, and sometimes, long training time. Multi-task learning (MTL),
particularly with deep neural networks, greatly reduces the risk of over tting.
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] With multiple tasks being learned simultaneously, our model will try to
capture the representation of all the tasks, which signi cantly lowers the chance of
over tting on each task.
      </p>
      <p>Happiness is one of the important facets of human emotion. In psychology, it
is a certain state of mind. The descriptions of happy moments include events that
give satisfaction, pleasure, or a positive emotional condition. For the purpose of
natural language processing (NLP) tasks, it is di cult to formalize happiness.
However, as human a ect is context-driven, what we are concerned with here is
the contextual and agentic attributes of the descriptions of the happy moments.</p>
      <p>
        The CL-A Shared Task aims to challenge the current understanding of
emotion and a ect in text through a task that models the experiential, contextual,
and agentic attributes of the crowd-sourced single-sentences that describe happy
moments. The CL-A shared task is based on HappyDB [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which is a corpus
of more than 100,000 happy moments crowd-sourced via Amazon's
Mechanical Turk. There are two sub-tasks in the shared task for analyzing happiness
and well-being in written language on the modi ed HappyDB corpus. Task 1
is focused on predicting agency and social labels (classes), while task 2 is open
ended, encouraging participants to propose new characterizations and insights
for the descriptions of the happy moments in the test set. In task 1, beside of
two existing tasks, we introduced determining the Concepts as the third task,
to take full advantage of the power of multi-task learning.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Data Preprocessing</title>
      <p>The distribution of the social and agent labels in training data is shown in
table 1. We can see that whereas the data are almost evenly distributed for the
attribute Social, for Agency, the positive data accounts for a high proportion.
This imbalance could causes the performance of our models on Agency to not
be as good as the performance on the Social label.</p>
      <p>We added one step for label processing:
{ Categorise Agency, Social and Concepts. The class Agency and Social
both contain only binary values: yes and no, which can be easily
transformed into 1 and 0, whereas the class Concepts has 15 di erent values and
many more combinations of them. We use the one-hot encoding method to
represent all 15 concepts. Each value will be transformed into a 15-dimension
array, in which the locations of the concepts that are present are marked as
1, and the others are set to 0.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Models</title>
      <p>From bottom to top, our model comprises: the embedding layer, the
convolutional layer, the dropout and pooling layer, and two detached dense layer heaps.
We have compared the results of several deep learning models, like Convolutional
Neural Network (CNN) and Long Short Term Memory (LSTM). CNN proved
to achieve better results in our experiments, this is why we will present only the
results for the CNN-based models in section 4.
3.1</p>
      <sec id="sec-3-1">
        <title>Embedding Layer</title>
        <p>The embedding layer is fed with 1-D moment description, which are then
embedded into 2-D matrices. The size of the second dimension is 100, which means
every word will be transformed into a 100-dimensional vector. For example, for
a sentence with a length of 20, we rst add 9 zeros in the front to reach the
length 29. After passing the embedding layer, the size of the output matrix will
be (29, 100).</p>
        <p>There are two ways to initialize the values inside the embeddings: randomly,
or with pre-trained embeddings from an outside corpus. Then, there are also
two ways to handle the values: keep them xed, or allow them to be updated
during training (they are trained for our tasks). From our experiments, using
pre-trained embeddings that can be updated during training, lead to the the
best results.</p>
      </sec>
      <sec id="sec-3-2">
        <title>GloVe: Global Vectors for Word Representation GloVe is an unsuper</title>
        <p>
          vised learning algorithm for obtaining vector representations for words. The
pre-trained embeddings we used are trained on 2 billion tweets corpus, with 27
billion tokens and 1.2 million vocabulary [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
3.2
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Convolutional Layer</title>
        <p>
          Whereas a 2D convolution layer suits image processing most, here we used a 1D
convolution layer, which is usually use for natural language processing [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The
kernel, also called a lter, has the same length as the input words. After sliding
along the input sentence, each lter will generate a 1-D vector. So the size of
the nal output will only be related with the number of lters and length of a
sentence, but not with the dimension of the word embeddings.
        </p>
        <p>After the convolution, the output is fed into a dropout layer and then the
max pooling operation is performed.
3.3</p>
      </sec>
      <sec id="sec-3-4">
        <title>Hard Parameter Sharing for Multi-task Learning</title>
        <p>
          Hard parameter sharing [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] is the most commonly used approach to multi-task
learning in neural networks. It is applied by sharing the hidden layers between
all the tasks, while isolating several task-speci c output layers.
        </p>
        <p>
          Normally, after the convolutional layers, the model will be followed by a fully
connected layer and has one output (maybe with multiple dimensions) at the
end. But for hard parameter sharing, the rst part of model is shared between
the multiple tasks, while the layers after convolution are task-speci c. Here,
the class Concepts is treated as the third classi cation task, besides the two
classes Agency and Social. Fig. 1 is the high level description of our MTL model,
for three tasks. The idea is that sharing layers between the tasks could reduce
the risk of over tting for each task. Intuitively, the more tasks we are training
simultaneously, the more our model will try to represent all of the tasks, leading
to a lower chance of over tting on a single task. Our proposed MTL model still
achieves good results, as shown in the next section.
During training, we use mini-batch gradient descent with size 32 and the Adam
optimizer [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] is used with a learning rate of 0.1. The size of the embedding layer
was set to 100, and the dropout ratio is 0.2.
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>Evaluation and Results</title>
        <p>We evaluate the performance of di erent models on the training data. The split
ratio we used is 60%:20%:20%, which means 60% of data is used for training,
20% for validation and the rest or 20% for testing.</p>
        <p>Table 2 shows the result of several models. From left to right, the
abbreviations of models mean (all word embeddings are of dimension 100 for this set of
experiments):</p>
        <p>CNN: Convolutional Neural Network model, with randomly initialized
embedding layer;</p>
        <p>CNN+MTL: Convolutional Neural Network model with randomly initialized
embedding layer; followed by multi-task learning layer;</p>
        <p>CNN+MTL+GloVe, xed: Convolutional Neural Network model with
embedding layer which is initialized from pre-trained embeddings from GloVe, and
values are not allowed to update during training. Followed by multi-task
learning layer;</p>
        <p>CNN+MTL+GloVe, trainable: Convolutional Neural Network model with
embedding layer which is initialized from pre-trained GloVe embeddings, and the
values are updated during training. Followed by multi-task learning layer;</p>
        <p>From the result, we can see that compared with other models, the CNN
model with multi-task learning and pre-trained GloVe embeddings achieves the
best result, when the embeddings are allowed to update (they are trained for
the two tasks at hand).</p>
        <p>Comparing between classes, our model obtains a relatively better result for
the class Social than for the class Agency, probably because of the imbalance in
the training data for Agency.</p>
        <p>All the results in the table are from models with 100-dimension embeddings.
We have tested our models on other dimensions, like 50 or 200, and the results
did not change much. In other words, the dimension of the embeddings did not
a ect the model performance signi cantly. We also tested a two-task multitask
learning: the task-speci c layers contain only the targets Agency and Social,
without the target Concepts. The result of two-task model was very similar with
the previous three-task model, which means that target adding the concepts did
not contribute much, at least not with the classifying into Agency and Social.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we presented a multi-task deep learning model. Our experiments
show that the model works well on the provided happiness data. We obtained
acceptable accuracy, AUC, and F1 scores, especially on the label Social.</p>
      <p>Although our best model achieved good results, there are still some methods
we can use to improve its performance. One direction of future work is to also
learn from the unlabelled data (70,000 instances). We did not use the unlabelled
in our current model due to time constraints. We propose to use a bootstrap
algorithm: to run our current best model on the unlabelled data, then add to the
labelled training data the best of the automatically-labelled instances, namely
the ones for which the con dence in the prediction is high for both classes (Social
and Agency); then to retrain our model on the enhanced training data. The
model trained by this bootstrapping method might work better, but only if we
do not add too much noise to the training data.</p>
      <p>Another direction of future work is to make use of other information provided
in the training data, such as age, gender, location, marital status and parental
status. Another information from the training data that we plan to use is the
provided concepts. They are available for the training data but not for the test
data. We experimented with detecting concepts as a separate task while training
the MTL model, but we could further apply the model to predict concepts on
the test data. Then we can use these automatically-detected concepts when we
run the model on the test data in order to obtain results for the multi-task model
with three tasks.</p>
      <p>We mentioned that the sub-task 2 is an open ended task where the
participants can propose their own task that could bring insights into the concept of
happiness as re ected in texts. As an idea that might be interesting as sub-task
2, that we propose for our future work, is to apply event detection methods
to nd out what is the event that makes people happy. Then to analyze the
events by age, gender, location, marital status, and parental status. This could
show what kind of events are important / happy at various ages. We could see
what events are considered happy by women, maybe they could be di erent than
what men consider happy events. Cultural events might be di erent by locations.
Married vs. single people might choose di erent events as important / happy at
their current stage in life. Finally, parents could be happy when their children
accomplish some developmental milestones, and this kind of events would not
show up for people who are not parents.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Asai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evensen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Golshan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopatenko</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stepanov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suhara</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>W.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>HappyDB: A corpus of 100,000 crowdsourced happy moments</article-title>
          .
          <source>In: Proceedings of LREC 2018</source>
          .
          <article-title>European Language Resources Association (ELRA), Miyazaki</article-title>
          , Japan (May
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Caruana</surname>
          </string-name>
          , R.:
          <article-title>Multitask learning: A knowledge-based source of inductive bias</article-title>
          .
          <source>Proceedings of the Tenth International Conference on Machine Learning</source>
          . (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Convolutional Neural Networks for Sentence Classi cation</article-title>
          .
          <source>ArXiv e-prints arXiv:1408.5882 (Aug</source>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>CoRR abs/1412</source>
          .6980 (
          <year>2014</year>
          ), http://arxiv.org/abs/1412.6980
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.: Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In: Empirical Methods in Natural Language Processing (EMNLP)</source>
          . pp.
          <volume>1532</volume>
          {
          <issue>1543</issue>
          (
          <year>2014</year>
          ), http://www.aclweb.org/anthology/D14-1162
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ruder</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>An Overview of Multi-Task Learning in Deep Neural Networks</article-title>
          .
          <source>ArXiv e-prints arXiv:1706.05098 (Jun</source>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>