<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-Task Learning in Deep Neural Network for Sentiment Polarity and Irony classi cation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lorenzo De Mattei</string-name>
          <email>lorenzo.demattei@di.unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Cimino</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felice Dell'Orletta</string-name>
          <email>felice.dellorlettag@ilc.cnr.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Informatica, Universita di Pisa</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Istituto di Linguistica Computazionale \Antonio Zampolli" (ILC-CNR), Pisa ItaliaNLP Lab -</institution>
        </aff>
      </contrib-group>
      <fpage>76</fpage>
      <lpage>82</lpage>
      <abstract>
        <p>We study the impact of a new multi-task learning approach in deep neural network for polarity and irony detection in Italian Twitter posts. We compare this approach with traditional single-task learning models. The di erent behavior of the two approaches shows the e ectiveness of the proposed method that is able to combine the information from the two tasks improving the accuracy in both tasks. This is particularly true on edge cases in which knowledge about the two tasks is needed to classify a tweet, this is the case, for example, when the literal polarity of a tweet is inverted by irony.</p>
      </abstract>
      <kwd-group>
        <kwd>Deep neural network Multi-Task learning Sentiment analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        During the last years Sentiment Analysis and related tasks have attracted a
lot of attention in the research community. Several works have been published
on these topics, and with the rising of deep learning the performances of the
systems have considerably increased. Despite these performances improvements,
machine learning based systems still struggle to perform well in edge cases such as
when literal polarity is inverted by irony, especially when these cases are
underrepresented in the training data. Such cases were annotated for the SENTIPOLC
2016 shared task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]: consider the tweet from the dataset "Ho molta ducia nel
nuovo Governo Monti. Piu o meno la stessa che ripongo in mia madre che tenta
di inviare un'email" ("I have a lot of faith in the new Monti government. More or
less the same thing that I have in my mother who tries to send an email"): this
tweet has literal positive polarity, but irony changes the nal polarity annotation.
      </p>
      <p>
        Previous works on neural networks already shown issues on learning such
di cult cases: [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] pointed out a set of 10 criticisms of deep neural networks like
the inability to deal with hierarchical structure, the limited capacity for
transfer learning, the impossibility to integrate prior knowledge or lack of systematic
compositional skills. Despite these issues, previous works [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] have shown that
multi-task learning (MTL) is an appealing idea compared to single-task learning
(STL) since it allows to incorporate previous knowledge about tasks hierarchy
into neural networks architectures. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] have shown that MTL is useful to
combine even loosely related tasks, letting the networks automatically learn the tasks
hierarchy.
      </p>
      <p>
        To study the e ectiveness of MTL on Sentiment Analysis tasks, in this
paper we present a mixed MTL/STL approach (named MIX) based on deep
bidirectional recurrent neural networks [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] applied to polarity and irony detection
on Italian tweets. We modeled our networks to solve three binary tasks:
positive, negative and ironic tweet identi cation. We tested the performances of our
system on the most recent datasets available for Italian. We show that our
system outperforms the state of the art for Italian for what concerns polarity and
irony classi cation. Furthermore, we show that the proposed mixed approach
outperforms both our STL and MTL approaches.
      </p>
      <p>
        To our knowledge, this is the rst work that shows the e ectiveness of MTL
combining irony and polarity detection. A previous work on this topic [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] has
been presented at EVALITA 2016, but the authors proposed an approach that
is more similar to a multi-label classi cation method based on a single classi er
for all the labels, rather than a MTL in which di erent loss functions are used
for the di erent tasks.
      </p>
      <p>We present an in-depth analysis on the results obtained by our method
showing how the proposed multi-task learning approach is able to compose the
information coming from the di erent tasks.</p>
      <p>Our contributions: (i) to our knowledge this is the rst work that presents
a MTL system for polarity and irony detection; (ii) we introduce a novel mixed
MTL and STL approach; (iii) we present an error analysis that suggests that
the proposed multi-task learning approach is able to combine the information
extracted from sentiment polarity and irony classi cation training sets and
improves the performance on both the tasks. This is particularly true on edge cases
in which knowledge about the two tasks are needed to classify a tweet.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Dataset</title>
      <p>For the Italian polarity and irony classi cation tasks we relied on the dataset
provided for the SENTIPOLC event which made part of EVALITA 2016, the
periodic evaluation campaign NLP and speech tools for the Italian language. The
SENTIPOLC dataset contains a training set made of 7,410 tweets and a test set
of 2,000 tweets. Each tweet was labeled with a set of 6 binary labels that de ne
if a tweet is subjective (subj), positive (pos), negative (neg), ironic (iro), literally
positive (lpos) and literally negative (lneg). We performed our experiments only
on positive, negative and ironic classes, but we still used the other labels to
perform a comparative analysis between the performances of the system trained
in the single-task and in the multi-task models.</p>
      <p>Table 1 reports the distributions of labels in the data set.
ironic inputs. We introduce in this work a new method (named MIX) to combine
these two architectures using a two stage training approach in which a layer is
shared in just one stage of the training phase.</p>
      <p>
        Features: We built two sets of word embeddings with 128 dimensions using
word2vec [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The rst set of word embeddings was generated starting from the
itWac Corpus [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], while the second was built exploiting approximatively 25
millions of Italian tweets. Both the corpora were postagged using the postagger by
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and the word embeddings were computed using the combination of the word
and its part of speech. The generated itWac and Twitter embeddings provided a
coverage of 91.5% and 96.6% on the SENTIPOLC dataset. In addition, for each
word its sentiment polarity is used as feature exploiting the sentiment polarity
lexicon by [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>Each token of a tweet is represented by a vector resulting from the
concatenation of the described features.</p>
      <p>Training: To train the STL networks, we performed three di erent training
steps, one for each task. To train the MTL architecture, we run a shared training
by iteratively optimizing at each step a loss function for each task. For the MTL
the global loss function is given by the sum of the three individual loss functions.
In STL and MTL architectures, we stopped the training after 50 epochs without
improvements of the loss function on the validation set, choosing the parameters
with the best performances.</p>
      <p>To mix the MTL and STL approaches we used a two stage training. In
the rst stage we trained the MTL network as described above. In the second
stage we initialized the weights of the three rst Bi-LSTM layers of the STL
architecture using the weights of the MTL network's shared Bi-LSTM and the
second level Bi-LSTM using the weights learned in the rst stage. We then run
a speci c training for each task. We used the same stopping criteria as for STL
and MTL training.</p>
      <p>
        Since in the dataset all the tweets are labeled with their polarity and irony
labels and the number of ironic tweets is extremely unbalanced w.r.t. the non-ironic
ones, we oversampled the ironic examples by replicating them in the dataset. The
oversampling technique has been showed to improve classi cation performance
on unbalanced datasets [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>System
STL
PMIX
MTL
MIX
SwissCheese.c
UniPI.2.c
tweet2check16.c</p>
      <p>POS NEG Polarity IRO
.641 .665 .653 .608
.670 .699 .684
.674 .700 .674 .586
.660 .736 .698 .622
.653 .713 .683 .536
.685 .643 .664</p>
      <p>- - - .541
System
STL
PMIX
MTL
MIX</p>
      <p>POS NEG Polarity
Iro l Pol Iro l Pol Iro l Pol
.115 .105 0.11 .090 .080 .085
.143 .044 .093 .075 .093 .049
.104 .069 .086 .075 .086 .061
.539 .567 .553 .492 .553 .500</p>
      <p>
        As we can see in Table 2, in the polarity detection tasks the MTL, PMIX,
and MIX models all outperform the best SENTIPOLC system that used a single
task approach [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] (UniPI.2.c row), while only the MIX model performed better
than the [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] system (SwissChese.c row), that used a multi-label classi er for the
subjectivity, polarity and irony identi cation tasks.
      </p>
      <p>For what concerns Irony detection, we observe that all our networks
outperform the best SENTIPOLC system, probably thanks to the usage of
oversampling (the F-score of our STL model without oversampling is only 0.473).
More importantly, we observe that MIX model signi cantly outperforms the STL
baseline, while the standard MTL does not.</p>
      <p>These results show that MIX model brings improvement in both polarity and
irony detection tasks.</p>
      <p>To study the impact of multi-task learning in Polarity and Irony detection,
we conducted an in-depth error analysis to investigate the performance of our
models on edge cases. We studied the behavior of the models for a selected
subsets of the test set. Table 3 reports the polarity detection accuracies of our
models on Italian ironic tweets (columns Iro in the table) and on tweets for which
irony changes the literal polarity (l Pol). We can clearly observe how the MIX
model brings great improvements for polarity detection in l Pol tweets while
the standard MTL does not. The improvements are clear for both positive and
negative tweets. This result suggests that the MIX model is able to compose
information coming from di erent examples of di erent tasks and to obtain
better results on edge cases. This is also shown by the results obtained in the
polarity detection task on ironic tweets (Iro).</p>
      <p>Table 4 reports the accuracy of our systems in the irony detection task for
all the di erent label combinations in the test set. We can observe that the STL
and the MTL models show the same behavior while the MIX model signi cantly
outperforms the other two in mostly all kinds of ironic instances (rows 1-8) and
not ironic positive instances (row 9). Vice versa, MTL and STL outperform MIX
in the negative not ironic comments (rows 10-11). Given that the MIX approach
brings impressive improvements for edge-cases (especially rare ones), it is likely
that it overestimates the correlation between irony and negativity.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We conducted a study on the e ectiveness of multi-task learning approaches in
sentiment polarity and irony classi cation. We presented a mixed single- and
multi-task learning approach, that is able to improve the performance both in
polarity and irony detection with respect to single-task and standard multi-task
learning approaches. In particular, our approach led to substantial improvements
on edge cases in which knowledge about the two tasks are needed to classify a
tweet. This is particularly true, when these cases are under-represented in the
training data. An example is the case when a the literal polarity of a tweet is
inverted by irony.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Attardi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sartiano</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alzetta</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Semplici</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Convolutional neural networks for sentiment analysis on italian tweets</article-title>
          .
          <source>In: Proceedings of Third Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2016</year>
          ) &amp;
          <article-title>Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2016</year>
          )
          <article-title>(</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Barbieri</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basile</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Croce</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nissim</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Novielli</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patti</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Overview of the evalita 2016 sentiment polarity classi cation task</article-title>
          .
          <source>In: Proceedings of Third Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2016</year>
          ) &amp;
          <article-title>Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2016</year>
          )
          <article-title>(</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Baroni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernardini</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferraresi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zanchetta</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>The wacky wide web: a collection of very large linguistically processed web-crawled corpora</article-title>
          .
          <source>Journal of Language Resources and Evaluation</source>
          <volume>43</volume>
          (
          <issue>3</issue>
          ),
          <volume>209</volume>
          {
          <fpage>226</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chawla</surname>
            ,
            <given-names>N.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bowyer</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>L.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kegelmeyer</surname>
            ,
            <given-names>W.P.</given-names>
          </string-name>
          :
          <article-title>Smote: synthetic minority over-sampling technique</article-title>
          .
          <source>Journal of arti cial intelligence research 16</source>
          ,
          <volume>321</volume>
          {
          <fpage>357</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cimino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dell'Orletta</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Building the state-of-the-art in pos tagging of italian tweets</article-title>
          .
          <source>In: Proceedings of Third Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2016</year>
          ) &amp;
          <article-title>Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2016</year>
          )
          <article-title>(</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Deriu</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cieliebak</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Sentiment analysis using convolutional neural networks with multi-task training and distant supervision on italian tweets</article-title>
          .
          <source>In: Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, Napoli, Italy, December 5-7</source>
          ,
          <year>2016</year>
          .
          <source>Italian Journal of Computational Linguistics</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Framewise phoneme classi cation with bidirectional lstm and other neural network architectures</article-title>
          .
          <source>Neural Networks</source>
          <volume>18</volume>
          (
          <issue>5-6</issue>
          ),
          <volume>602</volume>
          {
          <fpage>610</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation 9(8)</source>
          ,
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Maks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Izquierdo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frontini</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agerri</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Azpeitia</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vossen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Generating polarity lexicons with wordnet propagation in ve languages</article-title>
          .
          <source>Proceedings of LREC2014</source>
          , Reykjavik (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Marcus</surname>
          </string-name>
          , G.:
          <article-title>Deep learning: A critical appraisal</article-title>
          . Computing Research Repository abs/
          <year>1801</year>
          .00631 (
          <year>2018</year>
          ), http://arxiv.org/abs/
          <year>1801</year>
          .00631, version 2
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>E cient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Ruder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bingel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Augenstein</surname>
          </string-name>
          , I., S gaard, A.:
          <article-title>Sluice networks: Learning what to share between loosely related tasks</article-title>
          .
          <source>arXiv preprint arXiv:1705.08142</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Schuster</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paliwal</surname>
            ,
            <given-names>K.K.:</given-names>
          </string-name>
          <article-title>Bidirectional recurrent neural networks</article-title>
          .
          <source>IEEE Transactions on Signal Processing</source>
          <volume>45</volume>
          (
          <issue>11</issue>
          ),
          <volume>2673</volume>
          {
          <fpage>2681</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. S gaard, A.,
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Deep multi-task learning with low level tasks supervised at lower layers</article-title>
          .
          <source>In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>2</volume>
          :
          <string-name>
            <given-names>Short</given-names>
            <surname>Papers</surname>
          </string-name>
          <article-title>)</article-title>
          .
          <source>vol. 2</source>
          , pp.
          <volume>231</volume>
          {
          <issue>235</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>