<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-task Learning in Deep Neural Networks at EVALITA 2018</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Cimino?</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenzo De Mattei?</string-name>
          <email>lorenzo.demattei@di.unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Informatica, Universita` di Pisa</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>English. In this paper we describe the system used for the participation to the ABSITA, GxG, HaSpeeDe and IronITA shared tasks of the EVALITA 2018 conference. We developed a classifier that can be configured to use Bidirectional Long Short Term Memories and linear Support Vector Machines as learning algorithms. When using Bi-LSTMs we tested a multitask learning approach which learns the optimized parameters of the network exploiting simultaneously all the annotated dataset labels and a multiclassifier voting approach based on a k-fold technique. In addition, we developed generic and specific word embedding lexicons to further improve classification performances. When evaluated on the official test sets, our system ranked 1st in almost all subtasks for each shared task, showing the effectiveness of our approach.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Italiano. In questo articolo descriviamo
il sistema utilizzato per la partecipazione
agli shared task ABSITA, GxG,
HaSpeeDe ed IronITA della conferenza EVALITA
2018. Abbiamo sviluppato un sistema che
utilizza come algoritmi di apprendimento
sia reti di tipo Long Short Term Memory
Bidirezionali (Bi-LSTM) che Support
Vector Machines. Nell’utilizzo delle Bi-LSTM
abbiamo testato un approccio di tipo multi
task learning nel quale i parametri della
rete vengono ottimizzati utilizzando
contemporaneamente le annotazioni presenti
nel dataset ed una strategia di
classificazione a voti di tipo k-fold. Abbiamo creato
word embeddings generici e specifici per
ogni singolo task per migliorare
ulteriormente le performance di classificazione.
Il nostro sistema quando valutato sui
test set ufficiali ha ottenuto il primo posto in
quasi tutti i sotto task di ogni shared
task affrontato, dimostrando la validita` del
nostro approccio.
1</p>
    </sec>
    <sec id="sec-2">
      <title>Description of the System</title>
      <p>The EVALITA 2018 edition has been one of the
most successful editions in terms of number of
shared tasks proposed. In particular, a large part of
the tasks proposed by the organizers can be tackled
as binary document classification tasks. This gave
us the possibility to test a new system specifically
designed for this EVALITA edition.</p>
      <p>
        We implemented a system which relies on
BiLSTM
        <xref ref-type="bibr" rid="ref13">(Hochreiter et al., 1997)</xref>
        and SVM which
are widely used learning algorithms in the
document classification task. The learning algorithm
can be selected in a configuration file. In this work
we used the Keras
        <xref ref-type="bibr" rid="ref5">(Chollet, 2016)</xref>
        library and
the liblinear
        <xref ref-type="bibr" rid="ref11">(Fan et al., 2008)</xref>
        library to generate
the Bi-LSTM and SVM statistical models
respectively. Since our approach relies on
morphosyntactically tagged text, training and test data were
automatically morphosyntactically tagged by the
PoS tagger described in
        <xref ref-type="bibr" rid="ref18 ref22 ref7 ref8">(Cimino and Dell’Orletta,
2016)</xref>
        . Due to the label constraints in the dataset, if
our system classified an aspect as not present, we
forced the related positive and negative labels to
be classified as not positive and not negative. We
developed sentiment polarity and word embedding
lexicons with the aim of improving the overall
accuracy of our system.
      </p>
      <p>
        Some specific adaptions were made due to the
characteristics of each shared task. In the
Aspectbased Sentiment Analysis (ABSITA) 2018 shared
task
        <xref ref-type="bibr" rid="ref2 ref6">(Basile et al., 2018)</xref>
        participants were asked,
given a training set of Booking hotel reviews, to
detect the mentioned aspect categories in a review
among a set of 8 fixed categories (ACD task) and
to assign the polarity (neutral, positive, neutral,
positive-negative) for each detected aspect (ACP
task). Since each Booking review in the training
set is labeled with 24 binary labels (8 indicating
the presence of an aspect, 8 indicating positivity
and 8 indicating negativity w.r.t. an aspect), we
addressed the ABISTA 2018 shared task as 24 binary
classification problems.
      </p>
      <p>
        The Gender X-Genre (GxG) 2018 shared task
        <xref ref-type="bibr" rid="ref10 ref3 ref6 ref9">(Dell’Orletta and Nissim, 2018)</xref>
        consisted in the
automatic identification of the gender of the
author of a text (Female or Male). Five different
training sets and test sets were provided by the
organizers for five different genres: Children essays
(CH), Diary (DI), Journalism (JO), Twitter posts
(TW) and YouTube comments (YT). For each test
set the participants are requested to submit a
system trained using in-domain training dataset and a
system trained using cross-domain data only.
      </p>
      <p>
        The IronITA task
        <xref ref-type="bibr" rid="ref6">(Cignarella et al., 2018)</xref>
        consisted of two tasks. In the first task participants
had to automatically label a message as ironic or
not. The second task had a more fine grain: given
a message, participants had to classify whether the
message is sarcastic, ironic but not sarcastic or not
ironic.
      </p>
      <p>
        Finally in the HaSpeeDe 2018 shared task
        <xref ref-type="bibr" rid="ref3">(Bosco et al., 2018)</xref>
        consisted in automatically
annotating messages from Twitter and Facebook
with a boolean value indicating the presence
(or not) of hate speech. In particular three
tasks were proposed: HaSpeeDe-FB where only
the Facebook dataset could be used to classify
Facebook comments, HaSpeeDe-TW where just
Twitter data could be used to classify tweets
and Cross-HaspeeDe where only the Facebook
dataset could be used to classify the Twitter test
set and vice versa (Cross-HaspeeDe FB,
Cross
      </p>
      <sec id="sec-2-1">
        <title>HaspeeDe TW).</title>
        <p>1.1
1.1.1</p>
      </sec>
      <sec id="sec-2-2">
        <title>Lexical Resources</title>
      </sec>
      <sec id="sec-2-3">
        <title>Automatically Generated Sentiment</title>
      </sec>
      <sec id="sec-2-4">
        <title>Polarity Lexicons for Social Media</title>
        <p>For the purpose of modeling the word usage in
generic, positive and negative contexts of social
media texts, we developed three lexicons which
we named T WGEN ; T WNEG; T WP OS. Each
lexicon reports the relative frequency of a word
in three different corpora. The main idea behind
building these lexicons is that positive and
negative words should present a higher relative
frequency in T WP OS and T WNEG respectively. The
three corpora were generated by first downloading
approximately 50,000,000 tweets and then
applying some filtering rules to the downloaded tweets
to build the positive and negative corpora (no
filtering rules were applied to build the generic
corpus). In order to build a corpus of positive tweets,
we constrained the downloaded tweets to contain
at least one positive emoji among heart and kisses.
Since emojis are rarely used in negative tweets, to
build the negative tweets corpus we created a list
of commonly used words in negative language and
constrained these tweets to contain at least one of
these words.</p>
      </sec>
      <sec id="sec-2-5">
        <title>1.1.2 Automatically translated Sentiment</title>
      </sec>
      <sec id="sec-2-6">
        <title>Polarity Lexicons</title>
        <p>
          The Multi–Perspective Question Answering
(hereafter referred to as M P QA) Subjectivity Lexicon
          <xref ref-type="bibr" rid="ref21">(Wilson et al., 2005)</xref>
          . This lexicon consists of
approximately 8,200 English words with their
associated polarity. To use this resource for the Italian
language, we translated all the entries through the
Yandex translation service1.
        </p>
      </sec>
      <sec id="sec-2-7">
        <title>1.1.3 Word Embedding Lexicons</title>
        <p>
          We generated four word embedding lexicons using
the word2vec2 toolkit
          <xref ref-type="bibr" rid="ref14">(Mikolov et al., 2013)</xref>
          . As
recommended in
          <xref ref-type="bibr" rid="ref14">(Mikolov et al., 2013)</xref>
          , we used
the CBOW model that learns to predict the word
in the middle of a symmetric window based on
the sum of the vector representations of the words
in the window. For our experiments, we
considered a context window of 5 words. The Word
Embedding Lexicons starting from the following
corpora which were tokenized and postagged by the
PoS tagger for Twitter described in
          <xref ref-type="bibr" rid="ref18 ref22 ref7 ref8">(Cimino and
Dell’Orletta, 2016)</xref>
          :
        </p>
        <p>The first lexicon was built using the itWaC
corpus3. The itWaC corpus is a 2 billion word
corpus constructed from the Web limiting the
crawl to the .it domain and using
mediumfrequency words from the Repubblica corpus
and basic Italian vocabulary lists as seeds.
The second lexicon was built using the set
of the 50,000,000 tweets we downloaded to
build the sentiment polarity lexicons
previously described in subsection 1.1.1
1http://api.yandex.com/translate/
2http://code.google.com/p/word2vec/
3http://wacky.sslmit.unibo.it/doku.php?id=corpora
The third and the fourth lexicon were built
using a corpus consisting of 538,835 Booking
reviews scraped from the web. Since each
review in the Booking site is split in a positive
secion (indicated by a plus mark) and
negative section (indicated by a minus mark), we
split these reviews obtaining in 338,494
positive reviews and 200,341 negative reviews.
Starting from the positive and the negative
reviews, we finally obtained two different word
embedding lexicons.</p>
        <p>Each entry of the lexicons maps a pair (word,
POS) to the associated word embedding, allowing
to mitigate polisemy problems which can lead to
poorer results in classification. In addition, both
the corpora where preprocessed in order to 1) map
each url to the word ”URL” 2) distinguish between
all uppercased words and non-uppercased words
(eg.: ”mai” vs ”MAI”), since all uppercased words
are usually used in negative contexts. Since each
task has its own characteristics in terms of
information that needs to be captured from the
classifiers, we decided to use a subset of the word
embeddings in each task. Table 1 sums up the word
embeddings used in each shared task.</p>
      </sec>
      <sec id="sec-2-8">
        <title>Task</title>
        <p>ABSITA</p>
        <p>GxG
HaSpeeDe
IronITA</p>
      </sec>
      <sec id="sec-2-9">
        <title>Booking</title>
        <p>3
7
7
7</p>
      </sec>
      <sec id="sec-2-10">
        <title>ITWAC</title>
        <p>3
3
3
3</p>
      </sec>
      <sec id="sec-2-11">
        <title>Twitter</title>
        <p>
          7
3
3
3
The classifier we built for our participation to the
tasks was designed with the aim of testing
different learning algorithms and learning strategies.
More specifically our classifier implements two
workflows which allow testing SVM and recurrent
neural networks as learning algorithms. In
addition, when recurrent neural networks are chosen
as learning algorithms, our classifier allows to
perform neural network multi-task learning (MTL)
using an external dataset in order to share
knowledge between related tasks. We decided to test the
MTL strategy since, as demonstrated in
          <xref ref-type="bibr" rid="ref9">(De
Mattei et al., 2018)</xref>
          , it can improve the performance of
the classifier on emotion recognition tasks. The
benefits of this approach were investigated also
by Søgaard and Goldberg (2016), which showed
that MTL is appealing since it allows to
incorporate previous knowledge about tasks hierarchy
into neural networks architectures. Furthermore,
Ruder et al. (2017) showed that MTL is useful to
combine even loosely related tasks, letting the
networks automatically learn the tasks hierarchy.
        </p>
        <p>Both the workflows we implemented share a
common pattern used in machine learning
classifiers consisting of a document feature
extraction and a learning phase based on the extracted
features, but since SVM and Bi-LSTM take
input 2-dimensional and 3-dimensional tensors
respectively, a different feature extraction phase is
involved for each considered algorithm. In
addition, when the Bi-LSTM workflow is selected
the classifier can take as input an extra file which
will be used to exploit the MTL learning approach.
Furthermore, when the Bi-LSTM workflow is
selected, the classifier performs 5-fold training
approach. More precisely we build 5 different
models using different training and validation sets.
These models are then exploited in the
classification phase: the assigned labels are the ones that
obtain the majority among all the models. The
5fold approach strategy was chosen in order to
generate a global model which should less be prone
to overfitting or underfitting w.r.t. a single learned
model.</p>
      </sec>
      <sec id="sec-2-12">
        <title>1.2.1 The SVM classifier</title>
        <p>
          The SVM classifier exploits a wide set of
features ranging across different levels of
linguistic description. With the exception of the word
embedding combination, these features were
already tested in our previous participation at the
EVALITA 2016 SENTIPOLC edition
          <xref ref-type="bibr" rid="ref7 ref8">(Cimino et
al., 2016)</xref>
          . The features are organised into three
main categories: raw and lexical text features,
morpho-syntactic features and lexicon features.
Due to size constraints we report only the feature
names.
        </p>
      </sec>
      <sec id="sec-2-13">
        <title>Raw and Lexical Text Features number of to</title>
        <p>kens, character n-grams, word n-grams, lemma
n-grams, repetition of n-grams chars, number of
mentions, number of hashtags, punctuation.</p>
      </sec>
      <sec id="sec-2-14">
        <title>Morpho-syntactic Features coarse grained</title>
        <p>Part-Of-Speech n-grams, Fine grained
Part-OfSpeech n-grams, Coarse grained Part-Of-Speech
distribution
Lexicon features Emoticons Presence, Lemma
sentiment polarity n-grams, Polarity modifier,
PMI score, sentiment polarity distribution, Most
frequent sentiment polarity, Sentiment polarity in
text sections, Word embeddings combination.</p>
      </sec>
      <sec id="sec-2-15">
        <title>1.2.2 The Deep Neural Network classifier</title>
        <p>
          We tested two different models based on
BiLSTM: one that learns to classify the labels
without sharing information from all the labels in the
training phase (Single task learning - STL), and
the other one which learns to classify the labels
exploiting the related information through a shared
Bi-LSTM (Multi task learning - MTL). We
employed Bi-LSTM architectures since these
architectures allow to capture long-range dependencies
from both directions of a document by
constructing bidirectional links in the network
          <xref ref-type="bibr" rid="ref17">(Schuster et
al., 1997)</xref>
          . We applied a dropout factor to both
input gates and to the recurrent connections in
order to prevent overfitting which is a typical issue
in neural networks
          <xref ref-type="bibr" rid="ref12 ref19">(Galp and Ghahramani, 2015)</xref>
          .
We have chosen a dropout factor value of 0.50.
        </p>
        <p>For what concerns GxG, as we had to deal with
longer documents such as news, we employed a
two layer Bi-LSTM encoder. The first Bi-LSTM
layer served us to encode each sentence as a token
sequence, the second layer served us to encode the
sentences sequence. For what concerns ironITA
we added a task-specifici Bi-LSTM for each
substask before the dense layer.</p>
        <p>
          Figure 1 shows a graphical representation of the
STL and MTL architectures we employed. For
what concerns the optimization process, the binary
cross entropy function is used as a loss function
and optimization is performed by the rmsprop
optimizer
          <xref ref-type="bibr" rid="ref20">(Tieleman and Hinton, 2012)</xref>
          .
Each input word is represented by a vector
which is composed by:
Word embeddings: the concatenation of the word
embeddings extracted by the available Word
Embedding Lexicons (128 dimensions for each word
embedding), and for each word embedding an
extra component was added to handle the ”unknown
word” (1 dimension for each lexicon used).
Word polarity: the corresponding word polarity
obtained by exploiting the Sentiment Polarity
Lexicons. This results in 3 components, one for each
possible lexicon outcome (negative, neutral,
positive) (3 dimensions). We assumed that a word not
found in the lexicons has a neutral polarity.
        </p>
      </sec>
      <sec id="sec-2-16">
        <title>Automatically Generated Sentiment Polarity</title>
      </sec>
      <sec id="sec-2-17">
        <title>Lexicons for Social Media: The presence or the</title>
        <p>absence of the word in a lexicon and the relative
presence if the word is found in the lexicon. Since
we built the T WGEN , T WP OS and T WNEG 6
dimensions are needed, 2 for each lexicon.</p>
      </sec>
      <sec id="sec-2-18">
        <title>Coarse Grained Part-of-Speech: 13 dimensions.</title>
        <p>End of Sentence: a component (1 dimension)
indicating whether the sentence was totally read.
2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results and Discussion</title>
      <p>Table 2 reports the official results obtained by our
best runs on all the task we participated. As it can
be noted our system performed extremely well,
achieving the best scores almost in every single
subtask. In the following subsections a discussion
of the results obtained in each task is provided.
2.1</p>
      <sec id="sec-3-1">
        <title>ABSITA</title>
        <p>We tested five learning configurations of our
system based on linear SVM and DNN learning
algorithms using the features described in section
1.2.1 and 1.2.2. All the experiments were aimed
at testing the contribution in terms of f-score of
MTL vs STL, the k-fold technique and the
external resources. For what concerns the Bi-LSTM
learning algorithm we tested Bi-LSTM both in the
STL and MTL scenarios. In addition, to test the
contribution of the Booking word embeddings, we
created a configuration which uses a shallow
BiLSTM in MTL setting without using these
embeddings (MTL NO BOOKING-WE). Finally, to test
the contribution of the k-fold technique we created
a configuration which does not use the k-fold
technique (MTL NO K-FOLD). To obtain fair
comparisons in the last case we run all the experiments
5 times and averaged the scores of the runs. To
test the proposed classification models, we created</p>
        <p>ABSITA</p>
        <p>GxG IN-DOMAIN
GxG CROSS-DOMAIN
an internal development set by randomly selecting
documents from the training sets distributed by the
task organizers. The resulting development set is
composed by approximately the 10% (561
documents) of the whole training set.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Configuration</title>
        <p>baseline
linear SVM
STL
MTL
MTL NO K-FOLD
MTL NO BOOKING-WE
ACD
0.313
0.797
0.821
0.824
0.819
0.817</p>
        <p>ACP
0.197
0.739
0.795
0.804
0.782
0.757</p>
        <p>Table 3 reports the overall accuracies achieved
by the models on the internal development set for
all the tasks. In addition, the results of
baseline system (baseline row) which emits always the
most probable label according to the label
distribubaseline
linear SVM
STL
MTL
MTL NO K-FOLD
MTL NO BOOKING-WE
0.338
0.772*
0.814
0.811*
0.801
0.808
0.199
0.686*
0.765
0.767*
0.755
0.753
tions in the training set is reported. The accuracy
is calculated as the micro f–score obtained using
the evaluation tool provided by the organizers. For
what concerns the ACD task it is worth noting that
the models based on DNN always outperform
linear SVM, even though the difference in terms of
f-score is small (approximately 2 f-score points).
The MTL configuration was the best performing
among all the the models, but the difference in
term of f-score among all the DNN configuration
is not evident.</p>
        <p>When analyzing the results obtained on the
ACP task we can notice remarkable differences
among the performances obtained by the models.
Again the linear SVM was the worst performing
model, but this time with a difference in terms
of f-score of 6 points with respect to MTL, the
best performing model on the task. It is
interesting to notice that the results achieved by the
DNN models have bigger difference between them
in terms of f-score with respect to the ACD task:
this suggests that the external resources and the
kfold technique contributed significantly to obtain
the best result in the ACP task. The configuration
that does not use the k-fold technique scored 2
fscore points w.r.t. the MTL configuration. We can
also notice that the Booking word emebeddings
were particularly helpful in this task: the MTL
NO BOOOKING-WE configuration in fact scored
5 points less than the best configuration. The
results obtained on the internal development set lead
us to choose the models for the official runs on the
provided test set. Table 4 reports the overall
accuracies achieved by all our classifier configurations
on the official test set, the official submitted runs
are starred in the table.</p>
        <p>As it can be noticed the best scores both in the
ACD and ACP tasks were obtained by the DNN
models. Surprisingly the difference in terms of
fscore were reduced in both the tasks, with the
exception of linear SVM, which performed 4 and 8
f-score points less in the ACD and ACP tasks
respectively when compared to the best DNN model
systems. The STL model outperformed the MTL
models the ACD task, even though the difference
in term of f-score is not relevant. When the results
on the ACP are considered, the MTL model
outperformed all the other models, even though the
the difference in terms of f-score with respect to
the STL model is not noticeable. Is it worth to
notice that the k-fold technique and the Booking
word embeddings seemed to again contribute in
the final accuracy of the MTL system. This can be
seen by looking at the results achieved by the MTL
NO BOOKING-WE model and the MTL NO
KFOLD model that scored 1.2 and 1.5 f-score points
less than the MTL system.
2.2</p>
        <p>GxG
We tested three different learning configurations
of our system based on linear SVM and DNN
learning algorithms using the features described in
section 1.2.1 and 1.2.2. For what concerns the
BiLSTM learning algorithm we tested both the STL
and MTL approaches. We tested the three
configurations for each of the 5 five in-domain subtasks
and for each of the 5 five cross-domain subtasks.
To test the proposed classification models, we
created internal development sets by randomly
selecting documents from the training sets distributed
by the task organizers. The resulting development
sets are composed by approximately 10% of the
each data sets. For what concern the in-domain
task, we tried to train the SVM classifier on
indomain-data only and and on both in-domain and
cross-domain data.</p>
        <p>Model CH
SVMa
SVM
STL
MTL
0.667
0.701
0.556
0.499</p>
        <p>DI
0.626
0.737
0.545
0.817</p>
        <p>JO
0.485
0.560
0.500
0.625</p>
        <p>TW
0.582
0.728
0.724
0.729</p>
        <p>YT
0.611
0.619
0.596
0.632</p>
        <p>Table 5 and 6 report the overall accuracy,
computed as the average accuracy for the two classes
(male and female), achieved by the models on
the development data sets for the in-domain and
Model CH
SVM
STL
MTL
the cross-domain tasks respectively. For the
indomain tasks we observe that the SVM performs
well on the smaller datasets (Children and
Diary), while MTL neural network has the best
overall performances. When trained on all the
datasets, in- and cross-domain, the SVM (SVMa)
perform worst than when trained on in-domain
data only (SVM). For what concerns the
crossdomain datasets we observe poor performances
over all the subtasks with all the employed
models, implying that the models have difficulties in
cross-domain generalization.</p>
        <p>Model CH
SVMa
SVM
STL
MTL</p>
        <p>Table 7 and 8 report the overall accuracy,
computed as the average accuracy for the two classes
(male and female), achieved by the models on the
official test sets for the in-domain and the
crossdomain tasks respectively (* marks the running
that obtain the best results in the competition). For
what concerns the in-domain subtasks the
performances appear to be not in line with the ones
obtained on the development set, but still our
models outperform the other participant’s systems in
four out of five subtasks. The MTL model
provided the best results for the Children and Diary
test sets, while on the other test sets all the
models performed quite poorly. Again when trained
on all the datasets, in and cross-domain, the SVM
(SVMa) perform worst then when trained on
indomain data only (SVM). For what concerns the
cross-domain subtasks, while our model gets the
best performances on three out of five subtasks,
the results confirm poor performances over all the
subtasks, again indicating that the models have
difficulties in cross-domain generalization.
2.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>HaSpeeDe</title>
        <p>We tested seven learning configurations of our
system based on linear SVM and DNN learning
algorithms using the features described in section 1.2.1
and 1.2.2. All the experiments were aimed at
testing the contribution in terms of f-score of the
number of layers, MTL vs STL, the k-fold technique
and the external resources. For what concerns the
Bi-LSTM learning algorithm we tested one and
two layers Bi-LSTM both in the STL and MTL
scenarios. In addition, to test the contribution of
the sentiment lexicon features, we created a
configuration which uses a 2-layer Bi-LSTM in MTL
setting without using these features (1L MTL NO
SNT). Finally, to test the contribution of the
kfold technique we created a configuration which
does not use the k-fold technique (1 STL NO
KFOLD). To obtain fair results in the last case we
run all the experiments 5 times and averaged the
scores of the runs. To test the proposed
classification models, we created two internal development
sets, one for each dataset, by randomly selecting
documents from the training sets distributed by the
task organizers. The resulting development sets
are composed by the 10% (300 documents) of the
whole training sets.</p>
        <p>Table 9 reports the overall accuracies achieved
by the models on our internal development sets
for all the tasks. In addition, the results of
baseline system (baseline row) which emits always the
most probable label according to the label
distribution in the training set is reported. The
accuracy is calculated as the f–score obtained using the
evaluation tool provided by the organizers. For
what concerns the Twitter in–domain task (TW
C FB
in the table) it is worth noting that linear SVM
outperformed all the configurations based on
BiLSTM. In addition, the MTL architecture results
are slightly better than the STL ones (+1 f-score
point with respect to the STL counterparts).
External sentiment resources were not particularly
helpful in this task, as shown by the result obtained by
the 1L MTL NO SNT row. In the FB task,
BiLSTMs sensibly outperformed linear SVMs (+5
fscore points in average); this is most probably due
to longer text lengths that are found in this dataset
with respect to the Twitter one. For what
concerns the out–domain tasks, when testing models
trained on Twitter and tested on Facebook (C TW
column), we can notice an expected drop in
performance with respect to the models trained on
the FB dataset (15-20 points f-score points). The
best result was achieved by the 2L MTL
configuration (+4 points w.r.t. the STL counterpart).
Finally, when testing the models trained on
Facebook and tested on Twitter (C FB column),
linear SVM showed a huge drop in terms of
accuracy (-30 f-score points), while all the models
trained with Bi-LSTM showed a performance drop
of approximately 12 f-score points. Also in this
setting the best result was achieved by a MTL
configuration (1L MTL), which performed better
with respect to the STL counterpart (+2 f-score
points). For what concerns the k-fold learning
strategy, we can notice that the results achieved by
the model not using the k-fold learning strategy
(1 STL NO K-FOLD) are always lower than the
counterpart which used the k-fold approach (+2.5
f-score points gained in the C TW task), showing
the benefits of using this technique.</p>
        <p>These results lead us to choose the models for
the official runs on the provided test set. Table
10 reports the overall accuracies achieved by all
our classifier configurations on the official test set,
the official submitted runs are starred in the
table. The best official system row reports, for each
task, the best official results submitted by the
participants of the EVALITA 2018 HaSpeeDe shared
task. As we can note the best scores in each task
were obtained by the Bi-LSTM in the MTL
setting, showing that MTL networks seem to be more
effective with respect to STL networks. For what
concerns the Twitter in–domain task, we obtained
similar results to the development set ones. A
sensible drop in performance is observed in the FB
task w.r.t the development set (-5 f-score points
in average). Still Bi-LSTMs models outperformed
the linear SVM model by 5 f-score points. In the
out-domain tasks, all the models performed
similarly to what observed in the development set. It
is worth observing that linear SVM performed
almost as a baseline system in the C FB task. In
addition, in the same task the model exploiting the
sentiment lexicon (1L MTL) showed a better
performance (+1.5 f-score points) w.r.t to the 1L MTL
NO SNT model. It is worth to notice that the
kfold learning strategy was beneficial also on the
official test set: the 1L STL model obtained better
results (approximately +2 f-score points in each
task) w.r.t. the model that did not use the k-fold
learning strategy.</p>
      </sec>
      <sec id="sec-3-4">
        <title>2.4 IronITA</title>
        <p>We tested the four designed learning
configurations of our system based on linear SVM and deep
neural network (DNN) learning algorithms using
the features described in section 1.2.1 and 1.2.2.
To select the proposed classification models, we
used k-cross validation (k=4).</p>
        <p>Table 11 reports the overall average f-score
achieved by the models on the k-cross
validation sets for both the irony and sarcarsm detection
tasks.</p>
        <p>We can observe that the SVM obtains good
results on irony detection but the MTL neural
approach overperforms sensibly the SVM. Also
we note that the usage of additional Polarity and
Hate Speech datasets lead to better performances.
These results lead us to choose the MTL models
trained with the additional dataset for the two
official run submissions.</p>
        <p>Table 12 reports the overall accuracies achieved
by all our classifier configurations on the
official test set, the official submitted runs are starred
in the table. The accuracies has been computed
in terms of F-Score using the official
evaluation script. We submitted the runs MTL+Polarity
and MTL+Polarity+Hate. The run MTL+Polarity
ranked first in the subtask A, and third in the
subtask B on the official leaderboard. The run
MTL+Polarity ranked second in the subtask A,
and fourth in the subtask B on the official
leaderboard.</p>
        <p>The results on the test set confirm the good
performances of the SVM classifier on irony
detection task and that the MTL neural approaches
overperform the SVM. The model trained on the
IronITA and SENTIPOLC datasets outperformed
all the systems that participated to the subtask A,
while on the subtask B it slightly underperformed
the best participant system. The model trained on
the IronITA, SENTIPOLC and HaSpeeDe datasets
overperformed all the systems that participated to
the subtask A but our model trained on IronITA
and SENTIPOLC datasets only. Although the best
scores in both tasks were obtained by the MTL
network trained on IronITA data set only. The
MTL model trained on IronITA dataset only would
have outperformed all the systems submitted to
both the subtasks by all participants. Seems that
for these tasks the usage of additional datasets
leads to overfitting issues.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>In this paper we reported the results of our
participation to the ABSITA, GxG, HaSpeeDe and
IronITA shared tasks of the EVALITA 2018
conference. By resorting to a system which used
Support Vector Machines and Deep Neural
Networks (DNN) as learning algorithms, we achieved
the best scores almost in every task, showing the
effectivness of our approach. In addition, when
DNN was used as learning algorithm we
introduced a new multi-task learning approach and a
majority vote classification approach to further
improve the overall accuracy of our system. The
proposed system resulted in an very effective solution
achieving the first position in almost all sub-tasks
for each shared task.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>We gratefully acknowledge the support of
NVIDIA Corporation with the donation of the
Tesla K40 used for this research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Francesco</given-names>
            <surname>Barbieri</surname>
          </string-name>
          , Valerio Basile, Danilo Croce, Malvina Nissim, Nicole Novielli,
          <string-name>
            <given-names>Viviana</given-names>
            <surname>Patti</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Overview of the EVALITA 2016 SENTiment POLarity Classification Task</article-title>
          .
          <source>In Proceedings of EVALITA '16</source>
          ,
          <article-title>Evaluation of NLP and Speech Tools for Italian</article-title>
          . December, Naples, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Pierpaolo</given-names>
            <surname>Basile</surname>
          </string-name>
          , Valerio Basile, Danilo Croce,
          <string-name>
            <given-names>Marco</given-names>
            <surname>Polignano</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of the EVALITA Aspect-based Sentiment Analysis (ABSITA) Task</article-title>
          . T. Caselli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Novielli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          , P. Rosso, (eds).
          <source>In Proceedings of the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA'18)</source>
          . December, Turin, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Cristina</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <article-title>Felice dell'Orletta, Fabio Poletto, Manuela Sanguinetti</article-title>
          and
          <string-name>
            <given-names>Maurizio</given-names>
            <surname>Tesconi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of the Evalita 2018 Hate Speech Detection Task</article-title>
          . T. Caselli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Novielli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          , P. Rosso,
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          (eds).
          <source>In Proceedings of EVALITA '18</source>
          , Evaluation
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Franois</given-names>
            <surname>Chollet</surname>
          </string-name>
          .
          <year>2016</year>
          . Keras. Software available at https://github.com/fchollet/keras/tree/master/keras.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Alessandra</given-names>
            <surname>Cignarella</surname>
          </string-name>
          and
          <article-title>Simona Frenda and Valerio Basile and Cristina Bosco and Viviana Patti</article-title>
          and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of the Evalita 2018 Task on Irony Detection in Italian Tweets (IronITA)</article-title>
          . T. Caselli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Novielli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          , P. Rosso, (eds).
          <source>In Proceedings of the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA'18)</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Cimino</surname>
          </string-name>
          and Felice dell'
          <source>Orletta</source>
          .
          <year>2016</year>
          .
          <article-title>Tandem LSTM-SVM Approach for Sentiment Analysis</article-title>
          .
          <source>In Proceedings of EVALITA '16</source>
          ,
          <article-title>Evaluation of NLP and Speech Tools for Italian</article-title>
          . December, Naples, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Cimino</surname>
          </string-name>
          and Felice dell'
          <source>Orletta</source>
          .
          <year>2016</year>
          .
          <article-title>Building the state-of-the-art in POS tagging of Italian Tweets</article-title>
          .
          <source>In Proceedings of Third Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2016</year>
          ) &amp;
          <article-title>Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2016</year>
          ), Napoli, Italy, December 5-
          <issue>7</issue>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Lorenzo de Mattei</surname>
          </string-name>
          ,
          <source>Andrea Cimino and Felice dell'Orletta</source>
          .
          <year>2018</year>
          .
          <article-title>Multi-Task Learning in Deep Neural Network for Sentiment Polarity and Irony classification</article-title>
          .
          <source>In Proceedings of the 2nd Workshop on Natural Language for Artificial Intelligence</source>
          , Trento, Italy,
          <source>November 22-23</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Felice</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          and
          <string-name>
            <given-names>Malvina</given-names>
            <surname>Nissim</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of the EVALITA Cross-Genre Gender Prediction in Italian (GxG) Task</article-title>
          . T. Caselli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Novielli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          , P. Rosso, (eds).
          <source>In Proceedings of the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA'18).</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Rong-En</surname>
            <given-names>Fan</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kai-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho-Jui</surname>
            <given-names>Hsieh</given-names>
          </string-name>
          ,
          <source>XiangRui Wang and Chih-Jen Lin</source>
          <year>2008</year>
          .
          <article-title>LIBLINEAR: A Library for Large Linear Classification Journal of Machine Learning Research</article-title>
          . Volume
          <volume>9</volume>
          ,
          <fpage>1871</fpage>
          <string-name>
            <surname>-</surname>
          </string-name>
          <fpage>1874</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Yarin</given-names>
            <surname>Gal</surname>
          </string-name>
          and
          <string-name>
            <given-names>Zoubin</given-names>
            <surname>Ghahramani</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A theoretically grounded application of dropout in recurrent neural networks</article-title>
          .
          <source>arXiv preprint arXiv:1512.05287</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jurgen</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Kai Chen, Greg Corrado and
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Efficient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv1:1301</source>
          .
          <fpage>3781</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Preslav</given-names>
            <surname>Nakov</surname>
          </string-name>
          , Alan Ritter, Sara Rosenthal, Fabrizio Sebastiani and
          <string-name>
            <given-names>Veselin</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>SemEval2016 task 4: Sentiment analysis in Twitter</article-title>
          .
          <source>In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Ruder</surname>
          </string-name>
          , Joachim Bingel, Isabelle Augenstein and
          <string-name>
            <given-names>Anders</given-names>
            <surname>Søgaard</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Document modeling with gated recurrent neural network for sentiment classification</article-title>
          .
          <source>arXiv preprint arXiv:1705</source>
          .0814
          <fpage>1422</fpage>
          -
          <lpage>1432</lpage>
          , Lisbon, Portugal.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Mike</given-names>
            <surname>Schuster and Kuldip</surname>
          </string-name>
          <string-name>
            <given-names>K.</given-names>
            <surname>Paliwal</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Bidirectional recurrent neural networks</article-title>
          .
          <source>IEEE Transactions on Signal Processing</source>
          <volume>45</volume>
          (
          <issue>11</issue>
          ):
          <fpage>2673</fpage>
          -
          <lpage>2681</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Anders</given-names>
            <surname>Søgaard</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yoav</given-names>
            <surname>Goldberg</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep multi-task learning with low level tasks supervised at lower layers</article-title>
          .
          <source>In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics</source>
          .
          <fpage>231</fpage>
          -
          <lpage>235</lpage>
          , Berlin, Portugal.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Duyu</given-names>
            <surname>Tang</surname>
          </string-name>
          , Bing Qin and
          <string-name>
            <given-names>Ting</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Document modeling with gated recurrent neural network for sentiment classification</article-title>
          .
          <source>In Proceedings of EMNLP</source>
          <year>2015</year>
          .
          <volume>1422</volume>
          -
          <fpage>1432</fpage>
          , Lisbon, Portugal.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Tijmen</given-names>
            <surname>Tieleman</surname>
          </string-name>
          and
          <string-name>
            <given-names>Geoffrey</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <source>2012. Lecture 6</source>
          .5
          <article-title>-RmsProp: Divide the gradient by a running average of its recent magnitude</article-title>
          .
          <source>In COURSERA: Neural Networks for Machine Learning.</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Theresa</surname>
            <given-names>Wilson</given-names>
          </string-name>
          , Zornitsa Kozareva, Preslav Nakov, Sara Rosenthal, Veselin Stoyanov and
          <string-name>
            <given-names>Alan</given-names>
            <surname>Ritter</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Recognizing contextual polarity in phraselevel sentiment analysis</article-title>
          .
          <source>In Proceedings of HLTEMNLP</source>
          <year>2005</year>
          .
          <volume>347</volume>
          -
          <fpage>354</fpage>
          , Stroudsburg, PA, USA. ACL.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>XingYi Xu</surname>
          </string-name>
          ,
          <source>HuiZhi Liang and Timothy Baldwin</source>
          .
          <year>2016</year>
          . UNIMELB at SemEval
          <article-title>-2016 Tasks 4A and 4B: An Ensemble of Neural Networks and a Word2Vec Based Model for Sentiment Classification</article-title>
          .
          <source>In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016).</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>