Multi-task Learning in Deep Neural Networks at EVALITA 2018
             Andrea Cimino? , Lorenzo De Mattei?,  and Felice Dell’Orletta?
       ?
           Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR)
                               ItaliaNLP Lab - www.italianlp.it
                       
                         Dipartimento di Informatica, Università di Pisa
             {andrea.cimino,felice.dellorletta}@ilc.cnr.it
                          lorenzo.demattei@di.unipi.it


                Abstract                          Il nostro sistema quando valutato sui te-
                                                  st set ufficiali ha ottenuto il primo posto in
English. In this paper we describe the
                                                  quasi tutti i sotto task di ogni shared ta-
system used for the participation to the
                                                  sk affrontato, dimostrando la validità del
ABSITA, GxG, HaSpeeDe and IronITA
                                                  nostro approccio.
shared tasks of the EVALITA 2018 con-
ference. We developed a classifier that can
be configured to use Bidirectional Long       1   Description of the System
Short Term Memories and linear Support
Vector Machines as learning algorithms.       The EVALITA 2018 edition has been one of the
When using Bi-LSTMs we tested a multi-        most successful editions in terms of number of
task learning approach which learns the       shared tasks proposed. In particular, a large part of
optimized parameters of the network ex-       the tasks proposed by the organizers can be tackled
ploiting simultaneously all the annotated     as binary document classification tasks. This gave
dataset labels and a multiclassifier vot-     us the possibility to test a new system specifically
ing approach based on a k-fold technique.     designed for this EVALITA edition.
In addition, we developed generic and            We implemented a system which relies on Bi-
specific word embedding lexicons to fur-      LSTM (Hochreiter et al., 1997) and SVM which
ther improve classification performances.     are widely used learning algorithms in the docu-
When evaluated on the official test sets,     ment classification task. The learning algorithm
our system ranked 1st in almost all sub-      can be selected in a configuration file. In this work
tasks for each shared task, showing the ef-   we used the Keras (Chollet, 2016) library and
fectiveness of our approach.                  the liblinear (Fan et al., 2008) library to generate
                                              the Bi-LSTM and SVM statistical models respec-
Italiano. In questo articolo descriviamo      tively. Since our approach relies on morphosyn-
il sistema utilizzato per la partecipazione   tactically tagged text, training and test data were
agli shared task ABSITA, GxG, HaSpee-         automatically morphosyntactically tagged by the
De ed IronITA della conferenza EVALITA        PoS tagger described in (Cimino and Dell’Orletta,
2018. Abbiamo sviluppato un sistema che       2016). Due to the label constraints in the dataset, if
utilizza come algoritmi di apprendimento      our system classified an aspect as not present, we
sia reti di tipo Long Short Term Memory       forced the related positive and negative labels to
Bidirezionali (Bi-LSTM) che Support Vec-      be classified as not positive and not negative. We
tor Machines. Nell’utilizzo delle Bi-LSTM     developed sentiment polarity and word embedding
abbiamo testato un approccio di tipo multi    lexicons with the aim of improving the overall ac-
task learning nel quale i parametri della     curacy of our system.
rete vengono ottimizzati utilizzando con-        Some specific adaptions were made due to the
temporaneamente le annotazioni presenti       characteristics of each shared task. In the Aspect-
nel dataset ed una strategia di classifica-   based Sentiment Analysis (ABSITA) 2018 shared
zione a voti di tipo k-fold. Abbiamo creato   task (Basile et al., 2018) participants were asked,
word embeddings generici e specifici per      given a training set of Booking hotel reviews, to
ogni singolo task per migliorare ulterior-    detect the mentioned aspect categories in a review
mente le performance di classificazione.      among a set of 8 fixed categories (ACD task) and
to assign the polarity (neutral, positive, neutral,     quency in T WP OS and T WN EG respectively. The
positive-negative) for each detected aspect (ACP        three corpora were generated by first downloading
task). Since each Booking review in the training        approximately 50,000,000 tweets and then apply-
set is labeled with 24 binary labels (8 indicating      ing some filtering rules to the downloaded tweets
the presence of an aspect, 8 indicating positivity      to build the positive and negative corpora (no fil-
and 8 indicating negativity w.r.t. an aspect), we ad-   tering rules were applied to build the generic cor-
dressed the ABISTA 2018 shared task as 24 binary        pus). In order to build a corpus of positive tweets,
classification problems.                                we constrained the downloaded tweets to contain
   The Gender X-Genre (GxG) 2018 shared task            at least one positive emoji among heart and kisses.
(Dell’Orletta and Nissim, 2018) consisted in the        Since emojis are rarely used in negative tweets, to
automatic identification of the gender of the au-       build the negative tweets corpus we created a list
thor of a text (Female or Male). Five different         of commonly used words in negative language and
training sets and test sets were provided by the or-    constrained these tweets to contain at least one of
ganizers for five different genres: Children essays     these words.
(CH), Diary (DI), Journalism (JO), Twitter posts
                                                        1.1.2   Automatically translated Sentiment
(TW) and YouTube comments (YT). For each test
                                                                Polarity Lexicons
set the participants are requested to submit a sys-
tem trained using in-domain training dataset and a      The Multi–Perspective Question Answering (here-
system trained using cross-domain data only.            after referred to as M P QA) Subjectivity Lexicon
   The IronITA task (Cignarella et al., 2018) con-      (Wilson et al., 2005). This lexicon consists of ap-
sisted of two tasks. In the first task participants     proximately 8,200 English words with their asso-
had to automatically label a message as ironic or       ciated polarity. To use this resource for the Italian
not. The second task had a more fine grain: given       language, we translated all the entries through the
a message, participants had to classify whether the     Yandex translation service1 .
message is sarcastic, ironic but not sarcastic or not   1.1.3 Word Embedding Lexicons
ironic.                                                 We generated four word embedding lexicons using
   Finally in the HaSpeeDe 2018 shared task             the word2vec2 toolkit (Mikolov et al., 2013). As
(Bosco et al., 2018) consisted in automatically         recommended in (Mikolov et al., 2013), we used
annotating messages from Twitter and Facebook           the CBOW model that learns to predict the word
with a boolean value indicating the presence            in the middle of a symmetric window based on
(or not) of hate speech. In particular three            the sum of the vector representations of the words
tasks were proposed: HaSpeeDe-FB where only             in the window. For our experiments, we consid-
the Facebook dataset could be used to classify          ered a context window of 5 words. The Word Em-
Facebook comments, HaSpeeDe-TW where just               bedding Lexicons starting from the following cor-
Twitter data could be used to classify tweets           pora which were tokenized and postagged by the
and Cross-HaspeeDe where only the Facebook              PoS tagger for Twitter described in (Cimino and
dataset could be used to classify the Twitter test      Dell’Orletta, 2016):
set and vice versa (Cross-HaspeeDe FB, Cross-
HaspeeDe TW).                                             • The first lexicon was built using the itWaC
                                                            corpus3 . The itWaC corpus is a 2 billion word
1.1     Lexical Resources                                   corpus constructed from the Web limiting the
1.1.1    Automatically Generated Sentiment                  crawl to the .it domain and using medium-
         Polarity Lexicons for Social Media                 frequency words from the Repubblica corpus
                                                            and basic Italian vocabulary lists as seeds.
For the purpose of modeling the word usage in
generic, positive and negative contexts of social         • The second lexicon was built using the set
media texts, we developed three lexicons which              of the 50,000,000 tweets we downloaded to
we named T WGEN , T WN EG , T WP OS . Each                  build the sentiment polarity lexicons previ-
lexicon reports the relative frequency of a word            ously described in subsection 1.1.1
in three different corpora. The main idea behind           1
                                                             http://api.yandex.com/translate/
building these lexicons is that positive and neg-          2
                                                             http://code.google.com/p/word2vec/
                                                           3
ative words should present a higher relative fre-            http://wacky.sslmit.unibo.it/doku.php?id=corpora
  • The third and the fourth lexicon were built us-     benefits of this approach were investigated also
    ing a corpus consisting of 538,835 Booking          by Søgaard and Goldberg (2016), which showed
    reviews scraped from the web. Since each re-        that MTL is appealing since it allows to incor-
    view in the Booking site is split in a positive     porate previous knowledge about tasks hierarchy
    secion (indicated by a plus mark) and nega-         into neural networks architectures. Furthermore,
    tive section (indicated by a minus mark), we        Ruder et al. (2017) showed that MTL is useful to
    split these reviews obtaining in 338,494 pos-       combine even loosely related tasks, letting the net-
    itive reviews and 200,341 negative reviews.         works automatically learn the tasks hierarchy.
    Starting from the positive and the negative re-        Both the workflows we implemented share a
    views, we finally obtained two different word       common pattern used in machine learning clas-
    embedding lexicons.                                 sifiers consisting of a document feature extrac-
                                                        tion and a learning phase based on the extracted
Each entry of the lexicons maps a pair (word,           features, but since SVM and Bi-LSTM take in-
POS) to the associated word embedding, allowing         put 2-dimensional and 3-dimensional tensors re-
to mitigate polisemy problems which can lead to         spectively, a different feature extraction phase is
poorer results in classification. In addition, both     involved for each considered algorithm. In ad-
the corpora where preprocessed in order to 1) map       dition, when the Bi-LSTM workflow is selected
each url to the word ”URL” 2) distinguish between       the classifier can take as input an extra file which
all uppercased words and non-uppercased words           will be used to exploit the MTL learning approach.
(eg.: ”mai” vs ”MAI”), since all uppercased words       Furthermore, when the Bi-LSTM workflow is se-
are usually used in negative contexts. Since each       lected, the classifier performs 5-fold training ap-
task has its own characteristics in terms of infor-     proach. More precisely we build 5 different mod-
mation that needs to be captured from the classi-       els using different training and validation sets.
fiers, we decided to use a subset of the word em-       These models are then exploited in the classifica-
beddings in each task. Table 1 sums up the word         tion phase: the assigned labels are the ones that
embeddings used in each shared task.                    obtain the majority among all the models. The 5-
                                                        fold approach strategy was chosen in order to gen-
     Task        Booking     ITWAC       Twitter        erate a global model which should less be prone
   ABSITA          3           3           7            to overfitting or underfitting w.r.t. a single learned
     GxG            7          3           3            model.
  HaSpeeDe          7          3           3
   IronITA          7          3           3            1.2.1 The SVM classifier
                                                        The SVM classifier exploits a wide set of fea-
Table 1: Word embedding lexicons used by our            tures ranging across different levels of linguis-
system in each shared task (3used; 7not used).          tic description. With the exception of the word
                                                        embedding combination, these features were al-
                                                        ready tested in our previous participation at the
1.2   The Classifier                                    EVALITA 2016 SENTIPOLC edition (Cimino et
The classifier we built for our participation to the    al., 2016). The features are organised into three
tasks was designed with the aim of testing dif-         main categories: raw and lexical text features,
ferent learning algorithms and learning strategies.     morpho-syntactic features and lexicon features.
More specifically our classifier implements two         Due to size constraints we report only the feature
workflows which allow testing SVM and recurrent         names.
neural networks as learning algorithms. In addi-
                                                        Raw and Lexical Text Features number of to-
tion, when recurrent neural networks are chosen
                                                        kens, character n-grams, word n-grams, lemma
as learning algorithms, our classifier allows to per-
                                                        n-grams, repetition of n-grams chars, number of
form neural network multi-task learning (MTL)
                                                        mentions, number of hashtags, punctuation.
using an external dataset in order to share knowl-
edge between related tasks. We decided to test the      Morpho-syntactic Features coarse grained
MTL strategy since, as demonstrated in (De Mat-         Part-Of-Speech n-grams, Fine grained Part-Of-
tei et al., 2018), it can improve the performance of    Speech n-grams, Coarse grained Part-Of-Speech
the classifier on emotion recognition tasks. The        distribution
Lexicon features Emoticons Presence, Lemma                      which is composed by:
sentiment polarity n-grams, Polarity modifier,                  Word embeddings: the concatenation of the word
PMI score, sentiment polarity distribution, Most                embeddings extracted by the available Word Em-
frequent sentiment polarity, Sentiment polarity in              bedding Lexicons (128 dimensions for each word
text sections, Word embeddings combination.                     embedding), and for each word embedding an ex-
                                                                tra component was added to handle the ”unknown
1.2.2 The Deep Neural Network classifier
                                                                word” (1 dimension for each lexicon used).
We tested two different models based on Bi-                     Word polarity: the corresponding word polarity
LSTM: one that learns to classify the labels with-              obtained by exploiting the Sentiment Polarity Lex-
out sharing information from all the labels in the              icons. This results in 3 components, one for each
training phase (Single task learning - STL), and                possible lexicon outcome (negative, neutral, posi-
the other one which learns to classify the labels ex-           tive) (3 dimensions). We assumed that a word not
ploiting the related information through a shared               found in the lexicons has a neutral polarity.
Bi-LSTM (Multi task learning - MTL). We em-                     Automatically Generated Sentiment Polarity
ployed Bi-LSTM architectures since these archi-                 Lexicons for Social Media: The presence or the
tectures allow to capture long-range dependencies               absence of the word in a lexicon and the relative
from both directions of a document by construct-                presence if the word is found in the lexicon. Since
ing bidirectional links in the network (Schuster et             we built the T WGEN , T WP OS and T WN EG 6 di-
al., 1997). We applied a dropout factor to both                 mensions are needed, 2 for each lexicon.
input gates and to the recurrent connections in or-             Coarse Grained Part-of-Speech: 13 dimensions.
der to prevent overfitting which is a typical issue             End of Sentence: a component (1 dimension) in-
in neural networks (Galp and Ghahramani, 2015).                 dicating whether the sentence was totally read.
We have chosen a dropout factor value of 0.50.
   For what concerns GxG, as we had to deal with                2     Results and Discussion
longer documents such as news, we employed a
two layer Bi-LSTM encoder. The first Bi-LSTM                    Table 2 reports the official results obtained by our
layer served us to encode each sentence as a token              best runs on all the task we participated. As it can
sequence, the second layer served us to encode the              be noted our system performed extremely well,
sentences sequence. For what concerns ironITA                   achieving the best scores almost in every single
we added a task-specifici Bi-LSTM for each sub-                 subtask. In the following subsections a discussion
stask before the dense layer.                                   of the results obtained in each task is provided.
   Figure 1 shows a graphical representation of the
STL and MTL architectures we employed. For                      2.1    ABSITA
what concerns the optimization process, the binary              We tested five learning configurations of our sys-
cross entropy function is used as a loss function               tem based on linear SVM and DNN learning al-
and optimization is performed by the rmsprop op-                gorithms using the features described in section
timizer (Tieleman and Hinton, 2012).                            1.2.1 and 1.2.2. All the experiments were aimed
                                                                at testing the contribution in terms of f-score of
        Figure 1: STL and MTL architectures.
                                                                MTL vs STL, the k-fold technique and the exter-
         (a)   STL Model                (b)   MTL Model         nal resources. For what concerns the Bi-LSTM
                                                                learning algorithm we tested Bi-LSTM both in the
               Input                           Input
                                                                STL and MTL scenarios. In addition, to test the
                                                                contribution of the Booking word embeddings, we
   Bi-          Bi-         Bi-
  LSTM         LSTM        LSTM
                                              Bi-LSTM           created a configuration which uses a shallow Bi-
                                                                LSTM in MTL setting without using these embed-
                                                                dings (MTL NO BOOKING-WE). Finally, to test
 dense         dense   dense      dense        dense    dense
                                                                the contribution of the k-fold technique we created
                                                                a configuration which does not use the k-fold tech-
   L1           ...        Ln      L1           ...       Ln    nique (MTL NO K-FOLD). To obtain fair compar-
                                                                isons in the last case we run all the experiments
                                                                5 times and averaged the scores of the runs. To
  Each input word is represented by a vector                    test the proposed classification models, we created
       Task      Our Score      Best Score   Rank        Configuration                  ACD       ACP
                      ABSITA
                                                         baseline                       0.338     0.199
       ACD         0.811          0.811       1
       ACP         0.767          0.767       1          linear SVM                     0.772*    0.686*
                 GxG IN-DOMAIN                           STL                            0.814     0.765
       CH          0.640          0.640       1          MTL                            0.811*    0.767*
       DI          0.676          0.676       1          MTL NO K-FOLD                  0.801     0.755
       JO          0.555          0.585       2          MTL NO BOOKING-WE              0.808     0.753
       TW          0.595          0.595       1
       YT          0.555          0.555       1
                                                       Table 4: Classification results (micro f-score) of
               GxG CROSS-DOMAIN
                                                       the different learning models on the ABSITA offi-
       CH          0.640          0.640       1
       DI          0.595          0.635       2        cial test set.
       JO          0.510          0.515       2
       TW          0.609          0.609       1
       YT          0.513          0.513       1        tions in the training set is reported. The accuracy
                     HaSpeeDe                          is calculated as the micro f–score obtained using
       TW          0.799          0.799       1        the evaluation tool provided by the organizers. For
       FB          0.829          0.829       1        what concerns the ACD task it is worth noting that
      C TW         0.699          0.699       1
      C FB         0.607          0.654       5        the models based on DNN always outperform lin-
                      IronITA
                                                       ear SVM, even though the difference in terms of
                                                       f-score is small (approximately 2 f-score points).
     IRONY         0.730          0.730       1
    SARCASM        0.516          0.520       3        The MTL configuration was the best performing
                                                       among all the the models, but the difference in
Table 2: Classification results of our best runs on    term of f-score among all the DNN configuration
the ABSITA, GxG, HaSpeeDe and IronITA test             is not evident.
sets.                                                     When analyzing the results obtained on the
                                                       ACP task we can notice remarkable differences
                                                       among the performances obtained by the models.
an internal development set by randomly selecting      Again the linear SVM was the worst performing
documents from the training sets distributed by the    model, but this time with a difference in terms
task organizers. The resulting development set is      of f-score of 6 points with respect to MTL, the
composed by approximately the 10% (561 docu-           best performing model on the task. It is inter-
ments) of the whole training set.                      esting to notice that the results achieved by the
                                                       DNN models have bigger difference between them
   Configuration                    ACD      ACP       in terms of f-score with respect to the ACD task:
                                                       this suggests that the external resources and the k-
   baseline                         0.313    0.197
                                                       fold technique contributed significantly to obtain
   linear SVM                       0.797    0.739     the best result in the ACP task. The configuration
   STL                              0.821    0.795     that does not use the k-fold technique scored 2 f-
   MTL                              0.824    0.804     score points w.r.t. the MTL configuration. We can
   MTL NO K-FOLD                    0.819    0.782     also notice that the Booking word emebeddings
   MTL NO BOOKING-WE                0.817    0.757     were particularly helpful in this task: the MTL
                                                       NO BOOOKING-WE configuration in fact scored
Table 3: Classification results (micro f-score) of     5 points less than the best configuration. The re-
the different learning models on our ABSITA de-        sults obtained on the internal development set lead
velopment set.                                         us to choose the models for the official runs on the
                                                       provided test set. Table 4 reports the overall accu-
   Table 3 reports the overall accuracies achieved     racies achieved by all our classifier configurations
by the models on the internal development set for      on the official test set, the official submitted runs
all the tasks. In addition, the results of base-       are starred in the table.
line system (baseline row) which emits always the         As it can be noticed the best scores both in the
most probable label according to the label distribu-   ACD and ACP tasks were obtained by the DNN
models. Surprisingly the difference in terms of f-         Model     CH       DI      JO      TW       YT
score were reduced in both the tasks, with the ex-         SVM       0.530    0.565   0.580   0.588    0.568
ception of linear SVM, which performed 4 and 8             STL       0.550    0.535   0.505   0.625    0.580
                                                           MTL       0.523    0.549   0.538   0.500    0.556
f-score points less in the ACD and ACP tasks re-
spectively when compared to the best DNN model
                                                       Table 6: Classification results of the different
systems. The STL model outperformed the MTL
                                                       learning models on development set in terms of
models the ACD task, even though the difference
                                                       accuracy for the cross-domain tasks
in term of f-score is not relevant. When the results
on the ACP are considered, the MTL model out-
performed all the other models, even though the        the cross-domain tasks respectively. For the in-
the difference in terms of f-score with respect to     domain tasks we observe that the SVM performs
the STL model is not noticeable. Is it worth to        well on the smaller datasets (Children and Di-
notice that the k-fold technique and the Booking       ary), while MTL neural network has the best
word embeddings seemed to again contribute in          overall performances. When trained on all the
the final accuracy of the MTL system. This can be      datasets, in- and cross-domain, the SVM (SVMa)
seen by looking at the results achieved by the MTL     perform worst than when trained on in-domain
NO BOOKING-WE model and the MTL NO K-                  data only (SVM). For what concerns the cross-
FOLD model that scored 1.2 and 1.5 f-score points      domain datasets we observe poor performances
less than the MTL system.                              over all the subtasks with all the employed mod-
                                                       els, implying that the models have difficulties in
2.2    GxG
                                                       cross-domain generalization.
We tested three different learning configurations
of our system based on linear SVM and DNN               Model      CH        DI       JO      TW        YT
learning algorithms using the features described in     SVMa       0.545     0.514    0.475   0.539     0.585
section 1.2.1 and 1.2.2. For what concerns the Bi-      SVM        0.550     0.649    0.555   0.567     0.555*
LSTM learning algorithm we tested both the STL          STL        0.545     0.541    0.500   0.595*    0.512
                                                        MTL        0.640*    0.676*   0.470   0.561     0.546
and MTL approaches. We tested the three config-
urations for each of the 5 five in-domain subtasks     Table 7: Classification results of the different
and for each of the 5 five cross-domain subtasks.      learning models on the official test set in terms of
To test the proposed classification models, we cre-    accuracy for the in-domain tasks (* marks runs
ated internal development sets by randomly select-     that outperformed all the systems that participated
ing documents from the training sets distributed       to the task).
by the task organizers. The resulting development
sets are composed by approximately 10% of the
each data sets. For what concern the in-domain           Model     CH        DI       JO      TW       YT
task, we tried to train the SVM classifier on in-
                                                         SVM       0.540     0.514    0.505   0.586    0.513*
domain-data only and and on both in-domain and           STL       0.640*    0.554    0.495   0.609*   0.510
cross-domain data.                                       MTL       0.535     0.595    0.510   0.500    0.500

      Model   CH      DI      JO      TW      YT       Table 8: Classification results of the different
      SVMa    0.667   0.626   0.485   0.582   0.611    learning models on the official test set in terms
      SVM     0.701   0.737   0.560   0.728   0.619
      STL     0.556   0.545   0.500   0.724   0.596
                                                       of accuracy for the cross-domain tasks. (* marks
      MTL     0.499   0.817   0.625   0.729   0.632    runs that outperformed all the systems that partic-
                                                       ipated to the task).
Table 5: Classification results of the different
learning models on development set in terms of            Table 7 and 8 report the overall accuracy, com-
accuracy for the in-domain tasks.                      puted as the average accuracy for the two classes
                                                       (male and female), achieved by the models on the
  Table 5 and 6 report the overall accuracy, com-      official test sets for the in-domain and the cross-
puted as the average accuracy for the two classes      domain tasks respectively (* marks the running
(male and female), achieved by the models on           that obtain the best results in the competition). For
the development data sets for the in-domain and        what concerns the in-domain subtasks the perfor-
mances appear to be not in line with the ones ob-       Configuration            TW       FB      C TW     C FB
tained on the development set, but still our mod-       baseline                 0.378    0.345   0.345    0.378
els outperform the other participant’s systems in       linear SVM               0.800    0.813   0.617    0.503
four out of five subtasks. The MTL model pro-           1L STL                   0.774    0.860   0.683    0.647
vided the best results for the Children and Diary       2L STL                   0.790    0.860   0.672    0.597
                                                        1L MTL                   0.783    0.860   0.672    0.663
test sets, while on the other test sets all the mod-    2L MTL                   0.796    0.853   0.710    0.613
els performed quite poorly. Again when trained          1L MTL NO SNT            0.793    0.857   0.651    0.661
on all the datasets, in and cross-domain, the SVM       1L STL NO K-FOLD         0.771    0.846   0.657    0.646
(SVMa) perform worst then when trained on in-
                                                       Table 9: Classification results of the different
domain data only (SVM). For what concerns the
                                                       learning models on our HaSpeeDe development
cross-domain subtasks, while our model gets the
                                                       set in terms of F1-score.
best performances on three out of five subtasks,
the results confirm poor performances over all the
                                                        Configuration          TW        FB       C TW     C FB
subtasks, again indicating that the models have
difficulties in cross-domain generalization.            baseline               0.403     0.404    0.404    0.403
                                                        best official system   0.799     0.829    0.699    0.654
2.3   HaSpeeDe                                          linear SVM             0.798*    0.761    0.658    0.451
                                                        1L STL                 0.793     0.811*   0.669*   0.607*
We tested seven learning configurations of our sys-     2L STL                 0.791     0.812    0.644    0.561
                                                        1L MTL                 0.788     0.818    0.707    0.635
tem based on linear SVM and DNN learning algo-          2L MTL                 0.799*    0.829*   0.699*   0.585*
rithms using the features described in section 1.2.1    1L MTL NO SNT          0.801     0.808    0.709    0.620
and 1.2.2. All the experiments were aimed at test-      1L STL NO FOLD         0.785     0.806    0.652    0.583

ing the contribution in terms of f-score of the num-
                                                       Table 10: Classification results of the different
ber of layers, MTL vs STL, the k-fold technique
                                                       learning models on the official HaSpeeDe test set
and the external resources. For what concerns the
                                                       in terms of F1-score.
Bi-LSTM learning algorithm we tested one and
two layers Bi-LSTM both in the STL and MTL
scenarios. In addition, to test the contribution of    in the table) it is worth noting that linear SVM
the sentiment lexicon features, we created a con-      outperformed all the configurations based on Bi-
figuration which uses a 2-layer Bi-LSTM in MTL         LSTM. In addition, the MTL architecture results
setting without using these features (1L MTL NO        are slightly better than the STL ones (+1 f-score
SNT). Finally, to test the contribution of the k-      point with respect to the STL counterparts). Exter-
fold technique we created a configuration which        nal sentiment resources were not particularly help-
does not use the k-fold technique (1 STL NO K-         ful in this task, as shown by the result obtained by
FOLD). To obtain fair results in the last case we      the 1L MTL NO SNT row. In the FB task, Bi-
run all the experiments 5 times and averaged the       LSTMs sensibly outperformed linear SVMs (+5 f-
scores of the runs. To test the proposed classifica-   score points in average); this is most probably due
tion models, we created two internal development       to longer text lengths that are found in this dataset
sets, one for each dataset, by randomly selecting      with respect to the Twitter one. For what con-
documents from the training sets distributed by the    cerns the out–domain tasks, when testing models
task organizers. The resulting development sets        trained on Twitter and tested on Facebook (C TW
are composed by the 10% (300 documents) of the         column), we can notice an expected drop in per-
whole training sets.                                   formance with respect to the models trained on
   Table 9 reports the overall accuracies achieved     the FB dataset (15-20 points f-score points). The
by the models on our internal development sets         best result was achieved by the 2L MTL configu-
for all the tasks. In addition, the results of base-   ration (+4 points w.r.t. the STL counterpart). Fi-
line system (baseline row) which emits always the      nally, when testing the models trained on Face-
most probable label according to the label distri-     book and tested on Twitter (C FB column), lin-
bution in the training set is reported. The accu-      ear SVM showed a huge drop in terms of ac-
racy is calculated as the f–score obtained using the   curacy (-30 f-score points), while all the models
evaluation tool provided by the organizers. For        trained with Bi-LSTM showed a performance drop
what concerns the Twitter in–domain task (TW           of approximately 12 f-score points. Also in this
setting the best result was achieved by a MTL                     Configuration        Irony    Sarcasm
configuration (1L MTL), which performed better                    linear SVM           0.734    0.512
with respect to the STL counterpart (+2 f-score                   MTL                  0.745    0.530
                                                                  MTL+Polarity         0.757    0.562
points). For what concerns the k-fold learning                    MTL+Polarity+Hate    0.760    0.557
strategy, we can notice that the results achieved by
the model not using the k-fold learning strategy          Table 11: Classification results of the different
(1 STL NO K-FOLD) are always lower than the               learning models on k-cross validation terms of av-
counterpart which used the k-fold approach (+2.5          erage F1-score.
f-score points gained in the C TW task), showing
the benefits of using this technique.                            Configuration         Irony    Sarcasm
   These results lead us to choose the models for                baseline-random       0.505    0.337
the official runs on the provided test set. Table                baseline-mfc          0.334    0.223
10 reports the overall accuracies achieved by all                best participant      0.730    0.52
our classifier configurations on the official test set,          linear SVM            0.701    0.493
                                                                 MTL                   0.736    0.530
the official submitted runs are starred in the ta-               MTL+Polarity          0.730*   0.516*
ble. The best official system row reports, for each              MTL+Polarity+Hate     0.713*   0.503*
task, the best official results submitted by the par-
ticipants of the EVALITA 2018 HaSpeeDe shared             Table 12: Classification results of the different
task. As we can note the best scores in each task         learning models on the official test set in terms of
were obtained by the Bi-LSTM in the MTL set-              F1-score (* submitted run).
ting, showing that MTL networks seem to be more
effective with respect to STL networks. For what
concerns the Twitter in–domain task, we obtained          tion sets for both the irony and sarcarsm detection
similar results to the development set ones. A sen-       tasks.
sible drop in performance is observed in the FB              We can observe that the SVM obtains good
task w.r.t the development set (-5 f-score points         results on irony detection but the MTL neural
in average). Still Bi-LSTMs models outperformed           approach overperforms sensibly the SVM. Also
the linear SVM model by 5 f-score points. In the          we note that the usage of additional Polarity and
out-domain tasks, all the models performed simi-          Hate Speech datasets lead to better performances.
larly to what observed in the development set. It         These results lead us to choose the MTL models
is worth observing that linear SVM performed al-          trained with the additional dataset for the two offi-
most as a baseline system in the C FB task. In            cial run submissions.
addition, in the same task the model exploiting the          Table 12 reports the overall accuracies achieved
sentiment lexicon (1L MTL) showed a better per-           by all our classifier configurations on the offi-
formance (+1.5 f-score points) w.r.t to the 1L MTL        cial test set, the official submitted runs are starred
NO SNT model. It is worth to notice that the k-           in the table. The accuracies has been computed
fold learning strategy was beneficial also on the         in terms of F-Score using the official evalua-
official test set: the 1L STL model obtained better       tion script. We submitted the runs MTL+Polarity
results (approximately +2 f-score points in each          and MTL+Polarity+Hate. The run MTL+Polarity
task) w.r.t. the model that did not use the k-fold        ranked first in the subtask A, and third in the
learning strategy.                                        subtask B on the official leaderboard. The run
                                                          MTL+Polarity ranked second in the subtask A,
2.4   IronITA                                             and fourth in the subtask B on the official leader-
                                                          board.
We tested the four designed learning configura-              The results on the test set confirm the good
tions of our system based on linear SVM and deep          performances of the SVM classifier on irony de-
neural network (DNN) learning algorithms using            tection task and that the MTL neural approaches
the features described in section 1.2.1 and 1.2.2.        overperform the SVM. The model trained on the
To select the proposed classification models, we          IronITA and SENTIPOLC datasets outperformed
used k-cross validation (k=4).                            all the systems that participated to the subtask A,
   Table 11 reports the overall average f-score           while on the subtask B it slightly underperformed
achieved by the models on the k-cross valida-             the best participant system. The model trained on
the IronITA, SENTIPOLC and HaSpeeDe datasets                  (eds). In Proceedings of EVALITA ’18, Evaluation
overperformed all the systems that participated to            of NLP and Speech Tools for Italian. December,
                                                              Turin, Italy.
the subtask A but our model trained on IronITA
and SENTIPOLC datasets only. Although the best              Franois Chollet. 2016. Keras. Software available at
scores in both tasks were obtained by the MTL                 https://github.com/fchollet/keras/tree/master/keras.
network trained on IronITA data set only. The
MTL model trained on IronITA dataset only would             Alessandra Cignarella and Simona Frenda and Vale-
have outperformed all the systems submitted to                rio Basile and Cristina Bosco and Viviana Patti and
                                                              Paolo Rosso. 2018. Overview of the Evalita 2018
both the subtasks by all participants. Seems that             Task on Irony Detection in Italian Tweets (IronITA).
for these tasks the usage of additional datasets              T. Caselli, N. Novielli, V. Patti, P. Rosso, (eds). In
leads to overfitting issues.                                  Proceedings of the 6th evaluation campaign of Nat-
                                                              ural Language Processing and Speech tools for Ital-
3   Conclusions                                               ian (EVALITA’18)

In this paper we reported the results of our par-           Andrea Cimino and Felice dell’Orletta. 2016. Tan-
ticipation to the ABSITA, GxG, HaSpeeDe and                   dem LSTM-SVM Approach for Sentiment Analysis.
IronITA shared tasks of the EVALITA 2018 con-                 In Proceedings of EVALITA ’16, Evaluation of NLP
                                                              and Speech Tools for Italian. December, Naples,
ference. By resorting to a system which used                  Italy.
Support Vector Machines and Deep Neural Net-
works (DNN) as learning algorithms, we achieved             Andrea Cimino and Felice dell’Orletta. 2016. Build-
the best scores almost in every task, showing the             ing the state-of-the-art in POS tagging of Italian
effectivness of our approach. In addition, when               Tweets. In Proceedings of Third Italian Confer-
                                                              ence on Computational Linguistics (CLiC-it 2016)
DNN was used as learning algorithm we intro-                  & Fifth Evaluation Campaign of Natural Language
duced a new multi-task learning approach and a                Processing and Speech Tools for Italian. Final Work-
majority vote classification approach to further im-          shop (EVALITA 2016), Napoli, Italy, December 5-7,
prove the overall accuracy of our system. The pro-            2016.
posed system resulted in an very effective solution
                                                            Lorenzo de Mattei, Andrea Cimino and Felice
achieving the first position in almost all sub-tasks          dell’Orletta. 2018. Multi-Task Learning in Deep
for each shared task.                                         Neural Network for Sentiment Polarity and Irony
                                                              classification. In Proceedings of the 2nd Work-
Acknowledgments                                               shop on Natural Language for Artificial Intelligence,
                                                              Trento, Italy, November 22-23, 2018.
We gratefully acknowledge the support of
NVIDIA Corporation with the donation of the                 Felice Dell’Orletta and Malvina Nissim.      2018.
Tesla K40 used for this research.                             Overview of the EVALITA Cross-Genre Gender
                                                              Prediction in Italian (GxG) Task. T. Caselli, N.
                                                              Novielli, V. Patti, P. Rosso, (eds). In Proceed-
                                                              ings of the 6th evaluation campaign of Natural
References                                                    Language Processing and Speech tools for Italian
Francesco Barbieri, Valerio Basile, Danilo Croce,             (EVALITA’18).
  Malvina Nissim, Nicole Novielli, Viviana Patti.
  2016. Overview of the EVALITA 2016 SENTiment              Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-
  POLarity Classification Task. In Proceedings of             Rui Wang and Chih-Jen Lin 2008. LIBLINEAR: A
  EVALITA ’16, Evaluation of NLP and Speech Tools             Library for Large Linear Classification Journal of
  for Italian. December, Naples, Italy.                       Machine Learning Research. Volume 9, 1871–1874
Pierpaolo Basile, Valerio Basile, Danilo Croce, Marco       Yarin Gal and Zoubin Ghahramani. 2015. A theoret-
   Polignano. 2018. Overview of the EVALITA                   ically grounded application of dropout in recurrent
   Aspect-based Sentiment Analysis (ABSITA) Task.             neural networks. arXiv preprint arXiv:1512.05287
   T. Caselli, N. Novielli, V. Patti, P. Rosso, (eds). In
   Proceedings of the 6th evaluation campaign of Nat-
   ural Language Processing and Speech tools for Ital-      Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long
   ian (EVALITA’18). December, Turin, Italy.                  short-term memory. Neural computation

Cristina Bosco, Felice dell’Orletta, Fabio Poletto,         Tomas Mikolov, Kai Chen, Greg Corrado and Jef-
  Manuela Sanguinetti and Maurizio Tesconi. 2018.             frey Dean. 2013. Efficient estimation of word
  Overview of the Evalita 2018 Hate Speech Detec-             representations in vector space. arXiv preprint
  tion Task. T. Caselli, N. Novielli, V. Patti, P. Rosso,     arXiv1:1301.3781.
Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio
  Sebastiani and Veselin Stoyanov. 2016. SemEval-
  2016 task 4: Sentiment analysis in Twitter. In Pro-
  ceedings of the 10th International Workshop on Se-
  mantic Evaluation (SemEval-2016)
Sebastian Ruder, Joachim Bingel, Isabelle Augenstein
  and Anders Søgaard. 2017. Document model-
  ing with gated recurrent neural network for senti-
  ment classification. arXiv preprint arXiv:1705.0814
  1422-1432, Lisbon, Portugal.
Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec-
  tional recurrent neural networks. IEEE Transactions
  on Signal Processing 45(11):2673–2681
Anders Søgaard and Yoav Goldberg. 2016. Deep
  multi-task learning with low level tasks supervised
  at lower layers. In Proceedings of the 54th Annual
  Meeting of the Association for Computational Lin-
  guistics. 231–235, Berlin, Portugal.
Duyu Tang, Bing Qin and Ting Liu. 2015. Document
  modeling with gated recurrent neural network for
  sentiment classification. In Proceedings of EMNLP
  2015. 1422-1432, Lisbon, Portugal.
Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture
   6.5-RmsProp: Divide the gradient by a running aver-
   age of its recent magnitude. In COURSERA: Neural
   Networks for Machine Learning.
Theresa Wilson, Zornitsa Kozareva, Preslav Nakov,
  Sara Rosenthal, Veselin Stoyanov and Alan Ritter.
  2005. Recognizing contextual polarity in phrase-
  level sentiment analysis. In Proceedings of HLT-
  EMNLP 2005. 347-354, Stroudsburg, PA, USA.
  ACL.
XingYi Xu, HuiZhi Liang and Timothy Baldwin. 2016.
  UNIMELB at SemEval-2016 Tasks 4A and 4B:
  An Ensemble of Neural Networks and a Word2Vec
  Based Model for Sentiment Classification. In Pro-
  ceedings of the 10th International Workshop on Se-
  mantic Evaluation (SemEval-2016).