=Paper=
{{Paper
|id=Vol-2263/paper013
|storemode=property
|title=Multi-task Learning in Deep Neural Networks at EVALITA 2018
|pdfUrl=https://ceur-ws.org/Vol-2263/paper013.pdf
|volume=Vol-2263
|authors=Andrea Cimino,Lorenzo De Mattei,Felice Dell'Orletta
|dblpUrl=https://dblp.org/rec/conf/evalita/CiminoMD18
}}
==Multi-task Learning in Deep Neural Networks at EVALITA 2018==
Multi-task Learning in Deep Neural Networks at EVALITA 2018
Andrea Cimino? , Lorenzo De Mattei?, and Felice Dell’Orletta?
?
Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR)
ItaliaNLP Lab - www.italianlp.it
Dipartimento di Informatica, Università di Pisa
{andrea.cimino,felice.dellorletta}@ilc.cnr.it
lorenzo.demattei@di.unipi.it
Abstract Il nostro sistema quando valutato sui te-
st set ufficiali ha ottenuto il primo posto in
English. In this paper we describe the
quasi tutti i sotto task di ogni shared ta-
system used for the participation to the
sk affrontato, dimostrando la validità del
ABSITA, GxG, HaSpeeDe and IronITA
nostro approccio.
shared tasks of the EVALITA 2018 con-
ference. We developed a classifier that can
be configured to use Bidirectional Long 1 Description of the System
Short Term Memories and linear Support
Vector Machines as learning algorithms. The EVALITA 2018 edition has been one of the
When using Bi-LSTMs we tested a multi- most successful editions in terms of number of
task learning approach which learns the shared tasks proposed. In particular, a large part of
optimized parameters of the network ex- the tasks proposed by the organizers can be tackled
ploiting simultaneously all the annotated as binary document classification tasks. This gave
dataset labels and a multiclassifier vot- us the possibility to test a new system specifically
ing approach based on a k-fold technique. designed for this EVALITA edition.
In addition, we developed generic and We implemented a system which relies on Bi-
specific word embedding lexicons to fur- LSTM (Hochreiter et al., 1997) and SVM which
ther improve classification performances. are widely used learning algorithms in the docu-
When evaluated on the official test sets, ment classification task. The learning algorithm
our system ranked 1st in almost all sub- can be selected in a configuration file. In this work
tasks for each shared task, showing the ef- we used the Keras (Chollet, 2016) library and
fectiveness of our approach. the liblinear (Fan et al., 2008) library to generate
the Bi-LSTM and SVM statistical models respec-
Italiano. In questo articolo descriviamo tively. Since our approach relies on morphosyn-
il sistema utilizzato per la partecipazione tactically tagged text, training and test data were
agli shared task ABSITA, GxG, HaSpee- automatically morphosyntactically tagged by the
De ed IronITA della conferenza EVALITA PoS tagger described in (Cimino and Dell’Orletta,
2018. Abbiamo sviluppato un sistema che 2016). Due to the label constraints in the dataset, if
utilizza come algoritmi di apprendimento our system classified an aspect as not present, we
sia reti di tipo Long Short Term Memory forced the related positive and negative labels to
Bidirezionali (Bi-LSTM) che Support Vec- be classified as not positive and not negative. We
tor Machines. Nell’utilizzo delle Bi-LSTM developed sentiment polarity and word embedding
abbiamo testato un approccio di tipo multi lexicons with the aim of improving the overall ac-
task learning nel quale i parametri della curacy of our system.
rete vengono ottimizzati utilizzando con- Some specific adaptions were made due to the
temporaneamente le annotazioni presenti characteristics of each shared task. In the Aspect-
nel dataset ed una strategia di classifica- based Sentiment Analysis (ABSITA) 2018 shared
zione a voti di tipo k-fold. Abbiamo creato task (Basile et al., 2018) participants were asked,
word embeddings generici e specifici per given a training set of Booking hotel reviews, to
ogni singolo task per migliorare ulterior- detect the mentioned aspect categories in a review
mente le performance di classificazione. among a set of 8 fixed categories (ACD task) and
to assign the polarity (neutral, positive, neutral, quency in T WP OS and T WN EG respectively. The
positive-negative) for each detected aspect (ACP three corpora were generated by first downloading
task). Since each Booking review in the training approximately 50,000,000 tweets and then apply-
set is labeled with 24 binary labels (8 indicating ing some filtering rules to the downloaded tweets
the presence of an aspect, 8 indicating positivity to build the positive and negative corpora (no fil-
and 8 indicating negativity w.r.t. an aspect), we ad- tering rules were applied to build the generic cor-
dressed the ABISTA 2018 shared task as 24 binary pus). In order to build a corpus of positive tweets,
classification problems. we constrained the downloaded tweets to contain
The Gender X-Genre (GxG) 2018 shared task at least one positive emoji among heart and kisses.
(Dell’Orletta and Nissim, 2018) consisted in the Since emojis are rarely used in negative tweets, to
automatic identification of the gender of the au- build the negative tweets corpus we created a list
thor of a text (Female or Male). Five different of commonly used words in negative language and
training sets and test sets were provided by the or- constrained these tweets to contain at least one of
ganizers for five different genres: Children essays these words.
(CH), Diary (DI), Journalism (JO), Twitter posts
1.1.2 Automatically translated Sentiment
(TW) and YouTube comments (YT). For each test
Polarity Lexicons
set the participants are requested to submit a sys-
tem trained using in-domain training dataset and a The Multi–Perspective Question Answering (here-
system trained using cross-domain data only. after referred to as M P QA) Subjectivity Lexicon
The IronITA task (Cignarella et al., 2018) con- (Wilson et al., 2005). This lexicon consists of ap-
sisted of two tasks. In the first task participants proximately 8,200 English words with their asso-
had to automatically label a message as ironic or ciated polarity. To use this resource for the Italian
not. The second task had a more fine grain: given language, we translated all the entries through the
a message, participants had to classify whether the Yandex translation service1 .
message is sarcastic, ironic but not sarcastic or not 1.1.3 Word Embedding Lexicons
ironic. We generated four word embedding lexicons using
Finally in the HaSpeeDe 2018 shared task the word2vec2 toolkit (Mikolov et al., 2013). As
(Bosco et al., 2018) consisted in automatically recommended in (Mikolov et al., 2013), we used
annotating messages from Twitter and Facebook the CBOW model that learns to predict the word
with a boolean value indicating the presence in the middle of a symmetric window based on
(or not) of hate speech. In particular three the sum of the vector representations of the words
tasks were proposed: HaSpeeDe-FB where only in the window. For our experiments, we consid-
the Facebook dataset could be used to classify ered a context window of 5 words. The Word Em-
Facebook comments, HaSpeeDe-TW where just bedding Lexicons starting from the following cor-
Twitter data could be used to classify tweets pora which were tokenized and postagged by the
and Cross-HaspeeDe where only the Facebook PoS tagger for Twitter described in (Cimino and
dataset could be used to classify the Twitter test Dell’Orletta, 2016):
set and vice versa (Cross-HaspeeDe FB, Cross-
HaspeeDe TW). • The first lexicon was built using the itWaC
corpus3 . The itWaC corpus is a 2 billion word
1.1 Lexical Resources corpus constructed from the Web limiting the
1.1.1 Automatically Generated Sentiment crawl to the .it domain and using medium-
Polarity Lexicons for Social Media frequency words from the Repubblica corpus
and basic Italian vocabulary lists as seeds.
For the purpose of modeling the word usage in
generic, positive and negative contexts of social • The second lexicon was built using the set
media texts, we developed three lexicons which of the 50,000,000 tweets we downloaded to
we named T WGEN , T WN EG , T WP OS . Each build the sentiment polarity lexicons previ-
lexicon reports the relative frequency of a word ously described in subsection 1.1.1
in three different corpora. The main idea behind 1
http://api.yandex.com/translate/
building these lexicons is that positive and neg- 2
http://code.google.com/p/word2vec/
3
ative words should present a higher relative fre- http://wacky.sslmit.unibo.it/doku.php?id=corpora
• The third and the fourth lexicon were built us- benefits of this approach were investigated also
ing a corpus consisting of 538,835 Booking by Søgaard and Goldberg (2016), which showed
reviews scraped from the web. Since each re- that MTL is appealing since it allows to incor-
view in the Booking site is split in a positive porate previous knowledge about tasks hierarchy
secion (indicated by a plus mark) and nega- into neural networks architectures. Furthermore,
tive section (indicated by a minus mark), we Ruder et al. (2017) showed that MTL is useful to
split these reviews obtaining in 338,494 pos- combine even loosely related tasks, letting the net-
itive reviews and 200,341 negative reviews. works automatically learn the tasks hierarchy.
Starting from the positive and the negative re- Both the workflows we implemented share a
views, we finally obtained two different word common pattern used in machine learning clas-
embedding lexicons. sifiers consisting of a document feature extrac-
tion and a learning phase based on the extracted
Each entry of the lexicons maps a pair (word, features, but since SVM and Bi-LSTM take in-
POS) to the associated word embedding, allowing put 2-dimensional and 3-dimensional tensors re-
to mitigate polisemy problems which can lead to spectively, a different feature extraction phase is
poorer results in classification. In addition, both involved for each considered algorithm. In ad-
the corpora where preprocessed in order to 1) map dition, when the Bi-LSTM workflow is selected
each url to the word ”URL” 2) distinguish between the classifier can take as input an extra file which
all uppercased words and non-uppercased words will be used to exploit the MTL learning approach.
(eg.: ”mai” vs ”MAI”), since all uppercased words Furthermore, when the Bi-LSTM workflow is se-
are usually used in negative contexts. Since each lected, the classifier performs 5-fold training ap-
task has its own characteristics in terms of infor- proach. More precisely we build 5 different mod-
mation that needs to be captured from the classi- els using different training and validation sets.
fiers, we decided to use a subset of the word em- These models are then exploited in the classifica-
beddings in each task. Table 1 sums up the word tion phase: the assigned labels are the ones that
embeddings used in each shared task. obtain the majority among all the models. The 5-
fold approach strategy was chosen in order to gen-
Task Booking ITWAC Twitter erate a global model which should less be prone
ABSITA 3 3 7 to overfitting or underfitting w.r.t. a single learned
GxG 7 3 3 model.
HaSpeeDe 7 3 3
IronITA 7 3 3 1.2.1 The SVM classifier
The SVM classifier exploits a wide set of fea-
Table 1: Word embedding lexicons used by our tures ranging across different levels of linguis-
system in each shared task (3used; 7not used). tic description. With the exception of the word
embedding combination, these features were al-
ready tested in our previous participation at the
1.2 The Classifier EVALITA 2016 SENTIPOLC edition (Cimino et
The classifier we built for our participation to the al., 2016). The features are organised into three
tasks was designed with the aim of testing dif- main categories: raw and lexical text features,
ferent learning algorithms and learning strategies. morpho-syntactic features and lexicon features.
More specifically our classifier implements two Due to size constraints we report only the feature
workflows which allow testing SVM and recurrent names.
neural networks as learning algorithms. In addi-
Raw and Lexical Text Features number of to-
tion, when recurrent neural networks are chosen
kens, character n-grams, word n-grams, lemma
as learning algorithms, our classifier allows to per-
n-grams, repetition of n-grams chars, number of
form neural network multi-task learning (MTL)
mentions, number of hashtags, punctuation.
using an external dataset in order to share knowl-
edge between related tasks. We decided to test the Morpho-syntactic Features coarse grained
MTL strategy since, as demonstrated in (De Mat- Part-Of-Speech n-grams, Fine grained Part-Of-
tei et al., 2018), it can improve the performance of Speech n-grams, Coarse grained Part-Of-Speech
the classifier on emotion recognition tasks. The distribution
Lexicon features Emoticons Presence, Lemma which is composed by:
sentiment polarity n-grams, Polarity modifier, Word embeddings: the concatenation of the word
PMI score, sentiment polarity distribution, Most embeddings extracted by the available Word Em-
frequent sentiment polarity, Sentiment polarity in bedding Lexicons (128 dimensions for each word
text sections, Word embeddings combination. embedding), and for each word embedding an ex-
tra component was added to handle the ”unknown
1.2.2 The Deep Neural Network classifier
word” (1 dimension for each lexicon used).
We tested two different models based on Bi- Word polarity: the corresponding word polarity
LSTM: one that learns to classify the labels with- obtained by exploiting the Sentiment Polarity Lex-
out sharing information from all the labels in the icons. This results in 3 components, one for each
training phase (Single task learning - STL), and possible lexicon outcome (negative, neutral, posi-
the other one which learns to classify the labels ex- tive) (3 dimensions). We assumed that a word not
ploiting the related information through a shared found in the lexicons has a neutral polarity.
Bi-LSTM (Multi task learning - MTL). We em- Automatically Generated Sentiment Polarity
ployed Bi-LSTM architectures since these archi- Lexicons for Social Media: The presence or the
tectures allow to capture long-range dependencies absence of the word in a lexicon and the relative
from both directions of a document by construct- presence if the word is found in the lexicon. Since
ing bidirectional links in the network (Schuster et we built the T WGEN , T WP OS and T WN EG 6 di-
al., 1997). We applied a dropout factor to both mensions are needed, 2 for each lexicon.
input gates and to the recurrent connections in or- Coarse Grained Part-of-Speech: 13 dimensions.
der to prevent overfitting which is a typical issue End of Sentence: a component (1 dimension) in-
in neural networks (Galp and Ghahramani, 2015). dicating whether the sentence was totally read.
We have chosen a dropout factor value of 0.50.
For what concerns GxG, as we had to deal with 2 Results and Discussion
longer documents such as news, we employed a
two layer Bi-LSTM encoder. The first Bi-LSTM Table 2 reports the official results obtained by our
layer served us to encode each sentence as a token best runs on all the task we participated. As it can
sequence, the second layer served us to encode the be noted our system performed extremely well,
sentences sequence. For what concerns ironITA achieving the best scores almost in every single
we added a task-specifici Bi-LSTM for each sub- subtask. In the following subsections a discussion
stask before the dense layer. of the results obtained in each task is provided.
Figure 1 shows a graphical representation of the
STL and MTL architectures we employed. For 2.1 ABSITA
what concerns the optimization process, the binary We tested five learning configurations of our sys-
cross entropy function is used as a loss function tem based on linear SVM and DNN learning al-
and optimization is performed by the rmsprop op- gorithms using the features described in section
timizer (Tieleman and Hinton, 2012). 1.2.1 and 1.2.2. All the experiments were aimed
at testing the contribution in terms of f-score of
Figure 1: STL and MTL architectures.
MTL vs STL, the k-fold technique and the exter-
(a) STL Model (b) MTL Model nal resources. For what concerns the Bi-LSTM
learning algorithm we tested Bi-LSTM both in the
Input Input
STL and MTL scenarios. In addition, to test the
contribution of the Booking word embeddings, we
Bi- Bi- Bi-
LSTM LSTM LSTM
Bi-LSTM created a configuration which uses a shallow Bi-
LSTM in MTL setting without using these embed-
dings (MTL NO BOOKING-WE). Finally, to test
dense dense dense dense dense dense
the contribution of the k-fold technique we created
a configuration which does not use the k-fold tech-
L1 ... Ln L1 ... Ln nique (MTL NO K-FOLD). To obtain fair compar-
isons in the last case we run all the experiments
5 times and averaged the scores of the runs. To
Each input word is represented by a vector test the proposed classification models, we created
Task Our Score Best Score Rank Configuration ACD ACP
ABSITA
baseline 0.338 0.199
ACD 0.811 0.811 1
ACP 0.767 0.767 1 linear SVM 0.772* 0.686*
GxG IN-DOMAIN STL 0.814 0.765
CH 0.640 0.640 1 MTL 0.811* 0.767*
DI 0.676 0.676 1 MTL NO K-FOLD 0.801 0.755
JO 0.555 0.585 2 MTL NO BOOKING-WE 0.808 0.753
TW 0.595 0.595 1
YT 0.555 0.555 1
Table 4: Classification results (micro f-score) of
GxG CROSS-DOMAIN
the different learning models on the ABSITA offi-
CH 0.640 0.640 1
DI 0.595 0.635 2 cial test set.
JO 0.510 0.515 2
TW 0.609 0.609 1
YT 0.513 0.513 1 tions in the training set is reported. The accuracy
HaSpeeDe is calculated as the micro f–score obtained using
TW 0.799 0.799 1 the evaluation tool provided by the organizers. For
FB 0.829 0.829 1 what concerns the ACD task it is worth noting that
C TW 0.699 0.699 1
C FB 0.607 0.654 5 the models based on DNN always outperform lin-
IronITA
ear SVM, even though the difference in terms of
f-score is small (approximately 2 f-score points).
IRONY 0.730 0.730 1
SARCASM 0.516 0.520 3 The MTL configuration was the best performing
among all the the models, but the difference in
Table 2: Classification results of our best runs on term of f-score among all the DNN configuration
the ABSITA, GxG, HaSpeeDe and IronITA test is not evident.
sets. When analyzing the results obtained on the
ACP task we can notice remarkable differences
among the performances obtained by the models.
an internal development set by randomly selecting Again the linear SVM was the worst performing
documents from the training sets distributed by the model, but this time with a difference in terms
task organizers. The resulting development set is of f-score of 6 points with respect to MTL, the
composed by approximately the 10% (561 docu- best performing model on the task. It is inter-
ments) of the whole training set. esting to notice that the results achieved by the
DNN models have bigger difference between them
Configuration ACD ACP in terms of f-score with respect to the ACD task:
this suggests that the external resources and the k-
baseline 0.313 0.197
fold technique contributed significantly to obtain
linear SVM 0.797 0.739 the best result in the ACP task. The configuration
STL 0.821 0.795 that does not use the k-fold technique scored 2 f-
MTL 0.824 0.804 score points w.r.t. the MTL configuration. We can
MTL NO K-FOLD 0.819 0.782 also notice that the Booking word emebeddings
MTL NO BOOKING-WE 0.817 0.757 were particularly helpful in this task: the MTL
NO BOOOKING-WE configuration in fact scored
Table 3: Classification results (micro f-score) of 5 points less than the best configuration. The re-
the different learning models on our ABSITA de- sults obtained on the internal development set lead
velopment set. us to choose the models for the official runs on the
provided test set. Table 4 reports the overall accu-
Table 3 reports the overall accuracies achieved racies achieved by all our classifier configurations
by the models on the internal development set for on the official test set, the official submitted runs
all the tasks. In addition, the results of base- are starred in the table.
line system (baseline row) which emits always the As it can be noticed the best scores both in the
most probable label according to the label distribu- ACD and ACP tasks were obtained by the DNN
models. Surprisingly the difference in terms of f- Model CH DI JO TW YT
score were reduced in both the tasks, with the ex- SVM 0.530 0.565 0.580 0.588 0.568
ception of linear SVM, which performed 4 and 8 STL 0.550 0.535 0.505 0.625 0.580
MTL 0.523 0.549 0.538 0.500 0.556
f-score points less in the ACD and ACP tasks re-
spectively when compared to the best DNN model
Table 6: Classification results of the different
systems. The STL model outperformed the MTL
learning models on development set in terms of
models the ACD task, even though the difference
accuracy for the cross-domain tasks
in term of f-score is not relevant. When the results
on the ACP are considered, the MTL model out-
performed all the other models, even though the the cross-domain tasks respectively. For the in-
the difference in terms of f-score with respect to domain tasks we observe that the SVM performs
the STL model is not noticeable. Is it worth to well on the smaller datasets (Children and Di-
notice that the k-fold technique and the Booking ary), while MTL neural network has the best
word embeddings seemed to again contribute in overall performances. When trained on all the
the final accuracy of the MTL system. This can be datasets, in- and cross-domain, the SVM (SVMa)
seen by looking at the results achieved by the MTL perform worst than when trained on in-domain
NO BOOKING-WE model and the MTL NO K- data only (SVM). For what concerns the cross-
FOLD model that scored 1.2 and 1.5 f-score points domain datasets we observe poor performances
less than the MTL system. over all the subtasks with all the employed mod-
els, implying that the models have difficulties in
2.2 GxG
cross-domain generalization.
We tested three different learning configurations
of our system based on linear SVM and DNN Model CH DI JO TW YT
learning algorithms using the features described in SVMa 0.545 0.514 0.475 0.539 0.585
section 1.2.1 and 1.2.2. For what concerns the Bi- SVM 0.550 0.649 0.555 0.567 0.555*
LSTM learning algorithm we tested both the STL STL 0.545 0.541 0.500 0.595* 0.512
MTL 0.640* 0.676* 0.470 0.561 0.546
and MTL approaches. We tested the three config-
urations for each of the 5 five in-domain subtasks Table 7: Classification results of the different
and for each of the 5 five cross-domain subtasks. learning models on the official test set in terms of
To test the proposed classification models, we cre- accuracy for the in-domain tasks (* marks runs
ated internal development sets by randomly select- that outperformed all the systems that participated
ing documents from the training sets distributed to the task).
by the task organizers. The resulting development
sets are composed by approximately 10% of the
each data sets. For what concern the in-domain Model CH DI JO TW YT
task, we tried to train the SVM classifier on in-
SVM 0.540 0.514 0.505 0.586 0.513*
domain-data only and and on both in-domain and STL 0.640* 0.554 0.495 0.609* 0.510
cross-domain data. MTL 0.535 0.595 0.510 0.500 0.500
Model CH DI JO TW YT Table 8: Classification results of the different
SVMa 0.667 0.626 0.485 0.582 0.611 learning models on the official test set in terms
SVM 0.701 0.737 0.560 0.728 0.619
STL 0.556 0.545 0.500 0.724 0.596
of accuracy for the cross-domain tasks. (* marks
MTL 0.499 0.817 0.625 0.729 0.632 runs that outperformed all the systems that partic-
ipated to the task).
Table 5: Classification results of the different
learning models on development set in terms of Table 7 and 8 report the overall accuracy, com-
accuracy for the in-domain tasks. puted as the average accuracy for the two classes
(male and female), achieved by the models on the
Table 5 and 6 report the overall accuracy, com- official test sets for the in-domain and the cross-
puted as the average accuracy for the two classes domain tasks respectively (* marks the running
(male and female), achieved by the models on that obtain the best results in the competition). For
the development data sets for the in-domain and what concerns the in-domain subtasks the perfor-
mances appear to be not in line with the ones ob- Configuration TW FB C TW C FB
tained on the development set, but still our mod- baseline 0.378 0.345 0.345 0.378
els outperform the other participant’s systems in linear SVM 0.800 0.813 0.617 0.503
four out of five subtasks. The MTL model pro- 1L STL 0.774 0.860 0.683 0.647
vided the best results for the Children and Diary 2L STL 0.790 0.860 0.672 0.597
1L MTL 0.783 0.860 0.672 0.663
test sets, while on the other test sets all the mod- 2L MTL 0.796 0.853 0.710 0.613
els performed quite poorly. Again when trained 1L MTL NO SNT 0.793 0.857 0.651 0.661
on all the datasets, in and cross-domain, the SVM 1L STL NO K-FOLD 0.771 0.846 0.657 0.646
(SVMa) perform worst then when trained on in-
Table 9: Classification results of the different
domain data only (SVM). For what concerns the
learning models on our HaSpeeDe development
cross-domain subtasks, while our model gets the
set in terms of F1-score.
best performances on three out of five subtasks,
the results confirm poor performances over all the
Configuration TW FB C TW C FB
subtasks, again indicating that the models have
difficulties in cross-domain generalization. baseline 0.403 0.404 0.404 0.403
best official system 0.799 0.829 0.699 0.654
2.3 HaSpeeDe linear SVM 0.798* 0.761 0.658 0.451
1L STL 0.793 0.811* 0.669* 0.607*
We tested seven learning configurations of our sys- 2L STL 0.791 0.812 0.644 0.561
1L MTL 0.788 0.818 0.707 0.635
tem based on linear SVM and DNN learning algo- 2L MTL 0.799* 0.829* 0.699* 0.585*
rithms using the features described in section 1.2.1 1L MTL NO SNT 0.801 0.808 0.709 0.620
and 1.2.2. All the experiments were aimed at test- 1L STL NO FOLD 0.785 0.806 0.652 0.583
ing the contribution in terms of f-score of the num-
Table 10: Classification results of the different
ber of layers, MTL vs STL, the k-fold technique
learning models on the official HaSpeeDe test set
and the external resources. For what concerns the
in terms of F1-score.
Bi-LSTM learning algorithm we tested one and
two layers Bi-LSTM both in the STL and MTL
scenarios. In addition, to test the contribution of in the table) it is worth noting that linear SVM
the sentiment lexicon features, we created a con- outperformed all the configurations based on Bi-
figuration which uses a 2-layer Bi-LSTM in MTL LSTM. In addition, the MTL architecture results
setting without using these features (1L MTL NO are slightly better than the STL ones (+1 f-score
SNT). Finally, to test the contribution of the k- point with respect to the STL counterparts). Exter-
fold technique we created a configuration which nal sentiment resources were not particularly help-
does not use the k-fold technique (1 STL NO K- ful in this task, as shown by the result obtained by
FOLD). To obtain fair results in the last case we the 1L MTL NO SNT row. In the FB task, Bi-
run all the experiments 5 times and averaged the LSTMs sensibly outperformed linear SVMs (+5 f-
scores of the runs. To test the proposed classifica- score points in average); this is most probably due
tion models, we created two internal development to longer text lengths that are found in this dataset
sets, one for each dataset, by randomly selecting with respect to the Twitter one. For what con-
documents from the training sets distributed by the cerns the out–domain tasks, when testing models
task organizers. The resulting development sets trained on Twitter and tested on Facebook (C TW
are composed by the 10% (300 documents) of the column), we can notice an expected drop in per-
whole training sets. formance with respect to the models trained on
Table 9 reports the overall accuracies achieved the FB dataset (15-20 points f-score points). The
by the models on our internal development sets best result was achieved by the 2L MTL configu-
for all the tasks. In addition, the results of base- ration (+4 points w.r.t. the STL counterpart). Fi-
line system (baseline row) which emits always the nally, when testing the models trained on Face-
most probable label according to the label distri- book and tested on Twitter (C FB column), lin-
bution in the training set is reported. The accu- ear SVM showed a huge drop in terms of ac-
racy is calculated as the f–score obtained using the curacy (-30 f-score points), while all the models
evaluation tool provided by the organizers. For trained with Bi-LSTM showed a performance drop
what concerns the Twitter in–domain task (TW of approximately 12 f-score points. Also in this
setting the best result was achieved by a MTL Configuration Irony Sarcasm
configuration (1L MTL), which performed better linear SVM 0.734 0.512
with respect to the STL counterpart (+2 f-score MTL 0.745 0.530
MTL+Polarity 0.757 0.562
points). For what concerns the k-fold learning MTL+Polarity+Hate 0.760 0.557
strategy, we can notice that the results achieved by
the model not using the k-fold learning strategy Table 11: Classification results of the different
(1 STL NO K-FOLD) are always lower than the learning models on k-cross validation terms of av-
counterpart which used the k-fold approach (+2.5 erage F1-score.
f-score points gained in the C TW task), showing
the benefits of using this technique. Configuration Irony Sarcasm
These results lead us to choose the models for baseline-random 0.505 0.337
the official runs on the provided test set. Table baseline-mfc 0.334 0.223
10 reports the overall accuracies achieved by all best participant 0.730 0.52
our classifier configurations on the official test set, linear SVM 0.701 0.493
MTL 0.736 0.530
the official submitted runs are starred in the ta- MTL+Polarity 0.730* 0.516*
ble. The best official system row reports, for each MTL+Polarity+Hate 0.713* 0.503*
task, the best official results submitted by the par-
ticipants of the EVALITA 2018 HaSpeeDe shared Table 12: Classification results of the different
task. As we can note the best scores in each task learning models on the official test set in terms of
were obtained by the Bi-LSTM in the MTL set- F1-score (* submitted run).
ting, showing that MTL networks seem to be more
effective with respect to STL networks. For what
concerns the Twitter in–domain task, we obtained tion sets for both the irony and sarcarsm detection
similar results to the development set ones. A sen- tasks.
sible drop in performance is observed in the FB We can observe that the SVM obtains good
task w.r.t the development set (-5 f-score points results on irony detection but the MTL neural
in average). Still Bi-LSTMs models outperformed approach overperforms sensibly the SVM. Also
the linear SVM model by 5 f-score points. In the we note that the usage of additional Polarity and
out-domain tasks, all the models performed simi- Hate Speech datasets lead to better performances.
larly to what observed in the development set. It These results lead us to choose the MTL models
is worth observing that linear SVM performed al- trained with the additional dataset for the two offi-
most as a baseline system in the C FB task. In cial run submissions.
addition, in the same task the model exploiting the Table 12 reports the overall accuracies achieved
sentiment lexicon (1L MTL) showed a better per- by all our classifier configurations on the offi-
formance (+1.5 f-score points) w.r.t to the 1L MTL cial test set, the official submitted runs are starred
NO SNT model. It is worth to notice that the k- in the table. The accuracies has been computed
fold learning strategy was beneficial also on the in terms of F-Score using the official evalua-
official test set: the 1L STL model obtained better tion script. We submitted the runs MTL+Polarity
results (approximately +2 f-score points in each and MTL+Polarity+Hate. The run MTL+Polarity
task) w.r.t. the model that did not use the k-fold ranked first in the subtask A, and third in the
learning strategy. subtask B on the official leaderboard. The run
MTL+Polarity ranked second in the subtask A,
2.4 IronITA and fourth in the subtask B on the official leader-
board.
We tested the four designed learning configura- The results on the test set confirm the good
tions of our system based on linear SVM and deep performances of the SVM classifier on irony de-
neural network (DNN) learning algorithms using tection task and that the MTL neural approaches
the features described in section 1.2.1 and 1.2.2. overperform the SVM. The model trained on the
To select the proposed classification models, we IronITA and SENTIPOLC datasets outperformed
used k-cross validation (k=4). all the systems that participated to the subtask A,
Table 11 reports the overall average f-score while on the subtask B it slightly underperformed
achieved by the models on the k-cross valida- the best participant system. The model trained on
the IronITA, SENTIPOLC and HaSpeeDe datasets (eds). In Proceedings of EVALITA ’18, Evaluation
overperformed all the systems that participated to of NLP and Speech Tools for Italian. December,
Turin, Italy.
the subtask A but our model trained on IronITA
and SENTIPOLC datasets only. Although the best Franois Chollet. 2016. Keras. Software available at
scores in both tasks were obtained by the MTL https://github.com/fchollet/keras/tree/master/keras.
network trained on IronITA data set only. The
MTL model trained on IronITA dataset only would Alessandra Cignarella and Simona Frenda and Vale-
have outperformed all the systems submitted to rio Basile and Cristina Bosco and Viviana Patti and
Paolo Rosso. 2018. Overview of the Evalita 2018
both the subtasks by all participants. Seems that Task on Irony Detection in Italian Tweets (IronITA).
for these tasks the usage of additional datasets T. Caselli, N. Novielli, V. Patti, P. Rosso, (eds). In
leads to overfitting issues. Proceedings of the 6th evaluation campaign of Nat-
ural Language Processing and Speech tools for Ital-
3 Conclusions ian (EVALITA’18)
In this paper we reported the results of our par- Andrea Cimino and Felice dell’Orletta. 2016. Tan-
ticipation to the ABSITA, GxG, HaSpeeDe and dem LSTM-SVM Approach for Sentiment Analysis.
IronITA shared tasks of the EVALITA 2018 con- In Proceedings of EVALITA ’16, Evaluation of NLP
and Speech Tools for Italian. December, Naples,
ference. By resorting to a system which used Italy.
Support Vector Machines and Deep Neural Net-
works (DNN) as learning algorithms, we achieved Andrea Cimino and Felice dell’Orletta. 2016. Build-
the best scores almost in every task, showing the ing the state-of-the-art in POS tagging of Italian
effectivness of our approach. In addition, when Tweets. In Proceedings of Third Italian Confer-
ence on Computational Linguistics (CLiC-it 2016)
DNN was used as learning algorithm we intro- & Fifth Evaluation Campaign of Natural Language
duced a new multi-task learning approach and a Processing and Speech Tools for Italian. Final Work-
majority vote classification approach to further im- shop (EVALITA 2016), Napoli, Italy, December 5-7,
prove the overall accuracy of our system. The pro- 2016.
posed system resulted in an very effective solution
Lorenzo de Mattei, Andrea Cimino and Felice
achieving the first position in almost all sub-tasks dell’Orletta. 2018. Multi-Task Learning in Deep
for each shared task. Neural Network for Sentiment Polarity and Irony
classification. In Proceedings of the 2nd Work-
Acknowledgments shop on Natural Language for Artificial Intelligence,
Trento, Italy, November 22-23, 2018.
We gratefully acknowledge the support of
NVIDIA Corporation with the donation of the Felice Dell’Orletta and Malvina Nissim. 2018.
Tesla K40 used for this research. Overview of the EVALITA Cross-Genre Gender
Prediction in Italian (GxG) Task. T. Caselli, N.
Novielli, V. Patti, P. Rosso, (eds). In Proceed-
ings of the 6th evaluation campaign of Natural
References Language Processing and Speech tools for Italian
Francesco Barbieri, Valerio Basile, Danilo Croce, (EVALITA’18).
Malvina Nissim, Nicole Novielli, Viviana Patti.
2016. Overview of the EVALITA 2016 SENTiment Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-
POLarity Classification Task. In Proceedings of Rui Wang and Chih-Jen Lin 2008. LIBLINEAR: A
EVALITA ’16, Evaluation of NLP and Speech Tools Library for Large Linear Classification Journal of
for Italian. December, Naples, Italy. Machine Learning Research. Volume 9, 1871–1874
Pierpaolo Basile, Valerio Basile, Danilo Croce, Marco Yarin Gal and Zoubin Ghahramani. 2015. A theoret-
Polignano. 2018. Overview of the EVALITA ically grounded application of dropout in recurrent
Aspect-based Sentiment Analysis (ABSITA) Task. neural networks. arXiv preprint arXiv:1512.05287
T. Caselli, N. Novielli, V. Patti, P. Rosso, (eds). In
Proceedings of the 6th evaluation campaign of Nat-
ural Language Processing and Speech tools for Ital- Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long
ian (EVALITA’18). December, Turin, Italy. short-term memory. Neural computation
Cristina Bosco, Felice dell’Orletta, Fabio Poletto, Tomas Mikolov, Kai Chen, Greg Corrado and Jef-
Manuela Sanguinetti and Maurizio Tesconi. 2018. frey Dean. 2013. Efficient estimation of word
Overview of the Evalita 2018 Hate Speech Detec- representations in vector space. arXiv preprint
tion Task. T. Caselli, N. Novielli, V. Patti, P. Rosso, arXiv1:1301.3781.
Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio
Sebastiani and Veselin Stoyanov. 2016. SemEval-
2016 task 4: Sentiment analysis in Twitter. In Pro-
ceedings of the 10th International Workshop on Se-
mantic Evaluation (SemEval-2016)
Sebastian Ruder, Joachim Bingel, Isabelle Augenstein
and Anders Søgaard. 2017. Document model-
ing with gated recurrent neural network for senti-
ment classification. arXiv preprint arXiv:1705.0814
1422-1432, Lisbon, Portugal.
Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec-
tional recurrent neural networks. IEEE Transactions
on Signal Processing 45(11):2673–2681
Anders Søgaard and Yoav Goldberg. 2016. Deep
multi-task learning with low level tasks supervised
at lower layers. In Proceedings of the 54th Annual
Meeting of the Association for Computational Lin-
guistics. 231–235, Berlin, Portugal.
Duyu Tang, Bing Qin and Ting Liu. 2015. Document
modeling with gated recurrent neural network for
sentiment classification. In Proceedings of EMNLP
2015. 1422-1432, Lisbon, Portugal.
Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture
6.5-RmsProp: Divide the gradient by a running aver-
age of its recent magnitude. In COURSERA: Neural
Networks for Machine Learning.
Theresa Wilson, Zornitsa Kozareva, Preslav Nakov,
Sara Rosenthal, Veselin Stoyanov and Alan Ritter.
2005. Recognizing contextual polarity in phrase-
level sentiment analysis. In Proceedings of HLT-
EMNLP 2005. 347-354, Stroudsburg, PA, USA.
ACL.
XingYi Xu, HuiZhi Liang and Timothy Baldwin. 2016.
UNIMELB at SemEval-2016 Tasks 4A and 4B:
An Ensemble of Neural Networks and a Word2Vec
Based Model for Sentiment Classification. In Pro-
ceedings of the 10th International Workshop on Se-
mantic Evaluation (SemEval-2016).