Multi-task Learning in Deep Neural Networks at EVALITA 2018 Andrea Cimino? , Lorenzo De Mattei?,  and Felice Dell’Orletta? ? Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR) ItaliaNLP Lab - www.italianlp.it  Dipartimento di Informatica, Università di Pisa {andrea.cimino,felice.dellorletta}@ilc.cnr.it lorenzo.demattei@di.unipi.it Abstract Il nostro sistema quando valutato sui te- st set ufficiali ha ottenuto il primo posto in English. In this paper we describe the quasi tutti i sotto task di ogni shared ta- system used for the participation to the sk affrontato, dimostrando la validità del ABSITA, GxG, HaSpeeDe and IronITA nostro approccio. shared tasks of the EVALITA 2018 con- ference. We developed a classifier that can be configured to use Bidirectional Long 1 Description of the System Short Term Memories and linear Support Vector Machines as learning algorithms. The EVALITA 2018 edition has been one of the When using Bi-LSTMs we tested a multi- most successful editions in terms of number of task learning approach which learns the shared tasks proposed. In particular, a large part of optimized parameters of the network ex- the tasks proposed by the organizers can be tackled ploiting simultaneously all the annotated as binary document classification tasks. This gave dataset labels and a multiclassifier vot- us the possibility to test a new system specifically ing approach based on a k-fold technique. designed for this EVALITA edition. In addition, we developed generic and We implemented a system which relies on Bi- specific word embedding lexicons to fur- LSTM (Hochreiter et al., 1997) and SVM which ther improve classification performances. are widely used learning algorithms in the docu- When evaluated on the official test sets, ment classification task. The learning algorithm our system ranked 1st in almost all sub- can be selected in a configuration file. In this work tasks for each shared task, showing the ef- we used the Keras (Chollet, 2016) library and fectiveness of our approach. the liblinear (Fan et al., 2008) library to generate the Bi-LSTM and SVM statistical models respec- Italiano. In questo articolo descriviamo tively. Since our approach relies on morphosyn- il sistema utilizzato per la partecipazione tactically tagged text, training and test data were agli shared task ABSITA, GxG, HaSpee- automatically morphosyntactically tagged by the De ed IronITA della conferenza EVALITA PoS tagger described in (Cimino and Dell’Orletta, 2018. Abbiamo sviluppato un sistema che 2016). Due to the label constraints in the dataset, if utilizza come algoritmi di apprendimento our system classified an aspect as not present, we sia reti di tipo Long Short Term Memory forced the related positive and negative labels to Bidirezionali (Bi-LSTM) che Support Vec- be classified as not positive and not negative. We tor Machines. Nell’utilizzo delle Bi-LSTM developed sentiment polarity and word embedding abbiamo testato un approccio di tipo multi lexicons with the aim of improving the overall ac- task learning nel quale i parametri della curacy of our system. rete vengono ottimizzati utilizzando con- Some specific adaptions were made due to the temporaneamente le annotazioni presenti characteristics of each shared task. In the Aspect- nel dataset ed una strategia di classifica- based Sentiment Analysis (ABSITA) 2018 shared zione a voti di tipo k-fold. Abbiamo creato task (Basile et al., 2018) participants were asked, word embeddings generici e specifici per given a training set of Booking hotel reviews, to ogni singolo task per migliorare ulterior- detect the mentioned aspect categories in a review mente le performance di classificazione. among a set of 8 fixed categories (ACD task) and to assign the polarity (neutral, positive, neutral, quency in T WP OS and T WN EG respectively. The positive-negative) for each detected aspect (ACP three corpora were generated by first downloading task). Since each Booking review in the training approximately 50,000,000 tweets and then apply- set is labeled with 24 binary labels (8 indicating ing some filtering rules to the downloaded tweets the presence of an aspect, 8 indicating positivity to build the positive and negative corpora (no fil- and 8 indicating negativity w.r.t. an aspect), we ad- tering rules were applied to build the generic cor- dressed the ABISTA 2018 shared task as 24 binary pus). In order to build a corpus of positive tweets, classification problems. we constrained the downloaded tweets to contain The Gender X-Genre (GxG) 2018 shared task at least one positive emoji among heart and kisses. (Dell’Orletta and Nissim, 2018) consisted in the Since emojis are rarely used in negative tweets, to automatic identification of the gender of the au- build the negative tweets corpus we created a list thor of a text (Female or Male). Five different of commonly used words in negative language and training sets and test sets were provided by the or- constrained these tweets to contain at least one of ganizers for five different genres: Children essays these words. (CH), Diary (DI), Journalism (JO), Twitter posts 1.1.2 Automatically translated Sentiment (TW) and YouTube comments (YT). For each test Polarity Lexicons set the participants are requested to submit a sys- tem trained using in-domain training dataset and a The Multi–Perspective Question Answering (here- system trained using cross-domain data only. after referred to as M P QA) Subjectivity Lexicon The IronITA task (Cignarella et al., 2018) con- (Wilson et al., 2005). This lexicon consists of ap- sisted of two tasks. In the first task participants proximately 8,200 English words with their asso- had to automatically label a message as ironic or ciated polarity. To use this resource for the Italian not. The second task had a more fine grain: given language, we translated all the entries through the a message, participants had to classify whether the Yandex translation service1 . message is sarcastic, ironic but not sarcastic or not 1.1.3 Word Embedding Lexicons ironic. We generated four word embedding lexicons using Finally in the HaSpeeDe 2018 shared task the word2vec2 toolkit (Mikolov et al., 2013). As (Bosco et al., 2018) consisted in automatically recommended in (Mikolov et al., 2013), we used annotating messages from Twitter and Facebook the CBOW model that learns to predict the word with a boolean value indicating the presence in the middle of a symmetric window based on (or not) of hate speech. In particular three the sum of the vector representations of the words tasks were proposed: HaSpeeDe-FB where only in the window. For our experiments, we consid- the Facebook dataset could be used to classify ered a context window of 5 words. The Word Em- Facebook comments, HaSpeeDe-TW where just bedding Lexicons starting from the following cor- Twitter data could be used to classify tweets pora which were tokenized and postagged by the and Cross-HaspeeDe where only the Facebook PoS tagger for Twitter described in (Cimino and dataset could be used to classify the Twitter test Dell’Orletta, 2016): set and vice versa (Cross-HaspeeDe FB, Cross- HaspeeDe TW). • The first lexicon was built using the itWaC corpus3 . The itWaC corpus is a 2 billion word 1.1 Lexical Resources corpus constructed from the Web limiting the 1.1.1 Automatically Generated Sentiment crawl to the .it domain and using medium- Polarity Lexicons for Social Media frequency words from the Repubblica corpus and basic Italian vocabulary lists as seeds. For the purpose of modeling the word usage in generic, positive and negative contexts of social • The second lexicon was built using the set media texts, we developed three lexicons which of the 50,000,000 tweets we downloaded to we named T WGEN , T WN EG , T WP OS . Each build the sentiment polarity lexicons previ- lexicon reports the relative frequency of a word ously described in subsection 1.1.1 in three different corpora. The main idea behind 1 http://api.yandex.com/translate/ building these lexicons is that positive and neg- 2 http://code.google.com/p/word2vec/ 3 ative words should present a higher relative fre- http://wacky.sslmit.unibo.it/doku.php?id=corpora • The third and the fourth lexicon were built us- benefits of this approach were investigated also ing a corpus consisting of 538,835 Booking by Søgaard and Goldberg (2016), which showed reviews scraped from the web. Since each re- that MTL is appealing since it allows to incor- view in the Booking site is split in a positive porate previous knowledge about tasks hierarchy secion (indicated by a plus mark) and nega- into neural networks architectures. Furthermore, tive section (indicated by a minus mark), we Ruder et al. (2017) showed that MTL is useful to split these reviews obtaining in 338,494 pos- combine even loosely related tasks, letting the net- itive reviews and 200,341 negative reviews. works automatically learn the tasks hierarchy. Starting from the positive and the negative re- Both the workflows we implemented share a views, we finally obtained two different word common pattern used in machine learning clas- embedding lexicons. sifiers consisting of a document feature extrac- tion and a learning phase based on the extracted Each entry of the lexicons maps a pair (word, features, but since SVM and Bi-LSTM take in- POS) to the associated word embedding, allowing put 2-dimensional and 3-dimensional tensors re- to mitigate polisemy problems which can lead to spectively, a different feature extraction phase is poorer results in classification. In addition, both involved for each considered algorithm. In ad- the corpora where preprocessed in order to 1) map dition, when the Bi-LSTM workflow is selected each url to the word ”URL” 2) distinguish between the classifier can take as input an extra file which all uppercased words and non-uppercased words will be used to exploit the MTL learning approach. (eg.: ”mai” vs ”MAI”), since all uppercased words Furthermore, when the Bi-LSTM workflow is se- are usually used in negative contexts. Since each lected, the classifier performs 5-fold training ap- task has its own characteristics in terms of infor- proach. More precisely we build 5 different mod- mation that needs to be captured from the classi- els using different training and validation sets. fiers, we decided to use a subset of the word em- These models are then exploited in the classifica- beddings in each task. Table 1 sums up the word tion phase: the assigned labels are the ones that embeddings used in each shared task. obtain the majority among all the models. The 5- fold approach strategy was chosen in order to gen- Task Booking ITWAC Twitter erate a global model which should less be prone ABSITA 3 3 7 to overfitting or underfitting w.r.t. a single learned GxG 7 3 3 model. HaSpeeDe 7 3 3 IronITA 7 3 3 1.2.1 The SVM classifier The SVM classifier exploits a wide set of fea- Table 1: Word embedding lexicons used by our tures ranging across different levels of linguis- system in each shared task (3used; 7not used). tic description. With the exception of the word embedding combination, these features were al- ready tested in our previous participation at the 1.2 The Classifier EVALITA 2016 SENTIPOLC edition (Cimino et The classifier we built for our participation to the al., 2016). The features are organised into three tasks was designed with the aim of testing dif- main categories: raw and lexical text features, ferent learning algorithms and learning strategies. morpho-syntactic features and lexicon features. More specifically our classifier implements two Due to size constraints we report only the feature workflows which allow testing SVM and recurrent names. neural networks as learning algorithms. In addi- Raw and Lexical Text Features number of to- tion, when recurrent neural networks are chosen kens, character n-grams, word n-grams, lemma as learning algorithms, our classifier allows to per- n-grams, repetition of n-grams chars, number of form neural network multi-task learning (MTL) mentions, number of hashtags, punctuation. using an external dataset in order to share knowl- edge between related tasks. We decided to test the Morpho-syntactic Features coarse grained MTL strategy since, as demonstrated in (De Mat- Part-Of-Speech n-grams, Fine grained Part-Of- tei et al., 2018), it can improve the performance of Speech n-grams, Coarse grained Part-Of-Speech the classifier on emotion recognition tasks. The distribution Lexicon features Emoticons Presence, Lemma which is composed by: sentiment polarity n-grams, Polarity modifier, Word embeddings: the concatenation of the word PMI score, sentiment polarity distribution, Most embeddings extracted by the available Word Em- frequent sentiment polarity, Sentiment polarity in bedding Lexicons (128 dimensions for each word text sections, Word embeddings combination. embedding), and for each word embedding an ex- tra component was added to handle the ”unknown 1.2.2 The Deep Neural Network classifier word” (1 dimension for each lexicon used). We tested two different models based on Bi- Word polarity: the corresponding word polarity LSTM: one that learns to classify the labels with- obtained by exploiting the Sentiment Polarity Lex- out sharing information from all the labels in the icons. This results in 3 components, one for each training phase (Single task learning - STL), and possible lexicon outcome (negative, neutral, posi- the other one which learns to classify the labels ex- tive) (3 dimensions). We assumed that a word not ploiting the related information through a shared found in the lexicons has a neutral polarity. Bi-LSTM (Multi task learning - MTL). We em- Automatically Generated Sentiment Polarity ployed Bi-LSTM architectures since these archi- Lexicons for Social Media: The presence or the tectures allow to capture long-range dependencies absence of the word in a lexicon and the relative from both directions of a document by construct- presence if the word is found in the lexicon. Since ing bidirectional links in the network (Schuster et we built the T WGEN , T WP OS and T WN EG 6 di- al., 1997). We applied a dropout factor to both mensions are needed, 2 for each lexicon. input gates and to the recurrent connections in or- Coarse Grained Part-of-Speech: 13 dimensions. der to prevent overfitting which is a typical issue End of Sentence: a component (1 dimension) in- in neural networks (Galp and Ghahramani, 2015). dicating whether the sentence was totally read. We have chosen a dropout factor value of 0.50. For what concerns GxG, as we had to deal with 2 Results and Discussion longer documents such as news, we employed a two layer Bi-LSTM encoder. The first Bi-LSTM Table 2 reports the official results obtained by our layer served us to encode each sentence as a token best runs on all the task we participated. As it can sequence, the second layer served us to encode the be noted our system performed extremely well, sentences sequence. For what concerns ironITA achieving the best scores almost in every single we added a task-specifici Bi-LSTM for each sub- subtask. In the following subsections a discussion stask before the dense layer. of the results obtained in each task is provided. Figure 1 shows a graphical representation of the STL and MTL architectures we employed. For 2.1 ABSITA what concerns the optimization process, the binary We tested five learning configurations of our sys- cross entropy function is used as a loss function tem based on linear SVM and DNN learning al- and optimization is performed by the rmsprop op- gorithms using the features described in section timizer (Tieleman and Hinton, 2012). 1.2.1 and 1.2.2. All the experiments were aimed at testing the contribution in terms of f-score of Figure 1: STL and MTL architectures. MTL vs STL, the k-fold technique and the exter- (a) STL Model (b) MTL Model nal resources. For what concerns the Bi-LSTM learning algorithm we tested Bi-LSTM both in the Input Input STL and MTL scenarios. In addition, to test the contribution of the Booking word embeddings, we Bi- Bi- Bi- LSTM LSTM LSTM Bi-LSTM created a configuration which uses a shallow Bi- LSTM in MTL setting without using these embed- dings (MTL NO BOOKING-WE). Finally, to test dense dense dense dense dense dense the contribution of the k-fold technique we created a configuration which does not use the k-fold tech- L1 ... Ln L1 ... Ln nique (MTL NO K-FOLD). To obtain fair compar- isons in the last case we run all the experiments 5 times and averaged the scores of the runs. To Each input word is represented by a vector test the proposed classification models, we created Task Our Score Best Score Rank Configuration ACD ACP ABSITA baseline 0.338 0.199 ACD 0.811 0.811 1 ACP 0.767 0.767 1 linear SVM 0.772* 0.686* GxG IN-DOMAIN STL 0.814 0.765 CH 0.640 0.640 1 MTL 0.811* 0.767* DI 0.676 0.676 1 MTL NO K-FOLD 0.801 0.755 JO 0.555 0.585 2 MTL NO BOOKING-WE 0.808 0.753 TW 0.595 0.595 1 YT 0.555 0.555 1 Table 4: Classification results (micro f-score) of GxG CROSS-DOMAIN the different learning models on the ABSITA offi- CH 0.640 0.640 1 DI 0.595 0.635 2 cial test set. JO 0.510 0.515 2 TW 0.609 0.609 1 YT 0.513 0.513 1 tions in the training set is reported. The accuracy HaSpeeDe is calculated as the micro f–score obtained using TW 0.799 0.799 1 the evaluation tool provided by the organizers. For FB 0.829 0.829 1 what concerns the ACD task it is worth noting that C TW 0.699 0.699 1 C FB 0.607 0.654 5 the models based on DNN always outperform lin- IronITA ear SVM, even though the difference in terms of f-score is small (approximately 2 f-score points). IRONY 0.730 0.730 1 SARCASM 0.516 0.520 3 The MTL configuration was the best performing among all the the models, but the difference in Table 2: Classification results of our best runs on term of f-score among all the DNN configuration the ABSITA, GxG, HaSpeeDe and IronITA test is not evident. sets. When analyzing the results obtained on the ACP task we can notice remarkable differences among the performances obtained by the models. an internal development set by randomly selecting Again the linear SVM was the worst performing documents from the training sets distributed by the model, but this time with a difference in terms task organizers. The resulting development set is of f-score of 6 points with respect to MTL, the composed by approximately the 10% (561 docu- best performing model on the task. It is inter- ments) of the whole training set. esting to notice that the results achieved by the DNN models have bigger difference between them Configuration ACD ACP in terms of f-score with respect to the ACD task: this suggests that the external resources and the k- baseline 0.313 0.197 fold technique contributed significantly to obtain linear SVM 0.797 0.739 the best result in the ACP task. The configuration STL 0.821 0.795 that does not use the k-fold technique scored 2 f- MTL 0.824 0.804 score points w.r.t. the MTL configuration. We can MTL NO K-FOLD 0.819 0.782 also notice that the Booking word emebeddings MTL NO BOOKING-WE 0.817 0.757 were particularly helpful in this task: the MTL NO BOOOKING-WE configuration in fact scored Table 3: Classification results (micro f-score) of 5 points less than the best configuration. The re- the different learning models on our ABSITA de- sults obtained on the internal development set lead velopment set. us to choose the models for the official runs on the provided test set. Table 4 reports the overall accu- Table 3 reports the overall accuracies achieved racies achieved by all our classifier configurations by the models on the internal development set for on the official test set, the official submitted runs all the tasks. In addition, the results of base- are starred in the table. line system (baseline row) which emits always the As it can be noticed the best scores both in the most probable label according to the label distribu- ACD and ACP tasks were obtained by the DNN models. Surprisingly the difference in terms of f- Model CH DI JO TW YT score were reduced in both the tasks, with the ex- SVM 0.530 0.565 0.580 0.588 0.568 ception of linear SVM, which performed 4 and 8 STL 0.550 0.535 0.505 0.625 0.580 MTL 0.523 0.549 0.538 0.500 0.556 f-score points less in the ACD and ACP tasks re- spectively when compared to the best DNN model Table 6: Classification results of the different systems. The STL model outperformed the MTL learning models on development set in terms of models the ACD task, even though the difference accuracy for the cross-domain tasks in term of f-score is not relevant. When the results on the ACP are considered, the MTL model out- performed all the other models, even though the the cross-domain tasks respectively. For the in- the difference in terms of f-score with respect to domain tasks we observe that the SVM performs the STL model is not noticeable. Is it worth to well on the smaller datasets (Children and Di- notice that the k-fold technique and the Booking ary), while MTL neural network has the best word embeddings seemed to again contribute in overall performances. When trained on all the the final accuracy of the MTL system. This can be datasets, in- and cross-domain, the SVM (SVMa) seen by looking at the results achieved by the MTL perform worst than when trained on in-domain NO BOOKING-WE model and the MTL NO K- data only (SVM). For what concerns the cross- FOLD model that scored 1.2 and 1.5 f-score points domain datasets we observe poor performances less than the MTL system. over all the subtasks with all the employed mod- els, implying that the models have difficulties in 2.2 GxG cross-domain generalization. We tested three different learning configurations of our system based on linear SVM and DNN Model CH DI JO TW YT learning algorithms using the features described in SVMa 0.545 0.514 0.475 0.539 0.585 section 1.2.1 and 1.2.2. For what concerns the Bi- SVM 0.550 0.649 0.555 0.567 0.555* LSTM learning algorithm we tested both the STL STL 0.545 0.541 0.500 0.595* 0.512 MTL 0.640* 0.676* 0.470 0.561 0.546 and MTL approaches. We tested the three config- urations for each of the 5 five in-domain subtasks Table 7: Classification results of the different and for each of the 5 five cross-domain subtasks. learning models on the official test set in terms of To test the proposed classification models, we cre- accuracy for the in-domain tasks (* marks runs ated internal development sets by randomly select- that outperformed all the systems that participated ing documents from the training sets distributed to the task). by the task organizers. The resulting development sets are composed by approximately 10% of the each data sets. For what concern the in-domain Model CH DI JO TW YT task, we tried to train the SVM classifier on in- SVM 0.540 0.514 0.505 0.586 0.513* domain-data only and and on both in-domain and STL 0.640* 0.554 0.495 0.609* 0.510 cross-domain data. MTL 0.535 0.595 0.510 0.500 0.500 Model CH DI JO TW YT Table 8: Classification results of the different SVMa 0.667 0.626 0.485 0.582 0.611 learning models on the official test set in terms SVM 0.701 0.737 0.560 0.728 0.619 STL 0.556 0.545 0.500 0.724 0.596 of accuracy for the cross-domain tasks. (* marks MTL 0.499 0.817 0.625 0.729 0.632 runs that outperformed all the systems that partic- ipated to the task). Table 5: Classification results of the different learning models on development set in terms of Table 7 and 8 report the overall accuracy, com- accuracy for the in-domain tasks. puted as the average accuracy for the two classes (male and female), achieved by the models on the Table 5 and 6 report the overall accuracy, com- official test sets for the in-domain and the cross- puted as the average accuracy for the two classes domain tasks respectively (* marks the running (male and female), achieved by the models on that obtain the best results in the competition). For the development data sets for the in-domain and what concerns the in-domain subtasks the perfor- mances appear to be not in line with the ones ob- Configuration TW FB C TW C FB tained on the development set, but still our mod- baseline 0.378 0.345 0.345 0.378 els outperform the other participant’s systems in linear SVM 0.800 0.813 0.617 0.503 four out of five subtasks. The MTL model pro- 1L STL 0.774 0.860 0.683 0.647 vided the best results for the Children and Diary 2L STL 0.790 0.860 0.672 0.597 1L MTL 0.783 0.860 0.672 0.663 test sets, while on the other test sets all the mod- 2L MTL 0.796 0.853 0.710 0.613 els performed quite poorly. Again when trained 1L MTL NO SNT 0.793 0.857 0.651 0.661 on all the datasets, in and cross-domain, the SVM 1L STL NO K-FOLD 0.771 0.846 0.657 0.646 (SVMa) perform worst then when trained on in- Table 9: Classification results of the different domain data only (SVM). For what concerns the learning models on our HaSpeeDe development cross-domain subtasks, while our model gets the set in terms of F1-score. best performances on three out of five subtasks, the results confirm poor performances over all the Configuration TW FB C TW C FB subtasks, again indicating that the models have difficulties in cross-domain generalization. baseline 0.403 0.404 0.404 0.403 best official system 0.799 0.829 0.699 0.654 2.3 HaSpeeDe linear SVM 0.798* 0.761 0.658 0.451 1L STL 0.793 0.811* 0.669* 0.607* We tested seven learning configurations of our sys- 2L STL 0.791 0.812 0.644 0.561 1L MTL 0.788 0.818 0.707 0.635 tem based on linear SVM and DNN learning algo- 2L MTL 0.799* 0.829* 0.699* 0.585* rithms using the features described in section 1.2.1 1L MTL NO SNT 0.801 0.808 0.709 0.620 and 1.2.2. All the experiments were aimed at test- 1L STL NO FOLD 0.785 0.806 0.652 0.583 ing the contribution in terms of f-score of the num- Table 10: Classification results of the different ber of layers, MTL vs STL, the k-fold technique learning models on the official HaSpeeDe test set and the external resources. For what concerns the in terms of F1-score. Bi-LSTM learning algorithm we tested one and two layers Bi-LSTM both in the STL and MTL scenarios. In addition, to test the contribution of in the table) it is worth noting that linear SVM the sentiment lexicon features, we created a con- outperformed all the configurations based on Bi- figuration which uses a 2-layer Bi-LSTM in MTL LSTM. In addition, the MTL architecture results setting without using these features (1L MTL NO are slightly better than the STL ones (+1 f-score SNT). Finally, to test the contribution of the k- point with respect to the STL counterparts). Exter- fold technique we created a configuration which nal sentiment resources were not particularly help- does not use the k-fold technique (1 STL NO K- ful in this task, as shown by the result obtained by FOLD). To obtain fair results in the last case we the 1L MTL NO SNT row. In the FB task, Bi- run all the experiments 5 times and averaged the LSTMs sensibly outperformed linear SVMs (+5 f- scores of the runs. To test the proposed classifica- score points in average); this is most probably due tion models, we created two internal development to longer text lengths that are found in this dataset sets, one for each dataset, by randomly selecting with respect to the Twitter one. For what con- documents from the training sets distributed by the cerns the out–domain tasks, when testing models task organizers. The resulting development sets trained on Twitter and tested on Facebook (C TW are composed by the 10% (300 documents) of the column), we can notice an expected drop in per- whole training sets. formance with respect to the models trained on Table 9 reports the overall accuracies achieved the FB dataset (15-20 points f-score points). The by the models on our internal development sets best result was achieved by the 2L MTL configu- for all the tasks. In addition, the results of base- ration (+4 points w.r.t. the STL counterpart). Fi- line system (baseline row) which emits always the nally, when testing the models trained on Face- most probable label according to the label distri- book and tested on Twitter (C FB column), lin- bution in the training set is reported. The accu- ear SVM showed a huge drop in terms of ac- racy is calculated as the f–score obtained using the curacy (-30 f-score points), while all the models evaluation tool provided by the organizers. For trained with Bi-LSTM showed a performance drop what concerns the Twitter in–domain task (TW of approximately 12 f-score points. Also in this setting the best result was achieved by a MTL Configuration Irony Sarcasm configuration (1L MTL), which performed better linear SVM 0.734 0.512 with respect to the STL counterpart (+2 f-score MTL 0.745 0.530 MTL+Polarity 0.757 0.562 points). For what concerns the k-fold learning MTL+Polarity+Hate 0.760 0.557 strategy, we can notice that the results achieved by the model not using the k-fold learning strategy Table 11: Classification results of the different (1 STL NO K-FOLD) are always lower than the learning models on k-cross validation terms of av- counterpart which used the k-fold approach (+2.5 erage F1-score. f-score points gained in the C TW task), showing the benefits of using this technique. Configuration Irony Sarcasm These results lead us to choose the models for baseline-random 0.505 0.337 the official runs on the provided test set. Table baseline-mfc 0.334 0.223 10 reports the overall accuracies achieved by all best participant 0.730 0.52 our classifier configurations on the official test set, linear SVM 0.701 0.493 MTL 0.736 0.530 the official submitted runs are starred in the ta- MTL+Polarity 0.730* 0.516* ble. The best official system row reports, for each MTL+Polarity+Hate 0.713* 0.503* task, the best official results submitted by the par- ticipants of the EVALITA 2018 HaSpeeDe shared Table 12: Classification results of the different task. As we can note the best scores in each task learning models on the official test set in terms of were obtained by the Bi-LSTM in the MTL set- F1-score (* submitted run). ting, showing that MTL networks seem to be more effective with respect to STL networks. For what concerns the Twitter in–domain task, we obtained tion sets for both the irony and sarcarsm detection similar results to the development set ones. A sen- tasks. sible drop in performance is observed in the FB We can observe that the SVM obtains good task w.r.t the development set (-5 f-score points results on irony detection but the MTL neural in average). Still Bi-LSTMs models outperformed approach overperforms sensibly the SVM. Also the linear SVM model by 5 f-score points. In the we note that the usage of additional Polarity and out-domain tasks, all the models performed simi- Hate Speech datasets lead to better performances. larly to what observed in the development set. It These results lead us to choose the MTL models is worth observing that linear SVM performed al- trained with the additional dataset for the two offi- most as a baseline system in the C FB task. In cial run submissions. addition, in the same task the model exploiting the Table 12 reports the overall accuracies achieved sentiment lexicon (1L MTL) showed a better per- by all our classifier configurations on the offi- formance (+1.5 f-score points) w.r.t to the 1L MTL cial test set, the official submitted runs are starred NO SNT model. It is worth to notice that the k- in the table. The accuracies has been computed fold learning strategy was beneficial also on the in terms of F-Score using the official evalua- official test set: the 1L STL model obtained better tion script. We submitted the runs MTL+Polarity results (approximately +2 f-score points in each and MTL+Polarity+Hate. The run MTL+Polarity task) w.r.t. the model that did not use the k-fold ranked first in the subtask A, and third in the learning strategy. subtask B on the official leaderboard. The run MTL+Polarity ranked second in the subtask A, 2.4 IronITA and fourth in the subtask B on the official leader- board. We tested the four designed learning configura- The results on the test set confirm the good tions of our system based on linear SVM and deep performances of the SVM classifier on irony de- neural network (DNN) learning algorithms using tection task and that the MTL neural approaches the features described in section 1.2.1 and 1.2.2. overperform the SVM. The model trained on the To select the proposed classification models, we IronITA and SENTIPOLC datasets outperformed used k-cross validation (k=4). all the systems that participated to the subtask A, Table 11 reports the overall average f-score while on the subtask B it slightly underperformed achieved by the models on the k-cross valida- the best participant system. The model trained on the IronITA, SENTIPOLC and HaSpeeDe datasets (eds). In Proceedings of EVALITA ’18, Evaluation overperformed all the systems that participated to of NLP and Speech Tools for Italian. December, Turin, Italy. the subtask A but our model trained on IronITA and SENTIPOLC datasets only. Although the best Franois Chollet. 2016. Keras. Software available at scores in both tasks were obtained by the MTL https://github.com/fchollet/keras/tree/master/keras. network trained on IronITA data set only. The MTL model trained on IronITA dataset only would Alessandra Cignarella and Simona Frenda and Vale- have outperformed all the systems submitted to rio Basile and Cristina Bosco and Viviana Patti and Paolo Rosso. 2018. Overview of the Evalita 2018 both the subtasks by all participants. Seems that Task on Irony Detection in Italian Tweets (IronITA). for these tasks the usage of additional datasets T. Caselli, N. Novielli, V. Patti, P. Rosso, (eds). In leads to overfitting issues. Proceedings of the 6th evaluation campaign of Nat- ural Language Processing and Speech tools for Ital- 3 Conclusions ian (EVALITA’18) In this paper we reported the results of our par- Andrea Cimino and Felice dell’Orletta. 2016. Tan- ticipation to the ABSITA, GxG, HaSpeeDe and dem LSTM-SVM Approach for Sentiment Analysis. IronITA shared tasks of the EVALITA 2018 con- In Proceedings of EVALITA ’16, Evaluation of NLP and Speech Tools for Italian. December, Naples, ference. By resorting to a system which used Italy. Support Vector Machines and Deep Neural Net- works (DNN) as learning algorithms, we achieved Andrea Cimino and Felice dell’Orletta. 2016. Build- the best scores almost in every task, showing the ing the state-of-the-art in POS tagging of Italian effectivness of our approach. In addition, when Tweets. In Proceedings of Third Italian Confer- ence on Computational Linguistics (CLiC-it 2016) DNN was used as learning algorithm we intro- & Fifth Evaluation Campaign of Natural Language duced a new multi-task learning approach and a Processing and Speech Tools for Italian. Final Work- majority vote classification approach to further im- shop (EVALITA 2016), Napoli, Italy, December 5-7, prove the overall accuracy of our system. The pro- 2016. posed system resulted in an very effective solution Lorenzo de Mattei, Andrea Cimino and Felice achieving the first position in almost all sub-tasks dell’Orletta. 2018. Multi-Task Learning in Deep for each shared task. Neural Network for Sentiment Polarity and Irony classification. In Proceedings of the 2nd Work- Acknowledgments shop on Natural Language for Artificial Intelligence, Trento, Italy, November 22-23, 2018. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Felice Dell’Orletta and Malvina Nissim. 2018. Tesla K40 used for this research. Overview of the EVALITA Cross-Genre Gender Prediction in Italian (GxG) Task. T. Caselli, N. Novielli, V. Patti, P. Rosso, (eds). In Proceed- ings of the 6th evaluation campaign of Natural References Language Processing and Speech tools for Italian Francesco Barbieri, Valerio Basile, Danilo Croce, (EVALITA’18). Malvina Nissim, Nicole Novielli, Viviana Patti. 2016. Overview of the EVALITA 2016 SENTiment Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang- POLarity Classification Task. In Proceedings of Rui Wang and Chih-Jen Lin 2008. LIBLINEAR: A EVALITA ’16, Evaluation of NLP and Speech Tools Library for Large Linear Classification Journal of for Italian. December, Naples, Italy. Machine Learning Research. Volume 9, 1871–1874 Pierpaolo Basile, Valerio Basile, Danilo Croce, Marco Yarin Gal and Zoubin Ghahramani. 2015. A theoret- Polignano. 2018. Overview of the EVALITA ically grounded application of dropout in recurrent Aspect-based Sentiment Analysis (ABSITA) Task. neural networks. arXiv preprint arXiv:1512.05287 T. Caselli, N. Novielli, V. Patti, P. Rosso, (eds). In Proceedings of the 6th evaluation campaign of Nat- ural Language Processing and Speech tools for Ital- Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long ian (EVALITA’18). December, Turin, Italy. short-term memory. Neural computation Cristina Bosco, Felice dell’Orletta, Fabio Poletto, Tomas Mikolov, Kai Chen, Greg Corrado and Jef- Manuela Sanguinetti and Maurizio Tesconi. 2018. frey Dean. 2013. Efficient estimation of word Overview of the Evalita 2018 Hate Speech Detec- representations in vector space. arXiv preprint tion Task. T. Caselli, N. Novielli, V. Patti, P. Rosso, arXiv1:1301.3781. Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio Sebastiani and Veselin Stoyanov. 2016. SemEval- 2016 task 4: Sentiment analysis in Twitter. In Pro- ceedings of the 10th International Workshop on Se- mantic Evaluation (SemEval-2016) Sebastian Ruder, Joachim Bingel, Isabelle Augenstein and Anders Søgaard. 2017. Document model- ing with gated recurrent neural network for senti- ment classification. arXiv preprint arXiv:1705.0814 1422-1432, Lisbon, Portugal. Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec- tional recurrent neural networks. IEEE Transactions on Signal Processing 45(11):2673–2681 Anders Søgaard and Yoav Goldberg. 2016. Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of the 54th Annual Meeting of the Association for Computational Lin- guistics. 231–235, Berlin, Portugal. Duyu Tang, Bing Qin and Ting Liu. 2015. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of EMNLP 2015. 1422-1432, Lisbon, Portugal. Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-RmsProp: Divide the gradient by a running aver- age of its recent magnitude. In COURSERA: Neural Networks for Machine Learning. Theresa Wilson, Zornitsa Kozareva, Preslav Nakov, Sara Rosenthal, Veselin Stoyanov and Alan Ritter. 2005. Recognizing contextual polarity in phrase- level sentiment analysis. In Proceedings of HLT- EMNLP 2005. 347-354, Stroudsburg, PA, USA. ACL. XingYi Xu, HuiZhi Liang and Timothy Baldwin. 2016. UNIMELB at SemEval-2016 Tasks 4A and 4B: An Ensemble of Neural Networks and a Word2Vec Based Model for Sentiment Classification. In Pro- ceedings of the 10th International Workshop on Se- mantic Evaluation (SemEval-2016).