Multi-task Learning in Deep Neural Networks at EVALITA 2018

Multi-task Learning in Deep Neural Networks at EVALITA 2018 AndreaCimino andrea.cimino@ilc.cnr.it LorenzoDe Mattei lorenzo.demattei@di.unipi.it FeliceDell'orletta felice.dellorletta@ilc.cnr.it Istituto di Linguistica Computazionale " Antonio Zampolli" (ILC-CNR) ItaliaNLP Lab -www.italianlp.it Dipartimento di Informatica Università di Pisa Multi-task Learning in Deep Neural Networks at EVALITA 2018 3DE7E2935D43335DA963FD3D438EDBBA GROBID - A machine learning software for extracting information from scholarly documents

English. In this paper we describe the system used for the participation to the ABSITA, GxG, HaSpeeDe and IronITA shared tasks of the EVALITA 2018 conference. We developed a classifier that can be configured to use Bidirectional Long Short Term Memories and linear Support Vector Machines as learning algorithms. When using Bi-LSTMs we tested a multitask learning approach which learns the optimized parameters of the network exploiting simultaneously all the annotated dataset labels and a multiclassifier voting approach based on a k-fold technique. In addition, we developed generic and specific word embedding lexicons to further improve classification performances. When evaluated on the official test sets, our system ranked 1st in almost all subtasks for each shared task, showing the effectiveness of our approach.

Italiano. In questo articolo descriviamo il sistema utilizzato per la partecipazione agli shared task ABSITA, GxG, HaSpee-De ed IronITA della conferenza EVALITA 2018. Abbiamo sviluppato un sistema che utilizza come algoritmi di apprendimento sia reti di tipo Long Short Term Memory Bidirezionali (Bi-LSTM) che Support Vector Machines. Nell'utilizzo delle Bi-LSTM abbiamo testato un approccio di tipo multi task learning nel quale i parametri della rete vengono ottimizzati utilizzando contemporaneamente le annotazioni presenti nel dataset ed una strategia di classificazione a voti di tipo k-fold. Abbiamo creato word embeddings generici e specifici per ogni singolo task per migliorare ulteriormente le performance di classificazione.

Il nostro sistema quando valutato sui test set ufficiali ha ottenuto il primo posto in quasi tutti i sotto task di ogni shared task affrontato, dimostrando la validità del nostro approccio.

Description of the System

The EVALITA 2018 edition has been one of the most successful editions in terms of number of shared tasks proposed. In particular, a large part of the tasks proposed by the organizers can be tackled as binary document classification tasks. This gave us the possibility to test a new system specifically designed for this EVALITA edition.

We implemented a system which relies on Bi-LSTM (Hochreiter et al., 1997) and SVM which are widely used learning algorithms in the document classification task. The learning algorithm can be selected in a configuration file. In this work we used the Keras (Chollet, 2016) library and the liblinear (Fan et al., 2008) library to generate the Bi-LSTM and SVM statistical models respectively. Since our approach relies on morphosyntactically tagged text, training and test data were automatically morphosyntactically tagged by the PoS tagger described in (Cimino and Dell'Orletta, 2016). Due to the label constraints in the dataset, if our system classified an aspect as not present, we forced the related positive and negative labels to be classified as not positive and not negative. We developed sentiment polarity and word embedding lexicons with the aim of improving the overall accuracy of our system. Some specific adaptions were made due to the characteristics of each shared task. In the Aspectbased Sentiment Analysis (ABSITA) 2018 shared task (Basile et al., 2018) participants were asked, given a training set of Booking hotel reviews, to detect the mentioned aspect categories in a review among a set of 8 fixed categories (ACD task) and to assign the polarity (neutral, positive, neutral, positive-negative) for each detected aspect (ACP task). Since each Booking review in the training set is labeled with 24 binary labels (8 indicating the presence of an aspect, 8 indicating positivity and 8 indicating negativity w.r.t. an aspect), we addressed the ABISTA 2018 shared task as 24 binary classification problems.

The Gender X-Genre (GxG) 2018 shared task (Dell'Orletta and Nissim, 2018) consisted in the automatic identification of the gender of the author of a text (Female or Male). Five different training sets and test sets were provided by the organizers for five different genres: Children essays (CH), Diary (DI), Journalism (JO), Twitter posts (TW) and YouTube comments (YT). For each test set the participants are requested to submit a system trained using in-domain training dataset and a system trained using cross-domain data only.

The IronITA task (Cignarella et al., 2018) consisted of two tasks. In the first task participants had to automatically label a message as ironic or not. The second task had a more fine grain: given a message, participants had to classify whether the message is sarcastic, ironic but not sarcastic or not ironic.

Finally in the HaSpeeDe 2018 shared task (Bosco et al., 2018) consisted in automatically annotating messages from Twitter and Facebook with a boolean value indicating the presence (or not) of hate speech. In particular three tasks were proposed: HaSpeeDe-FB where only the Facebook dataset could be used to classify Facebook comments, HaSpeeDe-TW where just Twitter data could be used to classify tweets and Cross-HaspeeDe where only the Facebook dataset could be used to classify the Twitter test set and vice versa (Cross-HaspeeDe FB, Cross-HaspeeDe TW).

Lexical Resources

Automatically Generated Sentiment Polarity Lexicons for Social Media

For the purpose of modeling the word usage in generic, positive and negative contexts of social media texts, we developed three lexicons which we named T W GEN , T W N EG , T W P OS . Each lexicon reports the relative frequency of a word in three different corpora. The main idea behind building these lexicons is that positive and negative words should present a higher relative fre-quency in T W P OS and T W N EG respectively. The three corpora were generated by first downloading approximately 50,000,000 tweets and then applying some filtering rules to the downloaded tweets to build the positive and negative corpora (no filtering rules were applied to build the generic corpus). In order to build a corpus of positive tweets, we constrained the downloaded tweets to contain at least one positive emoji among heart and kisses.

Since emojis are rarely used in negative tweets, to build the negative tweets corpus we created a list of commonly used words in negative language and constrained these tweets to contain at least one of these words.

Automatically translated Sentiment Polarity Lexicons

The Multi-Perspective Question Answering (hereafter referred to as M P QA) Subjectivity Lexicon (Wilson et al., 2005). This lexicon consists of approximately 8,200 English words with their associated polarity. To use this resource for the Italian language, we translated all the entries through the Yandex translation service1 .

Word Embedding Lexicons

We generated four word embedding lexicons using the word2vec2 toolkit (Mikolov et al., 2013). As recommended in (Mikolov et al., 2013), we used the CBOW model that learns to predict the word in the middle of a symmetric window based on the sum of the vector representations of the words in the window. For our experiments, we considered a context window of 5 words. The Word Embedding Lexicons starting from the following corpora which were tokenized and postagged by the PoS tagger for Twitter described in (Cimino and Dell'Orletta, 2016):

• The first lexicon was built using the itWaC corpus3 . The itWaC corpus is a 2 billion word corpus constructed from the Web limiting the crawl to the .it domain and using mediumfrequency words from the Repubblica corpus and basic Italian vocabulary lists as seeds.

• The second lexicon was built using the set of the 50,000,000 tweets we downloaded to build the sentiment polarity lexicons previously described in subsection 1.1.1

• The third and the fourth lexicon were built using a corpus consisting of 538,835 Booking reviews scraped from the web. Since each review in the Booking site is split in a positive secion (indicated by a plus mark) and negative section (indicated by a minus mark), we split these reviews obtaining in 338,494 positive reviews and 200,341 negative reviews.

Starting from the positive and the negative reviews, we finally obtained two different word embedding lexicons.

Each entry of the lexicons maps a pair (word, POS) to the associated word embedding, allowing to mitigate polisemy problems which can lead to poorer results in classification. In addition, both the corpora where preprocessed in order to 1) map each url to the word "URL" 2) distinguish between all uppercased words and non-uppercased words (eg.: "mai" vs "MAI"), since all uppercased words are usually used in negative contexts. Since each task has its own characteristics in terms of information that needs to be captured from the classifiers, we decided to use a subset of the word embeddings in each task. Table 1 sums up the word embeddings used in each shared task.

Task

Booking ITWAC Twitter ABSITA GxG HaSpeeDe IronITA

Table 1: Word embedding lexicons used by our system in each shared task (used; not used).

The Classifier

The classifier we built for our participation to the tasks was designed with the aim of testing different learning algorithms and learning strategies. More specifically our classifier implements two workflows which allow testing SVM and recurrent neural networks as learning algorithms. In addition, when recurrent neural networks are chosen as learning algorithms, our classifier allows to perform neural network multi-task learning (MTL) using an external dataset in order to share knowledge between related tasks. We decided to test the MTL strategy since, as demonstrated in (De Mattei et al., 2018), it can improve the performance of the classifier on emotion recognition tasks. The benefits of this approach were investigated also by Søgaard and Goldberg (2016), which showed that MTL is appealing since it allows to incorporate previous knowledge about tasks hierarchy into neural networks architectures. Furthermore, Ruder et al. (2017) showed that MTL is useful to combine even loosely related tasks, letting the networks automatically learn the tasks hierarchy.

Both the workflows we implemented share a common pattern used in machine learning classifiers consisting of a document feature extraction and a learning phase based on the extracted features, but since SVM and Bi-LSTM take input 2-dimensional and 3-dimensional tensors respectively, a different feature extraction phase is involved for each considered algorithm. In addition, when the Bi-LSTM workflow is selected the classifier can take as input an extra file which will be used to exploit the MTL learning approach. Furthermore, when the Bi-LSTM workflow is selected, the classifier performs 5-fold training approach. More precisely we build 5 different models using different training and validation sets. These models are then exploited in the classification phase: the assigned labels are the ones that obtain the majority among all the models. The 5fold approach strategy was chosen in order to generate a global model which should less be prone to overfitting or underfitting w.r.t. a single learned model.

The SVM classifier

The SVM classifier exploits a wide set of features ranging across different levels of linguistic description. With the exception of the word embedding combination, these features were already tested in our previous participation at the EVALITA 2016 SENTIPOLC edition (Cimino et al., 2016). The features are organised into three main categories: raw and lexical text features, morpho-syntactic features and lexicon features. Due to size constraints we report only the feature names.

Raw and Lexical Text Features number of tokens, character n-grams, word n-grams, lemma n-grams, repetition of n-grams chars, number of mentions, number of hashtags, punctuation.

Morpho-syntactic Features coarse grained

The Deep Neural Network classifier

We tested two different models based on Bi-LSTM: one that learns to classify the labels without sharing information from all the labels in the training phase (Single task learning -STL), and the other one which learns to classify the labels exploiting the related information through a shared Bi-LSTM (Multi task learning -MTL). We employed Bi-LSTM architectures since these architectures allow to capture long-range dependencies from both directions of a document by constructing bidirectional links in the network (Schuster et al., 1997). We applied a dropout factor to both input gates and to the recurrent connections in order to prevent overfitting which is a typical issue in neural networks (Galp and Ghahramani, 2015). We have chosen a dropout factor value of 0.50.

For what concerns GxG, as we had to deal with longer documents such as news, we employed a two layer Bi-LSTM encoder. The first Bi-LSTM layer served us to encode each sentence as a token sequence, the second layer served us to encode the sentences sequence. For what concerns ironITA we added a task-specifici Bi-LSTM for each substask before the dense layer.

Figure 1 shows a graphical representation of the STL and MTL architectures we employed. For what concerns the optimization process, the binary cross entropy function is used as a loss function and optimization is performed by the rmsprop optimizer (Tieleman and Hinton, 2012).

Each input word is represented by a vector

which is composed by: Word embeddings: the concatenation of the word embeddings extracted by the available Word Embedding Lexicons (128 dimensions for each word embedding), and for each word embedding an extra component was added to handle the "unknown word" (1 dimension for each lexicon used). Word polarity: the corresponding word polarity obtained by exploiting the Sentiment Polarity Lexicons. This results in 3 components, one for each possible lexicon outcome (negative, neutral, positive) (3 dimensions). We assumed that a word not found in the lexicons has a neutral polarity. Automatically Generated Sentiment Polarity Lexicons for Social Media: The presence or the absence of the word in a lexicon and the relative presence if the word is found in the lexicon. Since we built the T W GEN , T W P OS and T W N EG 6 dimensions are needed, 2 for each lexicon. Coarse Grained Part-of-Speech: 13 dimensions. End of Sentence: a component (1 dimension) indicating whether the sentence was totally read.

Results and Discussion

Table 2 reports the official results obtained by our best runs on all the task we participated. As it can be noted our system performed extremely well, achieving the best scores almost in every single subtask. In the following subsections a discussion of the results obtained in each task is provided.

ABSITA

We tested five learning configurations of our system based on linear SVM and DNN learning algorithms using the features described in section 1.2.1 and 1.2.2. All the experiments were aimed at testing the contribution in terms of f-score of MTL vs STL, the k-fold technique and the external resources. For what concerns the Bi-LSTM learning algorithm we tested Bi-LSTM both in the STL and MTL scenarios. In addition, to test the contribution of the Booking word embeddings, we created a configuration which uses a shallow Bi-LSTM in MTL setting without using these embeddings (MTL NO BOOKING-WE). Finally, to test the contribution of the k-fold technique we created a configuration which does not use the k-fold technique (MTL NO K-FOLD). To obtain fair comparisons in the last case we run all the experiments 5 times and averaged the scores of the runs. To test the proposed classification models, we created The MTL configuration was the best performing among all the the models, but the difference in term of f-score among all the DNN configuration is not evident.

When analyzing the results obtained on the ACP task we can notice remarkable differences among the performances obtained by the models. Again the linear SVM was the worst performing model, but this time with a difference in terms of f-score of 6 points with respect to MTL, the best performing model on the task. It is interesting to notice that the results achieved by the DNN models have bigger difference between them in terms of f-score with respect to the ACD task: this suggests that the external resources and the kfold technique contributed significantly to obtain the best result in the ACP task. The configuration that does not use the k-fold technique scored 2 fscore points w.r.t. the MTL configuration. We can also notice that the Booking word emebeddings were particularly helpful in this task: the MTL NO BOOOKING-WE configuration in fact scored 5 points less than the best configuration. The results obtained on the internal development set lead us to choose the models for the official runs on the provided test set. Table 4 reports the overall accuracies achieved by all our classifier configurations on the official test set, the official submitted runs are starred in the table.

As it can be noticed the best scores both in the ACD and ACP tasks were obtained by the DNN models. Surprisingly the difference in terms of fscore were reduced in both the tasks, with the exception of linear SVM, which performed 4 and 8 f-score points less in the ACD and ACP tasks respectively when compared to the best DNN model systems. The STL model outperformed the MTL models the ACD task, even though the difference in term of f-score is not relevant. When the results on the ACP are considered, the MTL model outperformed all the other models, even though the the difference in terms of f-score with respect to the STL model is not noticeable. Is it worth to notice that the k-fold technique and the Booking word embeddings seemed to again contribute in the final accuracy of the MTL system. This can be seen by looking at the results achieved by the MTL NO BOOKING-WE model and the MTL NO K-FOLD model that scored 1.2 and 1.5 f-score points less than the MTL system.

GxG

We tested three different learning configurations of our system based on linear SVM and DNN learning algorithms using the features described in section 1.2.1 and 1.2.2. For what concerns the Bi-LSTM learning algorithm we tested both the STL and MTL approaches. We tested the three configurations for each of the 5 five in-domain subtasks and for each of the 5 five cross-domain subtasks. Table 8: Classification results of the different learning models on the official test set in terms of accuracy for the cross-domain tasks. (* marks runs that outperformed all the systems that participated to the task).

Table 7 and 8 report the overall accuracy, computed as the average accuracy for the two classes (male and female), achieved by the models on the official test sets for the in-domain and the crossdomain tasks respectively (* marks the running that obtain the best results in the competition). For what concerns the in-domain subtasks the perfor-mances appear to be not in line with the ones obtained on the development set, but still our models outperform the other participant's systems in four out of five subtasks. The MTL model provided the best results for the Children and Diary test sets, while on the other test sets all the models performed quite poorly. Again when trained on all the datasets, in and cross-domain, the SVM (SVMa) perform worst then when trained on indomain data only (SVM). For what concerns the cross-domain subtasks, while our model gets the best performances on three out of five subtasks, the results confirm poor performances over all the subtasks, again indicating that the models have difficulties in cross-domain generalization.

HaSpeeDe

We tested seven learning configurations of our system based on linear SVM and DNN learning algorithms using the features described in section 1.2.1 and 1.2.2. All the experiments were aimed at testing the contribution in terms of f-score of the number of layers, MTL vs STL, the k-fold technique and the external resources. For what concerns the Bi-LSTM learning algorithm we tested one and two layers Bi-LSTM both in the STL and MTL scenarios. In addition, to test the contribution of the sentiment lexicon features, we created a configuration which uses a 2-layer Bi-LSTM in MTL setting without using these features (1L MTL NO SNT). Finally, to test the contribution of the kfold technique we created a configuration which does not use the k-fold technique (1 STL NO K-FOLD). To obtain fair results in the last case we run all the experiments 5 times and averaged the scores of the runs. To test the proposed classification models, we created two internal development sets, one for each dataset, by randomly selecting documents from the training sets distributed by the task organizers. The resulting development sets are composed by the 10% (300 documents) of the whole training sets.

Table 9 reports the overall accuracies achieved by the models on our internal development sets for all the tasks. In addition, the results of baseline system (baseline row) which emits always the most probable label according to the label distribution in the training set is reported. The accuracy is calculated as the f-score obtained using the evaluation tool provided by the organizers. For what concerns the Twitter in-domain task (TW For what concerns the k-fold learning strategy, we can notice that the results achieved by the model not using the k-fold learning strategy (1 STL NO K-FOLD) are always lower than the counterpart which used the k-fold approach (+2.5 f-score points gained in the C TW task), showing the benefits of using this technique. These results lead us to choose the models for the official runs on the provided test set. Table 10 reports the overall accuracies achieved by all our classifier configurations on the official test set, the official submitted runs are starred in the table. The best official system row reports, for each task, the best official results submitted by the participants of the EVALITA 2018 HaSpeeDe shared task. As we can note the best scores in each task were obtained by the Bi-LSTM in the MTL setting, showing that MTL networks seem to be more effective with respect to STL networks. For what concerns the Twitter in-domain task, we obtained similar results to the development set ones. A sensible drop in performance is observed in the FB task w.r.t the development set (-5 f-score points in average). Still Bi-LSTMs models outperformed the linear SVM model by 5 f-score points. In the out-domain tasks, all the models performed similarly to what observed in the development set. It is worth observing that linear SVM performed almost as a baseline system in the C FB task. In addition, in the same task the model exploiting the sentiment lexicon (1L MTL) showed a better performance (+1.5 f-score points) w.r.t to the 1L MTL NO SNT model. It is worth to notice that the kfold learning strategy was beneficial also on the official test set: the 1L STL model obtained better results (approximately +2 f-score points in each task) w.r.t. the model that did not use the k-fold learning strategy.

IronITA

We tested the four designed learning configurations of our system based on linear SVM and deep neural network (DNN) learning algorithms using the features described in section 1.2.1 and 1.2.2. To select the proposed classification models, we used k-cross validation (k=4).

Table 11 reports the overall average f-score achieved by the models on the k-cross valida- tion sets for both the irony and sarcarsm detection tasks.

We can observe that the SVM obtains good results on irony detection but the MTL neural approach overperforms sensibly the SVM. Also we note that the usage of additional Polarity and Hate Speech datasets lead to better performances. These results lead us to choose the MTL models trained with the additional dataset for the two official run submissions.

Table 12 reports the overall accuracies achieved by all our classifier configurations on the official test set, the official submitted runs are starred in the table. The accuracies has been computed in terms of F-Score using the official evaluation script. We submitted the runs MTL+Polarity and MTL+Polarity+Hate. The run MTL+Polarity ranked first in the subtask A, and third in the subtask B on the official leaderboard. The run MTL+Polarity ranked second in the subtask A, and fourth in the subtask B on the official leaderboard.

The results on the test set confirm the good performances of the SVM classifier on irony detection task and that the MTL neural approaches overperform the SVM. The model trained on the IronITA and SENTIPOLC datasets outperformed all the systems that participated to the subtask A, while on the subtask B it slightly underperformed the best participant system. The model trained on the IronITA, SENTIPOLC and HaSpeeDe datasets overperformed all the systems that participated to the subtask A but our model trained on IronITA and SENTIPOLC datasets only. Although the best scores in both tasks were obtained by the MTL network trained on IronITA data set only. The MTL model trained on IronITA dataset only would have outperformed all the systems submitted to both the subtasks by all participants. Seems that for these tasks the usage of additional datasets leads to overfitting issues.

Conclusions

In this paper we reported the results of our participation to the ABSITA, GxG, HaSpeeDe and IronITA shared tasks of the EVALITA 2018 conference. By resorting to a system which used Support Vector Machines and Deep Neural Networks (DNN) as learning algorithms, we achieved the best scores almost in every task, showing the effectivness of our approach. In addition, when DNN was used as learning algorithm we introduced a new multi-task learning approach and a majority vote classification approach to further improve the overall accuracy of our system. The proposed system resulted in an very effective solution achieving the first position in almost all sub-tasks for each shared task.

Part-Of-Speech n-grams, Fine grained Part-Of-Speech n-grams, Coarse grained Part-Of-Speech distribution Lexicon features Emoticons Presence, Lemma sentiment polarity n-grams, Polarity modifier, PMI score, sentiment polarity distribution, Most frequent sentiment polarity, Sentiment polarity in text sections, Word embeddings combination.

Figure 1 :1Figure 1: STL and MTL architectures.

Table 2 :2Classification results of our best runs on the ABSITA, GxG, HaSpeeDe and IronITA test sets.TaskOur Score Best ScoreRankABSITAACD0.8110.8111ACP0.7670.7671GxG IN-DOMAINCH0.6400.6401DI0.6760.6761JO0.5550.5852TW0.5950.5951YT0.5550.5551GxG CROSS-DOMAINCH0.6400.6401DI0.5950.6352JO0.5100.5152TW0.6090.6091YT0.5130.5131HaSpeeDeTW0.7990.7991FB0.8290.8291C TW0.6990.6991C FB0.6070.6545IronITAIRONY0.7300.7301SARCASM0.5160.5203an internal development set by randomly selectingdocuments from the training sets distributed by thetask organizers. The resulting development set iscomposed by approximately the 10% (561 docu-ments) of the whole training set.ConfigurationACD ACPbaseline0.313 0.197linear SVM0.797 0.739STL0.821 0.795MTL0.824 0.804MTL NO K-FOLD0.819 0.782MTL NO BOOKING-WE 0.817 0.757

Table 3 :3Classification results (micro f-score) of the different learning models on our ABSITA development set.

Table 33reports the overall accuracies achieved by the models on the internal development set for all the tasks. In addition, the results of baseline system (baseline row) which emits always the most probable label according to the label distribu-

ConfigurationACDACPbaseline0.3380.199linear SVM0.772* 0.686*STL0.8140.765MTL0.811* 0.767*MTL NO K-FOLD0.8010.755MTL NO BOOKING-WE 0.8080.753Table 4: Classification results (micro f-score) ofthe different learning models on the ABSITA offi-cial test set.tions in the training set is reported. The accuracyis calculated as the micro f-score obtained usingthe evaluation tool provided by the organizers. Forwhat concerns the ACD task it is worth noting thatthe models based on DNN always outperform lin-ear SVM, even though the difference in terms off-score is small (approximately 2 f-score points).

To test the proposed classification models, we created internal development sets by randomly selecting documents from the training sets distributed by the task organizers. The resulting development sets are composed by approximately 10% of the each data sets. For what concern the in-domain task, we tried to train the SVM classifier on indomain-data only and and on both in-domain and cross-domain data.Model CHDIJOTWYTSVMa 0.667 0.626 0.485 0.582 0.611SVM0.701 0.737 0.560 0.728 0.619STL0.556 0.545 0.500 0.724 0.596MTL0.499 0.817 0.625 0.729 0.632

Table 5 :5Classification results of the different learning models on development set in terms of accuracy for the in-domain tasks.

Table 55and 6 report the overall accuracy, computed as the average accuracy for the two classes (male and female), achieved by the models on the development data sets for the in-domain and

Model CHDIJOTWYTSVM0.530 0.565 0.580 0.588 0.568STL0.550 0.535 0.505 0.625 0.580MTL0.523 0.549 0.538 0.500 0.556Table 6: Classification results of the differentlearning models on development set in terms ofaccuracy for the cross-domain tasksthe cross-domain tasks respectively. For the in-domain tasks we observe that the SVM performswell on the smaller datasets (Children and Di-ary), while MTL neural network has the bestoverall performances. When trained on all thedatasets, in-and cross-domain, the SVM (SVMa)perform worst than when trained on in-domaindata only (SVM). For what concerns the cross-domain datasets we observe poor performancesover all the subtasks with all the employed mod-els, implying that the models have difficulties incross-domain generalization.Model CHDIJOTWYTSVMa 0.5450.5140.475 0.5390.585SVM0.5500.6490.555 0.5670.555*STL0.5450.5410.500 0.595* 0.512MTL0.640* 0.676* 0.470 0.5610.546

Table 7 :7Classification results of the different learning models on the official test set in terms of accuracy for the in-domain tasks (* marks runs that outperformed all the systems that participated to the task).Model CHDIJOTWYTSVM0.5400.514 0.505 0.5860.513*STL0.640* 0.554 0.495 0.609* 0.510MTL0.5350.595 0.510 0.5000.500

Table 11 :11Classification results of the different learning models on k-cross validation terms of average F1-score.ConfigurationIrony Sarcasmlinear SVM0.734 0.512MTL0.745 0.530MTL+Polarity0.757 0.562MTL+Polarity+Hate 0.760 0.557ConfigurationIronySarcasmbaseline-random0.5050.337baseline-mfc0.3340.223best participant0.7300.52linear SVM0.7010.493MTL0.7360.530MTL+Polarity0.730* 0.516*MTL+Polarity+Hate 0.713* 0.503*

Table 12 :12Classification results of the different learning models on the official test set in terms of F1-score (* submitted run).http://api.yandex.com/translate/http://code.google.com/p/word2vec/http://wacky.sslmit.unibo.it/doku.php?id=corpora

Acknowledgments

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 used for this research.

Overview of the EVALITA 2016 SENTiment POLarity Classification Task FrancescoBarbieri ValerioBasile DaniloCroce MalvinaNissim NicoleNovielli VivianaPatti Proceedings of EVALITA '16, Evaluation of NLP and Speech Tools for Italian EVALITA '16, Evaluation of NLP and Speech Tools for Italian

Naples, Italy

2016. December PierpaoloBasile ValerioBasile DaniloCroce MarcoPolignano Overview of the EVALITA Aspect-based Sentiment Analysis (ABSITA) Task 2018 Proceedings of the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA'18) TCaselli NNovielli VPatti PRosso the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA'18)

Turin, Italy

December Overview of the Evalita 2018 Hate Speech Detection Task CristinaBosco FeliceDell'orletta FabioPoletto ManuelaSanguinetti MaurizioTesconi Proceedings of EVALITA '18, Evaluation of NLP and Speech Tools for Italian TCaselli NNovielli VPatti PRosso EVALITA '18, Evaluation of NLP and Speech Tools for Italian

Turin, Italy

2018. December FranoisChollet Keras 2016 AlessandraCignarella SimonaFrenda ValerioBasile CristinaBosco VivianaPatti PaoloRosso Overview of the Evalita 2018 Task on Irony Detection in Italian Tweets IronITA 2018 Proceedings of the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA'18) TCaselli NNovielli VPatti PRosso the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA'18) Tandem LSTM-SVM Approach for Sentiment Analysis AndreaCimino FeliceDell'orletta Proceedings of EVALITA '16, Evaluation of NLP and Speech Tools for Italian EVALITA '16, Evaluation of NLP and Speech Tools for Italian

Naples, Italy

2016. December Building the state-of-the-art in POS tagging of Italian Tweets AndreaCimino FeliceDell'orletta Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016) Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016)

Napoli, Italy

2016. December 5-7, 2016 Multi-Task Learning in Deep Neural Network for Sentiment Polarity and Irony classification AndreaLorenzo De Mattei FeliceCimino Dell'orletta Proceedings of the 2nd Workshop on Natural Language for Artificial Intelligence the 2nd Workshop on Natural Language for Artificial Intelligence

Trento, Italy

2018. November 22-23, 2018 Overview of the EVALITA Cross-Genre Gender Prediction in Italian (GxG) Task FeliceDell 'Orletta MalvinaNissim Proceedings of the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA'18) TCaselli NNovielli VPatti PRosso the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA'18) 2018 LIBLINEAR: A Library for Large Linear Classification Journal of Machine Learning Research Kai-WeiRong-En Fan Cho-JuiChang Xiang-RuiHsieh Chih-JenWang Lin 2008 9 A theoretically grounded application of dropout in recurrent neural networks YarinGal ZoubinGhahramani arXiv:1512.05287 2015 arXiv preprint Efficient estimation of word representations in vector space SeppHochreiter JurgenSchmidhuber ; Kai Chen GregCorrado JeffreyDean arXiv1:1301.3781 Neural computation Tomas Mikolov 1997. 2013 arXiv preprint Long short-term memory SemEval-2016 task 4: Sentiment analysis in Twitter PreslavNakov AlanRitter SaraRosenthal FabrizioSebastiani VeselinStoyanov Proceedings of the 10th International Workshop on Semantic Evaluation the 10th International Workshop on Semantic Evaluation

SemEval-

2016. 2016 Document modeling with gated recurrent neural network for sentiment classification SebastianRuder JoachimBingel IsabelleAugenstein AndersSøgaard arXiv:1705.08141422-1432 2017 Lisbon, Portugal arXiv preprint Bidirectional recurrent neural networks MikeSchuster KuldipKPaliwal IEEE Transactions on Signal Processing 45 11 1997 Deep multi-task learning with low level tasks supervised at lower layers AndersSøgaard YoavGoldberg Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics the 54th Annual Meeting of the Association for Computational Linguistics

Berlin, Portugal

2016 Document modeling with gated recurrent neural network for sentiment classification DuyuTang BingQin TingLiu Proceedings of EMNLP 2015 EMNLP 2015

Lisbon, Portugal

2015 Lecture 6.5-RmsProp: Divide the gradient by a running average of its recent magnitude TijmenTieleman GeoffreyHinton COURSERA: Neural Networks for Machine Learning 2012 Recognizing contextual polarity in phraselevel sentiment analysis TheresaWilson ZornitsaKozareva PreslavNakov SaraRosenthal VeselinStoyanov AlanRitter Proceedings of HLT-EMNLP 2005 HLT-EMNLP 2005

Stroudsburg, PA, USA

ACL 2005 UNIMELB at SemEval-2016 Tasks 4A and 4B: An Ensemble of Neural Networks and a Word2Vec Based Model for Sentiment Classification XingyiXu HuizhiLiang TimothyBaldwin Proceedings of the 10th International Workshop on Semantic Evaluation the 10th International Workshop on Semantic Evaluation

SemEval-

2016. 2016