Sentiment Analysis using Convolutional Neural Networks with Multi-Task Training and Distant Supervision on Italian Tweets Jan Deriu Mark Cieliebak Zurich University of Applied Sciences Zurich University of Applied Sciences Switzerland Switzerland deri@zhaw.ch ciel@zhaw.ch Abstract T 3 : Irony detection: is the tweet ironic? The classic approaches to sentiment analysis usu- English. In this paper, we propose a clas- ally consist of manual feature engineering and ap- sifier for predicting sentiments of Italian plying some sort of classifier on these features Twitter messages. This work builds upon (Liu, 2015). Deep neural networks have shown a deep learning approach where we lever- great promises at capturing salient features for age large amounts of weakly labelled data these complex tasks (Mikolov et al., 2013b; Sev- to train a 2-layer convolutional neural net- eryn and Moschitti, 2015a). Particularly success- work. To train our network we apply a ful for sentiment classification were Convolutional form of multi-task training. Our system Neural Networks (CNN) (Kim, 2014; Kalchbren- participated in the EvalItalia-2016 com- ner et al., 2014; Severyn and Moschitti, 2015b; petition and outperformed all other ap- Johnson and Zhang, 2015), on which our work proaches on the sentiment analysis task. builds upon. These networks typically have a large In questo articolo, presentiamo un sis- number of parameters and are especially effective tema per la classificazione di soggettività when trained on large amounts of data. e polarità di tweet in lingua italiana. In this work, we use a distant supervision ap- L’approccio descritto si basa su reti neu- proach to leverage large amounts of data in order rali. In particolare, utilizziamo un dataset to train a 2-layer CNN 2 . More specifically, we di 300M di tweet per addestrare una con- train a neural network using the following three- volutional neural network. Il sistema è phase procedure: P 1 : creation of word embed- stato addestrato e valutato sui dati for- dings for the initialization of the first layer based niti dagli organizzatori di Sentipolc, task on an unsupervised corpus of 300M Italian tweets; di sentiment analysis su Twitter organiz- P 2 : distant supervised phase, where the net- zato nell’ambito di Evalita 2016.. work is pre-trained on a weakly labelled dataset of 40M tweets where the network weights and word embeddings are trained to capture aspects related 1 Introduction to sentiment; and P 3 : supervised phase, where the network is trained on the provided supervised Sentiment analysis is a fundamental problem aim- training data consisting of 7410 manually labelled ing to give a machine the ability to understand the tweets. emotions and opinions expressed in a written text. As the three tasks of EvalItalia-2016 are closely This is an extremely challenging task due to the related we apply a form of multitask training as complexity and variety of human language. proposed by (Collobert et al., 2011), i.e. we train The sentiment polarity classification task of one CNN for all the tasks simultaneously. This EvalItalia-2016 1 (sentipolc) consists of three sub- has two advantages: i) we need to train only tasks which cover different aspects of sentiment one model instead of three models, and ii) the detection: T 1 : Subjectivity detection: is the tweet CNN has access to more information which ben- subjective or objective? T 2 : Polarity detection: is efits the score. The experiments indicate that the the sentiment of the tweet neutral, positive, nega- multi-task CNN performs better than the single- tive or mixed? 1 2 http://www.di.unito.it/˜tutreeb/sentipolc- We here refer to a layer as one convolutional and one evalita16/index.html pooling layer. task CNN. After a small bugfix regarding the data- feature map matrix has the form Cpooled ∈ preprocessing our system outperforms all the other m× n−h+1−s R st . Depending on whether the borders systems in the sentiment polarity task. are included or not, the result of the fraction is rounded up or down respectively. 2 Convolutional Neural Networks Hidden layer. A fully connected hidden layer We train a 2-layer CNN using 9-fold cross- computes the transformation α(W ∗x+b), where validation and combine the outputs of the 9 re- W ∈ Rm×m is the weight matrix, b ∈ IRm the sulting classifiers to increase robustness. The 9 bias, and α the rectified linear (relu) activation classifiers differ in the data used for the super- function (Nair and Hinton, 2010). The output vec- vised phase since cross-validation creates 9 differ- tor of this layer, x ∈ Rm , corresponds to the sen- ent training and validation sets. tence embeddings for each tweet. The architecture of the CNN is shown in Fig- ure 1 and described in detail below. Softmax. Finally, the outputs of the hidden layer x ∈ Rm are fully connected to a soft-max regres- Sentence model. Each word in the input data is sion layer, which returns the class ŷ ∈ [1, K] with associated to a vector representation, which con- largest probability, sists in a d-dimensional vector. A sentence (or | tweet) is represented by the concatenation of the ex wj +aj ŷ := arg max PK , (2) representations of its n constituent words. This j x| wk +aj k=1 e yields a matrix S ∈ Rd×n , which is used as input to the convolutional neural network. where wj denotes the weights vector of class j and The first layer of the network consists of a lookup aj the bias of class j. table where the word embeddings are represented Network parameters. Training the neural net- as a matrix X ∈ Rd×|V| , where V is the vocabu- work consists in learning the set of parameters lary. Thus the i-th column of X represents the i-th Θ = {X, F1 , b1 , F2 , b2 , W, a}, where X is the word in the vocabulary V . embedding matrix, with each row containing the Convolutional layer. In this layer, a set of m fil- d-dimensional embedding vector for a specific ters is applied to a sliding window of length h over word; Fi , bi (i = {1, 2}) the filter weights and bi- each sentence. Let S[i:i+h] denote the concatena- ases of the first and second convolutional layers; tion of word vectors si to si+h . A feature ci is W the concatenation of the weights wj for every generated for a given filter F by: output class in the soft-max layer; and a the bias of the soft-max layer. X ci := (S[i:i+h] )k,j · Fk,j (1) Hyperparameters For both convolutional lay- k,j ers we set the length of the sliding window h to A concatenation of all vectors in a sentence pro- 5, the size of the pooling interval s is set to 3 in duces a feature vector c ∈ Rn−h+1 . The vectors both layers, where we use a striding of 2 in the c are then aggregated over all m filters into a fea- first layer, and the number of filters m is set to 200 ture map matrix C ∈ Rm×(n−h+1) . The filters in both convolutional layers. are learned during the training phase of the neu- Dropout Dropout is an alternative technique ral network using a procedure detailed in the next used to reduce overfitting (Srivastava et al., section. 2014). In each training stage individual nodes are dropped with probability p, the reduced neural net Max pooling. The output of the convolutional is updated and then the dropped nodes are rein- layer is passed through a non-linear activation serted. We apply Dropout to the hidden layer and function, before entering a pooling layer. The lat- to the input layer using p = 0.2 in both cases. ter aggregates vector elements by taking the max- imum over a fixed set of non-overlapping inter- Optimization The network parameters are vals. The resulting pooled feature map matrix has learned using AdaDelta (Zeiler, 2012), which n−h+1 the form: Cpooled ∈ Rm× s , where s is the adapts the learning rate for each dimension using length of each interval. In the case of overlap- only first order information. We used the default ping intervals with a stride value st , the pooled hyper-parameters. Convolutional pooled Convolutional pooled Hidden Softmax Sentence Matrix Feature Map repr. Feature Map repr. Layer Figure 1: The architecture of the CNN used in our approach. ber of dimensions is d = 52 3 . The resulting vo- Word Embeddings cat 0.1 0.9 0.3 cabulary contains 890K unique words. The word Raw Tweets word2vec cats 0.3 0.2 0.7 Twitter (200M) GloVe cute … 0.2 0.3 0.1 embeddings account for the majority of network Adapted Word Emb. 3-Phase Training parameters (42.2M out of 46.6M parameters) and :-) Smiley Distant cat cats 0.1 0.3 0.9 0.2 0.3 0.7 are updated during the next two phases to intro- :-( Tweets Supervision cute 0.2 0.3 0.1 (90M) . … duce sentiment specific information into the word embeddings and create a good initialization for the Annotated Tweets 2-Layer CNN. CNN (18k) Distant Supervised Phase We pre-train the Application CNN for 1 epoch on an weakly labelled dataset Unknown Predictive Tweet Model of 40M Italian tweets where each tweet contains an emoticon. The label is inferred by the emoti- Figure 2: The overall architecture of our 3-phase approach. cons inside the tweet, where we ignore tweets with opposite emoticons. This results in 30M positive tweets and 10M negative tweets. Thus, the classi- 3 Training fier is trained on a binary classification task. Supervised Phase During the supervised phase We train the parameters of the CNN using the we train the pre-trained CNN with the provided three-phase procedure as described in the intro- annotated data. The CNN is trained jointly on all duction. Figure 2 depicts the general flow of this tasks of EvalItalia. There are four different binary procedure. labels as well as some restrictions which result in 9 possible joint labels (for more details, see Sec- 3.1 Three-Phase Training tion 3.2). The multi-task classifier is trained to pre- dict the most likely joint-label. Preprocessing We apply standard preprocess- We apply 9-fold cross-validation on the dataset ing procedures of normalizing URLs, hashtags generating 9 equally sized buckets. In each round and usernames, and lowercasing the tweets. The we train the CNN using early stopping on the held- tweets are converted into a list of indices where out set, i.e. we train it as long as the score im- each index corresponds to the word position in the proves on the held-out set. For the multi-task train- vocabulary V . This representation is used as in- ing we monitor the scores for all 4 subtasks simul- put for the lookup table to assemble the sentence taneously and store the best model for each sub- matrix S. task. The training stops if there is no improvement of any of the 4 monitored scores. Word Embeddings We create the word embed- dings in phase P 1 using word2vec (Mikolov et al., Meta Classifier We train the CNN using 9-fold 2013a) and train a skip-gram model on a corpus of cross-validation, which results in 9 different mod- 300M unlabelled Italian tweets. The window size els. Each model outputs nine real-value numbers for the skip-gram model is 5, the threshold for the 3 According to the gensim implementation of word2vec minimal word frequency is set to 20 and the num- using d divisible by 4 speeds up the process. ŷ corresponding to the probabilities for each of scores show that the CNN is tuned too much to- the nine classes. To increase the robustness of the wards the held-out folds since the scores of the system we train a random forest which takes the held-out folds are significantly higher. For exam- outputs of the 9 models as its input. The hyper- ple, the average score of the positivity task is 0.733 parameters were found via grid-search to obtain on the held-out sets but only 0.6694 on the dev-set the best overall performance over a development and 0.6601 on the test-set. Similar differences in set: Number of trees (100), maximum depth of the sores can be observed for the other tasks as well. forest (3) and the number of features used per ran- To mitigate this problem we apply a random for- dom selection (5). est on the outputs of the 9 classifiers obtained by cross-validation. The results are shown in Table 3.2 Data 3. The meta-classifier outperforms the average The supervised training and test data is provided scores obtained by the CNNs by up to 2 points by the EvalItalia-2016 competition. Each tweet on the dev-set. The scores on the test-set show contains four labels: L1 : is the tweet subjective a slightly lower increase in score. Especially the or objective? L2 : is the tweet positive? L3 : is the single-task classifier did not benefit from the meta- tweet negative? L4 : is the tweet ironic? Further- classifier as the scores on the test set decreased in more an objective tweet implies that it is neither some cases. positive nor negative as well as not ironic. There The results show that the multi-task classifier out- are 9 possible combination of labels. performs the single-task classifier in most cases. To jointly train the CNN for all three tasks T 1, T 2 There is some variation in the magnitude of the and T 3 we join the labels of each tweet into a sin- difference: the multi-task classifier outperforms gle label. In contrast, the single task training trains the single-task classifier by 0.06 points in the neg- a single model for each of the four labels sepa- ativity task in the test-set but only by 0.005 points rately. in the subjectivity task. Table 1 shows an overview of the data available. Set Task Subjective Positive Negative Irony Fold-Set Single Task 0.723 0.738 0.721 0.646 Table 1: Overview of datasets provided in EvalItalia-2016. Multi Task 0.729 0.733 0.737 0.657 Label Training Set Test Set Dev-Set Single Task 0.696 0.650 0.685 0.563 Multi Task 0.710 0.669 0.699 0.595 Total 7410 2000 Subjective 5098 1305 Test-Set Single Task 0.705 0.652 0.696 0.526 Overall Positive 2051 352 Multi Task 0.681 0.660 0.700 0.540 Overall Negative 2983 770 Irony 868 235 Table 2: Average F1-score obtained after applying cross val- idation. 3.3 Experiments & Results Set Task Subjective Positive Negative Irony Dev-Set Single Task 0.702 0.693 0.695 0.573 We compare the performance of the multi-task Multi Task 0.714 0.686 0.717 0.604 CNN with the performance of the single-task Test-Set Single Task 0.712 0.650 0.643 0.501 Multi Task 0.714 0.653 0.713 0.536 CNNs. All the experiments start at the third-phase, i.e. the supervised phase. Since there was no Table 3: F1-Score obtained by the meta classifier. predefined split in training and development set, we generated a development set by sampling 10% 4 Conclusion uniformly at random from the provided training set. The development set is needed when assess- In this work we presented a deep-learning ap- ing the generalization power of the CNNs and the proach for sentiment analysis. We described the meta-classifier. For each task we compute the av- three-phase training approach to guarantee a high eraged F1-score (Barbieri et al., 2016). We present quality initialization of the CNN and showed the the results achieved on the dev-set and the test-set effects of using a multi-task training approach. To used for the competition. We refer to the set which increase the robustness of our system we applied was held out during a cross validation iteration as a meta-classifier on top of the CNN. The system fold-set. was evaluated in the EvalItalia-2016 competition In Table 2 we show the average results obtained where it achieved 1st place in the polarity task and by the 9 CNNs after the cross validation. The high positions on the other two subtasks. References Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Francesco Barbieri, Valerio Basile, Danilo Croce, Dropout: A Simple Way to Prevent Neural Networks Malvina Nissim, Nicole Novielli, and Viviana Patti. from Overfitting. Journal of Machine Learning Re- 2016. Overview of the EVALITA 2016 SENTi- search, 15:1929–1958. ment POLarity Classification Task. In Pierpaolo Basile, Anna Corazza, Franco Cutugno, Simonetta Matthew D. Zeiler. 2012. ADADELTA: An Adaptive Montemagni, Malvina Nissim, Viviana Patti, Gio- Learning Rate Method. arXiv, page 6. vanni Semeraro, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Compu- tational Linguistics (CLiC-it 2016) & Fifth Evalua- tion Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016). Associazione Italiana di Linguistica Com- putazionale (AILC). Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural Language Processing (Almost) from Scratch. JMLR, 12:2493−2537. Rie Johnson and Tong Zhang. 2015. Semi-supervised Convolutional Neural Networks for Text Categoriza- tion via Region Embedding. In NIPS 2015 - Ad- vances in Neural Information Processing Systems 28, pages 919–927. Nal Kalchbrenner, Edward Grefenstette, and Phil Blun- som. 2014. A Convolutional Neural Network for Modelling Sentences. In ACL - Proceedings of the 52nd Annual Meeting of the Association for Com- putational Linguistics, pages 655–665, Baltimore, Maryland, USA, April. Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In EMNLP 2014 - Empiri- cal Methods in Natural Language Processing, pages 1746–1751, August. Bing Liu. 2015. Sentiment Analysis. Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013a. Exploiting Similarities among Languages for Machine Translation. arXiv, September. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013b. Distributed represen- tations of words and phrases and their composition- ality. In Advances in neural information processing systems, pages 3111–3119. Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 807–814. Aliaksei Severyn and Alessandro Moschitti. 2015a. Twitter Sentiment Analysis with Deep Convolu- tional Neural Networks. In 38th International ACM SIGIR Conference, pages 959–962, New York, USA, August. ACM. Aliaksei Severyn and Alessandro Moschitti. 2015b. UNITN: Training Deep Convolutional Neural Net- work for Twitter Sentiment Classification. In Se- mEval 2015 - Proceedings of the 9th International Workshop on Semantic Evaluation.