=Paper=
{{Paper
|id=Vol-1749/paper_029
|storemode=property
|title=Context–aware Convolutional Neural Networks for Twitter Sentiment Analysis in Italian
|pdfUrl=https://ceur-ws.org/Vol-1749/paper_029.pdf
|volume=Vol-1749
|authors=Giuseppe Castellucci,Danilo Croce,Roberto Basili
|dblpUrl=https://dblp.org/rec/conf/clic-it/CastellucciCB16
}}
==Context–aware Convolutional Neural Networks for Twitter Sentiment Analysis in Italian==
Context-aware Convolutional Neural Networks for Twitter Sentiment Analysis in Italian Giuseppe Castellucci, Danilo Croce, Roberto Basili Department of Enterprise Engineering University of Roma, Tor Vergata Via del Politecnico 1, 00133 Roma, Italy castellucci@ing.uniroma2.it, {croce,basili}@info.uniroma2.it Abstract Evalita 2016 evaluation campaign is described. The system is based on a cascade of three clas- English. This paper describes the sifiers based on Deep Learning methods and it Unitor system that participated to the has been applied to all the three sub-tasks of SENTIment POLarity Classification task SENTIPOLC: Subjectivity Classification, Polar- proposed in Evalita 2016. The system im- ity Classification and the pilot task called Irony plements a classification workflow made Detection. Each classifier is implemented with of several Convolutional Neural Network a Convolutional Neural Network (CNN) (LeCun classifiers, that generalize the linguistic in- et al., 1998) according the modeling proposed in formation observed in the training tweets (Croce et al., 2016). The adopted solution ex- by considering also their context. More- tends the CNN architecture proposed in (Kim, over, sentiment specific information is 2014) with (i) sentiment specific information de- injected in the training process by us- rived from an automatically derived polarity lex- ing Polarity Lexicons automatically ac- icon (Castellucci et al., 2015a), and (ii) with the quired through the automatic analysis of contextual information associated with each tweet unlabeled collection of tweets. Unitor (see (Castellucci et al., 2015b) for more informa- achieved the best results in the Subjectiv- tion about the contextual modeling in SA in Twit- ity Classification sub-task, and it scored ter). The Unitor system ranked 1st in the Sub- 2nd in the Polarity Classification sub-task, jectivity Classification task and 2nd in the Polar- among about 25 different submissions. ity Detection task among the unconstrained sys- Italiano. Questo lavoro descrive il sis- tems, resulting as one of the best solution in the tema Unitor valutato nel task di SEN- challenge. It is a remarkable result as the CNNs TIment POLarity Classification proposto have been trained without any complex feature en- all’interno di Evalita 2016. Il sistema é gineering but adopting almost the same modeling basato su un workflow di classificazione in each sub-task. The proposed solution allows implementato usando Convolutional Neu- to achieve state-of-the-art results in Subjectivity ral Network, che generalizzano le evidenze Classification and Polarity Classification task by osservabili all’interno dei dati di adde- applying unsupervised analysis of unlabeled data stramento analizzando i loro contesti e that can be easily gathered by Twitter. sfruttando lessici specifici per la analisi In Section 2 the deep learning architecture del sentimento, generati automaticamente. adopted in Unitor is presented, while the clas- Il sistema ha ottenuto ottimi risultati, otte- sification workflow is presented in 3. In Section nendo la miglior performance nel task di 4 the experimental results are reported and dis- Subjectivity Classification e la seconda nel cussed, while Section 5 derives the conclusions. task di Polarity Classification. 2 A Sentiment and Context aware Convolutional Neural Networks 1 Introduction The Unitor system is based on the Convolu- In this paper, the Unitor system participating tional Neural Network (CNN) architecture for text in the Sentiment Polarity Classification (SEN- classification proposed in (Kim, 2014), and further TIPOLC) task (Barbieri et al., 2016) within the extended in (Croce et al., 2016). This deep net- work is characterized by 4 layers (see Figure 1). bias term bc that is used to classify a message, The first layer represents the input through word given the learned representations c̃. In particu- embedding: it is a low-dimensional representation lar, the final classification y is obtained through of words, which is derived by the unsupervised argmaxy∈Y (sof tmax(S · c̃ + bc )), where Y is analysis of large-scale corpora, with approaches the set of classes of interest. similar to (Mikolov et al., 2013). The embedding In order to reduce the risk of over-fitting, two of a vocabulary V is a look-up table E, where forms of regularization are applied, as in (Kim, each element is the d−dimensional representation 2014). First, a dropout operation over the penulti- of a word. Details about this representation will mate layer (Hinton et al., 2012) is adopted to pre- be discussed in the next sections. Let xi ∈ Rd be vent co-adaptation of hidden units by randomly the d-dimensional representation of the i-th word. dropping out, i.e., setting to zero, a portion of A sentence of length n is represented through the the hidden units during forward-backpropagation. concatenation of the word vectors composing it, The second regularization is obtained by con- i.e., a matrix I whose dimension is n × d. straining the l2 norm of S and bc . The second layer represents the convolutional features that are learned during the training stage. 2.1 Injecting Sentiment Information through A filter, or feature detector, W ∈ Rf ×d , is applied Polarity Lexicons over the input layer matrix producing the learned In (Kim, 2014), the use of word embeddings is representations. In particular, a new feature ci is advised to generalize lexical information. These learned according to: ci = g(W · Ii:i+f −1 + b), word representations can capture paradigmatic re- where g is a non-linear function, such as the lationships between lexical items. They are best rectifier function, b ∈ R is a bias term and suited to help the generalization of learning al- Ii:i+f −1 is a portion of the input matrix along gorithms in natural language tasks. However, the first dimension. In particular, the filter slides paradigmatic relationships do not always reflect over the input matrix producing a feature map the relative sentiment between words. In Deep c = [c1 , . . . , cn−h+1 ]. The filter is applied over the Learning, it is a common practice to make the in- whole input matrix by assuming two key aspects: put representations trainable in the final learning local invariance and compositionality. The former stages. This is a valid strategy, but it makes the specifies that the filter should learn to detect pat- learning process more complex. In fact, the num- terns in texts without considering their exact po- ber of learnable parameters increases significantly, sition in the input. The latter specifies that each resulting in the need of more annotated examples local patch of height f , i.e., a f -gram, of the input in order to adequately estimate them. should be considered in the learned feature repre- We advocate the adoption of a multi-channel in- sentations. Ideally, a f -gram is composed through put representation, which is typical of CNNs in W into a higher level representation. image processing. A first channel is dedicated to In practice, multiple filters of different heights host representations derived from a word embed- can be applied resulting in a set of learned ding. A second channel is introduced to inject representations, which are combined in a third sentiment information of words through a large- layer through the max-over-time operation, i.e., scale polarity lexicon, which is acquired accord- c̃ = max{c}. It is expected to select the most ing to the methodology proposed in (Castellucci important features, which are the ones with the et al., 2015a). This method leverages on word highest value, for each feature map. The max- embedding representations to assign polarity in- over-time pooling operation serves also to make formation to words by transferring it from sen- the learned features of a fixed size: it allows to tences whose polarity is known. The resultant lex- deal with variable sentence lengths and to adopt icons are called Distributional Polarity Lexicons the learned features in fully connected layers. (DPLs). The process is based on the capability This representation is finally used in the fourth of word embedding to represent both sentences layer, that is a fully connected softmax layer. and words in the same space (Landauer and Du- It classifies the example into one of the cate- mais, 1997). First, sentences (here tweets) are la- gories of the task. In particular, this layer is beled with some polarity classes: in (Castellucci characterized by a parameter matrix S and a et al., 2015a) this labeling is achieved by apply- good luck targeted to classes all the juniors tomorrow :) ! word embedding DPL input layer - embedding lookup convolution filters max pooling fully connected (2,3,4) softmax layer Figure 1: The Convolutional Neural Network architecture adopted for the Unitor system. ing a Distant Supervision (Go et al., 2009) heuris- tor from the embedding with the polarity scores tic. The labeled dataset is projected in the em- derived from the DPL1 . In Table 1, a compari- bedding space by applying a simple but effective son of the most similar words of polarity carri- linear combination of the word vectors composing ers is compared when the polarity lexicon is not each sentence. Then, a polarity classifier is trained adopted (second column) and when the multi- over these sentences in order to emphasize those channel schema is adopted (third column). Notice dimensions of the space more related to the polar- that, the DPL positively affects the vector repre- ity classes. The DPL is generated by classifying sentations for SA. For example, the word pessimo each word (represented in the embedding through is no longer in set of the 3-most similar words of a vector) with respect to each targeted class, using the word ottimo. The polarity information cap- the confidence level of the classification to derive tured in the DPL making words that are seman- a word polarity signature. For example, in a DPL tically related and whose polarity agrees nearer in the word ottimo is 0.89 positive, 0.04 negative and the space. 0.07 neutral (see Table 1). For more details, please refer to (Castellucci et al., 2015a). 2.2 Context-aware model for SA in Twitter In (Severyn and Moschitti, 2015) a pre-training Term w/o DPL w/ DPL strategy is suggested for the Sentiment Analy- pessimo ottima ottimo (0.89,0.04,0.07) eccellente eccellente sis task. The adoption of heuristically classified ottima fantastico tweet messages is advised to initialize the network peggior peggior parameters. The selection of messages is based peggiore (0.17,0.57,0.26) peggio peggio migliore peggiori on the presence of emoticons (Go et al., 2009) deprimente deprimente that can be related to polarities, e.g. :) and :(. triste (0.04,0.82,0.14) tristissima tristissima However, selecting messages only with emoticons felice depressa could potentially introduce many topically unre- Table 1: Similar words in the embedding without lated messages that use out-of-domain linguistic (2nd column) and with (3rd column) DPL, whose expressions and limiting the contribution of the scores (positivity, negativity, neutrality) are in the pre-training. We instead suggest to adopt another first column. strategy for the selection of pre-training data. We draw on the work in (Vanzo et al., 2014), where This method has two main advantages: first, it topically related messages of the target domain allows deriving a signature for each word in the are selected by considering the reply-to or hash- embedding to be used in the CNN; second, this tag contexts of each message. The former (con- method allows assigning sentiment information to versational context) is made of the stream of mes- words by observing their usage. This represents sages belonging to the same conversation in Twit- an interesting setting to observe sentiment related ter, while the latter (hashtag context) is composed phenomena, as often a word does not carry a senti- by tweets preceding a target message and shar- ment if not immersed in a context (i.e., a sentence). ing at least one hashtag with it. In (Vanzo et al., As proposed in (Croce et al., 2016), in order 2014), these messages are first classified through a to keep limited the computational complexity of 1 We normalize the embedding and the DPL vectors before the training phase of CNN, we augment each vec- the juxtaposition. context-unaware SVM classifier. Here, we are go- tive tweets, with respect to the classes neutral, ing to leverage on contextual information for the positive, negative and conflict. The selection of pre-training material for the CNN. We Irony CNN is trained over the subset of subjective select the messages both in the conversation con- tweets, with respect to the classes ironic and text, and we classify them with a context-unaware not-ironic. classifier to produce the pre-training dataset. Each CNN classifier has been trained in the two settings specified in the SENTIPOLC guide- 3 The Unitor Classification Workflow lines: constrained and unconstrained. The con- The SENTIPOLC challenge is made of three sub- strained setting refers to a system that adopted tasks aiming at investigating different aspects of only the provided training data. For example, in the subjectivity of short messages. The first sub- the constrained setting it is forbidden the use of task is the Subjectivity Classification that consists a word embedding generated starting from other in deciding whether a message expresses subjec- tweets. The unconstrained systems, instead, can tivity or it is objective. The second task is the adopt also other tweets in the training stage. In Polarity Classification: given a subjective tweet our work, the constrained CNNs are trained with- a system should decide whether a tweet is ex- out using a pre-computed word embedding in the pressing a neutral, positive, negative or conflict input layer. In order to provide input data to the position. Finally, the Irony Detection sub-task neural network, we randomly initialized the word aims at finding whether a message is express- embedding, adding them to the parameters to be ing ironic content or not. The Unitor system estimated in the training process: in the follow- tackles each sub-task with a different CNN clas- ing, we will refer to the constrained classifica- sifier, resulting in a classification workflow that tion workflow as Unitor. The unconstrained is summarized in the Algorithm 1: a message is CNNs are instead initialized with pre-computed first classified with the Subjectivity CNN-based word embedding and DPL. Notice that in this set- classifier S; in the case the message is classified ting we do not back-propagate over the input layer. as subjective (subjective=True), it is also The word embedding is obtained from a corpus processed with the other two classifiers, the Po- downloaded in July 2016 of about 10 millions of larity classifier P and the Irony classifier I. In tweets. A 250-dimensional embedding is gener- the case the message is first classified as objec- ated according to a Skip-gram model (Mikolov et tive (subjective=False), the remaining clas- al., 2013)2 . Starting from this corpus and the gen- sifiers are not invoked. erated embedding, we acquired the DPL accord- ing to the methodology described in Section 2.1. Algorithm 1 Unitor classification workflow. The final embedding is obtained by juxtaposing 1: function T A G (tweet T, cnn S, cnn P, cnn I) the Skip-gram vectors and the DPL3 , resulting in a 2: subjective = S(T) 3: if subjective==True then 253-dimensional representation for about 290, 000 4: polarity = P(T), irony = I(T) words, as shown in Figure 1. The resulting clas- 5: else sification workflow made of unconstrained classi- 6: polarity = none, irony = none 7: end if fier is called Unitor-U1. Notice that these word return subjective, polarity, irony representations represent a richer feature set for 8: end function the CNN, however the cost of obtaining them is negligible, as no manual activity is needed. The same CNN architecture is adopted to im- As suggested in (Croce et al., 2016), the con- plement all the three classifiers and tweets are textual pre-training (see Section 2.2) is obtained modeled in the same way for the three sub-tasks. by considering the conversational contexts of the Each classifier has been specialized to the corre- provided training data. This dataset is made of sponding sub-task by adopting different selection about 2, 200 new messages, that have been clas- policies of the training material and adapting the sified with the Unitor-U1 system. This set of output layer of the CNN to the sub-task specific 2 classes. In detail, the Subjectivity CNN is trained The following settings are adopted: window 5 and min- over the whole training dataset with respect to the count 10 with hierarchical softmax 3 Measures adopting only the Skip-gram vectors have been classes subjective and objective. The Po- pursued in the classifier tuning stage; these have highlighted larity CNN is trained over the subset of subjec- the positive contribution of the DPL. messages is adopted to initialize the network pa- .7444 (Table 2). Moreover, also the Unitor-U2 rameters. In the following, the system adopting system is capable of adequately classify whether the pre-trained CNNs is called Unitor-U2. a message is subjective or not. The fact that the The CNNs have a number of hyper-parameters pre-trained system is not performing as well as that should be fine-tuned. The parameters we Unitor-U1, can be ascribed to the fact that the investigated are: size of filters, i.e., capturing pre-training material size is actually small. Dur- 2/3/4/5-grams. We combined together multiple ing the classifier tuning phases we adopted also filter sizes in the same run. The number of filters the hashtag contexts (about 20, 000 messages) for each size: we selected this parameter among (Vanzo et al., 2014) to pre-train our networks: the 50, 100 and 200. The dropout keep probability measures over the development set indicated that has been selected among 0.5, 0.8 and 1.0. The fi- probably the hashtag contexts were introducing nal parameters has been determined over a devel- too many unrelated messages. Moreover, the opment dataset, made of the 20% of the training pre-training material has been classified with the material. Other parameters have been kept fixed: Unitor-U1 system. It could be the case that batch size (100), learning rate (0.001), number the adoption of such added material was not so of epochs (15) and L2 regularization (0.0). The effective, as instead demonstrated in (Croce et CNNs are implemented in Tensorflow4 and they al., 2016). In fact, in that work the pre-training have been optimized with the Adam optimizer. material was classified with a totally different algorithm (Support Vector Machine) and a totally 4 Experimental Results different representation (kernel-based). In this In Tables 2, 3 and 4 the performances of the setting, the different algorithm and representation Unitor systems are reported, respectively for the produced a better and substantially different task of Subjectivity Classification, Polarity Classi- dataset, in terms of covered linguistic phenomena fication and Irony Detection. In the first Table (2) and their relationships with the target classes. the F-0 measure refers to the F1 measure of the Finally, the constrained version of our system, ob- objective class, while F-1 refers to the F1 mea- tained a remarkable score of .7134, demonstrating sure of the subjective class. In the Table 3 the F-0 that the random initialization of the input vectors measure refers to the F1 measure of the negative can be also adopted for the classification of the class, while F-1 refers to the F1 measure of the subjectivity of a message. positive class. Notice that in this case, the neutral System F-0 F-1 F-Mean Rank class is mapped to a “not negative” and “not posi- Unitor-C .6486 .6279 .6382 11 tive” classification and the conflict class is mapped Unitor-U1 .6885 .6354 .6620 2 to a “negative” and “positive” classification. The Unitor-U2 .6838 .6312 .6575 3 F-0 and F-1 measures capture also these configu- Table 3: Polarity Classification results rations. In Table 4 the F-0 measure refers to the F1 measure of the not ironic class, while F-1 refers In Table 3 the Polarity Classification results to the F1 measure of the ironic class. Finally, F- are reported. Also in this task, the performances Mean is the mean between these F-0 and F-1 val- of the unconstrained systems are higher with re- ues, and is the score used by the organizers for spect to the constrained one (.662 against .6382). producing the final ranks. It demonstrates the usefulness of acquiring lex- System F-0 F-1 F-Mean Rank ical representations and use them as inputs for Unitor-C .6733 .7535 .7134 4 the CNNs. Notice that the performances of the Unitor-U1 .6784 .8105 .7444 1 Unitor-U2 .6723 .7979 .7351 2 Unitor classifiers are remarkable, as the two un- constrained systems rank in 2nd and 3rd position. Table 2: Subjectivity Classification results The contribution of the pre-training is not positive, as instead measured in (Croce et al., 2016). Again, Notice that our unconstrained system we believe that the problem resides in the size and (Unitor-U1) is the best performing system quality of the pre-training dataset. in recognizing when a message is expressing a In Table 4 the Irony Detection results are re- subjective position or not, with a final F-mean of ported. Our systems do not perform well, as all 4 https://www.tensorflow.org/ the submitted systems reported a very low recall System F-0 F-1 F-Mean Rank 2016. Overview of the EVALITA 2016 SENTi- Unitor-C .9358 .016 .4761 10 ment POLarity Classification Task. In Pierpaolo Unitor-U1 .9373 .008 .4728 11 Basile, Anna Corazza, Franco Cutugno, Simonetta Unitor-U2 .9372 .025 .4810 9 Montemagni, Malvina Nissim, Viviana Patti, Gio- vanni Semeraro, and Rachele Sprugnoli, editors, Table 4: Irony Detection results Proceedings of Third Italian Conference on Compu- tational Linguistics (CLiC-it 2016) & Fifth Evalua- tion Campaign of Natural Language Processing and for the ironic class: for example, the Unitor-U2 Speech Tools for Italian. Final Workshop (EVALITA recall is only .0013, while its precision is .4286. It 2016). Associazione Italiana di Linguistica Com- can be due mainly to two factors. First, the CNN putazionale (AILC). devoted to the classification of the irony of a mes- Giuseppe Castellucci, Danilo Croce, Diego De Cao, sage has been trained with a dataset very skewed and Roberto Basili. 2014. A multiple kernel ap- towards the not-ironic class: in the original dataset proach for twitter sentiment analysis in italian. In only 868 over 7409 messages are ironic. Second, a Fourth International Workshop EVALITA 2014. CNN observes local features (bi-grams, tri-grams, Giuseppe Castellucci, Danilo Croce, and Roberto . . . ) without ever considering global constraints. Basili. 2015a. Acquiring a large scale polarity lexi- Irony, is not a word-level phenomenon but, in- con through unsupervised distributional methods. In stead, it is related to sentence or even social as- Proc. of 20th NLDB, volume 9103. Springer. pects. For example, the best performing system in Giuseppe Castellucci, Andrea Vanzo, Danilo Croce, Irony Detection in SENTIPOLC 2014 (Castellucci and Roberto Basili. 2015b. Context-aware models et al., 2014) adopted a specific feature, which es- for twitter sentiment analysis. IJCoL vol. 1, n. 1: timates the violation of paradigmatic coherence of Emerging Topics at the 1st CLiC-It Conf., page 69. a word with respect to the entire sentence, i.e., a Danilo Croce, Giuseppe Castellucci, and Roberto global information about a tweet. This is not ac- Basili. 2016. Injecting sentiment information in counted for in the CNN here discussed, and ironic context-aware convolutional neural networks. Pro- sub-phrases are likely to be neglected. ceedings of SocialNLP@ IJCAI, 2016. Alec Go, Richa Bhayani, and Lei Huang. 2009. Twit- 5 Conclusions ter sentiment classification using distant supervision. CS224N Project Report, Stanford. The results obtained by the Unitor system at SENTIPOLC 2016 are promising, as the system Geoffrey Hinton, Nitish Srivastava, Alex Krizhevsky, won the Subjectivity Classification sub-task and Ilya Sutskever, and Ruslan Salakhutdinov. 2012. Improving neural networks by prevent- placed in 2n d position in the Polarity Classifica- ing co-adaptation of feature detectors. CoRR, tion. While in the Irony Detection the results abs/1207.0580. are not satisfactory, the proposed architecture is straightforward as its setup cost is very low. In Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings EMNLP fact, the human effort in producing data for the 2014, pages 1746–1751, Doha, Qatar, October. As- CNNs, i.e., the pre-training material and the ac- sociation for Computational Linguistics. quisition of the Distributional Polarity Lexicon is very limited. In fact, the former can be easily ac- Tom Landauer and Sue Dumais. 1997. A solution to plato’s problem: The latent semantic analysis the- quired with the Twitter Developer API; the latter is ory of acquisition, induction and representation of realized through an unsupervised process (Castel- knowledge. Psychological Review, 104. lucci et al., 2015a). In the future, we need to bet- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. 1998. ter model the irony detection problem, as proba- Gradient-based learning applied to document recog- bly the CNN here adopted is not best suited for nition. Proc. of the IEEE, 86(11), Nov. such task. In fact, irony is a more global linguistic phenomenon than the ones captured by the (local) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- rado, and Jeffrey Dean. 2013. Distributed represen- convolutions operated by a CNN. tations of words and phrases and their composition- ality. CoRR, abs/1310.4546. References Aliaksei Severyn and Alessandro Moschitti. 2015. Twitter sentiment analysis with deep convolutional Francesco Barbieri, Valerio Basile, Danilo Croce, neural networks. In Proc. of the SIGIR 2015, pages Malvina Nissim, Nicole Novielli, and Viviana Patti. 959–962, New York, NY, USA. ACM. Andrea Vanzo, Danilo Croce, and Roberto Basili. 2014. A context-based model for sentiment analysis in twitter. In Proc. of 25th COLING, pages 2345– 2354.