Bidirectional Attentional LSTM for Aspect Based Sentiment Analysis on Italian Giancarlo Nicola University of Pavia giancarlo.nicola01@universitadipavia.it Abstract from the ”booking.com” website. The aspects are related to the accommodation reviews and com- English. This paper describes the SentITA prehend topics like cleanliness, comfort, location, system that participated to the ABSITA etc. The task is divided in two subtasks As- task proposed in Evalita 2018. The sys- pect Category Detection (ACD) and Aspect Cat- tem is based on a Bidirectional Long Short egory Polarity (ACP). The fist, ACD consists in Term Memory network with attention that identifying the aspects mentioned in the sentence, exploits word embeddings and sentiment while the second requires to associate a senti- specific polarity embeddings. The model ment polarity label to the aspects evoked in the also leverages grammatical information sentence. Both the tasks are addressed with the from POS tagging and NER tagging. The same architecture and the same data preprocess- system participated in both the Aspect ing. The system is based on a deep learning model, Category Detection (ACD) and Aspect a Bidirectional Long Short Term Memory net- Category Polarity (ACP) tasks achieving work with attention. The model exploits word em- the 5th place in the ACD task and the 2nd beddings, sentiment specific polarity embeddings in the ACD task. and it leverages also grammatical and information Italiano. Questo paper descrive il sis- from POS tagging and NER tagging. tema SentITA valutato nel task ABSITA Recently, deep learning has emerged as a pow- proposto all’interno di Evalita 2018. Il erful machine learning technique achieving state- sistema è basato su una rete nuerale ricor- of-the-art results in many application domains, rente con celle di memoria di tipo Long including sentiment analysis. Among the deep Short Term Memory e con implementato learning frameworks applied to sentiment analy- un meccanismo d’attenzione. Il modello sis, many employ a combination of semantic vec- sfrutta sia word embeddings generali sia tor representations (Mikolov et al. 2013), (Pen- polarity embeddings specifici per la sen- nignton et al. 2014) and different deep learning timent analysis ed inoltre fa uso delle in- architectures. Long Short-Term Memory (LSTM) formazioni derivanti dal POS-tagging e networks (Hochreiter and Schmidhuber 1997), dal NER-tagging. Il sistema ha parteci- (Socher et al. 2013), (Cho et al. 2014) have pato sia nella sfida di Aspect Category been applied to model complex and long term Detection (ACD) sia in quella di Aspect non-local relationships in both word level and Category Polarity (ACP) posizionandosi character level text sequences. Recursive Neu- al quinto posto nella prima e al secondo ral Tensor Networks (RNTN) have shown great posto nella seconda. results for semantic compositionality (Socher et al. 2011), (Socher et al. 2013) and also convo- lutional networks (CNN) for both sentiment anal- 1 Introduction ysis (Collobert et al 2011) and sentence modelling This paper describes the SentITA system that par- (Kalchbrenner et al. 2014) have performed better ticipated to the ABSITA task (Basile et al. 2018) than previous state of the art methodologies. All proposed in Evalita 2018. In ABSITA the task these methods in most of the applications receive consists in performing Aspect Based Sentiment in input a vector representation of words called Analysis (ABSA) on self-reliant sentences scraped word embeddings. (Mikolov 2012), (Mikolov et al. 2013) and (Pennignton et al. 2014), further tion 2 the model and its features are explained; in expanding the work on word embeddings (Ben- Section 3 the model training and its performances gio et al 2003), that grounds on the idea of dis- are discussed; in Section 4 a conclusion with the tributed representations for symbols (Hinton et next improvement of the model is given. al 1986), have introduced unsupervised learning methods to create dense multidimensional spaces 2 Description of the system where words are represented by vectors. The po- sition of such vectors is related to their semantic The model implemented is an Attentional Bidi- meaning and grammatical properties and they are rectional Recurrent Neural Network with LSTM widely used in all modern NLP. In fact, they allow cells. It operates at word level and therefore each for a dimensionality reduction compared to tradi- sentence is represented as a sequence of words tional sparse vectors space models and they are of- representations that are sequentially fed to the ten used as pre-trained initialization for the first model one after another until the sequence has embedding layers of the neural networks in NLP been entirely used up. One sentence sequence cou- tasks. In (Le and Mikolov 2014), expanding the pled with its polarity scores represent a single dat- previous work on word embeddings, is developed apoint for the model. a model capable of representing also sentences in The input to the model are sentences, repre- a dense multidimensional space. In this case too, sented as sequence of word representations. The sentences are represented by vectors whose posi- maximum sequence length has been set to 35, tion is related to the semantic content of the sen- with shorter sentences left-padded to this length tence with similar sentences represented by vec- and longer sentences cut to this length. Each tors that are close to each other. word of the sequence is represented by five vec- tors corresponding to 5 different features that are: When working with isolated and short sen- high dimensional word embedding, word polar- tences, often with a specific writing style, like ity, word NER tag, word POS tag, custom low tweets or phrases extracted from internet reviews dimensional word embedding. The high dimen- many long term text dependencies are lost and sional word embeddings are the pretrained Fas- not exploitable. In this situation it is important text embeddings for Italian (Grave et al. 2018). that the model learns both to pay attention to spe- They are 300-dimensional vectors obtained using cific words that have key roles in determining the the skip-gram model described in (Bojanowski et sentence polarity like negations, magnifiers, ad- al. 2016) with default parameters. The word jectives and to model the discourse but with less polarity is obtained from the OpeNER Senti- focus on long term dependencies (due to the text ment Lexicon Italian (Russo et al. 2016). This brevity). For this reason, deep learning word em- freely available Italian Sentiment Lexicon con- bedding based models augmented with task spe- tains a total of 24.293 lexical entries annotated cific gazettes (dictionaries) and features, repre- for positive/negative/neutral polarity. It was semi- sent a solid baseline when working with these automatically developed using a propagation algo- kind of datasets (Nakov et al. 2016)(Attardi et rithm starting from a list of seed key-words and al. 2016)(Castellucci et al. 2016)(Cimino et al. manually reviewing the most frequent. 2016)(Deriu et al. 2016). In this system, a polarity Both the NER tags and POS tags are obtained dictionary for Italian has been included as feature from the Spacy library Tagger model for Italian to the model in addition to the word embeddings. (Spacy 2.0.11 - https://spacy.io/). The custom low Moreover every sentence during preprocessing is dimensional word embeddings are generated by augmented with its NER tags and POS tags which random initialization and are inserted to provide then are used as features in the model. Thanks an embedding representation of the words that are to the inclusion of these features relevant for the missing from the Fastext embeddings, which oth- considered task in combination with word embed- erwise would all be represented by the same out dings and an attentional bidirectional LSTM re- of vocabulary token (OOV token). In general, current neural network, the model achieves useful it could be possible to train and fine-tune these results with some thousands labelled examples. custom embeddings on specific datasets to let the The remainder of the paper presents the model model learn the usage of words in specific cases. and the experiments on the ABSITA task. in Sec- The information extracted from the OpeNER Sen- Figure 1: Model architecture timent Lexicon Italian are the word polarity with gineered features like polarity dictionary, NER tag its confidence and they are concatenated in a vec- and POS tag that help in classifying the examples. tor of length 2 that is one of the input to the first layer of the network. The NER tags and POS tags 3 Training and results instead are mapped to randomly initialized em- beddings of dimensionality respectively 2 and 4 The only preprocessing applied to the text is the that are not trained during the model training for conversion of each character to its lower case the task submission. With more data available it form. Then, the vocabulary of the model is lim- would probably be beneficial to train all the NER, ited to the first 150,000 words of the Fastext em- POS and custom embeddings but for this specific beddings trough a cap on the max number of em- dataset the results were comparable and slightly beddings, due to memory constraints of the GPU better when not training the embeddings. used for training the model. The Fastext embed- dings are sorted by descending frequency of ap- The model, whose architecture is schematized pearance in their training corpus, thus the vocabu- in fig. 1, performs in its initial layer a dimension- lary comprises the 150,000 most frequent words ality reduction on the Fastext embeddings and then in Italian. The other words that remain out of concatenates them with the rest of the embeddings this cut are represented in the model high dimen- (polarity, NER tag, POS tag, and custom word em- sional embeddings (Fastext embeddings) by an out beddings) for each each timestep (word) of the se- of vocabulary token. However, all the training set quence. The concatenation of the embeddings is words are anyhow included in the custom low di- fed in a sequence of two bidirectional recurrent mensional word embeddings; this is done since layers with LSTM cells. The result of these recur- both our training text and in general users text rent layers is passed to the attention mechanism (specially when working with reviews, tweets, so- presented in (Raffel et al. 2016) and finally to cial network platforms) could be quite different the dense layers that outputs the aspect detection from the one on which Fastext embeddings are and aspect polarity signals. The attention mecha- trained. In addition the NER-tagging and POS- nism in this formulation, produces a fixed-length tagging models for Italian included in the Spacy embedding of the input sequence by computing library are applied to the text to compute the ad- an adaptive weighted average of the sequence of ditional NER-tags and POS-tags features for each states (normally denoted as ”h”) of the RNN. This word. form of integration is similar to the ”global tem- To train the model and generate the challenge poral pooling” described in (Sander 2014), which submission a k-fold cross validation strategy has is based on the ”global average pooling” tech- been applied. The dataset has been divided in nique of (Min et al. 2014). The non linear ac- 5 folds and 5 different instantiations of the same tivations used in the model are Rectified Linear model (with the same architecture) have been Units (ReLU) for the internal dense layers, hy- trained picking each time a different fold as val- perbolic tangent (tanh) in the recurrent layers and idation set (20%) and the remaining 4 folds as sigmoid in the output dense layer. In order to con- training set (80%). The number of training epochs trast overfitting the dropout mechanism has been is defined with the early stopping technique with used after the Fastext embedding dimensionality patience parameter equal to 7. Once the train- reduction with rate 0.5, in both the recurrent lay- ing epochs are completed, the model snapshot that ers between each timestep with rate 0.5 and on the achieved the best validation loss is loaded. At the output of the recurrent layers with rate 0.3. end the predictions from the 5 models have been The model has 61,368 trainable parameters and averaged together and thresholded at 0.5. The a total of 45,233,366 parameters, the majority of training of five different instantiations of the same them representing the Fastext embedding matrix model and the averaging of their predictions over- (45,000,300). Compared to many NLP models comes the fact that in each k th -fold the model se- used today the number of trainable parameters is lection based on the best validation loss is biased quite small to reduce the possibility of overfit- on the validation fold itself. ting the training dataset (6,337 examples is small Each of the five models is trained minimizing compared to many English sentiment datasets) and the crossentropy loss on the different classes with also because is compensated by the addition of en- the Nesterov Adam (Nadam) optimizer (Dozat Micro Micro Micro Micro Micro Micro Ranking Precision Recall F1-score Ranking Precision Recall F1-score 1 0.8397 0.7837 0.8108 1 0.8264 0.7161 0.7673 2 0.8713 0.7504 0.8063 2 0.8612 0.6562 0.7449 3 0.8697 0.7481 0.8043 3 0.7472 0.7186 0.7326 4 0.8626 0.7519 0.8035 4 0.7387 0.7206 0.7295 5 0.8819 0.7378 0.8035 5 0.8735 0.5649 0.6861 6 0.898 0.6937 0.7827 6 0.6869 0.5409 0.6052 7 0.8658 0.697 0.7723 7 0.4123 0.3125 0.3555 8 0.7902 0.7181 0.7524 8 0.5452 0.2511 0.3439 9 0.6232 0.6093 0.6162 baseline 0.2451 0.1681 0.1994 10 0.6164 0.6134 0.6149 11 0.5443 0.5418 0.5431 Table 2: Task ACP (Aspect Category Polarity) 12 0.6213 0.433 0.5104 baseline 0.4111 0.2866 0.3377 ranking. This system score is reported between dashed lines Table 1: Task ACD (Aspect Category Detection) ranking. This system score is reported between very far from the 1st in terms of F1-score. The dashed lines model in general shows a high precision but in general a lower recall compared to the other sys- 2015) with default parameters (λ = 0.002, β1 = tems. The proposed architecture makes use of 0.9, β2 = 0.999, schedule decay = 0.004). The different features that is easy to obtain through Nesterov Adam optimizer is similar to the Adam other models like POS and NER tags, polarity and optimizer (Kingma et al. 2014) but were momen- word embeddings, for this reason, the human ef- tum is replaced with Nesterov momentum (Nes- fort in the data preprocessing is very limited. One terov 1983). Adam in fact, combines two algo- important direction to further improve the model rithms known to work well for different reasons: would be to rely more on unsupervised learning, momentum, which points the model in a better di- which at the moment is used only for the word rection, and RMSProp, which adapts how far the embeddings. It could be possible to integrate in model goes in that direction on a per-parameter ba- the model features based on language models or sis. However, Nesterov momentum which can be encoder-decoder networks, for example. More un- viewed as a simple modification of the former, in- supervised learning would better ensure the model creases stability, and can sometimes provide a dis- generalization to cover most of the argument and tinct improvement in performance, superior to mo- lexical content of the Italian language due to the mentum (Sutskever et al. 2013). For this reason large quantity of text available and thus improving the two approaches are combined in the Nadam also the model recall. optimizer. This system obtained the 5th place in the ACD and the 2nd place in the ACP task as reported re- References spectively in Table 1 and Table 2. In these tables Giuseppe Attardi, Daniele Sartiano, Chiara Alzetta, the performances of the systems participating to Federica Semplici. 2016. Convolutional Neural Net- works for Sentiment Analysis on Italian Tweets. the challenge have been ranked by F1-score from CLiC-it/EVALITA (2016). the task organizers. In particular, it is interesting the second place in the ACP since the model is Pierpaolo Basile and Valerio Basile and Danilo Croce and Marco Polignano. 2018. Overview of the more oriented towards polarity classification for EVALITA 2018 Aspect-based Sentiment Analy- which it has specific dictionaries more than as- sis task (ABSITA). Proceedings of the 6th eval- pect detection. This is confirmed also from the uation campaign of Natural Language Process- high precision score obtained from the model in ing and Speech tools for Italian (EVALITA’18), the ACP task, the 2nd highest among the partici- CEUR.org, Turin pating systems. Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin (2003) A neural probabilistic language 4 Discussion model. The Journal of Machine Learning Research, 3:1137–1155, 2003. The results obtained by the SentITA system at AB- P. Bojanowski, E. Grave, A. Joulin, T. Mikolov (2016) SITA 2018 are promising, as the system placed Enriching Word Vectors with Subword Information. 2nd in the ACP and 5th in the ACD task but not arXiv:1607.04606v2 Giuseppe Castellucci, Danilo Croce, Roberto Basili. Min Lin, Qiang Chen, and Shuicheng Yen. Network in 2016. Context–aware Convolutional Neural Net- network. arXiv preprint arXiv:1312.4400, 2014. works for Twitter Sentiment Analysis in Italian. CLiC-it/EVALITA (2016). Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio Sebastiani, Veselin Stoyanov. 2016. SemEval-2016 K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, Task 4: Sentiment Analysis in Twitter. Proceedings H. Schwenk, and Y. Bengio. (2014) Learning phrase of SemEval-2016, pages 1–18, San Diego, Califor- representations using RNN encoder-decoder for sta- nia, June 16-17, 2016. tistical machine translation. In EMNLP, 2014. Y. Nesterov (1983) A method of solving a convex pro- Andrea Cimino, Felice Dell’Orletta. 2016. Tandem gramming problem with convergence rate o (1/k2). LSTM–SVM Approach for Sentiment Analysis. In Soviet Mathematics Doklady, volume 27, pages Castellucci, Giuseppe et al. CLiC-it/EVALITA 372-376, 1983. (2016). J. Pennington, R. Socher, and C. Manning. (2014) R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Glove: Global vectors for word representation. In Kavukcuoglu and P. Kuksa. Natural Language Pro- Proceedings of the 2014 Conference on Empirical cessing (Almost) from Scratch. Journal of Machine Methods in Natural Language Processing (EMNLP), Learning Research, 12:2493- 2537, 2011. pages 1532–1543, Doha, Qatar, October. Associa- Jan Deriu, Mark Cieliebak. 2016. Sentiment Detec- tion for Computational Linguistics. tion using Convolutional Neural Networks with Colin Raffel, Daniel P. W. Ellis (2016) Feed- Multi–Task Training and Distant Supervision. Forward Networks with Attention Can CLiC-it/EVALITA (2016). Solve Some Long-Term Memory Problems. Timothy Dozat (2015) Incorporating https://arxiv.org/abs/1512.08756 Nesterov Momentum into Adam. http://cs229.stanford.edu/proj2015/054 report.pdf. Russo, Irene; Frontini, Francesca and Quochi, Va- leria, 2016, OpeNER Sentiment Lexicon Ital- E. Grave*, P. Bojanowski*, P. Gupta, A. Joulin, T. ian - LMF, ILC-CNR for CLARIN-IT repository Mikolov (2018) Learning Word Vectors for 157 Lan- hosted at Institute for Computational Linguistics guages. Proceedings of the International Confer- ”A. Zampolli”, National Research Council, in Pisa, ence on Language Resources and Evaluation (LREC http://hdl.handle.net/20.500.11752/ILC-73. 2018) Sander Dieleman. Recommending mu- G. E. Hinton, J. L. McClelland, and D. E. Rumel- sic on Spotify with deep learning. hart (1986) Distributed representations. In Rumel- http://benanne.github.io/2014/08/05/spotify- hart, D. E. and McClelland, J. L., editors, Paral- cnns.html, 2014. lel Distributed Processing: Explorations in the Mi- crostructure of Cognition. 1986. Volume 1: Founda- R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and tions, MIT Press, Cambridge, MA. pp 77-109. Christopher D. Manning. (2011) Semi-Supervised Recursive Autoencoders for Predicting Sentiment S. Hochreiter, J. Schmidhuber. Long Short-Term Mem- Distributions. In Proceedings of the 2011 Confer- ory. Neural Computation 9(8):1735-1780, 1997 ence on Empirical Methods in Natural Language N. Kalchbrenner, E. Grefenstette, P. Blunsom. (2014) Processing (EMNLP). A Convolutional Neural Network for Modelling R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Man- Sentences. In Proceedings of ACL 2014. ning, A. Y. Ng, and Christopher Potts. (2013) Re- Kingma, Diederik and Ba, Jimmy. (2014). Adam: cursive deep models for semantic compositionality A Method for Stochastic Optimization. Interna- over a sentiment treebank. In Proceedings of the tional Conference on Learning Representations. 2013 Conference on Empirical Methods in Natural https://arxiv.org/pdf/1412.6980.pdf Language Processing, pages 1631–1642, Strouds- burg, PA, October. Association for Computational Q. Le, T. Mikolov. Distributed Representations of Sen- Linguistics. tences and Documents. Proceedings of the 31 st In- ternational Conference on Machine Learning, Bei- Ilya Sutskever, James Martens, George Dahl, Geof- jing, China, 2014. JMLR: W&CP, volume 32. frey Hinton (2013) Proceedings of the 30th Inter- national Conference on Machine Learning, PMLR T. Mikolov. (2012) Statistical Language Models Based 28(3):1139-1147, 2013. on Neural Networks. PhD thesis, PhD Thesis, Brno University of Technology, 2012. T. Mikolov, K. Chen, G. Corrado, and J. Dean. (2013) Efficient estimation of word representations in vec- tor space. In Proceedings of Workshop at Inter- national Conference on Learning Representations, 2013.