Aspect-based Sentiment Analysis: X2Check at ABSITA 2018 Emanuele Di Rosa Alberto Durante Chief Technology Officer Research Scientist App2Check s.r.l. App2Check s.r.l. emanuele.dirosa alberto.durante @app2check.com @app2check.com Abstract strato sul training set della evaluation. English. In this paper we describe and present the results of the two systems, 1 Introduction called here X2C-A and X2C-B, that we specifically developed and submitted for The traditional task of sentiment analysis is the our participation to ABSITA 2018, for classification of a sentence according to the pos- the Aspect Category Detection (ACD) and itive, negative, or neutral classes. However, such Aspect Category Polarity (ACP) tasks. task in this simple version is not enough to detect The results show that X2C-A is top ranker when a sentence contains a mixed sentiment, in in the official results of the ACD task, at a which a positive sentiment is referred to one as- distance of just 0.0073 from the best sys- pect and a negative sentiment to another aspect. tem; moreover, its post deadline improved Aspect-based sentiment analysis is focused on the version, called X2C-A-s, scores first in the sentiment classification (negative, neutral, posi- official ACD results. About the ACP re- tive) for a given aspect/category in a sentence. sults, our X2C-A-s system, which takes In nowadays world, reviews became an important advantage of our ready-to-use industrial tool widely used by consumers to evaluate ser- Sentiment API, scores at a distance of just vices and products. Given the large amount of 0.0577 from the best system, even though reviews available online, systems allowing to au- it has not been specifically trained on the tomatically classify reviews according to differ- training set of the evaluation. ent categories, and assign a sentiment to each of those categories, are gaining more and more inter- Italiano. In questo articolo descrivi- est in the market. The former task is called Aspect amo e presentiamo i risultati dei due sis- Category Detection (ACD) since detects whether temi, chiamati qui X2C-A e X2C-B, che a review speaks about one of the categories un- abbiamo specificatamente sviluppato per der evaluation; the latter task, called Aspect Cat- partecipare ad ABSITA 2018, per i task egory Polarity (ACP) tries to assign a sentiment Aspect Category Detection (ACD) e As- independently for each aspect. In this paper, we pect Category Polarity (ACP). I risultati present X2C-A and X2C-B, two different imple- mostrano che X2C-A si posiziona ad una mentations for dealing with the ACD and ACP distanza di soli 0.0073 dal miglior sis- tasks, specifically developed for the ABSITA eval- tema del task ACD; inoltre, la sua versione uation (Basile et al., 2018). In particular, we de- migliorata, chiamata X2C-A-s, realizzata scribe the models used to participate to the ACD successivamente alla scadenza, mostra un competition together with some post deadline re- punteggio che lo posiziona al primo posto sults, in which we had the opportunity to improve nella classifica ufficiale del task ACD. our ACD results and evaluate our systems also on Riguardo al task ACP, il sistema X2C- the ACP task. The resuls show that our X2C-A A-s che utilizza il nostro standard Senti- system is top ranking in the official ACD com- ment API, consente di ottenere un punteg- petition and scores first, in its X2C-A-s version. gio che dista solo 0.0577 dal miglior sis- Moreover, by testing our ACD models on the ACP tema, nonostante il classificatore di sen- tasks, with the help of our standard X2Check sen- timent non sia stato specificamente adde- timent API, the X2C-A-s system scores fifth at a distance of just 0.057 from the best system, even on the Value category, while shows the best per- if the other systems have a sentiment classifier formance on Location, and high score on Wifi and specifically trained on the training set of the com- Staff. petition. This paper is structured as follow: after the introduction we present the descriptions of our two systems submitted to ABSITA and the results 2.2 X2C-B on the development set; then we show and discuss the results on the official testset of the competi- In the model selection process, the two best algo- tion for both ACD and ACP, finally we provide rithms have been Naive Bayes and SMO. We built our conclusions. a model with both algorithms for each category. We took into account the F1 score on the posi- 2 Systems description tive labels in order to select the best algorithm. In this implementation, SMO (Sequential Minimal The official training dataset has been split into our Optimization) (Platt, 1998) (Keerthi et al., 2001) internal training set (80% of the documents) and (Hastie et al., 1998) has been the best performing development set (the remaining 20%). We ran- algorithm on all of the categories, and showed an domly sampled the examples for each category, average F1 score across all categories of 85.08%. thus obtaining different sets for training/test set, Its scores are reported in Table 1, where we also by keeping the per category distribution of the compare its performance with the X2C-A one on samples through the three sets. We submitted the development set. two runs, as the results of the two different sys- The two systems are built on different imple- tems we developed for each category, called X2C- mentation of support vector machines, as previ- A and X2C-B. The former has been developed ously pointed out, and differ on the features ex- on top of the Scikit-learn library in Python lan- traction process. In fact, X2C-B takes into ac- guage (Pedregosa et al., 2011), and the latter on count a vocabulary of the 1000 most mentioned top of the WEKA library (Frank et al., 2016) in words in the training set, according to the size JAVA language. In both cases, the input text has limit parameter available in the StringToWordVec- been cleaned with a typical NLP pipeline, involv- tor Weka function. Moreover, it uses unigrams ing punctuation, numbers and stopwords removal. instead of the bi-grams extraction performed in The two systems have been developed separately, X2C-A. The two systems reach similar results, i.e. but the best algorithms obtained by both the model high scores on Location, Wifi and Staff, and low selections are different implementations of Sup- scores on the Value category. However, the overall port Vector Machine. More details in the follow- weighted performance is very close, around 85% ing sections. of F1 on the positive labels, and since for some 2.1 X2C-A categories is better X2C-A and for others X2C-B, we decided to submit both implementations, in or- The X2C-A system has been created by apply- der to understand which is the best one on the test ing an NLP pipeline including a vectorization of set of the ABSITA evaluation. the collection of reviews to a matrix of token counts of the bi-grams; then, the count matrix has Category X2C-A X2C-B been transformed to a normalized tf-idf represen- tation (term-frequency times inverse document- Cleanliness 0.8675 0.8882 frequency). As machine learning algorithm, an Comfort 0.8017 0.7995 implementation of the Support Vector Machine Amenities 0.8041 0.7896 has been used, specifically the LinearSVC. Such Staff 0.8917 0.8978 algorithm has been selected as the best performer Value 0.7561 0.7333 on such dataset compared to other common imple- Wifi 0.9056 0.9412 mentations available in the sklearn library. Location 0.9179 0.9058 Table 1 shows the F1 score on the positive label in the development set for each category, Table 1: F1 score per category on the positive la- where the average value on all of the categories bels on the development set. Best system in bold. is 84.92%. X2C-A shows the lowest performance 3 Results on the ABSITA testset Team Mic-Pr Mic-Re Mic-F1 X2C-A-s 0.8278 0.8014 0.8144 3.1 Aspect Category Detection 1 0.8397 0.7837 0.8108 Table 2 shows the official results of the Aspect- 2 0.8713 0.7504 0.8063 based Category Detection task, with the addition 3 0.8697 0.7481 0.8043 of two post deadline results obtained by an addi- X2C-A 0.8626 0.7519 0.8035 tional version of X2C-A and X2C-B, called X2C- 5 0.8819 0.7378 0.8035 A-s and X2C-B-s. X2C-B 0.8980 0.6937 0.7827 The difference between the submitted results X2C-B-s 0.8954 0.6855 0.7765 and the versions called X2C-A-s and X2C-B-s, 7 0.8658 0.697 0.7723 is just at prediction time: X2C-A and X2C-B 8 0.7902 0.7181 0.7524 make a prediction at document-level, i.e. on the 9 0.6232 0.6093 0.6162 whole review, while X2C-A-s and X2C-B-s make 10 0.6164 0.6134 0.6149 a prediction at sentence-level, where each sen- 11 0.5443 0.5418 0.5431 tence has been obtained by splitting the reviews 12 0.6213 0.433 0.5104 on some punctuation and key conjunction words. baseline 0.4111 0.2866 0.3377 This makes more likely that each sentence con- tains one category and it seems to be easier for the Table 2: ACD results models the category detection. For example, the review tables 3 and 4, we can see that X2C-A-s is the best system on all of the categories, with the exception The sight is beautiful, but the staff is of Cleanliness, where X2C-B shows a slightly bet- rude ter performance. Comparing the results on devel- is about Location and Staff, but since only a part opment set (Table 1) and the ones on the ABSITA of it is about Location, the location model of test set, Value is confirmed being the most difficult this category would receive a document contain- category to detect for our systems, with a score of ing ”noise” from its point of view. In the post 0.6168. Instead, concerning Wifi, which has been deadline runs, we reduce the ”noise” by splitting the easiest category in Table 1, in Table 4 shows this example review in The sight is beautiful which a lower relative score, while the easiest category is only about Location, and but the staff is rude to detect overall was Location, on which X2C-A- which is only about Staff. As we can see in Ta- split has reached a score of 0.8898. ble 2, the performance of X2C-A increased sig- X2C-A X2C-B nificantly and reached a performance score that Cleanliness 0.8357 0.8459 is better even than the first classified. However, Comfort 0.794 0.7475 the performance of X2C-B slighted decreased in Amenities 0.8156 0.7934 its X2C-B-s version. This means that the model Staff 0.8751 0.8681 of this latter system is not helped by this kind of Value 0.6146 0.6141 ”noise” removal technique. This last result shows Wifi 0.8403 0.8667 that such approach does not have a general appli- Location 0.8887 0.8681 cability but it depends on the model; however, it shows to work very well on X2C-A. Table 3: X2Check per category results submitted In order to identify the categories where we per- to ACD form better, we calculated the score of our systems on each category1 , as shown in Table 3 and Table 4. In Table 3 X2C-A is the best of our systems 3.2 Aspect Category Polarity on all the categories except Cleanliness and Wifi, In Table 5 we show the results of the Aspect-based where X2C-B has reached the higher score. In Ta- Category Polarity task to which X2Check did not ble 4, X2C-A-s shows the best performance on all formally participate. In fact, after the evaluation of the categories. By comparing the results across deadline we had time to work on the ACP task. 1 To obtain these scores, we modified the ABSITA evalu- In order to deal with the ACP task, we decided ation script so that only one category is taken into account. to take advantage of our ready-to-use, standard X2C-A-s X2C-B-s the collection of all of the category-sentiment Cleanliness 0.8445 0.8411 pairs found in the sentences Comfort 0.8099 0.739 Amenities 0.8308 0.7884 The results shown in Table 5 show that our as- Staff 0.8833 0.8652 sumption is valid. In fact, despite being a single Value 0.6168 0.6089 sentiment model for all of the categories, we reach Wifi 0.8702 0.8667 the fifth place in the official ranking with our X2C- Location 0.8898 0.8584 A-s system, at a distance of just 0.057 from the best system specifically trained on such training Table 4: X2Check per category results submitted set. Furthermore, the ACP performance depends post deadline to ACD. on the ACD results, in fact the former task can- not reach a performance higher than the other. For this reason, we decided to evaluate the sentiment X2Check sentiment API (Di Rosa and Durante, performance reached on the reviews whose cate- 2017). In fact, since we do have an industrial per- gories have been correctly predicted. Thus, we spective, we realized that in a real world setting, created a score capturing the relationship between the fact of training an Aspect-based sentiment sys- the two results: it is the ratio between the micro tem through a specific training set has a high ef- F1 score obtained in the ACP task and the one ob- fort associated and cannot have a general purpose tained in the ACD task. This hand crafted score application. In fact, a very common case is the shows the quality of the sentiment model, by re- one in which new categories to predict have to be moving the influence of the performance on the quickly added into the system. In this setting, a ACD task. The overall sentiment score obtained is high effort activity of labeling examples for the 88.0% for X2C-B and 87.1% for X2C-A, showing training set would be required. Moreover, label- that even if a specific train has not been made, the ing a review according to the aspects mentioned general purpose X2Check sentiment API shows and additionally assign a sentiment to each aspect very good results (recall that, according to (Wil- requires a higher human effort than just labeling son et al., 2009) humans agree in the sentiment the category. For this reason, we decided to not classification in the 82% of cases). specifically train a sentiment predictor specialized on the given categories/aspects in the evaluation. Team Mic-Pr Mic-Re Mic-F1 Thus, we performed an experimental evaluation in 1 0.8264 0.7161 0.7673 which after the prediction of the category in the 2 0.8612 0.6562 0.7449 review, our standard X2Check sentiment API has 3 0.7472 0.7186 0.7326 been called to predict the sentiment. Since we are 4 0.7387 0.7206 0.7295 aware that a review may, in general, speak about X2C-A-s 0.7175 0.7019 0.7096 multiple aspects and having different sentiment as- 5 0.8735 0.5649 0.6861 sociated, we decided to apply the X2C-A-s and X2C-B-s 0.7888 0.6025 0.6832 X2C-B-s versions which use the splitting method 6 0.6869 0.5409 0.6052 described in section 3.1. More specifically: 7 0.4123 0.3125 0.3555 1. each review document has been split into sen- 8 0.5452 0.2511 0.3439 tences baseline 0.2451 0.1681 0.1994 2. both the X2Check sentiment API and the Table 5: ACP results. X2C-A/X2C-B category classifiers were run on each sentence. The former gives as output Tables 6 and 7 show for each category the the polarity of each sentence; our assumption micro-F1 and the sentiment score of the ACP task, is that each portion of the review has a high calculated like in Table 4, and the relationship be- probability to have just one sentiment asso- tween ACP and ACD scores per category. We can ciated. The latter gives as output all of the see that the sentiment model has reached a very detected categories in each sentence good performance on Cleanliness, Comfort, Staff and Location since it is close or over the 90%. 3. the overall result of a review is given by However, like noticed for the ACD results, it is difficult to handle reviews about the Value cate- 2018 Aspect-based Sentiment Analysis task (AB- gory. SITA) in Tommaso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, editors, Proceedings of the 6th evaluation campaign of Natural Language Pro- Micro-F1 SS cessing and Speech tools for Italian (EVALITA’18), Cleanliness 0.7739 91.6% CEUR.org, Turin Comfort 0.7165 88.5% Eibe Frank, Mark A. Hall, and Ian H. Witten. 2016. Amenities 0.6618 79.7% The WEKA Workbench. Online Appendix for Staff 0.8086 91.5% ”Data Mining: Practical Machine Learning Tools Value 0.4533 73.5% and Techniques”, Morgan Kaufmann, Fourth Edi- Wifi 0.6615 76.0% tion, 2016. Location 0.8168 91.8% Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Table 6: X2C-A ACP results and sentiment score Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and by category. Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. 2011. Scikit-learn: Machine Learn- ing in Python in Journal of Machine Learning Re- Micro-F1 SS search, pp. 2825–2830. Cleanliness 0.7626 90.7% Emanuele Di Rosa and Alberto Durante. LREC 2016 Comfort 0.671 90.8% App2Check: a Machine Learning-based system for Amenities 0.6276 79.6% Sentiment Analysis of App Reviews in Italian Lan- Staff 0.7948 91.9% guage in Proc. of the 2nd International Workshop on Value 0.4581 75.2% Social Media World Sensors, pp. 8-11. Wifi 0.6441 74.3% Emanuele Di Rosa and Alberto Durante. 2017. Eval- Location 0.7969 92.8% uating Industrial and Research Sentiment Analysis Engines on Multiple Sources in Proc. of AI*IA 2017 Advances in Artificial Intelligence - International Table 7: X2C-B ACP results and sentiment score Conference of the Italian Association for Artificial by category. Intelligence, Bari, Italy, November 14-17, 2017, pp. 141-155. Sophie de Kok, Linda Punt, Rosita van den Puttelaar, 4 Conclusions Karoliina Ranta, Kim Schouten and Flavius Frasin- In this paper we presented a description of two dif- car. 2018. Review-aggregated aspect-based senti- ment analysis with ontology features in Prog Artif ferent implementations for dealing with the ACD Intell (2018) 7: 295. and ACP tasks at ABSITA 2018. In particular, we described the models used to participate to the Theresa Wilson, Janyce Wiebe and Paul Hoffmann. 2009. Recognizing Contextual Polarity: An Explo- ACD competition together with some post dead- ration of Features for Phrase-Level Sentiment Anal- line results, in which we had the opportunity to ysis in Computational Linguistic, pp. 399–433. improve our ACD results and evaluate our systems Seth Grimes. 2010. Expert Analysis: Is also on the ACP task. The resuls show that our Sentiment Analysis an 80% Solution? X2C-A system is top ranking in the official ACD http://www.informationweek.com/software/information- competition and scores first, in its X2C-A-s ver- management/expert-analysis-is-sentiment-analysis- sion. Moreover, by testing our ACD models on the an-80–solution/d/d-id/1087919 ACP tasks, with the help of our standard X2Check J. Platt. 1998. Fast Training of Support Vector sentiment API, the X2C-A-s system scores fifth Machines using Sequential Minimal Optimization at a distance of just 0.057 from the best system, in Advances in Kernel Methods - Support Vector Learning even if the other systems have a sentiment classi- fier specifically trained on the training set of the S.S. Keerthi and S.K. Shevade and C. Bhattacharyya competition. and K.R.K. Murthy. 2001. Improvements to Platt’s SMO Algorithm for SVM Classifier Design in Neural Computation, volume 13, pp. 637-649. References Trevor Hastie and Robert Tibshirani 1998. Classifi- cation by Pairwise Coupling in Advances in Neural Pierpaolo Basile, Valerio Basile, Danilo Croce and Information Processing Systems, volume 10. Marco Polignano. 2018. Overview of the EVALITA