App2Check @ ATE ABSITA 2020: Aspect Term Extraction and Aspect-based Sentiment Analysis Emanuele Di Rosa Alberto Durante Chief Technology Officer Research Scientist emanuele.dirosa alberto.durante @app2check.com @app2check.com Abstract former helps analysts to go beyond the traditional ”word cloud” that is available in most of text an- In this paper we describe and present the alytic tools and that focuses just on the most re- results of the system we specifically de- current words in a collection. Aspect-Term Ex- veloped and submitted for our participa- traction, similarly to the Named-Entity Recogni- tion to the ATE ABSITA 2020 evalua- tion task, detects a sequence of word tokens that tion campaign on the Aspect Term Ex- conceptually identify an ”aspect” of the sentence. traction (ATE), Aspect-based Sentiment The Sentiment Analysis task maintains its impor- Analysis (ABSA), and Sentiment Analy- tance on a higher level, where it can substitute user sis (SA) tasks. The official results show rating, which can be sometimes incoherent to the that App2Check ranks first in all of the opinions expressed in the review text. Anyway, three tasks, reaching a F1 score which is it represents just the overall polarity of an opinion, 0.14236 higher than the second best sys- which is very often the result of different polarities tem in the ATE task and 0.11943 higher on multiple aspects. The assignment of a specific in the ABSA task; it shows a Root-Mean- and, in general, different polarity to each aspect Square Error (RMSE) that is 0.13075 in the sentence, leads to the ABSA task, which is lower than the second classified in the SA highly dependent on the ATE task, but can take ad- task. vantage of the learning obtained by an SA model. 1 Introduction In the last few years, deep learning-based models proved to be the best technical approach for natu- User reviews are becoming more important for all ral language processing and understanding and are consumer-oriented industries. Thanks to the ex- very promising also for the ATE, SA and ABSA pansion of a review culture, collecting and shar- tasks. ing a feedback from a buyer of a product/service can both help the seller to improve and other cus- In this paper, we present the system that we tomers who can take advantage of the reviews for specifically developed and submitted for our par- their purchase decisions. However, having auto- ticipation to the ATE ABSITA 2020 evaluation matic tools to process reviews and extract use- campaign (De Mattei et al., 2018), which is part of ful insights to analysts, especially where large EVALITA 2020 (Basile et al., 2020), on the Aspect amounts of reviews are available, becomes rele- Term Extraction (ATE), Aspect-based Sentiment vant for any consumer-oriented industry. Analysis (ABSA), and Sentiment Analysis (SA) Aspect-Term Extraction and Aspect-Based Sen- tasks. To this aim, we decided to focus just on timent Analysis tasks are, respectively, focused on deep learning-based approaches to train a specific the extraction of the main aspects in a sentence model for each task. More specifically, we take and to assign a specific sentiment to each of them. advantage of the most recent approach in which These are essential tools to understand the reasons pre-trained language models, largely recognized behind the success or the failure of a product or as bringing NLP to a new era (Qiu et al., 2020), service, or anyway that allow to take actions, final- are used as the main component for the 3 tasks. In ized to to improve the customer perception. The particular, about the ATE task, in order to select the best performing pre-trained models to use for Copyright c 2020 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- our submission, we performed an extensive exper- ternational (CC BY 4.0). imental analysis and comparison. The experimen- tal evaluation shows some interesting and unpre- fine-tuned on the ABSA training set of the compe- dictable results, discussed in section 2, which also tition: this helped to take advantage of a transfer represent an added value of this paper. In fact, we learning from the SA task to the ABSA task. can summarize that in the dev set: This paper is structured as follows: in sections 2, 3 and 4, we describe each of the three tasks of 1. the NER fine-tuned model shows lower per- the competition, the details of our training, system formance than general-purpose pre-trained implementation and present the results in both the models without a specific NER fine-tuning dev set and the competition results. Finally, we show the conclusions in section 5. 2. a language specific, Italian-native model shows a lower performance than multilingual 2 Aspect-Term Extraction models fine-tuned on Italian in the specific downstream tasks Aspect Term Extraction (ATE) is the task of iden- tifying an ”aspect” in a text without knowing a pri- 3. the biggest and most recent, multilingual ori the list of aspects that contains it. According to XLM-Roberta model shows the best perfor- the literature definition, a term/phrase is consid- mance when fine-tuned on the downstream ered as an aspect when it co-occurs with ”opinion tasks words” that indicate a sentiment polarity on it. Our approach has been to consider the ATE task While the last result, related to the fact that big- as a Named Entity Recognition task (NER) and ger models –in terms of number of parameters– fine-tune already existing pre-trained language are more effective than smaller models, is quite models on the NER task, by using the training set common (with the exception of distilled models) of the competition. More specifically, we decided and known in literature (see also the recent GPT3 to investigate four different classes of models: vs GPT2 comparison (Brown et al., 2020)), the first two results are quite surprising. In fact, we 1. Native Italian pre-trained language models, expected that the multilingual model specifically with no specific NER fine-tuning fine-tuned on the NER task on another language could take advantage of a previous training in an- 2. Multilingual pre-trained language model, other language as shown in (Pires et al., 2019). with no specific NER fine-tuning Moreover, the native Italian pre-trained language model GilBERTo (based on Facebook RoBERTa 3. Native Italian pre-trained language models, architecture (Liu et al., 2019) and CamemBERT with a specific NER fine-tuning text tokenization approach (Martin et al., 2020)) later fine-tuned on the NER task with italian train- 4. Multilingual pre-trained language model, ing set, shows a performance that is 4% lower than with a specific NER fine-tuning the XML-Roberta multilingual pre-trained model To implement all of these approaches, we based later trained on a NER training set in Italian. on the Hugging Face transformers library (Wolf About the SA task, we take advantage of a et al., 2019) and, in order to simplify our work, previously trained predictive model we had at we looked for pre-trained models made available App2Check, an evolution of the one presented in publicly by the Hugging Face. With the exception (Di Rosa and Durante, 2017), which is now based of item 3, for which we could not find any pub- on the Multilingual BERT model and later fine- licly available model in the HuggingFace models tuned on a 1 to 5 sentiment scale on a big amount list, we considered more than one state-of-the-art of product reviews. This model has been addition- model for each type of encoding that we further ally trained on the training set of the competition trained/fine-tuned on the competition training set. in order to have a domain-specific training. For For type 1, we considered dbmdz/bert-base- the SA task, we decided to not perform any ad- italian-xxl-uncased1 and GilBERTo2 . For type 2, ditional experimental comparison with other pre- we considered two implementations of RoBERTa: trained models. Finally, about the ABSA task, we created a special encoding to map the output of 1 https://github.com/dbmdz/berts 2 our available SA model in order to be additionally https://github.com/idb-ita/GilBERTo xml-roberta-large3 (Conneau et al., 2020), xml- higher than the Base version of the same model, roberta-base4 (Liu et al., 2019), and multilin- even if they show almost the same performance gual BERT5 (Pires et al., 2019). We wanted on the training set. The model in class 4, multi- to try xml-roberta-large with a 512 maximum lingual Bert Base specifically trained on the NER sequence length, but an out of memory ex- task, shows the worst performance on the develop- ception prevented us for using it. For type ment set, even if trained with a much higher num- 4 we considered wietsedv/bert-base-multilingual- ber of epochs. cased-finetuned-conll2002-ner6 . Thanks to the F1 score reached on the devel- opment set, the xlm RoBERTa Large multilingual K Len Model Ep F1-T F1-D model has been chosen as our competition model, 1 512 B-BERT ita unc. 11 0.961 0.663 so it has been further trained on the development 1 512 GilBERTo unc. 10 0.941 0.697 set and tested on the competition test set. 1 512 GilBERTo unc. 15 0.973 0.6700 2 512 B-xlmRoBERTa 8 0.981 0.687 Pos. Name F1 score 2 256 L-xlmRoBERTa 12 0.965 0.728 1 App2Check 0.68222 2 256 L-xlmRoBERTa 15 0.980 0.708 2 ghostwriter19 0.53986 2 512 B-mBERT 20 0.991 0.679 3 SentNa 0.34027 4 512 B-mBERT NER 30 0.910 0.657 4 Baseline 0.2556 4 512 B-mBERT NER 45 0.965 0.623 Table 2: Aspect-Term Extraction on the test set of Table 1: Aspect-Term Extraction performance on the competition. development set. In Table 2 we show the official results of the All models have been trained on a cloud plat- Aspect-Term Extraction task in (De Mattei et al., form using an Nvidia Tesla P100-PCIE as GPU 2018). App2Check model ranked first with a F1 accelerator. In Table 1 we show the results ob- score that is 0.14236 higher than the second best tained by the models on the training and develop- system. ment set, highlighting in bold the model chosen 3 Sentiment Analysis for the competition. The value in column K, Len and Ep are associ- The SA task is about the detection of the opinion ated respectively to the kind of pre-trained model expressed in a text review. According to the typ- used, the maximum sequence length used in the ical user rating, which is here used as the refer- training and the number of epochs of the train- ence value for the polarity, the score is defined on ing. The F1-T and F1-D columns contains the F1- a five-value scale from 1 (very negative) to 5 (very scores on training set and development set. For positive). each model, the prefixes L and B indicates whether About our implementation for this task, we took the base or large version has been used; if an un- advantage of a previously trained predictive model cased version of the pre-trained model has been we had at App2Check. It is an evolution of the used, the model name is labeled with unc.. one presented in (Di Rosa and Durante, 2017), The Italian Base Bert and GilBERTo ap- which is now based on the Multilingual BERT proaches, both of class 1, show similar results on model based on 104 languages and 110M parame- both training and development set. Interestingly, ters, and later fine-tuned on a 1 to 5 sentiment scale on the development set, the multilingual Base Bert on a big amount of product reviews. This model model in class 2 shows very similar results to the has been additionally trained on the training set of best model in class 1 which is specifically trained the competition in order to have a domain-specific on Italian. training. We decided to not perform any additional The xlm RoBERTa Large multilingual model experimental comparison with other pre-trained shows a F1-score on the development set that is models, since it has been already compared with 3 other approaches in the past and also because of https://huggingface.co/xlm-roberta-large 4 https://huggingface.co/xlm-roberta-base the little time at our disposal. 5 bert-base-multilingual-cased In Table 3 we show the results of the compe- 6 https://github.com/chambliss/Multilingual NER tition for the Sentiment Analysis task. The root- Pos Name RMSE has the same polarity score as Ottimo prodotto di 1 App2Check 0.66458 marca, la qualità é veramente notevole. 2 SentNa 0.79533 The same assumption has been applied to the 3 ghostwriter19 0.81394 training set: the polarity of each portion of a re- 4 Baseline-AVG score 0.10040 view has been associated to the contained aspect. 5 Baseline-AlBERTo 0.10806 If a portion of a review does not contain any as- 6 Baseline-Freq score 0.12800 pect, it has been ignored. The submitted ABSA system has been based Table 3: Sentiment Analysis on the test set of the on a single sentiment classification model, rather competition. than two binary models for positive and nega- tive polarities. The final model is a four-class re- mean-square error of App2Check is 0.13073 lower training of the sentiment model presented in sec- than the error of the second best system, ranking tion 3 which has been originally trained on user in first position. reviews with five levels (strong positive, posi- tive, mixed/neutral, negative, strong negative) us- 4 Aspect-Based Sentiment Analysis ing multilingual BERT (Pires et al., 2019). In this way, we take advantage of some transfer learn- The Aspect-Based Sentiment Analysis task is an ing about positive, negative and neutral sentiment extension of both the ATE and the SA tasks. In learned on reviews. fact, the aim of the Aspect-Based Sentiment Anal- ysis task is to detect the sentiment polarity asso- Pos. Name F1 score ciated to each aspect extracted, thanks to the ATE 1 App2Check 0.61878 task discussed in Section 2. The possible polarity 2 ghostwriter19 0.49935 values are: 3 SentNa 0.28632 4 Baseline 0.20000 Polarity Value neutral [0,0] Table 4: Aspect-Based Sentiment Analysis on the positive [1,0] test set of the competition. negative [0,1] mixed [1,1] In Table 4 we show the results of the Aspect- Based Sentiment Analysis of the competition. Similarly to what we have done with the Aspect App2Check system is in first position, with a F1 Category Polarity task at ABSITA 2018 (Di Rosa score that is 0.11943 higher than the second best and Durante, 2018), we assumed that the senti- system. ment score of every aspect detected in Section 2 is the one associated to the portion of text in which 5 Conclusions it is contained. In order to do so, we split portions In this paper we described the approach we fol- of the review using strong punctuation marks and lowed and the models we built for our partic- some conjunctions (especially the ones leading to ipation to the ATE ABSITA 2020 competition. sentiment inversion). For example, in the case of: We also presented the experimental evaluation Ottimo prodotto di marca, la qualità é we made in the context of our model selection veramente notevole. Non è molto capi- process in the development set and show in- ente ma si può prendere un’altra ver- teresting results: (i) the NER fine-tuned model sione. È provvisto di una tasca piccola shows lower performance than general-purpose davanti e quella grande7 pre-trained models without a specific NER fine- tuning; (ii) a language specific, Italian-native The aspect capiente8 has the same polarity score model shows a lower performance than multilin- as Non è molto capiente, while the aspect qualità9 gual models fine-tuned on Italian in the specific 7 Translation: Great branded product, the quality is truly downstream tasks; (iii) the biggest and most re- remarkable. It is not very capacious but you can get another version. It has a small front pocket and a large one cent, multilingual XLM-Roberta model shows the 8 Translation: capacious best performance when fine-tuned on the down- 9 Translation: quality stream tasks. We also showed that our App2Check system scored first in all of the three tasks of the Clergerie, Djamé Seddah, Benoı̂t Sagot Camem- competition, reaching a F1 score which is 0.14236 BERT: a Tasty French Language Model. ACL 2020: 7203-7219 higher than the second best system in the ATE task and 0.11943 higher in the ABSA task; in the SA Xipeng Qiu and Tianxiang Sun and Yige Xu and task, our system shows a Root-Mean-Square Er- Yunfan Shao and Ning Dai and Xuanjing Huang ror (RMSE) that is 0.13075 lower than the second 2020 Pre-trained Models for Natural Lan- guage Processing: A Survey 2003.08271, arXiv, classified. https://arxiv.org/abs/2003.08271 Tom B. Brown and Benjamin Mann and Nick Ry- References der and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Basile, Valerio and Croce, Danilo and Di Maro, Maria Pranav Shyam and Girish Sastry and Amanda and Passaro, Lucia C. 2020 EVALITA 2020: Askell and Sandhini Agarwal and Ariel Herbert- Overview of the 7th Evaluation Campaign of Nat- Voss and Gretchen Krueger and Tom Henighan ural Language Processing and Speech Tools for and Rewon Child and Aditya Ramesh and Daniel Italian Proceedings of Seventh Evaluation Cam- M. Ziegler and Jeffrey Wu and Clemens Winter paign of Natural Language Processing and Speech and Christopher Hesse and Mark Chen and Eric Tools for Italian. Final Workshop (EVALITA 2020) Sigler and Mateusz Litwin and Scott Gray and CEUR.org Benjamin Chess and Jack Clark and Christopher Lorenzo de Mattei, Graziella de Martino, Andrea Berner and Sam McCandlish and Alec Radford and Iovine, Alessio Miaschi, Marco Polignano, and Giu- Ilya Sutskever and Dario Amodei 2020 Lan- lia Rambelli. 2020 ATE ABSITA@EVALITA2020: guage Models are Few-Shot Learners CoRR 2020 Overview of the Aspect Term Extraction and Aspect- https://arxiv.org/abs/2005.14165 based Sentiment Analysis Task. Proceedings of Thomas Wolf and Lysandre Debut and Victor Sanh the 7th evaluation campaign of Natural Language and Julien Chaumond and Clement Delangue and Processing and Speech tools for Italian (EVALITA Anthony Moi and Pierric Cistac and Tim Rault and 2020). CEUR.org Rémi Louf and Morgan Funtowicz and Joe Davi- Emanuele Di Rosa and Alberto Durante 2018 Aspect- son and Sam Shleifer and Patrick von Platen and based Sentiment Analysis: X2Check at ABSITA Clara Ma and Yacine Jernite and Julien Plu and 2018 Proceedings of the Sixth Evaluation Campaign Canwen Xu and Teven Le Scao and Sylvain Gug- of Natural Language Processing and Speech Tools ger and Mariama Drame and Quentin Lhoest and for Italian. Final Workshop (EVALITA 2018) co- Alexander M. Rush 2019. Transformers: State-of- located with the Fifth Italian Conference on Com- the-art natural language processing. arXiv preprint putational Linguistics (CLiC-it 2018), Turin, Italy, arXiv:1910.03771. December 12-13, 2018 Telmo Pires and Eva Schlinger and Dan Garrette 2019 Emanuele Di Rosa and Alberto Durante. 2017. Eval- How multilingual is Multilingual BERT? CoRR uating Industrial and Research Sentiment Analysis 2019 http://arxiv.org/abs/1906.01502 Engines on Multiple Sources in Proc. of AI*IA 2017 Advances in Artificial Intelligence - International Basile, Valerio and Croce, Danilo and Di Maro, Maria Conference of the Italian Association for Artificial and Passaro, Lucia C. 2020. EVALITA 2020: Intelligence, Bari, Italy, November 14-17, 2017, pp. Overview of the 7th Evaluation Campaign of Natu- 141-155. ral Language Processing and Speech Tools for Ital- ian. In Online Proceedings of Evalita 2020 Pub- Yinhan Liu and Myle Ott and Naman Goyal and Jingfei lisher: CEUR.org Editor: Basile, Valerio and Croce, Du and Mandar Joshi and Danqi Chen and Omer Danilo and Di Maro, Maria and Passaro, Lucia C. Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov 2019 RoBERTa: A Robustly Optimized BERT Pretraining Approach CoRR, abs/1907.11692 Alexis Conneau and Kartikay Khandelwal and Naman Goyal and Vishrav Chaudhary and Guillaume Wen- zek and Francisco Guzmán and Edouard Grave and Myle Ott and Luke Zettlemoyer and Veselin Stoy- anov 2020 Unsupervised Cross-lingual Representa- tion Learning at Scale Proceedings of the 58th An- nual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 Louis Martin, Benjamin Müller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric de la