App2Check @ ATE ABSITA 2020: Aspect Term Extraction and Aspect-based Sentiment Analysis

App2Check @ ATE ABSITA 2020: Aspect Term Extraction and Aspect-based Sentiment Analysis EmanueleDi emanuele.dirosa@app2check.com AlbertoDurante alberto.durante@app2check.com App2Check @ ATE ABSITA 2020: Aspect Term Extraction and Aspect-based Sentiment Analysis 9DEBC811CCC4DAE70FC5690273C986A0 GROBID - A machine learning software for extracting information from scholarly documents

In this paper we describe and present the results of the system we specifically developed and submitted for our participation to the ATE ABSITA 2020 evaluation campaign on the Aspect Term Extraction (ATE), Aspect-based Sentiment Analysis (ABSA), and Sentiment Analysis (SA) tasks. The official results show that App2Check ranks first in all of the three tasks, reaching a F1 score which is 0.14236 higher than the second best system in the ATE task and 0.11943 higher in the ABSA task; it shows a Root-Mean-Square Error (RMSE) that is 0.13075 lower than the second classified in the SA task.

Introduction

User reviews are becoming more important for all consumer-oriented industries. Thanks to the expansion of a review culture, collecting and sharing a feedback from a buyer of a product/service can both help the seller to improve and other customers who can take advantage of the reviews for their purchase decisions. However, having automatic tools to process reviews and extract useful insights to analysts, especially where large amounts of reviews are available, becomes relevant for any consumer-oriented industry.

Aspect-Term Extraction and Aspect-Based Sentiment Analysis tasks are, respectively, focused on the extraction of the main aspects in a sentence and to assign a specific sentiment to each of them. These are essential tools to understand the reasons behind the success or the failure of a product or service, or anyway that allow to take actions, finalized to to improve the customer perception. The former helps analysts to go beyond the traditional "word cloud" that is available in most of text analytic tools and that focuses just on the most recurrent words in a collection. Aspect-Term Extraction, similarly to the Named-Entity Recognition task, detects a sequence of word tokens that conceptually identify an "aspect" of the sentence. The Sentiment Analysis task maintains its importance on a higher level, where it can substitute user rating, which can be sometimes incoherent to the opinions expressed in the review text. Anyway, it represents just the overall polarity of an opinion, which is very often the result of different polarities on multiple aspects. The assignment of a specific and, in general, different polarity to each aspect in the sentence, leads to the ABSA task, which is highly dependent on the ATE task, but can take advantage of the learning obtained by an SA model. In the last few years, deep learning-based models proved to be the best technical approach for natural language processing and understanding and are very promising also for the ATE, SA and ABSA tasks.

In this paper, we present the system that we specifically developed and submitted for our participation to the ATE ABSITA 2020 evaluation campaign (De Mattei et al., 2018), which is part of EVALITA 2020 (Basile et al., 2020), on the Aspect Term Extraction (ATE), Aspect-based Sentiment Analysis (ABSA), and Sentiment Analysis (SA) tasks. To this aim, we decided to focus just on deep learning-based approaches to train a specific model for each task. More specifically, we take advantage of the most recent approach in which pre-trained language models, largely recognized as bringing NLP to a new era (Qiu et al., 2020), are used as the main component for the 3 tasks. In particular, about the ATE task, in order to select the best performing pre-trained models to use for our submission, we performed an extensive experimental analysis and comparison. The experimen-tal evaluation shows some interesting and unpredictable results, discussed in section 2, which also represent an added value of this paper. In fact, we can summarize that in the dev set:

1. the NER fine-tuned model shows lower performance than general-purpose pre-trained models without a specific NER fine-tuning 2. a language specific, Italian-native model shows a lower performance than multilingual models fine-tuned on Italian in the specific downstream tasks 3. the biggest and most recent, multilingual XLM-Roberta model shows the best performance when fine-tuned on the downstream tasks While the last result, related to the fact that bigger models -in terms of number of parametersare more effective than smaller models, is quite common (with the exception of distilled models) and known in literature (see also the recent GPT3 vs GPT2 comparison (Brown et al., 2020)), the first two results are quite surprising. In fact, we expected that the multilingual model specifically fine-tuned on the NER task on another language could take advantage of a previous training in another language as shown in (Pires et al., 2019). Moreover, the native Italian pre-trained language model GilBERTo (based on Facebook RoBERTa architecture (Liu et al., 2019) and CamemBERT text tokenization approach (Martin et al., 2020)) later fine-tuned on the NER task with italian training set, shows a performance that is 4% lower than the XML-Roberta multilingual pre-trained model later trained on a NER training set in Italian.

About the SA task, we take advantage of a previously trained predictive model we had at App2Check, an evolution of the one presented in (Di Rosa and Durante, 2017), which is now based on the Multilingual BERT model and later finetuned on a 1 to 5 sentiment scale on a big amount of product reviews. This model has been additionally trained on the training set of the competition in order to have a domain-specific training. For the SA task, we decided to not perform any additional experimental comparison with other pretrained models. Finally, about the ABSA task, we created a special encoding to map the output of our available SA model in order to be additionally fine-tuned on the ABSA training set of the competition: this helped to take advantage of a transfer learning from the SA task to the ABSA task.

This paper is structured as follows: in sections 2, 3 and 4, we describe each of the three tasks of the competition, the details of our training, system implementation and present the results in both the dev set and the competition results. Finally, we show the conclusions in section 5.

Aspect-Term Extraction

Aspect Term Extraction (ATE) is the task of identifying an "aspect" in a text without knowing a priori the list of aspects that contains it. According to the literature definition, a term/phrase is considered as an aspect when it co-occurs with "opinion words" that indicate a sentiment polarity on it.

Our approach has been to consider the ATE task as a Named Entity Recognition task (NER) and fine-tune already existing pre-trained language models on the NER task, by using the training set of the competition. More specifically, we decided to investigate four different classes of models:

1. Native Italian pre-trained language models, with no specific NER fine-tuning 2. Multilingual pre-trained language model, with no specific NER fine-tuning 3. Native Italian pre-trained language models, with a specific NER fine-tuning 4. Multilingual pre-trained language model, with a specific NER fine-tuning

To implement all of these approaches, we based on the Hugging Face transformers library (Wolf et al., 2019) and, in order to simplify our work, we looked for pre-trained models made available publicly by the Hugging Face. With the exception of item 3, for which we could not find any publicly available model in the HuggingFace models list, we considered more than one state-of-the-art model for each type of encoding that we further trained/fine-tuned on the competition training set.

For type 1, we considered dbmdz/bert-baseitalian-xxl-uncased1 and GilBERTo2 . For type 2, we considered two implementations of RoBERTa:

xml-roberta-large3 (Conneau et al., 2020), xmlroberta-base4 (Liu et al., 2019), and multilingual BERT5 (Pires et al., 2019). We wanted to try xml-roberta-large with a 512 maximum sequence length, but an out of memory exception prevented us for using it. For type 4 we considered wietsedv/bert-base-multilingualcased-finetuned-conll2002-ner6 .

K Len Model

Ep F1-T F1-D All models have been trained on a cloud platform using an Nvidia Tesla P100-PCIE as GPU accelerator. In Table 1 we The xlm RoBERTa Large multilingual model shows a F1-score on the development set that is higher than the Base version of the same model, even if they show almost the same performance on the training set. The model in class 4, multilingual Bert Base specifically trained on the NER task, shows the worst performance on the development set, even if trained with a much higher number of epochs.

Thanks to the F1 score reached on the development set, the xlm RoBERTa Large multilingual model has been chosen as our competition model, so it has been further trained on the development set and tested on the competition test set.

Pos. Name

F1 score 1 App2Check 0.68222 2 ghostwriter19 0.53986 3 SentNa 0.34027 4 Baseline 0.2556

Table 2: Aspect-Term Extraction on the test set of the competition.

In Table 2 we show the official results of the Aspect-Term Extraction task in (De Mattei et al., 2018). App2Check model ranked first with a F1 score that is 0.14236 higher than the second best system.

Sentiment Analysis

The SA task is about the detection of the opinion expressed in a text review. According to the typical user rating, which is here used as the reference value for the polarity, the score is defined on a five-value scale from 1 (very negative) to 5 (very positive).

About our implementation for this task, we took advantage of a previously trained predictive model we had at App2Check. It is an evolution of the one presented in (Di Rosa and Durante, 2017), which is now based on the Multilingual BERT model based on 104 languages and 110M parameters, and later fine-tuned on a 1 to 5 sentiment scale on a big amount of product reviews. This model has been additionally trained on the training set of the competition in order to have a domain-specific training. We decided to not perform any additional experimental comparison with other pre-trained models, since it has been already compared with other approaches in the past and also because of the little time at our disposal.

In Table 3 we show the results of the competition for the Sentiment Analysis task. The root-

Aspect-Based Sentiment Analysis

The Aspect-Based Sentiment Analysis task is an extension of both the ATE and the SA tasks. In fact, the aim of the Aspect-Based Sentiment Analysis task is to detect the sentiment polarity associated to each aspect extracted, thanks to the ATE task discussed in Section 2. The possible polarity values are:

Polarity Value neutral [0,0] positive [1,0] negative [0,1] mixed [1,1]

Similarly to what we have done with the Aspect Category Polarity task at ABSITA 2018 (Di Rosa and Durante, 2018), we assumed that the sentiment score of every aspect detected in Section 2 is the one associated to the portion of text in which it is contained. In order to do so, we split portions of the review using strong punctuation marks and some conjunctions (especially the ones leading to sentiment inversion). For example, in the case of:

Ottimo prodotto di marca, la qualità é veramente notevole. Non è molto capiente ma si può prendere un'altra versione. È provvisto di una tasca piccola davanti e quella grande 7

The aspect capiente 8 has the same polarity score as Non è molto capiente, while the aspect qualità 9 7 Translation: Great branded product, the quality is truly remarkable. It is not very capacious but you can get another version. It has a small front pocket and a large one 8 Translation: capacious 9 Translation: quality has the same polarity score as Ottimo prodotto di marca, la qualità é veramente notevole.

The same assumption has been applied to the training set: the polarity of each portion of a review has been associated to the contained aspect. If a portion of a review does not contain any aspect, it has been ignored. The submitted ABSA system has been based on a single sentiment classification model, rather than two binary models for positive and negative polarities. The final model is a four-class retraining of the sentiment model presented in section 3 which has been originally trained on user reviews with five levels (strong positive, positive, mixed/neutral, negative, strong negative) using multilingual BERT (Pires et al., 2019). In this way, we take advantage of some transfer learning about positive, negative and neutral sentiment learned on reviews. In Table 4 we show the results of the Aspect-Based Sentiment Analysis of the competition. App2Check system is in first position, with a F1 score that is 0.11943 higher than the second best system.

Pos

Conclusions

In this paper we described the approach we followed and the models we built for our participation to the ATE ABSITA 2020 competition. We also presented the experimental evaluation we made in the context of our model selection process in the development set and show interesting results: (i) the NER fine-tuned model shows lower performance than general-purpose pre-trained models without a specific NER finetuning; (ii) a language specific, Italian-native model shows a lower performance than multilingual models fine-tuned on Italian in the specific downstream tasks; (iii) the biggest and most recent, multilingual XLM-Roberta model shows the best performance when fine-tuned on the downstream tasks. We also showed that our App2Check system scored first in all of the three tasks of the competition, reaching a F1 score which is 0.14236 higher than the second best system in the ATE task and 0.11943 higher in the ABSA task; in the SA task, our system shows a Root-Mean-Square Error (RMSE) that is 0.13075 lower than the second classified.

show the results obtained by the models on the training and development set, highlighting in bold the model chosen for the competition. The value in column K, Len and Ep are associated respectively to the kind of pre-trained model used, the maximum sequence length used in the training and the number of epochs of the training. The F1-T and F1-D columns contains the F1scores on training set and development set. For each model, the prefixes L and B indicates whether the base or large version has been used; if an uncased version of the pre-trained model has been used, the model name is labeled with unc.. The Italian Base Bert and GilBERTo approaches, both of class 1, show similar results on both training and development set. Interestingly, on the development set, the multilingual Base Bert model in class 2 shows very similar results to the best model in class 1 which is specifically trained on Italian.

Table 1 :1Aspect-Term Extraction performance on development set.1 512 B-BERT ita unc. 11 0.961 0.6631 512 GilBERTo unc.10 0.941 0.6971 512 GilBERTo unc.15 0.973 0.67002 512 B-xlmRoBERTa 8 0.981 0.6872 256 L-xlmRoBERTa 12 0.965 0.7282 256 L-xlmRoBERTa 15 0.980 0.7082 512 B-mBERT20 0.991 0.6794 512 B-mBERT NER 30 0.910 0.6574 512 B-mBERT NER 45 0.965 0.623

Table 3 :3Sentiment Analysis on the test set of the competition.mean-square error of App2Check is 0.13073 lower than the error of the second best system, ranking in first position.Pos NameRMSE1 App2Check0.664582 SentNa0.795333 ghostwriter190.813944 Baseline-AVG score 0.100405 Baseline-AlBERTo0.108066 Baseline-Freq score 0.12800

Table 4 :4Aspect-Based Sentiment Analysis on the test set of the competition.. NameF1 score1 App2Check0.618782 ghostwriter19 0.499353 SentNa0.286324 Baseline0.20000

https://github.com/dbmdz/berts https://github.com/idb-ita/GilBERTo https://huggingface.co/xlm-roberta-large https://huggingface.co/xlm-roberta-base bert-base-multilingual-cased https://github.com/chambliss/Multilingual NER

EVALITA 2020: Overview of the 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian ValerioBasile DaniloCroce DiMaro MariaPassaro LuciaC Final Workshop

EVALITA

2020. 2020 ATE ABSITA@EVALITA2020: Overview of the Aspect Term Extraction and Aspectbased Sentiment Analysis Task GraziellaLorenzo De Mattei AndreaDe Martino AlessioIovine MarcoMiaschi GiuliaPolignano Rambelli Proceedings of the 7th evaluation campaign of Natural Language Processing and Speech tools for Italian the 7th evaluation campaign of Natural Language Processing and Speech tools for Italian

EVALITA

2020. 2020 Aspectbased Sentiment Analysis: X2Check at ABSITA 2018 EmanueleDi Rosa AlbertoDurante Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2018) colocated with the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018) the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2018) colocated with the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

Turin, Italy

2018. December 12-13, 2018 Evaluating Industrial and Research Sentiment Analysis Engines on Multiple Sources in EmanueleDi Rosa AlbertoDurante Proc. of AI*IA 2017 Advances in Artificial Intelligence -International Conference of the Italian Association for Artificial Intelligence of AI*IA 2017 Advances in Artificial Intelligence -International Conference of the Italian Association for Artificial Intelligence

Bari, Italy

2017. November 14-17, 2017 YinhanLiu MyleOtt NamanGoyal JingfeiDu MandarJoshi DanqiChen OmerLevy MikeLewis LukeZettlemoyer VeselinStoyanov abs/1907.11692 RoBERTa: A Robustly Optimized BERT Pretraining Approach CoRR 2019 Unsupervised Cross-lingual Representation Learning at Scale AlexisConneau KartikayKhandelwal NamanGoyal VishravChaudhary GuillaumeWenzek FranciscoGuzmán EdouardGrave MyleOtt LukeZettlemoyer VeselinStoyanov Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 2020. July 5-10, 2020 Benoît Sagot Camem-BERT: a Tasty French Language Model LouisMartin BenjaminMüller Pedro JavierOrtizSuárez YoannDupont LaurentRomary ÉricDe La Clergerie DjaméSeddah ACL 2020 Pre-trained Models for Natural Language Processing: A Survey XipengQiu TianxiangSun YigeXu YunfanShao NingDai XuanjingHuang arXiv 2020. 2003 8271 TomBBrown BenjaminMann NickRyder MelanieSubbiah JaredKaplan PrafullaDhariwal ArvindNeelakantan PranavShyam GirishSastry AmandaAskell SandhiniAgarwal ArielHerbert-Voss GretchenKrueger TomHenighan RewonChild AdityaRamesh DanielMZiegler JeffreyWu ClemensWinter ChristopherHesse MarkChen EricSigler MateuszLitwin ScottGray BenjaminChess JackClark ChristopherBerner SamMccandlish AlecRadford IlyaSutskever DarioAmodei Language Models are Few-Shot Learners CoRR 2020. 2020 Transformers: State-ofthe-art natural language processing ThomasWolf LysandreDebut VictorSanh JulienChaumond ClementDelangue AnthonyMoi PierricCistac TimRault RémiLouf MorganFuntowicz JoeDavison SamShleifer ClaraPatrick Von Platen YacineMa JulienJernite CanwenPlu TevenXu ScaoLe SylvainGugger MariamaDrame QuentinLhoest AlexanderMRush arXiv:1910.03771 2019 arXiv preprint How multilingual is Multilingual BERT TelmoPires EvaSchlinger DanGarrette CoRR 2019 2019 EVALITA 2020: Overview of the 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian ValerioBasile DaniloCroce DiMaro MariaPassaro Lucia C ;Basile Online Proceedings of Evalita 2020 Publisher CroceValerio Danilo DiMaro MariaPassaro LuciaC 2020