App2Check @ ATE ABSITA 2020: Aspect Term Extraction and
                    Aspect-based Sentiment Analysis

                    Emanuele Di Rosa                                       Alberto Durante
                  Chief Technology Officer                                Research Scientist
                   emanuele.dirosa                                       alberto.durante
                    @app2check.com                                       @app2check.com


                        Abstract                                former helps analysts to go beyond the traditional
                                                                ”word cloud” that is available in most of text an-
    In this paper we describe and present the
                                                                alytic tools and that focuses just on the most re-
    results of the system we specifically de-
                                                                current words in a collection. Aspect-Term Ex-
    veloped and submitted for our participa-
                                                                traction, similarly to the Named-Entity Recogni-
    tion to the ATE ABSITA 2020 evalua-
                                                                tion task, detects a sequence of word tokens that
    tion campaign on the Aspect Term Ex-
                                                                conceptually identify an ”aspect” of the sentence.
    traction (ATE), Aspect-based Sentiment
                                                                The Sentiment Analysis task maintains its impor-
    Analysis (ABSA), and Sentiment Analy-
                                                                tance on a higher level, where it can substitute user
    sis (SA) tasks. The official results show
                                                                rating, which can be sometimes incoherent to the
    that App2Check ranks first in all of the
                                                                opinions expressed in the review text. Anyway,
    three tasks, reaching a F1 score which is
                                                                it represents just the overall polarity of an opinion,
    0.14236 higher than the second best sys-
                                                                which is very often the result of different polarities
    tem in the ATE task and 0.11943 higher
                                                                on multiple aspects. The assignment of a specific
    in the ABSA task; it shows a Root-Mean-
                                                                and, in general, different polarity to each aspect
    Square Error (RMSE) that is 0.13075
                                                                in the sentence, leads to the ABSA task, which is
    lower than the second classified in the SA
                                                                highly dependent on the ATE task, but can take ad-
    task.
                                                                vantage of the learning obtained by an SA model.
1    Introduction                                               In the last few years, deep learning-based models
                                                                proved to be the best technical approach for natu-
User reviews are becoming more important for all                ral language processing and understanding and are
consumer-oriented industries. Thanks to the ex-                 very promising also for the ATE, SA and ABSA
pansion of a review culture, collecting and shar-               tasks.
ing a feedback from a buyer of a product/service
can both help the seller to improve and other cus-                 In this paper, we present the system that we
tomers who can take advantage of the reviews for                specifically developed and submitted for our par-
their purchase decisions. However, having auto-                 ticipation to the ATE ABSITA 2020 evaluation
matic tools to process reviews and extract use-                 campaign (De Mattei et al., 2018), which is part of
ful insights to analysts, especially where large                EVALITA 2020 (Basile et al., 2020), on the Aspect
amounts of reviews are available, becomes rele-                 Term Extraction (ATE), Aspect-based Sentiment
vant for any consumer-oriented industry.                        Analysis (ABSA), and Sentiment Analysis (SA)
   Aspect-Term Extraction and Aspect-Based Sen-                 tasks. To this aim, we decided to focus just on
timent Analysis tasks are, respectively, focused on             deep learning-based approaches to train a specific
the extraction of the main aspects in a sentence                model for each task. More specifically, we take
and to assign a specific sentiment to each of them.             advantage of the most recent approach in which
These are essential tools to understand the reasons             pre-trained language models, largely recognized
behind the success or the failure of a product or               as bringing NLP to a new era (Qiu et al., 2020),
service, or anyway that allow to take actions, final-           are used as the main component for the 3 tasks. In
ized to to improve the customer perception. The                 particular, about the ATE task, in order to select
                                                                the best performing pre-trained models to use for
     Copyright c 2020 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       our submission, we performed an extensive exper-
ternational (CC BY 4.0).                                        imental analysis and comparison. The experimen-
tal evaluation shows some interesting and unpre-          fine-tuned on the ABSA training set of the compe-
dictable results, discussed in section 2, which also      tition: this helped to take advantage of a transfer
represent an added value of this paper. In fact, we       learning from the SA task to the ABSA task.
can summarize that in the dev set:                            This paper is structured as follows: in sections
                                                          2, 3 and 4, we describe each of the three tasks of
  1. the NER fine-tuned model shows lower per-            the competition, the details of our training, system
     formance than general-purpose pre-trained            implementation and present the results in both the
     models without a specific NER fine-tuning            dev set and the competition results. Finally, we
                                                          show the conclusions in section 5.
  2. a language specific, Italian-native model
     shows a lower performance than multilingual          2       Aspect-Term Extraction
     models fine-tuned on Italian in the specific
     downstream tasks                                     Aspect Term Extraction (ATE) is the task of iden-
                                                          tifying an ”aspect” in a text without knowing a pri-
  3. the biggest and most recent, multilingual            ori the list of aspects that contains it. According to
     XLM-Roberta model shows the best perfor-             the literature definition, a term/phrase is consid-
     mance when fine-tuned on the downstream              ered as an aspect when it co-occurs with ”opinion
     tasks                                                words” that indicate a sentiment polarity on it.
                                                             Our approach has been to consider the ATE task
   While the last result, related to the fact that big-   as a Named Entity Recognition task (NER) and
ger models –in terms of number of parameters–             fine-tune already existing pre-trained language
are more effective than smaller models, is quite          models on the NER task, by using the training set
common (with the exception of distilled models)           of the competition. More specifically, we decided
and known in literature (see also the recent GPT3         to investigate four different classes of models:
vs GPT2 comparison (Brown et al., 2020)), the
first two results are quite surprising. In fact, we           1. Native Italian pre-trained language models,
expected that the multilingual model specifically                with no specific NER fine-tuning
fine-tuned on the NER task on another language
could take advantage of a previous training in an-            2. Multilingual pre-trained language model,
other language as shown in (Pires et al., 2019).                 with no specific NER fine-tuning
Moreover, the native Italian pre-trained language
model GilBERTo (based on Facebook RoBERTa                     3. Native Italian pre-trained language models,
architecture (Liu et al., 2019) and CamemBERT                    with a specific NER fine-tuning
text tokenization approach (Martin et al., 2020))
later fine-tuned on the NER task with italian train-          4. Multilingual pre-trained language model,
ing set, shows a performance that is 4% lower than               with a specific NER fine-tuning
the XML-Roberta multilingual pre-trained model
                                                             To implement all of these approaches, we based
later trained on a NER training set in Italian.
                                                          on the Hugging Face transformers library (Wolf
   About the SA task, we take advantage of a
                                                          et al., 2019) and, in order to simplify our work,
previously trained predictive model we had at
                                                          we looked for pre-trained models made available
App2Check, an evolution of the one presented in
                                                          publicly by the Hugging Face. With the exception
(Di Rosa and Durante, 2017), which is now based
                                                          of item 3, for which we could not find any pub-
on the Multilingual BERT model and later fine-
                                                          licly available model in the HuggingFace models
tuned on a 1 to 5 sentiment scale on a big amount
                                                          list, we considered more than one state-of-the-art
of product reviews. This model has been addition-
                                                          model for each type of encoding that we further
ally trained on the training set of the competition
                                                          trained/fine-tuned on the competition training set.
in order to have a domain-specific training. For
                                                             For type 1, we considered dbmdz/bert-base-
the SA task, we decided to not perform any ad-
                                                          italian-xxl-uncased1 and GilBERTo2 . For type 2,
ditional experimental comparison with other pre-
                                                          we considered two implementations of RoBERTa:
trained models. Finally, about the ABSA task, we
created a special encoding to map the output of               1
                                                                  https://github.com/dbmdz/berts
                                                              2
our available SA model in order to be additionally                https://github.com/idb-ita/GilBERTo
xml-roberta-large3 (Conneau et al., 2020), xml-        higher than the Base version of the same model,
roberta-base4 (Liu et al., 2019), and multilin-        even if they show almost the same performance
gual BERT5 (Pires et al., 2019). We wanted             on the training set. The model in class 4, multi-
to try xml-roberta-large with a 512 maximum            lingual Bert Base specifically trained on the NER
sequence length, but an out of memory ex-              task, shows the worst performance on the develop-
ception prevented us for using it. For type            ment set, even if trained with a much higher num-
4 we considered wietsedv/bert-base-multilingual-       ber of epochs.
cased-finetuned-conll2002-ner6 .                          Thanks to the F1 score reached on the devel-
                                                       opment set, the xlm RoBERTa Large multilingual
 K    Len Model                 Ep    F1-T     F1-D    model has been chosen as our competition model,
 1    512 B-BERT ita unc.       11    0.961   0.663    so it has been further trained on the development
 1    512 GilBERTo unc.         10    0.941   0.697    set and tested on the competition test set.
 1    512 GilBERTo unc.         15    0.973   0.6700
 2    512 B-xlmRoBERTa           8    0.981   0.687             Pos.    Name             F1 score
 2    256 L-xlmRoBERTa          12    0.965   0.728               1     App2Check        0.68222
 2    256 L-xlmRoBERTa          15    0.980   0.708               2     ghostwriter19    0.53986
 2    512 B-mBERT               20    0.991   0.679               3     SentNa           0.34027
 4    512 B-mBERT NER           30    0.910   0.657               4     Baseline         0.2556
 4    512 B-mBERT NER           45    0.965   0.623
                                                       Table 2: Aspect-Term Extraction on the test set of
Table 1: Aspect-Term Extraction performance on         the competition.
development set.
                                                         In Table 2 we show the official results of the
   All models have been trained on a cloud plat-       Aspect-Term Extraction task in (De Mattei et al.,
form using an Nvidia Tesla P100-PCIE as GPU            2018). App2Check model ranked first with a F1
accelerator. In Table 1 we show the results ob-        score that is 0.14236 higher than the second best
tained by the models on the training and develop-      system.
ment set, highlighting in bold the model chosen
                                                       3   Sentiment Analysis
for the competition.
   The value in column K, Len and Ep are associ-       The SA task is about the detection of the opinion
ated respectively to the kind of pre-trained model     expressed in a text review. According to the typ-
used, the maximum sequence length used in the          ical user rating, which is here used as the refer-
training and the number of epochs of the train-        ence value for the polarity, the score is defined on
ing. The F1-T and F1-D columns contains the F1-        a five-value scale from 1 (very negative) to 5 (very
scores on training set and development set. For        positive).
each model, the prefixes L and B indicates whether         About our implementation for this task, we took
the base or large version has been used; if an un-     advantage of a previously trained predictive model
cased version of the pre-trained model has been        we had at App2Check. It is an evolution of the
used, the model name is labeled with unc..             one presented in (Di Rosa and Durante, 2017),
   The Italian Base Bert and GilBERTo ap-              which is now based on the Multilingual BERT
proaches, both of class 1, show similar results on     model based on 104 languages and 110M parame-
both training and development set. Interestingly,      ters, and later fine-tuned on a 1 to 5 sentiment scale
on the development set, the multilingual Base Bert     on a big amount of product reviews. This model
model in class 2 shows very similar results to the     has been additionally trained on the training set of
best model in class 1 which is specifically trained    the competition in order to have a domain-specific
on Italian.                                            training. We decided to not perform any additional
   The xlm RoBERTa Large multilingual model            experimental comparison with other pre-trained
shows a F1-score on the development set that is        models, since it has been already compared with
  3
                                                       other approaches in the past and also because of
    https://huggingface.co/xlm-roberta-large
  4
    https://huggingface.co/xlm-roberta-base
                                                       the little time at our disposal.
  5
    bert-base-multilingual-cased                           In Table 3 we show the results of the compe-
  6
    https://github.com/chambliss/Multilingual NER      tition for the Sentiment Analysis task. The root-
       Pos     Name                        RMSE                 has the same polarity score as Ottimo prodotto di
         1     App2Check                   0.66458              marca, la qualità é veramente notevole.
         2     SentNa                      0.79533                 The same assumption has been applied to the
         3     ghostwriter19               0.81394              training set: the polarity of each portion of a re-
         4     Baseline-AVG score          0.10040              view has been associated to the contained aspect.
         5     Baseline-AlBERTo            0.10806              If a portion of a review does not contain any as-
         6     Baseline-Freq score         0.12800              pect, it has been ignored.
                                                                   The submitted ABSA system has been based
Table 3: Sentiment Analysis on the test set of the              on a single sentiment classification model, rather
competition.                                                    than two binary models for positive and nega-
                                                                tive polarities. The final model is a four-class re-
mean-square error of App2Check is 0.13073 lower                 training of the sentiment model presented in sec-
than the error of the second best system, ranking               tion 3 which has been originally trained on user
in first position.                                              reviews with five levels (strong positive, posi-
                                                                tive, mixed/neutral, negative, strong negative) us-
4    Aspect-Based Sentiment Analysis                            ing multilingual BERT (Pires et al., 2019). In this
                                                                way, we take advantage of some transfer learn-
The Aspect-Based Sentiment Analysis task is an
                                                                ing about positive, negative and neutral sentiment
extension of both the ATE and the SA tasks. In
                                                                learned on reviews.
fact, the aim of the Aspect-Based Sentiment Anal-
ysis task is to detect the sentiment polarity asso-                      Pos.   Name             F1 score
ciated to each aspect extracted, thanks to the ATE                         1    App2Check        0.61878
task discussed in Section 2. The possible polarity                         2    ghostwriter19    0.49935
values are:                                                                3    SentNa           0.28632
                                                                           4    Baseline         0.20000
                   Polarity      Value
                     neutral     [0,0]
                                                                Table 4: Aspect-Based Sentiment Analysis on the
                    positive     [1,0]
                                                                test set of the competition.
                   negative      [0,1]
                      mixed      [1,1]
                                                                  In Table 4 we show the results of the Aspect-
                                                                Based Sentiment Analysis of the competition.
    Similarly to what we have done with the Aspect              App2Check system is in first position, with a F1
Category Polarity task at ABSITA 2018 (Di Rosa                  score that is 0.11943 higher than the second best
and Durante, 2018), we assumed that the senti-                  system.
ment score of every aspect detected in Section 2 is
the one associated to the portion of text in which              5   Conclusions
it is contained. In order to do so, we split portions           In this paper we described the approach we fol-
of the review using strong punctuation marks and                lowed and the models we built for our partic-
some conjunctions (especially the ones leading to               ipation to the ATE ABSITA 2020 competition.
sentiment inversion). For example, in the case of:              We also presented the experimental evaluation
      Ottimo prodotto di marca, la qualità é                  we made in the context of our model selection
      veramente notevole. Non è molto capi-                    process in the development set and show in-
      ente ma si può prendere un’altra ver-                    teresting results: (i) the NER fine-tuned model
      sione. È provvisto di una tasca piccola                  shows lower performance than general-purpose
      davanti e quella grande7                                  pre-trained models without a specific NER fine-
                                                                tuning; (ii) a language specific, Italian-native
The aspect capiente8 has the same polarity score
                                                                model shows a lower performance than multilin-
as Non è molto capiente, while the aspect qualità9
                                                                gual models fine-tuned on Italian in the specific
   7
     Translation: Great branded product, the quality is truly   downstream tasks; (iii) the biggest and most re-
remarkable. It is not very capacious but you can get another
version. It has a small front pocket and a large one
                                                                cent, multilingual XLM-Roberta model shows the
   8
     Translation: capacious                                     best performance when fine-tuned on the down-
   9
     Translation: quality                                       stream tasks. We also showed that our App2Check
system scored first in all of the three tasks of the       Clergerie, Djamé Seddah, Benoı̂t Sagot Camem-
competition, reaching a F1 score which is 0.14236          BERT: a Tasty French Language Model. ACL 2020:
                                                           7203-7219
higher than the second best system in the ATE task
and 0.11943 higher in the ABSA task; in the SA           Xipeng Qiu and Tianxiang Sun and Yige Xu and
task, our system shows a Root-Mean-Square Er-              Yunfan Shao and Ning Dai and Xuanjing Huang
ror (RMSE) that is 0.13075 lower than the second           2020       Pre-trained Models for Natural Lan-
                                                           guage Processing: A Survey 2003.08271, arXiv,
classified.                                                https://arxiv.org/abs/2003.08271
                                                         Tom B. Brown and Benjamin Mann and Nick Ry-
References                                                 der and Melanie Subbiah and Jared Kaplan and
                                                           Prafulla Dhariwal and Arvind Neelakantan and
Basile, Valerio and Croce, Danilo and Di Maro, Maria       Pranav Shyam and Girish Sastry and Amanda
  and Passaro, Lucia C. 2020 EVALITA 2020:                 Askell and Sandhini Agarwal and Ariel Herbert-
  Overview of the 7th Evaluation Campaign of Nat-          Voss and Gretchen Krueger and Tom Henighan
  ural Language Processing and Speech Tools for            and Rewon Child and Aditya Ramesh and Daniel
  Italian Proceedings of Seventh Evaluation Cam-           M. Ziegler and Jeffrey Wu and Clemens Winter
  paign of Natural Language Processing and Speech          and Christopher Hesse and Mark Chen and Eric
  Tools for Italian. Final Workshop (EVALITA 2020)         Sigler and Mateusz Litwin and Scott Gray and
  CEUR.org                                                 Benjamin Chess and Jack Clark and Christopher
Lorenzo de Mattei, Graziella de Martino, Andrea            Berner and Sam McCandlish and Alec Radford and
  Iovine, Alessio Miaschi, Marco Polignano, and Giu-       Ilya Sutskever and Dario Amodei 2020 Lan-
  lia Rambelli. 2020 ATE ABSITA@EVALITA2020:               guage Models are Few-Shot Learners CoRR 2020
  Overview of the Aspect Term Extraction and Aspect-       https://arxiv.org/abs/2005.14165
  based Sentiment Analysis Task. Proceedings of           Thomas Wolf and Lysandre Debut and Victor Sanh
  the 7th evaluation campaign of Natural Language          and Julien Chaumond and Clement Delangue and
  Processing and Speech tools for Italian (EVALITA         Anthony Moi and Pierric Cistac and Tim Rault and
  2020). CEUR.org                                          Rémi Louf and Morgan Funtowicz and Joe Davi-
Emanuele Di Rosa and Alberto Durante 2018 Aspect-          son and Sam Shleifer and Patrick von Platen and
  based Sentiment Analysis: X2Check at ABSITA              Clara Ma and Yacine Jernite and Julien Plu and
  2018 Proceedings of the Sixth Evaluation Campaign        Canwen Xu and Teven Le Scao and Sylvain Gug-
  of Natural Language Processing and Speech Tools          ger and Mariama Drame and Quentin Lhoest and
  for Italian. Final Workshop (EVALITA 2018) co-           Alexander M. Rush 2019. Transformers: State-of-
  located with the Fifth Italian Conference on Com-        the-art natural language processing. arXiv preprint
  putational Linguistics (CLiC-it 2018), Turin, Italy,     arXiv:1910.03771.
  December 12-13, 2018
                                                         Telmo Pires and Eva Schlinger and Dan Garrette 2019
Emanuele Di Rosa and Alberto Durante. 2017. Eval-          How multilingual is Multilingual BERT? CoRR
  uating Industrial and Research Sentiment Analysis        2019 http://arxiv.org/abs/1906.01502
  Engines on Multiple Sources in Proc. of AI*IA 2017
  Advances in Artificial Intelligence - International    Basile, Valerio and Croce, Danilo and Di Maro, Maria
  Conference of the Italian Association for Artificial     and Passaro, Lucia C. 2020. EVALITA 2020:
  Intelligence, Bari, Italy, November 14-17, 2017, pp.     Overview of the 7th Evaluation Campaign of Natu-
  141-155.                                                 ral Language Processing and Speech Tools for Ital-
                                                           ian. In Online Proceedings of Evalita 2020 Pub-
Yinhan Liu and Myle Ott and Naman Goyal and Jingfei        lisher: CEUR.org Editor: Basile, Valerio and Croce,
  Du and Mandar Joshi and Danqi Chen and Omer              Danilo and Di Maro, Maria and Passaro, Lucia C.
  Levy and Mike Lewis and Luke Zettlemoyer and
  Veselin Stoyanov 2019 RoBERTa: A Robustly
  Optimized BERT Pretraining Approach CoRR,
  abs/1907.11692
Alexis Conneau and Kartikay Khandelwal and Naman
  Goyal and Vishrav Chaudhary and Guillaume Wen-
  zek and Francisco Guzmán and Edouard Grave and
  Myle Ott and Luke Zettlemoyer and Veselin Stoy-
  anov 2020 Unsupervised Cross-lingual Representa-
  tion Learning at Scale Proceedings of the 58th An-
  nual Meeting of the Association for Computational
  Linguistics, ACL 2020, Online, July 5-10, 2020
Louis Martin, Benjamin Müller, Pedro Javier Ortiz
  Suárez, Yoann Dupont, Laurent Romary, Éric de la