KERMIT for Sentiment Analysis in Italian Healthcare Reviews

               Leonardo Ranaldi1 , Michele Mastromattei2 , Dario Onorati2 ,
         Elena Sofia Ruzzetti2 , Francesca Fallucchi1 , Fabio Massimo Zanzotto2
  1. Dept. of Innovation and Information Engineering Guglielmo Marconi University, Italy
          2. Dept. of Enterprise Engineering University of Rome Tor Vergata, Italy
l.ranaldi@unimarconi.com, elenasofia.ruzzetti@alumni.uniroma2.eu,
    {michele.mastromattei,fabio.massimo.zanzotto}@uniroma2.it,
        dario.onorati@uniroma1.it, f.fallucchi@unimarconi.it


                        Abstract                                1       Introduction

    English. In this paper, we describe our ap-                 People are practically reviewing anything in on-
    proach to the sentiment classification chal-                line sites and understanding the polarization of
    lenge on Italian reviews in the healthcare                  a comment through automatic sentiment classi-
    domain. Firstly, we followed the work of                    fier is a tantalizing challenge. In recent years,
    Bacco et al. (2020) from which we ob-                       the number of virtual reviewers has drastically in-
    tained the dataset. Then, we generated our                  creased and there are many products and services,
    model called KERMITHC based on KER-                         which can be reviewed. Each person, before buy-
    MIT (Zanzotto et al., 2020). Through an                     ing a product or a service, searches into reviews
    extensive comparative analysis of the re-                   from people who have already had experienced the
    sults obtained, we showed how the use                       product or the service. Review portals are usu-
    of syntax can improve performance in                        ally linked to the leisure or business activities such
    terms of both accuracy and F1-score com-                    as the world of tourism, e-commerce or movies.
    pared to previously proposed models. Fi-                    However, there are topics where these reviews and
    nally, we explored the interpretative power                 the associated automatic computed sentiment may
    of KERMIT-viz to explain the inferences                     induce to select wrong services, which may dra-
    made by neural networks on examples.                        matically affect personal life.
                                                                   When dealing with health-related services, the
    Italiano. In questo lavoro, presentiamo il                  effect of positive or negative reviews on hospitals
    nostro approccio al task di sentiment anal-                 and doctors can have a potential catastrophic im-
    ysis per le recensioni italiane in ambito                   pact on the health of who is using this piece of in-
    sanitario. Abbiamo seguito il lavoro di                     formation. QSalute 1 is one of the most important
    Bacco et al. (2020) da cui abbiamo ot-                      Italian portals of reviews about hospitals, nursing
    tenuto il dataset. Successivamente, abbi-                   homes and doctors. It is very important for pa-
    amo usato KERMITHC basato su KER-                           tients to seek the best hospital for their condition
    MIT(Zanzotto et al., 2020). Da un’ampia                     based on the past experience of other patients. Re-
    analisi comparativa dei risultati ottenuti                  views in the world of health benefit both patients
    mostriamo come l’uso della sintassi può                    and hospitals because they are a means to discover
    migliorare le prestazioni sia in termini di                 problems and solve them (Greaves et al., 2013;
    accuratezza che di F1-score rispetto ai                     Khanbhai et al., 2021).
    modelli proposti in precedenza. Infine,                        Automatic sentiment analyzer have then a big
    abbiamo esplorato il potere interpretativo                  responsibility in the context of health-related ser-
    di KERMIT-viz per spiegare le inferenze                     vices. In these sensitive areas, it is important to
    fatte dalle reti neurali sugli esempi.                      design AI systems whose decisions are transpar-
                                                                ent (Doshi-Velez and Kim, 2017), that is, the sys-
                                                                tems must give the motivation for the choice made
     Copyright © 2021 for this paper by its authors. Use per-   so that people can trust. If the users do not trust a
mitted under Creative Commons License Attribution 4.0 In-
                                                                    1
ternational (CC BY 4.0).                                                https://www.qsalute.it/
model or a prediction, they will not use it (Ribeiro     is less than or equal to 2, (2) positive if the average
et al., 2016).                                           of its scores is greater than or equal to 4 (3) neutral
   In this article, we investigate a model that can      otherwise.
mitigate the responsibility of sentiment analyz-            The resulting dataset is composed of 47,224 re-
ers for health-related services. The model we are        views consisting of: 40,641 reviews in the positive
using exploits syntactic information within neu-         class, 3,898 in the neutral class and 2,685 in the
ral networks to provide a clear visualisation of         negative class.
the internal decision mechanism of the model that           In this work, we solely consider positive and
produced the decision. We propose KERMITHC               negative classes, so our final dataset is composed
(KERMIT for HealthCare) based on KERMIT                  of 43,326 reviews. The dataset is heavily skewed
(Zanzotto et al., 2020) to solve the sentiment anal-     (93,80% positive class - 6,20% negative class) fa-
ysis task introduced by Bacco el al.(2020). We           voring reviews labeled as positive.
use KERMITHC on QSalute Italian portal re-
                                                         2.2   KERMIT 4 Healthcare
views in order to include symbolic knowledge as
a part of the architecture and visualize the internal    KERMITHC (KERMIT for HealthCare) architec-
decision-making mechanism of the neural model,           ture is composed of 3 major parts: (1) a KERMIT
using KERMIT-viz (Ranaldi et al., 2021).                 model described in Zanzotto et al. (2020), (2) a
   In the rest of paper, Section 2 gives details about   Transformers model and (3) a decoder layer that
the dataset and methods, while Section 3 and 4           combines the results obtained from the previous
describe the experiments, the results obtained and       two sub-parts. In figure Fig.1 we show a graphical
their discussion. Finally, in Section 5 we present       representation of the architecture of KERMITHC ,
the final conclusions and future goals.                  pointing the parts that compose it.

2       Data & Methods
To explore our hunch that syntactic interpreta-
tion may help in Healthcare reviews recognition,
we leverage: (1) a Healthcare training corpus
(Sec. 2.1); (2) a KERMITHC , which is based on
syntactic interpretation and it can explain its deci-
sions; and finally, (3) some challenges solved due
to KERMITHC (Sec. 2.2).

2.1      Dataset
In order to investigate reviews in healthcare area,
we selected the QSalute portal, one of the most
important health websites in Italy. This portal can      Figure 1: KERMITHC architecture, forward and
be defined as the TripAdvisor of hospital facili-        interpretation pass.
ties, indeed it talks about: Expertise, Assistance,
Cleaning and Services. In addition to the reviews,          The architecture of KERMITHC makes it a
there are some associated metadata such as: user         particular model, because it combines the syn-
id, hospital name, review title and patient pathol-      tax offered by KERMIT with the versatility of a
ogy. To ensure privacy we do not consider sensi-         Transformer-model. We use KERMIT because it
tive data such as user id and hospital name.             allows the encoding of universal syntactic inter-
   We used a free available scraper on GitHub 2 to       pretations in a neural network architecture. KER-
download the dataset. Then, to model this data to        MIT component is itself composed of two parts:
a sentiment analysis task, we followed the indica-       KERMIT encoder, which converts parse tree T
tions provided by Bacco et al.(2020) - in detail, a      into embedding vectors and a multi-layer percep-
review is: (1) negative if the average of its scores     tron that exploits these embedding vectors. The
    2
                                                         second sub-part of our architecture is composed
   The scraper is available at https://github.com/l
bacco/Italian-Healthcare-Reviews-4-Senti                 of a Bidirectional Encoder Representations from
ment-Analysis                                            Transformers, - as known as BERT - to classify the
 Model                  Average Accuracy      Average Macro F1 score        Average Weighed F1 score
 UmBERTo                0.74(±0.14)⋄          0.43(±0.02)                   0.75(±0.18)◦
 AlBERTo                0.82(±0.15)⋄          0.47(±0.05)†                  0.8(±0.14)◦
 BERT multilingual      0.73(±0.13)           0.46(±0.1)†                   0.73(±0.22)
 ELECTRAita             0.67(±0.17)           0.4(±0.13)                    0.66(±0.2)

Table 1: Performance of BERT, on 25% of the QSalute dataset. Mean and standard deviation results are
obtained from 10 runs. For each Site, the best performing model was highlighted based on the F1 score
values obtained. The symbols ⋄, ◦ and † indicate a statistically significant difference between two results
with a 95% of confidence level with the sign test.


sentiment of the reviews. BERT is a pre-trained        are those proposed in Zanzotto et al., (2020) pa-
language model developed by Devlin et al. (2019)       per. The constituency parse trees used for KER-
at Google AI Language. In particular, since the        MIT sub-part are obtained using our freely avail-
task concerns sentences in the Italian language, we    able script on GitHub3 .
have used a special BERT version pretrained on
                                                          We tested several different BERT version pre-
that language called AlBERTo (Polignano et al.,
                                                       trained on Italian language in order to get the best
2019).
                                                       model for our task. In particular, we tested the
3     Experiments                                      following transformers: (1) UmBERTo (Parisi et
                                                       al., 2020); (2) AlBERTo (Polignano et al., 2019);
We used KERMITHC architecture to examine if            (3) BERT multilingual (Devlin et al., 2018) and
it is possible to answer the research questions        (4) ELECTRAita : an Italian version of ELEC-
showed in KERMIT (Zanzotto et al., 2020) also          TRA model (Clark et al., 2020) implemented by
in healthcare domain using the Italian language.       Schweter (2020) on a work of Chan et al. (2020).
Those research questions are: (1) Can the sym-         All the models were implemented using Hugging-
bolic knowledge provided by universal symbolic         face’s transformers library (Wolf et al., 2019) and
syntactic interpretations, make a difference and it    all were used in the uncased setting with the pre-
be used effectively in neural networks? (2) Do         trained version. The input text for BERT has been
universal symbolic syntactic interpretations en-       preprocessed and tokenized as specified in respec-
code different syntactic information than those en-    tively work (Parisi et al., 2020; Polignano et al.,
coded in “embeddings of universal sentences”?          2019; Devlin et al., 2018; Schweter, 2020).
(3) Can the universal symbolic syntactic interpre-
tations provided by KERMITHC , supply a better            Since our experiments are text classification
and clearer way to explain the decisions of neural     task, the decoder layer of our KERMITHC archi-
networks than those provided by transformers?          tecture is a fully connected layer with the soft-
   To provide a comprehensive answer to these          max activation function applied to the concatena-
questions, we tested the architecture in a com-        tion of the KERMIT sub-part output and the final
pletely universal setting where both KERMIT and        [CLS] token representation of the selected trans-
AlBERTo are trained only in the last decision          former model. Finally, the optimizer used to train
layer.                                                 the whole architecture is AdamW (Loshchilov and
   The rest of the Section describes the experi-       Hutter, 2019) with the learning rate set to 2e−5 .
mental set-up, the quantitative experimental re-       For reproducibility, the source code of our experi-
sults and discusses how we can use the KERMIT-         ments is publicly available on our GitHub reposi-
viz to explain decisions of neural network infer-      tory4 .
ences over examples.

3.1    Experimental Set-up                               3
                                                           The code is available at https://github.com/L
This section describes the general experimental        eonardRanaldi/Constituency-Parser-Italia
set-up of our experiments and the specific config-     n
                                                         4
                                                           The code is available at https://github.com/A
urations adopted.                                      RT-Group-it/KERMIT-4-Sentiment-Analysis-
   The parameters used for the KERMIT encoder          on-Italian-Reviews-in-Healthcare
                                             Average          Average               Average
            Site             Model          Accuracy       Macro F1 score       Weighed F1 score
                          KERMITHC        0.71 (± 0.14)     0.51 (± 0.08)         0.7 (± 0.11)
        Pneumology
                           AlBERTo        0.66 (± 0.27)     0.4 (± 0.12)†        0.61 (± 0.26)
                          KERMITHC        0.78 (± 0.13)     0.51 (± 0.07)        0.81 (± 0.08)
     Thoracic Surgery
                           AlBERTo        0.74 (± 0.28)     0.43 (± 0.13)        0.74 (± 0.26)
                          KERMITHC        0.87 (± 0.05)†    0.6 (± 0.03)†        0.89 (± 0.03)
      Nervous System
                           AlBERTo        0.94 (± 0.01)†    0.48 (± 0.0)†        0.91 (± 0.01)
                          KERMITHC        0.93 (± 0.03)†   0.56 (± 0.03)†        0.93 (± 0.02)
           Hearth
                           AlBERTo        0.96 (± 0.01)†    0.49 (± 0.0)†        0.94 (± 0.01)
                          KERMITHC        0.81 (± 0.16)    0.49 (± 0.06)†        0.83 (± 0.12)
      Vascular Surgery
                           AlBERTo        0.70 (± 0.29)    0.42 (± 0.11)†        0.73 (± 0.23)
                          KERMITHC        0.79 (± 0.08)    0.55 (± 0.05)†        0.83 (± 0.06)
      Ophthalmology
                           AlBERTo        0.87 (± 0.08)    0.48 (± 0.02)†        0.86 (± 0.04)
                          KERMITHC        0.58 (± 0.23)     0.43 (± 0.11)        0.60 (± 0.20)
       Rheumatology
                           AlBERTo        0.68 (± 0.20)     0.44 (± 0.10)        0.69 (± 0.19)
                          KERMITHC        0.68 (± 0.19)     0.51 (± 0.12)        0.70 (± 0.17)
         Infections
                           AlBERTo        0.57 (± 0.23)     0.42 (± 0.13)        0.58 (± 0.21)
                          KERMITHC        0.64 (± 0.11)     0.50 (± 0.07)        0.70 (± 0.10)
            Skin
                           AlBERTo        0.63 (± 0.26)     0.39 (± 0.11)        0.61 (± 0.24)
                          KERMITHC        0.79 (± 0.09)†   0.55 (± 0.03)†        0.82 (± 0.06)
          Genital
                           AlBERTo        0.88 (± 0.06)†   0.49 (± 0.02)†        0.87 (± 0.03)
                          KERMITHC        0.75 (± 0.09)    0.52 (± 0.04)†        0.80 (± 0.05)
         Endoscopy
                           AlBERTo        0.80 (± 0.19)    0.45 (± 0.07)†        0.78 (± 0.17)
                          KERMITHC        0.70 (± 0.24)     0.42 (± 0.08)        0.76 (± 0.18)
           Facial
                           AlBERTo        0.72 (± 0.26)     0.42 (± 0.10)        0.76 (± 0.22)
                          KERMITHC        0.91 (± 0.06)    0.52 (± 0.04)†        0.92 (± 0.03)
         Oncology
                           AlBERTo        0.89 (± 0.21)    0.46 (± 0.08)†        0.89 (± 0.17)
                          KERMITHC        0.56 (± 0.30)     0.36 (± 0.14)        0.57 (± 0.31)
       Haematology
                           AlBERTo        0.41 (± 0.25)     0.30 (± 0.11)        0.46 (± 0.23)
                          KERMITHC        0.71 (± 0.20)     0.48 (± 0.12)        0.71 (± 0.22)
       Endocrinology
                           AlBERTo        0.73 (± 0.29)     0.41 (± 0.13)        0.69 (± 0.28)
                          KERMITHC        0.82 (± 0.08)    0.56 (± 0.05)†        0.85 (± 0.05)
       Gynaecology
                           AlBERTo        0.85 (± 0.14)    0.48 (± 0.04)†        0.84 (± 0.09)
                          KERMITHC        0.84 (± 0.14)     0.50 (± 0.06)        0.86 (± 0.09)
       Otorhinology
                           AlBERTo        0.80 (± 0.18)     0.46 (± 0.05)        0.83 (± 0.13)

Table 2: Performance of KERMITHC and AlBERTo on QSalute database grouped by Site. Mean and
standard deviation results are obtained from 10 runs. For each Site, the best performing model was
highlighted based on the F1 score values obtained. The symbol † indicate a statistically significant
difference between two results with a 95% of confidence level with the sign test.


4   Results and Discussion                           uate the models using accuracy and F1-score met-
                                                     rics. Despite this division, the dataset is still very
Syntactic information is useful to significantly     unbalanced favoring the class 1 (positive reviews).
increase performances to classify Healthcare re-     We reports results in terms of the accuracy, Macro
views (see Table 2). KERMITHC uses AlBERTo           F1 and Weighed F1. Observing Table 2, we can
which is the best BERT-italian version model ac-     see that the performance obtained by KERMITHC
cording to our experiments, showed in Table 1.       always exceeds the best configuration of BERT:
Especially KERMITHC outperforms the solely           AlBERTo. Hence, trained on the Healthcare re-
AlBERTo sub-part model (ref. to Table 2).            view dataset (Bacco et al., 2020) (see Section 2.1)
   As in the work proposed by Bacco et al.(2020),    KERMITHC seems to be a good candidate to ana-
we chose to divide the dataset by “Site” and eval-   lyze sentiment of hospital patients.
                 (a) S: Uno staff di grandissima competenza e professionalità!


        (b) S:Pessima assistenza e servizi assenti tranne il primario di reparto
        di neurochirurgia eccellente professionista

Figure 2: The visualizations offered by KERMIT-viz. Both examples have the target class positive but in
the first one, it is easy to state the positivity. In the second one, who wrote the review, makes disquisitions
about the medical staff but at the same time lauds the head of the department.


   Using the KERMIT-viz visualiser, we anal-             5   Conclusion
ysed how important the contribution of symbolic
knowledge provided by KERMIT can be. In many             In this article, we investigated a model that
cases it makes all the difference. Looking at the        can mitigate the responsibility of sentiment an-
Figure 2, these are two sentences with a positive        alyzers for health-related services. Our model
target. The first sentence (shown in Fig. 2a) is         KERMITHC exploits syntactic information within
clearly positive while the sentence shown in the         neural networks to provide a clear visualisation of
Fig. 2b could be ambiguous as the patient makes          its internal decision mechanism. KERMITHC is
bad remarks about the service but praises the head       based on KERMIT (Zanzotto et al., 2020) and we
of the department. We can observe how some               worked in a sentiment analysis task introduced by
words have been colored in red (therefore they           Bacco el al.(2020).
have received a greater weight during the classifi-         We studied several versions of pre-trained
cation phase) emphasizing the positive aspects of        BERT models on the Italian language and found
the sentence and causing it to be labeled as “posi-      out that AlBERTo is, among them, the best model
tive review”. In this way the explainability is guar-    for this task. However, KERMITHC , which is
anteed and in very delicate topics - like sentiment      composed of KERMIT+AlBERTo, outperforms
in health reviews - we can have more “trust” on          better than AlBERTo model alone. Additionally,
sentiment analysers.                                     via KERMIT-viz, we visualized the reasons why
                                                         KERMITHC classifies the dataset. We observed
                                                         how KERMITHC captures relevant syntactic in-
                                                         formation by catching the keywords in each sen-
tence giving them more weight in the decision             Loreto Parisi, Simone Francia, and Paolo Magnani.
phase, mitigating and capturing possible errors of          2020. Umberto: An italian language model trained
                                                            with whole word masking.
the sentiment analysers. Our future goal is to be
able to have full control of the sentiment analysers      Marco Polignano, Pierpaolo Basile, Marco de Gem-
by injecting human rules (Onorati et al., 2020) in         mis, Giovanni Semeraro, and Valerio Basile. 2019.
order to mitigate possible errors.                         AlBERTo: Italian BERT Language Understanding
                                                           Model for NLP Challenging Tasks Based on Tweets.
                                                           In Proceedings of the Sixth Italian Conference on
                                                           Computational Linguistics (CLiC-it 2019), volume
References                                                 2481. CEUR.
Luca Bacco, A. Cimino, L. Paulon, M. Merone, and          Leonardo Ranaldi, Francesca Fallucchi, and
  F. Dell’Orletta. 2020. A machine learning approach        Fabio Massimo Zanzotto. 2021. KERMITviz:
  for sentiment analysis for italian reviews in health-     Visualizing Neural Network Activations on Syntac-
  care. In CLiC-it.                                         tic Trees. In In the 15th International Conference
                                                            on Metadata and Semantics Research (MTSR’21),
Branden Chan, Stefan Schweter, and Timo Möller.            volume 1.
  2020.    German’s next language model.          In
  Proceedings of the 28th International Conference        Marco Tulio Ribeiro, Sameer Singh, and Carlos
  on Computational Linguistics, pages 6788–6796,           Guestrin. 2016. ”why should i trust you?”: Explain-
  Barcelona, Spain (Online), December. International       ing the predictions of any classifier.
  Committee on Computational Linguistics.
                                                          Stefan Schweter. 2020. Italian bert and electra models,
Kevin Clark, Minh-Thang Luong, Quoc V. Le, and               November.
  Christopher D. Manning. 2020. ELECTRA: Pre-
  training text encoders as discriminators rather than    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
  generators. In ICLR.                                      Chaumond, Clement Delangue, Anthony Moi, Pier-
                                                            ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-
                                                            icz, and Jamie Brew. 2019. HuggingFace’s Trans-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
                                                            formers: State-of-the-art Natural Language Process-
   Kristina Toutanova. 2018. BERT: pre-training of
                                                            ing. ArXiv, abs/1910.0.
   deep bidirectional transformers for language under-
   standing. CoRR, abs/1810.04805.                        Fabio Massimo Zanzotto, Andrea Santilli, Leonardo
                                                            Ranaldi, Dario Onorati, Pierfrancesco Tommasino,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and               and Francesca Fallucchi. 2020. KERMIT: Comple-
   Kristina Toutanova. 2019. Bert: Pre-training of          menting transformer architectures with encoders of
   deep bidirectional transformers for language under-      explicit syntactic interpretations. In Proceedings of
   standing.                                                the 2020 Conference on Empirical Methods in Natu-
                                                            ral Language Processing (EMNLP), pages 256–267,
Finale Doshi-Velez and Been Kim. 2017. Towards a            Online, November. Association for Computational
   rigorous science of interpretable machine learning.      Linguistics.

Felix Greaves, Daniel Ramirez-Cano, Christopher Mil-
  lett, Ara Darzi, and Liam Donaldson. 2013. Use of
  sentiment analysis for capturing patient experience
  from free-text comments posted online. Journal of
  medical Internet research, 15:e239, 11.

Mustafa Khanbhai, Patrick Anyadi, Joshua Symons,
 Kelsey Flott, Ara Darzi, and Erik Mayer. 2021.
 Applying natural language processing and machine
 learning techniques to patient experience feedback:
 a systematic review. BMJ Health & Care Informat-
 ics, 28(1).

Ilya Loshchilov and Frank Hutter. 2019. Decoupled
   weight decay regularization. 7th International Con-
   ference on Learning Representations, ICLR 2019.

Dario Onorati, Pierfrancesco Tommasino, Leonardo
  Ranaldi, Francesca Fallucchi, and Fabio Massimo
  Zanzotto. 2020. Pat-in-the-loop: Declarative
  knowledge for controlling neural networks. Future
  Internet, 12(12).