KERMIT for Sentiment Analysis in Italian Healthcare Reviews Leonardo Ranaldi1 , Michele Mastromattei2 , Dario Onorati2 , Elena Sofia Ruzzetti2 , Francesca Fallucchi1 , Fabio Massimo Zanzotto2 1. Dept. of Innovation and Information Engineering Guglielmo Marconi University, Italy 2. Dept. of Enterprise Engineering University of Rome Tor Vergata, Italy l.ranaldi@unimarconi.com, elenasofia.ruzzetti@alumni.uniroma2.eu, {michele.mastromattei,fabio.massimo.zanzotto}@uniroma2.it, dario.onorati@uniroma1.it, f.fallucchi@unimarconi.it Abstract 1 Introduction English. In this paper, we describe our ap- People are practically reviewing anything in on- proach to the sentiment classification chal- line sites and understanding the polarization of lenge on Italian reviews in the healthcare a comment through automatic sentiment classi- domain. Firstly, we followed the work of fier is a tantalizing challenge. In recent years, Bacco et al. (2020) from which we ob- the number of virtual reviewers has drastically in- tained the dataset. Then, we generated our creased and there are many products and services, model called KERMITHC based on KER- which can be reviewed. Each person, before buy- MIT (Zanzotto et al., 2020). Through an ing a product or a service, searches into reviews extensive comparative analysis of the re- from people who have already had experienced the sults obtained, we showed how the use product or the service. Review portals are usu- of syntax can improve performance in ally linked to the leisure or business activities such terms of both accuracy and F1-score com- as the world of tourism, e-commerce or movies. pared to previously proposed models. Fi- However, there are topics where these reviews and nally, we explored the interpretative power the associated automatic computed sentiment may of KERMIT-viz to explain the inferences induce to select wrong services, which may dra- made by neural networks on examples. matically affect personal life. When dealing with health-related services, the Italiano. In questo lavoro, presentiamo il effect of positive or negative reviews on hospitals nostro approccio al task di sentiment anal- and doctors can have a potential catastrophic im- ysis per le recensioni italiane in ambito pact on the health of who is using this piece of in- sanitario. Abbiamo seguito il lavoro di formation. QSalute 1 is one of the most important Bacco et al. (2020) da cui abbiamo ot- Italian portals of reviews about hospitals, nursing tenuto il dataset. Successivamente, abbi- homes and doctors. It is very important for pa- amo usato KERMITHC basato su KER- tients to seek the best hospital for their condition MIT(Zanzotto et al., 2020). Da un’ampia based on the past experience of other patients. Re- analisi comparativa dei risultati ottenuti views in the world of health benefit both patients mostriamo come l’uso della sintassi può and hospitals because they are a means to discover migliorare le prestazioni sia in termini di problems and solve them (Greaves et al., 2013; accuratezza che di F1-score rispetto ai Khanbhai et al., 2021). modelli proposti in precedenza. Infine, Automatic sentiment analyzer have then a big abbiamo esplorato il potere interpretativo responsibility in the context of health-related ser- di KERMIT-viz per spiegare le inferenze vices. In these sensitive areas, it is important to fatte dalle reti neurali sugli esempi. design AI systems whose decisions are transpar- ent (Doshi-Velez and Kim, 2017), that is, the sys- tems must give the motivation for the choice made Copyright © 2021 for this paper by its authors. Use per- so that people can trust. If the users do not trust a mitted under Creative Commons License Attribution 4.0 In- 1 ternational (CC BY 4.0). https://www.qsalute.it/ model or a prediction, they will not use it (Ribeiro is less than or equal to 2, (2) positive if the average et al., 2016). of its scores is greater than or equal to 4 (3) neutral In this article, we investigate a model that can otherwise. mitigate the responsibility of sentiment analyz- The resulting dataset is composed of 47,224 re- ers for health-related services. The model we are views consisting of: 40,641 reviews in the positive using exploits syntactic information within neu- class, 3,898 in the neutral class and 2,685 in the ral networks to provide a clear visualisation of negative class. the internal decision mechanism of the model that In this work, we solely consider positive and produced the decision. We propose KERMITHC negative classes, so our final dataset is composed (KERMIT for HealthCare) based on KERMIT of 43,326 reviews. The dataset is heavily skewed (Zanzotto et al., 2020) to solve the sentiment anal- (93,80% positive class - 6,20% negative class) fa- ysis task introduced by Bacco el al.(2020). We voring reviews labeled as positive. use KERMITHC on QSalute Italian portal re- 2.2 KERMIT 4 Healthcare views in order to include symbolic knowledge as a part of the architecture and visualize the internal KERMITHC (KERMIT for HealthCare) architec- decision-making mechanism of the neural model, ture is composed of 3 major parts: (1) a KERMIT using KERMIT-viz (Ranaldi et al., 2021). model described in Zanzotto et al. (2020), (2) a In the rest of paper, Section 2 gives details about Transformers model and (3) a decoder layer that the dataset and methods, while Section 3 and 4 combines the results obtained from the previous describe the experiments, the results obtained and two sub-parts. In figure Fig.1 we show a graphical their discussion. Finally, in Section 5 we present representation of the architecture of KERMITHC , the final conclusions and future goals. pointing the parts that compose it. 2 Data & Methods To explore our hunch that syntactic interpreta- tion may help in Healthcare reviews recognition, we leverage: (1) a Healthcare training corpus (Sec. 2.1); (2) a KERMITHC , which is based on syntactic interpretation and it can explain its deci- sions; and finally, (3) some challenges solved due to KERMITHC (Sec. 2.2). 2.1 Dataset In order to investigate reviews in healthcare area, we selected the QSalute portal, one of the most important health websites in Italy. This portal can Figure 1: KERMITHC architecture, forward and be defined as the TripAdvisor of hospital facili- interpretation pass. ties, indeed it talks about: Expertise, Assistance, Cleaning and Services. In addition to the reviews, The architecture of KERMITHC makes it a there are some associated metadata such as: user particular model, because it combines the syn- id, hospital name, review title and patient pathol- tax offered by KERMIT with the versatility of a ogy. To ensure privacy we do not consider sensi- Transformer-model. We use KERMIT because it tive data such as user id and hospital name. allows the encoding of universal syntactic inter- We used a free available scraper on GitHub 2 to pretations in a neural network architecture. KER- download the dataset. Then, to model this data to MIT component is itself composed of two parts: a sentiment analysis task, we followed the indica- KERMIT encoder, which converts parse tree T tions provided by Bacco et al.(2020) - in detail, a into embedding vectors and a multi-layer percep- review is: (1) negative if the average of its scores tron that exploits these embedding vectors. The 2 second sub-part of our architecture is composed The scraper is available at https://github.com/l bacco/Italian-Healthcare-Reviews-4-Senti of a Bidirectional Encoder Representations from ment-Analysis Transformers, - as known as BERT - to classify the Model Average Accuracy Average Macro F1 score Average Weighed F1 score UmBERTo 0.74(±0.14)⋄ 0.43(±0.02) 0.75(±0.18)◦ AlBERTo 0.82(±0.15)⋄ 0.47(±0.05)† 0.8(±0.14)◦ BERT multilingual 0.73(±0.13) 0.46(±0.1)† 0.73(±0.22) ELECTRAita 0.67(±0.17) 0.4(±0.13) 0.66(±0.2) Table 1: Performance of BERT, on 25% of the QSalute dataset. Mean and standard deviation results are obtained from 10 runs. For each Site, the best performing model was highlighted based on the F1 score values obtained. The symbols ⋄, ◦ and † indicate a statistically significant difference between two results with a 95% of confidence level with the sign test. sentiment of the reviews. BERT is a pre-trained are those proposed in Zanzotto et al., (2020) pa- language model developed by Devlin et al. (2019) per. The constituency parse trees used for KER- at Google AI Language. In particular, since the MIT sub-part are obtained using our freely avail- task concerns sentences in the Italian language, we able script on GitHub3 . have used a special BERT version pretrained on We tested several different BERT version pre- that language called AlBERTo (Polignano et al., trained on Italian language in order to get the best 2019). model for our task. In particular, we tested the 3 Experiments following transformers: (1) UmBERTo (Parisi et al., 2020); (2) AlBERTo (Polignano et al., 2019); We used KERMITHC architecture to examine if (3) BERT multilingual (Devlin et al., 2018) and it is possible to answer the research questions (4) ELECTRAita : an Italian version of ELEC- showed in KERMIT (Zanzotto et al., 2020) also TRA model (Clark et al., 2020) implemented by in healthcare domain using the Italian language. Schweter (2020) on a work of Chan et al. (2020). Those research questions are: (1) Can the sym- All the models were implemented using Hugging- bolic knowledge provided by universal symbolic face’s transformers library (Wolf et al., 2019) and syntactic interpretations, make a difference and it all were used in the uncased setting with the pre- be used effectively in neural networks? (2) Do trained version. The input text for BERT has been universal symbolic syntactic interpretations en- preprocessed and tokenized as specified in respec- code different syntactic information than those en- tively work (Parisi et al., 2020; Polignano et al., coded in “embeddings of universal sentences”? 2019; Devlin et al., 2018; Schweter, 2020). (3) Can the universal symbolic syntactic interpre- tations provided by KERMITHC , supply a better Since our experiments are text classification and clearer way to explain the decisions of neural task, the decoder layer of our KERMITHC archi- networks than those provided by transformers? tecture is a fully connected layer with the soft- To provide a comprehensive answer to these max activation function applied to the concatena- questions, we tested the architecture in a com- tion of the KERMIT sub-part output and the final pletely universal setting where both KERMIT and [CLS] token representation of the selected trans- AlBERTo are trained only in the last decision former model. Finally, the optimizer used to train layer. the whole architecture is AdamW (Loshchilov and The rest of the Section describes the experi- Hutter, 2019) with the learning rate set to 2e−5 . mental set-up, the quantitative experimental re- For reproducibility, the source code of our experi- sults and discusses how we can use the KERMIT- ments is publicly available on our GitHub reposi- viz to explain decisions of neural network infer- tory4 . ences over examples. 3.1 Experimental Set-up 3 The code is available at https://github.com/L This section describes the general experimental eonardRanaldi/Constituency-Parser-Italia set-up of our experiments and the specific config- n 4 The code is available at https://github.com/A urations adopted. RT-Group-it/KERMIT-4-Sentiment-Analysis- The parameters used for the KERMIT encoder on-Italian-Reviews-in-Healthcare Average Average Average Site Model Accuracy Macro F1 score Weighed F1 score KERMITHC 0.71 (± 0.14) 0.51 (± 0.08) 0.7 (± 0.11) Pneumology AlBERTo 0.66 (± 0.27) 0.4 (± 0.12)† 0.61 (± 0.26) KERMITHC 0.78 (± 0.13) 0.51 (± 0.07) 0.81 (± 0.08) Thoracic Surgery AlBERTo 0.74 (± 0.28) 0.43 (± 0.13) 0.74 (± 0.26) KERMITHC 0.87 (± 0.05)† 0.6 (± 0.03)† 0.89 (± 0.03) Nervous System AlBERTo 0.94 (± 0.01)† 0.48 (± 0.0)† 0.91 (± 0.01) KERMITHC 0.93 (± 0.03)† 0.56 (± 0.03)† 0.93 (± 0.02) Hearth AlBERTo 0.96 (± 0.01)† 0.49 (± 0.0)† 0.94 (± 0.01) KERMITHC 0.81 (± 0.16) 0.49 (± 0.06)† 0.83 (± 0.12) Vascular Surgery AlBERTo 0.70 (± 0.29) 0.42 (± 0.11)† 0.73 (± 0.23) KERMITHC 0.79 (± 0.08) 0.55 (± 0.05)† 0.83 (± 0.06) Ophthalmology AlBERTo 0.87 (± 0.08) 0.48 (± 0.02)† 0.86 (± 0.04) KERMITHC 0.58 (± 0.23) 0.43 (± 0.11) 0.60 (± 0.20) Rheumatology AlBERTo 0.68 (± 0.20) 0.44 (± 0.10) 0.69 (± 0.19) KERMITHC 0.68 (± 0.19) 0.51 (± 0.12) 0.70 (± 0.17) Infections AlBERTo 0.57 (± 0.23) 0.42 (± 0.13) 0.58 (± 0.21) KERMITHC 0.64 (± 0.11) 0.50 (± 0.07) 0.70 (± 0.10) Skin AlBERTo 0.63 (± 0.26) 0.39 (± 0.11) 0.61 (± 0.24) KERMITHC 0.79 (± 0.09)† 0.55 (± 0.03)† 0.82 (± 0.06) Genital AlBERTo 0.88 (± 0.06)† 0.49 (± 0.02)† 0.87 (± 0.03) KERMITHC 0.75 (± 0.09) 0.52 (± 0.04)† 0.80 (± 0.05) Endoscopy AlBERTo 0.80 (± 0.19) 0.45 (± 0.07)† 0.78 (± 0.17) KERMITHC 0.70 (± 0.24) 0.42 (± 0.08) 0.76 (± 0.18) Facial AlBERTo 0.72 (± 0.26) 0.42 (± 0.10) 0.76 (± 0.22) KERMITHC 0.91 (± 0.06) 0.52 (± 0.04)† 0.92 (± 0.03) Oncology AlBERTo 0.89 (± 0.21) 0.46 (± 0.08)† 0.89 (± 0.17) KERMITHC 0.56 (± 0.30) 0.36 (± 0.14) 0.57 (± 0.31) Haematology AlBERTo 0.41 (± 0.25) 0.30 (± 0.11) 0.46 (± 0.23) KERMITHC 0.71 (± 0.20) 0.48 (± 0.12) 0.71 (± 0.22) Endocrinology AlBERTo 0.73 (± 0.29) 0.41 (± 0.13) 0.69 (± 0.28) KERMITHC 0.82 (± 0.08) 0.56 (± 0.05)† 0.85 (± 0.05) Gynaecology AlBERTo 0.85 (± 0.14) 0.48 (± 0.04)† 0.84 (± 0.09) KERMITHC 0.84 (± 0.14) 0.50 (± 0.06) 0.86 (± 0.09) Otorhinology AlBERTo 0.80 (± 0.18) 0.46 (± 0.05) 0.83 (± 0.13) Table 2: Performance of KERMITHC and AlBERTo on QSalute database grouped by Site. Mean and standard deviation results are obtained from 10 runs. For each Site, the best performing model was highlighted based on the F1 score values obtained. The symbol † indicate a statistically significant difference between two results with a 95% of confidence level with the sign test. 4 Results and Discussion uate the models using accuracy and F1-score met- rics. Despite this division, the dataset is still very Syntactic information is useful to significantly unbalanced favoring the class 1 (positive reviews). increase performances to classify Healthcare re- We reports results in terms of the accuracy, Macro views (see Table 2). KERMITHC uses AlBERTo F1 and Weighed F1. Observing Table 2, we can which is the best BERT-italian version model ac- see that the performance obtained by KERMITHC cording to our experiments, showed in Table 1. always exceeds the best configuration of BERT: Especially KERMITHC outperforms the solely AlBERTo. Hence, trained on the Healthcare re- AlBERTo sub-part model (ref. to Table 2). view dataset (Bacco et al., 2020) (see Section 2.1) As in the work proposed by Bacco et al.(2020), KERMITHC seems to be a good candidate to ana- we chose to divide the dataset by “Site” and eval- lyze sentiment of hospital patients. (a) S: Uno staff di grandissima competenza e professionalità! (b) S:Pessima assistenza e servizi assenti tranne il primario di reparto di neurochirurgia eccellente professionista Figure 2: The visualizations offered by KERMIT-viz. Both examples have the target class positive but in the first one, it is easy to state the positivity. In the second one, who wrote the review, makes disquisitions about the medical staff but at the same time lauds the head of the department. Using the KERMIT-viz visualiser, we anal- 5 Conclusion ysed how important the contribution of symbolic knowledge provided by KERMIT can be. In many In this article, we investigated a model that cases it makes all the difference. Looking at the can mitigate the responsibility of sentiment an- Figure 2, these are two sentences with a positive alyzers for health-related services. Our model target. The first sentence (shown in Fig. 2a) is KERMITHC exploits syntactic information within clearly positive while the sentence shown in the neural networks to provide a clear visualisation of Fig. 2b could be ambiguous as the patient makes its internal decision mechanism. KERMITHC is bad remarks about the service but praises the head based on KERMIT (Zanzotto et al., 2020) and we of the department. We can observe how some worked in a sentiment analysis task introduced by words have been colored in red (therefore they Bacco el al.(2020). have received a greater weight during the classifi- We studied several versions of pre-trained cation phase) emphasizing the positive aspects of BERT models on the Italian language and found the sentence and causing it to be labeled as “posi- out that AlBERTo is, among them, the best model tive review”. In this way the explainability is guar- for this task. However, KERMITHC , which is anteed and in very delicate topics - like sentiment composed of KERMIT+AlBERTo, outperforms in health reviews - we can have more “trust” on better than AlBERTo model alone. Additionally, sentiment analysers. via KERMIT-viz, we visualized the reasons why KERMITHC classifies the dataset. We observed how KERMITHC captures relevant syntactic in- formation by catching the keywords in each sen- tence giving them more weight in the decision Loreto Parisi, Simone Francia, and Paolo Magnani. phase, mitigating and capturing possible errors of 2020. Umberto: An italian language model trained with whole word masking. the sentiment analysers. Our future goal is to be able to have full control of the sentiment analysers Marco Polignano, Pierpaolo Basile, Marco de Gem- by injecting human rules (Onorati et al., 2020) in mis, Giovanni Semeraro, and Valerio Basile. 2019. order to mitigate possible errors. AlBERTo: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets. In Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019), volume References 2481. CEUR. Luca Bacco, A. Cimino, L. Paulon, M. Merone, and Leonardo Ranaldi, Francesca Fallucchi, and F. Dell’Orletta. 2020. A machine learning approach Fabio Massimo Zanzotto. 2021. KERMITviz: for sentiment analysis for italian reviews in health- Visualizing Neural Network Activations on Syntac- care. In CLiC-it. tic Trees. In In the 15th International Conference on Metadata and Semantics Research (MTSR’21), Branden Chan, Stefan Schweter, and Timo Möller. volume 1. 2020. German’s next language model. In Proceedings of the 28th International Conference Marco Tulio Ribeiro, Sameer Singh, and Carlos on Computational Linguistics, pages 6788–6796, Guestrin. 2016. ”why should i trust you?”: Explain- Barcelona, Spain (Online), December. International ing the predictions of any classifier. Committee on Computational Linguistics. Stefan Schweter. 2020. Italian bert and electra models, Kevin Clark, Minh-Thang Luong, Quoc V. Le, and November. Christopher D. Manning. 2020. ELECTRA: Pre- training text encoders as discriminators rather than Thomas Wolf, Lysandre Debut, Victor Sanh, Julien generators. In ICLR. Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow- icz, and Jamie Brew. 2019. HuggingFace’s Trans- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and formers: State-of-the-art Natural Language Process- Kristina Toutanova. 2018. BERT: pre-training of ing. ArXiv, abs/1910.0. deep bidirectional transformers for language under- standing. CoRR, abs/1810.04805. Fabio Massimo Zanzotto, Andrea Santilli, Leonardo Ranaldi, Dario Onorati, Pierfrancesco Tommasino, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and and Francesca Fallucchi. 2020. KERMIT: Comple- Kristina Toutanova. 2019. Bert: Pre-training of menting transformer architectures with encoders of deep bidirectional transformers for language under- explicit syntactic interpretations. In Proceedings of standing. the 2020 Conference on Empirical Methods in Natu- ral Language Processing (EMNLP), pages 256–267, Finale Doshi-Velez and Been Kim. 2017. Towards a Online, November. Association for Computational rigorous science of interpretable machine learning. Linguistics. Felix Greaves, Daniel Ramirez-Cano, Christopher Mil- lett, Ara Darzi, and Liam Donaldson. 2013. Use of sentiment analysis for capturing patient experience from free-text comments posted online. Journal of medical Internet research, 15:e239, 11. Mustafa Khanbhai, Patrick Anyadi, Joshua Symons, Kelsey Flott, Ara Darzi, and Erik Mayer. 2021. Applying natural language processing and machine learning techniques to patient experience feedback: a systematic review. BMJ Health & Care Informat- ics, 28(1). Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. 7th International Con- ference on Learning Representations, ICLR 2019. Dario Onorati, Pierfrancesco Tommasino, Leonardo Ranaldi, Francesca Fallucchi, and Fabio Massimo Zanzotto. 2020. Pat-in-the-loop: Declarative knowledge for controlling neural networks. Future Internet, 12(12).