A Preliminary Evaluation of a Privacy-Preserving Dialogue System Bettina Fazzinga?1 , Andrea Galassi?2(B) , and Paolo Torroni?2 1 ICAR CNR, Rende, Italy; DICES, University of Calabria, Rende, Italy bettina.fazzinga@unical.it 2 DISI, University of Bologna, Bologna, Italy a.galassi@unibo.it paolo.torroni@unibo.it Abstract. Dialogue systems are AI applications widely used in many contexts requiring user interaction. However, unconstrained interaction may lead to users communicating sensitive data. This raises concerns about how these systems handle personal data, and about their compli- ance with relevant laws, regulations, and ethical principles. We propose to integrate advanced natural language processing techniques in a dia- logue system architecture based on computational argumentation, ensur- ing that user data are ethically managed and regulations are respected. A preliminary experimental evaluation of our proposal over a COVID-19 vaccine information case study shows promising results. Keywords: Dialogue systems · Sentence embeddings · Data Protection · Expert systems · Chatbots · COVID-19. 1 Introduction The idea of an artificial agent capable of communicating with the user through natural language has inspired researchers since the early days of artificial in- telligence. The recent development in language technologies has nourished this ambition further and now the full maturation of intelligent dialogue systems does not seem a far dream any longer. Their adoption allows immediate support to any user, making them incredibly valuable for companies and public admin- istrations alike. In fact, they are being used by public administrations to help citizens to request services,3 but also to provide updates and information on pressing matters, such as COVID-19 [23].4 ? Equal contribution. 3 https://www.canada.ca/en/employment-social-development/services/my- account/terms-use-chatbot.html 4 https://government.economictimes.indiatimes.com/news/digital-india/covid-19- govt-launches-facebook-and-messenger-chatbot/74843125 Copyright c 2021 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0) 2 B. Fazzinga, A. Galassi, P. Torroni The pervasive presence of information-providing chatbots and assistive di- alogue systems in many delicate context raises the need for trustworthy AI methods, which can guarantee citizens protection against possible misuses of technology. We believe that trustworthiness demands transparency, explainabil- ity, correctness, and that it requires architectural choices that take data access into account from the very beginning. In fact, chatbots should not only process data through transparent and verifiable methods following appropriate regula- tions, but also provide explanations of their outputs in a manner adapted to the intended (human) user. This is especially true in the public sector and when the interaction among different legal entities is involved. In our earlier work [11], we identified a combination of computational ar- gumentation and language technology as a possible answer to some of these challenges. We described an architecture for AI dialogue systems where user in- teraction is carried out in natural language, both for providing information to the user and to answer user queries about the reasons leading to the system output (explainability). We proposed to use computational argumentation techniques to realize a transparent reasoning module with a rigorous, verifiable semantics (transparency, auditability). We also underlined the importance of modularity in the architecture’s design, to decouple the natural language interface, where user data is processed, and the reasoning module, where expert knowledge is used to generate outputs (privacy and data governance). In [11], we focused on the computational argumentation module, describing how the system works to compute answers. The focus of this work, instead, is on the language module’s design and its initial evaluation. The main idea is simple: to use sentence embeddings and a similarity function to match user inputs with a set of natural language sentences describing relevant facts. In order to evaluate whether this concept may work in practice, we constructed a tiny dataset of sentences describing user information in possible dialogues regarding COVID-19 vaccines. For example, we encoded different ways users may express whether they suffer from drug allergies, or are immunosuppressed. We run a preliminary experimentation to compare different sentence embeddings and hyperparameters, obtaining encouraging results. Our presentation starts by discussing related approaches (Section 2). Sec- tion 3 gives a high-level description of the system architecture, while we illustrate the implementation in more detail in Section 4. In Section 5 we offer an initial empirical evaluation of the language module, pointing to the feasibility of the approach in real-world contexts. We conclude and address future developments in Section 6. 2 Related Work Our work is positioned at the intersection of two areas: computational argu- mentation and natural language understanding. While computational argumen- tation has had significant applications in the context of automated dialogues A Preliminary Evaluation of a Privacy-Preserving Dialogue System 3 among software agents, its combination with systems able to interact in natural language in socio-technical systems has been more recent [5]. Dialogue systems are typically divided between conversational agents, which support open-domain dialogues, and task-oriented agents, which assist the user in a specific task [7,8]. Our proposal falls in the second category. The task is to obtain information on a specific topic. The advancement of deep learning techniques and their successful application in many Natural Language Processing tasks has lead researchers to investigate the use of neural architectures for end- to-end dialogue systems [27,22]. However, these architectures have downsides too. Their training phase usually has a heavy computational footprint, and it requires the construction of large corpora for the specific use cases. Moreover, they are often vulnerable to biases, privacy violations, adversarial attacks, and safety concerns [1,10,14,21]. Finally, reuse and adaptation to a different domain typically require building a new training corpus and a complete retrain. Given our focus on user protection and our aim to develop a general, data-independent approach, our system is modular (as opposed to end-to-end), and does not involve any training phase. Since it only uses off-the-shelf tools, it can be applied to new contexts without having to construct new training corpora. The protection of users’ identity and personal information is usually ad- dressed through redaction [34] or sanitization [4] methods. Such techniques are usually built on large but domain-specific datasets [24,31]. Nonetheless, these techniques are still far from guaranteeing zero-risk to the user [19], and often focus only predefined categories of entities, ignoring elements that may play a role in re-identifying the individual [20]. Our proposal is general-purpose and aims at maximizing user protection. The main idea is that user information is neither shared nor stored in the applica- tion. Instead, it is replaced by a collection of general, “sanitized” information elements that are pertinent to the case at hand. Our approach is akin to Informa- tion Retrieval-based chatbots, where dialogue agents retrieve their answer from a knowledge base made of dialogues, treating the user’s sentences as queries. In the same vein, Charras et al. [6] use sentence similarity to retrieve the desired an- swer from a knowledge base made of dialogues, while Chalaguine and Hunter [5] retrieve an answer from a graph. Both work compare sentences through the co- sine similarity between the TF-IDF representation of the sentences, but Charras et al. explore also the use of doc-to-vec [18] representation. However, the design of these approaches does not include a history of conversation, nor the possi- bility to retrieve multiple information elements within a single interaction. This is a strong limitation in real-world scenarios, where information cannot be con- sidered in isolation, but on the contrary, multiple pieces must be considered at the same time, independently whether they have been communicated in a single sentence or at different points in the dialogue. Another limitation of previous approaches is their relying on lexical, instead of semantic similarity. Conversely, we use sentence-level embeddings, which enables semantic similarity measures. Moreover, we consider the possibility of retrieving multiple information in a 4 B. Fazzinga, A. Galassi, P. Torroni Knowledge Base Legend NL NA NB Sentences SX SA1 SA2 SB1 SB2 SB3 Status NX Node Reply R Node User 1 2 Argumentation Module SBU 3 NA NB NB 4 SR SR 5 R R Language Module Fig. 1. System architecture and example of interaction with the user. Natural language sentences are represented as rectangles and indicated with S, while circles are used for status and reply nodes (indicated respectively with N and R). We represent a case where nodes and sentences refers to two concepts, A and B, and the user sentence regards B. The information provided by the user is represented with the green color and by diagonal stripes. It is easy to see that such information does not reach the argumentation module. single interaction, and to maintain a history of retrieved concept thanks to a reasoning module based on argumentation. 3 System Architecture Our architecture consists of two main modules: the language module and the argumentation module. The former, which is the focus of this work, is responsible for user interaction: it processes user input and generates answers, all in natural A Preliminary Evaluation of a Privacy-Preserving Dialogue System 5 language. The latter, described in our previous work [11], receives the processed information and reasons over it, so as to find the appropriate answer, according to a knowledge base (KB) of the domain of interest. The KB is built by domain experts and consists of an argumentation graph, having a node for each possible relevant piece of information that could be communicated to the system by users, called status nodes (that thus contain factual information about possible users conditions/statuses), and a node for each possible answer, called reply nodes. A set of natural language sentences is also associated with each status node. In this way, we have a natural language representation of possible ways a user would express what a node is meant to encode. These different representations of facts could be produced by domain experts or crowd-sourced as proposed by Chalaguine and Hunter [5]. The interaction with the user is represented in Figure 1 and it is structured as follows: 1. The user inputs one or more sentences. 2. The language module compares each sentence with the KB sentences associ- ated with status nodes, obtaining a set of “matched” sentences, correspond- ing to a set of “activated” status nodes. These nodes, collectively, represent the specific use case described by the user. 3. The list of activated status nodes is sent to the argumentation module. 4. The argumentation module performs reasoning over the activated nodes, resulting in an answer, or in the request for more information. Either way, a node is selected and this selection is communicated to the language module. 5. The language module elaborates the output of the argumentation module and produces a natural language reply to the user. Since the sentence provided by the user may match multiple KB sentences, multiple nodes may be activated in a single interaction. At the same time, the system allows to provide information over multiple interaction, since the argu- mentation module stores the history of activated nodes in its “memory”. We shall remark that such an architecture protects the privacy of its users on two levels. First of all, the system ignores any information that does not match a KB sentence. The KB is not user-specific: it only represents general knowledge, in a user-independent way. Therefore, any information not strictly relevant to the scenario is filtered out. Then, during step 3, the argumentation module receives only the list of activated nodes, not the sentences as the user has formulated it. The outcome of this procedure is similar to the one produced by a “sanitization” process [4], since the all the relevant information are kept, but in a form that is general and does not contain any information that may lead to the identification of the user. Our proposal is therefore suited with any scenarios where the language mod- ule is authorized and entrusted to manage the user’s data, but the argumentation module is not. One of them is a client-server implementation, where the client side includes the language module, while the argumentation module resides on the server side. In this case, all the personal information of the user will re- main on the client side, and only the sanitized version of them will reach the 6 B. Fazzinga, A. Galassi, P. Torroni server. Another possibility would be a multi-agent system, where the two mod- ules are managed by different organizations, e.g. a service provided jointly by a government and by a private company. Last but not least, the reasoning module is transparent, rigorous, and verifi- able, allowing the users to request for more information regarding the provided answer. More details about this process are discussed in Fazzinga et al. [11]. 4 Language Module One of the main objectives of our proposal is keeping the approach as general as possible. While many scenarios may gain advantage from tailored NLP solutions, their construction may be too costly, even impossible. We have therefore decided to follow previous works [6,5] and to assess which KB sentences match the user’s ones by computing the similarity between their embedded representation. But instead of relying on simple syntactical representation, we propose to use state- of-the-art techniques, apt to capture the semantic content of the sentences. In particular, we encode both the user sentences and the KB sentences us- ing sentence embeddings. These are high-dimensional numerical representations of textual sentences that can be computed using (pre-trained) neural architec- tures. Many embeddings have been proposed along the years [26,25], and modern attention-based [13] sentence embeddings such as BERT [9] do not only model the syntactic content and structure of a sentence, but also capture its meaning. Ideally, if two sentences have a similar meaning, they will be mapped onto simi- lar sentence embeddings. Sentence embeddings have been used successfully in a variety of NLP tasks, including hard ones such as understanding negations and speculations, and have shown to outperform traditional rule-based systems [30]. Among the many possible models, we have decided to focus on Sentence- BERT models [28], which are specifically trained to perform well on tasks of sentence similarity. While it is possible to train new models for specific domains or tasks, many pre-trained models are already available and can be used as off- the-shelf tools without the need of creating a corpus, nor to perform a training or fine-tuning steps. The similarity between two embeddings can be computed using any similarity function that operates on high-dimensional numerical vectors. We use the Bray- Curtis similarity [2] since it has led to satisfactory results in related settings be- fore [12], but other measures, such as cosine similarity [16], may be equally valid options. A possible alternative to the use of sentence embeddings combined with a similarity measure may be the use of neural architectures specifically trained to perform this task, such as cross-encoders [28]. However, the computational footprint of these techniques may be too heavy in most contexts, since they require to encode and process any possible pair at any step of iteration. Given a measure of similarity between two sentences, we transform it to a Boolean value by applying a threshold, which is an hyper-parameter of the architecture. In this way, we discriminate between the pairs of sentences that are similar enough to be considered “a match”, and those that are not. A Preliminary Evaluation of a Privacy-Preserving Dialogue System 7 Table 1. Sentences used in our case study and the status node they are associated with. Node ID Sent. ID Sentence N1 S1 I am celiac N1 S2 I suffer from the celiac disease N1 S3 I am afflicted with the celiac disease N1 S4 I have the celiac disease N1 S5 I recently found out to be celiac N1 S6 I have suffered from celiac disease since birth N2 S7 I do not have the celiac disease N2 S8 I am not celiac N2 S9 I do not suffer from the celiac disease N2 S10 I am not afflicted with the celiac disease N3 S11 I am not immunosuppressed N3 S12 I do not suffer from immunosuppression N3 S13 I am not afflicted with immunosuppression N4 S14 I am immunosuppressed N4 S15 I suffer from immunosuppression N4 S16 I am afflicted with immunosuppression N4 S17 I do suffer from immunosuppression N4 S18 I indeed suffer from immunosuppression N4 S19 I recently found out to be immunosuppressed N5 S20 I do not have any drug allergy N5 S21 I do not suffer from drug allergies N5 S22 I do not suffer from any drug allergy N5 S23 I am not afflicted with any drug allergy N5 S24 I do not have medication allergies N5 S25 I do not have any medication allergy N6 S26 I have a drug allergy N6 S27 I do have a drug allergy N6 S28 I have a serious drug allergy N6 S29 I suffer from drug allergy N6 S30 I am afflicted with drug allergies N6 S31 I suffer from medication allergies 5 Experimental Evaluation To assess the effectiveness of our language module based on sentence embeddings and similarity measures, we run a preliminary experimentation on a small-sized dataset built around the use case of vaccines for COVID-19. We are especially interested in evaluating our method on sentences with a similar syntactic struc- ture, but different meaning (e.g., a sentence and its negation). 5.1 Setting In the context of COVID-19 vaccines, our dialogue system helps the users to understand whether or not, and eventually where (hospital or generic site), they 8 B. Fazzinga, A. Galassi, P. Torroni can get vaccinated, depending on their health status. Our KB has been built from the information published by the Italian Medicines Agency (AIFA) on their website (https://www.aifa.gov.it/en/). For example, for people suffering of diabetes, no special recommendation is given, so our system will tell users that they can be vaccinated at any site (without the need of going to the hospital), while in the case that they suffer from bronchial asthma, our system will tell them to get vaccinated at the hospital. In this context, it is essential for our system to perfectly understand users health conditions, so, in the following, we focus on the matching phase between user sentences and the information stored in KB. We consider a case study with a KB made of only 6 status nodes, corre- sponding to the presence/absence of 3 particular medical conditions, i.e., celiac disease, immunosuppression, and drug allergy. For each node, our KB contains from 3 to 7 sentences that can be used to express the same concept (see Table 1). Instead of using an additional set of sentences to simulate the user input, we compare the KB sentences between each other and verify whether sentences belonging to the same node do match. To evaluate our method quantitatively, we treat it as a binary classification task on every possible pair of (different) sentences. If the two sentences belong to the same status node, their pair is considered a positive instance, otherwise it is considered negative. In our experiment we compare different models of sentence embeddings and different threshold criteria. For sentence embeddings we evaluate the following Sentence-BERT [28] models:5 – stsb-mpnet: based on MPNet [33] and pre-trained for semantic similarity on the STSbenchmark [3]. – paraphrase-mpnet: based on MPNet and pre-trained for paraphrase mining. – paraphrase-TinyBERT-L6: based on TinyBERT [15] and pre-trained for paraphrase mining. – paraphrase-MiniLM-L3: based on MiniLM [35] and pre-trained for para- phrase mining. – nq-distilbert: based on DistilBERT [32] and pre-trained for question an- swering on Googles Natural Questions dataset [17]. – paraphrase-multilingual-mpnet: multilingual extension [29] of the mono- lingual model. We have decided to include this model in the perspective of future multi-lingual applications. We also include TF-IDF representation as in Charras et al. [6], Chalaguine and Hunter [5], using the entire set of sentences to create the vocabulary. As thresh- olds, we use three arbitrary values (0.75, 0.70, 0.65), plus two values based on the distribution of the similarity scores: one is given by the average of the similarities (mean), and the other one is given by the sum between the average similarity and the standard deviation (mean+std). For each combination of models and thresholds, we measure precision, recall, and F1 score of the positive class (see Table 2). Precision is especially important: 5 All the implementations of the models are taken from http://www.sbert.net/. A Preliminary Evaluation of a Privacy-Preserving Dialogue System 9 false positives can be seen as cases where the system “misunderstands” the input of the user, and therefore precision can be seen as a measure of correctness. Recall instead can be seen as a measure of the ability of the system to not “miss” information communicated by the user. For the purposes of our system, poor recall is a less serious problem than poor precision, since the argumentation module proactively asks the user for missing bits of information that would influence the final result. In our perspective, the priority must be to guarantee the correctness of the final answer, even if this means that the system will, in some cases, ask for information that the user has already submitted. For this reason, we use precision as the main evaluation metric. 5.2 Results and Discussion Our results clearly show that the stsb-mpnet and the paraphrase-mpnet mod- els are the best ones, with the former achieving perfect precision with all the fixed similary scores and the latter achieving equivalent or even better F1 scores with every threshold. In particular, they both achieve an almost perfect result (only one false positive, no false negatives) using the mean+std threshold. The paraphrase-multilingual-mpnet model perform slightly worse than the mono- lingual version, providing encouraging results in the perspective of future multi- lingual applications. The TF-IDF model is the one that performs worse with all the threshold values, in part probably due to the small size of the vocabulary. Table 3 shows an example of matching using sentences from S1 to S19, which are those related to the status nodes “Has celiac disease”, “Has not celiac dis- ease”, “Is immunosuppressed”, “Is not immunosuppressed”. The matches are computed by the stsb-mpnet and the paraphrase-mpnet models using a thresh- old value of 0.65. The former achieves perfect precision but not perfect recall, and indeed we can see that it misses some matches, such as S8 and S10. The latter reaches perfect recall but not precision, which indicates the presence of false positives e.g. the pair S1 and S8. Some of these false positives might be particularly dangerous in a real application since they mean that the system has misunderstood a sentence for its negation, e.g. the sentence ”I am not celiac” as ”I am celiac”. The argumentation module would be able to detect such conflicts and in future works we plan to include conflict resolution modules and proce- dures. A careful user experience design may also be able to mitigate the issue, for instance by displaying relevant pieces of information interactively as they are understood by the system. These results are encouraging and motivate us to continue along this research direction. Nonetheless, our research is still in its early stages and we are aware that a proper and sound evaluation of the whole proposal would require to include more nodes, a rigorous split between calibration and test sentences, and should eventually be validated by human testers. 10 B. Fazzinga, A. Galassi, P. Torroni Table 2. Experimental results of the embedding models and the threshold criterion on the sentence matching task. Embedding Model Threshold P R F1 mean 0.33 1.00 0.50 mean+std 0.99 1.00 0.99 stsb-mpnet 0.75 1.00 0.67 0.80 0.70 1.00 0.86 0.92 0.65 1.00 0.97 0.99 mean 0.32 1.00 0.49 mean+std 0.99 1.00 0.99 paraphrase-mpnet 0.75 1.00 0.86 0.92 0.70 1.00 0.94 0.97 0.65 0.96 1.00 0.98 mean 0.40 1.00 0.57 mean+std 0.72 0.99 0.83 paraphrase-TinyBERT-L6 0.75 1.00 0.46 0.63 0.70 0.94 0.70 0.80 0.65 0.81 0.94 0.87 mean 0.43 1.00 0.60 mean+std 0.55 0.96 0.70 paraphrase-MiniLM-L3 0.75 0.81 0.43 0.57 0.70 0.66 0.61 0.63 0.65 0.57 0.87 0.69 mean 0.37 1.00 0.54 mean+std 0.50 0.75 0.60 nq-distilbert 0.75 0.96 0.33 0.49 0.70 0.64 0.46 0.54 0.65 0.58 0.64 0.61 mean 0.31 1.00 0.47 mean+std 0.99 0.97 0.98 paraphrase-multilingual-mpnet 0.75 1.00 0.81 0.90 0.70 0.98 0.93 0.96 0.65 0.90 1.00 0.95 mean 0.27 0.71 0.39 mean+std 0.34 0.39 0.36 TF-IDF 0.75 0.38 0.07 0.12 0.70 0.33 0.07 0.12 0.65 0.50 0.14 0.22 6 Conclusion We proposed the integration of advanced sentence embeddings into a modular dialogue system architecture based on argumentation, so as to support privacy by design. In particular, the language module is the only module that processes user input, and its output to the argumentation module is devoid of any sensitive, A Preliminary Evaluation of a Privacy-Preserving Dialogue System 11 Table 3. Matches computed by the models using the 0.65 threshold value on sentences from S1 to S19. The + symbol indicates the correct matches. The • symbol indicates the matches computed using the stsb-mpnet model. The ◦ symbol indicates the matches computed using the paraphrase-mpnet model. S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S1 +◦• +◦• +◦• +◦• +◦• +◦• ◦ S2 +◦• +◦• +◦• +◦• +◦• +◦• S3 +◦• +◦• +◦• +◦• +◦• +◦• S4 +◦• +◦• +◦• +◦• +◦• +◦• ◦ S5 +◦• +◦• +◦• +◦• +◦• +◦• S6 +◦• +◦• +◦• +◦• +◦• +◦• ◦ S7 +◦• +◦• +◦• +◦• ◦ S8 +◦• +◦• +◦• +◦ S9 +◦• +◦• +◦• +◦• S10 +◦• +◦ +◦• +◦• S11 +◦• +◦• +◦ ◦ S12 +◦• +◦• +◦• S13 +◦ +◦• +◦• ◦ S14 +◦• +◦• +◦• +◦• +◦• +◦• S15 +◦• +◦• +◦• +◦• +◦• +◦• S16 +◦• +◦• +◦• +◦• +◦• +◦• S17 +◦• +◦• +◦• +◦• +◦• +◦• S18 +◦• +◦• +◦• +◦• +◦• +◦• S19 +◦• +◦• +◦• +◦• +◦• +◦• personal, or irrelevant piece of information the user may have written. The out- put of this module can therefore be seen as the anonymized and sanitized version of the user’s sentences. This makes the system amenable to distributed, multi- party implementations, where domain knowledge representation and reasoning may be left to third parties, and the user interface completely decouples the user input from the arguments used in the reasoning. We shall point out that guaranteeing the anonymization of user data, may not only a desirable feature, but even a legal requirement in some contexts, such as those regulated by EU’s GDPR6 . Importantly, the architecture is general-purpose and does not require domain-specific training or reference corpora. The COVID-19 vaccines case study has given the context for a preliminary ex- perimental evaluation. Our results indicate that the use of sentence embeddings computed by pre-trained neural architectures greatly outperforms the TF-IDF model used in other approaches, leading to precise matches. We also emphasized the importance of precision and correctness over recall. In future developments we aim at extending our experimental evaluation, including human testers in the loop. We also want to investigate additional case studies, potentially involving languages other than English. It would also be 6 See https://eur-lex.europa.eu/eli/reg/2016/679/oj. 12 B. Fazzinga, A. Galassi, P. Torroni interesting to extend our architecture with techniques for the detection and the resolution of conflicts, especially false positives, both in the reasoning module and in the language module. Finally, we would like to provide the user the possibility to directly correct matches. That could further improve the transparency of our architecture and reduce the number of false positives. However, that would require redesigning user interaction, which is now intentionally simple, possibly making it more complicated and less intuitive. Acknowledgments The research reported in this work was partially supported by the EU H2020 ICT48 project “Humane AI Net” under contract #952026. References 1. Barikeri, S., Lauscher, A., Vulic, I., Glavas, G.: Redditbias: A real-world re- source for bias evaluation and debiasing of conversational language models. In: ACL/IJCNLP (1). pp. 1941–1955. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.acl-long.151 2. Bray, J.R., Curtis, J.T.: An ordination of the upland forest communi- ties of Southern Wisconsin. Ecological Monographs 27(4), 325–349 (1957). https://doi.org/10.2307/1942268 3. Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval- 2017). pp. 1–14. Association for Computational Linguistics, Vancouver, Canada (Aug 2017). https://doi.org/10.18653/v1/S17-2001 4. Chakaravarthy, V.T., Gupta, H., Roy, P., Mohania, M.K.: Efficient tech- niques for document sanitization. In: Proceedings of the 17th ACM Con- ference on Information and Knowledge Management. p. 843852. CIKM ’08, Association for Computing Machinery, New York, NY, USA (2008). https://doi.org/10.1145/1458082.1458194 5. Chalaguine, L.A., Hunter, A.: A persuasive chatbot using a crowd-sourced argu- ment graph and concerns. In: Prakken, H., Bistarelli, S., Santini, F., Taticchi, C. (eds.) COMMA. Frontiers in Artificial Intelligence and Applications, vol. 326, pp. 9–20. IOS Press (2020). https://doi.org/10.3233/FAIA200487 6. Charras, F., Dubuisson Duplessis, G., Letard, V., Ligozat, A.L., Rosset, S.: Com- paring System-response Retrieval Models for Open-domain and Casual Conver- sational Agent. In: WOCHAT. Los Angeles, United States (2016), https://hal. archives-ouvertes.fr/hal-01782262 7. Chen, H., Liu, X., Yin, D., Tang, J.: A survey on dialogue systems: Recent advances and new frontiers. SIGKDD Explor. Newsl. 19(2), 2535 (Nov 2017). https://doi.org/10.1145/3166054.3166058 8. Deriu, J., Rodrigo, Á., Otegi, A., Echegoyen, G., Rosset, S., Agirre, E., Cieliebak, M.: Survey on evaluation methods for dialogue systems. Artif. Intell. Rev. 54(1), 755–810 (2021). https://doi.org/10.1007/s10462-020-09866-x A Preliminary Evaluation of a Privacy-Preserving Dialogue System 13 9. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidi- rectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) NAACL-HLT (1). pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423 10. Dinan, E., Fan, A., Williams, A., Urbanek, J., Kiela, D., Weston, J.: Queens are powerful too: Mitigating gender bias in dialogue generation. In: EMNLP (1). pp. 8173–8188. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.emnlp-main.656 11. Fazzinga, B., Galassi, A., Torroni, P.: An argumentative dialogue system for COVID-19 vaccine information. In: Baroni, P., Benzmüller, C., Wáng, Y.N. (eds.) Logic and Argumentation - 4th International Conference, CLAR 2021, Hangzhou, China, October 20-22, 2021, Proceedings. Lecture Notes in Computer Science, vol. 13040, pp. 477–485. Springer (2021). https://doi.org/10.1007/978-3-030-89391- 0 27 12. Galassi, A., Drazewski, K., Lippi, M., Torroni, P.: Cross-lingual annota- tion projection in legal texts. In: COLING. pp. 915–926. International Com- mittee on Computational Linguistics, Barcelona, Spain (Online) (Dec 2020). https://doi.org/10.18653/v1/2020.coling-main.79 13. Galassi, A., Lippi, M., Torroni, P.: Attention in natural language process- ing. IEEE Trans. Neural Networks Learn. Syst. 32(10), 4291–4308 (2021). https://doi.org/10.1109/TNNLS.2020.3019893 14. Henderson, P., Sinha, K., Angelard-Gontier, N., Ke, N.R., Fried, G., Lowe, R., Pineau, J.: Ethical challenges in data-driven dialogue systems. In: AIES. p. 123129. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3278721.3278777 15. Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., Liu, Q.: TinyBERT: Distilling BERT for natural language understanding. In: Find- ings of the Association for Computational Linguistics: EMNLP 2020. pp. 4163–4174. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.findings-emnlp.372 16. Kenter, T., de Rijke, M.: Short text similarity with word embeddings. In: CIKM. p. 14111420. CIKM ’15, Association for Computing Machinery, New York, NY, USA (2015). https://doi.org/10.1145/2806416.2806475 17. Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A.P., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M., Dai, A.M., Uszkoreit, J., Le, Q., Petrov, S.: Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics 7, 452–466 (2019). https://doi.org/https://doi.org/10.1162/tacl a 00276 18. Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: ICML. JMLR Workshop and Conference Proceedings, vol. 32, pp. 1188–1196. JMLR.org (2014), http://proceedings.mlr.press/v32/le14.html 19. Li, B., Vorobeychik, Y., Li, M., Malin, B.A.: Scalable iterative classification for san- itizing large-scale datasets. IEEE Trans. Knowl. Data Eng. 29(3), 698–711 (2017). https://doi.org/10.1109/TKDE.2016.2628180 20. Lison, P., Pilán, I., Sánchez, D., Batet, M., Øvrelid, L.: Anonymisation mod- els for text data: State of the art, challenges and future directions. In: ACL/IJCNLP (1). pp. 4188–4203. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.acl-long.323 21. Liu, H., Dacon, J., Fan, W., Liu, H., Liu, Z., Tang, J.: Does gender matter? to- wards fairness in dialogue systems. In: COLING. pp. 4403–4416. International 14 B. Fazzinga, A. Galassi, P. Torroni Committee on Computational Linguistics, Barcelona, Spain (Online) (Dec 2020). https://doi.org/10.18653/v1/2020.coling-main.390 22. Luo, L., Huang, W., Zeng, Q., Nie, Z., Sun, X.: Learning personalized end-to-end goal-oriented dialog. Proceedings of the AAAI Conference on Artificial Intelligence 33(01), 6794–6801 (Jul 2019). https://doi.org/10.1609/aaai.v33i01.33016794 23. Miner, A.S., Laranjo, L., Kocaballi, A.B.: Chatbots in the fight against the COVID- 19 pandemic. npj Digital Medicine 3(1) (2020). https://doi.org/10.1038/s41746- 020-0280-0 24. Nguyen, H., Cavallari, S.: Neural multi-task text normalization and sanitization with pointer-generator. In: Proceedings of the First Workshop on Natural Language Interfaces. pp. 37–47. Association for Computational Linguistics, Online (Jul 2020). https://doi.org/10.18653/v1/2020.nli-1.5 25. Pennington, J., Socher, R., Manning, C.: GloVe: Global vectors for word repre- sentation. In: EMNLP. pp. 1532–1543. Association for Computational Linguistics, Doha, Qatar (Oct 2014). https://doi.org/10.3115/v1/D14-1162 26. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettle- moyer, L.: Deep contextualized word representations. In: Walker, M.A., Ji, H., Stent, A. (eds.) NAACL-HLT. pp. 2227–2237. Association for Computational Lin- guistics (2018). https://doi.org/10.18653/v1/n18-1202 27. Rajendran, J., Ganhotra, J., Singh, S., Polymenakos, L.: Learning end-to- end goal-oriented dialog with multiple answers. In: EMNLP. pp. 3834–3843. Association for Computational Linguistics, Brussels, Belgium (Oct-Nov 2018). https://doi.org/10.18653/v1/D18-1418 28. Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: EMNLP/IJCNLP (1). pp. 3982–3992. Asso- ciation for Computational Linguistics, Hong Kong, China (Nov 2019). https://doi.org/10.18653/v1/D19-1410 29. Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilin- gual using knowledge distillation. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) EMNLP. pp. 4512–4525. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.emnlp-main.365 30. Rivera Zavala, R., Martinez, P.: The impact of pretrained language models on nega- tion and speculation detection in cross-lingual medical text: Comparative study. JMIR Med Inform 8(12), e18953 (Dec 2020). https://doi.org/10.2196/18953 31. Sánchez, D., Batet, M., Viejo, A.: Utility-preserving privacy protection of textual healthcare documents. J. Biomed. Informatics 52, 189–198 (2014). https://doi.org/10.1016/j.jbi.2014.06.008 32. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. In: The 5th Workshop on En- ergy Efficient Machine Learning and Cognitive Computing @ NeurIPS (2019), http://arxiv.org/abs/1910.01108 33. Song, K., Tan, X., Qin, T., Lu, J., Liu, T.: Mpnet: Masked and permuted pre- training for language understanding. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) NeurIPS (2020), https://proceedings.neurips.cc/paper/ 2020/hash/c3a690be93aa602ee2dc0ccab5b7b67e-Abstract.html 34. Szarvas, G., Farkas, R., Busa-Fekete, R.: Research paper: State-of-the- art anonymization of medical records using an iterative machine learn- ing framework. J. Am. Medical Informatics Assoc. 14(5), 574–580 (2007). https://doi.org/10.1197/jamia.M2441 A Preliminary Evaluation of a Privacy-Preserving Dialogue System 15 35. Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M.: Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) NeurIPS (2020), https://proceedings.neurips.cc/paper/2020/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract.html