The “Corpus Anchise 320” and the analysis of conversations between healthcare workers and people with dementia Nicola Andrea Bolioli Alessio Bosca Alessandro Pietro Vigorelli Benvenuti CELI CELI Mazzei Gruppo Anchise Università di Università di Torino Torino Abstract cornerstones of the approach developed by the Anchise Group to support people with dementia The aim of this research was to create the and their caregivers, i.e. the “Enabling Approach” first Italian corpus of free conversations (Vigorelli 2018). between healthcare professionals and The paper is divided in 4 sections. Section 1 people with dementia, in order to introduces the topic of Alzheimer’s language investigate specific linguistic phenomena Section 2 presents the recent researches and from a computational point of view. Most related works. In Section 3, the creation of the of the previous researches on speech Corpus Anchise 320 will be discussed, which disorders of people with dementia have collects the transcripts and annotations of a set of been based on qualitative analysis, or on dialogues between healthcare professionals and the study of a few dozen cases executed in dementia patients carried out by the Anchise laboratory conditions, and not in Group from 2007 until today, in Italian language. spontaneous speech (in particular for the Section 4 will report the results of the Italian language). The creation of the computational linguistic analysis with the Corpus Anchise 320 aims to investigate StanfordNLP library for Italian. The results Dementia language by providing a broader obtained will be discussed to outline some of the number of dialogues collected in peculiarities of the Dementia language. Section 5 ecological conditions. Automatic linguistic concludes this paper with some final analysis can help healthcare professionals considerations. to understand some characteristics of the language used by patients and to 1 The Alzheimer’s language implement effective dialogue strategies.1 Dementia refers to a series of symptoms that Introduction manifest in “difficulties with memory, language, problem solving and other cognitive skills that In this paper we will present the construction affect a persons ability to perform everyday of the first annotated corpus of conversations activities.” (Alzheimer's-Association 2018, 368). between healthcare workers and people with These symptoms change over time and reflect the dementia for Italian, called “Corpus Anchise degree of neuronal damage in different parts of the 320”, and the quantitative linguistic analysis we brain. Alzheimer's disease (AD), a carried out. The aim of the project is twofold. On neurodegenerative brain disease, is the most the one hand, we created a dataset of spoken common form of dementia. One of the most dialogue transcriptions that is useful for research popular neuropsychological tests for assessing a on the language of people with dementia. On the patient's neurocognitive and functional status is other hand, techniques typical of computational still the Mini-Mental Test, designed by Folstein et linguistics are applied to help doctors in assessing al. (1975). the state of the disease and implement effective The first symptoms are memory loss or a state dialogue strategies. Focusing attention on verbal of frequent confusion. Alzheimer's disease, exchanges between speakers is one of the semantic dementia, aphasia and amnesia all share 1 Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). a close link with lexical memory and therefore As stated in (de la Fuente Garcia 2020) declarative memory, while they would leave “datasets containing both clinical information grammar and procedural memory intact. and spontaneous speech suitable for statistical Language would thus move between a structural learning are relatively scarce. In addition, speech component, formed by grammatical rules that are data are often collected under different stabilized over the course of life and are preserved conditions, such as monologue and dialogue longer as a crystallized function; and a semantic recording protocols.” A notable example is the component that would collapse more quickly Carolina Conversations Collection (CCC), that is because it requires a mnemonic and amongst the few spontaneous dialogue datasets contextualized effort that makes the cognitive available in English in the context of AD research. activity of the individual more complex. This It is hosted and distributed by the Medical dissociation is confirmed by studies on University of South Carolina (Pope 2011). Alzheimer's language (Almor 1999), (Kempler The study of AD language with computational 2008), (Bucks 2000) in which it has been amply methods is fairly recent, but a number of work demonstrated that one of the first symptoms is showed the applicability of symbolic and statistic anomia, or the difficulty in finding the lexical algorithms for the prediction of dementia and target; as opposed to a good ability to construct similar diseases (Karlekar et al. 2018, Mirheidari the sentence up to the advanced stages of the et al. 2019, Kong et al. 2019). disease. These deficits would then be In (Karlekar et al. 2018) neural networks have compensated through linguistic strategies, such as been used on the publicly available the high use of pronouns, circumlocutions and DementiaBank dataset in order to predict passepartout words present in the speech of Alzheimer’s dementia of a patient starting from Alzheimer's patients: “empty words (“things”, the language produced and annotated with the “do”, “he”, “it”, etc.) are successfully and POS feature. They reached precision result relatively easily activated precisely because they between 80-85%. Interestingly, they showed that are high in frequency and allow the patients to there is no significative difference between the produce fluent and grammatical sentences in the prediction results by considering the gender. presence of debilitating semantic deficits” (Almor In (Mirheidari et al. 2019) an automatic dementia 1999, 205). In the more advanced stages of the detection system was presented, including a disease, communication becomes increasingly diarisation unit, an automatic speech recogniser, problematic as Alzheimer's patients experience conversation analysis (CA) based acoustic and difficulties in understanding and constructing a lexical feature extraction module and a machine coherent discourse: “their narratives are often learning classifier, in order to facilitate and ripetitive with topic changes, unclear references, improve screening procedures for dementia. They and lack of coherence and informativeness” showed that using these features, they can obtain (Kempler 2008, 76). a high value of precision in detecting dementia for both a neurologist-patient and VirtualAgent- 2 Related works patient conversations. In (Kong et al. 2019) neural networks on The recent workshop on the creation of medical DementiaBank dataset have been used too. They dialogues corpora (Bhatia et al. 2020) is a reached precision results close to the state of art consequence of an increasing interest on this (80-85%), but they pointed out on the scalability specific application field. The main reason of this of their neural methods that need less data. interest is on the possibility of design and realize Moreover, they showed that “the attention software applications which can assist mechanism of the model manages to capture professionals in medicine in their daily work in similar key concepts as the information unit order to avoid errors: “It is imperative to find a features specified by human experts.”. solution to minimize causes of such errors, via As for Italian language, in (Beltrami 2018) the better tooling and visualization or by providing participants (both healthy and cognitively automated decision support assistants to medical impaired) were asked to answer to three specific practitioners.”. With this final aim, the creation of tasks, i.e. the description of a drawing, details of a medical dialogues corpora can be seen as a first last dream and the description of a working day. step toward the creation of a virtual medical The researchers investigated whether the analysis assistant that can assist, speed-up, improve the performed by Natural Language Processing capacities of medical practitioners. techniques could reveal alterations of the The corpus has been created in two phases. In language performance in early cognitive decline. the first phase, health professionals of Anchise Group created the audio recordings, transcribed 3 The Corpus Anchise 320 portions of dialogues and annotated each transcription with a series of metadata with the Corpus Anchise 320 collects the transcripts of aim of investigating the relationship between the dialogues between healthcare professionals and language, age, sex and stage of dementia (MMSE2 patients carried out over the period from 2007 score). In the second phase, we collected the 320 until today by the Anchise Group, an association transcriptions, we removed pragmalinguistic of experts (doctors, psychologists, nurses, comments of health professionals, such as trainers) for the research, training and care of the "[Touch the recorder]", "[Silence]", "[Laughs]", elderly with dementia. The corpus consists of an etc., and we analyzed and annotated the corpus as unselected series of people diagnosed with described in the following sections. dementia and not only those with an established Corpus Anchise 320 has been built and diagnosis with specific criteria for Alzheimer's archived according to EU General Data Protection were included. For probabilistic reasons, most Regulation (GDPR). Audio recording and patients are affected by AD. The corpus contains transcriptions were made with the consent of the 320 individual conversations resulting from speaker, as far as possible, of the family member transcription of about 15 minutes of dialogue for and of the head of the facility or department. each patient in which the patient can speak freely Personal data have been anonymized. The dataset with the health worker. This peculiarity is of is not publicly available but it can be requested to considerable importance in a field of investigation the authors for research purposes. that was mainly based on "formal medical- psychological situations of the anamnestic investigation and the collection of tests" (Lai 4 Computational linguistic analysis of 2000). Corpus Anchise 320 The corpus contains 20,588 turns of conversation, consisting of 10,193 turns of In this Section we will discuss the results of the patients with dementia and 10,381 turns of health lexical analysis (3.1) and of the morphosyntactic workers. The total number of tokens is 222,856 analysis (3.2) carried out on the Corpus Anchise and the total number of types (different words) is 320. 14,513. In the table below we present a small portion of one conversation. 3.1 Lexical analysis The Corpus Anchise 320 contains 222,856 7 P Eh ma mia figlia… è dura, è dura. [Eh but my daughter... it's hard, it's hard.] tokens and 14,513 types. The relationship between types and tokens constitutes the Types- 8 O Lo sa Marta che da quando abbiamo Token Ratio (TTR), which represents a type of iniziato a vederci lei parla molto, molto index to calculate the lexical richness of a text meglio? (Torruella 2013). The number of tokens and types [You know, Marta, that since we started were subsequently calculated for the patient's total seeing each other you talk much, much turns and the health worker’s total turns. The better?] results are shown in Table 2. 9 P Ma a casa mia no! Loro non capiscono. Han detto che non capiscono niente… Token Types TTR [But not at home! They don't understand. They said they don't understand anything…] Corpus Anchise 222.856 14.513 0,07 10 O Forse sono loro che non capiscono. [Maybe it is they who do not understand.] Patients 144.405 8.499 0,06 Health 78.451 6.014 0,08 Table 1: An excerpt from a conversation between a patient (P) and workers an health worker (O), with English translation (turn 7 to 10). Table 2: Token/types data. 2 Mini-Mental State Examination. TTR is low for both speakers. As for the (“things”). This attitude confirms the scientific patient, this trend is closely linked to Alzheimer's research carried out so far on Alzheimer's disease, in which “the production of high- language regarding word finding deficits: “the frequency words is relatively preserved while the earliest language deficits observed in DAT is production of low-frequency is impaired” (Almor anomia. (...) Semantically empty words are 1999, 204). As for the health professional, this scattered throughout the DAT patient's utterances trend reflects the Grice Principle of Cooperation in place of content words, thereby maintaining between speakers in which it is necessary to fluency and sacrificing informational content." conform the conversational contribution to what (Kempler 1991, 98). From the analysis of the first is required, when it occurs, by the accepted 100 words used, we note the presence of the words common intent or by the direction of the verbal “casa” (“home”) with 637 occurrences, “mamma” exchange. Finally, if look at the number of tokens, (“mother”) with 394 occurrences, “marito” we get that the patients speak more but with a (“husband”) with 190 occurrences, “figli” poorer vocabulary relative to the lower lexical (”children”) with 162 occurrences. As the corpus richness index than the sample of the health contains spontaneous speech, we can note that the workers. most common topic is the patient’s family. A frequency list was then created on the corpus sample of patients with dementia. The table 3 is 3.2 Morphosyntactic analysis the result of a pre-processing phase where 4 types of function words have been removed from the The Corpus Anchise 320 was analyzed frequency list, i.e. adpositions, determiners, morpho-syntactically by means of the conjunctions and auxiliaries. StanfordNLP library in Python language (Qi From the analysis of the data it emerges that 2018). The default pre-trained neural model for the first 50 words in order of frequency cover 32% Italian was used. Specifically, tokenization, of the entire Corpus Anchise 320 and 49.4% of the lemmatization, POS tagging and Dependency patients' speech; the first 100 words cover 40.0% parsing were carried out. These annotations, i.e. of the entire corpus and 61.8% of that of patients ID, Form, Lemma, POS, FEATS, HEAD, with dementia; the first 200 words cover 46.7% of DEPREL, were organized according to the the entire corpus and 72.1% of the words used by CoNLL-U format (Zeman 2018, Bosco 2014). A patients. This means that, on an expressive level, linguist reviewed the automatic annotations. patients with the use of 200 words cover almost three quarters of all the vocabulary used in these conversations. ID TOKEN XPOS LEMMA FEATS HEAD DEPREL Words Frequency Words Frequency 3 che PRON che ... 4 nsubj 1 non 3.954 10 adesso 687 4 fa VERB fare ... 2 acl:relcl 2 sì 3.123 11 lì 666 3 mi 2.272 12 mia 639 5 un DET uno ... 6 det 4 io 2.063 13 qui 639 5 no 1.546 14 casa 637 6 po' ADV poco ... 7 advmod 6 eh 1.343 15 me 612 7 fatica NOUN fatica ... 4 obj 7 anche 1032 16 fare 577 8 bene 1007 17 lei 552 8 a ADP a ... 9 mark 9 cosa 768 18 so 535 9 parlare VERB parlare ... 4 xcomp Table 3: Words frequency of patients with dementia. Table 4: An excerpt from the annotated corpus. Features are not The analysis of the words most used by shown due to space constraints. patients diagnosed with dementia present in the Corpus Anchises 320 shows a high percentage of The analysis of the linguistic data of patients deictics, such as “io” (“me”), “qui” (“here”), “lì” suffering from dementia was made using both the (“there”), and the presence of semantically empty LIP3 corpus (De Mauro 1993, 155) and the speech words, such as “cosa” (“thing”) and “cose” corpus of healthcare professionals as a reference. 3 Lessico di frequenza dell’italiano parlato The analysis of the percentages of occurrence of (i.e., subject verb agreement, well formed plural the parts of speech, in the patient corpus sample, and tense markings)”, (Kempler 2008, 75). reveals a superior use of pronouns and adverbs However, some linguistic phenomena that both with respect to LIP and with respect to the emerged from the analysis of the occurrences of corpus of health workers. With reference to the the verbal system could be linked to a spatial- LIP, the use of pronouns records 10.9% of temporal disorientation characteristic of occurrence, while the use of adverbs 10.1%. If we Alzheimer's disease (Macrì 2016). This compare these data with the rates of occurrence in disorientation is reflected in the massive use of the the patients' speech (Table 5), 13.9% frequency indicative mode present in 95.9% of cases (Table for pronouns and 14.2% for adverbs respectively, 6). The use of the subjunctive and conditional we notice a notable difference. Furthermore, these modes appears to be almost minimal with two indices, when added together, are 1.7 percentages that are around 1%. This tendency percentage points higher than the health workers’ could be paraphrased in terms of cognitive work, speech (ADV 13,2%, PRON 13,2 %). This trend since the two verbal modes require both the ability would confirm what was said in the analysis to imagine possible worlds and - at the level of relating to word frequency, i.e. the difficulty for sentence construction - of conjugation and patients to access the lexicon and therefore to temporal concordance. compensate for this deficit with the use of deictics, closely linked to the context. If we cross these data with the rate of names used by patients Verb Fin Inf Part Ger (1.6 percentage points lower than the corpus of the form health workers, NOUN 13,2%), we can deduce that 25.771 4.655 4.751 158 the patient implements a real compensatory (72,9%) (13,2%) (13,4%) (0,4%) strategy linked to the impairment of access to Mood Ind Sub Imp Cnd semantic memory. A significant difference is also 24.702 452 326 285 present with the LIP, which records a rate of (95,9%) (1,8%) (1,2%) (1,1%) names of 15.7% against 11.6% of the corpus Tense Pres Past Imp Fut relating to patients with dementia. 21.958 4.800 3.459 304 (71,9%) (15,7%) (11,33%) (0,9%) Patients % Table 6:: Percentages of occurrence of the verbal system. ADJ 4.843 3,3 ADP 11.177 7,7 ADV 20.560 14,2 5 Conclusion and further research AUX 10.586 7,3 CCONJ 5.562 3,8 In this paper we presented the first Italian corpus of conversations between healthcare DET 14.200 9,8 professionals and people with dementia, called INTJ 7.383 5,1 “Corpus Anchise 320”. The study of this corpus NOUN 16.787 11,6 with computational linguistic analysis confirmed NUM 1.145 0,7 some characteristics of the language of people PRON 20.118 13,9 with dementia, such as the reduction in the rate of PROPN 1.799 1,2 names and the increase in deictics. Corpus SCONJ 5.078 3,5 Anchise 320 has been built and archived VERB 24.749 17,1 according to GDPR. It is not publicly available but it can be requested to the authors for research X 418 0,3 purposes. TOT. 144.405 The large number of the sample (320 Table 5: Percentages of occurrence of the parts of speech. conversations) and the use of computational analysis will make it possible to identify At the morphosyntactic level, it is known in the indicators of pathological language to be used in literature that Alzheimer's patients do not suffer the preclinical phase, to trace the change in the from serious deficits in the construction of the linguistic abilities of people with dementia as the sentence: "sentence production in DAT is disease progresses, to put in relation the characterized by intact morphosyntactic structure characteristics of the pathological language with a series of metalinguistic data such as age, sex and degree of dementia. The corpus will be increased de la Fuente Garcia, S., Haider, F., & Luz, S. in the coming months with the addition and (2020). Cross-corpus Feature Learning between annotation of other transcripts of dialogues of Spontaneous Monologue and Dialogue for people with dementia. Automatic Classification of Alzheimer’s Dementia Speech. In 2020 42nd Annual International Conference of the IEEE References Engineering in Medicine & Biology Society (EMBC) (pp. 5851-5855). IEEE. Alzheimer's-Association. "2018 Alzheimer's De Mauro, T., Mancini, F., Vedovelli, M., disease facts and figures." Alzheimer's and Voghera, M. (1993). Lessico di frequenza Dementia, 2018: 367-429. dell'italiano parlato. Milano: Etaslibri. Almor, A., Kempler, D., MacDonald, M. C., Folstein, M. F., Folstein, S. E., & McHugh, P. R. Andersen, E. S., & Tyler, L. K. (1999). Why do (1975). “Mini-mental state”: a practical method Alzheimer patients have difficulty with for grading the cognitive state of patients for the pronouns? Working memory, semantics, and clinician. Journal of psychiatric research, 12(3), reference in comprehension and production in 189-198. Alzheimer's disease. Brain and language, 67(3), 202-227. S. Karlekar, T. Niu, and M. Bansal. Detecting linguistic characteristics of Alzheimer’s dementia Associazione Gruppo Anchise. by interpreting neural models. In Proceedings of http://www.formalzheimer.it/. the 2018 Conference of the North American Chapter of the Association for Computational P. Bhatia, S. Lin, R. Gangadharaiah, B. Wallace, Linguistics: Human Language Technologies, I. Shafran, C. Shivade, N. Du, and M. Diab, Volume 2 (Short Papers), pages 701–707, New editors. Proceedings of the First Workshop on Orleans, Louisiana, June 2018. Association for Natural Language Processing for Medical Computational Linguistics. Conversations, Online, July 2020. Association for Computational Linguistics Kempler, D. (1991). Language Changes in Dementia of Alzheimer Type. In Dementia and Beltrami, D., Gagliardi, G., Rossini Favretti, R., Communication, by Rosemary Lubinsky, 98- Ghidoni, E., Tamburini, F., & Calzà, L. (2018). 114. Philadelphia: B.C. Decker, Inc. Speech analysis by natural language processing techniques: a possible tool for very early detection Kempler, D., & Goral, M. (2008). Language and of cognitive decline?. Frontiers in aging dementia: Neuropsychological aspects. Annual neuroscience, 10, 369. review of applied linguistics, 28, 73. Bucks, R. S., Singh, S., Cuerden, J. M., & W. Kong, H. Jang, G. Carenini, and T. Field. A Wilcock, G. K. (2000). Analysis of spontaneous, neural model for predicting dementia from conversational speech in dementia of Alzheimer language. volume 106 of Proceedings of type: Evaluation of an objective technique for Machine Learning Research, pages 270–286, analysing lexical performance. Aphasiology, Ann Arbor, Michigan, 09–10 Aug 2019. PMLR. 14(1), 71-91. Lai, G. (2000). Conversazioni con l'Alzheimer. Bosco, C., Montemagni, S., Simi, M. (2013). Prospettive sociali e sanitarie, 18, 2-5. Converting Italian Treebanks: Towards an Italian Stanford Treebanks. Macrì, A. (2016). La lingua della demenza di Alzheimer. Analisi linguistica del parlato Bosco, C., Dell'Orletta, F., Montemagni, S., spontaneo. In Le lingue della malattia, 329-424. Sanguinetti, M., & Simi, M. (2014). The Milano: Mimesis Edizioni. EVALITA 2014 dependency parsing task. In EVALITA 2014 Evaluation of NLP and Speech Mirheidari, B., Blackburn, D., Walker, T., Tools for Italian (pp. 1-8). Pisa University Press. Reuber, M., & Christensen, H. (2019). Dementia detection using automatic analysis of conversations. Computer Speech & Language, 53, 65-79. Pope, C., & Davis, B. H. (2011). Finding a balance: The carolinas conversation collection. Corpus Linguistics and Linguistic Theory, 7(1), 143-161. Qi, P., Dozat, T., Yuhao Zhang, Y., & Manning, C. D. (2018). Universal Dependency Parsing from Scratch In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 160-170. Torruella, J.; Capsada, R. (2013). "Lexical Statistics and Tipological Structures: A Measure of Lexical Richness." Procedia - Social and Behavioral Sciences, pp. 447-454. Vigorelli, P. (2018). Alzheimer, come parlare e comunicare nella vita quotidiana nonostante la malattia. Milano: Franco Angeli Editore. Vigorelli, P. (2004). La conversazione possibile con il malato Alzheimer. Milano: Franco Angeli Editore. Zeman, D., Hajic, J., Popel, M., Potthast, M., Straka, M., Ginter, F., ... & Petrov, S. (2018, October). CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual parsing from raw text to universal dependencies (pp. 1-21).