=Paper=
{{Paper
|id=Vol-2769/67
|storemode=property
|title=The "Corpus Anchise 320" and the Analysis of Conversations between Healthcare Workers and People with Dementia
|pdfUrl=https://ceur-ws.org/Vol-2769/paper_67.pdf
|volume=Vol-2769
|authors=Nicola Benvenuti,Andrea Bolioli,Alessandro Mazzei,Pietro Vigorelli,Alessio Bosca
|dblpUrl=https://dblp.org/rec/conf/clic-it/BenvenutiBMVB20
}}
==The "Corpus Anchise 320" and the Analysis of Conversations between Healthcare Workers and People with Dementia==
The “Corpus Anchise 320” and the analysis of conversations between
healthcare workers and people with dementia
Nicola Andrea Bolioli Alessio Bosca Alessandro Pietro Vigorelli
Benvenuti CELI CELI Mazzei Gruppo Anchise
Università di Università di
Torino Torino
Abstract cornerstones of the approach developed by the
Anchise Group to support people with dementia
The aim of this research was to create the and their caregivers, i.e. the “Enabling Approach”
first Italian corpus of free conversations (Vigorelli 2018).
between healthcare professionals and The paper is divided in 4 sections. Section 1
people with dementia, in order to introduces the topic of Alzheimer’s language
investigate specific linguistic phenomena Section 2 presents the recent researches and
from a computational point of view. Most related works. In Section 3, the creation of the
of the previous researches on speech Corpus Anchise 320 will be discussed, which
disorders of people with dementia have collects the transcripts and annotations of a set of
been based on qualitative analysis, or on dialogues between healthcare professionals and
the study of a few dozen cases executed in dementia patients carried out by the Anchise
laboratory conditions, and not in Group from 2007 until today, in Italian language.
spontaneous speech (in particular for the Section 4 will report the results of the
Italian language). The creation of the computational linguistic analysis with the
Corpus Anchise 320 aims to investigate StanfordNLP library for Italian. The results
Dementia language by providing a broader obtained will be discussed to outline some of the
number of dialogues collected in peculiarities of the Dementia language. Section 5
ecological conditions. Automatic linguistic concludes this paper with some final
analysis can help healthcare professionals considerations.
to understand some characteristics of the
language used by patients and to 1 The Alzheimer’s language
implement effective dialogue strategies.1
Dementia refers to a series of symptoms that
Introduction manifest in “difficulties with memory, language,
problem solving and other cognitive skills that
In this paper we will present the construction affect a persons ability to perform everyday
of the first annotated corpus of conversations activities.” (Alzheimer's-Association 2018, 368).
between healthcare workers and people with These symptoms change over time and reflect the
dementia for Italian, called “Corpus Anchise degree of neuronal damage in different parts of the
320”, and the quantitative linguistic analysis we brain. Alzheimer's disease (AD), a
carried out. The aim of the project is twofold. On neurodegenerative brain disease, is the most
the one hand, we created a dataset of spoken common form of dementia. One of the most
dialogue transcriptions that is useful for research popular neuropsychological tests for assessing a
on the language of people with dementia. On the patient's neurocognitive and functional status is
other hand, techniques typical of computational still the Mini-Mental Test, designed by Folstein et
linguistics are applied to help doctors in assessing al. (1975).
the state of the disease and implement effective The first symptoms are memory loss or a state
dialogue strategies. Focusing attention on verbal of frequent confusion. Alzheimer's disease,
exchanges between speakers is one of the semantic dementia, aphasia and amnesia all share
1
Copyright ©2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
a close link with lexical memory and therefore As stated in (de la Fuente Garcia 2020)
declarative memory, while they would leave “datasets containing both clinical information
grammar and procedural memory intact. and spontaneous speech suitable for statistical
Language would thus move between a structural learning are relatively scarce. In addition, speech
component, formed by grammatical rules that are data are often collected under different
stabilized over the course of life and are preserved conditions, such as monologue and dialogue
longer as a crystallized function; and a semantic recording protocols.” A notable example is the
component that would collapse more quickly Carolina Conversations Collection (CCC), that is
because it requires a mnemonic and amongst the few spontaneous dialogue datasets
contextualized effort that makes the cognitive available in English in the context of AD research.
activity of the individual more complex. This It is hosted and distributed by the Medical
dissociation is confirmed by studies on University of South Carolina (Pope 2011).
Alzheimer's language (Almor 1999), (Kempler The study of AD language with computational
2008), (Bucks 2000) in which it has been amply methods is fairly recent, but a number of work
demonstrated that one of the first symptoms is showed the applicability of symbolic and statistic
anomia, or the difficulty in finding the lexical algorithms for the prediction of dementia and
target; as opposed to a good ability to construct similar diseases (Karlekar et al. 2018, Mirheidari
the sentence up to the advanced stages of the et al. 2019, Kong et al. 2019).
disease. These deficits would then be In (Karlekar et al. 2018) neural networks have
compensated through linguistic strategies, such as been used on the publicly available
the high use of pronouns, circumlocutions and DementiaBank dataset in order to predict
passepartout words present in the speech of Alzheimer’s dementia of a patient starting from
Alzheimer's patients: “empty words (“things”, the language produced and annotated with the
“do”, “he”, “it”, etc.) are successfully and POS feature. They reached precision result
relatively easily activated precisely because they between 80-85%. Interestingly, they showed that
are high in frequency and allow the patients to there is no significative difference between the
produce fluent and grammatical sentences in the prediction results by considering the gender.
presence of debilitating semantic deficits” (Almor In (Mirheidari et al. 2019) an automatic dementia
1999, 205). In the more advanced stages of the detection system was presented, including a
disease, communication becomes increasingly diarisation unit, an automatic speech recogniser,
problematic as Alzheimer's patients experience conversation analysis (CA) based acoustic and
difficulties in understanding and constructing a lexical feature extraction module and a machine
coherent discourse: “their narratives are often learning classifier, in order to facilitate and
ripetitive with topic changes, unclear references, improve screening procedures for dementia. They
and lack of coherence and informativeness” showed that using these features, they can obtain
(Kempler 2008, 76). a high value of precision in detecting dementia for
both a neurologist-patient and VirtualAgent-
2 Related works patient conversations.
In (Kong et al. 2019) neural networks on
The recent workshop on the creation of medical DementiaBank dataset have been used too. They
dialogues corpora (Bhatia et al. 2020) is a reached precision results close to the state of art
consequence of an increasing interest on this (80-85%), but they pointed out on the scalability
specific application field. The main reason of this of their neural methods that need less data.
interest is on the possibility of design and realize Moreover, they showed that “the attention
software applications which can assist mechanism of the model manages to capture
professionals in medicine in their daily work in similar key concepts as the information unit
order to avoid errors: “It is imperative to find a features specified by human experts.”.
solution to minimize causes of such errors, via As for Italian language, in (Beltrami 2018) the
better tooling and visualization or by providing participants (both healthy and cognitively
automated decision support assistants to medical impaired) were asked to answer to three specific
practitioners.”. With this final aim, the creation of tasks, i.e. the description of a drawing, details of a
medical dialogues corpora can be seen as a first last dream and the description of a working day.
step toward the creation of a virtual medical The researchers investigated whether the analysis
assistant that can assist, speed-up, improve the performed by Natural Language Processing
capacities of medical practitioners.
techniques could reveal alterations of the The corpus has been created in two phases. In
language performance in early cognitive decline. the first phase, health professionals of Anchise
Group created the audio recordings, transcribed
3 The Corpus Anchise 320 portions of dialogues and annotated each
transcription with a series of metadata with the
Corpus Anchise 320 collects the transcripts of aim of investigating the relationship between the
dialogues between healthcare professionals and language, age, sex and stage of dementia (MMSE2
patients carried out over the period from 2007 score). In the second phase, we collected the 320
until today by the Anchise Group, an association transcriptions, we removed pragmalinguistic
of experts (doctors, psychologists, nurses, comments of health professionals, such as
trainers) for the research, training and care of the "[Touch the recorder]", "[Silence]", "[Laughs]",
elderly with dementia. The corpus consists of an etc., and we analyzed and annotated the corpus as
unselected series of people diagnosed with described in the following sections.
dementia and not only those with an established Corpus Anchise 320 has been built and
diagnosis with specific criteria for Alzheimer's archived according to EU General Data Protection
were included. For probabilistic reasons, most Regulation (GDPR). Audio recording and
patients are affected by AD. The corpus contains transcriptions were made with the consent of the
320 individual conversations resulting from speaker, as far as possible, of the family member
transcription of about 15 minutes of dialogue for and of the head of the facility or department.
each patient in which the patient can speak freely Personal data have been anonymized. The dataset
with the health worker. This peculiarity is of is not publicly available but it can be requested to
considerable importance in a field of investigation the authors for research purposes.
that was mainly based on "formal medical-
psychological situations of the anamnestic
investigation and the collection of tests" (Lai 4 Computational linguistic analysis of
2000). Corpus Anchise 320
The corpus contains 20,588 turns of
conversation, consisting of 10,193 turns of In this Section we will discuss the results of the
patients with dementia and 10,381 turns of health lexical analysis (3.1) and of the morphosyntactic
workers. The total number of tokens is 222,856 analysis (3.2) carried out on the Corpus Anchise
and the total number of types (different words) is 320.
14,513. In the table below we present a small
portion of one conversation. 3.1 Lexical analysis
The Corpus Anchise 320 contains 222,856
7 P Eh ma mia figlia… è dura, è dura.
[Eh but my daughter... it's hard, it's hard.]
tokens and 14,513 types. The relationship
between types and tokens constitutes the Types-
8 O Lo sa Marta che da quando abbiamo Token Ratio (TTR), which represents a type of
iniziato a vederci lei parla molto, molto index to calculate the lexical richness of a text
meglio? (Torruella 2013). The number of tokens and types
[You know, Marta, that since we started were subsequently calculated for the patient's total
seeing each other you talk much, much turns and the health worker’s total turns. The
better?] results are shown in Table 2.
9 P Ma a casa mia no! Loro non capiscono.
Han detto che non capiscono niente…
Token Types TTR
[But not at home! They don't understand.
They said they don't understand anything…] Corpus
Anchise 222.856 14.513 0,07
10 O Forse sono loro che non capiscono.
[Maybe it is they who do not understand.] Patients 144.405 8.499 0,06
Health
78.451 6.014 0,08
Table 1: An excerpt from a conversation between a patient (P) and workers
an health worker (O), with English translation (turn 7 to 10).
Table 2: Token/types data.
2
Mini-Mental State Examination.
TTR is low for both speakers. As for the (“things”). This attitude confirms the scientific
patient, this trend is closely linked to Alzheimer's research carried out so far on Alzheimer's
disease, in which “the production of high- language regarding word finding deficits: “the
frequency words is relatively preserved while the earliest language deficits observed in DAT is
production of low-frequency is impaired” (Almor anomia. (...) Semantically empty words are
1999, 204). As for the health professional, this scattered throughout the DAT patient's utterances
trend reflects the Grice Principle of Cooperation in place of content words, thereby maintaining
between speakers in which it is necessary to fluency and sacrificing informational content."
conform the conversational contribution to what (Kempler 1991, 98). From the analysis of the first
is required, when it occurs, by the accepted 100 words used, we note the presence of the words
common intent or by the direction of the verbal “casa” (“home”) with 637 occurrences, “mamma”
exchange. Finally, if look at the number of tokens, (“mother”) with 394 occurrences, “marito”
we get that the patients speak more but with a (“husband”) with 190 occurrences, “figli”
poorer vocabulary relative to the lower lexical (”children”) with 162 occurrences. As the corpus
richness index than the sample of the health contains spontaneous speech, we can note that the
workers. most common topic is the patient’s family.
A frequency list was then created on the corpus
sample of patients with dementia. The table 3 is 3.2 Morphosyntactic analysis
the result of a pre-processing phase where 4 types
of function words have been removed from the The Corpus Anchise 320 was analyzed
frequency list, i.e. adpositions, determiners, morpho-syntactically by means of the
conjunctions and auxiliaries. StanfordNLP library in Python language (Qi
From the analysis of the data it emerges that 2018). The default pre-trained neural model for
the first 50 words in order of frequency cover 32% Italian was used. Specifically, tokenization,
of the entire Corpus Anchise 320 and 49.4% of the lemmatization, POS tagging and Dependency
patients' speech; the first 100 words cover 40.0% parsing were carried out. These annotations, i.e.
of the entire corpus and 61.8% of that of patients ID, Form, Lemma, POS, FEATS, HEAD,
with dementia; the first 200 words cover 46.7% of DEPREL, were organized according to the
the entire corpus and 72.1% of the words used by CoNLL-U format (Zeman 2018, Bosco 2014). A
patients. This means that, on an expressive level, linguist reviewed the automatic annotations.
patients with the use of 200 words cover almost
three quarters of all the vocabulary used in these
conversations. ID TOKEN XPOS LEMMA FEATS HEAD DEPREL
Words Frequency Words Frequency 3 che PRON che ... 4 nsubj
1 non 3.954 10 adesso 687
4 fa VERB fare ... 2 acl:relcl
2 sì 3.123 11 lì 666
3 mi 2.272 12 mia 639 5 un DET uno ... 6 det
4 io 2.063 13 qui 639
5 no 1.546 14 casa 637 6 po' ADV poco ... 7 advmod
6 eh 1.343 15 me 612
7 fatica NOUN fatica ... 4 obj
7 anche 1032 16 fare 577
8 bene 1007 17 lei 552 8 a ADP a ... 9 mark
9 cosa 768 18 so 535
9 parlare VERB parlare ... 4 xcomp
Table 3: Words frequency of patients with dementia.
Table 4: An excerpt from the annotated corpus. Features are not
The analysis of the words most used by shown due to space constraints.
patients diagnosed with dementia present in the
Corpus Anchises 320 shows a high percentage of The analysis of the linguistic data of patients
deictics, such as “io” (“me”), “qui” (“here”), “lì” suffering from dementia was made using both the
(“there”), and the presence of semantically empty LIP3 corpus (De Mauro 1993, 155) and the speech
words, such as “cosa” (“thing”) and “cose” corpus of healthcare professionals as a reference.
3
Lessico di frequenza dell’italiano parlato
The analysis of the percentages of occurrence of (i.e., subject verb agreement, well formed plural
the parts of speech, in the patient corpus sample, and tense markings)”, (Kempler 2008, 75).
reveals a superior use of pronouns and adverbs However, some linguistic phenomena that
both with respect to LIP and with respect to the emerged from the analysis of the occurrences of
corpus of health workers. With reference to the the verbal system could be linked to a spatial-
LIP, the use of pronouns records 10.9% of temporal disorientation characteristic of
occurrence, while the use of adverbs 10.1%. If we Alzheimer's disease (Macrì 2016). This
compare these data with the rates of occurrence in disorientation is reflected in the massive use of the
the patients' speech (Table 5), 13.9% frequency indicative mode present in 95.9% of cases (Table
for pronouns and 14.2% for adverbs respectively, 6). The use of the subjunctive and conditional
we notice a notable difference. Furthermore, these modes appears to be almost minimal with
two indices, when added together, are 1.7 percentages that are around 1%. This tendency
percentage points higher than the health workers’ could be paraphrased in terms of cognitive work,
speech (ADV 13,2%, PRON 13,2 %). This trend since the two verbal modes require both the ability
would confirm what was said in the analysis to imagine possible worlds and - at the level of
relating to word frequency, i.e. the difficulty for sentence construction - of conjugation and
patients to access the lexicon and therefore to temporal concordance.
compensate for this deficit with the use of
deictics, closely linked to the context. If we cross
these data with the rate of names used by patients Verb
Fin Inf Part Ger
(1.6 percentage points lower than the corpus of the form
health workers, NOUN 13,2%), we can deduce that 25.771 4.655 4.751 158
the patient implements a real compensatory (72,9%) (13,2%) (13,4%) (0,4%)
strategy linked to the impairment of access to Mood Ind Sub Imp Cnd
semantic memory. A significant difference is also 24.702 452 326 285
present with the LIP, which records a rate of (95,9%) (1,8%) (1,2%) (1,1%)
names of 15.7% against 11.6% of the corpus Tense Pres Past Imp Fut
relating to patients with dementia. 21.958 4.800 3.459 304
(71,9%) (15,7%) (11,33%) (0,9%)
Patients %
Table 6:: Percentages of occurrence of the verbal system.
ADJ 4.843 3,3
ADP 11.177 7,7
ADV 20.560 14,2 5 Conclusion and further research
AUX 10.586 7,3
CCONJ 5.562 3,8 In this paper we presented the first Italian
corpus of conversations between healthcare
DET 14.200 9,8
professionals and people with dementia, called
INTJ 7.383 5,1
“Corpus Anchise 320”. The study of this corpus
NOUN 16.787 11,6 with computational linguistic analysis confirmed
NUM 1.145 0,7 some characteristics of the language of people
PRON 20.118 13,9 with dementia, such as the reduction in the rate of
PROPN 1.799 1,2 names and the increase in deictics. Corpus
SCONJ 5.078 3,5 Anchise 320 has been built and archived
VERB 24.749 17,1
according to GDPR. It is not publicly available
but it can be requested to the authors for research
X 418 0,3
purposes.
TOT. 144.405
The large number of the sample (320
Table 5: Percentages of occurrence of the parts of speech.
conversations) and the use of computational
analysis will make it possible to identify
At the morphosyntactic level, it is known in the indicators of pathological language to be used in
literature that Alzheimer's patients do not suffer the preclinical phase, to trace the change in the
from serious deficits in the construction of the linguistic abilities of people with dementia as the
sentence: "sentence production in DAT is disease progresses, to put in relation the
characterized by intact morphosyntactic structure characteristics of the pathological language with a
series of metalinguistic data such as age, sex and
degree of dementia. The corpus will be increased de la Fuente Garcia, S., Haider, F., & Luz, S.
in the coming months with the addition and (2020). Cross-corpus Feature Learning between
annotation of other transcripts of dialogues of Spontaneous Monologue and Dialogue for
people with dementia. Automatic Classification of Alzheimer’s
Dementia Speech. In 2020 42nd Annual
International Conference of the IEEE
References Engineering in Medicine & Biology Society
(EMBC) (pp. 5851-5855). IEEE.
Alzheimer's-Association. "2018 Alzheimer's De Mauro, T., Mancini, F., Vedovelli, M.,
disease facts and figures." Alzheimer's and Voghera, M. (1993). Lessico di frequenza
Dementia, 2018: 367-429. dell'italiano parlato. Milano: Etaslibri.
Almor, A., Kempler, D., MacDonald, M. C., Folstein, M. F., Folstein, S. E., & McHugh, P. R.
Andersen, E. S., & Tyler, L. K. (1999). Why do (1975). “Mini-mental state”: a practical method
Alzheimer patients have difficulty with for grading the cognitive state of patients for the
pronouns? Working memory, semantics, and clinician. Journal of psychiatric research, 12(3),
reference in comprehension and production in 189-198.
Alzheimer's disease. Brain and language, 67(3),
202-227. S. Karlekar, T. Niu, and M. Bansal. Detecting
linguistic characteristics of Alzheimer’s dementia
Associazione Gruppo Anchise. by interpreting neural models. In Proceedings of
http://www.formalzheimer.it/. the 2018 Conference of the North American
Chapter of the Association for Computational
P. Bhatia, S. Lin, R. Gangadharaiah, B. Wallace, Linguistics: Human Language Technologies,
I. Shafran, C. Shivade, N. Du, and M. Diab, Volume 2 (Short Papers), pages 701–707, New
editors. Proceedings of the First Workshop on Orleans, Louisiana, June 2018. Association for
Natural Language Processing for Medical Computational Linguistics.
Conversations, Online, July 2020. Association for
Computational Linguistics Kempler, D. (1991). Language Changes in
Dementia of Alzheimer Type. In Dementia and
Beltrami, D., Gagliardi, G., Rossini Favretti, R., Communication, by Rosemary Lubinsky, 98-
Ghidoni, E., Tamburini, F., & Calzà, L. (2018). 114. Philadelphia: B.C. Decker, Inc.
Speech analysis by natural language processing
techniques: a possible tool for very early detection Kempler, D., & Goral, M. (2008). Language and
of cognitive decline?. Frontiers in aging dementia: Neuropsychological aspects. Annual
neuroscience, 10, 369. review of applied linguistics, 28, 73.
Bucks, R. S., Singh, S., Cuerden, J. M., & W. Kong, H. Jang, G. Carenini, and T. Field. A
Wilcock, G. K. (2000). Analysis of spontaneous, neural model for predicting dementia from
conversational speech in dementia of Alzheimer language. volume 106 of Proceedings of
type: Evaluation of an objective technique for Machine Learning Research, pages 270–286,
analysing lexical performance. Aphasiology, Ann Arbor, Michigan, 09–10 Aug 2019. PMLR.
14(1), 71-91.
Lai, G. (2000). Conversazioni con l'Alzheimer.
Bosco, C., Montemagni, S., Simi, M. (2013). Prospettive sociali e sanitarie, 18, 2-5.
Converting Italian Treebanks: Towards an Italian
Stanford Treebanks. Macrì, A. (2016). La lingua della demenza di
Alzheimer. Analisi linguistica del parlato
Bosco, C., Dell'Orletta, F., Montemagni, S., spontaneo. In Le lingue della malattia, 329-424.
Sanguinetti, M., & Simi, M. (2014). The Milano: Mimesis Edizioni.
EVALITA 2014 dependency parsing task. In
EVALITA 2014 Evaluation of NLP and Speech Mirheidari, B., Blackburn, D., Walker, T.,
Tools for Italian (pp. 1-8). Pisa University Press. Reuber, M., & Christensen, H. (2019). Dementia
detection using automatic analysis of
conversations. Computer Speech & Language,
53, 65-79.
Pope, C., & Davis, B. H. (2011). Finding a
balance: The carolinas conversation collection.
Corpus Linguistics and Linguistic Theory, 7(1),
143-161.
Qi, P., Dozat, T., Yuhao Zhang, Y., & Manning,
C. D. (2018). Universal Dependency Parsing
from Scratch In Proceedings of the CoNLL 2018
Shared Task: Multilingual Parsing from Raw
Text to Universal Dependencies, pp. 160-170.
Torruella, J.; Capsada, R. (2013). "Lexical
Statistics and Tipological Structures: A Measure
of Lexical Richness." Procedia - Social and
Behavioral Sciences, pp. 447-454.
Vigorelli, P. (2018). Alzheimer, come parlare e
comunicare nella vita quotidiana nonostante la
malattia. Milano: Franco Angeli Editore.
Vigorelli, P. (2004). La conversazione possibile
con il malato Alzheimer. Milano: Franco Angeli
Editore.
Zeman, D., Hajic, J., Popel, M., Potthast, M.,
Straka, M., Ginter, F., ... & Petrov, S. (2018,
October). CoNLL 2018 shared task: Multilingual
parsing from raw text to universal dependencies.
In Proceedings of the CoNLL 2018 Shared Task:
Multilingual parsing from raw text to universal
dependencies (pp. 1-21).