The “Corpus Anchise 320” and the analysis of conversations between
              healthcare workers and people with dementia
       Nicola             Andrea Bolioli           Alessio Bosca        Alessandro           Pietro Vigorelli
      Benvenuti              CELI                      CELI               Mazzei             Gruppo Anchise
     Università di                                                      Università di
       Torino                                                             Torino


                      Abstract                             cornerstones of the approach developed by the
                                                           Anchise Group to support people with dementia
        The aim of this research was to create the         and their caregivers, i.e. the “Enabling Approach”
     first Italian corpus of free conversations            (Vigorelli 2018).
     between healthcare professionals and                      The paper is divided in 4 sections. Section 1
     people with dementia, in order to                     introduces the topic of Alzheimer’s language
     investigate specific linguistic phenomena             Section 2 presents the recent researches and
     from a computational point of view. Most              related works. In Section 3, the creation of the
     of the previous researches on speech                  Corpus Anchise 320 will be discussed, which
     disorders of people with dementia have                collects the transcripts and annotations of a set of
     been based on qualitative analysis, or on             dialogues between healthcare professionals and
     the study of a few dozen cases executed in            dementia patients carried out by the Anchise
     laboratory conditions, and not in                     Group from 2007 until today, in Italian language.
     spontaneous speech (in particular for the             Section 4 will report the results of the
     Italian language). The creation of the                computational linguistic analysis with the
     Corpus Anchise 320 aims to investigate                StanfordNLP library for Italian. The results
     Dementia language by providing a broader              obtained will be discussed to outline some of the
     number of dialogues collected in                      peculiarities of the Dementia language. Section 5
     ecological conditions. Automatic linguistic           concludes this paper with some final
     analysis can help healthcare professionals            considerations.
     to understand some characteristics of the
     language used by patients and to                      1    The Alzheimer’s language
     implement effective dialogue strategies.1
                                                              Dementia refers to a series of symptoms that
Introduction                                               manifest in “difficulties with memory, language,
                                                           problem solving and other cognitive skills that
   In this paper we will present the construction          affect a persons ability to perform everyday
of the first annotated corpus of conversations             activities.” (Alzheimer's-Association 2018, 368).
between healthcare workers and people with                 These symptoms change over time and reflect the
dementia for Italian, called “Corpus Anchise               degree of neuronal damage in different parts of the
320”, and the quantitative linguistic analysis we          brain.     Alzheimer's      disease    (AD),      a
carried out. The aim of the project is twofold. On         neurodegenerative brain disease, is the most
the one hand, we created a dataset of spoken               common form of dementia. One of the most
dialogue transcriptions that is useful for research        popular neuropsychological tests for assessing a
on the language of people with dementia. On the            patient's neurocognitive and functional status is
other hand, techniques typical of computational            still the Mini-Mental Test, designed by Folstein et
linguistics are applied to help doctors in assessing       al. (1975).
the state of the disease and implement effective              The first symptoms are memory loss or a state
dialogue strategies. Focusing attention on verbal          of frequent confusion. Alzheimer's disease,
exchanges between speakers is one of the                   semantic dementia, aphasia and amnesia all share

1
  Copyright ©2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
a close link with lexical memory and therefore             As stated in (de la Fuente Garcia 2020)
declarative memory, while they would leave              “datasets containing both clinical information
grammar and procedural memory intact.                   and spontaneous speech suitable for statistical
Language would thus move between a structural           learning are relatively scarce. In addition, speech
component, formed by grammatical rules that are         data are often collected under different
stabilized over the course of life and are preserved    conditions, such as monologue and dialogue
longer as a crystallized function; and a semantic       recording protocols.” A notable example is the
component that would collapse more quickly              Carolina Conversations Collection (CCC), that is
because it requires a mnemonic and                      amongst the few spontaneous dialogue datasets
contextualized effort that makes the cognitive          available in English in the context of AD research.
activity of the individual more complex. This           It is hosted and distributed by the Medical
dissociation is confirmed by studies on                 University of South Carolina (Pope 2011).
Alzheimer's language (Almor 1999), (Kempler                The study of AD language with computational
2008), (Bucks 2000) in which it has been amply          methods is fairly recent, but a number of work
demonstrated that one of the first symptoms is          showed the applicability of symbolic and statistic
anomia, or the difficulty in finding the lexical        algorithms for the prediction of dementia and
target; as opposed to a good ability to construct       similar diseases (Karlekar et al. 2018, Mirheidari
the sentence up to the advanced stages of the           et al. 2019, Kong et al. 2019).
disease. These deficits would then be                      In (Karlekar et al. 2018) neural networks have
compensated through linguistic strategies, such as      been used on the publicly available
the high use of pronouns, circumlocutions and           DementiaBank dataset in order to predict
passepartout words present in the speech of             Alzheimer’s dementia of a patient starting from
Alzheimer's patients: “empty words (“things”,           the language produced and annotated with the
“do”, “he”, “it”, etc.) are successfully and            POS feature. They reached precision result
relatively easily activated precisely because they      between 80-85%. Interestingly, they showed that
are high in frequency and allow the patients to         there is no significative difference between the
produce fluent and grammatical sentences in the         prediction results by considering the gender.
presence of debilitating semantic deficits” (Almor       In (Mirheidari et al. 2019) an automatic dementia
1999, 205). In the more advanced stages of the          detection system was presented, including a
disease, communication becomes increasingly             diarisation unit, an automatic speech recogniser,
problematic as Alzheimer's patients experience          conversation analysis (CA) based acoustic and
difficulties in understanding and constructing a        lexical feature extraction module and a machine
coherent discourse: “their narratives are often         learning classifier, in order to facilitate and
ripetitive with topic changes, unclear references,      improve screening procedures for dementia. They
and lack of coherence and informativeness”              showed that using these features, they can obtain
(Kempler 2008, 76).                                     a high value of precision in detecting dementia for
                                                        both a neurologist-patient and VirtualAgent-
2    Related works                                      patient conversations.
                                                           In (Kong et al. 2019) neural networks on
   The recent workshop on the creation of medical       DementiaBank dataset have been used too. They
dialogues corpora (Bhatia et al. 2020) is a             reached precision results close to the state of art
consequence of an increasing interest on this           (80-85%), but they pointed out on the scalability
specific application field. The main reason of this     of their neural methods that need less data.
interest is on the possibility of design and realize    Moreover, they showed that “the attention
software applications which can assist                  mechanism of the model manages to capture
professionals in medicine in their daily work in        similar key concepts as the information unit
order to avoid errors: “It is imperative to find a      features specified by human experts.”.
solution to minimize causes of such errors, via            As for Italian language, in (Beltrami 2018) the
better tooling and visualization or by providing        participants (both healthy and cognitively
automated decision support assistants to medical        impaired) were asked to answer to three specific
practitioners.”. With this final aim, the creation of   tasks, i.e. the description of a drawing, details of a
medical dialogues corpora can be seen as a first        last dream and the description of a working day.
step toward the creation of a virtual medical           The researchers investigated whether the analysis
assistant that can assist, speed-up, improve the        performed by Natural Language Processing
capacities of medical practitioners.
techniques could reveal alterations of the                             The corpus has been created in two phases. In
language performance in early cognitive decline.                    the first phase, health professionals of Anchise
                                                                    Group created the audio recordings, transcribed
3       The Corpus Anchise 320                                      portions of dialogues and annotated each
                                                                    transcription with a series of metadata with the
   Corpus Anchise 320 collects the transcripts of                   aim of investigating the relationship between the
dialogues between healthcare professionals and                      language, age, sex and stage of dementia (MMSE2
patients carried out over the period from 2007                      score). In the second phase, we collected the 320
until today by the Anchise Group, an association                    transcriptions, we removed pragmalinguistic
of experts (doctors, psychologists, nurses,                         comments of health professionals, such as
trainers) for the research, training and care of the                "[Touch the recorder]", "[Silence]", "[Laughs]",
elderly with dementia. The corpus consists of an                    etc., and we analyzed and annotated the corpus as
unselected series of people diagnosed with                          described in the following sections.
dementia and not only those with an established                        Corpus Anchise 320 has been built and
diagnosis with specific criteria for Alzheimer's                    archived according to EU General Data Protection
were included. For probabilistic reasons, most                      Regulation (GDPR). Audio recording and
patients are affected by AD. The corpus contains                    transcriptions were made with the consent of the
320 individual conversations resulting from                         speaker, as far as possible, of the family member
transcription of about 15 minutes of dialogue for                   and of the head of the facility or department.
each patient in which the patient can speak freely                  Personal data have been anonymized. The dataset
with the health worker. This peculiarity is of                      is not publicly available but it can be requested to
considerable importance in a field of investigation                 the authors for research purposes.
that was mainly based on "formal medical-
psychological situations of the anamnestic
investigation and the collection of tests" (Lai                     4    Computational linguistic analysis of
2000).                                                                   Corpus Anchise 320
   The corpus contains 20,588 turns of
conversation, consisting of 10,193 turns of                            In this Section we will discuss the results of the
patients with dementia and 10,381 turns of health                   lexical analysis (3.1) and of the morphosyntactic
workers. The total number of tokens is 222,856                      analysis (3.2) carried out on the Corpus Anchise
and the total number of types (different words) is                  320.
14,513. In the table below we present a small
portion of one conversation.                                        3.1 Lexical analysis
                                                                       The Corpus Anchise 320 contains 222,856
7      P      Eh ma mia figlia… è dura, è dura.
              [Eh but my daughter... it's hard, it's hard.]
                                                                    tokens and 14,513 types. The relationship
                                                                    between types and tokens constitutes the Types-
8      O      Lo sa Marta che da quando abbiamo                     Token Ratio (TTR), which represents a type of
           iniziato a vederci lei parla molto, molto                index to calculate the lexical richness of a text
           meglio?                                                  (Torruella 2013). The number of tokens and types
              [You know, Marta, that since we started               were subsequently calculated for the patient's total
           seeing each other you talk much, much                    turns and the health worker’s total turns. The
           better?]                                                 results are shown in Table 2.
9      P     Ma a casa mia no! Loro non capiscono.
           Han detto che non capiscono niente…
                                                                                    Token           Types       TTR
             [But not at home! They don't understand.
           They said they don't understand anything…]                   Corpus
                                                                        Anchise     222.856         14.513      0,07
10 O          Forse sono loro che non capiscono.
              [Maybe it is they who do not understand.]                 Patients    144.405         8.499       0,06
                                                                        Health
                                                                                    78.451          6.014       0,08
Table 1: An excerpt from a conversation between a patient (P) and       workers
  an health worker (O), with English translation (turn 7 to 10).
                                                                                   Table 2: Token/types data.


2
    Mini-Mental State Examination.
    TTR is low for both speakers. As for the                         (“things”). This attitude confirms the scientific
patient, this trend is closely linked to Alzheimer's                 research carried out so far on Alzheimer's
disease, in which “the production of high-                           language regarding word finding deficits: “the
frequency words is relatively preserved while the                    earliest language deficits observed in DAT is
production of low-frequency is impaired” (Almor                      anomia. (...) Semantically empty words are
1999, 204). As for the health professional, this                     scattered throughout the DAT patient's utterances
trend reflects the Grice Principle of Cooperation                    in place of content words, thereby maintaining
between speakers in which it is necessary to                         fluency and sacrificing informational content."
conform the conversational contribution to what                      (Kempler 1991, 98). From the analysis of the first
is required, when it occurs, by the accepted                         100 words used, we note the presence of the words
common intent or by the direction of the verbal                      “casa” (“home”) with 637 occurrences, “mamma”
exchange. Finally, if look at the number of tokens,                  (“mother”) with 394 occurrences, “marito”
we get that the patients speak more but with a                       (“husband”) with 190 occurrences, “figli”
poorer vocabulary relative to the lower lexical                      (”children”) with 162 occurrences. As the corpus
richness index than the sample of the health                         contains spontaneous speech, we can note that the
workers.                                                             most common topic is the patient’s family.
    A frequency list was then created on the corpus
sample of patients with dementia. The table 3 is                     3.2 Morphosyntactic analysis
the result of a pre-processing phase where 4 types
of function words have been removed from the                             The Corpus Anchise 320 was analyzed
frequency list, i.e. adpositions, determiners,                       morpho-syntactically by means of the
conjunctions and auxiliaries.                                        StanfordNLP library in Python language (Qi
    From the analysis of the data it emerges that                    2018). The default pre-trained neural model for
the first 50 words in order of frequency cover 32%                   Italian was used. Specifically, tokenization,
of the entire Corpus Anchise 320 and 49.4% of the                    lemmatization, POS tagging and Dependency
patients' speech; the first 100 words cover 40.0%                    parsing were carried out. These annotations, i.e.
of the entire corpus and 61.8% of that of patients                   ID, Form, Lemma, POS, FEATS, HEAD,
with dementia; the first 200 words cover 46.7% of                    DEPREL, were organized according to the
the entire corpus and 72.1% of the words used by                     CoNLL-U format (Zeman 2018, Bosco 2014). A
patients. This means that, on an expressive level,                   linguist reviewed the automatic annotations.
patients with the use of 200 words cover almost
three quarters of all the vocabulary used in these
conversations.                                                       ID TOKEN XPOS           LEMMA FEATS HEAD DEPREL

        Words     Frequency              Words       Frequency       3   che       PRON che             ...        4       nsubj
    1      non          3.954     10      adesso               687
                                                                     4   fa        VERB fare            ...        2       acl:relcl
    2        sì         3.123     11           lì              666
    3       mi          2.272     12         mia               639   5   un        DET       uno        ...        6       det
    4       io          2.063     13         qui               639
    5       no          1.546     14        casa               637   6   po'       ADV       poco       ...        7       advmod
    6       eh          1.343     15         me                612
                                                                     7   fatica    NOUN fatica          ...        4       obj
    7    anche          1032      16        fare               577
    8     bene          1007      17          lei              552   8   a         ADP       a          ...        9       mark
    9     cosa            768     18          so               535
                                                                     9   parlare VERB parlare           ...        4       xcomp
         Table 3: Words frequency of patients with dementia.
                                                                      Table 4: An excerpt from the annotated corpus. Features are not
   The analysis of the words most used by                                            shown due to space constraints.
patients diagnosed with dementia present in the
Corpus Anchises 320 shows a high percentage of                          The analysis of the linguistic data of patients
deictics, such as “io” (“me”), “qui” (“here”), “lì”                  suffering from dementia was made using both the
(“there”), and the presence of semantically empty                    LIP3 corpus (De Mauro 1993, 155) and the speech
words, such as “cosa” (“thing”) and “cose”                           corpus of healthcare professionals as a reference.

3
    Lessico di frequenza dell’italiano parlato
The analysis of the percentages of occurrence of                (i.e., subject verb agreement, well formed plural
the parts of speech, in the patient corpus sample,              and tense markings)”, (Kempler 2008, 75).
reveals a superior use of pronouns and adverbs                  However, some linguistic phenomena that
both with respect to LIP and with respect to the                emerged from the analysis of the occurrences of
corpus of health workers. With reference to the                 the verbal system could be linked to a spatial-
LIP, the use of pronouns records 10.9% of                       temporal      disorientation     characteristic    of
occurrence, while the use of adverbs 10.1%. If we               Alzheimer's disease (Macrì 2016). This
compare these data with the rates of occurrence in              disorientation is reflected in the massive use of the
the patients' speech (Table 5), 13.9% frequency                 indicative mode present in 95.9% of cases (Table
for pronouns and 14.2% for adverbs respectively,                6). The use of the subjunctive and conditional
we notice a notable difference. Furthermore, these              modes appears to be almost minimal with
two indices, when added together, are 1.7                       percentages that are around 1%. This tendency
percentage points higher than the health workers’               could be paraphrased in terms of cognitive work,
speech (ADV 13,2%, PRON 13,2 %). This trend                     since the two verbal modes require both the ability
would confirm what was said in the analysis                     to imagine possible worlds and - at the level of
relating to word frequency, i.e. the difficulty for             sentence construction - of conjugation and
patients to access the lexicon and therefore to                 temporal concordance.
compensate for this deficit with the use of
deictics, closely linked to the context. If we cross
these data with the rate of names used by patients                  Verb
                                                                                 Fin          Inf          Part           Ger
(1.6 percentage points lower than the corpus of the                 form
health workers, NOUN 13,2%), we can deduce that                                25.771        4.655          4.751          158
the patient implements a real compensatory                                    (72,9%)      (13,2%)        (13,4%)        (0,4%)
strategy linked to the impairment of access to                      Mood        Ind           Sub           Imp           Cnd
semantic memory. A significant difference is also                              24.702         452            326           285
present with the LIP, which records a rate of                                 (95,9%)       (1,8%)         (1,2%)        (1,1%)
names of 15.7% against 11.6% of the corpus                          Tense       Pres         Past           Imp            Fut
relating to patients with dementia.                                            21.958        4.800          3.459          304
                                                                              (71,9%)      (15,7%)       (11,33%)        (0,9%)
                          Patients                    %
                                                                     Table 6:: Percentages of occurrence of the verbal system.
       ADJ                  4.843                    3,3
       ADP                 11.177                    7,7
      ADV                  20.560                   14,2        5     Conclusion and further research
      AUX                  10.586                    7,3
     CCONJ                  5.562                    3,8           In this paper we presented the first Italian
                                                                corpus of conversations between healthcare
       DET                 14.200                    9,8
                                                                professionals and people with dementia, called
      INTJ                  7.383                    5,1
                                                                “Corpus Anchise 320”. The study of this corpus
      NOUN                 16.787                   11,6        with computational linguistic analysis confirmed
      NUM                   1.145                    0,7        some characteristics of the language of people
      PRON                 20.118                   13,9        with dementia, such as the reduction in the rate of
     PROPN                  1.799                    1,2        names and the increase in deictics. Corpus
     SCONJ                  5.078                    3,5        Anchise 320 has been built and archived
      VERB                 24.749                   17,1
                                                                according to GDPR. It is not publicly available
                                                                but it can be requested to the authors for research
        X                    418                     0,3
                                                                purposes.
      TOT.                 144.405
                                                                    The large number of the sample (320
   Table 5: Percentages of occurrence of the parts of speech.
                                                                conversations) and the use of computational
                                                                analysis will make it possible to identify
    At the morphosyntactic level, it is known in the            indicators of pathological language to be used in
literature that Alzheimer's patients do not suffer              the preclinical phase, to trace the change in the
from serious deficits in the construction of the                linguistic abilities of people with dementia as the
sentence: "sentence production in DAT is                        disease progresses, to put in relation the
characterized by intact morphosyntactic structure               characteristics of the pathological language with a
series of metalinguistic data such as age, sex and
degree of dementia. The corpus will be increased       de la Fuente Garcia, S., Haider, F., & Luz, S.
in the coming months with the addition and             (2020). Cross-corpus Feature Learning between
annotation of other transcripts of dialogues of        Spontaneous Monologue and Dialogue for
people with dementia.                                  Automatic Classification of Alzheimer’s
                                                       Dementia Speech. In 2020 42nd Annual
                                                       International Conference of the IEEE
References                                             Engineering in Medicine & Biology Society
                                                       (EMBC) (pp. 5851-5855). IEEE.
Alzheimer's-Association. "2018 Alzheimer's             De Mauro, T., Mancini, F., Vedovelli, M.,
disease facts and figures." Alzheimer's and            Voghera, M. (1993). Lessico di frequenza
Dementia, 2018: 367-429.                               dell'italiano parlato. Milano: Etaslibri.
Almor, A., Kempler, D., MacDonald, M. C.,              Folstein, M. F., Folstein, S. E., & McHugh, P. R.
Andersen, E. S., & Tyler, L. K. (1999). Why do         (1975). “Mini-mental state”: a practical method
Alzheimer patients have difficulty with                for grading the cognitive state of patients for the
pronouns? Working memory, semantics, and               clinician. Journal of psychiatric research, 12(3),
reference in comprehension and production in           189-198.
Alzheimer's disease. Brain and language, 67(3),
202-227.                                               S. Karlekar, T. Niu, and M. Bansal. Detecting
                                                       linguistic characteristics of Alzheimer’s dementia
Associazione Gruppo Anchise.                           by interpreting neural models. In Proceedings of
http://www.formalzheimer.it/.                          the 2018 Conference of the North American
                                                       Chapter of the Association for Computational
P. Bhatia, S. Lin, R. Gangadharaiah, B. Wallace,       Linguistics: Human Language Technologies,
I. Shafran, C. Shivade, N. Du, and M. Diab,            Volume 2 (Short Papers), pages 701–707, New
editors. Proceedings of the First Workshop on          Orleans, Louisiana, June 2018. Association for
Natural Language Processing for Medical                Computational Linguistics.
Conversations, Online, July 2020. Association for
Computational Linguistics                              Kempler, D. (1991). Language Changes in
                                                       Dementia of Alzheimer Type. In Dementia and
Beltrami, D., Gagliardi, G., Rossini Favretti, R.,     Communication, by Rosemary Lubinsky, 98-
Ghidoni, E., Tamburini, F., & Calzà, L. (2018).        114. Philadelphia: B.C. Decker, Inc.
Speech analysis by natural language processing
techniques: a possible tool for very early detection   Kempler, D., & Goral, M. (2008). Language and
of cognitive decline?. Frontiers in aging              dementia: Neuropsychological aspects. Annual
neuroscience, 10, 369.                                 review of applied linguistics, 28, 73.

Bucks, R. S., Singh, S., Cuerden, J. M., &             W. Kong, H. Jang, G. Carenini, and T. Field. A
Wilcock, G. K. (2000). Analysis of spontaneous,        neural model for predicting dementia from
conversational speech in dementia of Alzheimer         language. volume 106 of Proceedings of
type: Evaluation of an objective technique for         Machine Learning Research, pages 270–286,
analysing lexical performance. Aphasiology,            Ann Arbor, Michigan, 09–10 Aug 2019. PMLR.
14(1), 71-91.
                                                       Lai, G. (2000). Conversazioni con l'Alzheimer.
Bosco, C., Montemagni, S., Simi, M. (2013).            Prospettive sociali e sanitarie, 18, 2-5.
Converting Italian Treebanks: Towards an Italian
Stanford Treebanks.                                    Macrì, A. (2016). La lingua della demenza di
                                                       Alzheimer. Analisi linguistica del parlato
Bosco, C., Dell'Orletta, F., Montemagni, S.,           spontaneo. In Le lingue della malattia, 329-424.
Sanguinetti, M., & Simi, M. (2014). The                Milano: Mimesis Edizioni.
EVALITA 2014 dependency parsing task. In
EVALITA 2014 Evaluation of NLP and Speech              Mirheidari, B., Blackburn, D., Walker, T.,
Tools for Italian (pp. 1-8). Pisa University Press.    Reuber, M., & Christensen, H. (2019). Dementia
detection using automatic analysis of
conversations. Computer Speech & Language,
53, 65-79.

Pope, C., & Davis, B. H. (2011). Finding a
balance: The carolinas conversation collection.
Corpus Linguistics and Linguistic Theory, 7(1),
143-161.

Qi, P., Dozat, T., Yuhao Zhang, Y., & Manning,
C. D. (2018). Universal Dependency Parsing
from Scratch In Proceedings of the CoNLL 2018
Shared Task: Multilingual Parsing from Raw
Text to Universal Dependencies, pp. 160-170.

Torruella, J.; Capsada, R. (2013). "Lexical
Statistics and Tipological Structures: A Measure
of Lexical Richness." Procedia - Social and
Behavioral Sciences, pp. 447-454.

Vigorelli, P. (2018). Alzheimer, come parlare e
comunicare nella vita quotidiana nonostante la
malattia. Milano: Franco Angeli Editore.

Vigorelli, P. (2004). La conversazione possibile
con il malato Alzheimer. Milano: Franco Angeli
Editore.

Zeman, D., Hajic, J., Popel, M., Potthast, M.,
Straka, M., Ginter, F., ... & Petrov, S. (2018,
October). CoNLL 2018 shared task: Multilingual
parsing from raw text to universal dependencies.
In Proceedings of the CoNLL 2018 Shared Task:
Multilingual parsing from raw text to universal
dependencies (pp. 1-21).