1. Introduction

1987. Supplementum 8 - Italian journal of [18] M. A. K. Halliday

Exploring the Use of Cohesive Devices in Dementia within an Elderly Italian Semi-spontaneous Speech Corpus

Giorgia Albertin

Elena Martinelli

0 0 Alma Mater Studiorum - University of Bologna, Department of Classical Philology and Italian Studies , 32 Zamboni Street, 40126 Bologna , Italy

2016

26 5234 5242

The study of language disruption in dementia, aimed at individuating which features correlate with cognitive impairment, is a growing area in computational linguistic research. Still, it needs a further development in analyzing some discourse phenomena that also undergo deterioration, and can help expand our understanding of dementia-related speech and refine automatic tools. This paper explores the discourse property of cohesion by investigating three types of cohesive devices: reference, lexical iteration, and connectives. Ten features related to these categories have been defined and automatically extracted from an Italian corpus of semi-spontaneous speech collected from dementia patients and healthy controls. Some of the designed features have proven significant for the binary classification of the two groups and further quantitative analysis highlight interesting diferences in the use of cohesive devices, that seem to be associated with cognitive decline.

eol>Cohesion Cohesive devices Dementia Cognitive Impairment Semi-spontaneous speech

1. Introduction Coherence is compromised, especially in spontaneous

speech: the discourse appears with an abundance of irLinguistics deficits commonly characterized neurodegen- relevant details and the overt dificulty to mention the erative diseases from their onset. In Dementia, or Major key concept or to refer to the topic, resulting in a lack of Neurocognitive Disorder (DSM-5 [1]), a syndrome of informativeness in communication [8, 9, 10]. acquired and progressive impairment in cognitive func- In recent years, speech analysis in cognitive decline tion that interfere with independence in everyday life, has gained increasing importance in the development language deterioration manifests itself within a broader of low-cost and portable tools for dementia screening, framework of cognitive impairment, which could afects also supported by the remarkable advancements in Natmemory, visuo-spatial skills, executive functions and rea- ural Language Processing (NLP) and Machine Learning soning. Deficits both in verbal production and compre- (ML) technologies [11]. The refinement of classicfiation hension have been observed, despite the specificity of systems goes hand in hand with the operationalization diferent Dementia’s etiological subtypes, among which of linguistic features computed from oral productions, the most common is Alzheimer’s Disease (AD), character- that need to be adapted to diferent languages. Regardized with a primary impairment in episodic memory. In ing Italian, the OPLON (OPportunities for active and AD, for example, among the well-established linguistic healthy LONgevity) [2014-2016] project was devoted to deficits there are word-finding problems, which include the automatic extraction of an extensive group of linguisanomia, the production of semantic paraphasias [2, 3] and tic features from acoustic, rhythmic, readability, lexical, the "on the-tip-of-the tongue" experience [4], low speech morpho-syntactic and syntactic levels, from a speech corrate, poor word comprehension [5] and, as the disease pus of cognitively impaired patients and healthy peers worsen, a generalized simplification of syntax [ 6]. Also [12, 13]. Analysis of the significance of the features highdiscourse and pragmatic level is afected by cognitive de- lighted that the acoustics ones largely correlated with cline. Errors in referential cohesion has been registered, the cognitive state of the subjects [14]. in particular regarding ambiguous use of pronouns [7]. Expanding the list of language levels covered to include speech properties would enrich the features used CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, for classification and, in addition, could broaden our un*DCecor0r4es—po0n6,d2in02g4a, uPtihsao,r.Italy derstanding of how cognitive decline manifests itself † The contribution of each author to the paper is specified in the in verbal competence. Nevertheless, denfiing specific CRediT authorship statement declaration. features of higher-level and complex phenomena is not $ giorgia.albertin3@unibo.it (G. Albertin); trivial. Drawing inspiration from works that propose a elena.martinelli12@unibo.it (E. Martinelli) "stratified" approach to discourse analysis, which indi https://www.unibo.it/sitoweb/giorgia.albertin3 (G. Albertin); vidually considers macro-phenomena that intersect with http0s0:0//0w-0w00w2.-u5n7i2b8o-.3it4/7s3ito(Gw.eAb/lebleerntain.m);a0r0t0in9e-0ll0i1027/-4(E39.9M-6a9r5ti1nelli) one another [15, 16], this paper will examine cohesion, (E. Martinelli) the property of the superficial form of the text to reflect © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License its internal unity [17]. Cohesion assures continuity in disAttribution 4.0 International (CC BY 4.0). course through a network of cohesive devices, which are mainly words or morphemes, that contribute to maintain semantic relations occurring in the text [17]. Therefore, we proposed a method to design and formalize a set of cohesion features, with the aim of observing whether they contribute to discriminate the speech of individuals with dementia from healthy peers. Specifically, three types of elements, which Halliday & Hasan [18] indicate among the major contributors to cohesion, were taken into consideration: reference, lexical iteration and connectives. The implementation of measures based on cohesive devices is the first step towards the attempt to include discourse properties in the automatic analysis of Figure 1: Esame del Linguaggio II [19], stimulus figure used language in cognitive decline. The study of their interac- in the picture description task. tion with features of other linguistic levels is crucial to observe whether they have a positive impact on discrimination between dementia subjects and healthy subjects. in the collected speech, and will be discussed in Section The work presented in this paper, therefore, has to be in- 4 in relation to the results of the analysis. tended as a preliminary analysis that will serve to pursue The Pathological Group (PG) consists of 20 patients more sophisticated ML classification in the future. sufering from diferent forms of dementia (9 cases of Alzheimer’s Disease, 2 of Mixed Dementia, 5 of unspeci2. Corpus Description ifed Dementia, 3 of Vascular Dementia, 1 of Frontotemporal Dementia), recruited at the “Universo Salute - Opera In this study, we used the corpus collected within the Don Uva (PZ)” rest home, and the Control Group (CG) project "Linguistic characteristics of the speech of el- consists of 20 subjects with neurotypical cognitive aging. derly subjects with dementia” [20, 21], approved by the Informed consent was obtained from all participants (in Bioethics Committee of the University of Bologna (Prot. the case of patients, by their family members, caregivers, N. 0072032/2022). The corpus consists of oral linguistic or legal tutors). As a first step, the recruited subjects unproduction of 40 Italian-speaking individuals living in derwent an evaluation of their cognitive status through Basilicata, forming two groups balanced by sex and age. the administration of the four following neuropsychologAlthough the initial objective was to balance the cohorts ical tests: Mini-Mental State Examination (MMSE [22]), also on education level, it was not possible to consider Montreal Cognitive Assesment (MoCA [23]), and Verbal this aspect due to the lack of this information in some Fluency Test, both Phonemic [24, 25, 26, 27] and Semanpatients medical records. Even from a sociolinguistic tic [28]. The Table 1 summarizes the recruitment criteria perspective, it is important to advance that some par- and the demographics for study participants. ticipants, albeit Italian-speaking, were also exposed to Then, two narrative tasks (the story of a journey and dialect systems in their lives. This aspect explains the fre- the story of the Christmas holiday’s traditions) and one quent occurrence of substandard linguistic expressions picture description task (using the stimulus figure in “Language Examination II" [19], see Figure 2) were adminis- to something already known in the text or anticipating tered to collect semi-spontaneous speech, elicited with it. Reference functions either by repetition, which can the following stimulus sentences: 1) "Do you want to tell be partial (e.g., through a synonym) or total, by semantic me about a trip you took?"; 2) “How do you usually spend contiguity, or by substitution with pronouns or other Christmas day?”; 3) “Could you describe this figure to elements [17]. It is this second type of referential exme?”. This protocol allowed the collection of approxi- pressions, closely linked to the textual dimension, that is mately 9 hours of audio (i.e., 8 hours for the recruited investigated through the features, thus focusing on the groups and 1 hour for the interviewer), subsequently an- occurrence of anaphora and cataphora. notated at various linguistic levels. By using the ELAN An extensive literature review was necessary to sesoftware [29], the corpus was manually transcribed at lect a relevant group of those expressions in the Italian the orthographic level, segmented into utterances (i.e., language (see [33, 34, 35]). The group of elements colthe reference unit of discursive analysis [30]), and anno- lected includes pronouns, both personal (e.g., io, tu, lei, tated at the prosodic level (theoretical framework: The lui), demonstrative (e.g., questo, quello), indefinite (e.g., Language into Act Theory - L-AcT [31]). Table 2 sum- alcuni, tutti), and possessive, possessive adjectives (e.g., marize the size of the corpus and the average material mio, tuo), as well as deictics (e.g., fuori, sopra, avanti, qua, (audio/token) collected for each patient and control sub- qui, dentro, dietro, giù, indietro, su, lì, avanti, oltre, ci). The ject. The total number of tokens was calculated on the occurrences of these groups were counted and divided by orthographic transcription of the corpus (cleaned of an- the total number of tokens per subject (COE_REF). Addinotation tags), and consists of 49,263 tokens (i.e., 23,518 tionally, the pronoun density (COE_PRON_DENS), defined for PG and 25,745 for CG). Finally, using the Gagliardi as the ratio between pronouns and nouns uttered [36], & Tamburini pipeline [32], tokenization, lemmatization, was computed for each subject. part-of-speech tagging, and syntactic parsing was automatically performed for the entire corpus.

3.2. Lexical iteration 3. Cohesive Devices’ Features

Ten features that quantify the use of cohesive devices by the speakers were designed and formalised. The features were computed with respect to each subject, thus referring to the amount of speech produced by the single individual in the three tasks. To comprehensively address the categories of cohesive devices considered, we use the .conll file resulted from the data annotation as the input for our analysis. Features’ automatic extraction was done via .python scripts. The methodology used will be described in detail in the following sections.

3.1. Reference Reference is involved when an expression that requires interpretation by referring to something else occurs in the discourse [18]. This mechanism can be employed both in anaphoric and cataphoric uses, to refer respectively

According to Halliday and Hasan [18], the iteration of a lexical item is a specific use of the repetition-type referential mechanism, which acquires cohesive force on its own because it is typically used when the referent is farther in the text. This set of features focuses on the repetition of three main open-class categories, namely nouns, (main) verbs, and adjectives. The use of words from these classes afects the richness of vocabulary, relfecting the speaker’s tendency toward lexical variation. Word-finding problems occurring in cognitive decline often manifest as dificulties in retrieving forms from the lexicon. The repetition of the same words can then occur as a sort of repair mechanism, resulting in semantically impoverished speech. Conversely, the use of some types of closed-class particles, such as prepositions and auxiliaries, is bound to the syntactic structure.

Lexical iteration features were computed by separately considering word forms and lemmas of nouns, verbs, and adjectives. These features include the repetitions of elements divided by the total number of words (COE_RIP_LEM, COE_RIP_WORD), the average number of repetitions for repeated elements (COE_MEDRIP_LEM, COE_MEDRIP_WORD), and the maximum number of repetitions over the total number of iterations (COE_MAXRIP_LEM, COE_MAXRIP_WORD).

3.3. Connectives

As defined by Ferrari [ 37], connectives are morphologically invariable forms (e.g., conjunctions or locutions) that explicitly indicate logical relations within parts of the text and pertain to the logical level. Elements from diferent grammatical classes can be used as connectives and are classified based on their function, which usually reflects their meaning (e.g., temporal, causal, additive).

To compile an extensive list of connectives, we rely on the Lexicon of Italian Connectives - LICO1 [38, 39]. LICO contains 173 entries, including single words (e.g., e, se, ma, infatti, quando, quindi), complex expressions (e.g., a causa di, da allora), and correlatives (e.g., da un lato ... dall’altro). Connectives are reported along with their lexical or orthographic variants, part of speech category, the semantic relations conveyed according to the Penn Discourse Tree Bank 3.0 schema [40], examples of usage, and alignments of connectives from other languages. A feature was devoted to compute the occurrences of connectives relative to the total number of tokens per subject (COE_TC).

Finally, the last feature was designed as an attempt to capture the overall impact of the classes of cohesive devices studied in this paper in the two cohorts of corpus speakers. Therefore, the role of cohesion elements was comprehensively measured in COE_TOT by summing referential-substitute expressions, lexical iteration items and connectives, divided by the total number of words.

Figure 3.3 shows as example an excerpt from the annotation in .conll format, in which some of the linguistic elements considered were highlighted.

1http://connective-lex.info/ 4. Results

The statistical significance of the cohesion features for the binary discrimination of PG and CG cohorts was calculated using the non-parametric Kolmogorov-Smirnov test, due to the limited sample size of the corpus. Given the number of comparisons performed, we adjusted the results with Bonferroni correction to control for Type I error. This approach involves adjusting the significance level by dividing the conventional alpha value (0.05) by the total number of comparisons made. The results of the test, reported in Table 3, show that two of the designed features significantly contribute to diferentiate the two groups: a feature related to lemmas’ iteration (COE_RIP_LEM) and the comprehensive feature of cohesive devices (COE_TOT). The distribution of these features is reported in Figure 4.

The application of Bonferroni’s correction caused a decrease in the p-value of two initially significant features, namely COE_TC and COE_MAXRIP_WORD. Given the exploratory nature of the experiment, which involves the formalisation of new features in order to discriminate subjects with cognitive impairment from healthy controls in Italian, we have nevertheless chosen to highlight the p-values of these features in 3.

We can observe that, compared with the control group, the speech of dementia subjects is characterized by fewer repetitions of the same noun, verb and adjective lemmas out of the total number of words uttered, captured by COE_RIP_LEM. Thus in the dataset emerges that PG group is less prone to lexical iteration of lemmas than CG. However, if we have a look to the occurrences’ distributions of the cohesive elements considered, reported in Table 4, interesting trends could be noticed. Indeed, the quantitative analysis of lexical repetitions revealed a disparity between repeated lemmas and repeated word forms of the same grammatical categories (noun, adjectives and verb) between the two groups. Specifically, despite the high variability due to subjective diferences, it is observed that in PG, the average repetition of forms (mean=74.15) is higher than the repetition of lemmas (mean=68.9), while the two values are very similar in CG (lemmas: mean=87.05, words: mean=87.8). This imbalance in favor of forms in the dementia patients appears to uncover lexical impoverishment compared to healthy subjects. Indeed in CG, although a higher overall number of repetitions is registered, it is combined with a more balanced distribution between lemmas and forms, suggest greater lexical variety.

An additional consideration regarding the opposing trend observed between lemmas and forms could be explained with respect to the sociolinguistic profile of the data, related to the diatopic variation of Italian language [41]. Indeed, speakers from both groups show an extensive use of dialectal terms and structures characteristic of the Italian variety spoken in the Lucanian Apennine area. As reported in Section 2, the annotation was conducted automatically using the pipeline developed by Gagliardi & Tamburini [32], which is designed to analyze standard Italian. Therefore, it is likely that the system struggled to handle some substandard expressions, which often orthographically diverge from the other words in the transcription, as can be observed in this example from a PG subject: gemm’ a trua’ [=andammo a fare visita] a mia suocera, ca [=che] mio suocero è morto (. . . ).

It is not excluded that the presence of dialect may also have influenced the automatic extraction of other cohesive devices. Indeed, the higher frequency in CG of substitution-type reference items (mean=161) and connectives (mean=36.65) compared to PG (ref. mean=146.5, conn. mean=23.8) contrasts with what has been observed in oral production of narrative discourse in cohorts of dementia subjects and healthy controls [8]. Therefore, we consider the possibility that automatic feature extraction preceded on manually-checked annotation may yield diferent results than those obtained.

Nevertheless, the significance of the comprehensive feature (COE_TOT) indicates that the use of cohesive devices investigated in this paper plays a role in distinguishing dementia subjects from healthy controls. In Figure 4 it can be noted that COE_TOT shows, on average, lower values for the PG compared to the CG. This results suggests that the linguistic processing of some phenomena related to cohesion (i.e. substitution-type reference elements, lexical iteration items, and connectives) is generally afected by cognitive decline in semi-spontaneous speech. Thus, the analysis of discourse properties seems to be a promising path for studying the linguistic characterisation of neurodegenerative disorders. Therefore, we hope that our approach in the future could be applied to phenomena strictly related to cohesion - first of all, coherence - or extend to other domains, such as pragmatics, that may mask subtle clues of cognitive frailty.

5. Conclusion

a comparative review, Journal of clinical and experimental neuropsychology 30 (2008) 501–556.

In this work, we present a methodology for delineat- [4] E. A. Stamatakis, M. A. Shafto, G. Williams, P. Tam, ing linguistic features of cohesion to track and study L. K. Tyler, White matter changes and word findchanges in discourse properties in the speech of indi- ing failures with increasing age, PloS one 6 (2011) viduals with cognitive impairment compared to healthy e14496. peers. The research focused on three types of cohesive [5] A. E. Budson, N. W. Kowall, The handbook of devices, i.e., reference, lexical iteration, and connectives, Alzheimer’s disease and other dementias, John Withat were automatically extracted from a Italian corpus ley & Sons, 2011. of semi-spontaneous speech from dementia subjects and [6] S. O. Orimaye, J. S.-M. Wong, K. J. Golden, Learning controls, collected in Basilicata. Statistical significance predictive linguistic features for alzheimer’s disease for binary discrimination was computed applying the and related dementias using verbal utterances, in: Kolmogorov-Smirnov test, and then adjusting the results Proceedings of the Workshop on Computational with Bonferroni’s method. The test shows that a feature Linguistics and Clinical Psychology: From linguisof the repetitions of lemmas and the one related to the tic signal to clinical reality, 2014, pp. 78–87. set of cohesive devices jointly considered contribute to [7] S. Carlomagno, A. Santoro, A. Menditti, M. Pandistinguish the two groups. Moreover, the quantitative dolfi, A. Marini, Referential communication in distribution of the cohesive devices reveals diferences alzheimer’s type dementia, Cortex 41 (2005) 520– in the use of elements within the considered categories 534. between PG and CG, which seem to highlight a general [8] C. Drummond, G. Coutinho, R. P. Fonseca, N. Asdeterioration in discursive competencies associated with sunção, A. Teldeschi, R. de Oliveira-Souza, J. Moll, dementia. The results obtained provide a preliminary ba- F. Tovar-Moll, P. Mattos, Deficits in narrative dissis for further study of discourse properties in cognitive course elicited by visual stimuli are already present decline, with the aim of expanding the set of linguis- in patients with mild cognitive impairment, Frontic features that can be automatically extracted to other tiers in aging neuroscience 7 (2015) 96. levels of language. This expansion is intended to refine [9] S. Ahmed, A.-M. F. Haigh, C. A. de Jager, P. Garrard, digital systems that could be employed as support for Connected speech as a marker of disease progresthe early diagnosis and monitoring of neurodegenerative sion in autopsy-proven alzheimer’s disease, Brain diseases, potentially improving timely interventions for 136 (2013) 3727–3737. patients and their caregivers. [10] T. Bschor, K.-P. Kühl, F. M. Reischies, Spontaneous speech of patients with dementia of the alzheimer CRediT authorship statement type and mild cognitive impairment, International psychogeriatrics 13 (2001) 289–298. declaration [11] S. De la Fuente Garcia, C. W. Ritchie, S. Luz, Artificial intelligence, speech, and language processing GA Conceptualization, Methodology, Software (i.e. fea- approaches to monitoring alzheimer’s disease: a tures formalization), Formal analysis, Writing (§ 1, 3, 4, systematic review, Journal of Alzheimer’s Disease 5). 78 (2020) 1547–1574.

EM Resources (i.e. data collection), Data curation (i.e. [12] L. Calzà, G. Gagliardi, R. R. Favretti, F. Tamburini, manual transcription), Writing (§ 2). Linguistic features and automatic classifiers for identifying mild cognitive impairment and demenReferences tia, Computer Speech & Language 65 (2021) 101113. [13] D. Beltrami, G. Gagliardi, R. Rossini Favretti, E. Ghi[1] D. American Psychiatric Association, D. American doni, F. Tamburini, L. Calzà, Speech analysis by Psychiatric Association, et al., Diagnostic and statis- natural language processing techniques: a possible tical manual of mental disorders: DSM-5, volume 5, tool for very early detection of cognitive decline?, American psychiatric association Washington, DC, Frontiers in aging neuroscience 10 (2018) 369. 2013. [14] G. Gagliardi, F. Tamburini, Linguistic biomark[2] E. Catricalà, P. A. Della Rosa, V. Plebani, D. Perani, ers for the detection of mild cognitive impairment, P. Garrard, S. F. Cappa, Semantic feature degrada- Lingue e linguaggio 20 (2021) 3–31. tion and naming performance. evidence from neu- [15] B. S. Kim, Y. B. Kim, H. Kim, Discourse measures rodegenerative disorders, Brain and language 147 to diferentiate between mild cognitive impairment (2015) 58–65. and healthy aging, Frontiers in aging neuroscience [3] V. Taler, N. A. Phillips, Language performance in 11 (2019) 221.

alzheimer’s disease and mild cognitive impairment: [16] J. Kim, J. Shim, J. H. Yoon, Subjective rating scale for