=Paper=
{{Paper
|id=Vol-3033/paper8
|storemode=property
|title=Moving from Human Ratings to Word Vectors to Classify People With Focal Dementias: Are We There Yet?
|pdfUrl=https://ceur-ws.org/Vol-3033/paper8.pdf
|volume=Vol-3033
|authors=Chiara Barattieri di San Pietro,Marco Marelli,Carlo Reverberi
|dblpUrl=https://dblp.org/rec/conf/clic-it/PietroMR21
}}
==Moving from Human Ratings to Word Vectors to Classify People With Focal Dementias: Are We There Yet?==
Moving from Human Ratings to Word Vectors to Classify People with Focal Dementias: Are We There Yet? Chiara Barattieri di San Pietro1,2, Marco Marelli1, Carlo Reverberi1 1. Università degli Studi di Milano-Bicocca, Milano, Italy 2. Università degli Studi di Verona, Verona, Italy chiara.barattieridisanpietro@unimib.it, carlo.reverberi@unimib.it, marco.marelli@unimib.it Abstract proposed. Among these, the number of consecu- tive words produced that share similar properties Fine-grained variables based on semantic such as being a citrus fruit (this is called "semantic proximity of words can provide helpful di- cluster" and its size is a clinically useful variable), agnostic information when applied to the and the total number of transitions between clus- analysis of Verbal Fluency tasks. How- ters (called "number of switches" – Troyer et al., ever, before leaving human-based ratings 1997). Indeed, by characterising a semantic VF in favour of measures derived from distri- task (category "fruits") using the number of se- butional approaches, it is essential to as- mantic categories produced, the average semantic sess the performance of the latter against proximity between words, the number of new that of the former. In this work, we ana- words and out-of-category words, it has been pos- lysed a Verbal Fluency task using sible to classify people with and without focal de- measures of semantic proximity derived mentias, as well as across three different subtypes from Distributional Semantic Models of of dementias (Fronto-Temporal Dementia versus language, and we show how Machine Primary Progressive Aphasia versus Semantic Learning models based on them are less Dementia) with good accuracy (78% accuracy for accurate in classifying patients with focal patients vs healthy control classification, and dementias than the same models built on 58.3% accuracy for classification across three human-based ratings. We discuss the pos- pathological subcategories – Reverberi et al., sible interpretation of these results and the 2014). One shortcoming of this model, however, implications for the application of distri- is that those VP indexes are built upon human- butional semantics in clinical settings. based ratings of semantic proximity between pairs of words collected from a sample of healthy con- 1 Introduction trols, making it hard to extend the same approach to words for which human judgments were not A Verbal Fluency (VF) task (Lezak et al., 2004) previously collected, i.e., other semantic catego- is a test routinely used in the neuropsychological ries. practice that requires participants to produce as many words as possible belonging to a given se- Recent advances in Natural Language Pro- mantic category (e.g., "colours, "animals", etc.) cessing techniques could help overcome this lim- within a time limit (typically 60 sec). It is com- itation. Distributional Semantic Models (DSMs) monly used to study lexical retrieval, and the sub- of language start from lexical co-occurrences ex- ject's performance is standardly rated by the num- tracted from large text corpora (Turney & Pantel, ber of correct words produced for a given cue. 2010), and applying different computational tech- However, to overcome the opacity of the overall niques, end up representing word meanings as nu- score and help distinguish the different cognitive merical vectors in a multidimensional space. functions underpinning VF performance, addi- Here, terms that are semantically related are lo- tional measures of VF performance have been cated close to each other. Such models can be used to simulate the structure of conceptual knowledge implied in the performance of semantic tasks such Copyright ©️ 2021 for this paper by its authors. Use permitted under Creative Commons License Attribu- tion 4.0 International (CC BY 4.0). as a VF task. Indeed, DSMs have been success- (about 1.9 billion tokens). To ensure comparabil- fully applied to different tasks of semantic rela- ity with this previous literature, we extracted a tionships (Mandera et al., 2017), including the subset of the itWac corpus to match the TASA analysis of VF tasks to classify patients with Alz- size. We selected an untagged set of 91,058 docu- heimer's disease (Linz et al., 2017) and reaching ments randomly extracted from itWAC, compris- remarkable accuracy (F1 = 0.77). However, de- ing the same set of words (N = 180,080) of the spite the success, questions have been posed con- WEISS semantic spaces. The creation of a matrix cerning what exactly distributional models can of co-occurrences was carried out using the DIS- learn (Erk, 2016) and if such models are suffi- SECT toolkit (Dinu et al., 2013), and applying a ciently rich in terms of encoded features (Lucy Positive Pointwise Mutual Information weighting and Gauthier, 2017) to be applied to all sorts of scheme (Niwa & Nitta, 1995), followed by dimen- semantic tasks/problems. sionality reduction by Singular Value Decompo- The present study aims to test if the analysis sition. We set the number of dimensions at 300 of a VF task based on DSM-derived measures following the study of Landauer and Dumais would reproduce the results of an analysis based (1997), which indicates good performance for di- on human-derived measures. In particular, we de- mensionalities ranging from 300 to 1,000. cided to re-analyse the original data of a semantic 2 Materials and Methods VF task (category “fruit”) that Reverberi et al. col- lected on a cohort of participants with focal de- The verbal production to a sematic VF (category mentias and healthy controls (CTR). Focal de- "fruits") from the original cohort of 371 subjects mentias are neurodegenerative diseases that cause (Table 1) was analysed. Overall datapoints were deterioration of cognitive function, including lan- N = 3,642 words, with 133 unique words. guage. The original cohort included people with Fronto-Temporal Dementia (FTD), Primary Pro- PPA FTD SD CTR gressive Aphasia (PPA), and Semantic Dementia Number 16 33 15 307 Age 73.6±3.4 67.0±6.1 67.9±6.5 54.9±17 (SD). Each diagnostic group presents peculiar lin- Education 7±4.6 8.6±4.4 9.3±4.9 9.6±5 guistic symptomatology, making these syndromes ideal candidates for a differential approach. The Table 1: Demographic information for all the human-based indexes of VF (see Section 2 for de- subject groups. tails) were adapted to be computed on different DSMs (Landauer & Dumais, 1997; Mikolov et al., Data were entered in an R pipeline, leveraging on 2013). Specifically, we adopted two predict and two word2vec (Mikolov et al., 2013) semantic one count model. All three semantic spaces were spaces ("WEISS1" and "WEISS2"), and an LSA based on the itWac web-crawled corpus (Baroni space with identical vocabulary size (“LSA”). For et al., 2009). The two predict models (Word-Em- each participant, the pipeline outputs three sets of beddings Italian Semantic Space 1 and 2 - semantic indexes computed according to five dif- "WEISS1" and "WEISS2") were obtained from ferent thresholds (set to identify the occurrence of Marelli (2017) and were chosen for both their a semantic switch), corresponding to the 10 th, 30th, practical accessibility (http://me- 50th, 70th, and 90th quantiles of the distribution of shugga.ugent.be/snaut-italian) and their proven semantic relatedness values (Table 2), computed good performance in previous studies (Mancuso considering the cosine proximity of all adjacent et al., 2020; Nadalini et al., 2018). WEISS1 is words produced by the whole study cohort. based on a CBOW model with 400 dimensions 10th 30th 50th 70th 90th and a 9-word window; WEISS2 is based on a WEISS1 .185 .226 .247 .268 .287 CBOW model with 200 dimensions and a 5-word WEISS2 .303 .371 .405 .434 .463 LSA .336 .431 .479 .519 .582 window. Both models consider words with a min- imum frequency of 100 in the original corpus. The Table 2: Cosine values adopted as thresholds count-model based on Latent Semantic Analysis for the three semantic spaces. ("LSA") was created ad-hoc for this study follow- ing Günther and colleagues' (2015) procedure. For each participant, we computed the follow- Many psycholinguistic studies applying LSA in ing 9 indexes of VF: the English language used the TASA corpus (http://lsa.colorado.edu, including 12,190,931 to- 1) Total number of valid words, produced in kens), which is a far smaller corpus than ItWac 1 minute, excluding repetitions. Differ- ently from the original work, words not included in the vocabulary of the seman- from the corpus of reference (itWac), con- tic space were obligatory excluded, but verted to lower case and excluding words not belonging to the category metadata; "fruit" were kept. Due to limitations of the 8) Out-of-category words ("OOC" ): number semantic space's vocabulary, 53 words of words not pertaining to the 15 subcate- and compound expressions (8 from the gories of "fruit" as identified in previous patient group and 45 from the control works by the same Authors (Reverberi et group) out of the 3,642 (1.5%) were re- al., 2004; 2006). Given that the vectorial moved from the data; representation of words differs according 2) Repetitions ("rep"): the total number of to inflectional morphology, data were not repeated words; normalised (singular to plural) but kept as originally produced; 3) Total number of switches ("switch"): computational equivalent of the "number 9) Order Index ("OI" ): computed following of switches between subcategories" in the the formula proposed in Reverberi et al., original work. Semantic switches were 2006. In its simplified notation, the Order identified based on measures of semantic Index is equivalent to the difference be- relatedness obtained from three semantic tween the theoretical maximum number spaces and according to five different of switches (total number of words minus thresholds (Table 2); 1) and the actual observed switches, di- vided by the range of theoretically possi- 4) Total number of semantic clusters ble switches (total number of words mi- ("NC"): computational equivalent of the nus 1, minus total number of clusters mi- "number of subcategories" in the original nus 1). To avoid non-linearity problems, work. Clusters were identified based on the participant production is represented the occurrence of a semantic switch, i.e., in a three-dimensional space having num- when the mean value of cosine similarity ber of words, number of switches, and of words within a cluster drops below the number of subcategories as axes: the or- identified threshold (Table 2); der index is then transformed using the 5) Mean size of clusters ("SC"): mean num- arctangents of the resulting segments. ber of words within a semantic cluster; computational equivalent of the "relative 2.1 Statistical Analyses switching" index in the original work; All variables of interest were pre-processed to 6) Average semantic proximity ("prox"), the remove variance due to differences in age, level semantic distance between adjacent of education, and the total number of words. We words. Unlike the original index, based ran a linear regression analysis with the relevant on human-derived estimated of semantic variable as the dependent factor and with age, ed- proximity (Reverberi et al., 2006), we de- ucation, and the total number of words as regres- rived this index from the mean cosine be- sors (only considering healthy subjects to avoid tween the vectorial representation of adja- any potential bias in the estimates due to brain cent words in the participants' production. damage). We then used the regression coefficients to compute the residuals for each variable and all In addition, to ascertain the replicability subjects. Residuals were then used as predicting of original results with computational meth- variables for the classification analysis. The aver- odologies, the following indexes were adapted age for each variable and each patient group was from the original work: compared with the respective average in the con- trol group through a two-sample t-test, Bonferroni 7) Mean familiarity ("fam"). As a computa- corrected. tional equivalent of the original index, 2.2 Classification Analysis calculated according to familiarity scores collected from a sample of healthy con- The R packages caret and e1071 (interfaces to trols (Reverberi et al., 2004), we com- the LIBSVM by Chang & Lin 2011) were used. puted the raw word frequency as derived The aim of the classification analysis was to de- termine: i) which variables, alone or in combina- tion, would be able to classify a subject as being either a patient or control, and; ii) which variables, The best classification performances for pa- alone or in combination, would best classify a pa- tients versus healthy controls was found when we tient as being member of one of the three frontal considered the variables "total number of new dementia group (FTD, PPA, SD). words" and "Order Index" at any threshold and After removing variance due to differences with all semantic spaces. In these cases, the over- in age and education, we performed a Leave-One- all accuracy of the models was 61.2%, with sensi- Out Cross-Validation (LOOCV) analysis. The tivity of 57.4% and specificity of 79.7% (Table 4). model kernels were set as linear, and relative weights were added to counterbalance the differ- SS Thres. Vars Acc. Sens. Spec. ence in group numerosity. In LOOCV, a data in- Human- NC + prox + new + 84 86 82 stance is left out, and a model is constructed on all Based OOC other data instances in the training set. The model all all New + OI 61.2 57.4 79.7 is tested against the data point left out, and the as- - - New 61.0 57.0 79.7 sociated error is recorded. The process is then re- all all OI 61.0 57.0 79.7 peated for all data points, and the overall predic- all all Rep + new + OI 60.7 55.7 84.4 tion error is calculated by taking the average of the - - OOC 60.4 56.4 79.7 recorded test error estimates. The LOOCV analy- sis was repeated for each combination of the 9 Table 4. Top 5 performing classification mod- variables of interest, for each of the 3 semantic els (patients vs controls). spaces, and each of the 5 thresholds, resulting in 7,665 models. The best classification performances for pa- tients in their specific pathology group was found 3 Results when we considered the variables "out of category words", "average semantic proximity", and "size We compared the performance of each group to of clusters" computed at the 3rd threshold (50th) of that of healthy controls for each of the nine varia- the WEISS2 space (Table 5). In this case, the bles considered. All pathological groups signifi- overall max accuracy was 43.8%. Sensitivity and cantly differed from the controls on at least one specificity for each pathology group were: PPA = variable (Table 3). In the classification analysis, 87.5% and 62.5%; FTD = 36.4% and 71%; SD = we investigated which variables (alone, or in all 13.33% and 81.6%, respectively. the possible combinations with other variables, i.e., 511 combinations) would best predict the SS Th Vars Acc. PPA FTD SD res. membership of participants. We carried out two Fam + NS + sets of analysis: i) healthy controls versus partici- Human- OI + new + 58 NA NA NA Based pants with focal dementias (PPA, FTD, and SD); rep OOC + prox 87.5/ 36.4/ 13.3/ and ii) participants with PPA versus participants W2 50 + SC 43.8 62.5 71 81.6 with FTD versus participants with SD. The analy- 87.5/ 39.4/ 0/ W1 10 OOC + SC 42.2 sis was performed for each semantic space and for 56.3 74.2 83.7 93.8/ 33.3/ 0/ each preidentified threshold for a total number of W1 30 NS + NC 40.6 50 77.4 85.7 7,665 models. 87.5/ 36.4/ 0/ W1 70 OOC + SC 40.6 62.5 64.5 81.6 68.8/ 42.4/ 0/81. FTD PPA SD W2 90 SC 39.1 60.4 64.5 6 Proximity + Table 5. Top 5 performing classification mod- Familiarity els (patients in each specific pathology group). New words + + Out-Of-Category N Switches + 4 Discussion N Cluster + In this work, we replaced human-based measures Size Cluster + + + of semantic proximity with DSM-derived Order Index + + Repetitions measures of semantic proximity to compute a set of indexes of VF that was found to be able to clas- Table 3: Variables that are significantly differ- sify with good accuracy people with and without ent between a given pathological group vis-à-vis focal dementias based on their verbal production healthy controls. Results Bonferroni-corrected to a semantic VF task (category "fruits", which for multiple comparison are reported. was originally adopted to limit the set of possible items as compared to broader categories such as touch). As such, the representation of this seman- “animals”). The objective of the study was to as- tic category might not be simply derivable by the sess the accuracy of Machine Learning (ML) lexical distribution of its items in a corpus. Differ- models based on DSM measures of semantic in- ently, other semantic categories might leverage on formation, in view of their possible extension to less perceptual and more encyclopaedic semantic words and semantic categories for whom the knowledge, such as, for example, the category measure of semantic proximity is not available. "animals", another semantic cue widely used for Despite being above chance in both cases, ML the assessment of VF. Indeed, while people do models based on DSM-derived measures of se- generally have first-hand, real-life experience of mantic proximity showed lower accuracy com- "fruits", knowledge about "animals" may be more pared to models built on human-based ratings. commonly derived from indirect exposure to en- This was true both for the classification of patients cyclopaedic information (i.e., the media). In other versus controls (61.2% and 84%, respectively), as words, when we think about a cherry, we may not well as for the subclassification of diagnosis only recall the meaning of the lemma as compared (43.8% and 58%, respectively). to, for example, an apple, but at the same time, we The observed differences might be due to the might also recall the sensory information attached functional adaptations needed to transpose the to the drupe (round, red, juicy, etc.). Conversely, original VF indexes to DSM-derived measures. apart from common pets, it is unlikely that partic- For example, the computational equivalent of the ipants have first-hand experience about most of "familiarity" index, calculated according to famil- the items commonly included "animals" category iarity scores collected from the sample of healthy (e.g., "lion", “whale”, etc.). controls, was approximated via the raw word fre- This means that distributional models might be quency as derived from the corpus of reference. not the best-suited tool to resolve semantic prob- Moreover, given that the vectorial representation lems when the semantic task under investigation of words differs according to inflectional mor- makes use of a subset of words pertaining to a se- phology, data were not normalised (singular to mantic category perceptually rich (such as that of plural) but kept as originally produced, unlike the “fruits”). original work. Hence, it might be possible that these operations introduced some distortions that 5 Conclusions and Future Works could explain the differences observed compared The past decades have witnessed an increasing to the original study. interest towards the application of NLP tech- In terms of parameter setting, it is worth noting niques to answer, or support the resolution of, dif- that our choices might have affect the overall per- ferent clinical problems, from patients’ classifica- formance of the adopted models, possibly reduc- tions to disease monitoring, and from differential ing their ability to avoid noise and biases. For ex- diagnosis to prediction of treatment response (see ample, according to Tripodi (2017), hyperparam- de Boer et al., 2018 for a comprehensive review). eter setting for Italian has specific requirements in All these applications implicitly rely on the as- terms of vector size, negative sampling, vocabu- sumption that these techniques are agnostic/trans- lary threshold cutting, to maximize performance parent to the semantic task under investigation in an analogy task (although to what extent such and, given the good results obtained, that they are recommendation can be extended to VF is an em- equipped with sufficiently rich semantic infor- pirical question that remains to be addressed). mation to solve any kind of task based on linguis- Also, the choice of a CBOW model, instead of tic data. Our findings challenge this idea and align “more predictive” algorithms such as Skipgram with previous works pointing to a lack of basic and Mask might have reduced the ability of the features of perceptual meaning in DSM (Lucy and model to mimic the human ratings of word asso- Gauthier, 2017). ciations. Implications for the application of DSM-de- However, a different explanation might be re- rived measures to clinical work and research indi- lated to the type of information encoded into the cate that the choice of the verbal task and the as- human proximity ratings. Given its evolutionary sociated DSM can affect the results. For this rea- relevance, the neural substrate underpinning the son, we plan to assess the classification accuracy notion of "fruits" might encode a rich multidimen- of ML models built both on human ratings and sional semantic characterisation (including sen- DSM-derived measures of semantic proximity for sory information such as taste, smell, sight, other categorical VF tasks, as well as adopting ceedings of the 51st Annual Meeting of the Associa- word vectors derived from lemmatised corpora. tion for Computational Linguistics: System Demon- Before moving to more recent language models strations, 31-36. such as the last generation of deep neural language Günther Fritz, Dudschig Caroline and Kaup Barbara. models like BERT (Devlin et al., 2019), consider- 2015. Latent semantic analysis cosines as a cogni- ation should be given to the trade-off between tive similarity measure: Evidence from priming computational and data resources needed to train studies. Quarterly Journal of Experimental Psychol- them (Bender et al., 2021) on one hand, and what ogy, 69(4):626–653. kind of added value they can give compared to tra- Landauer Thomas and Dumais Susan. 1997. A solution ditional “static” embeddings (Lenci et al., 2021) to Plato's problem: The latent semantic analysis the- on the other. Further research might address the ory of acquisition, induction, and representation of limits of current DSM models by enriching the in- knowledge. Psychological review, 104(2), 211. formation encoded, integrating experiential and Lenci Alessandro, Sahlgren Magnus, Jeuniaux Patrick, distributional data to induce reliable semantic rep- Gyllensten Amaru Cuba and Miliani Martina 2021. resentations (Andrews et al., 2009). Additional A comprehensive comparative evaluation and anal- sources of multimodal information (e.g., Lynnott ysis of Distributional Semantic Models. arXiv pre- et al., 2020) including visual and audio infor- print arXiv:2105.09825. mation, might help overcome these current limita- Lezak Muriel, Howieson Diane, Loring David, Hannay tions (Chen et al., 2021). Julia and Fischer Jill. 2004. Neuropsychological as- sessment. New York: OUP, USA. References Lucy Li and Gauthier Jon. 2017. Are Distributional Baroni Marco, Bernardini Silvia, Ferraresi Adriano Representations Ready for the Real World? Evalu- and Zanchetta Eros. 2009. The waCky wide web: A ating Word Vectors for Grounded Perceptual Mean- collection of very large linguistically processed ing. Proceedings of the First Workshop on Lan- web-crawled corpora. Language Resources and guage Grounding for Robotics. Evaluation, 43(3): 209–226. Lynott Dermot, Connell Louise, Brysbaert Marc, Bender Emily M., Gebru Timnit, McMillan-Major Brand James and Carney James. 2020. The Lancas- Angelina & Shmitchell Shmargaret. 2021. On the ter Sensorimotor Norms: multidimensional Dangers of Stochastic Parrots: Can Language Mod- measures of perceptual and action strength for els Be Too Big? . In Proceedings of the 2021 40,000 English words. Behavior Research Methods, ACM Conference on Fairness, Accountability, and 52(3), 1271-1291. Transparency: 610-623. Mandera Paul, Keuleers Emmanuel and Brysbaert Chang Chih-Chung and Lin Chih-Jen. 2011. LIBSVM: Marc. 2017. Explaining human performance in psy- a library for support vector machines. ACM transac- cholinguistic tasks with models of semantic similar- tions on intelligent systems and technology (TIST), ity based on prediction and counting: A review and 2(3): 1-27. empirical validation. Journal of Memory and Lan- guage, 92, 57-78. Chen Wei, Wang Weiping, Liu Li and Lew Micheal S. 2021. New ideas and trends in deep multimodal con- Marelli Marco. 2017. Word-embeddings Italian Se- tent understanding: A review. Neurocomputing, mantic spaces: A semantic model for psycholinguis- 426:195-215. tic research. Psihologija, 50(4): 503–520. De Boer Jann N., Voppel Alban E., Begemann Marieke Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado J.H., Schnack Hugo G., Wijnen Frank and Sommer Greg and Dean Jeffrey. 2013. Distributed Represen- Iris E.C. 2018. Clinical use of semantic space mod- tations of Words and Phrases and their Composi- els in psychiatry and neurology: a systematic review tionality. Retrieved from and meta-analysis. Neuroscience & Biobehavioral http://arxiv.org/abs/1310.4546 Reviews, 93: 85-92. Niwa Yoshiki and Nitta Yoshihiko. 1995. Co-occur- Devlin Jacob, Chang MW, Lee K, Toutanova K. 2019 rence vectors from corpora vs. distance vectors from BERT: Pre-training of Deep Bidirectional Trans- dictionaries. arXiv preprint cmp-lg/9503025 formers for Language Understanding. In: Proceed- R CoreTeam. 2021. R: A language and environment for ings of NAACLHLT 2019, 4171–4186 statistical computing. R Foundation for Statistical Dinu Georgiana and Baroni Marco. 2013. Dissect-dis- Computing. Retrieved from https://www.r-pro- tributional semantics composition toolkit. In Pro- ject.org. Reverberi Carlo, Cherubini Paolo, Baldinelli Sara and Luzzi Simona. 2014. Semantic fluency: Cognitive basis and diagnostic performance in focal dementias and Alzheimer's disease. Cortex, 54, 150-164. Tripodi Rocco and Pira Stefano Li. 2017. Analysis of Italian word embeddings. arXiv preprint arXiv:1707.08783. Turney Peter D. and Pantel Patrick. 2010. From fre- quency to meaning: Vector space models of seman- tics. Journal of artificial intelligence research, 37, 141-188