1. Introduction

Dec

Automatic Detection of Parkinson's Disease with Connected Speech Acoustic Features: towards a Linguistically Interpretable Approach

Marta Mafia

Loredana Schettino

1 2

Vincenzo Norman Vitale

0 2 0 DIETI - University of Naples Federico II , Italy 1 Dept. of Literary, Linguistics and Comparative Studies, University of Naples L'Orientale , Italy 2 Interdepartmental Research Center Urban/Eco, University of Naples Federico II , Italy

2023

02 2023 0000 0002

Alterations in speech and voice are among the earliest symptoms of Parkinson's Disease (PD). Nevertheless, the rich information carried by patients' speech and voice is only partially used for diagnosis and clinical decision-making that is currently based on holistic ratings of speech intelligibility. An accurate diagnosis could be supported by the application of fully automated analytic methods and machine learning techniques on speech recordings. However, most of the proposed procedures were designed for highly functional but “artificial” vocal paradigms such as sustained phonation and consider all the considerable amount of features that can be extracted using automatic systems. In this work, we perform PD detection trials using features extracted from connected speech rather than isolated speech units. Moreover, we support the adopted machine learning-based methods with linguistic considerations so as to reduce the number of features to some meaningful ones. The main findings highlight that this procedure allows more accurate, economical and, most importantly, interpretable discrimination.

1. Introduction

decision-making, since the Unified Parkinson’s Disease Rating Scale (UPDRS), a standardized rating tool used 1 Parkinson’s Disease (PD) is the most common move- to assess the severity and progression of the pathology, ment disorder and the second most common neurode- only presents one item (item 3.1) that concerns the evalgenerative disorder worldwide after Alzheimer disease. uation of speech [ 5 ]. This item is based on the clinician’s It afects more than 2-3% of the population aged 65 and perception and mostly considers speech in terms of inover [ 1, 2 ]. telligibility. A deeper understanding of speech and voice

Caused by the deterioration or loss of dopaminergic phenomena by advanced data analytics methods could neurons in the substantia nigra of basal ganglia, PD is be therefore very useful in both the diagnostic phase and generally diagnosed based on clinical criteria, by using a in the monitoring of therapy response in PwPD. medical individual’s history and a physical/neurological exam. The loss of dopamine in the central nervous system, along with the anatomical and physiological changes 2. Speech in Parkinson’s Disease related to the disease, has an impact on laryngeal, respiratory and articulatory functions of Persons with PD (PwPD). Alterations in speech and voice are in fact among the earliest symptoms of PD, which results in a motor speech disorder called hypokinetic dysarthria [ 3, 4 ]. Nevertheless, the rich information carried by patients’ speech and voice is only partially used for diagnosis and clinical

PD-related dysarthria, caused by poor activation and co

ordination of the muscles involved in speech production, includes a range of alterations, extensively described in experimental studies on diferent languages [ 6 ].

As for the voice quality, a breathy, husky-semiwhisper and hoarse voice is often reported in PwPD, accompanied by vocal tremor, an increase in nasality, reduced voice intensity and constant loudness [ 7 ]. Voice quality spectrum was also studied using a deep learning approach applied to diferential phonological posterior features for the characterization of pathological PD speech, collected through diferent tasks and compared to healthy non-modal phonation. [ 8 ].

At the segmental level, the decreased amplitude of motility of lips, tongue, and jaw provokes imprecision in the production of consonantal sounds, with the socalled spirantization phenomenon or occlusive weakening [ 9, 10 ]. A reduction in the vowel space area and an impaired and less distinctive formant generation in speech linguistic considerations so as to reduce the size of PwPD have also been described, both in sustained pro- of the big sets of features automatically extracted longation of single vowels [11] and in continuous speech, to some meaningful ones and provide an efective such as sentence repetition [12] or reading passage [13]. linguistic interpretation of the results. The centralization of formant values, measured by the Vowel Articulation Index (VAI), was also proposed as a potential early marker of PD, especially when observed 3. Method in spontaneous speech [14].

As for the suprasegmental aspects, PwPD often report 3.1. Data and Annotation a significantly narrower tonal range (monopitch) or an The study has been conducted on data from the Italian abnormal pitch variability, along with a compromised Parkinson’s Voice and Speech corpus [31, 32], which conability to consciously manipulate intonation [ 15, 4 ]. Ar- sists of speech data collected through diferent speech ticulation and speech rate are also altered in PD, although production tasks from three groups of Italian (Apulian) previous findings do not highlight a uniform pattern speakers: PD patients, age-matched healthy control (HC) of variation in the speech of PwPD: in some studies a speakers and younger HC speakers. reduction in speech rate was observed in PD patients In particular, we considered a subset of this corpus, [16], while some reported the opposite efect [ 17, 18] and consisting in 25 speech samples elicited through a reading other found no intergroup diferences between patho- task2 from 15 PD patients and 10 age-matched healthy logical and healthy speech [19]. Furthermore, diferent speakers. Subjects in the PD group are classified by the rhythmic metrics were used to describe the alteration specialists as <4 on the modified Hoehn and Yahr scale, of rhythm in PD speech, as part of a more “general dys- which stands for a non-severe stage of the severity of rhythmia” [20]. In recent studies on Italian PD patients, their disease. The patients’ speech ability is evaluated the percentage of vocalic intervals (%V) was found to following the tips provided in section 3.1 (eloquence) of be efective in characterizing pathological speech, when the Unified Parkinson’s Disease Rating Scale (UPDRS) compared to that of healthy individuals, both in read and as minimally/slightly impaired (maximum score is 4 = spontaneous conditions and even at a very early stage of severe impairment). Demographic and clinical features of the disease [21, 22]. patients with PD and HC speakers are resumed in Table

In the last decades, in line with the growing inter- 1. est and eforts in the identification of reliable linguistic and acoustic biomarkers of PD, some studies demon- HC (n=10) PD (n=15) strated that an accurate diagnosis could be supported by Age (m±SD) 68±6 64±9 the application of fully automated analytic methods and Sex (M/F) 4/6 11/4 machine learning techniques on speech recordings [23]. H&Y - <4 However, most of the proposed procedures were designed UPDRS (Item 3.1) - 1.07±1.18 for highly functional (but “artificial”) vocal paradigms such as sustained phonation, diadochokinetic tasks, syl- Table 1 lable repetition, short sentences [24, 25, 26, 27, 28, 29]. Biographical (Sex and Age) characteristics of the PD and HC These kinds of elicitation techniques indeed provide speakers and clinical data (H&Y: Hoehn & Yahr scale; UPDRS: highly controlled signals, but such control afects phona- Unified Parkinson’s Disease Rating Scale) of PD speakers [32]. tion and may even mask features that may emerge in less controlled semi(spontaneous) connected speech. In The considered dataset had already been the object addition, previous studies often achieve high levels of of a spectroacoustic analysis in a previous study [22] accuracy in the detection of PD speech by taking into and the acoustic signal had been therefore manually segaccount a very large number of features, and the clas- mented and annotated into vowel (V) and consonantal sification focuses on computational aspects rather than (C) intervals (see Figure 1). Main descriptive statistics of linguistic ones [30]. the dataset are reported in Table 2.

In this contribution, we address the following issues: • investigate the role of acoustic features, usually overlooked or, however, not always or directly taken into account by specialists for PD diagnosis; • consider patterns that emerge from connected (read) speech rather than isolated speech units (phones, syllables, words) productions; • support machine learning-based methods with 3.2. Analysis

In this study, we intend to use the described continuous

speech data for PD detection based on a reduced set of interpretable features of the acoustic signal. To this aim,

2The reading task was based on a phonemically balanced text

[31]. • Vowels (V) - in previous studies, the percentage of vocalic interval in the speech signal was demonstrated to be informative in PD detection. So we investigate whether vowels alone contain enough information for the detection task; • Consonant and Vowels (CV) - we extend the context of vowels to the previous consonants, obtaining a wider feature extraction window to evaluate the influence of consonants preceding vowels on PD detection; • Phonetic Chains (PC) - lastly we employ the phonetic chain, namely the sequence of vowels and consonants between two silent pauses. On the one hand, such units provide the most comprehensive automatically detectable window for feature extraction. On the other hand, being a larger unit of speech production, it should provide far enough features to discriminate speaker status.

Based on the OpenSmile toolkit [33], we selected the

eGeMAPSv02 [34] as the basic feature set, and then investigated which features could be considered as the most relevant for discrimination considering previous literature [ 35 ] and inspection of the data with the Orange software [ 36 ].

Then, the impact of the selected features was evaluated by employing two unsupervised machine-learning • The K-Means2[ 37 ] a vector-quantization method which divides n objects in k clusters based on their mean distance. • Hierarchical Agglomerative Clustering (HAC)2 [ 38 ] is a greedy technique that aims at grouping (or splitting) clusters based on a similarity measure. The final output is a clusters hierarchy which could be divided based on the number of desired clusters.

These simple yet eficient techniques were employed to obtain explainable and interpretable results. The PD detection trials were conducted considering the following sets of features:

• a full feature set, i.e. the eGeMAPSv02 complete feature set (88 features) [34] plus the speakers’ sex. • a subset feature set, i.e. 18 features from the eGeMAPSv02 feature set, plus the sex (see Appendix A).

In both cases, features were normalized at zero mean and unitary variance. 4. Results

The inspection conducted with the Orange software highlighted that the most relevant features for discriminating between PwPD and HC speakers are those concerning the spectral distribution (i.e., slope, alpha ratio, Hammarberg index), followed by those concerning energy and amplitude (i.e. loudness, shimmer), and frequency (MFCC). The observed features were included in the subset employed for the discrimination trials (as reported in Appendix A. Also, the table in Appendix C shows the Mean values and Standard Deviation of these features in PC units per speaker).

Results show that classification based on the Phonetic Chain (see Figure 4) outperforms by far classifiers based on both V and CV. On the one hand, the HAC classifier with the full feature set reaches nearly 99% of true positive detection and 85% of true negative detection. On the other hand, the K-means performs at its best with the feature subset with an 89% of true positive and a 72% of true negative. This means that by reducing the number of features of 75% with respect to the original feature set, the K-means has a 10% reduction in true positive (i.e., PD) detection and a 13% reduction in true negative (i.e., HC) detection, with respect to HAC on the full feature set.

The vowels-based setting (see Figure 2) shows better performances with the feature subset with both K-means the intelligibility score (from the above-described UPDRS) given by the specialists. As illustrated in figure 5, no strong correlation emerges between UPDRS scores and the analysed acoustic features with the exception of slopeV0-500 that negatively correlates with the specialists’ ratings (see Appendix B for the correlation matrices concerning the features extracted from V and VC intervals, Figure 7). and HAC. However, the True negative detection rate is near 60% in the best case, while the true positive rate is at 80% in the best case.

Finally the CV setting (see Figure 3) shows performances which are comparable to a coin toss in most of cases. Only the K-means based on feature subset reaches a true positive detection rate of 81%, with a true negative detection rate of 54%.

In light of these results, we decided to also investi- Figure 5: Feature correlation considering PC units. gate the correlation between the considered features and

5. Discussion and Conclusion The present study provides relevant findings both for the

development of PD detection systems and the analysis of Parkinsonian speech characteristics by integrating computational methods with domain-specific linguistic knowledge.

The correlation data between the UPDRS ratings concerning PD speakers’ speech ability and the acoustic features automatically extracted from the speech signal corroborate the observation that the specialists’ holistic assessment overlooks, or at least only partially and indirectly considers, acoustic features, which, nonetheless, prove to provide crucial information for the diagnosis.

In fact, the speech signal is afected by the condition of the muscles involved in phonation. So, if the vocal apparatus is somewhat compromised as an efect of the muscular impairment due to the disease (dysarthria), the signal should show this. Hence, the relevance of including acoustic features in the assessment of the outbreak and severity of PD.

However, fully automated extraction and treatment of speech acoustic features is usually achieved with highly complex systems whose interpretation is quite dificult for both computational scientists, who might be not familiar with PD symptoms and the linguistic value of the features of the speech signal, and for domain experts, who might not be familiar with machine learning methods. Therefore, the design of models in a way that their predictions can be explainable and easily interpretable may actually be most sensible and economical. In fact, this study highlights that not all the possibly considerable acoustic features provide the same amount of information and are actually relevant for discrimination. Moreover, their contribution may vary as a result of the type and span of the linguistic unit used for the feature extraction.

More specifically, the classification results show that considering vowel intervals as units of reference for the features extraction is already quite efective. Most efective is, however, considering wider contexts as provided by the inter-pausal phonetic chain intervals, whereas enlarging the vocalic intervals only to the previous consonant (CV intervals as a basic unit) turns out to be noisy rather than informative.

Then, on average, the feature subset proved to be most informative, carrying out suficient information to let the classifiers reach a reasonable detection rate in the considered medical scenario. In particular, the subset mainly includes features concerning spectral distribution, followed by those involving energy and amplitude and ifnally frequency features (MFCC above all).

It is worth noticing that the study has been conducted on continuous speech rather than on isolated phones, syllables or words, to get closer to the normal working dynamic of the vocal apparatus during utterance phonation and avoid artificial efects that may arise when producing single short items.

To conclude, supporting automated analytic methods and machine learning techniques with linguistic considerations allows for more accurate, economical and, most importantly, interpretable discrimination. Future work will be devoted to delving deeper into the linguistic analysis of the way the emergent features characterize PD speech and the investigation of the explainability of classification methods based on deep neural networks. [10] D. Duez, Acoustic analysis of occlusive weakening [23] J. Hlavnička, R. Čmejla, T. Tykalová, K. Šonka, in parkinsonian french speech, in: International E. Ržička, J. Rusz, Automated analysis of connected Congress of Phonetic Sciences, Université de Saare- speech reveals early biomarkers of parkinson’s disbrücken, 2007, pp. 1–4. ease in patients with rapid eye movement sleep [11] I. Eliasova, J. Mekyska, M. Kostalova, R. Marecek, behaviour disorder, Scientific reports 7 (2017) 12.

Z. Smekal, I. Rektorová, Acoustic evaluation of [24] J. I. Godino-Llorente, S. Shattuck-Hufnagel, J.-Y. short-term efects of repetitive transcranial mag- Choi, Moro-Velázquez, J. A. Gómez-García, Tonetic stimulation on motor aspects of speech in wards the identification of idiopathic parkinson’s parkinson’s disease, Journal of Neural Transmis- disease from the speech. new articulatory kinetic sion 120 (2013) 597–605. biomarkers, PloS one 12 (2017) e0189583. [12] S. Sapir, L. O. Ramig, J. L. Spielman, C. Fox, Formant [25] M. Chronowski, M. Klaczynski, M. Dec-Cwiek, centralization ratio: A proposal for a new acoustic K. Porebska, Parkinson’s disease diagnostics usmeasure of dysarthric speech (2010). ing ai and natural language knowledge transfer, [13] S. Skodda, W. Visser, U. Schlegel, Vowel articulation arXiv preprint arXiv:2204.12559 (2022). in parkinson’s disease, Journal of voice 25 (2011) [26] L. Ali, C. Zhu, M. Zhou, Y. Liu, Early diagnosis 467–472. of parkinson’s disease from multiple voice record[14] J. Rusz, R. Cmejla, T. Tykalova, H. Ruzickova, ings by simultaneous sample and feature selection, J. Klempir, V. Majerova, J. Picmausova, J. Roth, Expert Systems with Applications 137 (2019) 22–28. E. Ruzicka, Imprecise vowel articulation as a po- [27] J. S. Almeida, P. P. Rebouças Filho, T. Carneiro, tential early marker of parkinson’s disease: Efect W. Wei, R. Damaševičius, R. Maskeliu¯nas, V. H. C. of speaking task, The Journal of the Acoustical de Albuquerque, Detecting parkinson’s disease Society of America 134 (2013) 2171–2181. with sustained phonation and speech signals using [15] A. M. Goberman, C. A. Coelho, M. P. Robb, Prosodic machine learning techniques, Pattern Recognition characteristics of parkinsonian speech: The efect Letters 125 (2019) 55–62. of levodopa-based medication, Journal of medical [28] R. Lamba, T. Gulati, H. F. Alharbi, A. Jain, A hybrid speech-language pathology 13 (2005) 51–69. system for parkinson’s disease diagnosis using ma[16] C. L. Ludlow, N. P. Connor, C. J. Bassich, Speech tim- chine learning techniques, International Journal of ing in parkinson’s and huntington’s disease, Brain Speech Technology 3 (2021) 583–593. and language 32 (1987) 195–214. [29] C. O. Sakar, G. Serbes, A. Gunduz, H. C. Tunc, [17] H. Hirose, S. Kiritani, M. Sawashima, Velocity of H. Nizam, B. E. Sakar, M. Tutuncu, T. Aydin, M. E. articulatory movements in normal and dysarthric Isenkul, H. Apaydin, A comparative analysis of subjects, Folia Phoniatrica et Logopaedica 34 (1982) speech signal processing algorithms for parkinson’s 210–215. disease classification and the use of the tunable q[18] H. Ackermann, J. Konczak, I. Hertrich, The tempo- factor wavelet transform, Applied Soft Computing ral control of repetitive articulatory movements in 74 (2019) 255–263. parkinson’s disease, Brain and language 56 (1997) [30] F. Cordella, A. Pafi, A. Pallotti, Classification-based 312–319. screening of parkinson’s disease patients through [19] S. Skodda, U. Schlegel, Speech rate and rhythm in voice signal, in: 2021 IEEE International Sympoparkinson’s disease, Movement disorders: oficial sium on Medical Measurements and Applications journal of the Movement Disorder Society 23 (2008) (MeMeA), IEEE, 2021, pp. 1–6.

985–992. [31] G. Dimauro, F. Girardi, Italian parkinson’s voice [20] J. M. Liss, L. White, S. L. Mattys, K. Lansford, A. J. and speech, IEEE Dataport (2019).

Lotto, S. M. Spitzer, J. N. Caviness, Quantifying [32] G. Dimauro, V. Di Nicola, V. Bevilacqua, D. Caivano, speech rhythm abnormalities in the dysarthrias F. Girardi, Assessment of speech intelligibility in (2009). parkinson’s disease using a speech-to-text system, [21] M. Mafia, R. De Micco, M. Pettorino, M. Siciliano, IEEE Access 5 (2017) 22199–22208.

A. Tessitore, A. De Meo, Speech rhythm variation in [33] F. Eyben, M. Wöllmer, B. Schuller, Opensmile: the early-stage parkinson’s disease: a study on diferent munich versatile and fast open-source audio feaspeaking tasks, Frontiers in Psychology 12 (2021) ture extractor, in: Proceedings of the 18th ACM 668291. international conference on Multimedia, 2010, pp. [22] M. Mafia, M. Pettorino, Voce, età e parkinson: 1459–1462.

questioni di ritmo, in: CLUB Working Papers in [34] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, Linguistics, volume 6, Alma Mater Studiorum Uni- E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, versità di Bologna, 2022, pp. 66–78. S. S. Narayanan, et al., The geneva minimalistic

Appendix A: Further Features Analysis List of the features included in the considered subset of the eGeMAPSv02 features.

Features concerning the spectral distribution: • slopeV0-500_sma3nz_amean • slopeV0-500_sma3nz_stddevNorm • alphaRatioV_sma3nz_amean • alphaRatioV_sma3nz_stddevNorm • hammarbergIndexV_sma3nz_amean • hammarbergIndexV_sma3nz_stddevNorm • spectralFlux_sma3_amean • spectralFlux_sma3_stddevNorm Features concerning energy and amplitude: • loudness_sma3_amean • loudness_sma3_percentile20.0 • shimmerLocaldB_sma3nz_amean • shimmerLocaldB_sma3nz_stddevNorm Features concerning frequency: • mfcc1_sma3_amean • mfcc1_sma3_stddevNorm • mfcc1V_sma3nz_amean • mfcc1V_sma3nz_stddevNorm • jitterLocal_sma3nz_amean • jitterLocal_sma3nz_stddevNorm

Appendix B: Further Results

alphaRatio

Shimmer

Loudness 01PDm 02PDm 03PDm 04PDf 05PDf 06PDf 07PDf 08PDm 09PDm 10PDm 11PDm 12PDm 13PDm 14PDm 15PDm 16HCf 17HCf 18HCm 19HCf 20HCf 21HCf 22HCm 23HCf 24HCm 25HCf Mean value and Standard Deviation of the most relevant features in PC units per speaker.

[1] L. M. De Lau , M. M. Breteler , Epidemiology of parkinson's disease , The Lancet Neurology 5 ( 2006 ) 525 - 535 .

[2]

Poewe ,

Seppi ,

C. M.

Tanner ,

G. M.

Halliday ,

Brundin ,

Volkmann ,

A.-E.

Schrag ,

A. E.

Lang , Parkinson disease, Nature reviews Disease primers 3 ( 2017 ) 1 - 21 .

[3]

F. L.

Darley ,

A. E.

Aronson ,

J. R.

Brown , Clusters of deviant speech dimensions in the dysarthrias , Journal of speech and hearing research 12 ( 1969 ) 462 - 496 .

[4]

Pinto ,

Chan , I. Guimarães ,

Rothe-Neves ,

Sadat , A cross-linguistic perspective to the study of dysarthria in parkinson's disease , Journal of Phonetics 64 ( 2017 ) 156 - 167 .

[5]

C. G.

Goetz ,

B. C.

Tilley ,

S. R.

Shaftman ,

G. T.

Stebbins ,

Fahn ,

Martinez-Martin ,

Poewe ,

Sampaio ,

M. B.

Stern ,

Dodel , et al., Movement disorder society-sponsored revision of the unified parkinson's disease rating scale (mds-updrs): scale presentation and clinimetric testing results , Movement disorders: oficial journal of the Movement Disorder Society 23 ( 2008 ) 2129 - 2170 .

[6]

J. R.

Dufy , Defining, understanding, and categorizing motor speech disorders, Motor speech disorders-substrates, diferential diagnosis, and management . 3rd edn. Saint Louis: Elsevier Mosby ( 2013 ) 3 - 13 .

[7]

D. G.

Hanson ,

B. R.

Gerratt ,

P. H.

Ward , Cinegraphic observations of laryngeal function in parkinson's disease , The Laryngoscope 94 ( 1984 ) 348 - 353 .

[8]

Cernak ,

J. R.

Orozco-Arroyave ,

Rudzicz ,

Christensen ,

J. C.

Vásquez-Correa , E. Nöth, Characterisation of voice quality of parkinson's disease using diferential phonological posterior features , Computer Speech & Language 46 ( 2017 ) 196 - 208 .

[9]

Ackermann , W. Ziegler, Articulatory deficits in parkinsonian dysarthria: an acoustic analysis . , Journal of Neurology, Neurosurgery & Psychiatry 54 ( 1991 ) 1093 - 1098 . acoustic parameter set (gemaps) for voice research and afective computing , IEEE transactions on affective computing 7 ( 2015 ) 190 - 202 .

[35]

Moro-Velazquez ,

J. A.

Gomez-Garcia ,

J. D.

AriasLondoño ,

Dehak ,

J. I.

Godino-Llorente , Advances in parkinson's disease detection and assessment using voice and speech: A review of the articulatory and phonatory aspects , Biomedical Signal Processing and Control 66 ( 2021 ) 102418 .

[36]

Demšar ,

Curk ,

Erjavec , Črt Gorup,

Hočevar ,

Milutinovič ,

Možina ,

Polajnar ,

Toplak ,

Starič ,

Štajdohar ,

Umek ,

Žagar ,

Žbontar ,

Žitnik ,

Zupan , Orange: Data mining toolbox in python , Journal of Machine Learning Research 14 ( 2013 ) 2349 - 2353 . URL: http://jmlr.org/papers/v14/demsar13a.html.

[37]

J. A.

Hartigan ,

M. A.

Wong , Algorithm as 136: A kmeans clustering algorithm , Journal of the royal statistical society . series c (applied statistics) 28 ( 1979 ) 100 - 108 .

[38]

Murtagh , A survey of recent advances in hierarchical clustering algorithms , The computer journal 26 ( 1983 ) 354 - 359 .