1. Introduction

Journal of Comput

10.1098/rspa

CNN based Parkinson's Disease Assessment using Empirical Mode Decomposition

Ayush Tripathi

Sunil Kumar Kopparapu

0 0 TCS Research & Innovation - Mumbai, Tata Consultancy Services Limited , Maharashtra , India

2016

4 3063 3067

Parkinson's Disease (PD) is a neuro-degenerative disorder which is caused by a decrease in dopamine producing neurons in the human body and afects the body's motor system. In addition to afecting several motor and non-motor activities of a person's day to day life, PD patients have dificulty in speech production due to reduced coordination of the muscles that control breathing, phonation, articulation and prosody. Analyzing speech allows clinicians to objectively measure the severity of PD in a non-invasive way. In this work, we propose an efective method to discriminate between PD and healthy control (HC) subjects by utilizing a technique to decompose a speech signal into simpler Intrinsic Mode Functions called the Empirical Mode Decomposition. We train a Convolutional Neural Network (CNN) to learn significant properties from raw IMFs for the purpose of PD-HC classification. We evaluate our technique on sustained phonations speech from the Italian Parkinson's Voice and Speech database. Experimental results show that significant characteristics of Parkinsonian dysarthria can be learnt by using the raw IMFs and the need for explicitly extracting handcrafted features could be mitigated.

eol>Parkinson's speech Empirical Mode Decomposition Intrinsic Mode Function sustained phonation

1. Introduction

the signs of PD are often confused with those of natural aging hence making the diagnosis even more chalParkinson’s Disease (PD) is a neuro-degenerative dis- lenging. Clinicians widely use the Unified Parkinson’s order which is caused by a decrease in dopamine pro- Disease Rating Scale (UPDRS) [ 4 ] for evaluation of PD. ducing neurons in the human body and afects the body’s The evaluation is carried out through face to face inmotor system [ 1 ]. PD afects 1-2 per 1000 of the pop- terviews and clinical observations using a set of quesulation at any time. The prevalence of PD increases tions to evaluate: (a) non-motor experiences of daily with age and it afects roughly 1% of the population living, (b) motor experiences of daily living, (c) motor above 60 years [ 2 ]. Normal respiratory and well con- examination, and (d) motor complications. trolled articulatory movements are fundamental for pro- Naturally spoken speech can be analyzed in a nonducing well-coordinated normal speech. The common invasive manner and hence the study of changes in signs and symptoms of PD such as tremor, bradykine- acoustic properties of speech are a center-point of resia, rigid muscles and akinesia hamper the ability of an search for the measurement of symptomatic changes individual to precisely control the speech producing in PD [ 5 ]. Articulation, voice intensity, frequency specorgans which leads to disordered speech. This man- trum, and speech intelligibility are the main acoustic ifests in PD patients in the form of soft voice, mono- parameters observed for tracking changes in speech. It tone, breathiness, hoarse voice quality, imprecise ar- was observed [ 6 ] that PD patients sufer from reducticulation and a decrease in naturalness while speak- tion in the range of articulatory movement which in ing [ 3 ]. turn leads to impaired vowel articulation. The produc

In the absence of any specific laboratory test or in- tion of vowels is a complicated process that involves struments to measure or monitor the evolution and precise control over the movements of the tongue, lips treatment response of PD, it is extremely crucial to and jaw, creating oropharyngeal resonating cavities, track the motor functions such as gait freezing and which amplify certain frequency bands of the voice speech analysis to examine the disease. Importantly, spectrum called formants. The possibility of using sustained phonation /a/ for discriminating PD from healthy Proceedings of the CIKM 2020 Workshops, October 19-20, Galway, subjects was first proposed in [ 7 ].

Iermelaainld: .t.Eadyiutosrhs@oftctsh.ecoPmroc(Aee.dTinrigpsa:tShtie)f;an Conrad, Ilaria Tiddi A set of 13 features describing diferent aspects of sunilkumar.kopparapu@tcs.com (S.K. Kopparapu) Parkinsonian speech for the task was suggested in [ 8 ]. url: https://www.tcs.com (S.K. Kopparapu) Phonation and rhythm features [ 9 ] and other vowel orcid: 0000-0002-7944-2260 (A. Tripathi); 0000-0002-0502-527X features [ 10 ] to capture characteristics of PD dysarthria (S.K. Kopp©a2r02a0pCuop)yright for this paper by its authors. Use permitted under Creative have been proposed in literature. An extensive feature CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmUmoRns WLiceonrsekAsthtriobuptioPnr4o.0cIneteerdnaitniognasl ((CCC EBYU4R.0)-.WS.org) analysis followed by a 2 stage feature selection to represent physiological aspects of PD obtained from sus- Table 1 tained vowel /a/ and DDK task was proposed in [ 11 ]. A Number of 1 second utterances for PD and HC categories in set of frame-level features was used to construct a Fis- the dataset. cher Vector representation of the speech sample along Phonation PD HC tvwiecicthfteoaartSbuuarsepespdowaratpsVpreuocsateocdrhiManlao[cn1h3gi]nweinitchloarasdsleiafierrrgtieonsi[ed1te2on]f.tiaAfcynotuih-se- ///iea/// 433089350 222996079 most relevant features for characterizing the disorders /o/ 400 284 in speech of PD patients. Voxtester [14], is a system for /u/ 379 305 apsasreasmsientgerPsDinrcelluadteindgi:mvpoaiciremsepnetctbryumus,infogrmawanidtse, sDeDtoKf Total 1957 1445 rate, voice intensity and vocal sound pressure level.

With the advent of machine learning in all spheres been recorded in a warm, echo free and quiet room of processing the trend has been to extract more and at a sampling frequency of 16 kHz by keeping the mimore features using signal processing in order to dis- crophone at a distance of 15 to 25 centimeters from criminate PD and HC subjects. In this paper, we pro- the subject. The speech intelligibility of the patients pose a method to classify PD and HC by decompos- was perceptually assessed on a 5-point scale based on ing the speech utterance by using the Empirical Mode the UPDRS protocol. The following reading tasks were Decomposition (EMD) technique. EMD is the process performed by the subjects: of decomposing non-stationary time series into simpler Intrinsic Mode Functions (IMF) in the time do- • 2 phonations each of the vowel /a/, /e/, /i/, /o/, main. This technique has had various applications in /u/ the speech domain such as enhancement, denoising • execution of syllable /pa/ and /ka/ (5 sec) [15], formant tracking [16], pathological voice analysis [17], emotion recognition [18], glottal activity detec- • 2 readings of a phonetically balanced text tion [19] etc. In these studies, the emphasis has been on extracting temporal and spectral features using the IMFs which are then used for classification tasks. How- In our study, we use a subset of this dataset, namely ever, to the best of our knowledge, employing raw IMFs the sustained phonations (/a/, /e/, /i/, /o/, /u/). Defor classification of pathological speech has not been pending on the severity of the condition and the speaker, studied. The main contribution of this paper lies in the amount of time a subject can sustain a phonation using a Convolutional Neural Network architecture to is diferent and subsequently the length (in seconds) of learn these features from raw IMFs without the need of the audio recordings are unequal. As will be discussed explicitly extracting handcrafted features for the pur- in Section 3, we segment the unequal length speech pose of PD-HC classification. The approach is vali- samples into non-overlapping segments (utterance) of dated on the Italian Parkinson’s Voice and Speech databaseea.ch 1 second duration. In all there were 1957 utterThe rest of the paper is organized as follows: Section ances from PD and 1445 utterances from HC (see Ta2 describes the database used for the experiments; we ble 1); this forms the data in all our experiments on provide the description of the proposed approach in the phonation data for PD-HC classification. For comSection 3 while Section 4 details achieved results. We plete information on the recording protocol, the subdiscuss the salient aspects of the proposed approach jects and the tasks, please refer to [21]. while also providing an analogy to the traditional feature extraction based methods in Section 5 and conclude in Section 6

3. Proposed Approach

• reading of phonetically balanced words and phrases

The proposed PD diagnosis system consists of two ma

2. Dataset jor parts. First, the raw speech utterance of 1 second duration is decomposed into its Intrinsic Mode The Italian Parkinson’s Voice and Speech database [20] Functions (IMFs) by using the Empirical Mode Decomconsists of recordings from 28 (19 Male, 9 Female) speak- position (EMD) technique. A 1D-CNN model is then ers with Parkinson’s Disease aged between 40 and 80 trained using the raw IMFs as input for classifying the years and 22 (10 Male, 12 Female) healthy controls (HC) speech utterance into one of the two categories, namely, aged between 60 and 77 years. The utterances have HC or PD. We now describe the signal decomposition sponding IMFs, the signal is subjected to a sifting pro

In order to decompose a signal [ ] into its corre- sample.

cess, namely, process and the architecture of the 1D-CNN model used in our experiments. 3.1. Empirical Mode Decomposition Empirical Mode Decomposition is an adaptive, data driven technique used to decompose non-stationary and non-linear signals into Intrinsic Mode Functions of a signal, in the time-domain itself without the requirement of any a priori basis [22]. Any function that satisfies the following two conditions is categorized as an Intrinsic Mode Function: 1. The number of extrema and the number of zero crossings in the signal must be either equal or difer at most by one, and 2. The mean value of the envelope defined by joining the points of local minima and local maxima must be zero. 1. For the signal [ ], find the locations of all local maxima and minima. Define initial residue as, 0[ ] = [ ] 2. Connect all the local maxima (minima) by applying a cubic spline interpolation to obtain upper (lower) envelope 3. Compute the mean 4. Update initial residue ( = ( ) .

+ 2

) 0[ ] ← 0[ ] − of an IMF. 5. Repeat Steps 1 - 4 until 0[ ] = [ ] gets reduced to a function ℎ1[ ] which satisfies the properties 6. Obtain the first residue 1[ ] = 0[ ] − ℎ1[ ] 7. Repeat Steps 1-6 with the residue 1[ ] as the inieither monotonic, or a function with single maxima and minima or is a constant.

[ ] [ ] = [ ] + ∑ ℎ [ ] (1) duration, where the decomposition is curtailed at =

9. Note that the process of decomposing a signal into

8. Stop the process when the residue tial residue to find all the IMFs ℎ [ ] = 1, 2, ⋯ , . Figure 2: IMFs for PD and HC, (a)-(f) ((g)-(l)) represent first becomes 5 IMFs and residue for PD (HC) speech of phonation /a/. residue, namely, [ ] can be represented as a sum of IMFs and the final By performing the decomposition process, the signal taneous amplitude and frequency is termed as Hilbert

IMFs and then representing the IMFs using the instan Huang Transform (HHT). Features extracted from the IMFs can be used as complimentary features to the

standard signal processing practices. In this regard,

HHT can be understood as a generalized Fourier Trans

form that represents the signal in terms of a finite number of components [23].

In general, healthy speech is more coherent than the speech of a PD patient and as a result HC speech is decomposed faster (smaller ) than PD speech. This observation forms the hypothesis of our work. Previous studies have focused on using handcrafted spectral and temporal features extracted from these IMFs in order to discriminate between healthy and pathological speech (see [ 11, 24 ]). In this paper, we propose a machine learning approach to use the raw IMFs in order to diagnose the presence of Parkinson’s disease. The first set of results are on the sustained phonations from both PD and HC. We consider the first five IMFs, namely, ℎ1[ ] to ℎ5[ ] and the residue, 5[ ] as the input to our classifier.

Figure 2 depicts the first 5 IMFs and the final residue corresponding to the sustained phonation /a/ spoken by a HC ((a)-(f)) and a PD ((g)-(l)) subject. Clearly, one can visually notice the diference between the IMFs and the residue for HC and PD speech sample. These IMFs capture the characteristics of the parent signal and hence can be employed to extract information useful for pathological speech classification. This is the diference we wish to exploit to discriminate speech uttered by PD and speech uttered by HC. 3.2. Experimental Setup to 49 speakers are used for training the model and the model is tested on the left out speaker. For all experiments, 20% of the training data is randomly chosen for the purpose of validating the model. For the test speaker, the posterior probabilities obtained from the model output for each 1 second utterance was averaged for classification. Note that Italian PD dataset is not very large (as is common with any pathological speech databases) to define separate train, test and validation sets, using leave one out mechanism allows predictions for all the speakers without relying on any sort of speaker specific information.

The architecture of the 1D-CNN model used for the

classification task is shown in Figure 3. The input to the 1D-CNN model is the raw IMF signal. The 1DCNN was trained using Keras [25] deep learning library with Tensorflow [26] backend. We use speech signal (as mentioned in Table 1) of 1 second which corresponds to 16000 samples. Each of the 1 second speech utterance is subject to the EMD process and the first 5 IMFs (ℎ1[ ], ℎ2[ ], ⋯ , ℎ5[ ]) were extracted 4. Results along with the final residue ( 5[ ]). These are then fed as input to a multiple-input 1D-CNN network. Thus, The experimental results using 1D-CNN obtained for the input to the network is a set of 6, 16000 dimen- leave-one-speaker-out for diferent phonations are tabsional vector (time series). We set the kernel size for ulated in Table 2. In order to account for variations in the CNN to be 320 with a stride of 160 and the num- outcomes due to random weight initialization of the ber of filters is chosen by performing a grid search to 1D-CNN, we repeat the experiment 5 times and report optimize the classification accuracy. The output of the the average accuracy obtained in Table 2. We also reCNN is then concatenated after a Global MaxPooling port the specificity and sensitivity which is defined as operation and is fed to a fully connected layer with the percentage of correctly classified HC and PD utReLU activation function, while the number of neu- terances respectively. The confusion matrix for 5 inrons is optimized by using a grid search. For the output dividual runs for the phonation /a/ is also shown in layer, softmax activation function is used with the out- Table 3, as can be observed the number of correctly put dimensions being the two classes, namely, HC and recognized subjects are not significantly diferent; the PD. The target to the model was one-hot encoding of variation between diferent runs is ±2. As can be obthe health state of the individual. We trained the net- served in Figure 2, the final residue ( 5[ ]) is most rework using binary cross-entropy loss with Adam op- flective of the diference between PD and HC speech timizer. We set the learning rate to the default value samples followed by IMFs ℎ4[ ] and ℎ5[ ]. To evaluate of 0.001. In order to obtain speaker independent re- if 5[ ] by itself independently captures the discrimisults which can be scaled to populations outside the nating properties between HC and PD, we trained a training set, we perform a leave-one-speaker-out vali- single input 1D-CNN model using 5[ ] as the input, dation of the model wherein utterances corresponding /a/ /e/ /i/ /o/ /u/ namely, all inputs were 0 except the last residue input 5[ ] in Figure 3. We perform a similar analysis by training another model with inputs as signals ℎ4[ ], ℎ5[ ] and 5[ ]. The results obtained by using these 5. Discussion approaches are reported in Tables 4 and 5. Clearly, the performance detoriates (it can be observed that for the EMD is a popular decomposition technique used to anphonation /a/ there is drop in accuracy from 76% to alyze non-stationary and non-linear signals. The IMFs 69.6% and 64.4%) compared to when all the IMFs and can be used to extract features like instantaneous amresidue are used together. Further, we combine the re- plitude and frequency, marginal spectrum etc which sults obtained by using each of the individual phona- are relevant for pathological speech classification Howtions by taking a majority vote on the predictions ob- ever, in this paper we propose a deep architecture in tained by each of the 5 diferent models. The class con- the form of 1D-CNN which allows us to use raw IMF fusion matrix using this approach is presented in Table signal instead of having to select and extract explicit 6. We achieve an average accuracy of 85%, while the features useful for pathological speech classification. specificity and sensitivity values are 81.82% and 87.5% It is commonly assumed that neural networks are black respectively. boxes that are unable to interpretable results. We at

The use of IMFs signals as raw features in a 1D- tempt to explain the performance of the proposed arCNN classifier shows promise to be able to discrim- chitecture. inate PD and HC as can be seen in Table 2. To the For the 1D-CNN, we used a kernel size of 320 with best of our knowledge, a study on classification of PD a stride of 160. In the hindsight this is equivalent to PD HC

PD and HC using the Italian Parkinson’s Voice and Speech has not been attempted earlier. However, our results are comparable to the state-of-the art measures which have been validated on other datasets, for example [ 11, 12, 13, 27 ]. Note that we did not have access to these datasets to make a direct comparison. On closer observation, we observed that most of the misclassified PD patients by our proposed approach belong to the class of 11 (of the 28) PD patients in the database who were rated 0 (namely, having no speech problems) on the UPDRS test scale by the clinicians. This is consistent with the fact that assigning a precise rating (PD or HC) for these boundary cases is challenging even for the trained experts which translates to misclassification of these samples. extracting features from 20 ms of speech with a shift of 10 ms which is common practice in speech processing owing to the non-stationary nature of the speech signal. Further, • The 1D-CNN network can be assumed to be a feature extraction mechanism which, given a raw IMF (or residue), extracts a set of discriminative features. The number of filters may be interpreted as the number of features extracted from a particular input signal. • The extracted features from input signals ℎ1[ ] - ℎ5[ ] and 5[ ] are then concatenated to form a feature vector. • The Dense layers then act as a simple binary classifier with the input as the concatenated feature vector.

As one can observe, the use of raw IMFs mitigates the need to explicitly extract handcrafted features from the IMFs, the 1D-CNN architecture learns discriminating features from the raw signal to distinguish between PD and HC speech samples. For the purpose of decomposing the signal, the speech sample is segmented into ifxed durations of 1 second each. This duration is long enough to capture the non-stationary aspect of speech as well as the dynamics involved in the phonation of vowel sounds.

6. Conclusion

Parkinson’s Disease is a chronic neuro-degenerative disease which is dificult to diagnose. The symptoms of PD can be mistaken with natural aging, thereby making the diagnosis very very challenging. Tracking changes in speech has proven to be a useful tool for establishing non-invasive approach to early detection of PD. In this work, we propose an eficient technique to discriminate PD and HC patients by analyzing their speech samples of sustained phonation. Traditional approaches have focused on experimenting with handcrafted spectral and temporal features. In this paper, however, we focus on machine learning the discriminating features of speech associated with PD patients and healthy control from the raw IMF signals. We train a 1D-CNN model using these raw IMFs to learn the discriminating properties in the signals to classify PD and HC subjects.

[1]

Hoehn ,

Yahr , Parkinsonism: onset, progression and mortality, Neurology 17 ( 1967 ) 427 - 442 .

[2]

Tysnes ,

Storstein , Epidemiology of parkinson's disease , Journal of Neural Transmission 124 ( 2017 ) 901 - 905 . doi: 10 .1007/s00702-017-1686-y.

[3]

A. K.

Ho ,

Iansek , M. C. , B. J.L. , G. S. , Speech impairment in a large sample of patients with parkinson's disease , Behavioral Neurology 11 ( 1998 / 1999 ) 131 - 137 .

[4]

Fahn ,

R. L.

Elton , Unified parkinsons disease rating scale , Recent Developments in Parkinsons Disease,Macmillan Health Care Information 2 ( 1987 ) 153 - 163 .

[5]

Cohen , Disorders of speech and language in parkinson's disease, Mental and Behavioral Dysfunction in Movement Disorders, M. A . Be´dard,

Agid ,

A. D.

Korczyn ,

Lesperance , and S. Chouinard, Eds. New York,NY, USA: Humana Press, ( 2003 ) 125 - 134 .

[6]

A. K.

Ho ,

Iansek , M. C. , B. J.L. , G. S. , Motor instability in parkinsonian speech intensity , Neuropsychiatry, Neuropsychology and Behavioral Neurology 14 ( 2001 ) 109 - 116 .

[7]

M. A.

Little ∗,

P. E.

McSharry ,

E. J.

Hunter ,

Spielman ,

L. O.

Ramig , Suitability of dysphonia measurements for telemonitoring of parkinson's disease , IEEE Transactions on Biomedical Engineering 56 ( 2009 ) 1015 - 1022 .

[8]

Novotn y´,

Rusz , R. Cˇ mejla, E. Ru˚zˇicˇka, Automatic evaluation of articulatory disorders in parkinson's disease , IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 ( 2014 ) 1366 - 1378 .

[9]

Rusz ,

Cmejla , Quantitative acoustic measurements for characterization of speech and voice disorders in early untreated parkinson's disease , Journal of Acoustical Society of America 129 ( 2011 ) 350 .

[10]

Rusz ,

Cmejla , Imprecise vowel articulation as a potential early marker of parkinson's disease: Efect of speaking task , Journal of Acoustical Society of America 134 ( 2013 ) 2171 .

[11]

Rueda , J.

Va´squez-

Correa , C. D.

Rios-Urrego , J. R.

Orozco-Arroyave , S.

Krishnan , E. Noeth, Feature Representation of Pathophysiology of Parkinsonian Dysarthria , in: Proc. Interspeech 2019 , 2019 , pp. 3048 - 3052 . URL: http://dx.doi.org/ 10.21437/Interspeech.2019- 2490 . doi: 10 .21437/ Interspeech.2019- 2490 .