=Paper= {{Paper |id=Vol-2699/paper28 |storemode=property |title=CNN based Parkinson's Disease Assessment using Empirical Mode Decomposition |pdfUrl=https://ceur-ws.org/Vol-2699/paper28.pdf |volume=Vol-2699 |authors=Ayush Tripathi,Sunil Kumar Kopparapu |dblpUrl=https://dblp.org/rec/conf/cikm/TripathiK20 }} ==CNN based Parkinson's Disease Assessment using Empirical Mode Decomposition== https://ceur-ws.org/Vol-2699/paper28.pdf
CNN based Parkinson’s Disease Assessment using
Empirical Mode Decomposition
Ayush Tripathia , Sunil Kumar Kopparapua
a TCS Research & Innovation - Mumbai, Tata Consultancy Services Limited, Maharashtra, India.



                                    Abstract
                                    Parkinson’s Disease (PD) is a neuro-degenerative disorder which is caused by a decrease in dopamine producing neurons
                                    in the human body and affects the body’s motor system. In addition to affecting several motor and non-motor activities
                                    of a person’s day to day life, PD patients have difficulty in speech production due to reduced coordination of the muscles
                                    that control breathing, phonation, articulation and prosody. Analyzing speech allows clinicians to objectively measure the
                                    severity of PD in a non-invasive way. In this work, we propose an effective method to discriminate between PD and healthy
                                    control (HC) subjects by utilizing a technique to decompose a speech signal into simpler Intrinsic Mode Functions called the
                                    Empirical Mode Decomposition. We train a Convolutional Neural Network (CNN) to learn significant properties from raw
                                    IMFs for the purpose of PD-HC classification. We evaluate our technique on sustained phonations speech from the Italian
                                    Parkinson’s Voice and Speech database. Experimental results show that significant characteristics of Parkinsonian dysarthria
                                    can be learnt by using the raw IMFs and the need for explicitly extracting handcrafted features could be mitigated.

                                    Keywords
                                    Parkinson’s speech, Empirical Mode Decomposition, Intrinsic Mode Function, sustained phonation


1. Introduction                                                                        the signs of PD are often confused with those of natu-
                                                                                       ral aging hence making the diagnosis even more chal-
Parkinson’s Disease (PD) is a neuro-degenerative dis- lenging. Clinicians widely use the Unified Parkinson’s
order which is caused by a decrease in dopamine pro- Disease Rating Scale (UPDRS) [4] for evaluation of PD.
ducing neurons in the human body and affects the body’s The evaluation is carried out through face to face in-
motor system [1]. PD affects 1-2 per 1000 of the pop- terviews and clinical observations using a set of ques-
ulation at any time. The prevalence of PD increases tions to evaluate: (a) non-motor experiences of daily
with age and it affects roughly 1% of the population living, (b) motor experiences of daily living, (c) motor
above 60 years [2]. Normal respiratory and well con- examination, and (d) motor complications.
trolled articulatory movements are fundamental for pro- Naturally spoken speech can be analyzed in a non-
ducing well-coordinated normal speech. The common invasive manner and hence the study of changes in
signs and symptoms of PD such as tremor, bradykine- acoustic properties of speech are a center-point of re-
sia, rigid muscles and akinesia hamper the ability of an search for the measurement of symptomatic changes
individual to precisely control the speech producing in PD [5]. Articulation, voice intensity, frequency spec-
organs which leads to disordered speech. This man- trum, and speech intelligibility are the main acoustic
ifests in PD patients in the form of soft voice, mono- parameters observed for tracking changes in speech. It
tone, breathiness, hoarse voice quality, imprecise ar- was observed [6] that PD patients suffer from reduc-
ticulation and a decrease in naturalness while speak- tion in the range of articulatory movement which in
ing [3].                                                                               turn leads to impaired vowel articulation. The produc-
    In the absence of any specific laboratory test or in- tion of vowels is a complicated process that involves
struments to measure or monitor the evolution and precise control over the movements of the tongue, lips
treatment response of PD, it is extremely crucial to and jaw, creating oropharyngeal resonating cavities,
track the motor functions such as gait freezing and which amplify certain frequency bands of the voice
speech analysis to examine the disease. Importantly, spectrum called formants. The possibility of using sus-
                                                                                       tained phonation /a/ for discriminating PD from healthy
Proceedings of the CIKM 2020 Workshops, October 19-20, Galway,                         subjects was first proposed in [7].
Ireland. Editors of the Proceedings: Stefan Conrad, Ilaria Tiddi
email: t.ayush@tcs.com (A. Tripathi);                                                     A set of 13 features describing different aspects of
sunilkumar.kopparapu@tcs.com (S.K. Kopparapu)                                          Parkinsonian speech for the task was suggested in [8].
url: https://www.tcs.com (S.K. Kopparapu)                                              Phonation and rhythm features [9] and other vowel
orcid: 0000-0002-7944-2260 (A. Tripathi); 0000-0002-0502-527X                          features [10] to capture characteristics of PD dysarthria
(S.K. Kopparapu)
          © 2020 Copyright for this paper by its authors. Use permitted under Creative have been proposed in literature. An extensive feature
          Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
          CEUR Workshop Proceedings (CEUR-WS.org)
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                                                                       analysis followed by a 2 stage feature selection to rep-
resent physiological aspects of PD obtained from sus- Table 1
tained vowel /a/ and DDK task was proposed in [11]. A Number of 1 second utterances for PD and HC categories in
set of frame-level features was used to construct a Fis- the dataset.
cher Vector representation of the speech sample along                    Phonation      PD      HC
with a Support Vector Machine classifier in [12]. An i-                       /a/       390     269
vector based approach along with a large set of acous-                        /e/       385     290
tic features was used in [13] in order to identify the                        /i/       403     297
most relevant features for characterizing the disorders                       /o/       400     284
in speech of PD patients. Voxtester [14], is a system for                     /u/       379     305
assessing PD related impairment by using a wide set of                       Total     1957 1445
parameters including: voice spectrum, formants, DDK
rate, voice intensity and vocal sound pressure level.
   With the advent of machine learning in all spheres been recorded in a warm, echo free and quiet room
of processing the trend has been to extract more and at a sampling frequency of 16 kHz by keeping the mi-
more features using signal processing in order to dis- crophone at a distance of 15 to 25 centimeters from
criminate PD and HC subjects. In this paper, we pro- the subject. The speech intelligibility of the patients
pose a method to classify PD and HC by decompos- was perceptually assessed on a 5-point scale based on
ing the speech utterance by using the Empirical Mode the UPDRS protocol. The following reading tasks were
Decomposition (EMD) technique. EMD is the process performed by the subjects:
of decomposing non-stationary time series into sim-
pler Intrinsic Mode Functions (IMF) in the time do-            • 2 phonations each of the vowel /a/, /e/, /i/, /o/,
main. This technique has had various applications in              /u/
the speech domain such as enhancement, denoising               • execution of syllable /pa/ and /ka/ (5 sec)
[15], formant tracking [16], pathological voice analysis
[17], emotion recognition [18], glottal activity detec-        • 2 readings of a phonetically balanced text
tion [19] etc. In these studies, the emphasis has been
on extracting temporal and spectral features using the         • reading of phonetically balanced words and phrases
IMFs which are then used for classification tasks. How-
                                                             In our study, we use a subset of this dataset, namely
ever, to the best of our knowledge, employing raw IMFs
                                                          the sustained phonations (/a/, /e/, /i/, /o/, /u/). De-
for classification of pathological speech has not been
                                                          pending on the severity of the condition and the speaker,
studied. The main contribution of this paper lies in
                                                          the amount of time a subject can sustain a phonation
using a Convolutional Neural Network architecture to
                                                          is different and subsequently the length (in seconds) of
learn these features from raw IMFs without the need of
                                                          the audio recordings are unequal. As will be discussed
explicitly extracting handcrafted features for the pur-
                                                          in Section 3, we segment the unequal length speech
pose of PD-HC classification. The approach is vali-
                                                          samples into non-overlapping segments (utterance) of
dated on the Italian Parkinson’s Voice and Speech database.
                                                          each 1 second duration. In all there were 1957 utter-
The rest of the paper is organized as follows: Section
                                                          ances from PD and 1445 utterances from HC (see Ta-
2 describes the database used for the experiments; we
                                                          ble 1); this forms the data in all our experiments on
provide the description of the proposed approach in
                                                          the phonation data for PD-HC classification. For com-
Section 3 while Section 4 details achieved results. We
                                                          plete information on the recording protocol, the sub-
discuss the salient aspects of the proposed approach
                                                          jects and the tasks, please refer to [21].
while also providing an analogy to the traditional fea-
ture extraction based methods in Section 5 and con-
clude in Section 6                                        3. Proposed Approach
                                                          The proposed PD diagnosis system consists of two ma-
2. Dataset                                                jor parts. First, the raw speech utterance of 1 sec-
                                                          ond duration is decomposed into its Intrinsic Mode
The Italian Parkinson’s Voice and Speech database [20]
                                                          Functions (IMFs) by using the Empirical Mode Decom-
consists of recordings from 28 (19 Male, 9 Female) speak-
                                                          position (EMD) technique. A 1D-CNN model is then
ers with Parkinson’s Disease aged between 40 and 80
                                                          trained using the raw IMFs as input for classifying the
years and 22 (10 Male, 12 Female) healthy controls (HC)
                                                          speech utterance into one of the two categories, namely,
aged between 60 and 77 years. The utterances have
                                                          HC or PD. We now describe the signal decomposition
process and the architecture of the 1D-CNN model used
in our experiments.

3.1. Empirical Mode Decomposition
Empirical Mode Decomposition is an adaptive, data
driven technique used to decompose non-stationary
and non-linear signals into Intrinsic Mode Functions
of a signal, in the time-domain itself without the re-
quirement of any a priori basis [22]. Any function that
satisfies the following two conditions is categorized as
an Intrinsic Mode Function:
   1. The number of extrema and the number of zero
      crossings in the signal must be either equal or
      differ at most by one, and
   2. The mean value of the envelope defined by join-
      ing the points of local minima and local maxima
      must be zero.
                                                                 Figure 1: Empirical Mode Decomposition of a 1 second
  In order to decompose a signal 𝑠[𝑛] into its corre- sample.
sponding IMFs, the signal is subjected to a sifting pro-
cess, namely,
    1. For the signal 𝑠[𝑛], find the locations of all local
       maxima and minima. Define initial residue as,
       𝑟0 [𝑛] = 𝑠[𝑛]
    2. Connect all the local maxima (minima) by apply-
       ing a cubic spline interpolation to obtain upper
       (lower) envelope 𝐸𝑢𝑝𝑝𝑒𝑟 (𝐸𝑙𝑜𝑤𝑒𝑟 ).
                                       (𝐸              )
    3. Compute the mean 𝐸𝑚𝑒𝑎𝑛 = 𝑢𝑝𝑝𝑒𝑟 2 𝑙𝑜𝑤𝑒𝑟
                                               +𝐸

    4. Update initial residue 𝑟0 [𝑛] ← 𝑟0 [𝑛] − 𝐸𝑚𝑒𝑎𝑛
    5. Repeat Steps 1 - 4 until 𝑟0 [𝑛] = 𝑠[𝑛] gets reduced
       to a function ℎ1 [𝑛] which satisfies the properties
       of an IMF.
    6. Obtain the first residue 𝑟1 [𝑛] = 𝑟0 [𝑛] − ℎ1 [𝑛]
    7. Repeat Steps 1-6 with the residue 𝑟1 [𝑛] as the ini-
       tial residue to find all the IMFs ℎ𝑖 [𝑛] 𝑖 = 1, 2, ⋯ , 𝐾 . Figure 2: IMFs for PD and HC, (a)-(f) ((g)-(l)) represent first
    8. Stop the process when the residue 𝑟𝐾 [𝑛] becomes 5 IMFs and residue for PD (HC) speech of phonation /a/.
       either monotonic, or a function with single max-
       ima and minima or is a constant.
                                                                  IMFs and then representing the IMFs using the instan-
By performing the decomposition process, the signal taneous amplitude and frequency is termed as Hilbert
𝑠[𝑛] can be represented as a sum of IMFs and the final Huang Transform (HHT). Features extracted from the
residue, namely,                                                  IMFs can be used as complimentary features to the
                                                                  standard signal processing practices. In this regard,
                                    𝐾
                  𝑠[𝑛] = 𝑟𝐾 [𝑛] + ∑ ℎ𝑖 [𝑛]                  (1) HHT can be understood as a generalized Fourier Trans-
                                   𝑖=1                            form that represents the signal in terms of a finite num-
                                                                  ber of components [23].
   Figure 1 depicts the IMFs obtained as a result of de-             In general, healthy speech is more coherent than the
composing a natural speech utterance of one second speech of a PD patient and as a result HC speech is
duration, where the decomposition is curtailed at 𝐾 = decomposed faster (smaller 𝐾 ) than PD speech. This
9. Note that the process of decomposing a signal into observation forms the hypothesis of our work. Previ-
ous studies have focused on using handcrafted spec-              h [n]
                                                                     1   h [n]   h [n]2
                                                                                                 h [n] h [n]
                                                                                                       3     r [n]           4             5              5



tral and temporal features extracted from these IMFs              InputLayer      InputLayer      InputLayer             InputLayer    InputLayer      InputLayer

in order to discriminate between healthy and patho-
logical speech (see [11, 24]). In this paper, we propose            Batch
                                                                 Normalization
                                                                                    Batch
                                                                                 Normalization
                                                                                                    Batch
                                                                                                 Normalization
                                                                                                                        Batch
                                                                                                                     Normalization
                                                                                                                                         Batch
                                                                                                                                      Normalization
                                                                                                                                                         Batch
                                                                                                                                                      Normalization


a machine learning approach to use the raw IMFs in                 Conv1D          Conv1D          Conv1D                 Conv1D        Conv1D          Conv1D

order to diagnose the presence of Parkinson’s disease.
The first set of results are on the sustained phonations            Global
                                                                 MaxPooling1D
                                                                                    Global
                                                                                 MaxPooling1D
                                                                                                    Global
                                                                                                 MaxPooling1D
                                                                                                                        Global
                                                                                                                     MaxPooling1D
                                                                                                                                         Global
                                                                                                                                      MaxPooling1D
                                                                                                                                                         Global
                                                                                                                                                      MaxPooling1D


from both PD and HC. We consider the first five IMFs,
namely, ℎ1 [𝑛] to ℎ5 [𝑛] and the residue, 𝑟5 [𝑛] as the in-                          Concatenate

put to our classifier.
                                                                                                                 Dense
   Figure 2 depicts the first 5 IMFs and the final residue
corresponding to the sustained phonation /a/ spoken                                                              Dense


by a HC ((a)-(f)) and a PD ((g)-(l)) subject. Clearly, one                             Prediction

can visually notice the difference between the IMFs
and the residue for HC and PD speech sample. These Figure 3: Proposed 1D-CNN Architecture.
IMFs capture the characteristics of the parent signal
and hence can be employed to extract information use-
ful for pathological speech classification. This is the
                                                            to 49 speakers are used for training the model and the
difference we wish to exploit to discriminate speech
                                                            model is tested on the left out speaker. For all exper-
uttered by PD and speech uttered by HC.
                                                            iments, 20% of the training data is randomly chosen
                                                            for the purpose of validating the model. For the test
3.2. Experimental Setup                                     speaker, the posterior probabilities obtained from the
The architecture of the 1D-CNN model used for the model output for each 1 second utterance was aver-
classification task is shown in Figure 3. The input to aged for classification. Note that Italian PD dataset
the 1D-CNN model is the raw IMF signal. The 1D- is not very large (as is common with any pathologi-
CNN was trained using Keras [25] deep learning li- cal speech databases) to define separate train, test and
brary with Tensorflow [26] backend. We use speech validation sets, using leave one out mechanism allows
signal (as mentioned in Table 1) of 1 second which predictions for all the speakers without relying on any
corresponds to 16000 samples. Each of the 1 second sort of speaker specific information.
speech utterance is subject to the EMD process and
the first 5 IMFs (ℎ1 [𝑛], ℎ2 [𝑛], ⋯ , ℎ5 [𝑛]) were extracted   4. Results
along with the final residue (𝑟5 [𝑛]). These are then fed
as input to a multiple-input 1D-CNN network. Thus,             The experimental results using 1D-CNN obtained for
the input to the network is a set of 6, 16000 dimen-           leave-one-speaker-out for different phonations are tab-
sional vector (time series). We set the kernel size for        ulated in Table 2. In order to account for variations in
the CNN to be 320 with a stride of 160 and the num-            outcomes due to random weight initialization of the
ber of filters is chosen by performing a grid search to        1D-CNN, we repeat the experiment 5 times and report
optimize the classification accuracy. The output of the        the average accuracy obtained in Table 2. We also re-
CNN is then concatenated after a Global MaxPooling             port the specificity and sensitivity which is defined as
operation and is fed to a fully connected layer with           the percentage of correctly classified HC and PD ut-
ReLU activation function, while the number of neu-             terances respectively. The confusion matrix for 5 in-
rons is optimized by using a grid search. For the output       dividual runs for the phonation /a/ is also shown in
layer, softmax activation function is used with the out-       Table 3, as can be observed the number of correctly
put dimensions being the two classes, namely, HC and           recognized subjects are not significantly different; the
PD. The target to the model was one-hot encoding of            variation between different runs is ±2. As can be ob-
the health state of the individual. We trained the net-        served in Figure 2, the final residue (𝑟5 [𝑛]) is most re-
work using binary cross-entropy loss with Adam op-             flective of the difference between PD and HC speech
timizer. We set the learning rate to the default value         samples followed by IMFs ℎ4 [𝑛] and ℎ5 [𝑛]. To evaluate
of 0.001. In order to obtain speaker independent re-           if 𝑟5 [𝑛] by itself independently captures the discrimi-
sults which can be scaled to populations outside the           nating properties between HC and PD, we trained a
training set, we perform a leave-one-speaker-out vali-         single input 1D-CNN model using 𝑟5 [𝑛] as the input,
dation of the model wherein utterances corresponding
Table 2                                                       Table 5
Accuracies for Phonation tasks (proposed approach).           Accuracies for Phonation tasks (using only residue).
  Phonation      Accuracy     Specificity     Sensitivity       Phonation      Accuracy       Specificity   Sensitivity
      /a/          76.00           80.00             72.86          /a/           64.4            52.72         73.57
      /e/          76.40           78.57             73.64          /e/           67.2            74.54         61.43
      /i/          72.00           68.57             76.36          /i/           56.4            51.82         60.00
      /o/          72.40           68.57             77.27          /o/           62.4            55.45         67.86
      /u/          72.00           70.00             74.55          /u/           61.2            57.27         64.29
   Average         73.76           73.14             74.94       Average          62.32           58.36         65.43


Table 3                                                       Table 6
Confusion matrix for 5 runs for the phonation /a/ (proposed   Class confusion matrix for the classification system by using
approach).                                                    majority voting across all 5 sustained phonations.
                       PD                 HC                                               PD      HC
            PD     21, 20, 20,           7, 8, 8,                                PD        87.5    12.5
                 21, 20 (72.86%)     7, 8 (27.14%)                               HC       18.18   81.82
            HC       5, 3, 4,         17, 19, 18,
                  5, 5 (20.0%)      17, 17 (80.0%)
                                                                  and HC using the Italian Parkinson’s Voice and Speech
                                                                  has not been attempted earlier. However, our results
Table 4                                                           are comparable to the state-of-the art measures which
Accuracies for Phonation tasks (using ℎ4 [𝑛], ℎ5 [𝑛] and 𝑟5 [𝑛]). have been validated on other datasets, for example [11,
  Phonation Accuracy Specificity Sensitivity
                                                                  12, 13, 27]. Note that we did not have access to these
                                                                  datasets to make a direct comparison. On closer ob-
       /a/           69.6         61.82             75.00
                                                                  servation, we observed that most of the misclassified
       /e/           72.8         71.82             73.57
                                                                  PD patients by our proposed approach belong to the
       /i/           59.6         51.82             65.71
       /o/           66.4         60.00             71.43
                                                                  class of 11 (of the 28) PD patients in the database who
       /u/           62.4         52.73             70.00         were rated 0 (namely, having no speech problems) on
                                                                  the UPDRS test scale by the clinicians. This is consis-
    Average         66.16         59.64             71.14
                                                                  tent with the fact that assigning a precise rating (PD
                                                                  or HC) for these boundary cases is challenging even
namely, all inputs were 0 except the last residue in- for the trained experts which translates to misclassifi-
put 𝑟5 [𝑛] in Figure 3. We perform a similar analysis by cation of these samples.
training another model with inputs as signals ℎ4 [𝑛],
ℎ5 [𝑛] and 𝑟5 [𝑛]. The results obtained by using these 5. Discussion
approaches are reported in Tables 4 and 5. Clearly, the
performance detoriates (it can be observed that for the EMD is a popular decomposition technique used to an-
phonation /a/ there is drop in accuracy from 76% to alyze non-stationary and non-linear signals. The IMFs
69.6% and 64.4%) compared to when all the IMFs and can be used to extract features like instantaneous am-
residue are used together. Further, we combine the re- plitude and frequency, marginal spectrum etc which
sults obtained by using each of the individual phona- are relevant for pathological speech classification How-
tions by taking a majority vote on the predictions ob- ever, in this paper we propose a deep architecture in
tained by each of the 5 different models. The class con- the form of 1D-CNN which allows us to use raw IMF
fusion matrix using this approach is presented in Table signal instead of having to select and extract explicit
6. We achieve an average accuracy of 85%, while the features useful for pathological speech classification.
specificity and sensitivity values are 81.82% and 87.5% It is commonly assumed that neural networks are black
respectively.                                                     boxes that are unable to interpretable results. We at-
   The use of IMFs signals as raw features in a 1D- tempt to explain the performance of the proposed ar-
CNN classifier shows promise to be able to discrim- chitecture.
inate PD and HC as can be seen in Table 2. To the                    For the 1D-CNN, we used a kernel size of 320 with
best of our knowledge, a study on classification of PD a stride of 160. In the hindsight this is equivalent to
extracting features from 20 ms of speech with a shift     References
of 10 ms which is common practice in speech process-
ing owing to the non-stationary nature of the speech      [1] M. Hoehn, M. Yahr, Parkinsonism: onset, pro-
signal. Further,                                              gression and mortality, Neurology 17 (1967) 427–
                                                              442.
     • The 1D-CNN network can be assumed to be a [2] O. Tysnes, A. Storstein,                         Epidemiol-
       feature extraction mechanism which, given a raw        ogy of parkinson’s disease,            Journal of
       IMF (or residue), extracts a set of discriminative     Neural Transmission 124 (2017) 901–905.
       features. The number of filters may be inter-          doi:10.1007/s00702-017-1686-y.
       preted as the number of features extracted from [3] A. K. Ho, R. Iansek, M. C., B. J.L., G. S., Speech
       a particular input signal.                             impairment in a large sample of patients with
                                                              parkinson’s disease, Behavioral Neurology 11
     • The extracted features from input signals ℎ1 [𝑛]       (1998/1999) 131–137.
       - ℎ5 [𝑛] and 𝑟5 [𝑛] are then concatenated to form [4] S. Fahn, R. L. Elton, Unified parkinsons disease
       a feature vector.                                      rating scale, Recent Developments in Parkin-
     • The Dense layers then act as a simple binary           sons Disease,Macmillan Health Care Information
       classifier with the input as the concatenated fea-     2 (1987) 153–163.
       ture vector.                                       [5] H. Cohen, Disorders of speech and language
                                                              in parkinson’s disease,      Mental and Behav-
   As one can observe, the use of raw IMFs mitigates          ioral Dysfunction in Movement Disorders, M. A.
the need to explicitly extract handcrafted features from      Bédard,Y. Agid, A. D. Korczyn, P. Lesperance, and
the IMFs, the 1D-CNN architecture learns discriminat-         S. Chouinard, Eds. New York,NY, USA: Humana
ing features from the raw signal to distinguish between       Press, (2003) 125–134.
PD and HC speech samples. For the purpose of decom- [6] A. K. Ho, R. Iansek, M. C., B. J.L., G. S., Mo-
posing the signal, the speech sample is segmented into        tor instability in parkinsonian speech intensity,
fixed durations of 1 second each. This duration is long       Neuropsychiatry, Neuropsychology and Behav-
enough to capture the non-stationary aspect of speech         ioral Neurology 14 (2001) 109–116.
as well as the dynamics involved in the phonation of [7] M. A. Little , P. E. McSharry, E. J. Hunter, J. Spiel-
                                                                          ∗

vowel sounds.                                                 man, L. O. Ramig, Suitability of dysphonia mea-
                                                              surements for telemonitoring of parkinson’s dis-
                                                              ease, IEEE Transactions on Biomedical Engineer-
6. Conclusion                                                 ing 56 (2009) 1015–1022.
                                                          [8] M. Novotný, J. Rusz, R. Čmejla, E. Růžička, Au-
Parkinson’s Disease is a chronic neuro-degenerative           tomatic evaluation of articulatory disorders in
disease which is difficult to diagnose. The symptoms          parkinson’s disease, IEEE/ACM Transactions
of PD can be mistaken with natural aging, thereby mak-        on Audio, Speech, and Language Processing 22
ing the diagnosis very very challenging. Tracking changes     (2014) 1366–1378.
in speech has proven to be a useful tool for establishing [9] J. Rusz, R. Cmejla, Quantitative acoustic mea-
non-invasive approach to early detection of PD. In this       surements for characterization of speech and
work, we propose an efficient technique to discrim-           voice disorders in early untreated parkinson’s
inate PD and HC patients by analyzing their speech            disease, Journal of Acoustical Society of America
samples of sustained phonation. Traditional approaches        129 (2011) 350.
have focused on experimenting with handcrafted spec- [10] J. Rusz, R. Cmejla, Imprecise vowel articulation
tral and temporal features. In this paper, however, we        as a potential early marker of parkinson’s dis-
focus on machine learning the discriminating features         ease: Effect of speaking task, Journal of Acousti-
of speech associated with PD patients and healthy con-        cal Society of America 134 (2013) 2171.
trol from the raw IMF signals. We train a 1D-CNN [11] A. Rueda, J. Vásquez-Correa, C. D. Rios-Urrego,
model using these raw IMFs to learn the discriminat-          J. R. Orozco-Arroyave, S. Krishnan, E. Noeth,
ing properties in the signals to classify PD and HC sub-      Feature Representation of Pathophysiology of
jects.                                                        Parkinsonian Dysarthria, in: Proc. Interspeech
                                                              2019, 2019, pp. 3048–3052. URL: http://dx.doi.org/
                                                              10.21437/Interspeech.2019-2490. doi:10.21437/
                                                              Interspeech.2019-2490.
[12] J. V. E. López, J. R. Orozco-Arroyave, G. Gosz-              non-stationary time series analysis, Proceed-
     tolya,      Assessing Parkinson’s Disease from                ings of the Royal Society of London. Series
     Speech Using Fisher Vectors, in: Proc. Inter-                 A: Mathematical, Physical and Engineering Sci-
     speech 2019, 2019, pp. 3063–3067. URL: http://dx.             ences 454 (1998) 903–995. doi:10.1098/rspa.
     doi.org/10.21437/Interspeech.2019-2217. doi:10.               1998.0193.
     21437/Interspeech.2019-2217.                             [23] R. Sharma, L. Vignolo, G. Schlotthauer,
[13] Y. Hauptman, R. Aloni-Lavi, I. Lapidot, T. Gure-              M. Colominas, H. L. Rufiner, S. Prasanna,
     vich, Y. Manor, S. Naor, N. Diamant, I. Opher,                Empirical mode decomposition for adaptive
     Identifying Distinctive Acoustic and Spectral                 am-fm analysis of speech: A review, Speech
     Features in Parkinson’s Disease, in: Proc. Inter-             Communication 88 (2017) 39 – 64. URL:
     speech 2019, 2019, pp. 2498–2502. URL: http://dx.             http://www.sciencedirect.com/science/article/
     doi.org/10.21437/Interspeech.2019-2465. doi:10.               pii/S0167639316302370.          doi:https://doi.
     21437/Interspeech.2019-2465.                                  org/10.1016/j.specom.2016.12.004.
[14] G. Dimauro, D. Caivano, V. Bevilacqua, F. Gi-            [24] M. Kaleem, B. Ghoraani, A. Guergachi, S. Krish-
     rardi, V. Napoletano, Voxtester, software for                 nan, Pathological speech signal analysis and clas-
     digital evaluation of speech changes in parkin-               sification using empirical mode decomposition,
     son disease, in: 2016 IEEE International Sympo-               Med Biol Eng Comput 51 (2013).
     sium on Medical Measurements and Applications            [25] F. Chollet, et al., Keras, https://keras.io, 2015.
     (MeMeA), 2016, pp. 1–6.                                  [26] M. Abadi, et al., TensorFlow: Large-scale
[15] G. Rilling, P. Flandrin, P. Goncalves, Empirical              machine learning on heterogeneous systems,
     mode decomposition, fractional gaussian noise                 2015. URL: https://www.tensorflow.org/, soft-
     and hurst exponent estimation, in: Proceedings.               ware available from tensorflow.org.
     (ICASSP ’05). IEEE International Conference on           [27] N. Garcia, J. C. Vásquez Correa, J. R. Orozco-
     Acoustics, Speech, and Signal Processing, 2005.,              Arroyave, E. Nöth, Multimodal i-vectors to detect
     volume 4, 2005, pp. iv/489–iv/492 Vol. 4.                     and evaluate parkinson’s disease, in: Proc. Inter-
[16] A. Bouzid, N. Ellouze, Voiced speech analysis by              speech 2018, 2018, pp. 2349–2353. URL: http://dx.
     empiricalmode decompositio, Advances in Non-                  doi.org/10.21437/Interspeech.2018-2295. doi:10.
     linear Speech Pro-cessing, Springer (2007).                   21437/Interspeech.2018-2295.
[17] B. Mijović, M. Silva, V. den B. R. H. Bergh, K. Alle-
     gaert, J. M. Aerts, D. Berckmans, V. S. Huffel, As-
     sessment of pain expression in infant cry signals
     using empirical mode decomposition., Methods
     Inf Med 49(05) (2010).
[18] L. Xiang, X. L., Speech emotion recognition using
     novel hht-teo based features, Journal of Comput-
     ers 6 (2011).
[19] R. Sharma, S. R. Mahadeva Prasanna, Character-
     izing glottal activity from speech using empiri-
     cal mode decomposition, in: 2015 Twenty First
     National Conference on Communications (NCC),
     2015, pp. 1–6.
[20] G. D. F. Girardi, Italian parkinson’s voice and
     speech, 2019. URL: http://dx.doi.org/10.21227/
     aw6b-tg17. doi:10.21227/aw6b-tg17.
[21] G. Dimauro, V. Di Nicola, V. Bevilacqua,
     D. Caivano, F. Girardi, Assessment of speech
     intelligibility in parkinson’s disease using a
     speech-to-text system, IEEE Access 5 (2017)
     22199–22208.
[22] N. E. Huang, Z. Shen, S. R. Long, M. C. Wu,
     H. H. Shih, Q. Zheng, N.-C. Yen, C. C. Tung,
     H. H. Liu, The empirical mode decomposi-
     tion and the hilbert spectrum for nonlinear and