A Two-Step Framework for Parkinson’s Disease Classification:
        Using Multiple One-Way ANOVA on Speech Features and Decision Trees
                        Gaurang Prasad, 1 Thilanka Munasinghe, 2 Oshani Seneviratne 2
                                                                    1
                                                            wikiHow
                                                    2
                                                 Rensselaer Polytechnic Institute
                                     gaurang@wikihow.com, munast@rpi.edu, senevo@rpi.edu


                            Abstract                                    tained vowel phonations (Little et al. 2008). Sustained
  We propose a two-step classification framework to diagnose
                                                                        vowel phonations don’t capture all morphological or lexi-
  Parkinson’s Disease (PD) using speech samples. At the first           cal speech features, but research shows that they are suf-
  stage, multiple one-way ANalysis Of VAriance (ANOVA) is               ficient for distinguishing between PD subjects and healthy
  used on independent subsets of vocal features to extract the          controls (Gürüler 2017). Most PD classification studies us-
  best set of features from each speech processing algorithm.           ing speech features have been focused on jitter, shimmer,
  These extracted feature subsets are then merged with other            and signal-to-noise ratio. Recent studies have also used other
  baseline vocal features (shimmer, jitter, pitch, harmonicity,         vocal features like fundamental frequency parameters, Mel-
  vocal fold, and fundamental frequency parameters) to form             Frequency Cepstral Coefficients (MFCCs), harmonicity fea-
  the training feature set. In the second step, this combined           tures, Wavelet Transform (WT)-based features, and Tunable
  training set is used to train an extreme gradient boosting (XG-       Q-factor Wavelet Transform (TQWT)-based features to bet-
  Boost) classification model, which is a decision tree based
  algorithm. The overall model performance was scored and
                                                                        ter understand speech deterioration. TQWT was first used in
  evaluated using the Receiver Operating Characteristic Area            2019 for PD classification and was shown to perform bet-
  Under Curve (ROC AUC), F-Measure, Matthews Correlation                ter than other vocal features for PD diagnosis (Sakar et al.
  Coefficient (MCC), and accuracy. It was then compared with            2019). The performance of PD classification models de-
  benchmarked statistical classifiers and other studies that use        pends directly on the selection of vocal features used for
  different combinations of features from this PD dataset. We           training them.
  apply one-way ANOVA on different speech feature sets to
  extract the best features without losing useful vocal informa-
  tion. Our classification performance outperforms state-of-the-            Past studies have used different combinations of the afore-
  art PD classification models that use generic feature selection       mentioned features to train classifiers without any focus on
  methods or use only one or more of the vocal feature subsets.         extracting useful features from different types of vocal fea-
                                                                        tures. This study proposes a novel two-step classification
PD is one of the most common diseases of the motor sys-                 framework for PD diagnosis. The first step uses multiple
tem degeneration that results from the loss of cells in vari-           one-way ANOVAs to extract vocal features from MFCCs,
ous parts of the brain. PD’s primary symptoms are tremor,               WTs, and TQWTs separately. Extracted feature sets are
slow movement, speech disorder, impaired balance, and gait              merged with other baseline vocal features to form the fi-
problems. There are no diagnostic tests or biomarkers for PD            nal training set. In the second step, a decision-tree based
diagnosis because the symptoms resemble the ones observed               classifier is trained on this training set to make predictions.
due to other diseases. Physicians use methods like MRI, ul-             To the best of our knowledge, this is the first PD classifi-
trasound, blood tests to eliminate other conditions with sim-           cation study that employs a multiple ANOVA strategy to ex-
ilar symptoms. Research has also been done to detect PD                 tract the best vocal features from TQWT, MFCCs, and WTs,
using various motor and non-motor symptoms (Tolosa et al.               and combine all of them with standard baseline features like
2009). However, there is no standard way for PD diagnosis.              jitter, skimmer, etc., to generate an extensive training set.
   PD Diagnosis has typically involved measuring the sever-             Our study shows that extracting features separate from each
ity of the symptoms using non-invasive medical techniques.              other prevents not only loss of useful vocal/ signal informa-
Since approximately 90% of PD patients suffer from speech               tion but also addresses the high-dimensionality nature of the
disorders, analyzing speech samples to study vocal im-                  dataset. Using a decision-tree based classifier on extracted
pairment is considered as the most common technique for                 features also handles any class imbalance without the need
PD diagnosis (Shahbakhi, Far, and Tahami 2014). The ex-                 of oversampling or under-sampling the dataset. Classifica-
tent of vocal impairment is typically assessed using sus-               tion results obtained on the public dataset show that our
AAAI Fall 2020 Symposium on AI for Social Good.                         proposed two-step framework outperforms current state-of-
Copyright © 2020 for this paper by its authors. Use permitted under     the-art models that use just one or more of the vocal feature
Creative Commons License Attribution 4.0 International (CC BY           subsets without extracting the best features from individual
4.0).                                                                   algorithms.
                   Literature Review                              selection algorithm on the entire feature set to select the top-
There are no laboratory tests or biomarkers for the diagnosis     50 features. The mRMR top-50 feature selection improved
of PD (Cova and Priori 2018). Consequently, there has been        their classification accuracy to 86% with an F-measure of
significant research in measuring the severity of symptoms        0.84 using an SVM-RBF classifier. This was the first study
to diagnose PD. Tseng et al. (2014) have shown multiple           that used TQWT-based features for PD classification. It was
eye-tracking methods for PD diagnosis. Jansson et al. (2015)      also the first study to report an improvement in diagnostic
proposed two approaches by using stochastic anomaly de-           accuracy by combining all features and selecting 50-best by
tection in eye-tracking data. There have also been multiple       using a feature selection algorithm. They found that MFCCs
studies that use gait and tremor measures to diagnose PD          and TQWT contain complementary information, and com-
(Lee and Lim 2012; Manap, Tahir, and Yassin 2011).                bining them improves the classification performance.
   Analyzing voice samples and deterioration has shown               Since then, there have been a few studies that have pro-
great potential in the advancement of PD diagnosis (Ramani        posed different classification methods using TQWT-based
and Sivagami 2011). Vocal impairment has also been shown          features and this larger dataset built by Sakar et al. (2019).
to be among the earliest symptoms of PD, detectable up to         Gunduz (2019) proposed two frameworks using Convolu-
five years before clinical diagnosis (Oung et al. 2015). This     tional Neural Networks (CNN). The first framework com-
aligns with clinical evidence, which shows that most PD pa-       bines all features and inputs it to a 9-layer CNN. The second
tients exhibit vocal disorders. These studies reinforce the no-   framework passes the feature sets to the parallel input lay-
tion that speech samples reflect disease status after extract-    ers connected to the convolution layers in the CNN. They
ing the necessary information from the vowel phonations.          achieved an accuracy of 84.9% by using a combination of
   There have been multiple studies on PD classification          TQWT and baseline features. This was improved to 86.9%
techniques using vocal features. Gürüler (2017) proposed        by using triple feature sets that used TQWT, WT, and base-
a system using a complex-valued artificial neural net-            line features. They reported that the TQWT features had the
work with k-means clustering and achieved an accuracy of          best feature performance metrics among all classifiers.
99.52%. Das (2010) also used neural networks and demon-              Solana-Lavalle, Galán-Hernández, and Rosas-Romero
strated an accuracy of 92.9%. Peker, Sen, and Delen (2015)        (2020) proposed using a Wrapper Feature Selection method
achieved a 98.1% accuracy using complex-valued neural             along with an SVM classifier and obtained a classification
networks with minimum Redundancy Maximum Relevance                accuracy of 94.7% on the larger dataset. The feature selec-
(mRMR) feature selection. Gil and Manuel (2009) achieved          tion method used in this study did not account for the bi-
an accuracy of 90% using a multilayer perceptron and Sup-         ological and vocal features in the dataset separately and in-
port Vector Machines (SVM). Karimi Rouzbahani and Daliri          stead selected the best K features suited to the used classifier.
(2011) used a K-Nearest Neighbor (KNN) classifier and             Only 8 to 20 features are selected from 754 vocal features.
achieved an accuracy of 93.82%. Hazan et al. (2012) pro-          This leads to loss of valuable acoustic and signal informa-
posed using a country-specific sample of the training data        tion, especially from WT and TQWT-based features – since
and achieved a 94% accuracy. Many of these studies use            they are extensive WT techniques that quantify frequency
a public dataset consisting of 195 vocal measurements be-         deviations in speech signals and contain 10+ original fea-
longing to 23 PD and 8 healthy controls (Little et al. 2008).     tures each. Wrapper feature selection methods try to find the
Another publicly available dataset used in the aforemen-          best set of features suited to a specific learning algorithm
tioned studies consists of multiple speech recordings of 20       by evaluating all combinations of features against the eval-
PD and 20 healthy controls (Sakar et al. 2013). Since most        uation/ performance metric, and thus, there is also a high
of the proposed PD classifiers perform analysis on one of         chance of over-fitting to the training data.
these datasets, the extracted vocal features from speech sam-        Polat (2019) proposed a hybrid approach using a com-
ples largely overlap. Although high classification rates have     bination of Synthetic Minority Over-Sampling Technique
been reported in these studies, both of these datasets are ex-    (SMOTE) and a Random Forest Classifier (RFC). They
tremely small. Models trained on these datasets are prone to      achieved an accuracy of 87.037% without SMOTE and a
overfitting to a very small sample of features. Sakar et al.      higher accuracy of 94.89% by over-sampling the minority
(2019) have shown that the cross-validation methods used          class (healthy control) and then training an RFC. By over-
in these studies cause biases since the number of controls in     sampling, this study changed the original dataset to bal-
them were minimal.                                                ance the classes. Over-sampling also increases the likelihood
   Sakar et al. (2019) collected 3 voice recordings each from     of overfitting because it replicates the oversampled class
252 subjects to build a much larger dataset for PD classifi-      datapoints. It also does not consider neighboring examples
cation. Apart from the baseline vocal features used in pre-       can be from different classes. Studies on class-imbalanced
vious studies, they also extracted MFCCs, WTs, and for the        data have shown that SMOTE is not beneficial for high-
first time, TQWT-based features too. They reported a high-        dimensional datasets (Maldonado, López, and Vairetti 2019;
est classification accuracy of 86% by using a SVM-Radial          Joseph 2020). This leads to overlap of classes and additional
Basis Function (SVM-RBF) classifier and just the MFCCs            noise in an already high-dimensional dataset (Joseph 2020).
feature set. By only using the TQWT-based features, they             Compared to the previous work, our work is one of the
reported the highest individual classifier accuracy of 85%        first studies to demonstrate an improved speech feature se-
with an F-measure of 0.84 using a multilayer perceptron           lection methodology and a decision-tree based robust clas-
classifier. They also demonstrated using a mRMR feature           sifier that handles class imbalance without having to modify
the original dataset by over-sampling or under-sampling.          can detect distortion in vocal fold vibrations. TQWT param-
                                                                  eters were set by considering the time domain characteris-
  Feature                                             Num.        tics of the speech signals. The tunable Q-factor parameter is
               Description of feature-set                         related to the number of oscillations in the signals. A high
  Category                                            feats.
               Jitter, shimmer, harmonicity,                      Q value is selected for signals with high oscillations in the
  Baseline                                            54          time domain. The parameter J comes from the end of the
               time frequency, vocal fold, pitch
  MFCC         Speech deterioration indicator         84          decomposition stage of the transformation. There would be
               Fundamental frequency                              J levels and J + 1 sub-bands coming from J high-pass fil-
  WT                                                  182         ters and one final low-pass filter. The redundancy parame-
               deviations in speech signals
               More extensive quantification                      ter, r, controls the excessive ringing to localize the wavelet
  TQWT         method for fundamental frequency       432         without affecting its shape (Sakar et al. 2019). At first, the
               deviations as compared to WT                       value of the Q parameter is defined to control the oscillatory
                                                                  behavior of wavelets. The r parameter value was set to be
     Table 1: Description of speech feature categories.           equal or greater than 3 to prevent the undesired ringings in
                                                                  wavelets. To find out the best accuracy values of the differ-
                                                                  ent Q − r pairs, several levels (J) were searched for in the
                                                                  specified intervals, and in total, 432 TQWT features are ex-
                          Dataset                                 tracted (Sakar et al. 2019). Table 1 describes the 4 feature
The dataset we used for the analysis was gathered at the De-      subsets in this dataset and the number of features in each.
partment of Neurology in Cerrahpasa Faculty of Medicine,
Istanbul University (Sakar et al. 2019). It contains the in-
formation of 188 patients with PD – 107 men and 81
women, and 64 healthy controls (23 men and 41 women)
with ages varying between 41 and 82. The researchers set
the microphone to 44.1 kHz, and the sustained phonation of
the vowel “ahh. . . ” was collected from each subject with
three repetitions. These phonations were fed into the Praat
acoustic analysis software to extract information about jitter,
glow, vocal fold, fundamental frequency, harmonicity, Re-
currence Period Density Entropy (RPDE), Detrended Fluc-
tuation Analysis (DFA), and Pitch Period Entropy (PPE)
from the signal. In the gathered dataset, these fundamental
vocal features, along with gender, are called baseline fea-
tures.
   MFCCs of a sound signal separate the impact of the vocal
cords (source) and vocal tract (filter) in the signal (Poorjam
2018). This helps detect deterioration in the movement of ar-
ticulators like the tongue and lips, which are affected by PD.
Higher-order MFCCs represent greater levels of spectral de-
tail. Typically, 10 to 20 MFCCs are used for speech analysis.
In this dataset, there are 13 original MFCCs and 71 derived             Figure 1: End-to-end classification framework.
features that are formed with mean and standard deviation
of the original signals, addition to log-energy of the signal,
and their 1st and 2nd derivatives (Sakar et al. 2019).
   WT is used to analyze signals in terms of wavelets, time,                            Methodology
and frequency domain limited functions to detect regional         PD classification is treated as a binary classification task in
fluctuations. WT features of the basic frequency of speech        which the framework takes an input of extracted speech fea-
signal (F0 ) have been used for PD diagnosis (Gunduz 2019).       tures and predicts a class (PD/ No PD). Figure 1 illustrates
It captures the amount of deviation in speech samples and         the end-to-end classification framework for PD diagnosis.
thus detects any distortions in vowel phonations. 10-level        The dataset contains 752 features in 4 feature sets: baseline
discrete WT is applied to signals for extracting WT-based         features, MFCCs, WT, and TQWT. The drawback of using
features obtained from F0 and its log transformation. This        MFCCs, WT, and TQWT together is the ‘curse of the dimen-
results in 182 features, including the log energy entropy and     sionality’ problem. High-dimensional datasets lead to over-
Teager-Kaiser energy of both the approximation and detailed       fitting, hinders useful vocal information in the dataset, and
coefficients (Sakar et al. 2019).                                 leads to computational instability. Extracting a meaningful
   TQWT is a discrete-time wave transform, like WT.               set of features from each feature set is important to reduce
TQWT uses 3 tunable parameters (Q, J, and r) to tune              the dimensionality of the feature set while still ensuring that
it based on the behavior of the speech signal (Sakar et al.       all useful vocal features are retained. This will also reduce
2019). TQWT has been recently used in PD studies since it         the computational complexity of the classifier. We propose
using the one-way ANOVA selection schemes to extract the          returns an array of F-scores, one for each speech feature.
best performing training features from MFCCs, WT, and             SelectKBest class then picks the first k features with the
TQWT feature-sets. The selected features from each method         highest scores (Pedregosa et al. 2011).
are merged with the baseline features. This merged feature           Using ANOVA feature selection on the entire dataset
set serves as the training data for the classifier. We then       leads to loss of vital vocal information. Each of the 54 base-
train an optimized XGBoost classifier on the training data        line features provides fundamental and distinct speech in-
and evaluate its performance against past studies and bench-      formation. Removing any of these baseline features leads to
marked statistical classification models.                         lost information, which is not available in any of the other
                                                                  vocal feature sets. Just selecting the best k features from the
ANOVA Feature Selection                                           entire dataset using the highest F-scores leads to many cru-
ANOVA is a statistical hypothesis test used to determine          cial original and derived features being left out. This is es-
whether the means from two or more samples of data come           pecially observed in the highly dimensional WT and TQWT
from the same distribution or not. It is usually used in prob-    feature subsets. This can also lead to overfitting to certain
lems involving numerical inputs and a classification target       derived features or a classification model that relies primar-
variable. There are two types of ANOVA: one-way ANOVA             ily on features that perform well for that specific model in-
and two-way ANOVA. One-way ANOVA only involves one                stead of features that represent the disease. To conserve vital
independent variable, while two-way ANOVA compares two            information obtained from each feature subset while also ad-
independent variables.                                            dressing the broader dimensionality problem, we extract fea-
   To find how well each speech feature discriminates be-         tures from each feature set separately. This ensures that the
tween the two output classes, we use a one-way ANOVA              original signals are retained and focuses on finding the best
F-test. F-tests are a class of statistical tests that calculate   performing derived features. All baseline features are used,
the ratio between variances values. ANOVA tests the fol-          and the best ki features are extracted from MFCCs, WTs,
lowing null hypothesis (H0 ): there is no difference between      and TQWTs, respectively. ki is obtained for each subset
features, and the features have the same mean value. The          using grid-search cross-validation. The grid-search cross-
alternate hypothesis (H1 ) is that there is a difference be-      validation evaluated a different combination of ki features
tween the means and the groups (feature variances are not         from each subset to find the optimal classification perfor-
equal). The ANOVA F-test produces an F-score based on             mance. Forty features from MFCCs, 75 from WT, and 100
the variance ratio calculated among the means to the vari-        from TQWT were selected with the highest F-scores in their
ance within the group. Group means drawn from features            category, and these were used along with baseline features
with the same or highly similar mean values will have lower       as the training set.
variance between the group and have a lower F-score. A high
F-score implies that features have different mean values and                     Parameter                 Value
can discriminate between the dependent variable categories                       Learning Rate             0.05
better. The results of this test can be used for feature selec-                  Number of Estimators      1000
tion where those features that are independent of the target                     Max Depth                 5
variable can be removed from the training set. The F-score                       Min Child Weight          1
for each speech feature is calculated as follows:                                Gamma                     0
                Between Group Variability (BGV)                                  Subsample                 0.8
          F =                                                                    Col. Sample by Tree       0.8
                Within Group Variability (WGV)
                                                                                 Num. Thread               4
The BGV and WGV for each subset is calculated as:                                Scale POS Weight          1
                          K
                          X ni (Y i. − Y )2
                BGV =                                                         Table 2: XGBoost hyperparameters.
                          i=1
                                   K −1
                           ni
                         K X
                         X    (Yij − Y i. )2
              W GV =                                              XGBoost Classifier
                         i=1 j=1
                                    N −K
                                                                  XGBoost is a robust gradient boosting library based on
Where K is the number of groups, N is the overall sample          ensemble tree-boosting. Its fundamental function predicts
size, ni is the number of observations in the ith group. Yij      a new classification membership after each iteration. Pre-
is the j th observation in the ith out of K groups. Y is the      dictions are made from weak classifiers and are iteratively
overall mean of the variable set, and Y i. is the sample mean     improved. Incorrect classifications from the previous iter-
of the ith group. K − 1 is also defined as the degrees of         ation receive higher weights, forcing the model to focus
freedom in some studies, referring to the maximum number          on their performance improvement. The final classification
of logically independent features with the freedom to vary.       combines the improvement of all the previously modeled
   The scikit-learn machine learning library provides a           trees. XGBoost is not susceptible to overfitting because of
native implementation of a one-way ANOVA F-test                   its more robust regularization framework that constrains
(f classif) and a SelectKBest class to pick fea-                  overfitting. An XGBoost classifier was trained on the train-
tures with the highest F-scores. The F-test score function        ing dataset that was extracted after ANOVA. XGBoost’s
                                     SVM                        RFC                      GBC
               Feature Set
                                     AUC      F1       Acc.     AUC     F1      Acc.     AUC       F1       Acc.
               Baseline              0.5      0.865    0.762    0.704   0.902   0.841    0.695     0.884    0.815
               MFCC                  0.561    0.867    0.772    0.723   0.904   0.846    0.717     0.897    0.836
               WT                    0.537    0.849    0.746    0.654   0.859   0.778    0.604     0.84     0.746
               TQWT                  0.5      0.868    0.767    0.82    0.932   0.894    0.867     0.938    0.905
               Baseline + MFCC       0.5      0.84     0.725    0.724   0.887   0.825    0.767     0.891    0.836
               Baseline + WT         0.529    0.822    0.709    0.654   0.863   0.783    0.673     0.869    0.764
               Baseline + TQWT       0.5      0.834    0.714    0.707   0.885   0.82     0.728     0.886    0.825
               MFCC + WT             0.561    0.867    0.772    0.723   0.904   0.846    0.717     0.897    0.836
               MFCC + TQWT           0.5      0.847    0.735    0.799   0.925   0.883    0.805     0.925    0.883
               WT + TQWT             0.509    0.839    0.725    0.736   0.894   0.836    0.742     0.893    0.836
               All features          0.508    0.828    0.709    0.737   0.898   0.841    0.742     0.897    0.841

Table 3: Classification performance of benchmarked statistical classifiers (SVM, RFC, GBC) on different combinations of
features without ANOVA.


                                Performance Metrics                classification performance is observed when one feature set
      Model/ Study
                             AUC F1       Acc.    MCC              (baseline, MFCC, or WT) is complemented with TQWT fea-
     multi-ANOVA                                                   tures. Using ANOVA to extract the best features and then
       + XGBoost             0.91    0.96    0.947    0.86         using them to train an XGBoost model performs better than
 (proposed framework)                                              other state-of-the-art techniques proposed on this dataset.
   Combined ANOVA                                                  Polat’s (2019) proposal to use SMOTE to over-sample the
                             0.89    0.94    0.928    0.81
       + XGBoost                                                   minority class and train an RFC leads to a slightly better
     Gunduz (2019):                                                classification accuracy (0.001). However, AUC, F-measure,
                             n/a     0.89    0.833    0.52
   All features + CNN                                              and MCC metrics of Polat’s model are unknown. The per-
     Gunduz (2019):                                                formance of benchmarked classifiers, including SVM, RFC,
                             n/a     0.91    0.857    0.59
   All features + SVM                                              and Gradient Boosting Classifier (GBC), using different fea-
   Sakar et al. (2019):                                            ture combinations is shown in Table 3. The performance
  Top-50 features using      n/a     0.84    0.86     0.59         metrics of our proposed framework, compared to other stud-
  mRMR + SVM (RBF)                                                 ies, are presented in Table 4. We also demonstrate that using
    Polat (2019): RFC        n/a     n/a     0.87     n/a          a multi-ANOVA strategy performs better than one ANOVA
                                                                   on the entire feature set.
    Table 4: Performance compared with other studies.
                                                                                          Conclusion
built-in cross-validation was used at each iteration to get        This paper presents a two-step classification framework to
the optimal boosting iterations in a single run. Grid-search       diagnose PD using a set of 753 vocal features. We propose
cross-validation was used to optimize the model parame-            a novel vocal-feature selection technique for PD classifica-
ters. The final hyper-parameters obtained are shown in Ta-         tion using multiple one-way ANOVA on the MFCCs, WT
ble 2. The optimized model achieved the highest classifica-        and TQWT. The selected features are merged with base-
tion accuracy of 94.78%. In the following section, we evalu-       line vocal and biological features to form the training set.
ate our framework’s performance with benchmarked statis-           We propose an XGBoost classifier trained on the extracted
tical models and other studies on this dataset.                    data for PD classification. The proposed framework achieves
                                                                   a classification accuracy of 94.71% with an F-1 of 0.965
                                                                   and an MCC of 0.86. We show that the proposed frame-
                       Evaluation                                  work performs better than the state of the art without altering
Evaluation metrics are needed to assess the predictive per-        the dataset by over or under-sampling. We demonstrate that
formance of the proposed framework. Although accuracy is           separately extracting features from different algorithms re-
a common metric, it may yield misleading results in case           duces the dimensionality without the loss of any vital speech
of unbalanced class distribution. Evaluation metrics such as       information and performs better than a generic feature se-
F-measure, MCC, and ROC AUC can measure how well a                 lection technique. We also show that the proposed frame-
classifier performs, even in class imbalance cases. We use         work performs better than benchmarked statistical classi-
ROC AUC, F-Measure, MCC, and accuracy to evaluate the              fiers. Most literature on PD diagnosis relies on a very small
performance of the proposed framework against statistical          sample size collected from 20-30 persons. High levels of
classifiers and other studies using this dataset. While using      accuracy in predictions of models based on a significantly
individual feature sets, TQWT-based features perform bet-          larger data set (i.e., 252 persons) have been demonstrated
ter than other feature subsets. Significant improvement in         in this paper. Thereby, the generalization capabilities of the
model are validated. Using the proposed framework, clini-          Artificial Neural Network. In 2011 IEEE International Sym-
cal diagnosis of early-onset of PD will be consistent across       posium on Signal Processing and Information Technology
physicians, thereby eliminating the chances of misdiagno-          (ISSPIT), 060–065. IEEE.
sis. Specifically, the high levels of accuracy, F1, MCC, and       Oung, Q. W.; Muthusamy, H.; Lee, H. L.; Basah, S. N.; Yaa-
ROC AUC indicate that there is a very negligible chance of         cob, S.; Sarillee, M.; and Lee, C. H. 2015. Technologies
missing a diagnosis. We have open-sourced the code used in         for assessment of motor disorders in Parkinson’s disease: a
this study in a public GitHub repository (https://github.com/      review. Sensors 15(9): 21710–21745.
Gaurangprasad/parkinson disease ANOVA classifier).
                                                                   Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.;
                        References                                 Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss,
                                                                   R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.;
Cova, I.; and Priori, A. 2018. Diagnostic biomarkers for           Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit-
Parkinson’s disease at a glance: where are we? Journal of          learn: Machine Learning in Python. Journal of Machine
Neural Transmission 125(10): 1417–1432.                            Learning Research 12: 2825–2830.
Das, R. 2010. A comparison of multiple classification meth-        Peker, M.; Sen, B.; and Delen, D. 2015. Computer-aided di-
ods for diagnosis of Parkinson disease. Expert Systems with        agnosis of Parkinson’s disease using complex-valued neural
Applications 37(2): 1568–1572.                                     networks and mRMR feature selection algorithm. Journal
Gil, D.; and Manuel, D. J. 2009. Diagnosing Parkinson              of healthcare engineering 6.
by using artificial neural networks and support vector ma-         Polat, K. 2019. A hybrid approach to Parkinson disease clas-
chines. Global Journal of Computer Science and Technol-            sification using speech signal: the combination of smote and
ogy 9(4).                                                          random forests. In 2019 Scientific Meeting on Electrical-
Gunduz, H. 2019. Deep learning-based Parkinson’s dis-              Electronics & Biomedical Engineering and Computer Sci-
ease classification using vocal feature sets. IEEE Access 7:       ence (EBBT), 1–3. IEEE.
115540–115551.                                                     Poorjam, A. H. 2018.              Why we take only 12-
Gürüler, H. 2017. A novel diagnosis system for Parkinson’s       13 MFCC coefficients in feature extraction?               URL
disease using complex-valued artificial neural network with        https://www.researchgate.net/post/Why we take only 12-
k-means clustering feature weighting method. Neural Com-           13 MFCC coefficients in feature extraction.
puting and Applications 28(7): 1657–1666.                          Ramani, R. G.; and Sivagami, G. 2011. Parkinson disease
Hazan, H.; Hilu, D.; Manevitz, L.; Ramig, L. O.; and Sapir,        classification using data mining algorithms. International
S. 2012. Early diagnosis of Parkinson’s disease via machine        journal of computer applications 32(9): 17–22.
learning on speech data. In 2012 IEEE 27th Convention of           Sakar, B. E.; Isenkul, M. E.; Sakar, C. O.; Sertbas, A.; Gur-
Electrical and Electronics Engineers in Israel, 1–4. IEEE.         gen, F.; Delil, S.; Apaydin, H.; and Kursun, O. 2013. Collec-
Jansson, D.; Medvedev, A.; Axelson, H.; and Nyholm, D.             tion and analysis of a Parkinson speech dataset with multiple
2015. Stochastic anomaly detection in eye-tracking data for        types of sound recordings. IEEE Journal of Biomedical and
quantification of motor symptoms in Parkinson’s disease. In        Health Informatics 17(4): 828–834.
Signal and Image Analysis for Biomedical and Life Sciences,        Sakar, C. O.; Serbes, G.; Gunduz, A.; Tunc, H. C.; Nizam,
63–82. Springer.                                                   H.; Sakar, B. E.; Tutuncu, M.; Aydin, T.; Isenkul, M. E.; and
Joseph, J. 2020. Imbalanced Data. URL https://medium.              Apaydin, H. 2019. A comparative analysis of speech signal
com/@jasonjoseph072/imbalanced-data-97e2e8a9e0a8.                  processing algorithms for Parkinson’s disease classification
                                                                   and the use of the tunable Q-factor wavelet transform. Ap-
Karimi Rouzbahani, H.; and Daliri, M. R. 2011. Diagnosis           plied Soft Computing 74: 255–263.
of Parkinson’s disease in human using voice signals. Basic
                                                                   Shahbakhi, M.; Far, D. T.; and Tahami, E. 2014. Speech
and Clinical Neuroscience 2(3): 12–20.
                                                                   analysis for diagnosis of parkinson’s disease using genetic
Lee, S.-H.; and Lim, J. S. 2012. Parkinson’s disease classi-       algorithm and support vector machine. Journal of Biomedi-
fication using gait characteristics and wavelet-based feature      cal Science and Engineering 2014.
extraction. Expert Systems with Applications 39(8): 7338–
                                                                   Solana-Lavalle, G.; Galán-Hernández, J.-C.; and Rosas-
7344.
                                                                   Romero, R. 2020. Automatic Parkinson disease detection at
Little, M.; McSharry, P.; Hunter, E.; Spielman, J.; and            early stages as a pre-diagnosis tool by using classifiers and a
Ramig, L. 2008. Suitability of dysphonia measurements for          small set of vocal features. Biocybernetics and Biomedical
telemonitoring of Parkinson’s disease. Nature Precedings           Engineering 40(1): 505–516.
1–1.                                                               Tolosa, E.; Gaig, C.; Santamarı́a, J.; and Compta, Y. 2009.
Maldonado, S.; López, J.; and Vairetti, C. 2019. An alter-        Diagnosis and the premotor phase of Parkinson disease.
native SMOTE oversampling strategy for high-dimensional            Neurology 72(7 Supplement 2): S12–S20.
datasets. Applied Soft Computing 76: 380–389.                      Tseng, P.-H.; Cameron, I. G.; Munoz, D. P.; and Itti, L. 2014.
Manap, H. H.; Tahir, N. M.; and Yassin, A. I. M. 2011. Sta-        Eye-tracking method and system for screening human dis-
tistical analysis of parkinson disease gait classification using   eases. US Patent 8,808,195.