A Two-Step Framework for Parkinson’s Disease Classification: Using Multiple One-Way ANOVA on Speech Features and Decision Trees Gaurang Prasad, 1 Thilanka Munasinghe, 2 Oshani Seneviratne 2 1 wikiHow 2 Rensselaer Polytechnic Institute gaurang@wikihow.com, munast@rpi.edu, senevo@rpi.edu Abstract tained vowel phonations (Little et al. 2008). Sustained We propose a two-step classification framework to diagnose vowel phonations don’t capture all morphological or lexi- Parkinson’s Disease (PD) using speech samples. At the first cal speech features, but research shows that they are suf- stage, multiple one-way ANalysis Of VAriance (ANOVA) is ficient for distinguishing between PD subjects and healthy used on independent subsets of vocal features to extract the controls (Gürüler 2017). Most PD classification studies us- best set of features from each speech processing algorithm. ing speech features have been focused on jitter, shimmer, These extracted feature subsets are then merged with other and signal-to-noise ratio. Recent studies have also used other baseline vocal features (shimmer, jitter, pitch, harmonicity, vocal features like fundamental frequency parameters, Mel- vocal fold, and fundamental frequency parameters) to form Frequency Cepstral Coefficients (MFCCs), harmonicity fea- the training feature set. In the second step, this combined tures, Wavelet Transform (WT)-based features, and Tunable training set is used to train an extreme gradient boosting (XG- Q-factor Wavelet Transform (TQWT)-based features to bet- Boost) classification model, which is a decision tree based algorithm. The overall model performance was scored and ter understand speech deterioration. TQWT was first used in evaluated using the Receiver Operating Characteristic Area 2019 for PD classification and was shown to perform bet- Under Curve (ROC AUC), F-Measure, Matthews Correlation ter than other vocal features for PD diagnosis (Sakar et al. Coefficient (MCC), and accuracy. It was then compared with 2019). The performance of PD classification models de- benchmarked statistical classifiers and other studies that use pends directly on the selection of vocal features used for different combinations of features from this PD dataset. We training them. apply one-way ANOVA on different speech feature sets to extract the best features without losing useful vocal informa- tion. Our classification performance outperforms state-of-the- Past studies have used different combinations of the afore- art PD classification models that use generic feature selection mentioned features to train classifiers without any focus on methods or use only one or more of the vocal feature subsets. extracting useful features from different types of vocal fea- tures. This study proposes a novel two-step classification PD is one of the most common diseases of the motor sys- framework for PD diagnosis. The first step uses multiple tem degeneration that results from the loss of cells in vari- one-way ANOVAs to extract vocal features from MFCCs, ous parts of the brain. PD’s primary symptoms are tremor, WTs, and TQWTs separately. Extracted feature sets are slow movement, speech disorder, impaired balance, and gait merged with other baseline vocal features to form the fi- problems. There are no diagnostic tests or biomarkers for PD nal training set. In the second step, a decision-tree based diagnosis because the symptoms resemble the ones observed classifier is trained on this training set to make predictions. due to other diseases. Physicians use methods like MRI, ul- To the best of our knowledge, this is the first PD classifi- trasound, blood tests to eliminate other conditions with sim- cation study that employs a multiple ANOVA strategy to ex- ilar symptoms. Research has also been done to detect PD tract the best vocal features from TQWT, MFCCs, and WTs, using various motor and non-motor symptoms (Tolosa et al. and combine all of them with standard baseline features like 2009). However, there is no standard way for PD diagnosis. jitter, skimmer, etc., to generate an extensive training set. PD Diagnosis has typically involved measuring the sever- Our study shows that extracting features separate from each ity of the symptoms using non-invasive medical techniques. other prevents not only loss of useful vocal/ signal informa- Since approximately 90% of PD patients suffer from speech tion but also addresses the high-dimensionality nature of the disorders, analyzing speech samples to study vocal im- dataset. Using a decision-tree based classifier on extracted pairment is considered as the most common technique for features also handles any class imbalance without the need PD diagnosis (Shahbakhi, Far, and Tahami 2014). The ex- of oversampling or under-sampling the dataset. Classifica- tent of vocal impairment is typically assessed using sus- tion results obtained on the public dataset show that our AAAI Fall 2020 Symposium on AI for Social Good. proposed two-step framework outperforms current state-of- Copyright © 2020 for this paper by its authors. Use permitted under the-art models that use just one or more of the vocal feature Creative Commons License Attribution 4.0 International (CC BY subsets without extracting the best features from individual 4.0). algorithms. Literature Review selection algorithm on the entire feature set to select the top- There are no laboratory tests or biomarkers for the diagnosis 50 features. The mRMR top-50 feature selection improved of PD (Cova and Priori 2018). Consequently, there has been their classification accuracy to 86% with an F-measure of significant research in measuring the severity of symptoms 0.84 using an SVM-RBF classifier. This was the first study to diagnose PD. Tseng et al. (2014) have shown multiple that used TQWT-based features for PD classification. It was eye-tracking methods for PD diagnosis. Jansson et al. (2015) also the first study to report an improvement in diagnostic proposed two approaches by using stochastic anomaly de- accuracy by combining all features and selecting 50-best by tection in eye-tracking data. There have also been multiple using a feature selection algorithm. They found that MFCCs studies that use gait and tremor measures to diagnose PD and TQWT contain complementary information, and com- (Lee and Lim 2012; Manap, Tahir, and Yassin 2011). bining them improves the classification performance. Analyzing voice samples and deterioration has shown Since then, there have been a few studies that have pro- great potential in the advancement of PD diagnosis (Ramani posed different classification methods using TQWT-based and Sivagami 2011). Vocal impairment has also been shown features and this larger dataset built by Sakar et al. (2019). to be among the earliest symptoms of PD, detectable up to Gunduz (2019) proposed two frameworks using Convolu- five years before clinical diagnosis (Oung et al. 2015). This tional Neural Networks (CNN). The first framework com- aligns with clinical evidence, which shows that most PD pa- bines all features and inputs it to a 9-layer CNN. The second tients exhibit vocal disorders. These studies reinforce the no- framework passes the feature sets to the parallel input lay- tion that speech samples reflect disease status after extract- ers connected to the convolution layers in the CNN. They ing the necessary information from the vowel phonations. achieved an accuracy of 84.9% by using a combination of There have been multiple studies on PD classification TQWT and baseline features. This was improved to 86.9% techniques using vocal features. Gürüler (2017) proposed by using triple feature sets that used TQWT, WT, and base- a system using a complex-valued artificial neural net- line features. They reported that the TQWT features had the work with k-means clustering and achieved an accuracy of best feature performance metrics among all classifiers. 99.52%. Das (2010) also used neural networks and demon- Solana-Lavalle, Galán-Hernández, and Rosas-Romero strated an accuracy of 92.9%. Peker, Sen, and Delen (2015) (2020) proposed using a Wrapper Feature Selection method achieved a 98.1% accuracy using complex-valued neural along with an SVM classifier and obtained a classification networks with minimum Redundancy Maximum Relevance accuracy of 94.7% on the larger dataset. The feature selec- (mRMR) feature selection. Gil and Manuel (2009) achieved tion method used in this study did not account for the bi- an accuracy of 90% using a multilayer perceptron and Sup- ological and vocal features in the dataset separately and in- port Vector Machines (SVM). Karimi Rouzbahani and Daliri stead selected the best K features suited to the used classifier. (2011) used a K-Nearest Neighbor (KNN) classifier and Only 8 to 20 features are selected from 754 vocal features. achieved an accuracy of 93.82%. Hazan et al. (2012) pro- This leads to loss of valuable acoustic and signal informa- posed using a country-specific sample of the training data tion, especially from WT and TQWT-based features – since and achieved a 94% accuracy. Many of these studies use they are extensive WT techniques that quantify frequency a public dataset consisting of 195 vocal measurements be- deviations in speech signals and contain 10+ original fea- longing to 23 PD and 8 healthy controls (Little et al. 2008). tures each. Wrapper feature selection methods try to find the Another publicly available dataset used in the aforemen- best set of features suited to a specific learning algorithm tioned studies consists of multiple speech recordings of 20 by evaluating all combinations of features against the eval- PD and 20 healthy controls (Sakar et al. 2013). Since most uation/ performance metric, and thus, there is also a high of the proposed PD classifiers perform analysis on one of chance of over-fitting to the training data. these datasets, the extracted vocal features from speech sam- Polat (2019) proposed a hybrid approach using a com- ples largely overlap. Although high classification rates have bination of Synthetic Minority Over-Sampling Technique been reported in these studies, both of these datasets are ex- (SMOTE) and a Random Forest Classifier (RFC). They tremely small. Models trained on these datasets are prone to achieved an accuracy of 87.037% without SMOTE and a overfitting to a very small sample of features. Sakar et al. higher accuracy of 94.89% by over-sampling the minority (2019) have shown that the cross-validation methods used class (healthy control) and then training an RFC. By over- in these studies cause biases since the number of controls in sampling, this study changed the original dataset to bal- them were minimal. ance the classes. Over-sampling also increases the likelihood Sakar et al. (2019) collected 3 voice recordings each from of overfitting because it replicates the oversampled class 252 subjects to build a much larger dataset for PD classifi- datapoints. It also does not consider neighboring examples cation. Apart from the baseline vocal features used in pre- can be from different classes. Studies on class-imbalanced vious studies, they also extracted MFCCs, WTs, and for the data have shown that SMOTE is not beneficial for high- first time, TQWT-based features too. They reported a high- dimensional datasets (Maldonado, López, and Vairetti 2019; est classification accuracy of 86% by using a SVM-Radial Joseph 2020). This leads to overlap of classes and additional Basis Function (SVM-RBF) classifier and just the MFCCs noise in an already high-dimensional dataset (Joseph 2020). feature set. By only using the TQWT-based features, they Compared to the previous work, our work is one of the reported the highest individual classifier accuracy of 85% first studies to demonstrate an improved speech feature se- with an F-measure of 0.84 using a multilayer perceptron lection methodology and a decision-tree based robust clas- classifier. They also demonstrated using a mRMR feature sifier that handles class imbalance without having to modify the original dataset by over-sampling or under-sampling. can detect distortion in vocal fold vibrations. TQWT param- eters were set by considering the time domain characteris- Feature Num. tics of the speech signals. The tunable Q-factor parameter is Description of feature-set related to the number of oscillations in the signals. A high Category feats. Jitter, shimmer, harmonicity, Q value is selected for signals with high oscillations in the Baseline 54 time domain. The parameter J comes from the end of the time frequency, vocal fold, pitch MFCC Speech deterioration indicator 84 decomposition stage of the transformation. There would be Fundamental frequency J levels and J + 1 sub-bands coming from J high-pass fil- WT 182 ters and one final low-pass filter. The redundancy parame- deviations in speech signals More extensive quantification ter, r, controls the excessive ringing to localize the wavelet TQWT method for fundamental frequency 432 without affecting its shape (Sakar et al. 2019). At first, the deviations as compared to WT value of the Q parameter is defined to control the oscillatory behavior of wavelets. The r parameter value was set to be Table 1: Description of speech feature categories. equal or greater than 3 to prevent the undesired ringings in wavelets. To find out the best accuracy values of the differ- ent Q − r pairs, several levels (J) were searched for in the specified intervals, and in total, 432 TQWT features are ex- Dataset tracted (Sakar et al. 2019). Table 1 describes the 4 feature The dataset we used for the analysis was gathered at the De- subsets in this dataset and the number of features in each. partment of Neurology in Cerrahpasa Faculty of Medicine, Istanbul University (Sakar et al. 2019). It contains the in- formation of 188 patients with PD – 107 men and 81 women, and 64 healthy controls (23 men and 41 women) with ages varying between 41 and 82. The researchers set the microphone to 44.1 kHz, and the sustained phonation of the vowel “ahh. . . ” was collected from each subject with three repetitions. These phonations were fed into the Praat acoustic analysis software to extract information about jitter, glow, vocal fold, fundamental frequency, harmonicity, Re- currence Period Density Entropy (RPDE), Detrended Fluc- tuation Analysis (DFA), and Pitch Period Entropy (PPE) from the signal. In the gathered dataset, these fundamental vocal features, along with gender, are called baseline fea- tures. MFCCs of a sound signal separate the impact of the vocal cords (source) and vocal tract (filter) in the signal (Poorjam 2018). This helps detect deterioration in the movement of ar- ticulators like the tongue and lips, which are affected by PD. Higher-order MFCCs represent greater levels of spectral de- tail. Typically, 10 to 20 MFCCs are used for speech analysis. In this dataset, there are 13 original MFCCs and 71 derived Figure 1: End-to-end classification framework. features that are formed with mean and standard deviation of the original signals, addition to log-energy of the signal, and their 1st and 2nd derivatives (Sakar et al. 2019). WT is used to analyze signals in terms of wavelets, time, Methodology and frequency domain limited functions to detect regional PD classification is treated as a binary classification task in fluctuations. WT features of the basic frequency of speech which the framework takes an input of extracted speech fea- signal (F0 ) have been used for PD diagnosis (Gunduz 2019). tures and predicts a class (PD/ No PD). Figure 1 illustrates It captures the amount of deviation in speech samples and the end-to-end classification framework for PD diagnosis. thus detects any distortions in vowel phonations. 10-level The dataset contains 752 features in 4 feature sets: baseline discrete WT is applied to signals for extracting WT-based features, MFCCs, WT, and TQWT. The drawback of using features obtained from F0 and its log transformation. This MFCCs, WT, and TQWT together is the ‘curse of the dimen- results in 182 features, including the log energy entropy and sionality’ problem. High-dimensional datasets lead to over- Teager-Kaiser energy of both the approximation and detailed fitting, hinders useful vocal information in the dataset, and coefficients (Sakar et al. 2019). leads to computational instability. Extracting a meaningful TQWT is a discrete-time wave transform, like WT. set of features from each feature set is important to reduce TQWT uses 3 tunable parameters (Q, J, and r) to tune the dimensionality of the feature set while still ensuring that it based on the behavior of the speech signal (Sakar et al. all useful vocal features are retained. This will also reduce 2019). TQWT has been recently used in PD studies since it the computational complexity of the classifier. We propose using the one-way ANOVA selection schemes to extract the returns an array of F-scores, one for each speech feature. best performing training features from MFCCs, WT, and SelectKBest class then picks the first k features with the TQWT feature-sets. The selected features from each method highest scores (Pedregosa et al. 2011). are merged with the baseline features. This merged feature Using ANOVA feature selection on the entire dataset set serves as the training data for the classifier. We then leads to loss of vital vocal information. Each of the 54 base- train an optimized XGBoost classifier on the training data line features provides fundamental and distinct speech in- and evaluate its performance against past studies and bench- formation. Removing any of these baseline features leads to marked statistical classification models. lost information, which is not available in any of the other vocal feature sets. Just selecting the best k features from the ANOVA Feature Selection entire dataset using the highest F-scores leads to many cru- ANOVA is a statistical hypothesis test used to determine cial original and derived features being left out. This is es- whether the means from two or more samples of data come pecially observed in the highly dimensional WT and TQWT from the same distribution or not. It is usually used in prob- feature subsets. This can also lead to overfitting to certain lems involving numerical inputs and a classification target derived features or a classification model that relies primar- variable. There are two types of ANOVA: one-way ANOVA ily on features that perform well for that specific model in- and two-way ANOVA. One-way ANOVA only involves one stead of features that represent the disease. To conserve vital independent variable, while two-way ANOVA compares two information obtained from each feature subset while also ad- independent variables. dressing the broader dimensionality problem, we extract fea- To find how well each speech feature discriminates be- tures from each feature set separately. This ensures that the tween the two output classes, we use a one-way ANOVA original signals are retained and focuses on finding the best F-test. F-tests are a class of statistical tests that calculate performing derived features. All baseline features are used, the ratio between variances values. ANOVA tests the fol- and the best ki features are extracted from MFCCs, WTs, lowing null hypothesis (H0 ): there is no difference between and TQWTs, respectively. ki is obtained for each subset features, and the features have the same mean value. The using grid-search cross-validation. The grid-search cross- alternate hypothesis (H1 ) is that there is a difference be- validation evaluated a different combination of ki features tween the means and the groups (feature variances are not from each subset to find the optimal classification perfor- equal). The ANOVA F-test produces an F-score based on mance. Forty features from MFCCs, 75 from WT, and 100 the variance ratio calculated among the means to the vari- from TQWT were selected with the highest F-scores in their ance within the group. Group means drawn from features category, and these were used along with baseline features with the same or highly similar mean values will have lower as the training set. variance between the group and have a lower F-score. A high F-score implies that features have different mean values and Parameter Value can discriminate between the dependent variable categories Learning Rate 0.05 better. The results of this test can be used for feature selec- Number of Estimators 1000 tion where those features that are independent of the target Max Depth 5 variable can be removed from the training set. The F-score Min Child Weight 1 for each speech feature is calculated as follows: Gamma 0 Between Group Variability (BGV) Subsample 0.8 F = Col. Sample by Tree 0.8 Within Group Variability (WGV) Num. Thread 4 The BGV and WGV for each subset is calculated as: Scale POS Weight 1 K X ni (Y i. − Y )2 BGV = Table 2: XGBoost hyperparameters. i=1 K −1 ni K X X (Yij − Y i. )2 W GV = XGBoost Classifier i=1 j=1 N −K XGBoost is a robust gradient boosting library based on Where K is the number of groups, N is the overall sample ensemble tree-boosting. Its fundamental function predicts size, ni is the number of observations in the ith group. Yij a new classification membership after each iteration. Pre- is the j th observation in the ith out of K groups. Y is the dictions are made from weak classifiers and are iteratively overall mean of the variable set, and Y i. is the sample mean improved. Incorrect classifications from the previous iter- of the ith group. K − 1 is also defined as the degrees of ation receive higher weights, forcing the model to focus freedom in some studies, referring to the maximum number on their performance improvement. The final classification of logically independent features with the freedom to vary. combines the improvement of all the previously modeled The scikit-learn machine learning library provides a trees. XGBoost is not susceptible to overfitting because of native implementation of a one-way ANOVA F-test its more robust regularization framework that constrains (f classif) and a SelectKBest class to pick fea- overfitting. An XGBoost classifier was trained on the train- tures with the highest F-scores. The F-test score function ing dataset that was extracted after ANOVA. XGBoost’s SVM RFC GBC Feature Set AUC F1 Acc. AUC F1 Acc. AUC F1 Acc. Baseline 0.5 0.865 0.762 0.704 0.902 0.841 0.695 0.884 0.815 MFCC 0.561 0.867 0.772 0.723 0.904 0.846 0.717 0.897 0.836 WT 0.537 0.849 0.746 0.654 0.859 0.778 0.604 0.84 0.746 TQWT 0.5 0.868 0.767 0.82 0.932 0.894 0.867 0.938 0.905 Baseline + MFCC 0.5 0.84 0.725 0.724 0.887 0.825 0.767 0.891 0.836 Baseline + WT 0.529 0.822 0.709 0.654 0.863 0.783 0.673 0.869 0.764 Baseline + TQWT 0.5 0.834 0.714 0.707 0.885 0.82 0.728 0.886 0.825 MFCC + WT 0.561 0.867 0.772 0.723 0.904 0.846 0.717 0.897 0.836 MFCC + TQWT 0.5 0.847 0.735 0.799 0.925 0.883 0.805 0.925 0.883 WT + TQWT 0.509 0.839 0.725 0.736 0.894 0.836 0.742 0.893 0.836 All features 0.508 0.828 0.709 0.737 0.898 0.841 0.742 0.897 0.841 Table 3: Classification performance of benchmarked statistical classifiers (SVM, RFC, GBC) on different combinations of features without ANOVA. Performance Metrics classification performance is observed when one feature set Model/ Study AUC F1 Acc. MCC (baseline, MFCC, or WT) is complemented with TQWT fea- multi-ANOVA tures. Using ANOVA to extract the best features and then + XGBoost 0.91 0.96 0.947 0.86 using them to train an XGBoost model performs better than (proposed framework) other state-of-the-art techniques proposed on this dataset. Combined ANOVA Polat’s (2019) proposal to use SMOTE to over-sample the 0.89 0.94 0.928 0.81 + XGBoost minority class and train an RFC leads to a slightly better Gunduz (2019): classification accuracy (0.001). However, AUC, F-measure, n/a 0.89 0.833 0.52 All features + CNN and MCC metrics of Polat’s model are unknown. The per- Gunduz (2019): formance of benchmarked classifiers, including SVM, RFC, n/a 0.91 0.857 0.59 All features + SVM and Gradient Boosting Classifier (GBC), using different fea- Sakar et al. (2019): ture combinations is shown in Table 3. The performance Top-50 features using n/a 0.84 0.86 0.59 metrics of our proposed framework, compared to other stud- mRMR + SVM (RBF) ies, are presented in Table 4. We also demonstrate that using Polat (2019): RFC n/a n/a 0.87 n/a a multi-ANOVA strategy performs better than one ANOVA on the entire feature set. Table 4: Performance compared with other studies. Conclusion built-in cross-validation was used at each iteration to get This paper presents a two-step classification framework to the optimal boosting iterations in a single run. Grid-search diagnose PD using a set of 753 vocal features. We propose cross-validation was used to optimize the model parame- a novel vocal-feature selection technique for PD classifica- ters. The final hyper-parameters obtained are shown in Ta- tion using multiple one-way ANOVA on the MFCCs, WT ble 2. The optimized model achieved the highest classifica- and TQWT. The selected features are merged with base- tion accuracy of 94.78%. In the following section, we evalu- line vocal and biological features to form the training set. ate our framework’s performance with benchmarked statis- We propose an XGBoost classifier trained on the extracted tical models and other studies on this dataset. data for PD classification. The proposed framework achieves a classification accuracy of 94.71% with an F-1 of 0.965 and an MCC of 0.86. We show that the proposed frame- Evaluation work performs better than the state of the art without altering Evaluation metrics are needed to assess the predictive per- the dataset by over or under-sampling. We demonstrate that formance of the proposed framework. Although accuracy is separately extracting features from different algorithms re- a common metric, it may yield misleading results in case duces the dimensionality without the loss of any vital speech of unbalanced class distribution. Evaluation metrics such as information and performs better than a generic feature se- F-measure, MCC, and ROC AUC can measure how well a lection technique. We also show that the proposed frame- classifier performs, even in class imbalance cases. We use work performs better than benchmarked statistical classi- ROC AUC, F-Measure, MCC, and accuracy to evaluate the fiers. Most literature on PD diagnosis relies on a very small performance of the proposed framework against statistical sample size collected from 20-30 persons. High levels of classifiers and other studies using this dataset. While using accuracy in predictions of models based on a significantly individual feature sets, TQWT-based features perform bet- larger data set (i.e., 252 persons) have been demonstrated ter than other feature subsets. Significant improvement in in this paper. Thereby, the generalization capabilities of the model are validated. Using the proposed framework, clini- Artificial Neural Network. In 2011 IEEE International Sym- cal diagnosis of early-onset of PD will be consistent across posium on Signal Processing and Information Technology physicians, thereby eliminating the chances of misdiagno- (ISSPIT), 060–065. IEEE. sis. Specifically, the high levels of accuracy, F1, MCC, and Oung, Q. W.; Muthusamy, H.; Lee, H. L.; Basah, S. N.; Yaa- ROC AUC indicate that there is a very negligible chance of cob, S.; Sarillee, M.; and Lee, C. H. 2015. Technologies missing a diagnosis. We have open-sourced the code used in for assessment of motor disorders in Parkinson’s disease: a this study in a public GitHub repository (https://github.com/ review. Sensors 15(9): 21710–21745. Gaurangprasad/parkinson disease ANOVA classifier). Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; References Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Cova, I.; and Priori, A. 2018. Diagnostic biomarkers for Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit- Parkinson’s disease at a glance: where are we? Journal of learn: Machine Learning in Python. Journal of Machine Neural Transmission 125(10): 1417–1432. Learning Research 12: 2825–2830. Das, R. 2010. A comparison of multiple classification meth- Peker, M.; Sen, B.; and Delen, D. 2015. Computer-aided di- ods for diagnosis of Parkinson disease. Expert Systems with agnosis of Parkinson’s disease using complex-valued neural Applications 37(2): 1568–1572. networks and mRMR feature selection algorithm. Journal Gil, D.; and Manuel, D. J. 2009. Diagnosing Parkinson of healthcare engineering 6. by using artificial neural networks and support vector ma- Polat, K. 2019. A hybrid approach to Parkinson disease clas- chines. Global Journal of Computer Science and Technol- sification using speech signal: the combination of smote and ogy 9(4). random forests. In 2019 Scientific Meeting on Electrical- Gunduz, H. 2019. Deep learning-based Parkinson’s dis- Electronics & Biomedical Engineering and Computer Sci- ease classification using vocal feature sets. IEEE Access 7: ence (EBBT), 1–3. IEEE. 115540–115551. Poorjam, A. H. 2018. Why we take only 12- Gürüler, H. 2017. A novel diagnosis system for Parkinson’s 13 MFCC coefficients in feature extraction? URL disease using complex-valued artificial neural network with https://www.researchgate.net/post/Why we take only 12- k-means clustering feature weighting method. Neural Com- 13 MFCC coefficients in feature extraction. puting and Applications 28(7): 1657–1666. Ramani, R. G.; and Sivagami, G. 2011. Parkinson disease Hazan, H.; Hilu, D.; Manevitz, L.; Ramig, L. O.; and Sapir, classification using data mining algorithms. International S. 2012. Early diagnosis of Parkinson’s disease via machine journal of computer applications 32(9): 17–22. learning on speech data. In 2012 IEEE 27th Convention of Sakar, B. E.; Isenkul, M. E.; Sakar, C. O.; Sertbas, A.; Gur- Electrical and Electronics Engineers in Israel, 1–4. IEEE. gen, F.; Delil, S.; Apaydin, H.; and Kursun, O. 2013. Collec- Jansson, D.; Medvedev, A.; Axelson, H.; and Nyholm, D. tion and analysis of a Parkinson speech dataset with multiple 2015. Stochastic anomaly detection in eye-tracking data for types of sound recordings. IEEE Journal of Biomedical and quantification of motor symptoms in Parkinson’s disease. In Health Informatics 17(4): 828–834. Signal and Image Analysis for Biomedical and Life Sciences, Sakar, C. O.; Serbes, G.; Gunduz, A.; Tunc, H. C.; Nizam, 63–82. Springer. H.; Sakar, B. E.; Tutuncu, M.; Aydin, T.; Isenkul, M. E.; and Joseph, J. 2020. Imbalanced Data. URL https://medium. Apaydin, H. 2019. A comparative analysis of speech signal com/@jasonjoseph072/imbalanced-data-97e2e8a9e0a8. processing algorithms for Parkinson’s disease classification and the use of the tunable Q-factor wavelet transform. Ap- Karimi Rouzbahani, H.; and Daliri, M. R. 2011. Diagnosis plied Soft Computing 74: 255–263. of Parkinson’s disease in human using voice signals. Basic Shahbakhi, M.; Far, D. T.; and Tahami, E. 2014. Speech and Clinical Neuroscience 2(3): 12–20. analysis for diagnosis of parkinson’s disease using genetic Lee, S.-H.; and Lim, J. S. 2012. Parkinson’s disease classi- algorithm and support vector machine. Journal of Biomedi- fication using gait characteristics and wavelet-based feature cal Science and Engineering 2014. extraction. Expert Systems with Applications 39(8): 7338– Solana-Lavalle, G.; Galán-Hernández, J.-C.; and Rosas- 7344. Romero, R. 2020. Automatic Parkinson disease detection at Little, M.; McSharry, P.; Hunter, E.; Spielman, J.; and early stages as a pre-diagnosis tool by using classifiers and a Ramig, L. 2008. Suitability of dysphonia measurements for small set of vocal features. Biocybernetics and Biomedical telemonitoring of Parkinson’s disease. Nature Precedings Engineering 40(1): 505–516. 1–1. Tolosa, E.; Gaig, C.; Santamarı́a, J.; and Compta, Y. 2009. Maldonado, S.; López, J.; and Vairetti, C. 2019. An alter- Diagnosis and the premotor phase of Parkinson disease. native SMOTE oversampling strategy for high-dimensional Neurology 72(7 Supplement 2): S12–S20. datasets. Applied Soft Computing 76: 380–389. Tseng, P.-H.; Cameron, I. G.; Munoz, D. P.; and Itti, L. 2014. Manap, H. H.; Tahir, N. M.; and Yassin, A. I. M. 2011. Sta- Eye-tracking method and system for screening human dis- tistical analysis of parkinson disease gait classification using eases. US Patent 8,808,195.