Towards Remote Differential Diagnosis of Mental and Neurological Disorders using Automatically Extracted Speech and Facial Features Vanessa Richter1,† , Michael Neumann1 and Vikram Ramanarayanan1,2,* 1 Modality.AI, Inc., San Francisco, CA 94105, United States 2 University of California, San Francisco, CA 94127, United States Abstract Utilizing computer vision and speech signal processing to assess neurological and mental conditions remotely has the potential to help detecting diseases or monitoring their progression earlier and more accurately. Multimodal features have demonstrated usefulness in identifying cases with a disorder from controls across several health conditions. However, challenges arise in distinguishing between specific disorders during the process of differential diagnosis, where shared characteristics among different disorders may complicate accurate classification. Our aim in this study was to evaluate the utility and accuracy of automatically extracted speech and facial features for differentiating between multiple disorders in a multi-class (differential diagnosis) setting using a machine learning classifier. We use datasets comprising people with depression, bulbar and limb onset amyotrophic lateral sclerosis (ALS), and schizophrenia, in addition to healthy controls. The data was collected in a real-world scenario with a multimodal dialog system, where a virtual guide walked participants through a set of tasks that elicit speech and facial behavior. Our study demonstrates the utility of digital speech and facial biomarkers in assessing neurological and mental disorders for differential diagnosis. Furthermore, this research emphasizes the importance of combining information derived from multiple modalities for a more comprehensive understanding and classification of disorders. Keywords differential diagnosis, multi-class, mental disorders, neurological disorders, depression, schizophrenia, amyotrophic lateral sclerosis, digital biomarkers, dialog system, speech, facial, multimodal 1. Introduction tures characterize a given disorder. For example, percent pause time (PPT) has been found to differ significantly One out of eight individuals in the world lives with a between people with ALS (pALS) and HCs [3] as well as mental health disorder, but most people do not have ac- between people with depression symptoms and HCs [4]. cess to effective care.1 Moreover, disorders of the nervous Furthermore, a slower speaking rate differentiates pALS system are the second leading cause of death globally [1]. [5] as well as people with schizophrenia [6] from HC. The development of clinically valid digital biomarkers To assess the utility of automatically computed digital for neurological and mental disorders that can be auto- biomarkers to capture specific disease attributes despite matically extracted could significantly improve patients’ such shared characteristics, we aim to answer the follow- lives. This advancement has the potential to assist clini- ing questions: cians in achieving quicker and more reliable diagnoses by providing fast and objective insights into a patient’s state. 1. How accurately can a machine learning (ML) Note that the idea here is not to replace the clinician, classifier differentially distinguish between mul- but to provide effective and assistive tools that can help tiple disorders – depression, schizophrenia, bul- improve his/her efficiency, speed and accuracy. bar symptomatic ALS and bulbar presymptomatic Many speech and facial features have shown to be ALS? useful in differentiating between different mental and 2. Which modalities and features are most useful for neurological disorders and healthy controls (HCs) [2]. this multi-class classification task – overall and However, it remains unclear how distinctly these fea- with respect to a given disorder – and how does that compare to a binary classification baseline Machine Learning for Cognitive and Mental Health Workshop (controls versus cases in each of the investigated (ML4CMH), AAAI 2024, Vancouver, BC, Canada health conditions)? * Corresponding author. † Vanessa Richter performed the work described in this paper when she was an intern at Modality.AI. $ vikram.ramanarayanan@modality.ai (V. Ramanarayanan) 2. Related Work © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 1 Attribution 4.0 International (CC BY 4.0). https://www.who.int/news-room/fact-sheets/detail/mental- Recently, digital speech and facial features have been disorders, accessed 11/7/2022 shown to yield statistically significant differences be- CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings tween cases with neurological or mental disorders and healthy controls, exhibit high specificity and sensitiv- ity in discriminatory ability between those groups, or, a high potential for disease progression and treatment effect monitoring [2, 3, 6, 7, 8, 9, 10, 11, 12]. Several studies have evaluated the detection of neuro- logical and mental disorders in multi-class classifica- tion settings as compared to binary case-control studies [13, 14, 15]. Altaf et al. [13] introduced an algorithm for Alzheimer’s disease (AD) detection validated on binary classification and multi-class classification of AD, normal and mild cognitive impairment (MCI). Using the bag of visual word approach, the algorithm enhances texture- Figure 1: Overview of feature extraction and dataset creation. based features like the gray level co-occurrence matrix. It integrates clinical data, creating a hybrid feature vec- tor from whole magnetic resonance (MR) brain images. interview environment and data collection. Each session They use the Alzheimer’s Disease Neuro-imaging Initia- starts with a microphone, speaker, and camera check to tive dataset (ADNI) and achieve 98.4% accuracy in binary ensure that the participant has given their device the AD versus normal classification and 79.8% accuracy in permission to access camera and microphone, is able to multi-class AD, normal, and MCI classification. hear the instructions and the captured signal is of ade- Furthermore, Hansen et al. [14] explored the poten- quate quality. After these tests the virtual guide involves tial of speech patterns as diagnostic markers for mul- participants in a structured conversation that consists of tiple neuropsychiatric conditions by examining record- exercises (speaking tasks, open-ended questions, motor ings from 420 participants with major depressive disor- abilities) to elicit speech, facial and motor behaviors rel- der, schizophrenia, autism spectrum disorder, and non- evant to the type of disease being studied. In this work, psychiatric controls. Various models were trained and we focus on tasks that were shared across multiple study tested for both binary and multi-class classification tasks protocols for different disease conditions: (a) sentence in- using speech and text features. While binary classifica- telligibility test (SIT), (b) diadochokinesis (DDK), (c) read tion models exhibited comparable performance to prior speech, and (d) a picture description task. For (a), par- research (F1: 0.54–0.92), multi-class classification showed ticipants were asked to read individual SIT sentences of a notable decrease in performance (F1: 0.35–0.75). The varying lengths (5-15 words2 ), while (b) required reading study further demonstrates that combining voice- and a longer passage (Bamboo reading passage, 99 words). To text-based models enhances overall performance by 9.4% assess DDK skills (c), participants were asked to repeat a F1 macro, highlighting the potential of a multimodal pattern of syllables (/pa ta ka/) as fast as they can until approach for more accurate neuropsychiatric condition they run out of breath and (d) prompted users to describe classification While these studies show the effectiveness a scene in a picture that was shown to them on screen. of different types of speech- and facial-derived features These tasks are inspired by previous work [17, 18, 19]. for assessing psychiatric conditions in differential diag- nosis settings, none of them utilized ’in-the-wild‘ data collected remotely from participants devices with a mul- 3.1. Datasets timodal dialog system. An overview of the data used in this study is given in Table 1. While some datasets for a disease may be small, there is a subset of tasks that are shared across research 3. Multimodal Dialog Platform studies. Since the data is collected in the same way and Data Collection (remotely with a personal electronic device), we can create a larger dataset for the healthy population across Audiovisual data was collected using NEMSI (Neurologi- studies to get a more accurate representation of the cal and Mental health Screening Instrument) [16], a mul- properties of normative behavior. For the larger dataset timodal dialog system for remote health assessments. An of healthy controls, we identify age-related trends as overview of the dataset creation process is illustrated in well as collinerarity of features. This information is used Figure 1. A virtual guide, Tina, led study participants to correct control as well as patient feature values from through various tasks that are designed to elicit speech, facial, and motor behaviors. Having an interactive virtual 2 In the remainder of the paper, the different SIT sentence lengths guide to elicit participants’ behavior allows for scalability are treated as separate tasks and are denoted as SIT_n, where n is while providing a natural but controlled and objective the length in words. Participants Sessions Mean Age (SD) 3.1.2. Amyotrophic Lateral Sclerosis Controls ALS is a neurological disease that affects nerve cells in Female 408 (63%) 655 (62.8%) 46.3 (16.4) the brain and spinal cord that control voluntary mus- Male 240 (37%) 388 (37.2%) 46.2 (16.0) cle movement. The disease is progressive and there is All 648 1043 46.3 (16.2) Schizophrenia currently no cure or effective treatment to reverse its Female 10 (24.4%) 19 (26.4%) 36.1 (9.4) progression.5 . Global estimates of ALS prevalence range Male 31 (75.6%) 53 (73.6%) 36.6 (10.1) from 1.9 to 6 per 100,000.6 Studies on ALS found co- All 41 72 36.5 (9.9) morbidity with dementia, parkinsonism and depressive Depression symptoms [23]. Diekmann et al. [24] found depression Female 66 (79.5%) 76 (79.2%) 34.6 (12.1) to occur statistically significantly more often in pALS Male 17 (20.5%) 20 (20.8%) 35.0 (10.2) compared to HC. In addition, Heidari et al. [25] found All 83 96 34.7 (11.7) in a meta-analysis of 46 eligible studies that the pooled Bulbar Symptomatic ALS prevalence of depression among individuals with ALS to Female 38 (48.1%) 67 (46.2%) 61.7 (10.8) be 34%, with mild, moderate, and severe depression rates Male 41 (51.9%) 78 (53.8%) 61.3 (9.0) at 29%, 16%, and 8%, respectively. All 79 145 61.5 (9.8) As shown in Table 1, data from 79 ALS bulbar symp- Bulbar Presymptomatic ALS Female 31 (50%) 54 (50.5%) 58.1 (10.9) tomatic (BS) and 62 ALS bulbar pre-symptomatic (BP) Male 31 (50%) 53 (49.5%) 62.2 (8.3) patients were collected in cooperation with Everythin- All 62 107 60.1 (9.9) gALS and the Peter Cohen Foundation7 . In addition to the assessment of speech and facial behavior, partici- Table 1 pants filled out the ALS Functional Rating Scale-revised Cohort demographics. SD: standard deviation. (ALSFRS-R), a standard instrument for monitoring the progression of ALS [26]. The questionnaire comprises 12 questions about physical ability with each function’s age effects and remove feature redundancies. rating ranging from normal function (score 4) to severe disability (score 0). It includes four scales for different domains affected by the disorder: bulbar system, fine 3.1.1. Schizophrenia and gross motor skills, and respiratory function. The ALSFRS-R score is the total of the domain sub-scores, Schizophrenia is a chronic brain disorder that affects the sum ranging from 0 to 48. For this study, pALS were approximately 24 million or 1 in 300 people (1 in 222 stratified into the following sub-cohorts based on their in adults)3 worldwide. According to the American Psy- bulbar subscore: (a) BS ALS with a bulbar subscore < 12 chiatric Association (APA), active schizophrenia may be (first three ALSFRS-R questions) and (b) BP ALS with a characterized by episodes in which the affected individual bulbar sub-score = 12. cannot distinguish between real and unreal experiences.4 Among individuals with schizophrenia, psychiatric and 3.1.3. Depression medical comorbidities such as substance abuse, anxiety and depression are common [20, 21, 22]. Buckley et al. Depression is a common mental health disorder char- pointed out that depression is estimated to affect half of acterized by persistent sadness and lack of interest or the patients. These comorbidities, as well as the variation pleasure in previously enjoyable activities. In addition, in symptoms and medications, make the identification of fatigue and poor concentration are common. The effects multimodal biomarkers for schizophrenia a difficult task. of depression can be long-lasting or recurrent and can As can be seen in Table 1, we assessed 41 individuals drastically affect a person’s ability to lead a fulfilling with a diagnosis of schizophrenia at a state psychiatric life. The disorder is one of the most common causes of facility in New York, NY. The study was approved by the disability in the world.8 One in six people (16.6%) will Nathan S. Kline Institute for Psychiatric Research and we experience depression at some point in their lifetime.9 obtained written informed consent from all participants at the time of screening after explaining details of the 5 https://www.ninds.nih.gov/health- study. The assessment of both patients and controls was information/disorders/amyotrophic-lateral-sclerosis-als, accessed overseen by a psychiatrist. 05/19/2023 6 https://www.targetals.org/2022/11/22/epidemiology-of-als- incidence-prevalence-and-clusters/, accessed 05/19/2023 3 7 https://www.who.int/news-room/fact-sheets/detail/ https://www.everythingals.org/research 8 schizophrenia, accessed 05/19/2023 https://www.who.int/health-topics/depression, accessed 06/20/2023 4 9 https://www.psychiatry.org/patients-families/schizophrenia/ https://www.psychiatry.org/patients-families/depression/what-is- what-is-schizophrenia, accessed 05/19/2023 depression, accessed 06/20/2023 A well-established tool for assessing depression is the sessions that had more than 15% missing features. Then, Patient Health Questionnaire (PHQ)-8 [27]. The PHQ-8 on the feature level, we filtered out features with more score ranges from 0 to 24 (higher score indicates more than 10% missing values. These thresholds have been severe depression symptoms). determined empirically. After those removal procedures, We investigated at least moderately severe depression we impute remaining missing values with mean feature cases, based on a cutoff of PHQ-8 ≥ 15. The data for this values for the respective cohort in train and test sets study, including the completion of the PHQ-8 question- separately. naire, was collected through crowd-sourcing, resulting in a sample of 83 individuals that scored at or above 4.3. Age-Correction & Sex-Normalization this cutoff. Statistics for this cohort are summarized in Table 1. Similar to the approach in Falahati et al. [31], we applied a linear correction algorithm to both patient and con- trol data based on age-related changes in the HC cohort. 4. Methods By calculating age trends and coefficients on healthy controls, we aim to obtain the most accurate estimate Our procedure is divided into the following stages: (1) fea- of purely age-related changes without the confounding ture extraction, (2) preprocessing, (3) age-correction and effects of disease-related influences. In detail, for each sex-normalization, (4) redundancy and effect size analy- feature, we fit a linear regression model to age as the sis, and finally (5) classification (binary and multi-class) independent and the feature as the dependent variable, and evaluation. modeling the age-related changes as a linear deviation. This is done separately for males and females to obtain 4.1. Multimodal Metrics Extraction a sex-specific result. Then, the sex-specific regression coefficients are used to correct feature values for age In this and the following sections, we use the following by subtracting the product of coefficient and age from terminology: Metric denotes a speech or facial metric in the feature value for each participant. To account for general, and Feature denotes a specific combination of a sex-related differences, we applied sex-specific z-scoring metric extracted from a certain task, e.g. speaking rate to normalize the features. Z-normalization is a method- for the SIT task. ology that allows for the comparison or compilation of Both speech and facial metrics were extracted from observations of different cohorts [32]. In addition, the the audiovisual recordings (overview in Table 2). To ex- normalization process ensures the comparability of fea- tract facial metrics, we used the Mediapipe FaceMesh tures on different scales by centering the feature distribu- software10 . More specifically, MediaPipe’s Face Detec- tions around zero with a standard deviation of one. First, tion is based on BlazeFace [28] and determines the (x, the dataset to analyze was divided into male and female y)-coordinates of the face for every frame. Subsequently, participants. Then, each feature was normalized within 468 facial landmarks are identified using MediaPipe each sex group using z-scoring. FaceMesh. We selected 14 key landmarks to compute functionals of facial behavior. Distances between land- marks were normalized by dividing them by the inter- 4.4. Redundancy Analysis and Effect Sizes caruncular distance. In terms of between- as well as To identify collinear features and reduce the high- within-subject analyses, when the same position rela- dimensional feature space, we performed hierarchical tive to the camera cannot be assumed, Roesler et al. [29] clustering on the Spearman rank-order correlations us- found this to be the most reliable method of normaliza- ing the age-corrected and sex-normalized larger healthy tion. More details and a visual depiction of the land- control dataset. We applied the clustering for speech and marks used to calculate facial features can be found in facial features separately. The clustering procedure is [4]. Speech metrics were computed using Praat [30] and motivated by the approach in Ienco and Meo [33]. It is cover different domains, such as energy, timing, voice based on Ward’s method [34], which aims at minimising quality and frequency. within-cluster variance. We implemented it using the scikit-learn library11 . A dendrogram was plotted to 4.2. Preprocessing inspect the correlations between features visually and to determine a suitable distance threshold for generat- We applied the following approach to handle missing ing feature clusters. The threshold choice was based on data, which can occur for a number of reasons, including two major factors: (a) balance between speech and facial incomplete sessions, technical issues, or network prob- clusters as we target roughly an equal number to avoid lems. First, on the session level, we removed participant 11 https://scikit-learn.org/stable/auto_examples/inspection/plot_ 10 https://google.github.io/mediapipe/ permutation_importance_multicollinear.html Domain Metrics Energy signal-to-noise ratio (SNR, dB) Timing speaking & articulation duration/rate (sec./WPM), percent pause time (PPT, %), Audio canonical timing agreement (CTA, %) Specific to DDK cycle-to-cycle temporal variability (cTV, sec.), syllable rate (syl./sec.), number of syllables Voice quality shimmer (%), harmonics-to-noise ratio (HNR, dB), jitter (%) Frequency mean, min, max & standard deviation (stdev) of fundamental frequency (F0, Hz) Jaw mean, min & max speed/acceleration/jerk of the jaw center (JC) Video Lower Lip mean, min & max speed/acceleration/jerk of the lower lip (LL) Mouth mean & max lip aperture, lip width, mouth surface area; mean mouth symmetry ratio Eyes mean & max eye opening Table 2 Overview of speech and facial metrics. # Cluster domain Metrics Tasks # Features 1 Energy SNR all 8 2 Timing alignment CTA all 6 3 Timing, pauses PPT all 5 4 Timing, speaking (1) articulation/speaking duration Picture Description 2 5 DDK articulation SNR, syl.rate, syl.count & cTV DDK 4 6 Timing, speaking (2) articulation/speaking rate/time SIT_{5,9} 8 7 Timing, speaking (3) articulation/speaking rate/time SIT_{7,11,13,15}, 21 Reading passage 8 DDK voice quality HNR, jitter & shimmer DDK 3 9 Voice quality (periodicity) HNR all except DDK 8 10 Voice quality (amplitude variation) shimmer all except DDK 8 11 Voice quality (frequency variation) jitter all except DDK 8 12 Frequency (mean, min) min & mean F0 all 16 13 Frequency (max, std) max & std F0 all 16 ∑︀ 113 Table 3 Speech feature clusters identified by hierarchical clustering. predominance of one modality over the other, and (b) MLP has one hidden layer. We experimented with adding expert knowledge about the different task and feature more hidden layers, but found that the minimal config- domains (e.g. timing versus voice quality features, jaw uration with only one layer was beneficial in terms of versus eye movement or read versus free speech), which performance. The hidden layer size ℎ was determined resulted in the clusters shown in Table 3 and Table 4. dynamically as The clusters are used in the feature selection process as ℎ= 𝑓 +𝑐 (1) described Section 4.5. 2 Statistical tests to assess the statistical significance, where 𝑓 is the number of selected features and 𝑐 the num- as well as the magnitude and direction of effects for a ber of classes. The model was trained with a maximum given comparison, were conducted within classification of 10,000 iterations to allow sufficient time for conver- folds and as part of a post hoc analysis. Effect sizes were gence during training. Model training was stopped when calculated using Glass’s Delta [35]. Here, only features the loss or score was not improving by a defined toler- showing statistical significance (𝑝 < 0.05) in the Mann- ance threshold. Here, we used scikit-learn’s default Whitney U-test (MWU) were considered. of 1𝑒 − 4. Additionally, the alpha parameter was set to 0.001, controlling the regularization strength to prevent 4.5. Classification overfitting. The sgd (stochastic gradient descent) solver was used for optimization during training. The batch For both the binary and multi-class classification exper- size was set to auto, enabling the model to determine iments, we used a multilayer perceptron (MLP), which the appropriate batch size during training. We used the was implemented using the scikit-learn library. The rectified linear unit function as the activation function. # Cluster domain Metrics Tasks # Features 1 Lip movement (1) speed, acc. & jerk measures all except DDK 95 2 Lip width mean & max lip width all 18 3 Mouth opening mean & max lip aperture, all 36 mouth surface area 4 Lip movement (2) speed, acc. & jerk metrics DDK 12 5 Jaw movement (1) speed, acc. & jerk metrics DDK 12 6 Jaw movement (2) speed, acc. & jerk metrics SIT_7 12 7 Jaw movement (3) speed, acc. & jerk metrics SIT_5 12 8 Jaw movement (4) min + max speed, acc. & jerk metrics Picture Description 9 9 Jaw movement (5) speed, acc. & jerk metrics SIT_{9,11,13,15}, RP, 63 Picture Description 10 Mouth symmetry mean mouth symmetry all 9 11 Eye opening mean and max eye opening all 18 ∑︀ 296 Table 4 Facial feature clusters identified by hierarchical clustering. RP: reading passage. Ten-fold cross-validation was applied for evaluation in BS ALS vs. Depression cases, and so on). We merged order to maximize the utilization of data for both training the selected features from these comparisons as input and testing purposes. To avoid bias towards the majority to the classifier. Therefore, multiple features from the group, we created datasets that consist of an equal num- same cluster could be included in one feature set. We ber samples in each disease condition. For each individual allowed a certain amount of redundancy compared to participant, we consider, if available, the first two ses- the case-control baseline in order to account for the com- sions as data points. Because of the equality constraint, plexity associated with multiple comparisons. For both the number of data points was limited by the smallest experiments, classification performance was evaluated dataset (schizophrenia). This resulted in 72 randomly in terms of F1 score, sensitivity, and specificity. selected data points per cohort, summing up to a total of 360 data points. The classification experiments are run ten times to smooth out performance variations and 5. Results obtain more representative results. We split the data us- ing scikit-learn’s StratifiedGroupKFold to make sure 5.1. Binary Classification Baseline that sessions from the same participant are either in the respective training or testing fold. In each fold, we im- Cohort Speech Facial Speech + Facial puted missing values and standardized features by sex F1 F1 F1 SEN SP using z-scoring. This was done separately for training DEP vs. HC 0.64 0.59 0.65 0.65 0.65 and test sets. SCHIZ vs. HC 0.82 0.64 0.83 0.85 0.82 As a benchmark, we evaluated binary classification BP ALS vs. HC 0.54 0.51 0.52 0.52 0.53 BS ALS vs. HC 0.84 0.63 0.83 0.82 0.83 performance of models aimed at distinguishing cases with a disorder from controls. Here, for each cluster of Table 5 collinear features as described in Section 4.4, the one with Binary classification results. In each row, we highlighted the the highest effect size was selected for the final feature highest performance in terms of F1. set as input to the classifier. If no feature showed statisti- HC: Healthy Controls, DEP: Depression, SCHIZ: Schizophre- cally significant differences between cases and controls nia, SEN: Sensitivity, SP: Specificity in a given cluster, no feature was selected. Hence, the number of clusters determines the maximum number of As can be seen in Table 5, we observe a good per- features fed into the classifier. Statistical significance and formance in classifying controls versus BS ALS (speech effect sizes for each feature were calculated as described features alone; F1-score: 0.84) and schizophrenia (com- in the previous section. bined speech and facial; F1-score: 0.83) cases, respectively. In a second step, we performed 4-class classification, in- The binary classification of depression did not perform corporating all the investigated neurological and mental as well; however, it still surpassed the random chance disorders. Here, feature selection was done based on pair- baseline (combined speech and facial; F1-score: 0.65). wise comparisons of all disease cohorts (e.g. Depression The classifier struggled to distinguish controls from BP vs. Schizophrenia cases, Schizophrenia vs. BS ALS cases, ALS cases, where we observed performance just above random chance across modalities. Furthermore, the per- BP ALS and depression the per class F1-score is highest formance with regard to sensitivity and specificity is when combining speech and facial features. There is no relatively balanced across comparisons. performance difference between using only speech or In depression and schizophrenia, combining speech speech and facial features for identifying schizophrenia. and facial modalities resulted in improved classification Figure 2 shows a confusion matrix that indicates the per- performance compared to speech or facial features alone, centage of accurate class predictions and the classes with as shown in Table 5. However, adding facial informa- which they were confused. The model was most confi- tion did not enhance performance for BP or BS and ALS dent in detecting schizophrenia (72.22%), followed by BS cohorts compared to utilizing speech features alone. ALS (64.58%) and depression (63.75%). The model faced its greatest challenge in accurately predicting BP ALS 5.2. Multi-Class Classification (57.22%), yet it still performs notably above chance in a 4- class classification scenario. BP ALS and depression cases were most often confused with each other. Schizophrenic Cohort Speech Facial Speech + Facial patients were least often confused with other cohorts. F1 F1 F1 SEN SP Among the cases of BS ALS, the most frequent confusion SCHIZ 0.72 0.53 0.72 0.72 0.91 BP ALS 0.55 0.36 0.57 0.57 0.86 occurred with BP ALS patients (16.11%). BS ALS 0.62 0.47 0.64 0.65 0.88 The features that we identified to be consistently cho- DEP 0.61 0.46 0.64 0.64 0.88 sen across classification folds (Table 7) are predominantly Average 0.63 0.46 0.64 0.65 0.88 speech features of timing, voice quality, and energy domains. In addition, two facial features are selected Table 6 across folds concerning the maximum lip width and the Multi-class classification results. In each row, we highlight maximum absolute acceleration of jaw movements. We the highest F1 score performance. HC: Healthy Controls, DEP: Depression, SCHIZ: Schizophre- conducted a post hoc analysis of effect sizes between nia, SEN: Sensitivity, SP: Specificity HC and cases with a disorder for these features to gain further insight into disorder-specific importance. Here, positive effect sizes represent feature values that are larger for cases with a disorder than controls. Conversely, negative values represent larger feature values for controls than cases with a disorder12 . In schizophrenia, we find all of the features consistently selected across classification folds to be statistically significant when compared to HC. With respect to the other cohorts, the largest effects are shown for CTA (-1.44 for SIT_13) and speaking rate (-2.00 for RP). This shows that patients exhibit a lower CTA, a measure of phonetic alignment between their own speech and that of the virtual guide, while speaking slower. We also observed a smaller average lip width as an important feature that shows the largest effect between HC and depression cases compared to the other cohorts. This may be associated with decreased emotional expressivity, as indicated by reduced smiling and increased frowning. These findings Figure 2: Normalized confusion matrix for 4-class classifica- align with previous studies highlighting similar patterns tion. The x-axis shows the true labels, the y-axis the predicted ones. of emotional expressiveness in depression [37, 38]. Few and small differences compared to controls are revealed for BP ALS cases. This is also the cohort with the lowest In the 4-class experiment aimed at discriminating be- performance across classification experiments. In BS tween all investigated neurological and mental disorders, ALS, we found the largest effects for SNR and speaking we achieve the best overall performance (F1-score: 0.64) rate. Another feature that stood out is cTV in the DDK by utilizing both speech and facial features, as shown in task, a measure that captures the temporal variability, i.e. Table 6. Overall, the specificity (average: 0.88) for the the consistency or irregularity in the timing of speech disorders examined is considerably higher than the sensi- patterns, between consecutive cycles of speech. tivity (average: 0.65). This indicates that the classifier is 12 We follow the commonly used effect size magnitude thresholds as more effective at avoiding false-positive results than iden- suggested in Cohen [36] – small: 0.2 − 0.5, medium: 0.5 − 0.8, tifying true positives. In most cases, namely for BS ALS, and large: > 0.8 Effect sizes (HC vs. disorder cases) Features Modality Cluster domain SCHIZ BP ALS BS ALS DEP max abs acc. JC (RP) Facial Jaw movement -0.51 - N.S - max lip width (SIT 11) Facial Lip width -0.35 - 0.31 -0.44 shimmer (DDK) Speech Voice quality 0.35 - -0.63 - shimmer (SIT 5) Speech Voice quality 0.97 - -0.31 - jitter (SIT 9) Speech Voice quality 0.43 -0.20 -0.48 0.26 CTA (SIT 13) Speech Timing alignment -1.44 - -1.16 -0.31 SNR (DDK) Speech Energy 1.88 - 2.43 - speaking rate (RP) Speech Timing, speaking -2.00 - -1.84 - speaking rate (SIT 7) Speech Timing, speaking -0.73 -0.31 -1.25 0.59 HNR (DDK) Speech Voice quality 1.01 - 0.86 -0.30 HNR (SIT 15) Speech Voice quality 0.94 - 0.75 - cTV (DDK) Speech Energy & articulation skills 0.39 - 1.82 0.43 Table 7 Features selected across all multi-class classification CV folds (considering the 4 disorders) and their effect sizes as calculated between the healthy control and disorder cohorts. In each row, we highlighted the largest effect size, which were only calculated in case of statistical significance. HC: healthy controls, SCHIZ: schizophrenia, BS: bulbar symptomatic, BP: bulbar pre-symptomatic, DEP: depression, JC: jaw center, RP: reading passage While many features are shared in terms of indicating That being said, we acknowledge the importance of a signal between cases with a disorder and controls, it contextualizing the promise of such multimodal method- is mostly the magnitude of the effect that differentiates ologies for differential diagnosis with several caveats. them, as well as how they combine. However, there are First, the performance of any machine learning classi- also a few features that show a different direction of fier trained for this purpose will depend on the specific effect across cohorts. For example, in BS ALS, compared conditions being studied and the range and heterogene- to other cohorts, we observed the largest effect for ity of symptoms presented in each case. For example, shimmer (DDK, -0.63), which measures the variation in in this study we investigated four specific conditions – amplitude of the vocal folds during the speech signal. schizophrenia, depression, bulbar symptomatic (BS) and There is no effect observed for BP ALS or depression bulbar presymptomatic (BP) ALS – and we observed that cohorts, while in schizophrenia, the direction of effect is schizophrenia (where the facial modality is particularly the opposite (0.35). good at capturing characteristics exhibited therein such as anhedonia, blunted affect, etc.) and BS ALS (which is characterized by speech motor deficits, reflected in the timing, rate and intelligibility of speech), quite dif- 6. Discussion ferent in terms of symptom presentation, exhibit greater separability relative to other classes for differential classi- We explored the utility of speech and facial features ex- fication. For both BS ALS and schizophrenia, our analysis tracted by a multimodal dialog system for differential demonstrates a robust discriminatory capability to effec- classification of ALS, depression and schizophrenia. Note tively distinguish these cohorts from healthy controls, as that the idea here is not to replace clinicians, but to pro- well as other neurological and mental disorders, in binary vide effective and assistive tools that can help improve and multi-class experiments. However, the overall higher their efficiency, speed and accuracy. Overall, combining specificity of the multi-class classifier implies a robust speech and facial information proved to be beneficial capability to accurately identify non-cases, effectively for identifying several disorders in both multi-class and minimizing false positives. Yet, the lower sensitivity sug- binary classification experiments. In addition, our au- gests limitations in the identification of true cases for the tomated feature analysis indicates several features that analyzed disorders, likely due to the imposed strong re- show relevance across experiments. While some of these strictions. In BS ALS, speech features alone demonstrate features are intuitively identifiable by human experts as superior performance when comparing this group with markers of a given disorder (for example, a slower speak- controls. Yet, in the more intricate task of differential ing rate or a lower intelligibility), such an analysis also diagnosis, performance improves when speech features allows discovery of other features that might be harder are combined with facial information. For schizophrenia, to detect or identify objectively by human experts, such the combination of speech and facial modalities proves as quicker facial movements. most effective in both binary and multitask experiments. In contrast, BP ALS, which does not present with as many ologies that lead to better separability of disease condi- speech and facial motor deficits, is much less separable tions. Future work will also focus on improving differ- even in binary classification, let alone in the multi-class ential diagnosis performance in a manner that is both classification context, highlighting the challenging na- generalizable and explainable. ture of detecting this condition. Furthermore, for the misidentified BS ALS cases, the classifier most frequently categorized them as BP ALS. Although distinguishing BP Acknowledgments ALS cases from controls is challenging, this outcome indi- This work was funded in part by the National Institutes cates that the classifier may be able to capture condition- of Health grant R42DC019877. We thank all study par- specific information from features that are shared across ticipants for their time and we gratefully acknowledge different stages of ALS, which may have led to this confu- the contribution of the Peter Cohen Foundation and Ev- sion. Finally, in evaluating depression, best performance erythingALS towards participant recruitment and data in both binary and multi-class classification experiments collection for the ALS corpus and Anzalee Khan and Jean- is achieved by combining speech and facial information. Pierre Lindenmayer at the Manhattan Psychiatric Center The overall accuracy in discerning depression from other – Nathan Kline Institute for the schizophrenia corpus. cohorts is notably lower compared to schizophrenia or BS ALS. The variability introduced by the wide range and time horizon of potential symptoms present in de- References pression as well as medication status might contribute to lower differential diagnosis accuracy. That being said, [1] V. Feigin, E. Nichols, T. Alam, M. Bannick, E. Beghi, a significant limitation of the present study is the lack N. Blake, W. Culpepper, E. Dorsey, A. Elbaz, R. Ellen- of information about co-morbidities to factor into our bogen, J. Fisher, C. Fitzmaurice, G. Giussani, L. Glen- analysis, since datasets were collected independently. Fu- nie, S. James, C. Johnson, N. Kassebaum, G. Logros- ture research will aim to explicitly address this gap by cino, B. Marin, T. Vos, Global, regional, and national capturing, for instance, information about co-morbid de- burden of neurological disorders, 1990-2016: a sys- pression in ALS or schizophrenia (e.g., through PHQ-8 tematic analysis for the global burden of disease scales), that might help us better stratify these cohorts. study 2016, The Lancet Neurology 18 (2019) 459– Second, this study focused on a restricted set of tasks, 480. doi:10.1016/S1474-4422(18)30499-X. primarily focusing on reading abilities and picture de- [2] V. Ramanarayanan, A. C. Lammert, H. P. Rowe, scription assessments. However, these task-feature com- T. F. Quatieri, J. R. Green, Speech as a biomarker: binations alone may not fully capture the nuances of each Opportunities, interpretability, and challenges, Per- disorder. spectives of the ASHA Special Interest Groups 7 Third, while we focused on interpretable features in (2022) 276–283. this study, less interpretable ones, such as log mel spectro- [3] M. Neumann, O. Roesler, J. Liscombe, H. Kothare, grams or Mel Frequency Cepstral Coefficients (MFCCs) D. Suendermann-Oeft, J. D. Berry, E. Fraenkel, may be able to capture more nuanced and complex pat- R. Norel, A. Anvar, I. Navar, A. V. Sherman, J. R. terns in the data. Additionally, more sophisticated deep Green, V. Ramanarayanan, Multimodal dialog based learning approaches for representation learning could speech and facial biomarkers capture differential be applied, such as Res-Net 50 [39] in the facial modal- disease progression rates for als remote patient ity. While such features can be powerful in capturing monitoring, in: Proceedings of the 32nd Interna- subtle details and nuances of audiovisual behavior, the tional Symposium on Amyotrophic Lateral Sclero- inner workings of the deep learning model are not easily sis and Motor Neuron Disease, Virtual, 2021. explainable or interpretable by non-experts. [4] V. Richter, J. Cohen, M. Neumann, D. Black, A. Haq, Fourth, our sample size is not representative enough J. Wright-Berryman, V. Ramanarayanan, A multi- to truly claim generalizability of findings. The smaller modal dialog approach to mental state character- the sample, the larger the risk of having model “blind ization in clinically depressed, anxious, and sui- spots” that in turn lead to variable estimates of true model cidal populations, Frontiers in Psychology 14 performance on unseen real world data, giving algorithm (2023). URL: https://www.frontiersin.org/articles/ designers an inaccurate sense of how well a model is 10.3389/fpsyg.2023.1135469. doi:10.3389/fpsyg. performing during development [40]. 2023.1135469. Our results argue for the importance of a hybrid ap- [5] M. Neumann, O. Roesler, J. Liscombe, H. Kothare, proach to differential diagnosis going forward, combining D. Suendermann-Oeft, D. Pautler, I. Navar, A. Anvar, knowledge-driven and data-driven approaches. Under- J. Kumm, R. Norel, E. Fraenkel, A. Sherman, J. Berry, standing specific disease pathologies and symptoms can G. Pattee, J. Wang, J. Green, V. Ramanarayanan, in turn help in developing features and learning method- Investigating the utility of multimodal conversa- burg, G. L. Pattee, J. D. Berry, E. A. Macklin, E. P. tional technology and audiovisual analytic mea- Pioro, R. A. Smith, Additional evidence for a ther- sures for the assessment and monitoring of amy- apeutic effect of dextromethorphan/quinidine on otrophic lateral sclerosis at scale, 2021, pp. 4783– bulbar motor function in patients with amyotrophic 4787. doi:10.21437/Interspeech.2021-1801. lateral sclerosis: A quantitative speech analysis, [6] V. Richter, M. Neumann, H. Kothare, O. Roesler, British Journal of Clinical Pharmacology 84 (2018) J. Liscombe, D. Suendermann-Oeft, S. Prokop, 2849–2856. A. Khan, C. Yavorsky, J.-P. Lindenmayer, V. Ra- [13] T. Altaf, S. M. Anwar, N. Gul, M. N. Majeed, manarayanan, Towards multimodal dialog-based M. Majid, Multi-class alzheimer’s disease classi- speech & facial biomarkers of schizophrenia, in: fication using image and clinical features, Biomed- Companion Publication of the 2022 International ical Signal Processing and Control 43 (2018) 64– Conference on Multimodal Interaction, ICMI ’22 74. URL: https://www.sciencedirect.com/science/ Companion, Association for Computing Machinery, article/pii/S1746809418300508. doi:https://doi. New York, NY, USA, 2022, p. 171–176. URL: https: org/10.1016/j.bspc.2018.02.019. //doi.org/10.1145/3536220.3558075. doi:10.1145/ [14] L. Hansen, R. Rocca, A. Simonsen, et al., 3536220.3558075. Speech- and text-based classification of neu- [7] H. Kothare, M. Neumann, J. Liscombe, O. Roesler, ropsychiatric conditions in a multidiagnostic set- W. Burke, A. Exner, S. Snyder, A. Cornish, D. Hab- ting, Nature Mental Health (2023). doi:10.1038/ berstad, D. Pautler, D. Suendermann-Oeft, J. Hu- s44220-023-00152-7. ber, V. Ramanarayanan, Statistical and clini- [15] E. Emre, Erol, C. Taş, N. Tarhan, Multi-class cal utility of multimodal dialogue-based speech classification model for psychiatric disor- and facial metrics for parkinson’s disease as- der discrimination, International Journal of sessment, 2022, pp. 3658–3662. doi:10.21437/ Medical Informatics 170 (2023) 104926. URL: Interspeech.2022-11048. https://www.sciencedirect.com/science/article/pii/ [8] N. Cummins, J. Joshi, A. Dhall, V. Sethu, R. Goecke, S1386505622002404. doi:https://doi.org/10. J. Epps, Diagnosis of depression by behavioural 1016/j.ijmedinf.2022.104926. signals: A multimodal approach, in: Proceed- [16] D. Suendermann-Oeft, A. Robinson, A. Cornish, ings of the 3rd ACM International Workshop on D. Habberstad, D. Pautler, D. Schnelle-Walka, Audio/Visual Emotion Challenge, AVEC ’13, As- F. Haller, J. Liscombe, M. Neumann, M. Merrill, sociation for Computing Machinery, New York, O. Roesler, R. Geffarth, Nemsi: A multimodal dia- NY, USA, 2013, p. 11–20. URL: https://doi.org/ log system for screening of neurological or mental 10.1145/2512530.2512535. doi:10.1145/2512530. conditions, in: Proceedings of the 19th ACM Inter- 2512535. national Conference on Intelligent Virtual Agents, [9] J. Robin, M. Xu, A. Balagopalan, J. Novikova, IVA ’19, Association for Computing Machinery, L. Kahn, A. Oday, M. Hejrati, S. Hashemifar, M. Ne- New York, NY, USA, 2019, p. 245–247. URL: https: gahdar, W. Simpson, E. Teng, Automated detection //doi.org/10.1145/3308532.3329415. doi:10.1145/ of progressive speech changes in early alzheimer’s 3308532.3329415. disease, Alzheimer’s & Dementia: Diagnosis, As- [17] A. K. Silbergleit, A. F. Johnson, B. H. Jacobson, sessment & Disease Monitoring 15 (2023) e12445. Acoustic analysis of voice in individuals with amy- doi:https://doi.org/10.1002/dad2.12445. otrophic lateral sclerosis and perceptually normal [10] J. Hlavnika, R. Cmejla, T. Tykalová, K. onka, vocal quality, Journal of Voice 11 (1997) 222–231. E. Růika, J. Rusz, Automated analysis of con- [18] B. Tomik, R. J. Guiloff, Dysarthria in amyotrophic nected speech reveals early biomarkers of parkin- lateral sclerosis: A review, Amyotrophic Lateral son’s disease in patients with rapid eye move- Sclerosis 11 (2010) 4–15. ment sleep behaviour disorder, Scientific Re- [19] M. Novotny, J. Melechovsky, K. Rozenstoks, ports 7 (2017). URL: https://api.semanticscholar.org/ T. Tykalova, P. Kryze, M. Kanok, J. Klempir, J. Rusz, CorpusID:19272861. Comparison of automated acoustic methods for [11] G. Stegmann, S. Charles, J. Liss, J. Shefner, oral diadochokinesis assessment in amyotrophic S. Rutkove, V. Berisha, A speech-based prognos- lateral sclerosis, Journal of speech, language, and tic model for dysarthria progression in als, Amy- hearing research : JSLHR 63 (2020) 3453–3460. otrophic lateral sclerosis & frontotemporal degen- doi:10.1044/2020_JSLHR-20-00109. eration (2023) 1–6. URL: https://doi.org/10.1080/ [20] P. Buckley, B. Miller, D. Lehrer, D. Castle, Psychi- 21678421.2023.2222144. doi:10.1080/21678421. atric comorbidities and schizophrenia, Schizophre- 2023.2222144, advance online publication. nia bulletin 35 (2008) 383–402. doi:10.1093/ [12] J. R. Green, K. M. Allison, C. Cordella, B. D. Rich- schbul/sbn135. [21] A. I. Green, C. M. Canuso, M. J. Brenner, J. D. Woj- health monitoring agent, in: Companion Pub- cik, Detection and management of comorbidity in lication of the 2022 International Conference on patients with schizophrenia, Psychiatric Clinics 26 Multimodal Interaction, ICMI ’22 Companion, As- (2003) 115–139. sociation for Computing Machinery, New York, [22] G. B. Cassano, S. Pini, M. Saettoni, P. Rucci, NY, USA, 2022, p. 160–165. URL: https://doi.org/ L. Dell’Osso, Occurrence and clinical correlates of 10.1145/3536220.3558071. doi:10.1145/3536220. psychiatric comorbidity in patients with psychotic 3558071. disorders, Journal of Clinical Psychiatry 59 (1998) [30] P. Boersma, V. Van Heuven, Speak and unspeak 60–68. with praat, Glot International 5 (2001) 341–347. [23] S. Körner, K. Kollewe, J. Ilsemann, A. Karch, R. Den- [31] F. Falahati, D. Ferreira, J.-S. Muehlboeck, H. Soini- gler, K. Krampfl, S. Petri, Prevalence and prognostic nen, P. Mecocci, B. Vellas, M. Tsolaki, I. Kłoszewska, impact of comorbidities in amyotrophic lateral scle- C. Spenger, S. Lovestone, M. Eriksdotter, L.-O. rosis, European journal of neurology : the official Wahlund, A. Simmons, E. Westman, The effect journal of the European Federation of Neurological of age correction on multivariate classification in Societies 20 (2012). doi:10.1111/ene.12015. alzheimer’s disease, with a focus on the characteris- [24] K. Diekmann, M. Kuźma-Kozakiewicz, M. Pi- tics of incorrectly and correctly classified subjects, otrkiewicz, M. Gromicho, J. Grosskreutz, P. M. Brain Topography In-press (2016). doi:10.1007/ Andersen, M. de carvalho, H. Uysal, A. Osman- s10548-015-0455-1. ovic, O. Schreiber-Katz, S. Petri, S. Körner, Im- [32] J.-P. Guilloux, M. Seney, N. Edgar, E. Sibille, In- pact of comorbidities and co-medication on dis- tegrated behavioral z-scoring increases the sensi- ease onset and progression in a large german als tivity and reliability of behavioral phenotyping in patient group, Journal of Neurology 267 (2020). mice: Relevance to emotionality and sex, Jour- doi:10.1007/s00415-020-09799-z. nal of neuroscience methods 197 (2011) 21–31. [25] M. E. Heidari, J. Nadali, A. Parouhan, M. Azarafraz, doi:10.1016/j.jneumeth.2011.01.019. S. M. Tabatabai, S. S. N. Irvani, F. Eskandari, [33] D. Ienco, R. Meo, Exploration and reduction of A. Gharebaghi, Prevalence of depression among the feature space by hierarchical clustering, in: amyotrophic lateral sclerosis (als) patients: A sys- Proceedings of the 2008 SIAM International Con- tematic review and meta-analysis, Journal of affec- ference on Data Mining, SIAM, 2008, pp. 577–587. tive disorders 287 (2021) 182–190. doi:10.1016/j. [34] J. H. Ward, Hierarchical grouping to optimize an ob- jad.2021.03.015. jective function, Journal of the American Statistical [26] J. M. Cedarbaum, N. Stambler, E. Malta, C. Fuller, Association 58 (1963) 236–244. D. Hilt, B. Thurmond, A. Nakanishi, The alsfrs-r: [35] K. Hopkins, G. Glass, Basic Statistics for the Behav- a revised als functional rating scale that incorpo- ioral Sciences, Prentice-Hall, Englewood Cliffs, N.J., rates assessments of respiratory function, Jour- 1978. nal of the Neurological Sciences 169 (1999) 13– [36] J. Cohen, Statistical Power Analysis for the Behav- 21. URL: https://www.sciencedirect.com/science/ ioral Sciences, 2nd ed., Lawrence Erlbaum Asso- article/pii/S0022510X99002105. doi:https://doi. ciates, Publishers, Hillsdale, NJ, 1988. org/10.1016/S0022-510X(99)00210-5. [37] S. Scherer, G. Stratou, G. Lucas, M. Mahmoud, [27] K. Kroenke, T. W. Strine, R. L. Spitzer, J. B. Williams, J. Boberg, J. Gratch, L.-P. Morency, Automatic audio- J. T. Berry, A. H. Mokdad, The phq-8 as a mea- visual behavior descriptors for psychological dis- sure of current depression in the general popula- order analysis, Image and Vision Computing 32 tion, Journal of Affective Disorders 114 (2009) 163– (2014) 648–658. 173. URL: https://www.sciencedirect.com/science/ [38] S. Sorg, C. Vögele, N. Furka, A. Meyer, Perseverative article/pii/S0165032708002826. doi:https://doi. thinking in depression and anxiety, Frontiers in Psy- org/10.1016/j.jad.2008.06.026. chology 3 (2012). URL: https://www.frontiersin.org/ [28] V. Bazarevsky, Y. Kartynnik, A. Vakunov, K. Raveen- articles/10.3389/fpsyg.2012.00020. doi:10.3389/ dran, M. Grundmann, Blazeface: Sub-millisecond fpsyg.2012.00020. neural face detection on mobile gpus, CoRR [39] B. Li, D. Lima, Facial expression recognition via abs/1907.05047 (2019). URL: http://arxiv.org/abs/ resnet-50, International Journal of Cognitive Com- 1907.05047. arXiv:1907.05047. puting in Engineering 2 (2021). doi:10.1016/j. [29] O. Roesler, H. Kothare, W. Burke, M. Neumann, ijcce.2021.02.002. J. Liscombe, A. Cornish, D. Habberstad, D. Paut- [40] V. Berisha, C. Krantsevich, P. R. Hahn, S. Hahn, ler, D. Suendermann-Oeft, V. Ramanarayanan, Ex- G. Dasarathy, P. Turaga, J. Liss, Digital medicine ploring facial metric normalization for within- and the curse of dimensionality, NPJ digital and between-subject comparisons in a multimodal medicine 4 (2021) 153.