1. Introduction

Journal of Clinical Psychiatry

10.1145/3536220

Towards Remote Diferential Diagnosis of Mental and Neurological Disorders using Automatically Extracted Speech and Facial Features

Vanessa Richter

Michael Neumann

Vikram Ramanarayanan

0 1 0 Modality.AI, Inc. , San Francisco, CA 94105 , United States 1 University of California , San Francisco, CA 94127 , United States

2022

59 1998 160 165

Utilizing computer vision and speech signal processing to assess neurological and mental conditions remotely has the potential to help detecting diseases or monitoring their progression earlier and more accurately. Multimodal features have demonstrated usefulness in identifying cases with a disorder from controls across several health conditions. However, challenges arise in distinguishing between specific disorders during the process of diferential diagnosis, where shared characteristics among diferent disorders may complicate accurate classification. Our aim in this study was to evaluate the utility and accuracy of automatically extracted speech and facial features for diferentiating between multiple disorders in a multi-class (diferential diagnosis) setting using a machine learning classifier. We use datasets comprising people with depression, bulbar and limb onset amyotrophic lateral sclerosis (ALS), and schizophrenia, in addition to healthy controls. The data was collected in a real-world scenario with a multimodal dialog system, where a virtual guide walked participants through a set of tasks that elicit speech and facial behavior. Our study demonstrates the utility of digital speech and facial biomarkers in assessing neurological and mental disorders for diferential diagnosis. Furthermore, this research emphasizes the importance of combining information derived from multiple modalities for a more comprehensive understanding and classification of disorders.

eol>diferential diagnosis multi-class mental disorders neurological disorders depression schizophrenia amyotrophic lateral sclerosis digital biomarkers dialog system speech facial multimodal

1. Introduction One out of eight individuals in the world lives with a

mental health disorder, but most people do not have access to efective care. 1 Moreover, disorders of the nervous system are the second leading cause of death globally [ 1 ].

The development of clinically valid digital biomarkers

for neurological and mental disorders that can be automatically extracted could significantly improve patients’ lives. This advancement has the potential to assist clinicians in achieving quicker and more reliable diagnoses by providing fast and objective insights into a patient’s state.

Note that the idea here is not to replace the clinician,

but to provide efective and assistive tools that can help improve his/her eficiency, speed and accuracy.

Many speech and facial features have shown to be

useful in diferentiating between diferent mental and neurological disorders and healthy controls (HCs) [ 2 ].

However, it remains unclear how distinctly these fea

tures characterize a given disorder. For example, percent pause time (PPT) has been found to difer significantly between people with ALS (pALS) and HCs [ 3 ] as well as between people with depression symptoms and HCs [ 4 ].

Furthermore, a slower speaking rate diferentiates pALS

[ 5 ] as well as people with schizophrenia [6] from HC. To assess the utility of automatically computed digital biomarkers to capture specific disease attributes despite such shared characteristics, we aim to answer the following questions:

1. How accurately can a machine learning (ML)

classifier diferentially distinguish between multiple disorders – depression, schizophrenia, bulbar symptomatic ALS and bulbar presymptomatic ALS? 2. Which modalities and features are most useful for this multi-class classification task – overall and with respect to a given disorder – and how does that compare to a binary classification baseline (controls versus cases in each of the investigated health conditions)? Recently, digital speech and facial features have been shown to yield statistically significant diferences between cases with neurological or mental disorders and healthy controls, exhibit high specificity and sensitivity in discriminatory ability between those groups, or, a high potential for disease progression and treatment efect monitoring [ 2, 3, 6, 7, 8, 9, 10, 11, 12 ].

Several studies have evaluated the detection of neuro

logical and mental disorders in multi-class classification settings as compared to binary case-control studies [13, 14, 15]. Altaf et al. [13] introduced an algorithm for

Alzheimer’s disease (AD) detection validated on binary

classification and multi-class classification of AD, normal and mild cognitive impairment (MCI). Using the bag of visual word approach, the algorithm enhances texture- Figure 1: Overview of feature extraction and dataset creation. based features like the gray level co-occurrence matrix.

It integrates clinical data, creating a hybrid feature vec

tor from whole magnetic resonance (MR) brain images. interview environment and data collection. Each session They use the Alzheimer’s Disease Neuro-imaging Initia- starts with a microphone, speaker, and camera check to tive dataset (ADNI) and achieve 98.4% accuracy in binary ensure that the participant has given their device the AD versus normal classification and 79.8% accuracy in permission to access camera and microphone, is able to multi-class AD, normal, and MCI classification. hear the instructions and the captured signal is of adeFurthermore, Hansen et al. [14] explored the poten- quate quality. After these tests the virtual guide involves tial of speech patterns as diagnostic markers for mul- participants in a structured conversation that consists of tiple neuropsychiatric conditions by examining record- exercises (speaking tasks, open-ended questions, motor ings from 420 participants with major depressive disor- abilities) to elicit speech, facial and motor behaviors relder, schizophrenia, autism spectrum disorder, and non- evant to the type of disease being studied. In this work, psychiatric controls. Various models were trained and we focus on tasks that were shared across multiple study tested for both binary and multi-class classification tasks protocols for diferent disease conditions: (a) sentence inusing speech and text features. While binary classifica- telligibility test (SIT), (b) diadochokinesis (DDK), (c) read tion models exhibited comparable performance to prior speech, and (d) a picture description task. For (a), parresearch (F1: 0.54–0.92), multi-class classification showed ticipants were asked to read individual SIT sentences of a notable decrease in performance (F1: 0.35–0.75). The varying lengths (5-15 words2), while (b) required reading study further demonstrates that combining voice- and a longer passage (Bamboo reading passage, 99 words). To text-based models enhances overall performance by 9.4% assess DDK skills (c), participants were asked to repeat a F1 macro, highlighting the potential of a multimodal pattern of syllables (/pa ta ka/) as fast as they can until approach for more accurate neuropsychiatric condition they run out of breath and (d) prompted users to describe classification While these studies show the efectiveness a scene in a picture that was shown to them on screen. of diferent types of speech- and facial-derived features These tasks are inspired by previous work [17, 18, 19]. for assessing psychiatric conditions in diferential diagnosis settings, none of them utilized ’in-the-wild‘ data collected remotely from participants devices with a mul- 3.1. Datasets timodal dialog system.

3. Multimodal Dialog Platform and Data Collection Audiovisual data was collected using NEMSI (Neurologi

cal and Mental health Screening Instrument) [16], a multimodal dialog system for remote health assessments. An overview of the dataset creation process is illustrated in Figure 1. A virtual guide, Tina, led study participants through various tasks that are designed to elicit speech, facial, and motor behaviors. Having an interactive virtual guide to elicit participants’ behavior allows for scalability while providing a natural but controlled and objective

An overview of the data used in this study is given in

Table 1. While some datasets for a disease may be small, there is a subset of tasks that are shared across research studies. Since the data is collected in the same way (remotely with a personal electronic device), we can create a larger dataset for the healthy population across studies to get a more accurate representation of the properties of normative behavior. For the larger dataset of healthy controls, we identify age-related trends as well as collinerarity of features. This information is used to correct control as well as patient feature values from

2In the remainder of the paper, the diferent SIT sentence lengths

are treated as separate tasks and are denoted as SIT_n, where n is the length in words.

Participants Sessions Mean Age (SD)

Controls Female 408 (63%) 655 (62.8%) Male 240 (37%) 388 (37.2%) All 648 1043 Schizophrenia Female 10 (24.4%) 19 (26.4%) Male 31 (75.6%) 53 (73.6%) All 41 72 Depression Female 66 (79.5%) 76 (79.2%) Male 17 (20.5%) 20 (20.8%) All 83 96 Bulbar Symptomatic ALS Female 38 (48.1%) 67 (46.2%) Male 41 (51.9%) 78 (53.8%) All 79 145 Bulbar Presymptomatic ALS Female 31 (50%) 54 (50.5%) Male 31 (50%) 53 (49.5%) All 62 107 46.3 (16.4) 46.2 (16.0) 46.3 (16.2) age efects and remove feature redundancies. 3.1.1. Schizophrenia

Schizophrenia is a chronic brain disorder that afects

approximately 24 million or 1 in 300 people (1 in 222 in adults)3 worldwide. According to the American Psychiatric Association (APA), active schizophrenia may be characterized by episodes in which the afected individual cannot distinguish between real and unreal experiences.4 Among individuals with schizophrenia, psychiatric and medical comorbidities such as substance abuse, anxiety and depression are common [20, 21, 22]. Buckley et al. pointed out that depression is estimated to afect half of the patients. These comorbidities, as well as the variation in symptoms and medications, make the identification of multimodal biomarkers for schizophrenia a dificult task.

As can be seen in Table 1, we assessed 41 individuals

with a diagnosis of schizophrenia at a state psychiatric facility in New York, NY. The study was approved by the Nathan S. Kline Institute for Psychiatric Research and we obtained written informed consent from all participants at the time of screening after explaining details of the study. The assessment of both patients and controls was overseen by a psychiatrist.

3https://www.who.int/news-room/fact-sheets/detail/

schizophrenia, accessed 05/19/2023

4https://www.psychiatry.org/patients-families/schizophrenia/

what-is-schizophrenia, accessed 05/19/2023 ALS is a neurological disease that afects nerve cells in the brain and spinal cord that control voluntary muscle movement. The disease is progressive and there is currently no cure or efective treatment to reverse its progression.5. Global estimates of ALS prevalence range from 1.9 to 6 per 100,000.6 Studies on ALS found comorbidity with dementia, parkinsonism and depressive symptoms [23]. Diekmann et al. [24] found depression to occur statistically significantly more often in pALS compared to HC. In addition, Heidari et al. [25] found in a meta-analysis of 46 eligible studies that the pooled prevalence of depression among individuals with ALS to be 34%, with mild, moderate, and severe depression rates at 29%, 16%, and 8%, respectively.

As shown in Table 1, data from 79 ALS bulbar symptomatic (BS) and 62 ALS bulbar pre-symptomatic (BP) patients were collected in cooperation with EverythingALS and the Peter Cohen Foundation7. In addition to the assessment of speech and facial behavior, participants filled out the ALS Functional Rating Scale-revised (ALSFRS-R), a standard instrument for monitoring the progression of ALS [26]. The questionnaire comprises 12 questions about physical ability with each function’s rating ranging from normal function (score 4) to severe disability (score 0). It includes four scales for diferent domains afected by the disorder: bulbar system, fine and gross motor skills, and respiratory function. The ALSFRS-R score is the total of the domain sub-scores, the sum ranging from 0 to 48. For this study, pALS were stratified into the following sub-cohorts based on their bulbar subscore: (a) BS ALS with a bulbar subscore < 12 (first three ALSFRS-R questions) and (b) BP ALS with a bulbar sub-score = 12. 3.1.3. Depression Depression is a common mental health disorder characterized by persistent sadness and lack of interest or pleasure in previously enjoyable activities. In addition, fatigue and poor concentration are common. The efects of depression can be long-lasting or recurrent and can drastically afect a person’s ability to lead a fulfilling life. The disorder is one of the most common causes of disability in the world.8 One in six people (16.6%) will experience depression at some point in their lifetime.9 5https://www.ninds.nih.gov/healthinformation/disorders/amyotrophic-lateral-sclerosis-als, accessed 05/19/2023

6https://www.targetals.org/2022/11/22/epidemiology-of-als

incidence-prevalence-and-clusters/, accessed 05/19/2023

7https://www.everythingals.org/research 8https://www.who.int/health-topics/depression, accessed 06/20/2023 9https://www.psychiatry.org/patients-families/depression/what-is

depression, accessed 06/20/2023 A well-established tool for assessing depression is the sessions that had more than 15% missing features. Then, Patient Health Questionnaire (PHQ)-8 [27]. The PHQ-8 on the feature level, we filtered out features with more score ranges from 0 to 24 (higher score indicates more than 10% missing values. These thresholds have been severe depression symptoms). determined empirically. After those removal procedures,

We investigated at least moderately severe depression we impute remaining missing values with mean feature cases, based on a cutof of PHQ-8 ≥ 15. The data for this values for the respective cohort in train and test sets study, including the completion of the PHQ-8 question- separately. naire, was collected through crowd-sourcing, resulting in a sample of 83 individuals that scored at or above 4.3. Age-Correction & Sex-Normalization this cutof. Statistics for this cohort are summarized in Table 1.

4. Methods Our procedure is divided into the following stages: (1) fea

ture extraction, (2) preprocessing, (3) age-correction and sex-normalization, (4) redundancy and efect size analysis, and finally (5) classification (binary and multi-class) and evaluation. 4.1. Multimodal Metrics Extraction

In this and the following sections, we use the following

terminology: Metric denotes a speech or facial metric in general, and Feature denotes a specific combination of a metric extracted from a certain task, e.g. speaking rate for the SIT task.

Both speech and facial metrics were extracted from the audiovisual recordings (overview in Table 2). To extract facial metrics, we used the Mediapipe FaceMesh software10. More specifically, MediaPipe’s Face Detection is based on BlazeFace [28] and determines the (x, y)-coordinates of the face for every frame. Subsequently, 468 facial landmarks are identified using MediaPipe FaceMesh. We selected 14 key landmarks to compute functionals of facial behavior. Distances between landmarks were normalized by dividing them by the intercaruncular distance. In terms of between- as well as within-subject analyses, when the same position relative to the camera cannot be assumed, Roesler et al. [29] found this to be the most reliable method of normalization. More details and a visual depiction of the landmarks used to calculate facial features can be found in [ 4 ]. Speech metrics were computed using Praat [30] and cover diferent domains, such as energy, timing, voice quality and frequency. 4.2. Preprocessing

We applied the following approach to handle missing

data, which can occur for a number of reasons, including incomplete sessions, technical issues, or network problems. First, on the session level, we removed participant 10https://google.github.io/mediapipe/

Similar to the approach in Falahati et al. [31], we applied

a linear correction algorithm to both patient and control data based on age-related changes in the HC cohort. By calculating age trends and coeficients on healthy controls, we aim to obtain the most accurate estimate of purely age-related changes without the confounding efects of disease-related influences. In detail, for each feature, we fit a linear regression model to age as the independent and the feature as the dependent variable, modeling the age-related changes as a linear deviation. This is done separately for males and females to obtain a sex-specific result. Then, the sex-specific regression coeficients are used to correct feature values for age by subtracting the product of coeficient and age from the feature value for each participant. To account for sex-related diferences, we applied sex-specific z-scoring to normalize the features. Z-normalization is a methodology that allows for the comparison or compilation of observations of diferent cohorts [ 32]. In addition, the normalization process ensures the comparability of features on diferent scales by centering the feature distributions around zero with a standard deviation of one. First, the dataset to analyze was divided into male and female participants. Then, each feature was normalized within each sex group using z-scoring. 4.4. Redundancy Analysis and Efect Sizes To identify collinear features and reduce the highdimensional feature space, we performed hierarchical clustering on the Spearman rank-order correlations using the age-corrected and sex-normalized larger healthy control dataset. We applied the clustering for speech and facial features separately. The clustering procedure is motivated by the approach in Ienco and Meo [33]. It is based on Ward’s method [34], which aims at minimising within-cluster variance. We implemented it using the scikit-learn library11. A dendrogram was plotted to inspect the correlations between features visually and to determine a suitable distance threshold for generating feature clusters. The threshold choice was based on two major factors: (a) balance between speech and facial clusters as we target roughly an equal number to avoid 11https://scikit-learn.org/stable/auto_examples/inspection/plot_ permutation_importance_multicollinear.html #

Energy

o Timing i d u

A Specific to DDK Voice quality Frequency

Jaw eo Lower Lip id Mouth V Eyes signal-to-noise ratio (SNR, dB) speaking & articulation duration/rate (sec./WPM), percent pause time (PPT, %), canonical timing agreement (CTA, %) cycle-to-cycle temporal variability (cTV, sec.), syllable rate (syl./sec.), number of syllables shimmer (%), harmonics-to-noise ratio (HNR, dB), jitter (%) mean, min, max & standard deviation (stdev) of fundamental frequency (F0, Hz) mean, min & max speed/acceleration/jerk of the jaw center (JC) mean, min & max speed/acceleration/jerk of the lower lip (LL) mean & max lip aperture, lip width, mouth surface area; mean mouth symmetry ratio mean & max eye opening predominance of one modality over the other, and (b) MLP has one hidden layer. We experimented with adding expert knowledge about the diferent task and feature more hidden layers, but found that the minimal configdomains (e.g. timing versus voice quality features, jaw uration with only one layer was beneficial in terms of versus eye movement or read versus free speech), which performance. The hidden layer size ℎ was determined resulted in the clusters shown in Table 3 and Table 4. dynamically as The clusters are used in the feature selection process as ℎ = + (1) described Section 4.5. 2

Statistical tests to assess the statistical significance, where is the number of selected features and the numas well as the magnitude and direction of efects for a ber of classes. The model was trained with a maximum given comparison, were conducted within classification of 10,000 iterations to allow suficient time for converfolds and as part of a post hoc analysis. Efect sizes were gence during training. Model training was stopped when calculated using Glass’s Delta [35]. Here, only features the loss or score was not improving by a defined tolershowing statistical significance ( < 0.05) in the Mann- ance threshold. Here, we used scikit-learn’s default Whitney U-test (MWU) were considered. of 1 − 4. Additionally, the alpha parameter was set to 0.001, controlling the regularization strength to prevent 4.5. Classification overfitting. The sgd (stochastic gradient descent) solver was used for optimization during training. The batch For both the binary and multi-class classification exper- size was set to auto, enabling the model to determine iments, we used a multilayer perceptron (MLP), which the appropriate batch size during training. We used the was implemented using the scikit-learn library. The rectified linear unit function as the activation function.

Metrics

SNR CTA PPT articulation/speaking duration

SNR, syl.rate, syl.count & cTV articulation/speaking rate/time articulation/speaking rate/time Tasks all all all Picture Description

DDK SIT_{5,9} SIT_{7,11,13,15},

Reading passage

DDK all except DDK all except DDK all except DDK all all ∑︀

Lip movement (1) Lip width Mouth opening Lip movement (2) Jaw movement (1) Jaw movement (2) Jaw movement (3) Jaw movement (4) Jaw movement (5) Mouth symmetry Eye opening

all except DDK all all DDK DDK SIT_7 SIT_5

Picture Description

SIT_{9,11,13,15}, RP, Picture Description all all ∑︀

Ten-fold cross-validation was applied for evaluation in BS ALS vs. Depression cases, and so on). We merged order to maximize the utilization of data for both training the selected features from these comparisons as input and testing purposes. To avoid bias towards the majority to the classifier. Therefore, multiple features from the group, we created datasets that consist of an equal num- same cluster could be included in one feature set. We ber samples in each disease condition. For each individual allowed a certain amount of redundancy compared to participant, we consider, if available, the first two ses- the case-control baseline in order to account for the comsions as data points. Because of the equality constraint, plexity associated with multiple comparisons. For both the number of data points was limited by the smallest experiments, classification performance was evaluated dataset (schizophrenia). This resulted in 72 randomly in terms of F1 score, sensitivity, and specificity. selected data points per cohort, summing up to a total of 360 data points. The classification experiments are run ten times to smooth out performance variations and 5. Results obtain more representative results. We split the data using scikit-learn’s StratifiedGroupKFold to make sure 5.1. Binary Classification Baseline that sessions from the same participant are either in the respective training or testing fold. In each fold, we im- Cohort Speech Facial Speech + Facial puted missing values and standardized features by sex F1 F1 F1 SEN SP using z-scoring. This was done separately for training DEP vs. HC 0.64 0.59 0.65 0.65 0.65 and test sets. SCHIZ vs. HC 0.82 0.64 0.83 0.85 0.82

As a benchmark, we evaluated binary classification BP ALS vs. HC 0.54 0.51 0.52 0.52 0.53 performance of models aimed at distinguishing cases BS ALS vs. HC 0.84 0.63 0.83 0.82 0.83 with a disorder from controls. Here, for each cluster of Table 5 collinear features as described in Section 4.4, the one with Binary classification results. In each row, we highlighted the the highest efect size was selected for the final feature highest performance in terms of F1. set as input to the classifier. If no feature showed statisti- HC: Healthy Controls, DEP: Depression, SCHIZ: Schizophrecally significant diferences between cases and controls nia, SEN: Sensitivity, SP: Specificity in a given cluster, no feature was selected. Hence, the number of clusters determines the maximum number of As can be seen in Table 5, we observe a good perfeatures fed into the classifier. Statistical significance and formance in classifying controls versus BS ALS (speech efect sizes for each feature were calculated as described features alone; F1-score: 0.84) and schizophrenia (comin the previous section. bined speech and facial; F1-score: 0.83) cases, respectively. In a second step, we performed 4-class classification, in- The binary classification of depression did not perform corporating all the investigated neurological and mental as well; however, it still surpassed the random chance disorders. Here, feature selection was done based on pair- baseline (combined speech and facial; F1-score: 0.65). wise comparisons of all disease cohorts (e.g. Depression The classifier struggled to distinguish controls from BP vs. Schizophrenia cases, Schizophrenia vs. BS ALS cases, ALS cases, where we observed performance just above random chance across modalities. Furthermore, the per- BP ALS and depression the per class F1-score is highest formance with regard to sensitivity and specificity is when combining speech and facial features. There is no relatively balanced across comparisons. performance diference between using only speech or

In depression and schizophrenia, combining speech speech and facial features for identifying schizophrenia. and facial modalities resulted in improved classification Figure 2 shows a confusion matrix that indicates the perperformance compared to speech or facial features alone, centage of accurate class predictions and the classes with as shown in Table 5. However, adding facial informa- which they were confused. The model was most confition did not enhance performance for BP or BS and ALS dent in detecting schizophrenia (72.22%), followed by BS cohorts compared to utilizing speech features alone. ALS (64.58%) and depression (63.75%). The model faced its greatest challenge in accurately predicting BP ALS 5.2. Multi-Class Classification (57.22%), yet it still performs notably above chance in a 4class classification scenario. BP ALS and depression cases were most often confused with each other. Schizophrenic Cohort Speech Facial Speech + Facial patients were least often confused with other cohorts. SCHIZ 0F.712 0F.513 0F.712 S0E.7N2 0S.9P1 Among the cases of BS ALS, the most frequent confusion BP ALS 0.55 0.36 0.57 0.57 0.86 occurred with BP ALS patients (16.11%). BS ALS 0.62 0.47 0.64 0.65 0.88 The features that we identified to be consistently choDEP 0.61 0.46 0.64 0.64 0.88 sen across classification folds (Table 7) are predominantly Average 0.63 0.46 0.64 0.65 0.88 speech features of timing, voice quality, and energy domains. In addition, two facial features are selected Table 6 across folds concerning the maximum lip width and the tMheulhtii-gchlaessst Fcl1asscsoifriecapteiorfnorrmesaunltcse.. In each row, we highlight maximum absolute acceleration of jaw movements. We HC: Healthy Controls, DEP: Depression, SCHIZ: Schizophre- conducted a post hoc analysis of efect sizes between nia, SEN: Sensitivity, SP: Specificity HC and cases with a disorder for these features to gain further insight into disorder-specific importance. Here, positive efect sizes represent feature values that are larger for cases with a disorder than controls. Conversely, negative values represent larger feature values for controls than cases with a disorder12. In schizophrenia, we find all of the features consistently selected across classification folds to be statistically significant when compared to HC. With respect to the other cohorts, the largest efects are shown for CTA (-1.44 for SIT_13) and speaking rate (-2.00 for RP). This shows that patients exhibit a lower CTA, a measure of phonetic alignment between their own speech and that of the virtual guide, while speaking slower. We also observed a smaller average lip width as an important feature that shows the largest efect between HC and depression cases compared to the other cohorts. This may be associated with decreased emotional expressivity, as indicated by reduced smiling and increased frowning. These findings tFiiognu.rTehe2:xN-aoxrismsahloizweds tchoentfruuseiolnabmelast,rtihxefoyr-a4x-icslathsse cplraesdsiicfitcead- align with previous studies highlighting similar patterns ones. of emotional expressiveness in depression [37, 38]. Few and small diferences compared to controls are revealed for BP ALS cases. This is also the cohort with the lowest

In the 4-class experiment aimed at discriminating be- performance across classification experiments. In BS tween all investigated neurological and mental disorders, ALS, we found the largest efects for SNR and speaking we achieve the best overall performance (F1-score: 0.64) rate. Another feature that stood out is cTV in the DDK by utilizing both speech and facial features, as shown in task, a measure that captures the temporal variability, i.e. Table 6. Overall, the specificity (average: 0.88) for the the consistency or irregularity in the timing of speech disorders examined is considerably higher than the sensi- patterns, between consecutive cycles of speech. tivity (average: 0.65). This indicates that the classifier is more efective at avoiding false-positive results than identifying true positives. In most cases, namely for BS ALS, 12We follow the commonly used efect size magnitude thresholds as suggested in Cohen [36] – small: 0.2 − 0.5, medium: 0.5 − 0.8, and large: > 0.8

Features

max abs acc. JC (RP) max lip width (SIT 11) shimmer (DDK) shimmer (SIT 5) jitter (SIT 9) CTA (SIT 13) SNR (DDK) speaking rate (RP) speaking rate (SIT 7) HNR (DDK) HNR (SIT 15) cTV (DDK)

Cluster domain Jaw movement Lip width Voice quality Voice quality Voice quality Timing alignment Energy Timing, speaking Timing, speaking Voice quality Voice quality Energy & articulation skills

While many features are shared in terms of indicating That being said, we acknowledge the importance of a signal between cases with a disorder and controls, it contextualizing the promise of such multimodal methodis mostly the magnitude of the efect that diferentiates ologies for diferential diagnosis with several caveats. them, as well as how they combine. However, there are First, the performance of any machine learning classialso a few features that show a diferent direction of ifer trained for this purpose will depend on the specific efect across cohorts. For example, in BS ALS, compared conditions being studied and the range and heterogeneto other cohorts, we observed the largest efect for ity of symptoms presented in each case. For example, shimmer (DDK, -0.63), which measures the variation in in this study we investigated four specific conditions – amplitude of the vocal folds during the speech signal. schizophrenia, depression, bulbar symptomatic (BS) and There is no efect observed for BP ALS or depression bulbar presymptomatic (BP) ALS – and we observed that cohorts, while in schizophrenia, the direction of efect is schizophrenia (where the facial modality is particularly the opposite (0.35). good at capturing characteristics exhibited therein such as anhedonia, blunted afect, etc.) and BS ALS (which is characterized by speech motor deficits, reflected in the timing, rate and intelligibility of speech), quite dif6. Discussion ferent in terms of symptom presentation, exhibit greater separability relative to other classes for diferential classiWe explored the utility of speech and facial features ex- ifcation. For both BS ALS and schizophrenia, our analysis tracted by a multimodal dialog system for diferential demonstrates a robust discriminatory capability to efecclassification of ALS, depression and schizophrenia. Note tively distinguish these cohorts from healthy controls, as that the idea here is not to replace clinicians, but to pro- well as other neurological and mental disorders, in binary vide efective and assistive tools that can help improve and multi-class experiments. However, the overall higher their eficiency, speed and accuracy. Overall, combining specificity of the multi-class classifier implies a robust speech and facial information proved to be beneficial capability to accurately identify non-cases, efectively for identifying several disorders in both multi-class and minimizing false positives. Yet, the lower sensitivity sugbinary classification experiments. In addition, our au- gests limitations in the identification of true cases for the tomated feature analysis indicates several features that analyzed disorders, likely due to the imposed strong reshow relevance across experiments. While some of these strictions. In BS ALS, speech features alone demonstrate features are intuitively identifiable by human experts as superior performance when comparing this group with markers of a given disorder (for example, a slower speak- controls. Yet, in the more intricate task of diferential ing rate or a lower intelligibility), such an analysis also diagnosis, performance improves when speech features allows discovery of other features that might be harder are combined with facial information. For schizophrenia, to detect or identify objectively by human experts, such the combination of speech and facial modalities proves as quicker facial movements. most efective in both binary and multitask experiments. In contrast, BP ALS, which does not present with as many speech and facial motor deficits, is much less separable even in binary classification, let alone in the multi-class classification context, highlighting the challenging nature of detecting this condition. Furthermore, for the misidentified BS ALS cases, the classifier most frequently categorized them as BP ALS. Although distinguishing BP ALS cases from controls is challenging, this outcome indicates that the classifier may be able to capture conditionspecific information from features that are shared across diferent stages of ALS, which may have led to this confusion. Finally, in evaluating depression, best performance in both binary and multi-class classification experiments is achieved by combining speech and facial information.

The overall accuracy in discerning depression from other

cohorts is notably lower compared to schizophrenia or BS ALS. The variability introduced by the wide range and time horizon of potential symptoms present in depression as well as medication status might contribute to lower diferential diagnosis accuracy. That being said, a significant limitation of the present study is the lack of information about co-morbidities to factor into our analysis, since datasets were collected independently. Future research will aim to explicitly address this gap by capturing, for instance, information about co-morbid depression in ALS or schizophrenia (e.g., through PHQ-8 scales), that might help us better stratify these cohorts.

Second, this study focused on a restricted set of tasks,

primarily focusing on reading abilities and picture description assessments. However, these task-feature combinations alone may not fully capture the nuances of each disorder.

Third, while we focused on interpretable features in this study, less interpretable ones, such as log mel spectrograms or Mel Frequency Cepstral Coeficients (MFCCs) may be able to capture more nuanced and complex patterns in the data. Additionally, more sophisticated deep learning approaches for representation learning could be applied, such as Res-Net 50 [39] in the facial modality. While such features can be powerful in capturing subtle details and nuances of audiovisual behavior, the inner workings of the deep learning model are not easily explainable or interpretable by non-experts.

Fourth, our sample size is not representative enough to truly claim generalizability of findings. The smaller the sample, the larger the risk of having model “blind spots” that in turn lead to variable estimates of true model performance on unseen real world data, giving algorithm designers an inaccurate sense of how well a model is performing during development [40].

Our results argue for the importance of a hybrid approach to diferential diagnosis going forward, combining knowledge-driven and data-driven approaches. Understanding specific disease pathologies and symptoms can in turn help in developing features and learning methodologies that lead to better separability of disease conditions. Future work will also focus on improving diferential diagnosis performance in a manner that is both generalizable and explainable.

Acknowledgments This work was funded in part by the National Institutes

of Health grant R42DC019877. We thank all study participants for their time and we gratefully acknowledge the contribution of the Peter Cohen Foundation and EverythingALS towards participant recruitment and data collection for the ALS corpus and Anzalee Khan and Jean

Pierre Lindenmayer at the Manhattan Psychiatric Center – Nathan Kline Institute for the schizophrenia corpus.

Investigating the utility of multimodal conversa- burg, G. L. Pattee, J. D. Berry, E. A. Macklin, E. P. tional technology and audiovisual analytic mea- Pioro, R. A. Smith, Additional evidence for a thersures for the assessment and monitoring of amy- apeutic efect of dextromethorphan/quinidine on otrophic lateral sclerosis at scale, 2021, pp. 4783– bulbar motor function in patients with amyotrophic 4787. doi:10.21437/Interspeech.2021-1801. lateral sclerosis: A quantitative speech analysis, [6] V. Richter, M. Neumann, H. Kothare, O. Roesler, British Journal of Clinical Pharmacology 84 (2018)

J. Liscombe, D. Suendermann-Oeft, S. Prokop, 2849–2856.

A. Khan, C. Yavorsky, J.-P. Lindenmayer, V. Ra- [13] T. Altaf, S. M. Anwar, N. Gul, M. N. Majeed, manarayanan, Towards multimodal dialog-based M. Majid, Multi-class alzheimer’s disease classispeech & facial biomarkers of schizophrenia, in: ifcation using image and clinical features, BiomedCompanion Publication of the 2022 International ical Signal Processing and Control 43 (2018) 64– Conference on Multimodal Interaction, ICMI ’22 74. URL: https://www.sciencedirect.com/science/ Companion, Association for Computing Machinery, article/pii/S1746809418300508. doi:https://doi. New York, NY, USA, 2022, p. 171–176. URL: https: org/10.1016/j.bspc.2018.02.019. //doi.org/10.1145/3536220.3558075. doi:10.1145/ [14] L. Hansen, R. Rocca, A. Simonsen, et al., 3536220.3558075. Speech- and text-based classification of neu[7] H. Kothare, M. Neumann, J. Liscombe, O. Roesler, ropsychiatric conditions in a multidiagnostic setW. Burke, A. Exner, S. Snyder, A. Cornish, D. Hab- ting, Nature Mental Health (2023). doi:10.1038/ berstad, D. Pautler, D. Suendermann-Oeft, J. Hu- s44220-023-00152-7. ber, V. Ramanarayanan, Statistical and clini- [15] E. Emre, Erol, C. Taş, N. Tarhan, Multi-class cal utility of multimodal dialogue-based speech classification model for psychiatric disorand facial metrics for parkinson’s disease as- der discrimination, International Journal of sessment, 2022, pp. 3658–3662. doi:10.21437/ Medical Informatics 170 (2023) 104926. URL: Interspeech.2022-11048. https://www.sciencedirect.com/science/article/pii/ [8] N. Cummins, J. Joshi, A. Dhall, V. Sethu, R. Goecke, S1386505622002404. doi:https://doi.org/10.

J. Epps, Diagnosis of depression by behavioural 1016/j.ijmedinf.2022.104926.

signals: A multimodal approach, in: Proceed- [16] D. Suendermann-Oeft, A. Robinson, A. Cornish, ings of the 3rd ACM International Workshop on D. Habberstad, D. Pautler, D. Schnelle-Walka, Audio/Visual Emotion Challenge, AVEC ’13, As- F. Haller, J. Liscombe, M. Neumann, M. Merrill, sociation for Computing Machinery, New York, O. Roesler, R. Gefarth, Nemsi: A multimodal diaNY, USA, 2013, p. 11–20. URL: https://doi.org/ log system for screening of neurological or mental 10.1145/2512530.2512535. doi:10.1145/2512530. conditions, in: Proceedings of the 19th ACM Inter2512535. national Conference on Intelligent Virtual Agents, [9] J. Robin, M. Xu, A. Balagopalan, J. Novikova, IVA ’19, Association for Computing Machinery, L. Kahn, A. Oday, M. Hejrati, S. Hashemifar, M. Ne- New York, NY, USA, 2019, p. 245–247. URL: https: gahdar, W. Simpson, E. Teng, Automated detection //doi.org/10.1145/3308532.3329415. doi:10.1145/ of progressive speech changes in early alzheimer’s 3308532.3329415. disease, Alzheimer’s & Dementia: Diagnosis, As- [17] A. K. Silbergleit, A. F. Johnson, B. H. Jacobson, sessment & Disease Monitoring 15 (2023) e12445. Acoustic analysis of voice in individuals with amydoi:https://doi.org/10.1002/dad2.12445. otrophic lateral sclerosis and perceptually normal [10] J. Hlavnika, R. Cmejla, T. Tykalová, K. onka, vocal quality, Journal of Voice 11 (1997) 222–231.

E. Růika, J. Rusz, Automated analysis of con- [18] B. Tomik, R. J. Guilof, Dysarthria in amyotrophic nected speech reveals early biomarkers of parkin- lateral sclerosis: A review, Amyotrophic Lateral son’s disease in patients with rapid eye move- Sclerosis 11 (2010) 4–15. ment sleep behaviour disorder, Scientific Re- [19] M. Novotny, J. Melechovsky, K. Rozenstoks, ports 7 (2017). URL: https://api.semanticscholar.org/ T. Tykalova, P. Kryze, M. Kanok, J. Klempir, J. Rusz, CorpusID:19272861. Comparison of automated acoustic methods for [11] G. Stegmann, S. Charles, J. Liss, J. Shefner, oral diadochokinesis assessment in amyotrophic S. Rutkove, V. Berisha, A speech-based prognos- lateral sclerosis, Journal of speech, language, and tic model for dysarthria progression in als, Amy- hearing research : JSLHR 63 (2020) 3453–3460. otrophic lateral sclerosis & frontotemporal degen- doi:10.1044/2020_JSLHR-20-00109. eration (2023) 1–6. URL: https://doi.org/10.1080/ [20] P. Buckley, B. Miller, D. Lehrer, D. Castle, Psychi21678421.2023.2222144. doi:10.1080/21678421. atric comorbidities and schizophrenia, Schizophre2023.2222144, advance online publication. nia bulletin 35 (2008) 383–402. doi:10.1093/ [12] J. R. Green, K. M. Allison, C. Cordella, B. D. Rich- schbul/sbn135.

[1]

Feigin , E. Nichols,

Alam ,

Bannick ,

Beghi ,

Blake ,

Culpepper ,

Dorsey ,

Elbaz ,

Ellenbogen , J. Fisher, C. Fitzmaurice,

Giussani ,

Glennie ,

James ,

Johnson , N. Kassebaum,

Logroscino ,

Marin , T. Vos, Global, regional, and national burden of neurological disorders, 1990 - 2016 : a systematic analysis for the global burden of disease study 2016 , The Lancet Neurology 18 ( 2019 ) 459 - 480 . doi: 10 .1016/S1474- 4422 ( 18 ) 30499 - X .

[2]

Ramanarayanan ,

A. C.

Lammert ,

H. P.

Rowe ,

T. F.

Quatieri ,

J. R.

Green , Speech as a biomarker: Opportunities, interpretability, and challenges , Perspectives of the ASHA Special Interest Groups 7 ( 2022 ) 276 - 283 .

[3]

Neumann ,

Roesler ,

Liscombe ,

Kothare ,

Suendermann-Oeft ,

J. D.

Berry , E. Fraenkel,

Norel ,

Anvar , I. Navar ,

A. V.

Sherman ,

J. R.

Green ,

Ramanarayanan , Multimodal dialog based speech and facial biomarkers capture diferential disease progression rates for als remote patient monitoring , in: Proceedings of the 32nd International Symposium on Amyotrophic Lateral Sclerosis and Motor Neuron Disease, Virtual , 2021 .

[4]

Richter ,

Cohen ,

Neumann ,

Black ,

Haq ,

Wright-Berryman ,

Ramanarayanan , A multimodal dialog approach to mental state characterization in clinically depressed, anxious, and suicidal populations , Frontiers in Psychology 14 ( 2023 ). URL: https://www.frontiersin.org/articles/ 10.3389/fpsyg. 2023 . 1135469 . doi: 10 .3389/fpsyg. 2023 . 1135469 .

[5]

Neumann ,

Roesler ,

Liscombe ,

Kothare ,

Suendermann-Oeft ,

Pautler , I. Navar ,

Anvar ,

Kumm ,

Norel ,

Fraenkel ,

Sherman ,

Berry , G. Pattee,

Wang ,

Green ,

Ramanarayanan ,