=Paper= {{Paper |id=Vol-3649/Paper19 |storemode=property |title=Towards remote differential diagnosis of mental and neurological disorders using automatically extracted speech and facial features |pdfUrl=https://ceur-ws.org/Vol-3649/Paper19.pdf |volume=Vol-3649 |authors=Vanessa Richter,Michael Neumann,Vikram Ramanarayanan |dblpUrl=https://dblp.org/rec/conf/aaai/RichterNR24 }} ==Towards remote differential diagnosis of mental and neurological disorders using automatically extracted speech and facial features== https://ceur-ws.org/Vol-3649/Paper19.pdf
                                Towards Remote Differential Diagnosis of Mental and
                                Neurological Disorders using Automatically Extracted
                                Speech and Facial Features
                                Vanessa Richter1,† , Michael Neumann1 and Vikram Ramanarayanan1,2,*
                                1
                                    Modality.AI, Inc., San Francisco, CA 94105, United States
                                2
                                    University of California, San Francisco, CA 94127, United States


                                                   Abstract
                                                   Utilizing computer vision and speech signal processing to assess neurological and mental conditions remotely has the
                                                   potential to help detecting diseases or monitoring their progression earlier and more accurately. Multimodal features have
                                                   demonstrated usefulness in identifying cases with a disorder from controls across several health conditions. However,
                                                   challenges arise in distinguishing between specific disorders during the process of differential diagnosis, where shared
                                                   characteristics among different disorders may complicate accurate classification. Our aim in this study was to evaluate the
                                                   utility and accuracy of automatically extracted speech and facial features for differentiating between multiple disorders in
                                                   a multi-class (differential diagnosis) setting using a machine learning classifier. We use datasets comprising people with
                                                   depression, bulbar and limb onset amyotrophic lateral sclerosis (ALS), and schizophrenia, in addition to healthy controls.
                                                   The data was collected in a real-world scenario with a multimodal dialog system, where a virtual guide walked participants
                                                   through a set of tasks that elicit speech and facial behavior. Our study demonstrates the utility of digital speech and facial
                                                   biomarkers in assessing neurological and mental disorders for differential diagnosis. Furthermore, this research emphasizes
                                                   the importance of combining information derived from multiple modalities for a more comprehensive understanding and
                                                   classification of disorders.

                                                   Keywords
                                                   differential diagnosis, multi-class, mental disorders, neurological disorders, depression, schizophrenia, amyotrophic lateral
                                                   sclerosis, digital biomarkers, dialog system, speech, facial, multimodal



                                1. Introduction                                                                                             tures characterize a given disorder. For example, percent
                                                                                                                                            pause time (PPT) has been found to differ significantly
                                One out of eight individuals in the world lives with a                                                      between people with ALS (pALS) and HCs [3] as well as
                                mental health disorder, but most people do not have ac-                                                     between people with depression symptoms and HCs [4].
                                cess to effective care.1 Moreover, disorders of the nervous                                                 Furthermore, a slower speaking rate differentiates pALS
                                system are the second leading cause of death globally [1].                                                  [5] as well as people with schizophrenia [6] from HC.
                                   The development of clinically valid digital biomarkers                                                   To assess the utility of automatically computed digital
                                for neurological and mental disorders that can be auto-                                                     biomarkers to capture specific disease attributes despite
                                matically extracted could significantly improve patients’                                                   such shared characteristics, we aim to answer the follow-
                                lives. This advancement has the potential to assist clini-                                                  ing questions:
                                cians in achieving quicker and more reliable diagnoses by
                                providing fast and objective insights into a patient’s state.                                                   1. How accurately can a machine learning (ML)
                                Note that the idea here is not to replace the clinician,                                                           classifier differentially distinguish between mul-
                                but to provide effective and assistive tools that can help                                                         tiple disorders – depression, schizophrenia, bul-
                                improve his/her efficiency, speed and accuracy.                                                                    bar symptomatic ALS and bulbar presymptomatic
                                   Many speech and facial features have shown to be                                                                ALS?
                                useful in differentiating between different mental and                                                          2. Which modalities and features are most useful for
                                neurological disorders and healthy controls (HCs) [2].                                                             this multi-class classification task – overall and
                                However, it remains unclear how distinctly these fea-                                                              with respect to a given disorder – and how does
                                                                                                                                                   that compare to a binary classification baseline
                                Machine Learning for Cognitive and Mental Health Workshop                                                          (controls versus cases in each of the investigated
                                (ML4CMH), AAAI 2024, Vancouver, BC, Canada
                                                                                                                                                   health conditions)?
                                *
                                  Corresponding author.
                                †
                                  Vanessa Richter performed the work described in this paper when
                                  she was an intern at Modality.AI.
                                $ vikram.ramanarayanan@modality.ai (V. Ramanarayanan)
                                                                                                                                            2. Related Work
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License

                                1
                                             Attribution 4.0 International (CC BY 4.0).
                                    https://www.who.int/news-room/fact-sheets/detail/mental-                                                Recently, digital speech and facial features have been
                                    disorders, accessed 11/7/2022                                                                           shown to yield statistically significant differences be-




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
tween cases with neurological or mental disorders and
healthy controls, exhibit high specificity and sensitiv-
ity in discriminatory ability between those groups, or,
a high potential for disease progression and treatment
effect monitoring [2, 3, 6, 7, 8, 9, 10, 11, 12].
Several studies have evaluated the detection of neuro-
logical and mental disorders in multi-class classifica-
tion settings as compared to binary case-control studies
[13, 14, 15]. Altaf et al. [13] introduced an algorithm for
Alzheimer’s disease (AD) detection validated on binary
classification and multi-class classification of AD, normal
and mild cognitive impairment (MCI). Using the bag of
visual word approach, the algorithm enhances texture-            Figure 1: Overview of feature extraction and dataset creation.
based features like the gray level co-occurrence matrix.
It integrates clinical data, creating a hybrid feature vec-
tor from whole magnetic resonance (MR) brain images.             interview environment and data collection. Each session
They use the Alzheimer’s Disease Neuro-imaging Initia-           starts with a microphone, speaker, and camera check to
tive dataset (ADNI) and achieve 98.4% accuracy in binary         ensure that the participant has given their device the
AD versus normal classification and 79.8% accuracy in            permission to access camera and microphone, is able to
multi-class AD, normal, and MCI classification.                  hear the instructions and the captured signal is of ade-
Furthermore, Hansen et al. [14] explored the poten-              quate quality. After these tests the virtual guide involves
tial of speech patterns as diagnostic markers for mul-           participants in a structured conversation that consists of
tiple neuropsychiatric conditions by examining record-           exercises (speaking tasks, open-ended questions, motor
ings from 420 participants with major depressive disor-          abilities) to elicit speech, facial and motor behaviors rel-
der, schizophrenia, autism spectrum disorder, and non-           evant to the type of disease being studied. In this work,
psychiatric controls. Various models were trained and            we focus on tasks that were shared across multiple study
tested for both binary and multi-class classification tasks      protocols for different disease conditions: (a) sentence in-
using speech and text features. While binary classifica-         telligibility test (SIT), (b) diadochokinesis (DDK), (c) read
tion models exhibited comparable performance to prior            speech, and (d) a picture description task. For (a), par-
research (F1: 0.54–0.92), multi-class classification showed      ticipants were asked to read individual SIT sentences of
a notable decrease in performance (F1: 0.35–0.75). The           varying lengths (5-15 words2 ), while (b) required reading
study further demonstrates that combining voice- and             a longer passage (Bamboo reading passage, 99 words). To
text-based models enhances overall performance by 9.4%           assess DDK skills (c), participants were asked to repeat a
F1 macro, highlighting the potential of a multimodal             pattern of syllables (/pa ta ka/) as fast as they can until
approach for more accurate neuropsychiatric condition            they run out of breath and (d) prompted users to describe
classification While these studies show the effectiveness        a scene in a picture that was shown to them on screen.
of different types of speech- and facial-derived features        These tasks are inspired by previous work [17, 18, 19].
for assessing psychiatric conditions in differential diag-
nosis settings, none of them utilized ’in-the-wild‘ data
collected remotely from participants devices with a mul-         3.1. Datasets
timodal dialog system.                                        An overview of the data used in this study is given in
                                                              Table 1. While some datasets for a disease may be small,
                                                              there is a subset of tasks that are shared across research
3. Multimodal Dialog Platform                                 studies. Since the data is collected in the same way
     and Data Collection                                      (remotely with a personal electronic device), we can
                                                              create a larger dataset for the healthy population across
Audiovisual data was collected using NEMSI (Neurologi- studies to get a more accurate representation of the
cal and Mental health Screening Instrument) [16], a mul- properties of normative behavior. For the larger dataset
timodal dialog system for remote health assessments. An of healthy controls, we identify age-related trends as
overview of the dataset creation process is illustrated in well as collinerarity of features. This information is used
Figure 1. A virtual guide, Tina, led study participants to correct control as well as patient feature values from
through various tasks that are designed to elicit speech,
facial, and motor behaviors. Having an interactive virtual 2
                                                               In the remainder of the paper, the different SIT sentence lengths
guide to elicit participants’ behavior allows for scalability are treated as separate tasks and are denoted as SIT_n, where n is
while providing a natural but controlled and objective the length in words.
             Participants        Sessions     Mean Age (SD)      3.1.2. Amyotrophic Lateral Sclerosis
    Controls                                                ALS is a neurological disease that affects nerve cells in
    Female     408 (63%) 655 (62.8%)               46.3 (16.4)
                                                            the brain and spinal cord that control voluntary mus-
    Male       240 (37%) 388 (37.2%)               46.2 (16.0)
                                                            cle movement. The disease is progressive and there is
    All               648       1043               46.3 (16.2)
    Schizophrenia
                                                            currently no cure or effective treatment to reverse its
    Female     10 (24.4%) 19 (26.4%)            36.1 (9.4)  progression.5 . Global estimates of ALS prevalence range
    Male       31 (75.6%) 53 (73.6%)           36.6 (10.1)  from 1.9 to 6 per 100,000.6 Studies on ALS found co-
    All                41         72            36.5 (9.9)  morbidity with dementia, parkinsonism and depressive
    Depression                                              symptoms [23]. Diekmann et al. [24] found depression
    Female     66 (79.5%) 76 (79.2%)           34.6 (12.1)  to occur statistically significantly more often in pALS
    Male       17 (20.5%) 20 (20.8%)           35.0 (10.2)  compared to HC. In addition, Heidari et al. [25] found
    All                83         96           34.7 (11.7)  in a meta-analysis of 46 eligible studies that the pooled
    Bulbar Symptomatic ALS                                  prevalence of depression among individuals with ALS to
    Female     38 (48.1%) 67 (46.2%)           61.7 (10.8)  be 34%, with mild, moderate, and severe depression rates
    Male       41 (51.9%) 78 (53.8%)            61.3 (9.0)
                                                            at 29%, 16%, and 8%, respectively.
    All                79        145            61.5 (9.8)
                                                            As shown in Table 1, data from 79 ALS bulbar symp-
    Bulbar Presymptomatic ALS
    Female      31 (50%)  54 (50.5%)           58.1 (10.9)  tomatic (BS) and 62 ALS bulbar pre-symptomatic (BP)
    Male        31 (50%)  53 (49.5%)            62.2 (8.3)  patients were collected in cooperation with Everythin-
    All                62        107            60.1 (9.9)  gALS and the Peter Cohen Foundation7 . In addition to
                                                            the assessment of speech and facial behavior, partici-
Table 1                                                     pants filled out the ALS Functional Rating Scale-revised
Cohort demographics. SD: standard deviation.                (ALSFRS-R), a standard instrument for monitoring the
                                                            progression of ALS [26]. The questionnaire comprises
                                                            12 questions about physical ability with each function’s
age effects and remove feature redundancies.                rating ranging from normal function (score 4) to severe
                                                            disability (score 0). It includes four scales for different
                                                            domains affected by the disorder: bulbar system, fine
3.1.1. Schizophrenia                                        and gross motor skills, and respiratory function. The
                                                            ALSFRS-R score is the total of the domain sub-scores,
Schizophrenia is a chronic brain disorder that affects the sum ranging from 0 to 48. For this study, pALS were
approximately 24 million or 1 in 300 people (1 in 222 stratified into the following sub-cohorts based on their
in adults)3 worldwide. According to the American Psy- bulbar subscore: (a) BS ALS with a bulbar subscore < 12
chiatric Association (APA), active schizophrenia may be (first three ALSFRS-R questions) and (b) BP ALS with a
characterized by episodes in which the affected individual bulbar sub-score = 12.
cannot distinguish between real and unreal experiences.4
Among individuals with schizophrenia, psychiatric and
                                                            3.1.3. Depression
medical comorbidities such as substance abuse, anxiety
and depression are common [20, 21, 22]. Buckley et al. Depression is a common mental health disorder char-
pointed out that depression is estimated to affect half of acterized by persistent sadness and lack of interest or
the patients. These comorbidities, as well as the variation pleasure in previously enjoyable activities. In addition,
in symptoms and medications, make the identification of fatigue and poor concentration are common. The effects
multimodal biomarkers for schizophrenia a difficult task. of depression can be long-lasting or recurrent and can
As can be seen in Table 1, we assessed 41 individuals drastically affect a person’s ability to lead a fulfilling
with a diagnosis of schizophrenia at a state psychiatric life. The disorder is one of the most common causes of
facility in New York, NY. The study was approved by the disability in the world.8 One in six people (16.6%) will
Nathan S. Kline Institute for Psychiatric Research and we experience depression at some point in their lifetime.9
obtained written informed consent from all participants
at the time of screening after explaining details of the 5 https://www.ninds.nih.gov/health-
study. The assessment of both patients and controls was information/disorders/amyotrophic-lateral-sclerosis-als, accessed
overseen by a psychiatrist.                                  05/19/2023
                                                                 6
                                                                   https://www.targetals.org/2022/11/22/epidemiology-of-als-
                                                                   incidence-prevalence-and-clusters/, accessed 05/19/2023
3                                                                7
  https://www.who.int/news-room/fact-sheets/detail/                https://www.everythingals.org/research
                                                                 8
  schizophrenia, accessed 05/19/2023                               https://www.who.int/health-topics/depression, accessed 06/20/2023
4                                                                9
  https://www.psychiatry.org/patients-families/schizophrenia/      https://www.psychiatry.org/patients-families/depression/what-is-
  what-is-schizophrenia, accessed 05/19/2023                       depression, accessed 06/20/2023
A well-established tool for assessing depression is the        sessions that had more than 15% missing features. Then,
Patient Health Questionnaire (PHQ)-8 [27]. The PHQ-8           on the feature level, we filtered out features with more
score ranges from 0 to 24 (higher score indicates more         than 10% missing values. These thresholds have been
severe depression symptoms).                                   determined empirically. After those removal procedures,
   We investigated at least moderately severe depression       we impute remaining missing values with mean feature
cases, based on a cutoff of PHQ-8 ≥ 15. The data for this      values for the respective cohort in train and test sets
study, including the completion of the PHQ-8 question-         separately.
naire, was collected through crowd-sourcing, resulting
in a sample of 83 individuals that scored at or above          4.3. Age-Correction & Sex-Normalization
this cutoff. Statistics for this cohort are summarized in
Table 1.                                                       Similar to the approach in Falahati et al. [31], we applied
                                                               a linear correction algorithm to both patient and con-
                                                               trol data based on age-related changes in the HC cohort.
4. Methods                                                     By calculating age trends and coefficients on healthy
                                                               controls, we aim to obtain the most accurate estimate
Our procedure is divided into the following stages: (1) fea-   of purely age-related changes without the confounding
ture extraction, (2) preprocessing, (3) age-correction and     effects of disease-related influences. In detail, for each
sex-normalization, (4) redundancy and effect size analy-       feature, we fit a linear regression model to age as the
sis, and finally (5) classification (binary and multi-class)   independent and the feature as the dependent variable,
and evaluation.                                                modeling the age-related changes as a linear deviation.
                                                               This is done separately for males and females to obtain
4.1. Multimodal Metrics Extraction                             a sex-specific result. Then, the sex-specific regression
                                                               coefficients are used to correct feature values for age
In this and the following sections, we use the following
                                                               by subtracting the product of coefficient and age from
terminology: Metric denotes a speech or facial metric in
                                                               the feature value for each participant. To account for
general, and Feature denotes a specific combination of a
                                                               sex-related differences, we applied sex-specific z-scoring
metric extracted from a certain task, e.g. speaking rate
                                                               to normalize the features. Z-normalization is a method-
for the SIT task.
                                                               ology that allows for the comparison or compilation of
   Both speech and facial metrics were extracted from
                                                               observations of different cohorts [32]. In addition, the
the audiovisual recordings (overview in Table 2). To ex-
                                                               normalization process ensures the comparability of fea-
tract facial metrics, we used the Mediapipe FaceMesh
                                                               tures on different scales by centering the feature distribu-
software10 . More specifically, MediaPipe’s Face Detec-
                                                               tions around zero with a standard deviation of one. First,
tion is based on BlazeFace [28] and determines the (x,
                                                               the dataset to analyze was divided into male and female
y)-coordinates of the face for every frame. Subsequently,
                                                               participants. Then, each feature was normalized within
468 facial landmarks are identified using MediaPipe
                                                               each sex group using z-scoring.
FaceMesh. We selected 14 key landmarks to compute
functionals of facial behavior. Distances between land-
marks were normalized by dividing them by the inter-           4.4. Redundancy Analysis and Effect Sizes
caruncular distance. In terms of between- as well as      To identify collinear features and reduce the high-
within-subject analyses, when the same position rela-     dimensional feature space, we performed hierarchical
tive to the camera cannot be assumed, Roesler et al. [29] clustering on the Spearman rank-order correlations us-
found this to be the most reliable method of normaliza-   ing the age-corrected and sex-normalized larger healthy
tion. More details and a visual depiction of the land-    control dataset. We applied the clustering for speech and
marks used to calculate facial features can be found in   facial features separately. The clustering procedure is
[4]. Speech metrics were computed using Praat [30] and    motivated by the approach in Ienco and Meo [33]. It is
cover different domains, such as energy, timing, voice    based on Ward’s method [34], which aims at minimising
quality and frequency.                                    within-cluster variance. We implemented it using the
                                                          scikit-learn library11 . A dendrogram was plotted to
4.2. Preprocessing                                        inspect the correlations between features visually and
                                                          to determine a suitable distance threshold for generat-
We applied the following approach to handle missing
                                                          ing feature clusters. The threshold choice was based on
data, which can occur for a number of reasons, including
                                                          two major factors: (a) balance between speech and facial
incomplete sessions, technical issues, or network prob-
                                                          clusters as we target roughly an equal number to avoid
lems. First, on the session level, we removed participant
                                                               11
                                                                    https://scikit-learn.org/stable/auto_examples/inspection/plot_
10
     https://google.github.io/mediapipe/                            permutation_importance_multicollinear.html
          Domain             Metrics
          Energy             signal-to-noise ratio (SNR, dB)
          Timing             speaking & articulation duration/rate (sec./WPM), percent pause time (PPT, %),
 Audio


                             canonical timing agreement (CTA, %)
          Specific to DDK    cycle-to-cycle temporal variability (cTV, sec.), syllable rate (syl./sec.), number of syllables
          Voice quality      shimmer (%), harmonics-to-noise ratio (HNR, dB), jitter (%)
          Frequency          mean, min, max & standard deviation (stdev) of fundamental frequency (F0, Hz)
          Jaw                mean, min & max speed/acceleration/jerk of the jaw center (JC)
 Video




          Lower Lip          mean, min & max speed/acceleration/jerk of the lower lip (LL)
          Mouth              mean & max lip aperture, lip width, mouth surface area; mean mouth symmetry ratio
          Eyes               mean & max eye opening

Table 2
Overview of speech and facial metrics.


         #    Cluster domain                          Metrics                              Tasks                   # Features
          1   Energy                                  SNR                                  all                           8
          2   Timing alignment                        CTA                                  all                           6
          3   Timing, pauses                          PPT                                  all                           5
          4   Timing, speaking (1)                    articulation/speaking duration       Picture Description           2
          5   DDK articulation                        SNR, syl.rate, syl.count & cTV       DDK                           4
          6   Timing, speaking (2)                    articulation/speaking rate/time      SIT_{5,9}                     8
          7   Timing, speaking (3)                    articulation/speaking rate/time      SIT_{7,11,13,15},            21
                                                                                           Reading passage
          8   DDK voice quality                       HNR, jitter & shimmer                DDK                          3
          9   Voice quality (periodicity)             HNR                                  all except DDK               8
         10   Voice quality (amplitude variation)     shimmer                              all except DDK               8
         11   Voice quality (frequency variation)     jitter                               all except DDK               8
         12   Frequency (mean, min)                   min & mean F0                        all                         16
         13   Frequency (max, std)                    max & std F0                         all                         16
                                                                                                             ∑︀
                                                                                                                       113
Table 3
Speech feature clusters identified by hierarchical clustering.



predominance of one modality over the other, and (b)      MLP has one hidden layer. We experimented with adding
expert knowledge about the different task and feature     more hidden layers, but found that the minimal config-
domains (e.g. timing versus voice quality features, jaw   uration with only one layer was beneficial in terms of
versus eye movement or read versus free speech), which    performance. The hidden layer size ℎ was determined
resulted in the clusters shown in Table 3 and Table 4.    dynamically as
The clusters are used in the feature selection process as                         ℎ=
                                                                                      𝑓 +𝑐
                                                                                                                 (1)
described Section 4.5.                                                                   2
   Statistical tests to assess the statistical significance,
                                                          where 𝑓 is the number of selected features and 𝑐 the num-
as well as the magnitude and direction of effects for a   ber of classes. The model was trained with a maximum
given comparison, were conducted within classification    of 10,000 iterations to allow sufficient time for conver-
folds and as part of a post hoc analysis. Effect sizes were
                                                          gence during training. Model training was stopped when
calculated using Glass’s Delta [35]. Here, only features  the loss or score was not improving by a defined toler-
showing statistical significance (𝑝 < 0.05) in the Mann-  ance threshold. Here, we used scikit-learn’s default
Whitney U-test (MWU) were considered.                     of 1𝑒 − 4. Additionally, the alpha parameter was set to
                                                          0.001, controlling the regularization strength to prevent
4.5. Classification                                       overfitting. The sgd (stochastic gradient descent) solver
                                                          was used for optimization during training. The batch
For both the binary and multi-class classification exper- size was set to auto, enabling the model to determine
iments, we used a multilayer perceptron (MLP), which the appropriate batch size during training. We used the
was implemented using the scikit-learn library. The rectified linear unit function as the activation function.
            #    Cluster domain         Metrics                                   Tasks                   # Features
            1    Lip movement (1)       speed, acc. & jerk measures               all except DDK             95
            2    Lip width              mean & max lip width                      all                        18
            3    Mouth opening          mean & max lip aperture,                  all                        36
                                        mouth surface area
            4    Lip movement (2)       speed, acc. & jerk metrics                DDK                        12
            5    Jaw movement (1)       speed, acc. & jerk metrics                DDK                        12
            6    Jaw movement (2)       speed, acc. & jerk metrics                SIT_7                      12
            7    Jaw movement (3)       speed, acc. & jerk metrics                SIT_5                      12
            8    Jaw movement (4)       min + max speed, acc. & jerk metrics      Picture Description        9
            9    Jaw movement (5)       speed, acc. & jerk metrics                SIT_{9,11,13,15}, RP,      63
                                                                                  Picture Description
           10    Mouth symmetry         mean mouth symmetry                       all                         9
           11    Eye opening            mean and max eye opening                  all                         18
                                                                                                    ∑︀
                                                                                                             296
Table 4
Facial feature clusters identified by hierarchical clustering. RP: reading passage.



   Ten-fold cross-validation was applied for evaluation in        BS ALS vs. Depression cases, and so on). We merged
order to maximize the utilization of data for both training       the selected features from these comparisons as input
and testing purposes. To avoid bias towards the majority          to the classifier. Therefore, multiple features from the
group, we created datasets that consist of an equal num-          same cluster could be included in one feature set. We
ber samples in each disease condition. For each individual        allowed a certain amount of redundancy compared to
participant, we consider, if available, the first two ses-        the case-control baseline in order to account for the com-
sions as data points. Because of the equality constraint,         plexity associated with multiple comparisons. For both
the number of data points was limited by the smallest             experiments, classification performance was evaluated
dataset (schizophrenia). This resulted in 72 randomly             in terms of F1 score, sensitivity, and specificity.
selected data points per cohort, summing up to a total
of 360 data points. The classification experiments are
run ten times to smooth out performance variations and            5. Results
obtain more representative results. We split the data us-
ing scikit-learn’s StratifiedGroupKFold to make sure              5.1. Binary Classification Baseline
that sessions from the same participant are either in the
respective training or testing fold. In each fold, we im-          Cohort             Speech     Facial     Speech + Facial
puted missing values and standardized features by sex                                   F1         F1      F1    SEN     SP
using z-scoring. This was done separately for training             DEP vs. HC          0.64       0.59    0.65   0.65   0.65
and test sets.                                                     SCHIZ vs. HC        0.82       0.64    0.83   0.85   0.82
   As a benchmark, we evaluated binary classification              BP ALS vs. HC       0.54       0.51    0.52   0.52   0.53
                                                                   BS ALS vs. HC       0.84       0.63    0.83   0.82   0.83
performance of models aimed at distinguishing cases
with a disorder from controls. Here, for each cluster of         Table 5
collinear features as described in Section 4.4, the one with     Binary classification results. In each row, we highlighted the
the highest effect size was selected for the final feature       highest performance in terms of F1.
set as input to the classifier. If no feature showed statisti-   HC: Healthy Controls, DEP: Depression, SCHIZ: Schizophre-
cally significant differences between cases and controls         nia, SEN: Sensitivity, SP: Specificity
in a given cluster, no feature was selected. Hence, the
number of clusters determines the maximum number of                As can be seen in Table 5, we observe a good per-
features fed into the classifier. Statistical significance and   formance in classifying controls versus BS ALS (speech
effect sizes for each feature were calculated as described       features alone; F1-score: 0.84) and schizophrenia (com-
in the previous section.                                         bined speech and facial; F1-score: 0.83) cases, respectively.
In a second step, we performed 4-class classification, in-       The binary classification of depression did not perform
corporating all the investigated neurological and mental         as well; however, it still surpassed the random chance
disorders. Here, feature selection was done based on pair-       baseline (combined speech and facial; F1-score: 0.65).
wise comparisons of all disease cohorts (e.g. Depression         The classifier struggled to distinguish controls from BP
vs. Schizophrenia cases, Schizophrenia vs. BS ALS cases,         ALS cases, where we observed performance just above
random chance across modalities. Furthermore, the per-             BP ALS and depression the per class F1-score is highest
formance with regard to sensitivity and specificity is             when combining speech and facial features. There is no
relatively balanced across comparisons.                            performance difference between using only speech or
   In depression and schizophrenia, combining speech               speech and facial features for identifying schizophrenia.
and facial modalities resulted in improved classification          Figure 2 shows a confusion matrix that indicates the per-
performance compared to speech or facial features alone,           centage of accurate class predictions and the classes with
as shown in Table 5. However, adding facial informa-               which they were confused. The model was most confi-
tion did not enhance performance for BP or BS and ALS              dent in detecting schizophrenia (72.22%), followed by BS
cohorts compared to utilizing speech features alone.               ALS (64.58%) and depression (63.75%). The model faced
                                                                   its greatest challenge in accurately predicting BP ALS
5.2. Multi-Class Classification                                    (57.22%), yet it still performs notably above chance in a 4-
                                                                   class classification scenario. BP ALS and depression cases
                                                                   were most often confused with each other. Schizophrenic
   Cohort       Speech      Facial     Speech + Facial
                                                                   patients were least often confused with other cohorts.
                  F1          F1       F1   SEN     SP
                                                                   Among the cases of BS ALS, the most frequent confusion
   SCHIZ         0.72        0.53     0.72  0.72   0.91
   BP ALS        0.55        0.36     0.57  0.57   0.86            occurred with BP ALS patients (16.11%).
   BS ALS        0.62        0.47     0.64  0.65   0.88               The features that we identified to be consistently cho-
   DEP           0.61        0.46     0.64  0.64   0.88            sen across classification folds (Table 7) are predominantly
   Average       0.63        0.46     0.64  0.65   0.88            speech features of timing, voice quality, and energy
                                                                   domains. In addition, two facial features are selected
Table 6                                                            across folds concerning the maximum lip width and the
Multi-class classification results. In each row, we highlight
                                                                   maximum absolute acceleration of jaw movements. We
the highest F1 score performance.
HC: Healthy Controls, DEP: Depression, SCHIZ: Schizophre-          conducted a post hoc analysis of effect sizes between
nia, SEN: Sensitivity, SP: Specificity                             HC and cases with a disorder for these features to gain
                                                                   further insight into disorder-specific importance. Here,
                                                                   positive effect sizes represent feature values that are
                                                                   larger for cases with a disorder than controls. Conversely,
                                                                   negative values represent larger feature values for
                                                                   controls than cases with a disorder12 . In schizophrenia,
                                                                   we find all of the features consistently selected across
                                                                   classification folds to be statistically significant when
                                                                   compared to HC. With respect to the other cohorts, the
                                                                   largest effects are shown for CTA (-1.44 for SIT_13) and
                                                                   speaking rate (-2.00 for RP). This shows that patients
                                                                   exhibit a lower CTA, a measure of phonetic alignment
                                                                   between their own speech and that of the virtual guide,
                                                                   while speaking slower. We also observed a smaller
                                                                   average lip width as an important feature that shows
                                                                   the largest effect between HC and depression cases
                                                                   compared to the other cohorts. This may be associated
                                                                   with decreased emotional expressivity, as indicated by
                                                                   reduced smiling and increased frowning. These findings
Figure 2: Normalized confusion matrix for 4-class classifica-
                                                                   align with previous studies highlighting similar patterns
tion. The x-axis shows the true labels, the y-axis the predicted
ones.                                                              of emotional expressiveness in depression [37, 38]. Few
                                                                   and small differences compared to controls are revealed
                                                                   for BP ALS cases. This is also the cohort with the lowest
   In the 4-class experiment aimed at discriminating be-           performance across classification experiments. In BS
tween all investigated neurological and mental disorders,          ALS, we found the largest effects for SNR and speaking
we achieve the best overall performance (F1-score: 0.64)           rate. Another feature that stood out is cTV in the DDK
by utilizing both speech and facial features, as shown in          task, a measure that captures the temporal variability, i.e.
Table 6. Overall, the specificity (average: 0.88) for the          the consistency or irregularity in the timing of speech
disorders examined is considerably higher than the sensi-          patterns, between consecutive cycles of speech.
tivity (average: 0.65). This indicates that the classifier is      12
                                                                        We follow the commonly used effect size magnitude thresholds as
more effective at avoiding false-positive results than iden-            suggested in Cohen [36] – small: 0.2 − 0.5, medium: 0.5 − 0.8,
tifying true positives. In most cases, namely for BS ALS,               and large: > 0.8
                                                                           Effect sizes (HC vs. disorder cases)
 Features                    Modality     Cluster domain                  SCHIZ BP ALS BS ALS               DEP
 max abs acc. JC (RP)        Facial       Jaw movement                     -0.51            -       N.S         -
 max lip width (SIT 11)      Facial       Lip width                         -0.35           -       0.31 -0.44
 shimmer (DDK)               Speech       Voice quality                      0.35           -     -0.63         -
 shimmer (SIT 5)             Speech       Voice quality                     0.97            -      -0.31        -
 jitter (SIT 9)              Speech       Voice quality                      0.43       -0.20     -0.48      0.26
 CTA (SIT 13)                Speech       Timing alignment                 -1.44            -      -1.16    -0.31
 SNR (DDK)                   Speech       Energy                             1.88           -      2.43         -
 speaking rate (RP)          Speech       Timing, speaking                 -2.00            -      -1.84        -
 speaking rate (SIT 7)       Speech       Timing, speaking                  -0.73       -0.31     -1.25      0.59
 HNR (DDK)                   Speech       Voice quality                     1.01            -       0.86    -0.30
 HNR (SIT 15)                Speech       Voice quality                     0.94            -       0.75        -
 cTV (DDK)                   Speech       Energy & articulation skills       0.39           -      1.82      0.43
Table 7
Features selected across all multi-class classification CV folds (considering the 4 disorders) and their effect sizes as calculated
between the healthy control and disorder cohorts. In each row, we highlighted the largest effect size, which were only calculated
in case of statistical significance.
HC: healthy controls, SCHIZ: schizophrenia, BS: bulbar symptomatic, BP: bulbar pre-symptomatic, DEP: depression, JC: jaw
center, RP: reading passage



While many features are shared in terms of indicating                That being said, we acknowledge the importance of
a signal between cases with a disorder and controls, it           contextualizing the promise of such multimodal method-
is mostly the magnitude of the effect that differentiates         ologies for differential diagnosis with several caveats.
them, as well as how they combine. However, there are             First, the performance of any machine learning classi-
also a few features that show a different direction of            fier trained for this purpose will depend on the specific
effect across cohorts. For example, in BS ALS, compared           conditions being studied and the range and heterogene-
to other cohorts, we observed the largest effect for              ity of symptoms presented in each case. For example,
shimmer (DDK, -0.63), which measures the variation in             in this study we investigated four specific conditions –
amplitude of the vocal folds during the speech signal.            schizophrenia, depression, bulbar symptomatic (BS) and
There is no effect observed for BP ALS or depression              bulbar presymptomatic (BP) ALS – and we observed that
cohorts, while in schizophrenia, the direction of effect is       schizophrenia (where the facial modality is particularly
the opposite (0.35).                                              good at capturing characteristics exhibited therein such
                                                                  as anhedonia, blunted affect, etc.) and BS ALS (which
                                                                  is characterized by speech motor deficits, reflected in
                                                                  the timing, rate and intelligibility of speech), quite dif-
6. Discussion                                                     ferent in terms of symptom presentation, exhibit greater
                                                                  separability relative to other classes for differential classi-
We explored the utility of speech and facial features ex-
                                                                  fication. For both BS ALS and schizophrenia, our analysis
tracted by a multimodal dialog system for differential
                                                                  demonstrates a robust discriminatory capability to effec-
classification of ALS, depression and schizophrenia. Note
                                                                  tively distinguish these cohorts from healthy controls, as
that the idea here is not to replace clinicians, but to pro-
                                                                  well as other neurological and mental disorders, in binary
vide effective and assistive tools that can help improve
                                                                  and multi-class experiments. However, the overall higher
their efficiency, speed and accuracy. Overall, combining
                                                                  specificity of the multi-class classifier implies a robust
speech and facial information proved to be beneficial
                                                                  capability to accurately identify non-cases, effectively
for identifying several disorders in both multi-class and
                                                                  minimizing false positives. Yet, the lower sensitivity sug-
binary classification experiments. In addition, our au-
                                                                  gests limitations in the identification of true cases for the
tomated feature analysis indicates several features that
                                                                  analyzed disorders, likely due to the imposed strong re-
show relevance across experiments. While some of these
                                                                  strictions. In BS ALS, speech features alone demonstrate
features are intuitively identifiable by human experts as
                                                                  superior performance when comparing this group with
markers of a given disorder (for example, a slower speak-
                                                                  controls. Yet, in the more intricate task of differential
ing rate or a lower intelligibility), such an analysis also
                                                                  diagnosis, performance improves when speech features
allows discovery of other features that might be harder
                                                                  are combined with facial information. For schizophrenia,
to detect or identify objectively by human experts, such
                                                                  the combination of speech and facial modalities proves
as quicker facial movements.
                                                                  most effective in both binary and multitask experiments.
In contrast, BP ALS, which does not present with as many        ologies that lead to better separability of disease condi-
speech and facial motor deficits, is much less separable        tions. Future work will also focus on improving differ-
even in binary classification, let alone in the multi-class     ential diagnosis performance in a manner that is both
classification context, highlighting the challenging na-        generalizable and explainable.
ture of detecting this condition. Furthermore, for the
misidentified BS ALS cases, the classifier most frequently
categorized them as BP ALS. Although distinguishing BP          Acknowledgments
ALS cases from controls is challenging, this outcome indi-
                                                                This work was funded in part by the National Institutes
cates that the classifier may be able to capture condition-
                                                                of Health grant R42DC019877. We thank all study par-
specific information from features that are shared across
                                                                ticipants for their time and we gratefully acknowledge
different stages of ALS, which may have led to this confu-
                                                                the contribution of the Peter Cohen Foundation and Ev-
sion. Finally, in evaluating depression, best performance
                                                                erythingALS towards participant recruitment and data
in both binary and multi-class classification experiments
                                                                collection for the ALS corpus and Anzalee Khan and Jean-
is achieved by combining speech and facial information.
                                                                Pierre Lindenmayer at the Manhattan Psychiatric Center
The overall accuracy in discerning depression from other
                                                                – Nathan Kline Institute for the schizophrenia corpus.
cohorts is notably lower compared to schizophrenia or
BS ALS. The variability introduced by the wide range
and time horizon of potential symptoms present in de-           References
pression as well as medication status might contribute
to lower differential diagnosis accuracy. That being said,       [1] V. Feigin, E. Nichols, T. Alam, M. Bannick, E. Beghi,
a significant limitation of the present study is the lack            N. Blake, W. Culpepper, E. Dorsey, A. Elbaz, R. Ellen-
of information about co-morbidities to factor into our               bogen, J. Fisher, C. Fitzmaurice, G. Giussani, L. Glen-
analysis, since datasets were collected independently. Fu-           nie, S. James, C. Johnson, N. Kassebaum, G. Logros-
ture research will aim to explicitly address this gap by             cino, B. Marin, T. Vos, Global, regional, and national
capturing, for instance, information about co-morbid de-             burden of neurological disorders, 1990-2016: a sys-
pression in ALS or schizophrenia (e.g., through PHQ-8                tematic analysis for the global burden of disease
scales), that might help us better stratify these cohorts.           study 2016, The Lancet Neurology 18 (2019) 459–
   Second, this study focused on a restricted set of tasks,          480. doi:10.1016/S1474-4422(18)30499-X.
primarily focusing on reading abilities and picture de-          [2] V. Ramanarayanan, A. C. Lammert, H. P. Rowe,
scription assessments. However, these task-feature com-              T. F. Quatieri, J. R. Green, Speech as a biomarker:
binations alone may not fully capture the nuances of each            Opportunities, interpretability, and challenges, Per-
disorder.                                                            spectives of the ASHA Special Interest Groups 7
   Third, while we focused on interpretable features in              (2022) 276–283.
this study, less interpretable ones, such as log mel spectro-    [3] M. Neumann, O. Roesler, J. Liscombe, H. Kothare,
grams or Mel Frequency Cepstral Coefficients (MFCCs)                 D. Suendermann-Oeft, J. D. Berry, E. Fraenkel,
may be able to capture more nuanced and complex pat-                 R. Norel, A. Anvar, I. Navar, A. V. Sherman, J. R.
terns in the data. Additionally, more sophisticated deep             Green, V. Ramanarayanan, Multimodal dialog based
learning approaches for representation learning could                speech and facial biomarkers capture differential
be applied, such as Res-Net 50 [39] in the facial modal-             disease progression rates for als remote patient
ity. While such features can be powerful in capturing                monitoring, in: Proceedings of the 32nd Interna-
subtle details and nuances of audiovisual behavior, the              tional Symposium on Amyotrophic Lateral Sclero-
inner workings of the deep learning model are not easily             sis and Motor Neuron Disease, Virtual, 2021.
explainable or interpretable by non-experts.                     [4] V. Richter, J. Cohen, M. Neumann, D. Black, A. Haq,
   Fourth, our sample size is not representative enough              J. Wright-Berryman, V. Ramanarayanan, A multi-
to truly claim generalizability of findings. The smaller             modal dialog approach to mental state character-
the sample, the larger the risk of having model “blind               ization in clinically depressed, anxious, and sui-
spots” that in turn lead to variable estimates of true model         cidal populations, Frontiers in Psychology 14
performance on unseen real world data, giving algorithm              (2023). URL: https://www.frontiersin.org/articles/
designers an inaccurate sense of how well a model is                 10.3389/fpsyg.2023.1135469. doi:10.3389/fpsyg.
performing during development [40].                                  2023.1135469.
   Our results argue for the importance of a hybrid ap-          [5] M. Neumann, O. Roesler, J. Liscombe, H. Kothare,
proach to differential diagnosis going forward, combining            D. Suendermann-Oeft, D. Pautler, I. Navar, A. Anvar,
knowledge-driven and data-driven approaches. Under-                  J. Kumm, R. Norel, E. Fraenkel, A. Sherman, J. Berry,
standing specific disease pathologies and symptoms can               G. Pattee, J. Wang, J. Green, V. Ramanarayanan,
in turn help in developing features and learning method-
     Investigating the utility of multimodal conversa-            burg, G. L. Pattee, J. D. Berry, E. A. Macklin, E. P.
     tional technology and audiovisual analytic mea-              Pioro, R. A. Smith, Additional evidence for a ther-
     sures for the assessment and monitoring of amy-              apeutic effect of dextromethorphan/quinidine on
     otrophic lateral sclerosis at scale, 2021, pp. 4783–         bulbar motor function in patients with amyotrophic
     4787. doi:10.21437/Interspeech.2021-1801.                    lateral sclerosis: A quantitative speech analysis,
 [6] V. Richter, M. Neumann, H. Kothare, O. Roesler,              British Journal of Clinical Pharmacology 84 (2018)
     J. Liscombe, D. Suendermann-Oeft, S. Prokop,                 2849–2856.
     A. Khan, C. Yavorsky, J.-P. Lindenmayer, V. Ra-         [13] T. Altaf, S. M. Anwar, N. Gul, M. N. Majeed,
     manarayanan, Towards multimodal dialog-based                 M. Majid, Multi-class alzheimer’s disease classi-
     speech & facial biomarkers of schizophrenia, in:             fication using image and clinical features, Biomed-
     Companion Publication of the 2022 International              ical Signal Processing and Control 43 (2018) 64–
     Conference on Multimodal Interaction, ICMI ’22               74. URL: https://www.sciencedirect.com/science/
     Companion, Association for Computing Machinery,              article/pii/S1746809418300508. doi:https://doi.
     New York, NY, USA, 2022, p. 171–176. URL: https:             org/10.1016/j.bspc.2018.02.019.
     //doi.org/10.1145/3536220.3558075. doi:10.1145/         [14] L. Hansen, R. Rocca, A. Simonsen, et al.,
     3536220.3558075.                                             Speech- and text-based classification of neu-
 [7] H. Kothare, M. Neumann, J. Liscombe, O. Roesler,             ropsychiatric conditions in a multidiagnostic set-
     W. Burke, A. Exner, S. Snyder, A. Cornish, D. Hab-           ting, Nature Mental Health (2023). doi:10.1038/
     berstad, D. Pautler, D. Suendermann-Oeft, J. Hu-             s44220-023-00152-7.
     ber, V. Ramanarayanan, Statistical and clini-           [15] E. Emre, Erol, C. Taş, N. Tarhan, Multi-class
     cal utility of multimodal dialogue-based speech              classification model for psychiatric disor-
     and facial metrics for parkinson’s disease as-               der discrimination,       International Journal of
     sessment, 2022, pp. 3658–3662. doi:10.21437/                 Medical Informatics 170 (2023) 104926. URL:
     Interspeech.2022-11048.                                      https://www.sciencedirect.com/science/article/pii/
 [8] N. Cummins, J. Joshi, A. Dhall, V. Sethu, R. Goecke,         S1386505622002404. doi:https://doi.org/10.
     J. Epps, Diagnosis of depression by behavioural              1016/j.ijmedinf.2022.104926.
     signals: A multimodal approach, in: Proceed-            [16] D. Suendermann-Oeft, A. Robinson, A. Cornish,
     ings of the 3rd ACM International Workshop on                D. Habberstad, D. Pautler, D. Schnelle-Walka,
     Audio/Visual Emotion Challenge, AVEC ’13, As-                F. Haller, J. Liscombe, M. Neumann, M. Merrill,
     sociation for Computing Machinery, New York,                 O. Roesler, R. Geffarth, Nemsi: A multimodal dia-
     NY, USA, 2013, p. 11–20. URL: https://doi.org/               log system for screening of neurological or mental
     10.1145/2512530.2512535. doi:10.1145/2512530.                conditions, in: Proceedings of the 19th ACM Inter-
     2512535.                                                     national Conference on Intelligent Virtual Agents,
 [9] J. Robin, M. Xu, A. Balagopalan, J. Novikova,                IVA ’19, Association for Computing Machinery,
     L. Kahn, A. Oday, M. Hejrati, S. Hashemifar, M. Ne-          New York, NY, USA, 2019, p. 245–247. URL: https:
     gahdar, W. Simpson, E. Teng, Automated detection             //doi.org/10.1145/3308532.3329415. doi:10.1145/
     of progressive speech changes in early alzheimer’s           3308532.3329415.
     disease, Alzheimer’s & Dementia: Diagnosis, As-         [17] A. K. Silbergleit, A. F. Johnson, B. H. Jacobson,
     sessment & Disease Monitoring 15 (2023) e12445.              Acoustic analysis of voice in individuals with amy-
     doi:https://doi.org/10.1002/dad2.12445.                      otrophic lateral sclerosis and perceptually normal
[10] J. Hlavnika, R. Cmejla, T. Tykalová, K. onka,                vocal quality, Journal of Voice 11 (1997) 222–231.
     E. Růika, J. Rusz, Automated analysis of con-           [18] B. Tomik, R. J. Guiloff, Dysarthria in amyotrophic
     nected speech reveals early biomarkers of parkin-            lateral sclerosis: A review, Amyotrophic Lateral
     son’s disease in patients with rapid eye move-               Sclerosis 11 (2010) 4–15.
     ment sleep behaviour disorder, Scientific Re-           [19] M. Novotny, J. Melechovsky, K. Rozenstoks,
     ports 7 (2017). URL: https://api.semanticscholar.org/        T. Tykalova, P. Kryze, M. Kanok, J. Klempir, J. Rusz,
     CorpusID:19272861.                                           Comparison of automated acoustic methods for
[11] G. Stegmann, S. Charles, J. Liss, J. Shefner,                oral diadochokinesis assessment in amyotrophic
     S. Rutkove, V. Berisha, A speech-based prognos-              lateral sclerosis, Journal of speech, language, and
     tic model for dysarthria progression in als, Amy-            hearing research : JSLHR 63 (2020) 3453–3460.
     otrophic lateral sclerosis & frontotemporal degen-           doi:10.1044/2020_JSLHR-20-00109.
     eration (2023) 1–6. URL: https://doi.org/10.1080/       [20] P. Buckley, B. Miller, D. Lehrer, D. Castle, Psychi-
     21678421.2023.2222144. doi:10.1080/21678421.                 atric comorbidities and schizophrenia, Schizophre-
     2023.2222144, advance online publication.                    nia bulletin 35 (2008) 383–402. doi:10.1093/
[12] J. R. Green, K. M. Allison, C. Cordella, B. D. Rich-         schbul/sbn135.
[21] A. I. Green, C. M. Canuso, M. J. Brenner, J. D. Woj-            health monitoring agent, in: Companion Pub-
     cik, Detection and management of comorbidity in                 lication of the 2022 International Conference on
     patients with schizophrenia, Psychiatric Clinics 26             Multimodal Interaction, ICMI ’22 Companion, As-
     (2003) 115–139.                                                 sociation for Computing Machinery, New York,
[22] G. B. Cassano, S. Pini, M. Saettoni, P. Rucci,                  NY, USA, 2022, p. 160–165. URL: https://doi.org/
     L. Dell’Osso, Occurrence and clinical correlates of             10.1145/3536220.3558071. doi:10.1145/3536220.
     psychiatric comorbidity in patients with psychotic              3558071.
     disorders, Journal of Clinical Psychiatry 59 (1998)        [30] P. Boersma, V. Van Heuven, Speak and unspeak
     60–68.                                                          with praat, Glot International 5 (2001) 341–347.
[23] S. Körner, K. Kollewe, J. Ilsemann, A. Karch, R. Den-      [31] F. Falahati, D. Ferreira, J.-S. Muehlboeck, H. Soini-
     gler, K. Krampfl, S. Petri, Prevalence and prognostic           nen, P. Mecocci, B. Vellas, M. Tsolaki, I. Kłoszewska,
     impact of comorbidities in amyotrophic lateral scle-            C. Spenger, S. Lovestone, M. Eriksdotter, L.-O.
     rosis, European journal of neurology : the official             Wahlund, A. Simmons, E. Westman, The effect
     journal of the European Federation of Neurological              of age correction on multivariate classification in
     Societies 20 (2012). doi:10.1111/ene.12015.                     alzheimer’s disease, with a focus on the characteris-
[24] K. Diekmann, M. Kuźma-Kozakiewicz, M. Pi-                       tics of incorrectly and correctly classified subjects,
     otrkiewicz, M. Gromicho, J. Grosskreutz, P. M.                  Brain Topography In-press (2016). doi:10.1007/
     Andersen, M. de carvalho, H. Uysal, A. Osman-                   s10548-015-0455-1.
     ovic, O. Schreiber-Katz, S. Petri, S. Körner, Im-          [32] J.-P. Guilloux, M. Seney, N. Edgar, E. Sibille, In-
     pact of comorbidities and co-medication on dis-                 tegrated behavioral z-scoring increases the sensi-
     ease onset and progression in a large german als                tivity and reliability of behavioral phenotyping in
     patient group, Journal of Neurology 267 (2020).                 mice: Relevance to emotionality and sex, Jour-
     doi:10.1007/s00415-020-09799-z.                                 nal of neuroscience methods 197 (2011) 21–31.
[25] M. E. Heidari, J. Nadali, A. Parouhan, M. Azarafraz,            doi:10.1016/j.jneumeth.2011.01.019.
     S. M. Tabatabai, S. S. N. Irvani, F. Eskandari,            [33] D. Ienco, R. Meo, Exploration and reduction of
     A. Gharebaghi, Prevalence of depression among                   the feature space by hierarchical clustering, in:
     amyotrophic lateral sclerosis (als) patients: A sys-            Proceedings of the 2008 SIAM International Con-
     tematic review and meta-analysis, Journal of affec-             ference on Data Mining, SIAM, 2008, pp. 577–587.
     tive disorders 287 (2021) 182–190. doi:10.1016/j.          [34] J. H. Ward, Hierarchical grouping to optimize an ob-
     jad.2021.03.015.                                                jective function, Journal of the American Statistical
[26] J. M. Cedarbaum, N. Stambler, E. Malta, C. Fuller,              Association 58 (1963) 236–244.
     D. Hilt, B. Thurmond, A. Nakanishi, The alsfrs-r:          [35] K. Hopkins, G. Glass, Basic Statistics for the Behav-
     a revised als functional rating scale that incorpo-             ioral Sciences, Prentice-Hall, Englewood Cliffs, N.J.,
     rates assessments of respiratory function, Jour-                1978.
     nal of the Neurological Sciences 169 (1999) 13–            [36] J. Cohen, Statistical Power Analysis for the Behav-
     21. URL: https://www.sciencedirect.com/science/                 ioral Sciences, 2nd ed., Lawrence Erlbaum Asso-
     article/pii/S0022510X99002105. doi:https://doi.                 ciates, Publishers, Hillsdale, NJ, 1988.
     org/10.1016/S0022-510X(99)00210-5.                         [37] S. Scherer, G. Stratou, G. Lucas, M. Mahmoud,
[27] K. Kroenke, T. W. Strine, R. L. Spitzer, J. B. Williams,        J. Boberg, J. Gratch, L.-P. Morency, Automatic audio-
     J. T. Berry, A. H. Mokdad, The phq-8 as a mea-                  visual behavior descriptors for psychological dis-
     sure of current depression in the general popula-               order analysis, Image and Vision Computing 32
     tion, Journal of Affective Disorders 114 (2009) 163–            (2014) 648–658.
     173. URL: https://www.sciencedirect.com/science/           [38] S. Sorg, C. Vögele, N. Furka, A. Meyer, Perseverative
     article/pii/S0165032708002826. doi:https://doi.                 thinking in depression and anxiety, Frontiers in Psy-
     org/10.1016/j.jad.2008.06.026.                                  chology 3 (2012). URL: https://www.frontiersin.org/
[28] V. Bazarevsky, Y. Kartynnik, A. Vakunov, K. Raveen-             articles/10.3389/fpsyg.2012.00020. doi:10.3389/
     dran, M. Grundmann, Blazeface: Sub-millisecond                  fpsyg.2012.00020.
     neural face detection on mobile gpus, CoRR                 [39] B. Li, D. Lima, Facial expression recognition via
     abs/1907.05047 (2019). URL: http://arxiv.org/abs/               resnet-50, International Journal of Cognitive Com-
     1907.05047. arXiv:1907.05047.                                   puting in Engineering 2 (2021). doi:10.1016/j.
[29] O. Roesler, H. Kothare, W. Burke, M. Neumann,                   ijcce.2021.02.002.
     J. Liscombe, A. Cornish, D. Habberstad, D. Paut-           [40] V. Berisha, C. Krantsevich, P. R. Hahn, S. Hahn,
     ler, D. Suendermann-Oeft, V. Ramanarayanan, Ex-                 G. Dasarathy, P. Turaga, J. Liss, Digital medicine
     ploring facial metric normalization for within-                 and the curse of dimensionality, NPJ digital
     and between-subject comparisons in a multimodal                 medicine 4 (2021) 153.