=Paper= {{Paper |id=Vol-3649/Paper19 |storemode=property |title=Towards remote differential diagnosis of mental and neurological disorders using automatically extracted speech and facial features |pdfUrl=https://ceur-ws.org/Vol-3649/Paper19.pdf |volume=Vol-3649 |authors=Vanessa Richter,Michael Neumann,Vikram Ramanarayanan |dblpUrl=https://dblp.org/rec/conf/aaai/RichterNR24 }} ==Towards remote differential diagnosis of mental and neurological disorders using automatically extracted speech and facial features== https://ceur-ws.org/Vol-3649/Paper19.pdf

Towards Remote Differential Diagnosis of Mental and
Neurological Disorders using Automatically Extracted
Speech and Facial Features
Vanessa Richter1,† , Michael Neumann1 and Vikram Ramanarayanan1,2,*
1
Modality.AI, Inc., San Francisco, CA 94105, United States
2
University of California, San Francisco, CA 94127, United States

Abstract
Utilizing computer vision and speech signal processing to assess neurological and mental conditions remotely has the
potential to help detecting diseases or monitoring their progression earlier and more accurately. Multimodal features have
demonstrated usefulness in identifying cases with a disorder from controls across several health conditions. However,
challenges arise in distinguishing between specific disorders during the process of differential diagnosis, where shared
characteristics among different disorders may complicate accurate classification. Our aim in this study was to evaluate the
utility and accuracy of automatically extracted speech and facial features for differentiating between multiple disorders in
a multi-class (differential diagnosis) setting using a machine learning classifier. We use datasets comprising people with
depression, bulbar and limb onset amyotrophic lateral sclerosis (ALS), and schizophrenia, in addition to healthy controls.
The data was collected in a real-world scenario with a multimodal dialog system, where a virtual guide walked participants
through a set of tasks that elicit speech and facial behavior. Our study demonstrates the utility of digital speech and facial
biomarkers in assessing neurological and mental disorders for differential diagnosis. Furthermore, this research emphasizes
the importance of combining information derived from multiple modalities for a more comprehensive understanding and
classification of disorders.

Keywords
differential diagnosis, multi-class, mental disorders, neurological disorders, depression, schizophrenia, amyotrophic lateral
sclerosis, digital biomarkers, dialog system, speech, facial, multimodal

1. Introduction tures characterize a given disorder. For example, percent
pause time (PPT) has been found to differ significantly
One out of eight individuals in the world lives with a between people with ALS (pALS) and HCs [3] as well as
mental health disorder, but most people do not have ac- between people with depression symptoms and HCs [4].
cess to effective care.1 Moreover, disorders of the nervous Furthermore, a slower speaking rate differentiates pALS
system are the second leading cause of death globally [1]. [5] as well as people with schizophrenia [6] from HC.
The development of clinically valid digital biomarkers To assess the utility of automatically computed digital
for neurological and mental disorders that can be auto- biomarkers to capture specific disease attributes despite
matically extracted could significantly improve patients’ such shared characteristics, we aim to answer the follow-
lives. This advancement has the potential to assist clini- ing questions:
cians in achieving quicker and more reliable diagnoses by
providing fast and objective insights into a patient’s state. 1. How accurately can a machine learning (ML)
Note that the idea here is not to replace the clinician, classifier differentially distinguish between mul-
but to provide effective and assistive tools that can help tiple disorders – depression, schizophrenia, bul-
improve his/her efficiency, speed and accuracy. bar symptomatic ALS and bulbar presymptomatic
Many speech and facial features have shown to be ALS?
useful in differentiating between different mental and 2. Which modalities and features are most useful for
neurological disorders and healthy controls (HCs) [2]. this multi-class classification task – overall and
However, it remains unclear how distinctly these fea- with respect to a given disorder – and how does
that compare to a binary classification baseline
Machine Learning for Cognitive and Mental Health Workshop (controls versus cases in each of the investigated
(ML4CMH), AAAI 2024, Vancouver, BC, Canada
health conditions)?
*
Corresponding author.
†
Vanessa Richter performed the work described in this paper when
she was an intern at Modality.AI.
$ vikram.ramanarayanan@modality.ai (V. Ramanarayanan)
2. Related Work
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License

1
Attribution 4.0 International (CC BY 4.0).
https://www.who.int/news-room/fact-sheets/detail/mental- Recently, digital speech and facial features have been
disorders, accessed 11/7/2022 shown to yield statistically significant differences be-

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
tween cases with neurological or mental disorders and
healthy controls, exhibit high specificity and sensitiv-
ity in discriminatory ability between those groups, or,
a high potential for disease progression and treatment
effect monitoring [2, 3, 6, 7, 8, 9, 10, 11, 12].
Several studies have evaluated the detection of neuro-
logical and mental disorders in multi-class classifica-
tion settings as compared to binary case-control studies
[13, 14, 15]. Altaf et al. [13] introduced an algorithm for
Alzheimer’s disease (AD) detection validated on binary
classification and multi-class classification of AD, normal
and mild cognitive impairment (MCI). Using the bag of
visual word approach, the algorithm enhances texture- Figure 1: Overview of feature extraction and dataset creation.
based features like the gray level co-occurrence matrix.
It integrates clinical data, creating a hybrid feature vec-
tor from whole magnetic resonance (MR) brain images. interview environment and data collection. Each session
They use the Alzheimer’s Disease Neuro-imaging Initia- starts with a microphone, speaker, and camera check to
tive dataset (ADNI) and achieve 98.4% accuracy in binary ensure that the participant has given their device the
AD versus normal classification and 79.8% accuracy in permission to access camera and microphone, is able to
multi-class AD, normal, and MCI classification. hear the instructions and the captured signal is of ade-
Furthermore, Hansen et al. [14] explored the poten- quate quality. After these tests the virtual guide involves
tial of speech patterns as diagnostic markers for mul- participants in a structured conversation that consists of
tiple neuropsychiatric conditions by examining record- exercises (speaking tasks, open-ended questions, motor
ings from 420 participants with major depressive disor- abilities) to elicit speech, facial and motor behaviors rel-
der, schizophrenia, autism spectrum disorder, and non- evant to the type of disease being studied. In this work,
psychiatric controls. Various models were trained and we focus on tasks that were shared across multiple study
tested for both binary and multi-class classification tasks protocols for different disease conditions: (a) sentence in-
using speech and text features. While binary classifica- telligibility test (SIT), (b) diadochokinesis (DDK), (c) read
tion models exhibited comparable performance to prior speech, and (d) a picture description task. For (a), par-
research (F1: 0.54–0.92), multi-class classification showed ticipants were asked to read individual SIT sentences of
a notable decrease in performance (F1: 0.35–0.75). The varying lengths (5-15 words2 ), while (b) required reading
study further demonstrates that combining voice- and a longer passage (Bamboo reading passage, 99 words). To
text-based models enhances overall performance by 9.4% assess DDK skills (c), participants were asked to repeat a
F1 macro, highlighting the potential of a multimodal pattern of syllables (/pa ta ka/) as fast as they can until
approach for more accurate neuropsychiatric condition they run out of breath and (d) prompted users to describe
classification While these studies show the effectiveness a scene in a picture that was shown to them on screen.
of different types of speech- and facial-derived features These tasks are inspired by previous work [17, 18, 19].
for assessing psychiatric conditions in differential diag-
nosis settings, none of them utilized ’in-the-wild‘ data
collected remotely from participants devices with a mul- 3.1. Datasets
timodal dialog system. An overview of the data used in this study is given in
Table 1. While some datasets for a disease may be small,
there is a subset of tasks that are shared across research
3. Multimodal Dialog Platform studies. Since the data is collected in the same way
and Data Collection (remotely with a personal electronic device), we can
create a larger dataset for the healthy population across
Audiovisual data was collected using NEMSI (Neurologi- studies to get a more accurate representation of the
cal and Mental health Screening Instrument) [16], a mul- properties of normative behavior. For the larger dataset
timodal dialog system for remote health assessments. An of healthy controls, we identify age-related trends as
overview of the dataset creation process is illustrated in well as collinerarity of features. This information is used
Figure 1. A virtual guide, Tina, led study participants to correct control as well as patient feature values from
through various tasks that are designed to elicit speech,
facial, and motor behaviors. Having an interactive virtual 2
In the remainder of the paper, the different SIT sentence lengths
guide to elicit participants’ behavior allows for scalability are treated as separate tasks and are denoted as SIT_n, where n is
while providing a natural but controlled and objective the length in words.
Participants Sessions Mean Age (SD) 3.1.2. Amyotrophic Lateral Sclerosis
Controls ALS is a neurological disease that affects nerve cells in
Female 408 (63%) 655 (62.8%) 46.3 (16.4)
the brain and spinal cord that control voluntary mus-
Male 240 (37%) 388 (37.2%) 46.2 (16.0)
cle movement. The disease is progressive and there is
All 648 1043 46.3 (16.2)
Schizophrenia
currently no cure or effective treatment to reverse its
Female 10 (24.4%) 19 (26.4%) 36.1 (9.4) progression.5 . Global estimates of ALS prevalence range
Male 31 (75.6%) 53 (73.6%) 36.6 (10.1) from 1.9 to 6 per 100,000.6 Studies on ALS found co-
All 41 72 36.5 (9.9) morbidity with dementia, parkinsonism and depressive
Depression symptoms [23]. Diekmann et al. [24] found depression
Female 66 (79.5%) 76 (79.2%) 34.6 (12.1) to occur statistically significantly more often in pALS
Male 17 (20.5%) 20 (20.8%) 35.0 (10.2) compared to HC. In addition, Heidari et al. [25] found
All 83 96 34.7 (11.7) in a meta-analysis of 46 eligible studies that the pooled
Bulbar Symptomatic ALS prevalence of depression among individuals with ALS to
Female 38 (48.1%) 67 (46.2%) 61.7 (10.8) be 34%, with mild, moderate, and severe depression rates
Male 41 (51.9%) 78 (53.8%) 61.3 (9.0)
at 29%, 16%, and 8%, respectively.
All 79 145 61.5 (9.8)
As shown in Table 1, data from 79 ALS bulbar symp-
Bulbar Presymptomatic ALS
Female 31 (50%) 54 (50.5%) 58.1 (10.9) tomatic (BS) and 62 ALS bulbar pre-symptomatic (BP)
Male 31 (50%) 53 (49.5%) 62.2 (8.3) patients were collected in cooperation with Everythin-
All 62 107 60.1 (9.9) gALS and the Peter Cohen Foundation7 . In addition to
the assessment of speech and facial behavior, partici-
Table 1 pants filled out the ALS Functional Rating Scale-revised
Cohort demographics. SD: standard deviation. (ALSFRS-R), a standard instrument for monitoring the
progression of ALS [26]. The questionnaire comprises
12 questions about physical ability with each function’s
age effects and remove feature redundancies. rating ranging from normal function (score 4) to severe
disability (score 0). It includes four scales for different
domains affected by the disorder: bulbar system, fine
3.1.1. Schizophrenia and gross motor skills, and respiratory function. The
ALSFRS-R score is the total of the domain sub-scores,
Schizophrenia is a chronic brain disorder that affects the sum ranging from 0 to 48. For this study, pALS were
approximately 24 million or 1 in 300 people (1 in 222 stratified into the following sub-cohorts based on their
in adults)3 worldwide. According to the American Psy- bulbar subscore: (a) BS ALS with a bulbar subscore < 12
chiatric Association (APA), active schizophrenia may be (first three ALSFRS-R questions) and (b) BP ALS with a
characterized by episodes in which the affected individual bulbar sub-score = 12.
cannot distinguish between real and unreal experiences.4
Among individuals with schizophrenia, psychiatric and
3.1.3. Depression
medical comorbidities such as substance abuse, anxiety
and depression are common [20, 21, 22]. Buckley et al. Depression is a common mental health disorder char-
pointed out that depression is estimated to affect half of acterized by persistent sadness and lack of interest or
the patients. These comorbidities, as well as the variation pleasure in previously enjoyable activities. In addition,
in symptoms and medications, make the identification of fatigue and poor concentration are common. The effects
multimodal biomarkers for schizophrenia a difficult task. of depression can be long-lasting or recurrent and can
As can be seen in Table 1, we assessed 41 individuals drastically affect a person’s ability to lead a fulfilling
with a diagnosis of schizophrenia at a state psychiatric life. The disorder is one of the most common causes of
facility in New York, NY. The study was approved by the disability in the world.8 One in six people (16.6%) will
Nathan S. Kline Institute for Psychiatric Research and we experience depression at some point in their lifetime.9
obtained written informed consent from all participants
at the time of screening after explaining details of the 5 https://www.ninds.nih.gov/health-
study. The assessment of both patients and controls was information/disorders/amyotrophic-lateral-sclerosis-als, accessed
overseen by a psychiatrist. 05/19/2023
6
https://www.targetals.org/2022/11/22/epidemiology-of-als-
incidence-prevalence-and-clusters/, accessed 05/19/2023
3 7
https://www.who.int/news-room/fact-sheets/detail/ https://www.everythingals.org/research
8
schizophrenia, accessed 05/19/2023 https://www.who.int/health-topics/depression, accessed 06/20/2023
4 9
https://www.psychiatry.org/patients-families/schizophrenia/ https://www.psychiatry.org/patients-families/depression/what-is-
what-is-schizophrenia, accessed 05/19/2023 depression, accessed 06/20/2023
A well-established tool for assessing depression is the sessions that had more than 15% missing features. Then,
Patient Health Questionnaire (PHQ)-8 [27]. The PHQ-8 on the feature level, we filtered out features with more
score ranges from 0 to 24 (higher score indicates more than 10% missing values. These thresholds have been
severe depression symptoms). determined empirically. After those removal procedures,
We investigated at least moderately severe depression we impute remaining missing values with mean feature
cases, based on a cutoff of PHQ-8 ≥ 15. The data for this values for the respective cohort in train and test sets
study, including the completion of the PHQ-8 question- separately.
naire, was collected through crowd-sourcing, resulting
in a sample of 83 individuals that scored at or above 4.3. Age-Correction & Sex-Normalization
this cutoff. Statistics for this cohort are summarized in
Table 1. Similar to the approach in Falahati et al. [31], we applied
a linear correction algorithm to both patient and con-
trol data based on age-related changes in the HC cohort.
4. Methods By calculating age trends and coefficients on healthy
controls, we aim to obtain the most accurate estimate
Our procedure is divided into the following stages: (1) fea- of purely age-related changes without the confounding
ture extraction, (2) preprocessing, (3) age-correction and effects of disease-related influences. In detail, for each
sex-normalization, (4) redundancy and effect size analy- feature, we fit a linear regression model to age as the
sis, and finally (5) classification (binary and multi-class) independent and the feature as the dependent variable,
and evaluation. modeling the age-related changes as a linear deviation.
This is done separately for males and females to obtain
4.1. Multimodal Metrics Extraction a sex-specific result. Then, the sex-specific regression
coefficients are used to correct feature values for age
In this and the following sections, we use the following
by subtracting the product of coefficient and age from
terminology: Metric denotes a speech or facial metric in
the feature value for each participant. To account for
general, and Feature denotes a specific combination of a
sex-related differences, we applied sex-specific z-scoring
metric extracted from a certain task, e.g. speaking rate
to normalize the features. Z-normalization is a method-
for the SIT task.
ology that allows for the comparison or compilation of
Both speech and facial metrics were extracted from
observations of different cohorts [32]. In addition, the
the audiovisual recordings (overview in Table 2). To ex-
normalization process ensures the comparability of fea-
tract facial metrics, we used the Mediapipe FaceMesh
tures on different scales by centering the feature distribu-
software10 . More specifically, MediaPipe’s Face Detec-
tions around zero with a standard deviation of one. First,
tion is based on BlazeFace [28] and determines the (x,
the dataset to analyze was divided into male and female
y)-coordinates of the face for every frame. Subsequently,
participants. Then, each feature was normalized within
468 facial landmarks are identified using MediaPipe
each sex group using z-scoring.
FaceMesh. We selected 14 key landmarks to compute
functionals of facial behavior. Distances between land-
marks were normalized by dividing them by the inter- 4.4. Redundancy Analysis and Effect Sizes
caruncular distance. In terms of between- as well as To identify collinear features and reduce the high-
within-subject analyses, when the same position rela- dimensional feature space, we performed hierarchical
tive to the camera cannot be assumed, Roesler et al. [29] clustering on the Spearman rank-order correlations us-
found this to be the most reliable method of normaliza- ing the age-corrected and sex-normalized larger healthy
tion. More details and a visual depiction of the land- control dataset. We applied the clustering for speech and
marks used to calculate facial features can be found in facial features separately. The clustering procedure is
[4]. Speech metrics were computed using Praat [30] and motivated by the approach in Ienco and Meo [33]. It is
cover different domains, such as energy, timing, voice based on Ward’s method [34], which aims at minimising
quality and frequency. within-cluster variance. We implemented it using the
scikit-learn library11 . A dendrogram was plotted to
4.2. Preprocessing inspect the correlations between features visually and
to determine a suitable distance threshold for generat-
We applied the following approach to handle missing
ing feature clusters. The threshold choice was based on
data, which can occur for a number of reasons, including
two major factors: (a) balance between speech and facial
incomplete sessions, technical issues, or network prob-
clusters as we target roughly an equal number to avoid
lems. First, on the session level, we removed participant
11
https://scikit-learn.org/stable/auto_examples/inspection/plot_
10
https://google.github.io/mediapipe/ permutation_importance_multicollinear.html
Domain Metrics
Energy signal-to-noise ratio (SNR, dB)
Timing speaking & articulation duration/rate (sec./WPM), percent pause time (PPT, %),
Audio

canonical timing agreement (CTA, %)
Specific to DDK cycle-to-cycle temporal variability (cTV, sec.), syllable rate (syl./sec.), number of syllables
Voice quality shimmer (%), harmonics-to-noise ratio (HNR, dB), jitter (%)
Frequency mean, min, max & standard deviation (stdev) of fundamental frequency (F0, Hz)
Jaw mean, min & max speed/acceleration/jerk of the jaw center (JC)
Video

Lower Lip mean, min & max speed/acceleration/jerk of the lower lip (LL)
Mouth mean & max lip aperture, lip width, mouth surface area; mean mouth symmetry ratio
Eyes mean & max eye opening

Table 2
Overview of speech and facial metrics.

# Cluster domain Metrics Tasks # Features
1 Energy SNR all 8
2 Timing alignment CTA all 6
3 Timing, pauses PPT all 5
4 Timing, speaking (1) articulation/speaking duration Picture Description 2
5 DDK articulation SNR, syl.rate, syl.count & cTV DDK 4
6 Timing, speaking (2) articulation/speaking rate/time SIT_{5,9} 8
7 Timing, speaking (3) articulation/speaking rate/time SIT_{7,11,13,15}, 21
Reading passage
8 DDK voice quality HNR, jitter & shimmer DDK 3
9 Voice quality (periodicity) HNR all except DDK 8
10 Voice quality (amplitude variation) shimmer all except DDK 8
11 Voice quality (frequency variation) jitter all except DDK 8
12 Frequency (mean, min) min & mean F0 all 16
13 Frequency (max, std) max & std F0 all 16
∑︀
113
Table 3
Speech feature clusters identified by hierarchical clustering.

predominance of one modality over the other, and (b) MLP has one hidden layer. We experimented with adding
expert knowledge about the different task and feature more hidden layers, but found that the minimal config-
domains (e.g. timing versus voice quality features, jaw uration with only one layer was beneficial in terms of
versus eye movement or read versus free speech), which performance. The hidden layer size ℎ was determined
resulted in the clusters shown in Table 3 and Table 4. dynamically as
The clusters are used in the feature selection process as ℎ=
𝑓 +𝑐
(1)
described Section 4.5. 2
Statistical tests to assess the statistical significance,
where 𝑓 is the number of selected features and 𝑐 the num-
as well as the magnitude and direction of effects for a ber of classes. The model was trained with a maximum
given comparison, were conducted within classification of 10,000 iterations to allow sufficient time for conver-
folds and as part of a post hoc analysis. Effect sizes were
gence during training. Model training was stopped when
calculated using Glass’s Delta [35]. Here, only features the loss or score was not improving by a defined toler-
showing statistical significance (𝑝 < 0.05) in the Mann- ance threshold. Here, we used scikit-learn’s default
Whitney U-test (MWU) were considered. of 1𝑒 − 4. Additionally, the alpha parameter was set to
0.001, controlling the regularization strength to prevent
4.5. Classification overfitting. The sgd (stochastic gradient descent) solver
was used for optimization during training. The batch
For both the binary and multi-class classification exper- size was set to auto, enabling the model to determine
iments, we used a multilayer perceptron (MLP), which the appropriate batch size during training. We used the
was implemented using the scikit-learn library. The rectified linear unit function as the activation function.
# Cluster domain Metrics Tasks # Features
1 Lip movement (1) speed, acc. & jerk measures all except DDK 95
2 Lip width mean & max lip width all 18
3 Mouth opening mean & max lip aperture, all 36
mouth surface area
4 Lip movement (2) speed, acc. & jerk metrics DDK 12
5 Jaw movement (1) speed, acc. & jerk metrics DDK 12
6 Jaw movement (2) speed, acc. & jerk metrics SIT_7 12
7 Jaw movement (3) speed, acc. & jerk metrics SIT_5 12
8 Jaw movement (4) min + max speed, acc. & jerk metrics Picture Description 9
9 Jaw movement (5) speed, acc. & jerk metrics SIT_{9,11,13,15}, RP, 63
Picture Description
10 Mouth symmetry mean mouth symmetry all 9
11 Eye opening mean and max eye opening all 18
∑︀
296
Table 4
Facial feature clusters identified by hierarchical clustering. RP: reading passage.

Ten-fold cross-validation was applied for evaluation in BS ALS vs. Depression cases, and so on). We merged
order to maximize the utilization of data for both training the selected features from these comparisons as input
and testing purposes. To avoid bias towards the majority to the classifier. Therefore, multiple features from the
group, we created datasets that consist of an equal num- same cluster could be included in one feature set. We
ber samples in each disease condition. For each individual allowed a certain amount of redundancy compared to
participant, we consider, if available, the first two ses- the case-control baseline in order to account for the com-
sions as data points. Because of the equality constraint, plexity associated with multiple comparisons. For both
the number of data points was limited by the smallest experiments, classification performance was evaluated
dataset (schizophrenia). This resulted in 72 randomly in terms of F1 score, sensitivity, and specificity.
selected data points per cohort, summing up to a total
of 360 data points. The classification experiments are
run ten times to smooth out performance variations and 5. Results
obtain more representative results. We split the data us-
ing scikit-learn’s StratifiedGroupKFold to make sure 5.1. Binary Classification Baseline
that sessions from the same participant are either in the
respective training or testing fold. In each fold, we im- Cohort Speech Facial Speech + Facial
puted missing values and standardized features by sex F1 F1 F1 SEN SP
using z-scoring. This was done separately for training DEP vs. HC 0.64 0.59 0.65 0.65 0.65
and test sets. SCHIZ vs. HC 0.82 0.64 0.83 0.85 0.82
As a benchmark, we evaluated binary classification BP ALS vs. HC 0.54 0.51 0.52 0.52 0.53
BS ALS vs. HC 0.84 0.63 0.83 0.82 0.83
performance of models aimed at distinguishing cases
with a disorder from controls. Here, for each cluster of Table 5
collinear features as described in Section 4.4, the one with Binary classification results. In each row, we highlighted the
the highest effect size was selected for the final feature highest performance in terms of F1.
set as input to the classifier. If no feature showed statisti- HC: Healthy Controls, DEP: Depression, SCHIZ: Schizophre-
cally significant differences between cases and controls nia, SEN: Sensitivity, SP: Specificity
in a given cluster, no feature was selected. Hence, the
number of clusters determines the maximum number of As can be seen in Table 5, we observe a good per-
features fed into the classifier. Statistical significance and formance in classifying controls versus BS ALS (speech
effect sizes for each feature were calculated as described features alone; F1-score: 0.84) and schizophrenia (com-
in the previous section. bined speech and facial; F1-score: 0.83) cases, respectively.
In a second step, we performed 4-class classification, in- The binary classification of depression did not perform
corporating all the investigated neurological and mental as well; however, it still surpassed the random chance
disorders. Here, feature selection was done based on pair- baseline (combined speech and facial; F1-score: 0.65).
wise comparisons of all disease cohorts (e.g. Depression The classifier struggled to distinguish controls from BP
vs. Schizophrenia cases, Schizophrenia vs. BS ALS cases, ALS cases, where we observed performance just above
random chance across modalities. Furthermore, the per- BP ALS and depression the per class F1-score is highest
formance with regard to sensitivity and specificity is when combining speech and facial features. There is no
relatively balanced across comparisons. performance difference between using only speech or
In depression and schizophrenia, combining speech speech and facial features for identifying schizophrenia.
and facial modalities resulted in improved classification Figure 2 shows a confusion matrix that indicates the per-
performance compared to speech or facial features alone, centage of accurate class predictions and the classes with
as shown in Table 5. However, adding facial informa- which they were confused. The model was most confi-
tion did not enhance performance for BP or BS and ALS dent in detecting schizophrenia (72.22%), followed by BS
cohorts compared to utilizing speech features alone. ALS (64.58%) and depression (63.75%). The model faced
its greatest challenge in accurately predicting BP ALS
5.2. Multi-Class Classification (57.22%), yet it still performs notably above chance in a 4-
class classification scenario. BP ALS and depression cases
were most often confused with each other. Schizophrenic
Cohort Speech Facial Speech + Facial
patients were least often confused with other cohorts.
F1 F1 F1 SEN SP
Among the cases of BS ALS, the most frequent confusion
SCHIZ 0.72 0.53 0.72 0.72 0.91
BP ALS 0.55 0.36 0.57 0.57 0.86 occurred with BP ALS patients (16.11%).
BS ALS 0.62 0.47 0.64 0.65 0.88 The features that we identified to be consistently cho-
DEP 0.61 0.46 0.64 0.64 0.88 sen across classification folds (Table 7) are predominantly
Average 0.63 0.46 0.64 0.65 0.88 speech features of timing, voice quality, and energy
domains. In addition, two facial features are selected
Table 6 across folds concerning the maximum lip width and the
Multi-class classification results. In each row, we highlight
maximum absolute acceleration of jaw movements. We
the highest F1 score performance.
HC: Healthy Controls, DEP: Depression, SCHIZ: Schizophre- conducted a post hoc analysis of effect sizes between
nia, SEN: Sensitivity, SP: Specificity HC and cases with a disorder for these features to gain
further insight into disorder-specific importance. Here,
positive effect sizes represent feature values that are
larger for cases with a disorder than controls. Conversely,
negative values represent larger feature values for
controls than cases with a disorder12 . In schizophrenia,
we find all of the features consistently selected across
classification folds to be statistically significant when
compared to HC. With respect to the other cohorts, the
largest effects are shown for CTA (-1.44 for SIT_13) and
speaking rate (-2.00 for RP). This shows that patients
exhibit a lower CTA, a measure of phonetic alignment
between their own speech and that of the virtual guide,
while speaking slower. We also observed a smaller
average lip width as an important feature that shows
the largest effect between HC and depression cases
compared to the other cohorts. This may be associated
with decreased emotional expressivity, as indicated by
reduced smiling and increased frowning. These findings
Figure 2: Normalized confusion matrix for 4-class classifica-
align with previous studies highlighting similar patterns
tion. The x-axis shows the true labels, the y-axis the predicted
ones. of emotional expressiveness in depression [37, 38]. Few
and small differences compared to controls are revealed
for BP ALS cases. This is also the cohort with the lowest
In the 4-class experiment aimed at discriminating be- performance across classification experiments. In BS
tween all investigated neurological and mental disorders, ALS, we found the largest effects for SNR and speaking
we achieve the best overall performance (F1-score: 0.64) rate. Another feature that stood out is cTV in the DDK
by utilizing both speech and facial features, as shown in task, a measure that captures the temporal variability, i.e.
Table 6. Overall, the specificity (average: 0.88) for the the consistency or irregularity in the timing of speech
disorders examined is considerably higher than the sensi- patterns, between consecutive cycles of speech.
tivity (average: 0.65). This indicates that the classifier is 12
We follow the commonly used effect size magnitude thresholds as
more effective at avoiding false-positive results than iden- suggested in Cohen [36] – small: 0.2 − 0.5, medium: 0.5 − 0.8,
tifying true positives. In most cases, namely for BS ALS, and large: > 0.8
Effect sizes (HC vs. disorder cases)
Features Modality Cluster domain SCHIZ BP ALS BS ALS DEP
max abs acc. JC (RP) Facial Jaw movement -0.51 - N.S -
max lip width (SIT 11) Facial Lip width -0.35 - 0.31 -0.44
shimmer (DDK) Speech Voice quality 0.35 - -0.63 -
shimmer (SIT 5) Speech Voice quality 0.97 - -0.31 -
jitter (SIT 9) Speech Voice quality 0.43 -0.20 -0.48 0.26
CTA (SIT 13) Speech Timing alignment -1.44 - -1.16 -0.31
SNR (DDK) Speech Energy 1.88 - 2.43 -
speaking rate (RP) Speech Timing, speaking -2.00 - -1.84 -
speaking rate (SIT 7) Speech Timing, speaking -0.73 -0.31 -1.25 0.59
HNR (DDK) Speech Voice quality 1.01 - 0.86 -0.30
HNR (SIT 15) Speech Voice quality 0.94 - 0.75 -
cTV (DDK) Speech Energy & articulation skills 0.39 - 1.82 0.43
Table 7
Features selected across all multi-class classification CV folds (considering the 4 disorders) and their effect sizes as calculated
between the healthy control and disorder cohorts. In each row, we highlighted the largest effect size, which were only calculated
in case of statistical significance.
HC: healthy controls, SCHIZ: schizophrenia, BS: bulbar symptomatic, BP: bulbar pre-symptomatic, DEP: depression, JC: jaw
center, RP: reading passage

While many features are shared in terms of indicating That being said, we acknowledge the importance of
a signal between cases with a disorder and controls, it contextualizing the promise of such multimodal method-
is mostly the magnitude of the effect that differentiates ologies for differential diagnosis with several caveats.
them, as well as how they combine. However, there are First, the performance of any machine learning classi-
also a few features that show a different direction of fier trained for this purpose will depend on the specific
effect across cohorts. For example, in BS ALS, compared conditions being studied and the range and heterogene-
to other cohorts, we observed the largest effect for ity of symptoms presented in each case. For example,
shimmer (DDK, -0.63), which measures the variation in in this study we investigated four specific conditions –
amplitude of the vocal folds during the speech signal. schizophrenia, depression, bulbar symptomatic (BS) and
There is no effect observed for BP ALS or depression bulbar presymptomatic (BP) ALS – and we observed that
cohorts, while in schizophrenia, the direction of effect is schizophrenia (where the facial modality is particularly
the opposite (0.35). good at capturing characteristics exhibited therein such
as anhedonia, blunted affect, etc.) and BS ALS (which
is characterized by speech motor deficits, reflected in
the timing, rate and intelligibility of speech), quite dif-
6. Discussion ferent in terms of symptom presentation, exhibit greater
separability relative to other classes for differential classi-
We explored the utility of speech and facial features ex-
fication. For both BS ALS and schizophrenia, our analysis
tracted by a multimodal dialog system for differential
demonstrates a robust discriminatory capability to effec-
classification of ALS, depression and schizophrenia. Note
tively distinguish these cohorts from healthy controls, as
that the idea here is not to replace clinicians, but to pro-
well as other neurological and mental disorders, in binary
vide effective and assistive tools that can help improve
and multi-class experiments. However, the overall higher
their efficiency, speed and accuracy. Overall, combining
specificity of the multi-class classifier implies a robust
speech and facial information proved to be beneficial
capability to accurately identify non-cases, effectively
for identifying several disorders in both multi-class and
minimizing false positives. Yet, the lower sensitivity sug-
binary classification experiments. In addition, our au-
gests limitations in the identification of true cases for the
tomated feature analysis indicates several features that
analyzed disorders, likely due to the imposed strong re-
show relevance across experiments. While some of these
strictions. In BS ALS, speech features alone demonstrate
features are intuitively identifiable by human experts as
superior performance when comparing this group with
markers of a given disorder (for example, a slower speak-
controls. Yet, in the more intricate task of differential
ing rate or a lower intelligibility), such an analysis also
diagnosis, performance improves when speech features
allows discovery of other features that might be harder
are combined with facial information. For schizophrenia,
to detect or identify objectively by human experts, such
the combination of speech and facial modalities proves
as quicker facial movements.
most effective in both binary and multitask experiments.
In contrast, BP ALS, which does not present with as many ologies that lead to better separability of disease condi-
speech and facial motor deficits, is much less separable tions. Future work will also focus on improving differ-
even in binary classification, let alone in the multi-class ential diagnosis performance in a manner that is both
classification context, highlighting the challenging na- generalizable and explainable.
ture of detecting this condition. Furthermore, for the
misidentified BS ALS cases, the classifier most frequently
categorized them as BP ALS. Although distinguishing BP Acknowledgments
ALS cases from controls is challenging, this outcome indi-
This work was funded in part by the National Institutes
cates that the classifier may be able to capture condition-
of Health grant R42DC019877. We thank all study par-
specific information from features that are shared across
ticipants for their time and we gratefully acknowledge
different stages of ALS, which may have led to this confu-
the contribution of the Peter Cohen Foundation and Ev-
sion. Finally, in evaluating depression, best performance
erythingALS towards participant recruitment and data
in both binary and multi-class classification experiments
collection for the ALS corpus and Anzalee Khan and Jean-
is achieved by combining speech and facial information.
Pierre Lindenmayer at the Manhattan Psychiatric Center
The overall accuracy in discerning depression from other
– Nathan Kline Institute for the schizophrenia corpus.
cohorts is notably lower compared to schizophrenia or
BS ALS. The variability introduced by the wide range
and time horizon of potential symptoms present in de- References
pression as well as medication status might contribute
to lower differential diagnosis accuracy. That being said, [1] V. Feigin, E. Nichols, T. Alam, M. Bannick, E. Beghi,
a significant limitation of the present study is the lack N. Blake, W. Culpepper, E. Dorsey, A. Elbaz, R. Ellen-
of information about co-morbidities to factor into our bogen, J. Fisher, C. Fitzmaurice, G. Giussani, L. Glen-
analysis, since datasets were collected independently. Fu- nie, S. James, C. Johnson, N. Kassebaum, G. Logros-
ture research will aim to explicitly address this gap by cino, B. Marin, T. Vos, Global, regional, and national
capturing, for instance, information about co-morbid de- burden of neurological disorders, 1990-2016: a sys-
pression in ALS or schizophrenia (e.g., through PHQ-8 tematic analysis for the global burden of disease
scales), that might help us better stratify these cohorts. study 2016, The Lancet Neurology 18 (2019) 459–
Second, this study focused on a restricted set of tasks, 480. doi:10.1016/S1474-4422(18)30499-X.
primarily focusing on reading abilities and picture de- [2] V. Ramanarayanan, A. C. Lammert, H. P. Rowe,
scription assessments. However, these task-feature com- T. F. Quatieri, J. R. Green, Speech as a biomarker:
binations alone may not fully capture the nuances of each Opportunities, interpretability, and challenges, Per-
disorder. spectives of the ASHA Special Interest Groups 7
Third, while we focused on interpretable features in (2022) 276–283.
this study, less interpretable ones, such as log mel spectro- [3] M. Neumann, O. Roesler, J. Liscombe, H. Kothare,
grams or Mel Frequency Cepstral Coefficients (MFCCs) D. Suendermann-Oeft, J. D. Berry, E. Fraenkel,
may be able to capture more nuanced and complex pat- R. Norel, A. Anvar, I. Navar, A. V. Sherman, J. R.
terns in the data. Additionally, more sophisticated deep Green, V. Ramanarayanan, Multimodal dialog based
learning approaches for representation learning could speech and facial biomarkers capture differential
be applied, such as Res-Net 50 [39] in the facial modal- disease progression rates for als remote patient
ity. While such features can be powerful in capturing monitoring, in: Proceedings of the 32nd Interna-
subtle details and nuances of audiovisual behavior, the tional Symposium on Amyotrophic Lateral Sclero-
inner workings of the deep learning model are not easily sis and Motor Neuron Disease, Virtual, 2021.
explainable or interpretable by non-experts. [4] V. Richter, J. Cohen, M. Neumann, D. Black, A. Haq,
Fourth, our sample size is not representative enough J. Wright-Berryman, V. Ramanarayanan, A multi-
to truly claim generalizability of findings. The smaller modal dialog approach to mental state character-
the sample, the larger the risk of having model “blind ization in clinically depressed, anxious, and sui-
spots” that in turn lead to variable estimates of true model cidal populations, Frontiers in Psychology 14
performance on unseen real world data, giving algorithm (2023). URL: https://www.frontiersin.org/articles/
designers an inaccurate sense of how well a model is 10.3389/fpsyg.2023.1135469. doi:10.3389/fpsyg.
performing during development [40]. 2023.1135469.
Our results argue for the importance of a hybrid ap- [5] M. Neumann, O. Roesler, J. Liscombe, H. Kothare,
proach to differential diagnosis going forward, combining D. Suendermann-Oeft, D. Pautler, I. Navar, A. Anvar,
knowledge-driven and data-driven approaches. Under- J. Kumm, R. Norel, E. Fraenkel, A. Sherman, J. Berry,
standing specific disease pathologies and symptoms can G. Pattee, J. Wang, J. Green, V. Ramanarayanan,
in turn help in developing features and learning method-
Investigating the utility of multimodal conversa- burg, G. L. Pattee, J. D. Berry, E. A. Macklin, E. P.
tional technology and audiovisual analytic mea- Pioro, R. A. Smith, Additional evidence for a ther-
sures for the assessment and monitoring of amy- apeutic effect of dextromethorphan/quinidine on
otrophic lateral sclerosis at scale, 2021, pp. 4783– bulbar motor function in patients with amyotrophic
4787. doi:10.21437/Interspeech.2021-1801. lateral sclerosis: A quantitative speech analysis,
[6] V. Richter, M. Neumann, H. Kothare, O. Roesler, British Journal of Clinical Pharmacology 84 (2018)
J. Liscombe, D. Suendermann-Oeft, S. Prokop, 2849–2856.
A. Khan, C. Yavorsky, J.-P. Lindenmayer, V. Ra- [13] T. Altaf, S. M. Anwar, N. Gul, M. N. Majeed,
manarayanan, Towards multimodal dialog-based M. Majid, Multi-class alzheimer’s disease classi-
speech & facial biomarkers of schizophrenia, in: fication using image and clinical features, Biomed-
Companion Publication of the 2022 International ical Signal Processing and Control 43 (2018) 64–
Conference on Multimodal Interaction, ICMI ’22 74. URL: https://www.sciencedirect.com/science/
Companion, Association for Computing Machinery, article/pii/S1746809418300508. doi:https://doi.
New York, NY, USA, 2022, p. 171–176. URL: https: org/10.1016/j.bspc.2018.02.019.
//doi.org/10.1145/3536220.3558075. doi:10.1145/ [14] L. Hansen, R. Rocca, A. Simonsen, et al.,
3536220.3558075. Speech- and text-based classification of neu-
[7] H. Kothare, M. Neumann, J. Liscombe, O. Roesler, ropsychiatric conditions in a multidiagnostic set-
W. Burke, A. Exner, S. Snyder, A. Cornish, D. Hab- ting, Nature Mental Health (2023). doi:10.1038/
berstad, D. Pautler, D. Suendermann-Oeft, J. Hu- s44220-023-00152-7.
ber, V. Ramanarayanan, Statistical and clini- [15] E. Emre, Erol, C. Taş, N. Tarhan, Multi-class
cal utility of multimodal dialogue-based speech classification model for psychiatric disor-
and facial metrics for parkinson’s disease as- der discrimination, International Journal of
sessment, 2022, pp. 3658–3662. doi:10.21437/ Medical Informatics 170 (2023) 104926. URL:
Interspeech.2022-11048. https://www.sciencedirect.com/science/article/pii/
[8] N. Cummins, J. Joshi, A. Dhall, V. Sethu, R. Goecke, S1386505622002404. doi:https://doi.org/10.
J. Epps, Diagnosis of depression by behavioural 1016/j.ijmedinf.2022.104926.
signals: A multimodal approach, in: Proceed- [16] D. Suendermann-Oeft, A. Robinson, A. Cornish,
ings of the 3rd ACM International Workshop on D. Habberstad, D. Pautler, D. Schnelle-Walka,
Audio/Visual Emotion Challenge, AVEC ’13, As- F. Haller, J. Liscombe, M. Neumann, M. Merrill,
sociation for Computing Machinery, New York, O. Roesler, R. Geffarth, Nemsi: A multimodal dia-
NY, USA, 2013, p. 11–20. URL: https://doi.org/ log system for screening of neurological or mental
10.1145/2512530.2512535. doi:10.1145/2512530. conditions, in: Proceedings of the 19th ACM Inter-
2512535. national Conference on Intelligent Virtual Agents,
[9] J. Robin, M. Xu, A. Balagopalan, J. Novikova, IVA ’19, Association for Computing Machinery,
L. Kahn, A. Oday, M. Hejrati, S. Hashemifar, M. Ne- New York, NY, USA, 2019, p. 245–247. URL: https:
gahdar, W. Simpson, E. Teng, Automated detection //doi.org/10.1145/3308532.3329415. doi:10.1145/
of progressive speech changes in early alzheimer’s 3308532.3329415.
disease, Alzheimer’s & Dementia: Diagnosis, As- [17] A. K. Silbergleit, A. F. Johnson, B. H. Jacobson,
sessment & Disease Monitoring 15 (2023) e12445. Acoustic analysis of voice in individuals with amy-
doi:https://doi.org/10.1002/dad2.12445. otrophic lateral sclerosis and perceptually normal
[10] J. Hlavnika, R. Cmejla, T. Tykalová, K. onka, vocal quality, Journal of Voice 11 (1997) 222–231.
E. Růika, J. Rusz, Automated analysis of con- [18] B. Tomik, R. J. Guiloff, Dysarthria in amyotrophic
nected speech reveals early biomarkers of parkin- lateral sclerosis: A review, Amyotrophic Lateral
son’s disease in patients with rapid eye move- Sclerosis 11 (2010) 4–15.
ment sleep behaviour disorder, Scientific Re- [19] M. Novotny, J. Melechovsky, K. Rozenstoks,
ports 7 (2017). URL: https://api.semanticscholar.org/ T. Tykalova, P. Kryze, M. Kanok, J. Klempir, J. Rusz,
CorpusID:19272861. Comparison of automated acoustic methods for
[11] G. Stegmann, S. Charles, J. Liss, J. Shefner, oral diadochokinesis assessment in amyotrophic
S. Rutkove, V. Berisha, A speech-based prognos- lateral sclerosis, Journal of speech, language, and
tic model for dysarthria progression in als, Amy- hearing research : JSLHR 63 (2020) 3453–3460.
otrophic lateral sclerosis & frontotemporal degen- doi:10.1044/2020_JSLHR-20-00109.
eration (2023) 1–6. URL: https://doi.org/10.1080/ [20] P. Buckley, B. Miller, D. Lehrer, D. Castle, Psychi-
21678421.2023.2222144. doi:10.1080/21678421. atric comorbidities and schizophrenia, Schizophre-
2023.2222144, advance online publication. nia bulletin 35 (2008) 383–402. doi:10.1093/
[12] J. R. Green, K. M. Allison, C. Cordella, B. D. Rich- schbul/sbn135.
[21] A. I. Green, C. M. Canuso, M. J. Brenner, J. D. Woj- health monitoring agent, in: Companion Pub-
cik, Detection and management of comorbidity in lication of the 2022 International Conference on
patients with schizophrenia, Psychiatric Clinics 26 Multimodal Interaction, ICMI ’22 Companion, As-
(2003) 115–139. sociation for Computing Machinery, New York,
[22] G. B. Cassano, S. Pini, M. Saettoni, P. Rucci, NY, USA, 2022, p. 160–165. URL: https://doi.org/
L. Dell’Osso, Occurrence and clinical correlates of 10.1145/3536220.3558071. doi:10.1145/3536220.
psychiatric comorbidity in patients with psychotic 3558071.
disorders, Journal of Clinical Psychiatry 59 (1998) [30] P. Boersma, V. Van Heuven, Speak and unspeak
60–68. with praat, Glot International 5 (2001) 341–347.
[23] S. Körner, K. Kollewe, J. Ilsemann, A. Karch, R. Den- [31] F. Falahati, D. Ferreira, J.-S. Muehlboeck, H. Soini-
gler, K. Krampfl, S. Petri, Prevalence and prognostic nen, P. Mecocci, B. Vellas, M. Tsolaki, I. Kłoszewska,
impact of comorbidities in amyotrophic lateral scle- C. Spenger, S. Lovestone, M. Eriksdotter, L.-O.
rosis, European journal of neurology : the official Wahlund, A. Simmons, E. Westman, The effect
journal of the European Federation of Neurological of age correction on multivariate classification in
Societies 20 (2012). doi:10.1111/ene.12015. alzheimer’s disease, with a focus on the characteris-
[24] K. Diekmann, M. Kuźma-Kozakiewicz, M. Pi- tics of incorrectly and correctly classified subjects,
otrkiewicz, M. Gromicho, J. Grosskreutz, P. M. Brain Topography In-press (2016). doi:10.1007/
Andersen, M. de carvalho, H. Uysal, A. Osman- s10548-015-0455-1.
ovic, O. Schreiber-Katz, S. Petri, S. Körner, Im- [32] J.-P. Guilloux, M. Seney, N. Edgar, E. Sibille, In-
pact of comorbidities and co-medication on dis- tegrated behavioral z-scoring increases the sensi-
ease onset and progression in a large german als tivity and reliability of behavioral phenotyping in
patient group, Journal of Neurology 267 (2020). mice: Relevance to emotionality and sex, Jour-
doi:10.1007/s00415-020-09799-z. nal of neuroscience methods 197 (2011) 21–31.
[25] M. E. Heidari, J. Nadali, A. Parouhan, M. Azarafraz, doi:10.1016/j.jneumeth.2011.01.019.
S. M. Tabatabai, S. S. N. Irvani, F. Eskandari, [33] D. Ienco, R. Meo, Exploration and reduction of
A. Gharebaghi, Prevalence of depression among the feature space by hierarchical clustering, in:
amyotrophic lateral sclerosis (als) patients: A sys- Proceedings of the 2008 SIAM International Con-
tematic review and meta-analysis, Journal of affec- ference on Data Mining, SIAM, 2008, pp. 577–587.
tive disorders 287 (2021) 182–190. doi:10.1016/j. [34] J. H. Ward, Hierarchical grouping to optimize an ob-
jad.2021.03.015. jective function, Journal of the American Statistical
[26] J. M. Cedarbaum, N. Stambler, E. Malta, C. Fuller, Association 58 (1963) 236–244.
D. Hilt, B. Thurmond, A. Nakanishi, The alsfrs-r: [35] K. Hopkins, G. Glass, Basic Statistics for the Behav-
a revised als functional rating scale that incorpo- ioral Sciences, Prentice-Hall, Englewood Cliffs, N.J.,
rates assessments of respiratory function, Jour- 1978.
nal of the Neurological Sciences 169 (1999) 13– [36] J. Cohen, Statistical Power Analysis for the Behav-
21. URL: https://www.sciencedirect.com/science/ ioral Sciences, 2nd ed., Lawrence Erlbaum Asso-
article/pii/S0022510X99002105. doi:https://doi. ciates, Publishers, Hillsdale, NJ, 1988.
org/10.1016/S0022-510X(99)00210-5. [37] S. Scherer, G. Stratou, G. Lucas, M. Mahmoud,
[27] K. Kroenke, T. W. Strine, R. L. Spitzer, J. B. Williams, J. Boberg, J. Gratch, L.-P. Morency, Automatic audio-
J. T. Berry, A. H. Mokdad, The phq-8 as a mea- visual behavior descriptors for psychological dis-
sure of current depression in the general popula- order analysis, Image and Vision Computing 32
tion, Journal of Affective Disorders 114 (2009) 163– (2014) 648–658.
173. URL: https://www.sciencedirect.com/science/ [38] S. Sorg, C. Vögele, N. Furka, A. Meyer, Perseverative
article/pii/S0165032708002826. doi:https://doi. thinking in depression and anxiety, Frontiers in Psy-
org/10.1016/j.jad.2008.06.026. chology 3 (2012). URL: https://www.frontiersin.org/
[28] V. Bazarevsky, Y. Kartynnik, A. Vakunov, K. Raveen- articles/10.3389/fpsyg.2012.00020. doi:10.3389/
dran, M. Grundmann, Blazeface: Sub-millisecond fpsyg.2012.00020.
neural face detection on mobile gpus, CoRR [39] B. Li, D. Lima, Facial expression recognition via
abs/1907.05047 (2019). URL: http://arxiv.org/abs/ resnet-50, International Journal of Cognitive Com-
1907.05047. arXiv:1907.05047. puting in Engineering 2 (2021). doi:10.1016/j.
[29] O. Roesler, H. Kothare, W. Burke, M. Neumann, ijcce.2021.02.002.
J. Liscombe, A. Cornish, D. Habberstad, D. Paut- [40] V. Berisha, C. Krantsevich, P. R. Hahn, S. Hahn,
ler, D. Suendermann-Oeft, V. Ramanarayanan, Ex- G. Dasarathy, P. Turaga, J. Liss, Digital medicine
ploring facial metric normalization for within- and the curse of dimensionality, NPJ digital
and between-subject comparisons in a multimodal medicine 4 (2021) 153.