<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Clinical Psychiatry</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/3536220</article-id>
      <title-group>
        <article-title>Towards Remote Diferential Diagnosis of Mental and Neurological Disorders using Automatically Extracted Speech and Facial Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vanessa Richter</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Neumann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vikram Ramanarayanan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Modality.AI, Inc.</institution>
          ,
          <addr-line>San Francisco, CA 94105</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of California</institution>
          ,
          <addr-line>San Francisco, CA 94127</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>59</volume>
      <issue>1998</issue>
      <fpage>160</fpage>
      <lpage>165</lpage>
      <abstract>
        <p>Utilizing computer vision and speech signal processing to assess neurological and mental conditions remotely has the potential to help detecting diseases or monitoring their progression earlier and more accurately. Multimodal features have demonstrated usefulness in identifying cases with a disorder from controls across several health conditions. However, challenges arise in distinguishing between specific disorders during the process of diferential diagnosis, where shared characteristics among diferent disorders may complicate accurate classification. Our aim in this study was to evaluate the utility and accuracy of automatically extracted speech and facial features for diferentiating between multiple disorders in a multi-class (diferential diagnosis) setting using a machine learning classifier. We use datasets comprising people with depression, bulbar and limb onset amyotrophic lateral sclerosis (ALS), and schizophrenia, in addition to healthy controls. The data was collected in a real-world scenario with a multimodal dialog system, where a virtual guide walked participants through a set of tasks that elicit speech and facial behavior. Our study demonstrates the utility of digital speech and facial biomarkers in assessing neurological and mental disorders for diferential diagnosis. Furthermore, this research emphasizes the importance of combining information derived from multiple modalities for a more comprehensive understanding and classification of disorders.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;diferential diagnosis</kwd>
        <kwd>multi-class</kwd>
        <kwd>mental disorders</kwd>
        <kwd>neurological disorders</kwd>
        <kwd>depression</kwd>
        <kwd>schizophrenia</kwd>
        <kwd>amyotrophic lateral sclerosis</kwd>
        <kwd>digital biomarkers</kwd>
        <kwd>dialog system</kwd>
        <kwd>speech</kwd>
        <kwd>facial</kwd>
        <kwd>multimodal</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>One out of eight individuals in the world lives with a</title>
        <p>
          mental health disorder, but most people do not have
access to efective care. 1 Moreover, disorders of the nervous
system are the second leading cause of death globally [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>The development of clinically valid digital biomarkers</title>
        <p>for neurological and mental disorders that can be
automatically extracted could significantly improve patients’
lives. This advancement has the potential to assist
clinicians in achieving quicker and more reliable diagnoses by
providing fast and objective insights into a patient’s state.</p>
      </sec>
      <sec id="sec-1-3">
        <title>Note that the idea here is not to replace the clinician,</title>
        <p>but to provide efective and assistive tools that can help
improve his/her eficiency, speed and accuracy.</p>
      </sec>
      <sec id="sec-1-4">
        <title>Many speech and facial features have shown to be</title>
        <p>
          useful in diferentiating between diferent mental and
neurological disorders and healthy controls (HCs) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
      </sec>
      <sec id="sec-1-5">
        <title>However, it remains unclear how distinctly these fea</title>
        <p>
          tures characterize a given disorder. For example, percent
pause time (PPT) has been found to difer significantly
between people with ALS (pALS) and HCs [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] as well as
between people with depression symptoms and HCs [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
      </sec>
      <sec id="sec-1-6">
        <title>Furthermore, a slower speaking rate diferentiates pALS</title>
        <p>
          [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] as well as people with schizophrenia [6] from HC.
To assess the utility of automatically computed digital
biomarkers to capture specific disease attributes despite
such shared characteristics, we aim to answer the
following questions:
        </p>
      </sec>
      <sec id="sec-1-7">
        <title>1. How accurately can a machine learning (ML)</title>
        <p>
          classifier diferentially distinguish between
multiple disorders – depression, schizophrenia,
bulbar symptomatic ALS and bulbar presymptomatic
ALS?
2. Which modalities and features are most useful for
this multi-class classification task – overall and
with respect to a given disorder – and how does
that compare to a binary classification baseline
(controls versus cases in each of the investigated
health conditions)?
Recently, digital speech and facial features have been
shown to yield statistically significant diferences
between cases with neurological or mental disorders and
healthy controls, exhibit high specificity and
sensitivity in discriminatory ability between those groups, or,
a high potential for disease progression and treatment
efect monitoring [
          <xref ref-type="bibr" rid="ref2 ref3">2, 3, 6, 7, 8, 9, 10, 11, 12</xref>
          ].
        </p>
      </sec>
      <sec id="sec-1-8">
        <title>Several studies have evaluated the detection of neuro</title>
        <p>logical and mental disorders in multi-class
classification settings as compared to binary case-control studies
[13, 14, 15]. Altaf et al. [13] introduced an algorithm for</p>
      </sec>
      <sec id="sec-1-9">
        <title>Alzheimer’s disease (AD) detection validated on binary</title>
        <p>classification and multi-class classification of AD, normal
and mild cognitive impairment (MCI). Using the bag of
visual word approach, the algorithm enhances texture- Figure 1: Overview of feature extraction and dataset creation.
based features like the gray level co-occurrence matrix.</p>
      </sec>
      <sec id="sec-1-10">
        <title>It integrates clinical data, creating a hybrid feature vec</title>
        <p>tor from whole magnetic resonance (MR) brain images. interview environment and data collection. Each session
They use the Alzheimer’s Disease Neuro-imaging Initia- starts with a microphone, speaker, and camera check to
tive dataset (ADNI) and achieve 98.4% accuracy in binary ensure that the participant has given their device the
AD versus normal classification and 79.8% accuracy in permission to access camera and microphone, is able to
multi-class AD, normal, and MCI classification. hear the instructions and the captured signal is of
adeFurthermore, Hansen et al. [14] explored the poten- quate quality. After these tests the virtual guide involves
tial of speech patterns as diagnostic markers for mul- participants in a structured conversation that consists of
tiple neuropsychiatric conditions by examining record- exercises (speaking tasks, open-ended questions, motor
ings from 420 participants with major depressive disor- abilities) to elicit speech, facial and motor behaviors
relder, schizophrenia, autism spectrum disorder, and non- evant to the type of disease being studied. In this work,
psychiatric controls. Various models were trained and we focus on tasks that were shared across multiple study
tested for both binary and multi-class classification tasks protocols for diferent disease conditions: (a) sentence
inusing speech and text features. While binary classifica- telligibility test (SIT), (b) diadochokinesis (DDK), (c) read
tion models exhibited comparable performance to prior speech, and (d) a picture description task. For (a),
parresearch (F1: 0.54–0.92), multi-class classification showed ticipants were asked to read individual SIT sentences of
a notable decrease in performance (F1: 0.35–0.75). The varying lengths (5-15 words2), while (b) required reading
study further demonstrates that combining voice- and a longer passage (Bamboo reading passage, 99 words). To
text-based models enhances overall performance by 9.4% assess DDK skills (c), participants were asked to repeat a
F1 macro, highlighting the potential of a multimodal pattern of syllables (/pa ta ka/) as fast as they can until
approach for more accurate neuropsychiatric condition they run out of breath and (d) prompted users to describe
classification While these studies show the efectiveness a scene in a picture that was shown to them on screen.
of diferent types of speech- and facial-derived features These tasks are inspired by previous work [17, 18, 19].
for assessing psychiatric conditions in diferential
diagnosis settings, none of them utilized ’in-the-wild‘ data
collected remotely from participants devices with a mul- 3.1. Datasets
timodal dialog system.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Multimodal Dialog Platform and Data Collection</title>
      <sec id="sec-2-1">
        <title>Audiovisual data was collected using NEMSI (Neurologi</title>
        <p>cal and Mental health Screening Instrument) [16], a
multimodal dialog system for remote health assessments. An
overview of the dataset creation process is illustrated in
Figure 1. A virtual guide, Tina, led study participants
through various tasks that are designed to elicit speech,
facial, and motor behaviors. Having an interactive virtual
guide to elicit participants’ behavior allows for scalability
while providing a natural but controlled and objective</p>
      </sec>
      <sec id="sec-2-2">
        <title>An overview of the data used in this study is given in</title>
        <p>Table 1. While some datasets for a disease may be small,
there is a subset of tasks that are shared across research
studies. Since the data is collected in the same way
(remotely with a personal electronic device), we can
create a larger dataset for the healthy population across
studies to get a more accurate representation of the
properties of normative behavior. For the larger dataset
of healthy controls, we identify age-related trends as
well as collinerarity of features. This information is used
to correct control as well as patient feature values from</p>
      </sec>
      <sec id="sec-2-3">
        <title>2In the remainder of the paper, the diferent SIT sentence lengths</title>
        <p>are treated as separate tasks and are denoted as SIT_n, where n is
the length in words.</p>
        <sec id="sec-2-3-1">
          <title>Participants</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>Sessions</title>
        </sec>
        <sec id="sec-2-3-3">
          <title>Mean Age (SD)</title>
          <p>Controls
Female 408 (63%) 655 (62.8%)
Male 240 (37%) 388 (37.2%)
All 648 1043
Schizophrenia
Female 10 (24.4%) 19 (26.4%)
Male 31 (75.6%) 53 (73.6%)
All 41 72
Depression
Female 66 (79.5%) 76 (79.2%)
Male 17 (20.5%) 20 (20.8%)
All 83 96
Bulbar Symptomatic ALS
Female 38 (48.1%) 67 (46.2%)
Male 41 (51.9%) 78 (53.8%)
All 79 145
Bulbar Presymptomatic ALS
Female 31 (50%) 54 (50.5%)
Male 31 (50%) 53 (49.5%)
All 62 107
46.3 (16.4)
46.2 (16.0)
46.3 (16.2)
age efects and remove feature redundancies.
3.1.1. Schizophrenia</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Schizophrenia is a chronic brain disorder that afects</title>
        <p>approximately 24 million or 1 in 300 people (1 in 222
in adults)3 worldwide. According to the American
Psychiatric Association (APA), active schizophrenia may be
characterized by episodes in which the afected individual
cannot distinguish between real and unreal experiences.4
Among individuals with schizophrenia, psychiatric and
medical comorbidities such as substance abuse, anxiety
and depression are common [20, 21, 22]. Buckley et al.
pointed out that depression is estimated to afect half of
the patients. These comorbidities, as well as the variation
in symptoms and medications, make the identification of
multimodal biomarkers for schizophrenia a dificult task.</p>
      </sec>
      <sec id="sec-2-5">
        <title>As can be seen in Table 1, we assessed 41 individuals</title>
        <p>with a diagnosis of schizophrenia at a state psychiatric
facility in New York, NY. The study was approved by the
Nathan S. Kline Institute for Psychiatric Research and we
obtained written informed consent from all participants
at the time of screening after explaining details of the
study. The assessment of both patients and controls was
overseen by a psychiatrist.</p>
      </sec>
      <sec id="sec-2-6">
        <title>3https://www.who.int/news-room/fact-sheets/detail/</title>
        <p>schizophrenia, accessed 05/19/2023</p>
      </sec>
      <sec id="sec-2-7">
        <title>4https://www.psychiatry.org/patients-families/schizophrenia/</title>
        <p>what-is-schizophrenia, accessed 05/19/2023
ALS is a neurological disease that afects nerve cells in
the brain and spinal cord that control voluntary
muscle movement. The disease is progressive and there is
currently no cure or efective treatment to reverse its
progression.5. Global estimates of ALS prevalence range
from 1.9 to 6 per 100,000.6 Studies on ALS found
comorbidity with dementia, parkinsonism and depressive
symptoms [23]. Diekmann et al. [24] found depression
to occur statistically significantly more often in pALS
compared to HC. In addition, Heidari et al. [25] found
in a meta-analysis of 46 eligible studies that the pooled
prevalence of depression among individuals with ALS to
be 34%, with mild, moderate, and severe depression rates
at 29%, 16%, and 8%, respectively.</p>
        <p>As shown in Table 1, data from 79 ALS bulbar
symptomatic (BS) and 62 ALS bulbar pre-symptomatic (BP)
patients were collected in cooperation with
EverythingALS and the Peter Cohen Foundation7. In addition to
the assessment of speech and facial behavior,
participants filled out the ALS Functional Rating Scale-revised
(ALSFRS-R), a standard instrument for monitoring the
progression of ALS [26]. The questionnaire comprises
12 questions about physical ability with each function’s
rating ranging from normal function (score 4) to severe
disability (score 0). It includes four scales for diferent
domains afected by the disorder: bulbar system, fine
and gross motor skills, and respiratory function. The
ALSFRS-R score is the total of the domain sub-scores,
the sum ranging from 0 to 48. For this study, pALS were
stratified into the following sub-cohorts based on their
bulbar subscore: (a) BS ALS with a bulbar subscore &lt; 12
(first three ALSFRS-R questions) and (b) BP ALS with a
bulbar sub-score = 12.
3.1.3. Depression
Depression is a common mental health disorder
characterized by persistent sadness and lack of interest or
pleasure in previously enjoyable activities. In addition,
fatigue and poor concentration are common. The efects
of depression can be long-lasting or recurrent and can
drastically afect a person’s ability to lead a fulfilling
life. The disorder is one of the most common causes of
disability in the world.8 One in six people (16.6%) will
experience depression at some point in their lifetime.9
5https://www.ninds.nih.gov/healthinformation/disorders/amyotrophic-lateral-sclerosis-als, accessed
05/19/2023</p>
      </sec>
      <sec id="sec-2-8">
        <title>6https://www.targetals.org/2022/11/22/epidemiology-of-als</title>
        <p>incidence-prevalence-and-clusters/, accessed 05/19/2023</p>
      </sec>
      <sec id="sec-2-9">
        <title>7https://www.everythingals.org/research</title>
      </sec>
      <sec id="sec-2-10">
        <title>8https://www.who.int/health-topics/depression, accessed 06/20/2023</title>
      </sec>
      <sec id="sec-2-11">
        <title>9https://www.psychiatry.org/patients-families/depression/what-is</title>
        <p>depression, accessed 06/20/2023
A well-established tool for assessing depression is the sessions that had more than 15% missing features. Then,
Patient Health Questionnaire (PHQ)-8 [27]. The PHQ-8 on the feature level, we filtered out features with more
score ranges from 0 to 24 (higher score indicates more than 10% missing values. These thresholds have been
severe depression symptoms). determined empirically. After those removal procedures,</p>
        <p>We investigated at least moderately severe depression we impute remaining missing values with mean feature
cases, based on a cutof of PHQ-8 ≥ 15. The data for this values for the respective cohort in train and test sets
study, including the completion of the PHQ-8 question- separately.
naire, was collected through crowd-sourcing, resulting
in a sample of 83 individuals that scored at or above 4.3. Age-Correction &amp; Sex-Normalization
this cutof. Statistics for this cohort are summarized in
Table 1.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Methods</title>
      <sec id="sec-3-1">
        <title>Our procedure is divided into the following stages: (1) fea</title>
        <p>ture extraction, (2) preprocessing, (3) age-correction and
sex-normalization, (4) redundancy and efect size
analysis, and finally (5) classification (binary and multi-class)
and evaluation.
4.1. Multimodal Metrics Extraction</p>
      </sec>
      <sec id="sec-3-2">
        <title>In this and the following sections, we use the following</title>
        <p>terminology: Metric denotes a speech or facial metric in
general, and Feature denotes a specific combination of a
metric extracted from a certain task, e.g. speaking rate
for the SIT task.</p>
        <p>
          Both speech and facial metrics were extracted from
the audiovisual recordings (overview in Table 2). To
extract facial metrics, we used the Mediapipe FaceMesh
software10. More specifically, MediaPipe’s Face
Detection is based on BlazeFace [28] and determines the (x,
y)-coordinates of the face for every frame. Subsequently,
468 facial landmarks are identified using MediaPipe
FaceMesh. We selected 14 key landmarks to compute
functionals of facial behavior. Distances between
landmarks were normalized by dividing them by the
intercaruncular distance. In terms of between- as well as
within-subject analyses, when the same position
relative to the camera cannot be assumed, Roesler et al. [29]
found this to be the most reliable method of
normalization. More details and a visual depiction of the
landmarks used to calculate facial features can be found in
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Speech metrics were computed using Praat [30] and
cover diferent domains, such as energy, timing, voice
quality and frequency.
4.2. Preprocessing
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>We applied the following approach to handle missing</title>
        <p>data, which can occur for a number of reasons, including
incomplete sessions, technical issues, or network
problems. First, on the session level, we removed participant
10https://google.github.io/mediapipe/</p>
      </sec>
      <sec id="sec-3-4">
        <title>Similar to the approach in Falahati et al. [31], we applied</title>
        <p>a linear correction algorithm to both patient and
control data based on age-related changes in the HC cohort.
By calculating age trends and coeficients on healthy
controls, we aim to obtain the most accurate estimate
of purely age-related changes without the confounding
efects of disease-related influences. In detail, for each
feature, we fit a linear regression model to age as the
independent and the feature as the dependent variable,
modeling the age-related changes as a linear deviation.
This is done separately for males and females to obtain
a sex-specific result. Then, the sex-specific regression
coeficients are used to correct feature values for age
by subtracting the product of coeficient and age from
the feature value for each participant. To account for
sex-related diferences, we applied sex-specific z-scoring
to normalize the features. Z-normalization is a
methodology that allows for the comparison or compilation of
observations of diferent cohorts [ 32]. In addition, the
normalization process ensures the comparability of
features on diferent scales by centering the feature
distributions around zero with a standard deviation of one. First,
the dataset to analyze was divided into male and female
participants. Then, each feature was normalized within
each sex group using z-scoring.
4.4. Redundancy Analysis and Efect Sizes
To identify collinear features and reduce the
highdimensional feature space, we performed hierarchical
clustering on the Spearman rank-order correlations
using the age-corrected and sex-normalized larger healthy
control dataset. We applied the clustering for speech and
facial features separately. The clustering procedure is
motivated by the approach in Ienco and Meo [33]. It is
based on Ward’s method [34], which aims at minimising
within-cluster variance. We implemented it using the
scikit-learn library11. A dendrogram was plotted to
inspect the correlations between features visually and
to determine a suitable distance threshold for
generating feature clusters. The threshold choice was based on
two major factors: (a) balance between speech and facial
clusters as we target roughly an equal number to avoid
11https://scikit-learn.org/stable/auto_examples/inspection/plot_
permutation_importance_multicollinear.html
#</p>
        <sec id="sec-3-4-1">
          <title>Energy</title>
          <p>o Timing
i
d
u</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>A Specific to DDK</title>
        </sec>
        <sec id="sec-3-4-3">
          <title>Voice quality</title>
        </sec>
        <sec id="sec-3-4-4">
          <title>Frequency</title>
          <p>Jaw
eo Lower Lip
id Mouth
V Eyes
signal-to-noise ratio (SNR, dB)
speaking &amp; articulation duration/rate (sec./WPM), percent pause time (PPT, %),
canonical timing agreement (CTA, %)
cycle-to-cycle temporal variability (cTV, sec.), syllable rate (syl./sec.), number of syllables
shimmer (%), harmonics-to-noise ratio (HNR, dB), jitter (%)
mean, min, max &amp; standard deviation (stdev) of fundamental frequency (F0, Hz)
mean, min &amp; max speed/acceleration/jerk of the jaw center (JC)
mean, min &amp; max speed/acceleration/jerk of the lower lip (LL)
mean &amp; max lip aperture, lip width, mouth surface area; mean mouth symmetry ratio
mean &amp; max eye opening
predominance of one modality over the other, and (b) MLP has one hidden layer. We experimented with adding
expert knowledge about the diferent task and feature more hidden layers, but found that the minimal
configdomains (e.g. timing versus voice quality features, jaw uration with only one layer was beneficial in terms of
versus eye movement or read versus free speech), which performance. The hidden layer size ℎ was determined
resulted in the clusters shown in Table 3 and Table 4. dynamically as
The clusters are used in the feature selection process as ℎ =  +  (1)
described Section 4.5. 2</p>
          <p>Statistical tests to assess the statistical significance, where  is the number of selected features and  the
numas well as the magnitude and direction of efects for a ber of classes. The model was trained with a maximum
given comparison, were conducted within classification of 10,000 iterations to allow suficient time for
converfolds and as part of a post hoc analysis. Efect sizes were gence during training. Model training was stopped when
calculated using Glass’s Delta [35]. Here, only features the loss or score was not improving by a defined
tolershowing statistical significance (  &lt; 0.05) in the Mann- ance threshold. Here, we used scikit-learn’s default
Whitney U-test (MWU) were considered. of 1 − 4. Additionally, the alpha parameter was set to
0.001, controlling the regularization strength to prevent
4.5. Classification overfitting. The sgd (stochastic gradient descent) solver
was used for optimization during training. The batch
For both the binary and multi-class classification exper- size was set to auto, enabling the model to determine
iments, we used a multilayer perceptron (MLP), which the appropriate batch size during training. We used the
was implemented using the scikit-learn library. The rectified linear unit function as the activation function.</p>
        </sec>
        <sec id="sec-3-4-5">
          <title>Metrics</title>
          <p>SNR
CTA
PPT
articulation/speaking duration</p>
        </sec>
        <sec id="sec-3-4-6">
          <title>SNR, syl.rate, syl.count &amp; cTV articulation/speaking rate/time articulation/speaking rate/time</title>
        </sec>
        <sec id="sec-3-4-7">
          <title>Tasks all all all</title>
        </sec>
        <sec id="sec-3-4-8">
          <title>Picture Description</title>
          <p>DDK
SIT_{5,9}
SIT_{7,11,13,15},</p>
        </sec>
        <sec id="sec-3-4-9">
          <title>Reading passage</title>
          <p>DDK
all except DDK
all except DDK
all except DDK
all
all
∑︀</p>
        </sec>
        <sec id="sec-3-4-10">
          <title>Lip movement (1)</title>
        </sec>
        <sec id="sec-3-4-11">
          <title>Lip width</title>
        </sec>
        <sec id="sec-3-4-12">
          <title>Mouth opening</title>
        </sec>
        <sec id="sec-3-4-13">
          <title>Lip movement (2)</title>
        </sec>
        <sec id="sec-3-4-14">
          <title>Jaw movement (1)</title>
        </sec>
        <sec id="sec-3-4-15">
          <title>Jaw movement (2)</title>
        </sec>
        <sec id="sec-3-4-16">
          <title>Jaw movement (3)</title>
        </sec>
        <sec id="sec-3-4-17">
          <title>Jaw movement (4)</title>
        </sec>
        <sec id="sec-3-4-18">
          <title>Jaw movement (5)</title>
        </sec>
        <sec id="sec-3-4-19">
          <title>Mouth symmetry</title>
        </sec>
        <sec id="sec-3-4-20">
          <title>Eye opening</title>
          <p>all except DDK
all
all
DDK
DDK
SIT_7
SIT_5</p>
        </sec>
        <sec id="sec-3-4-21">
          <title>Picture Description</title>
          <p>SIT_{9,11,13,15}, RP,
Picture Description
all
all
∑︀</p>
          <p>Ten-fold cross-validation was applied for evaluation in BS ALS vs. Depression cases, and so on). We merged
order to maximize the utilization of data for both training the selected features from these comparisons as input
and testing purposes. To avoid bias towards the majority to the classifier. Therefore, multiple features from the
group, we created datasets that consist of an equal num- same cluster could be included in one feature set. We
ber samples in each disease condition. For each individual allowed a certain amount of redundancy compared to
participant, we consider, if available, the first two ses- the case-control baseline in order to account for the
comsions as data points. Because of the equality constraint, plexity associated with multiple comparisons. For both
the number of data points was limited by the smallest experiments, classification performance was evaluated
dataset (schizophrenia). This resulted in 72 randomly in terms of F1 score, sensitivity, and specificity.
selected data points per cohort, summing up to a total
of 360 data points. The classification experiments are
run ten times to smooth out performance variations and 5. Results
obtain more representative results. We split the data
using scikit-learn’s StratifiedGroupKFold to make sure 5.1. Binary Classification Baseline
that sessions from the same participant are either in the
respective training or testing fold. In each fold, we im- Cohort Speech Facial Speech + Facial
puted missing values and standardized features by sex F1 F1 F1 SEN SP
using z-scoring. This was done separately for training DEP vs. HC 0.64 0.59 0.65 0.65 0.65
and test sets. SCHIZ vs. HC 0.82 0.64 0.83 0.85 0.82</p>
          <p>As a benchmark, we evaluated binary classification BP ALS vs. HC 0.54 0.51 0.52 0.52 0.53
performance of models aimed at distinguishing cases BS ALS vs. HC 0.84 0.63 0.83 0.82 0.83
with a disorder from controls. Here, for each cluster of Table 5
collinear features as described in Section 4.4, the one with Binary classification results. In each row, we highlighted the
the highest efect size was selected for the final feature highest performance in terms of F1.
set as input to the classifier. If no feature showed statisti- HC: Healthy Controls, DEP: Depression, SCHIZ:
Schizophrecally significant diferences between cases and controls nia, SEN: Sensitivity, SP: Specificity
in a given cluster, no feature was selected. Hence, the
number of clusters determines the maximum number of As can be seen in Table 5, we observe a good
perfeatures fed into the classifier. Statistical significance and formance in classifying controls versus BS ALS (speech
efect sizes for each feature were calculated as described features alone; F1-score: 0.84) and schizophrenia
(comin the previous section. bined speech and facial; F1-score: 0.83) cases, respectively.
In a second step, we performed 4-class classification, in- The binary classification of depression did not perform
corporating all the investigated neurological and mental as well; however, it still surpassed the random chance
disorders. Here, feature selection was done based on pair- baseline (combined speech and facial; F1-score: 0.65).
wise comparisons of all disease cohorts (e.g. Depression The classifier struggled to distinguish controls from BP
vs. Schizophrenia cases, Schizophrenia vs. BS ALS cases, ALS cases, where we observed performance just above
random chance across modalities. Furthermore, the per- BP ALS and depression the per class F1-score is highest
formance with regard to sensitivity and specificity is when combining speech and facial features. There is no
relatively balanced across comparisons. performance diference between using only speech or</p>
          <p>In depression and schizophrenia, combining speech speech and facial features for identifying schizophrenia.
and facial modalities resulted in improved classification Figure 2 shows a confusion matrix that indicates the
perperformance compared to speech or facial features alone, centage of accurate class predictions and the classes with
as shown in Table 5. However, adding facial informa- which they were confused. The model was most
confition did not enhance performance for BP or BS and ALS dent in detecting schizophrenia (72.22%), followed by BS
cohorts compared to utilizing speech features alone. ALS (64.58%) and depression (63.75%). The model faced
its greatest challenge in accurately predicting BP ALS
5.2. Multi-Class Classification (57.22%), yet it still performs notably above chance in a
4class classification scenario. BP ALS and depression cases
were most often confused with each other. Schizophrenic
Cohort Speech Facial Speech + Facial patients were least often confused with other cohorts.
SCHIZ 0F.712 0F.513 0F.712 S0E.7N2 0S.9P1 Among the cases of BS ALS, the most frequent confusion
BP ALS 0.55 0.36 0.57 0.57 0.86 occurred with BP ALS patients (16.11%).
BS ALS 0.62 0.47 0.64 0.65 0.88 The features that we identified to be consistently
choDEP 0.61 0.46 0.64 0.64 0.88 sen across classification folds (Table 7) are predominantly
Average 0.63 0.46 0.64 0.65 0.88 speech features of timing, voice quality, and energy
domains. In addition, two facial features are selected
Table 6 across folds concerning the maximum lip width and the
tMheulhtii-gchlaessst Fcl1asscsoifriecapteiorfnorrmesaunltcse.. In each row, we highlight maximum absolute acceleration of jaw movements. We
HC: Healthy Controls, DEP: Depression, SCHIZ: Schizophre- conducted a post hoc analysis of efect sizes between
nia, SEN: Sensitivity, SP: Specificity HC and cases with a disorder for these features to gain
further insight into disorder-specific importance. Here,
positive efect sizes represent feature values that are
larger for cases with a disorder than controls. Conversely,
negative values represent larger feature values for
controls than cases with a disorder12. In schizophrenia,
we find all of the features consistently selected across
classification folds to be statistically significant when
compared to HC. With respect to the other cohorts, the
largest efects are shown for CTA (-1.44 for SIT_13) and
speaking rate (-2.00 for RP). This shows that patients
exhibit a lower CTA, a measure of phonetic alignment
between their own speech and that of the virtual guide,
while speaking slower. We also observed a smaller
average lip width as an important feature that shows
the largest efect between HC and depression cases
compared to the other cohorts. This may be associated
with decreased emotional expressivity, as indicated by
reduced smiling and increased frowning. These findings
tFiiognu.rTehe2:xN-aoxrismsahloizweds tchoentfruuseiolnabmelast,rtihxefoyr-a4x-icslathsse cplraesdsiicfitcead- align with previous studies highlighting similar patterns
ones. of emotional expressiveness in depression [37, 38]. Few
and small diferences compared to controls are revealed
for BP ALS cases. This is also the cohort with the lowest</p>
          <p>In the 4-class experiment aimed at discriminating be- performance across classification experiments. In BS
tween all investigated neurological and mental disorders, ALS, we found the largest efects for SNR and speaking
we achieve the best overall performance (F1-score: 0.64) rate. Another feature that stood out is cTV in the DDK
by utilizing both speech and facial features, as shown in task, a measure that captures the temporal variability, i.e.
Table 6. Overall, the specificity (average: 0.88) for the the consistency or irregularity in the timing of speech
disorders examined is considerably higher than the sensi- patterns, between consecutive cycles of speech.
tivity (average: 0.65). This indicates that the classifier is
more efective at avoiding false-positive results than
identifying true positives. In most cases, namely for BS ALS,
12We follow the commonly used efect size magnitude thresholds as
suggested in Cohen [36] – small: 0.2 − 0.5, medium: 0.5 − 0.8,
and large: &gt; 0.8</p>
        </sec>
        <sec id="sec-3-4-22">
          <title>Features</title>
          <p>max abs acc. JC (RP)
max lip width (SIT 11)
shimmer (DDK)
shimmer (SIT 5)
jitter (SIT 9)
CTA (SIT 13)
SNR (DDK)
speaking rate (RP)
speaking rate (SIT 7)
HNR (DDK)
HNR (SIT 15)
cTV (DDK)</p>
        </sec>
        <sec id="sec-3-4-23">
          <title>Cluster domain</title>
        </sec>
        <sec id="sec-3-4-24">
          <title>Jaw movement</title>
        </sec>
        <sec id="sec-3-4-25">
          <title>Lip width</title>
        </sec>
        <sec id="sec-3-4-26">
          <title>Voice quality</title>
        </sec>
        <sec id="sec-3-4-27">
          <title>Voice quality</title>
        </sec>
        <sec id="sec-3-4-28">
          <title>Voice quality</title>
        </sec>
        <sec id="sec-3-4-29">
          <title>Timing alignment</title>
        </sec>
        <sec id="sec-3-4-30">
          <title>Energy</title>
        </sec>
        <sec id="sec-3-4-31">
          <title>Timing, speaking</title>
        </sec>
        <sec id="sec-3-4-32">
          <title>Timing, speaking</title>
        </sec>
        <sec id="sec-3-4-33">
          <title>Voice quality</title>
        </sec>
        <sec id="sec-3-4-34">
          <title>Voice quality</title>
        </sec>
        <sec id="sec-3-4-35">
          <title>Energy &amp; articulation skills</title>
          <p>While many features are shared in terms of indicating That being said, we acknowledge the importance of
a signal between cases with a disorder and controls, it contextualizing the promise of such multimodal
methodis mostly the magnitude of the efect that diferentiates ologies for diferential diagnosis with several caveats.
them, as well as how they combine. However, there are First, the performance of any machine learning
classialso a few features that show a diferent direction of ifer trained for this purpose will depend on the specific
efect across cohorts. For example, in BS ALS, compared conditions being studied and the range and
heterogeneto other cohorts, we observed the largest efect for ity of symptoms presented in each case. For example,
shimmer (DDK, -0.63), which measures the variation in in this study we investigated four specific conditions –
amplitude of the vocal folds during the speech signal. schizophrenia, depression, bulbar symptomatic (BS) and
There is no efect observed for BP ALS or depression bulbar presymptomatic (BP) ALS – and we observed that
cohorts, while in schizophrenia, the direction of efect is schizophrenia (where the facial modality is particularly
the opposite (0.35). good at capturing characteristics exhibited therein such
as anhedonia, blunted afect, etc.) and BS ALS (which
is characterized by speech motor deficits, reflected in
the timing, rate and intelligibility of speech), quite
dif6. Discussion ferent in terms of symptom presentation, exhibit greater
separability relative to other classes for diferential
classiWe explored the utility of speech and facial features ex- ifcation. For both BS ALS and schizophrenia, our analysis
tracted by a multimodal dialog system for diferential demonstrates a robust discriminatory capability to
efecclassification of ALS, depression and schizophrenia. Note tively distinguish these cohorts from healthy controls, as
that the idea here is not to replace clinicians, but to pro- well as other neurological and mental disorders, in binary
vide efective and assistive tools that can help improve and multi-class experiments. However, the overall higher
their eficiency, speed and accuracy. Overall, combining specificity of the multi-class classifier implies a robust
speech and facial information proved to be beneficial capability to accurately identify non-cases, efectively
for identifying several disorders in both multi-class and minimizing false positives. Yet, the lower sensitivity
sugbinary classification experiments. In addition, our au- gests limitations in the identification of true cases for the
tomated feature analysis indicates several features that analyzed disorders, likely due to the imposed strong
reshow relevance across experiments. While some of these strictions. In BS ALS, speech features alone demonstrate
features are intuitively identifiable by human experts as superior performance when comparing this group with
markers of a given disorder (for example, a slower speak- controls. Yet, in the more intricate task of diferential
ing rate or a lower intelligibility), such an analysis also diagnosis, performance improves when speech features
allows discovery of other features that might be harder are combined with facial information. For schizophrenia,
to detect or identify objectively by human experts, such the combination of speech and facial modalities proves
as quicker facial movements. most efective in both binary and multitask experiments.
In contrast, BP ALS, which does not present with as many
speech and facial motor deficits, is much less separable
even in binary classification, let alone in the multi-class
classification context, highlighting the challenging
nature of detecting this condition. Furthermore, for the
misidentified BS ALS cases, the classifier most frequently
categorized them as BP ALS. Although distinguishing BP
ALS cases from controls is challenging, this outcome
indicates that the classifier may be able to capture
conditionspecific information from features that are shared across
diferent stages of ALS, which may have led to this
confusion. Finally, in evaluating depression, best performance
in both binary and multi-class classification experiments
is achieved by combining speech and facial information.</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>The overall accuracy in discerning depression from other</title>
        <p>cohorts is notably lower compared to schizophrenia or
BS ALS. The variability introduced by the wide range
and time horizon of potential symptoms present in
depression as well as medication status might contribute
to lower diferential diagnosis accuracy. That being said,
a significant limitation of the present study is the lack
of information about co-morbidities to factor into our
analysis, since datasets were collected independently.
Future research will aim to explicitly address this gap by
capturing, for instance, information about co-morbid
depression in ALS or schizophrenia (e.g., through PHQ-8
scales), that might help us better stratify these cohorts.</p>
      </sec>
      <sec id="sec-3-6">
        <title>Second, this study focused on a restricted set of tasks,</title>
        <p>primarily focusing on reading abilities and picture
description assessments. However, these task-feature
combinations alone may not fully capture the nuances of each
disorder.</p>
        <p>Third, while we focused on interpretable features in
this study, less interpretable ones, such as log mel
spectrograms or Mel Frequency Cepstral Coeficients (MFCCs)
may be able to capture more nuanced and complex
patterns in the data. Additionally, more sophisticated deep
learning approaches for representation learning could
be applied, such as Res-Net 50 [39] in the facial
modality. While such features can be powerful in capturing
subtle details and nuances of audiovisual behavior, the
inner workings of the deep learning model are not easily
explainable or interpretable by non-experts.</p>
        <p>Fourth, our sample size is not representative enough
to truly claim generalizability of findings. The smaller
the sample, the larger the risk of having model “blind
spots” that in turn lead to variable estimates of true model
performance on unseen real world data, giving algorithm
designers an inaccurate sense of how well a model is
performing during development [40].</p>
        <p>Our results argue for the importance of a hybrid
approach to diferential diagnosis going forward, combining
knowledge-driven and data-driven approaches.
Understanding specific disease pathologies and symptoms can
in turn help in developing features and learning
methodologies that lead to better separability of disease
conditions. Future work will also focus on improving
diferential diagnosis performance in a manner that is both
generalizable and explainable.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <sec id="sec-4-1">
        <title>This work was funded in part by the National Institutes</title>
        <p>of Health grant R42DC019877. We thank all study
participants for their time and we gratefully acknowledge
the contribution of the Peter Cohen Foundation and
EverythingALS towards participant recruitment and data
collection for the ALS corpus and Anzalee Khan and
Jean</p>
      </sec>
      <sec id="sec-4-2">
        <title>Pierre Lindenmayer at the Manhattan Psychiatric Center – Nathan Kline Institute for the schizophrenia corpus.</title>
        <p>Investigating the utility of multimodal conversa- burg, G. L. Pattee, J. D. Berry, E. A. Macklin, E. P.
tional technology and audiovisual analytic mea- Pioro, R. A. Smith, Additional evidence for a
thersures for the assessment and monitoring of amy- apeutic efect of dextromethorphan/quinidine on
otrophic lateral sclerosis at scale, 2021, pp. 4783– bulbar motor function in patients with amyotrophic
4787. doi:10.21437/Interspeech.2021-1801. lateral sclerosis: A quantitative speech analysis,
[6] V. Richter, M. Neumann, H. Kothare, O. Roesler, British Journal of Clinical Pharmacology 84 (2018)</p>
      </sec>
      <sec id="sec-4-3">
        <title>J. Liscombe, D. Suendermann-Oeft, S. Prokop, 2849–2856.</title>
        <p>A. Khan, C. Yavorsky, J.-P. Lindenmayer, V. Ra- [13] T. Altaf, S. M. Anwar, N. Gul, M. N. Majeed,
manarayanan, Towards multimodal dialog-based M. Majid, Multi-class alzheimer’s disease
classispeech &amp; facial biomarkers of schizophrenia, in: ifcation using image and clinical features,
BiomedCompanion Publication of the 2022 International ical Signal Processing and Control 43 (2018) 64–
Conference on Multimodal Interaction, ICMI ’22 74. URL: https://www.sciencedirect.com/science/
Companion, Association for Computing Machinery, article/pii/S1746809418300508. doi:https://doi.
New York, NY, USA, 2022, p. 171–176. URL: https: org/10.1016/j.bspc.2018.02.019.
//doi.org/10.1145/3536220.3558075. doi:10.1145/ [14] L. Hansen, R. Rocca, A. Simonsen, et al.,
3536220.3558075. Speech- and text-based classification of
neu[7] H. Kothare, M. Neumann, J. Liscombe, O. Roesler, ropsychiatric conditions in a multidiagnostic
setW. Burke, A. Exner, S. Snyder, A. Cornish, D. Hab- ting, Nature Mental Health (2023). doi:10.1038/
berstad, D. Pautler, D. Suendermann-Oeft, J. Hu- s44220-023-00152-7.
ber, V. Ramanarayanan, Statistical and clini- [15] E. Emre, Erol, C. Taş, N. Tarhan, Multi-class
cal utility of multimodal dialogue-based speech classification model for psychiatric
disorand facial metrics for parkinson’s disease as- der discrimination, International Journal of
sessment, 2022, pp. 3658–3662. doi:10.21437/ Medical Informatics 170 (2023) 104926. URL:
Interspeech.2022-11048. https://www.sciencedirect.com/science/article/pii/
[8] N. Cummins, J. Joshi, A. Dhall, V. Sethu, R. Goecke, S1386505622002404. doi:https://doi.org/10.</p>
      </sec>
      <sec id="sec-4-4">
        <title>J. Epps, Diagnosis of depression by behavioural 1016/j.ijmedinf.2022.104926.</title>
        <p>signals: A multimodal approach, in: Proceed- [16] D. Suendermann-Oeft, A. Robinson, A. Cornish,
ings of the 3rd ACM International Workshop on D. Habberstad, D. Pautler, D. Schnelle-Walka,
Audio/Visual Emotion Challenge, AVEC ’13, As- F. Haller, J. Liscombe, M. Neumann, M. Merrill,
sociation for Computing Machinery, New York, O. Roesler, R. Gefarth, Nemsi: A multimodal
diaNY, USA, 2013, p. 11–20. URL: https://doi.org/ log system for screening of neurological or mental
10.1145/2512530.2512535. doi:10.1145/2512530. conditions, in: Proceedings of the 19th ACM
Inter2512535. national Conference on Intelligent Virtual Agents,
[9] J. Robin, M. Xu, A. Balagopalan, J. Novikova, IVA ’19, Association for Computing Machinery,
L. Kahn, A. Oday, M. Hejrati, S. Hashemifar, M. Ne- New York, NY, USA, 2019, p. 245–247. URL: https:
gahdar, W. Simpson, E. Teng, Automated detection //doi.org/10.1145/3308532.3329415. doi:10.1145/
of progressive speech changes in early alzheimer’s 3308532.3329415.
disease, Alzheimer’s &amp; Dementia: Diagnosis, As- [17] A. K. Silbergleit, A. F. Johnson, B. H. Jacobson,
sessment &amp; Disease Monitoring 15 (2023) e12445. Acoustic analysis of voice in individuals with
amydoi:https://doi.org/10.1002/dad2.12445. otrophic lateral sclerosis and perceptually normal
[10] J. Hlavnika, R. Cmejla, T. Tykalová, K. onka, vocal quality, Journal of Voice 11 (1997) 222–231.</p>
        <p>E. Růika, J. Rusz, Automated analysis of con- [18] B. Tomik, R. J. Guilof, Dysarthria in amyotrophic
nected speech reveals early biomarkers of parkin- lateral sclerosis: A review, Amyotrophic Lateral
son’s disease in patients with rapid eye move- Sclerosis 11 (2010) 4–15.
ment sleep behaviour disorder, Scientific Re- [19] M. Novotny, J. Melechovsky, K. Rozenstoks,
ports 7 (2017). URL: https://api.semanticscholar.org/ T. Tykalova, P. Kryze, M. Kanok, J. Klempir, J. Rusz,
CorpusID:19272861. Comparison of automated acoustic methods for
[11] G. Stegmann, S. Charles, J. Liss, J. Shefner, oral diadochokinesis assessment in amyotrophic
S. Rutkove, V. Berisha, A speech-based prognos- lateral sclerosis, Journal of speech, language, and
tic model for dysarthria progression in als, Amy- hearing research : JSLHR 63 (2020) 3453–3460.
otrophic lateral sclerosis &amp; frontotemporal degen- doi:10.1044/2020_JSLHR-20-00109.
eration (2023) 1–6. URL: https://doi.org/10.1080/ [20] P. Buckley, B. Miller, D. Lehrer, D. Castle,
Psychi21678421.2023.2222144. doi:10.1080/21678421. atric comorbidities and schizophrenia,
Schizophre2023.2222144, advance online publication. nia bulletin 35 (2008) 383–402. doi:10.1093/
[12] J. R. Green, K. M. Allison, C. Cordella, B. D. Rich- schbul/sbn135.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.</given-names>
            <surname>Feigin</surname>
          </string-name>
          , E. Nichols,
          <string-name>
            <given-names>T.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bannick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Beghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Blake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Culpepper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dorsey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elbaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ellenbogen</surname>
          </string-name>
          , J. Fisher, C. Fitzmaurice,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giussani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Glennie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>James</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , N. Kassebaum,
          <string-name>
            <given-names>G.</given-names>
            <surname>Logroscino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Marin</surname>
          </string-name>
          , T. Vos, Global, regional, and
          <article-title>national burden of neurological disorders,</article-title>
          <year>1990</year>
          -
          <fpage>2016</fpage>
          :
          <article-title>a systematic analysis for the global burden of disease study 2016</article-title>
          ,
          <source>The Lancet Neurology</source>
          <volume>18</volume>
          (
          <year>2019</year>
          )
          <fpage>459</fpage>
          -
          <lpage>480</lpage>
          . doi:
          <volume>10</volume>
          .1016/S1474-
          <volume>4422</volume>
          (
          <issue>18</issue>
          )
          <fpage>30499</fpage>
          -
          <lpage>X</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Ramanarayanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Lammert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. P.</given-names>
            <surname>Rowe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. F.</given-names>
            <surname>Quatieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <article-title>Speech as a biomarker: Opportunities, interpretability, and challenges</article-title>
          ,
          <source>Perspectives of the ASHA Special Interest Groups</source>
          <volume>7</volume>
          (
          <year>2022</year>
          )
          <fpage>276</fpage>
          -
          <lpage>283</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Roesler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liscombe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kothare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Suendermann-Oeft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Berry</surname>
          </string-name>
          , E. Fraenkel,
          <string-name>
            <given-names>R.</given-names>
            <surname>Norel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anvar</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Navar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Sherman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ramanarayanan</surname>
          </string-name>
          ,
          <article-title>Multimodal dialog based speech and facial biomarkers capture diferential disease progression rates for als remote patient monitoring</article-title>
          ,
          <source>in: Proceedings of the 32nd International Symposium on Amyotrophic Lateral Sclerosis and Motor Neuron Disease, Virtual</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Richter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Black</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Haq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wright-Berryman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ramanarayanan</surname>
          </string-name>
          ,
          <article-title>A multimodal dialog approach to mental state characterization in clinically depressed, anxious, and suicidal populations</article-title>
          ,
          <source>Frontiers in Psychology</source>
          <volume>14</volume>
          (
          <year>2023</year>
          ). URL: https://www.frontiersin.org/articles/ 10.3389/fpsyg.
          <year>2023</year>
          .
          <volume>1135469</volume>
          . doi:
          <volume>10</volume>
          .3389/fpsyg.
          <year>2023</year>
          .
          <volume>1135469</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Roesler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liscombe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kothare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Suendermann-Oeft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pautler</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Navar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anvar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kumm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Norel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fraenkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sherman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Berry</surname>
          </string-name>
          , G. Pattee,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ramanarayanan</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>