=Paper=
{{Paper
|id=Vol-3649/Paper22
|storemode=property
|title=Prediction of Relapse in Adolescent Depression using Fusion of Video and Speech Data (short paper)
|pdfUrl=https://ceur-ws.org/Vol-3649/Paper22.pdf
|volume=Vol-3649
|authors=Christopher Lucasius,Mai Ali,Deepa Kundur,Marco Battaglia,Peter Szatmari,John Strauss
|dblpUrl=https://dblp.org/rec/conf/aaai/LucasiusAKBSS24
}}
==Prediction of Relapse in Adolescent Depression using Fusion of Video and Speech Data (short paper)==
Prediction of Relapse in Adolescent Depression using Fusion
of Video and Speech Data
Christopher Lucasius1,∗ , Mai Ali1 , Marco Battaglia2,3 , John Strauss4 , Peter Szatmari2,3,5 and
Deepa Kundur1
1
Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada
2
Division of Child and Youth Psychiatry, Centre for Addiction and Mental Health, Toronto, Canada
3
Department of Psychiatry, University of Toronto, Toronto, Canada
4
Vancouver Island Health Authority, Vancouver, Canada
5
The Hospital for Sick Children, Toronto, Canada
Abstract
This article presents an innovative approach to predicting depression relapse in adolescents. Adolescentsíntensive use of
video and voice-based smartphone apps presents a rich, multimodal dataset that can be utilized for this purpose. This work
uses a dataset from the Depression Early Warning study conducted at the Center for Addiction and Mental Health. After
using a pre-trained Inception ResNet to generate embeddings of video frames, the proposed framework integrates this with
synchronized speech data. These embeddings are fused with audio features, resulting in a multimodal dataset. The combined
features are processed through a Long Short-Term Memory model and a fully connected network to predict relapse of
depression. An average accuracy of 0.80 highlights the effectiveness of the proposed multimodal approach and underscores
its potential to effectively predict depression relapse in adolescents.
Keywords
Depression relapse, Multimodality, Inception ResNet, LSTM
1. Introduction prediction. These modalities encompass physiological
features such as heart rate and temperature, as well as
Depression is a worldwide, prevalent mental health dis- behavioral features such as voice, facial expression, and
order among adolescents. The recognition and treatment gesture. Video chat and gaming are very popular among
of adolescent depression hold paramount significance youth with statistics reaching 87% in this population [4].
due to its association with substantial risks, notably sui- However, despite the widespread engagement in these
cide, which stands as the fourth leading cause of death activities, research exploring the use of video and speech
within this demographic [1]. Disturbingly, over half of modalities for the assessment of depression and predic-
adolescents who commit suicide are reported to have tion of relapse in youth is limited. This work investigates
been struggling with a depressive disorder [1]. Beyond the use of speech and video for depression relapse pre-
this, depression in adolescents causes profound social diction in adolescents. As far as the authors are aware,
and educational impairments, underscoring the need for it presents the first pipeline for predicting depression
timely intervention. The consequences extend to height- relapse in adolescents using fusion of video- and speech-
ened rates of smoking, substance misuse, and obesity, ac- based features.
centuating the urgency of addressing this mental health
concern [2].
Standard mental health diagnoses rely on clinical sur- 2. Literature Review
veys that may be subject to recall bias. This approach also
does not allow for timely interventions [3]. To address The use of speech and video analysis for depression pre-
these limitations, diverse modalities have been proposed diction represents an innovative and promising approach
in the literature for timely mental health assessment and in mental health research. Analyzing speech patterns
and facial expressions can provide valuable insights into
an individual’s emotional and mental state. Below is a
Machine Learning for Cognitive and Mental Health Workshop
review on the use of speech and video for depression
(ML4CMH), AAAI 2024, Vancouver, BC, Canada
∗
Corresponding author. prediction.
Envelope-Open christopher.lucasius@mail.utoronto.ca (C. Lucasius);
maia.ali@mail.utoronto.ca (M. Ali); marco.battaglia@camh.ca
(M. Battaglia); john.strauss@islandhealth.ca (J. Strauss);
2.1. Speech-based Depression Prediction
peter.szatmari@camh.ca (P. Szatmari); dkundur@ece.utoronto.ca Several studies demonstrated that voice quality contains
(D. Kundur)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License information about the mental state of a person and vocal
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
source features can be used as biomarkers of depres- in [11]. The framework combined spatial information
sion severity [5, 6, 7]. The work in [8] is based on extracted from the Inception-ResNet-v2 network with a
a cross-sectional and longitudinal study aimed to ex- volume local directional number (VLDN) based dynamic
plore the potential of voice acoustic features as objec- feature descriptor to capture facial motions. The VLDN
tive biomarkers for assessing depression severity and feature map was then fed into a CNN to obtain more
treatment effectiveness. The study identified 30 voice discriminative features. Temporal information was ob-
acoustic features to be associated with depression such tained using a multilayer Bi-LSTM which integrated the
as Mel-cepstral (MCEP), Mel-scale Frequency Cepstral temporal median pooling (TMP) approach on the tem-
Coefficients deltas (MFCC-deltas) and Harmonic Model poral fragments of spatial and temporal features. The
Phase Distortion Mean (HMPDM) among others. A neu- performance of this work was benchmarked against the
ral network model based on Neural Architecture Search AVEC2013 and AVEC2014 datasets, and it achieved an
(NAS) was developed for predicting depression severity. MAE of 7.04 and 6.86 on AVEC2013 and AVEC2014, re-
Grid search was used to obtain the optimal model ar- spectively.
chitecture which consisted of 4 hidden layers with with Zhou et al. presented a deep regression network called
32 units each. The model achieved a Mean Absolute Er- DepressNet which aimed to learn a visually interpretable
ror (MAE) of 3.137 when predicting depression severity representation of depression from facial images [12].
based on Hamilton Depression (HAMD) Scale. Addi- Their model is based on a CNN with a global average
tionally, a longitudinal study investigated the changes pooling layer which is first trained with facial depression
in voice features after an Internet-based cognitive- data, for identifying salient regions of an input image
behavioral therapy (ICBT) program, revealing four fea- in terms of its severity score based on the generated de-
tures that significantly decreased: Peak2RMS_kurtosis, pression activation map (DAM). The authors proposed
MFCC_deltas_10_intercept, MFCC_delta_deltas_4_kur- a multi-region DepressNet that combines multiple local
tosis, and MFCC_delta_deltas_9_kurtosis. This indicated deep regression models for different face regions to en-
their potential correlation with treatment response and hance recognition performance. The method achieved
improvement in depression. In [9], Vázquez-Romero et.al. an MAE of 6.20 and 6.21 on AVEC 2013 and 2014 datasets,
proposed a method for automatic classification of depres- respectively.
sion using speech and ensemble learning with Convo-
lutional Neural Networks (CNNs). In the preprocessing 2.3. Speech and Video-based Depression
phase, speech files are transformed into sequences of log-
spectrograms and randomly sampled to ensure a balance
Prediction
between positive and negative samples. For the classi- Physiological and psychological studies have identified
fication task, multiple CNNs are trained using different differences in speech and facial expressions between pa-
initializations, and their individual predictions are com- tients with depression and healthy individuals, providing
bined using an ensemble averaging algorithm. The pre- potential cues for automatic depression detection [13].
dictions are then aggregated for each speaker to obtain a Another related work by [14] presented a depression de-
final decision. The performance of the proposed model tection model that utilizes audiovisual features extracted
was evaluated on the DAIC-WOZ dataset and compared from video logs (vlogs) on YouTube. The model extracts
against the AVEC-2016 models that use support vector eight low-level acoustic descriptors, including loudness,
machine (SVM) classifiers and hand-crafted features, as fundamental frequency (F0), and spectral flux, using the
well as the DepAudionet architecture that consisted of OpenSmile toolkit. These features capture characteris-
a 1D-CNN, Long Short-Term Memory (LSTM) cell, and tics such as voice intensity and pitch which have been
fully connected layers. The results demonstrated a rela- found to be relevant in detecting depression. For visual
tive improvement in F1-score of 58.5%, 30.0%, and 10.2% features, the model utilizes a pre-trained face expression
compared to the baseline, DepAudionet, and single 1D- recognition model (FER) to extract emotional information
CNN architecture, respectively. from the vlogs. The proposed eXtreme Gradient Boost-
ing (XGBoost) depression detection model achieved an
2.2. Video-based Depression Prediction overall performance with an accuracy of 75.85%, recall
of 78.18%, precision of 76.79%, and F1 score of 77.48%.
Behavioral analysis of facial expressions has been stud- The model’s performance was further analyzed based on
ied as a source for eliciting the underlying emotional different modalities where the model trained with audio
state [10]. Computer vision methods have been used to features performed better than the model trained with
analyze facial expressions and gestures to predict the un- visual features. The best performance was achieved by
derlying mental health state of users [11]. A framework the model trained on the audiovisual features. The work
for estimating depression levels from video data using a of Othmani et.al. in [15] used deep learning techniques
two-stream deep spatiotemporal network was introduced to recognize depression and predict relapse from audio
and visual cues extracted from videos of clinical inter- interviewed by the coordinator during their initial visit
views. It involves a correlation-based anomaly detection and followup sessions. During recorded Zoom sessions,
framework that compares the audiovisual patterns of the coordinator asked them 10 open-ended questions
depression-free subjects to those of depressed individu- about their past activities and mood, resulting in 2-10
als. The correlation between the audiovisual encoding minutes of video data per session. This dataset was col-
of a test subject and a deep audiovisual representation lected as part of an ongoing research study at CAMH and
of depression is computed to monitor depressed subjects is unavailable to the public.
and predict relapse. The approach achieves promising
results, with an accuracy of 80.99% and 82.55% for relapse 4.2. Definition of Relapse
depression prediction on the DAIC-Woz dataset.
The existing landscape of research on adolescent de- While there are many definitions of relapse in depression,
pression has made significant strides in understanding a commonly accepted one is given by [16] which defines
the onset and symptoms of depression in this age group. a relapse in adolescents as observing a CDRS score of at
However, there is a notable gap in the ability to effec- most 28 during at least 12 weeks of treatment followed by
tively predict depression relapse from audio and video an increase in CDRS to at least 40 for at least two weeks.
modalities. By incorporating synchronized video and The first period of 12 weeks corresponds to a remission
speech data, this research captures a broader spectrum of stage where the depressed adolescent does not exhibit
behavioral and emotional cues that might signify impend- symptoms but has not yet completed treatment. The
ing relapse in adolescents. The synchronization ensures period of two weeks corresponds to a depressed episode.
that both modalities are aligned, allowing for a detailed In this study, there can be at least a three month break
examination of facial expressions, body language, and between followup visits. Hence, the timing aspect of Ken-
speech patterns simultaneously. nard’s definition must be accordingly modified to adhere
to the provided data. This work proposes a definition of
relapse as a period of at least one visit with a CDRS score
3. Problem Formulation of at most 40 followed by one visit with a CDRS score of
at least 40.
Our work aims to classify fused video and speech features
for the classification of data that is measured before a
relapse event. This entails a binary classification task 4.3. Pipeline
where the two classes include “relapse sometime in the There are three main stages that make up the methods of
future” and “non-relapse”. This problem is significantly this pipeline. The first consists of preprocessing the video
different from detecting the presence of depression or and audio data and organizing them such that the two
predicting a certain depression rating scale score. The modalities are aligned and the labels are balanced. The
problem of relapse prediction is more complex since it second involves training models on random subsets of the
involves the direct prediction of a clinical event within training data. In the final stage, the final model that was
a population of adolescents who are already diagnosed trained on the training set is evaluated on multiple test
with Major Depressive Disorder. sets, and the performance metrics are averaged across
each set. A diagram summarizing the pipeline is shown
4. Methods in Figure 1.
This work uses a dataset that is collected as part of the 4.3.1. Stage 1: Data Preparation
depression early warning study that was run in the Cen-
Each video interview is divided into segments where
tre for Addiction and Mental Health (CAMH). It includes
only the participant is speaking. Since the interviews are
80 video interviews collected from 52 adolescents aged
conducted via Zoom, the videos are also cropped such
12-21 who were all diagnosed with Major Depressive
that only the participant’s face is visible. Several spec-
Disorder.
tral features are extracted from the audio data using the
Python package libRosa [17]. They include the MFCCs,
4.1. CAMH Dataset fundamental frequency, chromagrams, power spectral
density, and spectral rolloff. These features are computed
All participants had an initial baseline visit followed by
over a rolling window that is applied across the video.
up to 7 followup visits, each spaced apart by 3-12 months.
The amount of overlap is chosen such that the number of
During each visit, participants were assessed by a trained
windows matches that of the video frames and are evenly
research coordinator and psychiatrist, providing psychi-
spread out across the video.
atric evaluations of their depressive states via the Chil-
dren’s Depression Rating Scale (CDRS). Participants were
Inception
ResNet LSTM
Train Fold 1
Train Fold 2
Spectral Train Fold N
Feature
Extraction Test Fold Inception
ResNet LSTM
Figure 1: In Stage 1 (green block), signal processing algorithms process audio signals to generate spectral features. Video
frames and spectral features are aligned and stored into train and test folds. Train folds are passed through Stage 2 (blue
block). Video frames are processed with a pretrained Inception ResNet model. Resulting features (video in blue and audio in
red) are fused with the spectral features and then fed through an LSTM and a fully connected network for training. Stage 3
(red block) has the same components of Stage 2, but it is only used on the test fold for evaluation purposes.
In the provided dataset, there is a significant class ing an LSTM and a fully connected network. The LSTM
imbalance where the non-relapse data is heavily over- is used to process 16 consecutive frames of features at a
represented (96.25% non-relapse). In order to not bias time, and the resulting hidden state is then fed into the
the training of the models and the evaluation metrics (de- fully connected network to be classified as either relapse
scribed in the next two sections), several training folds are or non-relapse. During the training process, random seg-
prepared alongside a test fold. The folds are constructed ments of 16 frames are sampled from the training video
by first randomly selecting a proportion of relapse video clips in order to not bias the training of the network
clips to use in the test fold. This proportion is chosen towards a certain class.
to be 30%, and it is computed based on the number of The training process is applied to each train fold, and
frames within each video clip. A random selection of within a given fold, it is repeated for eight epochs. Af-
non-relapse clips are chosen to match the number of ter the AudioVisual Network is trained on a given fold,
frames of the relapse ones (rounded to the nearest whole its saved parameters are used to continue training the
number of clips). This completes the test fold which network on a new fold. This is repeated until all train
is reserved for Stage 3 of the pipeline. The rest of the folds are exhausted. This allows the network to train on
relapse subjects are assigned to be used by train folds the entire training dataset while still keeping the classes
in Stage 2. Non-relapse video clips are randomly sam- relatively balanced.
pled without replacement where the number of clips is
selected to match the number of frames of the relapse 4.3.3. Stage 3: Evaluation of Models
subjects. Each random sample of non-relapse clips makes
up another train fold, and this process is repeated until After the AudioVisual Network is trained, the architec-
all non-relapse clips are used. ture (+InceptionResNet), is evaluated on the test fold that
was reserved in Stage 1. A receiver operating character-
istic (ROC) analysis is carried out on the predictions and
4.3.2. Stage 2: Training of Models
ground truth labels. The optimal threshold of the ROC
The video frames are fed into an InceptionResNet model curve is selected by choosing the point that maximizes
that was pre-trained on VGGFace2 [18], a large-scale face the difference between the true and false positive rats.
dataset. The resulting embeddings from this network are This threshold is used to compute the MAE.
then fused with the spectral features of the audio data. The entire process of training the models and evaluat-
The resulting fused features are then fed into a neural ing the final one on a test fold is carried out for 10 sets
network module (named AudioVisual Network) contain- of folds. This is to ensure that the reported metrics are
not biased toward a certain set of subjects. The resultingof gender-based analysis is a notable limitation, poten-
performance metrics are averaged across all of the test tially overlooking important nuances in how depression
folds. manifests across different genders.
Future work in predicting depression from audiovisual
features will prioritize the development of gender and
5. Results and Significance context aware models. Moreover, given the longitudinal
nature of the study, a promising avenue for future work
Table 1 shows the results of evaluating the trained model
is to exploit the temporal nature of data to track changes
on the 10 test folds. Each accuracy and MAE measure
in audiovisual features over long periods of time. Em-
was reported after finding the optimal threshold of the
ploying an overarching time series model could enhance
ROC curve.
the understanding of the dynamic nature of depression,
An average accuracy of 0.80 shows that video and
allowing for the development of more adaptive and per-
speech data are relatively promising in the prediction of
sonalized prediction models.
relapse in adolescent depression. In previous work by
Another way to extend this work is to combine other
Othmani et al. [15], the authors also predicted relapse
objective sources of data that can be collected simultane-
of depression using video and speech data. Similar to
ously with video and speech. One such modality includes
our work, they also yielded accuracies at around 0.8. To
wearable technologies, and there have been several stud-
the best of our knowledge, this is the only other work
ies on using them for the prediction of depression [19].
that used video and speech to predict relapse of depres-
Using similar techniques, it may be possible to fuse au-
sion. Our work differentiates from Othmani et al. in two
diovisual features and those derived from wearables to
significant ways: 1) our study focuses on adolescents
create a more robust predictor of adolescent depression
and 2) the source of our data is from non-clinical inter-
relapse. Finally, we intend to evaluate our work using
views. These interviews allow for more conversational
publicly available audio/video depression datasets such
topics that may better mimic a real-life situation in an
as AVEC.
adolescent’s everyday life. While the target population
for this work includes adolescents, this framework can
be extended to other depressed populations. References
Fold MAE Accuracy [1] World Health Organization, Suicide, 2023. URL:
1 0.077 0.92 https://www.who.int/news-room/fact-sheets/
2 0.28 0.72 detail/suicide.
3 0.21 0.79 [2] A. Thapar, S. Collishaw, D. S. Pine, A. K. Tha-
4 0.23 0.77 par, Depression in adolescence, The Lancet 379
5 0.12 0.88
(2012) 1056–1067. URL: https://www.ncbi.nlm.nih.
6 0.26 0.74
gov/pmc/articles/PMC3488279/. doi:https://doi.
7 0.17 0.83
org/10.1016/s0140- 6736(11)60871- 4 .
8 0.23 0.77
9 0.17 0.83
[3] N. H. Goldhaber, A. Chea, E. B. Hekler, W. Zhou,
10 0.27 0.73 B. Fergerson, Evaluating the mental health
Average 0.21 0.80 of physician-trainees using an sms text mes-
sage–based assessment tool: Longitudinal pi-
lot study, JMIR Formative Research 7 (2023)
Table 1 e45102–e45102. doi:https://doi.org/10.2196/
Results of evaluation of final trained models on test folds
45102 .
[4] P. Summerfield, How many kids in canada
are connecting with video games?, 2023.
URL: https://mediaincanada.com/2023/01/30/
6. Limitations and Future Work how-many-kids-in-canada-are-connecting-with-video-games/.
Predicting depression from audiovisual features encoun- [5] Q. Zhao, H.-Z. Fan, Y.-L. Li, L. Liu, Y.-X. Wu,
ters various challenges. The subjectivity of depression la- Y.-L. Zhao, Z.-X. Tian, Z.-R. Wang, Y.-L. Tan,
bels and the heterogeneous nature of this condition make S.-P. Tan, Vocal acoustic features as potential
it difficult to develop a universally applicable model. Ad- biomarkers for identifying/diagnosing depression:
ditionally, there may be ethnic and cultural biases in the A cross-sectional study, Frontiers in Psychiatry
data that may have impacted the model’s generalizabil- 13 (2022). doi:https://doi.org/10.3389/fpsyt.
ity. This work did not consider the context within which 2022.815678 .
interviews were conducted. Furthermore, the exclusion [6] D. Shin, W. I. Cho, C. H. K. Park, S. J. Rhee, M. J. Kim,
H. Lee, N. S. Kim, Y. M. Ahn, Detection of minor and proof of concept, Healthcare Analytics 2 (2022)
major depression through voice as a biomarker us- 100090. URL: https://www.sciencedirect.com/
ing machine learning, Journal of Clinical Medicine science/article/pii/S2772442522000387. doi:https:
10 (2021) 3046. URL: https://www.ncbi.nlm.nih. //doi.org/10.1016/j.health.2022.100090 .
gov/pmc/articles/PMC8303477/. doi:https://doi. [16] B. D. Kennard, T. L. Mayes, Z. Chahal, P. A.
org/10.3390/jcm10143046 . Nakonezny, A. Moorehead, G. J. Emslie, Predictors
[7] N. Cummins, S. Scherer, J. Krajewski, S. Schnieder, and Moderators of Relapse in Children and
J. Epps, T. F. Quatieri, A review of depres- Adolescents With Major Depressive Disorder,
sion and suicide risk assessment using speech The Journal of Clinical Psychiatry 79 (2018)
analysis, Speech Communication 71 (2015) e1–e8. URL: https://www-psychiatrist-com.
10–49. URL: https://www.sciencedirect.com/ myaccess.library.utoronto.ca/jcp/depression/
science/article/pii/S0167639315000369. doi:https: predictors-of-relapse-in-youth-with-major-depressive-disorderhttps:
//doi.org/10.1016/j.specom.2015.03.004 . //www-psychiatrist-com.myaccess.
[8] Y. Wang, L. Liang, Z. Zhang, X. Xu, R. Liu, H. Fang, library.utoronto.ca/jcp/depression/
R. Zhang, Y. Wei, Z. Liu, R. Zhu, X. Zhang, F. Wang, predictors-of-relapse-in-youth-with-major-depre.
Fast and accurate assessment of depression based doi:10.4088/JCP.15M10330 .
on voice acoustic features: a cross-sectional and [17] B. McFee, C. Raffel, D. Liang, D. Ellis, M. McVicar,
longitudinal study, Frontiers in Psychiatry 14 E. Battenberg, O. Nieto, librosa: Audio and music
(2023). URL: https://www.ncbi.nlm.nih.gov/pmc/ signal analysis in python, Proceedings of the 14th
articles/PMC10320390. doi:https://doi.org/10. Python in Science Conference (2015). doi:https:
3389/fpsyt.2023.1195276 . //doi.org/10.25080/majora- 7b98e3ed- 003 .
[9] A. Vázquez-Romero, A. Gallardo-Antolín, Au- [18] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wo-
tomatic detection of depression in speech using jna, Rethinking the inception architecture for com-
ensemble convolutional neural networks, En- puter vision, in: 2016 IEEE Conference on Com-
tropy 22 (2020) 688. doi:https://doi.org/10. puter Vision and Pattern Recognition (CVPR), 2016,
3390/e22060688 . pp. 2818–2826. doi:10.1109/CVPR.2016.308 .
[10] P. Ekman, W. V. Friesen, Facial action coding sys- [19] L. Sequeira, S. Perrotta, J. LaGrassa, K. Merikan-
tem: Investigator’s guide, Consulting Psychologists gas, D. Kreindler, D. Kundur, D. Courtney, P. Szat-
Press, 1978. mari, M. Battaglia, J. Strauss, Mobile and wear-
[11] M. Azher Uddin, J. Bibi Joolee, Y.-K. Lee, Depression able technology for monitoring depressive symp-
level prediction using deep spatiotemporal features toms in children and adolescents: A scoping re-
and multilayer bi-ltsm | ieee journals & magazine | view, Journal of Affective Disorders 265 (2020)
ieee xplore, 2022. URL: https://ieeexplore.ieee.org/ 314–324. URL: https://www.sciencedirect.com/
stamp/stamp.jsp?arnumber=8976084. science/article/pii/S0165032719310304. doi:https:
[12] X. Zhou, K. Jin, Y. Shang, G. Guo, Visu- //doi.org/10.1016/j.jad.2019.11.156 .
ally interpretable representation learning for de-
pression recognition from facial images, IEEE
Transactions on Affective Computing 11 (2020)
542–552. doi:https://doi.org/10.1109/taffc.
2018.2828819 .
[13] L. He, M. Niu, P. Tiwari, P. Marttinen, R. Su, J. Jiang,
C. Guo, H. Wang, S. Ding, Z. Wang, X. Pan, W. Dang,
Deep learning for depression recognition with au-
diovisual cues: A review, Information Fusion
80 (2022) 56–86. doi:10.1016/J.INFFUS.2021.10.
012 .
[14] K. Min, J. Yoon, M. Kang, D. Lee, E. Park, J. Han,
Detecting depression on video logs using au-
diovisual features, Humanities and Social Sci-
ences Communications 10 (2023). URL: http://dx.
doi.org/10.1057/s41599-023-02313-6. doi:10.1057/
s41599- 023- 02313- 6 .
[15] A. Othmani, A. O. Zeghina, A multimodal
computer-aided diagnostic system for depression
relapse prediction using audiovisual cues: A