=Paper=
{{Paper
|id=Vol-3649/Paper22
|storemode=property
|title=Prediction of Relapse in Adolescent Depression using Fusion of Video and Speech Data (short paper)
|pdfUrl=https://ceur-ws.org/Vol-3649/Paper22.pdf
|volume=Vol-3649
|authors=Christopher Lucasius,Mai Ali,Deepa Kundur,Marco Battaglia,Peter Szatmari,John Strauss
|dblpUrl=https://dblp.org/rec/conf/aaai/LucasiusAKBSS24
}}
==Prediction of Relapse in Adolescent Depression using Fusion of Video and Speech Data (short paper)==
<pdf width="1500px">https://ceur-ws.org/Vol-3649/Paper22.pdf</pdf>
<pre>
                                Prediction of Relapse in Adolescent Depression using Fusion
                                of Video and Speech Data
                                Christopher Lucasius1,∗ , Mai Ali1 , Marco Battaglia2,3 , John Strauss4 , Peter Szatmari2,3,5 and
                                Deepa Kundur1
                                1
                                  Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada
                                2
                                  Division of Child and Youth Psychiatry, Centre for Addiction and Mental Health, Toronto, Canada
                                3
                                  Department of Psychiatry, University of Toronto, Toronto, Canada
                                4
                                  Vancouver Island Health Authority, Vancouver, Canada
                                5
                                  The Hospital for Sick Children, Toronto, Canada


                                               Abstract
                                               This article presents an innovative approach to predicting depression relapse in adolescents. Adolescentsíntensive use of
                                               video and voice-based smartphone apps presents a rich, multimodal dataset that can be utilized for this purpose. This work
                                               uses a dataset from the Depression Early Warning study conducted at the Center for Addiction and Mental Health. After
                                               using a pre-trained Inception ResNet to generate embeddings of video frames, the proposed framework integrates this with
                                               synchronized speech data. These embeddings are fused with audio features, resulting in a multimodal dataset. The combined
                                               features are processed through a Long Short-Term Memory model and a fully connected network to predict relapse of
                                               depression. An average accuracy of 0.80 highlights the effectiveness of the proposed multimodal approach and underscores
                                               its potential to effectively predict depression relapse in adolescents.

                                               Keywords
                                               Depression relapse, Multimodality, Inception ResNet, LSTM


                                1. Introduction                                                                                         prediction. These modalities encompass physiological
                                                                                                                                        features such as heart rate and temperature, as well as
                                Depression is a worldwide, prevalent mental health dis-                                                 behavioral features such as voice, facial expression, and
                                order among adolescents. The recognition and treatment                                                  gesture. Video chat and gaming are very popular among
                                of adolescent depression hold paramount significance                                                    youth with statistics reaching 87% in this population [4].
                                due to its association with substantial risks, notably sui-                                             However, despite the widespread engagement in these
                                cide, which stands as the fourth leading cause of death                                                 activities, research exploring the use of video and speech
                                within this demographic [1]. Disturbingly, over half of                                                 modalities for the assessment of depression and predic-
                                adolescents who commit suicide are reported to have                                                     tion of relapse in youth is limited. This work investigates
                                been struggling with a depressive disorder [1]. Beyond                                                  the use of speech and video for depression relapse pre-
                                this, depression in adolescents causes profound social                                                  diction in adolescents. As far as the authors are aware,
                                and educational impairments, underscoring the need for                                                  it presents the first pipeline for predicting depression
                                timely intervention. The consequences extend to height-                                                 relapse in adolescents using fusion of video- and speech-
                                ened rates of smoking, substance misuse, and obesity, ac-                                               based features.
                                centuating the urgency of addressing this mental health
                                concern [2].
                                   Standard mental health diagnoses rely on clinical sur- 2. Literature Review
                                veys that may be subject to recall bias. This approach also
                                does not allow for timely interventions [3]. To address The use of speech and video analysis for depression pre-
                                these limitations, diverse modalities have been proposed diction represents an innovative and promising approach
                                in the literature for timely mental health assessment and in mental health research. Analyzing speech patterns
                                                                                            and facial expressions can provide valuable insights into
                                                                                            an individual’s emotional and mental state. Below is a
                                Machine Learning for Cognitive and Mental Health Workshop
                                                                                            review on the use of speech and video for depression
                                (ML4CMH), AAAI 2024, Vancouver, BC, Canada
                                ∗
                                  Corresponding author.                                     prediction.
                                Envelope-Open christopher.lucasius@mail.utoronto.ca (C. Lucasius);
                                maia.ali@mail.utoronto.ca (M. Ali); marco.battaglia@camh.ca
                                (M. Battaglia); john.strauss@islandhealth.ca (J. Strauss);
                                                                                                                                        2.1. Speech-based Depression Prediction
                                peter.szatmari@camh.ca (P. Szatmari); dkundur@ece.utoronto.ca                                           Several studies demonstrated that voice quality contains
                                (D. Kundur)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License   information about the mental state of a person and vocal
                                         Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
source features can be used as biomarkers of depres-         in [11]. The framework combined spatial information
sion severity [5, 6, 7]. The work in [8] is based on         extracted from the Inception-ResNet-v2 network with a
a cross-sectional and longitudinal study aimed to ex-        volume local directional number (VLDN) based dynamic
plore the potential of voice acoustic features as objec-     feature descriptor to capture facial motions. The VLDN
tive biomarkers for assessing depression severity and        feature map was then fed into a CNN to obtain more
treatment effectiveness. The study identified 30 voice       discriminative features. Temporal information was ob-
acoustic features to be associated with depression such      tained using a multilayer Bi-LSTM which integrated the
as Mel-cepstral (MCEP), Mel-scale Frequency Cepstral         temporal median pooling (TMP) approach on the tem-
Coefficients deltas (MFCC-deltas) and Harmonic Model         poral fragments of spatial and temporal features. The
Phase Distortion Mean (HMPDM) among others. A neu-           performance of this work was benchmarked against the
ral network model based on Neural Architecture Search        AVEC2013 and AVEC2014 datasets, and it achieved an
(NAS) was developed for predicting depression severity.      MAE of 7.04 and 6.86 on AVEC2013 and AVEC2014, re-
Grid search was used to obtain the optimal model ar-         spectively.
chitecture which consisted of 4 hidden layers with with         Zhou et al. presented a deep regression network called
32 units each. The model achieved a Mean Absolute Er-        DepressNet which aimed to learn a visually interpretable
ror (MAE) of 3.137 when predicting depression severity       representation of depression from facial images [12].
based on Hamilton Depression (HAMD) Scale. Addi-             Their model is based on a CNN with a global average
tionally, a longitudinal study investigated the changes      pooling layer which is first trained with facial depression
in voice features after an Internet-based cognitive-         data, for identifying salient regions of an input image
behavioral therapy (ICBT) program, revealing four fea-       in terms of its severity score based on the generated de-
tures that significantly decreased: Peak2RMS_kurtosis,       pression activation map (DAM). The authors proposed
MFCC_deltas_10_intercept, MFCC_delta_deltas_4_kur-           a multi-region DepressNet that combines multiple local
tosis, and MFCC_delta_deltas_9_kurtosis. This indicated      deep regression models for different face regions to en-
their potential correlation with treatment response and      hance recognition performance. The method achieved
improvement in depression. In [9], Vázquez-Romero et.al.     an MAE of 6.20 and 6.21 on AVEC 2013 and 2014 datasets,
proposed a method for automatic classification of depres-    respectively.
sion using speech and ensemble learning with Convo-
lutional Neural Networks (CNNs). In the preprocessing        2.3. Speech and Video-based Depression
phase, speech files are transformed into sequences of log-
spectrograms and randomly sampled to ensure a balance
                                                                  Prediction
between positive and negative samples. For the classi-     Physiological and psychological studies have identified
fication task, multiple CNNs are trained using different   differences in speech and facial expressions between pa-
initializations, and their individual predictions are com- tients with depression and healthy individuals, providing
bined using an ensemble averaging algorithm. The pre-      potential cues for automatic depression detection [13].
dictions are then aggregated for each speaker to obtain a  Another related work by [14] presented a depression de-
final decision. The performance of the proposed model      tection model that utilizes audiovisual features extracted
was evaluated on the DAIC-WOZ dataset and compared         from video logs (vlogs) on YouTube. The model extracts
against the AVEC-2016 models that use support vector       eight low-level acoustic descriptors, including loudness,
machine (SVM) classifiers and hand-crafted features, as    fundamental frequency (F0), and spectral flux, using the
well as the DepAudionet architecture that consisted of     OpenSmile toolkit. These features capture characteris-
a 1D-CNN, Long Short-Term Memory (LSTM) cell, and          tics such as voice intensity and pitch which have been
fully connected layers. The results demonstrated a rela-   found to be relevant in detecting depression. For visual
tive improvement in F1-score of 58.5%, 30.0%, and 10.2%    features, the model utilizes a pre-trained face expression
compared to the baseline, DepAudionet, and single 1D-      recognition model (FER) to extract emotional information
CNN architecture, respectively.                            from the vlogs. The proposed eXtreme Gradient Boost-
                                                           ing (XGBoost) depression detection model achieved an
2.2. Video-based Depression Prediction                     overall performance with an accuracy of 75.85%, recall
                                                           of 78.18%, precision of 76.79%, and F1 score of 77.48%.
Behavioral analysis of facial expressions has been stud- The model’s performance was further analyzed based on
ied as a source for eliciting the underlying emotional different modalities where the model trained with audio
state [10]. Computer vision methods have been used to features performed better than the model trained with
analyze facial expressions and gestures to predict the un- visual features. The best performance was achieved by
derlying mental health state of users [11]. A framework the model trained on the audiovisual features. The work
for estimating depression levels from video data using a of Othmani et.al. in [15] used deep learning techniques
two-stream deep spatiotemporal network was introduced to recognize depression and predict relapse from audio
and visual cues extracted from videos of clinical inter-     interviewed by the coordinator during their initial visit
views. It involves a correlation-based anomaly detection     and followup sessions. During recorded Zoom sessions,
framework that compares the audiovisual patterns of          the coordinator asked them 10 open-ended questions
depression-free subjects to those of depressed individu-     about their past activities and mood, resulting in 2-10
als. The correlation between the audiovisual encoding        minutes of video data per session. This dataset was col-
of a test subject and a deep audiovisual representation      lected as part of an ongoing research study at CAMH and
of depression is computed to monitor depressed subjects      is unavailable to the public.
and predict relapse. The approach achieves promising
results, with an accuracy of 80.99% and 82.55% for relapse   4.2. Definition of Relapse
depression prediction on the DAIC-Woz dataset.
   The existing landscape of research on adolescent de-      While there are many definitions of relapse in depression,
pression has made significant strides in understanding       a commonly accepted one is given by [16] which defines
the onset and symptoms of depression in this age group.      a relapse in adolescents as observing a CDRS score of at
However, there is a notable gap in the ability to effec-     most 28 during at least 12 weeks of treatment followed by
tively predict depression relapse from audio and video       an increase in CDRS to at least 40 for at least two weeks.
modalities. By incorporating synchronized video and          The first period of 12 weeks corresponds to a remission
speech data, this research captures a broader spectrum of    stage where the depressed adolescent does not exhibit
behavioral and emotional cues that might signify impend-     symptoms but has not yet completed treatment. The
ing relapse in adolescents. The synchronization ensures      period of two weeks corresponds to a depressed episode.
that both modalities are aligned, allowing for a detailed       In this study, there can be at least a three month break
examination of facial expressions, body language, and        between followup visits. Hence, the timing aspect of Ken-
speech patterns simultaneously.                              nard’s definition must be accordingly modified to adhere
                                                             to the provided data. This work proposes a definition of
                                                             relapse as a period of at least one visit with a CDRS score
3. Problem Formulation                                       of at most 40 followed by one visit with a CDRS score of
                                                             at least 40.
Our work aims to classify fused video and speech features
for the classification of data that is measured before a
relapse event. This entails a binary classification task     4.3. Pipeline
where the two classes include “relapse sometime in the       There are three main stages that make up the methods of
future” and “non-relapse”. This problem is significantly     this pipeline. The first consists of preprocessing the video
different from detecting the presence of depression or       and audio data and organizing them such that the two
predicting a certain depression rating scale score. The      modalities are aligned and the labels are balanced. The
problem of relapse prediction is more complex since it       second involves training models on random subsets of the
involves the direct prediction of a clinical event within    training data. In the final stage, the final model that was
a population of adolescents who are already diagnosed        trained on the training set is evaluated on multiple test
with Major Depressive Disorder.                              sets, and the performance metrics are averaged across
                                                             each set. A diagram summarizing the pipeline is shown
4. Methods                                                   in Figure 1.

This work uses a dataset that is collected as part of the    4.3.1. Stage 1: Data Preparation
depression early warning study that was run in the Cen-
                                                           Each video interview is divided into segments where
tre for Addiction and Mental Health (CAMH). It includes
                                                           only the participant is speaking. Since the interviews are
80 video interviews collected from 52 adolescents aged
                                                           conducted via Zoom, the videos are also cropped such
12-21 who were all diagnosed with Major Depressive
                                                           that only the participant’s face is visible. Several spec-
Disorder.
                                                           tral features are extracted from the audio data using the
                                                           Python package libRosa [17]. They include the MFCCs,
4.1. CAMH Dataset                                          fundamental frequency, chromagrams, power spectral
                                                           density, and spectral rolloff. These features are computed
All participants had an initial baseline visit followed by
                                                           over a rolling window that is applied across the video.
up to 7 followup visits, each spaced apart by 3-12 months.
                                                           The amount of overlap is chosen such that the number of
During each visit, participants were assessed by a trained
                                                           windows matches that of the video frames and are evenly
research coordinator and psychiatrist, providing psychi-
                                                           spread out across the video.
atric evaluations of their depressive states via the Chil-
dren’s Depression Rating Scale (CDRS). Participants were
                                                                         Inception
                                                                          ResNet            LSTM


                                                Train Fold 1
                                                Train Fold 2


                                   Spectral    Train Fold N
                                   Feature
                                  Extraction    Test Fold                Inception
                                                                          ResNet            LSTM


Figure 1: In Stage 1 (green block), signal processing algorithms process audio signals to generate spectral features. Video
frames and spectral features are aligned and stored into train and test folds. Train folds are passed through Stage 2 (blue
block). Video frames are processed with a pretrained Inception ResNet model. Resulting features (video in blue and audio in
red) are fused with the spectral features and then fed through an LSTM and a fully connected network for training. Stage 3
(red block) has the same components of Stage 2, but it is only used on the test fold for evaluation purposes.


   In the provided dataset, there is a significant class        ing an LSTM and a fully connected network. The LSTM
imbalance where the non-relapse data is heavily over-           is used to process 16 consecutive frames of features at a
represented (96.25% non-relapse). In order to not bias          time, and the resulting hidden state is then fed into the
the training of the models and the evaluation metrics (de-      fully connected network to be classified as either relapse
scribed in the next two sections), several training folds are   or non-relapse. During the training process, random seg-
prepared alongside a test fold. The folds are constructed       ments of 16 frames are sampled from the training video
by first randomly selecting a proportion of relapse video       clips in order to not bias the training of the network
clips to use in the test fold. This proportion is chosen        towards a certain class.
to be 30%, and it is computed based on the number of               The training process is applied to each train fold, and
frames within each video clip. A random selection of            within a given fold, it is repeated for eight epochs. Af-
non-relapse clips are chosen to match the number of             ter the AudioVisual Network is trained on a given fold,
frames of the relapse ones (rounded to the nearest whole        its saved parameters are used to continue training the
number of clips). This completes the test fold which            network on a new fold. This is repeated until all train
is reserved for Stage 3 of the pipeline. The rest of the        folds are exhausted. This allows the network to train on
relapse subjects are assigned to be used by train folds         the entire training dataset while still keeping the classes
in Stage 2. Non-relapse video clips are randomly sam-           relatively balanced.
pled without replacement where the number of clips is
selected to match the number of frames of the relapse           4.3.3. Stage 3: Evaluation of Models
subjects. Each random sample of non-relapse clips makes
up another train fold, and this process is repeated until After the AudioVisual Network is trained, the architec-
all non-relapse clips are used.                           ture (+InceptionResNet), is evaluated on the test fold that
                                                          was reserved in Stage 1. A receiver operating character-
                                                          istic (ROC) analysis is carried out on the predictions and
4.3.2. Stage 2: Training of Models
                                                          ground truth labels. The optimal threshold of the ROC
The video frames are fed into an InceptionResNet model curve is selected by choosing the point that maximizes
that was pre-trained on VGGFace2 [18], a large-scale face the difference between the true and false positive rats.
dataset. The resulting embeddings from this network are This threshold is used to compute the MAE.
then fused with the spectral features of the audio data.     The entire process of training the models and evaluat-
The resulting fused features are then fed into a neural ing the final one on a test fold is carried out for 10 sets
network module (named AudioVisual Network) contain- of folds. This is to ensure that the reported metrics are
not biased toward a certain set of subjects. The resultingof gender-based analysis is a notable limitation, poten-
performance metrics are averaged across all of the test   tially overlooking important nuances in how depression
folds.                                                    manifests across different genders.
                                                             Future work in predicting depression from audiovisual
                                                          features will prioritize the development of gender and
5. Results and Significance                               context aware models. Moreover, given the longitudinal
                                                          nature of the study, a promising avenue for future work
Table 1 shows the results of evaluating the trained model
                                                          is to exploit the temporal nature of data to track changes
on the 10 test folds. Each accuracy and MAE measure
                                                          in audiovisual features over long periods of time. Em-
was reported after finding the optimal threshold of the
                                                          ploying an overarching time series model could enhance
ROC curve.
                                                          the understanding of the dynamic nature of depression,
   An average accuracy of 0.80 shows that video and
                                                          allowing for the development of more adaptive and per-
speech data are relatively promising in the prediction of
                                                          sonalized prediction models.
relapse in adolescent depression. In previous work by
                                                             Another way to extend this work is to combine other
Othmani et al. [15], the authors also predicted relapse
                                                          objective sources of data that can be collected simultane-
of depression using video and speech data. Similar to
                                                          ously with video and speech. One such modality includes
our work, they also yielded accuracies at around 0.8. To
                                                          wearable technologies, and there have been several stud-
the best of our knowledge, this is the only other work
                                                          ies on using them for the prediction of depression [19].
that used video and speech to predict relapse of depres-
                                                          Using similar techniques, it may be possible to fuse au-
sion. Our work differentiates from Othmani et al. in two
                                                          diovisual features and those derived from wearables to
significant ways: 1) our study focuses on adolescents
                                                          create a more robust predictor of adolescent depression
and 2) the source of our data is from non-clinical inter-
                                                          relapse. Finally, we intend to evaluate our work using
views. These interviews allow for more conversational
                                                          publicly available audio/video depression datasets such
topics that may better mimic a real-life situation in an
                                                          as AVEC.
adolescent’s everyday life. While the target population
for this work includes adolescents, this framework can
be extended to other depressed populations.               References
                 Fold       MAE      Accuracy                 [1] World Health Organization, Suicide, 2023. URL:
                  1         0.077      0.92                       https://www.who.int/news-room/fact-sheets/
                  2         0.28       0.72                       detail/suicide.
                  3         0.21       0.79                   [2] A. Thapar, S. Collishaw, D. S. Pine, A. K. Tha-
                  4         0.23       0.77                       par, Depression in adolescence, The Lancet 379
                  5         0.12       0.88
                                                                  (2012) 1056–1067. URL: https://www.ncbi.nlm.nih.
                  6         0.26       0.74
                                                                  gov/pmc/articles/PMC3488279/. doi:https://doi.
                  7         0.17       0.83
                                                                  org/10.1016/s0140- 6736(11)60871- 4 .
                  8         0.23       0.77
                  9         0.17       0.83
                                                              [3] N. H. Goldhaber, A. Chea, E. B. Hekler, W. Zhou,
                  10        0.27       0.73                       B. Fergerson,       Evaluating the mental health
               Average      0.21       0.80                       of physician-trainees using an sms text mes-
                                                                  sage–based assessment tool: Longitudinal pi-
                                                                  lot study, JMIR Formative Research 7 (2023)
Table 1                                                           e45102–e45102. doi:https://doi.org/10.2196/
Results of evaluation of final trained models on test folds
                                                                  45102 .
                                                              [4] P. Summerfield, How many kids in canada
                                                                  are connecting with video games?, 2023.
                                                                  URL:       https://mediaincanada.com/2023/01/30/
6. Limitations and Future Work                                    how-many-kids-in-canada-are-connecting-with-video-games/.
Predicting depression from audiovisual features encoun-       [5] Q. Zhao, H.-Z. Fan, Y.-L. Li, L. Liu, Y.-X. Wu,
ters various challenges. The subjectivity of depression la-       Y.-L. Zhao, Z.-X. Tian, Z.-R. Wang, Y.-L. Tan,
bels and the heterogeneous nature of this condition make          S.-P. Tan, Vocal acoustic features as potential
it difficult to develop a universally applicable model. Ad-       biomarkers for identifying/diagnosing depression:
ditionally, there may be ethnic and cultural biases in the        A cross-sectional study, Frontiers in Psychiatry
data that may have impacted the model’s generalizabil-            13 (2022). doi:https://doi.org/10.3389/fpsyt.
ity. This work did not consider the context within which          2022.815678 .
interviews were conducted. Furthermore, the exclusion         [6] D. Shin, W. I. Cho, C. H. K. Park, S. J. Rhee, M. J. Kim,
     H. Lee, N. S. Kim, Y. M. Ahn, Detection of minor and            proof of concept, Healthcare Analytics 2 (2022)
     major depression through voice as a biomarker us-               100090. URL: https://www.sciencedirect.com/
     ing machine learning, Journal of Clinical Medicine              science/article/pii/S2772442522000387. doi:https:
     10 (2021) 3046. URL: https://www.ncbi.nlm.nih.                  //doi.org/10.1016/j.health.2022.100090 .
     gov/pmc/articles/PMC8303477/. doi:https://doi.             [16] B. D. Kennard, T. L. Mayes, Z. Chahal, P. A.
     org/10.3390/jcm10143046 .                                       Nakonezny, A. Moorehead, G. J. Emslie, Predictors
 [7] N. Cummins, S. Scherer, J. Krajewski, S. Schnieder,             and Moderators of Relapse in Children and
     J. Epps, T. F. Quatieri, A review of depres-                    Adolescents With Major Depressive Disorder,
     sion and suicide risk assessment using speech                   The Journal of Clinical Psychiatry 79 (2018)
     analysis,      Speech Communication 71 (2015)                   e1–e8. URL: https://www-psychiatrist-com.
     10–49. URL: https://www.sciencedirect.com/                      myaccess.library.utoronto.ca/jcp/depression/
     science/article/pii/S0167639315000369. doi:https:               predictors-of-relapse-in-youth-with-major-depressive-disorderhttps:
     //doi.org/10.1016/j.specom.2015.03.004 .                        //www-psychiatrist-com.myaccess.
 [8] Y. Wang, L. Liang, Z. Zhang, X. Xu, R. Liu, H. Fang,            library.utoronto.ca/jcp/depression/
     R. Zhang, Y. Wei, Z. Liu, R. Zhu, X. Zhang, F. Wang,            predictors-of-relapse-in-youth-with-major-depre.
     Fast and accurate assessment of depression based                doi:10.4088/JCP.15M10330 .
     on voice acoustic features: a cross-sectional and          [17] B. McFee, C. Raffel, D. Liang, D. Ellis, M. McVicar,
     longitudinal study, Frontiers in Psychiatry 14                  E. Battenberg, O. Nieto, librosa: Audio and music
     (2023). URL: https://www.ncbi.nlm.nih.gov/pmc/                  signal analysis in python, Proceedings of the 14th
     articles/PMC10320390. doi:https://doi.org/10.                   Python in Science Conference (2015). doi:https:
     3389/fpsyt.2023.1195276 .                                       //doi.org/10.25080/majora- 7b98e3ed- 003 .
 [9] A. Vázquez-Romero, A. Gallardo-Antolín, Au-                [18] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wo-
     tomatic detection of depression in speech using                 jna, Rethinking the inception architecture for com-
     ensemble convolutional neural networks, En-                     puter vision, in: 2016 IEEE Conference on Com-
     tropy 22 (2020) 688. doi:https://doi.org/10.                    puter Vision and Pattern Recognition (CVPR), 2016,
     3390/e22060688 .                                                pp. 2818–2826. doi:10.1109/CVPR.2016.308 .
[10] P. Ekman, W. V. Friesen, Facial action coding sys-         [19] L. Sequeira, S. Perrotta, J. LaGrassa, K. Merikan-
     tem: Investigator’s guide, Consulting Psychologists             gas, D. Kreindler, D. Kundur, D. Courtney, P. Szat-
     Press, 1978.                                                    mari, M. Battaglia, J. Strauss, Mobile and wear-
[11] M. Azher Uddin, J. Bibi Joolee, Y.-K. Lee, Depression           able technology for monitoring depressive symp-
     level prediction using deep spatiotemporal features             toms in children and adolescents: A scoping re-
     and multilayer bi-ltsm | ieee journals & magazine |             view, Journal of Affective Disorders 265 (2020)
     ieee xplore, 2022. URL: https://ieeexplore.ieee.org/            314–324. URL: https://www.sciencedirect.com/
     stamp/stamp.jsp?arnumber=8976084.                               science/article/pii/S0165032719310304. doi:https:
[12] X. Zhou, K. Jin, Y. Shang, G. Guo,                Visu-         //doi.org/10.1016/j.jad.2019.11.156 .
     ally interpretable representation learning for de-
     pression recognition from facial images, IEEE
     Transactions on Affective Computing 11 (2020)
     542–552. doi:https://doi.org/10.1109/taffc.
     2018.2828819 .
[13] L. He, M. Niu, P. Tiwari, P. Marttinen, R. Su, J. Jiang,
     C. Guo, H. Wang, S. Ding, Z. Wang, X. Pan, W. Dang,
     Deep learning for depression recognition with au-
     diovisual cues: A review, Information Fusion
     80 (2022) 56–86. doi:10.1016/J.INFFUS.2021.10.
     012 .
[14] K. Min, J. Yoon, M. Kang, D. Lee, E. Park, J. Han,
     Detecting depression on video logs using au-
     diovisual features, Humanities and Social Sci-
     ences Communications 10 (2023). URL: http://dx.
     doi.org/10.1057/s41599-023-02313-6. doi:10.1057/
     s41599- 023- 02313- 6 .
[15] A. Othmani, A. O. Zeghina,            A multimodal
     computer-aided diagnostic system for depression
     relapse prediction using audiovisual cues: A

</pre>