<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Three-Stream Region-Aware Residual Attention Network for Facial Depression Recognition⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Muhammad Turyalai Khan</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Faisal Shafait</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pradorn Sureephong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Art Media and Technology, Chiang Mai University</institution>
          ,
          <country country="TH">Thailand</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Guangdong CAS Cogniser, Information Technology, Co. Ltd.</institution>
          ,
          <addr-line>Guangzhou</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>National University of Sciences and Technology</institution>
          ,
          <addr-line>Islamabad</addr-line>
          ,
          <country country="PK">Pakistan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Facial expressions provide critical cues for understanding emotional and mental states, including depression. However, existing deep learning approaches often lack mechanisms to emphasize subtle yet informative features from distinct facial regions. This work introduces TS-RAN (Three-Stream Region-Aware Residual Attention Network), a novel architecture designed for facial depression recognition. TS-RAN extracts and fuses global and local features from the face, eyes, and mouth using three parallel customized residual branches, each integrated with a coordinate attention mechanism to enhance spatial feature learning. The fused representation enables a comprehensive and discriminative understanding of depressive facial cues. Experiments are conducted on the AVEC2014 (Audio-Visual Emotion Challenge 2014) and self-collected CZ2024 (Changzhou No. 2 People's Hospital) datasets. TS-RAN achieves MAE/RMSE (mean absolute error/root mean square error) of 8.04/9.65 and 6.84/8.77 on the respective datasets, demonstrating competitive performance compared to existing methods. These results highlight its potential in medical and afective computing applications.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Depression recognition</kwd>
        <kwd>facial expression recognition</kwd>
        <kwd>three-stream region-aware residual attention network</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Depression is a prevalent and debilitating mental disorder, afecting over 300 million people globally and
projected to become the second leading cause of disability by 2030 [1]. Timely and accurate detection
is essential for efective treatment, yet clinical diagnosis remains largely dependent on subjective
assessments, often leading to delays or inconsistencies [2]. With the increasing availability of visual
data and advances in artificial intelligence, facial expression analysis has emerged as a promising tool
for supporting objective mental health evaluation [3, 4].</p>
      <p>Recent advances in machine learning and deep learning have enabled robust modeling across domains
such as speech analysis, biomedical signal processing, and computer vision [5, 6, 7]. In the context of
depression recognition, diverse physiological and behavioral modalities have been explored, including
electro-encephalography [8], spectroscopy [9], brain imaging [10], and eye movement tracking [11].
Among these, facial expressions are particularly advantageous due to their non-intrusive acquisition
and strong correlation with emotional and cognitive states associated with depression [12, 13].</p>
      <p>Deep learning-based facial expression analysis has emerged as a promising approach for automated
depression recognition. Early models applied pre-trained convolutional neural networks (CNNs),
such as visual geometry group (VGG) and residual network (ResNet), to extract global facial features,
sometimes incorporating temporal dynamics using optical flow or frame diferencing [ 14, 15]. More
recent approaches emphasize regional analysis, dividing the face into patches or focusing on areas
like the eyes and mouth, combined with attention mechanisms to highlight diagnostically relevant
regions [16, 17, 18]. Multistream and spatiotemporal networks have also been proposed to integrate
local and global cues [19, 20]. Despite these advancements, many methods lack the ability to model
inter-region dependencies and capturing fine-grained cues that are critical for recognizing subtle
depressive expressions. Additionally, traditional architectures rarely leverage the complementary
nature of features derived from diferent facial regions, and often overlook the importance of robust
feature fusion strategies[21].</p>
      <p>To address these limitations, this work introduces the Three-Stream Region-Aware Residual Attention
Network (TS-RAN) for facial depression recognition. TS-RAN consists of three parallel residual branches,
each customized to extract features from the face, eyes, and mouth, respectively. Each stream
incorporates a coordinate attention mechanism to enhance spatial feature learning and selectively emphasize
important cues [22]. The resulting global and local features are fused into a unified representation for
comprehensive and discriminative depression analysis.</p>
      <p>The proposed method is evaluated on two benchmark datasets: Audio-Visual Emotion Challenge 2014
(AVEC2014) [23], a public dataset for afective computing, and Changzhou No. 2 People’s Hospital 2024
(CZ2024), a clinical dataset collected from Changzhou No. 2 People’s Hospital. Experimental results
show that TS-RAN achieves competitive performance in terms of mean absolute error (MAE) and root
mean square error (RMSE), outperforming several state-of-the-art methods. These findings underscore
the potential of region-aware modeling in advancing automated mental health diagnostics.</p>
      <p>The main contributions of this work are summarized as follows:
• A novel three-stream region-aware attention network (TS-RAN) is introduced to extract and fuse
facial features from the face, eyes, and mouth regions for depression recognition.
• Each residual stream is enhanced with coordinate attention to strengthen spatial encoding and
highlight subtle but informative facial cues.
• Comprehensive experiments are conducted on both a public dataset (AVEC2014) and a newly
collected clinical dataset (CZ2024), demonstrating the model’s robustness across controlled and
real-world conditions.</p>
      <p>The remainder of the paper is organized as follows. Section 2 presents the proposed methodology,
including preprocessing, feature extraction, and fusion. Section 3 describes the datasets, evaluation
metrics, and training setup. Section 4 reports experimental results, ablation analysis, and comparative
evaluations. Finally, Section 5 concludes the paper and outlines directions for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>The overall architecture of the proposed Three-Stream Region-Aware Residual Attention Network
(TS-RAN) is illustrated in Figure 1. Given an input video, frames are sampled at regular intervals
and passed through a facial landmark detection module to localize key facial regions. Three separate
branches are constructed to process the full face, eyes, and mouth independently. Each branch uses a
customized residual network (ResNet) backbone integrated with a coordinate attention (CA) mechanism
to extract region-specific features. These features are subsequently concatenated and passed through a
fully connected layer to generate a continuous depression score.</p>
      <sec id="sec-2-1">
        <title>2.1. Preprocessing</title>
        <p>Video frames are extracted at regular intervals of 5 seconds using the open-source computer vision library
(OpenCV) to reduce temporal redundancy and computational load. Each sampled frame is processed by
a multi-task cascaded convolutional network (MTCNN) to detect facial landmarks, including key points
around the eyes and mouth. Based on the landmark positions, three specific facial regions are cropped
using predefined bounding boxes: the full face, the eyes region, and the mouth region. Each cropped
region is extracted with fixed spatial dimensions and three RGB channels, resulting in face, eyes, and
mouth crops of size 3×256×256, 3×96×192, and 3×96×128, respectively, before being passed into their
corresponding feature extraction branches.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Region-Aware Feature Extraction</title>
        <p>Each of the three facial regions is processed by a dedicated ResNet-50 architecture, modified to act as
a pure feature extractor. The classification head is removed and replaced with an identity mapping,
allowing the network to output high-dimensional feature embeddings without any class-specific bias.
Each ResNet is augmented with a coordinate attention module, enhancing its ability to capture spatial
dependencies across both horizontal and vertical dimensions.</p>
        <p>An overview of the modified feature extraction module is shown in Figure 2. This architecture extends
the standard ResNet-50 by integrating coordinate attention into each residual block, specifically after
the final convolutional layer. This modification enables the network to capture both spatial structure
and contextual dependencies across the height and width axes, thereby enhancing the representation of
depression-relevant facial cues in each region.</p>
        <p>Coordinate attention improves upon traditional channel-only attention mechanisms by preserving
positional information [22]. Instead of applying global average pooling over spatial dimensions, it
encodes directional context separately along the height and width axes. Let  ∈ R× ×  denote the
input feature map, where , , and  are the number of channels, height, and width, respectively.
Coordinate attention computes:</p>
        <p>
          ℎ = 1 ∑=︁1  (:, :, ),  = 1 ∑=︁1  (:, , :) (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
        </p>
        <p>These aggregated features ℎ and  are passed through shared transformations to produce attention
maps, which are then applied back to the original feature map. This allows the network to focus on
salient spatial locations relevant to depression cues.</p>
        <p>Each stream outputs a 2048-dimensional feature vector corresponding to its respective region. The
use of coordinate attention ensures that both global semantics and local details are preserved, improving
the discriminative capacity of the extracted features.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Feature Fusion and Regression</title>
        <p>The high-dimensional feature embeddings generated by the face, eyes, and mouth streams, each of size
2048, are concatenated to form a unified 6144-dimensional feature vector. This fused representation
integrates complementary spatial information from global and local facial regions, capturing both
holistic appearance and subtle region-specific variations associated with depressive states. By combining
features from diferent facial regions, the model benefits from enhanced context awareness and improved
discriminative power.</p>
        <p>The concatenated feature vector is passed through a fully connected regression head, which maps
the fused representation to a scalar depression severity score. This final output is designed to reflect the
continuous nature of depression levels, aligning with clinical assessment standards. The regression
head consists of dense layers with non-linear activation functions, followed by a final linear layer that
outputs a single value.</p>
        <p>
          The model is trained to minimize the mean squared error (MSE) between the predicted and ground
truth depression scores, promoting accurate regression. Formally, the loss is defined as:

ℒMSE = 1 ∑=︁1 ( − ˆ)2
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
where  is the ground truth score, ˆ is the predicted score for the -th sample, and  is the total
number of training samples. This loss encourages the network to produce continuous estimates that
closely match clinical annotations.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <sec id="sec-3-1">
        <title>3.1. Datasets</title>
        <p>
          The proposed TS-RAN architecture is evaluated on two datasets: the publicly available AVEC2014 and
the clinical CZ2024 dataset collected from Changzhou No. 2 People’s Hospital. The AVEC2014 was
released as part of the Audio-Visual Emotion Challenge, and contains 300 videos divided equally across
two tasks: Freeform and Northwind [23]. In the Freeform task, participants respond to emotionally
driven prompts such as recalling personal experiences, while the Northwind task involves reading
a neutral passage aloud. Each task contributes 150 videos, with 50 allocated to each of the training,
validation, and test sets. Video durations range from 6 to 248 seconds, recorded at 30 frames per second.
Ground truth depression scores are provided based on the Beck Depression Inventory-II (BDI-II), which
classifies severity into minimal (
          <xref ref-type="bibr" rid="ref1 ref2 ref3">0–13</xref>
          ), mild (14–19), moderate (20–28), and severe (29–63) [ 24]. Dataset
statistics are summarized in Table 1.
        </p>
        <p>
          CZ2024 is a clinically collected dataset comprising 327 videos of patients undergoing depression
assessment at Changzhou No. 2 People’s Hospital in Jiangsu, China. Subjects span a diverse age range of
14 to 73 years and include both male and female participants. Depression severity is annotated according
to the Hamilton Depression Rating Scale (HAMD), which classifies depression into minimal (
          <xref ref-type="bibr" rid="ref1 ref2 ref3">0–7</xref>
          ), mild
(8–19), moderate (20–34), and severe (35). The dataset is divided into 193 training, 95 validation, and 39
test videos. All recordings were captured in natural indoor settings, reflecting variations in lighting,
pose, and expression. Key characteristics of CZ2024 are presented in Table 2.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Evaluation Metrics</title>
        <p>To assess the regression performance of the proposed TS-RAN model in predicting depression severity,
two commonly used metrics are employed: Root Mean Square Error (RMSE) and Mean Absolute Error
(MAE) [12]. These metrics are widely adopted in afective computing and regression tasks due to their
ability to capture both prediction accuracy and consistency.</p>
        <p>RMSE evaluates the average squared diference between predicted and actual scores, placing greater
emphasis on larger errors. MAE, in contrast, measures the average absolute diference and provides
a more interpretable measure of typical prediction error. Together, they ofer a robust evaluation of
the model’s precision in estimating continuous depression severity scores. Mathematically, RMSE and
MAE are defined as:

RMSE = ⎷⎸⎸ 1 ∑︁( − ˆ)2</p>
        <p>=1
⎯</p>
        <p>
          MAE = 1 ∑︁ | − ˆ|
=1
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
(4)
where  is the total number of samples,  is the ground truth score, and ˆ is the predicted score.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Training Details</title>
        <p>The training of TS-RAN was performed using the Python programming language, and its configuration
is summarized in Table 3. The model was optimized using the RMSprop algorithm with a smoothing
factor (alpha) of 0.9 to stabilize gradient updates. A learning rate of 0.0005 was selected to ensure a
balance between convergence speed and stability [25, 26], while mini-batches of size 24 were used to
maintain eficiency without compromising model generalization.</p>
        <p>To mitigate overfitting, a weight decay of 0.05 was applied as regularization. Additionally, a stepwise
learning rate scheduler was used, reducing the learning rate by a factor of 0.5 every 4 epochs. This
gradual adjustment strategy facilitates more refined convergence during later training stages, especially
on limited data.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>This section presents the experimental results of the proposed TS-RAN model for facial depression
recognition. Quantitative evaluations are reported on both the AVEC2014 and CZ2024 datasets
using standard regression metrics. Ablation studies are conducted to examine the contribution of the
coordinate attention mechanism and the impact of multi-region feature fusion in enhancing model
performance. In addition, a comparative analysis with existing state-of-the-art methods is provided to
demonstrate the efectiveness of the proposed architecture.</p>
      <sec id="sec-4-1">
        <title>4.1. Quantitative Results</title>
        <p>The performance of the proposed TS-RAN model was quantitatively evaluated on both the AVEC2014
and CZ2024 datasets using mean absolute error (MAE) and root mean square error (RMSE) as evaluation
metrics. The results are summarized in Figure 3.</p>
        <p>On the AVEC2014 dataset, TS-RAN achieved an MAE of 8.04 and an RMSE of 9.65, reflecting reliable
prediction accuracy across diverse video samples and depression severity levels. On the CZ2024 clinical
dataset, the model attained even lower error rates, with an MAE of 6.84 and an RMSE of 8.77, indicating
improved generalization to real-world, clinically acquired facial recordings.</p>
        <p>These results demonstrate the model’s robustness in handling both benchmark and clinical data,
highlighting its capacity to extract discriminative features from global and local facial regions for
accurate depression severity estimation.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Ablation Study</title>
        <p>To evaluate the contribution of individual components within the TS-RAN architecture, ablation studies
were conducted on both the AVEC2014 and CZ2024 datasets. The analysis focused on two critical
aspects: the role of coordinate attention and the impact of local facial features (eyes and mouth). The
results are presented in Table 4.</p>
        <p>When the coordinate attention module was removed, the performance deteriorated on both datasets,
confirming its importance in enhancing spatial feature representation. Specifically, the MAE increased
from 8.04 to 8.23 on AVEC2014 and from 6.84 to 7.61 on CZ2024, indicating reduced accuracy in the
absence of attention-guided feature enhancement. The exclusion of local features (eyes or mouth) also
led to performance drops, though less severe than removing attention. Notably, removing both eyes and
mouth features while retaining only the global face stream resulted in higher MAE and RMSE on both
datasets, suggesting that local region cues provide complementary information critical for accurate
depression prediction.</p>
        <p>The full TS-RAN model, incorporating coordinate attention and all three facial regions, consistently
outperformed all ablated variants, confirming the efectiveness of both spatial attention and multi-region
feature fusion in facial depression recognition.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Comparison with State-of-the-Art</title>
        <p>To assess the efectiveness of the proposed TS-RAN model, performance comparisons were made against
several existing state-of-the-art methods reported on the AVEC2014 and CZ2024 datasets. The results,
shown in Table 5 and Table 6, include models that utilize various combinations of handcrafted features
and deep learning architectures.</p>
        <p>On the AVEC2014 dataset, TS-RAN achieves competitive results with an MAE of 8.04 and an RMSE
of 9.65, outperforming earlier methods such as [23] and [27]. It also performed better than recent
Transformer-based method [28]. Furthermore, the performance of TS-RAN is close to other deep
models including [14, 29, 12]. Although these methods report slightly lower errors, TS-RAN maintains
robustness through its multi-region structure without requiring heavy temporal modeling.</p>
        <p>On the CZ2024 dataset, TS-RAN outperforms a hybrid Vision Transformed-based method [30] by a
noticeable margin, achieving 6.84 MAE and 8.77 RMSE compared to 7.46 MAE and 9.15 RMSE, indicating
its superior ability to generalize across clinical settings with varied facial presentations.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>This paper presented TS-RAN, a three-stream region-aware residual attention network designed for
facial depression recognition. The model extracts and fuses global and local facial features from the
full face, eyes, and mouth using customized residual networks enhanced with coordinate attention.
Experimental results on the AVEC2014 and CZ2024 datasets demonstrated that TS-RAN delivers
competitive performance, outperforming baseline approaches and showing strong generalization to</p>
      <sec id="sec-5-1">
        <title>Baseline [23]</title>
        <p>FDHH [27]
Deep CNN [14]
CNN-LSTM [29]
3D-CNN [19]
Handcrafted and deep features [12]
Transformer [28]
TS-RAN (ours)
MAE</p>
      </sec>
      <sec id="sec-5-2">
        <title>Hybrid Vision Transformer [30] TS-RAN (ours)</title>
        <p>MAE
clinical data. Ablation studies validated the contribution of coordinate attention and multi region feature
fusion in improving recognition accuracy.</p>
        <p>Future work will explore the integration of temporal modeling using three dimensional convolution
or transformer based encoders to better capture subtle changes and micro expressions. In addition,
combining visual features with other modalities such as audio or physiological signals may further
enhance the robustness of depression estimation in real-world applications.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Acknowledgments</title>
      <p>The authors would like to thank the Guangdong CAS Cogniser, NUST, and Chiang Mai University for
providing computational resources for this research. This work was supported by the
HORIZON-MSCASE-2022 project ACMod (grant 101130271). All authors declare that there is no conflict of interest. The
CZ2024 dataset is confidential and cannot be shared due to ethical restrictions.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[4] C. Shi, C. Tan, L. Wang, A facial expression recognition method based on a multibranch
crossconnection convolutional neural network, IEEE Access 9 (2021). doi:10.1109/ACCESS.2021.
3063493.
[5] N. Cummins, S. Scherer, J. Krajewski, S. Schnieder, J. Epps, T. F. Quatieri, A review of depression
and suicide risk assessment using speech analysis, Speech Communication 71 (2015). doi:10.
1016/j.specom.2015.03.004.
[6] M. T. Khan, A. Z. Sha’ameri, M. M. A. Zabidi, Classification of fhss signals in a multi-signal
environment by artificial neural network, International Journal of Computing and Digital Systems
11 (2022). doi:10.12785/ijcds/110163.
[7] M. T. Khan, A. Z. Sha’ameri, M. M. A. Zabidi, C. C. Chia, Fhss signals classification by linear
discriminant in a multi-signal environment, in: F. Thakkar, G. Saha, C. Shahnaz, Y.-C. Hu (Eds.),
Proceedings of the International e-Conference on Intelligent Systems and Signal Processing,
volume 1370 of Advances in Intelligent Systems and Computing, Springer, Singapore, 2022, pp.
143–155. doi:10.1007/978-981-16-2123-9_11.
[8] U. R. Acharya, S. L. Oh, Y. Hagiwara, J. H. Tan, H. Adeli, D. P. Subha, Automated eeg-based screening
of depression using deep convolutional neural network, Computer Methods and Programs in
Biomedicine 161 (2018). doi:10.1016/j.cmpb.2018.04.012.
[9] S. F. Husaina, T.-B. Tang, R. Yu, W. W. Tam, B. Tran, T. T. Quek, S.-H. Hwang, C. W. Chang, C. S.</p>
      <p>Ho, R. C. Ho, Cortical haemodynamic response measured by functional near infrared spectroscopy
during a verbal fluency task in patients with major depression and borderline personality disorder,
EBioMedicine 51 (2020). doi:10.1016/j.ebiom.2019.11.047.
[10] K. M. Han, D. D. Berardis, M. Fornaro, Y. K. Kim, Diferentiating between bipolar and unipolar
depression in functional and structural mri studies, Progress in Neuro-Psychopharmacology and
Biological Psychiatry 91 (2019). doi:10.1016/j.pnpbp.2018.03.022.
[11] Y. Lin, H. Ma, Z. Pan, R. Wang, Depression detection by combining eye movement with image
semantics, in: 2021 IEEE International Conference on Image Processing (ICIP), IEEE, USA, 2021,
pp. 269–273. doi:10.1109/ICIP42928.2021.9506702.
[12] C. Álvarez Casado, M. L. Cañellas, M. B. López, Depression recognition using remote
photoplethysmography from facial videos, IEEE Transactions on Afective Computing 14 (2023).
doi:10.1109/TAFFC.2023.3238641.
[13] X. Yuan, Z. Liu, Q. Chen, G. Li, Z. Ding, Z. Shangguan, B. Hu, Combining informative regions
and clips for detecting depression from facial expressions, Cognitive Computation 15 (2023).
doi:10.1007/s12559-023-10157-0.
[14] Y. Zhu, Y. Shang, Z. Shao, G. Guo, Automated depression diagnosis based on deep networks to
encode facial appearance and dynamics, IEEE Transactions on Afective Computing 9 (2018).
doi:10.1109/TAFFC.2017.2650899.
[15] W. C. de Melo, E. Granger, M. B. Lopez, Encoding temporal information for automatic depression
recognition from facial analysis, in: ICASSP 2020 - 2020 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 1080–1084. doi:10.1109/
ICASSP40776.2020.9054375.
[16] L. He, J. C.-W. Chan, Z. Wang, Automatic depression recognition using cnn with attention
mechanism from videos, Neurocomputing 422 (2021). doi:10.1016/j.neucom.2020.10.015.
[17] M. Niu, Z. Zhao, J. Tao, Y. Li, B. W. Schuller, Dual attention and element recalibration networks
for automatic depression level prediction, IEEE Transactions on Afective Computing 14 (2023).
doi:10.1109/TAFFC.2022.3177737.
[18] M. T. Khan, M. Imran, M. Kanwal, Mcnn: Multi-channel neural network with channel-wise
attention for facial expression-based depression recognition, Multimedia Tools and Applications
(2025). doi:10.1007/s11042-025-20962-4.
[19] L. He, C. Guo, P. Tiwari, H. M. Pandey, W. Dang, Intelligent system for depression scale estimation
with facial expressions and case study in industrial intelligence, International Journal of Intelligent
Systems 37 (2022). doi:10.1002/int.22426.
[20] W. C. de Melo, E. Granger, A. Hadid, Combining global and local convolutional 3d networks
for detecting depression from facial expressions, in: 2019 14th IEEE International Conference
on Automatic Face Gesture Recognition (FG 2019), IEEE, 2019, pp. 1–8. doi:10.1109/FG.2019.
8756568.
[21] M. T. Khan, U. U. Sheikh, A hybrid convolutional neural network with fusion of handcrafted
and deep features for fhss signals classification, Expert Systems with Applications 225 (2023).
doi:10.1016/j.eswa.2023.120153.
[22] Q. Hou, D. Zhou, J. Feng, Coordinate attention for eficient mobile network design, in: 2021
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2021, pp.
13713–13722. doi:10.1109/CVPR46437.2021.01350.
[23] M. Valstar, B. Schuller, K. Smith, T. Almaev, F. Eyben, J. Krajewski, R. Cowie, M. Pantic, Avec
2014: 3d dimensional afect and depression recognition challenge, in: Proceedings of the 4th
international workshop on audio/visual emotion challenge, 2014, ACM, 2014, pp. 3–10. doi:10.
1145/2661806.2661807.
[24] B. Hajduska-Dér, G. Kiss, D. Sztahó, K. Vicsi, L. Simon, The applicability of the beck depression
inventory and hamilton depression scale in the automatic recognition of depression based on
speech signal processing, Frontiers in Psychiatry 13 (2022). doi:10.3389/fpsyt.2022.879896.
[25] M. T. Khan, A modified convolutional neural network with rectangular filters for
frequencyhopping spread spectrum signals, Applied Soft Computing Journal 150 (2024). doi:10.1016/j.
asoc.2023.111036.
[26] M. T. Khan, Machine Learning Classification of Frequency-Hopping Spread Spectrum Signals in a
Multi-Signal Environment, Ph.D. thesis, Universiti Teknologi Malaysia, Skudai, Johor, Malaysia,
2023.
[27] A. Jan, H. Meng, Y. F. B. A. Gaus, F. Zhang, Artificial intelligent system for automatic
depression level analysis through visual and vocal expressions, IEEE Transactions on Cognitive and
Developmental Systems 10 (2017). doi:10.1109/TCDS.2017.2721552.
[28] H. Fan, X. Zhang, Y. Xu, J. Fang, S. Zhang, X. Zhao, J. Yu, Transformer-based multimodal feature
enhancement networks for multimodal depression detection integrating video, audio and remote
photoplethysmograph signals, Information Fusion 104 (2024). doi:10.1016/j.inffus.2023.
102161.
[29] M. A. Uddin, J. B. Joolee, Y.-K. Lee, Depression level prediction using deep spatiotemporal features
and multilayer bi-ltsm, IEEE Transactions on Afective Computing 13 (2022). doi: 10.1109/TAFFC.
2020.2970418.
[30] Z. Jiang, X. Gao, Y. Cao, Y. Zhang, G. Dong, Y. Chen, X. Zhu, Q. Zhang, R. Bi, K. Wang, Dnet:
A depression recognition network combining residual network and vision transformer (2024).
doi:10.21203/rs.3.rs-4465101/v1.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Mathers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Loncar</surname>
          </string-name>
          ,
          <article-title>Projections of global mortality and burden of disease from 2002 to 2030, PLOS Meicine 3 (</article-title>
          <year>2006</year>
          )
          <article-title>e442</article-title>
          . doi:
          <volume>10</volume>
          .1371/journal.pmed.
          <volume>0030442</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shang</surname>
          </string-name>
          , G. Guo,
          <article-title>Visually interpretable representation learning for depression recognition from facial images</article-title>
          ,
          <source>IEEE Transactions on Afective Computing</source>
          <volume>11</volume>
          (
          <year>2018</year>
          ). doi:
          <volume>10</volume>
          . 1109/TAFFC.
          <year>2018</year>
          .
          <volume>2828819</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Facialpulse:</surname>
          </string-name>
          <article-title>An eficient rnn-based depression detection via temporal facial landmarks</article-title>
          ,
          <source>in: MM '24: Proceedings of the 32nd ACM International Conference on Multimedia, ACM</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>320</lpage>
          . doi:
          <volume>10</volume>
          .1145/ 3664647.3681546.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>