<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Victoria Vysotska1,†, Kirill Smelyakov2,†, Anastasiya Chupryna2,†, Anastasiia Kochkina2,* ,†, Ganna Pliekhova3,†, Anton Naumov2,†</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ganna Pliekhova</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anton Naumov</string-name>
          <email>anton.naumov@nhdsl.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kharkiv National Automobile and Highway University</institution>
          ,
          <addr-line>Yaroslava Mudrogo str. 25, Kharkiv, 61002</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>Stepan Bandera Street, 12, Lviv, 79013</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Face recognition in dynamic real-world conditions presents a challenge due to varying lighting, partial occlusions, and pose variations. While convolutional graph networks (GCNs) and transformer-based architectures achieve high performance in addressing these issues, their black-box nature complicates interpretation. This paper proposes a methodology to enhance the explainability of dynamic face recognition models using graph-based approaches and attention mechanisms for implementing the model to the Hybrid Intellectual System. We introduce visualisation methods for key facial landmarks and frame significance in video sequences through Explainable AI (XAI) techniques, such as Attention Attribution, Feature Ablation, Grad-CAM, and Attention Rollout. Experimental results indicate that the proposed approach can improve model interpretability without compromising accuracy. This research explores a multimodal approach by integrating Llama-based Llasa speech synthesis by combining natural language processing with visual facial expression detection.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Facial recognition in real-world dynamic conditions is still one of the most challenging and
demanding areas in computer vision. FR has a wide range of practical paths to evolve, from public
security technologies to the more detailed analysis of personal emotions and interactions in social
robotics. We can see significant progress in recent years; however, the task remains challenging due
to the wild dynamic conditions, changing lights, head poses, and the presence of partial occlusions.</p>
      <p>
        In computer vision, in particular, the Facial Expression Recognition (FER) field, Convolutional
NNs (CNNs) are used the most frequently for static and dynamic 2D recognition tasks. However,
CNN may lose their performance in 3D tasks because of object position changes. Applying Graph
Convolutional Networks (GCN) instead for such tasks is reasonable as they effectively capture spatial
relationships between facial key points, allowing more precise detection of changes. At the same
time, Transformer-based networks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], initially built for natural language processing tasks, have
great potential for temporal sequence processing, capturing dynamic changes with high complexity.
      </p>
      <p>Incorporating NLP-driven models such as Llama-based Llasa into this framework enables a
multimodal approach, where facial expressions are not only recognised but also translated into</p>
      <p>0000-0001-6417-3689 (V. Vysotska); 0000-0001-9938-5489 (K. Smelyakov); 0000-0003-0394-9900 (A. Chupryna);
00090000-6679-8746 (A. Kochkina); 0000-0002-6912-6520 (G. Pliekhova); 0009-0007-9100-7978 (A. Naumov)
contextually appropriate speech responses. Llasa integration enables expressive speech synthesis by
matching language patterns [27-29] with emotional expressions, ensuring that detected facial
expressions can simulate human-like speech. Furthermore, it will enhance affective computing,
making human-AI integrations more emotionally aware. Incorporating the scalability of
Llamabased architectures, Llasa provides highly accurate speech synthesis that captures not only the
lexical content but also the emotional subtext of detected facial expression patterns.</p>
      <p>However, networks with a complicated architecture remain "black boxes", which affect more than
performance but also end-user trust. This second aspect is crucially vital in specific domains such as
medicine, security, and personal data processing.</p>
      <p>The issue of interpretability has gained increasing importance due to the growing interest in
transparency and explainability in AI-driven decision-making. Users and developers must
understand which features and frames are decisive in a model's predictions. This is why integrating
explainable AI (XAI) approaches into face recognition models is a key direction for enhancing their
practical value.</p>
      <p>Human conversation is naturally multimodal: meaning is encoded not only in text but also in
facial expression, vocal prosody and gesture. These paralinguistic features colour literal words,
attitude and emotional tone that are essential for smooth flow and mutual understanding. Dialogue
systems and social agents that rely on text alone, therefore, miss a substantial portion of the
communicative channel, leading to responses that can feel tone-deaf or robotic. Multimodal affective
language understanding seeks to fuse verbal and non-verbal cues so that artificial agents can
interpret and respond with emotionally congruent behaviour, bringing machine interaction closer to
human conversational norms.</p>
      <p>
        This study aims to develop and integrate specialised XAI methods into graph-based and
Transformer architectures to improve recognition accuracy and significantly enhance the
interpretability of decision-making processes. Specifically, the focus is on analysing and visualising
the most relevant facial landmarks and key video frames that substantially impact the final
classification outcome. In addition, this research explores the integration of SpatioTemporal Graph
Transformer (STGT) with the Llasa text-to-speech synthesis model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to create a hybrid multimodal
system, ensuring an emotionally adaptive AI system [13].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <sec id="sec-2-1">
        <title>2.1. Graph Methods for Dynamic Facial Expression Recognition</title>
        <p>
          Graph-based approaches leverage facial landmark points to construct a structured representation of
the face, where each node corresponds to a specific coordinate in space. This method enables
efficient modelling of spatial and temporal relationships between facial regions, which is crucial for
dynamic facial expression recognition (DFER) [15]. Unlike traditional 2D representations, increasing
the dimensionality from 2D to 3D enhances the ability to capture subtle depth variations in facial
expressions, leading to more accurate and robust emotion classification. One of the most effective
frameworks for real-time 3D facial landmark extraction is MediaPipe [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], developed by Google. It
defines a 3D face model using 468 key points, which are detected and tracked across video sequences.
The MediaPipe Face Landmark Model normalises the X and Y coordinates within a range of 0 to 1.
At the same time, the Z-coordinate is estimated relative to the X-axis using a perspective projection
camera model, with values ranging between -1 and 1. This approach ensures a consistent and stable
representation of depth, which is critical for analysing microexpressions and subtle facial
movements in dynamic sequences. MediaPipe shows a strong ability to detect landmarks on frames
with interferences, as shown in Figure 1.
        </p>
        <p>Given its high computational efficiency and real-time processing capabilities even on mobile
devices, MediaPipe serves as a strong foundation for integrating graph-based methods into Graph
Neural Networks (GNNs) [14] and Transformer models for DFER [15]. The framework's ability to
accurately capture 3D facial dynamics in real-time makes it highly suitable for graph-based facial
representation [25], enabling more expressive feature extraction while maintaining a balance
between performance and accuracy. Additionally, the integration of 3D landmarks with graph-based
models allows for improved spatial reasoning, helping the model better understand facial
deformations over time.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Implementing Transformer-based networks for DFER</title>
        <p>Transformer-based networks have recently gained attention for Dynamic Facial Expression
Recognition (DFER) due to their effectiveness in modelling long-range dependencies in sequential
data. Unlike traditional Convolutional neural networks (CNNs), which primarily focus on spatial
feature extraction, or Long Short-Term Memory networks (LSTMs) and Recurrent neural networks
(RNNs), which capture temporal patterns but often struggle with long-term dependencies,
Transformers use self-attention mechanisms to process facial expressions across multiple frames
holistically.</p>
        <p>
          DFER is increasingly proficient at analysing spatiotemporal features [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] from video sequences.
The ViViT [23] model improves on traditional CNN-RNN hybrid approaches by employing factorised
attention mechanisms to directly extract both spatial and temporal representations, thereby
eliminating the need for recurrent layers. These abilities allow us to pick up subtle patterns of
changes in facial expression over time.
        </p>
        <p>ViViT works by dividing video frames into patches, which are embedded in a high-dimensional
feature space. These patch embeddings then pass through spatiotemporal self-attention layers,
allowing the model to effectively learn spatial dependencies alongside temporal variations. It enables
accurate detection of subtle facial movements (patterns), including micro-expressions.</p>
        <p>
          Additionally, ViViT's scalability to more extended video sequences makes it well-suited for
realworld applications where emotions change dynamically. By integrating pose-invariant and
occlusion-aware learning strategies it ensures robust performance under varying conditions. Recent
studies [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] show that ViViT outperforms traditional CNN-LSTM models and other Transformer
models when dealing with pose variations, lighting changes, and expression intensity changes,
highlighting its promise in the field of deep learning-based affective computing.
        </p>
        <p>In DFER, each video sequence can be represented as a temporal-spatial graph, with facial
landmarks serving as nodes and their relationships evolving. A Spatial Transformer within the
network learns to focus on critical facial regions through landmark-level attention. At the same time,
a Temporal Transformer captures the sequential dependencies between frames, allowing the system
to detect subtle transitions in expressions.</p>
        <p>This architecture is particularly beneficial for addressing challenges such as occlusions, pose
variations, and fine-grained emotional cues, ultimately enhancing recognition accuracy.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Explainable AI (XAI) for DFER</title>
        <p>
          Explainable AI (XAI) in dynamic facial expression recognition (DFER) enhances model transparency
and fosters trustworthiness by tackling challenges such as temporal dependencies, pose variations,
and micro-expressions. Traditional facial expression recognition models, such as RNNs and CNNs,
often perform as black-box classifiers, making their predictions difficult to interpret. To mitigate this
issue, techniques like LIME and SHAP [24] are employed to identify influential facial features, while
Grad-CAM [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] visualises the key facial regions that contribute to the classifications.
        </p>
        <p>
          DFER requires models capable of processing sequences over time. Transformers, which utilise
self-attention mechanisms [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], allow models to focus on critical frames, thereby improving
interpretability. Graph-based approaches represent facial landmarks as nodes in a graph neural
network (GNN) and employ attention mechanisms to dynamically highlight important regions. In
addition, adaptations of Grad-CAM for temporal-spatial graphs can generate heatmaps over time,
enhancing the explainability of the model. Prototype-based methods, such as ProtoPNet, provide
case-based reasoning, and Contrastive Explanation Methods (CEMs) help distinguish subtle
differences in expressions.
        </p>
        <p>Evaluating XAI in DFER involves human-in-the-loop studies, faithfulness tests, and ensuring
alignment with psychological theories of emotion perception. However, challenges remain in
interpreting temporal features, mitigating bias, and ensuring real-time explainability for applications
in healthcare and surveillance. Future research should aim to develop scalable, real-time XAI
techniques to further enhance the transparency and effectiveness of DFER models.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methods and Materials</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset and Preprocessing</title>
        <p>The MAFW dataset served as the foundation for data preprocessing. Initially, video files were
processed using MediaPipe Face Mesh, which facilitated the extraction of a comprehensive set of 468
key points representing facial landmarks. To optimise the performance of the Transformer model, a
deliberate reduction of these key points was implemented, narrowing them down to a more
manageable 68.</p>
        <p>These selected key points were systematically organised into distinct datasets according to the
number of frames present in each video segment. As a result, three separate datasets were
meticulously curated for videos containing up to 50, 100, and 150 frames, which collectively
amounted to a substantial total of 8,120 video entries. However, it is worth noting that not every
video was paired with corresponding class labels, prompting an adjustment of the final count to 7,706
videos.</p>
        <p>The data structure prepared for model training is provided in Figure 2.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Model Architecture</title>
        <p>The proposed SpatioTemporal Graph Transformer (STGT) is designed for dynamic facial recognition
by capturing both spatial and temporal dependencies in facial landmark sequences.
</p>
        <p>The Landmark Embedding Layer converts raw 3D landmark coordinates into a
higherdimensional representation. It employs a linear transformation to map the input format of
(B, T, N, 3) into an embedding space of (B, T, N, embed_dim).


</p>
        <p>Spatial Transformer Block applies multi-head self-attention to model the spatial
dependencies across facial landmarks within a single frame. It processes the data in the
format (B, N, T, embed_dim) to allow landmarks (nodes) to attend to one another.
Additionally, it uses Layer Normalization and Feedforward Networks to enhance the
representations.</p>
        <p>Temporal Transformer Block captures sequential dependencies between frames using
stacked Transformer Encoder layers. The input format (B, T, embed_dim) is passed through
these layers to model long-range dependencies. It also incorporates gradient tracking hooks
to facilitate Grad-CAM-based interpretability.</p>
        <p>Classification Head representations are aggregated using Global Average Pooling over time.</p>
        <p>The pooled features are then passed through a fully connected layer for final classification.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. XAI implementation</title>
        <p>In dynamic facial expression recognition (DFER) using Graph Transformer models, explainability is
essential for understanding how the model interprets spatial-temporal facial landmarks to infer
emotions. Several techniques can enhance the interpretability and trustworthiness of the model,
including Attention Attribution, Feature Ablation, Grad-CAM, and Attention Rollout.</p>
        <p>Attention Attribution [19] is the method which helps identify which facial landmarks contribute
the most to the model's decision by analysing the self-attention weights in the Transformer layers.
Since the Graph Transformer processes 68 facial landmarks as nodes, Attention Attribution can
highlight which regions of the face (e.g., eyebrows, mouth, eyes) are most relevant to specific
expressions over time.</p>
        <p>Feature Ablation systematically removes or masks specific input features (facial landmarks) to
assess their impact on model predictions. This technique can be applied to subsets of facial nodes or
temporal frames to determine whether certain facial regions or time steps are crucial for emotion
classification. For instance, it can help verify whether jaw movements or subtle eye changes are more
significant for detecting emotions like anger or sadness.</p>
        <p>
          Grad-CAM is widely used in CNN-based vision [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] models. Still, it can also be adapted
to Transformer-based architectures by visualising the importance of different regions in an input
sequence. In a Graph Transformer model for DFER, Grad-CAM can generate heatmaps over
the facial graph nodes, indicating which parts of the face influence the model's classification at
different time steps.
        </p>
        <p>Attention Rollout aggregates attention maps across multiple layers of the Transformer, providing
insights into how low-level local attention in early layers evolves into global contextual attention in
deeper layers. This technique is instrumental in hybrid models combining local graph-based learning
and
global Transformer attention, helping to analyse
whether the
model progressively
captures short-term facial micro-expressions before forming high-level temporal patterns.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Metrics</title>
        <p>The evaluation of the SpatioTemporal Graph Transformer for dynamic facial expression
recognition is carried out using multiple performance metrics that access both classification accuracy
and model interpretability. Integrated metrics ensure that the system effectively recognises emotions
and provides insights into which spatial (facial landmarks) and temporal (frame-based changes)
patterns contribute the most to predictions.</p>
        <p>For training stability, we will use Cross-Entropy Loss with class weights, as the dataset may
contain an imbalanced distribution of emotions. Class weights are computed dynamically to ensure
balanced learning.</p>
        <p>The weighted cross-entropy loss used to handle class imbalance was calculated using the formula:

= −</p>
        <p>where



</p>
        <p>C is the number of emotion classes,
 is the weight assigned to class  ,</p>
        <p>is the ground truth label (one-hot encoded),
 is the predicted probability for class  .</p>
        <p>The loss function is designed to penalise incorrect predictions based on class imbalance, ensuring
that smaller-class emotional patterns are learned effectively.</p>
        <p>For model performance evaluation, we use training and validation loss as baseline standards. The
training process logs epoch-wise loss values to monitor the model's convergence. A learning rate
scheduler adjusts the learning rate dynamically when validation loss stagnates, which prevents
overfitting. We use the confusion matrix to achieve a detailed breakdown of model predictions versus
actual labels. It highlights misclassification patterns, helping refine the feature extraction process.
The matric is plotted using seaborn heatmaps for visual analysis of prediction distributions.</p>
        <p>
          For classification robustness, we use Grad-CAM [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] for Spatial and Temporal attention analysis.
Grad-CAM is used to visualise which facial landmarks are most influential in classification. We will
create two types of heatmaps for spatial Grad-CAM (which highlights key facial features that
contribute to predictions) and temporal Grad-CAM (which identifies critical time frames where
emotion transitions occur). These visualisations provide explainability, making the model's decisions
more interpretable.
        </p>
        <p>Grad-CAM generates attention heatmaps by computing the gradient of the largest class score
with respect to the feature maps. The Grad-CAM activation for a given location ( ,  ) is:

( )
( ,  ) = 
  ( ,  )
(1)
(2)
where




 ( ,  ) is the activation map of the  -th convolutional feature map,
 is the importance weight computed as:

=

1
,</p>
        <p>( ,  )
,
(3)
 is the predicted score for class  ,
 is the total number of spatial locations.</p>
        <p>For temporal Grad-CAM, this process is extended over time by computing gradients across
sequential frames.
convergence.</p>
        <p>For model selection and optimisation, we provide checkpoints with the best validation loss. The
system automatically saves the best model based on validation loss. A model is only saved if it
improves validation loss and maintains a reasonable training loss (&gt;0.4) to prevent premature
By integrating classification
metrics, loss analysis, confusion
matrix visualisation and
interpretability techniques, we ensure a comprehensive evaluation of the hybrid SpatioTemporal
Graph Transformer and Llasa speech synthesis system. The combination listed above ensures not
only accuracy but the trustworthiness and interpretability of the system, making it suitable for
realworld affective computing applications.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Integrating Model to the Hybrid Intellectual System</title>
        <p>
          To build a Hybrid Intellectual System, we will integrate the SpatioTemporal Graph Transformer
(STGT) at the perception front end alongside a speech-to-text transformer that supplies live
transcripts. The resulting affect stream and transcripts are fed to the dialogue-reasoning layer, which
drives the Llama-based Llasa speech synthesis model [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], which relies on language processing [9] to
ensure natural and contextually appropriate emotional outcomes. Using self-attention mechanisms,
the STGT assigns importance to key patterns, identifying expressions (anger, happiness, sadness,
surprise, fear, etc.) and pushes small messages of emotion label and confidence to the dialogue
manager that updates 20 times per second. The manager stores the latest label in its state slot for
user emotion and conduct, prompting the underlying language model to produce an empathic
response. For instance, if the user's emotion is Sad, it asks the language model for a gentle, caring
response. For Angry, it changes to calm and solution-focused wording.
        </p>
        <p>The chatbot will be fine-tuned on videos containing real-world dialogue samples between 2
people, where it will learn some behaviour patterns and then refine them with human feedback to
provide addressing – not mirroring response style.</p>
        <p>The finished sentence, along with matching emotional labels, is mapped to linguistic prosody [22]
and translated into linguistic attributes such as pitch variation, speech rate and pause insertions.</p>
        <p>Llasa incorporates text transformers (Llama-based architecture) to interpret the semantic
meaning behind the generated or predefined speech content [21]. The speech tokeniser ensures that
the emotion detected in the dialogue manager message aligns with the intonation and rhythm of the
synthesised speech.
increased volume.</p>
        <p>Each emotion-labelled sentence undergoes a transformation to match the phonetic and prosodic
attributes of expressive speech. For example, for angry expressions, the speech sounds sharp and has</p>
        <p>Llasa employs vector quantisation (VQ) codecs to convert text-based tokens into expressive
speech waveforms. Using Process Reward Models (PRMs), the system interactively refines speech
synthesis by adjusting articulation, tone and rhythm to align with both visual emotional cues and
linguistic context [10]. The final speech is temporally synchronised with an emotion label
corresponding to the user's facial expressions, ensuring that spoken words and tone appear together
naturally.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and results</title>
      <sec id="sec-4-1">
        <title>4.1. Model definition and training</title>
        <p>Original data was merged into a tabular format, describing each frame for video. An embedding layer
transformed these landmarks into higher-dimensional vectors. Facial landmarks were treated as
nodes in a graph for each frame, utilising multi-head attention to model spatial dependencies.
Temporal block processed frame-level embeddings, capturing temporal relationships across frames.
The classifier head aggregated the features to predict expression categories from 11 classes. Initially,
the model was trained on a dataset of 50 frames to establish baseline accuracy. Then, the dataset was
expanded to include sequences of 100 frames to improve the model's capability to handle extensive
temporal contexts. The last dataset containing the most extended sequences of 150 frames was used
for training to optimise the model's capability to handle extensive temporal contexts. The model's
output metrics are provided in Table 1. Memory-efficient training methods and computational
optimisations were applied to ensure efficiency and enhance performance.
The accuracy of the trained model is shown in Figure 3.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. XAI methods implementation</title>
        <p>To highlight the aspects that the model uses in its decision-making process, we implemented XAI
methods for a more precise analysis. Attention attribution was implemented by extracting attention
weights directly from the model's Multi-Head Attention layers. Within the Spatial Transformer
Block, we obtained attention weights for each frame, revealing how much importance the model
assigned to each facial landmark during decision-making. Similarly, in the Temporal Transformer
Block, attention weights were extracted across frames to identify key temporal segments influencing
the classification outcomes. The attention weights are shown as heatmaps of facial landmarks, clearly
indicating which points are most important for the accurate classification of facial expressions. It
provides a clearer understanding of the decision-making process in the model.</p>
        <p>A systematic feature ablation strategy was applied to test the significance of features. Specific
subsets (subgroups) of facial landmarks were gradually removed, and the resulting change in
classification accuracy was observed. This study allowed us to determine which facial features
contributed the most to the model's decision-making process. By quantifying these effects, we gain
confidence in the interpretability and robustness of the model.</p>
        <p>Grad-CAM was integrated to further enhance interpretability. Grad-CAM enabled visualisation
of the gradients flowing into the last convolutional layer (adapted to our transformer-based
approach), highlighting specific facial landmarks and regions significantly influencing classification
outcomes, as shown in Figure 4. It allowed us to visually verify the robustness of the model's
attention mechanisms and to confirm the correct identification of critical facial areas linked to
specific emotional expressions.</p>
        <p>Temporal Grad-CAM implemented in the project provides visualisation of changes in the model's
attention for different frames in sequences. Figure 5 visualises the epoch 10 model, which shows
unrelated diffuse attention, but the 60th epoch model gathers more extended periods where attention
is high.</p>
        <p>Layer-wise Relevance Propagation [18] was used extensively to quantify and visualise the
contributions of individual landmarks and specific frames toward classification decisions. By
propagating relevance scores from the output back to the input features, LRP [17] enabled a detailed
analysis of each landmark's influence across time, facilitating the understanding of how temporal
dynamics affect the model's predictions.</p>
        <p>Taken together, these complementary XAI techniques offer a triangulated view of the model's
decision-making process. Insights gleaned from the heatmaps and relevance scores feed directly into
an iterative training loop, guiding hyper-parameter tuning and data-augmentation choices that
further sharpen both accuracy and transparency.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and future research</title>
      <p>The results of this study demonstrate the effectiveness of the SpatioTemporal Graph Transformer
(STGT) in improving the explainability of Dynamic Facial Expression Recognition (DFER) while
maintaining high classification accuracy. By combining graph-based facial landmark processing with
transducer-based temporal modelling, the system effectively captures both spatial dependencies and
temporal changes in facial expressions. In addition, the integration of methods for increasing
explainability, such as Grad-CAM, attention attribution, and feature ablation, provides essential
insights into the model's decision-making process, increasing interpretability and confidence in
AIbased face recognition systems.</p>
      <p>In plans is experimenting with YOLO[11] for multiple face recognition and comparing its
capability with MediaPipe in dynamic facial expression recognition. MediaPipe offers lightweight
landmark detection in real-time, and YOLO[12] object detection may provide higher precision in
challenging conditions. This comparison will help determine what method works best and if
integrating both would be beneficial.</p>
      <p>Recent advancements in transformer-based architectures, such as ViViT (Video Vision
Transformer), have shown that self-attention mechanisms can significantly improve performance by
modelling long-range dependencies more effectively than CNN-RNN hybrids. ViViT segments video
frames into patch embeddings and processes both spatial and temporal information cohesively,
removing the necessity for recurrent structures. However, ViViT does not clearly encode the
structural relationships between facial landmarks, which limits its interpretability in deep fake
emotion recognition (DFER) tasks.</p>
      <p>A significant point of the proposed method is its ability to highlight facial landmarks that make
the most impact and keyframes in video sequences, making sure that classification is not purely
"black-box" on output. Increased interpretability is extremely important in healthcare, security, and
human-computer interaction applications, where understanding how and why a model makes a
specific prediction is as important as its accuracy. Moreover, using attention-based explanations
allows the model to be adjusted based on real-world changes such as lighting conditions such as
lightning conditions, occlusions and head position.</p>
      <p>Despite these advancements, several challenges remain. We will move from independent modules
to a unified STGT – ASR perception stack feeding a dialogue manager that, in turn, drives Llasa TTS.
The incorporation of Llasa would enable multimodal emotional AI in the conversational system,
where detected facial expressions dynamically affect emotionally expressive speech output. Future
work will be focused on implementing real-time synchronisation between facial expression
recognition and speech generation [26], ensuring that spoken emotions match detected facial cues.</p>
      <p>Additionally, current Grad-CAM-based explainability techniques primarily focus on spatial
attention rather than temporal dependencies in facial expressions. Future improvements should
explore temporal Grad-CAM visualisations to better understand how the model tracks emotion
transitions over time, while language-side XAI – via token-level attention rollouts will expose why
the bot chooses an appropriate expression in response.</p>
      <p>Another perspective way for work includes increasing the diversity of data used for training. To
advance beyond clip-level emotion tagging, future work should train the pipeline on long-form
dialogue video datasets that align face video, raw speech and full transcripts across multi-sentence
turns. Resources such as MELD, IEMOCAP, CMU-MOSEI/MOSI and SEWA supply precisely this
alignment, allowing STGT to model avoiding facial affect, ASR to capture prosodic cues, and the
dialogue manager to learn how emotions ebb and flow throughout an extended exchange. Leveraging
such material will equip the agent to maintain affective context over multiple turns and to generate
responses that are not merely reactive but emotionally coherent within the broader conversation.</p>
      <p>Facial expressions and their dynamic patterns can differ across age groups, ethnicity and social
context. The model should be trained more in a broader range of samples to ensure its robustness
and clearer classification ability. For real-world applications, it is mandatory to ensure that potential
biases are decreased to a minimum before deploying. Future enhancement of the model could explore
lightweight transformer architectures or efficient attention mechanisms such as sparse attention or
low-rank adaptations to reduce processing overhead while maintaining accuracy.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>This study introduces a comprehensive solution for Dynamic Facial Expression Recognition (DFER)
by combining SpatioTemporal Graph Transformer (STGT) methods with Explainable AI (XAI)
techniques and Llama-based Llasa speech synthesis. The framework captures spatial relationships
among facial landmarks and the temporal dynamics of facial movements, achieving accurate and
interpretable emotion classification.</p>
      <p>To enhance model interpretability, the research utilises advanced XAI methods like Grad-CAM,
Attention Attribution, and Feature Ablation. These techniques allow for clear visualisation of the
facial features and temporal segments that influence the model's decisions, addressing the challenge
of interpretability in AI-driven facial recognition systems and providing valuable insights for users
and developers.</p>
      <p>Moreover, the integration of NLP-driven speech synthesis via Llasa TTS substantially enriches
the system's multimodal interaction capabilities. This enhancement facilitates the synchronised
expression of recognised emotions is converted into matching prosody, so the agent voices its
response that fits the user's affect through natural and expressive speech outputs. Such multimodal
integration is particularly impactful for advancing applications within human-computer interaction,
affective computing, accessibility tools, and other interactive AI-driven environments.</p>
      <p>The study demonstrates the integration of an event-driven pipeline – STGT for facial expression
affect, a transformer ASR for live transcripts, a dialogue manager that fuses the two, and Llasa TTS
for prosody-controlled speech that robust visual emotion recognition with expressive speech
synthesis. It also identifies specific areas requiring further enhancement. Particularly, achieving
realtime synchronisation between user expressional input and corresponding voice-moduled output
remains challenging due to computational constraints and latency issues.</p>
      <p>Research shows that transformer-based architectures surpass traditional CNNs and LSTMs in
DFER tasks. However, challenges remain with dataset biases, occlusion handling, and head pose
variations. Future research should focus on optimising real-time model performance, refining XAI
methodologies for better temporal explainability, and diversifying training datasets to ensure fair
performance across demographic groups.</p>
      <p>In conclusion, this study contributes to emotionally responding to AI by enhancing both model
accuracy and interpretability. The proposed hybrid system establishes a robust and adaptable
foundation for future exploration and developments in multimodal emotion recognition systems,
intelligent interactive agents, and a range of practical affective computing applications.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.</p>
    </sec>
    <sec id="sec-8">
      <title>8. References</title>
      <p>[9] K. Smelyakov, D. Karachevtsev, D. Kulemza, Y. Samoilenko, O. Patlan and A. Chupryna,
"Effectiveness of Preprocessing Algorithms for Natural Language Processing Applications,"
2020 IEEE International Conference on Problems of Infocommunications. Science and
Technology (PIC S&amp;T), Kharkiv, Ukraine, 2020, pp. 187-191, doi:
10.1109/PICST51311.2020.9467919.
[10] Smelyakov, K., Chupryna, A., Darahan, D., Midina, S. Effectiveness of modern text recognition
solutions and tools for common data sources, CEUR Workshop Proceedings, 2021, 2870, pp. 154–
165, https://ceur-ws.org/Vol-2870/
[11] R. P. Narwaria, A. Ahirwar, A. K. Prajapati, A. Kumar and A. K. Tiwari, "Smart Object Detection
Using ESP32-CAM Based on YOLO Algorithm," 2024 Second International Conference on
Intelligent Cyber Physical Systems and Internet of Things (ICoICI), Coimbatore, India, 2024, pp.
817-820, doi: 10.1109/ICoICI62503.2024.10696374.
[12] Velychko, D., et al.: Image Preprocessing and YOLO Architectures for Enhanced Small and
SlowMoving Object Detection. In: 2024 IEEE Western New York Image and Signal Processing
Workshop (WNYISPW), Rochester, NY, USA, 8 Nov 2024, pp. 1–4. IEEE
(2024). https://doi.org/10.1109/wnyispw63690.2024.10786503.
[13] Song, T., et al.: MPED: A Multimodal Physiological Emotion Database for Discrete Emotion</p>
      <p>Recognition. IEEE Access. 7, 12177–12191 (2019). https://doi.org/10.1109/access.2019.2891579.
[14] Zhao, Q., et al.: Density Division Face Clustering Based on Graph Convolutional Networks. In:
2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–
25 Aug 2022. IEEE (2022). https://doi.org/10.1109/icpr56361.2022.9956670.
[15] Tian, Y., Li, M., Wang, D.: DFER-Net: Recognising Facial Expression In The Wild. In: 2021 IEEE
International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 Sep 2021.</p>
      <p>IEEE (2021). https://doi.org/10.1109/icip42928.2021.9506770.
[16] Chumachenko, K., Iosifidis, A., Gabbouj, M.: MMA-DFER: MultiModal Adaptation of unimodal
models for Dynamic Facial Expression Recognition in-the-wild. In: 2024 IEEE/CVF Conference
on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–
18 June 2024, pp. 4673–4682. IEEE (2024). https://doi.org/10.1109/cvprw63382.2024.00470.
[17] Bhati, D., et al.: Neural Network Interpretability with Layer-Wise Relevance Propagation: Novel
Techniques for Neuron Selection and Visualization. In: 2025 IEEE 15th Annual Computing and
Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 6–8 Jan 2025,
pp. 00441–00447. IEEE (2025). https://doi.org/10.1109/ccwc62904.2025.10903721.
[18] Jung, Y.-J., Han, S.-H., Choi, H.-J.: Explaining CNN and RNN Using Selective Layer-Wise
Relevance Propagation. IEEE Access. 9, 18670–18681 (2021).
https://doi.org/10.1109/access.2021.3051171.
[19] Lei, S., et al.: Watch the Speakers: A Hybrid Continuous Attribution Network for Emotion
Recognition in Conversation With Emotion Disentanglement. In: 2023 IEEE 35th International
Conference on Tools with Artificial Intelligence (ICTAI), Atlanta, GA, USA, 6–8 Nov 2023. IEEE
(2023). https://doi.org/10.1109/ictai59109.2023.00133.
[20] Li, H., Miao, S., Feng, R.: DG-FPN: Learning Dynamic Feature Fusion Based on Graph
Convolution Network For Object Detection. In: 2020 IEEE International Conference on
Multimedia and Expo (ICME), London, United Kingdom, 6–10 July 2020. IEEE
(2020). https://doi.org/10.1109/icme46284.2020.9102838.
[21] Ahn, Y., Chae, J., Shin, J.W.: Text-to-Speech With Lip Synchronization Based on
SpeechAssisted Text-to-Video Alignment and Masked Unit Prediction. IEEE Signal Processing Letters.
1–5 (2025). https://doi.org/10.1109/lsp.2025.3537949.
[22] Fu, Z., et al.: Emotion recognition based on multimodal physiological signals and transfer
learning. Frontiers in Neuroscience. 16 (2022). https://doi.org/10.3389/fnins.2022.1000716.
[23] Sun, G., Lian, Z.: Deepfake Video Detection Based on the Decomposition of Spatial-Temporal
Attention Mechanism in ViViT. In: 2024 IEEE International Symposium on Parallel and
Distributed Processing with Applications (ISPA), Kaifeng, China, 30 Oct–2 Nov 2024, pp. 1629–
1634. IEEE (2024). https://doi.org/10.1109/ispa63168.2024.00221.
[24] Oveis, A.H., et al.: Explainability In Hyperspectral Image Classification: A Study of Xai Through
the Shap Algorithm. In: 2023 13th Workshop on Hyperspectral Imaging and Signal Processing:
Evolution in Remote Sensing (WHISPERS), Athens, Greece, 31 Oct–2 Nov 2023. IEEE
(2023). https://doi.org/10.1109/whispers61460.2023.10430776.
[25] Zheng, H., et al.: A separable spatial-temporal graph learning approach for skeleton-based
action recognition. IEEE Sensors Letters. 1–4 (2024). https://doi.org/10.1109/lsens.2024.3475515.
[26] S. Das and D. Das, "Natural Language Processing (NLP) Techniques: Usability in
HumanComputer Interactions," 2024 6th International Conference on Natural Language Processing
(ICNLP), Xi'an, China, 2024, pp. 783-787, doi: 10.1109/ICNLP60986.2024.10692776.
[27] J. Francis and M. Subha, "An Overview of Natural Language Processing (NLP) in Healthcare:
Implications for English Language Teaching," 2024 8th International Conference on I-SMAC (IoT
in Social, Mobile, Analytics and Cloud) (I-SMAC), Kirtipur, Nepal, 2024, pp. 824-827, doi:
10.1109/I-SMAC61858.2024.10714890.
[28] V. Vysotska, N. Sharonova, A. Chupryna, M. Shirokopetleva, O. Dolhanenko, S. Smelyakov,
Research of methods for image sharpness evaluation in photos of people, CEUR Workshop
Proceedings, Vol-3664, 2024, pp. 255–272.
[29] Q. Zhang, Z. Wang, D. Zhang, W. Niu, S. Caldwell, T. Gedeon, Y. Liu, Z. Qin, "Visual Prompting
in LLMs for Enhancing Emotion Recognition," Proceedings of the 2024 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pp. 4484-4499, doi:
10.18653/v1/2024.emnlpmain.257</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>Attention is All you Need</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          <volume>30</volume>
          . (
          <year>2017</year>
          ). https://doi.org/10.48550/arXiv.1706.03762
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Wang</surname>
            , Zerui and
            <given-names>Yan</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>"STAA: Spatio-Temporal Attention Attribution for Real-Time Interpreting Transformer-based Video Models</article-title>
          .
          <source>"</source>
          (
          <year>2024</year>
          ) URL: https://www.semanticscholar.org/reader/6bfa663955410c4f59d5b9bbdd29f8c36670c463
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Yuanyuan</given-names>
            <surname>Liu</surname>
          </string-name>
          , Wei Dai, Chuanxu Feng, Wenbin Wang, Guanghao Yin, Jiabei Zeng, and
          <string-name>
            <given-names>Shiguang</given-names>
            <surname>Shan</surname>
          </string-name>
          .
          <year>2022</year>
          .
          <article-title>MAFW: A Large-scale, Multimodal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild</article-title>
          .
          <source>In Proceedings of the 30th ACM International Conference on Multimedia (MM' 22)</source>
          ,
          <source>October 10-14</source>
          ,
          <year>2022</year>
          , Lisboa, Portugal. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3503161.3548190 - https://mafwdatabase.github.io/MAFW/
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Ye</surname>
          </string-name>
          , Zhen, Xinfa Zhu, Chi-min
          <string-name>
            <surname>Chan</surname>
          </string-name>
          , Xinsheng Wang,
          <string-name>
            <surname>Xu Tan</surname>
            , Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen,
            <given-names>Zhifei</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Lei</given-names>
          </string-name>
          <string-name>
            <surname>Xie</surname>
          </string-name>
          , Qiuqiang Kong,
          <string-name>
            <surname>Yi-Ting Guo</surname>
            and
            <given-names>Wei</given-names>
          </string-name>
          <string-name>
            <surname>Xue</surname>
          </string-name>
          .
          <article-title>"Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis</article-title>
          .
          <source>"</source>
          (
          <year>2025</year>
          ). Doi: https://doi.org/10.48550/arXiv.2502.04128
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Sánchez-Brizuela</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , et al.:
          <article-title>Lightweight real-time hand segmentation leveraging MediaPipe landmark detection</article-title>
          .
          <source>Virtual Reality</source>
          . (
          <year>2023</year>
          ). https://doi.org/10.1007/s10055-023-00858-0.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Byzkrovnyi</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savulioniene</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smelyakov</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakalys</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chupryna</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Comparison of Potential Road Accident Detection Algorithms for Modern Machine Vision System, Vide</article-title>
          . Tehnologija. Resursi - Environment, Technology, Resources,
          <year>2023</year>
          , 3, pp.
          <fpage>50</fpage>
          -
          <lpage>55</lpage>
          . doi: https://doi.org/10.17770/etr2023vol3.7299
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Zhang, Y.:
          <string-name>
            <surname>Grad-CAM: Understanding</surname>
            <given-names>AI</given-names>
          </string-name>
          <string-name>
            <surname>Models. Computers</surname>
          </string-name>
          , Materials &amp; Continua.
          <volume>76</volume>
          (
          <issue>2</issue>
          ),
          <fpage>1321</fpage>
          -
          <lpage>1324</lpage>
          (
          <year>2023</year>
          ). Doi: https://doi.org/10.32604/cmc.
          <year>2023</year>
          .
          <volume>041419</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>Attention-Based Multimodal Multi-View Fusion Approach for Driver Facial Expression Recognition</article-title>
          .
          <source>IEEE Access</source>
          .
          <volume>1</volume>
          (
          <year>2024</year>
          ). Doi: https://doi.org/10.1109/access.
          <year>2024</year>
          .3462352
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>