<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Barbora Ľapinová</string-name>
          <email>barbora.lapinova@student.upjs.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ľubomír Antoni</string-name>
          <email>lubomir.antoni@upjs.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Šimon Horvát</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ITAT'25: Information technologies-Applications and Theory</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Computer Science, Faculty of Science, Pavol Jozef Šafárik University in Košice</institution>
          ,
          <addr-line>Jesenná 5, 040 01 Košice</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Individuals with hearing or speech impairments rely on sign language as their main form of communication, yet a communication barrier between this community and rest of the population persists. Recognizing AI and Deep Learning's role in aiding communication for the deaf and hard of hearing, this paper investigates deep learning methods for isolated sign language recognition using a complex video dataset. A Convolutional Neural Network (CNN) is employed to classify signs, and a thorough analysis of the model's performance is conducted to uncover common misclassification patterns and particularly challenging sign categories. This research outlines future work integrating hand pose data to potentially enhance model robustness and accuracy. a Approach presented in this paper aims to improve sign language recognition systems.</p>
      </abstract>
      <kwd-group>
        <kwd>Sign language</kwd>
        <kwd>Deep learning</kwd>
        <kwd>Video classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073</p>
      <p>This paper aims to contribute to the growing field of SLR by designing and evaluating a deep
learningbased approach for isolated sign language recognition using a complex, diverse video dataset. We
propose a custom Convolutional Neural Network (CNN) model and assess its performance through
experiments on diferent subsets of this challenging dataset. A comprehensive error analysis is conducted
using confusion matrices, identifying the worst-classified categories and the impact of label duplication
on model learning. Finally, we outline future research directions, with a focus on integrating additional
features such as hand skeleton tracking to capture fine-grained spatial information and enhance
recognition accuracy.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Sign Language Structure and Characteristics</title>
        <p>
          Sign languages are fully natural, richly expressive languages conveyed via the bodily‑visual modality.
They consist of manual elements—hand shape, orientation, movement, and location—and non‑manual
components—facial expressions, head movement, and body posture—which together form a sign and
represent a gloss. Glosses serve as proxies for signs in annotation, yet lack a universal written form and
vary widely across languages (e.g., ASL, BSL, CSL, GSL), complicating computational modeling [
          <xref ref-type="bibr" rid="ref2 ref8 ref9">2, 8, 9</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Applications of AI in Sign Language Processing</title>
        <p>
          Artificial Intelligence (AI) research in sign language has evolved into three interrelated domains:
• Sign Language Recognition (SLR): identifying signs from video or sensor input [
          <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
          ].
• Sign Language Translation (SLT): mapping from sign inputs to grammatically correct spoken
or written output [
          <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
          ].
        </p>
        <p>
          • Sign Language Generation (SLG): synthesizing sign output (e.g., via avatars) [
          <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
          ].
        </p>
        <p>
          This paper focuses on isolated SLR, where each video depicts a single sign. Many studies emphasize
combining computer vision, gesture recognition, and linguistic insights, though genuine datasets
capturing varied lighting, signer appearance, and real-world contexts remain limited [
          <xref ref-type="bibr" rid="ref16 ref8">16, 8</xref>
          ]. Solving
this requires interdisciplinary collaboration and richer, real-world data sources.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Deep Learning for Video Classification</title>
        <p>Deep learning has become foundational in SLR due to its capacity to model complex visual and temporal
patterns. Key architectures include:
Feed-forward Neural Networks (FNNs), which are well-suited for static data, but fall short in
modelling spatial or temporal dependencies.</p>
        <p>
          Convolutional Neural Networks (CNNs), that exploit spatial hierarchies via convolution and
pooling, and models such as VGG, ResNet, and MobileNet are popular backbones [
          <xref ref-type="bibr" rid="ref16">17, 16</xref>
          ].
        </p>
        <p>
          For video, these models have evolved into:
• 2D CNNs applied per-frame with temporal pooling or fed into sequence models like LSTMs;
• 3D CNNs, performing spatiotemporal convolutions across frames;
• Hybrid CNN–RNN architectures, e.g., CNN + attention‑based LSTM, which have achieved
accuracies over 84% on a challenging WLASL [18] dataset [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ];
• (2+1)D CNNs, mixing spatial and temporal convolutions, sometimes in novel fused
architectures [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>Recent advances include transformer models like SHuBERT for self‑supervised representation
learning in ASL [19], and hybrid CNN–Transformer networks for isolated Chinese SLR [20]. Techniques
emphasizing spatial‑temporal trajectory awareness, such as CorrNet+, have shown state-of-the-art
performance in continuous recognition and translation tasks [21].</p>
        <p>These developments illustrate the rapid progress in applying modern deep architectures to SLR tasks,
motivating our work with deep learning-based approaches.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>As mentioned earlier, this paper addresses the task of isolated sign language recognition which can be
seen as a machine learning classification problem with the input in the form of a video on which the
sign is performed, and the target output being the corresponding gloss.</p>
      <p>Our methodology began with identifying and selecting a suitable video dataset for this task, followed
by its preprocessing to prepare the data for model training. Next, we designed and implemented a
neural network architecture specifically for this task. Due to the heuristic nature of neural network
design, this process, along with its training and testing, was performed iteratively across several cycles
to optimize performance.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset Selection and Preprocessing</title>
        <p>Choosing an appropriate dataset is one of the most crucial steps in solving any machine learning
task. The most common form of input data in SLR is a video, which is represented by a sequence of
images—video frames. A number of research groups, especially in the past, have used data obtained
using various sensors. An example of such a sensor can be the data glove [22, 23], but nowadays, data
in the form of a video that can be easily captured on a smartphone is far more practical for real-time
SLR systems.</p>
        <p>Datasets for isolated and continuous sign language recognition are not as abundant as, for example,
image datasets for sign language alphabets, especially if we emphasize a suficient number of glosses in
the lexicon of the dataset, an adequate number of video samples in the dataset for these glosses, or the
diversity of the individual videos.</p>
        <p>
          In our work we chose the Greek sign language (GSL) dataset [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. This dataset contains 40826 videos
depicting signs corresponding to 310 unique glosses. However, one of the classes in this dataset contains
only one video sample and was excluded from further processing—the resulting number of videos and
classes was 40825 and 309, respectively. Seven signers are featured in the videos, and the individual
signs may be considered common in communication of the users of sign language in healthcare or
public administration.
        </p>
        <p>Although the average number of video samples per class is approximately 132—suggesting a seemingly
adequate dataset size—the distribution of samples across classes reveals significant imbalance. A more
detailed analysis yielded the following insights:
• The median and 25th percentile are both 35, indicating that at least 50% of the classes have 35
or fewer samples, which is far below the average.
• The 75th percentile is 105, meaning only 25% of the classes have more than 105 samples.
• The minimum number of samples in a class is 15, while the maximum is 2693, demonstrating
a severe skew in the class distribution.</p>
        <sec id="sec-3-1-1">
          <title>For clarity, these key statistics are summarized in the following table:</title>
          <p>Due to the range of possible values in the number of videos per class which indicates an imbalanced
dataset, the average is somewhat skewed, and more accurate information about the representation of
the number of video samples is provided by the median, the 25% quantile and the 75% quantile.</p>
          <p>The individual videos in this dataset can be found in 525 folders, with the name of each folder
specifying one of the areas covered by the signs (those that may occur in communication at the police
Minimum number of samples per class
25th percentile (Q1)
Median (50th percentile)
75th percentile (Q3)
Maximum number of samples per class
Mean (average) number of samples per class</p>
          <p>Value
station, in healthcare, etc.), which sign language performer is featured in these videos, and also the
repetition order of the recording of the particular sign by the respective sign language performer, as each
of the signs is recorded multiple times for each sign language performer. An example of a folder name
might be health1_signer1_rep1_glosses—this folder contains a portion of the healthcare signs, the
videos feature the first signer, and it is the first repetition of the recordings of the given signs. In each
of these folders there are several other folders—one for each video.</p>
          <p>In the folder for one video, there are images in the .jpg format, which represent the video already
divided into video frames, discarding those frames where the sign perfomer was idle (e.g., the beggining
and the end of the video), since the authors of this dataset have performed part of the preprocessing
(splitting the video into video frames, selecting appropriate video frames).</p>
          <p>Our preprocessing of each video involved changing its height and width to 240 × 320, and 5 equidistant
video frames were selected to represent each video to be fed into the model (e.g., for a video with 80
video frames, video frames at indices 0,20,40,60,80 were selected). The dataset contains 271 videos with
a frame count of less than five, which were excluded from further classification, resulting in 40554 final
samples. An example of a preprocessed input sequence can be seen in Figure 1.</p>
          <p>In this dataset, in addition to the videos themselves, there are also .csv files containing information
about which gloss, i.e. class, corresponds to each video, or information about the bounding-box
coordinates.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. CNN Model Architecture</title>
        <p>Our proposed model is a 3D CNN designed to classify sign language video sequences into 309 distinct
classes. The network accepts a video input, represented by a tensor of shape (5 × 240 × 320 × 3),
corresponding to five equidistant video frames with a spatial resolution of 240 × 320 pixels and three
color channels.</p>
        <p>The architecture begins with a 3D convolutional layer with 16 filters, each of size (3 × 3 × 3), with a
ReLU activation function. Next is a 3D max-pooling layer with a pooling window size of (1 × 2 × 2),
which performs spatial downsampling while preserving the temporal dimension. Then, a second 3D
convolutional layer with 32 filters of size (3 × 3 × 3) is applied, again with ReLU activation. This is
followed by another 3D max-pooling layer with the same pooling dimensions.</p>
        <p>The resulting feature maps are flattened into a one-dimensional vector of length 144768 and passed
through a series of fully connected layers. First is a dense layer with 10000 neurons, followed by
additional two layers with 500 and 128 units. Each layer uses ReLU activation.</p>
        <p>Finally, the network outputs a probability distribution over 309 target classes via a dense layer with
309 neurons and softmax activation. The model optimizes a categorical cross-entropy loss function and
uses the ADAM optimizer.</p>
        <p>The proposed model is the result of experiments with diferent network architectures and parameter
settings. In the first unsuccessful attempts, we also experimented, for example, with diferent numbers
of convolutional-pooling blocks and their hyperparameter settings.</p>
        <p>Compared to the first experimental models, a notable improvement of our proposed model’s
performance occurred after removing the dropout layers, which usually improve the network performance,
but in our case their occurrence led to its decrease. Enhancement was also noted after adding two dense
layers with 10000 and 500 neurons between the dense layer with 128 units and the output layer. By
adding additional dense layers a slower reduction of the vector that is the output of the flattening layer
and the input to the output layer was ensured. The flattened vector with 144768 elements is reduced to
a 128 element vector, which is input to the output layer gradually, not in a single step. Thus, the neural
network has more room to select important features to use in classification.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Results</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setup</title>
        <p>In order to validate the proposed model on the GSL dataset, a case study was conducted to evaluate its
performance on subsets of the dataset with respect to the number of target classes. In these subsets, we
selected samples belonging to three, ten, and then all classes of the dataset.</p>
        <p>The selection of categories for the subset of the dataset used in the three-class case study section was
essentially arbitrary. However, care was taken to ensure these categories contained similar number of
samples to simulate an ideal, balanced dataset. The subset contained the following target classes and
their number of samples:</p>
        <p>Class
ΒΙΒΛΙO (BOOK)
ΕΝΤΑΞΕΙ (OK)
ΤΑΞΙΔΙ (TRAVEL)</p>
        <p>Number of samples</p>
        <p>To create a dataset with ten output classes, seven additional labels were added to the three classes used
previously. The selected categories in this subset of the dataset displayed less uniform representation,
yet the diferences in number of samples between classes were not significant:</p>
        <p>Class
ΤΑΥΤOΤΗΤΑ (IDENTITY, ID CARD)
ΣΦΡΑΓΙΔΑ (SEAL, STAMP)
ΓΙΑ (FOR, ABOUT)
ΓΕΙΑ (HELLO)
ΚΙΝΗΤO (MOBILE, MOBILE PHONE)
ΕΥΧΑΡΙΣΤΩ (THANK YOU)
ΑΥΤOΣ (HE, SHE, IT, SELF)</p>
        <p>Number of samples</p>
        <p>Generated subsets were in all cases divided into training and test sets using a stratified hold-out
split ensuring that class distributions were preserved, and network performance was evaluated based
3
10
309
on accuracy. Further analysis was conducted based on normalized confusion matrices. The network
utilized batch learning for all three sections of the case study. The Table 4 provides a summary of the
various network parameters including the proportion of samples used for testing the network, the
number of epochs for its training, and the batch sizes.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Performance Analysis on Dataset Subsets</title>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Analysis of Model Error (Confusion Matrix)</title>
        <p>Given that the results of the testing phase of the proposed model in the part of the case study in which
the model was tested on the subset of the dataset with 309 target classes were no longer as optimal as in
the previous two sections on smaller balanced subsets, a normalized confusion matrix was constructed
for the trained model providing insight into which classes were challenging for the model.</p>
        <p>In Table 6, we present the ten classes with the weakest performance in terms of the proportion of
correctly classified samples from the class. These classes are sorted primarily by the percentage of
samples correctly classified and secondarily by the class name, in ascending order.</p>
        <p>
          As demonstrated in the table, the class EGΩ(3) exhibited the poorest performance, failing to achieve
a correct classification for any of its test samples. A more detailed analysis for this class revealed that
in 100% of the cases the samples from it were classified as the class ΕΓΩ(1). In fact, the GSL dataset
contains three distinct versions of signs for the gloss ΕΓΩ (I, ME). The authors of the dataset in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], state
that these variations may be attributed to regional variations. A systematic search of all the classes in
the dataset yielded three additional glosses to which multiple versions of signs belong.
        </p>
        <p>This analysis resulted in the merging of the recurring classes into one. Specifically, the considered
classes were ΕΓΩ(1), ΕΓΩ(2), ΕΓΩ(3), ΚOΚΚΙΝO(1), ΚOΚΚΙΝO(2), ΓΙΑΤΡOΣ(1), ΓΙΑΤΡOΣ(2), ΚΑΤΩ(1)
and ΚΑΤΩ(2) for glosses ΕΓΩ (I, ME), ΚOΚΚΙΝO (RED), ΓΙΑΤΡOΣ (DOCTOR) and ΚΑΤΩ (DOWN,
BELOW, UNDER) respectively. The classes representing the same gloss were merged into a single
category, such as ΕΓΩ(1), ΕΓΩ(2), and ΕΓΩ(3) into a common category ΕΓΩ, while the split of the
samples into training and test sets was preserved.</p>
        <p>Class
ΕΓΩ(3) (I, ME)
ΤΡΙΤOΝ (THIRD)
15
ΠΡOΣ (TO, TOWARDS)
ΕΓΩ(2) (I, ME)
ΚΕΠΑ(Δ.Α.)
(DISABILITY CERTIFICATION CENTRE)
200
ΑΚOΥΩ_ΜΕΙΩΝΩ (HEARING LOSS)
ΑΥΤΗ_ΤΗ_ΣΤΙΓΜΗ (RIGHT NOW)
ΓΥΝΑΙΚOΛOΓOΣ (GYNECOLOGIST)
% Correctly Classified</p>
        <p>This modification resulted in 304 classes from the previous 309 classes. The same neural network
was then trained on this modified data, with the exception of changing the number of output layer
neurons from 309 to 304, again at 18 epochs. The performance of the network was evaluated using
accuracy, and comparison of the maximum accuracy achieved during training and testing of the model
before and after merging the classes is summarized in the Table 7.</p>
        <p>The maximal accuracy value achieved during the training phase was 98.87% (previously 98.17%)
and during the testing phase was 83.16% (previously 82.92%). Merging the classes thus results in a
slight increase in the maximal accuracy value. However, it should be noted that 300 other classes, in
addition to the ones that were merged, contribute to these resulting values. Therefore, an analysis of
the normalized confusion matrix was conducted again.</p>
        <p>Phase
Train
Test</p>
        <p>Accuracy before merging</p>
        <p>Accuracy after merging
98.17%
82.92%
98.87%
83.16%</p>
        <p>Table 8 shows the number of classes with the correct classification percentage under diferent
thresholds. It can be seen that the number of filtered classes decreased after merging the classes for
each threshold. Merging the classes, especially the problem class ΕΓΩ(3), may have given the neural
network more room to learn and improve the classification of the remaining classes.</p>
        <p>Table 9 shows the ten worst-classified categories after merging the repeating classes. It is sorted in
the same way as Table 6. Except for class 15, whose percentage of correctly classified samples increased
from 14.29% to 28.57%, the structure of the worst-classified classes changed after merging the repeating
categories. For instance, the classification of classes such as ΤΡΙΤOΝ (THIRD), ΠΡOΣ (TO, TOWARDS),
and 200 has improved; these classes are no longer among the ten worst. However, after merging, the
class TETAPTON (FOURTH), which was not among the ten worst-classified classes before merging,
was added. This change may not have been directly caused by merging of the classes since several
factors afect classification, such as the initialization of parameters in the network, which was in our
case random.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Comparison with State-of-the-Art Models</title>
        <p>
          In their paper, Adaloglou et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] present the GSL dataset for both isolated and continuous SLR. They
also use the models proposed in [24, 25, 26] for the classification tasks on this dataset. These neural
network architectures were primarily utilized for continuous SLR; however, their performance was also
evaluated on the isolated SLR version of the GSL dataset.
        </p>
        <p>
          Table 10 presents the results of our proposed model and the models introduced in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Considering
these results, it can be concluded that our solution achieves similar results to those of the far more
complex models. However, it is important to note that our model was trained and tested using diferent
training and testing sets than those used for the presented models. Therefore, to objectively evaluate
our model in the future, we propose training and validating all models using the same subsets of the
dataset.
GoogLeNet + TConvs
3D-ResNet + BLSTM
I3D + BLSTM
Ours (before merging repeated classes)
Ours (after merging repeated classes)
        </p>
        <p>Accuracy
(2+1)D
3D
3D
3D
3D</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <sec id="sec-5-1">
        <title>5.1. Discussion of Findings</title>
        <p>In this paper, we have addressed the problem of isolated sign language recognition. To solve this task
using a deep learning approach, we proposed a 3D convolutional neural network model, selecting the
complex GSL dataset for the purpose of its training and testing.</p>
        <p>Due to the considerable number of classes in the GSL dataset, a case study was performed on the
proposed model. The model’s performance was evaluated on two smaller, balanced subsets of the
dataset, for which optimal results were obtained during training and testing.</p>
        <p>In the context of training and testing the network on the full, imbalanced dataset, the testing results
did not align with the results achieved during network training. This inconsistency may be attributed
to the uneven representation of the classes. A thorough analysis of the confusion matrix revealed that
certain glosses were associated with multiple signs, suggesting the presence of regional variations
characterized by subtle changes. As a result of this analysis, the sign variations corresponding to
the same gloss were merged into a single class. Following this transformation, an improvement in
the accuracy value was observed, accompanied by an enhancement in the analysis outcomes for the
problematic classes.</p>
        <p>The added value of this paper stems from the proposed methodology for addressing duplicate or
highly similar categories in sign language datasets. By systematically analyzing the confusion matrix,
we identified hard examples (frequently misclassified classes such as regional variants of the same gloss)
and soft examples (classes with consistent but subtle overlaps). Based on this analysis, we introduced a
principled approach for merging categories, which not only improved the recognition accuracy of our
CNN model but also provided a more realistic treatment of sign variation in computational systems. This
methodology represents a practical contribution toward building more robust SLR pipelines, especially
for datasets where gloss definitions are not strictly standardized or where regional sign variants coexist.
Beyond the scope of this study, such an approach could be generalized to other sign languages and
multimodal datasets, thereby reducing annotation noise while still respecting linguistic diversity.</p>
        <p>In consideration of the results obtained, it is possible to conclude that the proposed model can be
considered an efective tool for isolated sign language recognition systems with smaller, thematically
focused lexicons. In a real-world setting, the potential applications of this model could include the
recognition of signs in healthcare communication, where a limited, specifically defined repertoire of
signs is typically used.</p>
        <p>The proposed model is limited in two aspects. First, the training process is time-consuming. Second,
the model receives a limited number of video frames as input. In the context of training and testing,
almost entire GSL dataset was used, excluding videos with less than 5 frames and a video with no other
samples in its class with a training set consisting of 32433 examples and a test set containing 8111
examples. The duration of one epoch was found to be approximately five hours. Therefore, another
possible future task is to optimize it with respect to computation time, as well as to experiment with
accepting a higher number of video frames as input.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Future Work</title>
        <p>In a series of papers, including [27, 28, 29, 30, 31], supplementary information such as depth or joint
information is utilized to enhance the sign dynamics information in the neural network, yielding a
multimodal input stream.</p>
        <p>Following our work, initial steps were taken for the incorporation of joint information for hands—
hand skeletons. The creation of hand skeletons has already been explored through experimentation
with the Hands model from the MediaPipe library [32]. The result of the experimental hand skeleton
creation for a sequence of video frames can be seen in Figure 2.</p>
        <p>It is essential to acknowledge that the sequence of hand skeletons serves merely as a supplementary
piece of information, and it cannot entirely replace the input sequence of video frames. This is due to
the fact that the skeleton images neglect some characteristics of the signs, such as the facial expression
or the overall posture of the signer, and only emphasize the most basic parts of the signs—the shape of
the hands and the movement they perform.</p>
        <p>The integration of such information could potentially enhance the performance of the model by
providing an additional perspective on the signing process, particularly in capturing temporal patterns
and articulations that might otherwise be overlooked. However, the reliability and practical usage of
such an approach remains to be tested.</p>
        <p>Beyond the integration of hand pose information, another promising research direction lies in
exploiting recent advances in large multimodal language models (LLMs). For example, the **SignCLIP** model
leverages contrastive pretraining on paired sign-language videos and spoken-language text, learning
joint video–text embeddings that support both video-to-text and few-shot recognition tasks across
sign languages [33]. Leveraging such automatically generated textual representations could provide
complementary supervision signals for sign recognition models, potentially improving generalization
across domains and signers. Moreover, pretraining on large-scale gesture-to-text corpora may help
bridge the gap between visual sign representations and linguistic meaning, opening pathways toward
richer sign language translation systems. We thus see the integration of multimodal representation
learning—not only skeletal features but also language-guided supervision—as a highly relevant direction
for future research in sign language recognition.</p>
        <p>
          Another limitation of the present study is that we restricted our evaluation to a custom 3D CNN
model. While we compared our results to architectures reported in the literature, those models were
often trained and validated using diferent experimental setups, making direct comparison less reliable.
For a fairer benchmark, it will be necessary to re-implement and train alternative architectures such as
CNN–LSTM hybrids or Transformer-based models (e.g., vision transformers or multimodal transformers)
on the same dataset splits. Recent works have shown that such architectures can efectively capture
long-range temporal dependencies and multimodal context in sign language videos [
          <xref ref-type="bibr" rid="ref12">12, 34</xref>
          ]. We consider
systematic evaluation across these model families, under unified conditions, as an essential direction
for future work.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This article was supported by the Scientific Grant Agency of the Ministry of Education, Science, Research
and Sport of the Slovak Republic under contract VEGA 1/0539/25.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <sec id="sec-7-1">
        <title>The authors have not employed any Generative AI tools.</title>
        <p>CNN‑LSTM Framework Based on Attention Mechanism, Electronics 13 (2024) 1229. doi:10.
3390/electronics13071229.
[17] J. Sharma, K. S. Gill, M. Kumar, R. Rawat, Deep Learning for Sign Language Recognition: Exploring</p>
        <p>VGG16 and ResNet50 Capabilities (2024). doi:10.56155/978-81-955020-9-7-13.
[18] D. Li, C. Rodriguez, X. Yu, H. Li, Word-level deep sign language recognition from video: A new
large-scale dataset and methods comparison, in: Proceedings of the IEEE/CVF winter conference
on applications of computer vision, 2020, pp. 1459–1469.
[19] S. Gueuwou, X. Du, G. Shakhnarovich, K. Livescu, A. H. Liu, SHuBERT: Self-Supervised
Sign Language Representation Learning via Multi‑Stream Cluster Prediction, arXiv preprint
arXiv:2411.16765 (2024).
[20] S. Jing, G. Wang, H. Zhai, Q. Tao, J. Yang, B. Wang, P. Jin, Dual-view Spatio-Temporal Feature
Fusion with CNN‑Transformer Hybrid Network for Chinese Isolated Sign Language Recognition,
arXiv preprint arXiv:2506.06966 (2025).
[21] L. Hu, W. Feng, L. Gao, Z. Liu, L. Wan, CorrNet+: Sign Language Recognition and Translation via</p>
        <p>Spatial‑Temporal Correlation, arXiv preprint arXiv:2404.11111 (2024).
[22] D. L. Quam, G. B. Williams, J. R. Agnew, P. C. Browne, An experimental determination of human
hand accuracy with a dataglove, in: Proceedings of the Human Factors Society Annual Meeting,
volume 33, SAGE Publications Sage CA: Los Angeles, CA, 1989, pp. 315–319.
[23] A. Z. Shukor, M. F. Miskon, M. H. Jamaluddin, F. bin Ali, M. F. Asyraf, M. B. bin Bahar, et al., A
new data glove approach for malaysian sign language detection, Procedia Computer Science 76
(2015) 60–67.
[24] R. Cui, H. Liu, C. Zhang, A deep neural framework for continuous sign language recognition by
iterative training, IEEE Transactions on Multimedia 21 (2019) 1880–1891.
[25] J. Pu, W. Zhou, H. Li, Iterative alignment network for continuous sign language recognition, in:
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp.
4165–4174.
[26] J. Carreira, A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset,
in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.
6299–6308.
[27] S. Jiang, B. Sun, L. Wang, Y. Bai, K. Li, Y. Fu, Skeleton aware multi-modal sign language recognition,
in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp.
3413–3423.
[28] J. Huang, W. Zhou, H. Li, W. Li, Sign language recognition using 3d convolutional neural networks,
in: 2015 IEEE international conference on multimedia and expo (ICME), IEEE, 2015, pp. 1–6.
[29] J. Zhang, Q. Wang, Q. Wang, Z. Zheng, Multimodal fusion framework based on statistical attention
and contrastive attention for sign language recognition, IEEE Transactions on Mobile Computing
23 (2023) 1431–1443.
[30] D. Laines, M. Gonzalez-Mendoza, G. Ochoa-Ruiz, G. Bejarano, Isolated sign language recognition
based on tree structure skeleton images, in: Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, 2023, pp. 276–284.
[31] A. S. M. Miah, M. A. M. Hasan, S.-W. Jang, H.-S. Lee, J. Shin, Multi-stream general and graph-based
deep neural networks for skeleton-based sign language recognition, Electronics 12 (2023) 2841.
[32] F. Zhang, V. Bazarevsky, A. Vakunov, A. Tkachenka, G. Sung, C.-L. Chang, M. Grundmann,</p>
        <p>Mediapipe hands: On-device real-time hand tracking, arXiv preprint arXiv:2006.10214 (2020).
[33] Z. Jiang, G. Sant, A. Moryossef, M. Müller, R. Sennrich, S. Ebling, Signclip: Connecting text and
sign language by contrastive learning (2024). arXiv:2407.01264.
[34] A. Brettmann, J. Grävinghof, M. Rüschof, M. Westhues, Breaking the barriers: Video vision
transformers for word-level sign language recognition (2025). ArXiv preprint.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>S. I. Stamoulis</surname>
          </string-name>
          , Sign Language Detection,
          <source>Master's thesis</source>
          , University of West Attica,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Adaloglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chatzis</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Papastratis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stergioulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. T.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zacharopoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Xydopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Atzakas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Papazachariou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Daras</surname>
          </string-name>
          ,
          <article-title>A comprehensive study on deep learning-based methods for sign language recognition</article-title>
          ,
          <source>IEEE Transactions on Multimedia</source>
          <volume>24</volume>
          (
          <year>2021</year>
          )
          <fpage>1750</fpage>
          -
          <lpage>1762</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Perlmutter</surname>
          </string-name>
          , What is sign language,
          <source>Linguistic Society of America</source>
          <volume>1325</volume>
          (
          <year>2011</year>
          )
          <fpage>20036</fpage>
          -
          <lpage>6501</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Sign language recognition: A comprehensive review of traditional and deep learning approaches, datasets, and challenges</article-title>
          , IEEE Access (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Cheok</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Omar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Jaward</surname>
          </string-name>
          ,
          <article-title>A review of hand gesture and sign language recognition techniques</article-title>
          ,
          <source>International Journal of Machine Learning and Cybernetics</source>
          <volume>10</volume>
          (
          <year>2019</year>
          )
          <fpage>131</fpage>
          -
          <lpage>153</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Elakkiya</surname>
          </string-name>
          ,
          <article-title>Retracted article: Machine learning based sign language recognition: a review and its research frontier</article-title>
          ,
          <source>Journal of Ambient Intelligence and Humanized Computing</source>
          <volume>12</volume>
          (
          <year>2021</year>
          )
          <fpage>7205</fpage>
          -
          <lpage>7224</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huerta-Enochian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Y.</given-names>
            <surname>Ko</surname>
          </string-name>
          ,
          <article-title>Shedding light on the underexplored: Tackling the minor sign language research topics</article-title>
          ,
          <source>in: Proceedings of the LREC-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>147</fpage>
          -
          <lpage>158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Buttar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Gumaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Assiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Akbar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. F.</given-names>
            <surname>Alkhamees</surname>
          </string-name>
          ,
          <article-title>Deep Learning in Sign Language Recognition: A Hybrid Approach for the Recognition of Static and Dynamic Signs</article-title>
          ,
          <source>Mathematics</source>
          <volume>11</volume>
          (
          <year>2023</year>
          )
          <fpage>3729</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Sultan</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. M. M. Zaki</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kayed</surname>
            ,
            <given-names>A. M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Ali</surname>
          </string-name>
          ,
          <article-title>Multiple Sign Language Identification Using Deep Learning Techniques</article-title>
          ,
          <string-name>
            <given-names>Sci. J. Circuits</given-names>
            <surname>Syst</surname>
          </string-name>
          .
          <source>Signal Process</source>
          .
          <volume>11</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          . doi:
          <volume>10</volume>
          .11648/j. cssp.
          <volume>20231101</volume>
          .11.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>O.</given-names>
            <surname>Koller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Forster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ney</surname>
          </string-name>
          ,
          <article-title>Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers, in: Computer Vision</article-title>
          and Image Understanding, volume
          <volume>141</volume>
          ,
          <string-name>
            <surname>Elsevier</surname>
          </string-name>
          ,
          <year>2015</year>
          , pp.
          <fpage>108</fpage>
          -
          <lpage>125</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Video-based sign language recognition without temporal segmentation</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>32</volume>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N. C.</given-names>
            <surname>Camgoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Koller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hadfield</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bowden</surname>
          </string-name>
          ,
          <article-title>Sign language transformers: Joint end-to-end sign language recognition and translation</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>10023</fpage>
          -
          <lpage>10033</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <article-title>Neural sign language translation based on human keypoint estimation</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>9</volume>
          (
          <year>2019</year>
          )
          <fpage>2683</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          , J. Cheng, H. Lu,
          <article-title>Spatio-temporal graph convolutional network for skeleton-based sign language recognition</article-title>
          ,
          <source>in: Proceedings of the 28th ACM International Conference on Multimedia</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>600</fpage>
          -
          <lpage>608</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Stoll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. C.</given-names>
            <surname>Camgoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hadfield</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bowden</surname>
          </string-name>
          , Text2sign:
          <article-title>Towards sign language production using neural machine translation and generative adversarial networks</article-title>
          ,
          <source>in: International Journal of Computer Vision</source>
          , volume
          <volume>128</volume>
          , Springer,
          <year>2020</year>
          , p.
          <fpage>2515</fpage>
          -
          <lpage>2530</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kumari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <article-title>Isolated Video‑Based Sign Language Recognition Using a Hybrid</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>