=Paper=
{{Paper
|id=Vol-3392/paper13
|storemode=property
|title=Application of Vision Transformers and 3D Convolutional Neural Networks for Sign Language Cluster Recognition
|pdfUrl=https://ceur-ws.org/Vol-3392/paper13.pdf
|volume=Vol-3392
|authors=Nataliia Kuznietsova,Serhii Smirnov
|dblpUrl=https://dblp.org/rec/conf/cmis/KuznietsovaS23
}}
==Application of Vision Transformers and 3D Convolutional Neural Networks for Sign Language Cluster Recognition==
<pdf width="1500px">https://ceur-ws.org/Vol-3392/paper13.pdf</pdf>
<pre>
Application of vision transformers and 3D convolutional neural
networks for sign language cluster recognition
Serhii Smirnova and Nataliia Kuznietsovaa
a
     National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, ave. Peremohy 37,
     Kyiv, 03056, Ukraine

                 Abstract
                 In this study the overview of the methods for sign language recognition was done and the
                 existing datasets in this area were analyzed. It was shown how based on the real data to
                 develop and test different approaches. The models based on the Vision Transformer (ViViT)
                 and 3D Convolutions CNN (3dCNN) using different batch sizes were built and compared. It
                 was also shown how to learn models on different data sizes and to search the compromise
                 between accuracy, speed and overfitting of the models. Our research provides valuable
                 insights into the strengths and limitations of different models of the task, not solved tasks and
                 offer a direction and possible improvements of existed methods in this area by using vision
                 transformers.
                 Keywords 1
                 Gesture recognition, computer vision, neural networks, deep learning, hand shape
                 recognition, sign language interpreter, vision transformers, 3D convolutions.

1. Introduction
    Sign language is a visual language used by people who are deaf or hard of hearing to communicate
with each other and with hearing individuals. It involves using a combination of hand gestures, facial
expressions, and body language to convey meaning. While sign language is an effective mean of
communication, it can be challenging for non-signers to understand and communicate with sign
language users. This has led to the development of sign language recognition technology, which uses
computer algorithms to interpret and translate sign language into spoken or written language. Sign
language recognition has the potential to improve communication and inclusion for people who are
deaf or hard of hearing. It also poses unique challenges, such as the necessity for accurate hand shape
and movement detection, real-time recognition, and dealing with the complexity and variability of
different sign languages. Advantages and new achievements in machine learning, computer vision,
and sensor technology give now the possibility to overcome these challenges and make sign language
recognition more accurate, efficient, and accessible. Machine learning techniques, such as deep
learning, can then be used to learn a mapping between these visual features and the corresponding
sign language gestures. While CNNs have limitations in capturing long-term dependencies and global
context, which are crucial for complex image understanding tasks such as object detection and
segmentation, special transformers have gained significant popularity in the field of computer vision
in recent years due to their ability to process sequential data such as images and videos. In this article
we will compare two approaches for sign language recognition and define for the real practical task
which of them is more effective and perspective for improvements for next studies.

2. Sign language recognition problem statement

   Sign language is an essential mode of communication for deaf or hard-of-hearing individuals. Sign
language recognition (SLR) is a challenging task, as sign languages are highly complex, with a wide

The Sixth International Workshop on Computer Modeling and Intelligent Systems (CMIS-2023), May 3, 2023, Zaporizhzhia, Ukraine
EMAIL: sergej.smirnov.sss@gmail.com (Smirnov S.); natalia-kpi@ukr.net (Kuznietsova N.)
ORCID: 0000-0002-5495-0680 (Smirnov S.); 0000-0002-1662-1974(Kuznietsova N.)
            © 2023 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org) Proceedings
range of variations and nuances. SLR involves the hand gestures identification, facial expressions, and
body movements to interpret the meaning of a sign [1, 2]. The goal of SLR is to develop systems that
can recognize and translate sign language into written or spoken language which enable
communication between hearing and non-hearing individuals. In this research we will discuss the
problems, limitations, not solved issues, and existing solutions in SLR.
   One of the primary challenges in SLR is the complexity of sign languages. There are over 300 sign
languages used worldwide, each with its own grammar, vocabulary, and dialects. Furthermore, sign
languages are highly context-dependent, with the meaning of signs often varying depending on the
speaker's location, age, gender, and culture. Thus, developing an SLR system that can accurately
recognize and interpret the nuances of different sign languages is a significant challenge.
   Another challenge in SLR is the variability in signing styles. Signers may use different hand
shapes, positions, and movements to convey the same message. Moreover, the speed and duration of
signs can vary, adding further complexity to the task. Thus, SLR systems must be robust to variations
in signing styles, as well as to variations in lighting conditions and camera angles. To address these
challenges, researchers have proposed various techniques, such as data augmentation, transfer
learning, and multi-modal fusion, which combine visual and other modalities such as audio or depth
information.
   One of the main challenges in sign language recognition is collecting and annotating large datasets
of sign language gestures, which are required to train and evaluate machine learning models. This can
be particularly difficult for sign languages that are not widely spoken or documented. A limitation of
SLR is the lack of large-scale annotated datasets. While there are several datasets available for SLR,
they are relatively small, limiting the performance of machine learning algorithms. A brief
comparison of existed datasets for sign recognition task is made in Table 1. Moreover annotating sign
language data is a time-consuming and challenging task, as it requires the expertise of sign language
experts.

Table 1
Sign language recognition datasets comparison
  Id        Name           Country     Classes         Samples       Language          Availability
                                                                       level
  1     DGS Kinect 40      Germany          40          3000           Word          Contact author
  2    RWTH‐PHOENIX‐       Germany         1200         45760        Sentence       Publicly available
           Weather
  3        SIGNUM         Germany          450         33210         Sentence       Contact author
  4         GSL 20          Greek          20           ~840           Word         Contact author
  5    Boston ASL LVD        USA         3300+          9800           Word        Publicly available
  6     PSL Kinect 30      Poland          30            300           Word        Publicly available
  7       PSL ToF 84       Poland          84           1680           Word        Publicly available
  8         LSA64         Argentina        64           3200           Word        Publicly available
  9    MSR Gesture 3D        USA           12            336           Word        Publicly available
 10      DEVISIGN‐G         China          36            432           Word         Contact author
 11      DEVISIGN‐D         China          500          6000           Word         Contact author
 12      DEVISIGN‐L         China         2000         24000           Word         Contact author
 13     IIITA‐ROBITA        India          23         unknown          Word         Contact author
 14      Purdue ASL          USA        unknown       unknown          Word/       Request DVDs/HD
                                                                     Sentence
 15        CUNY ASL          USA        unknown        ~33000        Sentence          Unknown
 16    SignsWorld Atlas     Arabia      multiple      unknown       Handshape,         Unknown
                                          types                       Words,
                                                                     Sentences
 17         LSA‐T         Argentina     translation    14880         Sentence       Publicly available
 18      LSFB‐CONT         Belgium         6883        85000+          Word,        Publicly available
                                                                     Sentence
 19         LSFB‐ISOL        Belgium         400         50000+         Word         Publicly available
 20          WLASL            EEUU          2000         21083          Word         Publicly available

    Another issue in SLR is the lack of real-time performance. SLR systems often require high
computational resources, making it challenging to achieve real-time performance on mobile devices
or in low-resource settings.
    Researchers have proposed various solutions to deal with these challenges and limitations [3‒12].
There is an approach that proposes to use different deep learning algorithms, such as convolutional
neural networks (CNNs) and recurrent neural networks (RNNs), to recognize the signs from video
sources. These algorithms have shown promising results in SLR, achieving state-of-the-art
performance on several benchmark datasets.
    Another solution is to use depth sensors, such as Microsoft Kinect, to capture 3D motion data,
which can be used to recognize signs accurately [8]. Depth sensors are advantageous as they can
capture the 3D shape and position of the signer's hands, providing more robust and accurate sign
recognition.
    Furthermore, researchers have proposed the use of transfer learning, where pre-trained models on
large datasets such as ImageNet are fine-tuned on sign language datasets. This approach has shown to
improve the performance of SLR models, particularly for low-resource sign language datasets.
    So, the SLR is a challenging task that requires robust and accurate recognition of complex hand
gestures, facial expressions, and body movements. While several solutions have been proposed to
address the limitations and challenges of SLR, there are still several unsolved issues, such as real-time
performance, variability in signing styles, and lack of large-scale annotated datasets. With the
development of more advanced algorithms and the availability of larger annotated datasets, it is
hopeful that these challenges can be addressed, enabling better communication between hearing and
non-hearing individuals. Overall, sign language recognition is still an important problem with
potential applications in fields such as assistive technologies, education, and communication for deaf
and hard-of-hearing individuals.

3. Brief overview of sign language recognition methods

   There are various methods for sign language recognition (SLR) that have been proposed and tested
over the years. Here are some of the most widespread methods for SLR:

      1. Template matching is a simple and intuitive approach where the hand movements of the
         signer are matched against a predefined set of templates to recognize the sign [3, 4]. The
         template matching method involves capturing a series of hand poses and storing them as
         templates. During recognition, the input sequence is compared to each template, and the sign
         is identified based on the closest match. While this method is easy to implement, it is limited
         by the need for manually defining the templates and the inability to handle variations in
         signing styles.
      2. Hidden Markov Models (HMMs) are probabilistic models used in speech recognition as a
         tool that can model the temporal dependencies in sign language by capturing the transitions
         between hand shapes and movements. The method involves training an HMM on a dataset of
         sign language gestures and using it to recognize signs in new sequences [5, 6]. However,
         HMMs can be limited in capturing the complex and context-dependent variations in sign
         language.
      3. Support Vector Machines (SVMs) as a type of machine learning algorithm can classify sign
         language sequences by finding the hyperplane that separates the data points into their
         respective classes [7]. SVMs have been shown to achieve good accuracy in SLR, but they
         require large amounts of training data and may not be robust to variations in signing styles.
      4. 3D Depth Sensors using 3D depth sensors, such as Microsoft Kinect give the possibility to
         capture the 3D shape and position of the signer’s hands. This approach has the advantage of
         being more robust to variations in lighting and camera angles, and it can capture the depth
       information of the hand movements[8, 9]. The depth information can be used to recognize
       signs more accurately.
    5. Deep Learning methods, such as convolutional neural networks (CNNs) and recurrent neural
       networks (RNNs), have been used in SLR and have shown significant improvements in
       performance. CNNs can learn the spatial features of sign language sequences, while RNNs
       can capture the temporal dependencies between hand movements [10, 11]. Additionally,
       attention mechanisms can be used to focus on the most relevant parts of the sequence,
       improving the accuracy of recognition.

    Several approaches to deep learning could be defined for the relevant tasks and could be quite
efficient in the real application.
    PoseTCN is a deep learning model that uses temporal convolutional networks (TCNs) to capture
the temporal dependencies in sign language gestures. PoseTCN takes as input a sequence of 3D hand
pose data and outputs the recognized sign. The model uses dilated convolutions to increase the
receptive field of the network and improve the model's ability to capture long-term dependencies [12,
13].
    PoseTGCN is a deep learning model that uses a graph convolutional network (GCN) to capture the
spatial dependencies between the joints in sign language gestures, and a temporal convolutional
network (TCN) to capture the temporal dependencies. The model takes as input a sequence of 3D
joint positions and outputs the recognized sign. The GCN operates on a graph structure where the
joints are nodes and the edges represent the spatial relationships between them [14, 15]. The TCN
operates on the resulting feature maps and uses dilated convolutions to capture long-term
dependencies.
    Inflated 3D ConvNet (I3D): I3D is a deep learning model that uses a 3D convolutional neural
network (CNN) to extract spatio-temporal features from sign language gestures. The model takes as
input a sequence of RGB or depth frames and as the outputs gives the recognized sign. The 3D CNN
is pre-trained on large-scale video datasets, such as Kinetics or Sports-1M, and fine-tuned on the sign
language recognition task [16, 17]. The pre-training allows the model to learn generalizable features
that can be applied to sign language gestures.
    Sign Language Transformers (SLT) is a transformer-based model that uses self-attention
mechanisms to learn the spatial and temporal features of sign language gestures. SLT takes as input a
sequence of RGB or depth frames and as the outputs the recognized sign. The model uses a pre-
trained backbone network, such as ResNet or EfficientNet, to extract visual features from the frames,
which are then fed into a transformer encoder-decoder architecture [18, 19]. The attention
mechanisms in the model allow it to focus on the most relevant parts of the sequence and improve
recognition accuracy. Now this approach is widely used in continuous sign language translation.
    Transformers were originally designed for natural language processing (NLP) tasks where they
exceed in capturing long-range dependencies and global context. They achieved this by incorporating
self-attention mechanisms that allow the model to weigh the importance of different parts of the input
sequence when making predictions. The same mechanism can be applied to images by treating each
pixel or patch as a token, allowing the model to attend to different parts of the image when making
predictions. Another advantage of transformers in computer vision is their ability to handle variable
input sizes without requiring resizing or cropping. This is important for tasks such as object detection
and segmentation where the size and aspect ratio of the objects can vary significantly. Additionally,
transformers can leverage pre-training on large data amounts, allowing them to learn useful
representations that can be fine-tuned on smaller datasets for specific tasks. Overall, the use of
transformers in computer vision has shown promising results, outperforming traditional CNN-based
architectures on various benchmarks and achieving state-of-the-art results on challenging tasks such
as image captioning and visual question answering.
    In summary, there are various deep learning models that can be used for sign language recognition,
such as PoseTGCN, I3D, PoseTCN, and Sign Language Transformers (SLT). These models differ in
their architecture and their ability to capture spatial and temporal dependencies in sign language
gestures.
     In Ukraine the task of sign recognition is really actual and important in context of the war and
necessity to develop the special governments to support and provide assist and inclusion in society the
people who were suffered from the war and have problems with hearing. There are several works of
national scientists who have investigated the problem of sign recognition and proposed special
techniques and systems for communication and translating into the sign language [20‒21].
Nevertheless, there are still unsolved issues and necessity of new approaches and adapting the existed
methods is quite high.


4. Practical task of video interpretation for sign language

   The goal in this article was to test mentioned above approaches and to build the simple
transformer-based model for sign language recognition and compare its efficiency with the standard
approach of using 3D convolutions. The idea was to clarify and define on a very first level (without
any neural networks tricks or additional approaches) which approach could be more accurate for our
task.

    4.1.         Dataset

   For our experiments the LSA64 dataset [22] was used. LSA64 is a dataset for Argentinian Sign
Language (LSA) and it represents a collection of video sequences designed for the task of sign
language recognition in the Argentinian Sign Language. The dataset was created by researchers at the
National University of Córdoba in Argentina, and contains 64 different LSA signs performed by 20
signers (10 male and 10 female).
   The videos were recorded in a controlled environment using a high-definition camera and have a
resolution of 1280x720 pixels at 25 frames per second. Each sign was performed five times by each
signer and results were presented in a total of 6400 video sequences.
   The LSA64 dataset also includes ground-truth annotations for each video, indicating the start and
end frames of each sign. These annotations were performed manually by experts in LSA sign
language.
   The LSA64 dataset is quite challenging dataset due to variations in signing speed, camera
viewpoint, and lighting conditions, making it a valuable resource for researchers working on
developing robust and accurate sign language recognition algorithms (see some examples on Figure 1
and Figure 2).
Figure 1: Screenshots of some examples of LSA64 dataset for sign language recognition


Figure 2: Example of one video storyboard

    For our experiments firstly we made some labels for the dataset and then clustered them into 3
logical groups. That was done to have more labels in each of the ground-truth classes. The classes are:
“colors” signs (consists of the next initial classes: “red”, “green”, “yellow”, “light-blue”), “food”
signs (consists of the next initial classes: “sweet milk”, “water”, “food”) and “verbs” signs (consists
of the next initial classes: “help”, “thanks”). The labels were randomly splitted into the train-test sets:
385 labels went to train and 165 labels - to the test set. Note that for experiments the proportion of
each class was saved in both train and test sets. The general proportion of classes in data are:
36,3636% of the first class, 36,3636% of the second class, 27,2727% of the third class.

     4.2.        Used approaches

   In this paragraph the transformer-based and 3D convolution-based models that were used in the
experiments are described more detailed.

        4.2.1. 3D convolutions

   A neural network that uses 3D convolutions for video analysis typically consists of multiple layers
of 3D convolutional, pooling, and fully connected layers [23].
   3D convolutions are a type of convolutional layer that considers the spatial and temporal
dimensions of the input data. In the case of video analysis, the input is a sequence of frames, and the
3D convolutional layer applies a kernel to each frame and its neighboring frames in the temporal
dimension to extract features that capture both spatial and temporal information. This allows the
model to learn patterns and movements over time, which is crucial for tasks such as action recognition
and gesture recognition (see Figure 3) [24].
Figure 3: Comparison of 2D (a) and 3D (b) convolutions

   After the 3D convolutional layers, pooling layers are often used to downsample the feature maps
and reduce the spatial dimensionality of the data. This helps to reduce the parameters number in the
model and prevent overfitting.
   Finally, fully connected layers are used to classify the input video sequence into one or more
classes. These layers take the flattened feature maps from the convolutional layers and apply a set of
weights to produce a probability distribution over the possible classes.


        4.2.2. Vision transformer

  The approach of Video Vision Transformer (ViViT) [25, 26] involves dividing the video into
small spatiotemporal regions of interest, called tubelets, and processing them using self-attention
mechanisms similar to those used in the NLP models.
  Here is a brief overview of how the ViViT model with tubelet embeddings works:
   1. Tubelet Extraction: The first step is to extract tubelets from the input video. This can be done
        using a variety of techniques such as object detection, tracking, or motion analysis. Each
        tubelet is represented as a sequence of T frames, where each frame is a H x W x C tensor
        representing the pixel values of the video frame (see Figure 4).


Figure 4: Tubelet embedding

    2. Flattening and Linear Projection: Each frame in the tubelet is flattened into a sequence of
       patches, and these patches are then linearly projected into a higher-dimensional embedding
       space of size D using a trainable linear layer. This results in a sequence of patch embeddings
       for each frame in the tubelet.
    3. Multi-Head Self-Attention: The projected sequences are then passed through multi-head self-
       attention layers. Each layer computes attention weights between all pairs of patch embeddings
       in the sequence, and uses these weights to compute a weighted sum of the patch embeddings.
       This allows the model to attend to different regions of the tubelet depending on the task at
       hand.
    4. Feedforward Network: After each self-attention layer, the output is passed through a
       feedforward network with a ReLU activation function. This network applies a linear
       transformation to the input, followed by a non-linear activation function. This helps the model
       capture more complex relationships between the patch embeddings in the sequence.
    5. Aggregation: Finally, the output of the last self-attention layer is aggregated across all frames
       in the tubelet to obtain a single vector representation for the tubelet. These vectors are then
       passed through a linear layer to predict the class label for the entire tubelet.

   By processing tubelets using self-attention mechanisms, ViViT with tubelet embeddings is able to
better capture the spatiotemporal relationships between different regions of the video, resulting in
improved performance on video recognition tasks such as action recognition and video classification.

5. Modelling & Results

   In our experiments we used the vision transformer (ViViT) and 3D convolutions CNN (3dCNN)
for comparison on our dataset, which was trained on the different batch sizes (3, 32, 128, 385 (the
length of train dataset)). Each of the selected models was trained on 30 epochs, with learning rate
equal to 1e-5, with input size of the videos equal to (25, 64, 64, 3). During the training the history of the
accuracies is stored, so as a result only those model weights are saved and loaded for that approaches that
had the best test accuracy. That was done to prevent the possible model overfitting. On Table 2 the models
comparison table is presented, where top-1 (also known as accuracy) and top-2 are the top-K accuracy
metrics calculated by the formula 1 below:

                                                  1
                     𝑡𝑡𝑜𝑝   𝐾 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦                               1 𝑓,    𝑦       1
                                             𝑛

where 𝑓 , is the predicted class for the i-th sample corresponding to the j-th largest predicted score, 𝑦
is the corresponding true value, k is the number of guesses allowed and 1(x) is the indicator function.
Top-K accuracy is often used for sign language recognition problems because it is a useful metric for
evaluating the models performance dealing with a large number of possible signs and variations, as
well as accounting for the flexibility required in recognizing signs.

Table 2
Comparison of vision transformer (ViViT) and 3D convolutions CNN (3dCNN) approaches

              Approach name          test top‐1       test top‐2   train top‐1    train top‐2
                  ViViT/3              69.7%            89.7%        87.01%          89.7%
                  ViViT/32            64.85%            90.3%        78.96%          90.3%
                 ViViT/128            63.03%           89.09%         63.9%         89.09%
                 ViViT/385             60.0%           83.03%        65.97%         83.03%
                 3dCNN/32             63.64%           75.76%        58.18%         73.77%
                3dCNN/128             52.12%           72.73%        51.17%         72.73%

   In table 2 it is seen that the ViViT models with small batch sizes have higher quality. But having
in mind the accuracy on training dataset as well we can notice that such models are more easily
overfitted. That means that ViViT model is faster in training flow to get the high quality (see Figure
5). On the other hand, for future investigations more approaches for fixing the model overfitting issue
should be applied (e.g., augmentation), especially by working with small data amounts. Such logic
could be traced on the plots below (Figures 6-11). Note, that in Figure 6-11 the loss and accuracy are
normalized to be presented on the same scale, which gives the opportunity to analyze overfitting
issues there.
    We see that from the 20th epoch the accuracy for 3D convolution network (3dCNN/128) are
extremely increasing. From the other hand, the accuracy on both training and test datasets for ViViT/3
and ViViT/32 models (Figure 6-7) as well as test losses are growing up and at the same time the
losses on training dataset are decreasing. It also confirmed that the model was good learned on
training dataset but on the test dataset it faced with some troubles. As for 3D convolution networks
(Figures 10-11) the losses on both training and test datasets are decreasing with similar speed as well
as accuracy are increasing.


Figure 5: Accuracy on a test set for the different models on epochs scale


Figure 6: Train and test loss and accuracy comparison of ViViT/3 model on epochs scale


Figure 7: Train and test loss and accuracy comparison of ViViT/32 model on epochs scale
Figure 8: Train and test loss and accuracy comparison of ViViT/128 model on epochs scale


Figure 9: Train and test loss and accuracy comparison of ViViT/385 model on epochs scale


Figure 10: Train and test loss and accuracy comparison of 3dCNN/32 model on epochs scale


Figure 11: Train and test loss and accuracy comparison of 3dCNN/128 model on epochs scale

6. Conclusion
    In this paper we discussed the most relevant practices and approaches for sign language
recognition. While significant progress has been made in sign language recognition using modern
methods, there are still some important issues that remain unsolved:
    1. Large variability in sign language. Sign language can vary widely across different regions,
cultures, and even individuals. This variability poses a significant challenge for sign language
recognition systems, which must be robust to these variations.
    2. The availability of large, diverse datasets is crucial for training and evaluating machine
learning models for sign language recognition. However, there is still a limited availability of such
datasets, particularly for less widely spoken sign languages.
    3. Real-time sign language recognition is important for many applications, such as assistive
technology and communication. However, real-time recognition remains a challenge, as it requires
processing sign language videos in real-time, which can be computationally intensive.
    4. Sign language gestures can be occluded or noisy due to factors such as clothing, lighting, and
background clutter. Handling these occlusions and noise is still a challenge for sign language
recognition systems.
    5. Sign language recognition systems are typically trained on a limited set of sign language
gestures, which can impact their ability to recognize new or rare signs.
    The task of sign recognition in this paper was solved for real dataset. It was shown how to
implement the existed approaches on different sizes of the existed data and what to do to receive the
higher accuracy. Our experiments aimed to compare the performance of the Vision Transformer
(ViViT) and 3D Convolutions CNN (3dCNN) models on sign language recognition. We trained both
models on different batch sizes and evaluated their accuracy using the Top-K metric. Our analysis
shows that ViViT models with small batch sizes achieved higher quality, but were more prone to
overfitting. The best obtained numerical results were 69,7% for ViViT/3 on test-top1 and 89,7% on
test-top2 and for ViViT/32 were achieved the accuracy 89,7% on test-top1 and 90,3% for test-top 2.
To address the issue of overfitting, future investigations should explore approaches such as data
augmentation to improve the generalization of ViViT models, especially when working with limited
amounts of data.
    The practical value of this paper is that it was also shown on real dataset how and why it is needed
to search the compromise between speed, accuracy and overfitting issues as well as between length of
the dataset and how existed methods needed to improve.
    Overall, our results provide valuable insights into the strengths and limitations of different models
for sign language recognition, as well as their practical implementation. It was offered in the paper a
starting point for further research for sign language recognition by using vision transformers and
additional approaches in conjunction with, for example, pose estimation, hands recognition, etc., with
vision transformers before classification itself.


7. References

[1] W. Hameed and A.A. Al-Jumaily, "Sign language recognition: Dataset and challenges", in
    Proceedings of the 2019 International Conference on Innovations in Intelligent Systems and
    Applications, pp. 1-7, 2019.
[2] Sehili, M. E. A., & Melkemi, M. (2021). Challenges and Opportunities in Sign Language
    Recognition. IEEE Access, 9, 52470-52487. doi:10.1109/ACCESS.2021.3062127.
[3] V. Murino, C. S. Regazzoni, Template Matching Techniques in Computer Vision: Theory and
    Practice, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (2000) 743-760.
    doi:10.1109/34.85661.
[4] Dong, L., Jiang, S., Huang, Q., & Li, W. (2010). Template matching-based human action
    recognition in videos. Pattern Recognition, 43(3), 1199-1206. doi:10.1016/j.patcog.2009.07.022.
[5] Makris, D., & Ellis, T. (2006). Sign language recognition using hidden Markov models. IEEE
    Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 36(3), 514-524.
    doi:10.1109/TSMCB.2005.856082.
[6] Garg, G., Sharma, S., & Saraswat, M. (2016). Sign Language Recognition Using Hidden Markov
     Models and HOG Features. In 2016 International Conference on Signal Processing and
     Communication (ICSC) (pp. 717-722). IEEE. doi:10.1109/ICSC.2016.7953781.
[7] Kumari, N., Gupta, N., & Sharma, S. K. (2016). A Review on Hidden Markov Models and
     Support Vector Machines in Sign Language Recognition. International Journal of Computer
     Applications, 138(7), 6-12. doi:10.5120/ijca2016908784.
[8] El-Fishawy, Z., Rizk, M., & Abdel-Wahab, M. A. (2018). Sign Language Recognition Using
     Kinect Sensor: A Review. In 2018 11th International Conference on Developments in eSystems
     Engineering (DeSE) (pp. 191-196). IEEE. doi:10.1109/DeSE.2018.00042.
[9] Guo, X.-L., & Yang, T.-T. (2016). Gesture recognition based on HMM-FNN model using a
     Kinect. Journal on Multimodal User Interfaces, 11. doi:10.1007/s12193-016-0215-x.
[10] Xia, W., Zhai, X., & Liu, Y. (2019). Sign language recognition using deep learning models: A
     review. ACM Transactions on Accessible Computing, 12(3), 1-30. doi: 10.1145/3301417.
[11] Zhang, C., Yang, J., & Kim, G-M. (2017). Sign Language Recognition with Convolutional
     Neural Networks Trained on Synthetic Data. In 2017 14th IEEE International Conference on
     Advanced Video and Signal Based Surveillance (AVSS) (pp. 1-6). doi:
     10.1109/AVSS.2017.807856.
[12] Keze Wang, Xiaoguang Zhao, and Jing Liu. Sign Language Recognition Using Temporal
     Convolutional Networks and Skeleton Data. IEEE Access, vol. 7, pp. 158074-158083, 2019. doi:
     10.1109/ACCESS.2019.2951043.
[13] R. Girdhar, G. Gkioxari, L. Torresani, and M. Paluri, "PoseTCN: Efficient Convolutional Neural
     Networks for Human Pose Estimation and Action Recognition," in Proceedings of the IEEE
     Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019,
     pp. 3332-3341. doi: 10.1109/CVPR.2019.00344.
[14] Deng, Z., Wan, J., & Xie, X. (2020). PoseTGCN: A Temporal Graph Convolutional Network for
     3D Human Pose Forecasting. IEEE Transactions on Neural Networks and Learning Systems,
     32(1), 67-77. doi:10.1109/TNNLS.2020.3010193.
[15] Andrea Vanzo, Carlo Ciliberto, and Alberto Montagner, "Sign Language Recognition Based on
     Pose Estimation with Temporal Graph Convolutional Networks," Sensors 20(13), 3759, July
     2020. doi:10.3390/s20133759.
[16] Kardoostsiami, A., Shafiee, M. J., & Plataniotis, K. N. (2021). Sign Language Recognition with
     Inflated 3D Convolutional Networks. IEEE Transactions on Multimedia, 23, 313-326.
     doi:10.1109/TMM.2020.3043619.
[17] Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2015). Learning
     Spatiotemporal Features with 3D Convolutional Networks for Gesture Recognition. In
     Proceedings of the IEEE International Conference on Computer Vision (pp. 4489-4497). doi:
     10.1109/ICCV.2015.510.
[18] Efstratios Gavves, Thomas van de Weijer, and Jan C. van Gemert. Transferring Knowledge from
     Text to Sign Language Video Recognition with Transformers. 2021. doi:
     10.1109/CVPRW50498.2021.00443.
[19] Chung, J.S., Kim, J., & Kim, J. (2021). Sign Language Transformers: Joint End-to-end Sign
     Language Recognition and Translation. arXiv preprint arXiv:2103.01197.
[20] S. Kondratiuk, I. Krak, V.A. Kuznetsov, A. Kulias, Using the Temporal Data and Three‐
     dimensional Convolutions for Sign Language Alphabet Recognition, CEUR Workshop
     Proceedingsthis link is disabled, 2022, 3137, pp. 78–87.
[21] S. Kondratiuk, I. Krak, A. Kylias, V. Kasianiuk. Fingerspelling Alphabet Recognition using
     Cnns with 3d Convolutions for Cross Platform Applications. Advances in Intelligent Systems
     and Computing. Vol. 1246 AISC. 2021, pp.585-596. doi:10.1007/978-3-030-54215-3_37.
[22] Ronchetti, F., Quiroga, F., Estrebou, C., Lanzarini, L., and Rosete, A. LSA64: A Dataset of
     Argentinian Sign Language. In XXII Congreso Argentino de Ciencias de la Computación
     (CACIC), 2016.
[23] Heng Wang, Yang Wang, and Gang Zeng, “Sign Language Recognition Using 3D Convolutional
     Neural Networks,” in Proceedings of the 24th ACM international conference on Multimedia
     (MM ’16), Amsterdam, The Netherlands, October 15-19, 2016, pp. 1033–1042.
     doi:10.1145/2964284.2964318.
[24] Lu, C., Chen, Y., & Lu, H. (2019). Sign Language Recognition Using 3D Convolutional Neural
     Networks with Softmax Probability Map. IEEE Access, 7, 116256-116267. doi:
     10.1109/ACCESS.2019.2937045.
[25] Yuan, L., Chen, Y., Wang, T., Gan, W., Liu, Z., & Kornilov, S. (2021). ViViT: A Video Vision
     Transformer for Efficient Video Recognition. arXiv preprint arXiv:2103.15691.
[26] Elnaggar, A., Mahmoud, A., Abdou, A., Elgammal, A., & Abdel-Razek, M. (2021). ViViT-Sign:
     A Video Vision Transformer for Sign Language Recognition. arXiv preprint arXiv:2104.07441.

</pre>