<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>May</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Application  of  vision  transformers  and  3D  convolutional  neural  networks for sign language cluster recognition Serhii Smirnova and Nataliia Kuznietsovaa</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”</institution>
          ,
          <addr-line>ave. Peremohy 37, Kyiv, 03056</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>3</volume>
      <issue>2023</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>   In this study the overview of the methods for sign language recognition was done and the existing datasets in this area were analyzed. It was shown how based on the real data to develop and test different approaches. The models based on the Vision Transformer (ViViT) and 3D Convolutions CNN (3dCNN) using different batch sizes were built and compared. It was also shown how to learn models on different data sizes and to search the compromise between accuracy, speed and overfitting of the models. Our research provides valuable insights into the strengths and limitations of different models of the task, not solved tasks and offer a direction and possible improvements of existed methods in this area by using vision transformers.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction </title>
      <p>Sign language is a visual language used by people who are deaf or hard of hearing to communicate
with each other and with hearing individuals. It involves using a combination of hand gestures, facial
expressions, and body language to convey meaning. While sign language is an effective mean of
communication, it can be challenging for non-signers to understand and communicate with sign
language users. This has led to the development of sign language recognition technology, which uses
computer algorithms to interpret and translate sign language into spoken or written language. Sign
language recognition has the potential to improve communication and inclusion for people who are
deaf or hard of hearing. It also poses unique challenges, such as the necessity for accurate hand shape
and movement detection, real-time recognition, and dealing with the complexity and variability of
different sign languages. Advantages and new achievements in machine learning, computer vision,
and sensor technology give now the possibility to overcome these challenges and make sign language
recognition more accurate, efficient, and accessible. Machine learning techniques, such as deep
learning, can then be used to learn a mapping between these visual features and the corresponding
sign language gestures. While CNNs have limitations in capturing long-term dependencies and global
context, which are crucial for complex image understanding tasks such as object detection and
segmentation, special transformers have gained significant popularity in the field of computer vision
in recent years due to their ability to process sequential data such as images and videos. In this article
we will compare two approaches for sign language recognition and define for the real practical task
which of them is more effective and perspective for improvements for next studies.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Sign language recognition problem statement </title>
      <p>
        Sign language is an essential mode of communication for deaf or hard-of-hearing individuals. Sign
language recognition (SLR) is a challenging task, as sign languages are highly complex, with a wide
range of variations and nuances. SLR involves the hand gestures identification, facial expressions, and
body movements to interpret the meaning of a sign [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. The goal of SLR is to develop systems that
can recognize and translate sign language into written or spoken language which enable
communication between hearing and non-hearing individuals. In this research we will discuss the
problems, limitations, not solved issues, and existing solutions in SLR.
      </p>
      <p>One of the primary challenges in SLR is the complexity of sign languages. There are over 300 sign
languages used worldwide, each with its own grammar, vocabulary, and dialects. Furthermore, sign
languages are highly context-dependent, with the meaning of signs often varying depending on the
speaker's location, age, gender, and culture. Thus, developing an SLR system that can accurately
recognize and interpret the nuances of different sign languages is a significant challenge.</p>
      <p>Another challenge in SLR is the variability in signing styles. Signers may use different hand
shapes, positions, and movements to convey the same message. Moreover, the speed and duration of
signs can vary, adding further complexity to the task. Thus, SLR systems must be robust to variations
in signing styles, as well as to variations in lighting conditions and camera angles. To address these
challenges, researchers have proposed various techniques, such as data augmentation, transfer
learning, and multi-modal fusion, which combine visual and other modalities such as audio or depth
information.</p>
      <p>One of the main challenges in sign language recognition is collecting and annotating large datasets
of sign language gestures, which are required to train and evaluate machine learning models. This can
be particularly difficult for sign languages that are not widely spoken or documented. A limitation of
SLR is the lack of large-scale annotated datasets. While there are several datasets available for SLR,
they are relatively small, limiting the performance of machine learning algorithms. A brief
comparison of existed datasets for sign recognition task is made in Table 1. Moreover annotating sign
language data is a time-consuming and challenging task, as it requires the expertise of sign language
experts.</p>
      <p>Another issue in SLR is the lack of real-time performance. SLR systems often require high
computational resources, making it challenging to achieve real-time performance on mobile devices
or in low-resource settings.</p>
      <p>
        Researchers have proposed various solutions to deal with these challenges and limitations [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref3 ref4 ref5 ref6 ref7 ref8 ref9">3‒12</xref>
        ].
There is an approach that proposes to use different deep learning algorithms, such as convolutional
neural networks (CNNs) and recurrent neural networks (RNNs), to recognize the signs from video
sources. These algorithms have shown promising results in SLR, achieving state-of-the-art
performance on several benchmark datasets.
      </p>
      <p>
        Another solution is to use depth sensors, such as Microsoft Kinect, to capture 3D motion data,
which can be used to recognize signs accurately [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Depth sensors are advantageous as they can
capture the 3D shape and position of the signer's hands, providing more robust and accurate sign
recognition.
      </p>
      <p>Furthermore, researchers have proposed the use of transfer learning, where pre-trained models on
large datasets such as ImageNet are fine-tuned on sign language datasets. This approach has shown to
improve the performance of SLR models, particularly for low-resource sign language datasets.</p>
      <p>So, the SLR is a challenging task that requires robust and accurate recognition of complex hand
gestures, facial expressions, and body movements. While several solutions have been proposed to
address the limitations and challenges of SLR, there are still several unsolved issues, such as real-time
performance, variability in signing styles, and lack of large-scale annotated datasets. With the
development of more advanced algorithms and the availability of larger annotated datasets, it is
hopeful that these challenges can be addressed, enabling better communication between hearing and
non-hearing individuals. Overall, sign language recognition is still an important problem with
potential applications in fields such as assistive technologies, education, and communication for deaf
and hard-of-hearing individuals. </p>
    </sec>
    <sec id="sec-3">
      <title>3. Brief overview of sign language recognition methods </title>
      <p>
        There are various methods for sign language recognition (SLR) that have been proposed and tested
over the years. Here are some of the most widespread methods for SLR:
1. Template matching is a simple and intuitive approach where the hand movements of the
signer are matched against a predefined set of templates to recognize the sign [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. The
template matching method involves capturing a series of hand poses and storing them as
templates. During recognition, the input sequence is compared to each template, and the sign
is identified based on the closest match. While this method is easy to implement, it is limited
by the need for manually defining the templates and the inability to handle variations in
signing styles.
2. Hidden Markov Models (HMMs) are probabilistic models used in speech recognition as a
tool that can model the temporal dependencies in sign language by capturing the transitions
between hand shapes and movements. The method involves training an HMM on a dataset of
sign language gestures and using it to recognize signs in new sequences [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. However,
HMMs can be limited in capturing the complex and context-dependent variations in sign
language.
3. Support Vector Machines (SVMs) as a type of machine learning algorithm can classify sign
language sequences by finding the hyperplane that separates the data points into their
respective classes [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. SVMs have been shown to achieve good accuracy in SLR, but they
require large amounts of training data and may not be robust to variations in signing styles.
4. 3D Depth Sensors using 3D depth sensors, such as Microsoft Kinect give the possibility to
capture the 3D shape and position of the signer’s hands. This approach has the advantage of
being more robust to variations in lighting and camera angles, and it can capture the depth
information of the hand movements[
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. The depth information can be used to recognize
signs more accurately.
5. Deep Learning methods, such as convolutional neural networks (CNNs) and recurrent neural
networks (RNNs), have been used in SLR and have shown significant improvements in
performance. CNNs can learn the spatial features of sign language sequences, while RNNs
can capture the temporal dependencies between hand movements [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. Additionally,
attention mechanisms can be used to focus on the most relevant parts of the sequence,
improving the accuracy of recognition.
      </p>
      <p>Several approaches to deep learning could be defined for the relevant tasks and could be quite
efficient in the real application.</p>
      <p>
        PoseTCN is a deep learning model that uses temporal convolutional networks (TCNs) to capture
the temporal dependencies in sign language gestures. PoseTCN takes as input a sequence of 3D hand
pose data and outputs the recognized sign. The model uses dilated convolutions to increase the
receptive field of the network and improve the model's ability to capture long-term dependencies [
        <xref ref-type="bibr" rid="ref12 ref13">12,
13</xref>
        ].
      </p>
      <p>
        PoseTGCN is a deep learning model that uses a graph convolutional network (GCN) to capture the
spatial dependencies between the joints in sign language gestures, and a temporal convolutional
network (TCN) to capture the temporal dependencies. The model takes as input a sequence of 3D
joint positions and outputs the recognized sign. The GCN operates on a graph structure where the
joints are nodes and the edges represent the spatial relationships between them [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ]. The TCN
operates on the resulting feature maps and uses dilated convolutions to capture long-term
dependencies.
      </p>
      <p>
        Inflated 3D ConvNet (I3D): I3D is a deep learning model that uses a 3D convolutional neural
network (CNN) to extract spatio-temporal features from sign language gestures. The model takes as
input a sequence of RGB or depth frames and as the outputs gives the recognized sign. The 3D CNN
is pre-trained on large-scale video datasets, such as Kinetics or Sports-1M, and fine-tuned on the sign
language recognition task [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ]. The pre-training allows the model to learn generalizable features
that can be applied to sign language gestures.
      </p>
      <p>
        Sign Language Transformers (SLT) is a transformer-based model that uses self-attention
mechanisms to learn the spatial and temporal features of sign language gestures. SLT takes as input a
sequence of RGB or depth frames and as the outputs the recognized sign. The model uses a
pretrained backbone network, such as ResNet or EfficientNet, to extract visual features from the frames,
which are then fed into a transformer encoder-decoder architecture [
        <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
        ]. The attention
mechanisms in the model allow it to focus on the most relevant parts of the sequence and improve
recognition accuracy. Now this approach is widely used in continuous sign language translation.
      </p>
      <p>Transformers were originally designed for natural language processing (NLP) tasks where they
exceed in capturing long-range dependencies and global context. They achieved this by incorporating
self-attention mechanisms that allow the model to weigh the importance of different parts of the input
sequence when making predictions. The same mechanism can be applied to images by treating each
pixel or patch as a token, allowing the model to attend to different parts of the image when making
predictions. Another advantage of transformers in computer vision is their ability to handle variable
input sizes without requiring resizing or cropping. This is important for tasks such as object detection
and segmentation where the size and aspect ratio of the objects can vary significantly. Additionally,
transformers can leverage pre-training on large data amounts, allowing them to learn useful
representations that can be fine-tuned on smaller datasets for specific tasks. Overall, the use of
transformers in computer vision has shown promising results, outperforming traditional CNN-based
architectures on various benchmarks and achieving state-of-the-art results on challenging tasks such
as image captioning and visual question answering.</p>
      <p>In summary, there are various deep learning models that can be used for sign language recognition,
such as PoseTGCN, I3D, PoseTCN, and Sign Language Transformers (SLT). These models differ in
their architecture and their ability to capture spatial and temporal dependencies in sign language
gestures.</p>
      <p>
        In Ukraine the task of sign recognition is really actual and important in context of the war and
necessity to develop the special governments to support and provide assist and inclusion in society the
people who were suffered from the war and have problems with hearing. There are several works of
national scientists who have investigated the problem of sign recognition and proposed special
techniques and systems for communication and translating into the sign language [
        <xref ref-type="bibr" rid="ref20 ref21">20‒21</xref>
        ].
Nevertheless, there are still unsolved issues and necessity of new approaches and adapting the existed
methods is quite high.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Practical task of video interpretation for sign language </title>
      <p>The goal in this article was to test mentioned above approaches and to build the simple
transformer-based model for sign language recognition and compare its efficiency with the standard
approach of using 3D convolutions. The idea was to clarify and define on a very first level (without
any neural networks tricks or additional approaches) which approach could be more accurate for our
task.</p>
      <p>4.1.</p>
      <p> Dataset </p>
      <p>
        For our experiments the LSA64 dataset [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] was used. LSA64 is a dataset for Argentinian Sign
Language (LSA) and it represents a collection of video sequences designed for the task of sign
language recognition in the Argentinian Sign Language. The dataset was created by researchers at the
National University of Córdoba in Argentina, and contains 64 different LSA signs performed by 20
signers (10 male and 10 female).
      </p>
      <p>The videos were recorded in a controlled environment using a high-definition camera and have a
resolution of 1280x720 pixels at 25 frames per second. Each sign was performed five times by each
signer and results were presented in a total of 6400 video sequences.</p>
      <p>The LSA64 dataset also includes ground-truth annotations for each video, indicating the start and
end frames of each sign. These annotations were performed manually by experts in LSA sign
language.</p>
      <p>The LSA64 dataset is quite challenging dataset due to variations in signing speed, camera
viewpoint, and lighting conditions, making it a valuable resource for researchers working on
developing robust and accurate sign language recognition algorithms (see some examples on Figure 1
and Figure 2).</p>
      <p>Figure 1: Screenshots of some examples of LSA64 dataset for sign language recognition </p>
      <p>For our experiments firstly we made some labels for the dataset and then clustered them into 3
logical groups. That was done to have more labels in each of the ground-truth classes. The classes are:
“colors” signs (consists of the next initial classes: “red”, “green”, “yellow”, “light-blue”), “food”
signs (consists of the next initial classes: “sweet milk”, “water”, “food”) and “verbs” signs (consists
of the next initial classes: “help”, “thanks”). The labels were randomly splitted into the train-test sets:
385 labels went to train and 165 labels - to the test set. Note that for experiments the proportion of
each class was saved in both train and test sets. The general proportion of classes in data are:
36,3636% of the first class, 36,3636% of the second class, 27,2727% of the third class.
4.2.</p>
      <p> Used approaches </p>
      <p>In this paragraph the transformer-based and 3D convolution-based models that were used in the
experiments are described more detailed.</p>
    </sec>
    <sec id="sec-5">
      <title>4.2.1. 3D convolutions </title>
      <p>
        A neural network that uses 3D convolutions for video analysis typically consists of multiple layers
of 3D convolutional, pooling, and fully connected layers [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <p>3D convolutions are a type of convolutional layer that considers the spatial and temporal
dimensions of the input data. In the case of video analysis, the input is a sequence of frames, and the
3D convolutional layer applies a kernel to each frame and its neighboring frames in the temporal
dimension to extract features that capture both spatial and temporal information. This allows the
model to learn patterns and movements over time, which is crucial for tasks such as action recognition
and gesture recognition (see Figure 3) [24].</p>
      <sec id="sec-5-1">
        <title>Figure 3: Comparison of 2D (a) and 3D (b) convolutions</title>
        <p>After the 3D convolutional layers, pooling layers are often used to downsample the feature maps
and reduce the spatial dimensionality of the data. This helps to reduce the parameters number in the
model and prevent overfitting.</p>
        <p>Finally, fully connected layers are used to classify the input video sequence into one or more
classes. These layers take the flattened feature maps from the convolutional layers and apply a set of
weights to produce a probability distribution over the possible classes.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>4.2.2. Vision transformer </title>
      <p>The approach of Video Vision Transformer (ViViT) [25, 26] involves dividing the video into
small spatiotemporal regions of interest, called tubelets, and processing them using self-attention
mechanisms similar to those used in the NLP models.</p>
      <p>Here is a brief overview of how the ViViT model with tubelet embeddings works:
1. Tubelet Extraction: The first step is to extract tubelets from the input video. This can be done
using a variety of techniques such as object detection, tracking, or motion analysis. Each
tubelet is represented as a sequence of T frames, where each frame is a H x W x C tensor
representing the pixel values of the video frame (see Figure 4).</p>
      <p>Figure 4: Tubelet embedding 
 
2. Flattening and Linear Projection: Each frame in the tubelet is flattened into a sequence of
patches, and these patches are then linearly projected into a higher-dimensional embedding
hand.
space of size D using a trainable linear layer. This results in a sequence of patch embeddings
for each frame in the tubelet.</p>
      <p>Multi-Head Self-Attention: The projected sequences are then passed through multi-head
selfattention layers. Each layer computes attention weights between all pairs of patch embeddings
in the sequence, and uses these weights to compute a weighted sum of the patch embeddings.</p>
      <p>This allows the model to attend to different regions of the tubelet depending on the task at
4. Feedforward Network: After each self-attention layer, the output is passed through a
feedforward network</p>
      <p>with a ReLU activation function. This network applies a linear
transformation to the input, followed by a non-linear activation function. This helps the model
capture more complex relationships between the patch embeddings in the sequence.</p>
      <p>Aggregation: Finally, the output of the last self-attention layer is aggregated across all frames
in the tubelet to obtain a single vector representation for the tubelet. These vectors are then
passed through a linear layer to predict the class label for the entire tubelet.</p>
      <p>By processing tubelets using self-attention mechanisms, ViViT with tubelet embeddings is able to
better capture the spatiotemporal relationships between different regions of the video, resulting in
improved performance on video recognition tasks such as action recognition and video classification.</p>
    </sec>
    <sec id="sec-7">
      <title>Modelling &amp; Results </title>
      <p>In our experiments we used the vision transformer (ViViT) and 3D convolutions CNN (3dCNN)
for comparison on our dataset, which was trained on the different batch sizes (3, 32, 128, 385 (the
length of train dataset)). Each of the selected models was trained on 30 epochs, with learning rate
equal to 1e-5, with input size of the videos equal to (25, 64, 64, 3). During the training the history of the
accuracies is stored, so as a result only those model weights are saved and loaded for that approaches that
had the best test accuracy. That was done to prevent the possible model overfitting. On Table 2 the models
comparison table is presented, where top-1 (also known as accuracy) and top-2 are the top-K accuracy
metrics calculated by the formula 1 below:</p>
      <p>1
where  , is the predicted class for the i-th sample corresponding to the j-th largest predicted score, 
is the corresponding true value, k is the number of guesses allowed and 1(x) is the indicator function.
Top-K accuracy is often used for sign language recognition problems because it is a useful metric for
evaluating the models performance dealing with a large number of possible signs and variations, as
well as accounting for the flexibility required in recognizing signs.</p>
      <p>In table 2 it is seen that the ViViT models with small batch sizes have higher quality. But having
in mind the accuracy on training dataset as well we can notice that such models are more easily
overfitted. That means that ViViT model is faster in training flow to get the high quality (see Figure
5). On the other hand, for future investigations more approaches for fixing the model overfitting issue
should be applied (e.g., augmentation), especially by working with small data amounts. Such logic
could be traced on the plots below (Figures 6-11). Note, that in Figure 6-11 the loss and accuracy are
normalized to be presented on the same scale, which gives the opportunity to analyze overfitting
issues there.</p>
      <p>We see that from the 20th epoch the accuracy for 3D convolution network (3dCNN/128) are
extremely increasing. From the other hand, the accuracy on both training and test datasets for ViViT/3
and ViViT/32 models (Figure 6-7) as well as test losses are growing up and at the same time the
losses on training dataset are decreasing. It also confirmed that the model was good learned on
training dataset but on the test dataset it faced with some troubles. As for 3D convolution networks
(Figures 10-11) the losses on both training and test datasets are decreasing with similar speed as well
as accuracy are increasing.</p>
      <sec id="sec-7-1">
        <title>Figure 5: Accuracy on a test set for the different models on epochs scale</title>
        <p> 
Figure 8: Train and test loss and accuracy comparison of ViViT/128 model on epochs scale 
 
Figure 11:  Train and test loss and accuracy comparison of 3dCNN/128 model on epochs scale </p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>6. Conclusion </title>
      <p> </p>
      <p>In this paper we discussed the most relevant practices and approaches for sign language
recognition. While significant progress has been made in sign language recognition using modern
methods, there are still some important issues that remain unsolved:</p>
      <p>1. Large variability in sign language. Sign language can vary widely across different regions,
cultures, and even individuals. This variability poses a significant challenge for sign language
recognition systems, which must be robust to these variations.</p>
      <p>2. The availability of large, diverse datasets is crucial for training and evaluating machine
learning models for sign language recognition. However, there is still a limited availability of such
datasets, particularly for less widely spoken sign languages.</p>
      <p>3. Real-time sign language recognition is important for many applications, such as assistive
technology and communication. However, real-time recognition remains a challenge, as it requires
processing sign language videos in real-time, which can be computationally intensive.</p>
      <p>4. Sign language gestures can be occluded or noisy due to factors such as clothing, lighting, and
background clutter. Handling these occlusions and noise is still a challenge for sign language
recognition systems.</p>
      <p>5. Sign language recognition systems are typically trained on a limited set of sign language
gestures, which can impact their ability to recognize new or rare signs.</p>
      <p>The task of sign recognition in this paper was solved for real dataset. It was shown how to
implement the existed approaches on different sizes of the existed data and what to do to receive the
higher accuracy. Our experiments aimed to compare the performance of the Vision Transformer
(ViViT) and 3D Convolutions CNN (3dCNN) models on sign language recognition. We trained both
models on different batch sizes and evaluated their accuracy using the Top-K metric. Our analysis
shows that ViViT models with small batch sizes achieved higher quality, but were more prone to
overfitting. The best obtained numerical results were 69,7% for ViViT/3 on test-top1 and 89,7% on
test-top2 and for ViViT/32 were achieved the accuracy 89,7% on test-top1 and 90,3% for test-top 2.
To address the issue of overfitting, future investigations should explore approaches such as data
augmentation to improve the generalization of ViViT models, especially when working with limited
amounts of data.</p>
      <p>The practical value of this paper is that it was also shown on real dataset how and why it is needed
to search the compromise between speed, accuracy and overfitting issues as well as between length of
the dataset and how existed methods needed to improve.</p>
      <p>Overall, our results provide valuable insights into the strengths and limitations of different models
for sign language recognition, as well as their practical implementation. It was offered in the paper a
starting point for further research for sign language recognition by using vision transformers and
additional approaches in conjunction with, for example, pose estimation, hands recognition, etc., with
vision transformers before classification itself.</p>
    </sec>
    <sec id="sec-9">
      <title>7. References </title>
      <p>[24] Lu, C., Chen, Y., &amp; Lu, H. (2019). Sign Language Recognition Using 3D Convolutional Neural
Networks with Softmax Probability Map. IEEE Access, 7, 116256-116267. doi:
10.1109/ACCESS.2019.2937045.
[25] Yuan, L., Chen, Y., Wang, T., Gan, W., Liu, Z., &amp; Kornilov, S. (2021). ViViT: A Video Vision</p>
      <p>Transformer for Efficient Video Recognition. arXiv preprint arXiv:2103.15691.
[26] Elnaggar, A., Mahmoud, A., Abdou, A., Elgammal, A., &amp; Abdel-Razek, M. (2021). ViViT-Sign:
A Video Vision Transformer for Sign Language Recognition. arXiv preprint arXiv:2104.07441.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Hameed</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.A.</given-names>
            <surname>Al-Jumaily</surname>
          </string-name>
          ,
          <article-title>"Sign language recognition: Dataset and challenges"</article-title>
          ,
          <source>in Proceedings of the 2019 International Conference on Innovations in Intelligent Systems and Applications</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Sehili</surname>
            ,
            <given-names>M. E. A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Melkemi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Challenges and Opportunities in Sign Language Recognition</article-title>
          .
          <source>IEEE Access</source>
          ,
          <volume>9</volume>
          ,
          <fpage>52470</fpage>
          -
          <lpage>52487</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2021</year>
          .
          <volume>3062127</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Murino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Regazzoni</surname>
          </string-name>
          , Template Matching Techniques in
          <source>Computer Vision: Theory and Practice</source>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>22</volume>
          (
          <year>2000</year>
          )
          <fpage>743</fpage>
          -
          <lpage>760</lpage>
          . doi:
          <volume>10</volume>
          .1109/34.85661.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Template matching-based human action recognition in videos</article-title>
          .
          <source>Pattern Recognition</source>
          ,
          <volume>43</volume>
          (
          <issue>3</issue>
          ),
          <fpage>1199</fpage>
          -
          <lpage>1206</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.patcog.
          <year>2009</year>
          .
          <volume>07</volume>
          .022.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Makris</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ellis</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>Sign language recognition using hidden Markov models</article-title>
          .
          <source>IEEE Transactions on Systems, Man, and Cybernetics</source>
          ,
          <string-name>
            <surname>Part</surname>
            <given-names>B</given-names>
          </string-name>
          (Cybernetics),
          <volume>36</volume>
          (
          <issue>3</issue>
          ),
          <fpage>514</fpage>
          -
          <lpage>524</lpage>
          . doi:
          <volume>10</volume>
          .1109/TSMCB.
          <year>2005</year>
          .
          <volume>856082</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Garg</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Saraswat</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Sign Language Recognition Using Hidden Markov Models and HOG Features</article-title>
          .
          <source>In 2016 International Conference on Signal Processing and Communication (ICSC)</source>
          (pp.
          <fpage>717</fpage>
          -
          <lpage>722</lpage>
          ). IEEE. doi:
          <volume>10</volume>
          .1109/ICSC.
          <year>2016</year>
          .
          <volume>7953781</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Kumari</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>S. K.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>A Review on Hidden Markov Models and Support Vector Machines in Sign Language Recognition</article-title>
          .
          <source>International Journal of Computer Applications</source>
          ,
          <volume>138</volume>
          (
          <issue>7</issue>
          ),
          <fpage>6</fpage>
          -
          <lpage>12</lpage>
          . doi:
          <volume>10</volume>
          .5120/ijca2016908784.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>El-Fishawy</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rizk</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Abdel-Wahab</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Sign Language Recognition Using Kinect Sensor: A Review</article-title>
          .
          <source>In 2018 11th International Conference on Developments in eSystems Engineering (DeSE)</source>
          (pp.
          <fpage>191</fpage>
          -
          <lpage>196</lpage>
          ). IEEE. doi:
          <volume>10</volume>
          .1109/DeSE.
          <year>2018</year>
          .
          <volume>00042</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>X.-L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Yang</surname>
          </string-name>
          , T.-T. (
          <year>2016</year>
          ).
          <article-title>Gesture recognition based on HMM-FNN model using a Kinect</article-title>
          .
          <source>Journal on Multimodal User Interfaces</source>
          ,
          <volume>11</volume>
          . doi:
          <volume>10</volume>
          .1007/s12193-016-0215-x.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Sign language recognition using deep learning models: A review</article-title>
          .
          <source>ACM Transactions on Accessible Computing</source>
          ,
          <volume>12</volume>
          (
          <issue>3</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          . doi:
          <volume>10</volume>
          .1145/3301417.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>G-M.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Sign Language Recognition with Convolutional Neural Networks Trained on Synthetic Data</article-title>
          .
          <source>In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)</source>
          (pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          ). doi:
          <volume>10</volume>
          .1109/AVSS.
          <year>2017</year>
          .
          <volume>807856</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Keze</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiaoguang Zhao</surname>
          </string-name>
          , and Jing Liu.
          <article-title>Sign Language Recognition Using Temporal Convolutional Networks and Skeleton Data</article-title>
          .
          <source>IEEE Access</source>
          , vol.
          <volume>7</volume>
          , pp.
          <fpage>158074</fpage>
          -
          <lpage>158083</lpage>
          ,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2019</year>
          .
          <volume>2951043</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Girdhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gkioxari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Torresani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Paluri</surname>
          </string-name>
          ,
          <article-title>"PoseTCN: Efficient Convolutional Neural Networks for Human Pose Estimation and Action Recognition,"</article-title>
          <source>in Proceedings of the IEEE Conference on Computer Vision</source>
          and
          <article-title>Pattern Recognition (CVPR), Long Beach</article-title>
          , CA, USA,
          <year>2019</year>
          , pp.
          <fpage>3332</fpage>
          -
          <lpage>3341</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2019</year>
          .
          <volume>00344</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>PoseTGCN: A Temporal Graph Convolutional Network for 3D Human Pose Forecasting</article-title>
          .
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          ,
          <volume>32</volume>
          (
          <issue>1</issue>
          ),
          <fpage>67</fpage>
          -
          <lpage>77</lpage>
          . doi:
          <volume>10</volume>
          .1109/TNNLS.
          <year>2020</year>
          .
          <volume>3010193</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Andrea</surname>
            <given-names>Vanzo</given-names>
          </string-name>
          , Carlo Ciliberto, and
          <string-name>
            <given-names>Alberto</given-names>
            <surname>Montagner</surname>
          </string-name>
          ,
          <article-title>"Sign Language Recognition Based on Pose Estimation with Temporal Graph Convolutional Networks,"</article-title>
          <source>Sensors</source>
          <volume>20</volume>
          (
          <issue>13</issue>
          ),
          <volume>3759</volume>
          ,
          <year>July 2020</year>
          . doi:
          <volume>10</volume>
          .3390/s20133759.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Kardoostsiami</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shafiee</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Plataniotis</surname>
            ,
            <given-names>K. N.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Sign Language Recognition with Inflated 3D Convolutional Networks</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          ,
          <volume>23</volume>
          ,
          <fpage>313</fpage>
          -
          <lpage>326</lpage>
          . doi:
          <volume>10</volume>
          .1109/TMM.
          <year>2020</year>
          .
          <volume>3043619</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torresani</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ray</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , LeCun,
          <string-name>
            <given-names>Y.</given-names>
            , &amp;
            <surname>Paluri</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Learning Spatiotemporal Features with 3D Convolutional Networks for Gesture Recognition</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Computer Vision</source>
          (pp.
          <fpage>4489</fpage>
          -
          <lpage>4497</lpage>
          ). doi:
          <volume>10</volume>
          .1109/ICCV.
          <year>2015</year>
          .
          <volume>510</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Efstratios</surname>
            <given-names>Gavves</given-names>
          </string-name>
          , Thomas van de Weijer, and Jan C. van Gemert.
          <article-title>Transferring Knowledge from Text to Sign Language Video Recognition with Transformers</article-title>
          .
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .1109/CVPRW50498.
          <year>2021</year>
          .
          <volume>00443</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Chung</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation</article-title>
          .
          <source>arXiv preprint arXiv:2103</source>
          .
          <fpage>01197</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kondratiuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Krak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.A.</given-names>
            <surname>Kuznetsov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kulias</surname>
          </string-name>
          ,
          <article-title>Using the Temporal Data and Threedimensional Convolutions for Sign Language Alphabet Recognition</article-title>
          , CEUR Workshop Proceedingsthis link is disabled,
          <year>2022</year>
          ,
          <volume>3137</volume>
          , pp.
          <fpage>78</fpage>
          -
          <lpage>87</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kondratiuk</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Krak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kylias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kasianiuk</surname>
          </string-name>
          .
          <article-title>Fingerspelling Alphabet Recognition using Cnns with 3d Convolutions for Cross Platform Applications</article-title>
          .
          <source>Advances in Intelligent Systems and Computing</source>
          . Vol.
          <volume>1246</volume>
          AISC.
          <year>2021</year>
          , pp.
          <fpage>585</fpage>
          -
          <lpage>596</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -54215-3_
          <fpage>37</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Ronchetti</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quiroga</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Estrebou</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lanzarini</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Rosete</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>LSA64: A Dataset of Argentinian Sign Language</article-title>
          . In XXII Congreso Argentino de Ciencias de la
          <source>Computación (CACIC)</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Heng</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            <given-names>Wang</given-names>
          </string-name>
          , and Gang Zeng, “
          <article-title>Sign Language Recognition Using 3D Convolutional Neural Networks,”</article-title>
          <source>in Proceedings of the 24th ACM international conference on Multimedia (MM '16)</source>
          , Amsterdam, The Netherlands,
          <source>October 15-19</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1033</fpage>
          -
          <lpage>1042</lpage>
          . doi:
          <volume>10</volume>
          .1145/2964284.2964318.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>