Sign Language Fingerspelling Recognition using
                                          Synthetic Data

                                               Frank Fowley1,2 and Anthony Ventresque1
                                         1
                                           School of Computer Science, University College Dublin
                              2
                                  SFI Centre for Research Training in Digitally-Enhanced Reality (D-REAL)


                                  Abstract. Sign Language Recognition (SLR) is a Computer Vision (CV)
                                  and Machine Learning (ML) task, with potential applications that would
                                  be beneficial to the Deaf community, which includes not only deaf per-
                                  sons but also hearing people who use Sign Languages. SLR is particularly
                                  challenging due to the lack of training datasets for CV and ML models,
                                  which impacts their overall accuracy and robustness. In this paper, we
                                  explore the use of synthetic images to augment a dataset of fingerspelling
                                  signs and we evaluate whether this could be used to reliably increase the
                                  performance of an SLR system. Our model is based on a pretrained con-
                                  volutional network, fine-tuned using synthetic images, and tested using
                                  a corpus dataset of real recordings of native signers. An accuracy of 71%
                                  recognition was achieved using skeletal wireframe image training datasets
                                  and using the MediaPipe pose estimation model in the test pipeline. This
                                  compares favourably with state-of-the-art CV models which achieve up
                                  to 62% accuracy with “in-the-wild” fingerspelling test datasets.

                                  Keywords: Sign Language Recognition · Synthetic Data · Data Aug-
                                  mentation · Convolutional Neural Network · Pose Estimation Model


                          1    Introduction

                          Deaf advocacy organizations maintain that the use of Sign Languages is a core
                          right and can ensure that deaf people fully participate in society at large [1, 13].
                          However, in Ireland, the low number of Irish Sign Language (ISL) interpreters has
                          led to their use being confined to important event contexts [4]. To address this
                          issue, the development of a practical automated real-time ISL interpreter could
                          have many applications in areas such as public service information, the Internet
                          and social media, and in transport and medical contexts. Mobile and cloud-
                          based interpreting applications and services could lead to increased freedom of
                          expression, equal access to education and employment, as well as participation
                          in cultural, sporting and entertainment activities [4].
                              Because of their greater degrees of articulatory freedom, Sign Languages
                          have a richer and more complex phonology than spoken languages [11, 12]. In
                          this paper, we will focus on fingerspelling, which is used not only for spelling
                          out proper names, place names and abbreviations, but also as hand shapes for
                          signs, such as the “m” hand shape used in the articulation of the “mother” sign.


Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
2          Frank Fowley and Anthony Ventresque

There are 23 static letter signs and 3 moving letter signs (“J”, “X” and “Z”) in
the ISL fingerspelling alphabet which is shown in figure 1.


    Fig. 1. The ISL Fingerspelling Alphabet - Static signs. (Source: Irish Deaf Society)

    The Computer Vision (CV) task of recognising these fingerspelling signs is
challenging due to the high degree of variation in signer fluency and linguistic
effects such as co-articulation. SLR systems require a high degree of distortion
invariance while maintaining high accuracy and low misclassification rates. Con-
volutional Neural Networks (CNNs) have proved useful in extracting features in
images even with geometric transformations of the objects [10] but rely heavily
on large datasets to avoid over-fitting. The effect of over-fitting is the degradation
in model performance when applied to unseen test data. One of the fundamen-
tal challenges here when applying CV/ML to SLR is a sparsity of datasets and
corpora content, of sufficient scale and format to be useful as training input[6].
This is partially due to the absence of written forms of Sign Languages and the
costly nature of annotation [25].
    Classical deep learning mitigation techniques can be used to address this
issue: data augmentation - which improves the generalisation of neural networks
and aims to expand the training dataset; and transfer learning [22, 24] - which
refers to the use of a deep neural network, already pre-trained on a dataset, to
be fine-tuned with a new dataset with a new set of classes [18].
    In this paper, We describe a novel method to overcome the lack of ISL train-
ing data by generating synthetic images at scale, applying transfer learning tech-
niques to leverage the feature extraction capabilities of popular CNNs, and de-
ploying current pose estimation models in the recognition process.
    The paper is structured as follows: Section 2 details some related work, be-
fore describing our research methods in Section 3. The experimental results are
outlined in Section 4 and we conclude with a discussion of the results and future
directions in Section 5.

2      Related Work
Many classical approaches have been used for sign and posture recognition, based
on statistical pattern recognition and other shallow-learning techniques that re-
quire the initial definition of object features [13]. Farouk et al. [44] used syn-
thetic images of ISL fingerspelling hand shapes to create a recognition model
             Sign Language Fingerspelling Recognition using Synthetic Data       3

based on Principal Component Analysis (PCA). Experiments based on intrusive
motion-capture equipment, such as gloves and wearable sensors, have been per-
formed to create Sign Language interpreters [2], but they are of limited interest
given the holistic nature of Sign Languages which includes features such as fa-
cial expressions, for instance. We have not included results from works based on
plethysmography or electromyographic techniques as their use is considered by
the Deaf community to be intrusive to the signer and therefore impractical [6,
41, 42].
    Most published results for Sign Language fingerspelling recognition, including
ISL, have been obtained in controlled environments where the training and test
data are derived from the same subjects and in similar data capture conditions.
The performance degradation of such models, when applied to unseen and “real-
world” domains, known as “domain adaptation”, is well documented [43]. To
overcome this, our work is focussed on the more challenging scenario where
there is separation between the training and test domains. The state-of-the-art
figures cited in this paper were obtained from works that were conducted on
non-controlled and “in-the-wild” settings.

Sign Language Technology In their systematic survey of Sign Language com-
putational research, Zeledón et al.[15], highlight the low availability of commer-
cial applications for Sign Language translation, confined in the main to synthesis
rather than recognition, and restricted to particular domains. They report that
the performance measures for state-of-the-art machine translation systems are
between 70% and 80% (BLEU score)3 and between 20% and 30% (WER score)4
citing the limited availability of well-defined Sign Language grammars as a reason
for the low performance in comparison to spoken language machine translation
systems, as well as the challenge of properly annotating corpus recordings. Rast-
goo et al. also state that the current state-of-the-art is focused on phonetic sign
recognition rather than the lexicological and semantic areas which require more
complex models [15].
    Bragg et al., maintain that Deaf contributors should be involved at all facets
of research and development in order to “accurately represent the community,
address meaningful problems, and avoid cultural appropriation” [6]. They report
that most signing datasets are only partially annotated. Continuous sign recog-
nition is the most challenging part of the translation pipeline due to the complex
phonology of signing, the variance in the fluency, dexterity, age and gender of
the signer, the use of slang and dialect as well as the issues of occlusion and
camera quality.

Deep Learning for SLR There has been a significant body of research pub-
lished on the application of deep learning techniques to SLR [13, 20]. Shi et
3
  BLEU (Bilingual Evaluation Understudy) is a standard machine translation evalu-
  ation calculation method.
4
  WER (Word Error Rate) is a measure of the changes needed in the words of a phrase
  to transform it into another phrase.
4      Frank Fowley and Anthony Ventresque

al. [23] achieved state-of-the-art accuracy of 62.3% recognition on a dataset of
“in-the-wild” videos of American Sign Language (ASL) fingerspelling, using an
attention-based recurrent neural network. Halvardsson et al. [8] apply transfer
learning techniques to three CNNs (InceptionRes-NetV2 [33], Xception [32] and
InceptionV3 [31]) to recognise static manual signs of the Swedish Sign Language
finger-spelling alphabet. They obtain 85% accuracy using the InceptionV3 net-
work with 5 fine-tuned layers, on a test dataset derived from 8 recordings of 6
subjects. Though signer-independent, their training and test datasets were cre-
ated in the same controlled environment. They demonstrate that the accuracy
is dependent on the number of pre-trained layers.


Synthetic Data and Transfer Learning Transfer Learning techniques have
been applied to SLR with encouraging results using several popular pre-trained
deep learning networks and configurations [8, 14]. Synthetic data, produced from
artificial means rather than by human photography, has been used to train ML
models for CV applications, to train generative models, to augment real data
datasets and to anonymise real data in privacy sensitive scenarios [5]. Techniques
using synthetic data have been applied to problems in object detection and
segmentation, face and text recognition, image classification and pose estimation
[16].
    Nikolenko’s review [16] suggests that best results are obtained when combin-
ing synthetic datasets from different domains. Bayraktar et al. [3] use synthetic
data to fine-tune the VGGNet [37], Inception, ResNet [36] and Xception neu-
ral networks in object detection experiments and report that a mix of real and
synthetic data yields best results. Peng et al. [19] show that texture and colour
variations in training datasets are more important than pose variations. Hinter-
stoisser et al. [9] present a similar experimental technique to the one presented
in this paper. They propose to retain the feature extraction of lower layers in
the networks deployed, only fine-tuning the higher order blocks with synthetic
data. Rajpura et al. [21] use synthetic images, generated by Blender, and trans-
fer learning to fine-tune three CNNs (DetectNet, Faster R-CNN and SSD) to
create a network to recognise a set of household objects and more than dou-
bled the model’s precision score. They report that class set size and fine-tuning
depth have a significant effect on performance and that the optimal accuracy is
obtained by fine-tuning all of the underlying DetectNet Inception layers. Goyal
et al. [7] use augmented synthetic data for segmentation models, fine-tuning the
top-most five layers of a CNN with Blender-rendered synthetic images. They
show significant precision score improvements after retraining the FNC-8s net-
work with a subset of the PASCAL dataset [39], fine-tuning the result with a
small synthetic dataset. They report that the model’s success depends on class
sample size and object type. They limit the fine-tuning depth of their experi-
ments to account for the non photo-realism of the synthetic images. There have
been some studies which have implemented pose estimation models for Sign Lan-
guage recognition, mainly using the joint coordinates from the OpenPose [35]
model as input data [26–29].
             Sign Language Fingerspelling Recognition using Synthetic Data       5

Contributions Our approach differs from the previous research by using larger
synthetic datasets than those available in Sign Language corpora. Through the
use of an automated framework, we can control the variations within the training
dataset and can generate ground-truth frame-level annotation automatically.
Furthermore, we adopt a pose estimation model in the recognition pipeline to
reduce the domain shift between training and test datasets. We use customised
wireframe skeletal images to exploit the performance of current CNN models
through transfer learning techniques.


3     Methodology

3.1   Programmable Pose Framework

Our approach includes the use of data augmentation to include translational,
viewpoint, size and illumination invariance into the training datasets to enable
the model to overcome illumination, camera perspective, background, anatomical
and pose variations found in real-world scenarios. We developed a framework to
automate the generation of hand poses. It is based on a skeletal-rigged 3D hand
avatar mesh loaded into the Blender graphics engine and can be programmed
to produce synthetic data variations at scale. The skeletal armature of the hand
model can be rotated and positioned by setting its constituent bones from a pre-
set list of parameterised features such as “bent”, “hooked”, “curled”, etc. An
individual ISL manual shape is composed of a set of these features which thus
determines its pose. There is no hand-crafting or manual setting of the hand
mesh required for the animation.
    Figure 2 shows an example of pose variations where the finger rotations are
varied to correspond to the differences in fluency of signers and phonetic variants
found in ISL. The graphics engine allows for variations in scene illumination and
camera perspective by setting the positions of cameras and lights. The extent of
the image alterations and class balance can be controlled programmatically in
the framework.
    The system outputs colour, greyscale, depth and skeletal wireframe images
and video, as well as skeletal joint key-point coordinates corresponding to the
animated hand shapes. The wireframe images can be generated in the format
of pose estimation models, OpenPose, MediaPipe [34] and Kinect4Azure [30].
Examples of colour and wireframe images generated by the framework are shown
below.


3.2   Training Datasets

The experiments were divided into two phases. The first phase used synthetic
RGB images of hands. The second phase introduced a pose estimation model
into the pipeline and used images based on the output of this model as training
data.
6       Frank Fowley and Anthony Ventresque

Phase 1: Hand RGB Images The Phase 1 experiments used training datasets of
approximately 520,000 synthetic RGB images rendered in the synthetic frame-
work, a sample of which is shown in figure 2.


Fig. 2. Blender Synthetic Framework Examples. 10 synthetic images of ‘A’ from differ-
ent camera perspectives and illumination conditions. Poses show thumb angle variances,
finger curling, finger on palm, non-visible nails and random finger bone rotations.


Phase 2: Hand Skeletal Images The training dataset for the Phase 2 models was
based on approximately 1.6 million pose wireframe images in the format pro-
duced by pose estimation models. We experimented with several modifications
to the wireframe output formats to compare the effectiveness of different artifi-
cial features in the images. The logic is to add visual, geometric or pixel-based,
potentially discriminable features to the training dataset images. Figure 3 shows
two such “Feature Injection” formats generated by the Blender engine. The best
performing training dataset was based on colouring the individual fingers with
evenly spaced hues keeping all bones in the same finger the same colour without
any pixel thickness change between fingers or bones. The MediaPipe API allows
its wireframe images to be outputted with these above modifications at inference
time.


Fig. 3. Phase 2 Dataset Samples. MediaPipe WireFrames with Feature Injection. Left:
4 samples of “B” with different hue per finger. Right: 4 samples of “W” with different
hue per finger and different thickness per bone.


3.3   Test Dataset

The models were tested using the ISL-HS corpus of fingerspelling signs [17]. This
is a dataset of ISL fingerspell signs captured from 6 native ISL signers. They are
composed of 3 recordings of each person, resulting in 468 videos of static ISL
alphabet letters.
              Sign Language Fingerspelling Recognition using Synthetic Data            7


Fig. 4. Test Dataset - ISL-HS Corpus. Five samples showing letters ‘W’, ‘D’, ‘A’, ‘W’
and ‘A’, with different subjects using different finger poses for the same letter such as
curled fingers and straight fingers on palm for two ‘A’ samples.

    The corpus is available as a dataset of 52,688 grayscale images, with an
average of approximately 2,290 samples per alphabet letter, for the 23 static
signs. For the test datasets used in this work, we abstracted the RGB colour
images from the video frames. Figure 4 shows a sample of the corpus images.
For Phase 2 experiments, which use a pose model wireframe output as data,
the pose estimation model was applied to the RGB ISL-HS images to create the
wireframe test datasets.


3.4   Model Pipeline

The model training and evaluation pipelines are outlined in figure 5. Different
open-source networks were used during the experiments including VGGNet16,
InceptionV3, Xception, ResNet152V2 and MobileNetV2. The models were trained
with a learning rate of 0.0001 and used the Adam optimiser. Batch-sizes of 16,
32 and 64 were compared in the hyper-parameter adjustments with 16 proving
to be optimal. A training/validation set split of 90:10 was used throughout the
experiments.


3.5   Frameworks and Equipment

The Blender framework and API were used to develop the synthetic data frame-
work. The Keras TensorFlow framework was deployed for all the deep learning
pipeline tasks, along with the OpenCV2 and MediaPipe APIs for data pre-
processing. Python Jupyter Notebooks were used for all experimental coding.
No Keras data augmentation functions were used since it is the role of the syn-
thetic image framework to control the data augmentation. The experiments were
carried out on a Dell Precision 5820 Tower WorkStation equipped with an In-
tel Xeon W-2235 Processor and 32GB CPU RAM and fitted with an NVIDIA
Quadro RTX5000 GPU with 16GB GPU RAM.


4     Experimental Results

The accuracy of the best performing model in Phase 1, trained on RGB images,
is 33% overall, using a VGGNet16 network with its top convolutional block
8      Frank Fowley and Anthony Ventresque


               Fig. 5. Training, Evaluation and Inference Pipelines.


re-trained. While the model’s confusion matrix showed some encouraging class
recognition, the results also revealed significant confusion between some classes.
These subsets of non-discriminatory classes appeared in experiments with vary-
ing hyper-parameter settings. This suggests that some letters are inherently more
difficult to discriminate, which reflects the actual physical similarity between
some fingerspelling shapes.
    To enable the models to discriminate between these, we used wireframe im-
ages in the format of pose estimation model output as our training datasets in
Phase 2 (This would subsequently require the use of a pose estimation model
in the test pipeline). The synthetic pose framework was extended to generate
training datasets of skeletal wireframe images in the requisite format. We then
manipulated the visual aspects of the images such as the colours and widths
of bones and fingers, and trained our model based on these synthetic training
datasets. The best performing model in Phase 2 used a VGGNet16 base net-
work. For our tests, we used the MediaPipe pose estimation model in the test
pipeline. The Phase 2 results yielded an overall recognition accuracy of 71.4%
when applied to a corpus test dataset of recorded ISL fingerspelling alphabets
(These images having been pre-processed into wireframe images by applying the
pose model.)
    The optimal fine-tuning depth in Phase 2 is greater than that of Phase 1,
with a re-training of three convolutional blocks yielding best results. There was
a rise in accuracy of 4.7 percentage points resulting from a three-fold increase in
the size of the training dataset. The results from the best-performing VGGNet16
configuration are shown in figure 6. However, there is still a marked confusion
between ‘E’ and ‘S’, ‘G’ and ‘F’, and ‘R’ and ‘U’, as seen in table 1. This
confusion table shows the accuracy of the model as the percentage of correctly
    Sign Language Fingerspelling Recognition using Synthetic Data    9


Fig. 6. Phase 2 Results. Confusion Matrix. Overall Accuracy 71.4%.


 Table 1. Accuracy for each letter with 3 most confused letters.

                                     Top 3 Most Closely Confused
       Letter             Accuracy
                                               Letters
       A                    0.46     L (0.02) E (0.15)    F (0.29)
       B                    0.89     E (0.01) L (0.05)    P (0.05)
       C                    0.68     E (0.04) L (0.12)    Y (0.13)
       D                    0.97     E (0.00) T (0.00)    R (0.02)
       E                    0.81     F (0.01) G (0.02)    B (0.10)
       F                    0.61     Q (0.01) G (0.35)    K (0.02)
       G                    0.85     Q (0.01) F (0.09)    K (0.04)
       H                    0.89     B (0.01) T (0.01)    D (0.10)
       I                    0.81     E (0.03) S (0.09)    H (0.04)
       K                    0.83     G (0.02) Q (0.04)    H (0.07)
       L                    0.74      I (0.00) H (0.00)   F (0.26)
       P                    0.88     D (0.01) U (0.03)    B (0.07)
       Q                    0.82     H (0.01) U (0.03)    V (0.10)
       R                    0.23     T (0.04) U (0.36)    D (0.31)
       S                    0.04     E (0.45) D (0.29)    B (0.09)
       T                    0.12     B (0.06) D (0.27)    E (0.29)
       U                    0.88     P (0.01) R (0.01)    D (0.06)
       V                    0.97     D (0.00) E (0.01)    U (0.02)
       W                    0.97     Q (0.00) B (0.00)    P (0.02)
       Y                    0.72     A (0.05) H (0.08)    F (0.09)
       Overall Accuracy     0.71
10        Frank Fowley and Anthony Ventresque

recognised test samples for any letter. It also shows the three letters that have
been incorrectly classified by the model for any class, in terms of the highest
percentage of test samples. For example, although the model correctly recognised
‘F’ in 61% of test samples, the model also incorrectly classified 35% of ‘F’ test
samples as ‘G’, 1% as ‘Q’ and 2% as ‘K’, thus showing scope for further potential
improvements. They are, in effect, the three letters most ”confused” by the model
when recognising an ‘F’. Details of the above experimental results as well as all
corresponding code have been made available on GitHub. 5


5      Conclusion and Future Work

The above results demonstrate that a CNN, trained solely on synthetic images,
can effectively recognise isolated ISL fingerspelling signs. There is a need to
resolve the recognition confusion evident in a small subset of classes with tech-
niques such as ensemble learning and composite models. We plan to extend the
synthetic image generator and the recognition models to cater for the full set
of ISL handshapes as well as dynamic signs and to eventually recognise contin-
uous sign sequences. The latter will require the models to be extended with a
temporal architecture such as a recurrent neural network (RNN) or LSTM [40]
structure. We hope to create a corpus of native ISL signer recordings, in formats
suitable for input to the deep learning models, as well as a database of annotated
ISL online videos as a comprehensive “in-the-wild” test dataset. While occlusion
was not a problem for one-handed fingerspelling recognition, paired synchronised
depth sensors will be deployed in future pipelines with appropriate models to
cater for this effect.


Acknowledgement

This work was conducted with the financial support of the Science Foundation
Ireland Centre for Research Training in Digitally-Enhanced Reality (D-REAL)
under Grant No. 18/CRT/6224. This work has also been conducted within the
SignON project. This project has received funding from the European Union’s
Horizon 2020 research and innovation programme under grant agreement No
101017255.


References
1. EUD Homepage, https://www.eud.eu/about-us/eud-position-paper/accessibility-
   information-and-communication/. Last checked 05.08.2021.
2. Mohamed Aktham Ahmed, Bilal Bahaa Zaidan, Aws Alaa Zaidan, Mahmood Maher
   Salih, and Muhammad Modi Bin Lakulu.: A review on systems-based sensory gloves
   for sign language recognition state of the art between 2007 and 2017. In: Sensors,
   18(7):2208, 2018.
5
     https://github.com/ucd-csl/ISL-SLR
             Sign Language Fingerspelling Recognition using Synthetic Data         11

3. Ertugrul Bayraktar, Cihat Bora Yigit, and Pinar Boyraz.: A hybrid image dataset
   toward bridging the gap between real and simulation environments for robotics. In:
   MVA 2019.
4. Citizens Information Board. Information provision and access to public and so-
   cial services for the Deaf Community. Government of Ireland, December 2017.
   https://www.citizensinformationboard.ie/downloads/social policy/. Last checked
   05.08.2021.
5. Erik Bochinski, Volker Eiselein, and Tomas Sikora.: Training a convolutional neural
   network for multi-class object detection using solely virtual world data. In: AVSS
   2016.
6. Danielle Bragg, Oscar Koller, Mary Bellard, Larwan Berke, Patrick Boudreault, An-
   nelies Braffort, Naomi Caselli, Matt Huenerfauth, Hernisa Kacorri, Tessa Verhoef,
   Christian Vogler, and Meredith Ringel Morris: Sign language recognition, genera-
   tion, and translation: An interdisciplinary perspective. In: ASSETS 2019.
7. Manik Goyal, Param Rajpura, Hristo Bojinov, and Ravi Hegde.: Dataset augmen-
   tation with synthetic images improves semantic segmentation. Communications. In:
   NCVPRIPG 2018.
8. Gustaf Halvardsson, Johanna Peterson, C. Soto-Valero, and Benoit Baudry.: Inter-
   pretation of swedish sign language using convolutional neural networks and transfer
   learning. In: SN 2021.
9. Stefan Hinterstoisser, Vincent Lepetit, Paul Wohlhart, and Kurt Konolige.: On pre-
   trained image features and synthetic images for deep learning. In: ECCV 2018 Work-
   shops, page 682–697, 2019.
10. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner.: Gradient-based learning applied
   to document recognition. In: Proceedings of the IEEE, 1998.
11. L. Leeson and J.I. Saeed.: Irish Sign Language: A Cognitive Linguistic Account.
   Edinburgh University Press, 2012.
12. Patrick A. Matthews.: Extending the Lexicon of Irish Sign Language (ISL) [micro-
   form] / Patrick A. Matthews. Distributed by ERIC Clearinghouse [S.l.], 1996.
13. Z. Omar Ming Jin Cheok and M. Jaward.: A review of hand gesture and sign
   language recognition techniques. In: International Journal of Machine Learning and
   Cybernetics, 10:131–153, 2019.
14. Boris Mocialov, Graham Turner, and Helen Hastie.: Transfer learning for british
   sign language modelling, 2020.
15. Luis Naranjo-Zeled´on, Jes´us Peral, Antonio Ferr´andez, and Mario Chac´on-
   Rivas.: A systematic mapping of translation-enabling technologies for sign lan-
   guages. In: Electronics, 8(9), 2019.
16. Sergey I. Nikolenko.: Synthetic data for deep learning, 2019.
17. Marlon Oliveira, Houssem Chatbri, Suzanne Little, Ylva Ferstl, Noel E. Oconnor,
   and Alistair Sutherland.: Irish sign language recognition using principal component
   analysis and convolutional neural networks. In: DICTA 2017.
18. Sinno Jialin Pan and Qiang Yang.: A survey on transfer learning. In: TKDE,
   22(10):1345–1359, 2010.
19. Xingchao Peng, Baochen Sun, Karim Ali, and Kate Saenko.: Exploring invariances
   in deep convolutional neural networks using synthetic images. 2014.
20. R. Rastgoo, K. Kiani and S. Escalera.: Sign language recognition: A deep survey.
   In: ESA, 164:113794, 2021.
21. Param Rajpura, Alakh Aggarwal, Manik Goyal, Sanchit Gupta, Jonti Talukdar,
   Hristo Bojinov, and Ravi Hegde.: Transfer learning by finetuning pretrained cnns
   entirely with synthetic images. In: NCVPRIPG, page 517–528, 2018.
12      Frank Fowley and Anthony Ventresque

22. Ling Shao, Fan Zhu, and Xuelong Li.: Transfer learning for visual categorization:
   a survey. In: TNNLS 2015.
23. Bowen Shi, Aurora Martinez Del Rio, Jonathan Keane, Diane Brentari, Greg
   Shakhnarovich, and Karen Livescu.: Fingerspelling recognition in the wild with
   iterative visual attention. In: CoRR, abs/1908.10546, 2019.
24. Karl Weiss, Taghi Khoshgoftaar, and DingDing Wang.: A survey of transfer learn-
   ing. In: Journal of Big Data, 3, 05 2016.
25. L. Leeson, J. Saeed, and D. Byrne-Dunne. : Moving heads and moving hands
   Developing a digital corpus of irish sign language. 2006.
26. Bowen Shi, Diane Brentari, Greg Shakhnarovich and Karen Livescu : Fingerspelling
   Detection in American Sign Language. In: CVPR 2021.
27. Dongxu Li and Cristian Rodriguez-Opazo and Xin Yu and Hongdong Li : Word-
   level Deep Sign Language Recognition from Video: A New Large-scale Dataset and
   Methods Comparison. In: WACV 2020.
28. Hamid Reza Vaezi Joze and Oscar Koller : MS-ASL: A Large-Scale Data Set and
   Benchmark for Understanding American Sign Language. In: WACV 2020.
29. Sang-Ki Ko, Jae Gi Son, and Hyedong Jung. : Sign Language Recognition with
   Recurrent Neural Network using Human Keypoint Detection. In: RACS 2018.
30. kinect, https://azure.microsoft.com/en-us/services/kinect-dk/.
31. Christian Szegedy and Wei Liu and Yangqing Jia and Pierre Sermanet and Scott
   Reed and Dragomir Anguelov and Dumitru Erhan and Vincent Vanhoucke and
   Andrew Rabinovich : Going Deeper with Convolutions. 2014.
32. François Chollet : Xception: Deep Learning with Depthwise Separable Convolu-
   tions. 2017.
33. Christian Szegedy and Sergey Ioffe and Vincent Vanhoucke and Alex Alemi :
   Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learn-
   ing. 2016.
34. mediapipe, https://google.github.io/mediapipe/. Last checked
35. Zhe Cao and Gines Hidalgo and Tomas Simon and Shih-En Wei and Yaser Sheikh
   : OpenPose. Realtime Fields. In: CoRR, 2018.
36. Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun: Deep Residual
   Learning for Image Recognition. 2015.
37. Karen Simonyan and Andrew Zisserman : Very Deep Convolutional Networks for
   Large-Scale Image Recognition. 2015.
38. Liu, Wei and Anguelov, Dragomir and Erhan, Dumitru and Szegedy, Christian
   and Reed, Scott and Fu, Cheng-Yang and Berg, Alexander C. : SSD Single Shot
   MultiBox Detector. 2016. In: LNCS 2016.
39. Mark Everingham and Luc Van Gool and Christopher K. I. Williams and John M.
   Winn and Andrew Zisserman. : The Pascal Visual Object Classes (VOC)
40. Hochreiter, Sepp and Schmidhuber, Jürgen. : Long Short-term Memory. In: Neural
   computation. 1997.
41. Why Sign-Language Gloves Don’t Help Deaf People. The Atlantic 9 (2017),
   https://www.theatlantic.com/technology/archive/2017/. Last checked 28.11.2021
42. Those Signing Gloves Are Not That Great. Language First 2019,
   https://language1st.org/essays/2019/6/15/those-signing-gloves-are-not-that-great.
   Last checked 28.11.2021
43. Charles, J, T Pfister, D Magee, D Hogg, and A Zisserman. 2013 : Domain Adap-
   tation for Upper Body Pose Tracking in Signed TV Broadcasts. In: BMVA 2013.
44. Farouk, Mohamed, Sutherland, Alistair and Shoukry, Amin A. (2013) : Nonlinear-
   ity reduction of manifolds using Gaussian blur for handshape recognition based on
   multi-dimensional grids. In: ICPRAM 2013.