<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>P. (2018). AMERICAN SIGN LAN- GUAGE FINGERSPELLING
USING HYBRID DISCRETE WAVELET TRANSFORM-GABOR FILTER AND CONVOLUTIONAL
NEURAL NETWORK. In Journal of Engineering Science and Technology (Vol. 13</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Static Peruvian Sign Language Classifier Based on Manual Spelling Using a Convolutional Neural Network</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gerardo Portocarrero-Banda</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eveling Gloria Castro-Gutierrez</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abdel Alejandro Portocarrero-Banda</string-name>
          <email>abdel.portocarrero@ucsm.edu.pe</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudia Acra-Despradel</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Rondon</string-name>
          <email>drondon@continental.edu.pe</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hugo Guillermo Jimenez-Pacheco</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miguel Angel Ortiz-Esparza</string-name>
          <email>miguel.ortiz@cimat.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Research in Mathematics</institution>
          ,
          <addr-line>Quantum Knowledge City, Zacatecas</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad Católica de Santa María</institution>
          ,
          <addr-line>Urbanización San José s/n, Arequipa, 04013</addr-line>
          ,
          <country country="PE">Perú</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universidad Continental</institution>
          ,
          <addr-line>Arequipa</addr-line>
          ,
          <country country="PE">Perú</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Universidad Nacional Pedro Henríquez Ureña</institution>
          ,
          <addr-line>Santo Domingo</addr-line>
          ,
          <country country="DO">Dominican Republic</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Universidad Nacional de San Agustín de Arequipa</institution>
          ,
          <country country="PE">Peru</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>13</volume>
      <issue>9</issue>
      <fpage>7957</fpage>
      <lpage>7968</lpage>
      <abstract>
        <p>There are great difficulties for people who suffer from mixed language disorders, having only one means for interpretive communication, sign language. A great challenge is to efficiently recognize these static gestures in real environments, therefore, the present research presents a convolutional neural network model that allows recognizing and classifying Peruvian Sign Language (PSL) with a dataset of 3025 frames through 4 stages: a ) Generation of a dataset that involves 11 gestural components per frame, which involve invariant characteristics, sign parameters and gestural space, which allows greater generalization of the model compared to samples from other research b) Image preprocessing through the application of techniques and computer vision algorithms, c) Application of a convolutional neural network (CNN) model architecture and d) Execution of the model on a web platform to support model testing. The proposed CNN model obtained an accuracy rate of 99% in training, 88% in validation, and 84% in PSL recognition in the testing stage. The present model is better prepared to recognize static signs of the PSL in real scenarios.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Peruvian Sign Languages</kwd>
        <kwd>Convolutional Neural Network</kwd>
        <kwd>Fingerspelling</kwd>
        <kwd>1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Currently, in the social environment, there are communication barriers, accessibility, and equal
opportunities for people with mixed language disorders. There are approximately 70 million
people with hearing disabilities around the world [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] causing difficulties in their interaction,
teaching and under- standing. The main means of communication for people with hearing
disabilities is sign language. Sign language uses gestures to imitate or illustrate an object, feeling,
expression, or even an action. Like spoken languages, sign language is unfortunately not
universal. There is no single language that is shared by deaf people around the world. Several
different sign languages have evolved independently across countries and even regions [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The
last Peruvian census, in 2017, identified approximately 230,000 people with speech and hearing
disabilities using the PSL [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Fingerspelling is the process of spelling (one letter at a time)
words that do not have an existing sign, such as proper nouns such as a person’s name, cities,
products, etc. This method is carried out using the hand shapes associated with the letters of the
      </p>
      <p>0000-0002-2539-5294 (G. Portocarrero); 0000-0002-0203-041X (E. Castro-Gutierrez); 0000-0002-1050-2093 (A.
Portocarrero); 0000-0002-6429-5675 (C. Acra-Despradel); 000000-0003-3506-5309 (D. Rondon);
0009-0001-51771126 (H. Jiménez-Pacheco); 0000-0001-8762-5780 (M. Ortiz-Esparza)
© 2023 Copyright for this paper by its authors.</p>
      <p>Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>
        CEUR Workshop Proceedings (CEUR-WS.org)
static alphabet. However, in addition to hand joints, non-manual aspects such as facial
expressions, arms, head, body movements and positions play a crucial role in static sign language
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, most of the time, the hand gesture made by the dominant hand carries most of the
meaning of the [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] sign. Therefore, an automatic software tool that can recognize alphabetic sign
language symbols or gestures could have a great impact on the communication of individuals with
mixed language disorder in society. Much of the current research on sign language recognition
(RLS or LS) examines various advanced artificial intelligence methods. Specifically, convolutional
neural network (CNN) techniques are applied as in different research [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ][
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and different
computer vision algorithms are used, such as color model change [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], segmentation and
identification of image depth, in the work of [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and the application of activation functions in
CNN to enable multi- class classification, as described in the work of [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Knowing these aspects,
this work proposes to implement a CNN model that allows identifying the largest number of PSL
alphabets possible, with a percentage of precision similar to or greater than that investigated in
the state of the art. The present work is organized in 6 sections, which are detailed as follows,
section II describes the related works, section III contains the applied methodology, section IV
describes the experimentation, section V presents the results and the discussion, and finally
section VI presents the conclusions of the research and future work.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>
        In the work of [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] in 2010 they report the implementation of an optical CNN system for the
recognition of image patterns using neural networks for the identification of the sign language
alphabet, concentrating on its recognition and classification , without considering the cultural
system or identity of any specific community, 2 freely accessible datasets are also generated, one
a) Dataset made up of 23 digital images and a b) Dataset made up of 23 images, which was
generated by the authors of the article, both containing the same static signs of the alphabet. After
processing, they obtained an average performance of 99% in recognition; However, the Second
level sectioning CNN has difficulties in image recognition when applying a digital transform
correlator responsible for discriminating patterns such as rotation and translation compared to
their original position, which causes it to not be prepared for images that have not been trained
and that have multiple components such as rotation or translation of gestures.
      </p>
      <p>
        Ali Karami, Bahman Zanj and Azadeh Kiani Sarkaleh in 2011 show a focus on the recognition
of static gestures from Persian Sign Language alphabets (PSLa) using a multilayer perceptron
network [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], using a generated dataset using a digital camera, made up of 640 images of the bare
hand gesture without the use of other resources such as gloves or visual marking systems.
Achieving that the CNN obtains an average recognition accuracy of 94.06% of the PSLa alphabet,
developing a robust system for the selected PSLa images, which only share the same
characteristics of their dataset.
      </p>
      <p>
        In the research of Lionel Pigou, Sander Dieleman, Pieter- Jan Kindermans, Benjamin
Schrauwen in 2015, they identified communication difficulties between the society of people with
hearing disabilities and the hearing society, thus implementing a language recognition system of
Italian signs using Microsoft Kinect, the use of a CNN and GPU acceleration [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and make
available the open dataset of ChaLearn Looking At People 2014 (CLAP14), Track 3 (Gesture
Spotting), which consists of 20 generated gestures by 27 users, making up a total of 6600 samples,
4600 dedicated to training, 2000 to validation and 3543 in tests that can be included in the
validation or training set, considering different aspects such as complex backgrounds, clothing,
lighting and gestural movements. In this research, a recognition percentage of 91.7% was
obtained in the validation data and 95.68% in the test data, emphasizing that the test data were
used in the training of the model.
      </p>
      <p>
        Salem Ameen and Sunil Vadera in 2017 identified the number of individuals with hearing
disorders and their ad- versities; therefore, their proposal consists of implementing an automatic
tool for the interpretation of the static alphabet of American Sign Language (ASL) using a CNN. of
two entries dedicated to the intensity and depth of images, aimed at classifying gestures through
manual spelling of the alphabet [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Having a free dataset of 60,000 images of the static alphabet
of the LSA excluding the letters ’Y’ and ’Z’ that require movement for their gesticulation [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
this dataset was generated by 5 different users, it is like this that achieve an average percentage
of 82% accuracy in recognizing the static LSA alphabet. The authors compare their work with the
research of Rioux-Maldague and Giguere [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] obtaining a higher percentage of recognition, and
identifying two types of errors through a confusion matrix: (a) symmetric errors, such as two
letters that can be classified erroneously with each other and (b) asymmetric errors, in which one
letter is misclassified as another but not vice versa.
      </p>
      <p>
        In 2017, it is proposed to recognize 24 classes of the PSL static alphabet through the
development of two different CNN architectures (CNN1 and CNN2) with different amounts of
layers and parameters per layer, fed according to the use of digital image processing techniques.
to remove or reduce noise, improve contrast under varying lighting, separate the hand from the
image background and finally detect and crop the region containing the hand gesture [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The
dataset is generated by the researchers, made up of 16,200 images from 25 different users,
including invariant characteristics in different sets of frames such as: scale, rotation, translation,
lighting, noise and complex background. The performance of each CNN architecture is different,
and the following results are obtained: a) CNN1 achieves an average of 95.37% in the recognition
of test data (87.33% in the data with scale features, 94% in the data with rotation features, 91%
in the data with translation features, 92.67% in the data with illumination features, 89% in the
data with noise features and 84.44% in the data with background features complex) and an
average of 99.34% in the validation, b) CNN2 achieves an average of 96.2% in the recognition of
test data (93.67% on the data with scale features, 94.83% on the data with of rotation, 91.66% in
the data with translational features, 93% in the data with illumination features, 89.33% in the
data with noise features and 85.34% in the data with complex background features) and a
average 99.73% in validation.
      </p>
      <p>The research proposed by [19] highlights the infeasibility and impracticality of using devices
or equipment such as Microsoft Kinect, since they are usually expensive, and cannot be used in
controlled environments or with very specific re- quirements, for language recognition. address.
For this reason, it was proposed to recognize the static alphabet of the LSA using machine
learning algorithms: a) Support Vector Machine (SVM), b) K-Nearest Neighbors (KNN) and c)
Random Forest (RF). Which were fed by handcrafted features of the images (histogram, Gabor
filter and discrete wavelet transform). And they also used a CNN as a deep learning algorithm to
compare their results. The prepared dataset is made up of 2524 images (only the gestural hand
and random objects are considered) that cover 24 static LSA alphabets gestured by 5 users.
Applying the YCbCr color model, they carry out the preprocessing: a) Segmentation of the hand
and b) Extraction of the hand region (erosion and dilation). Making their CNN model reach the
highest accuracy of 97.</p>
      <p>Nikhil Kasukurthi, Brij Rokad, Shiv Bidani and Aju Dennisan in 2019 proposed a deep learning
model called Squeezenet for the recognition of the LSA alphabet so that it can be executed on
mobile devices [20]. Considering a freely available dataset with a set of 43986 RGB (320x320
pixels) finger images from Surrey [21]. The arranged model achieves 87.47% accuracy in
recognizing training data, and 83.29% in validation. Thus, obtaining a model that allows
predicting sign language in real time, thus being a squeezenet architecture that is executed
completely on a mobile device accessible to the end user. They are still investigating improving
aspects such as lighting and the distance of the sign from the capturing device.</p>
      <p>Amirhossein D, Alireza T, Maryam T and Majid M in 2019 propose a two-stage CNN
architecture for robust hand gesture recognition, called HGR-Net, where the first stage performs
precise semantic segmentation to determine the regions of the hands and the second stage
identifies the gesture. Its experimentation is focused on using public datasets [22], achieving a
recognition percentage of 81.1% with the HGR- Net model, occupying a size of only 2.4MB, which
allows it to be flexible in its portability to any type of digital environment, and in its execution,
achieving an average of 23ms with 43fps in the recognition of each gestural input. In the work of
Ankita Wadhawan and Parteek Kumar in 2020, they propose a CNN model that allows the
recognition of static gestures of Indian Sign Language (LSI) through in- depth experimentation
and comparison of 50 CNN models. Taking a free access dataset made up of 35,000 RGB images
(350 images for each sign), which consist of various sizes, colors and taken in different
environmental conditions to help in the best generalization of the classifier. Reaching 99.72% and
99.90% in the recognition of color and grayscale training images respectively. Demonstrating
greater precision compared to other identifying systems: a) KNN (95.95%), b) SVM (97.9%) and
c) ANN (98%) in the recognition of static signs of the LSI with training data [23].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This section describes the stages of the proposed method, which is represented illustratively in
Figure 1, these stages include the generation of a dataset of PSL alphabets, con- tinuing with the
preprocessing of the dataset that is used as input data (static sign of the PSL), which allows
adjusting and modifying each frame that makes up the dataset to correctly feed the CNN model
through the following stages that include the training and validation of the architecture of the
CNN model, continuing with the tests and adjustments thereof, to achieve correct recognition of
the PSL in the execution stage of the model on a local web platform.</p>
      <sec id="sec-3-1">
        <title>3.1. A) Generation of a Dataset</title>
        <p>It is made up of the generation of 121 frames for each static alphabet of the PSL. Which is
divided into 3 parts, training data, validation and test data (63%, 7% and 30% respectively). 24
PSL alphabets were considered plus a class labeled ’nothing’ that represents the frames that do
not correspond to any gesture of the PSL alphabet, with the purpose of differentiating any
external gesture that does not belong to the PSL, as shown in Figure 2. A total of 3025 frames
were produced across all PSL and non-PSL classes. Therefore, 1906 frames are allocated for
model training, 211 frames for model validation and 908 frames for evaluating the percentage of
recognition accuracy of test data.</p>
        <p>
          The data itself maintains certain very important aspects to consider, gestural diversity is
manifested in different perspectives and environments, and therefore it is optimal to cover all the
components of a signing gesture. It contemplates: a) 11 components of the gesture, which make
up 7 invariant characteristics (day, afternoon and night lighting, noise or atypical aspects,
complex background, translation and scaling) [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], b) 03 parameters of the sign ( hand shape,
hand orientation and location) [24], and c) the gestural space, which is the rectangular view of
the user’s upper torso. The shape of the hand is the configuration that the hand assumes when it
begins to make the sign. Orientation is the direction in which the hand turns, and location is the
formation of the sign near the signer’s body. Around 75% of all signs are formed in the head and
neck areas, as they can be identified better and are closer to reality.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. B) Pre-processing</title>
        <p>The preprocessing stage works on the generated dataset, its objective is to determine the most
important characteristics of each frame. For this preprocessing task, different filters such as
techniques and transformations are applied, which are mentioned below; 1) color model change,
2) skin color detection, 3) resolution change and 4) normalization.</p>
        <p>
          1) Color model shifting: The color model (RGB to YCbCr) is modified in each frame to allow the
application of the appropriate filters to identify skin color, as it is applicable to complex color
images with uneven lighting [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ][25], where ’Y’ is the luminance, ’Cb’ and ’Cr’ mean blue and red
respectively, are collectively called as color components.
        </p>
        <p>
          The YCbCr color space has the characteristics of separating chromaticity and brightness [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
When the frame includes different aspects, in addition to the person performing the gesture, we
choose to use skin color segmentation using the YCbCr [25] color model for the segmentation of
the bare hand as seen in Fig. 3.
        </p>
        <p>Therefore, the RGB color space is separated into the luminance and chroma components,
obtaining a YCbCr color space. The ’Y’ component belongs to luminance. The components ’Cb’ and
’Cr’ belong to the color difference chroma. The components ’Y’, ’Cb’, ’Cr’ can be obtained from the
RGB values [19] using Eq. 1. The prime symbol indicated by the average gamma correction is
used. Gamma correction controls the overall brightness of an image. R’, G’, and B’ nominally vary
from 0 to 1, where 0 represents the minimum intensity and 1 the maximum.</p>
        <p>′ Y = 16 + (65.481 · R ′ + 128.553 · G ′ + 24.966 · B ′ )</p>
        <p>CB = 128 + (−37.797 · R ′ − 74.203 · G ′ + 112.0 · B ′ )
Figure. 2: Dataset made up of frames of static gestures with bare hands according to the PSL
alphabet: A) Considering 11 components of the gesture (7 invariant characteristics, 3 parameters
of the sign and the gestural space), B) Frames that do not belong to any gesture of the alphabet of
the PSL (ex: a pencil).</p>
        <p>2) Skin color detection: Once the skin color range is defined according to the YCbCr model, all
the sectors that coincide are highlighted so that only these characteristics can be observed within
the environment that, in most cases, involve different complex aspects which can hinder the
information. After obtaining the YCbCr color space values for each pixel of the image in Eq. 1, each
pixel was classified as skin pixel or nonskin pixel. If the values of ’Y’, ’Cb’ and ’Cr’ for a pixel in the
image lie in the ranges mentioned in Eq. 2, then that pixel is a skin pixel, otherwise it is a non-skin
pixel [19].
(2)
0 &lt; Y &lt; 255
133 &lt; CB &lt; 173
77 &lt; CR &lt; 127</p>
        <p>These operations are applied to all frames of the dataset after changing the color model to
YCbCr, and in this case, highlighting the daylight invariant characteristics in Fig. 4</p>
        <p>As observed in each of them, the necessary aspects can be noted to cover both the invariant
characteristics, the parameters of the sign and the gestural space. Now, the reason for not
maintaining skin color and maintaining black and white frames is precisely because of the types
of components that were mentioned above, one of them is noise, which prevents recognition
efficiently, if these types of aspects is found in the frames of the dataset, such as a user’s glasses,
or possibly rings, bracelets, or objects in the background, which are recognizable if they remain
and thus spoil the recognition, because the only element What should be recognized are the static
gestures of the PSL alphabet.</p>
        <p>3) Resolution Change: All frames are modified and resized to have a resolution of 224x224
pixels [22][26][27][28], which further allows only the essential components of the gesture to be
extracted as shown in Figure 5.</p>
        <p>4) Normalization: All frames are normalized, dividing the value per pixel by 255, since it is the
maximum value that a pixel can have, obtaining values in the range from 0 to 1, with the aim of
having a lighter dataset that can be managed more easily, it is done using the following Eq. 3.</p>
        <p>Pixel Value = {0....1}
(3)</p>
        <p>R = 128 + (112.0 · R′ − 93.786 · G-18.214 · B′)</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. C) CNN Model Architecture</title>
        <p>The proposed system is designed to recognize static alphabets of the PSL, to achieve this the
architecture of the CNN was defined according to the number of classes (alphabets) to be
identified, for this different but related stages were established through training, validation,
model testing and adjustments.</p>
        <p>The CNN model has 4 convolution layers (32, 64, 128 and 256 filters respectively) applying a
3x3 kernel, and a ReLU activation function per convolution layer, 4 MaxPooling layers between
the convolution layers applying a kernel of 2x2, adding a dropout layer with a random node
exclusion rate of 30% [29] in the last convolution layer, to avoid overtraining.</p>
        <p>A flattening layer is also defined, which allows the input layer to be delivered to the fully
connected neural network. For the fully connected neural network, 1 input layer, 3 hidden layers
and an output layer (256, 128, 64, 32 and 25 nodes respectively) and a ReLU activation function
in the first 4 layers are defined, defining in the output layer the Softmax activation function to
allow multi-classification, producing a probabilistic percentage of recognition for each class,
giving the highest percentage to the class with the highest probability rate in the validation of the
model, it works as an early stop and distributing the rest to the other classes with the least
assertive probability, also adds a dropout layer with a random node exclusion rate of 30%, which
can be seen in Figure 6.</p>
        <p>
          The experimentation was carried out on a dedicated graphics card (GPU), as well as used in
the different proposals [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ][
          <xref ref-type="bibr" rid="ref18">18</xref>
          ][20][22][23][27][28][30], Nvidia GeForce GTX 1650Ti with 4GB
VRAM capacity together with a 4-core CPU at 2.50 GHz to 4.50 GHz with 16 GB RAM, for the use
of Different configurations of the GPU were made both in the basic entry system (BIOS) and in the
operating system and the use of official resources under Cuda 11.8 with cuDNN 8.2 in TensorFlow
2.10 with a stable version of Python 3.8. 1) Training and Validation: The training is executed using
the “categorical cross entropy” loss function and the “ADAM” optimizer with a learning rate of
0.001. The maximum number of batches for each training step varies depending on the amount
of data, as well as the iterations per batch, and a callback was implemented that allows identifying
the accuracy
Σn Matches · 100 (4)
i=1 LabeledClasses
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>D. Execution on a Web Platform</title>
        <p>To improve testing, in a real-time environment, a web platform has been developed following
an MVC (model-view- controller) architecture implemented in Django due to its capabilities for
the execution of the previously trained, validated and tested CNN model, which will have the
functionality to capture images through the user’s device so that it can be delivered to the CNN
model and perform the recognition, which will be displayed on the screen.</p>
        <p>Its use is intended to be simple and understandable, precisely so that any user has the ability
to use it, so it is only necessary to have a laptop or computer with a photo sensor or webcam and
the project locally. As shown in Figure 7, a distinction is made between what the user observes,
to which the platform is predestined, and what happens internally, each captured frame is sent
for processing, as described in the previous stages, to be delivered to the model and perform the
recognition, in this case, the identified alphabet is ’A’ of the PSL, superimposing the sign in text at
the bottom left of the frame, achieving a satisfactory recognition of the sign.</p>
        <p>To mention additionally, observe in the figure 8 how the frame is captured and the trained
model interprets when no PSL sign is being gestured and places the text ”nothing” at the bottom
left.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimentation</title>
      <p>Stage that allows establishing and adjusting the parameters and functions of both the
preprocessing and the CNN model according to numerous executions. For this, 04 different
configuration stages were required in which measurements were obtained according to the average
percentage of validation and testing accuracy of the model, its effectiveness when performing
gesture recognition.</p>
      <sec id="sec-4-1">
        <title>4.1. A. First experimentation</title>
        <p>Made when considering a dataset of 1200 frames, which make up the alphabets from ’A’ to ’L’,
excluding the alphabet ’J’ because it is not considered a static sign, and also consid- ering the
creation of the class ’ NADA’ that represents any alphabet of the LSP, and in this way contrast
when a sign is not being gestured, therefore, the different configurations for this dataset are found
in the table I.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. B. Second Experimentation</title>
        <p>Performed when a dataset of 1465 frames was contemplated, which make up the alphabets ’A’
through ’R’, excluding the alphabet ’J’ and ’N˜ ’ because they are not considered as static signs,
thus, the different configurations for this dataset are found in the table II.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. C. Third Experimentation</title>
        <p>Performed when 100 frames per alphabet were set to the dataset which are ’S’, ’T’, ’U’, ’V’, ’W’,
’X’, ’Y’, excluding the alphabet ’Z’ because it is not considered as a static sign, therefore, a dataset
of 2500 frames are contemplated in its totality, these configurations are found in the table III.
4.4. D. Fourth Experimentation
% Tests
83.59%
84.40%
83.48%</p>
        <p>This was done by adding 21 additional frames per class to the dataset, for a total of 3025
frames in its entirety, but for this change to be viable, the number of epochs was adjusted from
30 to 75, and its other settings are found in the table IV.</p>
        <p>All the experimentation was carried out using all the hard- ware resources (GPU, CPU, Storage)
and it was possible to shorten the training time with an impressive difference using the GPU with
respect to the CPU, having an average difference of 1500%, as shown in Table V.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>In the present investigation, a dataset of 3025 frames was generated, and all the factors of the
gesticulation of the signs in each image were incorporated. Each of the frames was processed to
be input and trained to the CNN model. Thus, it can be observed that in the different compositions
of the CNN model configurations, there are very narrow differences, how- ever, these disparities
establish the most optimal CNN model possible. When 10 iterations per batch is defined, it
delivers a (99, 84, 79)% average recognition accuracy of the PSL in the training, validation and
testing stages, with 30 iterations per batch it is obtained (98, 88, 82) % precision, respectively,
with 45 iterations per batch, (98, 89, 83)% precision respectively was obtained, and the last
adjustment made, increasing the number of batches from 30 in the first 3 experiments, and 75
batches in the fourth experiment, obtaining (99, 88, 84)% respectively. It is taken into account
that with fewer iterations a higher percentage of training precision is obtained, however, this
factor is not critical like validation and testing.</p>
      <p>Which indicates that the number of frames that have to be trained per batch should be as small
as possible, so the 45 iterations per batch were maintained, and consequently a 99% accuracy in
training was achieved. an average of 88% accuracy in validation and an average of 84% accuracy
in recognition of test data.</p>
      <p>This research work is compared with the authors’ research works, which are shown in Table
VI. An example of the images that make up each dataset can be seen in Fig. 9.</p>
      <p>
        Flores et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and Dadashzadeh et al. [22] worked with similar conditions in terms of
lighting during the day, afternoon and night, also with noise, complex background, translation
and scaling, achieving an accuracy percentage of 95.37% and 88.1% respectively in the stage of
evidence, it should be noted that Dadashzadeh et al. [22] added hand shape and location, adding
these components of the gesture affects the accuracy percentage.
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and [31] worked between 6 and 9 frames with similar conditions in terms of
translation and scaling, however [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and [31] added the form of hand and [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] the complex
background, under these conditions they obtained 92.88%, 96.2% and 73.4% precision in the
testing stage, as can be seen by working with more frames in this case 9 for the investigation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
also affects the percentage of precision, decreasing its value compared to other investigations.
      </p>
      <p>The research presented worked with a dataset made up of 11 components of the gesture
composed of: 07 invariant characteristics, 03 sign parameters, and 01 gestural space, also
considering these additional components for each frame in the dataset, 84% precision is obtained
in the testing stage.</p>
      <p>From the analysis of Table V, we can establish that when more components are added to these
frames, the recognition task becomes more complex and a trend of decreasing percent- age of
accuracy is observed in the testing stage, therefore 84% The precision achieved in our proposal
represents an advance compared to the closest research, a model that was trained with 9
components and the authors obtained 73.4%.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and future work</title>
      <p>There is a difficulty in finding systems that recognize PSL static sign gestures with high accuracy.
The present research uses a dataset of 3025 frames that contain 11 gesture components, which
consider different aspects of real environments, which was trained in the CNN model. The CNN
model is configured as follows: 04 convolution layers, 04 max pooling layers, 01 dropout layer,
01 flattening layer, 01 fully connected network with 01 input layer, 01 dropout layer, 03 hidden
layers and 01 output layer.</p>
      <p>This configuration allowed us to achieve an accuracy of 84%. It was concluded that to have a
greater range in real scenarios, precision has to be sacrificed. Although it is true, the accuracy
percentage of our proposal is not the highest in the literature, but it is better prepared to
recognize static signs of the PSL in real scenarios.</p>
      <p>Digital image processing techniques were used, these techniques helped to better detect the
region containing the hand gesture, minimizing the error caused by the gestural components.</p>
      <p>As future work, we plan to continue training a model that is capable of recognizing static and
continuous signs of the PSL in real time.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[1] World Federation of the Deaf</source>
          .
          <article-title>(</article-title>
          <year>2016</year>
          ).
          <article-title>Advancing human rights and signlanguage worldwide</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X</given-names>
            <surname>Glottolog</surname>
          </string-name>
          .
          <article-title>(</article-title>
          <year>2022</year>
          ).
          <article-title>Pseudo family: Lenguaje de sen˜aspara sordos</article-title>
          .
          <source>Retrieved</source>
          <year>2022</year>
          -
          <volume>10</volume>
          -31, fromhttps://glottolog.org/resource/languoid/id/deaf1237
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X</given-names>
            <surname>Glottolog</surname>
          </string-name>
          <article-title>Peru</article-title>
          . (
          <year>2022</year>
          ).
          <article-title>Pseudo family: Lenguaje desen˜as peruana</article-title>
          .
          <source>Retrieved</source>
          <year>2022</year>
          -
          <volume>10</volume>
          -31, fromhttps://glottolog.org/resource/languoid/id/peru1235
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>DDHH. (n.d.).</surname>
          </string-name>
          <article-title>Defensoria del pueblo: debe facilitarse el aprendisaje dela lengua de sen˜as peruana y promover la identidad linguistica y culturalde las personas sordas</article-title>
          .
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Crasborn</surname>
            ,
            <given-names>O. A.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>Nonmanual Structures in Sign Language</article-title>
          .
          <source>InEncyclopedia of Language &amp; Linguistics</source>
          (pp.
          <fpage>668</fpage>
          -
          <lpage>672</lpage>
          ). Elsevier.https://doi.org/10.1016/b0-08-044854-2/
          <fpage>04216</fpage>
          -
          <lpage>4</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            , &amp;
            <surname>Martinez</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. M.</surname>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Recovering the linguisticcomponents of the manual signs in american sign language</article-title>
          .
          <source>In 2007ieee conference on advanced video and signal based surveillance</source>
          (pp.
          <fpage>447</fpage>
          -
          <lpage>452</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Ashrafuzzaman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saha</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Nur</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Prediction of Stroke Dis-ease Using Deep CNN Based Approach</article-title>
          .
          <source>Journal of Advances in Informa-tion Technology</source>
          ,
          <volume>13</volume>
          (
          <issue>6</issue>
          ),
          <fpage>604</fpage>
          -
          <lpage>613</lpage>
          . https://doi.org/10.12720/jait.13.6.
          <fpage>604</fpage>
          -
          <lpage>613</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Al-Dmour</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tareef</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alkalbani</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hammouri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Alrahmani</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>Masked Face Detection and Recognition System Based on DeepLearning Algorithms</article-title>
          .
          <source>Journal of Advances in Information Technol- ogy</source>
          ,
          <volume>14</volume>
          (
          <issue>2</issue>
          ),
          <fpage>224</fpage>
          -
          <lpage>232</lpage>
          . https://doi.org/10.12720/jait.14.2.
          <fpage>224</fpage>
          -
          <lpage>232</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Satapathy</surname>
            ,
            <given-names>S. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S. H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y. D.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>A Survey on Artificial Intelligence in Chinese Sign Language Recogni-tion</article-title>
          .
          <source>Arabian Journal for Science and Engineering</source>
          ,
          <volume>45</volume>
          (
          <issue>12</issue>
          ),
          <fpage>9859</fpage>
          -
          <lpage>9894</lpage>
          .https://doi.org/10.1007/s13369-020-04758-2
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Deep atten- tionnetwork for joint hand gesture localization and recognition using staticRGB-D images</article-title>
          .
          <source>Information Sciences</source>
          ,
          <volume>441</volume>
          ,
          <fpage>66</fpage>
          -
          <lpage>78</lpage>
          . https://doi.org/10.1016/j.ins.
          <year>2018</year>
          .
          <volume>02</volume>
          .024
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Rao</surname>
            ,
            <given-names>G. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Syamala</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kishore</surname>
            ,
            <given-names>P. V. V</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A. S. C. S.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Deep Convolutional Neural Networks for Sign Language Recognition</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Vargas</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barba</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Mattos</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <source>Identification System of theSign Language Using Artificial Neural Networks.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Karami</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zanj</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Sarkaleh</surname>
            ,
            <given-names>A. K.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Persian sign language(PSL) recognition using wavelet transform and neural networks</article-title>
          .
          <source>ExpertSystems with Applications</source>
          ,
          <volume>38</volume>
          (
          <issue>3</issue>
          ),
          <fpage>2661</fpage>
          -
          <lpage>2667</lpage>
          . https://doi.org/10.1016/j.eswa.
          <year>2010</year>
          .
          <volume>08</volume>
          .056
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Bronstein</surname>
            ,
            <given-names>M. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agapito</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Rother</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Sign LanguageRecognition Using Convolutional Neural Networks</article-title>
          .
          <source>In Lecture Notesin Computer Science (including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformatics)</source>
          (Vol.
          <volume>8927</volume>
          , p.
          <source>VI)</source>
          . Springer Verlag. https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -16178-5
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Ameen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Vadera</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>A convolutional neural network to classify American Sign Language fingerspelling from depth and colour images</article-title>
          .
          <source>Expert Systems</source>
          ,
          <volume>34</volume>
          (
          <issue>3</issue>
          ). https://doi.org/10.1111/exsy.12197
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>X</given-names>
            <surname>Pugeault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            , &amp;
            <surname>Bowden</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Spelling it out: Real- time ASL fingerspelling recognition</article-title>
          .
          <source>2011 IEEE International Conference on Computer Vision</source>
          Workshops (
          <article-title>ICCV Workshops)</article-title>
          .
          <source>doi:10</source>
          .1109/iccvw.
          <year>2011</year>
          .6130290
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Rioux-Maldague</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Giguere</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Sign language fingerspelling classification from depth and color images using a deep belief network</article-title>
          .
          <source>Proceedings - Conference on Computer and Robot Vision</source>
          ,
          <string-name>
            <surname>CRV</surname>
          </string-name>
          <year>2014</year>
          ,
          <volume>92</volume>
          -
          <fpage>97</fpage>
          . https://doi.org/10.1109/CRV.
          <year>2014</year>
          .20
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Flores</surname>
            ,
            <given-names>J. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cutipa</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Enciso</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Application of Convo- lutional Neural Networks for Static Hand Gestures Recognition Under Different Invariant Features</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>