Real-Time Hand Gesture Recognition System
                                 Aarti1,† , Swathi Gowroju2,∗,† , Raju Pal3,† , Vaddiraju Swathi2,† and
                                Sirisha Yerraboina4,†
                                1
                                  Lovely Professional University, Punjab, INDIA
                                2
                                  Sreyas Institute of Engineering and Technology, Hyderabad, Telangana, INDIA
                                3
                                  Jaypee Institute of Information Technology, Noida, Uttar Pradesh, INDIA
                                4
                                  Matrusri Engineering College, Hyderabad, Telangana, INDIA


                                            Abstract
                                            A machine and a person can interact through hand gestures by using a hand gesture identification device.
                                            Real Time Hand Gesture Recognition (RTHGR) is discussed in this work in order to carry out system
                                            control actions as intended. With the help of this application, the user’s hand gestures may be detected
                                            by the webcam and basic actions can be taken as a result. The user must make a distinct gesture. The
                                            webcam records this, recognizing the gesture and carrying out the action in accordance with a list of
                                            recognized gestures. This process requires a binary threshold value to recognize gestures. A neural
                                            network is employed in the proposed classification process. The effectiveness of this technique for
                                            operating various systems will be assessed, and other hand recognition techniques will be compared.

                                            Keywords
                                            Gesture recognition, CNN, YOLO, Region of Interest, deep learning, object recognition


                                1. Introduction
                                The world is going through a technological revolution. The rapid advancement in the tech-
                                nological aspect of human beings is growing faster than ever which has allowed the various
                                computer systems to indulge in our everyday lives as an integral part of our daily activities.
                                Various computer systems have different methods of interaction with the users. This is known
                                as HCI in technical term. It is a very important aspect of intractability, usefulness, practicality,
                                and overall user experience on a computer system. In the past few years, the user experience
                                is focused more than anything on a system made to be used by humans. This is because
                                the effectiveness of a system is greatly measured based on how a system interacts with the
                                user to make the overall experience of using a system easier and more wonderful. In the last
                                decade, hardware like keyboard, mouse and touch-screen have been crucial in how people
                                engage with technology. However, new forms of engagement tools have been created as a
                                result of the quick advancement of technology. In the realm of HCI, technologies like thought
                                processing, gesture recognition, and speech recognition have advanced significantly. In our
                                proposed system, gesture recognition is one of these that is covered. In this, hand gestures are
                                utilized as communication between humans and electronic devices. It differs significantly from

                                ACI’23: Workshop on Advances in Computational Intelligence at ICAIDS 2023, December 29-30, 2023, Hyderabad, India
                                ∗
                                    Corresponding author.
                                †
                                    These authors contributed equally.
                                Envelope-Open aarti.1208@gmail.com ( Aarti); swathigowroju@sreyas.ac.in (S. Gowroju); raju3131.pal@gmail.com (R. Pal)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                             24
conventional hardware-based techniques, which are capable of achieving human-computer
interaction on a totally other level. The subject of computer vision and image processing has
been substantially altered by convolutional neural networks (CNNs), a significant advancement
in artificial intelligence and machine learning. These neural networks were created expressly
to tackle visual perception, one of the most difficult and inherently human jobs. CNNs have
advanced standard machine learning techniques by imitating the hierarchical and feature-driven
nature of how we, as humans, perceive and recognize patterns in the visual environment. CNNs
were inspired by the complex workings of the human visual system. By doing this, they have
unlocked astonishing ability for machines to understand, categorize, and extract valuable in-
formation from photos in addition to allowing them to ”see” images. CNNs operate on small,
overlapping regions of the image known as receptive fields, which allow the network to capture
local patterns and gradually build a rich hierarchical representation of the input data. CNNs
are a class of deep neural networks distinguished by their unique architecture, which includes
convolutional layers and pooling layers. A wide range of fields have been significantly impacted
by the introduction of CNNs. In addition, they have found use in a variety of fields, including
object identification, facial recognition, autonomous cars, medical picture analysis, and more.


2. Literature Survey
Since the subject of hand gesture recognition is expanding quickly, as many implementations
employing both deep learning and machine learning algorithms that attempt to identify a
gesture that is exhibited by a human using his/her hand. The study [1, 2] showed the common
and upcoming machine learning designs, CNN, achieved faster rates of successfully perceiving
components at a negligible computational cost. The suggested strategy focused primarily on
instances of movements that existed in pictures with two sets such as with hand gestures
and without hand gestures placement and followed instances of hand obstruction with 24
movements. It used a segmentation algorithm and back propagation algorithm to prepare
the multi-layer propagation, and for the back propagation going backwards from nodes that
produce to input nodes in order to check for faults to have an impact, sorting. Among these
approaches, Hidden Markov Models (HMM) is a well-liked technique that is employed by a
number of other detecting applications. The proposed system in this article refers to checks
and operates with all of the numerous detection versions that are frequently employed by an
application that we have seen by studying and examining other papers such as image, video,
and webcam are all addressed. According to a generic document produced by Francois et al. [3],
the posture detection application he refers to employs video and the HMM to do so. The three
aforementioned approaches all identify these features, but whether they are based on CNNs,
RNNs, or some other technique, the primary issue with all of them is that they all employ fitting
techniques, which all make reference to the bounding box [4] that was covered in this study.
The output of what image is being presented is determined by the confidence value that is
the highest, which is derived from the bounding box that represents the data that is detected.
Certain additional tools and methods connected to segmentation, general localization, and
even the union of other different areas aid in the accomplishment of the tasks of detecting and
recognizing. Fuzzy based human behavior recognition model was developed based on the body


                                               25
gestures [5]. Rahim et al. [6] analysis uses the conversion of a signed language word’s gesture
into text. The skin mask segmentation was used by the authors of this work used basic CNN
model to extract features. The support vector machine algorithm is a kind of supervised learning
method used to address regression and classification problems in machine learning used to
classify the signs’ movements with a 95.28% accuracy using a dataset of 10 movements from one
hand and 8 from both the hands [7]. Mambou et al. [8] analysis of nighttime both indoors and
outdoors included hand gestures connected to sexual assault. The YOLO CNN architecture was
used to create the gesture recognition system. This architecture extracted hand motions and
then classified bounding box images to provide the assault alert. MOving object classification
was done using eigen faces and optical flow approaches [9]. To decode gestures or finger-spelling
from movies that express multiple letters are signed in a series to form meaningful words by
identifying finger spellings in uncut sign language footage is a hard process, Ashiquzzaman et
al. [10] presented the lightweight spatial prism pooling (SPP) using a CNN model. The model
performed 3 times faster than conventional models and required 65% fewer parameters than
conventional classifiers. A lightweight semantic segmentation, Fast and Acurate Semantic
Segmentation Dilated-Net network was used by Benitez-Garcia et al. [11] in place of Temporal
Segment Networks (TSN), It is predicated on the notion of modeling long-range temporal
structures and Spatiotemporal Shift Modules (SSM). On a dataset of thirteen gestures aimed at
real-time interaction with touch less screens, they proved the effectiveness of the idea. There
are several other CNNs [12, 13, 14, 2, 15] that implemented mark-based prediction accurately up
to 98% for various biometric applications. Most of the publications [16, 17] concentrate on the
data gathering, surroundings, and hand gesture representation three essential components of
the “vision-based hand-gesture” identification system. We have also evaluated the vision-based
recognition of hand movements system’s performance in terms of recognizing precision. The
prediction accuracy for the signer dependent, CNN was used to train a total of 21 ISL static
alphabets, yielding verification and testing accuracy of 97.34% and training accuracy of 98.50%..
On the other hand, the signer independent’s claimed identification accuracy varies from 50-90%,
with a standard recognition accuracy of 78.2%, according to the studies that were chosen. Musa
et al. developed some models to identify and trace the suspicious activities based on body
movements [18, 19].


3. Proposed System
In this investigation of CNNs, we will enlarge on their structural elements, the guiding principles
for their success, and the various applications that make use of their extraordinary capabilities.
We’ll show how these networks have changed computer vision and ushered in a new era of
artificial intelligence by enabling machines to understand and interact with the visual envi-
ronment. CNNs have shown to be quite successful at recognizing hand gestures. In order to
effectively categorize and interpret these motions, CNNs are employed in the context of hand
gesture recognition to automatically extract features from pictures or video frames including
hand gestures. Fig. 1 shows various steps of proposed hand gesture recognition system.


                                                26
Figure 1: Proposed System


3.1. Data Gathering and Pre-processing:
A dataset of hand gesture photos or video frames is normally gathered in order to train a CNN
for hand gesture identification. This dataset should contain a variety of hand gestures made
by various people in various situations. Preparing the data is a crucial step before training
and testing. Each image was duplicated for reading and training, for two hands, by flipping
it horizontally, occasionally taking the corresponding image from both hands for making the
set more precise. Proposed system uses YOLO architecture about 230+ images of the dataset,
where 100 images were utilised for the testing by promoting it to 2-fold. 15 more pictures
were also captured and labelled for the testing set. Data pre-processing is essential before
post-processing so that we can identify the kind of data we collected and which parts will be
useful for developing, evaluating, and enhancing accuracy.
   A represents X-Value, B represents Y- Value, C represents Width, and D represents Height.
Table 1 illustration depicts how these files would appear when we label our dataset in order to
train it on the desired model.
   Five different traits, each with their own significance, are included in each line. The class ID
is the first item on the left, are the coordinates that define the labelled box around the gesture.
It includes the x-axis and y-axis values, specifying the position of the top-left corner and the
bottom-right corner of the bounding box.


                                                27
Table 1
Data acquisition from data file of YOLO

       Classification Id    A              B                      C                 D
               0            0.531771       0.490234               0.571875          0.794531
               1            0.498437       0.533203               0.571875          0.905469
               2            0.523438       0.579297               0.613542          0.819531
               3            0.526563       0.564453               0.473958          0.819531
               4            0.498611       0.587891               0.977778          0.792969


3.2. Gesture Segmentation:
The challenging nature of gesture training increases as a result of the fact that gesture data is
recorded in various places, under various lighting conditions, and at various times of day. The
RGB color space data is transformed into the YCbCr color space in the color image pipeline.
This conversion allows the separation of chroma (Cr) and brightness (Cb), effectively mitigating
interference from brightness characteristics.

       [𝑦𝐶𝑏 𝐶𝛾 ] = [0.210.71230.0692 − 0.1146 − 0.37850.490.48 − 0.4653 − 0.0456][𝑅𝐺𝐵]
                       [𝑅𝐺𝐵] = [ 101.57381 −0.1873 − 0.46511.84920 ]                           (1)
  Using Eq 1 the Cr channel is extracted. The data picture is subjected to Gaussian filtering,
and the 2D Gaussian distribution is used to limit the impact of Gaussian noise is characterized
by a bell-shaped curve, and Gaussian noise is essentially random noise with a mean of zero
using Eq 2.
                                                              2                 2
                                                    −(𝕝−𝑢 )             −(𝑡−𝑢𝑗 )
                                              1 𝑒 2 𝑖 𝜎𝑖2 1 𝑒 − 2 𝜎𝑗2
                    𝑓 (𝑖, 𝑗) = 𝑓 (𝕝)𝑓 (𝑗) =      𝕖            𝕖                               (2)
                                            √2Π𝜎         √2Π𝜎
  The Eq 2 belong to the bivariate Gaussian distribution family, with i and j representing the
two dimensions, and i and j representing the two dimensions, and ui, uj, 𝜎i, and 𝜎j representing
the mean and standard deviation parameters for each dimension. The difference between the
Gaussian template coefficients will be lower and the smoothing impact on the image will be
more noticeable as the value increases.

3.3. Seed Filling Algorithm:
To separate touching objects in the image the proposed system uses the mark-based algorithm
that assigns pixels to regions based on proximity to markers, resulting in segmented regions
delineated by watershed boundaries is used to segment the data images for the image samples
processed by skin colour detection. The algorithm connects neighbouring pixels with similar
grey values, forming contours for image segmentation. This approach is especially effective for
images with noise and irregular gradients, simplifying the segmentation process by highlighting
distinct intensity patterns in the image. Since the standing level of the component that is


                                               28
connected can be raised like a dam, the watershed algorithm based on mark may stop the
local smaller edges are merged and inseparable. Through the supervision of the mark-based
watershed algorithm, the gesture features may be efficiently segmented. The 8-connected edge
filling method is used to sporadic gesture portion once the gesture features have been accurately
segmented. An enhancement on the four-connected filling process is the 8 edge connected seed
filling algorithm. Unlike the four-connected filling method, which begins the process at the
beginning, the 8-connected seed filling technique spreads in eight directions, speeding up the
process. at the centre of injection in the area and expands in each direction to cover all of the
area’s pixels. The algorithm used to obtain the sign feature data has eight connected seeds.

3.4. Normalization and Gesture Labelling:
After the image data has been filled in and segmented, the scale normalisation operation may
guarantee the reliability of the feature extraction and the processing of labelling the data
with gestures clear features for training the model efficiently. In the proposed method, the
segmented and filled picture data are normalised to 128x128, which effectively increases the
model’s performance, speed and accuracy during training and prevents gradient explosion.

3.5. Evaluation and Fine-Tuning:
Following training, the model is evaluated on a different validation or test dataset to gauge its
performance in terms of accuracy and generalization. To attain the appropriate level of accuracy,
the model may need to be fine-tuned and its hyper-parameters may need to be adjusted.

3.6. Post-processing:
Post-processing techniques may be used to improve the precision of gesture detection in real-
world scenarios or to smooth predictions over time.


4. Results & Analysis
We tested on computer with an AMD Ryzen 5 processor and a 64-bit version of Windows 10,
our experiment is run using OpenCV and PIL installed for image processing, and Anaconda
is installed to create an interactive interface, were used to implement the identification and
classification operations.

4.1. Experimental Analysis:
The data set is collected from publicly available sources, and each gesture data gathers 250
image data. The case study’s trained model is capable of identifying ten distinct movement
types, ranging from 0 to 9. First, a total of 1980 data samples representing twenty types of
gestures are gathered for each class of 10 individuals over a range of time periods. The training
outcome for gestures 3, 4, and 5 is poor. The training part can be significantly increased by
including sample data. As a result, 100 gestures are added to each category of gestures, which
considerably raises the success rate of recognition.


                                               29
Figure 2: Sample images from dataset


Figure 3: Sample data for same gesture


   A comprehensive data set requirement is necessary to enhance training effect. The position
on the photograph will differ for the same motion when performed by different people, at
different angles, and with variable lighting conditions. Fig. 3 displays various data for the
same move. In the detection rate test, 150 test specimens were collected for quantifying across
various time periods; 95 of the samples had been successfully identified, 7 of the samples
experienced recognition mistakes, and 1 sample did not exhibit any gesture. The test’s outcomes
in the very dark environment weren’t the best; there were four times as many mistakes in the
prediction of gestures 2 and 3. This is due to the fact that the predicted gesture features cannot
be distinguished in setting and because gestures 4 and 5 are too similar to one another. In a
typical context, the model generated by the data pre-processing method can attain an optimal
prediction rate by improving the gesture characteristic. Data from four groups of samples is
collected and divided into the model’s robustness and stability tests. The analysis result is
shown in Fig. 4.
   We have introduced a real-time hand gesture recognition system in this research that makes


                                               30
Figure 4: Evaluation results of proposed system


use of cutting-edge computer vision and machine learning algorithms to precisely interpret and
categorize hand gestures in practical settings. Our study has significantly advanced the field of
gesture recognition in a number of ways.

4.2. Robustness and Accuracy
We have shown that our system achieves accuracy in recognizing a wide variety of hand
gestures through thorough experimentation and review. While CNNs excel in extracting spatial
features from grid-like data such as images, RNNs are well-suited for sequential data with
temporal dependencies. The choice between them depends on the nature of the task and
the characteristics of the input data. among other deep learning models, have considerably
increased the system’s capacity to handle complicated and dynamic movements. The main goal


                                                  31
of our research was to create a system that could process data in real-time. Our system can now
process and recognize gestures in real time, enabling technologies or methodologies likely play
a crucial role in improving user experiences and enhancing interactions in these areas. This
indicates that we have successfully accomplished our goal.

4.3. Comparison of each group
In a typical context, the model produced by the data pre-processing approach can reach an
optimal recognition rate by improving the gesture feature. Four sample groups of test data,
each with 30 test data graphs, are randomly chosen from the test data set in order to assess the
model’s stability and robustness. Each data group was recognized and put to the test. The model
exhibits strong resilience and stability after testing. Table 2 presents the comparing findings.

Table 2
Accuracy comparison

                           Group No         Accuracy Rate (%) of recognition
                       The 1st Category                  95.62
                       The 2nd Category                  94.47
                       The 3rd Category                  96.00
                       The 4th Category                  98.12

  The model’s convergence rate is accelerated since dropout is used to prevent over-fitting.
After numerous tries, increasing the batch value in the training process can lower the number of
epochs and accelerate training. It could not be able to effectively collect the data characteristics
due to the decreased number of repetitions, which would lower the prediction rate. When the
batch size reaches 35 following testing, the training time and model constancy are well-aligned.
The training is stopped early after the saturation point. The loss curve comparison chart for
various batch values is shown in Fig. 5.


Figure 5: Loss and accuracy graphs of training and testing


  Three people are chosen at random to participate in the laboratory testing to confirm the


                                                32
acceptance of prediction of gestures on various participants in proposed system. Each of the 10
motions must be tested 500 times, with each tester testing each position nearly 50 times, by
cross folding into two groups and testing each gesture. If testing is needed during the day, each
group must take it 15 times, and if it is needed at night, each group must take it 250 times. The
three testers’ respective recognition rates for the ten gestures are 94.4%, 94.6%, and 93.8%.


5. Conclusion
In this study, the gesture data is pre-processed using skin colour recognition, marker-based
watershed algorithms, and seed filling algorithms. in order to generate the gesture data with
clear gesture features. Proposed system accurately performed on the test data and then achieves
98.66% under conditions of ordinary light by using the training of 10 different types of gesture
data following YOLO convolution neural network pre-processing. The pre-processing technique
applied in this study effectively mitigates the influence of the surrounding background on
gesture recognition and detection. Notably, its implementation does not necessitate additional
training or detection time. Additionally, the post-processing step would depend on the specific
requirements of our gesture recognition task.


References
 [1] R. Khan, N. A. Zaman, Hand gesture recognition: a literature review, International journal
     of artificial Intelligence & Applications 3 (2012).
 [2] A. Swathi, Aarti, S. Kumar, A smart application to detect pupil for small dataset with low
     illumination, Innov. Syst. Softw. Eng. 17 (2021) 29–43.
 [3] A. Caputo, A. Giachetti, S. Soso, D. Pintani, A. D’Eusanio, S. Pini, G. Borghi, A. Simoni,
     R. Vezzani, R. Cucchiara, A. Ranieri, F. Giannini, K. Lupinetti, M. Monti, M. Maghoumi, J. J.
     LaViola, Jr, M.-Q. Le, H.-D. Nguyen, M.-T. Tran, SHREC 2021: Track on skeleton-based
     hand gesture recognition in the wild (2021). arXiv:2106.10980 .
 [4] M. Chmurski, G. Mauro, A. Santra, M. Zubert, G. Dagasan, Highly-optimized radar-based
     gesture recognition system with depthwise expansion module, Sensors (Basel) 21 (2021)
     7298.
 [5] P. Agrawal, V. Madaan, N. Kundu, D. Sethi, S. K. Singh, X-hubis: A fuzzy rule based human
     behaviour identification system based on body gestures, Indian Journal of Science and
     Technology (2016) 1–6.
 [6] M. A. Rahim, M. R. Islam, J. Shin, Non-touch sign word recognition based on dynamic
     hand gesture using hybrid segmentation and CNN feature fusion, Appl. Sci. (Basel) 9
     (2019) 3790.
 [7] N. Mohamed, M. B. Mustafa, N. Jomhari, A review of the hand gesture recognition system:
     Current progress and future directions, IEEE Access 9 (2021) 157422–157436.
 [8] Mambou, Krejcar, Maresova, Selamat, Kuca, Novel hand gesture alert system, Appl. Sci.
     (Basel) 9 (2019) 3419.
 [9] P. Agrawal, R. Kaur, V. Madaan, M. S. Babu, D. Sethi, Moving object detection and recog-
     nition using optical flow and eigen face using low resolution video, Recent Advances in


                                               33
     Computer Science and Communications (Formerly: Recent Patents on Computer Science)
     13 (2020) 1180–1187.
[10] A. Ashiquzzaman, H. Lee, K. Kim, H.-Y. Kim, J. Park, J. Kim, Compact spatial pyramid
     pooling deep convolutional neural network based hand gestures decoder, Appl. Sci. (Basel)
     10 (2020) 7898.
[11] M. Gibran, Y. Haris, N. Tsuda, Continuous finger gesture spotting and recognition based on
     similarities between start and end frames, IEEE Transactions on Intelligent Transportation
     Systems 23 (2020) 296–307.
[12] S. Gowroju, S. Kumar, Aarti, A. Ghimire, Deep neural network for accurate age group
     prediction through pupil using the optimized UNet model, Math. Probl. Eng. 2022 (2022)
     1–24.
[13] S. Gowroju, Aarti, S. Kumar, Robust pupil segmentation using UNET and morphological
     image processing, in: 2021 International Mobile, Intelligent, and Ubiquitous Computing
     Conference (MIUCC), IEEE, 2021.
[14] S. Gowroju, S. Kumar, Aarti, A. Ghimire, Deep neural network for accurate age group
     prediction through pupil using the optimized UNet model, Math. Probl. Eng. 2022 (2022)
     1–24.
[15] S. Gowroju, Aarti, S. Kumar, Review on secure traditional and machine learning algorithms
     for age prediction using IRIS image, Multimed. Tools Appl. (2022).
[16] N. Mohamed, M. B. Mustafa, N. Jomhari, A review of the hand gesture recognition system:
     Current progress and future directions, IEEE Access 9 (2021) 157422–157436.
[17] M. Chmurski, G. Mauro, A. Santra, M. Zubert, G. Dagasan, Highly-optimized radar-based
     gesture recognition system with depthwise expansion module, Sensors (Basel) 21 (2021)
     7298.
[18] A. S. Ben-Musa, S. K. Singh, P. Agrawal, Object detection and recognition in cluttered
     scene using harris corner detection, in: 2014 International Conference on Control, In-
     strumentation, Communication and Computational Technologies (ICCICCT), 2014, pp.
     181–184. doi:10.1109/ICCICCT.2014.6992953 .
[19] A. S. B. Musa, S. K. Singh, P. Agrawal, Suspicious human activity recognition for video
     surveillance system, in: IEEE proceedings of 2014 international conference on control,
     instrumentation, communication and computational technologies, 2014, pp. 214–218.
[20] A. Swathi, S. Kumar, V. Subbamma., S. Rani, A. Jain, R. Kumar, Emotion classification
     using feature extraction of facial expression, in: 2022 2nd International Conference on
     Technological Advancements in Computational Sciences (ICTACS), IEEE, 2022.


                                              34