1. Introduction

A Real-time Hand Gesture Recognition System for Human-Computer and Human-Robot Interaction

Valerio Ponzi

Emanuele Iacobelli

Christian Napoli

1 2

Janusz Starczewski

0 0 Department of Computational Intelligence, Czestochowa University of Technology , al. Armii Krajowej 36, Czestochowa, 42-200 , Poland 1 Department of Computer, Control and Management Engineering, Sapienza University of Rome , Via Ariosto 25, Roma, 00185 , Italy 2 Institute for Systems Analysis and Computer Science, Italian National Research Council , Via dei Taurini 19, Roma, 00185 , Italy

52 58

The proposed hand gesture recognition (HGR) system is designed to enhance human-computer interaction (HCI) and humanrobot interaction (HRI), which are crucial areas of research aimed at improving the way humans interact with computer or robot systems. With the growing need for intelligent computers and robots in a range of applications, including healthcare, manufacturing, and education, both HCI and HRI have gained significant importance. In this context, the HGR system plays a vital role by enabling natural and intuitive communication between humans and technology through hand gestures. The presented system uses a single camera and eficient image processing techniques that enable real-time gesture detection. Unlike other methods, our approach employs a basic video camera, which is widely available on most computers, eliminating the need for expensive and specialized hardware.

eol>Hand Gesture Recognition Machine Learning Deep Learning Convolutional Neural Network

1. Introduction

can be used to operate video games, move around virtual worlds, or carry out tasks on a computer screen. ControlHand gesture recognition (HGR) is a technology that ling a computer mouse in this way ofers a more flexible, enables the identification and interpretation of hand and intuitive, and natural way of interacting with the comifnger movements in order to understand and respond to puter than traditional input devices, making it one of the user actions. This technology analyzes the visual signals most promising and practical applications. This technolproduced by hand gestures and finds the characteristic ogy can also benefit users with disabilities, injuries, or patterns connected to particular commands or actions ergonomic issues that make it dificult or uncomfortable using computer vision algorithms and machine learning to use a conventional mouse, as well as those who pretechniques. With numerous applications ranging from fer a more immersive and engaging way of navigating virtual reality to industrial automation, HGR is a growing and manipulating digital content. Additionally, there are area of research and development. other potential uses for HGR in industries also as the

Hand gesture detection can be divided into two main manufacturing field. HGR can be used to control macategories: static and dynamic. Static HGR is the abil- chines and processes in the environment. For instance, ity to detect the static position of the hands at a given workers can use hand gestures to activate machinery moment. For example, it can be used to detect a hand or control robotic arms, allowing for more eficient and pointing in a direction or to detect an open or closed safer manufacturing processes. In conclusion, HGR is hand. On the other hand, dynamic HGR refers to the a rapidly developing field that presents many chances ability to detect hand movements in real time. This tech- to enhance how people interact with technology. The nology can be used to detect gestures such as waving or application-specific requirements and the trade-of beifnger movements. One of the main applications of hand tween accuracy and user comfort determine the best hand gesture recognition is in human-computer interaction. gesture detection method.

Users can interact with devices in a more intuitive and The paper proposes a real-time and computationally natural way by employing hand gestures. For instance, eficient hand gesture recognition system with four steps: without needing a real mouse or keyboard, hand gestures Frame Recording, Hand Recognition, Hand SegmentaICYRIME 2022: International Conference of Yearly Reports on tion, and Gesture Recognition. It uses a simple algorithm Informatics, Mathematics, and Engineering. Catania, August 26-29, to detect and segment hands and predict executed ges2022 tures. In contrast to current approaches, the suggested $ ponzi@diag.uniroma1.it (V. Ponzi); iacobelli@diag.uniroma1.it hand gesture recognition system stands out for being (jEan. uIasczo.sbtaerllciz);ecwnsakpio@lip@czd.ipalg(.Ju.nSitraormczae1w.itsk(Ci). Napoli); less expensive and eco-friendly. Instead of the complex © 2022 Copyright for this paper by its authors. Use permitted under Creative hardware and sensors needed by traditional systems, it CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) accomplishes this by capturing hand gestures using only

3. Proposed Method

a camera. This significantly lessens the requirement for additional resources, increasing the system’s sustainability and long-term cost-efectiveness.

For the proposed system, a simple and eficient algorithm

capable of working in real-time and with a small computational efort is proposed. The system pipeline comprises 2. Related Works four main steps: Frame Recording, Hand Recognition, Hand Segmentation, and Gesture Recognition. SpecifThere are two main approaches to hand gesture recogni- ically, for each image captured by the camera, a hand tion: Contact-based and Vision-based [ 1, 2, 3, 4 ]. detection process is performed to identify the portions of Contact-based methods involve the use of sensors on a the image where hands are present. Subsequently, a hand glove to extract information about hand rotations, accel- segmentation step is conducted to generate a mask that eration projections, and finger bending angles [ 5 ]. This represents the shape of the detected hands. The resulting approach can achieve high accuracy, especially after a mask is used as input for the Gesture Recognition step, calibration process to adapt the sensors to the user’s hand. which predicts the executed gesture. However, it can be costly and may not lead to a natural interaction [ 6 ]. On the other hand, Vision-based methods use visual devices such as stereo cameras, time of flight cameras, or Kinect sensors to extract depth information and create a 3D representation of the scene. Monocular systems with a single RGB camera have also been used in recent periods. These methods are generally cheaper and more adaptable than contact-based methods. Moreover a relevant number of studies are tackling the problem from the point of view of behavioural analysis and theory of mind[ 7, 8, 9 ]. Over the years, various methods have been proposed for hand gesture recognition. These range from the simplest method of wearing a colored glove [ 10 ] that is recognized by a video camera, to methods that use skin color recognition [ 11 ] followed by hand shape recognition. More advanced methods involve the use of machine learning, such as Skeleton-Based Recognition [ 12 ] and Deep-Learning Based Recognition [ 13 ].

Both contact-based and vision-based methods have their advantages and disadvantages, and the choice of which method to use depends on the specific application and environment. Vision-based methods are typically used in human-computer interaction and human-robot interaction applications, while contact-based methods are more commonly used in wearable devices for control purposes.

Hand gesture technology has two primary areas of application, which are sign language recognition and video gaming. Sign language is a means of communication for individuals who are unable to speak, and it involves a sequence of hand gestures that represent letters, numbers, and expressions. Researchers have proposed several approaches for sign language recognition, including the use Figure 1: Pipeline scheme for the Hand Detection and Hand of gloves or uncovered hand interaction with a camera Segmentation steps. using computer vision techniques to identify the gestures [ 14 ] [ 15 ]. In contrast, video gaming utilizes hand and body movements to interact with the game. The 3.1. Hand Detection Step Microsoft Kinect Xbox is an excellent example of gesture interaction for gaming purposes, as it employs a camera The Hand Detection step is implemented with the aim of placed over the screen that connects with the Xbox de- generating a mask that represents the pixels correspondvice through the cable port to track the user’s hand and ing to a hand in an RGB image, along with a set of points body movements [ 16 ]. that indicate the centroids of the detected hand regions. This mask is obtained by combining two diferent masks new centroid clusters to be missed in situations where obtained from color analysis in the HSV color domain and the hands are in close proximity or overlapping. To make foreground detection. The color analysis approach in- the system more robust to noisy efects, a new position volves static thresholding of the image, using pre-defined of the centroids are computed in the following way: skin limit values in the HSV domain that may be adjusted based on variations in skin tone or lighting conditions newCentroidPos = centroidPos− 1 + step · ∆ , within the image. The threshold values for Saturation or ∆ = detectedCentroidPos − centroidPos− 1 Value properties may vary from 0 to 255. However, for the Hue property, which represents the dominant color This approach enables the system to track the trajectory family, the range is limited from 6 to 28. Foreground de- of each hand accurately in the image, even if a completely tection is a well-established computer vision technique wrong new observation is detected for the hand in some that is used to distinguish between dynamic and static sporadic time steps. For that reason, this error would pixels in image sequences by detecting moving objects. not significantly afect the results if enough frames per To accomplish this, adjacent frames are analyzed to es- second are captured. We have put a lot of efort on comtablish a model of the image’s background and identify puting the right hands’ centroids since they are critical changes that occur. The generated mask, up to this point in eliminating any potential artifacts present in the Hand from the system, is then applied to the original camera Pixel Mask that represent other parts of the person’s skin. frame to generate the Hand Pixel Mask, which contains the pixels representing the possible detected hands. The 3.2. Hand Segmentation Step hands’ centroids are now determined using a clustering algorithm, specifically a k-means algorithm [ ? 17, 18], The Hand Segmentation step is implemented with the aim applied to the Hand Pixel Mask. However, tuning the pa- of refining the output of the previous phase by generating rameter k is crucial to obtaining accurate results, and this an Adaptive Skin Mask, by leveraging the outputs of the parameter is trained autonomously using the Elbow al- Hand Detection step. This Adaptive Skin Mask is built gorithm. The Elbow algorithm determines the minimum by using a more flexible threshold for selecting the skin total intra-cluster distance in order to identify the optimal pixels that can adjust to varying lighting conditions that value of k. The Sum of Squared Distances (SSD), which may afect the hands over time. This approach aims to in this particular case is computed as the squared sum provide greater eflxibility compared to the fixed threshold of distances between the pixels and their corresponding used in the Hand Detection step. The Hand Pixel Mask centroids for each cluster, is used in order to determine is used in order to analyze the pixel distribution across the best value for k. This process involves adding another various color domains, such as RGB, HSV, and YCBCR, cluster and assessing whether the total SSD significantly through histogram analysis. Each domain produces a improves over the previous k value. Moreover, the dis- unique threshold based on the mean and variance of the tance of the hand from the camera can influence this found distributions. Specifically: measure, since the closer the hand is to the camera, the upperBound = mean + 2 · variance higher the pixel density on the image. Therefore, the SSD is normalized based on the number of pixels present on lowerBound = mean − 2 · variance the Hand Pixel Mask. Furthermore, it is important to note and only the pixels that remain inside these bounds are that the Elbow method relies on the slope of the resultant considered skin pixels. By converting the original RGB function, which represents the normalized SSD values ob- image into diferent domains and focusing on the region tained over the iterations on k. As a result, it is essential of interest (ROI) generated by using the Hand’s Centroids, to establish a slope threshold to act as a significant metric multiple masks can be generated. These masks are then for stopping the K increment process when the function combined using a logical AND along with morphological starts to become flat. In the event that this occurs, the operations to improve the accuracy of the Adaptive Skin algorithm must be interrupted, and the previous stored Mask. It is important to note that in the case of multiple K value must be returned. The threshold values for the hand detections in the image, the pixel distribution analslope and the normalized SSD play a critical role in the ysis is performed on each ROI. This enables the system sensitivity of the system in detecting new clusters. To to adapt to diferent lighting efects that may afect the minimize the occurrence of false-positive clusters, the hands. proposed system includes an additional algorithm that matches the centroids computed in the current frame with those computed in the previous frame, using a dis- 3.3. Gesture Recognition Step tance metric such as the Euclidean distance. Since the In the final phase of the pipeline, the Gesture Recognition value of K is recomputed in each frame and can vary over step involves the use of a Deep Convolutional Neural Nettime, the mapping is not absolute, and it is possible for work (DCNN) that has been trained to accurately classify and recognize the specific gestures performed by the user. with 900 images corresponding to each gesture, while the Through the use of data augmentation techniques and remaining 6,000 images (300 for each gesture) are divided the training on a large dataset of labeled gesture samples, between the validation and test datasets. In addition, the DCNN can efectively identify and classify the dif- various data augmentation techniques such as random ferent gestures executed by the user with a high degree rotation (within the range of +15° to -15°), padding, ranof accuracy and reliability. In particular, the structure dom cropping, flipping, etc. have been applied to increase of the model is presented in Fig. 2. It is composed of the robustness of the trained model. However, as the imtwo 2D convolutional layers (activation function ReLU ages in this dataset are segmented by humans, they do and kernel size 6x6 and 16x16, respectively) each of them not account for the potential noise that may be present followed by a single 2D MaxPool layer (kernel 2x2). Af- in general images obtained through unsupervised algoter that, four fully connected layers are used in order to rithms. To address this limitation, Salt and Pepper noise produce the final prediction of the gesture. In detail, the with p=0.2 was introduced to better simulate real-world input and output features of these layers can be found in images and to increase the generalization power of the the Fig. 2. In order to train the model a SGD optimizer is network. used with a learning rate equal to 0.004 and momentum equal to 0.9. In addition, a scheduler with an exponential decay with gamma equal to 0.9 is used to decrease at each epoch the learning rate. 3.3.1. Dataset

To train the Gesture Recognition model, a comprehensive dataset [19] consisting of a total of 24,000 images and 20 distinct static hand gestures (Fig. 3) has been used. Specifically, the training dataset consists of 18,000 images, 4. Results Regarding the obtained results, a test accuracy of 93.8%

was achieved after training the model for 15 epochs. The accuracy and loss plots during the training and validation phases are shown in Fig. 5, 6, respectively. These plots indicate that the model was not overfitting the training dataset. The Confusion Matrix (Fig. 7) demonstrates that the model is highly capable of accurately predicting all the diferent classes. The worst predicted class is the class 5, which is sometimes confused with the class 2 due to their similarities, even under perfect conditions without introducing noise (as shown in Fig. 3). This behavior is also reflected in the F1 score shown in Table 1. on testing the efectiveness of a robust convolutional neural network (CNN) capable of extracting features even in the presence of imprecise masks. By defining various scenarios based on accuracy, it can be concluded that 5. Conclusions the proposed CNN model can still produce satisfactory results in all classes.

Our paper presented a potential solution for developing The proposed system could therefore have great potenaccurate hand gesture recognition (HGR) system. Based tial for various applications from the most known such on the results, it can be said that the proposed method has as human-computer interaction, virtual reality, and sign shown high accuracy and real-time functionality. The language recognition to new ones. For example, during test has indeed achieved an accuracy of 93.8% after train- the ongoing Covid-19 pandemic, a possible application ing the model for 15 epochs. Despite the accurate detec- is the use of gesture recognition and mouse tracking in tion and segmentation of hands, the research also focuses hospitals, which can help reduce the spread of the virus by minimizing contact with shared surfaces. With the aid of this technology, hospital staf and patients can interact with computer systems and medical equipment without physically touching them. This can support a more hygienic and efective hospital environment while also assisting in the prevention of the virus and other infectious diseases. Furthermore, individuals with physical limitations or disabilities may benefit particularly from the use of gesture-based interfaces because it makes it possible for them to interact with technology in a more organic and intuitive way. Therefore, hand gesture recognition technology holds the promise of revolutionizing healthcare and enhancing patient care. 52–58

[1]

Pepe ,

Tedeschi ,

Brandizzi ,

Russo ,

Iocchi ,

Napoli , Human attention assessment using a machine learning approach with gan-based data augmentation technique trained using a custom dataset , OBM Neurobiology 6 ( 2022 ). doi: 10 . 21926/obm.neurobiol. 2204139 .

[2]

Ponzi ,

Russo ,

Bianco ,

Napoli ,

Wajda , Psychoeducative social robots for an healthier lifestyle using artificial intelligence: a case-study , in: CEUR Workshop Proceedings , volume 3118 , 2021 , pp. 26 - 33 .

[3]

De Magistris ,

Caprari , G. Castro,

Russo ,

Iocchi ,

Nardi ,

Napoli , Vision-based holistic scene understanding for context-aware humanrobot interaction , Lecture Notes in Computer Science (including subseries Lecture Notes in Artiifcial Intelligence and Lecture Notes in Bioinformatics) 13196 LNAI ( 2022 ) 310 - 325 . doi: 10 .1007/ 978-3- 031 -08421-8_ 21 .

[4]

Brandizzi ,

Fanti ,

Gallotta ,

Russo ,

Iocchi ,

Nardi ,

Napoli , Unsupervised pose estimation by means of an innovative vision transformer , Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 13589 LNAI ( 2023 ) 3 - 20 . doi: 10 .1007/978-3- 031 -23480- 4 _ 1 .

[5]

Dipietro ,

A. M.

Sabatini ,

Dario , A survey of glove-based systems and their applications , IEEE Transactions on Systems, Man, and Cybernetics , Part C ( Applications and Reviews) 38 ( 2008 ) 461 - 482 . doi: 10 .1109/TSMCC. 2008 . 923862 .

[6]

Cheng , L. Yang , Z. Liu, Survey on 3d hand gesture recognition , IEEE transactions on circuits and systems for video technology 26 ( 2015 ) 1659 - 1673 .

[7]

Brandizzi ,

Russo , G. Galati,

Napoli , Addressing vehicle sharing through behavioral analysis: A solution to user clustering using recency-frequencymonetary and vehicle relocation based on neighborhood splits , Information (Switzerland) 13 ( 2022 ). doi: 10 .3390/info13110511.

[8]

Marcotrigiano , G. Stingi,

Fregnan ,

Magarelli ,

Pasquale ,

Russo , G. Orsi,

Montagna ,

Napoli ,

Napoli , An integrated control plan in primary schools: Results of a field investigation on nutritional and hygienic features in the apulia region (southern italy) , Nutrients 13 ( 2021 ). doi: 10 .3390/nu13093006.

[9]

Brandizzi ,

Russo ,

Brociek ,

Wajda , First studies to apply the theory of mind theory to green and smart mobility by using gaussian area clustering , in: CEUR Workshop Proceedings , volume 3118 , 2021 , pp. 71 - 76 .

[10]

Lamberti ,

Camastra , Real-time hand gesture recognition using a color glove , in: Image Analysis and Processing-ICIAP 2011 : 16th International Conference, Ravenna, Italy, September 14-16 , 2011 , Proceedings, Part I 16 , Springer, 2011 , pp. 365 - 373 .

[11] K. B. Shaik , P.

Ganesan , V.

Kalist , B.

Sathish , J. M. M. Jenitha , Comparative study of skin color detection and segmentation in hsv and ycbcr color space , Procedia Computer Science 57 ( 2015 ) 41 - 48 .

[12]

Xi ,

Chen ,

Zhao ,

Pei , L. Liu, Realtime hand tracking using kinect , in: Proceedings of the 2nd International Conference on Digital Signal Processing, ICDSP 2018 , Association for Computing Machinery , New York, NY, USA, 2018 , p. 37 - 42 . URL: https://doi.org/10.1145/3193025. 3193056. doi: 10 .1145/3193025.3193056.

[13]

John ,

Boyali ,

Mita ,

Imanishi ,

Sanma , Deep learning-based fast hand gesture recognition using representative frames , in: 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA) , IEEE, 2016 , pp. 1 - 8 .

[14]

M. R.

Islam ,

U. K.

Mitu ,

R. A.

Bhuiyan ,

Shin , Hand gesture feature extraction using deep convolutional neural network for recognizing american sign language , in: 2018 4th International Conference on Frontiers of Signal Processing (ICFSP) , IEEE, 2018 , pp. 115 - 119 .

[15] R.-H. Liang , M. Ouhyoung , A real-time continuous gesture recognition system for sign language , in: Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition , 1998 , pp. 558 - 567 . doi: 10 .1109/AFGR. 1998 . 671007 .

[16]

Marin ,

Dominio ,

Zanuttigh , Hand gesture recognition with leap motion and kinect devices , in: 2014 IEEE International conference on image processing (ICIP) , IEEE, 2014 , pp. 1565 - 1569 .

[17]

Magistris ,

Rametta , G. Capizzi,

Napoli , Fpga implementation of a parallel dds for wide-band applications , in: CEUR Workshop Proceedings , volume 3092 , 2021 , pp. 12 - 16 .

[18]

Ciancarelli , G. De Magistris,

Cognetta ,

Appetito ,

Napoli ,

Nardi , A gan approach for anomaly detection in spacecraft telemetries , Lecture Notes in Networks and Systems 531 LNNS ( 2023 ) 393 - 402 . doi: 10 .1007/ 978-3- 031 -18050-7_ 38 .

[19]

Arya , Hand gesture recognition dataset , 2020 . URL: https://www.kaggle.com/cihan063/ autism -image-data.