A Polite Robot: Visual Handshake Recognition Using Deep Learning Liutauras Butkus, Mantas Lukoševičius Faculty of Informatics Kaunas University of Technology Kaunas, Lithuania liutauras.butkus@ktu.edu, mantas.lukosevicius@ktu.lt Abstract—Our project was to create a demo system where a small RELATED WORK humanoid robot accepts an offered handshake when it sees it. The visual handshake recognition, which is the main part of the Virtually all vision-based hand gesture recognition systems system proved to be not an easy task. Here we describe how and described in literature use (a) image sequences (videos) with how well we solved it using deep learning. In contrast to most (b) depth information in them, see [1] for a good recent gesture recognition research we did not use depth information or survey. Microsoft Kinect [2] and Leap Motion [3] are two videos, but did this on static images. We wanted to use a simple examples of popular sensors specifically designed for gesture camera and our gesture is rather static. We have collected a and posture 3D tracking. While clearly both temporal (a) and special dataset for this task. Different configurations and learning depth (b) aspects are helpful in recognizing hand gestures, our algorithms of convolutional neural networks were tried. system uses neither of the two. We (b) used an inexpensive However, the biggest breakthrough came when we could camera for simple RGB image acquisition to make the system eliminate the background and make the model concentrate on the person in front. In addition to our experiment results we can also more accessible and the algorithms more widely applicable, share our dataset. e.g., in smartphones, natural lighting. We also (a) used single Keywords—image recognition, computer vision, deep learning, frames to recognize the extended hand for the handshake, convolutional neural networks, robotics since the gesture is rather static – just holding the extended hand still – and could perhaps be called “posture”. This makes I. INTRODUCTION the recognition problem considerably harder. The goal of this project is to create a robot that can visually A bit similar project to ours called “Gesture Recognition recognize an offered handshake and accept it. When the robot System using Deep Learning” was presented in PyData sees a man offering a handshake, it responds by stretching its Warsaw 2017 conference [4]. The author introduced a Python- arm too. This serves as a visual and interactive demonstration, based, deep learning gesture recognition model that is which would get students more interested in machine learning deployed on an embedded system, works in real-time and can and robotics. recognize 25 different hand gestures from a simple webcam For this purpose we used a small humanoid robot, a simple stream. The development of the system included: a large-scale camera mounted on it, and deep convolutional neural networks crowd-sourcing operation to collect over 150,000 short video for image recognition. The recognition, as well as training of clips, a process to decide which deep learning framework to it, were done on a PC and the command to raise the arm was use, the development of a network architecture that allows for sent back to the robot. classifications of video clips solely with RGB input frames, This article mainly shares our experience in developing the iterations necessary to make the neural network run in real- and training the visual handshake recognition system, which time on embedding devices, and lastly, the discovery and proved to not be trivial. In particular, we will discuss how development of playful gesture-based applications. Their images were collected, preprocessed, what architecture of approach is still different from our approach in that they used convolutional neural networks was used, how it was trained video samples as their input (several frames at a time) and and tested; what gave good and what not so good results. tried to recognize moving gestures. This document is divided into several sections. Section II There is considerable literature similar to our approach in reviews existing solution to similar problem. Section III both (a) and (b) for recognizing sign language hand gestures introduces our method for this project. Section IV describes (or rather postures) from RGB images, including using deep the data set used in this study. Section V emphasizes learning [5]. These approaches, however, usually work with importance of data preprocessing before training. Section VI images of a single hand on a uniform background where the describes robot interface. Sections VII and VIII provide hand can be cropped from the image using thresholding [5], analysis of results and conclusions. skin color [6], or relying on the subject wearing a brightly- colored glove [7]. II. Copyright held by the author(s). 78 III. OUR METHOD and drug design, where they have produced results comparable to and in some cases superior to human experts. Convolutional networks [8], also known as convolutional neural networks, or CNNs, are a specialized kind of neural network for processing data that has a known grid-like topology. Examples include time-series data, which can be thought of as 1-D grid taking samples at regular time intervals, and image data, which can be thought of as a 2-D grid of pixels. Convolutional networks have been tremendously successful in practical applications. The name “convolutional Fig. 1. System training model. neural network” indicates that the network employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation. Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers. Keras [10] is a high-level deep learning library written in Python and capable of running on top of either TensorFlow or Theano deep learning libraries. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good Fig. 2. System running model. research. Keras deep learning library allows for easy and fast prototyping (through total modularity, minimalism, and The system model consists of several parts showed in extensibility). It supports both convolutional networks (we Figure 1 and 2, including camera, camera images pre- used in our solution) and recurrent networks, as well as processing, convolutional neural network’s training using deep combinations of the two. Keras also supports arbitrary learning, graphical user interface, robot interface and robot connectivity schemes (including multi-input and multi-output itself. training) and runs seamlessly on CPU and GPU. The core data At first camera was used to collect the image dataset. This structure of Keras is a model, a way to organize layers. The is described in section IV. After later research, which is main type of model is the Sequential model, a linear stack of described in section VI, the images in the dataset had to be layers. Keras’ Guiding principles include Modularity. A model pre-processed to be able to train the model, which is the next is understood as a sequence or a graph of standalone, fully- part of our system. Using Keras library the model was created, configurable modules that can be plugged together with as compiled and finally trained with the images (this process little restrictions as possible. In particular, neural layers, cost explained in section VI). The final part is to run the model to functions, optimizers, initialization schemes, activation recognize new live images. For this reason camera’s interface functions, regularization schemes are all standalone modules was programmed to take photos at every 0.5 second, the model that users can combine to create new models. Each module gets those images as an input and returns probability of seeing should be kept short and simple. To be able to easily create an offered handshake as an output result. If this result is above new modules allows for total expressiveness, making Keras a certain threshold, a robot interface sends a command to robot suitable for advanced research. to perform a corresponding task. This part more deeply B. Our convolutional neural network model described in section VI. A. Choice of using deep learning libraries The convolutional neural network model that we used is specified in Figure 3. Deep learning [8] (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, semi-supervised or unsupervised. Deep learning models are loosely related to information processing and communication patterns in a biological nervous system, such as neural coding that attempts to define a relationship between various stimuli and associated neuronal Fig. 3. Our convolutional neural network model. responses in the brain. Deep learning architectures such as deep neural networks, It takes 64x40 resolution images as inputs, consists of deep belief networks and recurrent neural networks [9] have three convolutional layers, each followed by pooling, and has been applied to fields including computer vision, speech a single node output. We use rectified linear units in all layers recognition, natural language processing, audio recognition, except for the output node where it is sigmoid. social network filtering, machine translation, bioinformatics 79 C. Training process development of the whole project, more than 4,000 different images were collected for the training of the neural network. Before starting to train the model there are several Approximately 2000 for each category. Single-image parameters which describe training details. The first parameter resolution is 318x198. is epochs count. Epoch itself is an arbitrary milestone, We can see in Figure 4, that in the image, one person was generally defined as “one pass over the entire dataset”, used to usually with his hand stuck or not. It was also tried to capture separate training into distinct phases, which is useful for images in as many different environments as possible. Human logging and periodic evaluation. In general it means how clothing was also varied trying to capture as diverse as many times the process will go through the training set. possible colors. This is important in order to ensure that Second parameter is batch size. Batch size defines number recognition is not restricted to a particular specific situation. of samples that going to be propagated through the network. For instance, there are 200 training samples and we want to set up batch size equal to 30. Algorithm takes first 30 samples from the training dataset and trains network. Next it takes second 30 samples and trains network again. The procedure can be done until we propagate through the networks all samples. However, the problem usually happens with the last set of samples. In this example the last 20 samples which is Fig. 4. Image samples: top positive, bottom negative. not divisible by 30 without remainder. The simplest solution is just to get final 20 samples and train the network. The pictures were divided into three sets: training, We have tried different loss functions and training validation and testing. The neural network is taught with optimization methods. The ones that worked reasonably well training data. It is then validated with validation data to verify in the end are reported in Section VII. that a well-trained neural network performs recognition with new examples. The test data is intended to validate the final Train accuracy and train loss are calculated on the go, neural network's capability to obtain the final true recognition. during training. Figures in Section VII show how well our In addition, data augmentation [11] was used during training, network is doing on the data it is being trained. Training in which various small transformations were made to the accuracy usually keeps increasing throughout training. images before training on them (rotation, translation, color- D. Validation process shift, up- /down-scaling). If there are people who are interested in this task, we could To validate the model we need to have new dataset this share the data with everyone who wants it. new images, which has not been used in training process. Validation is usually carried out together with training. After V. BACKGROUND REMOVAL every epoch, the model is tested against a validation set, and validation loss and accuracy are calculated. These numbers tell Initially, we tried to train the neural network with the data you how good your model is at predicting outputs for inputs it obtained directly from the camera without preprocessing them. has never seen before. Validation accuracy increases initially However, it has been noticed that the model with the best and drops as you over fit. Overfitting happens when our model attempt reached 78 percent training accuracy and about 64 fits too well to the training set. It then becomes difficult for the percent validation accuracy followed by overfitting, during model to generalize to new examples that were not in the which the error rate increased significantly. For this reason, it training set. For example, our model recognizes specific was necessary to look for solutions on how to avoid overfitting images in your training set instead of general patterns. Our and how to increase the validation accuracy of the model. To training accuracy will be higher than the accuracy on the achieve this, attempts were made to change the model's validation/test set. parameters, but this did not improve result as much as it was expected. Then it was decided to process the data itself. From E. Testing process previous experiments, we were able to get the impression that overfitting appears due to the excessive color gamut and color To test the model we need another new dataset. Testing of the images. For this reason, we have decided to try usually is run manually by giving an image from dataset for removing background images and training a neural network trained model to get a result. And the result is a percent value with pictures without background. However, that causes a new that shows probability on each output option problem. How to detect where the background is and where is an object (in this case a human)? For this problem, we decided IV. DATA COLLECTION PREPARATION to take the first image without a human and claim that it is a background and all other images are objects with backgrounds. As we mentioned in the previous section a collection of Though, in this case camera had to be in fixed position. Then image data was needed to implement this project. As the we were able to subtract two images and get image without a system only recognizes greetings, only two results are background. Usually, after subtraction some noise always had possible: greetings are recognized or not. During the 80 left in images. To reduce it, we set a permissible error for pixel hackable, modular, humanoid robot development platform RGB values. You can see those images in Figure 5. designed from the ground up with customization and modification in mind. It has built in software which invokes robot actions. You can see robot’s software interface in Figure 7. Fig. 5. Same images with (left) and without (right) background. The result was obvious. It's possible to achieve 91 percent Fig. 7. Robot joints management interface. accuracy with a smooth natural background. This means that the model is quite precise enough to recognize the extended hand when the background behind the human is equal and Robot interface is used when the model is started to predict does not need to be removed. new images. After CNN return probability of the image it is In order to obtain this result, first we needed to draw up a sent to robot interface. Then robot interface reads the input test plan, which would make it clear how the training is most value and if it is true interface runs command for a robot to appropriate. We've identified three methods (regular training, raise its hand. training with removed background, training with attached backgrounds), and five types of data (when the background is VII. EXPERIMENT RESULTS AND ANALYSIS a specific color, when the background is smooth and natural, when the background is static color, when the background is In this section we will explain in detail what experiments changing and when the background is with a few outsiders). were done and what results were achieved. All test results are presented in the results Section VII. As we mentioned in the previous section the first experiments were carried out using dataset with non-removed VI.INTERFACING ROBOT background images for training and validation. The other parameters were:  Image width: 64  Image height: 40  Training dataset samples: 421  Validation dataset samples: 122  Epochs: 30  Batch size: 32  Model loss function: mean squared error  Model optimizer: stochastic gradient descent  Model metrics: accuracy After training we have got the results, which are shown in Figure 8, and they after final iteration were:  Training accuracy: 78%  Training loss: 0.16 Fig. 6. Photo of our robot and use case.  Validation accuracy: 64% For this project HR-OS1 Humanoid Endoskeleton robot [12] was used. It is showed in Figure 6. It has integrated  Validation loss: 0.19 onboard Linux computer with Intel Atom processor, which gives all the processing power to run robot. The HR-OS1 is a 81 This time the result shows that model validates new images by 82% accuracy and after testing manually this model with images which very different environments, the result shown about the same accuracy. All our tests and their results are shown in the table below. Some of the most sophisticated training experiments have not been completed, as it was not immediately meaningful to perform them due to poor results from simple training. 30 epochs A B C 1. Background of a - 91/82 86/77 Fig. 8. Training without removing background images results graph. specific color The result shows that model validates new images by 64% 2. A smooth natural 81/77 92/80 81/70 accuracy. However, after testing manually this model with background images which very different environments, the result have been even worse. 3. Static background 52/54 62/55 58/52 The next experiment were held by training model with (colorful) removed background. The model parameters were:  Image width: 64 4. Changing 40/50 53/50 Not tried background  Image height: 40  Training dataset samples: 421 TABLE I. DIFFERENT TRAINING VALIDATION TABLE  Validation dataset samples: 121 Columns of the table represents training of the model, rows – how model was validated:  Epochs: 50 A - Simple training on original images;  Batch size: 32 B - Trained with removed backgrounds;  Model loss function: mean squared error C - Trained with background replacements;  Model optimizer: stochastic gradient descent x/y – training accuracy / validation accuracy  Model metrics: accuracy The results showed that the removal of the background significantly improves the accuracy of the model recognition. After training we got the results, which are shown in The best results were from experiments where training took Figure 9, and they after final iteration were: place with removed background pictures and validating with  Training accuracy: 91% also removed background images or smooth background images.  Training loss: 0.07 VIII. DISCUSSION AND FUTURE WORK  Validation accuracy: 82% In this work several different training experiments were  Validation loss: 0.12 performed, watching and studying accuracy of the trained models. The experiments showed that the results depended more not on the model used and its parameters, but on the transformation of the images. To achieve the best results image preprocessing played a key role in this experiment. The best result was reached by removing background before training. Our interpretation of the results is that removing the background reduces the variation in the data and makes the machine learning model focus on the person in the image. Without the background removal the models are prone to overfitting, probably basing their decision on wrong features of the image. Might be that similar accuracy can be achieved Fig. 9. Training with removed background images results graph. without background removal, but with much more data, 82 training, and probably more powerful models. In that case the [5] Oyedotun, O.K. & Khashman, “Deep learning in vision-based static hand gesture recognition“, Neural Computing and Applications (2017) models have to infer that the person in the foreground is the 28: 3941. https://doi.org/10.1007/s00521-016-2294-8 most important object in the images and learn how to [6] Dennis Núñez Fernández, Bogdan Kwolek, “Hand Posture Recognition distinguish it on its own. Motion or depth information, which Using Convolutional Neural Network” Progress in Pattern Recognition, is used in many gesture recognition systems, would also make Image Analysis, Computer Vision, and Applications. CIARP 2017. separation of the person in front from the background easier Lecture Notes in Computer Science, vol 10657. Springer, Cham , 2018 http://home.agh.edu.pl/~bkw/research/pdf/2017/FernandezKwolek_CIA and likely not necessary to be done explicitly. RP2017.pdf This is consistent with results discussed in related work [7] Rosalina, L. Yusnita, N. Hadisukmana, R. B. Wahyu, R. Roestam and (Section II) where other authors either use motion and/or depth Y. Wahyu, "Implementation of real-time static hand gesture recognition information, or also crop the foreground from the background using artificial neural network," 2017 4th International Conference on Computer Applications and Information Processing Technology in some way. (CAIPT), Kuta Bali, 2017 In our approach it is not necessary to remove the http://journal.binus.ac.id/index.php/commit/article/viewFile/2282/3245 background during testing/valuation. The best validation [8] Ian Goodfellow, Yoshua Bengio, Aaron Courville, “Deep Learning”, results are on data with smooth natural backgrounds. The MIT Press, 2016. http://www.deeplearningbook.org/ accuracy of this validation data reached 92%. A reasonable [9] Denny Britz, „Recurrent Neural Networks Tutorial, Part 1 – Introduction future work would be to attempt to create a model that can to RNNs“, accessed on 05 - 2018 http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part- better recognize offered handshakes in a wider range of 1-introduction-to-rnns/ environments. [10] Keras documentation, “Why use Keras?”, accessed on 05 - 2018 https://keras.io/why-use-keras/ REFERENCES [11] Prasad Pai, “Data Augmentation Techniques in CNN using Tensorflow”, [1] Maryam Asadi-Aghbolaghi, Albert Clapes, Marco Bellantonio, Hugo accessed on 05 - 2018 https://medium.com/ymedialabs-innovation/data- Jair Escalante, “A survey on deep learning based approaches for action augmentation-techniques-in-cnn-using-tensorflow-371ae43d5be9 and gesture recognition in image sequences”, 12th IEEE International [12] Tossen Robotics, “HR-OS1 Humanoid Endoskeleton spescifications”, Conference on Automatic Face & Gesture Recognition (FG 2017), 2017 accessed on 05 - 2018 http://www.trossenrobotics.com/HR-OS1 http://sunai.uoc.edu/~vponcel/doc/survey-deep-learning_fg2017.pdf, [13] Połap, Dawid, Marcin Woźniak, Christian Napoli, Emiliano Tramontana, [2] Microsoft Robotics, “Kinect Sensor”, accessed on 05 - 2018 and Robertas Damaševičius. "Is the colony of ants able to recognize https://msdn.microsoft.com/en-us/library/hh438998.aspx graphic objects?." In International Conference on Information and [3] Leap Motion, “Leap Motion – Developer”, accessed on 05 - 2018 Software Technologies, pp. 376-387. Springer, 2015. https://developer.leapmotion.com/ [14] Woźniak, Marcin, Dawid Połap, Christian Napoli, and Emiliano [4] Joanna Materzynska, “Building a Gesture Recognition System using Tramontana. "Graphic object feature extraction system based on cuckoo Deep Learning”, PyData Warsaw 2017, search algorithm." Expert Systems with Applications, vol. 66, pp. 20-31, https://medium.com/twentybn/building-a-gesture-recognition-system- 2016. using-deep-learning-video-d24f13053a1 83