A Polite Robot: Visual Handshake Recognition
                                          Using Deep Learning
                                                   Liutauras Butkus, Mantas Lukoševičius
                                                           Faculty of Informatics
                                                      Kaunas University of Technology
                                                             Kaunas, Lithuania
                                           liutauras.butkus@ktu.edu, mantas.lukosevicius@ktu.lt


Abstract—Our project was to create a demo system where a small                                     RELATED WORK
humanoid robot accepts an offered handshake when it sees it.
The visual handshake recognition, which is the main part of the                 Virtually all vision-based hand gesture recognition systems
system proved to be not an easy task. Here we describe how and              described in literature use (a) image sequences (videos) with
how well we solved it using deep learning. In contrast to most              (b) depth information in them, see [1] for a good recent
gesture recognition research we did not use depth information or            survey. Microsoft Kinect [2] and Leap Motion [3] are two
videos, but did this on static images. We wanted to use a simple            examples of popular sensors specifically designed for gesture
camera and our gesture is rather static. We have collected a                and posture 3D tracking. While clearly both temporal (a) and
special dataset for this task. Different configurations and learning        depth (b) aspects are helpful in recognizing hand gestures, our
algorithms of convolutional neural networks were tried.                     system uses neither of the two. We (b) used an inexpensive
However, the biggest breakthrough came when we could
                                                                            camera for simple RGB image acquisition to make the system
eliminate the background and make the model concentrate on the
person in front. In addition to our experiment results we can also          more accessible and the algorithms more widely applicable,
share our dataset.                                                          e.g., in smartphones, natural lighting. We also (a) used single
Keywords—image recognition, computer vision, deep learning,                 frames to recognize the extended hand for the handshake,
convolutional neural networks, robotics                                     since the gesture is rather static – just holding the extended
                                                                            hand still – and could perhaps be called “posture”. This makes
                       I. INTRODUCTION                                      the recognition problem considerably harder.
    The goal of this project is to create a robot that can visually             A bit similar project to ours called “Gesture Recognition
recognize an offered handshake and accept it. When the robot                System using Deep Learning” was presented in PyData
sees a man offering a handshake, it responds by stretching its              Warsaw 2017 conference [4]. The author introduced a Python-
arm too. This serves as a visual and interactive demonstration,             based, deep learning gesture recognition model that is
which would get students more interested in machine learning                deployed on an embedded system, works in real-time and can
and robotics.                                                               recognize 25 different hand gestures from a simple webcam
    For this purpose we used a small humanoid robot, a simple               stream. The development of the system included: a large-scale
camera mounted on it, and deep convolutional neural networks                crowd-sourcing operation to collect over 150,000 short video
for image recognition. The recognition, as well as training of              clips, a process to decide which deep learning framework to
it, were done on a PC and the command to raise the arm was                  use, the development of a network architecture that allows for
sent back to the robot.                                                     classifications of video clips solely with RGB input frames,
    This article mainly shares our experience in developing                 the iterations necessary to make the neural network run in real-
and training the visual handshake recognition system, which                 time on embedding devices, and lastly, the discovery and
proved to not be trivial. In particular, we will discuss how                development of playful gesture-based applications. Their
images were collected, preprocessed, what architecture of                   approach is still different from our approach in that they used
convolutional neural networks was used, how it was trained                  video samples as their input (several frames at a time) and
and tested; what gave good and what not so good results.                    tried to recognize moving gestures.
    This document is divided into several sections. Section II                  There is considerable literature similar to our approach in
reviews existing solution to similar problem. Section III                   both (a) and (b) for recognizing sign language hand gestures
introduces our method for this project. Section IV describes                (or rather postures) from RGB images, including using deep
the data set used in this study. Section V emphasizes                       learning [5]. These approaches, however, usually work with
importance of data preprocessing before training. Section VI                images of a single hand on a uniform background where the
describes robot interface. Sections VII and VIII provide                    hand can be cropped from the image using thresholding [5],
analysis of results and conclusions.                                        skin color [6], or relying on the subject wearing a brightly-
                                                                            colored glove [7].
                           II.
 Copyright held by the author(s).


                                                                       78
                         III. OUR METHOD                                 and drug design, where they have produced results comparable
                                                                         to and in some cases superior to human experts.
                                                                               Convolutional networks [8], also known as convolutional
                                                                         neural networks, or CNNs, are a specialized kind of neural
                                                                         network for processing data that has a known grid-like
                                                                         topology. Examples include time-series data, which can be
                                                                         thought of as 1-D grid taking samples at regular time intervals,
                                                                         and image data, which can be thought of as a 2-D grid of
                                                                         pixels. Convolutional networks have been tremendously
                                                                         successful in practical applications. The name “convolutional
Fig. 1. System training model.
                                                                         neural network” indicates that the network employs a
                                                                         mathematical operation called convolution. Convolution is a
                                                                         specialized kind of linear operation. Convolutional networks
                                                                         are simply neural networks that use convolution in place of
                                                                         general matrix multiplication in at least one of their layers.
                                                                               Keras [10] is a high-level deep learning library written in
                                                                         Python and capable of running on top of either TensorFlow or
                                                                         Theano deep learning libraries. It was developed with a focus
                                                                         on enabling fast experimentation. Being able to go from idea
                                                                         to result with the least possible delay is key to doing good
Fig. 2. System running model.                                            research. Keras deep learning library allows for easy and fast
                                                                         prototyping (through total modularity, minimalism, and
     The system model consists of several parts showed in                extensibility). It supports both convolutional networks (we
Figure 1 and 2, including camera, camera images pre-                     used in our solution) and recurrent networks, as well as
processing, convolutional neural network’s training using deep           combinations of the two. Keras also supports arbitrary
learning, graphical user interface, robot interface and robot            connectivity schemes (including multi-input and multi-output
itself.                                                                  training) and runs seamlessly on CPU and GPU. The core data
     At first camera was used to collect the image dataset. This         structure of Keras is a model, a way to organize layers. The
is described in section IV. After later research, which is               main type of model is the Sequential model, a linear stack of
described in section VI, the images in the dataset had to be             layers. Keras’ Guiding principles include Modularity. A model
pre-processed to be able to train the model, which is the next           is understood as a sequence or a graph of standalone, fully-
part of our system. Using Keras library the model was created,           configurable modules that can be plugged together with as
compiled and finally trained with the images (this process               little restrictions as possible. In particular, neural layers, cost
explained in section VI). The final part is to run the model to          functions, optimizers, initialization schemes, activation
recognize new live images. For this reason camera’s interface            functions, regularization schemes are all standalone modules
was programmed to take photos at every 0.5 second, the model             that users can combine to create new models. Each module
gets those images as an input and returns probability of seeing          should be kept short and simple. To be able to easily create
an offered handshake as an output result. If this result is above        new modules allows for total expressiveness, making Keras
a certain threshold, a robot interface sends a command to robot          suitable for advanced research.
to perform a corresponding task. This part more deeply                   B. Our convolutional neural network model
described in section VI.
A. Choice of using deep learning libraries                                   The convolutional neural network model that we used is
                                                                         specified in Figure 3.
     Deep learning [8] (also known as deep structured learning
or hierarchical learning) is part of a broader family of machine
learning methods based on learning data representations, as
opposed to task-specific algorithms. Learning can be
supervised, semi-supervised or unsupervised.
     Deep learning models are loosely related to information
processing and communication patterns in a biological nervous
system, such as neural coding that attempts to define a
relationship between various stimuli and associated neuronal
                                                                         Fig. 3. Our convolutional neural network model.
responses in the brain.
     Deep learning architectures such as deep neural networks,                It takes 64x40 resolution images as inputs, consists of
deep belief networks and recurrent neural networks [9] have              three convolutional layers, each followed by pooling, and has
been applied to fields including computer vision, speech                 a single node output. We use rectified linear units in all layers
recognition, natural language processing, audio recognition,             except for the output node where it is sigmoid.
social network filtering, machine translation, bioinformatics


                                                                    79
C. Training process                                                         development of the whole project, more than 4,000 different
                                                                            images were collected for the training of the neural network.
    Before starting to train the model there are several                    Approximately 2000 for each category. Single-image
parameters which describe training details. The first parameter             resolution is 318x198.
is epochs count. Epoch itself is an arbitrary milestone,                         We can see in Figure 4, that in the image, one person was
generally defined as “one pass over the entire dataset”, used to            usually with his hand stuck or not. It was also tried to capture
separate training into distinct phases, which is useful for                 images in as many different environments as possible. Human
logging and periodic evaluation. In general it means how                    clothing was also varied trying to capture as diverse as
many times the process will go through the training set.                    possible colors. This is important in order to ensure that
     Second parameter is batch size. Batch size defines number              recognition is not restricted to a particular specific situation.
of samples that going to be propagated through the network.
For instance, there are 200 training samples and we want to set
up batch size equal to 30. Algorithm takes first 30 samples
from the training dataset and trains network. Next it takes
second 30 samples and trains network again. The procedure
can be done until we propagate through the networks all
samples. However, the problem usually happens with the last
set of samples. In this example the last 20 samples which is                Fig. 4. Image samples: top positive, bottom negative.
not divisible by 30 without remainder. The simplest solution is
just to get final 20 samples and train the network.                              The pictures were divided into three sets: training,
     We have tried different loss functions and training                    validation and testing. The neural network is taught with
optimization methods. The ones that worked reasonably well                  training data. It is then validated with validation data to verify
in the end are reported in Section VII.                                     that a well-trained neural network performs recognition with
                                                                            new examples. The test data is intended to validate the final
    Train accuracy and train loss are calculated on the go,                 neural network's capability to obtain the final true recognition.
during training. Figures in Section VII show how well our                   In addition, data augmentation [11] was used during training,
network is doing on the data it is being trained. Training                  in which various small transformations were made to the
accuracy usually keeps increasing throughout training.                      images before training on them (rotation, translation, color-
D. Validation process                                                       shift, up- /down-scaling).
                                                                            If there are people who are interested in this task, we could
      To validate the model we need to have new dataset this                share the data with everyone who wants it.
new images, which has not been used in training process.
Validation is usually carried out together with training. After                           V.         BACKGROUND REMOVAL
every epoch, the model is tested against a validation set, and
validation loss and accuracy are calculated. These numbers tell                  Initially, we tried to train the neural network with the data
you how good your model is at predicting outputs for inputs it              obtained directly from the camera without preprocessing them.
has never seen before. Validation accuracy increases initially              However, it has been noticed that the model with the best
and drops as you over fit. Overfitting happens when our model               attempt reached 78 percent training accuracy and about 64
fits too well to the training set. It then becomes difficult for the        percent validation accuracy followed by overfitting, during
model to generalize to new examples that were not in the                    which the error rate increased significantly. For this reason, it
training set. For example, our model recognizes specific                    was necessary to look for solutions on how to avoid overfitting
images in your training set instead of general patterns. Our                and how to increase the validation accuracy of the model. To
training accuracy will be higher than the accuracy on the                   achieve this, attempts were made to change the model's
validation/test set.                                                        parameters, but this did not improve result as much as it was
                                                                            expected. Then it was decided to process the data itself. From
E. Testing process                                                          previous experiments, we were able to get the impression that
                                                                            overfitting appears due to the excessive color gamut and color
     To test the model we need another new dataset. Testing                 of the images. For this reason, we have decided to try
usually is run manually by giving an image from dataset for                 removing background images and training a neural network
trained model to get a result. And the result is a percent value            with pictures without background. However, that causes a new
that shows probability on each output option                                problem. How to detect where the background is and where is
                                                                            an object (in this case a human)? For this problem, we decided
            IV. DATA COLLECTION PREPARATION                                 to take the first image without a human and claim that it is a
                                                                            background and all other images are objects with backgrounds.
     As we mentioned in the previous section a collection of                Though, in this case camera had to be in fixed position. Then
image data was needed to implement this project. As the                     we were able to subtract two images and get image without a
system only recognizes greetings, only two results are                      background. Usually, after subtraction some noise always had
possible: greetings are recognized or not. During the


                                                                       80
left in images. To reduce it, we set a permissible error for pixel        hackable, modular, humanoid robot development platform
RGB values. You can see those images in Figure 5.                         designed from the ground up with customization and
                                                                          modification in mind. It has built in software which invokes
                                                                          robot actions. You can see robot’s software interface in Figure
                                                                          7.


Fig. 5. Same images with (left) and without (right) background.

     The result was obvious. It's possible to achieve 91 percent
                                                                          Fig. 7. Robot joints management interface.
accuracy with a smooth natural background. This means that
the model is quite precise enough to recognize the extended
hand when the background behind the human is equal and                        Robot interface is used when the model is started to predict
does not need to be removed.                                              new images. After CNN return probability of the image it is
     In order to obtain this result, first we needed to draw up a         sent to robot interface. Then robot interface reads the input
test plan, which would make it clear how the training is most             value and if it is true interface runs command for a robot to
appropriate. We've identified three methods (regular training,            raise its hand.
training with removed background, training with attached
backgrounds), and five types of data (when the background is                          VII. EXPERIMENT RESULTS AND ANALYSIS
a specific color, when the background is smooth and natural,
when the background is static color, when the background is                   In this section we will explain in detail what experiments
changing and when the background is with a few outsiders).                were done and what results were achieved.
All test results are presented in the results Section VII.                    As we mentioned in the previous section the first
                                                                          experiments were carried out using dataset with non-removed
                     VI.INTERFACING ROBOT                                 background images for training and validation. The other
                                                                          parameters were:
                                                                              Image width: 64

                                                                               Image height: 40

                                                                               Training dataset samples: 421

                                                                               Validation dataset samples: 122

                                                                               Epochs: 30

                                                                               Batch size: 32

                                                                               Model loss function: mean squared error

                                                                               Model optimizer: stochastic gradient descent

                                                                               Model metrics: accuracy
                                                                             After training we have got the results, which are shown in
                                                                          Figure 8, and they after final iteration were:
                                                                              Training accuracy: 78%

                                                                                  Training loss: 0.16
Fig. 6. Photo of our robot and use case.
                                                                                  Validation accuracy: 64%
    For this project HR-OS1 Humanoid Endoskeleton robot
[12] was used. It is showed in Figure 6. It has integrated                        Validation loss: 0.19
onboard Linux computer with Intel Atom processor, which
gives all the processing power to run robot. The HR-OS1 is a


                                                                     81
                                                                             This time the result shows that model validates new images
                                                                          by 82% accuracy and after testing manually this model with
                                                                          images which very different environments, the result shown
                                                                          about the same accuracy.
                                                                             All our tests and their results are shown in the table below.
                                                                          Some of the most sophisticated training experiments have not
                                                                          been completed, as it was not immediately meaningful to
                                                                          perform them due to poor results from simple training.


                                                                                  30 epochs                A            B            C
                                                                          1.   Background of a         -            91/82        86/77
Fig. 8. Training without removing background images results graph.             specific color

   The result shows that model validates new images by 64%                2.   A smooth natural        81/77        92/80        81/70
accuracy. However, after testing manually this model with                      background
images which very different environments, the result have
been even worse.                                                          3.   Static background       52/54        62/55        58/52
   The next experiment were held by training model with                        (colorful)
removed background. The model parameters were:
    Image width: 64                                                      4.   Changing                40/50        53/50        Not tried
                                                                               background
     Image height: 40

     Training dataset samples: 421                                               TABLE I.      DIFFERENT TRAINING VALIDATION TABLE

     Validation dataset samples: 121                                         Columns of the table represents training of the model,
                                                                          rows – how model was validated:
     Epochs: 50
                                                                               A - Simple training on original images;
     Batch size: 32
                                                                               B - Trained with removed backgrounds;
     Model loss function: mean squared error
                                                                               C - Trained with background replacements;
     Model optimizer: stochastic gradient descent                             x/y – training accuracy / validation accuracy
     Model metrics: accuracy                                                 The results showed that the removal of the background
                                                                          significantly improves the accuracy of the model recognition.
   After training we got the results, which are shown in                  The best results were from experiments where training took
Figure 9, and they after final iteration were:                            place with removed background pictures and validating with
    Training accuracy: 91%                                               also removed background images or smooth background
                                                                          images.
     Training loss: 0.07
                                                                                     VIII.    DISCUSSION AND FUTURE WORK
     Validation accuracy: 82%
                                                                               In this work several different training experiments were
     Validation loss: 0.12                                               performed, watching and studying accuracy of the trained
                                                                          models. The experiments showed that the results depended
                                                                          more not on the model used and its parameters, but on the
                                                                          transformation of the images. To achieve the best results
                                                                          image preprocessing played a key role in this experiment. The
                                                                          best result was reached by removing background before
                                                                          training.
                                                                               Our interpretation of the results is that removing the
                                                                          background reduces the variation in the data and makes the
                                                                          machine learning model focus on the person in the image.
                                                                          Without the background removal the models are prone to
                                                                          overfitting, probably basing their decision on wrong features
                                                                          of the image. Might be that similar accuracy can be achieved
Fig. 9. Training with removed background images results graph.            without background removal, but with much more data,


                                                                     82
training, and probably more powerful models. In that case the                       [5]  Oyedotun, O.K. & Khashman, “Deep learning in vision-based static
                                                                                         hand gesture recognition“, Neural Computing and Applications (2017)
models have to infer that the person in the foreground is the                            28: 3941. https://doi.org/10.1007/s00521-016-2294-8
most important object in the images and learn how to                                [6] Dennis Núñez Fernández, Bogdan Kwolek, “Hand Posture Recognition
distinguish it on its own. Motion or depth information, which                            Using Convolutional Neural Network” Progress in Pattern Recognition,
is used in many gesture recognition systems, would also make                             Image Analysis, Computer Vision, and Applications. CIARP 2017.
separation of the person in front from the background easier                             Lecture Notes in Computer Science, vol 10657. Springer, Cham , 2018
                                                                                         http://home.agh.edu.pl/~bkw/research/pdf/2017/FernandezKwolek_CIA
and likely not necessary to be done explicitly.                                          RP2017.pdf
     This is consistent with results discussed in related work                      [7] Rosalina, L. Yusnita, N. Hadisukmana, R. B. Wahyu, R. Roestam and
(Section II) where other authors either use motion and/or depth                          Y. Wahyu, "Implementation of real-time static hand gesture recognition
information, or also crop the foreground from the background                             using artificial neural network," 2017 4th International Conference on
                                                                                         Computer Applications and Information Processing Technology
in some way.                                                                             (CAIPT),                    Kuta                Bali,               2017
     In our approach it is not necessary to remove the                                   http://journal.binus.ac.id/index.php/commit/article/viewFile/2282/3245
background during testing/valuation. The best validation                            [8] Ian Goodfellow, Yoshua Bengio, Aaron Courville, “Deep Learning”,
results are on data with smooth natural backgrounds. The                                 MIT Press, 2016. http://www.deeplearningbook.org/
accuracy of this validation data reached 92%. A reasonable                          [9] Denny Britz, „Recurrent Neural Networks Tutorial, Part 1 – Introduction
future work would be to attempt to create a model that can                               to        RNNs“,          accessed      on         05        -      2018
                                                                                         http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-
better recognize offered handshakes in a wider range of                                  1-introduction-to-rnns/
environments.                                                                       [10] Keras documentation, “Why use Keras?”, accessed on 05 - 2018
                                                                                         https://keras.io/why-use-keras/
                              REFERENCES
                                                                                    [11] Prasad Pai, “Data Augmentation Techniques in CNN using Tensorflow”,
[1]   Maryam Asadi-Aghbolaghi, Albert Clapes, Marco Bellantonio, Hugo                    accessed on 05 - 2018 https://medium.com/ymedialabs-innovation/data-
      Jair Escalante, “A survey on deep learning based approaches for action             augmentation-techniques-in-cnn-using-tensorflow-371ae43d5be9
      and gesture recognition in image sequences”, 12th IEEE International          [12] Tossen Robotics, “HR-OS1 Humanoid Endoskeleton spescifications”,
      Conference on Automatic Face & Gesture Recognition (FG 2017), 2017                 accessed on 05 - 2018 http://www.trossenrobotics.com/HR-OS1
      http://sunai.uoc.edu/~vponcel/doc/survey-deep-learning_fg2017.pdf,
                                                                                    [13] Połap, Dawid, Marcin Woźniak, Christian Napoli, Emiliano Tramontana,
[2]   Microsoft Robotics, “Kinect Sensor”, accessed on 05 - 2018                         and Robertas Damaševičius. "Is the colony of ants able to recognize
      https://msdn.microsoft.com/en-us/library/hh438998.aspx                             graphic objects?." In International Conference on Information and
[3]   Leap Motion, “Leap Motion – Developer”, accessed on 05 - 2018                      Software Technologies, pp. 376-387. Springer, 2015.
      https://developer.leapmotion.com/                                             [14] Woźniak, Marcin, Dawid Połap, Christian Napoli, and Emiliano
[4]   Joanna Materzynska, “Building a Gesture Recognition System using                   Tramontana. "Graphic object feature extraction system based on cuckoo
      Deep           Learning”,        PyData          Warsaw         2017,              search algorithm." Expert Systems with Applications, vol. 66, pp. 20-31,
      https://medium.com/twentybn/building-a-gesture-recognition-system-                 2016.
      using-deep-learning-video-d24f13053a1


                                                                               83