=Paper= {{Paper |id=Vol-2280/paper-11 |storemode=property |title=Hand Gesture Recognition Using Convolutional Neural Network and Histogram of Oriented Gradients Features |pdfUrl=https://ceur-ws.org/Vol-2280/paper-11.pdf |volume=Vol-2280 |authors=Alda Kika,Aldo Koni |dblpUrl=https://dblp.org/rec/conf/rtacsit/KikaK18 }} ==Hand Gesture Recognition Using Convolutional Neural Network and Histogram of Oriented Gradients Features== https://ceur-ws.org/Vol-2280/paper-11.pdf
       Hand gesture recognition using convolutional neural network
              and histogram of oriented gradients features
          Alda Kika                                                                    Aldo Koni
       Department of Informatics                                                Department of Informatics
      Faculty of Natural Sciences                                              Faculty of Natural Sciences
         University of Tirana                                                      University of Tirana
        alda.kika@fshn.edu.al                                                  aldo.koni@fshnstudent.info


                                                         traditional features like histogram of oriented gradients
                                                         combined with a classifier have resulted also successful
                                                         in computer vision tasks. Both of these algorithms have
                      Abstract                           been used in sign language hand gesture recognition as
    Hand gesture recognition is the core part for        in [Ame+17] and [Tav+14].
    building a sign language recognition system          We have chosen as dataset, Massey Dataset[Bar+11]
    for the people with hearing impairment and           which is created for American sign language
    has a wide application in human computer             fingerspelling gestures. Pretrained convolutional neural
    interaction. The chosen dataset for the              network, Alexnet, and histogram of oriented gradients
    construction of the hand gesture recognition         will be used as feature extractor while support vector
    system model is fingerspelling alphabet              machine is chosen as the classifier. In this paper we
    gestures of American sign language. The              explore these two methods for feature extraction from
    algorithms that are chosen in this study to          a fingerspelling alphabet gesture sign language dataset,
    create the features of the images that will          compare with each other and discuss the results.
    train the classifier are deep features from a        The study is divided into 5 sections. Feature extractors
    pretrained convolutional neural network              and classification algorithm are discussed in the second
    AlexNet and histogram of oriented gradients.         section. The dataset is presented in the third section.
    The feature vectors provided by the extraction       Experiments and results are discussed in the fourth
    methods are used as an input to train support        section. Conclusions are presented in last section.
    vector machine classifier. Testing results show
    that the classifiers constructed with two sets of    2 Background
    features perform almost with the same
    accuracy. The combination of histogram of
                                                         Feature Descriptors
    oriented gradient as feature extractor and
    support vector machine as classifier gives very      Convolutional neural network are deep learning tools
    good results for the classification of images        that are very suitable for computer vision taks. They do
    when the dataset of the input is small as in our     not only perform classification, but they can also learn
    case.                                                to extract features directly from raw images [Siv+12].
                                                         They are similar to neural networks because they
1. Introduction                                          contain neurons, weights and biases, they have one or
                                                         more fully connected layers as neural network with
Gesture recognition is a very interesting field in
                                                         many layers have, but differently from them they are
computer vision which find practical application in
                                                         easier to be trained because they have less parameters.
many fields. One of these fields is hand gesture
recognition as one of the method used in sign language   A very important advantage of using convolutional
for non-verbal communication. A hand gesture             neural network for computer vision tasks is related to
recognition system provides a natural way of             the fact that every layer learns different features of the
communication for people with hearing impairments        image. These features can be used to train the
and also interactive user friendly way of                classifier.
communication with the computer for the human            A convolutional neural network is composed of four
beings in general.                                       different layers [Shoi+16] which are:
Convolutional neural network are deep neural networks    Convolutional layer: a set of filters slide on the image.
that recently have reached very high performance in      They will be activated when they find the same pattern
computer vision problems like detection or               in it.
classification of images. On the other hand handcraft
Pooling Layer: the aim of this layer is to reduce the              Convolutional neural networks       are
dimension of the space, the parameters and the                      mainly deep learning models which are
calculations on the net. Several functions can be used              motivated by the manner that our cornea
but max pooling is more common.                                     operate through the alternation of
Non-linear Layer: In the architecture of convolutional              convolutional and pooling layers.
neural network there are non linear functions like                 They are trained feature dedectors
rectifed linear units (RELU), Identity, Tanh, Arctan                making them very adaptable. This is the
that have the purpose of introduction of non-linearity              reason why they reach highest accuracy
in the neural network which will make the training                  in image dedection.
faster and more accurate.                                          They can learn low level features from
Fully-connected Layer: the neurons in this type of                  training samples as the methods HOG
layer connect to every neuron in another layer like in              or SIFT do.
neural networks.
We have used the pretrained AlexNet, deep                  Histogram of oriented gradients :
convolutional neural network, which was used to
classify the 1.2 million high-resolution images in the             It is based on first order gradients that are in
ImageNet LSVRC-2010 contest into the 1000 different                 orientation bins.
classes. The architecture of this network is summarized
in Figure 1[Kri+12]. It contains eight learned layers,             It is dense (it is evaluated in all the image).
five convolutional and three fully-connected.
                                                                   The features extracted from histogram of
                                                                    oriented gradients can’t be learned but are
                                                                    hand crafted that means that the information
                                                                    is contained in the image for example in the
                                                                    corners or borders.

                                                           Classifier
        Figure 1: The architecture of AlexNet
                                                           Support Vector Machines (SVM) presented by
Histogram of oriented gradients defined from Dalal dhe     Wapnik[VAP98] is one of the most advanced
Triggs[Dal+05] are the general features in the             classification method based on machine learning. If we
structure for object dedection and one of the most         compare it with other classification methods such as
powerful method for image descriptor. Presentation         decision trees or Bayesian networks it has as
through HOG has many advantages. Usage of                  advantages higher accuaracy and geometric
histogram of oriented gradients on the images catches      interpretation. Above all, they do not need a large
information of local contour like the borders of the       amount of data for training in order to avoid overfitting
structure of gradients. The borders play a very            [Cam+11]. Support vector machines work well in
important role in the computer vision tasks and their      practice with different types of applications from the
orientation describe important features for object         dedection of digits, identification of faces,
dedection. Hog uses the borders of the objects to create   bioinformatics etc.
the feature set that describe the object. In order to
calculate Hog descriptors of an image, the image is        Classification of the data is a common task in machine
divided in a number of cells and bins of orientation.      learning. The principle of SVM lays in determining
                                                           the classes to which the data belong. SVM creates a
Below some characteristics of each of the methods          model that delivers new cases to the classes. Training
that we used to extract the features are given.            the SVM involves the optimization of a concave
                                                           function which has a single solution. Other learning
Convolutional Neural Network:                              paradigms do not provide that the function will be
concave resulting in different solutions depending on          The names of the files follow a simple convention that
initial values for model parameters. The data are saved        can easily be used by programmers in their scripts.
as kernels which measure the similarity or variability         For example the convention for                the   file:
of the objects of data. Kernels can be constructed with        handX_G_ILL_seg_crop_R.png is :
different types of objects from continous to discrete
data and from sequences to graphical data. In this                     X is the number os the user
manner different models of data can be trained with the                G is the gesture from a to z
same model making this approximation very flexible                     ILL        determine the condition of the
and powerful. Vector support machines are the most                      illumination which can be bot (bottom), top,
known and used method that uses kernels. [Cam+11]
                                                                        left, right or diff (diffuse).
                                                                       R is the repetition of the gesture.
3 Dataset                                                      In the figure 3 the dataset of the data for the american
The chosen dataset is created from Massey University,          fingerspelling alphabet gestures is presented.
New Zeland. It contains 2524 images created in such a
manner that the hands touches all the borders of the
frame. The hands are cropped from original image and
placed in a black background. The size of the frame is
500x500 pixels. To construct such a dataset 5 users are
used. The hand gestures are based on the american
sign language alphabet fingerspelling hand gestures.
The main characteristics that distinguish this dataset
from other similar datasets are: firstly, the images
cover a large variety of hands using different
illumination conditions. Secondly, the images are
segmented and cropped, but not altered from the
original captured images and thirdly, there is no need
to use special gloves, or any other apparatus [Bar+11].         Figure 3: The dataset for the fingerspelling alphabet
In the figure 2 the process of creation of the dataset is                 of american sign language [Asl]
shown.
                                                               Since two letters "j" and "z" are not static we will
                                                               remove them from the dataset. The data grouped in 24
                                                               classes will serve for the training of the classifier and
                                                               testing.

                                                               4 Experiments and results
                                                               Two methods were used to extract the features from the
                                                               images of the dataset: the pretrained convolutional
                                                               neural network AlexNet and histogram of oriented
                                                               gradients. Each feature set is divided in training set
                                                               and testing set. Two classifiers with each training set
                                                               are constructed and then tested with the remaining
 Figure 2: Acquired image with wrist cover. The images are
                                                               features. The diagram of the experiments is presented
  segmented to obtain the final images stored in the dataset
                                                               in the figure 4.
                                                           In the case of convolutional neural network, the
                                                           number of training sample is very important because it
                                                           learns from them. We have used the pretrained
                                                           convolutional neural network, Alexnet, which is
                                                           trained with millions of images from 1000 different
                                                           categories which are distinctive among each other
                                                           while sign languages hand gestures categories have
                                                           very little difference between them.
                                                           Histogram of oriented gradients use predetermined
                                                           filters while convolutional neural network learn from
Figure 4: Diagram of the experiments                       the training dataset.
We will use top-1 and top-5 accuracies. Top-1              Through fine-tuning with a larger sign language
accuracy is the conventional accuracy: the answer of       dataset the pretrained convolutional neural network
the model that has the highest accuracy match the          will transfer general learned recognition capabilities to
expected answer. Top-5 accuracy means that the             specific features of hand gesture classes having more
expected answer must match one of the model 5              potential for improvement of the results inspiring
highest probability answers.                               further research in the future.

The two classifiers were tested using the testing set
giving the following results:

       Table 1 : The accuracies of the classifiers
                                                           References
 Classifier             Hog          Alexnet (CNN)
                                                           [Siv+12] M. Sivalingamaiah and B. D. V. Reddy,
 Top-1 Accuracy         0.6423       0.6231                “Texture segmentation using multichannel Gabor
 Top-5 Accuracy         0.8769       0.8615                filtering,” IOSR Journal of Electronics and
                                                           Communication Engineering, Vol. 2, pp. 22-26, 2012.
                                                           [Shoi+16] Doaa A. Shoieb, Sherin M. Youssef, and
The results of the experiments show that the highest       Walid M. Aly. Computer-Aided Model for Skin
accuracy(Top-1 and Top-5) can be reached when the          Diagnosis Using Deep Learning. Journal of Image and
features that are extracted with HOG algorithm are         Graphics, Vol. 4, No. 2, pp. 116-121, December 2016.
used to train the classifier. Top-5 Accuracy is almost     doi: 10.18178/joig.4.2.116-121.
the same with both models.
                                                           [Li+10] Daoliang Li, Wenzhu Yang, Sile Wang.
                                                           Classification of foreign fibers in cotton lint using
5 Conclusions                                              machine vision and multi-class support vector
One of the field of machine learning that is giving very   machine. Comput. Electron. Agric., 74, 274–279, 2010.
good results in complex data analysis is deep learning.    [Vap98] Vladimir N. Vapnik, 1998. Statistical
Convolutional neural network is a deep neural network      Learning Theory. John Wiley & Sons, New York.
that is used in computer vision tasks. We have used a
                                                           [Cam+11] Colin Campbell, Yiming Ying. Learning
pretrained convolutional neural network and handcraft
                                                           with Support Vector Machines. Synthesis Lectures on
histogram of oriented gradients to extract the features
                                                           Artificial Intelligence and Machine Learning. Morgan
from a set of hand gesture images of American
                                                           & Claypool 2011.
fingerspelling sign language. The features were used
to train a support vector machine classifier.       The    [Dal+05] Navneet Dalal and Bill Triggs. Histograms of
classifier trained with features extracted with            oriented gradients for human detection. Proc. IEEE
histogram of oriented gradients reaches the highest        Computer Society Conference on Computer Vision and
top-1 and top-5 accuracy.
Pattern Recognition, pp. 886–893,        2005.    DOI:
10.1109/cvpr.2005.177. 48, 49
[Bar+11] A.L.C. Barczak, N.H. Reyes, M. Abastillas,
A. Piccio and T. SusnjakRes. A New 2D Static Hand
Gesture Colour Image Dataset for ASL Gestures. Lett.
Inf. Math. Sci., Vol. 15, pp. 12–20, 2011.

[Asl] A. S. L. University. Fingerspelling. ”http://www.
lifeprint.com/asl101/fingerspelling/”.
[Kri+12] Alex Krizhevsky, Ilya Sutskever, and
Geoffrey E. Hinton. Imagenet classification with deep
convolutional neural networks. Advances in neural
information processing systems. 25(NIPS’2012), 2012.
[Ame+17] Salem Ameen and Sunil Vadera. A
convolutional neural network to classify American
Sign Language fingerspelling from depth and colour
images. Expert Systems, Vol. 34. 2017.
[Tav+14] Neha V. Tavari , A. V. Deorankar. Indian
Sign Language Recognition based on Histograms of
Oriented Gradient. International Journal of Computer
Science and Information Technologies, Vol. 5 (3) ,
3657-3660, 2014.