<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hand gesture recognition using convolutional neural network and histogram of oriented gradients features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alda Kika</string-name>
          <email>alda.kika@fshn.edu.al</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aldo Koni</string-name>
          <email>aldo.koni@fshnstudent.info</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics, Faculty of Natural Sciences, University of Tirana</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Hand gesture recognition is the core part for building a sign language recognition system for the people with hearing impairment and has a wide application in human computer interaction. The chosen dataset for the construction of the hand gesture recognition system model is fingerspelling alphabet gestures of American sign language. The algorithms that are chosen in this study to create the features of the images that will train the classifier are deep features from a pretrained convolutional neural network AlexNet and histogram of oriented gradients. The feature vectors provided by the extraction methods are used as an input to train support vector machine classifier. Testing results show that the classifiers constructed with two sets of features perform almost with the same accuracy. The combination of histogram of oriented gradient as feature extractor and support vector machine as classifier gives very good results for the classification of images when the dataset of the input is small as in our case.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Gesture recognition is a very interesting field in
computer vision which find practical application in
many fields. One of these fields is hand gesture
recognition as one of the method used in sign language
for non-verbal communication. A hand gesture
recognition system provides a natural way of
communication for people with hearing impairments
and also interactive user friendly way of
communication with the computer for the human
beings in general.</p>
      <p>Convolutional neural network are deep neural networks
that recently have reached very high performance in
computer vision problems like detection or
classification of images. On the other hand handcraft
traditional features like histogram of oriented gradients
combined with a classifier have resulted also successful
in computer vision tasks. Both of these algorithms have
been used in sign language hand gesture recognition as
in [Ame+17] and [Tav+14].</p>
      <p>We have chosen as dataset, Massey Dataset[Bar+11]
which is created for American sign language
fingerspelling gestures. Pretrained convolutional neural
network, Alexnet, and histogram of oriented gradients
will be used as feature extractor while support vector
machine is chosen as the classifier. In this paper we
explore these two methods for feature extraction from
a fingerspelling alphabet gesture sign language dataset,
compare with each other and discuss the results.
The study is divided into 5 sections. Feature extractors
and classification algorithm are discussed in the second
section. The dataset is presented in the third section.
Experiments and results are discussed in the fourth
section. Conclusions are presented in last section.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Background</title>
      <sec id="sec-2-1">
        <title>Feature Descriptors</title>
        <p>Convolutional neural network are deep learning tools
that are very suitable for computer vision taks. They do
not only perform classification, but they can also learn
to extract features directly from raw images [Siv+12].
They are similar to neural networks because they
contain neurons, weights and biases, they have one or
more fully connected layers as neural network with
many layers have, but differently from them they are
easier to be trained because they have less parameters.
A very important advantage of using convolutional
neural network for computer vision tasks is related to
the fact that every layer learns different features of the
image. These features can be used to train the
classifier.</p>
        <p>A convolutional neural network is composed of four
different layers [Shoi+16] which are:
Convolutional layer: a set of filters slide on the image.
They will be activated when they find the same pattern
in it.</p>
        <p>Pooling Layer: the aim of this layer is to reduce the
dimension of the space, the parameters and the
calculations on the net. Several functions can be used
but max pooling is more common.</p>
        <p>Non-linear Layer: In the architecture of convolutional
neural network there are non linear functions like
rectifed linear units (RELU), Identity, Tanh, Arctan
that have the purpose of introduction of non-linearity
in the neural network which will make the training
faster and more accurate.</p>
        <p>Fully-connected Layer: the neurons in this type of
layer connect to every neuron in another layer like in
neural networks.</p>
        <p>We have used the pretrained AlexNet, deep
convolutional neural network, which was used to
classify the 1.2 million high-resolution images in the
ImageNet LSVRC-2010 contest into the 1000 different
classes. The architecture of this network is summarized
in Figure 1[Kri+12]. It contains eight learned layers,
five convolutional and three fully-connected.
Histogram of oriented gradients defined from Dalal dhe
Triggs[Dal+05] are the general features in the
structure for object dedection and one of the most
powerful method for image descriptor. Presentation
through HOG has many advantages. Usage of
histogram of oriented gradients on the images catches
information of local contour like the borders of the
structure of gradients. The borders play a very
important role in the computer vision tasks and their
orientation describe important features for object
dedection. Hog uses the borders of the objects to create
the feature set that describe the object. In order to
calculate Hog descriptors of an image, the image is
divided in a number of cells and bins of orientation.
Below some characteristics of each of the methods
that we used to extract the features are given.
Convolutional Neural Network:</p>
        <sec id="sec-2-1-1">
          <title>Convolutional neural networks are</title>
          <p>mainly deep learning models which are
motivated by the manner that our cornea
operate through the alternation of
convolutional and pooling layers.
They are trained feature dedectors
making them very adaptable. This is the
reason why they reach highest accuracy
in image dedection.</p>
          <p>They can learn low level features from
training samples as the methods HOG
or SIFT do.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Histogram of oriented gradients : It is based on first order gradients that are in orientation bins.</title>
        </sec>
        <sec id="sec-2-1-3">
          <title>It is dense (it is evaluated in all the image).</title>
          <p>The features extracted from histogram of
oriented gradients can’t be learned but are
hand crafted that means that the information
is contained in the image for example in the
corners or borders.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Classifier</title>
        <p>Support Vector Machines (SVM) presented by
Wapnik[VAP98] is one of the most advanced
classification method based on machine learning. If we
compare it with other classification methods such as
decision trees or Bayesian networks it has as
advantages higher accuaracy and geometric
interpretation. Above all, they do not need a large
amount of data for training in order to avoid overfitting
[Cam+11]. Support vector machines work well in
practice with different types of applications from the
dedection of digits, identification of faces,
bioinformatics etc.</p>
        <p>Classification of the data is a common task in machine
learning. The principle of SVM lays in determining
the classes to which the data belong. SVM creates a
model that delivers new cases to the classes. Training
the SVM involves the optimization of a concave
function which has a single solution. Other learning
paradigms do not provide that the function will be
concave resulting in different solutions depending on
initial values for model parameters. The data are saved
as kernels which measure the similarity or variability
of the objects of data. Kernels can be constructed with
different types of objects from continous to discrete
data and from sequences to graphical data. In this
manner different models of data can be trained with the
same model making this approximation very flexible
and powerful. Vector support machines are the most
known and used method that uses kernels. [Cam+11]</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 Dataset</title>
      <p>The chosen dataset is created from Massey University,
New Zeland. It contains 2524 images created in such a
manner that the hands touches all the borders of the
frame. The hands are cropped from original image and
placed in a black background. The size of the frame is
500x500 pixels. To construct such a dataset 5 users are
used. The hand gestures are based on the american
sign language alphabet fingerspelling hand gestures.
The main characteristics that distinguish this dataset
from other similar datasets are: firstly, the images
cover a large variety of hands using different
illumination conditions. Secondly, the images are
segmented and cropped, but not altered from the
original captured images and thirdly, there is no need
to use special gloves, or any other apparatus [Bar+11].
In the figure 2 the process of creation of the dataset is
shown.
The names of the files follow a simple convention that
can easily be used by programmers in their scripts.
For example the convention
handX_G_ILL_seg_crop_R.png is :
for
the
file:



</p>
      <sec id="sec-3-1">
        <title>X is the number os the user</title>
        <p>G is the gesture from a to z
ILL determine the condition of the
illumination which can be bot (bottom), top,
left, right or diff (diffuse).</p>
        <p>R is the repetition of the gesture.</p>
        <p>In the figure 3 the dataset of the data for the american
fingerspelling alphabet gestures is presented.
Since two letters "j" and "z" are not static we will
remove them from the dataset. The data grouped in 24
classes will serve for the training of the classifier and
testing.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4 Experiments and results</title>
      <p>Two methods were used to extract the features from the
images of the dataset: the pretrained convolutional
neural network AlexNet and histogram of oriented
gradients. Each feature set is divided in training set
and testing set. Two classifiers with each training set
are constructed and then tested with the remaining
features. The diagram of the experiments is presented
in the figure 4.
We will use top-1 and top-5 accuracies. Top-1
accuracy is the conventional accuracy: the answer of
the model that has the highest accuracy match the
expected answer. Top-5 accuracy means that the
expected answer must match one of the model 5
highest probability answers.</p>
      <p>The two classifiers were tested using the testing set
giving the following results:
The results of the experiments show that the highest
accuracy(Top-1 and Top-5) can be reached when the
features that are extracted with HOG algorithm are
used to train the classifier. Top-5 Accuracy is almost
the same with both models.</p>
    </sec>
    <sec id="sec-5">
      <title>5 Conclusions</title>
      <p>One of the field of machine learning that is giving very
good results in complex data analysis is deep learning.
Convolutional neural network is a deep neural network
that is used in computer vision tasks. We have used a
pretrained convolutional neural network and handcraft
histogram of oriented gradients to extract the features
from a set of hand gesture images of American
fingerspelling sign language. The features were used
to train a support vector machine classifier. The
classifier trained with features extracted with
histogram of oriented gradients reaches the highest
top-1 and top-5 accuracy.</p>
      <p>In the case of convolutional neural network, the
number of training sample is very important because it
learns from them. We have used the pretrained
convolutional neural network, Alexnet, which is
trained with millions of images from 1000 different
categories which are distinctive among each other
while sign languages hand gestures categories have
very little difference between them.</p>
      <p>Histogram of oriented gradients use predetermined
filters while convolutional neural network learn from
the training dataset.</p>
      <p>Through fine-tuning with a larger sign language
dataset the pretrained convolutional neural network
will transfer general learned recognition capabilities to
specific features of hand gesture classes having more
potential for improvement of the results inspiring
further research in the future.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Siv+12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sivalingamaiah</surname>
          </string-name>
          and
          <string-name>
            <given-names>B. D. V.</given-names>
            <surname>Reddy</surname>
          </string-name>
          , “
          <article-title>Texture segmentation using multichannel Gabor filtering</article-title>
          ,”
          <source>IOSR Journal of Electronics and Communication Engineering</source>
          , Vol.
          <volume>2</volume>
          , pp.
          <fpage>22</fpage>
          -
          <lpage>26</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Shoi+16]
          <string-name>
            <surname>Doaa</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Shoieb</surname>
          </string-name>
          ,
          <string-name>
            <surname>Sherin M. Youssef</surname>
          </string-name>
          , and
          <string-name>
            <surname>Walid</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Aly</surname>
          </string-name>
          .
          <article-title>Computer-Aided Model for Skin Diagnosis Using Deep Learning</article-title>
          .
          <source>Journal of Image and Graphics</source>
          , Vol.
          <volume>4</volume>
          , No.
          <issue>2</issue>
          , pp.
          <fpage>116</fpage>
          -
          <lpage>121</lpage>
          ,
          <year>December 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>doi: 10.18178/joig.4.2</source>
          .
          <fpage>116</fpage>
          -
          <lpage>121</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Li+10]
          <string-name>
            <given-names>Daoliang</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Wenzhu</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sile</given-names>
            <surname>Wang</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>Classification of foreign fibers in cotton lint using machine vision and multi-class support vector machine</article-title>
          .
          <source>Comput. Electron. Agric.</source>
          ,
          <volume>74</volume>
          ,
          <fpage>274</fpage>
          -
          <lpage>279</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Vap98]
          <string-name>
            <surname>Vladimir</surname>
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Vapnik</surname>
          </string-name>
          ,
          <year>1998</year>
          .
          <article-title>Statistical Learning Theory</article-title>
          . John Wiley &amp; Sons, New York.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Cam+11]
          <string-name>
            <surname>Colin</surname>
            <given-names>Campbell</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Yiming</given-names>
            <surname>Ying</surname>
          </string-name>
          .
          <article-title>Learning with Support Vector Machines</article-title>
          .
          <source>Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan &amp; Claypool</source>
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Dal+05]
          <string-name>
            <given-names>Navneet</given-names>
            <surname>Dalal</surname>
          </string-name>
          and
          <string-name>
            <given-names>Bill</given-names>
            <surname>Triggs</surname>
          </string-name>
          .
          <article-title>Histograms of oriented gradients for human detection</article-title>
          .
          <source>Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>886</fpage>
          -
          <lpage>893</lpage>
          ,
          <year>2005</year>
          . DOI:
          <volume>10</volume>
          .1109/cvpr.
          <year>2005</year>
          .
          <volume>177</volume>
          . 48,
          <issue>49</issue>
          [Bar+11]
          <string-name>
            <given-names>A.L.C.</given-names>
            <surname>Barczak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.H.</given-names>
            <surname>Reyes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abastillas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piccio</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>SusnjakRes</surname>
          </string-name>
          .
          <article-title>A New 2D Static Hand Gesture Colour Image Dataset for ASL Gestures</article-title>
          . Lett.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Inf. Math. Sci.</surname>
          </string-name>
          , Vol.
          <volume>15</volume>
          , pp.
          <fpage>12</fpage>
          -
          <lpage>20</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>lifeprint.com/asl101/fingerspelling/”.</mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [Kri+12]
          <string-name>
            <surname>Alex</surname>
            <given-names>Krizhevsky</given-names>
          </string-name>
          , Ilya Sutskever, and
          <string-name>
            <given-names>Geoffrey E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          .
          <source>Advances in neural information processing systems</source>
          .
          <volume>25</volume>
          (
          <issue>NIPS</issue>
          '
          <year>2012</year>
          ),
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [Ame+17]
          <string-name>
            <given-names>Salem</given-names>
            <surname>Ameen</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sunil</given-names>
            <surname>Vadera</surname>
          </string-name>
          .
          <article-title>A convolutional neural network to classify American Sign Language fingerspelling from depth and colour images</article-title>
          .
          <source>Expert Systems</source>
          , Vol.
          <volume>34</volume>
          .
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [Tav+14]
          <string-name>
            <surname>Neha</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Tavari</surname>
            ,
            <given-names>A. V.</given-names>
          </string-name>
          <string-name>
            <surname>Deorankar</surname>
          </string-name>
          .
          <article-title>Indian Sign Language Recognition based on Histograms of Oriented Gradient</article-title>
          .
          <source>International Journal of Computer Science and Information Technologies</source>
          , Vol.
          <volume>5</volume>
          (
          <issue>3</issue>
          ) ,
          <fpage>3657</fpage>
          -
          <lpage>3660</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>