=Paper=
{{Paper
|id=Vol-2280/paper-11
|storemode=property
|title=Hand Gesture Recognition Using Convolutional Neural Network and Histogram of Oriented
Gradients Features
|pdfUrl=https://ceur-ws.org/Vol-2280/paper-11.pdf
|volume=Vol-2280
|authors=Alda Kika,Aldo Koni
|dblpUrl=https://dblp.org/rec/conf/rtacsit/KikaK18
}}
==Hand Gesture Recognition Using Convolutional Neural Network and Histogram of Oriented
Gradients Features==
Hand gesture recognition using convolutional neural network and histogram of oriented gradients features Alda Kika Aldo Koni Department of Informatics Department of Informatics Faculty of Natural Sciences Faculty of Natural Sciences University of Tirana University of Tirana alda.kika@fshn.edu.al aldo.koni@fshnstudent.info traditional features like histogram of oriented gradients combined with a classifier have resulted also successful in computer vision tasks. Both of these algorithms have Abstract been used in sign language hand gesture recognition as Hand gesture recognition is the core part for in [Ame+17] and [Tav+14]. building a sign language recognition system We have chosen as dataset, Massey Dataset[Bar+11] for the people with hearing impairment and which is created for American sign language has a wide application in human computer fingerspelling gestures. Pretrained convolutional neural interaction. The chosen dataset for the network, Alexnet, and histogram of oriented gradients construction of the hand gesture recognition will be used as feature extractor while support vector system model is fingerspelling alphabet machine is chosen as the classifier. In this paper we gestures of American sign language. The explore these two methods for feature extraction from algorithms that are chosen in this study to a fingerspelling alphabet gesture sign language dataset, create the features of the images that will compare with each other and discuss the results. train the classifier are deep features from a The study is divided into 5 sections. Feature extractors pretrained convolutional neural network and classification algorithm are discussed in the second AlexNet and histogram of oriented gradients. section. The dataset is presented in the third section. The feature vectors provided by the extraction Experiments and results are discussed in the fourth methods are used as an input to train support section. Conclusions are presented in last section. vector machine classifier. Testing results show that the classifiers constructed with two sets of 2 Background features perform almost with the same accuracy. The combination of histogram of Feature Descriptors oriented gradient as feature extractor and support vector machine as classifier gives very Convolutional neural network are deep learning tools good results for the classification of images that are very suitable for computer vision taks. They do when the dataset of the input is small as in our not only perform classification, but they can also learn case. to extract features directly from raw images [Siv+12]. They are similar to neural networks because they 1. Introduction contain neurons, weights and biases, they have one or more fully connected layers as neural network with Gesture recognition is a very interesting field in many layers have, but differently from them they are computer vision which find practical application in easier to be trained because they have less parameters. many fields. One of these fields is hand gesture recognition as one of the method used in sign language A very important advantage of using convolutional for non-verbal communication. A hand gesture neural network for computer vision tasks is related to recognition system provides a natural way of the fact that every layer learns different features of the communication for people with hearing impairments image. These features can be used to train the and also interactive user friendly way of classifier. communication with the computer for the human A convolutional neural network is composed of four beings in general. different layers [Shoi+16] which are: Convolutional neural network are deep neural networks Convolutional layer: a set of filters slide on the image. that recently have reached very high performance in They will be activated when they find the same pattern computer vision problems like detection or in it. classification of images. On the other hand handcraft Pooling Layer: the aim of this layer is to reduce the Convolutional neural networks are dimension of the space, the parameters and the mainly deep learning models which are calculations on the net. Several functions can be used motivated by the manner that our cornea but max pooling is more common. operate through the alternation of Non-linear Layer: In the architecture of convolutional convolutional and pooling layers. neural network there are non linear functions like They are trained feature dedectors rectifed linear units (RELU), Identity, Tanh, Arctan making them very adaptable. This is the that have the purpose of introduction of non-linearity reason why they reach highest accuracy in the neural network which will make the training in image dedection. faster and more accurate. They can learn low level features from Fully-connected Layer: the neurons in this type of training samples as the methods HOG layer connect to every neuron in another layer like in or SIFT do. neural networks. We have used the pretrained AlexNet, deep Histogram of oriented gradients : convolutional neural network, which was used to classify the 1.2 million high-resolution images in the It is based on first order gradients that are in ImageNet LSVRC-2010 contest into the 1000 different orientation bins. classes. The architecture of this network is summarized in Figure 1[Kri+12]. It contains eight learned layers, It is dense (it is evaluated in all the image). five convolutional and three fully-connected. The features extracted from histogram of oriented gradients can’t be learned but are hand crafted that means that the information is contained in the image for example in the corners or borders. Classifier Figure 1: The architecture of AlexNet Support Vector Machines (SVM) presented by Histogram of oriented gradients defined from Dalal dhe Wapnik[VAP98] is one of the most advanced Triggs[Dal+05] are the general features in the classification method based on machine learning. If we structure for object dedection and one of the most compare it with other classification methods such as powerful method for image descriptor. Presentation decision trees or Bayesian networks it has as through HOG has many advantages. Usage of advantages higher accuaracy and geometric histogram of oriented gradients on the images catches interpretation. Above all, they do not need a large information of local contour like the borders of the amount of data for training in order to avoid overfitting structure of gradients. The borders play a very [Cam+11]. Support vector machines work well in important role in the computer vision tasks and their practice with different types of applications from the orientation describe important features for object dedection of digits, identification of faces, dedection. Hog uses the borders of the objects to create bioinformatics etc. the feature set that describe the object. In order to calculate Hog descriptors of an image, the image is Classification of the data is a common task in machine divided in a number of cells and bins of orientation. learning. The principle of SVM lays in determining the classes to which the data belong. SVM creates a Below some characteristics of each of the methods model that delivers new cases to the classes. Training that we used to extract the features are given. the SVM involves the optimization of a concave function which has a single solution. Other learning Convolutional Neural Network: paradigms do not provide that the function will be concave resulting in different solutions depending on The names of the files follow a simple convention that initial values for model parameters. The data are saved can easily be used by programmers in their scripts. as kernels which measure the similarity or variability For example the convention for the file: of the objects of data. Kernels can be constructed with handX_G_ILL_seg_crop_R.png is : different types of objects from continous to discrete data and from sequences to graphical data. In this X is the number os the user manner different models of data can be trained with the G is the gesture from a to z same model making this approximation very flexible ILL determine the condition of the and powerful. Vector support machines are the most illumination which can be bot (bottom), top, known and used method that uses kernels. [Cam+11] left, right or diff (diffuse). R is the repetition of the gesture. 3 Dataset In the figure 3 the dataset of the data for the american The chosen dataset is created from Massey University, fingerspelling alphabet gestures is presented. New Zeland. It contains 2524 images created in such a manner that the hands touches all the borders of the frame. The hands are cropped from original image and placed in a black background. The size of the frame is 500x500 pixels. To construct such a dataset 5 users are used. The hand gestures are based on the american sign language alphabet fingerspelling hand gestures. The main characteristics that distinguish this dataset from other similar datasets are: firstly, the images cover a large variety of hands using different illumination conditions. Secondly, the images are segmented and cropped, but not altered from the original captured images and thirdly, there is no need to use special gloves, or any other apparatus [Bar+11]. Figure 3: The dataset for the fingerspelling alphabet In the figure 2 the process of creation of the dataset is of american sign language [Asl] shown. Since two letters "j" and "z" are not static we will remove them from the dataset. The data grouped in 24 classes will serve for the training of the classifier and testing. 4 Experiments and results Two methods were used to extract the features from the images of the dataset: the pretrained convolutional neural network AlexNet and histogram of oriented gradients. Each feature set is divided in training set and testing set. Two classifiers with each training set are constructed and then tested with the remaining Figure 2: Acquired image with wrist cover. The images are features. The diagram of the experiments is presented segmented to obtain the final images stored in the dataset in the figure 4. In the case of convolutional neural network, the number of training sample is very important because it learns from them. We have used the pretrained convolutional neural network, Alexnet, which is trained with millions of images from 1000 different categories which are distinctive among each other while sign languages hand gestures categories have very little difference between them. Histogram of oriented gradients use predetermined filters while convolutional neural network learn from Figure 4: Diagram of the experiments the training dataset. We will use top-1 and top-5 accuracies. Top-1 Through fine-tuning with a larger sign language accuracy is the conventional accuracy: the answer of dataset the pretrained convolutional neural network the model that has the highest accuracy match the will transfer general learned recognition capabilities to expected answer. Top-5 accuracy means that the specific features of hand gesture classes having more expected answer must match one of the model 5 potential for improvement of the results inspiring highest probability answers. further research in the future. The two classifiers were tested using the testing set giving the following results: Table 1 : The accuracies of the classifiers References Classifier Hog Alexnet (CNN) [Siv+12] M. Sivalingamaiah and B. D. V. Reddy, Top-1 Accuracy 0.6423 0.6231 “Texture segmentation using multichannel Gabor Top-5 Accuracy 0.8769 0.8615 filtering,” IOSR Journal of Electronics and Communication Engineering, Vol. 2, pp. 22-26, 2012. [Shoi+16] Doaa A. Shoieb, Sherin M. Youssef, and The results of the experiments show that the highest Walid M. Aly. Computer-Aided Model for Skin accuracy(Top-1 and Top-5) can be reached when the Diagnosis Using Deep Learning. Journal of Image and features that are extracted with HOG algorithm are Graphics, Vol. 4, No. 2, pp. 116-121, December 2016. used to train the classifier. Top-5 Accuracy is almost doi: 10.18178/joig.4.2.116-121. the same with both models. [Li+10] Daoliang Li, Wenzhu Yang, Sile Wang. Classification of foreign fibers in cotton lint using 5 Conclusions machine vision and multi-class support vector One of the field of machine learning that is giving very machine. Comput. Electron. Agric., 74, 274–279, 2010. good results in complex data analysis is deep learning. [Vap98] Vladimir N. Vapnik, 1998. Statistical Convolutional neural network is a deep neural network Learning Theory. John Wiley & Sons, New York. that is used in computer vision tasks. We have used a [Cam+11] Colin Campbell, Yiming Ying. Learning pretrained convolutional neural network and handcraft with Support Vector Machines. Synthesis Lectures on histogram of oriented gradients to extract the features Artificial Intelligence and Machine Learning. Morgan from a set of hand gesture images of American & Claypool 2011. fingerspelling sign language. The features were used to train a support vector machine classifier. The [Dal+05] Navneet Dalal and Bill Triggs. Histograms of classifier trained with features extracted with oriented gradients for human detection. Proc. IEEE histogram of oriented gradients reaches the highest Computer Society Conference on Computer Vision and top-1 and top-5 accuracy. Pattern Recognition, pp. 886–893, 2005. DOI: 10.1109/cvpr.2005.177. 48, 49 [Bar+11] A.L.C. Barczak, N.H. Reyes, M. Abastillas, A. Piccio and T. SusnjakRes. A New 2D Static Hand Gesture Colour Image Dataset for ASL Gestures. Lett. Inf. Math. Sci., Vol. 15, pp. 12–20, 2011. [Asl] A. S. L. University. Fingerspelling. ”http://www. lifeprint.com/asl101/fingerspelling/”. [Kri+12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. 25(NIPS’2012), 2012. [Ame+17] Salem Ameen and Sunil Vadera. A convolutional neural network to classify American Sign Language fingerspelling from depth and colour images. Expert Systems, Vol. 34. 2017. [Tav+14] Neha V. Tavari , A. V. Deorankar. Indian Sign Language Recognition based on Histograms of Oriented Gradient. International Journal of Computer Science and Information Technologies, Vol. 5 (3) , 3657-3660, 2014.