=Paper=
{{Paper
|id=Vol-2145/p11
|storemode=property
|title=Implementation of Artificial Intelligence Methods for Virtual Reality Solutions: a Review of the Literature
|pdfUrl=https://ceur-ws.org/Vol-2145/p11.pdf
|volume=Vol-2145
|authors=Rytis Augustauskas,Aurimas Kudarauskas,Cenker Canbulut
}}
==Implementation of Artificial Intelligence Methods for Virtual Reality Solutions: a Review of the Literature==
<pdf width="1500px">https://ceur-ws.org/Vol-2145/p11.pdf</pdf>
<pre>
Implementation of Artificial Intelligence Methods for
 Virtual Reality Solutions: a Review of the Literature
          Rytis Augustauskas                              Aurimas Kudarauskas                                      Cenker Canbulut
      Department of Automation                          Department of Automation                         Department of Multimedia Engineering
   Kaunas University of Technology,                  Kaunas University of Technology,                      Kaunas University of Technology,
           Kaunas, Lithuania                               Kaunas, Lithuania                                      Kaunas, Lithuania
     rytis.augustauskas@ktu.edu                       aurimas.kudarauskas@ktu.edu                              cenker.canbulut@ktu.edu


          Abstract—Today, Artificial Intelligence (AI) used widely                                 II.      OVERVIEW
in data science and computer vision. It has proven to be state of
the art algorithm for classification tasks. One of the tasks that                    The following overview of literature is organized in
Virtual reality often solves can be specified as object recognition        sections. Sections II-A to II-C overviews generic problems
or classification. These types of tasks benefit from automatic             related to application of neural networks for problem solving.
feature detector provided by convolutional neural networks                 Part II-D describes latest methods on object detection related
(CNN). This article investigates and provides a practical guide on         to person tracking. Section II-E covers state of art methods of
implementing AI methods for object recognition and skeleton                pose and hand keypoints estimation and gestures recognition
recognition to show practical solutions on the given tasks for             done by using deep neural networks.
Virtual Reality.
                                                                           A. Training dataset
         Keywords—CNN; Neural network; VR; AI; Image                           When you are working with neural networks training data
processing, object recognition.
                                                                           set is a must since you can rely the solution in given
                       I.    INTRODUCTION                                  standardization. This task can be very labor intensive. The best
                                                                           tradeoff between information provided to algorithm and time
          Nowadays deep learning is new and hot topic.                     needed for marking is bounding box method [1]. Also, it is
Researches done in AI field can provide satisfactory solutions             possible to minimize labor time by utilizing internet generated
for object detection, image classification, natural language               data, but which also must be filtered well [2]. When problem
processing and many other areas corresponding the use of AI.               does not have strict classes for objects, it is possible to use
One of the biggest field of deep learning utilization is                   automate class generation algorithm to remove time needed
computer vision. Advanced artificial intelligence methods can              for database preparation [3]. If you are working with specific
detect object, understand person movement, interpret gestures              objects and there is no large dataset, the neural network can be
or behavior using RGB and depth data. Modern sensors, such                 pretrained on training dataset of similar nature [4]. Also, it is
as Microsoft Kinect, Leap Motion, Intel Real Sense or any                  worth noting, that only larger networks will benefit from more
other unmentioned here, can help to extract visual information             detail input data [5].
about the scene context. Machine could be made to
“understand” the scene without including other sensors, but
only with a visual data (RGB and depth map). It is also more               B. Data preprocessing
natural way of interaction in case of understanding gestures                    With wide variety of sensors used for collecting data it is
and pose, because no other input device, such as, joystick is              hard to have normalized data. Also, different sensors require
involved.                                                                  different filtering. If working with different image sensors
          In this paper, we are making an overview of the                  there are great median filter [4], or filters based on neural
newest researches for deep learning utilization in Virtual                 network [5][6]. When working with moving depth sensors you
Reality field by mentioning data preprocessing, gestures                   can get ghosting effect in data clouds. Inaccurate data can be
recognition, pose estimation methods based on neural                       filtered by utilizing segmentation of data cloud with
networks. We used articles from IEEE database due to high                  convolutional neural network. [7]
article acceptance requirements to the database. AI theme is                    When images are used as data, you must compensate the
very popular, so we included only the articles written over the            differences of object sizes. Usually, dedicated neural networks
last two years except for few written in last three years. The             are used to generate regions of picture that are most likely to
exception was made due to the impact that the articles made to             contain object [8][9]. When you already have regions of
the industry.                                                              interests (ROI) you can make few iterations with different
                                                                           scales over the ROI, so you could extract small objects [10].

           Copyright held by the author(s).


                                                                      68
Other way is to use few neural networks optimized for
different scales [11]. Also, it is important to note that shallow
CNN perform better with small objects then deep ones due to
the information lost in convolution layers [12]. The extraction
of most distinctive features can be improved by the
regularizations of a spatial transformation branch and a Fisher
encoding based multi modal fusion branch. [13]. Other great
approach to solve small scale object detection problem is
usage of atrous convolutions, these convolutions adapt to
different input sizes and have constant output size [14].
                                                                          Figure 1 - Example of a full-body skeleton (20 joints) and a
C. Optimization of CNN                                                                  hand skeleton (22 points) [23].
    Usually training of convolutional neural network (CNN)
can take a lot of computational power. This can be reduced by                     In the case of Figure 1, for full-body skeleton input
restructuring layer of CNN [15]. Also, it is possible to reduce          data will be composed as 20 joints with T time steps and three
computational needs of algorithm by removing background                  spatial coordinates. In hand skeleton case it is 22 joints
information from input data [16]. When the CNN algorithm is              composed to T time and three spatial coordinates. At each
optimized towards execution speed you should reduce parallel             time step block is shifted one frame releasing the oldest and
operations and use larger feature maps or combine feature                including new one in overlapping mode. In this approach,
maps of two different convolutional layers [17][18][19]. If              CNN is performed on 3D information of data and some
working with ROI, you can increase algorithm speed by                    temporal dimension (T time steps) to generate the features
implementing cascade filtering algorithm [20]. Also, the                 detected in the input block. Later, LSTM is used to integrate
search of ROI can be improved by combining convolutional                 features detected in the consecutive overlapping blocks which
layer map with edge map of the same image [7]. Computing                 allows system to maintain information beyond the last T time
the same algorithm with different image scales takes a lot of            steps. More information about the LSTM can be found in [25].
time. It is possible to use one scale feature map and calculate          Structured combined CNN and LMO for training is shown in
the feature maps of different scales. This improves the ability          Figure 2.
to detect small objects and reduce required computation time
[21].
    It is possible to minimize selection of CNN architecture
time by utilizing performance index calculation method [22].

                                                                         Figure 2 - The structure of the network during the pre-training
D. Object detection using CNN                                                  stage consists of a CNN attached to a LSTM [23].
         Approach using Convolutional Neural Network for
object recognition where, it can be human, hand or any other
object to define, gives another alternative on problem solving
of object detection. Research made in Madrid proposes deep
learning-based approach using CNN with the combination of
Long Short-Term Memory (LSTM) method to recognize
skeleton-based human activity and hand gesture [23]. There                Figure 3 - The structure of the network during the final stage
are many approaches to problem solution based on recognition                      consists of a CNN attached to a LSTM [23].
of human skeleton [24]. However, success of deep learning
techniques started around 2012. Proposed research relies on                       During the experimental work there are 5 different
CNN with combination of LSTM. As we know CNNs are                        datasets are used as follows MSRDailtActivity3D dataset,
structured to explore high spatial local correlation patterns in         UIKinect-Action3D dataset, NTU RGB+D dataset,
images. In this approach, CNN focuses on position patterns of            Montalbano V2 datasets, Dynamic hand gesture (DHG-14/28)
skeleton joints in 3D space and after LSTM recurrent network             dataset. Each dataset come with different activities to perform
is used to capture spatiotemporal patterns related to the time           as some has human-object interactions as well. As each
evolution of 3D coordinates of the skeleton joints. Proposed             dataset can act different on the proposed method, it is
approach has input data structure arranged in three-                     necessary to keep recent and most used datasets to test the
dimensional block where each dimension of this block                     given method for validity. The capability of skeleton tracking
matches with the number of skeleton joints, J.                           is also depending on the hardware architecture where the
                                                                         proposed method is performed. Said that, for smooth
                                                                         recognition and overall tracking performance, the hardware
                                                                         system of the computer should be taken into account. As this
                                                                         method may require data augmentation since the data taken


                                                                    69
from the datasets might be very small or different in terms of        criteria by the authors. Depending on that facts, we can see
color aspects, it gives good results in terms of human gesture        average accuracy measurement in overall value corresponding
recognition desired to be tracked or captured. As a result, it        to each action subsets with amount of training.
might be very good alternative to recognize full human body
skeleton and hand gestures using CNN with LSTM                        Table 1 - Accuracy obtained from action subsets given by T
combination.                                                          training amount.
Object detection problem can be solved by iterative CNN. The
image is split in equal boxes that has the class to be search
assigned to each box. Step by step the bounding boxes are             T             AS1         AS2            AS3          Average
moved toward the candidate of the class the box should be
bounding. After some iterations box ends up bounding the              10            92.39       94.65          93.7         93.58
object that it was searching. This method can increase                20            93.3        94.6           99.1         95.66
detection speed by five times compared to Fast R-CNN. [26]            30            93.81       85.72          93.7         87.74
          CNN performs well on image classification problem,
although object detection problem requires to extract special                  From the Table 1 we can find mean value of each
information of object. By introducing feature maps that also          average to define consistency of the proposed approach by
includes the spacing of the feature it is possible to increase        finding mean value of function (Equation 1).
both speed and accuracy of object detection compared to Fast
R-CNN. [27]
          Typically, object recognition is performed on 2D                                                                          (1)
objects. It is possible to perform object detection from 3D
points cloud. You can achieve great accuracy by making three                     By calculating mean value of given averages, we get
projections of object and utilizing three CNN networks for            92.3 which shows us solidarity and consistency of the
classification. [9]                                                   proposed approach by considering the fact of Training times
          Some objects have wide variety of features                  that are performed to output result. Authors also show
depending on viewing angle. Monolitick neural networks has            extended data of each Action Subset where each subset
problems to detect objects of widely diverse categories.              contains proposed hand gesture types that are similar. In the
Introduction of subcategories (S-CNN) improves the                    context, you will also see type of proposed gesture and what
performance of object recognition in such situations. [28]            error has been occurred during the execution and detection of
Combination of image sensors feature map and depth sensors            the same gesture. It means if the proposed gesture was
feature map can introduce impressive results. Obtained feature        pinching and error occurred during the execution of this
maps can be classified by support vector machine or mondrian          gesture it will be recorded what gesture has been detected
forest algorithms. [10]                                               instead of the proposed gesture. This data can be very useful
          Experiments performed to test the effectiveness of          to identify what type of difficulties the proposed approach is
the proposed approach in terms of detecting a specific gesture        struggling when training the dataset of MSRDailyActivity3D
of the user. The training was performed on high computational         Dataset. The article also provides secondary protocol where
power PC with specifications of GPU using an Intel Xeon E5-           this time body attributes are also recognized but it only
1620v3 server clocked at 3.5 GHz with 16 GB RAM and a                 contains        Kinect    dataset    compared      with     the
2015 NVIDIA GeForce GTX TITAN Black GPU with 6 GB                     MRSDailyActivity3D dataset. This protocol is containing 11
of GDDR5 memory and 2880CUDA cores.                                   body and hand gesture tracking mixture. As we believe the
                                                                      research scope was explained better with the precise
                                                                      conditions during the first protocol we will keep second
                                                                      protocol out of our scope but it can be investigated within the
                                                                      article itself [23].

                                                                          Table 2 - Accuracy results for MSR Action3D Dataset with
                                                                                            other methodologies
                                                                      Method                            Accuracy
Figure 4 - Skeleton count per second of each GPU processor
                                                                                                        AS1     AS2    AS3      Average

         Execution of experiment methodology performed by             Wang et al., 2016 [29]             -         -    -           97.4
creating action subsets and each subset contain group of              CNN + LSTM [23]                   93.3    94.6   99.1         95.7
similar hand gesture motion. This way, authors compare
accuracy of proposed approach in terms of gesture and body            Chen et al., 2015 [30]            98.1    92.0   94.6         94.9
recognition. In this review paper, we will show overall
                                                                      Du et al., 2015 [31]              93.3    94.6   95.5         94.5
accuracy of Action Subsets that are created under specific


                                                                 70
Tao & Vidal, 2015 [32]          89.8     93.6    97.0      93.5
Lillo et al., 2016 [33]           -       -       -        93.0
Vemulapalli et al., 2014        95.3     83.9    98.2      92.5
[34]
Ben Amor et al., 2016 [35]        -       -       -        89.0
Li et al., 2010 [36]            72.9     71.9    79.2      74.7

                                                                               Figure 5 - Convolutional pose machine joints detection
          As Table 2 shows, the proposed approach has                                                example1
significant accuracy on the proposed action subsets that are
trained using CNN+LSTM. Some methodologies that are                                Next year (2017), further technique by researchers
identified on table do not contain same action subsets as this           was introduced [39]. Same as the method mentioned
research, but they also rely on the similar hand gesture types           previously, it used convolutional neural network to detect
with only difference as the approach to measure accuracy is              joints from RGB image. Algorithm is capable to detect more
executed differently. We can conclude that, article relies on            than one person in image. It is state of art method in
absolute facts by using known datasets to evaluate their                 performance and efficiency on MPII [41] Multi-Person
approach as well as the presentation of paper shows the                  dataset, scoring 75.6% accuracy on whole testing set. On
relevancy to the outputted data. Personally, the given approach          laptop with Nvidia GeForce GTX-1080 GPU algorithm
can be very good alternative to implement CNN+LSTM on                    achieves 8.8fps on a frame 1080x1920 resized to 368x654
recognizing body and hand gestures. We highly recommend                  with 19 people in it.
researches to see given approach on the original article to
observe scale of the desired methodology. The approach can
be implemented in today’s games or engines to improve
effectiveness of the gesture recognitions to develop further
applications.


E. Pose estimation and gesture recognition
          One of the problems in Virtual and Augmented
reality applications is person pose estimation and hand gesture          Figure 6 - “Realtime Multi-Person 2D Human Pose Estimation
recognition. It can be challenging task, especially when                           using Part Affinity Fields” pose estimation.
environment is complex. Due to difficulty, it can even be
divided in several parts: person detection, joints extraction and                  Another part of even deeper person behavior
merge to the skeleton (pose estimation). Furthermore, hand               understanding is gesture recognition. It is a big part in VR
gestures can be interpreted, if needed. Few years ago, pose              application, because hand gestures enable more natural
estimation task had already been possible to perform with                interaction that does not requires any input equipment. Control
Microsoft Kinect SDK [37]. It uses RGB and depth camera                  can be done by interpreting visual information only.
data to extract skeleton and estimate its position in 3D space.                    Gesture recognition in Virtual reality application
In this case, point cloud data can be very useful in                     have been already enabled by companies, such as, Leap
distinguishing person from background.                                   Motion [42] or Softkinetic [43]. Mentioned companies are
          Nowadays, there are even more modern approaches                solving problem with depth cameras Figure 7.
to solve pose estimation task. With a help of deep neural
networks, it is even possible to extract person and detect joints
with only RGB camera data. Wei et al [38] and Cao et al [39]
proposed interesting methodology to detect 2D pose.
Convolutional neural network is utilized to detect joints of
person. This [38] is a state-of-art method on LSP [39] and
FLIC [40] datasets.

                                                                                            a)                             b)

                                                                               Figure 7 - a) DepthSense DS3252 and b) Leap Motion
                                                                                                     sensor3.


                                                                         1
                                                                             https://www.youtube.com/watch?v=EOKfMpGLGdY


                                                                    71
         Both of mentioned sensors can be attached to headset
or work in separate devices Figure 8.


                                                                        Figure 10 - Small part of RWTH-PHOENIX-Weather 2014
                                                                                          signs language dataset

               a)                       b)                                      From reviewed methods, real implementation can be
    Figure 8- a) standalone sensor mode4, b) sensor attached to        found. OpenPose project [47], utilizes pose estimation [38],
                             headset5.                                 [39] and hand and fingers joints detection [44].

For hands and gestures recognition, algorithms are using depth
data. Because of this technology, sensors are depended on the
object distance from camera and they have short working
range, i.e., SoftKinetic DS325 working range is 0.15-1m.
          However, hand detection and gesture recognition can
be done with different techniques and data. Recently, different
methods [44], [45] have been introduced. In one research [44],
algorithm to detect hand joints from RGB data is proposed. It
uses convolutional neural networks for hand pose detection.
Method can run real-time with GPU and its accuracy is as high
                                                                                Figure 11 - OpenPose project demonstration [47].
as other methods that uses depth sensor for the task.
Furthermore, from different viewpoints, it can produce 3D
hand pose estimation by triangulating feeds from different                                      III.     CONCLUSIONS
cameras (Fig. 9).                                                         Large interest in Artificial intelligence keeps the scientific
                                                                       society in high research pace. We have overviewed latest
                                                                       achievements in the field of AI application for virtual reality
                                                                       technologies and provided systematic classification of
                                                                       research papers and their contributions to the field. Also, we
                                                                       have separated general advancements of CNN algorithm and
                                                                       advancements directly related to VR to provide search tool for
                                                                       relevant literature. This type of training set is optimal between
                                                                       time needed for preparation and performance for training.
                                                                       Also, if you have small data set we would recommend using
                                                                       pre-training technique. Depending on your technical
                                                                       capabilities we highly recommend implementing one of
                                                                       techniques addressing different object scales.
Figure 9 - Hand pose estimation generated by RGB data from                For pose estimation and gestures recognition, several
                   different viewpoints6                               approaches using only RGB or depth camera data were
                                                                       introduced. Due to the necessity of having a depth camera, the
         Another proposed method [45] not only detects hand,           inherent noise associated with direct lightning conditions and
but also, recognizes hand gestures. Research uses Danish and           the depth camera measuring range limitations, AI methods that
New Zealand sign language data from RWTH-PHOENIX-                      uses only RGB data, might be one of the best approaches.
                                                                       From the given examples, it can be seen that AI based
Weather 2014 [46] dataset for CNN training. Can achieve
                                                                       techniques [38][39][44][46] perform accurately. The
more than 100fps on single Nvidia Geforce 980 GTX.                     drawback of the mentioned methods is their necessity of a
                                                                       high-end GPU to produce more frames per seconds.

2
  https://www.leapmotion.com/
3                                                                                                      REFERENCES
  https://www.leapmotion.com/
4                                                                         [1]    S. Hong, S. Kwak, and B. Han, “Weakly Supervised Learning with
  https://www.rockpapershotgun.com/tag/leap-motion-3d-jam/
5                                                                                Deep Convolutional Neural Networks for Semantic Segmentation:
  https://www.vrheads.com/how-use-leap-motion-your-oculus-rift                   Understanding Semantic Layout of Images with Minimum Human
6
  https://www.youtube.com/watch?v=q4xbmEQp3VE


                                                                  72
     Supervision,” IEEE Signal Process. Mag., vol. 34, no. 6, pp. 39–49,            [25] F. Bonanno, G. Capizzi, S. Coco, C. Napoli, A. Laudani, and G. Lo
     2017.                                                                               Sciuto, "Optimal thicknesses determination in a multilayer structure
[2] G. Zheng, M. Tan, J. Yu, Q. Wu, and J. Fan, “Fine-grained image                      to improve the SPP efficiency for photovoltaic devices by an hybrid
     recognition via weakly supervised click data guided bilinear CNN                    FEM-cascade neural network based approach". in International
     model,” Proc. - IEEE Int. Conf. Multimed. Expo, no. July, pp. 661–                  Symposium on Power Electronics, Electrical Drives, Automation
     666, 2017.                                                                          and Motion (SPEEDAM), pp. 355-362, 2014.
[3] C. Hsu, C. Lin, and S. Member, “CNN - Based Joint Clustering and                [26] F. Yang, W. Choi, and Y. Lin, “Exploit All the Layers: Fast and
     Representation Learning with Feature Drift Compensation for Large                   Accurate CNN Object Detector with Scale Dependent Pooling and
     - Scale Image Data,” vol. 20, no. 2, pp. 421–429, 2017.                             Cascaded Rejection Classifiers,” 2016 IEEE Conf. Comput. Vis.
[4] L. Jin and H. Liang, “Deep learning for underwater image                             Pattern Recognit., pp. 2129–2137, 2016.
     recognition in small sample size situations,” Ocean. 2017 -                    [27] M. Najibi, M. Rastegari, and L. S. Davis, “G-CNN: an Iterative Grid
     Aberdeen, no. 61379007, pp. 1–4, 2017.                                              Based Object Detector,” 2015.
[5] M. Valdenegro-Toro, “Best Practices in Convolutional Networks for               [28] Y. Liu, H. Li, J. Yan, F. Wei, X. Wang, and X. Tang, “Recurrent
     Forward-Looking Sonar Image Recognition,” 2017.                                     Scale Approximation for Object Detection in CNN,” vol. 1, pp.
[6] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning Deep CNN                            571–579, 2017.
     Denoiser Prior for Image Restoration,” pp. 3929–3938, 2017.                    [29] C. Wang, Y. Wang, and A. L. Yuille, “Mining 3D Key-Pose-Motifs
[7] F. Gomez-Donoso, A. Garcia-Garcia, J. Garcia-Rodriguez, S. Orts-                     for Action Recognition,” 2016 IEEE Conf. Comput. Vis. Pattern
     Escolano, and M. Cazorla, “LonchaNet: A sliced-based CNN                            Recognit., pp. 2639–2647, 2016.
     architecture for real-time 3D object recognition,” Proc. Int. Jt. Conf.        [30] C. Chen, R. Jafari, and N. Kehtarnavaz, “Action recognition from
     Neural Networks, vol. 2017–May, pp. 412–418, 2017.                                  depth sequences using depth motion maps-based local binary
[8] F. Beritelli, G. Capizzi, G. Lo Sciuto, C. Napoli, and F. Scaglione,                 patterns,” Proc. - 2015 IEEE Winter Conf. Appl. Comput. Vision,
     "Automatic heart activity diagnosis based on gram polynomials and                   WACV 2015, pp. 1092–1099, 2015.
     probabilistic neural networks". Biomedical Engineering Letters, vol.           [31] Y. Du, W. Wang, and L. Wang, “Hierarchial Recurrent Neural
     8, issue 1, pp. 77-85, 2018.                                                        Network for Skeleton Based Action Recognition,” IEEE Conf.
[9] Q. Lu, C. Liu, Z. Jiang, A. Men, and B. Yang, “G-CNN: Object                         Comput. Vis. Pattern Recognit., pp. 1110–1118, 2015.
     Detection via Grid Convolutional Neural Network,” IEEE Access,                 [32] L. Tao and R. Vidal, “Moving Poselets: A Discriminative and
     vol. 5, pp. 24023–24031, 2017.                                                      Interpretable Skeletal Motion Representation for Action
[10] T. Chen, S. Lu, and J. Fan, “S-CNN: Subcategory-aware                               Recognition,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2016–
     convolutional networks for object detection,” IEEE Trans. Pattern                   February, pp. 303–311, 2016.
     Anal. Mach. Intell., vol. 8828, no. c, pp. 1–8, 2017.                          [33] I. Lillo, J. C. Niebles, and A. Soto, “A Hierarchical Pose-Based
[11] Y. Gao et al., “Scale optimization for full-image-CNN vehicle                       Approach to Complex Action Understanding Using Dictionaries of
     detection,” IEEE Intell. Veh. Symp. Proc., pp. 785–791, 2017.                       Actionlets and Motion Poselets,” pp. 1981–1990, 2016.
[12] B. Nagy and C. Benedek, “3D CNN based phantom object                           [34] R. Vemulapalli, F. Arrate, and R. Chellappa, “Human action
     removing from mobile laser scanning data,” Proc. Int. Jt. Conf.                     recognition by representing 3D skeletons as points in a lie group,”
     Neural Networks, vol. 2017–May, pp. 4429–4435, 2017.                                Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp.
[13] C. Eggert, S. Brehm, A. Winschel, D. Zecha, and R. Lienhart, “A                     588–595, 2014.
     closer look: Small object detection in faster R-CNN,” Proc. - IEEE             [35] B. Ben Amor, J. Su, and A. Srivastava, “Action Recognition Using
     Int. Conf. Multimed. Expo, vol. 0, pp. 421–426, 2017.                               Rate-Invariant Analysis of Skeletal Shape Trajectories,” IEEE
[14] Y. Lou, G. Fu, Z. Jiang, A. Men, and Y. Zhou, “PBG-Net: Object                      Trans. Pattern Anal. Mach. Intell., vol. 38, no. 1, pp. 1–13, 2016.
     detection with a multi-feature and iterative CNN model,” 2017                  [36] W. Li, Z. Zhang, and Z. Liu, “Action Recognition Based on A Bag
     IEEE Int. Conf. Multimed. Expo Work. ICMEW 2017, no. July, pp.                      of 3D Points.pdf,” Comput. Vis. Pattern Recognit. Work. (CVPRW),
     453–458, 2017.                                                                      2010 IEEE Comput. Soc. Conf., pp. 9–14, 2010.
[15] and L. G.-D. Le-Le, Wang, “Research on Relief Effect of Image                  [37] “Microsoft      Kinect   SDK,”      2018.      [Online].    Available:
     Based on the 5 Dimension CNN.,” 2017, pp. 416–418.                                  https://www.microsoft.com/en-us/download/details.aspx?id=44561 .
[16] C. Termritthikun and S. Kanprachar, “Accuracy improvement of                   [38] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh,
     Thai food image recognition using deep convolutional neural                         “Convolutional Pose Machines.”
     networks,” 2017 Int. Electr. Eng. Congr., pp. 1–4, 2017.                       [39] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime Multi-
[17] C. Bentes, D. Velotto, and B. Tings, “Ship Classification in                        Person 2D Pose Estimation using Part Affinity Fields,” 2016.
     TerraSAR-X Images With Convolutional Neural Networks,” IEEE                    [40] B. Sapp and B. Taskar, “MODEC: c” Proc. IEEE Comput. Soc.
     J. Ocean. Eng., vol. 43, no. 1, pp. 258–266, 2017.                                  Conf. Comput. Vis. Pattern Recognit., pp. 3674–3681, 2013.
[18] U. Asif, M. Bennamoun, and F. Sohel, “A Multi-modal,                           [41] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2D human
     Discriminative and Spatially Invariant CNN for RGB-D Object                         pose estimation: New benchmark and state of the art analysis,”
     Labeling,” IEEE T. Patt. Anal. Mach. Intell., vol. 8828, no. c, 2017.               Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp.
[19] H. Li, Y. Huang, and Z. Zhang, “An improved faster R-CNN for                        3686–3693, 2014.
     same object retrieval,” IEEE Access, vol. 5, no. 8, pp. 13665–13676,           [42] “Leap      Motion product       website.”      [Online].    Available:
     2017.                                                                               https://www.leapmotion.com/
[20] T. Guan and H. Zhu, “Atrous Faster R-CNN for Small Scale Object                [43] “Depthsense       company     website.”       [Online].     Available:
     Detection,” 2017 2nd Int. Conf. Multimed. Image Process., pp. 16–                   https://www.sony-depthsensing.com.
     21, 2017.                                                                      [44] T. Simon, H. Joo, I. Matthews, and Y. Sheikh, “Hand Keypoint
[21] M. A. Waris, A. Iosifidis, and M. Gabbouj, “CNN-based edge                          Detection in Single Images using Multiview Bootstrapping,” pp.
     filtering for object proposals,” Neurocomputing, vol. 266, pp. 631–                 1145–1153, 2017.
     640, 2017.                                                                     [45] O. Koller, H. Ney, and R. Bowden, “Deep Hand: How to Train a
[22] D. Anisimov and T. Khanova, “Towards lightweight convolutional                      CNN on 1 Million Hand Images When Your Data is Continuous and
     neural networks for object detection,” 2017 14th IEEE Int. Conf.                    Weakly Labelled,” 2016 IEEE Conf. Comput. Vis. Pattern
     Adv. Video Signal Based Surveill., no. August, pp. 1–8, 2017.                       Recognit., pp. 3793–3802, 2016.
[23] J. C. Núñez, R. Cabido, J. J. Pantrigo, A. S. Montemayor, and J. F.            [46] O. Koller, J. Forster, and H. Ney, “Continuous sign language
     Vélez, “Convolutional Neural Networks and Long Short-Term                           recognition: Towards large vocabulary statistical recognition
     Memory for skeleton-based human activity and hand gesture                           systems handling multiple signers,” Comput. Vis. Image Underst.,
     recognition,” Pattern Recognit., vol. 76, pp. 80–94, 2018.                          vol. 141, pp. 108–125, 2015.
[24] L. Lo Presti and M. La Cascia, “3D skeleton-based human action                 [47] “OpenPose: Real-time multi-person keypoint detection library for
     classification: A survey,” Pattern Recognit., vol. 53, pp. 130–147,                 body, face, and hands estimation.” [Online]. Available:
     2016.                                                                               https://github.com/CMU-Perceptual-Computing-Lab/openpose.


                                                                               73
74

</pre>