<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Implementation of Artificial Intelligence Methods for Virtual Reality Solutions: a Review of the Literature</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rytis Augustauskas</string-name>
          <email>rytis.augustauskas@ktu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aurimas Kudarauskas</string-name>
          <email>aurimas.kudarauskas@ktu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cenker Canbulut</string-name>
          <email>cenker.canbulut@ktu.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Automation, Kaunas University of Technology</institution>
          ,
          <addr-line>Kaunas</addr-line>
          ,
          <country country="LT">Lithuania</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Multimedia Engineering, Kaunas University of Technology</institution>
          ,
          <addr-line>Kaunas</addr-line>
          ,
          <country country="LT">Lithuania</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>68</fpage>
      <lpage>74</lpage>
      <abstract>
        <p>-Today, Artificial Intelligence (AI) used widely in data science and computer vision. It has proven to be state of the art algorithm for classification tasks. One of the tasks that Virtual reality often solves can be specified as object recognition or classification. These types of tasks benefit from automatic feature detector provided by convolutional neural networks (CNN). This article investigates and provides a practical guide on implementing AI methods for object recognition and skeleton recognition to show practical solutions on the given tasks for Virtual Reality.</p>
      </abstract>
      <kwd-group>
        <kwd>CNN</kwd>
        <kwd>Neural network</kwd>
        <kwd>VR</kwd>
        <kwd>AI</kwd>
        <kwd>Image processing</kwd>
        <kwd>object recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>INTRODUCTION</p>
      <p>Nowadays deep learning is new and hot topic.
Researches done in AI field can provide satisfactory solutions
for object detection, image classification, natural language
processing and many other areas corresponding the use of AI.
One of the biggest field of deep learning utilization is
computer vision. Advanced artificial intelligence methods can
detect object, understand person movement, interpret gestures
or behavior using RGB and depth data. Modern sensors, such
as Microsoft Kinect, Leap Motion, Intel Real Sense or any
other unmentioned here, can help to extract visual information
about the scene context. Machine could be made to
“understand” the scene without including other sensors, but
only with a visual data (RGB and depth map). It is also more
natural way of interaction in case of understanding gestures
and pose, because no other input device, such as, joystick is
involved.</p>
      <p>In this paper, we are making an overview of the
newest researches for deep learning utilization in Virtual
Reality field by mentioning data preprocessing, gestures
recognition, pose estimation methods based on neural
networks. We used articles from IEEE database due to high
article acceptance requirements to the database. AI theme is
very popular, so we included only the articles written over the
last two years except for few written in last three years. The
exception was made due to the impact that the articles made to
the industry.</p>
    </sec>
    <sec id="sec-2">
      <title>Copyright held by the author(s). 68 II.</title>
    </sec>
    <sec id="sec-3">
      <title>OVERVIEW</title>
      <p>The following overview of literature is organized in
sections. Sections II-A to II-C overviews generic problems
related to application of neural networks for problem solving.
Part II-D describes latest methods on object detection related
to person tracking. Section II-E covers state of art methods of
pose and hand keypoints estimation and gestures recognition
done by using deep neural networks.</p>
      <sec id="sec-3-1">
        <title>A. Training dataset</title>
        <p>
          When you are working with neural networks training data
set is a must since you can rely the solution in given
standardization. This task can be very labor intensive. The best
tradeoff between information provided to algorithm and time
needed for marking is bounding box method [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Also, it is
possible to minimize labor time by utilizing internet generated
data, but which also must be filtered well [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. When problem
does not have strict classes for objects, it is possible to use
automate class generation algorithm to remove time needed
for database preparation [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. If you are working with specific
objects and there is no large dataset, the neural network can be
pretrained on training dataset of similar nature [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Also, it is
worth noting, that only larger networks will benefit from more
detail input data [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>B. Data preprocessing</title>
        <p>
          With wide variety of sensors used for collecting data it is
hard to have normalized data. Also, different sensors require
different filtering. If working with different image sensors
there are great median filter [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], or filters based on neural
network [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ][
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. When working with moving depth sensors you
can get ghosting effect in data clouds. Inaccurate data can be
filtered by utilizing segmentation of data cloud with
convolutional neural network. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
        </p>
        <p>
          When images are used as data, you must compensate the
differences of object sizes. Usually, dedicated neural networks
are used to generate regions of picture that are most likely to
contain object [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ][
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. When you already have regions of
interests (ROI) you can make few iterations with different
scales over the ROI, so you could extract small objects [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
Other way is to use few neural networks optimized for
different scales [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Also, it is important to note that shallow
CNN perform better with small objects then deep ones due to
the information lost in convolution layers [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The extraction
of most distinctive features can be improved by the
regularizations of a spatial transformation branch and a Fisher
encoding based multi modal fusion branch. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Other great
approach to solve small scale object detection problem is
usage of atrous convolutions, these convolutions adapt to
different input sizes and have constant output size [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>C. Optimization of CNN</title>
        <p>
          Usually training of convolutional neural network (CNN)
can take a lot of computational power. This can be reduced by
restructuring layer of CNN [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Also, it is possible to reduce
computational needs of algorithm by removing background
information from input data [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. When the CNN algorithm is
optimized towards execution speed you should reduce parallel
operations and use larger feature maps or combine feature
maps of two different convolutional layers [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ][
          <xref ref-type="bibr" rid="ref18">18</xref>
          ][
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. If
working with ROI, you can increase algorithm speed by
implementing cascade filtering algorithm [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. Also, the
search of ROI can be improved by combining convolutional
layer map with edge map of the same image [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Computing
the same algorithm with different image scales takes a lot of
time. It is possible to use one scale feature map and calculate
the feature maps of different scales. This improves the ability
to detect small objects and reduce required computation time
[
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
        </p>
        <p>
          It is possible to minimize selection of CNN architecture
time by utilizing performance index calculation method [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>D. Object detection using CNN</title>
        <p>
          Approach using Convolutional Neural Network for
object recognition where, it can be human, hand or any other
object to define, gives another alternative on problem solving
of object detection. Research made in Madrid proposes deep
learning-based approach using CNN with the combination of
Long Short-Term Memory (LSTM) method to recognize
skeleton-based human activity and hand gesture [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. There
are many approaches to problem solution based on recognition
of human skeleton [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. However, success of deep learning
techniques started around 2012. Proposed research relies on
CNN with combination of LSTM. As we know CNNs are
structured to explore high spatial local correlation patterns in
images. In this approach, CNN focuses on position patterns of
skeleton joints in 3D space and after LSTM recurrent network
is used to capture spatiotemporal patterns related to the time
evolution of 3D coordinates of the skeleton joints. Proposed
approach has input data structure arranged in
threedimensional block where each dimension of this block
matches with the number of skeleton joints, J.
        </p>
        <p>
          Figure 1 - Example of a full-body skeleton (20 joints) and a
hand skeleton (22 points) [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ].
        </p>
        <p>
          In the case of Figure 1, for full-body skeleton input
data will be composed as 20 joints with T time steps and three
spatial coordinates. In hand skeleton case it is 22 joints
composed to T time and three spatial coordinates. At each
time step block is shifted one frame releasing the oldest and
including new one in overlapping mode. In this approach,
CNN is performed on 3D information of data and some
temporal dimension (T time steps) to generate the features
detected in the input block. Later, LSTM is used to integrate
features detected in the consecutive overlapping blocks which
allows system to maintain information beyond the last T time
steps. More information about the LSTM can be found in [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ].
Structured combined CNN and LMO for training is shown in
Figure 2.
        </p>
        <p>
          Figure 2 - The structure of the network during the pre-training
stage consists of a CNN attached to a LSTM [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ].
        </p>
        <p>
          Figure 3 - The structure of the network during the final stage
consists of a CNN attached to a LSTM [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ].
        </p>
        <p>During the experimental work there are 5 different
datasets are used as follows MSRDailtActivity3D dataset,
UIKinect-Action3D dataset, NTU RGB+D dataset,
Montalbano V2 datasets, Dynamic hand gesture (DHG-14/28)
dataset. Each dataset come with different activities to perform
as some has human-object interactions as well. As each
dataset can act different on the proposed method, it is
necessary to keep recent and most used datasets to test the
given method for validity. The capability of skeleton tracking
is also depending on the hardware architecture where the
proposed method is performed. Said that, for smooth
recognition and overall tracking performance, the hardware
system of the computer should be taken into account. As this
method may require data augmentation since the data taken
from the datasets might be very small or different in terms of
color aspects, it gives good results in terms of human gesture
recognition desired to be tracked or captured. As a result, it
might be very good alternative to recognize full human body
skeleton and hand gestures using CNN with LSTM
combination.</p>
        <p>
          Object detection problem can be solved by iterative CNN. The
image is split in equal boxes that has the class to be search
assigned to each box. Step by step the bounding boxes are
moved toward the candidate of the class the box should be
bounding. After some iterations box ends up bounding the
object that it was searching. This method can increase
detection speed by five times compared to Fast R-CNN. [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]
        </p>
        <p>
          CNN performs well on image classification problem,
although object detection problem requires to extract special
information of object. By introducing feature maps that also
includes the spacing of the feature it is possible to increase
both speed and accuracy of object detection compared to Fast
R-CNN. [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]
        </p>
        <p>
          Typically, object recognition is performed on 2D
objects. It is possible to perform object detection from 3D
points cloud. You can achieve great accuracy by making three
projections of object and utilizing three CNN networks for
classification. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]
        </p>
        <p>
          Some objects have wide variety of features
depending on viewing angle. Monolitick neural networks has
problems to detect objects of widely diverse categories.
Introduction of subcategories (S-CNN) improves the
performance of object recognition in such situations. [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]
Combination of image sensors feature map and depth sensors
feature map can introduce impressive results. Obtained feature
maps can be classified by support vector machine or mondrian
forest algorithms. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]
        </p>
        <p>Experiments performed to test the effectiveness of
the proposed approach in terms of detecting a specific gesture
of the user. The training was performed on high computational
power PC with specifications of GPU using an Intel Xeon
E51620v3 server clocked at 3.5 GHz with 16 GB RAM and a
2015 NVIDIA GeForce GTX TITAN Black GPU with 6 GB
of GDDR5 memory and 2880CUDA cores.
criteria by the authors. Depending on that facts, we can see
average accuracy measurement in overall value corresponding
to each action subsets with amount of training.</p>
        <p>From the Table 1 we can find mean value of each
average to define consistency of the proposed approach by
finding mean value of function (Equation 1).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Average 93.58 95.66 87.74</title>
      <p>
        By calculating mean value of given averages, we get
92.3 which shows us solidarity and consistency of the
proposed approach by considering the fact of Training times
that are performed to output result. Authors also show
extended data of each Action Subset where each subset
contains proposed hand gesture types that are similar. In the
context, you will also see type of proposed gesture and what
error has been occurred during the execution and detection of
the same gesture. It means if the proposed gesture was
pinching and error occurred during the execution of this
gesture it will be recorded what gesture has been detected
instead of the proposed gesture. This data can be very useful
to identify what type of difficulties the proposed approach is
struggling when training the dataset of MSRDailyActivity3D
Dataset. The article also provides secondary protocol where
this time body attributes are also recognized but it only
contains Kinect dataset compared with the
MRSDailyActivity3D dataset. This protocol is containing 11
body and hand gesture tracking mixture. As we believe the
research scope was explained better with the precise
conditions during the first protocol we will keep second
protocol out of our scope but it can be investigated within the
article itself [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <p>
        Execution of experiment methodology performed by
creating action subsets and each subset contain group of
similar hand gesture motion. This way, authors compare
accuracy of proposed approach in terms of gesture and body
recognition. In this review paper, we will show overall
accuracy of Action Subsets that are created under specific
Wang et al., 2016 [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]
CNN + LSTM [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]
Chen et al., 2015 [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]
Du et al., 2015 [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]
AS1
      </p>
      <p>93.3
98.1
93.3</p>
      <p>AS2</p>
      <p>94.6
92.0
94.6</p>
      <p>AS3</p>
      <p>99.1
94.6
95.5</p>
    </sec>
    <sec id="sec-5">
      <title>Average</title>
      <p>
        97.4
95.7
94.9
94.5
Lillo et al., 2016 [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]
      </p>
    </sec>
    <sec id="sec-6">
      <title>Vemulapalli et al., 2014 [34] Ben Amor et al., 2016 [35] Li et al., 2010 [36]</title>
      <p>As Table 2 shows, the proposed approach has
significant accuracy on the proposed action subsets that are
trained using CNN+LSTM. Some methodologies that are
identified on table do not contain same action subsets as this
research, but they also rely on the similar hand gesture types
with only difference as the approach to measure accuracy is
executed differently. We can conclude that, article relies on
absolute facts by using known datasets to evaluate their
approach as well as the presentation of paper shows the
relevancy to the outputted data. Personally, the given approach
can be very good alternative to implement CNN+LSTM on
recognizing body and hand gestures. We highly recommend
researches to see given approach on the original article to
observe scale of the desired methodology. The approach can
be implemented in today’s games or engines to improve
effectiveness of the gesture recognitions to develop further
applications.</p>
      <sec id="sec-6-1">
        <title>E. Pose estimation and gesture recognition</title>
        <p>
          One of the problems in Virtual and Augmented
reality applications is person pose estimation and hand gesture
recognition. It can be challenging task, especially when
environment is complex. Due to difficulty, it can even be
divided in several parts: person detection, joints extraction and
merge to the skeleton (pose estimation). Furthermore, hand
gestures can be interpreted, if needed. Few years ago, pose
estimation task had already been possible to perform with
Microsoft Kinect SDK [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ]. It uses RGB and depth camera
data to extract skeleton and estimate its position in 3D space.
In this case, point cloud data can be very useful in
distinguishing person from background.
        </p>
        <p>
          Nowadays, there are even more modern approaches
to solve pose estimation task. With a help of deep neural
networks, it is even possible to extract person and detect joints
with only RGB camera data. Wei et al [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ] and Cao et al [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ]
proposed interesting methodology to detect 2D pose.
Convolutional neural network is utilized to detect joints of
person. This [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ] is a state-of-art method on LSP [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ] and
FLIC [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ] datasets.
        </p>
        <p>Figure 5 - Convolutional pose machine joints detection
example1</p>
        <p>
          Next year (2017), further technique by researchers
was introduced [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ]. Same as the method mentioned
previously, it used convolutional neural network to detect
joints from RGB image. Algorithm is capable to detect more
than one person in image. It is state of art method in
performance and efficiency on MPII [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ] Multi-Person
dataset, scoring 75.6% accuracy on whole testing set. On
laptop with Nvidia GeForce GTX-1080 GPU algorithm
achieves 8.8fps on a frame 1080x1920 resized to 368x654
with 19 people in it.
        </p>
        <p>Figure 6 - “Realtime Multi-Person 2D Human Pose Estimation
using Part Affinity Fields” pose estimation.</p>
        <p>Another part of even deeper person behavior
understanding is gesture recognition. It is a big part in VR
application, because hand gestures enable more natural
interaction that does not requires any input equipment. Control
can be done by interpreting visual information only.</p>
        <p>
          Gesture recognition in Virtual reality application
have been already enabled by companies, such as, Leap
Motion [
          <xref ref-type="bibr" rid="ref42">42</xref>
          ] or Softkinetic [
          <xref ref-type="bibr" rid="ref43">43</xref>
          ]. Mentioned companies are
solving problem with depth cameras Figure 7.
Figure 7 - a) DepthSense DS3252 and b) Leap Motion
sensor3.
1 https://www.youtube.com/watch?v=EOKfMpGLGdY
        </p>
        <p>Both of mentioned sensors can be attached to headset
or work in separate devices Figure 8.</p>
        <p>a) b)
Figure 8- a) standalone sensor mode4, b) sensor attached to
headset5.</p>
        <p>For hands and gestures recognition, algorithms are using depth
data. Because of this technology, sensors are depended on the
object distance from camera and they have short working
range, i.e., SoftKinetic DS325 working range is 0.15-1m.</p>
        <p>
          However, hand detection and gesture recognition can
be done with different techniques and data. Recently, different
methods [
          <xref ref-type="bibr" rid="ref44">44</xref>
          ], [
          <xref ref-type="bibr" rid="ref45">45</xref>
          ] have been introduced. In one research [
          <xref ref-type="bibr" rid="ref44">44</xref>
          ],
algorithm to detect hand joints from RGB data is proposed. It
uses convolutional neural networks for hand pose detection.
Method can run real-time with GPU and its accuracy is as high
as other methods that uses depth sensor for the task.
Furthermore, from different viewpoints, it can produce 3D
hand pose estimation by triangulating feeds from different
cameras (Fig. 9).
        </p>
        <p>Figure 9 - Hand pose estimation generated by RGB data from
different viewpoints6</p>
        <p>
          Another proposed method [
          <xref ref-type="bibr" rid="ref45">45</xref>
          ] not only detects hand,
but also, recognizes hand gestures. Research uses Danish and
New Zealand sign language data from
RWTH-PHOENIXWeather 2014 [
          <xref ref-type="bibr" rid="ref46">46</xref>
          ] dataset for CNN training. Can achieve
more than 100fps on single Nvidia Geforce 980 GTX.
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>2 https://www.leapmotion.com/</title>
      <p>3 https://www.leapmotion.com/
4 https://www.rockpapershotgun.com/tag/leap-motion-3d-jam/
5 https://www.vrheads.com/how-use-leap-motion-your-oculus-rift
6 https://www.youtube.com/watch?v=q4xbmEQp3VE
Figure 10 - Small part of RWTH-PHOENIX-Weather 2014
signs language dataset</p>
      <p>
        From reviewed methods, real implementation can be
found. OpenPose project [
        <xref ref-type="bibr" rid="ref47">47</xref>
        ], utilizes pose estimation [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ],
[
        <xref ref-type="bibr" rid="ref39">39</xref>
        ] and hand and fingers joints detection [
        <xref ref-type="bibr" rid="ref44">44</xref>
        ].
      </p>
      <p>Large interest in Artificial intelligence keeps the scientific
society in high research pace. We have overviewed latest
achievements in the field of AI application for virtual reality
technologies and provided systematic classification of
research papers and their contributions to the field. Also, we
have separated general advancements of CNN algorithm and
advancements directly related to VR to provide search tool for
relevant literature. This type of training set is optimal between
time needed for preparation and performance for training.
Also, if you have small data set we would recommend using
pre-training technique. Depending on your technical
capabilities we highly recommend implementing one of
techniques addressing different object scales.</p>
      <p>
        For pose estimation and gestures recognition, several
approaches using only RGB or depth camera data were
introduced. Due to the necessity of having a depth camera, the
inherent noise associated with direct lightning conditions and
the depth camera measuring range limitations, AI methods that
uses only RGB data, might be one of the best approaches.
From the given examples, it can be seen that AI based
techniques [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ][
        <xref ref-type="bibr" rid="ref39">39</xref>
        ][
        <xref ref-type="bibr" rid="ref44">44</xref>
        ][
        <xref ref-type="bibr" rid="ref46">46</xref>
        ] perform accurately. The
drawback of the mentioned methods is their necessity of a
high-end GPU to produce more frames per seconds.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kwak</surname>
          </string-name>
          , and B. Han, “
          <article-title>Weakly Supervised Learning with Deep Convolutional Neural Networks for Semantic Segmentation: Understanding Semantic Layout of Images with Minimum Human Supervision,” IEEE Signal Process</article-title>
          . Mag., vol.
          <volume>34</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>49</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Fan</surname>
          </string-name>
          , “
          <article-title>Fine-grained image recognition via weakly supervised click data guided bilinear CNN model</article-title>
          ,
          <source>” Proc. - IEEE Int. Conf. Multimed. Expo</source>
          , no.
          <source>July</source>
          , pp.
          <fpage>661</fpage>
          -
          <lpage>666</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Member</surname>
          </string-name>
          , “
          <article-title>CNN - Based Joint Clustering and Representation Learning with Feature Drift Compensation for Large - Scale Image Data</article-title>
          ,” vol.
          <volume>20</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>421</fpage>
          -
          <lpage>429</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Jin</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Liang</surname>
          </string-name>
          , “
          <article-title>Deep learning for underwater image recognition in small sample size situations</article-title>
          ,” Ocean. 2017 - Aberdeen, no.
          <issue>61379007</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Valdenegro-Toro</surname>
          </string-name>
          , “
          <article-title>Best Practices in Convolutional Networks for Forward-Looking Sonar Image Recognition</article-title>
          ,”
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Zuo,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gu</surname>
          </string-name>
          , and L. Zhang, “
          <article-title>Learning Deep CNN Denoiser Prior for Image Restoration</article-title>
          ,” pp.
          <fpage>3929</fpage>
          -
          <lpage>3938</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Gomez-Donoso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garcia-Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Garcia-Rodriguez</surname>
          </string-name>
          , S. OrtsEscolano, and M. Cazorla, “
          <article-title>LonchaNet: A sliced-based CNN architecture for real-time 3D object recognition</article-title>
          ,
          <source>” Proc. Int. Jt. Conf. Neural Networks</source>
          , vol. 2017-May, pp.
          <fpage>412</fpage>
          -
          <lpage>418</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Beritelli</surname>
          </string-name>
          , G. Capizzi,
          <string-name>
            <given-names>G. Lo</given-names>
            <surname>Sciuto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Scaglione</surname>
          </string-name>
          ,
          <article-title>"Automatic heart activity diagnosis based on gram polynomials and probabilistic neural networks"</article-title>
          .
          <source>Biomedical Engineering Letters</source>
          , vol.
          <volume>8</volume>
          , issue 1, pp.
          <fpage>77</fpage>
          -
          <lpage>85</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Men</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          , “G-CNN:
          <article-title>Object Detection via Grid Convolutional Neural Network,” IEEE Access</article-title>
          , vol.
          <volume>5</volume>
          , pp.
          <fpage>24023</fpage>
          -
          <lpage>24031</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Fan</surname>
          </string-name>
          , “S-CNN:
          <article-title>Subcategory-aware convolutional networks for object detection</article-title>
          ,
          <source>” IEEE Trans. Pattern Anal. Mach</source>
          . Intell., vol.
          <volume>8828</volume>
          , no. c, pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          et al.,
          <article-title>“Scale optimization for full-image-CNN vehicle detection</article-title>
          ,
          <source>” IEEE Intell. Veh. Symp. Proc.</source>
          , pp.
          <fpage>785</fpage>
          -
          <lpage>791</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Nagy</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Benedek</surname>
          </string-name>
          , “
          <article-title>3D CNN based phantom object removing from mobile laser scanning data</article-title>
          ,
          <source>” Proc. Int. Jt. Conf. Neural Networks</source>
          , vol. 2017-May, pp.
          <fpage>4429</fpage>
          -
          <lpage>4435</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Eggert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brehm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Winschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zecha</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Lienhart</surname>
          </string-name>
          , “
          <article-title>A closer look: Small object detection in faster R-CNN,”</article-title>
          <source>Proc. - IEEE Int. Conf. Multimed. Expo</source>
          , vol.
          <volume>0</volume>
          , pp.
          <fpage>421</fpage>
          -
          <lpage>426</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Men</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , “PBG-Net:
          <article-title>Object detection with a multi-feature and iterative CNN model</article-title>
          ,”
          <source>2017 IEEE Int. Conf. Multimed. Expo Work. ICMEW</source>
          <year>2017</year>
          , no.
          <source>July</source>
          , pp.
          <fpage>453</fpage>
          -
          <lpage>458</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>and L. G.-D.</surname>
          </string-name>
          Le-Le, Wang, “
          <article-title>Research on Relief Effect of Image Based on the 5 Dimension CNN</article-title>
          .,”
          <year>2017</year>
          , pp.
          <fpage>416</fpage>
          -
          <lpage>418</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Termritthikun</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Kanprachar</surname>
          </string-name>
          , “
          <article-title>Accuracy improvement of Thai food image recognition using deep convolutional neural networks</article-title>
          ,
          <source>” 2017 Int. Electr. Eng. Congr.</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Velotto</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Tings</surname>
          </string-name>
          , “
          <article-title>Ship Classification in TerraSAR-X Images With Convolutional Neural Networks,”</article-title>
          <source>IEEE J. Ocean. Eng.</source>
          , vol.
          <volume>43</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>258</fpage>
          -
          <lpage>266</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>U.</given-names>
            <surname>Asif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bennamoun</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Sohel</surname>
          </string-name>
          , “
          <article-title>A Multi-modal, Discriminative and Spatially Invariant CNN for RGB-D Object Labeling</article-title>
          ,” IEEE T. Patt. Anal. Mach. Intell., vol.
          <volume>8828</volume>
          , no. c,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          , and
          <string-name>
            <surname>Z. Zhang,</surname>
          </string-name>
          “
          <article-title>An improved faster R-CNN for same object retrieval,” IEEE Access</article-title>
          , vol.
          <volume>5</volume>
          , no.
          <issue>8</issue>
          , pp.
          <fpage>13665</fpage>
          -
          <lpage>13676</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>T.</given-names>
            <surname>Guan</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , “
          <article-title>Atrous Faster R-CNN for Small Scale Object Detection,” 2017 2nd Int</article-title>
          . Conf. Multimed. Image Process., pp.
          <fpage>16</fpage>
          -
          <lpage>21</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Waris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Iosifidis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Gabbouj</surname>
          </string-name>
          , “
          <article-title>CNN-based edge filtering for object proposals</article-title>
          ,
          <source>” Neurocomputing</source>
          , vol.
          <volume>266</volume>
          , pp.
          <fpage>631</fpage>
          -
          <lpage>640</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>D.</given-names>
            <surname>Anisimov</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Khanova</surname>
          </string-name>
          , “
          <article-title>Towards lightweight convolutional neural networks for object detection</article-title>
          ,
          <source>” 2017 14th IEEE Int. Conf. Adv. Video Signal Based Surveill., no. August</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Núñez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cabido</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Pantrigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Montemayor</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Vélez</surname>
          </string-name>
          , “
          <article-title>Convolutional Neural Networks and Long Short-Term Memory for skeleton-based human activity and hand gesture recognition,” Pattern Recognit</article-title>
          ., vol.
          <volume>76</volume>
          , pp.
          <fpage>80</fpage>
          -
          <lpage>94</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>L. Lo</given-names>
            <surname>Presti and M. La Cascia</surname>
          </string-name>
          , “
          <article-title>3D skeleton-based human action classification: A survey,” Pattern Recognit</article-title>
          ., vol.
          <volume>53</volume>
          , pp.
          <fpage>130</fpage>
          -
          <lpage>147</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bonanno</surname>
          </string-name>
          , G. Capizzi,
          <string-name>
            <given-names>S.</given-names>
            <surname>Coco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Laudani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Lo Sciuto</surname>
          </string-name>
          ,
          <article-title>"Optimal thicknesses determination in a multilayer structure to improve the SPP efficiency for photovoltaic devices by an hybrid FEM-cascade neural network based approach"</article-title>
          . in
          <source>International Symposium on Power Electronics, Electrical Drives, Automation and Motion (SPEEDAM)</source>
          , pp.
          <fpage>355</fpage>
          -
          <lpage>362</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>F.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Choi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          , “
          <article-title>Exploit All the Layers: Fast and Accurate CNN Object Detector with Scale Dependent Pooling</article-title>
          and Cascaded Rejection Classifiers,”
          <source>2016 IEEE Conf. Comput. Vis. Pattern Recognit</source>
          ., pp.
          <fpage>2129</fpage>
          -
          <lpage>2137</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>M.</given-names>
            <surname>Najibi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rastegari</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Davis</surname>
          </string-name>
          , “
          <article-title>G-CNN: an Iterative Grid Based Object Detector</article-title>
          ,”
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          , “
          <article-title>Recurrent Scale Approximation for Object Detection in CNN</article-title>
          ,” vol.
          <volume>1</volume>
          , pp.
          <fpage>571</fpage>
          -
          <lpage>579</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Yuille</surname>
          </string-name>
          , “
          <article-title>Mining 3D Key-Pose-Motifs for Action Recognition</article-title>
          ,”
          <source>2016 IEEE Conf. Comput. Vis. Pattern Recognit</source>
          ., pp.
          <fpage>2639</fpage>
          -
          <lpage>2647</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jafari</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Kehtarnavaz</surname>
          </string-name>
          , “
          <article-title>Action recognition from depth sequences using depth motion maps-based local binary patterns</article-title>
          ,
          <source>” Proc. - 2015 IEEE Winter Conf. Appl. Comput. Vision</source>
          ,
          <string-name>
            <surname>WACV</surname>
          </string-name>
          <year>2015</year>
          , pp.
          <fpage>1092</fpage>
          -
          <lpage>1099</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          , “
          <source>Hierarchial Recurrent Neural Network for Skeleton Based Action Recognition,” IEEE Conf. Comput. Vis. Pattern Recognit</source>
          ., pp.
          <fpage>1110</fpage>
          -
          <lpage>1118</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>L.</given-names>
            <surname>Tao</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Vidal</surname>
          </string-name>
          , “
          <article-title>Moving Poselets: A Discriminative and Interpretable Skeletal Motion Representation for Action Recognition,”</article-title>
          <source>Proc. IEEE Int. Conf. Comput. Vis.</source>
          , vol.
          <source>2016- February</source>
          , pp.
          <fpage>303</fpage>
          -
          <lpage>311</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>I.</given-names>
            <surname>Lillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Niebles</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Soto</surname>
          </string-name>
          , “
          <article-title>A Hierarchical Pose-Based Approach to Complex Action Understanding Using Dictionaries of Actionlets</article-title>
          and Motion Poselets,” pp.
          <fpage>1981</fpage>
          -
          <lpage>1990</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>R.</given-names>
            <surname>Vemulapalli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Arrate</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Chellappa</surname>
          </string-name>
          , “
          <article-title>Human action recognition by representing 3D skeletons as points in a lie group</article-title>
          ,
          <source>” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit</source>
          ., pp.
          <fpage>588</fpage>
          -
          <lpage>595</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ben Amor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          , “
          <article-title>Action Recognition Using Rate-Invariant Analysis of Skeletal Shape Trajectories,”</article-title>
          <source>IEEE Trans. Pattern Anal. Mach</source>
          . Intell., vol.
          <volume>38</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , “
          <source>Action Recognition Based on A Bag of 3D Points.pdf,” Comput. Vis. Pattern Recognit. Work. (CVPRW)</source>
          ,
          <source>2010 IEEE Comput. Soc. Conf.</source>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>14</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <surname>“Microsoft Kinect</surname>
            <given-names>SDK</given-names>
          </string-name>
          ,”
          <year>2018</year>
          . [Online]. Available: https://www.microsoft.com/en-us/download/details.aspx?id=
          <fpage>44561</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>S.-E.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ramakrishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kanade</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheikh</surname>
          </string-name>
          , “Convolutional Pose Machines.”
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Simon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-E.</given-names>
            <surname>Wei</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheikh</surname>
          </string-name>
          , “
          <article-title>Realtime MultiPerson 2D Pose Estimation using Part Affinity Fields</article-title>
          ,”
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sapp</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Taskar</surname>
          </string-name>
          , “MODEC: c
          <source>” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit</source>
          ., pp.
          <fpage>3674</fpage>
          -
          <lpage>3681</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>M.</given-names>
            <surname>Andriluka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pishchulin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gehler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Schiele</surname>
          </string-name>
          , “
          <article-title>2D human pose estimation: New benchmark and state of the art analysis</article-title>
          ,
          <source>” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit</source>
          ., pp.
          <fpage>3686</fpage>
          -
          <lpage>3693</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <article-title>“Leap Motion product website</article-title>
          .” [Online]. Available: https://www.leapmotion.com/
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>[43] “Depthsense company website.” [Online]. Available: https://www.sony-depthsensing.com.</mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>T.</given-names>
            <surname>Simon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Joo</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Matthews</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheikh</surname>
          </string-name>
          , “
          <article-title>Hand Keypoint Detection in Single Images using Multiview Bootstrapping</article-title>
          ,” pp.
          <fpage>1145</fpage>
          -
          <lpage>1153</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>O.</given-names>
            <surname>Koller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ney</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Bowden</surname>
          </string-name>
          , “Deep Hand:
          <article-title>How to Train a CNN on 1 Million Hand Images When Your Data is Continuous</article-title>
          and Weakly Labelled,”
          <source>2016 IEEE Conf. Comput. Vis. Pattern Recognit</source>
          ., pp.
          <fpage>3793</fpage>
          -
          <lpage>3802</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>O.</given-names>
            <surname>Koller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Forster</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Ney</surname>
          </string-name>
          , “
          <article-title>Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers,”</article-title>
          <string-name>
            <given-names>Comput. Vis. Image</given-names>
            <surname>Underst</surname>
          </string-name>
          ., vol.
          <volume>141</volume>
          , pp.
          <fpage>108</fpage>
          -
          <lpage>125</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <article-title>“OpenPose: Real-time multi-person keypoint detection library for body, face, and hands estimation</article-title>
          .” [Online]. Available: https://github.com/CMU-Perceptual-Computing-Lab/openpose.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>