<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Lviv, Ukraine, November</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>3D Reconstruction of 2D Sign Language Dictionaries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Roman Riazantsev</string-name>
          <email>riazantsev@ucu.edu.ua</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maksym Davydov</string-name>
          <email>maks.davydov@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ADVA Soft</institution>
          ,
          <addr-line>Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ukrainian Catholic University</institution>
          ,
          <addr-line>Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>1</volume>
      <fpage>5</fpage>
      <lpage>16</lpage>
      <abstract>
        <p>In this paper, we review different approaches to hand pose estimation and 3D reconstruction from a single RGB camera for converting 2D sign language dictionaries into animated 3D models. Unlike many other works aimed at real-time or near real-time translation, we focus on the quality of conversion given large video dictionary as input. Several approaches to training and validation are considered: pose reconstruction through depth estimation, training and validation with synthetic data, training and validation with multiple views. Besides that, the work provides a review of various end-to-end algorithms for keypoint detection trained on labeled data. Based on the results of the studied models, the outline of a possible solution to the 3D reconstruction task is proposed.</p>
      </abstract>
      <kwd-group>
        <kwd>Hand pose</kwd>
        <kwd>Convolutional Neural Network (CNN)</kwd>
        <kwd>Sign Language</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Today virtual and augmented reality technologies (AR/VR) are becoming more and
more popular. Such trend creates high demand for 3D image data processing, which
applies to many areas. We focus our research on the conversion of available 2D sign
language content into 3D. Our goal is to improve the quality of 3D reconstruction for
video lessons of sign language. Sign language video dictionaries are widely available
and reliable method for their conversion into 3D would create demanded content for
use in AR and VR applications. Often people who want to learn sign language see only
the front view of hands provided in 2D dictionaries. However, views from all angles
carry value, as they reflect the nuances between similar words.</p>
      <p>We aim at reconstruction specifically poses from sign language videos for the task
of creating educational content in the future. Almost all of the other methods aimed at
solving problems in general, but we propose a solution for specific subtasks, namely
reconstruction of sign language videos for further usage in AR/VR applications.</p>
      <p>The task of pose reconstruction from a video is nontrivial and is not fully solved at
the moment. The computational problems are related to blurred frames, which exist due
to high speed of movement, and complex hand poses with overlapping hand parts along
the z-axis. Often 3D reconstruction is performed with the usage of depth sensors, but
there is much more available 2D data, which can be potentially mapped into 3D.
Besides that, RGB camera is a more popular sensor, which can be used to record new
information.</p>
      <p>
        The different datasets can be used for training and testing of hand pose
reconstruction models. There are many datasets with depth-camera input and 3D key points [
        <xref ref-type="bibr" rid="ref1 ref14 ref15 ref16 ref17 ref18 ref2 ref3">1-3,
14-18</xref>
        ], somewhat less datasets with 3D points and single RGB camera [
        <xref ref-type="bibr" rid="ref10 ref11 ref13 ref19 ref20 ref4">4, 10, 11, 13,
19, 20</xref>
        ]. The lack of multi-view Ukrainian Sign Language data prompts us to create a
new dataset. Existing methods of hand pose reconstruction are reviewed in section 2.
Section 3 outlines the proposed approach.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Background and Related Work</title>
      <sec id="sec-2-1">
        <title>Background Overview</title>
        <p>
          The task of determining the position of an object in space is not new. Over the past 20
years, a large number of works have been aimed at solving this problem [
          <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
          ]. A lot
has changed with the advent of depth sensors and neural networks. These technologies
introduce new approaches to comprehensive scene analysis. Depth cameras produce
information about the distance to an object, which allows reconstructions of more
accurate 3D models, and neural networks calculate complex correlations in image
patterns. Since 2012, neural networks started to outperform most of the classical methods
in segmentation and classification problems. A large number of methods use a
combination of depth-camera output and neural network for 3D reconstruction of the body
position [
          <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
          ]. The abovementioned technologies also apply widely to the hands. Often,
researchers use a combination of depth sensors and gloves, which record the 3D
position of the hand. Several sensors are used for collection of fully labeled training samples
for 3D reconstruction, which may include depth map, joint angles, and 3D positions
[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Related Works</title>
        <p>Most methods for 3D hand pose generation from a single RGB image can be
generalized into four stages (see Fig. 1). The first stage is detection of hands in the input image
and cropping localized area, the second is detection of hand key points in 2D the third
is mapping of 2D locations into 3D, and the fourth is generation of 3D hand model.</p>
        <p>
          Paper [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] introduces a three-stage algorithm that localizes the hands and determines
the key points in 2D at the first two stages, and calculates 3D reconstruction at the third
is studied in the paper. The first step is the YOLO (you only look once) neural network,
which identifies the position of the hands, after which it cuts off this part of the image
and passes cropped sub-images to the OpenPose detector. These two neural networks
localize 21 2D key points in the video, which are then used as a target in the inverse
kinematic optimization problem. A distinct drawback of this method is the limitation
caused by the error of the OpenPose detector. This error causes the algorithm to
optimize 3D locations using wrong 2D key points. Nevertheless, the addition of a hand
position from a different view makes it possible to improve the optimization problem,
and hence the accuracy. The runtime of the method on Nvidia GTX 1070 GPU is close
to 53 ms.
        </p>
        <p>
          Publication [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] describes one of the few methods, which fully reconstructs the 3D
shape of the hand. It introduces graph convolutional neural network (CNN) for
generating 3D mesh [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. This work uses centered images of hands as input, thus hand
detection was not necessary. Therefore, the first part of the approach is 2D key point
detection, which is based on Stacked Hourglass Networks. The second part is the
encoding of 2D features, and the third is 3D reconstruction using graph CNN network. The
network outperforms the State-of-the-Art methods on RHD [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and STB [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] datasets.
The runtime of the method on Nvidia GTX 1080 GPU is on average 19.9ms. The
pretrained model is available, but the training dataset is not provided.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Datasets Review</title>
        <p>
          We examined several datasets and selected the most suitable for our task. Large portion
of datasets for 3D reconstruction contain depth maps, key points, but not RGB image:
NYU, ICVL, MSRA15, BigHand2.2M, SynHand5M, FHAD, MSRC (FingerPaint),
HandNet, Hands in Action, MSRA14 [
          <xref ref-type="bibr" rid="ref1 ref14 ref15 ref16 ref17 ref18 ref2 ref3 ref4">1-4, 14-18</xref>
          ]. For the problem of reconstruction
from single image. the most appropriate datasets are those featuring both RGB records
and key points: FreiHAND, GANerated Hands, EgoDexter, SynthHands, STB,
Dexter+Object, UCI-EGO, MHP [
          <xref ref-type="bibr" rid="ref10 ref11 ref13 ref19 ref20 ref4">4, 10, 11, 13, 19, 20</xref>
          ]. The possible complication of
combining different datasets is that the number of key points, record types, and camera
parameters may not match. From the available variety of datasets, we have selected
only those with a central position of a hand and 21 labeled key points.
        </p>
        <p>
          FreiHAND Dataset is a hand pose dataset for hand pose estimation from a single
image. The dataset contains shots with 4 different backgrounds annotated with 21 key
points for 2D and 3D spaces. There are 130240 of training samples, so 32560 images
per one background [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>
          GANerated Hands Dataset contains 330,000 examples annotated with 21 key
points for 2D and 3D spaces. The downside of this dataset is that images are
synthetically generated and have distorted edges of hands. All of these are recorded from one
viewpoint [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          SynthHands is a synthetic dataset, which provides information about 63,530 frames
recorded from 5 views. Learning examples contain both RGB and depth records and
represent records with and without object interaction. Data annotated for 21 points in
3D space. Hands were generated using Unity3D engine but animated using data
captured from real motion [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
There is a problem of the lack of a method for accurate 3D reconstruction of 2D sign
language video content. We aim to solve it by introducing the neural network to the 3D
reconstruction pipeline trained on multi-view dataset.
        </p>
        <p>Statement: usage of neural network trained to make projections onto several planes
improves the quality of sign language 3D reconstruction from video sequence.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Tentative Outline of the Thesis</title>
      <sec id="sec-3-1">
        <title>Methods Overview</title>
        <p>We plan to use the schema of 3D hand pose estimation specified in Fig 2. To improve
the performance of sign language 3D reconstruction, we are going to test various
methods for calculating intermediate results such as 2D and 3D points. We also plan to
capture new dataset to improve the accuracy of sign language 3D reconstruction.</p>
        <p>We are considering several approaches to address the problem. We propose two ways
of how to redesign the second and third stages of the computational pipeline (Fig. 2).
The first solution is to introduce the pair of networks, which will estimate the key points
in 2D and 3D. As the second method, we propose to calculate not points but
transformations of points in space with a pair of CNNs. Both methods use concatenated
information about previous and current frames as input, namely the location of 2D points of
last frame and RGB data for two frames. Therefore, the depth of the input is 27
(21+3+3).</p>
        <p>
          The first method. We are going to take as a basis the first method [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] described in
section 2.2. The pre-trained neural network YOLO will be used to calculate hand
localization. We will introduce two networks A and B to compute locations of 2D points
and estimate hand marks in three-dimensional space respectively. The usage of the two
connected networks makes it possible to use skip connection and increase the
complexity of extracted patterns in the third stage by creating connections between hidden layers
of the two networks.
        </p>
        <p>Let kc denote the convolutional layer with k filters and stride 1, kd - the convolutional
layer with k filters and stride 2, kr - the residual block with k filters, ku - the transposed
convolutional layer with k filters and stride 2, kfc - fully connected layer with k neurons.
Relu is an activation function on all layers, except the last one with sigmoid activation.
All layers have kernel size of 3 and padding of zeros with size 1. Then an architecture
of the network A is: 64c, 128d, 256d, 256c, 256r, 256r, 256r, 256r, 256c, 256u, 128u,
64c, 21c; and architecture of the network B is: 64c, 128d, 256d, 256c, 256r, 256r, 256r,
256d, 128c, 64c, 32c, 256fc, 256fc, 21*3fc. The outputs from fourth, fifth, and sixth
layers of network A concatenated with the correspondent outputs of network B. Skip
connections allow the second network to use encoded information about RGB image in
the process of 3D points estimation. The Adam optimizer will be used to minimize the
difference between the labeled and predicted 2D key points for network A, as well as
to minimize the difference between projection of predicted locations of 3D points and
known locations of their projections into different views for the network B (see Fig. 3).</p>
        <p>The second method. The second method is a modified version of the first one. We
are changing the architecture of networks A and B to approximate the transformation
matrices of points between frames and not the entire 3D model. The architecture
described below calculates 21 transformation matrices. For most frames, fever matrices
can be used to describe hand motion. We plan to train another CNN to handle these
cases.</p>
        <p>Let kc denote the convolutional layer with k filters and stride 1, kd - the convolutional
layer with k filters and stride 2, kr - the residual block with k filters, ku the transposed
convolutional layer with k filters and stride 2. Relu is an activation function on all
layers, except the last one with sigmoid activation and set of transposed convolutions
between networks with leaky relu activation. All layers have kernel size of 3 and padding
of zeros with size 1. Then an architecture of the network A is: 64c, 128d, 256d, 256c,
256r, 256r, 256r, 256r, 256d, 256d, 128d, 64d, 21c; and architecture of the network B
is: 64c, 128d, 256d, 256c, 256r, 256r, 256r, 256d, 128d, 64d, 32d, 21c. The outputs
from fourth, fifth, and sixth layers of network A concatenated with the correspondent
outputs of network B. To concatenate input and output of the network A we are going
to use 6 transposed convolutions with stride 2, leaky relu activations and following
number of channels: 32, 32, 16, 8, 4, 2 (see Fig. 4).
We are going to record a video with hand movements similar to sign language gestures
from at least three cameras. We expect to improve reconstruction accuracy by training
the networks on this dataset.
4.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Experiments and Evaluation</title>
        <p>We will train the network on FreiHAND, GANerated Hands, and SynthHands datasets,
and then fine-tune on the introduced sign language dictionary dataset. The proposed
methods will be evaluated on the STB and RHD datasets. Since the accuracy of sign
language dictionary reconstruction could not be completely evaluated with error metric,
we plan to engage sign language experts to evaluate the result.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Timeline to Completion</title>
      <p>October 2019 - Create sign language dataset. Implement and evaluate method one.
Describe results.</p>
      <p>November 2019 - Implement and evaluate method two. Describe results.</p>
      <p>December 2019 - Compare results to the State-of-the-Art methods for 3D
reconstruction, formulate conclusions.</p>
      <p>January 2020 - Make final edits. Defend the thesis.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We propose the methods aimed at solving the task of 3D reconstruction from video
sequences. We plan to compare the performance of multiple architectures and describe
the data pre-processing pipeline. The work is not only aimed at investigation of sign
language reconstruction problems but also at the preparation of the baseline algorithm
for future VR and AR products.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Tompson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lecun</surname>
            ,
            <given-names>Y..</given-names>
          </string-name>
          <string-name>
            <surname>Perlin</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Real-time continuous pose recovery of human hands using convolutional networks</article-title>
          .
          <source>ACM Trans on Graphics</source>
          <volume>33</volume>
          (
          <issue>5</issue>
          ), Article No.
          <volume>169</volume>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Jin</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Tejani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <surname>T. K.</surname>
          </string-name>
          :
          <article-title>Latent regression forest: structured estimation of 3d articulated hand posture</article-title>
          .
          <source>In: 2014 IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>3786</fpage>
          -
          <lpage>3793</lpage>
          . IEEE Press, New York (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Cascaded hand pose regression</article-title>
          .
          <source>In: 2015 IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>824</fpage>
          -
          <lpage>832</lpage>
          . IEEE Press, New York (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Malik</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elhayek</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nunnari</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varanasi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tamaddon</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heloir</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stricker</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Deephps: end-to-end estimation of 3d hand pose and shape by learning from synthetic depth</article-title>
          .
          <source>In: 2018 IEEE International Conference on 3D Vision</source>
          , pp.
          <fpage>110119</fpage>
          . IEEE Press, New York (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Athitsos</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sclaroff</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Estimating 3D hand pose from a cluttered image</article-title>
          .
          <source>In: 2003 IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2003</year>
          . Proceedings. Vol.
          <volume>2</volume>
          , pp.
          <fpage>II432</fpage>
          . IEEE Press, New York (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Pose guided structured region ensemble network for cascaded hand pose estimation</article-title>
          .
          <source>Neurocomputing</source>
          (
          <year>2019</year>
          ). doi:
          <volume>10</volume>
          .1016/j.neucom.
          <year>2018</year>
          .
          <volume>06</volume>
          .097
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Marín-Jiménez</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Romero-Ramirez</surname>
            ,
            <given-names>F.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muñoz-Salinas</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Medina-Carnicer</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>3D human pose estimation from depth maps using a deep combination of poses</article-title>
          .
          <source>Journal of Visual Communication and Image Representation</source>
          <volume>55</volume>
          ,
          <fpage>627</fpage>
          -
          <lpage>639</lpage>
          (
          <year>2018</year>
          ). doi:
          <volume>10</volume>
          .1016/j.jvcir.
          <year>2018</year>
          .
          <volume>07</volume>
          .010
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pollefeys</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Accurate 3d pose estimation from a single depth image</article-title>
          .
          <source>In: 2011 IEEE International Conference on Computer Vision</source>
          , pp.
          <fpage>731</fpage>
          -
          <lpage>738</lpage>
          . IEEE Press, New York (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Panteleris</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oikonomidis</surname>
            ,
            <given-names>I. Argyros</given-names>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Using a single RGB frame for real time 3D hand pose estimation in the wild</article-title>
          .
          <source>In: 2018 IEEE Winter Conference on Applications of Computer Vision</source>
          , pp.
          <fpage>436</fpage>
          -
          <lpage>445</lpage>
          . IEEE Press, New York (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mueller</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernard</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sotnychenko</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehta</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sridhar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Casas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Theobalt</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>GANerated hands for real-time 3d hand tracking from monocular RGB</article-title>
          .
          <source>In: 2018 IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>49</fpage>
          -
          <lpage>59</lpage>
          . IEEE Press, New York (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Mueller</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehta</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sotnychenko</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sridhar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Casas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Theobalt</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Real-time hand tracking under occlusion from an egocentric RGB-D sensor</article-title>
          .
          <source>In: 2017 IEEE International Conference on Computer Vision</source>
          , pp.
          <fpage>1284</fpage>
          -
          <lpage>1293</lpage>
          . IEEE Press, New York (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Zimmermann</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brox</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Learning to estimate 3D hand pose from single RGB images</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Computer Vision</source>
          , pp.
          <fpage>4903</fpage>
          -
          <lpage>4911</lpage>
          . IEEE Press, New York (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Jiao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>3D hand pose tracking and estimation using stereo matching</article-title>
          .
          <source>arXiv preprint arXiv:1610.07214</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stenger</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
          </string-name>
          , T.K.:
          <article-title>BigHand2.2m benchmark: hand pose dataset and state of the art analysis</article-title>
          .
          <source>In: 2017 IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>2605</fpage>
          -
          <lpage>2613</lpage>
          . IEEE Press, New York (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Garcia-Hernando</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baek</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>T.K.</given-names>
          </string-name>
          :
          <article-title>First-person hand action benchmark with RGB-D videos and 3D hand pose annotations</article-title>
          .
          <source>In: 2018 IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>409419</fpage>
          . IEEE Press, New York (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Sharp</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keskin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Shotton</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rhemann</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leichter</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinnikov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freedman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Accurate, robust, and flexible real-time hand tracking</article-title>
          .
          <source>In: 33rd Annual ACM Conference on Human Factors in Computing Systems</source>
          , pp.
          <fpage>3633</fpage>
          -
          <lpage>3642</lpage>
          ). ACM (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Wetzler</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Slossberg</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kimmel</surname>
          </string-name>
          , R.:
          <article-title>Rule of thumb: deep derotation for improved fingertip detection</article-title>
          .
          <source>arXiv preprint arXiv:1507.05726</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Tzionas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ballan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srikantha</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aponte</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pollefeys</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gall</surname>
          </string-name>
          , J.:
          <article-title>Capturing hands in action using discriminative salient points and physics simulation</article-title>
          .
          <source>International Journal of Computer Vision</source>
          <volume>118</volume>
          (
          <issue>2</issue>
          ),
          <fpage>172</fpage>
          -
          <lpage>193</lpage>
          (
          <year>2016</year>
          ).
          <source>doi: 10.1007/s11263-016-0895-4</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Rogez</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khademi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Supančič</surname>
            <given-names>III</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.S.</given-names>
            ,
            <surname>Montiel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.M.M.</given-names>
            ,
            <surname>Ramanan</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.:</surname>
          </string-name>
          <article-title>3D hand pose detection in egocentric RGB-D images</article-title>
          . In: Agapito,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Bronstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Rother</surname>
          </string-name>
          , C. (eds.)
          <article-title>ECCV 2014 Workshops</article-title>
          . LNCS, vol.
          <volume>8925</volume>
          , pp.
          <fpage>356</fpage>
          -
          <lpage>371</lpage>
          . Springer, Cham (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Gomez-Donoso</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Orts-Escolano</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cazorla</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Large-scale multiview 3D hand pose dataset</article-title>
          .
          <source>arXiv preprint arXiv:1707.03742</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Ge</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xue</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>3D hand shape and pose estimation from a single RGB image</article-title>
          .
          <source>In: 2019 IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>10833</fpage>
          -
          <lpage>10842</lpage>
          . IEEE Press, New York (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>