Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                                     IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


      Perceptive Parallel Processes Coordinating Geometry and Texture
                                   Marco A. Gutierrez1 , Rafael E. Banchs2 and Luis F. D'Haro2


   Abstract— Finding and classifying specific objects is a key
part in most of the tasks autonomous systems could face.
Properly being able to reach objects and find their exact
location is very important for successfully achieving higher
level robotic behaviors. To perform full object detection and
recognition tasks in a wide environment several perception
approaches need to be brought together to achieve a good
performance. In this paper we present a dual parallel system
for object finding in wide environments. Our system implements
two main parts. One texture based approach for wide scenes,
composed by a Multimodal Deep Learning Neural Network
and a syntactic distribution based parser. And another specific
geometry based process, using three dimensional data and
geometry constrains to look for specific objects and their
position within a whole scene. Both systems run in parallel and
compliment each other to fulfill an object search and locate task.
The major contribution of this paper consists on the success
of combining texture and geometry based solutions running in
                                                                                      Fig. 1. Our system combines the best of 2D and 3D data information
parallel and sharing information in real time to allow a full                         through to coordinated parallel process.
generic solution to be able to find almost any present object
in a wide environment. To validate our system we test it with
real environment data injected into a simulated environment.
We test 25 tasks in a household environment obtaining a 92%
overall success rate finally delivering the correct position of the                   geometric information on these solutions could generally
object.                                                                               improve their results as well as enrich the information they
                                                                                      deliver as an output.
                        I. INTRODUCTION                                                  On the other hand, 3D model based approaches make
   Significant amount of work has been done in scene under-                           easy to reason about a variety of properties from volumes,
standing from 2D images since the beginnings of computer                              3D distances and local convexities. Solutions focusing on
vision research, achieving significant results. Hand-designed                         object shapes and geometric characteristics have had also
features such as SIFT [1], ORB [2] or HOG [3] underpin                                intense computer vision research focus, specially due to the
many of these successful object recognition approaches.                               recent new range of inexpensive and fast RGB-D sensors
They basically capture low-level textured information with                            available in the market. 3D features such as FPFH [6] or
the difficulty on effectively capturing mid-level cues (like                          NARF [7] are some examples of robust features that describe
edge intersections) or high-level representation (like dif-                           the local geometry around points for 3D point cloud datasets.
ferent object parts). Recent developments in deep learning                            However 3D solutions have some drawbacks when dealing
based solutions have shown how hierarchies of features                                with heavily clustered scenes or very general views of the
can be learned in an unsupervised manner directly from                                environment.
data. Learned features based solutions proved significant                                Although good solutions exist on both, image and point
improvements on object recognition and detection, achieving                           cloud based approaches, when it comes to solving tasks in
some of them up to around 90% success rates on differ-                                real environments, a more generic approach to achieve a
ent benchmark training/testing sets (i.e. The Pascal VOC                              solution for the problem is needed. Systems with a use of
Challenge [4]). Recently even full semantical well structured                         both 2D images based solutions and 3D geometry aware
image descriptions are generated by the latest multimodal                             processes can provide a more generic purpose robotics
neural language models [5]. Still when using 2D based scene                           architecture with more reliable and rich information. Our
understanding a lot of valuable information about the shape                           approach combines the rich information obtained from new
and geometric layout of objects is not considered. Adding                             multimodal neural networ object classification techniques on
                                                                                      general 2D image scenes with a 3D geometric, distance and
  *This work was supported by by the A*STAR Research Attachment
Programme.                                                                            shape aware process (figure 1). This allows us to minimize
  1 Marco A. Gutiérrez is with the robotics and artificial vision laboratorio        the drawbacks of each of each approach with the strengths
(RoboLab), University of Extremadura, Spain marcog@unex.es                            of the other.
  2 Rafel E. Banchs and Luis F. D'Haro are with the HLT dept.,
I2R, A*STAR, Singapore. rembanchs@i2r.a-star.edu.sg,                                     For the evaluation of our model we used a hybrid
luisdhe@i2r.a-star.edu.sg                                                             simulation-real scenario. A simulator tool was used to man-


                                                                                 30
                                                       Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                       IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


age the robot movements around the environment while
sensor data was injected into the system from real scenario
captures. This allowed us to test our approach with real
environment data, since all perception information used as
an input for our application comes from real sensors. As a
result we obtain quite promising results on the object finding
tasks tested.
   The remaining of the paper is organized as follows: in
section II we provide an overview of some related works.
Following section III gives a detailed general description
of the perception system. Section IV and V explain more
specific details regarding each of the two main processes, the
texture aware process and the geometry aware one respec-
tively. Finally we evaluate the system with an experiment
in section VI and give some conclusions and future lines of
work in section VII.                                                       Fig. 2.   Overview of the architecture of the perception system.

                  II. RELATED WORKS
    There is a wide range of research in the area of scene             random forest (RF). The main difference with these works
understanding and object recognition from 2D and point                 is that they restrict the search to a certain scene while our
cloud data. With RGB-D increased popularity, bringing an               solution provides a framework to solve a find and locate an
easy to access way to RGB and depth data at the same time,             object in an entire household environment.
several researches have tried combining the two sources of                The work on [11] combine high-resolution 3D laser sans
information.                                                           with 2D images to improve object detection. Their solution
    Sensor fusion approaches are the most common ones, they            relies in using a sliding window approach over a combination
take both sources of data and combine them into one system             of visual and depth channels and use those patches to train
to improve performance. I.e. in [8] they associate groups of           a classifier. It solves the same problem as the one presented
pixels with 3D points into multimodal regions that they call           here although they do not perform any optimization in terms
regionlets, then they measure the structure of each regionlet          of the path to reach the object most probably leading to a
using bottom-up cues from image and range features. This               slower solution for an object search and location like the one
way they are able to determine the scene structure separating          explained here.
it into the meaningful parts discarding the background clutter.           Also in [12] they use binary logistic classifiers on 2D and
Although they do not relay on any rigid assumptions about              3D features. The 2D features are small patches selected from
the scene like we do (we consider objects are placed on                images on a training set. They, then, compute 3D features
tables), the output provides a basic structure discovery over          from distance from robot estimation, surface variation and
a scene with detection of the main objects while our solution          orientation and object dimensions. These features are then
solves a specific object search and locate task on a wider             learned by the classifier over two-split decision for each
environment.                                                           object class. The difference with our solution is that they
    The machine learning based approaches take features from           learn multimodal models per object while here the rgb and
both depth and color data sources and combine them into                point cloud data are used by two different process and the
one multimodal space to perform later searches for a given             outcome combined in a final solution.
input. Koppula et al. [9] perform a labeling task on over-
segmented 3D RGBDSLAM sensed scenario. They build                                    III. THE PERCEPTION SYSTEM
a graphical model capturing 2D images information (local                  As shown in figure 2, the system's architecture has a
color, texture, gradients of interests, etc.) as well as local         control manager for decisions and mediation among two
shape and geometry, and geometrical context (where object              perception parallel process. This manager takes care of
most commonly lay to each other). This model then uses ap-             the information shared between both processes and delivers
proximate inference and is trained using a maximum-margin              notifications according to them.
learning approach. They show the benefits of using image                  The texture aware perceptive process (showed in figure 2
and shape against separated solutions. Also, Lai et al [10]            in green) exploits 2D images information data. It runs a
present an RGB-D Object Dataset and evaluate some object               multimodal neural language model as described in [13]
recognition and detection techniques. They combine 2D                  along with a syntactic frequency distribution based parser to
SIFT descriptors with efficient match kernel (EMK) features            process and evaluate the neural network output. The second
computed over spin images on randomly subsamples set of                one, the geometry aware perceptive process is exploiting the
3D points. These features are then used for the evaluation of          geometric features of the environment. This one takes care
three classifiers: a linear support vector machine (LinSVM),           of two main tasks, looking for tables through the point cloud
a gaussian kernel support vector machine (kSVM) and a                  data and segmenting tabletop setups, recognizing the object


                                                                  31
                                                                 Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                                 IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


                                                                              the table. Once a table is reached the texture aware perceptive
                                                                              process keeps validating the appearance of this object in the
                                                                              scene, then a tabletop segmentation process will be started by
                                                                              the geometry aware perceptive process in order to segment,
                                                                              recognize and locate the object.
                                                                                 If no object seems to be present when the tabletop segmen-
                                                                              tation is performed, the robot continues with the next table
                                                                              or with the next place in list if no more tables are available
                                                                              in the current place. We will only conclude that we cannot
                                                                              find an object once all places have been visited and no object
                                                                              has been found.

                                                                                 IV. TEXTURE AWARE PERCEPTIVE PROCESS
                                                                                 This texture based perceptive process is intended to get
                                                                              quick scene labeling from wide overviews of the environ-
                                                                              ment. It contains a previously trained multimodal neural
                                                                              model that outputs image descriptions. Then, taking into
                                                                              account the top nearest descriptions in the model, a parser
                                                                              extracts the object candidates and builds a frequency distri-
                                                                              bution histogram on the appearances of these objects class
                                                                              names. This frequency distribution histogram helps obtain a
                                                                              more robust output against false positives as the objects that
                                                                              are present in the scene tend to keep appearing with higher
                                                                              frequency over time in the sentences while the false positives
                                                                              have usually a much lower frequency.

                                                                              A. Mulitmodal neural model
    Fig. 3.   Flow of the different states and process on the system.            As previously mentioned the multimodal neural model
                                                                              follows the structure in [13]. This is a neural model pipeline
                                                                              that learns multimodal representations of images and text.
and estimating its position through a shape and position                      The pipeline uses a long short-term memory [14] (LSTM)
aware histogram based feature matching system.                                recurrent neural network for encoding sentences. We use a
                                                                              convolutional network architecture provided by the Toronto
   Figure 3 describes the states and process of the system
                                                                              Convnet [15] in order to extract 4096 dimensional image
for a certain object search and locate task. Green tasks are
                                                                              features for the neural model. These image features are
performed by the texture aware perceptive process while the
                                                                              then projected into the embedding space of the LSTM
ones in red belong to the geometry aware perceptive process.
                                                                              hidden states. A pairwise ranking loss is minimized in
A restriction the system assumes is that objects are placed
                                                                              order to learn to rank images and their descriptions. For
on tables. A simple task is given to the robot in the form of
                                                                              decoding, the structure-content neural language model (SC-
“Look for the OBJECT”, and the object name is extracted
                                                                              NLM) disentangles the structure of a sentence to its content,
and passed to the texture aware perceptive process for the
                                                                              conditioned on distributed representations produced by the
“look for places to visit” step. A list of generic images for
                                                                              encoder. Finally, the output is generated by sampling from
each available place is stored in our database and evaluated
                                                                              the SC-NLM the image top descriptions.
by this perceptive process. A frequency of appearance of
possible objects histogram is built for each place. Places
to visit are then ordered according to highest appearance                     B. Syntactic frequency distribution parser
of label of the object on this output. Places with no object                     After the system obtains the top scenes generated descrip-
appearances are left to visit last and ordered randomly.                      tions, it extracts potential object classes from them, using a
   Once the list of places to visit is ready, the robot visits                syntactic parser. Using the Neural Language Toolkit [16] we
them in order. When the first place is reached both processes                 syntactically analyze the sentences to extract object candi-
start to work in parallel for the required object. The texture                dates that could be present in the image. A frequency dis-
aware perceptive process provides a frequency distribution                    tribution histogram is computed over this object candidates.
of objects on images taken from the current place while the                   This histogram is then used to evaluate the believe that an
geometry aware one will start looking for tables on the scene                 object is present in a scene, allowing us to compare different
point cloud data. If the object is found in a scene image and                 scenes according to the probability of finding an object there
a table has been detected the robot will start moving towards                 and therefore discriminate possible false positives.


                                                                         32
                                                                Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                                IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


                                                                                  Fig. 5.   Example of one of the tabletop setup used in the experiment.


                                                                                 plane point cloud. Then we obtain the convex hull of these
                                                                                 plane point cloud and perform a bounding box on top of it
                                                                                 up to a certain high. Points within the bounding box are then
                                                                                 considered to correspond to objects sitting on top the table.
                                                                                 Then it is performed an euclidean clustering extraction to
                                                                                 segment the object candidates point clouds.
Fig. 4. Tabletop segmentation and object recognition pipeline using point           As the next step (figure 4.c) we compute these point clouds
cloud data.                                                                      Viewpoint Feature Histograms [18] (VFH) and look for the
                                                                                 nearest match in our database. For this database we have a
                                                                                 previously computed VFH of single views of objects. These
  V. GEOMETRY AWARE PERCEPTIVE PROCESS                                           VFHs are stored and retrieved through fast approximate
                                                                                 K-Nearest Neighbors (KNN) searches using kd-trees [19].
  This process exploits the geometry present in the envi-                        The construction of the tree and the search of the nearest
ronment to extract a wide variety of information. For our                        neighbors places an equal weight on each histogram bin in
approach we have restricted the task of finding objects,                         the VFH and spin images features.
to objects placed on top of tables. Therefore this process                          Finally the system would check if any of the labels from
performs two main tasks, one is looking for tables in broad                      the objects correspond to the one we are looking for, see
scenes and another one consists on a tabletop segmentation                       figure 4.d, and call it a success or not.
with shape based object recognition and pose estimation.
                                                                                                        VI. EXPERIMENT
A. Looking for tables
                                                                                    We perform several experiments sending the robot to re-
   We describe tables as planes that are parallel to the                         trieve different objects in a wide household environment. For
floor and found at a height between 40 and 110 centime-                          the experiment an hybrid simulator-real data environment has
ters. Therefore, we use the RANdom SAmple Consensus                              been used. We used the simulator for the robot movements
(RANSAC) [17] for plane model fitting in the scene point                         between places, while sensor data has been acquired with
cloud data with a previous downsample of 1cm. Using                              real RGB and RGB-D cameras (i.e. the tabletop showed in
this algorithms we recursively look for planes matching the                      figure 5) and matched to the specific locations on the virtual
previously mentioned constrains and label them as tables.                        plane. When the robot needs to move around the simulator
                                                                                 takes care of it, once certain positions in the map are reached,
B. Object recognition and pose estimation
                                                                                 the previously obtained real data is injected and used as input
   The tabletop segmentation is used when a table is ap-                         for the algorithms. The robot always starts at the entrance
proached and in order to recognize the objects on top of it                      of the apartment and from there performs the most optimal
as well as to estimate their final position.                                     way to find the object and delivers its estimated position as
   In the first part, shown in figure 4.b, the RANSAC                            a final result.
algorithm provides us with the plane equation and the points
that match that equation. Since the RANSAC uses a threshold                      A. System setup
to deal with sensor noise, points matching the model are                            The LSTM encoder and SC-NLM decoder from the mul-
not in a perfect plane but within a certain range, so we first                   timodal neural model have been trained using a combina-
project this points to fit the plane equation to obtain a perfect                tion of the Flikr30k [20] dataset and the Microsoft COCO


                                                                            33
                                                                     Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                                     IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


                                                                            TABLE I
                                              S UCCESS RATES ON THE DIFFERENT PARTS OF THE ALGORITHM
                OBJECT                                     1. Places to visit Ordering     2. False Negative   3. False Positive   4. Success Rate


                Cereal box                                             5                          0                   0                100%


                Cup                                                    5                          0                   1                 80%


                Bottle                                                 5                          0                   1                 80%


                Laptop                                                 5                          0                   0                100%


                Monitor                                                5                          0                   0                100%

                Overall rate                                         100%                        0%                  8%                 92%


                                                                                      those rooms. For the point cloud analysis a kd-tree stores
                                                                                      3729 VFH from different views of 75 different objects.
                                                                                         All the system is developed using the RoboComp robotics
                                                                                      framework [22] and the simulation is performed in a virtual
                                                                                      scenario using the RoboComp simulator tool. See figure 6
                                                                                      for an overview of the simulation environment.


                                                                                      B. Results on the experiments

                                                                                         We run 5 different tasks 5 times and collect the results in
                                                                                      the table I. First we measure if the ordering of places to visit
                                                                                      after the “Look for places to visit” step in our system was
                                                                                      optimal (check figure 3 for details). This turned out to work
                                                                                      perfect for all of our test cases, basically because some of
Fig. 6. An overview of the simulation household environment. The rooms                the description pictures of the places contained those items
are labeled as follows: 1.- Entrance, 2.- Living room, 3.- Patio, 4.- Bathroom        and the texture aware perceptive process was able to detect
5.- Hallway, 6.- Kitchen, 7.- Bedroom. Circled in yellow is the robot at its
starting point.                                                                       them. It is important for this step to select a good range of
                                                                                      images representing the different places to visit (see figure 6),
                                                                                      specially those images that clearly show an average of the
                                                                                      objects you can usually find in those places.
dataset [21]. The 4096 dimensional image features for the                                Then we count the false negatives occurrences, this is
multimodal neural model training are extracted using the                              when we are done with the searching and no object was
Toronto Convnet with their provided models. The frequency                             found. Along our testing this never happened and an object
histogram is built using the NLTK toolbox on the top 5                                was always found. However we obtained two false positives
generated sentences over at least 5 frames, to achieve a                              when the system mistaken a cup for a bottle and when a
robustness on the objects observed. This NLTK tagging and                             bottle was mistaken for a bottle of glue. Those mistakes are
syntactic analysis is performed using the Treebank Part of                            basically due to the similarity on these objects shape. We
Speech Tagger (Maximum entropy) they have available. For                              could avoid this in the future reinforcing this step with other
the rooms representation, images in the house 5 generic                               object features. Specially since the objects to be found where
different images of parts of a house are used for each                                actually present in the table being segmented at the time.
of the places in the house: entrance, room, kitchen, living                              The final success rate on obtaining the proper location of
room, bathroom, patio and bedroom. This images have been                              the object and pose estimation is quite high which results
selected so they contain the usual set of items presents in                           promising for further real applications of the system.


                                                                                 34
                                                                 Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                                 IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


       VII. CONLUSIONS AND FUTURE WORK                                             [7] Bastian Steder, Radu Bogdan, Rusu Kurt, and Konolige Wolfram
                                                                                       Burgard. Narf: 3d range image features for object recognition. In
   We presented a hybrid perception system that combines                               Workshop on Defining and Solving Realistic Perception Problems in
2D data based solutions and approaches using point clouds                              Personal Robotics, Int. Conf. on Intelligent Robots and Systems, IROS
running in parallel and sharing information in real time in                            ’11. IEEE Computer Society, 2010.
                                                                                   [8] Alvaro Collet, Siddhartha S. Srinivasa, and Martial Hebert. Structure
order to achieve a finding object task. The system is able to                          discovery in multi-modal data: A region-based approach. In ICRA,
successfully predict a route through the places with higher                            pages 5695–5702. IEEE, 2011.
probability of finding this objects. We obtained a high rate                       [9] Hema S Koppula, Abhishek Anand, Thorsten Joachims, and Ashutosh
                                                                                       Saxena. Semantic labeling of 3d point clouds for indoor scenes. In
of success in our experiments as we only obtained two false                            Advances in Neural Information Processing Systems, pages 244–252,
positives among all our test cases.                                                    2011.
   An interesting future work would be to perform further                         [10] Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox. A large-
                                                                                       scale hierarchical multi-view rgb-d object dataset. In Robotics and
testings with a wider range of objects. This could help                                Automation (ICRA), 2011 IEEE International Conference on, pages
find some weak points on the system that we might have                                 1817–1824. IEEE, 2011.
not found yet and that should be worth to strength with                           [11] M. Quigley, Siddharth Batra, S. Gould, E. Klingbeil, Quoc Le, Ashley
                                                                                       Wellman, and A.Y. Ng. High-accuracy 3d sensing for mobile manip-
more processes interaction. In the same line and although                              ulation: Improving object detection and door opening. In Robotics
the sensor data used in the testing where taken from real                              and Automation, 2009. ICRA ’09. IEEE International Conference on,
sensors, integrating the solution with a real robot could bring                        pages 2816–2822, May 2009.
                                                                                  [12] Stephen Gould, Paul Baumstarck, Morgan Quigley, Andrew Y. Ng,
a more accurate overview of how the system performs in real                            and Daphne Koller. Integrating Visual and Range Data for Robotic
environments.                                                                          Object Detection. In ECCV workshop on Multi-camera and Multi-
   False positives obtained during experiments are mainly be-                          modal Sensor Fusion Algorithms and Applications (M2SFA2), 2008.
                                                                                  [13] Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. Unifying
cause of a bad performance of the geometry aware perceptive                            visual-semantic embeddings with multimodal neural language models.
process. Since similarity on the shape of different objects                            CoRR, abs/1411.2539, 2014.
confuses the VFH search, exploiting texture based features                        [14] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
                                                                                       Neural Comput., 9(8):1735–1780, November 1997.
on this last step could most probably benefit the whole                           [15] Toronto University.      Convolutional Neural Nets.        https://
system final output. Also, since we are using an euclidean                             torontodeeplearning.github.io/convnet/, 2015. [On-
clustering extraction method for objects on top of the table,                          line; accessed 04-March-2015].
                                                                                  [16] Steven Bird. Nltk: The natural language toolkit. In Proceedings
our system cannot deal with heavy cluttered scenes or objects                          of the COLING/ACL on Interactive Presentation Sessions, COLING-
touching each other. Adding alternatives to the segmentation                           ACL ’06, pages 69–72, Stroudsburg, PA, USA, 2006. Association for
process could help improve this in order to cover a more                               Computational Linguistics.
                                                                                  [17] Martin A. Fischler and Robert C. Bolles. Random sample consensus:
varied range of scenarios. It would be also desirable to avoid                         A paradigm for model fitting with applications to image analysis and
the assumption that objects are always on tables, so we                                automated cartography. Commun. ACM, 24(6):381–395, June 1981.
should look into new ways of scene segmentation to improve                        [18] R.B. Rusu, G. Bradski, R. Thibaux, and J. Hsu. Fast 3d recognition and
                                                                                       pose using the viewpoint feature histogram. In Intelligent Robots and
this step.                                                                             Systems (IROS), 2010 IEEE/RSJ International Conference on, pages
   Finally adding a learning process in the system would                               2155–2162, Oct 2010.
be an interesting enhancement, both parallel process could                        [19] Marius Muja and David G. Lowe. Fast approximate nearest neighbors
                                                                                       with automatic algorithm configuration. In In VISAPP International
complement each other, correcting each other mistakes and                              Conference on Computer Vision Theory and Applications, pages 331–
providing the fixed mistake as a new source of learning,                               340, 2009.
leading to improvements in the following overall system                           [20] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier.
                                                                                       From image descriptions to visual denotations: New similarity metrics
performances.                                                                          for semantic inference over event descriptions. Transactions of the
                                                                                       Association for Computational Linguistics, 2:67–78, 2014.
                            R EFERENCES                                           [21] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Per-
                                                                                       ona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft
 [1] D.G. Lowe. Object recognition from local scale-invariant features.                COCO: common objects in context. CoRR, abs/1405.0312, 2014.
     In Computer Vision, 1999. The Proceedings of the Seventh IEEE                [22] P. Bustos Marco A. Gutiérrez, A. Romero-Garcés and J. Mart ńez.
     International Conference on, volume 2, pages 1150–1157 vol.2, 1999.               Progress in robocomp. Journal of Physical Agents, 7(1), 2013.
 [2] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb:
     An efficient alternative to sift or surf. In Proceedings of the 2011
     International Conference on Computer Vision, ICCV ’11, pages 2564–
     2571, Washington, DC, USA, 2011. IEEE Computer Society.
 [3] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for
     human detection. In Cordelia Schmid, Stefano Soatto, and Carlo
     Tomasi, editors, International Conference on Computer Vision &
     Pattern Recognition, volume 2, pages 886–893, INRIA Rhône-Alpes,
     ZIRST-655, av. de l’Europe, Montbonnot-38334, June 2005.
 [4] Mark Everingham, S.M.Ali Eslami, Luc Van Gool, ChristopherK.I.
     Williams, John Winn, and Andrew Zisserman. The Pascal visual object
     classes challenge: A retrospective. International Journal of Computer
     Vision, 111(1):98–136, 2015.
 [5] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Er-
     han. Show and tell: A neural image caption generator. CoRR,
     abs/1411.4555, 2014.
 [6] R.B. Rusu, N. Blodow, and M. Beetz. Fast point feature histograms
     (fpfh) for 3d registration. In Robotics and Automation, 2009. ICRA
     ’09. IEEE International Conference on, pages 3212–3217, May 2009.


                                                                             35