=Paper=
{{Paper
|id=Vol-2322/DARLIAP_6
|storemode=property
|title=Exploring Task-agnostic, ShapeNet-based Object Recognition for Mobile Robots
|pdfUrl=https://ceur-ws.org/Vol-2322/DARLIAP_6.pdf
|volume=Vol-2322
|authors=Agnese Chiatti,Gianluca Bardaro,Emanuele Bastianelli,Ilaria Tiddi,Prasenjit Mitra,Enrico Motta
|dblpUrl=https://dblp.org/rec/conf/edbt/ChiattiBBTMM19
}}
==Exploring Task-agnostic, ShapeNet-based Object Recognition for Mobile Robots==
<pdf width="1500px">https://ceur-ws.org/Vol-2322/DARLIAP_6.pdf</pdf>
<pre>
    Exploring Task-agnostic, ShapeNet-based Object Recognition
                        for Mobile Robots
                 Agnese Chiatti                                      Gianluca Bardaro                           Emanuele Bastianelli
           Knowledge Media Institute                            Knowledge Media Institute                       The Interaction Lab
             The Open University                                   The Open University                         Heriot-Watt University
               United Kingdom                                        United Kingdom                               United Kingdom
           agnese.chiatti@open.ac.uk                           gianluca.bardaro@open.ac.uk                  emanuele.bastianelli@hw.ac.uk

                    Ilaria Tiddi                                       Prasenjit Mitra                               Enrico Motta
         Faculty of Computer Science                      Information Sciences and Technology                  Knowledge Media Institute
         Vrije Universitet Amsterdam                        The Pennsylvania State University                    The Open University
               The Netherlands                                     Pennsylvania, USA                               United Kingdom
                 i.tiddi@vu.nl                                    pmitra@ist.psu.edu                           enrico.motta@open.ac.uk

ABSTRACT                                                                           through human instructions provided in natural language [18], pre-
This position paper presents an attempt to improve the scalability                 emptive obstacle removal, particularly in the context of elderly
of existing object recognition methods, which largely rely on su-                  care [1, 20], door-to-door garbage collection in Smart Cities [13]. In
pervision and imply a huge availability of manually-labelled data                  this scenario, the ability to generalise across different domains by
points. Moreover, in the context of mobile robotics, data sets and ex-             learning features independently from the end goal, e.g., grasping
perimental settings are often handcrafted based on the specific task               or mapping, can allow agents to flexibly switch between different
the object recognition is aimed at, e.g. object grasping. In this work,            tasks and capability sets [32].
we argue instead that publicly available open data such as ShapeNet                State-of-the-art supervised approaches in object recognition from
[8] can be used for object classification first, and then to link objects          natural scenes [23–25] imply the availability of large collections
to their related concepts, leading to task-agnostic knowledge acqui-               of labelled examples and lack flexibility, when applied on unseen
sition practices. To this aim, we evaluated five pipelines for object              classes and mutable environments. On the other hand, fully unsu-
recognition, where target classes were all entities collected from                 pervised approaches can provide exploratory insights and guide-
ShapeNet and matching was based on: (i) shape-only features, (ii)                  lines that, however, require significant further tuning and error
RGB histogram comparison, (iii) a combination of shape and colour                  analysis. These evidences provide much incentive to explore alter-
matching, (iv) image feature descriptors, and (v) inexact, normalised              native semi-supervised approaches, to balance out the accuracy
cross-correlation, resembling the Deep, Siamese-like NN architec-                  and precision of the recognition process with the scalability of the
ture of [31]. We discussed the relative impact of shape-derived and                achieved solution. Besides, the recent availability of open, multi-
colour-derived features, as well as suitability of all tested solutions            modal common sense knowledge [8, 10, 30], has expanded the
for future application to real-life use cases.                                     opportunities to further refine, ground and enrich the extracted
                                                                                   object entities.
                                                                                   To form a task-agnostic image representation that enables object
1    INTRODUCTION                                                                  recognition under varying classes and conditions, different features
Autonomous sensemaking under rapidly-evolving and uncertain cir-                   can come into play. For instance, chairs and plants can be discrimi-
cumstances goes beyond building intelligent and knowledge-based                    nated from one another, in principle, thanks to their shape alone.
systems, requiring mobile systems that are not only able to reason                 However, coat hangers could be mistaken for plants, if colours
on their surroundings, but also to readily adapt to their context.                 were not taken into account. Hence, the contribution of shape
Context is, first and foremost, bound to the physical objects spread               and colour to the resulting classification needs careful assessment,
around the observed space, all belonging to different categories,                  before applying Neural Net-based methods that can produce less
and holding static or dynamic qualities, based on their evolution                  interpretable results, with respect to feature importance. Further,
over time. Scalable and adaptable object recognition through mo-                   relying on ShapetNet-derived models [8] for similarity matching
bile robots is then of crucial importance for successful knowledge                 provides readily available data, already segmented and labelled,
acquisition and mapping from rapidly-evolving environments. In                     while also linking object entities with a set of related concepts, for
fact, accurate object recognition is the essential prerequisite to a               future knowledge grounding.
number of applications in Robotics, including but not limited to:                  Based on these premises, we interrogated on: (i) the relative im-
health and safety monitoring [2], retrieving entities across space                 pact of shape and colour features on the overall object recognition
                                                                                   performance, when the presence of errors propagated from prior
                                                                                   segmentation faults is minimised, (ii) the scalability and perfor-
© 2019 Copyright held by the author(s). Published in the Workshop Proceedings of   mance of Siamese-like approaches already proven successful for
the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on CEUR-    person re-identification [31], when applied to ShapeNet based ob-
WS.org.
                                                                                   ject recognition instead. To tackle these questions, we designed five
pipelines, as the starting point to weigh up further application on           classes carrying semantic meaning (i.e., semantic mapping [22])
images collected on a mobile robot.                                           mainly rely on handcrafted knowledge bases or are often based
In this paper, we present our main contributions, with respect to:            on ARTags, to control for the complexity of autonomous object
     • Assessing the relative importance of shape-derived, colour-            recognition and rather focus on spatial reasoning and rule imple-
       derived and hybrid features when similarity-based matching             mentation [2].
       is applied against entities from ShapeNet.                             To the best of our knowledge, there has been no prior work, in
     • Evaluating the adequacy of feature descriptors in providing a          the related literature, evaluating possibilities to extract general
       less expensive and more general object representation, when            features through ShapeNet-based [8] similarity matching, for the
       applied to ShapeNet 2D views.                                          purpose of acquiring task-agnostic knowledge. Further, relative
     • Learning object similarity through inexact matching and a              impacts are here evaluated by isolating the classification problem
       CNN-based architectures that shares weights in modelling               from the additional noise carried over from the object segmentation
       the two input images, in a Siamese fashion, following an               routines. Thus, the integration of ShapeNet in the proposed work-
       approach that has only be applied to person re-identification          flow is not only motivated by the availability of pre-segmented and
       across successive frames, but not for task-agnostic object             pre-labelled data that comes with it, but also by the existing link
       recognition.                                                           between objects and their related concepts, enabling future knowl-
For all of the above, we conclude with discussing the obtained                edge grounding. Already used for learning object intrinsics [29]
results and the emerged challenges, which will inform future im-              and for 3D scene understanding [33], models in ShapeNet have in
provements in this work. All described data, implemented code and             fact never been used to assess the relative importance of colour and
pre-trained models are available at our Github repository1 .                  shape derived features when classifying objects, nor have they been
                                                                              combined with CNN-based inexact matching methods. Inspired the
2    BACKGROUND AND MOTIVATION                                                application of Siamese-like Networks for person re-identification
                                                                              [31, 34], we seek to test whether similar methods can be applied to
Recent advances in object recognition methods, such as YOLO
                                                                              Shapenet-based object recognition, to ultimately test their ability
[23, 24] or Faster R-CNN [25], have significantly improved perfor-
                                                                              to scale towards more diverse classes.
mance on predetermined sets of object classes, thanks to expensive
ad hoc training on manually-labelled data.
                                                                              3 EXPERIMENTAL SETUP
The costs and lack of flexibility associated with said solutions, espe-
cially when dealing with autonomous agents, have fuelled efforts in           3.1 Data preparation
designing a number of unsupervised and semi-supervised methods,               The experiments described in this paper involve different combina-
requiring limited labelled data points and ensuring more abstract             tions of two main datasets. We focused on natural scenes from the
and general data representations [4, 14, 33]. Along the same lines,           NYUDepth V2 collection [21], already annotated and segmented,
recent efforts have emphasised the need for autonomous agents                 and reference 2D models derived from the ShapeNet dataset.
to recognise cross-domain objects and react flexibly to rapidly-
                                                                              NYUSet NYUDepth V2 [21] comprises of 1449 densely labeled
evolving contexts [35]. Addressing similar concerns but from a
                                                                              pairs of aligned RGB and depth images and is provided with a
different angle, other proposed methods [5] have used Generative
                                                                              MatLab Toolbox for basic data retrieval and manipulation. We im-
Adversarial Learning on pre-trained Deep Neural Nets, to foster
                                                                              plemented our own Matlab script - extending the provided methods
adaptability to new domains.
                                                                              for segmented entities extraction - in order to mask out each la-
On the other hand, the reduced explainability of results obtained
                                                                              belled region belonging to one of the target object classes and store
through Deep Neural Network based methods [11], suggests that a
                                                                              them as separate RGB frames. To reduce cross-class imbalances,
more careful analysis of the contributing features should be com-
                                                                              we further down-sampled the chair examples available to 1000
bined with "black-box" learning settings, and can benefit all stages
                                                                              instances (see also Table 1).
of the knowledge discovery process [26]. Furthermore, identifying
                                                                              ShapeNetSet. ShapeNet is a large-scale collection of richly anno-
the most prominent descriptors has the potential to provide bet-
                                                                              tated 3D models [8], organised into two subsets: (i) ShapeNetCore,
ter insights on which modules to fine tune when optimising the
                                                                              covering 55 object classes with about 51,300 unique 3D models,
solution, in terms of both performance and computational costs.
                                                                              and (ii) ShapeNetSem, consisting of 12,000 more densely-annotated
As a result, more scalable solutions also represent a more suitable
                                                                              models across 270 categories. For a number of 3D models, 2D views
alternative for mobile robot on-board installation [2, 32]. There-
                                                                              of the object surfaces are available as well. Further, ShapeNet ob-
fore, these strategies can ultimately ensure a tradeoff between the
                                                                              ject annotation is based on synsets, i.e., sets of synonyms defined
more expensive and constrained supervised approaches and the
                                                                              according to the WordNet lexical database [19], and is linked with
more challenging fully-unsupervised scenarios, where objects are
                                                                              the ImageNet set as well [10].
autonomously recognised, e.g., based on the dynamics of their en-
                                                                              We first selected a subset of models, i.e., two for each of the ten
vironment [12, 17].
                                                                              object classes of interest. We will refer to this subset as ShapeNet-
As unsupervised and semi-supervised approaches grow in number
                                                                              Set1, or SNS1, in the remainder of this paper. Specifically, for most
and become more established in the context of scene segmentation,
                                                                              classes, four 2D views of the selected model were collected, or
object classification and object grasping, knowledge acquisition pro-
                                                                              manually-derived by rotating an existing view, when not avail-
cesses applied upfront, for mapping the robot environment within
                                                                              able. Less window and door examples were included, representing
1 https://github.com/kmi-robots/semantic-map-object-recognition               rotation-invariant models, whereas objects that were either more
                                                                          2
complex in nature or more highly-represented and diversified in the          had to be reduced. To achieve this, we (i) first converted to grayscale,
NYUset, such as chairs and bottles, were slightly oversampled (see           (ii) applied global binary thresholding (or its inverse, depending on
Table 1). Further, we selected a second, larger, subset (ShapeNetSet2,       whether the input background was black or white respectively), (iii)
or SNS2) spread across the same object classes, with ten 2D views            contour detection on cascade, and (iv) cropped the original RGB
for each target category. Class-wise cardinalities are outlined in           image to the contour of largest area.
Table 1.                                                                     We then framed the classification task as follows: a set of K Shapenet
                                                                             models, Mc , is defined for c = 1, ..., N object classes of interest (i.e.,
                   Table 1: Dataset statistics.                              N = 10 in this case). Let Vi be the set of 2D views available for each
                                                                             model mi ∈ Mc , with i = 1, .., K. Each input object to classify is
      Object     ShapeNetSet1       ShapeNetSet2       NYUSet                thus matched against each single view v j ∈ Vi , for all K models,
                                                                             and for all N classes. The mi determining the predicted label is
      Chair      14                 10                 1000
                                                                             then the argument optimising either a certain similarity or distance
      Bottle     12                 10                 920
                                                                             function, based on the following approaches.
      Paper      8                  10                 790
      Book       8                  10                 760                   Shape-only matching. Contours extracted from input samples
      Table      8                  10                 726                   were matched through the OpenCV built-in similarity function
      Box        8                  10                 637                   based on Hu moments [15], i.e. moments invariant to translation,
      Window     6                  10                 617                   rotation and scale. We tested three different variants of this methods,
      Door       4                  10                 511                   with distance metric between image moments set to be the L1, L2,
      Sofa       8                  10                 495                   or L3 norm respectively.
      Lamp       6                  10                 478                   Colour-only matching comparing the RGB histograms of the
                                                                             input image pairs. Similarly to the previous case, we relied on the
      Total      82                 100                6,934                 OpenCV library and tested different comparison metrics, namely
                                                                             Correlation, Chi-square, Intersection and Hellinger distance.
                                                                             Hybrid matching. The colour-only and shape-only similarity sco-
3.2    Shape and Color Feature Matching                                      res obtained in the previous steps were further combined, using
                                                                             three different objective functions. In all hybrid configurations, the
To tackle the first question on relative importance of colour and
                                                                             selected ShapeNet model mi was defined as:
shape features in recognising a specific class of objects, we con-
ducted a first exploratory analysis on the NYUSet, i.e., we evaluated                                    mi = arg min Θ                             (1)
feature matching-based classification methods alone, leaving poten-
                                                                             Let S and C be the scores obtained with shape-only and colour-only
tial error-propagation from segmentation faults out of the picture.
                                                                             matching when matching all views against each input image, with
On a similar note, since the segmented regions from the NYUset
                                                                             α and β being their relative weights. Then, the weighted sum of
were extracted through a black mask, while 2D views from ShapeNet
                                                                             scores is defined as:
lay on a white background, the marginal noise surrounding both the
                                                                                                         θ = αS + βC                        (2)
input objects to classify and the reference views to match against
                                                                             Since S is based on Hu-moment norms and should therefore be
Table 2: Cumulative (cross-class) accuracy under compari-                    minimised, the inverse of C was taken in those cases were histogram
son, for all configurations in the exploratory trials and for                comparison returned a similarity function with opposite trend, i.e.,
two data sets: (i) images in the NYUset matched against                      for the Correlation and Intersection metrics. However, the Θ set
ShapeNetSet1 (SNS1), (ii) views in ShapeNetSet1 (SNS1)                       was composed differently depending on the considered strategy.
                                                                             First, ΘT included all θ t , so that ΘT = {θ t : t = 1, ..., c i |Vi |}.
                                                                                                                                         Í Í
matched against ShapeNetSet2 (SNS2).
                                                                             Second, we averaged each θ by model (micro-average), creating ΘZ
                                                                             and computing its arg min. For z = 1, ..., N c |Mc |:
                                                                                                                             Í
  Approach                                   Dataset
                                                                                                                  v j ∈Vi θ
                                                                                                                Í
                                    NYU v. SNS1 SNS1 v. SNS2                                              θz =                                     (3)
                                                                                                                   |Vi |
  Baseline                               0.10787         0.10
                                                                             Finally, each θ was averaged by class (macro-average), before being
  Shape only L1                          0.14350         0.18                added to a ΘC :
  Shape only L2                          0.14537         0.12
                                                                                                                i v j ∈Vi θ
                                                                                                              Í Í
  Shape only L3                          0.15835         0.19                                           θc =     Í                                  (4)
  Color only Correlation                 0.15965         0.28                                                     i |Vi |
  Color only Chi-square                  0.14537         0.10                We assessed results for equal importance of contributing scores
  Color only Intersection                0.18777         0.29                (i.e., α = 1, β = 1), and then, increasing the relative importance of
  Color only Hellinger                   0.20637         0.32                histogram comparison (i.e., α = 0.3, β = 0.7), based on the prior
                                                                             batch of tests.
  Shape+Color (weighted sum)             0.20637         0.32
                                                                             Cross-class cumulative accuracies are outlined in Table 2. Further
  Shape+Color (micro-avg)                0.16945         0.28
                                                                             class-wise details are left to Appendix A. In the hybrid trials, all com-
  Shape+Color (macro-avg)                0.16513         0.22
                                                                             binations of shape-only and colour-only methods were evaluated,
                                                                         3
but we report only the configuration leading to the most consistent         Table 3: Cumulative (cross-class) accuracies after matching
cumulative accuracy across all trials here, for the sake of brevity.        feature descriptors of views in ShapeNetSet1 (SNS1) against
The segmented objects in the NYUset were first matched against              ShapeNetSet2 (SNS2).
ShapeNetSet1 and, then, the latter was matched against different
views collected under ShapeNetSet2, to control for the inherent                                   Approach      Accuracy
characteristics of the NYU sample. In all described experiments, we
took randomised label assignment as reference baseline.                                           Baseline         0.10
                                                                                                  SIFT             0.25
3.3    Matching Feature Descriptors                                                               SURF             0.22
Based on the results obtained in Section 3.2, we then tested whether                              ORB              0.25
relying on more general descriptions of image features would in-
crease the accuracy of ShapeNet-based matching. To more easily
assess the marginal variation introduced by these methods with re-          diverse objects, in a task-agnostic fashion. This CNN-based archi-
spect to the prior trials, we directly compared ShapeNetSet2 against        tecture combines successive convolutions and pooling layers to
the reference ShapeNetSet1, in a more controlled scenario.                  both input images, sharing weights across the two input pipelines,
For all these trials, we relied on OpenCV built-in methods and              drawing from the same rationale as Siamese Networks [6, 34]. Fur-
used brute-force matching. Using FLANN-based matching for opti-             ther, regions of pixels across the two image representations are
mised nearest neighbour search did not lead to any performance              compared so that a larger region is carried over from one image
gains, compared to the brute-force approach, most likely due to the         to another during the matching, hence explaining its inexact na-
fairly limited size of the input datasets. Therefore, we refrain from       ture, as opposed to more traditional exact (Cosine-similarity) based
reporting results obtained with FLANN-based matching.                       matching techniques, as the one introduced in [6]. This strategy is
                                                                            expected to be more robust to wide viewpoint and illumination con-
SIFT Firstly introduced by Lowe [16], the SIFT algorithm is based
                                                                            dition variations. Another property of normalized cross correlation
on the main rationale of describing images through scale-invariant
                                                                            matching is symmetry, making results independent from the or-
keypoints. We used L2 norm as distance measure for the match-
                                                                            dering of images withing each couple the architecture is presented
ing and trimmed the resulting matching keypoints to the second-
                                                                            with and, thus, reducing the number of parameters needed in the
nearest neighbour. A ratio test was then applied to select the best
                                                                            subsequent layers. In addition to classic Siamese Networks, after
match among all reference 2D views at each iteration, as proposed
                                                                            similarity is computed as Normalized-X-Correlation, the output is
in the original paper [16], setting the threshold to 0.75 and 0.5.
                                                                            further manipulated. Normalized-X-Corr tensors are in fact fed to
respectively.
                                                                            two successive convolutional layers followed by Maxpooling, for
SURF was originally conceived for providing a more scalable al-
                                                                            dimensionality reduction and to summarize information gained on
ternative to SIFT, performing convolutions through square-shaped
                                                                            the neighbourhood of each pixel into a more dense representation
filters and, therefore, speeding up the computation [3]. Further,
                                                                            [31]. Tensors are then fed to a fully-connected layer preceeding the
in SURF the keypoints are identified through maximising the de-
                                                                            final softmax layer to generate probabilities for the "similar" and
terminant of the Hessian matrix for blob detection. We kept all
                                                                            "dissimilar" classes. In our Keras-based implementation, to achieve
the settings used for SURF in these trials and set the Hessian filter
                                                                            the desired dimensionality to feed to the softmax layer, we applied
threshold to 400, to not overly reduce the output of the feature
                                                                            a dense layer and a flattening operation on cascade.
descriptor and thus retain sufficient richness in the representation.
                                                                            Further, the input RGB images were resized to 60x160x3 and the
ORB is another alternative approach to feature description im-
                                                                            described model compiled using categorical crossentropy as loss
plemented in the OpenCV Labs and proposed in [28]. ORB com-
                                                                            function and Adam optimiser.
bines FAST for corner-based keypoint detection [27] with improved
                                                                            We used ShapeNetSet2 as baseline to form a training set, compris-
feature descriptors derived from BRIEF [7], to accommodate for
                                                                            ing of 9,450 RGB image pairs, with 52% being examples of similar
rotation invariance. Since in BRIEF descriptors are parsed to bi-
                                                                            images and the remainder 48% being labelled as dissimilar pairs.
nary strings to reduce their dimensionality, we used the Hamming
                                                                            At training time, the learning rate was initialised to 0.0001 and its
distance instead of the L2 norm for this latter set of experiments.
                                                                            decay set to 1e−7. Training samples were fed in batches of size
During the evaluation, results were compared against randomised             16 to run over up to 100 epochs. An early stopping condition was
label assignment, similarly to 3.2. The obtained cumulative accuracy        defined so that training would stop if the ϵ of loss decrease was
values are summarised in Table 3. The details on class-wise results         lower than 1e−6 for more than 10 subsequent epochs. As a result,
obtained for the illustrated pipeline, are left to Appendix A.              training completed after 41 epochs, running on a NVIDIA Tesla
                                                                            P100 GPU.
3.4    Deep Neural Inexact Matching                                         Two different image sets were utilised on test: (i) 3,321 derived
Adapting the inexact-matching architecture proposed in [31] to our          from image pairs in ShapeNetSet1, and (ii) 8,200 paired examples,
purpose, we implemented a Keras pipeline on top of a Tensorflow             obtained after matching 100 images from the NYUset (where 10
backend [9] for matching pairs of input images and binary classify          where randomly-picked from each of the 10 classes) with all views
them as similar or dissimilar. In [31] this method was designed to          in ShapeNetSet1. The first experiment was conceived for checking
re-identify the same person across different frames; here we test           on whether the Neural Network had learned to discriminate similar
for whether a similar approach can successfully scale to recognise          ShapeNet models, whereas the second one was meant to provide
                                                                        4
better insights on the results obtained in 3.2 and 3.3. Experimental          Table 4: Class-wise Evaluation of our Keras implementation
results on both configurations, summed up in Table 4, are discussed           of Normalized-X-Corr, on the two labeled test sets.
in the following Section.
                                                                               Dataset                        Measure      Similar     Dissimilar
4   RESULTS AND DISCUSSION                                                                                    Precision    0.09        0.00
Table 2 outlines how, with respect to cross-class, cumulative accu-            ShapeNetSet1 pairs             Recall       1.00        0.00
racy, all configurations outperformed random label assignment.                                                F1-score     0.16        0.00
Interestingly enough, the weighted sum of shape and colour based                                              Support      295         3026
scores were equal to the first-best results obtained with RGB his-
                                                                                                              Precision    0.51        0.00
togram comparison alone. This could be due to a need to fine tune
                                                                               NYU+ShapeNetSet1 pairs         Recall       1.00        0.00
the α and β parameters, or it could also indicate that colour-based
                                                                                                              F1-score     0.67        0.00
features are more prominent, when concluding about the recog-
nised object. The latter hypothesis would align with another ob-                                              Support      4160        4040
servation: shape-only trials led to the lowest cumulative accuracy
values, among all tested setups. These observations hold also when
controlling for the input data and comparing SNS2 against SNS1                to a larger impact of false positives on the overall performance.
instead of NYUset.                                                            The incidence of false positives is also partially caused by how the
When taking a more careful look into class-wise results (as shown             Normalized-X-Corr architecture was originally conceived [31], i.e.,
in Appendix A), one can notice how different approaches favoured              to match wider areas accommodating for varying viewpoint and
different subsets of classes, when keeping the input data and bound-          luminance condition. However, the results obtained suggest that
ary conditions constant, with only partial overlap across different           the chosen training set, feeding all possible permutations of couples
pipelines and without any method completely outperforming the                 in SNS2 to also minimise the number of required input labels, was
others in terms of cross-class consistency and robustness. On av-             not introducing sufficient variability, resulting in representations
erage, chairs were more highly recognised than other classes and              that did not generalise, even on unseen ShapeNet models. Further,
most configurations led to unbalanced recognition, favouring cer-             the original use case of the exploited architecture was person re-
tain classes at the expense of others. This applies also for the best         identification, hinting towards further tweaking of the framework
case scenario, when matching SNS2 against SNS1, indicating that               and hyperparamenter tuning to scale to multiple - and more diverse
the inadequacy of the explored methods in robustly identifying all            - object classes than simply human silhouettes.
classes is not to be ascribed solely to the quality and characteristics
of segmented areas within the NYU set. For instance, by looking               5    CONCLUSION
at Table 8, it can be noted how the Paper and Window classes are              In this work, we tested for the relative importance of shape and
not recognised in most of cases, even though, overall, the obtained           colour based features in light of both cross-class and class-wise
performance was higher than in Table 7, due to the fact that all              evaluation, and in experimental settings where we could control
compared models belonged to ShapeNet.                                         for the introduction of segmentation faults that would normally
Based on these initial results, when representing the input image in          propagate from pre-processing steps. For this reason, we relied on
terms of SURF, SIFT or ORB based feature descriptors, we started              a combination of a pre-segmented data set, i.e., RGB images from
from matching SNS2 against SNS1, to evaluate whether further                  the large NYUDepth V2 set [21], and based similarity matching
application to the NYUset was worthwhile. As shown in Table 9,                on a subset 2D models derived from the ShapeNet dataset [8]. Al-
results obtained for the latter configurations were not sufficient            though features derived from comparing RGB histograms of the
and lower than the ones obtained with the hybrid strategies (Table            input images led, on average, to more consistent performances,
8), leading to cumulative accuracies in the range of 22% and 25%              none of the experimented pipelines ensured satisfactory results, in
(Table 3).                                                                    terms of robustness to class variation. Further, when adopting more
   Besides the feature engineering trials, we re-framed similarity            general, scale, shift and rotation-invariant image representations,
learning also with respect to the Siamese-like architecture intro-            accuracy of classification by similarity matching was not sufficient,
duced in [31]. Our Keras implementation of the Normalized-X-Corr              even when evaluating against alternative models all belonging to
model was trained to learn an optimised representation from com-              ShapeNet, hence controlling for other boundary conditions. Fi-
paring pairs of images from SNS2, i.e., 9,450 pairs quite equally             nally, the application of the Normalized-X-Architecture for inexact
balanced between positive and negative examples, as introduced                matching [31], formerly introduced in the context of person re-
in Section 3.4. To achieve a more abstract and compact representa-            identification, led to overfitting and did not allow for subsequent
tion, where contributing features are not as clearly designed and             application on unseen data sets and real-life settings.
discriminated as in the first set of experiments, the classification          All these findings confirmed the need for more scalable methods,
task was framed in binary terms at this stage. However, the tested            capable of leveraging labelled and unlabelled data points, when
setups led to unsatisfactory results that clearly indicate overfitting        learning object similarities with respect to diverse categories and
of the model (see Table 4). The even lower results obtained on                taxonomies that imply high within-class heterogeneity. We there-
the SNS1-derived test set can be further explained by looking at              fore intend to modify the tested architecture accordingly, to improve
the unbalance between positive and negative examples, leading                 its flexibility, while also increasing the heterogeneity of our datasets
                                                                          5
(e.g., by representing a higher number of classes, and by augment-                                   non-intrusive domestic assistant robot. In Human-Robot Interaction (HRI), 2016
ing the cardinality of each class), for further application on RGB                                   11th ACM/IEEE International Conference on. IEEE, 481–482.
                                                                                                [21] Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. 2012. Indoor
frames captured by a mobile robot in a real-life scenario.                                           Segmentation and Support Inference from RGBD Images. In ECCV.
                                                                                                [22] Andreas Nüchter and Joachim Hertzberg. 2008. Towards semantic maps for
                                                                                                     mobile robots. Robotics and Autonomous Systems 56, 11 (2008), 915–926.
ACKNOWLEDGMENTS                                                                                 [23] Joseph Redmon and Ali Farhadi. 2017. YOLO9000: better, faster, stronger. arXiv
                                                                                                     preprint (2017).
The authors would like to acknowledge the Center for Health Or-                                 [24] Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement.
ganization Transformation (CHOT) for graciously providing GPU                                        arXiv preprint arXiv:1804.02767 (2018).
computing resources exploited for some of the experiments pre-                                  [25] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN:
                                                                                                     towards real-time object detection with region proposal networks. IEEE Transac-
sented in this work, and Rajeev Bhatt Ambati (Pennsylvania State                                     tions on Pattern Analysis & Machine Intelligence 6 (2017), 1137–1149.
University) for his patient and kind assistance on server-access-                               [26] Petar Ristoski and Heiko Paulheim. 2016. Semantic Web in data mining and
                                                                                                     knowledge discovery: A comprehensive survey. Web semantics: science, services
related matters.                                                                                     and agents on the World Wide Web 36 (2016), 1–22.
                                                                                                [27] Edward Rosten and Tom Drummond. 2006. Machine learning for high-speed
                                                                                                     corner detection. In European conference on computer vision. Springer, 430–443.
REFERENCES                                                                                      [28] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. 2011. ORB:
 [1] Markus Bajones, David Fischinger, Astrid Weiss, Daniel Wolf, Markus Vincze,                     An efficient alternative to SIFT or SURF. In Computer Vision (ICCV), 2011 IEEE
     Paloma de la Puente, Tobias Körtner, Markus Weninger, Konstantinos Papout-                      international conference on. IEEE, 2564–2571.
     sakis, Damien Michel, et al. 2018. Hobbit: Providing Fall Detection and Prevention         [29] Jian Shi, Yue Dong, Hao Su, and X Yu Stella. 2017. Learning non-lambertian object
     for the Elderly in the Real World. Journal of Robotics 2018 (2018).                             intrinsics across shapenet categories. In Computer Vision and Pattern Recognition
 [2] Emanuele Bastianelli, Gianluca Bardaro, Ilaria Tiddi, and Enrico Motta. 2018. Meet              (CVPR), 2017 IEEE Conference on. IEEE, 5844–5853.
     HanS, the Health&Safety Autonomous Inspector. CEUR Workshop Proceedings.                   [30] Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. ConceptNet 5.5: An Open
 [3] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust                  Multilingual Graph of General Knowledge.. In AAAI. 4444–4451.
     features. In European conference on computer vision. Springer, 404–417.                    [31] Arulkumar Subramaniam, Moitreya Chatterjee, and Anurag Mittal. 2016. Deep
 [4] Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS                 neural networks with inexact matching for person re-identification. In Advances
     Torr. 2016. Fully-convolutional siamese networks for object tracking. In European               in Neural Information Processing Systems. 2667–2675.
     conference on computer vision. Springer, 850–865.                                          [32] Ilaria Tiddi, Emanuele Bastianelli, Gianluca Bardaro, Mathieu d’Aquin, and Enrico
 [5] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and                       Motta. 2017. An ontology-based approach to improve the accessibility of ROS-
     Dilip Krishnan. 2017. Unsupervised pixel-level domain adaptation with generative                based robotic systems. In Proceedings of the Knowledge Capture Conference. ACM,
     adversarial networks. In The IEEE Conference on Computer Vision and Pattern                     13.
     Recognition (CVPR), Vol. 1. 7.                                                             [33] Yu Xiang and Dieter Fox. 2017. DA-RNN: Semantic mapping with data associated
 [6] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah.                    recurrent neural networks. arXiv preprint arXiv:1703.03098 (2017).
     1994. Signature verification using a" siamese" time delay neural network. In               [34] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. 2014. Deep metric learning for
     Advances in neural information processing systems. 737–744.                                     person re-identification. In Pattern Recognition (ICPR), 2014 22nd International
 [7] Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. 2010.                     Conference on. IEEE, 34–39.
     Brief: Binary robust independent elementary features. In European conference on            [35] Andy Zeng, Shuran Song, Kuan-Ting Yu, Elliott Donlon, Francois R Hogan, Maria
     computer vision. Springer, 778–792.                                                             Bauza, Daolin Ma, Orion Taylor, Melody Liu, Eudald Romo, et al. 2018. Robotic
 [8] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing                         pick-and-place of novel objects in clutter with multi-affordance grasping and
     Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al.                     cross-domain image matching. In 2018 IEEE International Conference on Robotics
     2015. Shapenet: An information-rich 3d model repository. arXiv preprint                         and Automation (ICRA). IEEE, 1–8.
     arXiv:1512.03012 (2015).
 [9] François Chollet et al. 2015. Keras: Deep learning library for theano and tensor-
     flow. URL: https://keras. io/k 7, 8 (2015).                                                A      TABLES OF RESULTS
[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Ima-
     genet: A large-scale hierarchical image database. In Computer Vision and Pattern
                                                                                                The results enclosed in Tables 5, 6, and 7 refer to all exploratory
     Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 248–255.                           tests run when images from the NYUset were matched against
[11] Chester V Dolph, Loc Tran, and Bonnie D Allen. 2018. Towards Explainability                ShapeNetSet1 (SNS1) and evaluated class by class. Table 8 summa-
     of UAV-Based Convolutional Neural Networks for Object Classification. In 2018
     Aviation Technology, Integration, and Operations Conference. 4011.                         rizes results obtained when combining shape-only and color-only
[12] Thomas Fäulhammer, Rares Ambrus, Christopher Burbridge, Micheal Zillich, John              scores computed on images from ShapeNetSet2 (SNS2) and per-
     Folkesson, Nick Hawes, Patric Jensfelt, and Marcus Vincze. 2017. Autonomous                forming the matching against instances of ShapeNetSet1 (SNS1).
     learning of object models on a mobile robot. IEEE Robotics and Automation Letters
     2, 1 (2017), 26–33.                                                                        Similarly, class-wise results in Table 9 refer to matching feature
[13] Gabriele Ferri, Alessandro Manzi, Pericle Salvini, Barbara Mazzolai, Cecilia Laschi,       descriptors of views in SNS1 against descriptors of models in SNS2.
     and Paolo Dario. 2011. DustCart, an autonomous robot for door-to-door garbage
     collection: From DustBot project to the experimentation in the small town of
     Peccioli. In Robotics and Automation (ICRA), 2011 IEEE International Conference
     on. IEEE, 655–660.
[14] Elad Hoffer and Nir Ailon. 2015. Deep metric learning using triplet network. In
     International Workshop on Similarity-Based Pattern Recognition. Springer, 84–92.
[15] Ming-Kuei Hu. 1962. Visual pattern recognition by moment invariants. IRE
     transactions on information theory 8, 2 (1962), 179–187.
[16] David G Lowe. 2004. Distinctive image features from scale-invariant keypoints.
     International journal of computer vision 60, 2 (2004), 91–110.
[17] Nawel Medjkoune, Frédéric Armetta, Mathieu Lefort, and Stefan Duffner. 2017.
     Autonomous object recognition in videos using Siamese Neural Networks. In EU-
     Cognition Meeting (European Society for Cognitive Systems) on" Learning: Beyond
     Deep Neural Networks".
[18] Martino Mensio, Emanuele Bastianelli, Ilaria Tiddi, and Giuseppe Rizzo. 2018. A
     Multi-layer LSTM-based Approach for Robot Command Interaction Modeling.
     arXiv preprint arXiv:1811.05242 (2018).
[19] George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM
     38, 11 (1995), 39–41.
[20] Christophe Mollaret, Alhayat Ali Mekonnen, Julien Pinquier, Frédéric Lerasle,
     and Isabelle Ferrané. 2016. A multi-modal perception based architecture for a
                                                                                            6
                          Table 5: Class-wise results obtained when matching only based on shape.

   Approach       Measure     Chair       Bottle      Paper      Book       Table      Box       Window       Door      Sofa      Lamp
                  Accuracy    0.15600     0.10543     0.11899    0.10132    0.11846    0.08948   0.08104      0.07241   0.09899   0.09414
                  Precision   0.02250     0.01399     0.01356    0.01110    0.01240    0.00822   0.00721      0.00534   0.00707   0.00649
   Baseline
                  Recall      0.15600     0.10543     0.11899    0.10132    0.11846    0.08948   0.08104      0.07241   0.09899   0.09414
                  F1-score    0.03932     0.02470     0.02434    0.02002    0.02245    0.01506   0.01324      0.00994   0.01319   0.01214
                  Accuracy    0.25900     0.39565     0.04810    0.00132    0.15702    0.00471   0.00000      0.00783   0.36768   0.06276
                  Precision   0.03735     0.05249     0.00548    0.00014    0.01644    0.00043   0.00000      0.00058   0.02625   0.00433
   L1
                  Recall      0.25900     0.39565     0.04810    0.00132    0.15702    0.00471   0.00000      0.00783   0.36768   0.06276
                  F1          0.06529     0.09269     0.00984    0.00026    0.02977    0.00079   0.00000      0.00107   0.04900   0.00809
                  Accuracy    0.08500     0.81413     0.00759    0.00132    0.03581    0.00157   0.00000      0.00978   0.24444   0.02929
                  Precision   0.01226     0.10802     0.00087    0.00014    0.00375    0.00014   0.00000      0.00072   0.01745   0.00202
   L2
                  Recall      0.08500     0.81413     0.00759    0.00132    0.03581    0.00157   0.00000      0.00978   0.24444   0.02929
                  F1          0.02143     0.19073     0.00155    0.00026    0.00679    0.00026   0.00000      0.00134   0.03258   0.00378
                  Accuracy    0.32700     0.46413     0.04557    0.00395    0.07989    0.01099   0.00162      0.00978   0.32121   0.15690
                  Precision   0.04716     0.06158     0.00519    0.00043    0.00836    0.00101   0.00014      0.00072   0.02293   0.01082
   L3
                  Recall      0.32700     0.46413     0.04557    0.00395    0.07989    0.01099   0.00162      0.00978   0.32121   0.15690
                  F1          0.08243     0.10873     0.00932    0.00078    0.01514    0.00185   0.00026      0.00134   0.04281   0.02024

               Table 6: Class-wise results obtained when comparing RGB histograms (baseline same as Table 5)

Matching metric       Measure     Chair      Bottle      Paper      Book       Table      Box       Window       Door      Sofa      Lamp
                      Accuracy    0.56500    0.04130     0.20506    0.09211    0.03581    0.06750   0.08104      0.03327   0.14949   0.12971
                      Precision   0.08148    0.00548     0.02336    0.01010    0.00375    0.00620   0.00721      0.00245   0.01067   0.00894
Correlation
                      Recall      0.56500    0.04130     0.20506    0.09211    0.03581    0.06750   0.08104      0.03327   0.14949   0.12971
                      F1          0.14243    0.00968     0.04195    0.01820    0.00679    0.01136   0.01324      0.00457   0.01992   0.01673
                      Accuracy    0.48900    0.00000     0.00000    0.00921    0.13085    0.04710   0.44408      0.00196   0.00000   0.23431
                      Precision   0.07052    0.00000     0.00000    0.00101    0.01370    0.00433   0.03952      0.00014   0.00000   0.01615
Chi-square
                      Recall      0.48900    0.00000     0.00000    0.00921    0.13085    0.04710   0.44408      0.00196   0.00000   0.23431
                      F1          0.12327    0.00000     0.00000    0.00182    0.02480    0.00792   0.07257      0.00027   0.00000   0.03022
                      Accuracy    0.57200    0.19565     0.30886    0.01447    0.03581    0.01884   0.01945      0.04892   0.38182   0.06485
                      Precision   0.08249    0.02596     0.03519    0.00159    0.00375    0.00173   0.00173      0.00361   0.02726   0.00447
Intersection
                      Recall      0.57200    0.19565     0.30886    0.01447    0.03581    0.01884   0.01945      0.04892   0.38182   0.06485
                      F1          0.14419    0.04584     0.06318    0.00286    0.00679    0.00317   0.00318      0.00672   0.05088   0.00836
                      Accuracy    0.53800    0.08370     0.38228    0.01974    0.03168    0.03925   0.44895      0.05284   0.24242   0.05649
                      Precision   0.07759    0.01110     0.01110    0.00216    0.00332    0.00361   0.03995      0.00389   0.01731   0.00389
Hellinger
                      Recall      0.53800    0.08370     0.38228    0.01974    0.03168    0.03925   0.44895      0.05284   0.24242   0.05649
                      F1          0.13562    0.01961     0.02158    0.00390    0.00601    0.00660   0.07337      0.00725   0.03231   0.00729


                                                                        7
Table 7: Class-wise results obtained when combining L3 norm-based Hu moment matching with Hellinger distance-based RGB
histogram comparison, when class labels are determined based on minimizing: (i) the weighted sum of scores (ii) the micro-
average of scores and (iii) the macro-average of scores. We report the weight configuration that ensured the most consisted
results, among the tested ones, i.e., setting α = 0.3, β = 0.7. See Table 5 for reference baseline.

  Argmin function    Measure       Chair      Bottle             Paper          Book               Table         Box              Window          Door           Sofa      Lamp
                     Accuracy      0.65300    0.14891            0.12658        0.00526            0.10055       0.02512          0.29011         0.05871        0.28081   0.20921
                     Precision     0.09417    0.01976            0.01442        0.00058            0.01053       0.00231          0.02581         0.00433        0.02005   0.01442
  Weighted Sum
                     Recall        0.65300    0.14891            0.12658        0.00526            0.10055       0.02512          0.29011         0.05871        0.28081   0.20921
                     F1            0.16461    0.03489            0.02589        0.00104            0.01906       0.00423          0.04741         0.00806        0.03742   0.02698
                     Accuracy      0.37800    0.13587            0.18861        0.02105            0.04821       0.07064          0.37925         0.10568        0.22626   0.07741
                     Precision     0.05451    0.01803            0.02149        0.00231            0.00505       0.00649          0.03375         0.00779        0.01615   0.00534
  Micro-average
                     Recall        0.37800    0.13587            0.18861        0.02105            0.04821       0.07064          0.37925         0.10568        0.22626   0.07741
                     F1            0.09529    0.03183            0.03858        0.00416            0.00914       0.01189          0.06198         0.01451        0.03015   0.00998
                     Accuracy      0.39000    0.15543            0.39241        0.00000            0.11846       0.06750          0.00000         0.00000        0.29495   0.05649
                     Precision     0.05624    0.02062            0.04471        0.00000            0.01240       0.00620          0.00000         0.00000        0.02106   0.00389
  Macro-average
                     Recall        0.39000    0.15543            0.39241        0.00000            0.11846       0.06750          0.00000         0.00000        0.29495   0.05649
                     F1            0.09831    0.03641            0.08027        0.00000            0.02245       0.01136          0.00000         0.00000        0.03931   0.00729

                                 Table 8: Similarly to Table 7, but matching SNS2 against SNS1.

          Argmin function        Measure      Chair        Bottle         Paper          Book         Table         Box     Window          Door         Sofa      Lamp
                                 Accuracy         0.90          0.10          0.00          0.20          0.30      0.10          0.00          0.50      0.40      0.70
                                 Precision        0.09          0.01          0.00          0.02          0.03      0.01          0.00          0.05      0.04      0.07
          Weighted Sum
                                 Recall           0.90          0.10          0.00          0.20          0.30      0.10          0.00          0.50      0.40      0.70
                                 F1               0.16          0.02          0.00          0.04          0.05      0.02          0.00          0.09      0.07      0.13
                                 Accuracy         0.80          0.10          0.00          0.30          0.20      0.20          0.10          0.60      0.30      0.20
                                 Precision        0.08          0.01          0.00          0.03          0.02      0.02          0.01          0.06      0.03      0.02
          Micro-average
                                 Recall           0.80          0.10          0.00          0.30          0.20      0.20          0.10          0.60      0.30      0.20
                                 F1               0.15          0.02          0.00          0.05          0.04      0.04          0.02          0.11      0.05      0.04
                                 Accuracy         0.70          0.60          0.00          0.00          0.10      0.10          0.00          0.00      0.60      0.10
                                 Precision        0.07          0.06          0.00          0.00          0.01      0.01          0.00          0.00      0.06      0.01
          Macro-average
                                 Recall           0.70          0.60          0.00          0.00          0.10      0.10          0.00          0.00      0.60      0.10
                                 F1               0.13          0.11          0.00          0.00          0.02      0.02          0.00          0.00      0.11      0.02

Table 9: Class-wise results obtained when matching feature descriptors derived from SIFT, SURF and ORB. We report the
configurations that ensured the most consistent results, among the tested ones, i.e., for a ratio test threshold set to 0.5.

              Approach    Measure          Chair     Bottle        Paper         Book          Table         Box      Window             Door      Sofa     Lamp
                          Accuracy         0.30          0.30          0.00          0.40          0.00      0.40          0.30          0.20      0.30      0.30
                          Precision        0.03          0.03          0.00          0.04          0.00      0.04          0.03          0.02      0.03      0.03
              SIFT
                          Recall           0.30          0.30          0.00          0.40          0.00      0.40          0.30          0.20      0.30      0.30
                          F1               0.05          0.05          0.00          0.07          0.00      0.07          0.05          0.04      0.05      0.05
                          Accuracy         0.70          0.10          0.00          0.10          0.10      0.00          0.30          0.30      0.30      0.30
                          Precision        0.07          0.01          0.00          0.01          0.01      0.00          0.03          0.03      0.03      0.03
              SURF
                          Recall           0.70          0.10          0.00          0.10          0.10      0.00          0.30          0.30      0.30      0.30
                          F1               0.13          0.02          0.00          0.02          0.02      0.00          0.05          0.05      0.05      0.05
                          Accuracy         0.10          0.70          0.00          0.20          0.10      0.00          0.30          0.20      0.40      0.50
                          Precision        0.01          0.07          0.00          0.02          0.01      0.00          0.03          0.02      0.04      0.05
              ORB
                          Recall           0.10          0.70          0.00          0.20          0.10      0.00          0.30          0.20      0.40      0.50
                          F1               0.02          0.13          0.00          0.04          0.02      0.00          0.05          0.04      0.07      0.09

                                                                                     8

</pre>