=Paper= {{Paper |id=Vol-3121/paper4 |storemode=property |title=Robots With Commonsense: Improving Object Recognition Through Size and Spatial Awareness |pdfUrl=https://ceur-ws.org/Vol-3121/paper4.pdf |volume=Vol-3121 |authors=Agnese Chiatti,Enrico Motta,Enrico Daga |dblpUrl=https://dblp.org/rec/conf/aaaiss/ChiattiMD22 }} ==Robots With Commonsense: Improving Object Recognition Through Size and Spatial Awareness== https://ceur-ws.org/Vol-3121/paper4.pdf
Robots with Commonsense: Improving Object
Recognition through Size and Spatial Awareness
Agnese Chiatti1 , Enrico Motta1 and Enrico Daga1
1
    Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom


                                         Abstract
                                         To effectively assist us with our daily tasks, service robots need object recognition methods that perform
                                         robustly in dynamic environments. Our prior work has shown that augmenting Deep Learning (DL)
                                         methods with knowledge-based reasoning can drastically improve the reliability of object recognition
                                         systems. This paper proposes a novel method to equip DL-based object recognition with the ability to
                                         reason on the typical size and spatial relations of objects. Experiments in a real-world robotic scenario
                                         show that the proposed hybrid architecture significantly outperforms DL-only solutions.

                                         Keywords
                                         commonsense reasoning, visual intelligence, hybrid intelligence, service robotics




1. Introduction
Robots can be of assistance in many scenarios where it is unsafe or impractical for us to
intervene. For instance, they can help with waste disposal [1]; teleoperated patient care,
particularly when social distance needs to be maintained [2]; search and rescue operations
[3] and other tasks. However, operating in real-world scenarios is challenging because it
requires to correctly interpret the diverse data collected through the robot’s sensors [4]. Our
approach to this sensemaking problem focuses on the modality of vision. That is, our focus is
on equipping robots with high-performance Visual Intelligence, defined as "the ability to use
their vision system, reasoning components and external knowledge sources to make sense of
their environment" [5]. Consider HanS [6], the robot assistant that is currently undergoing
development at the Knowledge Media Institute (KMi). HanS’ job is to monitor the Lab in search
of potential Health and Safety (H&S) threats: e.g., a paper cup left on top of a portable heater or
a dangling cable in the middle of a corridor. As a necessary precondition to assessing correctly
the risks posed by these situations, the robot first needs to reliably recognise the different objects
involved, i.e., the paper cup and heater in question.
   Currently, the de facto methodology for tackling object recognition tasks is to rely on Deep
Learning (DL) methods. However, despite the breakthroughs that have been recently enabled

In A. Martin, K. Hinkelmann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI
2022 Spring Symposium on Machine Learning and Knowledge Engineering for Hybrid Intelligence (AAAI-MAKE
2022), Stanford University, Palo Alto, California, USA, March 21–23, 2022.
" agnese.chiatti@open.ac.uk (A. Chiatti); enrico.motta@open.ac.uk (E. Motta); enrico.daga@open.ac.uk
(E. Daga)
 0000-0003-3594-731X (A. Chiatti); 0000-0003-0015-1592 (E. Motta); 0000-0002-3184-540 (E. Daga)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
by DL methods, there are also significant limitations. In particular, as Cantwell Smith points
out [7], state-of-the-art Artificial Intelligence (AI) technologies exhibit only a "reckoning" of
the complete, deliberative thought processes that characterise human cognition. Hence, to
shift from pattern recognition to higher-level cognition and sensemaking, robots need access
to knowledge representations that are more comprehensive, transparent and explainable than
those embedded in neural architectures [8, 9]. Consistently with this view, our previous work
[5] has emphasised the key role played by commonsense knowledge in visual intelligence -
see also [10, 11, 12, 13, 14]. In particular, as in the case of HanS, when robots are deployed in
authentic problem-solving scenarios, it is important that they are both able to learn from their
experiences and also able to asses the typicality and plausibility of a situation – e.g., that it is
more likely to find a fire extinguisher, rather than a bottle, next to a fire extinguisher sign.
   Specifically, in our prior work [5], we defined a set of required knowledge properties and
reasoning capabilities that can enhance the Visual Intelligence of a robot. Among other things,
this analysis pointed out that an awareness of both object sizes and spatial relations between
objects has the potential to significantly improve the performance of object recognition systems.
Hence, in [15], we developed a framework for representing object sizes, which can be used to
enhance DL-based object recognition in the real-world scenario of autonomous H&S monitoring.
Furthermore, in [16], we introduced a formal model of spatial representation, which fits the
requirements of robot sensemaking. In the proposed representation, formally-defined spatial
relations are mapped to a set of commonsense predicates, which are based on the types of
prepositions that we use to describe and discuss space in natural language [17].
   This paper builds on our prior work and introduces the following key contributions:

    • we provide a methodology for extracting the commonsense spatial relations defined in
      [16] from large-scale multi-modal collections. In particular, we instantiate this approach
      in the case of the Visual Genome benchmark [18];

    • we extend the work in [15] by automating the construction of a Knowledge Base (KB) about
      object sizes, capitalising on measurements gathered from ShapeNet [19] and Amazon;

    • we discuss a robust evaluation of the resulting hybrid architecture, which integrates
      DL object recognition with size and spatial reasoning modules, in a real-world robotic
      scenario. In this context, we also test two meta-reasoning strategies, which provide
      alternative ways of combining the results of the reasoning modules introduced.


2. Related work
The work presented in this paper is situated at the intersection of different sub-fields of AI. First,
we capitalise on the recent advances enabled by object recognition methods based on Deep
Learning, and, in particular, on Deep Metric Learning methods, as further motivated in Section
2.1. Second, we contribute to the growing body of literature on developing hybrid intelligent
systems, that can effectively leverage DL-based and knowledge-based components. Lastly, we
contribute to the state-of-the-art in knowledge representation, in particular with respect to
formalising and reasoning about size and spatial relations.
2.1. Object Recognition with Deep Metric Learning
Deep learning has enabled remarkable progress on several Computer Vision benchmarks
[20, 21, 22]. Object recognition methods based on Deep Learning, however, suffer from certain
drawbacks. These include - to name just a few: (i) the inability to learn from limited data; (ii)
catastrophic forgetting, i.e., insufficient ability to retain the previously-learned information;
(iii) the issue of disentanglement, i.e., limitations in discerning, recomposing, and reusing
the separable constituents of observations; (iv) generalisation issues beyond the learned data
distribution. The dependency of DL performance on the availability of large amounts of data,
in particular, has inspired the development of few-shot Metric Learning methods. Deep Metric
Learning (DML) is the task of learning a feature space where similar objects lie closely to one
another and also further away from dissimilar objects. Specifically, in a few-shot learning
scenario, only a few training examples are used. The typical DML setup consists of multiple
pipelines based on Convolutional Neural Networks (CNN), yielding a so-called multi-branch
network. Features extracted in each branch are compared through a similarity-based or through
a distance-based function. In the case of Siamese Networks [23], two twin branches, fed with
similar image examples, are updated with identical parameters. An extension of [23] are Triplet
Networks [24], which are fed with both a positive and a negative example to contrast against the
input image. Unlike traditional classification models based on DL, which are usually optimised
over a pre-determined set of object categories, DML methods are targeted at comparing objects
by similarity. Thus, the learned feature space can be more easily adapted as soon as novel object
instances are observed. This characteristic of DML methods makes them particularly suitable
for supporting robot sensemaking [25] – and indeed DML has been successfully applied in a
number of robotic applications. In [26], a Triplet Network was used to recognise customers of a
cafe in a human-robot collaboration scenario. In [27], a Siamese Network was used to derive
optimal grasping poses for picking objects through a robotic arm. Zeng et al. [28] exploited
DML to win the object stowing task at the latest Amazon Robotic Challenge. Thus, we will
consider the methods presented in [28] as DL baselines. This choice also allows us to compare
the results presented in this paper with our prior trials [15].

2.2. Hybrid Intelligent Systems
The limitations of DL methods have inspired the development of hybrid methods, which enhance
DL by means of knowledge-based approaches. As described in [29], knowledge-based compo-
nents can be integrated at four different levels of a Deep Neural Network: (i) in pre-processing,
i.e., as input to the network; (ii) within the intermediate layers; (iii) as a part of the architec-
tural topology or optimisation function or (iv) after applying DL, i.e., in post-processing. First,
background knowledge can help to augment the training data. For instance, in [30], training
examples were completed with Web-retrieved images of newly-encountered objects, to facilitate
open-world object recognition. A second and increasingly popular trend in Computer Vision
fuses visual data and symbolic knowledge within multi-modal embeddings [31, 32]. Encoder
architectures have been used to combine visual features, extracted through a CNN, with a
vectorised representation of the image context in a Knowledge Graph (KG) [32]. Similarly,
in [31] object attributes have been encoded in multi-relational embeddings. However, these
methods suffer from the same disentanglement and explanation issues of traditional DL. Namely,
the contribution of the different embedding features is difficult to pinpoint. More transparent
than multi-modal embedding methods, approaches in the third group are aimed either at (i)
modelling the reasoning process through the topology of a KG [33, 34], or at (ii) introducing
modular knowledge-based components, that are trainable end-to-end [35]. For instance, [33]
proposes to learn how to optimally traverse a KG via Deep Reinforcement Learning, to clarify
the reasoning chains involved in Visual Question Answering. Another approach to hybrid
reasoning is validating the DL predictions in post-processing [36, 37, 15]. Compared to hybrid
methods based on end-to-end training, this class of methods offers the advantage of modularity,
because it is agnostic to the specific DL architecture used. Moreover, it facilitates incremental
knowledge updates by querying external KBs, as new object entities are observed. The work in
[36] considered the object predictions generated via DL as a baseline to represent the spatial
context of unknown objects. The system in [37] relied on expert knowledge to classify previ-
ously unseen classes, based on object parts detected via DL. Crucially, the post-hoc approach
allows to decouple the DL predictions from the knowledge-based predictions and, thus, to more
precisely assess how the different architectural modules contribute to the overall performance.
This characteristic is an important precondition to identifying the strengths, weaknesses and
complementarities of the DL-based and knowledge-based components. In [15] we proposed a
system to validate DL predictions based on background knowledge about object sizes. Thanks
to the modularity of the approach, we were able to conduct a detailed ablation study, to evaluate
the effect of introducing size awareness over the baseline DL performance. In this paper, we
build on this work, maintaining a post-hoc approach to hybrid reasoning.

2.3. Representing and reasoning about object sizes
Cognitive studies have suggested that the human brain maintains a canonical representation of
the physical size of objects, which is functional to imagining and categorising them [38, 39, 40].
Inspired by the work of [38], [41] proposes to build a size graph, where nodes are modelled
as log-normal distributions over Web-retrieved object dimensions. Similarly, [42] proposes to
represent size as a statistical distribution of Web-retrieved measurements. [43] also collected
physical size measurements from a combination of Freebase [44], Amazon and Ebay. Since the
latter representations were aggregated from multiple measurements over large-scale databases,
they are more likely to capture the contour variance among entities of the same class [40] – e.g.,
a short novel and a dictionary are both books, despite their size differences. However, these
representations consider at most one feature contributing to size. Converging towards a single
size feature is sufficient to identify broader groups of small and large objects, but hinders the
classification of finer-grained object categories. For example, while the volume of a recycling
bin may be comparable to the volume occupied by a coat stand, we would also need to know
that the former is thicker than the latter, to discriminate between the two on the basis of size
information. In [15], we proposed a representation of size information, where the cognitive
features examined in [40] were condensed according to three coarse, class-level descriptors.
These are: surface area, thickness, and aspect ratio. However, the resulting KB of standard object
sizes, presented in [15], was manually curated. In this paper, we propose a method to extract
these features automatically, from the size measurements provided with large-scale KBs.
2.4. Representing spatial relations between objects
The problem of building machine-treatable representations of space has been actively researched
for decades, producing a plethora of formal spatial representations [45]. In robotics, much work
has been devoted to grounding perceptual observations into higher-level symbolic representa-
tions [46, 47, 48, 9]. In this context, the semantic map representational model has been widely
adopted [48, 9]. This is characterised as a map that, "in addition to spatial information about
the environment", also contains "assignments of mapped features to entities of known classes"
[46]. Attempts have been made at bridging the gap between semantic maps and formal spatial
representations [36, 49, 50]. The authors of [36] focused on representing the closeness of objects
in a semantic map through Ring Calculus. Kunze et al. [49] proposed to model spatial regions
as point-like objects on the fixed 2D plane defined by a tabletop surface. Similarly, Deeken et al.
[50] modelled a simplified scenario where the robot’s viewpoint is aligned, at any point in time,
with the coordinate system of the map, as well as with the orientation of the observed objects.
   The latter representations exhibit two main shortcomings. First, they are unsuitable to
model the case of a robot who is interpreting the environment whilst navigating it, because
they are dependent on a fixed frame of reference, or coordinate system. Second, they define
Qualitative Spatial Relations (QSR) - e.g., North of, West of, which are not aligned with natural
language discourse, thus hindering Human-Robot Interaction. For instance, we want our robot
to be capable of interpreting the instruction bring me the cup that is on the left of the keyboard.
Moreover, mapping formal QSR to everyday linguistic predicates opens up the opportunity to
reuse the spatial knowledge that is available within general-purpose Knowledge Bases, such as
Visual Genome (VG) [18]. To address these shortcomings, we proposed the spatial framework
described in [16], which consists of a set of First Order Logic (FOL) axioms that extend the
definitions of [50], to account for varying viewpoints. Moreover, we also mapped the obtained
set of relations to the everyday linguistic predicates examined in [17]. In this paper, we provide
a methodology for (i) extracting the commonsense QSR described in [16] from large-scale KBs
and (ii) integrating them in a concrete reasoning architecture.


3. Methodology
We propose to integrate an understanding of the typical size and spatial relations of objects
with DL-based object recognition, through a post-hoc reasoning approach (Figure 1). In this
setup, Deep Learning is applied first to derive a series of object predictions for each input image.
Then, the observed size and spatial context of objects are considered, to infer the final object
predictions. In Sections 3.1 and 3.2 we present a novel method to autonomously extract the
relevant background knowledge from existing KBs. Then, we devote Section 3.3 to illustrating
the different architectural modules in the proposed hybrid reasoning pipeline.

3.1. Representing the typical sizes of objects
One of the distinctive attributes of human Visual Intelligence is the ability to link the lower-level
perceptual stimuli of the visual system with higher-level taxonomies of concepts [51, 40]. Thus,
to characterise the size of objects, we first need to define a target object taxonomy. Another
                 Image Data                  Annotated 2D Bounding
                                                boxes/polygons                       DL-based object
                                                                                       recognition

         HanS                                                                     Object prediction: Bottle

                 Depth Data                                                                 Semantic mapping
                                                                                         Depth Image to
                                                                                           3D Polygon
                                                                                           Conversion
                                                                                         3D Polygon Size
                                                                                           Estimation
                  Robot location
                     on map                                                              Add object to map


                                                                                         DL prediction         Correct the DL
                                                                                          selection            prediction or not?


                           Background Knowledge                                Knowledge-based Reasoning
                                    OnTopOf = Above                  medium-sized                            Bottle ON floor
                                      AND Touch                      thick                                  Bottle NEAR Fire
                                                                     taller than wide (ttw)                extinguisher sign
                    Visual          Commonsense         Typical
                   Genome
                                     QSR mapping         QSRs           Size reasoner                Spatial reasoner
                                                                                                      What objects
                                                                        What objects are
                              WordNet                                                                 usually lay on the
                                                                        typically medium-
                                                                                                      floor, near fire
                                                                        sized, thick and ttw?
                                                                                                      extinguisher signs?
                                     Coarse size
                 Web-retrieved       Abstraction                     Robot                                 Fire extinguisher
                object dimensions                       Typical
                                                                     Backpack                                    Rubbish Bin
                  - e.g., Amazon     area, thickness,    sizes       Fire extinguisher   Meta-reasoning              Radiator
                  ShapeNet             aspect ratio


                                                                           New prediction: Fire extinguisher

Figure 1: The proposed hybrid architecture for commonsense reasoning. Deep Learning is applied first
to classify the objects in each image (top-left corner of the Figure). Then, objects are mapped in the 3D
space based on the measured Depth data (in the "Semantic mapping" block). A subset of the predicted
objects is passed to the "Knowledge-based Reasoning" module, which validates the predictions based
on size and spatial knowledge.


requirement which can be derived from [40] is that the envisioned representation of size must be
robust to contour variance - i.e., to variations in the shape and appearance of objects belonging
to the same class. Specifically, in [40], object size is characterised in terms of object extent,
surface area, natural orientation and aspect ratio. We propose to abstract these features from
the raw size measurements available within general-purpose Knowledge Bases, as described in
the following sections.

Taxonomy definition. Consistently with [15], we focus on 60 object classes that are com-
monly found at the Knowledge Media Institute, i.e., the environment where our robot assistant
is currently deployed. To identify these categories, we examined the images collected by our
robot during one of its scouting routines and identified the set of objects characterising the
target environment. Objects which are too small to be detected through the robot camera
(e.g., pens, light switches) were discarded. The remaining objects include furniture (e.g., chairs,
desks), IT equipment (e.g., desktop PCs. monitors), H&S equipment (e.g., fire extinguishers,
emergency exit signs) and other miscellaneous items. Specifically, we used domain-specific
categories for H&S equipment and resorted to more generic class names otherwise. Intuitively,
the robot is expected to recognise that academic textbooks and notebooks are both types of
books. At the same time, it is vital that the robot recognises the difference between emergency
exit signs and fire extinguisher signs.
Handling contour variance. Based on how general is the object taxonomy defined at the
previous step, an object class may include instances of varying size. Indeed, broad semantic
concepts (e.g., chair) group objects of different shape and design (e.g., different chair models).
To account for this intra-class variability, we collect repeated measurements from large-scale
databases, as in [41, 42, 43]. Namely, for 31 out of the 60 target classes, the related object
dimensions could be found on ShapeNet [19]. Additional object dimensions were scraped from
Amazon. For the 14 remaining classes, measurements had to be collected manually, due to the
paucity of comprehensive resources encoding size [41, 5]. To filter out noisy measurements, we
applied Local Outlier Factor detection [52] across neighbour observation pairs in each class.
From instance-level absolute dimensions to class-level size features. Based on [40], the
size of objects can be described in terms of area, orientation and extent, i.e., the ratio of the area
of the object to its 2D bounding box. As such, the extent feature is affected by the so-called
framing effect [38]: i.e., our perception of the size of an object is influenced by the size of the
frame bounding the object. However, thanks to the availability of absolute object dimensions
(𝑑1, 𝑑2, 𝑑3), we can avoid to rely on the relative extent of objects within a frame to approximate
their size. Nevertheless, to derive the area of an object from (𝑑1, 𝑑2, 𝑑3), we need to make
hypotheses about its orientation. For instance, we may consider a chest of drawers measuring
90x120x40 cm as 90 cm wide or as 40 cm wide, based on how it was placed. Because standard
Web-retrieved measurements are represented as three object dimensions, there are only three
possible area configurations that can be modelled: 𝑑1·𝑑2, 𝑑2·𝑑3, 𝑑3·𝑑1. As a result, the remaining
dimension in each configuration indicates the depth (or thickness) of the object. To derive a
summary descriptor of these features across instances of the same class, the area and depth
values in each configuration are averaged class-wise. As the size variability across different
classes can be significant (e.g., cupboards are significantly larger than drink cans), thickness and
area values are also scaled with respect to their natural logarithm. This design choice also aligns
with the cognitive findings of [38], which suggest that prototypical sizes are "proportional to
the logarithm of the assumed size of the object".
Multi-feature abstraction. To group objects based on their size, we clustered the input
class-level features on a 2D histogram, where the area is quantised over 5 bins (extra-small,
small, medium, large or extra-large), and the thickness spans across 4 qualitative bins (flat, thin,
thick, or bulky). An example of object sorting along these two dimensions is available at https:
//robots.kmi.open.ac.uk/img/size_repr.pdf. Because the resulting qualitative representation
is expected to be human-understable and intuitive, we chose the most common format of
Likert graphic rating scale for the object area, i.e., a 5-point scale with medium as a neutral
category. For the thickness feature, we instead imposed a more polarised 4-point scale, as the
excess of thickness, i.e., bulkiness, or lack thereof, i.e., flatness, are more distinctive object
traits. In addition to these automatically-generated annotations, we also manually labelled
classes as flat or non-flat. This caution allowed us to validate the thickness properties generated
automatically. For example, if a certain object was labelled as strictly flat by the human oracle
but categorised as thick through the histograms, the system would raise a conflict and correct
the latter annotation. Similarly, a second check condition was introduced to complete sparse
annotations: e.g., if an object was annotated as both medium and extra-large, the large label
would be automatically added too1 . One last manual validation pass ensured that the relative
sorting of objects is coherent. In addition to modelling different orientation configurations, we
also considered that certain objects may be observed under natural orientations - e.g., chairs
usually stand upright [5, 40]. Objects which are aligned with their natural orientation exhibit a
distinctive aspect ratio (e.g., coat stands are typically taller than wide). Thus, we also labelled
objects as taller than wide (ttw), wider than tall (wtt), or equivalent.

3.2. Representing the typical spatial relations of objects
Since robots can be mobile, and can observe objects from different viewpoints, in our earlier
work [16], we devised a set of formally-defined Qualitative Spatial Relations (QSR) that support
the modelling of different robot viewpoints. In our framework, QSR are expressed through
first-order logic (FOL) axioms and 3D object regions are "contextualised", i.e., reconciled with
the robot’s current frame of reference. Moreover, our framework also provides a mapping
from FOL axioms to the commonsense spatial predicates discussed in [17]. An excerpt of the
proposed spatial relations and mapping is shown in Figure 2. We refrain from providing the
complete definitions and FOL axioms that were presented in [16], for the sake of conciseness.
Nonetheless, we ought to emphasise that the mapping from formal spatial relations to linguistic
predicates is non-trivial, because spatial prepositions in English can be ambiguous. This issue is
particularly pronounced in the case of the "on" preposition. For instance, we say that a mug is
on the counter and that a poster is on the wall. Although "on" is used in both phrases, the spatial
interpretation of the two sentences is very different. In the former case, the mug lays on top
of the counter. In the latter case, the poster is affixed on the wall. To handle this polysemy,
we identified three common uses of the "on" preposition, through the OnTopOf, AffixedOn, and
LeansOn relations (Figure 2).
   In the context of object recognition, we capitalise on this comprehensive knowledge model
to reason on the typical spatial relations between objects. To this aim, we augment the QSR
definitions defined in [16] with a measure of their typicality. Specifically, we propose to assess
the typicality of a relation by measuring how frequently the relation occurs within a large-scale
data collection. By relying on a large-scale set, the frequency of occurrence of a relation is less
likely to reflect random effects or skews in the sample class distribution. In particular, Visual
Genome (VG) [18] is one of the most comprehensive collections of spatial relations between
objects [5]. It consists of 108,077 images, which are annotated with over 1.5 million object-object
relationships. Spatial relations are prevalent in VG, as shown by its top ten most frequent
predicates: on, has, in, wearing, wears, behind, next to, with, near, and in front of. Moreover,
Visual Genome covers the majority of object categories in our taxonomy, i.e., 58 of the 60 classes.
In the following, we focus on providing a definition of typicality, as well as a methodology to
extract typical QSR from Visual Genome. The description of the module that supports reasoning
about typical spatial relations in our architecture is instead given in Section 3.3.
     1
       While this assumption is not always correct (e.g., cars can be extra-small, scaled models or extra-large, full-
sized vehicles, even in the absence of medium-sized car models), it holds true for the class domains of this work.
Hence, here we conservatively produced a larger set of annotations to avoid any size discrepancies.
           Formally defined QSRs                                        Linguistic predicates



                                                                    Beside
         LeftOf        RightOf      Intersects                      Next to         On




       InFrontOf      Completely       Touches
                      Contains

                                                                     OnTopOf     LeansOn        AffixedOn
        Above
                      Behind          Below



             special cases of the “on” preposition
             formally-defined QSRs that compose each predicate

Figure 2: Sample mapping of formally-defined QSR to linguistic predicates. Predicates are linked to
constituent QSR through black arrows. Blue arrows indicate special cases of the "on" preposition.


Typicality definition. The canonical structure of a spatial sentence [17, 16] is a triple where
a figure object 𝑎 (i.e., the grammatical subject) and a reference object 𝑏 (i.e., the grammatical
object) are linked through a spatial relation 𝑟. An example of canonical spatial relation is coat
stand is on the floor. Let 𝑁(𝑎,𝑟,𝑏),𝐶 be the number of times that 𝑎 and 𝑏 are in relation 𝑟 within a
data collection 𝐶. For instance, let us assume that the relation coat stand is on the floor occurs
100 times throughout VG. Moreover, let 𝑁(𝑎,𝑟,_),𝐶 and 𝑁(_,𝑟,𝑏),𝐶 indicate respectively: (i) the
number of times that 𝑎 is as the subject of predicate 𝑟, in 𝐶; (ii) the number of times that 𝑏 is as
the object of predicate 𝑟, in 𝐶. Namely, in VG, coat stand is the subject of on relations in 200
cases, and floor is the object of on relations in 500 cases. Then, the typicality of relation (𝑎, 𝑟, 𝑏)
in collection 𝐶 can be defined as:

                                                              𝑁(𝑎,𝑟,𝑏),𝐶
                       typicality(𝑎,𝑟,𝑏),𝐶 =                                          ,                     (1)
                                                 𝑁(𝑎,𝑟,_),𝐶 + 𝑁(_,𝑟,𝑏),𝐶 − 𝑁(𝑎,𝑟,𝑏),𝐶
   where 𝑁(𝑎,𝑟,𝑏),𝐶 ≤ 𝑁(𝑎,𝑟,_),𝐶 and 𝑁(_,𝑟,𝑏),𝐶 ≤ 𝑁(𝑎,𝑟,𝑏),𝐶 . Namely, typicality is here defined
as the Jaccard similarity coefficient between 𝑎 and 𝑏. In our example, the coat stand is on the
floor relation has a typicality score of 100/(200 + 500 − 100) ≈ 0.17. The more frequently two
objects appear in relation 𝑟 across the given collection, the more typical their relation is, i.e.,
the closer to 1 will be their Jaccard similarity score. Conversely, a relation which never appears
throughout the collection yields a score of 0. From Equation 1, it follows that the atypicality of
a relation can be expressed as the complement to 1, with respect to the chosen typicality metric,
i.e., as the Jaccard distance between 𝑎 and 𝑏, in terms of relation 𝑟.
Mapping VG predicates to Commonsense QSR. To derive a typicality score for the QSR
defined in [16], these QSR first need to be mapped to the spatial predicates in Visual Genome.
This process is complicated by the fact that spatial predicates in VG are often linked to incor-
rect aliases. For instance, the above and on top of predicates are treated as synonyms in VG.
                Predicate keyword    Aliases                         QSR
                on top of                                            OnTopOf
                against                                              AffixedOn ∨ LeansOn
                beside               next to, adjacent, on side of   Beside
                below                under                           Below
                above                                                Above
                right, left                                          RightOf, LeftOf
                front, behind                                        InFrontOf, Behind
                near                 by, around                      Near

Table 1
Mapping of spatial keywords in Visual Genome, and their aliases, to the Commonsense QSR of [16].


However, these two predicates carry a very different spatial connotation: only the on top of
expression implies that the two objects are in contact. To address this issue, we regrouped VG
predicates based on key terms that could be directly mapped to the commonsense QSR in [16],
as summarised in Table 1. We used keywords instead of key phrases (e.g., front instead of in
front of ), to account for the linguistic variability of VG predicates. For instance, the keyword
"front" is shared by both the in front of and the on the front of predicates, which we want to
map to the InFrontOf QSR. Here, each VG predicate is mapped to its nearest matching keyword.
In the absence of exact matches, the matching sentence of maximum length is considered, so
that, e.g., the relation "on the front of" is linked to the "front" preposition, as opposed to the
"on" preposition. At this stage, we filter out i) relations that miss either the subject, the object
or the predicate of the sentence, as well as ii) predicates expressed through the overly generic
"of" and "to" prepositions.
The case of "on". To map occurrences of "on" to the correct spatial QSR, examining textual
annotations is insufficient and further spatial inference on the input image is needed. To this
aim, the 2D bounding boxes in each VG image are added to a spatial database, implemented
in PostgreSQL 2 . The built-in spatial operators provided with the PostGIS engine3 can then be
used to conclude whether the two object regions involved in the "on" relation also overlap along
the vertical direction. Only the occurrences of "on" which imply contact along the top of the
reference object are mapped to the OnTopOf relation. The remaining occurrences are mapped
to the "against" preposition, as instances of either the AffixedOn or LeansOn relations (Table 1).
Computing typicality scores. Given the definition of typicality introduced at Equation 1, each
occurrence of a spatial relation found in VG increments the 𝑁(𝑎,𝑟,𝑏),𝐶 , 𝑁(𝑎,𝑟,_),𝐶 and 𝑁(_,𝑟,𝑏),𝐶
counts by one. Aliases in the same keyword group (Table 1) contribute to the same counts. For
instance, hits of the "around" and "by" terms count as occurrences of the "near" relation. More
broadly, all retrieved QSR, except for the "above" relation, contribute to the counts of the "near"
relation. Indeed, object-object relations appearing in a natural scene generally imply proximity
between the depicted objects. The "above" preposition, however, is an exception to this broader
norm: e.g., the sky is above.
   2
       https://www.postgresql.org/
   3
       https://postgis.net/
3.3. Hybrid Architecture for Commonsense Reasoning
The proposed reasoning architecture (Figure 1), consists of the following components.
DL-based object recognition. This module classifies the objects depicted in each image. It
receives as input an RGB image collected by the robot and a set of 2D object regions. Then,
objects are classified through a Deep Neural Network that returns, for each input region,
a set of ranked object predictions. In sum, this configuration is suitable for any algorithm
which provides, for each detected object: (i) the 2D region bounding the object, (ii) a set of
scored predictions, whether similarity-based or probability-based. Because the main aim of
the experiments discussed in this paper is to compare the performance of hybrid solutions
against the DML classification methods of [28], we rely on manually-annotated object regions,
to control for potential errors propagated from the image segmentation steps.
Semantic mapping. The object predictions and associated depth data are fed to a second
module, which maintains a semantic map of the robot’s environment. In practice, by measuring
the distance between the robot’s position and the surfaces reached by the laser sensor, we
can construct a PointCloud, i.e., a set of 3D geometrical points representing the input image.
Then, the 2D object regions annotated on the RGB images are projected on the PointCloud, to
derive 3D object regions and locate these regions on the geometrical map of the environment.
These 3D regions are stored in a PostgreSQL spatial database and used to estimate the object
dimensions. Lastly, the semantic map is completed by manually annotating the wall surfaces.
DL prediction selection. The next step is identifying which DL predictions need to be validated
through knowledge-based reasoning. To this aim, the DL predictions are filtered by confidence.
Specifically, because, in the considered baseline methods, object predictions are ordered by
ascending L2 distance, we feed the knowledge-based reasoner only with those predictions
whose top L2 distance, in the ranking, is greater than a threshold 𝜖. Optimal values for the 𝜖
parameter are automatically estimated, through n-fold stratified cross validation. Specifically, a
portion of the test set is held out, at each inference run, to derive the minimum L2 distance for
each class. Minimum L2 values are averaged across classes, to infer the recommended 𝜖.
Size reasoner. This module validates the DL predictions on the basis of the Knowledge Base
of typical sizes constructed in Section 3.1. Hence, here the object sizes stored in the semantic
map are quantised, so that they are expressed qualitatively, in terms of surface area, thickness
and aspect ratio. The background size KB is queried to retrieve a set of object classes that are
plausible with respect to the observed size. Then, the input DL ranking of object predictions is
filtered so that only those classes deemed as valid in terms of size are retained. As a result, the
output ranking of top-K predictions may include a different set of classes from those obtained
through DL, but the original DL scores, i.e., the class order, is preserved.
Spatial reasoner. This reasoner considers the spatial relations of neighbour objects to correct
the DL predictions. In particular, for each object instance that is selected for correction, the
set of objects within a distance radius, 𝑅, is retrieved from the semantic map. We rely on
PostGIS operators to compare neighbour object regions and derive their QSR, as described in
[16]. The object neighbourhood is modelled as a directed graph, where nodes indicate object
entities and edges encode spatial relations. As such, the efficacy of the spatial reasoner also
depends on the quality of the neighbour object labels in the graph. When all object predictions
are generated automatically, DL errors can hinder the spatial reasoning results. For instance,
consider the case of Figure 1, where a fire extinguisher was mistaken for a bottle. In principle,
the proximity of a fire sign is a strong indication that the object is in fact a fire extinguisher.
However, if the fire sign was mistaken, e.g., for a monitor, the same spatial cue would become
misleading. For these reasons, in the experiments of Section 4, we control for the presence of
DL-generated object labels. For each object prediction in the ranking and spatial relation in
the graph, the typicality score is computed (Equation 1). To contain the computational cost of
this step, typicality scores are computed only for the top-K predictions. Then, only the object
predictions of non-null typicality score are retained and converted to atypicality scores, so that
they can be combined with the L2 distances in the DL ranking, as follows. Atypicality scores are
first averaged class-wise, across all observed QSR. Second, the class average is added to the L2
distance of the current prediction and re-sorted in ascending order. Hence, differently from the
size reasoner, the set of predicted classes is always identical to the top-K DL ranking, whereas
their ranked order may change. The aforementioned spatial reasoning steps are skipped in all
cases where (i) no QSR are found for a given prediction, (ii) the DL predicted that the target
region is a "person". Indeed, people can be found at various locations. Hence, the "person" class
would always rank higher than other classes, invalidating the spatial reasoner results.
Meta-reasoning. Because the size reasoner and the spatial reasoner may recommend a
different set of object classes, a rationale is needed to combine the outcomes of the two reasoners.
In the experiments of this paper, we test two strategies for leveraging these reasoners. First, we
apply the size and spatial reasoners in sequence, i.e., following a "waterfall" approach. Since
the size reasoner only applies a filter to the input ranking, the order of execution of the two
reasoners could be equivalently swapped. However, we also consider a second meta-reasoning
scenario, which we refer to, in the following, as "parallel". In this scenario, the DL and size-
validated rankings are fed to the spatial reasoner in parallel, to filter out the spatially-invalid
predictions (see also Section 4.3).


4. Experiments
The experiments of this section are aimed at answering three main research questions. First,
are the performance benefits of integrating object recognition with size awareness, as measured
in our prior work [15], preserved when the construction of the size knowledge base is automated?
Second, we interrogate on the utility, in terms of object recognition performance, of introducing a
second reasoning component, which considers the typical spatial relations between objects. Lastly,
we ask what are the performance effects of combining both types of reasoners?
As anticipated in Section 3.3, the performance of the spatial reasoner is dependent on the quality
of the predictions generated for the nearby objects. Therefore, to control for error propagation
from the DL-based to the knowledge-based steps while addressing our research questions, we
set up two experiments. In Experiment A, objects which lie nearby each DL prediction are
represented with ground truth labels. Namely, in the fire extinguisher example of Figure 1,
the incorrect DL prediction, bottle, is associated with the spatial relation near fire extinguisher
sign, irrespective of which class the DL algorithm predicted for the fire extinguisher sign. In
Experiment B, no prior knowledge is available about the ground truth object categories. In
this realistic scenario, all object labels in the QSR graph are generated through DL.
   The remainder of this section illustrates the dataset (Section 4.1), performance metrics (Section
4.2) and object recognition methods (Section 4.3) that were considered for evaluation. Lastly,
the experimental results are presented and discussed in Section 4.4. The code supporting these
experiments is available at https://github.com/kmi-robots/spatial-KB/tree/test.

4.1. The KMi RGB-D set
For consistency with our prior trials, we consider the same dataset introduced in [15]. In brief,
images and depth data of the KMi Lab were collected through a Turtlebot mounting an Orbbec
Astra Pro RGB-Depth (RGB-D) monocular camera. Specifically, 10 image examples per class
were devoted to training the baseline models of [28] and to validating the model parameters,
through an 80-20 training-validation split. Training examples were collected against a neutral
background and cropped to reduce the marginal noise. Additionaly, test examples were sampled
from one of our robot’s monitoring routines. From the initial test sample, we annotated 1414
object regions, which were labelled with respect to the 60 classes of interest.

4.2. Evaluation metrics
In the experiments of this paper, performance is measured in terms of cross-class accuracy,
precision, recall and F1 score of the top-1 object predictions in the ranking. Additionally, to
account for the natural class imbalance in the KMi set [15], we also weigh these average metrics
by class support. The quality of the top-5 predictions is evaluated in terms of mean Precision
(P@5), mean normalised Discounted Cumulative Gain (nDCG@5) and hit ratio (HR), i.e., the
number of times the correct prediction appeared among the top-5, divided by the total number
of predictions. Evaluation metrics are averaged across 7 cross-validation splits. The value of
n=7 was chosen pragmatically, to devote only a restrained portion of test examples to parameter
tuning, i.e., 202 object regions out of 1414.

4.3. Ablation study
In the following, we contrast the performance of methods which are purely based on DL with
variations of the proposed hybrid system. We start by evaluating the size and spatial reasoners
individually. Then, we evaluate two meta-reasoning strategies that combine both reasoners.
N-net [28] is a Triplet Network based on three ResNet50 branches [53]. A feature space is
learned so that exemplars of visually-similar objects lie closer, in terms of L2 distance, than
dissimilar objects. This training objective is achieved by minimising the Triplet loss. At test
time, objects are matched against their nearest neighbour in the learned feature space.
K-net [28] is identical to the N-net pipeline, except a second component is added to the Triplet
loss during training. This component is based on the Softmax loss over 𝑀 pre-defined classes.
In this way, the training objective is modified to learn a feature space that can discriminate not
only similar and dissimilar objects, but also a pre-defined subset of classes.
Hybrid (size only). In this variation of the workflow proposed in Section 3.3, only the size
reasoner is considered. This method integrates knowledge of the object typical surface area,
thickness and aspect ratio, to validate the predictions generated by the DL algorithm. It resembles
the "area+thin+AR" pipeline of [15], except here both the size knowledge construction and the
estimation of the optimal confidence threshold, 𝜖, are automated.
Hybrid (spatial only). In this second variation of the proposed architecture, only the spatial
reasoner is considered. Specifically, the DL rankings selected for correction are modified based
on: (i) the observed QSR, which are extracted from the input image and organised in a graph
(Section 3.3), and (ii) the spatial typicality scores derived from Visual Genome (Section 3.2).
Specifically, L2 distances in the DL ranking are combined with the Jaccard distances indicating
the spatial atypicality of each class, yielding a modified ranking of predictions.
Hybrid (size + spatial, waterfall). In this implementation of the proposed hybrid reasoning
framework, the DL, size-based and spatial reasoners are applied in sequence. First, the DL
predictions are filtered based on the size reasoning outcomes. Then, the size-validated ranking
is passed through the spatial reasoner. At this stage, the DL-based scores in the ranking are
combined with the atypicality scores of the observed QSR.
Hybrid (size + spatial, parallel). Differently from the waterfall strategy, this method main-
tains the original DL predictions and the size-validated predictions as separate rankings. For
instance, let us consider a case where the top-5 predictions in the DL ranking are bottle, fire
extinguisher, backpack, bin, and cup - let this be ranking 𝑟1 . Once 𝑟1 is fed to the size reasoner,
only those predictions which are deemed as valid w.r.t. size are retained, yielding a second
ranking, 𝑟2 : e.g., fire extinguisher, backpack. 𝑟1 and 𝑟2 are passed in parallel to the spatial
reasoner. At this stage, the predictions of atypicality score 1 are filtered out of both rankings,
yielding 𝑟ˆ1 - e.g., fire extinguisher, bin and 𝑟ˆ2 - e.g., fire extinguisher. Lastly, majority voting is
used to derive the final predictions from 𝑟ˆ1 and 𝑟ˆ2 . In our example, the fire extinguisher class
would be selected as the final prediction, because it occurs twice. In the case of ties, only the
size-validated predictions are considered.

4.4. Results and Discussion
Results are reported in Table 2. The DL baseline results are common to both experiments, A
and B, because the DL module is independent from variations introduced in the reasoning
modules. Similarly, the "size only" results are reported only once for both experiments, as the
only difference between the two experiments pertains the spatial reasoner. The DL results
reproduce the findings of [15]. Namely, K-net outperforms N-net, thanks to the optimisation
over a pre-defined set of object classes. Therefore, in the following, we consider the predictions
generated through K-net as input to the hybrid methods.
  Importantly, the results achieved with the size reasoner match the top-1 accuracy and
weighted R of [15] and outperform prior results on all remaining metrics. Thus, in relation to
our first research question, although part of the knowledge base construction and the parameter
tuning were automated, the baseline DL performance was enhanced by a significant margin. In
particular, the unweighted and weighted F1 scores increased by 6% and 5%, compared to K-net.
Interestingly, the precision of the size reasoner is higher than its recall. In the case of top-1
weighted metrics, this trend resembles the baseline K-net performance. Thus, a first explanation
for this evidence is that the hybrid method maintains part of the prediction patterns of the
Table 2
Experiments A and B results on the KMi RGB-D test set.
                                      Top-1 unweighted   Top-1 weighted      Top-5 unweighted
 Method                  Top-1 Acc.   P     R    F1      P     R    F1     P@5 nDCG@5 HR
 N-net [28]              .45          .34   .40   .31    .62   .45   .47   .33    .36          .63
 K-net [28]              .48          .39   .40   .34    .68   .48   .50   .38    .41          .65
 Hybrid (size only)      .51          .47   .39   .40    .69   .51   .55   .42    .44          .68
 Experiment A:
 Hybrid (spatial only)   .49          .41   .43   .37    .69   .49   .53   .38    .41          .65
 Hybrid (waterfall)      .53          .49   .39   .41    .71   .53   .57   .39    .42          .68
 Hybrid (parallel)       .55          .48   .42   .42    .71   .55   .59   .42    .45          .69
 Experiment B:
 Hybrid (spatial only)   .48          .40   .41   .35    .71   .48   .52   .38    .41          .65
 Hybrid (waterfall)      .52          .48   .39   .40    .71   .52   .56   .39    .42          .68
 Hybrid (parallel)       .54          .48   .40   .41    .71   .54   .58   .41    .44          .68


underlying DL component. However, by definition, the size reasoner is "a more strict" classifier
than the DL baseline: it classifies an object correctly when it observes that the object size is
valid but also misses out a portion of true positives when the measured size is found to be
non-valid. Wrong size judgements could be equally caused by (i) errors in the estimation of
the object size, by (ii) biases in the distribution of object sizes in the raw data, or by (ii) the
methodological choice to adopt discrete size categories. For instance, if an object lies at the
frontier between small and medium-sized objects, it may be wrongly categorised due to relying
on clear-cut thresholds.
Experiment A. As shown in Table 2, the spatial reasoner also contributes to a performance im-
provement over the DL baselines. The F1 scores achieved in the "spatial only" case outperformed
K-net by 3%. On average, the improvement is more modest than in the case of the size reasoner.
These results align with the hypothesis we laid in [5]: that spatial reasoning capabilities are
relatively less impactful towards object categorisation than size reasoning. Interestingly, the
ranking quality metrics in the "spatial only" scenario are equivalent to the K-net case. This
result can be explained by considering that the spatial reasoner only re-ranks the DL predictions,
without discarding any of the originally predicted classes. As such, it may overlook potential
candidates that lie outside the top-K positions, compared to the size reasoner. Nonetheless,
this characteristic also ensures more balanced unweighted P and R scores than the "size only"
case. Combining both reasoners leads to a higher, or at least comparable performance than that
achieved through the individual reasoners, across all the top-1 metrics. The DL-only baseline
was surpassed by a significant margin. The "parallel" strategy, overall, led to the highest margin
of improvement: the K-net accuracy improved by 7%; the unweighted metrics improved by 9%
(P), 2% (R) and 8% (F1); the weighted metrics increased by 3% (P), 7% (R), 9% (F1). All the top-5
quality metrics also increased by 4%. Moreover, the unweighted P and R are more balanced
in the "parallel" than in the "waterfall" scenario. Because, in the "waterfall" case, the size and
spatial reasoner are applied in sequence, the same performance trend of the "size only" case
emerges. Conversely, in the "parallel" case, when the DL and size-validated predictions are
leveraged through the spatial reasoner, the higher precision ensured by the size reasoner is
coupled with the higher recall trend of the spatial reasoner, yielding the highest F1 results across
the tested hybrid methods.
Experiment B. In this realistic scenario, the object labels for the observed QSR are generated
automatically, through DL. Therefore, as expected, the performances of the "spatial only", "wa-
terfall" and "parallel" methods were lower than in Experiment A. Despite this slight performance
degrade, the hybrid methods which integrate both size and spatial awareness still ensured a
significant performance improvement over the K-net baseline. Moreover, Experiment B con-
firmed the performance patterns already observed in Experiment A. First, the "parallel" method
achieved the highest performance across the majority of evaluation metrics, with the exception
of the unweighted R and P@5, which are 1% lower than other hybrid methods. Second, as in
Experiment A, the performance gap between the precision and recall scores is less pronounced
in the "parallel" than in the "waterfall" scenario. This evidence suggests that the "parallel"
meta-reasoning strategy was the most effective one in terms of leveraging the advantages and
shortcomings of its constituent reasoners.


5. Conclusion
We have presented a novel hybrid system that enhances the object recognition performance of a
robot, by introducing awareness of the typical size and spatial relations of objects. Because the
knowledge representations proposed in this work are designed for capturing the typicality of
size and spatial properties, they relate to the broader objective of equipping intelligent systems
with commonsense. We showed how commonsense knowledge can be autonomously extracted
from general-purpose Knowledge Bases, namely: (i) ShapeNet and Amazon in the case of size,
(ii) Visual Genome, in the case of spatial relations.
    Furthermore, we have demonstrated how the extracted commonsense knowledge can sig-
nificantly augment the performance of object recognition systems which are purely based
on DL, in the real-world scenario of autonomous Health and Safety monitoring. Specifically,
the most robust results were achieved when both types of knowledge-based reasoners were
leveraged. The significant performance improvement achieved over DL is the key take-away
message from this study. Since we proposed a modular solution, the absolute performance
figures that have been presented only provide a reference. Naturally, other DL solutions could
be integrated in our system, to improve the quality of the initial predictions. In our future work,
we are also interested in assessing whether a specific type of reasoner was most helpful to
recognise certain subsets of classes, especially with respect to categories that are specific to
Health & Safety. To this aim, we will evaluate the proposed system on H&S decision-making
tasks, such as, e.g., verifying the correct placement of fire extinguishers. Another extension of
this work concerns the problem of reconciling successive observations of the same object entity.
In this paper we considered a basic model of semantic map, where reasoning is performed on
a frame-by-frame basis. However, tracking objects across frames and updating the mapped
observations incrementally would further improve the robot’s object recognition abilities.
References
 [1] M. J. C. Samonte, S. H. Baloloy, C. K. J. Datinguinoo, e-tapon: Solar-powered smart bin
     with path-based robotic garbage collector, in: 2021 IEEE 8th International Conference on
     Industrial Engineering and Applications (ICIEA), IEEE, 2021, pp. 181–185.
 [2] G. Yang, H. Lv, Z. Zhang, L. Yang, J. Deng, S. You, J. Du, H. Yang, Keep healthcare workers
     safe: application of teleoperated robot in isolation ward for covid-19 prevention and
     control, Chinese Journal of Mechanical Engineering 33 (2020) 1–4.
 [3] B. R. Le Comte, G. S. Gupta, M. T. Chew, Distributed sensors for hazard detection in
     an urban search and rescue operation, in: 2012 IEEE International Instrumentation and
     Measurement Technology Conference Proceedings, IEEE, 2012, pp. 2385–2390.
 [4] M. B. Alatise, G. P. Hancke, A review on challenges of autonomous mobile robot and
     sensor fusion methods, IEEE Access 8 (2020) 39830–39846.
 [5] A. Chiatti, E. Motta, E. Daga, Towards a Framework for Visual Intelligence in Service
     Robotics: Epistemic Requirements and Gap Analysis, in: Proceedings of the 17th KR2020
     Conference, IJCAI, 2020, pp. 905–916. doi:10.24963/kr.2020/93.
 [6] E. Bastianelli, G. Bardaro, I. Tiddi, E. Motta, Meet hans, the heath & safety autonomous
     inspector., in: International Semantic Web Conference - Demo track, 2018.
 [7] B. Cantwell Smith, The Promise of Artificial Intelligence: Reckoning and Judgement, The
     MIT Press, Cambridge, 2019.
 [8] J. Pearl, Theoretical Impediments to Machine Learning With Seven Sparks from the Causal
     Revolution, in: Proceedings of the ACM WSDM Conference, WSDM ’18, ACM, New York,
     NY, USA, 2018, p. 3. doi:10.1145/3159652.3176182.
 [9] D. Paulius, Y. Sun, A survey of knowledge representation in service robotics, Robotics
     and Autonomous Systems 118 (2019) 13–30.
[10] E. Davis, G. Marcus, Commonsense reasoning and commonsense knowledge in artificial
     intelligence, Communications of the ACM 58 (2015) 92–103. URL: https://doi.org/10.1145/
     2701413. doi:10.1145/2701413.
[11] H. Levesque, Common Sense, the Turing Test, and the Quest for Real AI
     | The MIT Press, The MIT Press, 2017. URL: https://mitpress.mit.edu/books/
     common-sense-turing-test-and-quest-real-ai.
[12] P. J. Hayes, The Second Naive Physics Manifesto, Formal theories of the common sense
     world (1988). URL: https://ci.nii.ac.jp/naid/10022004548/, publisher: Ablex Publishing
     Corporation.
[13] J. McCarthy, et al., Programs with common sense, RLE and MIT computation center, 1960.
[14] A. Newell, The knowledge level, Artificial intelligence 18 (1982) 87–127.
[15] A. Chiatti, E. Motta, E. Daga, G. Bardaro, Fit to measure: Reasoning about sizes for robust
     object recognition, in: To appear in proceedings of the AAAI2021 Spring Symposium on
     Combining Machine Learning and Knowledge Engineering (AAAI-MAKE 2021), 2021.
[16] A. Chiatti, G. Bardaro, E. Motta, E. Daga, Commonsense spatial reasoning for visually
     intelligent agents, arXiv preprint arXiv:2104.00387 (2021).
[17] Barbara Landau, R. Jackendoff, "What" and" where" in spatial language and spatial cogni-
     tion, Behavioral and Brain Sciences 16 (1993) 217–265.
[18] R. Krishna, Y. Zhu, O. Groth, J. Johnson, et al., Visual genome: Connecting language and
     vision using crowdsourced dense image annotations, International journal of computer
     vision 123 (2017) 32–73.
[19] M. Savva, A. X. Chang, P. Hanrahan, Semantically-enriched 3D models for common-sense
     knowledge, in: 2015 IEEE CVPRW Conference, 2015, pp. 24–31. doi:10.1109/CVPRW.
     2015.7301289, iSSN: 2160-7516.
[20] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, M. Pietikäinen, Deep Learning for
     Generic Object Detection: A Survey, International Journal of Computer Vision 128 (2020)
     261–318. doi:10.1007/s11263-019-01247-4.
[21] A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet classification with deep convolutional
     neural networks, Communications of the ACM 60 (2017) 84–90. URL: https://doi.org/10.
     1145/3065386. doi:10.1145/3065386.
[22] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, in:
     Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
     2016, pp. 770–778. URL: https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_
     Residual_Learning_CVPR_2016_paper.html.
[23] G. Koch, R. Zemel, R. Salakhutdinov, Siamese neural networks for one-shot image recog-
     nition, in: ICML deep learning workshop, volume 2, Lille, 2015.
[24] E. Hoffer, N. Ailon, Deep Metric Learning Using Triplet Network, in: A. Feragen, M. Pelillo,
     M. Loog (Eds.), Similarity-Based Pattern Recognition, Lecture Notes in Computer Science,
     Springer, Cham, 2015, pp. 84–92. doi:10.1007/978-3-319-24261-3_7.
[25] B. J. Meyer, T. Drummond, The importance of metric learning for robotic vision: Open
     set recognition and active learning, in: 2019 International Conference on Robotics and
     Automation (ICRA), IEEE, 2019, pp. 2924–2931.
[26] G. Sawadwuthikul, T. Tothong, T. Lodkaew, P. Soisudarat, S. Nutanong, P. Manoonpong,
     N. Dilokthanakul, Visual goal human-robot communication framework with few-shot
     learning: a case study in robot waiter system, IEEE Transactions on Industrial Informatics
     (2021).
[27] K. Suzuki, Y. Yokota, Y. Kanazawa, T. Takebayashi, Online self-supervised learning for
     object picking: detecting optimum grasping position using a metric learning approach,
     in: 2020 IEEE/SICE International Symposium on System Integration (SII), IEEE, 2020, pp.
     205–212.
[28] A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. R. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu,
     E. Romo, et al., Robotic pick-and-place of novel objects in clutter with multi-affordance
     grasping and cross-domain image matching, in: 2018 IEEE ICRA Conference, IEEE, 2018,
     pp. 1–8.
[29] S. Aditya, Y. Yang, C. Baral, Integrating Knowledge and Reasoning in Image Understanding,
     in: Proceedings of IJCAI 2019, 2019, pp. 6252–6259.
[30] M. Mancini, H. Karaoguz, E. Ricci, P. Jensfelt, B. Caputo, Knowledge is Never Enough:
     Towards Web Aided Deep Open World Recognition, in: IEEE ICRA Conference, 2019, p.
     9543. doi:10.1109/ICRA.2019.8793803.
[31] A. Daruna, W. Liu, Z. Kira, S. Chetnova, Robocse: Robot common sense embedding,
     in: 2019 International Conference on Robotics and Automation (ICRA), IEEE, 2019, pp.
     9777–9783.
[32] G. Castellano, G. Sansaro, G. Vessio, Integrating contextual knowledge to visual features
     for fine art classification, in: Workshop on Deep Learning for Knowledge Graphs (DL4KG).
     The International Semantic Web Conference (ISWC)., 2021.
[33] R. Koner, H. Li, M. Hildebrandt, D. Das, V. Tresp, S. Günnemann, Graphhopper: Multi-hop
     scene graph reasoning for visual question answering, in: International Semantic Web
     Conference, Springer, 2021, pp. 111–127.
[34] K. Marino, R. Salakhutdinov, A. Gupta, The More You Know: Using Knowledge Graphs
     for Image Classification, in: 2017 IEEE Conference on Computer Vision and Pattern
     Recognition (CVPR), 2017, pp. 20–28. doi:10.1109/CVPR.2017.10, iSSN: 1063-6919.
[35] E. van Krieken, E. Acar, F. van Harmelen, Analyzing Differentiable Fuzzy Implications,
     in: Proceedings of the 17th International Conference on Principles of Knowledge Rep-
     resentation and Reasoning, 2020, pp. 893–903. URL: https://doi.org/10.24963/kr.2020/92.
     doi:10.24963/kr.2020/92.
[36] J. Young, L. Kunze, V. Basile, E. Cabrio, N. Hawes, B. Caputo, Semantic web-mining
     and deep vision for lifelong object discovery, in: 2017 IEEE International Conference on
     Robotics and Automation (ICRA), IEEE, 2017, pp. 2774–2779.
[37] G. J. Burghouts, F. Hillerström, Zero-detect objects without training examples by knowing
     their parts., in: AAAI Spring Symposium: Combining Machine Learning with Knowledge
     Engineering, 2021.
[38] T. Konkle, A. Oliva, Canonical visual size for real-world objects, Journal of Experi-
     mental Psychology: Human Perception and Performance 37 (2011) 23–37. doi:10.1037/
     a0020413.
[39] J. B. Julian, J. Ryan, R. A. Epstein, Coding of object size and object category in human
     visual cortex, Cerebral Cortex 27 (2017) 3095–3109.
[40] B. Long, T. Konkle, M. A. Cohen, G. A. Alvarez, Mid-level perceptual features distinguish
     objects of different real-world sizes., Journal of Experimental Psychology: General 145
     (2016) 95.
[41] H. Bagherinezhad, H. Hajishirzi, Y. Choi, A. Farhadi, Are elephants bigger than butterflies?
     reasoning about sizes of objects, in: Proceedings of the Thirtieth AAAI Conference, AAAI
     Press, 2016, pp. 3449–3456.
[42] Y. Elazar, A. Mahabal, D. Ramachandran, T. Bedrax-Weiss, D. Roth, How Large Are Lions?
     Inducing Distributions over Quantitative Attributes, in: Proceedings of the 57th Annual
     Meeting of the ACL, Association for Computational Linguistics, 2019, pp. 3973–3983.
[43] Y. Zhu, A. Fathi, L. Fei-Fei, Reasoning about Object Affordances in a Knowledge Base
     Representation, in: Computer Vision – ECCV 2014, volume 8690, Springer, Cham, 2014, pp.
     408–424. doi:10.1007/978-3-319-10605-2_27, series Title: Lecture Notes in Com-
     puter Science.
[44] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, J. Taylor, Freebase: a collaboratively created
     graph database for structuring human knowledge, in: Proceedings of the SIGMOD 2008,
     2008, pp. 1247–1250.
[45] A. G. Cohn, J. Renz, Qualitative Spatial Representation and Reasoning, in: F. van Harmelen,
     V. Lifschitz, B. Porter (Eds.), Foundations of Artificial Intelligence, volume 3 of Handbook
     of Knowledge Representation, Elsevier, 2008, pp. 551–596.
[46] A. Nüchter, J. Hertzberg, Towards semantic maps for mobile robots, Robotics and Au-
     tonomous Systems 56 (2008) 915–926.
[47] S. Coradeschi, A. Saffiotti, An introduction to the anchoring problem, Robotics and
     autonomous systems 43 (2003) 85–96.
[48] I. Kostavelis, A. Gasteratos, Semantic mapping for mobile robotics tasks: A survey, Robotics
     and Autonomous Systems 66 (2015) 86–103.
[49] L. Kunze, C. Burbridge, M. Alberti, A. Thippur, J. Folkesson, P. Jensfelt, N. Hawes, Combin-
     ing top-down spatial reasoning and bottom-up object class recognition for scene under-
     standing, in: 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems,
     2014, pp. 2910–2915.
[50] H. Deeken, T. Wiemann, J. Hertzberg, Grounding semantic maps in spatial databases,
     Robotics and Autonomous Systems 105 (2018) 146–165.
[51] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, S. J. Gershman, Building machines that learn
     and think like people, Behavioral and Brain Sciences 40 (2017).
[52] M. M. Breunig, H.-P. Kriegel, R. T. Ng, J. Sander, Lof: identifying density-based local
     outliers, in: Proceedings of SIGMOD 2000, 2000, pp. 93–104.
[53] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Pro-
     ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.
     770–778.