=Paper= {{Paper |id=Vol-2846/paper7 |storemode=property |title=Fit to Measure: Reasoning About Sizes for Robust Object Recognition |pdfUrl=https://ceur-ws.org/Vol-2846/paper7.pdf |volume=Vol-2846 |authors=Agnese Chiatti,Enrico Motta,Enrico Daga,Gianluca Bardaro |dblpUrl=https://dblp.org/rec/conf/aaaiss/ChiattiMDB21 }} ==Fit to Measure: Reasoning About Sizes for Robust Object Recognition== https://ceur-ws.org/Vol-2846/paper7.pdf

Fit to Measure: Reasoning About Sizes for Robust
Object Recognition
Agnese Chiattia , Enrico Mottaa , Enrico Dagaa and Gianluca Bardaroa
a
Knowledge Media Institute, The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom

Abstract
Service robots can help with many of our daily tasks, especially in those cases where it is inconvenient or
unsafe for us to intervene – e.g., under extreme weather conditions or when social distance needs to be
maintained. However, before we can successfully delegate complex tasks to robots, we need to enhance
their ability to make sense of dynamic, real-world environments. In this context, the first prerequisite to
improving the Visual Intelligence of a robot is building robust and reliable object recognition systems.
While object recognition solutions are traditionally based on Machine Learning, augmenting them with
knowledge-based reasoners has been shown to improve their performance. In particular, based on our
prior work on identifying the epistemic requirements of Visual Intelligence, we hypothesise that knowl-
edge of the typical size of objects can significantly improve the accuracy of an object recognition system.
To verify this hypothesis, in this paper we present an approach to integrating knowledge about object
sizes in a ML-based architecture. Our experiments in a real-world robotic scenario show that this hybrid
approach ensures a significant performance increase over state-of-the-art Machine Learning methods.

Keywords
object recognition, service robotics, hybrid AI, reasoning about sizes, cognitive systems

1. Introduction
With the fast-paced advancement of the Artificial Intelligence (AI) and Robotics fields, there is
an increasing potential to resort to service robots (or robot assistants) to help with daily tasks.
Service robots can take on many roles. They can operate as patient carers [1], Health and Safety
monitors [2], museum or tour guides [3], to name a few. However, succeeding in the real world
is a challenge because it requires robots to make sense of the high-volume and diverse data
coming through their perceptual sensors. Although different sensory modalities contribute to
the robot’s sensemaking abilities (e.g., touch, sound), in this work, we focus on the modality of
vision. From this entry point, the problem then becomes one of enabling robots to correctly
interpret the stimuli of their vision system, with the support of background knowledge sources,
a capability also known as Visual Intelligence [4]. The first prerequisite to Visual Intelligence is
the ability to robustly recognise the different objects occupying the robot’s environment (object
In A. Martin, K. Hinkelmann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI
2021 Spring Symposium on Combining Machine Learning and Knowledge Engineering (AAAI-MAKE 2021) - Stanford
University, Palo Alto, California, USA, March 22-24, 2021.
" agnese.chiatti@open.ac.uk (A. Chiatti); enrico.motta@open.ac.uk (E. Motta); enrico.daga@open.ac.uk (E.
Daga); gianluca.bardaro@open.ac.uk (G. Bardaro)
0000-0003-3594-731X (A. Chiatti); 0000-0003-0015-1592 (E. Motta); 0000-0002-3184-540 (E. Daga);
0000-0002-6695-0012 (G. Bardaro)
© 2021 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)
recognition). Let us consider the case of HanS, the Health and Safety robot inspector currently
under development at the Knowledge Media Institute (KMi). HanS is expected to monitor the
Lab in search of potentially dangerous situations, such as fire hazards. Imagine that Hans was
observing a flammable object (e.g., a paper cup) left on top of a portable heater. To conclude
that it is in the presence of a potential fire hazard, the robot would first need to recognise that
the cup and the heater are there.
Currently, the most common approach to tackling object recognition tasks is applying meth-
ods which are based on Machine Learning (ML). In particular, the state-of-the-art performance
is defined by the latest approaches based on Deep Learning (DL) [5, 6]. Despite their popularity,
these methods have received many critiques due to their brittleness and lack of transparency
[7, 8, 9]. To compensate for these limitations, a more recent trend among AI researchers has
been to combine ML with knowledge-based reasoning, thus adopting a hybrid approach [10, 11].
A question remains, however, on what type of knowledge resources and reasoning capabilities
should be leveraged within this new class of hybrid methods [12].
In [4], we identified a set of epistemic requirements, i.e., a set of capabilities and knowledge
properties, required for service robots to exhibit Visual Intelligence. We then mapped the iden-
tified requirements to the types of classification errors emerging from one of HanS’ scouting
rounds, where we relied solely on Machine Learning to recognise the objects. This error analy-
sis highlighted that, in 74% of the cases, a more accurate object classification could in principle
have been achieved if the relative size of objects was considered for their categorisation. This
view is also supported by studies of human visual cognition, which suggest that our priors
about object sizes are crucial to their categorisation [13, 14]. For instance, back to HanS’ case,
the paper cup could be mistaken for a rubbish bin, due to its shape. However, rubbish bins
are typically larger than cups. With this awareness, HanS would be able to rule out object
categories, which, albeit being visual similarity to the correct class, are implausible from the
standpoint of size. These elements of typicality and plausible reasoning [15] link size reason-
ing to the broader objective of developing AI systems which exhibit common sense [16], and
intuitive Physics reasoning abilities [17, 18].
On a more practical level, knowledge representations which encode object sizes have already
been applied effectively to answering questions posed in natural language, such as “is A larger
than B?” [19, 20]. However, despite this body of theoretical and empirical evidence, the role of
size in object recognition has received little attention in the field of Computer Vision. In this
paper, we investigate the performance effects of augmenting a ML-based object recognition
with a reasoner which accounts for the size of the observed objects. In particular, we propose:
• a hybrid method to validate ML-based predictions based on the typical size of objects; • a novel
representation for size, which categorises objects based on their front surface area, depth and
aspect ratio.

2. Related Work
State-of-the-art object recognition methods rely heavily on Machine Learning, as further dis-
cussed in Section 2.1. Because of the limitations of ML methods, hybrid methods, which com-
bine ML with background knowledge and knowledge-based reasoning, have been recently
proposed (Section 2.2). In particular, our hypothesis is that awareness of the object size has
the potential to drastically improve the performance of hybrid object recognition methods [4].
Therefore, in Section 2.3, we discuss the existing approaches to representing the size of objects.

2.1. Machine Learning for Object Recognition
The impressive performance exhibited by object recognition methods based on DL has led to
significant advances on several Computer Vision benchmarks [5, 21, 22]. Deep Neural Net-
works (NNs), however, come with their limitations. These models (i) are notoriously data-
hungry, i.e., require thousands of annotated training examples to learn from, (ii) learn classifi-
cation tasks offline, i.e., assume to operate in a closed world [23], and (iii) learn representational
patterns automatically, by iterating over a raw input set [6]. The latter trait can drastically
reduce the start-up costs of feature engineering. However, it also complicates tasks such as
explaining results and integrating explicit knowledge statements in the pipeline [7, 9].
The issue of robust learning from minimal training examples has inspired the development
of few-shot metric learning methods. Metric learning is the task of learning an embedding (or
feature vector) space, where similar objects are mapped closer to one another than dissimilar
objects. In this setup, even objects unseen at training time can be categorised, by matching
the learned representations against a support (reference) image set. In particular, in a few-shot
scenario, the number of training examples and support images is kept to a minimum. Deep
metric learning has been applied successfully to object recognition tasks [24, 25, 26], even in
real-world, robotic scenarios [27]. Koch and colleagues [24] proposed to train two identical
Convolutional Neural Networks (CNN) fed with images to be matched by similarity. This twin
architecture is also known as Siamese Network. An extension of the Siamese architecture is
the Triplet Network [25, 26], where the input data are fed as triplets including: (i) one image
depicting a certain object class (i.e., anchor), (ii) a positive example of the same object, and (ii)
a negative example, depicting a different object. The winning team for the object stowing task
at the latest Amazon Robotic Challenge capitalised on learning weights independently on each
CNN branch of a Triplet Network, producing a multi-branch architecture [27]. Hence, in what
follows, we will use the two top-performing solutions in [27] as a baseline to evaluate the object
recognition performance of solutions which are purely based on Machine Learning.

2.2. Hybrid Methods for Object Recognition
Broadly speaking, hybrid reasoning methods combine knowledge-based reasoning with ML. A
detailed survey of hybrid methods developed to interpret images can be found in [10, 11]. Many
of these hybrid methods are specifically tailored on Deep NNs, which define the predominant
approach to tackling object recognition problems. In this setup, background knowledge and
knowledge-based reasoning can be integrated at four different levels of the NN [10]: (i) in pre-
processing, to augment the training examples, (ii) within the intermediate layers, (iii) as part
of the architectural topology or optimisation function, and (iv) in the post-processing
stages, to validate the NN predictions.
Methods in the first group rely on external knowledge to compensate for the lack of training
examples. In [23], auxiliary images depicting newly-encountered objects were first retrieved
from the Web and then manually validated. As a result, significant supervision costs were in-
troduced to compensate for the noisiness of data mined automatically. Other approaches have
encoded the background knowledge directly in the inner layer representations of a Deep NN. In
the RoboCSE framework, a set of knowledge properties of objects, (i.e., their typical location,
fabrication material and affordances) were represented through multi-relational embeddings
[28]. This method was proven effective to infer the object’s material, location and affordances
from its class, but performed poorly on object categorisation tasks (i.e., when asked to infer
the object’s class from its properties).
More transparent and explainable than multi-relational embeddings, methods in the third
group are either inspired by the topology of external knowledge graphs [29] or introduce rea-
soning components which are trainable end-to-end [30, 31, 32]. Graph Search Neural Networks
(GSNN) [29] resemble an input knowledge graph. In Logic Tensor Networks (LTN) [30], entities
are represented as distinct points in a vector space, based on a set of soft-logic assertions link-
ing these entities. In this framework, symbolic rules (which may adhere to probabilistic logic
[31]) are added as constraints to the NN optimisation function. Similarly, in [32], differentiable
knowledge statements (expressed in fuzzy logic) contribute to the training loss.
Finally, the fourth family of hybrid methods uses knowledge-based reasoning to validate the
object predictions generated through ML. In [33] the ML predictions are first associated with
the Qualitative Spatial Relationships in the image, and also matched against the top-most re-
lated DBpedia concepts, if an unknown object is observed. As in the case of [32], methods in
this group can modularly interface with different NN architectures. Moreover, they make it
possible to reason on objects unseen at training time, by querying external knowledge sources.
For these reasons, in the approach proposed in this paper, knowledge-based reasoning is ap-
plied after generating the ML predictions. Because our focus is on reasoning about size, we
cannot directly compare our approach against methods in [33], which focus on spatial and tax-
onomic reasoning (i.e., reasoning about the semantic relatedness of different object categories).

2.3. Representing the Size of Objects
Cognitive studies [13, 14] have suggested that our brain maintains a canonical representation
of the physical size of objects, which is functional to their categorisation. Specifically, this
prototypical size appears to be proportional to the logarithm of the known object size [13].
Successive experiments [14] have indicated that size-related features are extracted since the
earliest visual processing stages, before object recognition. As such, it appears that a mid-level
representation is produced to link the lower-level perceptual stimuli of our vision system to
higher-level semantic concepts - e.g., our naming of objects. Remarkably, this representation
of size is robust to variations in the shape and appearance of objects within a class, i.e., the
contour variance of [14]. For example, a short novel and a dictionary are both books, although
dictionaries are usually thicker. Moreover, the book may be open or closed, thus exhibiting a
different size.
Works in Artificial Intelligence have sought inspiration from these cognitive theories. Based
on the findings of [13], authors in [19] developed a reasoner where object sizes are represented
as log-normal distributions over quantitative size measurements. Additionally, the produced
distributions were used to populate nodes in a size graph, where objects which co-occur fre-
quently across different images are linked together. Similarly, in [20], the size of an object class
was modelled quantitatively as a statistical distribution over a set of textual references to size.
The size representations in [19, 20] were implemented to tackle textual reasoning problems, i.e.,
autonomously answer questions such as "is A larger than B?". In the context of autonomous
visual reasoning, however, the effects of integrating size awareness remain unexplored. A par-
tial exception is the work in [34], which proposes a methodology to build a Knowledge Base
of object affordances (uses or actions typically associated with an object). As the main focus of
[34] is affordance reasoning rather than pure size reasoning, however, the object size was only
partially represented, by resorting to the object length derived from a combination of Freebase
[35], Amazon and Ebay. All the reviewed representations [19, 20, 34] were extracted from re-
peated measurements retrieved from the Web. This approach, one the one hand, increases the
chances to capture the contour variance within a class, because it takes multiple input mea-
surements into account. Moreover, it reduces the cost of hardcoding a size Knowledge Base,
especially given the lack of comprehensive resources encoding size, as pointed out in [19] and
also highlighted by our prior KB coverage study [4]. On the other hand, however, this method
is sensitive to noise. Another limitation of the reviewed methods is that size is represented
in one-dimensional terms: e.g., either over volume or over length units. While relying on a
single size feature is sufficient to identify broader groups of smaller and larger objects, it does
not suit the task of classifying finer-grained object categories. For instance, recycling bins and
coat stands may occupy a comparable volume, but bins are usually thicker than coat stands.
Hence, while abstracting from lower-level size features, we also want to preserve enough in-
formation to categorise objects, i.e., produce the mid-level representation envisioned in [14].
To this aim, we propose to represent size qualitatively across three dimensions, i.e., based on
the object surface area, depth and aspect ratio. To control for noise, we start by collecting these
coarse-grained annotations manually, as further discussed in the next Section.

3. Methodology
3.1. Representing Qualitative Sizes in a Knowledge Base
We identified 60 object categories which are commonly found in KMi, the setting in which we
aim to deploy our robotic Health and Safety monitor, HanS. These include not only objects
which are common to most office spaces (e.g., chairs, desktop computers, keyboards), but also
Health and Safety equipment (e.g., fire extinguishers, emergency exit signs, fire assembly point
signs), and other miscellaneous items (e.g., a foosball table, colorful hats from previous gigs
of the KMi rock band). The objective was then to label each object category with respect
to a set of coarse-grained features contributing to size, namely their (i) front surface area
(i.e., the product of their width by their height), (ii) depth dimension, and (iii) Aspect Ratio
(AR), i.e., the ratio of their width to their height. With respect to the first dimension, we can
characterise objects as extra-small, small, medium, large or extra-large respectively. Secondly,
objects can be categorised as flat, thin, thick, or bulky, based on their depth. Thirdly, we can
further discriminate objects based on whether they are taller than wide (ttw), wider than tall
(wtt), or equivalent (eq), i.e., of AR close to 1.
If the first two qualitative dimensions were plotted on a cartesian plane, a series of quadrants
Prediction selection point

ML-based object Knowledge-based Reasoner
RGB image recognition
Baseline prediction: Bottle

Size Estimation

Depth Image to
PointCloud conversion

Foreground extraction
& noise removal
2D Bounding boxes/polygons
and class labels
2D Depth Image
3D bounding box
estimation

object dimensions: w,h,d

medium-sized
Size Quantization thick
taller than wide (ttw)
What objects are
typically medium- Ranking validation Size KB
sized, thick and ttw?
Robot, Backpack,
Fire extinguisher
Which object was
ranked higher by ML?

New prediction: Fire extinguisher

Figure 1: The proposed architecture for hybrid object recognition. The knowledge-based reasoning
module, which is aware of the typical size features of objects, validates the ML-based predictions.

would emerge, as illustrated in this Figure: https://robots.kmi.open.ac.uk/img/size_repr.pdf.
Then, the AR can help to further separate the clusters of objects belonging to the same quad-
rant. For instance, doors and desks both belong to the extra-large and bulky group but doors,
contrarily to desks, are usually taller than wide. Having defined the cartesian plane in the
supporting materials, we can manually allocate the KMi objects to each quadrant and further
sort the objects lying in the same quadrant. Sorting the objects manually ensures more reliable
results than if the same information was retrieved automatically.
Moreover, in the proposed representation, membership of each bin is mutually non-exclusive.
Thus, with this representation, even classes which are extremely variable with respect to size,
such as carton boxes and power cords, can be modelled. Indeed, boxes come in all sizes and
power cords come in different lengths. Moreover, a box might lay completely flat, or appear
bulkier, once assembled. Similarly, power cords, which are typically thinner than other pieces
of IT equipment, might appear rolled up or tangled.

3.2. Hybrid Reasoning Architecture
We propose a modular approach to combining knowledge-based reasoning with Machine Learn-
ing for object recognition, where knowledge of the qualitative size of objects (Section 3.1) is
integrated in post-processing, after generating the ML predictions. The architectural compo-
nents are organised as follows (Figure 1).
ML-based object recognition. We rely on the state-of-the-art, ML-based object recognition
methods of [27], to classify a set of pre-segmented object regions. Specifically, we classify
objects by similarity to a reference image set, through a multi-branch Network. In this deep
metric learning setting, predictions are ranked by increasing Euclidean (or L2) distance between
each target embedding and the reference embedding set. Nevertheless, this configuration can
be easily replaced by any other algorithm that provides, for each detected object, (i) a set of class
predictions with an associated measure of confidence (whether similarity-based or probability-
based) and (ii) the segmented image region enclosing the object.
Prediction selection. This checkpoint is conceived for assessing whether a ML-based pre-
diction needs to be corrected or not. At the time of writing, we achieved good results simply
by retaining those predictions which the ML algorithm is most confident about, and by run-
ning the remaining predictions through the size-based reasoner. Specifically, we avoid the
knowledge-based reasoning steps if the top-1 class in the ML ranking: (i) has a ranking score
smaller than 𝜖 (i.e., the test image embedding lies near one of the reference embeddings, in
terms of L2 distance); and also (ii) appears at least 𝑖 times in the top-K ranking.
Size estimation. At this stage, the input depth image corresponding to each RGB scene
is first converted to a 3D PointCloud representation. Then, statistical outliers are filtered out
to extract the dense 3D region which best approximates the volume of the observed object.
Specifically, all points which lie farther away than two standard deviations (2𝜎) from their 𝑛
nearest neighbours are discarded. Because this outlier removal step is computationally expen-
sive, especially for large 3D regions, we only retain 1 every 𝜒 points from the input PointCloud.
We use the Convex Hull algorithm to approximate the 3D box bounding of the object and thus
estimate its x,y,z dimensions. Since the orientation of the object is not known a priori, we can-
not unequivocally map any of the estimated dimensions to the object’s real width and height.
However, we can assume the object’s depth to be the minimum of the returned x,y,z dimen-
sions, due to the way depth measurements are collected through the sensor. Indeed, since
we do not apply any 3D reconstruction mechanisms, we can expect that the measured depth
underestimates the real depth occupied by the object.
Size quantization. The three dimensions obtained at the previous step are here expressed
in qualitative terms, to make them comparable with the representation of Section 3.1. First, the
two dimensions which were not marked as depth are multiplied together, yielding a proxy of
the object’s surface area. The object is then categorised as extra-small, small, medium, large or
extra-large, based on a threshold set 𝑇 . Second, the estimated depth dimension is labelled as
flat/non-flat (based on a threshold 𝜆0 ), and as flat, thin, thick, or bulky (based on a second set
of thresholds Λ, where 𝜆0 ∈ Λ). Third, hypotheses are made about whether the object is taller
than wide (ttw), wider than tall (wtt), or equivalent (eq), based on a cutoff 𝜔0 ∈ Ω. It would
be unfeasible to predict the object’s Aspect Ratio from the estimated 3D dimensions, without
knowing its current orientation. Thus, we estimate the object’s AR based on the width (w) and
height (h) of the 2D bounding box:
⎧
⎪
⎪ 𝑡𝑡𝑤 if ℎ ≥ 𝑤 ∧ 𝑤ℎ ≥ 𝜔0
⎪
𝐴𝑅 = ⎨𝑤𝑤𝑡 if ℎ < 𝑤 ∧ 𝑤ℎ ≥ 𝜔0 (1)
⎪
⎪
⎪
⎩𝑒𝑞 otherwise

Ranking validation. The qualitative features returned by the size quantization module are
then matched against the background KB of Section 3.1, to identify the set of categories which
are plausible candidates for the observed object, based on their size. In Section 4, we test the
effects of combining different size features (i.e., the area surface, thinness, and AR) to generate
this candidate set. Ultimately, only those object classes in the original ML ranking which were
validated as plausible are retained.

4. Experiments
4.1. Data Preparation
KMi RGB Reference Set. For training purposes, RGB images depicting the 60 target object
classes were collected through a Turtlebot mounting an Orbbec Astra Pro RGB-Depth (RGB-D)
monocular camera. Each object was captured against a neutral background and opportunely
cropped to control for the presence of clutter and occluded parts. We collected 10 images per
class, with an 80%-20% training-validation split. Specifically, 5 images per class were used as
anchor examples and paired up with their most similar image among the remaining 5 examples
in that class (i.e., the multi-anchor switch strategy in [27]). We also matched each anchor
with the most similar image belonging to a different class. In this way, instead of picking
negative examples randomly, i.e., the protocol followed in [27], we focused on triplets which
are relatively harder to disambiguate.
KMi RGB-D Test Set. For testing purposes, additional RGB and depth measurements of the
KMi office environment were collected, during one of HanS’ monitoring routines. Class cardi-
nalities in this set reflect the natural occurrence of objects in the observed domain - e.g., HanS
is more likely to spot fire extinguishers than printers, on its scouting route. These data were
recorded at a 640x480 resolution, i.e., the maximum resolution allowed by the depth sensor.
622 clear RGB frames, where the robot camera was still, were manually selected from the orig-
inal recording. Because the recording of RGB and depth messages is asynchronous, each RGB
frame had to be matched to its nearest depth image, within a time window of +/- 𝜇. Choosing
a higher value for 𝜇 increases the number of scenes for which a depth match is available, but
also increases the chances that a certain scene is matched with the wrong depth measurements
(e.g., because the robot has moved in the meantime). In our trials, we found that setting 𝜇 to 0.2
seconds offered a good compromise. With this setup, a depth match was found for 509 ( 82%) of
the 622 original frames. After a visual inspection, only 2 (0.4%) of the 509 depth matches were
identified as inaccurate and discarded. This RGB-D set was further pruned to exclude identi-
cal images, where neither the robot’s viewpoint nor the object arrangement had changed. We
annotated the remaining 213 images with respect to 60 reference object categories. Concur-
rently, we also labelled the rectangular or polygonal regions bounding the objects of interest.
The annotated regions were used to crop both the RGB frames and their time-synchronised
depth images. Upon analysing the generated depth crop, we identified 19 object regions which
did not enclose any depth measurement. This can happen, for instance, when the object falls
outside the range of the depth sensor (i.e., 60 cm – 8 m). To fairly compare the performance of
the size-based reasoner (which relies on depth data) against the ML baseline, we discarded the
empty depth crops from our test set, leaving us with a total 1414 object regions.

4.2. Ablation Study
In what follows, we list the different methods under evaluation and illustrate the changes in-
troduced before each performance assessment.
Baseline Nearest Neighbour (NN). In this pipeline, feature vectors are extracted from a
ResNet50 module pre-trained on ImageNet [22], without re-training on the KMi RGB refer-
ence set. Namely, a 2048-dimensional embedding is extracted for each object region in the KMi
RGB-D test set and matched to its nearest embedding in the KMi RGB reference set, in terms
of L2 distance. This baseline provides us with a lower bound for the ML performance, before
fine-tuning on the domain of interest.
N-net is the multi-branch Network which ensured the top performance on novel object classes,
i.e., classes unseen at training time, in [27]. Training hyperparameters are updated so that the
Triplet Loss is minimised, i.e., to minimise the L2 distance between matching pairs, while also
maximising the L2 distance between dissimilar pairs. At inference time, object regions in the
KMi RGB-D test set are classified based on their nearest object in the KMi RGB reference set,
as in the case of the baseline NN pipeline.
K-net is the multi-branch Network which led to the top performance on known object classes,
i.e., classes seen at training time, in [27]. K-net is a variation of N-net where a second loss
component is added to the Triplet Loss. This auxiliary component of the loss derives from
applying a SoftMax function over M known classes to the last fully connected layer.
Hybrid (area). This configuration follows the hybrid architecture introduced in Section 3.2.
However, only the object’s surface area is used as size feature to validate the ML predictions.
Hybrid (area + flat). With this ablation, we evaluate the effects of introducing a second feature
to represent the size of objects. Specifically, here we consider not only the qualitative surface
area of each object, but also whether they are flat or non-flat, based on their estimated depth.
Then, the ML-based predictions are validated based on the set of object classes which both lie
within the same area range and are also equally flat (or non-flat).
Hybrid (area + thin) is equivalent to the previous configuration, except the depth of objects is
represented on a four-class scale: i.e., as flat, thin, thick, or bulky. The purpose of this ablation
is testing the utility of introducing more granular depth bins.
Hybrid (* + AR). These ablations also integrate the qualitative Aspect Ratio (i.e., taller than
wide, wider than tall, or equivalent) as a third knowledge property of size.

4.3. Implementation Details
ML setup. All the tested ML models were implemented in PyTorch [36]. Images in the KMi
RGB reference set were resized to 224 × 224 frames and normalised to the same distribution as
the ImageNet-1000 dataset, which was used for pre-training the ResNet50 CNN backbone. The
tested ablations were fine-tuned through an Adabound optimizer [37], over minibatches of 16
image triplets, with a learning rate set to start at 0.0001 and to trigger switching to Stochastic
Gradient Descend optimization at 0.01. Parameters were updated for up to 1500 epochs, with
an early stopping whenever the validation loss had not decreased for more than 100 epochs.
Knowledge-based reasoning parameters. We relied on the Python Open3D library [38] to
process the PointCloud data and estimate the bounding 3D rectangles. During our trials we
achieved the best results with threshold values set as follows. In the prediction selection step, 𝜖
was set to a distance of 0.04 and 𝑖 to 3 predictions, within a top-5 ranking. The three cutoff sets
in the size quantization module were defined so that 𝑇 ={0.007,0.05,0.35,0.79} (with thresholds
expressed in squared meters), Λ ={0.1,0.2,0.4} (in meters) and Ω={1.4 times}.
Table 1
ML baseline results on the KMi RGB-D test set.
Top-1 unweighted Top-1 weighted Top-5 results unweighted
Method Top-1 Acc. P R F1 P R F1 P@5 nDCG@5 Hit ratio
Baseline NN .33 .36 .33 .25 .63 .33 .36 .23 .25 .54
N-net .45 .34 .40 .31 .62 .45 .47 .33 .36 .63
K-net .48 .39 .40 .34 .68 .48 .50 .38 .41 .65

Table 2
Hybrid reasoning results, when correcting only the wrong predictions.
Top-1 unweighted Top-1 weighted Top-5 results unweighted
Method Top-1 Acc. P R F1 P R F1 P@5 nDCG@5 Hit ratio
Hybrid (area) .60 .52 .50 .47 .71 .60 .61 .43 .47 .72
Hybrid (area+flat) .60 .55 .50 .48 .72 .60 .62 .44 .47 .72
Hybrid (area+thin) .61 .55 .50 .48 .71 .61 .63 .44 .47 .71
Hybrid (area+flat+AR) .62 .59 .50 .49 .76 .62 .65 .45 .49 .75
Hybrid (area+thin+AR) .62 .62 .51 .52 .74 .62 .65 .45 .48 .74

As an additional output of this work, we have also publicly released the RGB-D image
set, annotated knowledge properties, and code used in these experiments: https://github.com/
kmi-robots/object_reasoner.

5. Results and Discussion
We measured performance on the KMi RGB-D test (i) in terms of the cross-class Accuracy,
Precision (P), Recall (R) and F1 score of predictions in the top-1 result of the ranking; as well
as (ii) based on the top-5 predictions in the ranking, in terms of mean Precision (P@5), mean
normalised Discounted Cumulatve Gain (nDCG@5) and hit ratio (i.e., the ratio of the number of
times the correct prediction appeared in the top-5 ranking to the total number of predictions).
Specifically, the P, R and F1 metrics were aggregated across classes before and after weighing
the averages by class support (i.e., based on the number of ground truth instances in each class).
Measures@5 are unweighted. When comparing the different methods under evaluation, we
prioritise improvements on the weighted F1 score, which accounts for the naturally imbalanced
occurrence of classes in the test set. Moreover, at comparable top-1 results, we favour methods
which provide higher quality top-5 rankings. Indeed, if the correct class was not ranked first
but still appeared in the top-5 ranking, it would be easier for an additional reasoner (or human
oracle) to correct the prediction.
First, we evaluated all methods which are purely based on Machine Learning, i.e., before any
background knowledge about the typical size of objects is integrated. The results of this first
assessment are reported in Table 1. K-net is the ML baseline which led to the top performance,
across all evaluation metrics. Therefore, we considered the K-net predictions as a baseline, for
testing all the hybrid configurations.
Table 3
Hybrid reasoning results, when correcting only an automatically selected subset of predictions.
Top-1 unweighted Top-1 weighted Top-5 results unweighted
Method Top-1 Acc. P R F1 P R F1 P@5 nDCG@5 Hit ratio
Hybrid (area) .55 .42 .41 .38 .67 .55 .57 .43 .46 .69
Hybrid (area+flat) .54 .43 .39 .37 .67 .54 .56 .43 .46 .68
Hybrid (area+thin) .52 .39 .37 .35 .62 .52 .54 .42 .44 .64
Hybrid (area+flat+AR) .53 .43 .37 .36 .69 .53 .56 .43 .45 .68
Hybrid (area+thin+AR) .51 .43 .36 .36 .64 .51 .54 .41 .43 .63

Because the knowledge-based reasoner relies on different sub-modules, and each one of these
modules is likely to propagate its own errors, we initially tested performance assuming that the
ground truth predictions are known and that we can accurately discern which ML predictions
need to be corrected. Although unrealistic, this best-case scenario provides us with an upper
bound for the reasoner’s performance and aids the analysis of errors. As shown in Table 2,
simply integrating knowledge about the qualitative surface area of objects already ensured a
significant performance improvement, with a 13% increase of the unweighted F1 score and a
11% increase of the weighted F1 score. Overall, the best performance, both in terms of top-1
predictions as well as in terms of top-5 rankings, was achieved through the two hybrid con-
figurations which included all the qualitative size features (i.e., surface area, thinness and AR).
In particular, the unweighted F1 score increased up to 18% and the weighted F1 score up to
15%. Hence, the margin for improvement when complementing ML with size-based reasoning
is significant. These results confirm the hypothesis laid in [4]: the capability to compare ob-
jects by size and the access to background knowledge representing size play a crucial role in
object categorisation. Notably, there is no significant difference between the results obtained
when representing depth in binomial terms (i.e., as either flat or non-flat), as opposed to when
more fine-grained categories are used (i.e., flat, thin, thick or bulky). Thus, we can hypothesise
that the costs (and potential inaccuracies) associated with formalising additional priors for the
objects’ depth are not justified by a sufficient performance gain.
To capitalise on the latent performance gains highlighted in Table 2, the ML and knowl-
edge based outcomes need to be opportunely leveraged. To this aim, we introduced a meta-
reasoning checkpoint (i.e., the prediction selection module of Section 3.2) and automatically
selected a subset of ML predictions to feed to the knowledge-based reasoner. The results of
this last evaluation setup are summarised in Table 3. The ML baseline was outperformed by up
to 4% in terms of unweighted F1 and by up to 7% in terms of weighted F1. Moreover, introduc-
ing knowledge about the object’s surface positively impacted the quality of the top-5 ranking:
the mean P@5 and nDCG@5 both increased by 5%, and the hit ratio by 4%. The qualitative sur-
face area is the feature which led to the most consistent results across the different evaluation
metrics. In other words, integrating additional knowledge beyond that first feature only led to
comparable results, or even degraded the performance (i.e., in the case where a four-class scale
instead of a binary one is used for the object’s depth). Hence, in the experimental scenario
of this paper, a size representation as minimalistic as indicating whether the object exposes
an extra-small, small, medium, large, or extra-large front surface area is sufficient to ensure a
significant boost in performance.

6. Conclusion and Future Work
In this paper, we demonstrated that ML-based object recognition can be significantly aug-
mented by a reasoning module which can account for the typical size of objects, as hypothesised
in our prior work [4]. These results are particularly promising, because they were achieved on
image regions collected by a robot in its natural environment, i.e., in a more challenging setup
than benchmark image collections. In the proposed approach, we relied on a novel representa-
tion of the size of objects. Differently from prior knowledge representations, here we modelled
size across three dimensions (the object’s front surface area, depth and aspect ratio), to fur-
ther separate the object clusters. Moreover, we allowed for annotating each object class with
multiple size attributes, to adequately capture the size variability within each class.
The experiments presented in this paper also highlighted a series of directions of improve-
ment, informing our future work. First, when estimating the object size from depth data we had
to deal with hardware constraints: (i) objects falling outside the range of the depth sensors were
excluded; (ii) highly reflective, absorptive or transparent materials (e.g., shiny metals, glass) al-
tered the depth measurements. As such, access to a more advanced depth sensor would further
improve the performance. Second, if the object was only partially visible in the original im-
age, the estimated measurements (albeit accurate) would fail to represent the real object’s size.
Thus, incorporating the capability of moving towards the target object to refine the prediction
through repeated measurements (i.e., Active Vision) is likely to benefit performance.
Naturally, the relevance of background knowledge and knowledge-based reasoning for en-
abling Visual Intelligence spans way beyond the capability to reason about the typical size of
objects. In [4], we have identified several other reasoners (e.g., spatial, compositional, motion-
aware) which may enhance the robustness of state-of-the-art Machine Learning methods. Thus,
in our future work, we will evaluate the performance impacts of integrating: (i) additional
knowledge-based components, (ii) multiple sources of background knowledge, as well as (iii)
effective meta-reasoning strategies, to reconcile the outcomes of different reasoners.

References
[1] M. Bajones, D. Fischinger, A. Weiss, D. Wolf, M. Vincze, de la Puente, et al., Hobbit:
Providing Fall Detection and Prevention for the Elderly in the Real World, Journal of
Robotics (2018).
[2] F. Dong, S. Fang, Y. Xu, Design and Implementation of Security Robot for Public Safety,
in: 2018 International Conference on Virtual Reality and Intelligent Systems (ICVRIS),
2018, pp. 446–449.
[3] J. Waldhart, A. Clodic, R. Alami, Reasoning on Shared Visual Perspective to Improve
Route Directions, in: 2019 28th IEEE International Conference on Robot and Human
Interactive Communication (RO-MAN), 2019, pp. 1–8.
[4] A. Chiatti, E. Motta, E. Daga, Towards a Framework for Visual Intelligence in Service
Robotics: Epistemic Requirements and Gap Analysis, in: Proceedings of KR 2020- Special
session on KR & Robotics, IJCAI, 2020, pp. 905–916.
[5] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, M. Pietikäinen, Deep Learning
for Generic Object Detection: A Survey, International Journal of Computer Vision 128
(2020) 261–318.
[6] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (2015).
[7] G. Marcus, Deep learning: A critical appraisal, arXiv preprint arXiv:1801.00631 (2018).
[8] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, S. Wermter, Continual lifelong learning with
neural networks: A review, Neural Networks 113 (2019) 54–71.
[9] J. Pearl, Theoretical Impediments to Machine Learning With Seven Sparks from the
Causal Revolution, in: Proceedings of WSDM 2018, ACM, 2018, p. 3.
[10] S. Aditya, Y. Yang, C. Baral, Integrating Knowledge and Reasoning in Image Understand-
ing, in: Proceedings of IJCAI 2019, 2019, pp. 6252–6259.
[11] F. Gouidis, A. Vassiliades, T. Patkos, A. Argyros, N. Bassiliades, D. Plexousakis, A Review
on Intelligent Object Perception Methods Combining Knowledge-based Reasoning and
Machine Learning, arXiv:1912.11861 [cs] (2020).
[12] A. A. Daruna, V. Chu, W. Liu, M. Hahn, P. Khante, S. Chernova, A. Thomaz, Sirok: Situated
robot knowledge-understanding the balance between situated knowledge and variability,
in: 2018 AAAI Spring Symposium Series, 2018.
[13] T. Konkle, A. Oliva, Canonical visual size for real-world objects, Journal of Experimental
Psychology: Human Perception and Performance 37 (2011).
[14] B. Long, T. Konkle, M. A. Cohen, G. A. Alvarez, Mid-level perceptual features distinguish
objects of different real-world sizes., Journal of Experimental Psychology: General 145
(2016) 95.
[15] E. Davis, G. Marcus, Commonsense reasoning and commonsense knowledge in artificial
intelligence, Communications of the ACM 58 (2015) 92–103.
[16] H. Levesque, Common Sense, the Turing Test, and the Quest for Real AI | The MIT Press,
The MIT Press, 2017.
[17] P. J. Hayes, The Second Naive Physics Manifesto, Formal theories of the common sense
world (1988). Publisher: Ablex Publishing Corporation.
[18] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, S. J. Gershman, Building machines that learn
and think like people, Behavioral and Brain Sciences 40 (2017).
[19] H. Bagherinezhad, H. Hajishirzi, Y. Choi, A. Farhadi, Are elephants bigger than butter-
flies? reasoning about sizes of objects, in: Proceedings of AAAI, AAAI’16, AAAI Press,
Phoenix, Arizona, 2016, pp. 3449–3456.
[20] Y. Elazar, A. Mahabal, D. Ramachandran, T. Bedrax-Weiss, D. Roth, How Large Are Li-
ons? Inducing Distributions over Quantitative Attributes, in: Proceedings of the ACL,
Association for Computational Linguistics, 2019, pp. 3973–3983.
[21] A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet classification with deep convolutional
neural networks, Communications of the ACM 60 (2017) 84–90.
[22] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, in:
Proceedings of CVPR, 2016, pp. 770–778.
[23] M. Mancini, H. Karaoguz, E. Ricci, P. Jensfelt, B. Caputo, Knowledge is Never Enough:
Towards Web Aided Deep Open World Recognition, in: IEEE ICRA, 2019, p. 9543.
[24] G. Koch, R. Zemel, R. Salakhutdinov, Siamese neural networks for one-shot image recog-
nition, in: ICML deep learning workshop, volume 2, Lille, 2015.
[25] E. Hoffer, N. Ailon, Deep Metric Learning Using Triplet Network, Lecture Notes in
Computer Science (2015) 84–92.
[26] F. Schroff, D. Kalenichenko, J. Philbin, FaceNet: A Unified Embedding for Face Recogni-
tion and Clustering, in: Proceedings of the IEEE CVPR, 2015, pp. 815–823.
[27] A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. R. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu,
E. Romo, et al., Robotic pick-and-place of novel objects in clutter with multi-affordance
grasping and cross-domain image matching, in: 2018 IEEE ICRA, IEEE, 2018, pp. 1–8.
[28] A. Daruna, W. Liu, Z. Kira, S. Chetnova, Robocse: Robot common sense embedding, in:
Proceedings of ICRA, IEEE, 2019, pp. 9777–9783.
[29] K. Marino, R. Salakhutdinov, A. Gupta, The More You Know: Using Knowledge Graphs
for Image Classification, in: Proceedings of IEEE CVPR, 2017, pp. 20–28.
[30] L. Serafini, A. d. Garcez, Logic Tensor Networks: Deep Learning and Logical Reasoning
from Data and Knowledge, arXiv:1606.04422 [cs] (2016).
[31] R. Manhaeve, S. Dumancic, A. Kimmig, T. Demeester, L. De Raedt, Deepproblog: Neural
probabilistic logic programming, in: Advances in Neural Information Processing Systems,
2018, pp. 3749–3759.
[32] E. van Krieken, E. Acar, F. van Harmelen, Analyzing Differentiable Fuzzy Implications,
in: Proceedings of KR 2020, 2020, pp. 893–903.
[33] J. Young, L. Kunze, V. Basile, E. Cabrio, N. Hawes, B. Caputo, Semantic web-mining and
deep vision for lifelong object discovery, in: Proceedings of ICRA, IEEE, 2017, pp. 2774–
2779.
[34] Y. Zhu, A. Fathi, L. Fei-Fei, Reasoning about Object Affordances in a Knowledge Base
Representation, in: Proceedings of ECCV, volume 8690, Springer International Publishing,
2014, pp. 408–424.
[35] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, J. Taylor, Freebase: a collaboratively created
graph database for structuring human knowledge, in: Proceedings of the SIGMOD 2008,
2008, pp. 1247–1250.
[36] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, et al., PyTorch: An Imperative Style,
High-Performance Deep Learning Library, in: Advances in Neural Information Process-
ing Systems 32, Curran Associates, Inc., 2019, pp. 8026–8037.
[37] L. Luo, Y. Xiong, Y. Liu, X. Sun, Adaptive Gradient Methods with Dynamic Bound of
Learning Rate, in: Proceedings of ICLR, 2018.
[38] Q.-Y. Zhou, J. Park, V. Koltun, Open3d: A modern library for 3d data processing, arXiv
preprint arXiv:1801.09847 (2018).