<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Making Sense of Indoor Spaces Using Semantic Web Mining and Situated Robot Perception</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jay Young</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valerio Basile</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Suchi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lars Kunze</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nick Hawes</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Vincze</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Barbara Caputo</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Oxford Robotics Institute, Dept. of Engineering Science, University of Oxford</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technische Universitat Wien</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>The University of Birmingham</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Universita di Roma - Sapienza</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Universite Co</institution>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>te d'Azur</institution>
          ,
          <addr-line>Inria, CNRS, I3S</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <fpage>31</fpage>
      <lpage>40</lpage>
      <abstract>
        <p>Intelligent Autonomous Robots deployed in human environments must have understanding of the wide range of possible semantic identities associated with the spaces they inhabit { kitchens, living rooms, bathrooms, o ces, garages, etc. We believe robots should learn this information through their own exploration and situated perception in order to uncover and exploit structure in their environments { structure that may not be apparent to human engineers, or that may emerge over time during a deployment. In this work, we combine semantic webmining and situated robot perception to develop a system capable of assigning semantic categories to regions of space. This is accomplished by looking at web-mined relationships between room categories and objects identi ed by a Convolutional Neural Network trained on 1000 categories. Evaluated on real-world data, we show that our system exhibits several conceptual and technical advantages over similar systems, and uncovers semantic structure in the environment overlooked by ground-truth annotators.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Many tasks in Human-Robot Interaction (HRI) scenarios require autonomous
mobile service robots to relate to objects and places (or rooms) in their
environment at a semantic level. This capability is essential for interpreting task
instructions such as \Get me a mug from the kitchen" and for generating
referring expressions in real-world scenes such as \I found a red and a blue mug
in the kitchen, which one should I get?" However, in dynamic, open-world
environments such as human environments, it is simply impossible to pre-program
robots with the required knowledge about task-related objects and places in
advance. Hence, they need to be equipped with learning capabilities that allow
them to acquire knowledge of previously unknown objects and places online. In
previous work, we demonstrated how knowledge about perceived objects can be
acquired by mining textual resources [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and image databases on the Semantic
Web [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In this work, we focus on knowledge about places and investigate ways
of acquiring it using web mining and situated robot perception. In particular,
we aim to learn the semantic categories of places observed by an autonomous
mobile robot in real-world o ce environments.
      </p>
      <p>
        When mobile service robots are deployed in human-inhabited locations such
as o ces, homes, industrial workplaces and similar locations, we wish them to
be equipped with ways of learning and the ability to extend their own knowledge
on-line using information about the environment they gather through situated
experiences. This too is a di cult task, and is much more than just a matter of
data collection. Some form of semantic information is desirable too. We expect
that structured and semi-structured Web knowledge sources such as DBPedia
and WordNet [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to answer some of these questions. By linking robot knowledge
to entries in semantic ontologies, we can begin to exploit rich knowledge-bases
to facilitate better robot understanding of the world.
      </p>
      <p>
        One data source of interest to us is ImageNet, which is a large database
of categorised images organised using the WordNet lexical ontology. The
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] has in recent
years produced machine learning tools trained on ImageNet for object detection
and image classi cation. Of particular interest to us are deep learning based
approaches using Convolutional Neural Networks, trained on potentially
thousands of object categories [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This approach raises the question of how well such
predictors perform when queried with the challenging image data endemic to
mobile robot platforms, as opposed to the cleaner, and higher-resolution, data
they are typically trained and evaluated on. This domain adaptation problem is
a major di culty in using these state-of-the-art vision techniques on robots.
Using vision techniques with (ever-growing) training sets the size of ImageNet, will
allow us to extend a robot's knowledge base far beyond what it can be manually
equipped with in advance of a deployment.
      </p>
      <p>In this paper we document our work using the technologies mentioned so far
towards enabling a mobile robot to learn the semantic categories associated with
di erent regions of space in its environment. To do this, we employ large-scale
object recognition systems to generate semantic label hypotheses for objects
detected by robots in real-world environments. These hypotheses are linked to
structured, semantic knowledge bases such as DBPedia and WordNet, allowing us
to link a robot's situated experiences with higher-level knowledge. We then use
these object hypotheses to perform text-mining of the semantic web to produce
further hypotheses over the semantic category of particular regions of space.</p>
      <p>To summarise, this paper makes the following contributions:
{ an unsupervised approach for learning semantic categories of indoor spaces
using deep vision and semantic web mining;
{ an evaluation of our approach on real-world robot perception data; and
{ a proof-of-concept demonstration of how knowledge about semantic
categories can be transferred to novel environments.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Previous Work</title>
      <p>
        Space categorisation for mobile robots is an extensive, well-studied topic, and
one which it would be impossible to provide an in-depth review of in the space
available. For this, we would reccomend the work of [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], which provides a
thorough survey of the wider eld of robot semantic mapping to-date. The majority
of work in the area of space categorisation utilises semantic cues to identify and
label regions of space such as o ces, hallways, kitchens, bathrooms, laboratories,
and the partitions between them. One of the most commonly used semantic cues
is the presence of objects, and as this is also the semantic cue we use, we will
focus on this area of the work.
      </p>
      <p>
        The work of [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] realises a Bayesian approach to room categorisation, and
builds a hierarchical representation of space. This hierarchy is encoded by the
authors, who admit that their own views and experiences in regards to the
composition of these concepts could bias the system. In further work, the same authors
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] provide a more object-focused approach to space classi cation, however this
again required the development and evaluation of a knowledge base linking
objects to room types. The work of Pronobis and Jensfelt [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is signi cant in this
area in that it integrates heterogeneous semantic cues, such as the shape, size
and appearance of rooms, with object observations. However, their system was
only capable of recognising 6 object types and 11 room categories, which again
required the gathering and annotation of much training data, and it is unclear
how well this generalises to new environments and how much re-training would
be required. Similar systems [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] exhibit the same pitfalls. The work of Hanheide
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] on the Dora platform realises a robot system capable of exploiting knowledge
about the co-occurence of objects and rooms. This is facilitated by linkage to
the Open Mind Indoor Common Sense database, and is used for space
categorisation and to speed up object search by exploiting semantic relations between
objects and rooms.
      </p>
      <p>We argue that our approach exhibits several technical and conceptual
advantages over other pieces of work in this area:
{ The categorisation module requires no robot perceptual data collection or
training, and works fully on-line.
{ The system is domain agnostic, not tted to particular types of
environments, room structures or organisations.
{ We use existing, mature, tried-and-tested semantic ontologies, and as such
there is no knowledge-engineering required by the system designer to use
this information.
{ The use of large-scale object recognition tools mean we are not limited to
a small number of objects, and the use of text-mining means we are not
limited to a small number of room categories.
{ The relations between objects and room categories are derived statistically
from text mining, rather than being encoded by the developer or given by
an ontology.</p>
      <p>These key points lead to a novel way of solving the problem of space
classication on mobile robots.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Approach Overview</title>
      <p>We use a robot platform to observe the environment at various waypoints
speci ed in its environment. The robot is provided with a SLAM map of its
environment, and a set of waypoints within this map. At each of these places
the robot perceives its surroundings by taking multiple views at di erent angles
(360 ). The di erent views of the robot are aligned and integrated into a
consistent environment model in which object candidates are identi ed and clustered
into groups according to their proximity. For each object candidate, we predict
its class by using its visual appearance as an input to classi ers trained on a
large-scale object database, namely ImageNet. Based on the set of labelled (or
classi ed) object candidates which are in the same group, we perform a
webbased text-mining step to classify the region of space constrained by a bounding
polygon of the group of objects.</p>
      <p>In the following, we describe the individual components in more detail.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Object Category Recognition</title>
      <p>
        Our aim is to identify the semantic labels most strongly associated with a
particular point in a robot's environment by looking at the kinds of objects that are
visible from that point. As such, it is crucial for a robot to be able to recognise
the objects that inhabit its environment. It is typical in robotics that object
recognition is facilitated by a training step prior to deployment [
        <xref ref-type="bibr" rid="ref12 ref15">12,15</xref>
        ] (though
unsupervised approaches do exist [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) whereby selected objects from the robot's
environment are learned and later re-recognised and used for space
categorisation. The advantage of this is that the robot learns to recognise objects using
models trained using its own sensors and situated conditions, however it also
means that we must anticipate which objects a robot is likely to encounter so as
to determine which ones to learn and which to ignore. This process can also be
very time-consuming and error-prone.
      </p>
      <p>
        Previous work [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] has used Convolutional Neural Networks (CNNs) trained
on large image databases such as ImageNet, which provide databases of several
million images, for object recognition on a mobile robot. Results can vary, and
this is because the images used to train ImageNet-sourced CNNs possess very
di erent characteristics to those images observed by robots { robot data is often
noisy, grainy and typically low-resolution, and is exasperated by the di culties
robots have in getting close to objects, especially small ones. One cause of this
is what is known as the domain adaptation problem, where the features learning
mechanisms discover from their high-resolution training data do not robustly
and reliably map on to lower-resolution, noise-prone spaces. This is an active,
ongoing area of research in the computer vision community, the solution to which
holds the key to generic, o -the-shelf object recognition for mobile robots.
      </p>
      <p>
        We evaluated a set of state-of-the-art CNNs trained on ImageNet on a sample
(1000 object images) from one of our robot datasets. We measure our accuracy
using a WUP similarity score [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which calculates the semantic relatedness of
the ground-truth concept types against the concept predicted by the CNN by
considering their depth of their lowest common super-concept in the WordNet
ontology. A WUP score of 1:0 means two concepts are identical. The concepts
Dog and cat, for instance, have a WUP relatedness score of 0:86. To compare, we
also built a wrapper for the Google Web Vision API, that mapped its output to
the WordNet ontology. We evaluated against Google Web Vision, the GoogleNet
CNN, and the AlexNet and ResNet152 CNNs. Our results were 0:392, 0:594,
0:590 and 0:681 respectively, given as average WUP score over the randomly
sampled 1000 images from our labelled robot dataset. As such, we chose the
ResNet152 model to work with [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
4.1
      </p>
      <sec id="sec-4-1">
        <title>Scene Segmentation</title>
        <p>
          In order to identify objects we must rst have an idea about where they are
in the environment. To generate object location hypotheses we make use of our
own implementation of the RGB-D depth segmentation algorithm of [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. This
is a patch-based approach, which clusters locally co-planar surfaces in RGB-D
point clouds. These initial surfaces are geometrically modeled into planes and
non-uniform rational B-splines using a best t approach. The adjacency relation
between those models yield a graph and by applying a graph-cut algorithm re ne
the segmentation. Given an observation of a scene from the robot, this algorithm
returns a set of segmented candidate objects from the scene. From there, we
perform basic ltering for instance to lter out objects that are too small or too
dark, and are likely to be erroneously segmented environmental noise. We can
then extract the 2D bounding-box around the objects to be passed directly to
the object recognition system.
There has been recent work towards developing a Semantic Web-Mining
component for mobile robot systems [
          <xref ref-type="bibr" rid="ref8 ref9">8,9</xref>
          ] which we make use of. This component
provides access to object- and scene-relevant knowledge extracted from Web
sources, and is accessed using JSON-based HTTP requests. The structure of a
request to the system describes the objects that were observed in a scene, and
has been used to identify unknown objects given their context. In this case the
service computes the semantic relatedness between each object included in the
co-occurrence structure and every object in a large set of candidate objects (the
universe) from which possible concepts are drawn from. This semantic
relatedness is computed by leveraging the vectorial representation of the DBpedia
concepts provided by the NASARI resource [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The NASARI resource
represents BabelNet concepts as a vector in a high-dimensional geometric space. In
this case using Wikipedia as source corpus. The system computes the aggregate
of the relatedness of a candidate unknown object to each of the scene objects
contained in the query, returning a ranked list of object label candidates based
on relatedness. We re-work this same approach to instead return ranked
relatedness distributions over room categories given a set of observed objects. We used
the following room categories: Kitchen, O ce, Eating Area, Garage, Bathroom.
The system then provides a distribution over these categories for input sets of
objects.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments and Results</title>
      <p>
        We employ two datasets of observations taken by our robot during two
longterm ( 3 months) deployments in two separate o ce environments a year apart.
The rst dataset was labelled by a human to produce 3800 views of various
objects, with the data collection methodology following the approach of Ambrus
et. al [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The robot is provided with a map, and a set of waypoints in the map
that it visits several times per day, performing full 360 RGB-D scans of the
environment at those points. The second dataset is as-yet unlabelled.
      </p>
      <p>We perform two main experiments { rst, we demonstrate the results of our
approach on the rst, human-labelled dataset gathered from site 1 (dataset G ).
Since this is hand-labelled it gives us access to a representation of the objects
encountered by the robot under ideal conditions { assuming no segmentation
errors, and perfect object recognition. First, we sample the objects observed at
each waypoint over the period of the deployment by selecting the top-n occurring
objects, here using n == 30. From here we perform Euclidean Clustering to
group objects together, producing clusters of those objects that appear within
0:5m of one-another.</p>
      <p>Each of these clusters is then incrementally sent to our text-mining module.
In return, we receive a distribution over room categories at those points in space.
After all clusters have been processed we perform a round of merging,
coalescing any clusters that possess centroids within a 1:5m of one-another, and which
share the same top-ranked category. From here, we can use these new clusters to
calculate bounding polygons to produce larger, categorised spatial regions. For
a more intuitive representation, we found it helpful to include an in ation
parameter for this { because we would like to categorise the area around an object
or set of objects, which we expect is better served by a geometrical bounding
area around objects rather than treating them as points. We apply a bounding
area of 1:5m around objects.</p>
      <p>In our second experiment, we perform the exact procedure as described above
on data gathered from site 2 (dataset T ), however the input to the system takes
the form of dynamically segmented objects using the segmentation procedure
described previously, and using object hypotheses from the ImageNet-based CNN
approach. Since this dataset is signi cantly larger, we sampled from it an equal
number of observations per waypoint (4), providing us with roughly 2800
individual RGB-D clouds of scenes of the environment. Segmenting these resulted
in 85; 000 segments, however we applied a standard ltering by ignoring any
segments that were more than 2m away from the robot base, which ltered the
set of segments down to roughly 24; 000.</p>
      <p>To evaluate our results, we provided each of the clusters of objects to ve
human annotators, and asked them to identify the room categories they believed
to be most closely related to the set of objects. This was done without visual
information on the appearance of the objects or the environment in which they
were found, in the rst experiment at site 1 we achieved an agreement between
the annotators and the system of 74%. In the second experiment at site 2, we
achieved an agreement of 80% between annotators and our system. In a second
round of evaluation, a di erent set of seven annotators were provided images
observed by the robot at each waypoint, and asked to identify the likely room
categories displayed in the images from the same set of candidate rooms provided
to the robot. We apply these ground-truth labels to the areas of space around
each waypoint. This allows us to compare these ground-truth category labels
with the labels suggested by our system. The results are shown in Figure 3. On
the map, dark blue polygons represent regions learned by our system, red squares
indicate the waypoints where the robot took observations, and light coloured</p>
      <p>(a) Site 1, G dataset
(b) Site 2, T dataset
circles indicate the ground-truth label of the space around each waypoint {
human annotators agreed on labels for these areas, so there is no variance.
7</p>
    </sec>
    <sec id="sec-6">
      <title>Discussion</title>
      <p>In our results from site 1, the system categorised three region types { kitchen,
ofce and eating area. Our ground-truth labellers, given the same list of candidate
rooms as the robot, only labelled kitchen and o ce areas. All of the o ce and
kitchen areas learned by the system fall into the corresponding areas labelled by
the human annotators, and represent a sub-section of that space. These were
labelled by detecting objects such as ling cabinets, computer equipment, printers,
telephones and whiteboards, which all ultimately most strongly correlated with
the o ce room category. But where do the eating areas come from? These areas
were labelled by detecting objects such as water bottles, co ee cups and mugs on
the desks and cabinets of workers in the deployment environment. These objects
were typically surrounded by o ce equipment. While comparing these region
labels to our ground-truth data would suggest the answer is wrong, we believe
that this captures a more nely-grained semantic structure in the environment
that does in fact make sense. While the regions themselves may not, to a human,
meet the requirements for a dining area, the objects encompassed within them
are far more closely linked in the data with eating areas and kitchens than they
are with computer equipment and stationary, and so the system annotates these
regions di erently.</p>
      <p>At site 2 we see that the robot did not learn these characteristic eating area
regions. While inspection of the data shows that many desks do exhibit the same
structure of having mugs, cups and bottles on them in certain areas, the object
recognition system used in the second set of experiments failed to correctly
identify them. These objects are typically small, and di cult for a mobile robot
to get close to. The results for the second dataset are also more noisy { there are
misclassi ed regions. These were caused primarily by object recognition errors,
themselves compounded by segmentation errors and sensor noise. To lter these
out, we included a lter on system that ignored any classi cation result that came
back with a con dence below 0:1 { ignoring those objects completely ltered out
around 18; 000 segments.</p>
      <p>Our system is ultimately limited by its reliance on objects to generate
hypotheses for space classi cation. This means that our approach is unable to
categorise areas of space such as corridoors or hallways. However it is intended
to work as a component of object-search systems, so perhaps this is not
necessary at this stage. To illustrate this, we built a query interface for the system
which takes an arbitrary object label and suggests an area of space where the
object can be found, ranking results using the semantic relations of the object
with the categories learned at each region. This allows a robot to generate priors
over possible locations of objects it has never seen before, and we view as the
rst step towards unknown object search.</p>
      <p>There are many di erent possible representations for the data our system
generates. We opted for a clustering and bounding-polygon based approach in order
to most clearly visualise our results, but other approaches could be used such
as ood- ll algorithms, heat-maps or potential elds. Choice of representation
should be informed by the task that is intended to make use of the information.
8</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>In this work we presented a robot system capable of categorising regions of space
in real-world, noisy human-inhabited environments. The system used concepts
in a lexical ontology to represent object labels, and harnessed this representation
to mine relations between observed objects and room categories from corpora
of text. Transferring these relations back to the real-world, we used them to
annotate the robot's world with polygons indicating speci c semantic categories.
We found that the system was largely able to discover and categorise regions
similar in area to human annotators, but was also able to discover some structure
overlooked by those annotators.</p>
      <sec id="sec-7-1">
        <title>Acknowledgments</title>
        <p>The research leading to these results has received funding from EU FP7 grant
agreement No. 600623, STRANDS, and CHIST-ERA Project ALOOF.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>T.</given-names>
            <surname>Faeulhammer</surname>
          </string-name>
          , et. al.:
          <article-title>\Autonomous learning of object models on a mobile robot,"</article-title>
          <source>IEEE RAL</source>
          , vol. PP, no.
          <issue>99</issue>
          , pp.
          <volume>1</volume>
          {
          <issue>1</issue>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Kilgarri</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Fellbaum</surname>
          </string-name>
          , \
          <article-title>Wordnet: An electronic lexical database</article-title>
          ,"
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>O.</given-names>
            <surname>Russakovsky</surname>
          </string-name>
          , et. al. \
          <article-title>ImageNet Large Scale Visual Recognition Challenge,"</article-title>
          <source>International Journal of Computer Vision (IJCV)</source>
          , vol.
          <volume>115</volume>
          , no.
          <issue>3</issue>
          , pp.
          <volume>211</volume>
          {
          <issue>252</issue>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , et. al. \
          <article-title>Imagenet classi cation with deep convolutional neural networks,"</article-title>
          <source>in Advances in Neural Information Processing Systems</source>
          25,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J. C.</given-names>
            <surname>Burges</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          , Eds. Curran Associates, Inc.,
          <year>2012</year>
          , pp.
          <volume>1097</volume>
          {
          <fpage>1105</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Palmer</surname>
          </string-name>
          , \
          <article-title>Verbs semantics and lexical selection," in ACL, ser</article-title>
          .
          <source>ACL '94</source>
          .
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA: Association for Computational Linguistics,
          <year>1994</year>
          , pp.
          <volume>133</volume>
          {
          <fpage>138</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          , et. al. \
          <article-title>Nasari: a novel approach to a semanticallyaware representation of items." in HLT-NAACL</article-title>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Chai</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A</surname>
          </string-name>
          . Sarkar, Eds.
          <source>The Association for Computational Linguistics</source>
          ,
          <year>2015</year>
          , pp.
          <volume>567</volume>
          {
          <fpage>577</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>R.</given-names>
            <surname>Ambrus</surname>
          </string-name>
          ,et. al , \
          <article-title>Meta-rooms: Building and maintaining long term spatial models in a dynamic world,"</article-title>
          <source>in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems</source>
          . IEEE,
          <year>2014</year>
          , pp.
          <year>1854</year>
          {
          <year>1861</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>J.</given-names>
            <surname>Young</surname>
          </string-name>
          , et. al , \
          <article-title>Towards lifelong object learning by integrating situated robot perception and semantic web mining,"</article-title>
          <source>in Proceedings of the European Conference on Arti cial Intelligence (ECAI)</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>J.</given-names>
            <surname>Young</surname>
          </string-name>
          , et. al , \
          <article-title>Semantic Web-Mining and Deep Vision for Lifelong Object Discovery"</article-title>
          in
          <source>Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Vasudevan</surname>
            , Shrihari, and
            <given-names>Roland</given-names>
          </string-name>
          <string-name>
            <surname>Siegwart</surname>
          </string-name>
          .
          <article-title>"Bayesian space conceptualization and place classi cation for semantic maps in mobile robotics</article-title>
          .
          <source>" Robotics and Autonomous Systems 56.6</source>
          (
          <year>2008</year>
          ):
          <fpage>522</fpage>
          -
          <lpage>537</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Vasudevan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Shrihari</surname>
          </string-name>
          , et. al
          <article-title>"Cognitive maps for mobile robotsan object based approach</article-title>
          .
          <source>" Robotics and Autonomous Systems</source>
          <volume>55</volume>
          , no.
          <issue>5</issue>
          (
          <year>2007</year>
          ):
          <fpage>359</fpage>
          -
          <lpage>371</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Pronobis</surname>
            , Andrzej, and
            <given-names>Patric</given-names>
          </string-name>
          <string-name>
            <surname>Jensfelt</surname>
          </string-name>
          .
          <article-title>"Large-scale semantic mapping and reasoning with heterogeneous modalities</article-title>
          .
          <source>" IEEE International Conference on Robotics and Automation</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kostavelis</surname>
            , Ioannis, and
            <given-names>Antonios</given-names>
          </string-name>
          <string-name>
            <surname>Gasteratos</surname>
          </string-name>
          .
          <article-title>"Semantic mapping for mobile robotics tasks: A survey."</article-title>
          <source>Robotics and Autonomous Systems</source>
          <volume>66</volume>
          (
          <year>2015</year>
          ):
          <fpage>86</fpage>
          -
          <lpage>103</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Zender</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hendrik</surname>
          </string-name>
          , et. al
          <article-title>"Conceptual spatial representations for indoor mobile robots</article-title>
          .
          <source>" Robotics and Autonomous Systems</source>
          <volume>56</volume>
          , no.
          <issue>6</issue>
          (
          <year>2008</year>
          ):
          <fpage>493</fpage>
          -
          <lpage>502</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Hanheide</surname>
          </string-name>
          ,
          <string-name>
            <surname>Marc</surname>
          </string-name>
          , et. al
          <article-title>"Dora, a robot exploiting probabilistic knowledge under uncertain sensing for e cient object search."</article-title>
          <source>In Proceedings of Systems Demonstration of the 21st International Conference on Automated Planning and Scheduling (ICAPS)</source>
          , Freiburg, Germany.
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Potapova</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ekaterina</surname>
          </string-name>
          , et. al .
          <article-title>"Attention-driven object detection and segmentation of cluttered table scenes using 2.5 d symmetry."</article-title>
          <source>In Robotics and Automation (ICRA)</source>
          ,
          <year>2014</year>
          IEEE International Conference on, pp.
          <fpage>4946</fpage>
          -
          <lpage>4952</lpage>
          . IEEE,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>Kaiming</surname>
          </string-name>
          , et. al.
          <article-title>"Deep residual learning for image recognition."</article-title>
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>