<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>KILLE: learning grounded language through interaction</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Simon Dobnik and Erik Wouter de Graaf Dept. of Philosophy, Linguistics &amp; Theory of Science University of Gothenburg</institution>
          ,
          <country country="SE">Sweden</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Testing and computational implementation of formal models of situated linguistic interaction imposes demands on computational infrastructure. We present our system called KILLE and provide a proof-ofconcept evaluation of interactive situated learning of object categories and spatial relations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Contemporary approaches to semantics of natural
language
        <xref ref-type="bibr" rid="ref4 ref9">(Cooper, 2016; Ferna´ndez et al., 2011)</xref>
        are based on two important premises: (i) meanings
are not universal and static but are agent-relative
and are continuously adapted in interaction with
other agents and environment
        <xref ref-type="bibr" rid="ref2 ref22">(Clark, 1996;
Pickering and Garrod, 2004)</xref>
        ; and (ii) meanings (sense
and reference) are multi-modal where different
lexical items are sensitive to different modalities
in different contexts to different degrees
        <xref ref-type="bibr" rid="ref5">(Coventry and Garrod, 2005)</xref>
        .
      </p>
      <p>
        Both aspects have changed the focus in
computational semantics from engineering formal rules
that cover a domain or a fragment of linguistic data
off-line to approaches that are data driven and
involve continuous online fine-tuning of the model’s
parameters
        <xref ref-type="bibr" rid="ref19">(Skocˇaj et al., 2011; Matuszek et al.,
2012)</xref>
        . In robotics a shift in the approach has
happened much earlier as it quickly became apparent
that robots with static models cannot deal with any
changes in the environment or with the
environment’s uncertainty. Instead, modern robotics uses
models which are learned from data and refined
continuously as the robot’s interaction with the
environment develops (for example
        <xref ref-type="bibr" rid="ref6">(Dissanayake et
al., 2001)</xref>
        for map building). We argue that the
same paradigm should also be adopted when
dealing with computational models of language. In
this view the focus of building a computational
system is not on designing representations but
investigating and modelling interactive strategies or
dialogue games
        <xref ref-type="bibr" rid="ref13">(Kowtko et al., 1992)</xref>
        that will
allow construction of such representations or
finetuning of their features, depending on how much
of representations are pre-available to such a
system.1
      </p>
      <p>
        The interactive semantics of a computational
system have also implications on the models of
meaning used. The predominant semantic
representations used in computational semantics today
are vector-space representations that define
meaning as semantic similarity between lexical items on
the basis of their co-occurrence in contexts
        <xref ref-type="bibr" rid="ref27 ref3">(Turney et al., 2010; Clark, 2015)</xref>
        . Such models can
be successfully extracted from large corpora of
text and are very successful in representing
meaning. However, they nonetheless represent
meaning in an indirect way as they never consider a
relation between an expression and situations in
which that expression applies to or is true for. The
reason why words in particular linguistic contexts
are lexically similar is because words in
linguistic strings as a whole refer to (more or less) the
same situations which we do not have access to or
ignore when we built vector space models.
However, in an interactive scenario described above we
can explore linking linguistic expressions and
perceptual features directly, a process which is
commonly known as grounding
        <xref ref-type="bibr" rid="ref10 ref24">(Harnad, 1990; Roy,
2002)</xref>
        . Such models are required for situated
dialogue agents or conversational robots which have
to link language and situations that they jointly
attend to with human conversational partners.2
1This sounds similar to the Chomsky’s innateness claim
but here we are thinking of purely engineering a system and
make no claims about human cognition.
      </p>
      <p>
        2It is important to emphasise nonetheless that vector
space models may provide an important source of
backGrounded meanings of linguistic descriptions
such as “close to the table” and “red”
correspond to some function from physical or colour
space to a degree of acceptability of that
description
        <xref ref-type="bibr" rid="ref12 ref12 ref17 ref19 ref20 ref20 ref24 ref8 ref8">(Logan and Sadler, 1996; Roy, 2002; Skocˇaj
et al., 2011; Matuszek et al., 2012;
Kennington and Schlangen, 2015; McMahan and Stone,
2015)</xref>
        . Cognitive structures are hierarchically
organised at several representation layers focusing
on and combining different modalities
        <xref ref-type="bibr" rid="ref14">(Kruijff et
al., 2007)</xref>
        . Since the functions predict distributions
of degree of applicability several descriptions may
be equally applicable for the same perceptual
situation: the chair can be “close to the table” or
“to the left of the table” which means vagueness
is prevalent in grounding. This however, can be
resolved through interaction by adopting
appropriate interaction strategies
        <xref ref-type="bibr" rid="ref11 ref25 ref8">(Kelleher et al., 2005;
Skantze et al., 2014; Dobnik et al., 2015)</xref>
        .
      </p>
      <p>
        A formal model of perceptual semantics in
interaction has been the focus of Type Theory with
Records (TTR)
        <xref ref-type="bibr" rid="ref15 ref4 ref7 ref7">(Cooper, 2016; Larsson, 2013;
Dobnik et al., 2013)</xref>
        . Implementing,
validating and testing such models imposes complex
demands on computational infrastructure in the
sense that this involves connecting perceptual
sensors with dialogue systems and machine
learning algorithms. Processing language in
interaction also presents challenges from the
computational perspective as it is often not trivial to
employ existing language technology tools and
(machine learning) algorithms, which were developed
for processing data offline, in an interactive
tutoring scenario. To address both issues we have
developed a framework for situated agents that learn
grounded language incrementally and online with
a help of human tutor called KILLE3 (Kinect Is
Learning LanguagE). This paper focuses on the
construction of the Kille framework and its
properties while it also provides a proof-of-concept
evaluation of such learning of simple object and spatial
relations representations. We hope that this
framework will be a useful tool for future studying and
computational modelling language in interaction.
ground knowledge in such scenario and hence a dialogue
agent does not have to learn every meaning representation
through grounding. The challenges of integration of both
meaning representations are a focus of ongoing research.
3Swedish for “fellow”, “chap” or “bloke”.
      </p>
    </sec>
    <sec id="sec-2">
      <title>The KILLE system</title>
      <p>
        KILLE is a non-mobile table-top robot
connecting Kinect sensors with image processing
(libfreenect), classification (clustering of visual
features and location classification) and a spoken
dialogue system OpenDial4
        <xref ref-type="bibr" rid="ref16">(Lison, 2013)</xref>
        connected through Robot Operating System (ROS)
        <xref ref-type="bibr" rid="ref23">(Quigley et al., 2009)</xref>
        . The latter is a popular
robotic middle-ware which ensures
communication between them. It runs on a variety of popular
robotic hardware implementations which means
that our system could be ported to them
without too much modification (Figure 1). We
prefer a robotic middle-ware rather systems centred
around dialogue systems because it allows us to
represent and exchange perceptual and linguistic
information together and in the same way: there is
one information state for both. In addition to the
integration of these modules, our main
contribution is implementation of ROSDial which provides
and interface between OpenDial and ROS,
implementation of Kille Core which provides
perceptual and spatial classification, and implementation
of dialogue games that interface between dialogue
and perceptual classification and therefore enable
incremental perceptual learning.
      </p>
      <p>The system learns to recognise objects
presented to it by a human tutor from scratch. It
can direct learning by asking for more objects
of a particular category if it is not able to
classify them with sufficient reliability, thus filling
in the missing knowledge. If more objects of a
particular category are available in the scene and
the system is able to recognise them, the system
queries the user to describe spatial relations
between them. Each of these kinds of descriptions
focus on a different perceptual features (visual
features vs location) and represent two fundamental
linguistic semantic categories: entities and entity
relations. Overall, KILLE combines both passive
(being instructed) and active (requesting examples
to learn from) learning which is incremental in
terms of learning to recognise situations denoted
by the lexical categories. A demo of the system
can be found at http://www.youtube.com/
watch?v=AsEgaka6tH0.</p>
      <p>Below there are some examples of dialogue
games that we implemented as dialogue rules in
the dialogue manager which act upon both
linguistic and perceptual knowledge.</p>
      <sec id="sec-2-1">
        <title>4https://github.com/plison/opendial</title>
        <p>RGB frames depth frames
speech input
variables
acknowledgement or recognition result
topic
topic
Roscore</p>
        <p>topic
1. Direct instruction: objects</p>
        <p>User: This is a cup.</p>
        <p>System: [Object is focused on and observed.]</p>
        <p>S: OK, I learned a cup.
2. Describing objects and relations with (un)certainty
U: What is this?
S: [Object is focused on and observed.]
U: The object is thought to be a book, but it might also
be a mug.
3. Reinforcement from feedback</p>
        <p>U: What is this?
S: [Object is focused on and observed.]
S: This is a cup.</p>
        <p>U: That is correct.</p>
        <p>S: [Observation is added to the models.]
4. Self initiative: spatial relations</p>
        <p>S: [Objects are focused on and observed.]
S: One of the objects is book, the other one is box.</p>
        <p>Please tell me where things are.</p>
        <p>U: The book is to the left of the box.</p>
        <p>S: OK, this relation is called “to the left of”.</p>
        <p>
          For visual representations we use Open Source
Computer Vision (OpenCV)5
          <xref ref-type="bibr" rid="ref1">(Bradski and
Kaehler, 2008)</xref>
          which is a popular library for
computer vision including real-time machine
learning applications. Through ROS we receive
real-time frames from Kinect which includes both
data from the depth sensor and the visual RGB
sensor. We use the depth information (which
gives us a precise 3d location of points making up
an object) to detect the object in focus and later
take the pixels representing these points in focus
to detect SIFT features (Scale-Invariant Feature
Transform)
          <xref ref-type="bibr" rid="ref18">(Lowe, 1999)</xref>
          over them which are
used to represent objects in our model as shown
in Figure 2.
        </p>
        <p>Objects, including those that are very similar
and belong to the same category, have different
number of SIFT descriptors detected depending on</p>
      </sec>
      <sec id="sec-2-2">
        <title>5http://opencv.org</title>
        <p>(a)
(c)
(b)
(d)
their visual properties: some objects have more
visual details than others. There is a bias that
object with less features match objects with more
(and similar looking) features. In our interactive
scenario there is also no guarantee that the same
features will be detected after the object is
reintroduced (or even between two successive scans)
as the captured frame will be slightly different
from the previously captured one because of slight
changes in location, lighting and camera noise.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Interactive perceptional learning</title>
      <p>
        In the following subsections we present a
proofof concept implementation and evaluation of
perceptual learning through interaction which
demonstrates the usability of the Kille framework.
Learning to recognise objects To recognise
objects we developed a nearest neighbour
classification method based on the the FLANN library
        <xref ref-type="bibr" rid="ref21">(Muja and Lowe, 2009)</xref>
        which works by
comparing the SIFT descriptors of object to classify with
the objects in the database and then returns the
class of the closest matching object. In the
evaluation, 10 consecutive scans are taken and their
recognition scores are averaged to a single score.
This improves the accuracy but increases the
classification time (which is nonetheless still
reasonable for the small domain of objects we are
considering). The location of the recognised object
is estimated by taking the locations of the twenty
matched descriptors with the shortest distance.
      </p>
      <p>To evaluate the system’s performance in an
interactive tutoring scenario we chose the following
10 objects: apple, banana, teddy bear, book, cap,
car, cup, can of paint, shoe and shoe-box. A
human tutor successively re-introduces the same 10
objects to the system in a pre-defined order over
four rounds trying to keep the presentation
identical as much as possible. In each round all objects
are first learned and then queried. To avoid ASR
errors both in learning and generation text input is
used.</p>
      <p>Taking the average SIFT feature matching
scores over 4 rounds for each object and taking the
class of the object with highest mean score, on
average all but one object were recognised correctly.
However, the cap was consistently confused with
the banana. There were a couple of individual
confusions that have been levelled out in the
calculation of the average score. To test how distinct
objects are from one another we calculated a
difference of the matching scores of the highest-ranking
object of the correct category and the other highest
ranking candidate. If we arrange objects by this
score, we get the following ranking (from more
distinct to least distinct): book &gt; car &gt; shoe &gt;
cup &gt; banana &gt; bear &gt; apple &gt; paint &gt; shoe-box
&gt; cap. We also tested recognition of the same
objects when rotated and recognition of new objects
of the same category.</p>
      <p>Learning to recognise spatial relations
Before spatial relations can be learned the system
must recognise the target and the landmark
objects (“the gnome/TARGET is to the left of the
book/LANDMARK”) both in a linguistic string
and in a perceptual scene. Twenty highest
ranking SIFT features are taken for each object and
their x (width), y (height) and z (depth) coordinates
are averaged, thus giving us the centroid of the 20
most salient features of an object. The coordinate
frame of the coordinates is transposed to the centre
of the landmark object. The relativised location of
the target to the landmark are fed to a Linear
Support Vector Classifier (SVC) with descriptions as
target classes.</p>
      <p>A human tutor taught the system by
presenting it the target object (a book) randomly 3 times
at 16 different locations (2 distances/circles
containing 8 points separated at 45 ) in relation to
the landmark (the car). The spatial descriptions
that the human instructor used were to the left of,
to the right of, in front of, behind of, near and
close to (6). The performance of the system was
evaluated by two human conversational partners,
one of whom was also the tutor from the
learning stage. The target object was randomly placed
in one of the 16 locations and each location was
used twice which gave us 32 generations. A
particular location may be described with several
spatial descriptions (but not all combinations of
descriptions are possible) but some may be more
appropriate than others. The evaluators first wrote
down a description they would use to describe the
scene and then the system would be queried about
the location of the target to which it provided a
response. The evaluators would then also record
whether they agree with the generation. The
observed blind agreement between the evaluators is
0.5313 with k = 0:4313 which means that
choosing a spatial description is quite a subjective task.
The blind agreement between the evaluators and
the system is 0.2344 with k = 0:0537. The
evaluators were happy with the system’s generation
in additional 37.5% of cases, which means that
the system generated an appropriate description in
60.94% of cases which is encouraging and
comparable to the similar task in the literature. Note
also that the system tried to learn continuous
functions from a very small number of examples, on an
average only 46/6=8 instances.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and future work</title>
      <p>In this paper we argue that there is a need for
a computational infrastructure that will allow us
modelling dynamic grounded semantics in
interaction for two reasons: (i) to verify semantic theories
and (ii) to provide a platform for their
computational implementations. We developed and
framework called KILLE a simple interactive “robot”
which we argue provides a good solution for
modelling these aspects and at the same time can
be ported to more sophisticated robotic hardware
platforms. We demonstrated a proof-of-concept of
learning object categories and spatial relations
following the theoretical proposals in the literature.
We hope that the platform will provide useful for
testing further models of linguistic and perceptual
interactions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Gary</given-names>
            <surname>Bradski</surname>
          </string-name>
          and
          <string-name>
            <given-names>Adrian</given-names>
            <surname>Kaehler</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Learning OpenCV: Computer vision with the OpenCV library.</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Inc.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Herbert H.</given-names>
            <surname>Clark</surname>
          </string-name>
          .
          <year>1996</year>
          .
          <article-title>Using language</article-title>
          . Cambridge University Press, Cambridge.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Clark</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Vector space models of lexical meaning</article-title>
          .
          <source>In Shalom Lappin and Chris Fox</source>
          , editors,
          <source>Handbook of Contemporary Semantics - second edition</source>
          , chapter
          <volume>16</volume>
          , pages
          <fpage>493</fpage>
          -
          <lpage>522</lpage>
          . Wiley - Blackwell.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Robin</given-names>
            <surname>Cooper</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Type theory and language: From perception to linguistic communication</article-title>
          .
          <source>Draft of chapters 1-6</source>
          , 30th November.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Kenny</given-names>
            <surname>Coventry</surname>
          </string-name>
          and
          <string-name>
            <given-names>Simon</given-names>
            <surname>Garrod</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Spatial prepositions and the functional geometric framework. towards a classification of extra-geometric influences</article-title>
          .
          <source>In Laura Anne Carlson and Emile van der Zee</source>
          , editors,
          <article-title>Functional features in language and space: insights from perception, categorization, and development</article-title>
          , volume
          <volume>2</volume>
          , pages
          <fpage>149</fpage>
          -
          <lpage>162</lpage>
          . OUP.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>M. W. M. G Dissanayake</surname>
            ,
            <given-names>P. M.</given-names>
          </string-name>
          <string-name>
            <surname>Newman</surname>
            ,
            <given-names>H. F.</given-names>
          </string-name>
          <string-name>
            <surname>Durrant-Whyte</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Csorba</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>A solution to the simultaneous localization and map building (SLAM) problem</article-title>
          .
          <source>IEEE Transactions on Robotic and Automation</source>
          ,
          <volume>17</volume>
          (
          <issue>3</issue>
          ):
          <fpage>229</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Simon</given-names>
            <surname>Dobnik</surname>
          </string-name>
          , Robin Cooper, and
          <string-name>
            <given-names>Staffan</given-names>
            <surname>Larsson</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Modelling language, action, and perception in Type Theory with Records</article-title>
          .
          <source>In Denys Duchier and Yannick Parmentier</source>
          , editors,
          <source>Constraint Solving and Language Processing (CSLP</source>
          <year>2012</year>
          ),
          <source>Revised Selected Papers, v8114 of LNCS</source>
          , pages
          <fpage>70</fpage>
          -
          <lpage>91</lpage>
          . Springer Berlin Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Simon</given-names>
            <surname>Dobnik</surname>
          </string-name>
          , Christine Howes, and
          <string-name>
            <given-names>John D.</given-names>
            <surname>Kelleher</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Changing perspective: Local alignment of reference frames in dialogue</article-title>
          .
          <source>In Proceedings of goDIAL - Semdial</source>
          <year>2015</year>
          , pages
          <fpage>24</fpage>
          -
          <lpage>32</lpage>
          , Gothenburg, Sweden,
          <fpage>24</fpage>
          -
          <lpage>26th</lpage>
          August.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Raquel</given-names>
            <surname>Ferna</surname>
          </string-name>
          ´ndez, Staffan Larsson, Robin Cooper, Jonathan Ginzburg, and
          <string-name>
            <given-names>David</given-names>
            <surname>Schlangen</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Reciprocal learning via dialogue interaction: Challenges and prospects</article-title>
          .
          <source>In Proceedings of the IJCAI 2011 ALIHT Workshop</source>
          , Barcelona, Catalonia, Spain.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Stevan</given-names>
            <surname>Harnad</surname>
          </string-name>
          .
          <year>1990</year>
          .
          <article-title>The symbol grounding problem</article-title>
          . Physica D,
          <volume>42</volume>
          (
          <issue>1-3</issue>
          ):
          <fpage>335</fpage>
          -
          <lpage>346</lpage>
          , June.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>J.D. Kelleher</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Costello</surname>
            , and
            <given-names>J. van Genabith.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>Dynamically structuring updating and interrelating representations of visual and linguistic discourse</article-title>
          .
          <source>Artificial Intelligence</source>
          ,
          <volume>167</volume>
          :
          <fpage>62</fpage>
          -
          <lpage>102</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Casey</given-names>
            <surname>Kennington</surname>
          </string-name>
          and
          <string-name>
            <given-names>David</given-names>
            <surname>Schlangen</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Simple learning and compositional application of perceptually grounded word meanings for incremental reference resolution</article-title>
          .
          <source>In ACL-IJCNLP</source>
          <year>2015</year>
          , pages
          <fpage>292</fpage>
          -
          <lpage>301</lpage>
          , Beijing, China, July. ACL.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Jacqueline C Kowtko</surname>
          </string-name>
          ,
          <string-name>
            <surname>Stephen D Isard</surname>
          </string-name>
          , and
          <string-name>
            <surname>Gwyneth M Doherty</surname>
          </string-name>
          .
          <year>1992</year>
          .
          <article-title>Conversational games within dialogue</article-title>
          .
          <source>HCRC research paper RP-31</source>
          , University of Edinburgh.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Geert-Jan M. Kruijff</surname>
          </string-name>
          , Hendrik Zender, Patric Jensfelt, and
          <string-name>
            <surname>Henrik</surname>
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Christensen</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Situated dialogue and spatial organization: what, where</article-title>
          ... and why?
          <source>International Journal of Advanced Robotic Systems</source>
          ,
          <volume>4</volume>
          (
          <issue>1</issue>
          ):
          <fpage>125</fpage>
          -
          <lpage>138</lpage>
          .
          <article-title>Special issue on human and robot interactive communication</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Staffan</given-names>
            <surname>Larsson</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Formal semantics for perceptual classification</article-title>
          .
          <source>Journal of Logic and Computation</source>
          , online:
          <fpage>1</fpage>
          -
          <lpage>35</lpage>
          , December 18.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Lison</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Structured Probabilistic Modelling for Dialogue Management</article-title>
          .
          <source>Ph.D. thesis</source>
          , Department of Informatics,
          <source>Faculty of Mathematics and Natural Sciences</source>
          , University of Oslo, 30th October.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Gordon D.</given-names>
            <surname>Logan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Daniel D.</given-names>
            <surname>Sadler</surname>
          </string-name>
          .
          <year>1996</year>
          .
          <article-title>A computational analysis of the apprehension of spatial relations</article-title>
          . In Paul Bloom, Mary A.
          <string-name>
            <surname>Peterson</surname>
          </string-name>
          , Lynn Nadel, and Merrill F. Garrett, editors,
          <source>Language and Space</source>
          , pages
          <fpage>493</fpage>
          -
          <lpage>530</lpage>
          . MIT Press, Cambridge, MA.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>David G</given-names>
            <surname>Lowe</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Object recognition from local scale-invariant features</article-title>
          .
          <source>In Computer vision</source>
          ,
          <year>1999</year>
          .
          <source>The proceedings of the seventh IEEE international conference on</source>
          , volume
          <volume>2</volume>
          , pages
          <fpage>1150</fpage>
          -
          <lpage>1157</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Cynthia</given-names>
            <surname>Matuszek</surname>
          </string-name>
          , Nicholas FitzGerald, Luke Zettlemoyer, Liefeng Bo, and
          <string-name>
            <given-names>Dieter</given-names>
            <surname>Fox</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>A joint model of language and perception for grounded attribute learning</article-title>
          .
          <source>In Proceedings of ICML</source>
          <year>2012</year>
          , Edinburgh, Scotland, June 27th - July
          <year>3rd</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Brian</given-names>
            <surname>McMahan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Stone</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A Bayesian model of grounded color semantics</article-title>
          .
          <source>Transactions of the ACL</source>
          ,
          <volume>3</volume>
          :
          <fpage>103</fpage>
          -
          <lpage>115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Marius</given-names>
            <surname>Muja</surname>
          </string-name>
          and David G Lowe.
          <year>2009</year>
          .
          <article-title>Fast approximate nearest neighbors with automatic algorithm configuration</article-title>
          .
          <source>VISAPP (1)</source>
          ,
          <volume>2</volume>
          (
          <fpage>331</fpage>
          -340):
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Martin J.</given-names>
            <surname>Pickering</surname>
          </string-name>
          and
          <string-name>
            <given-names>Simon</given-names>
            <surname>Garrod</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Toward a mechanistic psychology of dialogue</article-title>
          .
          <source>Behavioral and Brain Sciences</source>
          ,
          <volume>27</volume>
          (
          <issue>2</issue>
          ):
          <fpage>169</fpage>
          -
          <lpage>190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Morgan</given-names>
            <surname>Quigley</surname>
          </string-name>
          , Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y Ng.
          <year>2009</year>
          .
          <article-title>ROS: an open-source robot operating system</article-title>
          .
          <source>In ICRA workshop on open source software</source>
          , volume
          <volume>3</volume>
          , page 5.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Deb</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Roy</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Learning visually-grounded words and syntax for a scene description task</article-title>
          .
          <source>Computer speech and language</source>
          ,
          <volume>16</volume>
          (
          <issue>3</issue>
          ):
          <fpage>353</fpage>
          -
          <lpage>385</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Gabriel</given-names>
            <surname>Skantze</surname>
          </string-name>
          , Anna Hjalmarsson, and
          <string-name>
            <given-names>Catharine</given-names>
            <surname>Oertel</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Turn-taking, feedback and joint attention in situated human-robot interaction</article-title>
          .
          <source>Speech Communication</source>
          ,
          <volume>65</volume>
          :
          <fpage>50</fpage>
          -
          <lpage>66</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Danijel</surname>
            <given-names>Skocˇaj</given-names>
          </string-name>
          , Matej Kristan,
          <string-name>
            <surname>Alen</surname>
            <given-names>Vrecˇko</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marko</surname>
            <given-names>Mahnicˇ</given-names>
          </string-name>
          , Miroslav Jan´ıcˇek,
          <string-name>
            <surname>Geert-Jan M. Kruijff</surname>
            , Marc Hanheide, Nick Hawes, Thomas Keller, Michael Zillich, and
            <given-names>Kai</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>A system for interactive learning in dialogue with a tutor</article-title>
          .
          <source>In IROS</source>
          <year>2011</year>
          , San Francisco, CA, USA,
          <fpage>25</fpage>
          -
          <lpage>30</lpage>
          September.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Peter D Turney</surname>
            ,
            <given-names>Patrick</given-names>
          </string-name>
          <string-name>
            <surname>Pantel</surname>
          </string-name>
          , et al.
          <year>2010</year>
          .
          <article-title>From frequency to meaning: Vector space models of semantics</article-title>
          .
          <source>Journal of artificial intelligence research</source>
          ,
          <volume>37</volume>
          (
          <issue>1</issue>
          ):
          <fpage>141</fpage>
          -
          <lpage>188</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>