<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Applying spatial knowledge from a scene description task to question answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Simon Dobnik</string-name>
          <email>simon@dobnik.net</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Philosophy</institution>
          ,
          <addr-line>Linguistics and Theory of Science Box 200, 405 30 Goteborg</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        This presentation extends our previous work in which we build and test a
mobile robot which learns grounded semantic representations of spatial concepts
from human descriptions and its own perception through sensors of a room
containing real objects. The learning is performed o ine as a machine learning
classi cation task. In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] we show that the learning of spatial concepts is
successful when the classi ers are evaluated. In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] we argue that classi er evaluation is
not enough to show that the robot acquired human-like spatial knowledge which
generalises to new spatial con gurations. We therefore integrated the classi ers
in our own NL generation system (pDescriber ) which produces grounded
descriptions of spatial scenes such as \The table is to the left of the chair" and
allows humans to observe the acquired spatial knowledge. The discourse setting
in which these descriptions are made is identical to the one in which they were
sampled before they were learned. In this contribution we examine whether we
can use data-sets and classi ers from the scene description task to answer
questions that (A) locate objects: \Where is the table?" - \The table is to the left
of the chair"; (B) con rming object description: \Is the table to the left of the
chair?" - \No, the table is near the chair."; (C) nd objects: \What is to the left
of the chair?" - \The pillars, the tyres and the wall are to the left of the chair";
and (D) reference objects: \What is the chair to the left of?" - \The chair is to
the left of the table, the desk and the wall". We see the task as an experimentally
constrained form of dialogue which contains only two dialogue acts (information
request and answer) which are always performed by the same illocutionary
partner: a human directs questions to the robot. Since the dialogue is situated both
spatially and in discourse we do expect to nd e ects of semantic coordination
of human observers when interpreting the robot's responses.
      </p>
      <p>Generating question answers (pDialogue) requires more steps than
generating descriptions and hence more factors may in uence the evaluation of spatial
knowledge. User utterances must be interpreted as questions and their content
must be matched against dialogue rules which specify how to answer them.
Most dialogue rules require an application of ML classi ers that take linguistic
descriptions and predict perceptual properties rather than reverse (pDescriber ).
The classi cation tells us what state of perception corresponds to a description.
The dialogue rules must then issue commands that bring the robot to this state
or nd a con guration of objects that holds in the state. The resulting knowledge
is used to generate natural language sentences.</p>
      <p>The system was individually evaluated by 13 non-expert volunteers in a room
environment di erent from that used in data collection for ML. Each evaluator</p>
      <p>From descriptions of scenes to answering questions
considered the robot's answers to 55 questions which were scripted and were
automatically asked by the evaluation software at four distinct room locations
(L1 to L4). This ensured that various spatial and linguistic conditions were
covered. The evaluators' task was to indicate whether each robot's answer is an
intuitive or natural description on a scale from 1 (bad) to 5 (best). Each run
took between 45 to 60 minutes to complete. We estimated evaluator agreement by
calculating Pearson's correlation coe cient between the scores of each evaluator
per particular question-answer pair and the mean of such scores over all other
evaluators. The overall agreement of 0.583 (the mean of correlation coe cients
from all 13 folds) shows that there is a considerable consensus between the
evaluators on the performance of the system.</p>
      <p>
        To estimate the accuracy of the system the evaluator scores were normalised
to values between 0 and 1 (1=0, 2=0.25, 3=0.5, 4=0.75, 5=1) and summed.
The accuracy per question type is as follows: A - 43.5%, B - 54.2%, C - 54.7%,
D 56.9% and mean - 52.3%. The steps involved in answering questions A are
identical to generating a description in pDescriber (59.3%) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] { one or two
objects are selected at random and the relation between them is classi ed {
but the estimated performance of pDialogue on questions A is lower by 15.8%.
The result suggests that a new discourse setting a ects the interpretation of
spatial descriptions. When the system generates a description on its own, a
human hearer understands it as a general statement about the scene that both
are observing. However, when an agent in conversation asks a question, they
expect an informative and relevant answer which helps them to interpret the
scene. Choosing a salient reference object is particularly important. Objects can
be salient in their visual properties (visual-salience) or through being previously
discussed and located in dialogue (discourse salience). The modelling of both
kinds of salience is an object of our future investigations.
      </p>
      <p>We also tested two other properties a ecting the semantics of spatial
descriptions in a situated discourse. The di erence in evaluation scores for
questionsanswer pairs that involved (a) objects that were in the robot's visual eld (L1
and L2) and those that were not (L3) is statistically signi cant (t-test: a &gt; b;
= 2P = 0:000). The interlocutors expect the robot to change its orientation
towards objects referred to in questions and answers. Secondly, the di erence
in evaluation scores (a) where the spatial description in questions was
unambiguous in terms of the reference frame (C at L1: \What is to the left of you?"
{ intrinsic ) and (b) where a question could be answered using an alternative
reference frame (\What is to the left of the chair?) is not statistically signi cant
(t-test: a = b; = 2P = 0:61 &gt; 0:05.). This shows that human observers align
to the reference frame chosen by the robot (relative to itself) and do not insist
on changing it.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Dobnik</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Learning spatial referential words with mobile robots</article-title>
          .
          <source>In: Proceedings of the 9th Annual CLUK Research colloquim</source>
          . The Open University, Milton Keynes, United
          <string-name>
            <surname>Kingdom</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Dobnik</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pulman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Human evaluation of robot-generated spatial descriptions</article-title>
          .
          <source>In: Proceedings of the Workshop on Computational Models of Spatial Language Interpretation (CoSLI)</source>
          . Portland, Oregon, USA (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>