Human evaluation of robot-generated spatial
                    descriptions

                       Simon Dobnik and Stephen G Pulman

                    Computing Laboratory, University of Oxford
          Wolfson Building, Parks Road, Oxford OX1 3QD, United Kingdom
                 {simon.dobnik,stephen.pulman}@comlab.ox.ac.uk
                             http://www.clg.ox.ac.uk


      Abstract. We describe a system where the semantics of spatial referential
      expressions have been automatically learned by finding mappings between
      symbolic natural language descriptions of the environment and non-symbolic
      representations from the sensory data of a mobile robot used for localisation
      and map building (SLAM). Although the success of learning can be measured
      by examining classifier performance on held-out data, this does not in itself
      guarantee that the descriptions generated will be natural and informative
      for a human observer. In this paper we describe the results of an evaluation
      of our embodied robotic system by human observers.

      Key words: spatial expressions, machine learning, mobile robots, embod-
      ied multi-modal conversational agents, evaluation


1   Introduction

A conversational robot must be able to refer to and resolve references to the en-
vironment in which it is located with its human conversational partner. Mapping
between the linguistic and non-linguistic representations is commonly performed by
first identifying some parameters of the physical world on the basis of psychological
evidence and then integrating them into customised functions [1, 2]. However, in
a real robotic system which has been primarily built for tasks such as map build-
ing, localisation and navigation the information required by such models may not
be readily available. Our approach attempts to use a simple model of space and
motion that is available to a mobile robot and show that a mapping between its
representations and highly abstract natural language spatial descriptions can be
learned: that the robot can display a human-like performance in “understanding”
and generating spatial descriptions or motion in new environments. In this paper
we focus on the evaluation of the robot’s performance from the point of view of a
human conversational partner.


2   Learning spatial descriptions

Spatial descriptions may be about the identity of objects in a scene [3], about the
spatial relations between the objects in a scene [4] or about the route that a moving
object can take in a scene [5]. The scene may be a small artificial town on a table
top, a building with rooms or a real town. Our scenario is a larger room, a lab, which
is constrained by walls, and which contains life-sized objects such as a chest, a box,
a table, a pillar, a stack of tyres, a chair, a desk and shelves. The natural language
descriptions that can be made in this environment belong to two categories: they
can be descriptions of the robot’s motion such as “You’re going forward slowly” or,
when the robot is stationary, descriptions of relations between the objects in the
scene “The table is to the left of the chair”. We consider descriptions of motion
spatial description because their meaning is also relative to the environment in
which they are used.
    We use an ATRV-JR mobile robot designed by iRobot which runs middle-ware
called MOOS.1 The system runs an odometry component which provides informa-
tion about the robot’s motion such as its hR-Headingi2 and hSpeedi and the SLAM
localisation component [6] which uses a previously built 2-dimensional SLAM map
to localise the robot. The objects were grounded on the map manually by tak-
ing a centre point of the cloud of points representing them, for example: chair
h0.6234,0.2132i (hXi and hYi). Our representation of the state of the robot and the
space around it is thus extremely simple but the values of such representations are
very accurate.
    A group of four non-expert volunteers was invited to provide linguistic spatial
descriptions of the robot and its environment. Each was first familiarised with the
scene, the names of the objects and the different types of motion that the robot can
produce. Then they were instructed to describe the motion and the location of ob-
jects from the perspective of the robot. This ensured that all directionals were used
unambiguously from a single reference frame [7]. Two datasets were created. The
linguistic descriptions in the first dataset (Simple) were made by a single describer
and were restricted to a pre-defined small vocabulary (16 words) that appeared as
choices on a computer screen. The second dataset (All ) was created by all four
participants who could use unrestricted vocabulary and sentences. Such descrip-
tions show considerable lexical variation (46 words) but their syntactic structure is
limited and in most cases similar to the examples above.3 The two settings were in-
tended to show the effects of subjectivity on the datasets and the models produced.
To preserve the naturalness of the situation we used speech recognition (with some
consequent noise in the language).
    To turn MOOS log files (where both linguistic and non-linguistic information
was recorded) to learning instances a few processing steps had to be performed: the
locations of objects were expressed relative to the robot (rather than being global
values relative to some random point where the robot has started) and their val-
ues were normalised (given the estimated size of the room or the maximum speed
of the robot in the current session). This ensured that the models that were built
could be later applied to new contexts. Words from natural language descriptions
1
  MOOS was designed by Paul M. Newman (Mobile Robotics Group, Department of
  Engineering, University of Oxford). We would like to thank him and members of his
  group for introducing us to mobile robotics.
2
  The attributes used in learning are marked with angled brackets.
3
  Complex descriptions such as “the chair is to the left of the table and behind the sofa”
  were simplified as two descriptions of relation.
were tagged to one of the categories hVerbi, hDirectioni, hHeadingi, hManneri and
hRelationi which were also the target classes to be learned. The learning was ac-
complished with the Weka toolkit [8] which includes a range of offline supervised
classifier implementations and a common framework to represent the data and eval-
uate the results. Each of the target linguistic classes was learned separately and not
all attributes were used in each learning exercise. For example, to learn the category
of hVerbi we only used the hR-Headingi and hSpeedi attributes and to learn the cat-
egory hRelationi we used the attributes hLO xi, hLO yi, hREFO xi and hREFO yi
where LO stands for a located object and REFO stands for a reference object. In-
cluding all attributes resulted in a considerably lower classifier accuracy since may
spurious relations were discovered.
    The classifiers that were used in the human evaluation experiments described
in the following sections were produced by the J48 learner which is the Weka’s
implementation of the ID3/C4.5 decision tree learner [9]. Their estimated accuracies
obtained by a stratified 10-fold cross-validation are given in the last column of
Table 1 for both Simple and All datasets. Note that these values are not the best
values that we obtained. The accuracy of the motion categories was improved by
a better method of combining a set of temporally sequential observations from the
robotic log to instances. We also compared the performance of different machine
learning methods on our datasets.


3   Evaluation by humans

The evaluation of machine learning classifiers by a stratified 10-fold cross-validation
tests the degree to which the descriptions learned will generalise correctly to new
cases. However, it does not tell us whether the models that are built will result
in linguistic behaviour natural to humans. In order to know this we carried out a
user study. We integrated the classifiers to a simple system that generates descrip-
tions called pDescriber. This considers the current (normalised) values of the same
attributes that were used in learning and predicts linguistic target classes. If the
robot is moving, it generates descriptions of motion; if it is stationary it generates
descriptions of object relations. The values of the predicted categories are applied
to syntactic patterns such as “I’m hVerbiing” or “hLOi is hRelationi the hREFOi”
which produce sentences that are subsequently pronounced by a speech synthesiser,
for example “I’m reversing” and “You are behind the chair”.
    A new room was set up. Most of the objects were the same as in the data
collection exercise but their placement was different. Five subjects were invited to
the lab for approximately an hour each. None of them had participated in data
collection. After being introduced the scene, they were explained that they should
indicate whether they agree with the description that was generated by the robot
given its current state and that of the environment. This gave us simple binary
data. If they disagreed with the description, they had a chance to provide a better
description. Note that the descriptions were not evaluated as utterances but per
linguistic category. For example, for each utterance the system would query the
evaluator whether “right” was a good word to describe the robot’s heading in which
it was moving or whether “to the left of” was a good description of the relation
between the chair and the table. The evaluators were also invited to make qualitative
judgements about the appropriateness of the descriptions which we noted down. For
approximately one half of the session the system used the classifiers built from the
Simple dataset and the other half it used the classifiers built from the All dataset.


4   Evaluator-system agreement

The central part of Table 1 shows the measured accuracies from each evaluator per
category. As explained in the previous section, accuracy is measured as evaluator
agreement with the system on the choice of description. The penultimate column
contains the accuracies when all evaluators are considered together. The last column
contains the estimated accuracies of the classifier that the system was using to
produce these descriptions. The table is split into two parts each containing the
results from one configuration of the system (J48-Simple and J48-All ).


                Table 1. System performance vs. classifier performance

       Category                              Evaluators           Classifier
                                   a     b      c      d     e    All   J48
                                                                    Simple
       Motion             n=     36    17    14     2    21    90         −
                Verb            100 88.24 100 100 95.24 96.67        89.02
                Direction       100 76.47 100 100 100 95.56          87.80
                Heading         100 82.35 100 100 85.71 93.33        97.56
                Manner          100 82.35 100 100 100 96.67          70.73
       Relation           n=     65    23    19    53    22 182           −
                Relation     67.69 65.22 68.42 66.04 59.09 65.93     75.90
                                                                        All
       Motion             n=     53    22    53     7    41 176           −
                Verb          96.23 77.27 88.68 100 100 92.61        48.22
                Direction     96.23 72.73 92.45 100 100 93.18        55.68
                Heading       98.11 68.18 92.45 100 95.12 92.05      60.77
                Manner          100 72.73 98.11 100 100 96.02        54.70
       Relation           n=     66    28    72 110      58 334           −
                Relation      72.73 57.14 44.44 70.00 43.10 59.28    69.12


    How do the results from the evaluation of the system by humans and the evalu-
ation of the underlying classifiers compare? The classifier accuracies are the average
accuracies obtained through a 10-fold cross-validation. In the human evaluation of
the system the accuracy is determined on an independent test set. In both cases
the reported accuracy is the ratio between the number of agreements with the sys-
tem or correct classifications over the total number of considered testing instances.
There is a slight difference between the two situations in how a positive match is
made. In cross-validation the correct value of the class is pre-defined and hidden
from the classifier and this is matched with the predicted class. In human evalua-
tion an evaluator hears the generated description before they give their evaluation.
This description is the one that is predicted by a classifier given the attributes rep-
resenting the robot’s current internal state. In this respect it is possible that the
system unavoidably biases the evaluator, since other possible descriptions are never
produced. Furthermore, when evaluating the system in this way, the observers are
not always just evaluating the classifiers. For example, when generating descriptions
of object relations the located and the reference objects are chosen at random and
the classifier is used to predict the best relation between the two. The description
may be evaluated as unsuitable because of an unfortunate choice of objects even
though the spatial relation between them is correct.
    A quick look at the table reveals that the evaluators considered the system per-
formance to be better than the accuracy of the underlying classifiers on most classes
of motion descriptions (J48-Simple classifier: x̄ = 86.28%, J48-Simple evaluators: x̄
= 95.56%; J48-All classifier x̄ = 54.84%, J48-All evaluators: 93.47%). To make the
comparison easier we mark the values where the opposite is true, when the system
is evaluated to perform worse than its classifiers, in bold. The evaluator accuracies
are quite similar across categories, even for the hManneri category on which the
the classifiers perform less well than others. This is more the case with the Sim-
ple configuration than All. On the contrary, the system was considered to perform
less well than its classifiers on the hRelationi category by approximately 10% in
both cases (J48-Simple classifier: 75.90%, J48-Simple evaluators: 65.93%; J48-All
classifier: 69.12%, J48-All evaluators: 59.28%).
    The scores from evaluator b are lower than those from other evaluators, par-
ticularly on the motion classes and for the J48-Simple configuration. The numbers
in lines starting with n = indicate the size of the evaluation sample. Although the
number of descriptions that the robot generated was not strictly controlled, a rea-
sonable sample was obtained for each evaluator. The only exception is evaluator d
who evaluated only a small number of descriptions of motion but on the other hand
considered more descriptions of object relations.
    An explanation why the evaluators consider the system to perform better than
its underlying classifiers on the motion categories but not on the relation category
could be that motion categories contain words that are less semantically restrictive.
For example, the category hVerbi contains words such as “going”, “moving” and
“continuing” which all have a very similar reference for a human but not for a ma-
chine learner where the attribute values are assumed to be discrete. Consequently,
an evaluator may accept such alternative. The categories hDirectioni, hHeadingi
and hManneri contain words with clearer semantic divisions but they all also con-
tain a word “none” which was assigned as a value of each category in machine
learning dataset if a word for that category was not present. The meaning of this
word is ambiguous between a default meaning and an anaphoric meaning. For the
hDirectioni category “none” has the same meaning as “straight”. However, it can
also refer anaphorically to the previously generated description of direction if this
has not changed.
    Another explanation why the results are different for descriptions of motion and
object relations is that learning and generating of the latter is more complex. It
could be that our learning and generation models for descriptions of object relations
capture human knowledge less well than the models for description of motion. We
discuss some qualitative evidence for this in Section 6.
5     Inter-evaluator agreement

Agreement between individual evaluators demonstrates that the system has not
been tuned to the vocabulary of the describers who provided descriptions for ma-
chine learning. Disagreement may be informative too: if evaluators collectively dis-
agree it means that the generation task is not subjective, that there exists a con-
sensus on what is a good description in a particular context and what is not.
    Unfortunately, inter-evaluator agreement cannot be established directly, for ex-
ample by calculating a κ coefficient, because not all evaluators evaluated the same
set of items. The evaluators considered a closed set of words produced by the sys-
tem. We can expect that the agreement of a single evaluator with the system will
not be identical on every word that it produces. Some words are more difficult to
learn than others. If so, the difference in the ratings for words should be consis-
tent across evaluators. According to our model of agreement, an evaluator agrees
with other evaluators if their accuracy scores per word correlate with the mean of
accuracy scores per word of everyone else.


              Table 2. Agreement of each evaluator with the rest of the group

                Configuration a:rest b:rest c:rest d:rest e:rest Mean
                J48-Simple   0.824** 0.382 ns 0.787** 0.907** 0.636* 0.707
                J48-All       0.504* 0.048 ns 0.635** 0.756** 0.662** 0.521


    Table 2 shows the Pearson’s correlation coefficients rxy obtained at each fold
of correlation for both sets of classifiers. The last column contains the average
correlation coefficient. The asterisks indicate the statistical significance levels of the
coefficients obtained by a two-tailed t-test.4
    We can see that except for the evaluator b there exists a moderate to high corre-
lation between the scores of an individual evaluator and the mean scores of the rest
of the group. The average correlation coefficient for the J48-Simple configuration is
greater (0.707) than the average correlation coefficient for the J48-All configuration
(0.521). All correlation coefficients, except in the case of evaluator b are statistically
significant at the level α = 0.05. In sum, apart from evaluator b there is a consid-
erable consensus between the remaining four evaluators on the performance of the
system. Thus, it has captured some universal knowledge.


6     Qualitative evaluation

Descriptive observations made by the evaluators are useful because they point out
facts about spatial cognition and the shortcomings of the system that can be further
improved [10, 11].
4
    * indicates that the correlation is significant at the 0.05 level, and ** indicate that it is
    significant at the 0.01 level. “ns” indicates that the correlation is not significant.
Ambiguity of heading and direction. The descriptions such as “left” and “right” are
ambiguous when used to refer to motion. “Moving right” can mean moving forward
with a heading in the clockwise direction. It can also mean making a sudden turn
to the region that is to the right of the current location and then moving straight in
that direction. Similarly, “moving backward” can mean that the robot is moving in
the direction that is behind its back (reversing) or that it has reversed but is now
moving forward in the direction that was previously behind its back. The second
of each description pair is more complex and to learn such descriptions the learner
would have to abstract over a set of actions rather than over physical descriptions
of environment. Since while performing the second action “right” and “backward”
may refer to the same state of the robot as “straight” and “forward” in our model,
the robot is likely to over-generate such descriptions in cases where the first action
was not performed.

Object shape. The SLAM map used in our model does not contain abstract rep-
resentations of objects but only clouds of points. Each object is represented by a
centre point. While this works reasonably well for objects that square-shaped, diffi-
culties arise with objects that are markedly different in one dimensions such as “the
wall” and “the barrier”.

Switching the reference frame. Although evaluators were told that the descriptions
generated with the reference frame fixed on the robot or from “its perspective”, it
was very easy for them to switch from this relative reference frame to the intrinsic
reference frame fixed on the reference object. Firstly, it became apparent that some
switches to the intrinsic reference frame have been learned from the training data
and such descriptions appeared appropriate in the current context. In this case, the
majority of evaluators would accept such descriptions although they should not do
so according to our instructions. Secondly, properties of some objects invite human
describers or observers to use intrinsic rather than the relative reference frame. This
is true for objects that are larger than describer (walls, barriers and cupboards), have
an identifiable front and are animate (another robot). Only the intrinsic reference
frame is possible when the robot describes its own location and consequently cannot
serve as a reference object. “I’m in front of the chair” unambiguously means that
the robot is located in the region around and orientated by the seating area of the
chair. Note that the reference frame also applies to the projective descriptions of
motion.

Reference to objects outside the robot’s field of vision. There was a disagreement
between the evaluators whether descriptions that cannot be “seen” by the robot are
appropriate or not. Technically, “the vision field” of the robot is much greater than
that of a human observer - it is the entire SLAM map which represents its mental
map. Humans also use mental maps to imagine configurations of objects for tasks
such as navigation and therefore descriptions of objects not in the visual focus of
the describer may not be completely unnatural. In fact, particularly disapproved
were those descriptions where only one of the objects was “visible”.

Non-optimal choice of objects. The classifiers always attempt to predict the best
description of relation between two objects and may do so but the description may
be judged inappropriate because of an unfortunate selection of objects. The latter
can be accomplished by a contextual model which our system does not implement.
Given that we are primarily interested in spatial relations itself the choice of objects
at random seems to be reasonable. Some evaluators were more sympathetic to such
descriptions than others. However, they all agreed that descriptions where the lack
of object salience was coupled with the lack of the vision field salience were quite
unacceptable.


7    Conclusion

Although our classifiers use a relatively simple (topological) representation of space
primarily intended for localisation of a mobile robot we can conclude that they
work surprisingly well in practice in replicating human linguistic competence. They
fall short sometimes because they do not have access to non-topological information
such as object shape, reference frame, discourse structure for modelling salience and
world knowledge about the objects. Such data must be provided from other sources.


References
 1. Regier, T., Carlson, L.A.: Grounding spatial language in perception: an empirical
    and computational investigation. Journal of Experimental Psychology: General 130(2),
    273–298 (2001)
 2. Coventry, K.R., Cangelosi, A., et al.: Spatial prepositions and vague quantifiers: imple-
    menting the functional geometric framework. In: Freksa, C., Knauff, M., et al. (eds.)
    Spatial Cognition, vol. IV, pp. 98–110. (2005)
 3. Zender, H., Martı́nez-Mozos, O., et al.: Conceptual spatial representations for indoor
    mobile robots. Robotics and Autonomous Systems 56(6), 493–502 (2008)
 4. Roy, D.K.: Learning visually-grounded words and syntax for a scene description task.
    Computer speech and language 16(3), 353–385 (2002)
 5. Lauria, S., Kyriacou, T., et al.: Converting natural language route instructions into
    robot-executable procedures. In: Proceedings of Roman’02. pp. 223–228. (2002)
 6. Bosse, M., Zlot, R.: Map matching and data association for large-scale two-dimensional
    laser scan-based SLAM. IJRR 27(6), 667–691 (2008)
 7. Steels, L., Loetzsch, M.: Perspective alignment and spatial language. In: Coventry,
    K.R., Tenbrink, T., Bateman, J. (eds.) Spatial language and dialogue, Explorations in
    Language and Space, vol. 3, pp. 70–88. OUP (2009)
 8. Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques.
    Morgan Kaufmann, 2nd edn. (2005)
 9. Quinlan, J.: C4.5: programs for machine learning. Morgan Kaufmann (1993)
10. MacMahon, M., Stankiewicz, B., Kuipers, B.: Walk the talk: Connecting language,
    knowledge, and action in route instructions. In: Proceedings of AAAI-2006. pp. 1475–
    1482. (2006)
11. Moratz, R., Tenbrink, T.: Spatial reference in linguistic human-robot interaction: it-
    erative, empirically supported development of a model of projective relations. Spatial
    Cognition & Computation 6(1), 63–107 (2009)