Introduction

A Probabilistic Approach to Modelling Spatial Language with Its Application To Sensor Models

Jamie Frost

jamie.frost@clg.ox.ac.uk 0

Alastair Harrison

Stephen Pulman

Paul Newman

pnewman@robots.ox.ac.uk 1 0 University of Oxford, Computational Linguistics Group , OX1 3QD , UK 1 University of Oxford, Mobile Robots Group , OX1 3PJ , UK

We examine why a probabilistic approach to modelling the various components of spatial language is the most practical for spatial algorithms in which they can be employed, and examine such models for prepositions such as 'between' and 'by'. We provide an example of such a probabilistic treatment by exploring a novel application of spatial models to the induction of the occupancy of an object in space given a description about it.

Introduction

Space occupies a privileged place in language and our cognitive systems, given the necessity to conceptualise various semantic domains. Spatial language can broadly be divided into two categories [ 1 ]: functions which map regions to some part of it, e.g. ‘the corner of the park’, and functions (in the form of spatial prepositions) which map a region to either an adjacent region, projection or axis, e.g. ‘the car between the two trees’. Approaches to implementing spatial models have fallen into two categories. [ 2 ] for example takes a logic-based approach, using a set of predicates on objects and binary or tertiary relations that connect objects to generate descriptions of objects that distinguishes it from others. A second approach is a numerical one, which given some reference object or objects and another ‘located’ object 1 or point, assigns a value based on some notion of ‘satisfaction’ of the spatial relation in question. But conceptualisation of this assigned value has a large amount of variety. [ 3 ] uses a ‘Potential Field Model’ characterised by potential elds which decreases away from object boundaries. [ 4 ] for example uses a linear function to model topological prepositions such as ‘near’, and produces a value in the range [ 0,1 ] depending on whether some point is directly by the object in question or on/beyond the horizon.

However, we argue that a conceptually more rigorous probabilistic approach is needed for all aspects of spatial language, in which validity of some spatial or semantic proposition is determined by the likelihood a human within the context 1 We use the term ‘locative expression’ to refer to any expression whose intention is to identify the location of an object or objects (such as ‘a chair by the table’). The ‘located object’ refers to the object in question, and the ‘reference’ object(s) are others that can be used to determine the location of the located object (the table in the latter example). of the expression would deem it to be true. We motivate this by the following reasons: 1. It provides a uniform treatment of condence across both spatial and nonspatial domains; uncertainty may be established in the latter in cases of variants of descriptive attributes (such as names) for example. As a result these models can be used in a variety of spatial algorithms such as searching or describing objects and inferring the occupancy in space of an object. 2. In the latter of the above applications (which will be explored in detail) as well as other independent systems or frameworks, a probabilistic representation is often required.

3. Combining multiple spatial observations becomes more transparent: While

any monotonically increasing or decreasing function is sucient to establish a relative measure of applicability across candidate points or objects, the lack of consideration of the function’s ‘absolute’ value becomes problematic when combining data from dierent spatial models, for example if we were to say ‘The chair is by the table and between the cat and the rug’.

Such an approach of assessing the ‘acceptibility’ of regions given a spatial relation is based on a concept called ‘Spatial Templates’ established by [ 5 ], but a probabilistic approach puts more emphasis on absolute value. What precisely then do we mean by ‘human condence’ ? One might think we can measure it by the probability that a given human would consider a (spatial) proposition to be true. But such a notion neglects a concept in philosophy known as subjectivism, in which rational agents can have degrees-of-belief in a proposition (rather than constricted to boolean answers of ‘agree’ or ‘disagree’), and probabilities can be interpreted as the measure of such a belief. With such an assumption it is therefore sucient to construct our models based on the ‘average degree-of-belief’ across people in some sample. Generically, this condence can be dened as p( j ), where represents the proposition and represents the context. For a particular spatial model, one might use p(in_f ront(obj1; obj2)jxt), where xt is the current position of the observer. We use x as a convenience to indicate that the location (say its centre of mass) of the located object in is at position x.

In the next section we present such models we have developed for the prepositions ‘between’ and ‘by’, and present a possible novel approach in which we might induce the occupancy of an object in space given a spatial description.We carried out an online experiment in which users asserted the validity of various locative expressions given a variety of scenes. For each category of spatial relation, e.g. by and between (and a number of other prepositions not presented here), the user was asked to rate the extent to which they agreed with the given statement, on a scale of 1 (representing ‘no’) to 7 (representing ‘yes’), each question accompanied by a picture 2. To produce the ‘average degree-of-belief’ we conn = Connecting lines

|q- v1| v1 3 1 |v1-v2| q|x-q| 2 2 (1) Full Validity (2) Partial Validity (3) No Validity Reference Object

v2 Located Object (centred at x) scaled the average answer to [ 0,1 ]. Our models are based on the Proximal Model as described in [ 7 ]. That is, features are based on the nearest point to the reference object, thus incorporating the shape of the object. This is in contrast to the Centre-of-Mass Model (as used in [ 4 ] for example) which treats all objects as points. This latter approach is computationally simpler and requires less data, although can be problematic for larger objects; if for example we were to assess the acceptibility of ‘you are near the park’, we would expect such a judgement to be based on proximity to the edge of the park rather than the centre. 2 2.1

Between Spatial Models for ‘between’ and ‘by’

The model we present below determines the acceptibility of a proposition = between(a; b; c) such that a is the located object, b and c the reference objects, and the position of a is at x. We determined that any point within the convex hull of the two reference objects (excluding the area of the objects themselves) was deemed to be fully valid. Outside of this area, certainty degraded proportional to the centrality of the object. Our model below quanties these ndings: p( xjxt) = p( x) = max(0; 1 jxtolqj ) if x 2= Hull(ref1 [ ref2) 1 otherwise (1) 2 The experiment was restricted to native English speakers only, due to cross-linguistic variations in spatial coding, such as a lack of distinction between dierent frames of reference (that is, distinguishing between say the deictic interpretation of in front of the tree based on the position of the observer, and the intrinsic interpretation based on the salient side of an object, as in in front of the shop) [ 6 ].

s.t. q = arg minfjx

q0 l : (v1; v2) = arg minfjx l0 q0j q0 on line lg; tol = jv1 q0j q0 on line l0; l0 2 conng v2jk1( j jv1 q conn = f(v1; v2)jv1 2 ref1; v2 2 ref2; (v1; v2) 2 edges(Hull(ref1 [ ref2))g 0.9 0.8 0.7 0.6 0.5 cne e d i 0.4 fnoC 0.3 0.2 0.1 0.5 0 0.5 0.4 lxuh 0.3 ven tcoo 0.2 cne itsa 0.1 D

x is the central point of the located object in question, ref1 and ref2 are the vertices of the two referenced objects, Hull(V ) gives the convex hull of the set of vertices V (thus q is the nearest point on the convex hull to x), tol gives the maximum allowed distance from the convex hull before the condence score is 0, conn is the set of 2 edges on the convex hull which connect the shapes corresponding to ref1 and ref2 (that is, the straight dotted lines in Fig. 2 and function edges gives the edges of a polygon. k1 controls the maximum tolerance permitted, a specied proportion of the distance between the two objects, and k2 controls the curvature of this ambiguous region. Via model tting (using the minimum sum of squared dierences) we found values of k1 = 0:55 and k2 = 2:5 yielded the best results (see Figure 3). For the preposition ‘by’, there are 3 main variables that can inuence the magnitude of the condence score; the base width ( w) and height (h) of the reference object, and the distance ( d) from the reference object. For polygonal objects, users were given 8 dierent reference objects in their scenarios, of a variety of dierent widths and heights. It was found that although the condence score for a given distance with respect to the width of the object (i.e. wd ) was a good starting point (see Fig 4(a)), greater heights led to a small increase in probability. Assuming a linear relationship with height (again relative to the object width), we therefore divide by wh + kh for some constant kh (given that at objects such as lakes still yield a non-zero condence score). Additionally, smaller objects tended to have a larger tolerance of distance with respect to this width, although this eect became less prominent as the width of the object became very large. Thus we multiply the distance by log(w + kw) for some constant kw 1, since for very small objects we still expect some tolerance of distance. Combining these relationships and simplifying, we suggest the following model: p( i;j jxt) = clamp(kc kmd log (w + kw) ) h + khw where km and kc the coecients of some line to obtain the condence from the adjusted distance, and clamp clamps the overall value to the range [0; 1]. Fig. 4(b) shows the eect of these using these transformations, using kw = 14 and kh = 2, resulting in values for km and kc of 1.38 and 1.15 respectively. Ultimately it is impossible to base any model of ‘by’ on physical metrics alone; the ‘use case’ of objects, i.e. the set of contexts in which an object is used, is likely to have an eect. In Fig. 4(b) for the case where the reference object was a chair, it is apparent condence deteriorated with distance much faster than expected. But if one considers that a chair is intended ‘to be sat on’, and therefore adjust the recorded height h to the more salient ‘seat-level’, we obtain condence values very close to the model for this example. (2) 0 1 2 Distan3ce(/w) 4 5 6 0 0.2 0.4 Adjusteddistance 0.6 0.8 1 1.2 We now propose a method to infer the occupancy in space of a particular object given an observation in the form of a spatial description made about it. This is strongly predicated on a probabilistic treatment of our spatial models discussed earlier. Occupancy Grid Mapping is a technique employed in robotics to generate maps of an environment via noisy sensor measurements. The occupancy grid map is useful because it can subsequently be fused with other maps obtained from say physical sensors. The aim is to produce a posterior p(mjz1:t; x1:t) where m = fmig is a partitioning of space into a nite grid of cells mi, z1:t are the observations made up to time t, and x1:t are the poses of the robot at each observation. mi is the event that cell i is occupied, thus p(mi) describes the probability that cell i is occupied. In the scope of this paper, we focus on how the ‘inverse sensor model’ p(mijzt; xt) can be computed, although a more detailed description of Occupancy Grid Mapping can be found in [ 8 ].

Our aim is to compute this inverse sensor model, in terms of our calculated p( jxt) probabilities from the previous section. An important simplifying assumption we make is that locative expressions refer to a specic point in space, within the boundaries of the object in question. This seems intuitive; were we to describe a town as being ‘10km away’, it would clearly be fallacious to assume that the entirety of the town is precisely 10km away. We dene a probability p(ri;j jxt; zt), where ri;j represents the event that the observer was referring to a point (i; j) in their observation, and zt is the locative expression such that the position of the located object is not specied, say (since we consider such a position in ri;j ). We can then calculate the desired probability easily by simply normalising our condence function across the space: p(ri;j jxt; zt = ) = p( i;j jxt). Before we determine how to calculate p(mi;j jxt; zt), we analyse the conceptual parallelism between traditional Occupancy Grid Mapping and that employed in our linguistic context. 3.1

A Comparison of Sensor Models On a cursory inspection there are some initial clear similarities that can be

drawn between the traditional occupancy grid map and our linguistic variant. Both involve the pose of some observer xt (although depending on the spatial model this is sometimes irrelevant) and some manifestation of an observation zt; a physical sensor reading with respect to the traditional approach and a locative expression for the linguistic approach. Upon closer analysis more similarities can be drawn. With a physical sensor, we expect a measurement of distance to a point being sensed to be noisy, and thus maintain a probability distribution with regards to the precise position of this point. This corresponds to our distribution p(ri;j jxt; zt). For a locative expression of a town being 10km away, human error or rounding is likely to lead to uncertainty in the judged distance, and additionally the direction of the town is unspecied, leading to a ‘blurred doughnut’ type distribution.

There are however a number of conceptual dierences. With traditional Occupancy Grid Mapping the posterior for a cell is only updated if it was part of the sensor range (i.e. we make no assumption with regards to space outside the limited range of our sensor). With locative expressions however, we can infer data outside that explicitly conveyed. Suppose for our town example, the town was 1km in diameter, and that the distance judgement of 10km (to some point within the town) was entirely accurate. If the centre of the town was actually 10.5km away, our observation would still hold, but a point any further could not possibly be occupied by ‘town’. 3.2

Computing the Inverse Sensor Model We can use the above fact to compute p(mi;j jxt; zt) from our previously calcu

lated values of p(ri;j jxt; zt). Let Q be the set of possible ‘poses’ for the located object such that the point (i; j) is within the object’s boundary, and a pose is the position and orientation of the object. Given our assumption that the observer referred to a point within the connes of their perceived position of the object,

Q represents all valid poses of the object given such a point. It follows that

p(mi;j jxt; zt) = Rq2Q p(qjxt; zt) dq. For each pose q 2 Q there is an associated frame (iq ; jq ; q) where (iq ; jq ) is the nominal centre of the shape in pose q (say The knownshape of the object(presuming availability of this data) in some poseq.

The nominalcentre of (tihqe*, ojqb*)je.ctin this pose,

The axis/rotation θqof the objectin this pose.

cTohnespidoeinratt(iio,jn).under the centre of mass) and q is the rotation of the object about this point. It is then possible to use p(riq ;jq jxt; zt) to refer to the probability of the object being positioned at (iq ; jq ) (see Fig. 5). The pose also has a probability p( q) associated with its orientation; for simplicity we assume this is independent of xt and zt (although the use of p( qjzt) would allow us to model for example observations such as The boat is in front of you, facing East ). Putting this together, this gives us the following equation to compute the occupancy probability: Z

q2Q p(mi;j jxt; zt) = p(riq ;jq jxt; zt)p( q) dq s.t. Q = fq j (i; j) 2 R(q)g (3) where R(q) is the region of the located object in pose q. Considering the pose of the located object has useful consequences; it allows us to model for example that vehicles are aligned to the direction of a road. Given a lack of prior shape information with regards to the located object, and given the above integral is somewhat intractable, a suitable approximation is to use the approximate width

W of the object (which can be obtained via knowledge of the class of the located

object, say the usual width of a town). If we infer as little about the shape as possible, the resulting approximation of the shape is a circle of diameter W .

Equation 3 then reduces to the following:

p(mi;j jxt; zt) =

Z (i0;j0)2R( W2 ;i;j) p(ri0;j0 jxt; zt) di0 dj0 (4) s.t. R( W2 ; i; j) is a set of points in a circular region of centre (i; j) and radius W2 4

Conclusions & Future Work In this paper we motivated a probabilistic approach to modelling spatial language that can be used in a number of algorithms, and provided an example of such an algorithm to induce a sense of ‘the space that an object occupies’ via the use of occupancy grid maps. We also presented models for the prepositions

‘by’ and ‘between’ based on the results of an online experiment. Future work is predominantly focused further development of our dialogue manager language that interacts with these spatial models, as well as developing further algorithms which make use of such models. For example, we developed an algorithm that combines semantic and spatial models to provide condence scores for arbitrarily complex locative expressions (including those based on current bounded trajectories, such as ‘the second left’). We are also investigating a measure of ‘relevance’ (one of the Gricean maxims [ 9 ]) for locative expressions, a consideration that is particularly key in generating descriptions of objects or locations.

This work has been supported by the European Commission under grant agreement number FP7-231888-EUROPA.

1. Herskovits , A. ( 1986 ) Language and Spatial Cognition, Cambridge University Press, .

2. Dale , R. and Haddock , N. ( 1991 ) In Proceedings of the fth conference on European chapter of the Association for Computational Linguistics Morristown , NJ, USA: Association for Computational Linguistics . pp. 161166 .

3. Olivier , P. and Tsujii , J.-I. ( 2004 ) Articial Intelligence Review 8 , 147158 .

4. Kelleher , J. D. and Costello , F. J. ( 2009 ) Comput . Linguist. 35 ( 2 ), 271306 .

5. Logan , G. and Sadler , D. Language and space chapter A computational analysis of the apprehension of spatial relations , pp. 493529 MIT Press ( 1996 ).

6. Levinson , S. C. and Wilkins , D. P. Grammars of Space: Explorations in Cognitive Diversity chapter 1 , pp. 45 Cambridge University Press ( 2006 ).

7. Regier , T. and Carlson , L. A. ( 2001 ) J. Exp Psychol Gen 130 ( 2 ), 273298 .

8. Thrun , S. ( 2003 ) Auton . Robots 15 ( 2 ), 111127 .

9. Grice , H. P. ( 1975 ) Logic and conversation In Peter Cole and Jerry L. Morgan, (ed.), Syntax and semantics 3: Speech Acts , volume 3 , pp. 4158 New York: Academic Press.