Representing And Reasoning with Functional Knowledge for Spatial Language Understanding Kalyan Moy Gupta1, Abraham R. Schneider1, Matthew Klenk2, Kellen Gillespie1, and Justin Karneeb1 1 Knexus Research Corporation; Springfield, VA 22153 2 Naval Research Laboratory, Washington, DC 20745 firstname.lastname@knexusresearch.com | Abstract. One of the central problems in spatial language understanding is the polysemy and the vagueness of spatial terms, which cannot be resolved by lexical knowledge alone. We address this issue by developing a representation framework for functional interactions between objects and agents. We use this framework with a constraint solver to resolve and recover the meanings of spatial descriptions for object placement tasks. We describe our approach in a virtual environment with an example of object placement task. 1 Introduction Virtual scene (re)construction or object placement is a vital task in many practical applications such as background layout in 3-D animated movies, accident or crime scene simulation, and navigation maps for video game development. Using natural language (NL) commands can be a natural and efficient alternative to an otherwise effort intensive manual placement of objects in a virtual world (e.g., Coyne and Sproat, 2001; and Dupuy, 2001). However, machine understanding of natural language commands is notoriously difficult due to polysemy and vagueness of spatial terms. Only considering lexical semantic knowledge of spatial terms is clearly insufficient for this task; world knowledge and pragmatics must be considered for understanding language in a form that can be acted upon by autonomous agents. Over the past couple of decades much research in spatial term semantics has focused on developing computational models that map utterances to semantics (e.g., Regier and Carlson, 2001; Coventry et al., 1994). Although, such research recognizes the need for pragmatic and functional knowledge about objects, the development of computational models for representing and using such knowledge has received little attention. To address this gap, we present a framework for representing world knowledge that can be effectively translated into spatial constraints to resolve vague and underspecified natural language commands. We present an algorithm that utilizes such knowledge for interpreting natural language commands and to perform valid least cost object placements. We organize the remainder of this paper as follows. We explain the nature of linguistic underspecification in object placement tasks in the next section. We follow this with a description of our representation framework and an algorithm that performs linguistically commanded single object placement task. Next, we illustrate our approach with an example. Finally, we discuss the strengths and limitations of our approach and conclude the paper. 2 Vagueness in NL driven Object Placement Consider the task of generating a static scene described by text utterances in a 3D virtual environment. For example, generating a scene with “a chair in front of the table” and subsequently placing a “printer on the table.” The desired rendering of the scene is shown in Figure 1. Figure 1. Example scene imagination based on linguistic description The central issue in such a task is interpreting vague spatial prepositions such as on and in-front-of into valid object placements. The utterance “printer on the table” can only be judged as vague when attempting to place the printer in the World. For instance, the possible placements on the table are to the left, right, front, and behind the monitor. However, the placements in front and back of the monitor are functionally invalid for a human user. The utterance also does not specify the suitable orientation of the printer. Without such a specification, the printer could be oriented in numerous ways in relation to the monitor and the chair, only some of which would be valid. For example, the orientation shown in Figure 1 is a valid one. However, the orientations of the printer such as upside down or facing the wall would be invalid. Clearly, functional knowledge of interaction between objects must be considered for generating valid placements. The question is what should be the content of such functional and world knowledge and how can it be utilized to recover the unspecified elements and generate a complete and valid specification for object placement. We answer this question in the next section. 3 Representation and Reasoning for Linguistically commanded Object Placement Problem Task: Given a world, W, containing a set of objects, O, located in various places in the world and an underspecified linguistic command requesting to place a target object, ot in W, find a location with the least interaction cost to place ot. We return to the notion of interaction cost later in this section. Functional Knowledge Representation. We introduce an autonomous agent, α, as the central element of a functional representation of objects, O, and their parts in the World. Given our goal of building agents that interact with humans, our representation encodes spatial constraints accordingly. We assume that α is human- like and interacts with objects using a set of primitive actions or perceived affordances (Gibson, 1977, Norman, 2002). We introduce a set of the following primitive actions: 1. Reach: the agent reaches for objects to manipulate and interact with them. Given our assumption that α is human-like, we subcategorize the reach interaction as follows: 1.1. Reach.Arm: the agent reaches for objects with arms fully extended. 1.2. Reach.Forearm: the agent reaches for objects with only the forearm extended. 1.3. Reach.Foot: the agent reaches for object with its foot. 1.4. Reach.Assisted: the agent reaches for objects with tools. 2. See: the agent obtains visual information from objects to perform reach actions. For an agent to see objects it must be oriented toward the objects. In certain situations the agent must be able to read the information present on the object. We represent this with the read action, a tighter constraint than see: 2.1. Read: an agent reads the information present on the object such as signs or writing. Clearly, this can be subcategorized to read fine print, read normal print, read large print, read poster print etc. We further assume that the agent performs these activities while it is located at certain places in W called activity stations, S. In addition, we assume that an agent has the following human-like poses; sitting, standing, and lying down. We categorize the functional relation between objects into the following three types: 1. Support: this is a functional relation typically implied by the preposition “on” in English. For example, a table supports a printer and a printer is supportedBy a table. 2. Contain: this is a functional relation typically implied by the preposition “in”. For example, a box contains the printer and a printer is containedBy a box. 3. Group: relates multiple objects into a spatial group. For example, a computer keyboard and display monitor may be related to each other by a spatial group relation. Table 1. Example representation of functional interaction constraints for a Printer Object/parts Object Agent Interaction Interaction Pose Activity Station Printer/Parent • Reachable.Arm • Stand Perimeter • Visible • Sit Control panel supportedBy(parent) • Readable Connection panel supportedBy(parent) • Reachable.Arm Paper tray containedBy(parent) • Reachable.Arm contains(paper) An object and its various parts may solicit different functional interaction constraints for agents. A representation of functional interaction constraints for a printer is shown in the Table 1. We assume a canonical geo-orientation for the printer, that is, it is upright. The table specifies that the printer control panel must be readable to the agent, for example. We introduce the notion of a possible interaction space, PIS, for an agent at an activity station (e.g., see Kurup & Cassimatis, 2010). As a simplification in this paper, we assume that the possible interaction space is a two dimensional region. Figure 2 shows the PIS with reachability, visibility, readability spaces. Figure 2.. Possible interaction spaces for agent α and possible linguistically constrained space for infrontOf(Printer) We introduce the notion of a possible linguistically constrained space, PLCSCS, as well. For example, Figure 2 shows the region selected by the function Printer). The possible space resulting from multiple constraints is inFrontOf(Printer the intersection of individual possible spaces. We will use this approach in the linguistically commanded ssingle object placement algorithm presented next. Linguistically inguistically command commanded single object placement algorithm. The algorithm begins by generating the set of potential activity y stations to identify the smallest subset set of that satisf satisfies the spatial constraints in the World. The functional interaction knowledge of the objects in the World is transformed into spatial constraints. Next, it uses possibility spaces to identify the candidate placements and selects the one with least cost. We detail these steps below: Inputs 1. O, set of objects in W. 2. ot, the target rget object to be placed (e.g., Printer) Printer). 3. lcs, linguistically expressed placement constraint (e.g., on the table). 4. KB, the functional interaction knowledge base containing the agent and object interaction knowledge covering all objects in W (O and ot). 5. αpsp the he possibility space parameter for α for which the minimal cost placement is to be performed. Output 1. P,, a set of placements with minimum cost of functional interaction for agent α. Processing steps 1. Find the smallest set of activity stations, Smin, that satisfes the functional interaction constraints for all objects oi ∈ O; the constraints are retrieved from the KB for a given category of object object. The candidate activity stations for an object oi are located around its perimeter. placements, CP = φ, placement cost, pc=0 2. Set the candidate placements stations, Sc = Smin 3. Set candidate stations 4. For each activity station sj ∈ Sc a. Compute Possible Placement Space Space, PPSs for the target object ot as the intersection of PIS, possible interactions nteractions space at the activity station and the LCS, linguistically constrained space space: PPSs = PIS ∩ PLCS b. Generate candidate placements (cp) and compute their cost, c: The candidate placements are possible placements PPSs if it is not empty. Only those placements that satisfy all the interaction constraints of ot without satisfied by sj are retained. The violating any of the existing constraints satis candidate placements are a combination of location locations and orientations. For simplicity, we only consider 4 orientations of ot relative to the orientation of the agent at the activity station. The cost of a placement is 0 when one of the existing stations is used for the placement. 5. Select the minimum cost placements P. IF CP ≠φ THEN Return minimum cost placements P ⊂ CP ELSE, Generate new activity stations (Snew) in the neighborhood of stations in Sc. set Sc=Snew, and IF pc is 0 set pc=1,=1, i.e., cost of placement increases with the number of activity stations. go to Step 4 End. Example Consider a world W that includes a table placed against a wall with a monitor on it. In addition, it includes a chair located in front of the monitor (see Figure 3). 3) The placement agent receives a linguistic command to place a printer, ot, in this world; “Place the printer on the table”. Figure 3. Place the printer on the table We assume that this linguistic command (i.e., lsc) is interpreted into a semantic form and its PLCS is computed computed, which is the entire surface of the table.. We begin with step 1 to find Smin. The algorithm generates the potential activity stations in the world, wor for example, stations s1 through s7. It is easy to see that stations s1, s2, s6 and s7 do not satisfy the reachability and readability constraints for the monitor. Similarly Similarly,, stations s3 and s5 fail to satisfy the readability constraint of the monitor. Notice that the alternative orientations of these stations would also fail on reachability constraints of various objects. Activity station s4 is the only one that satisfies the reachability and readability constraints for the monitor and the reachability constraint of the chair (i.e., Smin ={ s4}=Sc). We perform Step 4 and obtain PPS for s4 shown in grey obtained by the intersection of PIS and PLCS (on table). Since PPS is not empty, we create candidate placements cp1 through cp4. Although cp1 satisfies the printer’s reachability and readability constraints, it violates the monitor’s readability constraints for station s4. Similarly, cp2 fails to satisfy the readability constraint for the printer. Note that reorienting cp3 to face the agent will create a valid placement. The candidate placement cp3 satisfies all the constraints and is a valid. Placement cp4 is not in the PPS space and is shown here for illustration only. Our example illustrates how the algorithm using functional knowledge about object and agent interactions produces two valid placements for a printer given a highly underspecified placement directive. 4 Discussion Recent research on spatial language understanding has pointed out the need for functional representations for understanding spatial utterances. For example, Coventry and Garrord (2004) present a functional geometric framework which includes geometric and dynamic kinematic routines, and object knowledge. Our approach also considers the dynamic interactions and object knowledge. However, we explicitly consider the role of an agent along with a very small set of interaction primitive affordances specialized for the object placement task. Further, we present an inferencing algorithm that utilizes the world knowledge to perform valid object placement. Lockwood (2009) also emphasizes the need for functional knowledge but focuses on structure mapping as a means learning functional knowledge for a scene labeling task. However, she did not include an interpretation method to recover meanings of underspecified utterances. In contrast, we manually encode the affordances to recover underspecified spatial semantics in object placement tasks. We intend to develop methods of acquiring the interaction knowledge in our future work. Although, we demonstrated the use of functional knowledge for generating valid object placement, we did not consider the pragmatic and contextual elements such as plans, goals, and the situation of the agent requesting object placements. For instance, the directive “put the printer on the table” would carry different functional constraints with it if the requester were a mover in an office building or a warehouse instead of a worker in an office building. We plan to extend our models to include constraint selection based on the requesting agent’s goals and intentions. 5 Conclusion Interpretation of spatial descriptions and commands, such as those for an object placement, poses significant challenges due to polysemy and underspecification of spatial term semantics. To address this issue, we developed a functional interaction knowledge representation framework with a very small number of agent action primitives and object to object interaction primitives. We described a cost based constraint satisfaction algorithm for utilizing world knowledge for object placement. In our future work, we will implement and evaluate the performance our algorithm with varying number of objects in the scene and consider aspects of visual attention to resolve residual ambiguities and diectics (e.g., see Kelleher, 2003). Additionally, we will extend our approach to include the role of goals and intentions of the requesting agent in selecting the appropriate spatial constraints for object placement. Acknowledgements This research was supported by the Office of Naval Research. Matthew Klenk was supported by NRC Postdoctoral Research Fellowship. We thank the two anonymous reviewers for their comments and suggestions. References Coventry, K.R. & Garrod, S.C. (2004) Saying, Seeing and Acting: The Psychological Semantics of Spatial Prepositions. Hove: Psychology Press. Coventry, K.R., Carmichael, R. & Garrod, S.C. (1994). Spatial prepositions, object-specific function and task requirements, Journal of Semantics, 11, 289-309. Coyne, B., & Sproat, R., (2001) WordsEye: An automatic text-to-scene conversion system, SIGGRAP’01, Proceedings of the 28th annual conference on computer graphics and interactive techniques, pp. 487-496, New York, NY: ACM. Dupuy, S. (2001). Generating a 3-D simulation of a car accident from a written description in natural language: The CARSIM system. Proceedings of the Workshop on Temporal and Spatial Information Processing, pp.1-8. Gibson, K.J. (1977). The theory of affordances. In R. Shaw and J Bransford (Eds.), Perceiving, acting, and knowing: Toward and ecological psychology (pp. 67-82). Hillsdale, NJ: Erlbaum. Kelleher, J.D. (2003). A perceptually based computational framework for the interpretation of spatial language, Ph.D Thesis, Dublin, Ireland: School of Computing, Dublin City University. Kurup, U., & Cassimatis, N.L., (2010). Quantitative spatial reasoning for general intelligence, Proceedings of the Third Conference on Artificial General Intelligence Conference, pp. 1-6, Lugano, Switzerland: AGF. Lockwood, K. (2009). Using analogy to model spatial language use and multimodal knowledge capture, PhD Thesis, Department of Computer Science, Evanston, IL: Northwestern University. Norman, D. (2002). The design of everyday things, New York, NY: Basic Books. Regier, T., & Carlson, L. (2001). Grounding spatial language in perception: An empirical and computational investigation. Journal of Experimental Psychology: General, 130, 273-298.