Talking Robots Emanuele Bastianelli Ph.D. student at Department of Civil Engineering and Computer Science Engineering Tor Vergata University of Rome, Italy Research associate at Department of Computer, Control, Management Engineering Sapienza University of Rome, Italy bastianelli@ing.uniroma2.it Abstract. In the last years robotic platforms have appeared in many research and everyday life contexts. An easy way of interacting with them has then become a necessity. Human Robot Interaction is the research field that aims at studying how robots can interact with humans in the most natural way. In this work we will present preliminary studies that we have been done in this direction, focusing on Natural Language based interaction, with particular attention to the grounding problem. In par- ticular, we will study how Statistical Machine Learning techniques can be applied to Natural Language as it is used to interact with robots. Moreover, we will also investigate how this approach can be integrated in such complex systems. 1 Introduction Robots are slowly becoming part of everyday life, as they are being marketed for commercial applications (viz. telepresence, cleaning or entertainment). As a consequence, the ability to interact with a non-expert user is becoming a key requirement. The Human Robot Interaction (HRI) field aims at realizing robotic systems that o↵er a level of interaction the more natural as possible. This means providing robots with sensory systems capable of understanding and replicating human languages, such as speech, gestures, voice intonation, pragmatic interpre- tation, and any other non-verbal interaction. The ultimate goal of this research area is to provide robots with the ability of solving human languages references in the application context they belong, such as the real world (e.g. assigning the right coordinates to the phrase the kitchen) or an abstract world (e.g. solving anaphoric references in the domain of the discourse). This cognitive process, that is natural and implicit among human beings, is commonly called grounding. Our research will investigate the problems related to the natural language analysis involved in the design of HRI systems. To this aim, we will explore the possibility of reusing approaches that have been largely applied in di↵erent Natural Language Understanding (NLU) tasks and testing their applicability in the HRI field. In particular, we will focus on finding a bridge between the linguistic knowledge expressed in spoken commands and the robot representation of the world as a support for the grounding process. O 2 2 E. Bastianelli Motivations and Background Works Among the di↵erent kinds of interaction treated by HRI, we will focus on those aspects involving natural language. User utterances can be recognized and tran- scribed by Automatic Speech Recognition systems (ASR) that in the last years have become more and more accessible and powerful. The main issue is that in order to translate user utterances into robotic actions, we need to understand their meaning. For instance, from the sentence “take the bottle on the table”, we need to provide the command corresponding to the action of taking . Moreover, we need to identify the relation holding between the bottle and the table. This semantic information can be crucial as well to ground linguistic expressions into objects as they are represented in the robot set of beliefs (i.e. robot knowledge and perceptions). To fill the gap between the robot world representation and the linguistic knowledge expressed in user utterances, we need to extract the meaning from a sentence and represent it in a suitable form. Grammar-based ASR systems often o↵er the possibility to attach semantic primitives to each grammar rule. The meaning representation is obtained as the composition of all the primitives explored during the decoding. Such approach has been largely adopted in the robotic field, as in [4]. Grammars indeed have the limit of covering just a seg- ment of the language. If we want to realize more general HRI systems, and thus to cover a wider range of linguistic phenomena, we need to rely on free-form speech SR engines. Unfortunately, this kind of systems do not provide any kind of additional information, besides the plain transcription of utterances. A repre- sentation of their meaning can be only obtained by an external semantic parsing process. Natural Language Processing (NLP) approaches based on formal lan- guages have found wide application in the HRI field, e.g. semantic parsing with Combinatory Cathegorial Grammars (CCG), as in [5], where a way of obtaining a meaning representation based on Discourse Representation Structures [9] di- rectly from the speech recognition is used. Similarly, in [17], CCGs are used to produce a representation in term of Hybrid Logics Dependency Semantics [16] logic form. However, the overall attention has recently shifted towards the ap- plication of Statistical Learning techniques, reflecting the will of designing more general solutions. Several fields of research have shown a growing interest in HRI, giving the chance to apply these techniques in this area. Experts with di↵erent backgrounds proposed their own approach, mainly coming from Robotics, Com- putational Linguistics and Cognitivism. The problem of grounding natural language symbols into robot representa- tions of the world has been mostly explored in developing system for tasks as Human Augmented Mapping or able to follow route instructions. In [18], a sim- ulated robot system called MARCO able to follow route instruction in a virtual environment is presented. Here spoken commands are parsed using compound ac- tion specifications to model which actions to take under which conditions. These structures capture the commands in route instructions by modeling the surface meaning of a sentence as a verb-argument structure, and are obtained after a natural language processing chain. This work has been continued and extended Talking Robots in [6], where Statistical Learning has been applied to learn how to map com- mands in the corresponding logical form-like structure. This represent the robot O3 instruction that can be directly executed and implicitly resolves the grounding of all the entities. The work in [22] proposes a system that learns to follow naviga- tional natural language directions by apprenticeship from routes through a map paired with English descriptions. Reinforcement learning algorithm is applied to determine what portions of the language describe which aspects of the route. Other works have been inspired by novel spatial semantic theories. In [14] the problem of executing natural language directions is formalized through Spatial Description Clauses (SDCs), a structure that can hold the result of the spatial semantic parsing in terms of spatial roles. The same representation has been exploited in a subsequent work [21], where the probabilistic graphical model theory is applied to parse each instruction into a structure called Generalized Grounding Graph (G3 ). Here the SDCs are used as a base component of the more general structure of the G3 , that represents both semantic and syntactic information of the spoken sentence. In some cases, the construction of the rep- resentation is taken into account as in [11], where the robot learns the features of the environment, through the use of narrated guided tours. In this work, the robot builds both the metrical and topological representations of the environ- ment during the tour. Spatial and semantic information are then associated to the evolving situations through events labeling, that occur during a tour, and are later attached to the nodes of a topological graph. However, the approaches proposed so far have only taken into account sin- gle aspects (e.g. deep analysis of solving spatial relations [14, 21]) of the overall linguistic analysis necessary to realize a complete grounding process. The com- plexity of the problem is higher and is well described in [20]. Here, it is stated that a complete natural language HRI system should be able to: (i ) react in the same time frame of a human; (ii ) process all stages of language processing in a concurrent way; (iii ) own the capability of understanding spoken language; (iv ) decode multi-modal cues, such as linguistic expressions accompanied by gestures; (v ) share the perspective on the world and on events with its interlocutors; (vi ) start interaction to support bidirectional communications. All these features can constrain possible interpretations of the language, biasing the grounding process. It arises that the level of natural language analysis therefore needed is high and complex, as di↵erent informations, corresponding to di↵erent levels of semantics, need to be extracted and provided to the system. Moreover, this results in a so- phisticated interaction schema among the system modules (e.g. NLU processors, inference engines over knowledge bases, perception systems). The need to re-elaborate the problem from this point of view is being per- ceived by the community. Complex architectures have been already realized for tasks such as Question Answering, where the cooperation of structured NLP modules and other processors is fundamental. In order to maximize the repli- cability and adaptability, we argue that similar approaches should be followed in the implementation of HRI interfaces. One of our purposes is to study the applicability of robust NLP techniques that have been already adopted for other O 4 tasks. E. Bastianelli Following this direction, as a basic step of our research we contributed to the development of a prototype robot for Human-Augmented Mapping that is being used for experimental purpose. The data gathered during the experiments will be used in this research. In the meanwhile, a corpus of spoken commands is being collected using a web interface. It contains audio files paired with the corresponding transcriptions. Each transcription is annotated according to dif- ferent semantic formalisms, describing the linguistic knowledge we want to cap- ture. This corpus should become a useful resource for several tasks, e.g. training specific learning algorithms. 3 Theories and Methods An hypothetical NLU processing chain of a HRI system has to deal with audio processing and transcription, meaning understanding and dialogue management. The first module consists in ASR engine. To improve the grounding process, the module can be extended with the capability of detecting the source of the speech. This could assist, in fact, the reference point identification of certain spatial expressions (e.g. “the door on my right”). Morphological analysis and syntactic parsing are performed during the second step, as they can add crucial information for further semantic processing. This latter is the core of the NLU chain. Di↵erent semantic parsers can be used in parallel or in cascade, as the information generated by one such parser can be useful for others. During this step, the modules can also require an interaction with external resources, such as Linguistic Thesaurus or Knowledge Bases. This might be useful in discarding unlikely interpretations and consequently leading the system to consider other hypotheses from the ASR. Finally, the utterances should be enriched with all the meaning representations needed to correctly ground it in the robot set of beliefs. Dialogical interaction can be managed by a dialogue system that interacts with each step of the process. In our research, we will mainly focus on the semantic analysis part of the chain. In fact, while robust tools for ASR (e.g. Microsoft Speech Platform, Google Speech API or CMUSphix [23]) and for morpho-syntactic analysis (e.g. Stanford CoreNLP [13]) are available, semantic parsing must be designed from scratch. Although semantic processors exist, they are not always free and, more impor- tant, they o↵er just one level of semantic analysis. As stated in Section 2, we need di↵erent levels of information; consequently, our HRI system should rely on several semantic parsers. We need then to define which are the aspects of the world we want to model through semantic analysis. First, in order to be useful a robot is expected to perform the actions corresponding to the received commands; second, these actions take place in an physical environment. Look- ing back at linguistic theories that studied how these two aspects are conveyed through linguistic knowledge, we found that Frame Semantics [10] and Holistic Spatial Semantics [24] o↵er models of interpretation suitable for our purpose. The first generalizes actions or, more generally, experiences representing them Talking Robots as Semantic Frames. Each frame describes a scene or the general concept behind an action, enriched by a set of semantic arguments that play specific roles with O5 respect to the frame. Robot actions can then be linked with the semantic frame corresponding to that action. For example, in the sentence “take the book on the table”, the semantic frame related to the action of Taking is evoked by the verb take. The semantic role Theme (i.e. the entity taken during the Taking action) is here expressed and represented by the book on the table. Similarly, Holistic Spa- tial Semantics explains the spatial referring expressions contained in sentences in terms of spatial relations composed by spatial roles. Considering the previous example, the words book and table are related through the preposition on, that holds the spatial relation and plays the role of Spatial indicator, while the other two are respectively the Trajector and the Landmark. These two rep- resentations can collaborate to model the sentence meaning in a complete way. One more issue to be addressed is that these representations are not designed to work together, so further research about a formalism that should act as a general-purpose semantic container representation need to be done. In the first step of this research many of the aspects so far reported have been individually examined, and solution based on novel NLP techniques have been proposed. In [2] we propose a re-ranking approach to get the best speech transcription from the set of di↵erent hypotheses produced by an ASR sys- tem. The ranking function is learned through a Support Vector Machine (SVM ) exploiting a combination of di↵erent kernels capturing syntactic and semantic aspects of the utterances (e.g. Smoothed Partial Tree Kernels [8]). Moreover, the linguistic problem of extracting semantic representations from natural language expressions has been proposed in tasks as Semantic Role Labeling (SRL)[19] and Spatial Role Labeling (SpRL)[15]. We developed SRL [7] and SpRL [3] systems that model the problem as a sequential labeling task, exploiting specific formu- lations of SVMs, as SV M M ulticlass [12] and SV M Hmm [1]. These systems have not yet been used together. Their application in a HRI architecture, using the robotic prototype we developed, deserves further investigation. Another aspect our research wants to explore is the use of dialogical mech- anisms to improve the grounding process. Getting the meaning of a sentence may be insufficient to correctly ground the linguistic references. In fact, these can refer to objects or positions in the real world as well as entities in the ab- stract domain of the discourse (e.g. anaphoric references). Providing the robot with a more complex level of interaction, such as the ability to ask for clarifi- cations about ambiguous expressions, can improve the grounding capability of the robot. Similarly, the system could use dialogue to learn user-specific linguis- tic references, such as new terms or particular ways of calling objects, or new syntactic forms. This dialogue would be exploited to update the general knowl- edge of the robot, adding new concepts in a knowledge base or feeding the ASR grammars with new rules. In our robotic platform, we started modeling the dia- logue with Petri-Net Plans. They can drive the overall behavior of the robot, by managing the interaction among all the modules, including the NLU chain. The integrated representation of dialogue and robot actions is another issue that we O 6 E. Bastianelli intend to address in our research. We are aware that this proposal is a starting point for an analysis that will be wider and longer. Among all those aspects that should be examined and modeled in a HRI system, we took into account only the two (i.e. actions and spatial references) we considered fundamental to our ends. Future researches might investigate the study of temporal relations expressed in natural language. In parallel, we want to investigate and foster the reuse of robust NLP solu- tions in a field where single aspects of the problem have been explored, without converging to a common point. References 1. Altun, Y., Tsochantaridis, I., Hofmann, T.: Hidden Markov support vector ma- chines. In: Proceedings of the ICML (2003) 2. Basili, R., Bastianelli, E., Castellucci, G., Nardi, D., Perera, V.: Kernel-based dis- criminative re-ranking for spoken command understanding in hri. In: Proceedings of Ai*iA ’13. p. to appear (2013) 3. Bastianelli, E., Croce, D., Nardi, D., Basili, R.: Unitor-hmm-tk: Structured kernel- based learning for spatial role labeling. In: Proceedings of SemEval-2013. Atlanta, Georgia, USA (June 2013) 4. Bos, J.: Compilation of unification grammars with compositional semantics to speech recognition packages. In: COLING (2002), http://dblp.uni-trier.de/ db/conf/coling/coling2002.html#Bos02 5. Bos, J., Oka, T.: A spoken language interface with a mobile robot. Artificial Life and Robotics 11(1), 42–47 (2007) 6. Chen, D.L., Mooney, R.J.: Learning to interpret natural language navigation in- structions from observations. In: Proceedings of AAAI ’11. pp. 859–865 (2011) 7. Croce, D., Castellucci, G., Bastianelli, E.: Structured learning for semantic role labeling. Intelligenza Artificiale 6(2), 163–176 (2012) 8. Croce, D., Moschitti, A., Basili, R.: Structured lexical similarity via convolution kernels on dependency trees. In: EMNLP. pp. 1034–1046 (2011) 9. Curran, J., Clark, S., Bos, J.: Linguistically motivated large-scale nlp with c&c and boxer. In: Proceedings of the 45th Annual Meeting of the Association for Com- putational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. pp. 33–36. Association for Computational Linguistics, Prague, Czech Re- public (June 2007), http://www.aclweb.org/anthology/P07-2009 10. Fillmore, C.J.: Frames and the semantics of understanding. Quaderni di Semantica 6(2), 222–254 (1985) 11. Hemachandra, S., Kollar, T., Roy, N., Teller, S.: Following and interpreting nar- rated guided tours. In: Proceedings of ICRA ’11. Shanghai, China (2011) 12. Joachims, T., Finley, T., Yu, C.N.: Cutting-plane training of structural SVMs. Machine Learning 77(1), 27–59 (2009) 13. Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of ACL’03. pp. 423–430 (2003) 14. Kollar, T., Tellex, S., Roy, D., Roy, N.: Toward understanding natural language directions. In: Proceedings of the 5th ACM/IEEE. pp. 259–266. HRI ’10, IEEE Press, Piscataway, NJ, USA (2010) Talking Robots 15. Kordjamshidi, P., Van Otterlo, M., Moens, M.F.: Spatial role labeling: Towards ex- traction of spatial relations from natural language. ACM Trans. Speech Lang. Pro- O7 cess. 8(3), 4:1–4:36 (Dec 2011), http://doi.acm.org/10.1145/2050104.2050105 16. Kruij↵, G.J.M.: A Categorial-Modal Logical Architecture of Informativity: Depen- dency Grammar Logic & Information Structure. Ph.D. thesis, Faculty of Mathe- matics and Physics, Charles University, Prague, Czech Republic (April 2001) 17. Kruij↵, G.J.M., Zender, H., Jensfelt, P., Christensen, H.I.: Situated dialogue and spatial organization: What, where. . . and why? International Journal of Advanced Robotic Systems 4(2) (2007) 18. MacMahon, M., Stankiewicz, B., Kuipers, B.: Walk the talk: connecting language, knowledge, and action in route instructions. In: Proceedings of AAAI ’06. pp. 1475–1482. AAAI Press (2006) 19. Palmer, M., Gildea, D., Xue, N.: Semantic Role Labeling. Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers (2010) 20. Scheutz, M., Cantrell, R., Schemerhorn, P.: Toward humanlike task-based dialogue processing for human robot interaction. AI Magazine 34(4), 64–76 (2011) 21. Tellex, S., Kollar, T., Dickerson, S., Walter, M.R., Banerjee, A.G., Teller, S., Roy, N.: Approaching the symbol grounding problem with probabilistic graphical mod- els. AI Magazine 34(4), 64–76 (2011) 22. Vogel, A., Jurafsky, D.: Learning to follow navigational directions. In: Proceedings of ACL ’10. pp. 806–814. Association for Computational Linguistics, Stroudsburg, PA, USA (2010) 23. Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., Woelfel, J.: Sphinx-4: A flexible open source framework for speech recognition (2004) 24. Zlatev, J.: Spatial semantics. Handbook of Cognitive Linguistics pp. 318–350 (2007)