Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                    IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


  Connecting natural language to task demonstrations and low-level control
                            of industrial robots
                                                 Maj Stenmark, Jacek Malec
                                     Dept. of Computer Science, Lund University, Sweden


   Abstract— Industrial robotics is a complex domain, not             In particular, we have devoted last years to understand and
easily amenable to formalization using semantic technologies.      describe robotic assembly, including force-based operations
It involves such disparate aspects of the real world as geome-     (snap, drill, press, etc.), using machine-readable formalism
try, dynamics, constraint-satisfaction, planning and scheduling,
real-time control, robot-robot and human-robot communication       expressing the semantics of possible robot actions. Without
and, finally, intentions of the robot user. To represent so        it there will be no possibility to create a meaningful rea-
different kinds of knowledge is a challenge and the research on    soning leading from the task specification (what needs to be
combining those topics is only in its infancy.                     manufactured) to task synthesis (how can this be achieved
   This paper describes our attempts to combine descriptions of    using the available robot skills) and robust execution by
robot tasks using natural language together with their realiza-
tions using robot hardware involving force sensing, ultimately     the synthesized code for a particular architecture. Not to
leading to a potential of learning new robot skills employing      mention swift error handling in case of unexpected problems,
force-based assembly. We believe it is a novel approach opening    portability of a robot skill from one robot to another, and
possibilities of semantic anchoring for learning from demonstra-   learning of new skills.
tion.                                                                 In previous research, we have focused our attention on two
                                                                   areas: interaction between the user and the robotic system,
                     I. I NTRODUCTION
                                                                   preferably on the user’s conditions, e.g., using natural lan-
   Recent developments in robotics, artificial intelligence and    guage [1], [2], and representation of force-controlled assem-
cognitive science lead to bold predictions about the soon-         bly operations, particularly problematic due to the inherent
to-come robotization of all aspects of human life. Robots          mix of continuous and discrete aspects [3]. Besides being
will help the elderly, perform mundane jobs no one wants,          able to talk with the robot about a force-controlled assembly
drive our cars, fill our refrigerators when needed, tirelessly     operation, we would like it to be learnt automatically from a
rehabilitate patients in need of physical exercise, fight our      demonstration and be represented semantically in a manner
wars, become our sex partners, etc., etc. Some draw even           enabling portability among different robots.
the conclusion that robots will take over Earth and will turn         So far these kinds of systems are developed only in
humans into obsolete pets.                                         research laboratories. Our own research is done in the context
   However, when observing the development of the robotics         of several EU projects, in particular ROSETTA, P RACE and
field, we can realize that this perspective is rather far, far     SME ROBOTICS, aiming at developing intelligent interactive
away. Service robots are clumsy and unskilled, no one              systems suitable for inexperienced users, such as SMEs.
trusts a robotized car, and production still relies on simple      Before they reach the factory floor though, they need to
manipulators programmed in a classical manner by skilled           be filled with sufficient production knowledge so that they
engineers. Any attempt to instruct a robot to perform a            become useful. Knowledge acquisition is a bottleneck in
concrete manufacturing task consists of person-weeks of            developing practical systems, as it can only happen while
work of skilled system integrator engineers, who take into         the system is used, but it won’t be useful before it is done:
account the geometrical layout of the workcell, all the objects    a classical chicken and egg problem. Therefore the only
involved, involving their geometry, physical properties and,       viable solution is a learning system, capable of sharing
last but not least, purpose of the task. It is the implicit        its experiences by storing them in a (possibly cloud-based)
knowledge that needs to be transferred into robot code, which      knowledge base [4] and using experiences of other robots by
makes this task so complex.                                        importing and adapting their skills. However, such a solution
   Use of semantic technologies is advocated since at least        requires a common understanding of the contents of this
a decade. Unfortunately, industrial robotics is a complex          knowledge base, thus, a commonly agreed-upon semantics.
domain, not easily amenable to formalization. It involves             The work on standardization of robotics domain is already
such disparate aspects of the real world as geometry, dy-          quite well advanced. There exist ontologies for specific
namics (including forceful interaction with the work ob-           domains, like service robotics or surgical robotics, and a
jects), constraints, planning, scheduling, optimization, real-     core ontology for robotics and automation (CORA) recently
time control, robot-robot and human-robot communication            standardized by IEEE [5]. However, they introduce concepts
and, finally, intentions of the robot user. To represent so        in symbolic form without properly connecting to all their
different kinds of knowledge is a challenge and the research       denotations, e.g., robot programs instantiating skills named
on combining those topics is only in its infancy.                  in these ontologies. Our work addresses this problem by pro-


                                                              25
                                                   Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                   IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


viding concrete denotations belonging to several modalities.      operators how to interact with a robot using kinesthetic
As we have mentioned before, we describe robot actions            teaching and dialogue.
using natural language, using assembly graphs, using tran-           Please note that our understanding of the term multimodal
sition systems, using iTaSC formalism, and using the actual       semantics differs from the one quite commonly encountered
robot code. Those multiple modalities are co-existing in one      in literature, see e.g. [25], where the authors aim at finding
system, letting the reasoner switch between representations       the meaning of a particular text fragment using the statistical
when such a need arises.                                          approach grounded both in text and image corpora. However,
   Learning from demonstration leads to new problems in           there is no attempt to use this semantics in the reverse
semantic anchoring of robot actions, as there is no obvious,      direction, to generate new utterances (that our robot programs
apparent meaning in robot movements. Semantics may be             would correspond to).
either guessed, derived by inductive reasoning, or attributed
post factum by humans via some form of annotating. In                                  III. C URRENT WORK
particular, force-based assembly is problematic as quite often       The focus of current work is to semantically annotate task
the difference between success and failure depends on a           demonstrations to enable reuse and reasoning. This involves
particular profile of the force signal. So far, this issue has    annotating log data with quantities, units, and task states. The
been approached using sensor fusion techniques, without           logs can then be used to identify force/torque and position
direct support from semantics. Our work attempts to remedy        constraints and application specific parameter values (posi-
this situation, introducing natural language into the picture     tions, velocity, stiffness, etc.). The demonstrations are use to
and letting assembly to be not only detected via sensor           segment the task in different sub-skills and extract parameters
readings, but also by being simultaneously told about.            for each skill. One approach is to describe the trajectory
                                                                  of the sub-skills using DMPs and then parameterize the
                    II. R ELATED WORK                             primitives and describe them with for example skill type,
   In the domain of service robotics, there are some in-          preconditions, and postconditions.
teresting frameworks for representation of household tasks           As an example, when demonstrating picking and placing
and environments. KnowRob [6] is a knowledge processing           of an object, the task can be segmented into different
system that combines declarative and procedural knowledge         sub-skills. First the robot approaches the object, opens the
from multiple sources, e.g., the RoboEarth [7] database and       gripper, moves into pick position, closes the gripper, retracts
web sites. A similar project is RoboHow [8], which devel-         from the surface moves to the place position (perhaps using
oped a knowledge-based reasoning service OpenEASE [9]             via positions as well), positions the object correctly, releases
and attempts at bridging the gap from symbolic planning           it and retracts. Each segment can be described using a
to constraint-based control [10]. Ontologies for kit building     trajectory (e.g., a DMP) in some reference frame together
applications for industrial robots have been developed by         with a gripper state. Multiple demonstrations can be used
Balakirsky et al. [11] and Carbonera et al. [12] developed        for each sub-skill in order to detect which for example
a ontology for positions. We have already mentioned the           relevant reference frame and allowed gripping poses. To
standardization work of IEEE Working Group ORA [5].               enable reuse, we are working on annotating the segments
   We are interested in integrating low-level statistical task    with initial allowed start positions and gripper state, a skill
representations taken from demonstrations. Such tasks can         type and postconditions. This will allow the planner to add
be represented by a trajectory or force profile. The tra-         required actions before or after the skill and add error-
jectories can be extracted from the demonstration by first        handling procedures to the task (e.g., if the robot drops the
applying segmentation algorithms and then parameterizing          object when transporting it, the object should be localized
each segment as a trajectory. Niekum et al. [13], [14] use        and picked up again). The skill also has to be parameterized
Beta Process Autoregressive Hidden Markov Models from             so that it can be initialized correctly, for example, specifying
Fox et al. [15] to automatically segment demonstrations and       controller, reference frames, velocity values.
dynamic movement primitives (DMPs) [16] to represent the             Another example is force-controlled assembly. The force
trajectories. Since the statistical properties of semantically    data is not used for sensor fusion, it is used to control
different sub-tasks can be similar, they use predecessor states   the motions of the robot and signals failure or success
to refine the classification and determine the transitions        of the assembly. In a snapfit assembly skill, where a two
in a finite state machine. Other learning methods are for         plastic pieces, a switch and a box shown in Fig. 1, are
example reinforcement learning, used by Metzen et al. [17]        ”snapped” together, the force signature indicates whether the
to learn skill templates, and Iterative Learning Control, used    snap occurred or not. In previous work [1] such task could
by Nemec et al. [18] to follow demonstrated force profiles.       be expressed using the force constraint directly in guarded
   One way to annotate objects and actions is to describe         motions. Using a graphical user interface, primitive actions
them using natural language. Matuszek et al. [19], Kollar et      and skills could be combined into a sequence. An example
al. [20] and Landsiedel [21] use natural language to describe     is shown in Fig. 2. In the sequence, the box is first picked
routes and Walter et al. [22] use language descriptions to se-    and placed on a fixture using three search motions. The first
mantically annotate maps. She et al. [23] studies the dialogue    motion moves the robot down until it feels contact forces
system, while Cakmak [24] evaluates methods for teaching          in the z-direction, then, while pressing down, it searches in


                                                             26
                                                             Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                             IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


the y-direction until contact with the wall, finally, it searches
in the x-direction while simultaneously pressing down and
towards the wall. In the sequence pickbox, movetofixt,
pickswitch and retract are position based motions
running on the native robot controller. The snapFitSkill
is a reused skill, which in turn contains multiple guarded
searches. From the graphical representation, the skill specifi-
cation can be exported to XML-format (see excerpt in Fig. 3)
and to runnable format, see Fig. 4. The skills are semantically
annotated with sensor and controller type, and the parameters
are described with units.


     Fig. 1.   A part of a box to the left and a switch to the right.


                                                                             Fig. 3. The XML representation of the three guarded search motions
                                                                             created in the GUI.


                                                                             longer side of the rectangle is parallel with the x-axis and
                                                                             the rotation around a xy-vector from the initial tilted position
                                                                             will align it with the PCB, the execution will branch into a
                                                                             rotation around the x-axis until the short side is aligned with
                                                                             the PCB. Otherwise, the rotation will align the short side
                                                                             first, as seen in Fig. 6.
                                                                                To lower the threshold for the user, we want to use natural
                                                                             language dialogues to describe the demonstration and extend
                                                                             the task. Together with the parameterized demonstrations,
                                                                             this will allow the user to use high-level structures such as
               Fig. 2.   An example of the user interface.                   loops and if-then-else statements, which are easily described
                                                                             using language but tedious or difficult to describe using
   In another assembly, a rectangular metal plate (a shield                  demonstrations only. In our current system, the user can
can) is inserted on a printed circuit board (PCB). The PCB                   instruct the robot using unstructured text or dictate the task
is attached in a fixture, which is attached to a force sensor.               using Google dictation tools. An example instruction is
The assembly starts by tilting the shield can above the PCB                  displayed in Fig. 7. All parameters have default values, which
(see Fig. 5), moving down until the corner touches the board.                makes the high-level nominal task easy and fast to generate
Then, the robot attaches one corner of the plate to a corner                 from text. The programming interface use language specific
on the PCB and rotates the plate into place. The rotation is                 statistical tools to extract the semantics of the sentences,
first carried out around the xy-plane of the PCB until either                then a rule-based mapping to robot skills and world objects.
the long or the short side of the rectangle touches the PCB,                 At the moment, English and Swedish [26] are supported
then the last side has to be rotated into place. That is, if the             programming languages.


                                                                        27
                                                               Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                               IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


                                                                                               Fig. 6.   The short side is aligned with the PCB.

Fig. 4. Part of the executable state machine. The guarded searches run
on an external force controller (ExtCtrl) which has to be turned on and off
before the force controlled skills are executed.


                                                                                   Fig. 7. The user can instruct the robot using unstructured natural language
                                                                                   commands.


                                                                                   should ask questions and come with suggestions on what to
                                                                                   do.
                                                                                      The next step is to introduce the possibility of extending
                                                                                   the robot knowledge by adding new concepts to the semantic
                                                                                   hierarchy. This is a more complex task than the previous one,
                                                                                   as it involves inducing relations with existing concepts and
                                                                                   proper placing of the new symbol in the IsA hierarchy.
 Fig. 5.   In the initial position, the shield can is tilted above the PCB.           Yet another interesting problem is to reason about “syn-
                                                                                   onyms” among robot programs, i.e. syntactically different
                                                                                   structures or programs leading to the same effect. A simple
                         IV. C ONCLUSIONS                                          example is a “localize and pick” task that may use different
   The immediate future work involves investigating how to                         kinds of sensors to localize an object, while the goal (of
teach pre-and postconditions for skills learned from demon-                        picking the object from its current location) is achieved
stration, to enable online reasoning. These conditions need                        irrespectively of which concrete sensor is used. How to teach
to be anchored in sensor readings. Inductive inference is one                      the system that two skills are equivalent in such (or some
possibility; another is to use mixed-initiative dialogue with                      other) sense? What needs to be told? What kind of reasoning
the user, asking for guidance or confirmation, yet another is                      performed?
to introduce some annotation tool to be used simultaneously                           Representing knowledge about industrial processes involv-
with the learning procedure.                                                       ing semantically-capable robots is a challenge leading to
   It is desirable to have natural language support on all                         fascinating questions. We are quite sure we will have a lot
levels in the system. At the moment, we only support task                          to do in years to come.
instruction, but we also want to be able to describe the world
                                                                                                          ACKNOWLEDGMENTS
and connect the perceived objects and situations to (new)
semantic symbols. E.g., saying ”This is a nut” after teaching                         The research leading to these results has received par-
the camera system to recognize an object, or describing a                          tial funding from the European Union’s seventh framework
pallet as ”empty”. At the moment, the robot is a passive                           program under grant agreement No. 287787 (project SME-
participant in the dialogue, only reacting on commands from                        robotics) and from the European Union’s H2020 program
the human. When interacting with non-expert users, the robot                       under grant agreement No. 644938 (project SARAFun).


                                                                              28
                                                              Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                              IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


                             R EFERENCES                                        [14] S. Niekum, S. Osentoski, G. Konidaris, S. Chitta, B. Marthi, and
                                                                                     A. G. Barto, “Learning grounded finite-state representations from
 [1] M. Stenmark, “Instructing industrial robots using high-level task               unstructured demonstrations,” The International Journal of Robotics
     descriptions,” Ph.D. dissertation, Lund University, Department of               Research, vol. 34, no. 2, pp. 131–157, 2015.
     Computer Science, Mar. 2015, licentiate Thesis.                            [15] E. B. Fox, M. I. Jordan, E. B. Sudderth, and A. S. Willsky, “Sharing
 [2] M. Stenmark and J. Malec, “Knowledge-Based Instruction of                       features among dynamical systems with beta processes,” in Advances
     Manipulation Tasks for Industrial Robotics,” Robotics and Computer              in Neural Information Processing Systems 22, Y. Bengio, D. Schu-
     Integrated Manufacturing, vol. 33, pp. 56–67, 2015. [Online].                   urmans, J. Lafferty, C. Williams, and A. Culotta, Eds.            Curran
     Available: http://lup.lub.lu.se/record/4679243/file/4679245.pdf                 Associates, Inc., 2009, pp. 549–557.
 [3] J. Malec, K. Nilsson, and H. Bruyninckx, “Describing assembly tasks        [16] A. J. Ijspeert, J. Nakanishi, and S. Schaal, “Learning attractor
     in declarative way,” in Proc. IEEE ICRA 2013 Workshop on Semantics,             landscapes for learning motor primitives,” in Advances in Neural
     Identification and Control of Robot-Human-Environment Interaction,              Information Processing Systems 15 (NIPS2002), 2002, pp. 1547–1554.
     Karlsruhe, Germany, May 2013, pp. 50–53.                                   [17] J. Metzen, A. Fabisch, L. Senger, J. de Gea Fernndez, and E. Kirchner,
 [4] M. Stenmark, J. Malec, K. Nilsson, and A. Robertsson, “On                       “Towards learning of generic skills for robotic manipulation,” KI -
     Distributed Knowledge Bases for Robotized Small-Batch Assembly,”                Knstliche Intelligenz, vol. 28, no. 1, pp. 15–20, 2014.
     IEEE Transactions on Automation Science and Engineering,                   [18] B. Nemec, F. Abu-Dakka, B. Ridge, A. Ude, J. Jorgensen,
     vol. 12, no. 2, pp. 519–528, 2015. [Online]. Available:                         T. Savarimuthu, J. Jouffroy, H. Petersen, and N. Kruger, “Transfer
     http://dx.doi.org/10.1109/TASE.2015.2408264                                     of assembly operations to new workpiece poses by adaptation to
 [5] “IEEE standard ontologies for robotics and automation,” IEEE Stan-              the desired force profile,” in Advanced Robotics (ICAR), 2013 16th
     dard 1872-2015, 2015.                                                           International Conference on, Nov 2013, pp. 1–7.
 [6] M. Tenorth and M. Beetz, “Knowrob: A knowledge processing infras-          [19] C. Matuszek, E. Herbst, L. Zettlemoyer, and D. Fox, “Learning to
     tructure for cognition-enabled robots,” The International Journal of            parse natural language commands to a robot control system,” in
     Robotics Research, vol. 32, no. 5, pp. 566–590, 2013.                           Experimental Robotics, ser. Springer Tracts in Advanced Robotics.
 [7] M. Tenorth, A. Perzylo, R. Lafrenz, and M. Beetz, “Representation               Springer International Publishing, 2013, vol. 88, pp. 403–415.
     and exchange of knowledge about actions, objects, and environments         [20] T. Kollar, S. Tellex, D. Roy, and N. Roy, “Grounding verbs of
     in the roboearth framework,” Automation Science and Engineering,                motion in natural language commands to robots,” in Experimental
     IEEE Transactions on, vol. 10, no. 3, pp. 643–651, July 2013.                   Robotics, ser. Springer Tracts in Advanced Robotics. Springer Berlin
 [8] M. Tenorth, G. Bartels, and M. Beetz, “Knowledge-based specification            Heidelberg, 2014, vol. 79, pp. 31–47.
     of robot motions,” in Proceedings of the European Conference on            [21] C. Landsiedel, R. de Nijs, K. Kuhnlenz, D. Wollherr, and M. Buss,
     Artificial Intelligence (ECAI), 2014.                                           “Route description interpretation on automatically labeled robot
 [9] M. Beetz, M. Tenorth, and J. Winkler, “Open-EASE – a knowledge                  maps,” in Proceedings of the International Conference on Robotics
     processing service for robots and robotics/ai researchers,” in IEEE             and Automation (ICRA), Karlsruhe, Germany, May 2013, p. 22512256.
     International Conference on Robotics and Automation (ICRA), Seattle,       [22] M. R. Walter, S. Hemachandra, B. Homberg, S. Tellex, and S. Teller,
     Washington, USA, 2015.                                                          “Learning semantic maps from natural language descriptions,” in
[10] E. Scioni, G. Borghesan, H. Bruyninckx, and M. Bonfe, “Bridging                 Proceedings of the 2013 Robotics: Science and Systems IX Conference,
     the gap between discrete symbolic planning and optimization-based               Berlin, Germany, 2013.
     robot control,” in 2015 IEEE International Conference on Robotics          [23] L. She, S. Yang, Y. Cheng, Y. Jia, J. Chai, and N. Xi, “Back to
     and Automation, 2015.                                                           the blocks world: Learning new actions through situated human-robot
[11] S. Balakirsky, Z. Kootbally, C. Schlenoff, T. Kramer, and S. Gupta,             dialogue,” in Proceedings of the 15th Annual Meeting of the Special
     “An industrial robotic knowledge representation for kit building ap-            Interest Group on Discourse and Dialogue (SIGDIAL). Philadelphia,
     plications,” in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ            PA, U.S.A: for Computational Linguistics, 2014, pp. 89–97.
     International Conference on, Oct 2012, pp. 1365–1370.                      [24] M. Cakmak and L. Takayama, “Teaching people how to teach robots:
[12] J. Carbonera, S. Rama Fiorini, E. Prestes, V. Jorge, M. Abel, R. Mad-           The effect of instructional materials and dialog design,” in Inter-
     havan, A. Locoro, P. Goncalves, T. Haidegger, M. Barreto, and                   national Conference on Human-Robot Interaction (HRI), Bielefeld,
     C. Schlenoff, “Defining positioning in a core ontology for robotics,” in        Germany, Mar. 2014.
     Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International         [25] E. Bruni, N. K. Tran, and M. Baroni, “Multimodal distributional
     Conference on, November 2013, pp. 1867–1872.                                    semantics,” J. Artif. Int. Res., vol. 49, no. 1, pp. 1–47, Jan. 2014.
[13] S. Niekum, S. Chitta, A. Barto, B. Marthi, and S. Osentoski, “Incre-       [26] M. Stenmark, “Bilingual robots: Extracting robot program statements
     mental semantically grounded learning from demonstration,” Berlin,              from swedish natural language instructions,” in Proc. of The 13th
     Germany, 2013.                                                                  Scandinavian Conf. on Artificial Intelligence, Halmstad, Sweden,
                                                                                     2015.


                                                                           29