Using Image Schemata to Support Autonomous
Assembly Tasks
Nikolaos Tsiogkas1,2
1
    Department of Mechanical Engineering, KU Leuven, Celestijnenlaan 300, B-3001 Heverlee (Leuven), Belgium
2
    Core Lab ROB, Flanders Make, Gaston Geenslaan 8, 3001 Heverlee, Belgium


                                         Abstract
                                         Robots are an important part of the modern manufacturing industry. Newer generations of robots have
                                         improved sensing and control capabilities that allow the sharing of workspaces between humans and
                                         robots. In addition, these extra capabilities allow the robots to be agile, enabling the manufacturing
                                         of various goods, by using various skills developed by human experts. Unfortunately, such skills are
                                         sometimes hard to be reused, and are unable to recover from failures that were not anticipated from the
                                         designer, making them harder to be deployed in dynamic environments. This work proposes the use of
                                         image schemata as skill definition primitives. Such a representation will allow to make composable and
                                         easily reusable skills across domains using analogies. In addition, the image-schema-based representations
                                         can add a level of understanding of the skill performance and the environment, which can enable the
                                         recovery of unexpected errors during execution. Finally, this extended understanding can be used to
                                         explain the actions of the robot.

                                         Keywords
                                         Robotic assembly, Skill representation, Skill transfer


1. Introduction
Robots play an ever increasing role in the manufacturing industry. The first generation of
industrial robots included static manipulators, manually preprogrammed to perform a specific
task, which would blindly execute. This prevented the coexistence of humans with the robots
as it was dangerous [1]. The emergence of collaborative robots, allowed humans to share the
workspace with a robot, as it was safe to operate around each other, allowing closer collaboration
between the two [2].
   A further improvement is offered by the introduction of better sensing, planning, and control
methods, which allowed robots to perform more than one task allowing for agile manufacturing.
This enables the execution of more than one task using a set of skills that are prepogrammed by
a human. A skill consists of a set of actions that a robot must execute, whose correct execution
is monitored using the sensors. For example, a peg-in-a-hole skill [3], will be composed by
an action picking the peg, then and action leading the peg to the hole, and finally, an action


The Sixth Image Schema Day (ISD6), March 24–25, 2022, Jönköping University, Sweden
Envelope-Open nikolaos.tsiogkas@kuleuven.be (N. Tsiogkas)
GLOBE https://www.tsiogkas.me/ (N. Tsiogkas)
Orcid 0000-0003-2842-7316 (N. Tsiogkas)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
performing the insertion of the peg in the hole. It can be seen that the combination of such
skills can lead to complex behaviours.
   Despite the greater agility provided by the use of skills and better sensing, multiple challenging
problems still exist. Given that skills are developed by a human expert, the level of reusability
and composability of each of them is dependent on the specific design decisions made, while the
skill is created. For example, if a skill is making heavy use of robot specific primitives, such as
specific joint or payload limits, it will be hard to use this skill with a different robot. In addition,
a skill developed for solving a specific problem, most of the times cannot be used to solve an
analogous problem in a different domain. For example, the aforementioned peg-in-a-hole skill
at a high level can be analogous to insertion of an item in a box, used in kitting applications, but
is unable to be reused for that purpose. This is caused, mainly, by including a lot of task-specific
geometric information in the skill design, as well as, by hard-coded dependencies in specific
sensing and monitoring capabilities of the robot.
   Another problem is that skills cannot react to changes in the environment unless are explicitly
coded to anticipate those changes. This means that some unexpected event can cause the skill
to fail, without the robot being able to recover. To be able to cope with unexpected events, the
robot would require a higher level of understanding of the situation. Such understanding is
not encoded in state-of-the-art approaches for skill definition. Moreover, current approaches,
use various ”magic numbers” that need to be hand-tuned for specific applications or robots.
A higher level of understanding of the situation would allow the robot to automatically learn
and configure all the relevant parameters for the task at hand. Finally, this higher level of
understanding can be used by the robot to explain its behaviour.
   To overcome such limitations we believe that image schemata [4] can be used as primitives to
design a set of basic robotic skills, that are easy to compose and generate complex behaviours.
Such an approach would allow the skills, and their compositions, to be easily transferred
across domains through analogies, an inherent property of image schemata. In addition, using
image schemata will allow understanding of the situation to be incorporated into the skill,
making them resilient to disturbances and enabling them to recover from failure. Moreover,
this understanding can be used to monitor the quality of the performance, and to actively try
and avoid any failures. Finally, explainability can be achieved by using the knowledge and
understanding of the situation, along with the intentions of the robot.
   The rest of the paper is organized as follows: Section 2 presents the relevant literature
regarding robotic skill definition. In section 3 the proposed skill definition approach is presented.
Finally, section 4 concludes the work by presenting potential paths for future work.


2. Relevant work
In the robotics literature multiple skill definition approaches can be found. Unfortunately most
of them are requiring a manual implementation, are defined on robot-centric primitives, and
do not consider understanding of the skill execution to mitigate and recover errors. In [5] a
framework for skill-based control framework based on the ROS middleware. Skills are defined
as primitive actions that the robot has to perform like picking an object or placing it at a specific
location. The skill definition is happening in software that implements the sensing, motion
control, and monitoring in a hard-coded fashion. The work of [6] defines a skill as a directed
graph where a vertex is a manipulation primitive, consisting of a twist and a feed forward
wrench trajectory, and an edge is a transition between manipulation primitives triggered by a
monitored condition. In addition, recovery strategies can be manually specified. In [7] a domain
specific language for skill programming is defined. The skills are programmed by domain
experts using a set of elemental actions, such as OpenGripper, MoveTo etc. Once a skill is defined
control code is automatically generated. Another language-based approach for skill definition is
presented in [8]. This language supports both code generation for control, as well as, symbolic
representations of the skills to be used for reasoning and planning. The skill is defined using a
set of robot specific primitives that are translated to a finite state machine for execution. In
[9] a domain specific language is described that allows to define skills in a constraint based
manner. In addition, it allows to specify monitors that can help with task execution, and it
provides a reference implementation of a controller that can execute such tasks for a specific
robot. In [10] a modelling approach for robot agnostic reusable skills is presented. It is based
on hierarchical definition of skills, and the skills can be connected to concrete implementations
that are hardware specific. The presented movement and grasping primitives require a manual
skill definition by an expert, and despite claiming that the skills can be reusable across domains
nothing is demonstrated.
   The work of [11] presents a way where image schemata can be used as an alternative to
planning. There it is shown that image schemata can be used to reason about the environment,
the various alternative ways to perform an action, and how the robot can use analogies to
perform a variety of similar actions in diverse domains. It is rather relevant to the work
presented in this paper, as we propose that skills can be described as a set of image schemata,
which in turn can be used to automatically generate a controller to perform each skill. As skills
and actions can be viewed as equivalent, this work agrees on the ability to transfer a skill to
multiple domains.
   In [12] a method that converts qualitative descriptions of scenes to quantitative equivalents is
described. Such quantitative descriptions are then used in simulations to understand functional
relations between objects in the world, and interpret them in a qualitative way. This is relevant to
the work presented here, as understanding the functional relations will enable better monitoring
for the skill execution.


3. Proposed approach
To demonstrate the use of image schemata for skill definition and how it can be transferred in
different domains two examples of industrial applications will be used. A specific skill that is
applicable in both applications will be encoded using image schemata and presented in detail.
   The first application is a benchmark assembly application that can be seen in figures 1a, 1b,
and 1c. The aim of this application is to assemble an industrial item that requires components
to be placed in a specific partial order. To complete the assembly several skills are needed. For
example, to insert the pegs and the shaft to appropriate places, a peg-in-a-hole skill is needed.
In addition, there are some precedence constraints that need to be satisfied during the assembly,
such as lower parts are supporting the placement of the higher parts.
     (a) State of the item before...         (b) ...during...         (c) ...and after the assembly.
Figure 1: The Cranfield assembly task in various stages of execution. This task is used as a benchmark
for testing solutions to the automated assembly problem using robots [13].


            (a) Kitting base             (b) Table feet inserted            (c) Complete kit
Figure 2: Various stages of a kitting application. The base is supporting the other elements of the table.
Feet and legs are inserted having precedence and position constraints.


   The second application is a kitting application, where the robot has to collect and place a
set of items in a package, and can be seen in figure 2. In this case a table is to be packed in a
box, where the top of the table is on the bottom of the box, and the legs are placed on top using
some support material. Again, there is a pick-and-place type of skill, which can be viewed as
equivalent with the peg-in-a-hole skill. Their main difference is the tolerance and accuracy
needed for the execution. Nevertheless, in state-of-the-art skill developments these two skills
will not be compatible, meaning you cannot use one to achieve the other. However, if primitive
skills were based on image schemata, their composition would be able to handle both cases,
only needing a different type of monitoring and configuration.
   To demonstrate the skill definition using image schemata, a peg-in-a-hole skill will be detailed.
The objective of the skill is to insert a peg in a respective hole and is depicted in figure 3. The
skill starts by moving the manipulator towards the peg, such a motion can be described by a
Source-Path-Goal image schema (3c). Once the manipulator reaches the peg, it is grasped using
the end effector, so the peg is Contained in the end effector. The next step involves moving the
peg until it touches the surface containing the hole, as described by the Contact image schema
(3d). Following is moving the peg until it cannot move further as it is Blocked by the hole (3e-3f).
Finally, the peg is rotated to a vertical position above the hole, shown by a Verticality constraint
and inserted until a secure Link is achieved (3f-3h).
   The skill specification as a series of image schemata has the additional benefit of automatically
defining what needs to be monitored during the skill execution. These monitors can detect what
went wrong while executing the required actions, allowing the robot to recover and continue.
For example, if the peg falls out of the end effector while it is moved towards the hole (3e) the
containment constraint will not be satisfied. Then the robot can automatically reason that it
needs to grasp the peg again, and continue the execution from where it was interrupted.
   It can be seen that a similar approach can be used with placing the feet or legs of the table of
          (a) Initial conditions    (b) Initial position of the robot   (c) Motion to reach the peg


      (d) Grasped peg to support        (e) Motion towards hole          (f) Rotation for insertion


           (g) Final insertion             (h) Releasing peg              (i) Inserted peg in hole
Figure 3: Various stages of a peg-in-a-hole skill. The robot needs to grasp a peg and successfully insert
it into the respective hole. Initially the robot grasps the peg and moves it towards the surface that has
the hole until there is contact 3a-3d. Then it moves the peg towards the hole until the motion is blocked
by the peg being partly inserted 3e. Finally, the peg is rotated, inserted, and released 3f-3i.


the kitting application as seen in figure 2c. By using analogies the hole can be matched with the
leg support portions of the base, and the peg is the leg itself. Then by just using the relevant
parameters the leg can be inserted in the correct spots using the same skill.


4. Conclusion and future work
This work presented an approach to model robotic skills using image schemata as a core
modelling primitive. This approach allows skills to be composable, and reusable in various
tasks and contexts. In addition, they allow the robot to better monitor the execution of each
skill, as well as, understand situations that can cause the execution of skills to fail, and be able
to recover from such failures. Using such knowledge, the robot can also provide explanations,
regarding its actions.
   One of the most fruitful directions for future work is connecting these high level skills to
concrete low level perception and control implementations that can be used by a robot. Since
each sensor and manipulator have different capabilities, different software implementations
will be required to perform each skill. For that a way to automatically define and parametrize
the software used will be required. This will allow the easy deployment and adoption of such
skills, as it will require minimal expert involvement.
   Another interesting direction will be the use of such skills to explore, interact, and learn
from the environment. As the robot understands basic concepts and can use them in skills
to manipulate the environment, it can learn more about new skills, the affordances of the
environment, as well as, the causality chain of events. One potential approach is to allow the
robot to interact with objects, and detect and try to interpret the outcomes of its actions. A
recent work towards that direction is presented in [14].
   Finally, an important direction for future research is towards the explainability of the robot’s
actions. Using image schemata, and the increased understanding of the world they can bring, a
robot must be able to explain why it chose a specific skill, or why it performed an action that is
part of a skill in a specific way. For example, if the robot executes part of the insertion in the
peg-in-a-hole skill using a different angle or a different approach, because there is something
blocking the ”usual” way, it must be able to explain why it chose this alternative. This can
increase the trust of users towards the system, and thus the adoption of more robotic systems
in the real world.


Acknowledgments
This work was supported by the Flanders Make/VLAIO:SBO MULTIROB project.


References
 [1] M. Vasic, A. Billard, Safety issues in human-robot interactions, in: 2013 ieee international
     conference on robotics and automation, IEEE, 2013, pp. 197–204.
 [2] A. Cherubini, R. Passama, A. Crosnier, A. Lasnier, P. Fraisse, Collaborative manufacturing
     with physical human–robot interaction, Robotics and Computer-Integrated Manufacturing
     40 (2016) 1–13.
 [3] S.-k. Yun, Compliant manipulation for peg-in-hole: Is passive compliance a key to learn
     contact motion?, in: 2008 IEEE International Conference on Robotics and Automation,
     IEEE, 2008, pp. 1647–1652.
 [4] M. M. Hedblom, Image schemas and concept invention: cognitive, logical, and linguistic
     investigations, Springer Nature, 2020.
 [5] F. Rovida, M. Crosby, D. Holz, A. S. Polydoros, B. Großmann, R. Petrick, V. Krüger, Skiros—a
     skill-based robot control platform on top of ros, in: Robot operating system (ROS), Springer,
     2017, pp. 121–160.
 [6] L. Johannsmeier, M. Gerchow, S. Haddadin, A framework for robot manipulation: Skill
     formalism, meta learning and adaptive control, in: 2019 International Conference on
     Robotics and Automation (ICRA), IEEE, 2019, pp. 5844–5850.
 [7] U. Thomas, G. Hirzinger, B. Rumpe, C. Schulze, A. Wortmann, A new skill based robot
     programming language using uml/p statecharts, in: 2013 IEEE International Conference
     on Robotics and Automation, IEEE, 2013, pp. 461–466.
 [8] C. Lesire, D. Doose, C. Grand, Formalization of robot skills with descriptive and operational
     models, in: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems
     (IROS), IEEE, 2020, pp. 7227–7232.
 [9] E. Aertbeliën, J. De Schutter, etasl/etc: A constraint-based task specification language and
     robot controller using expression graphs, in: 2014 IEEE/RSJ International Conference on
     Intelligent Robots and Systems, IEEE, 2014, pp. 1540–1546.
[10] S. Profanter, A. Breitkreuz, M. Rickert, A. Knoll, A hardware-agnostic opc ua skill model for
     robot manipulators and tools, in: 2019 24th IEEE International Conference on Emerging
     Technologies and Factory Automation (ETFA), IEEE, 2019, pp. 1061–1068.
[11] M. M. Hedblom, M. Pomarlan, R. Porzel, R. Malaka, M. Beetz, Dynamic action selection
     using image schema-based reasoning for robots, in: Proc. of the Joint Ontology Workshops,
     2021.
[12] M. Pomarlan, J. A. Bateman, Embodied functional relations: A formal account combining
     abstract logical theory with grounding in simulation., in: FOIS, 2020, pp. 155–168.
[13] T. R. Savarimuthu, A. G. Buch, C. Schlette, N. Wantia, J. Roßmann, D. Martínez, G. Alenyà,
     C. Torras, A. Ude, B. Nemec, et al., Teaching a robot the semantics of assembly tasks, IEEE
     Transactions on Systems, Man, and Cybernetics: Systems 48 (2017) 670–692.
[14] M. Pomarlan, M. M. Hedblom, R. Porzel, Panta rhei: Curiosity-driven exploration to learn
     the image-schematic affordances of pouring liquids, in: The 29th Irish Conference on
     Artificial Intelligence and Cognitive Science 2021, Dublin, Republic of Ireland, December
     9-10, 2021, CEUR-WS, 2021, pp. 106–117.