=Paper=
{{Paper
|id=Vol-2487/sdpaper7
|storemode=property
|title=Towards Explanations of Plan Execution for Human-Robot Teaming
|pdfUrl=https://ceur-ws.org/Vol-2487/sdpaper7.pdf
|volume=Vol-2487
|authors=Jiyoun Moon,Daniele Magazzeni,Michael Cashmore
}}
==Towards Explanations of Plan Execution for Human-Robot Teaming==
<pdf width="1500px">https://ceur-ws.org/Vol-2487/sdpaper7.pdf</pdf>
<pre>
      The 1st International Workshop on the Semantic Descriptor, Semantic Modelingand Mapping for Humanlike
          Perceptionand Navigation of Mobile Robots toward Large Scale Long-Term Autonomy (SDMM19)


             Towards Explanations of Plan Execution for
                      Human-Robot Teaming

        Jiyoun Moon1                     Daniele Magazzeni2              Michael Cashmore3
    jiyounmoon@snu.ac.kr             daniele.magazzeni@kcl.ac.uk    michael.cashmore@strath.ac.uk
         Dorian Buksz2                Beom-Hee Lee  1
                                                           Yong-Seon Moon4       Sang-Hyun Roh5
    dorian.buksz@kcl.ac.uk            bhlee@snu.ac.kr    moon@sunchon.ac.kr         rsh@urc.kr
                     1
                      Automation and Systems Research Institute, Department of Electrical
                              and Computer Engineering, Seoul National University,
        2                                                  3
          King’s College London, London WC2R 2LS             University of Strathclyde, Glasgow G1 1XH
                      4
                        Department of Electronics Engineering, Sunchon National University
                                    5
                                      REDONE TECHNOLOGIES CO., LTD


                                                      Abstract
                      Human-robot teaming is inevitable in various applications ranging from
                      manufacturing to field robotics because of the advantages of adaptabil-
                      ity and high flexibility. To become an effective team, knowledge regard-
                      ing plan execution needs to be shared by verbalization. In this respect,
                      semantic scene understanding in natural language is one of the most
                      fundamental components for information sharing between humans and
                      heterogeneous robots, as robots can perceive the surrounding environ-
                      ment in a form that both humans and other robots can understand. In
                      this paper, we introduce semantic scene understanding methods for ver-
                      balization of plan execution. We generate sentences and scene graphs,
                      which is a natural language grounded graph over the detected objects
                      and their relationships, with the graph map generated using a robot
                      mapping algorithm. Experiments were performed to verify the effec-
                      tiveness of the proposed methods.


1     Introduction
A traditional robotic system can perform simple and repetitive tasks in well-structured environments. However,
the application of robotic systems to various fields such as medicine, manufacturing, and exploration has led to
an increasing demand of highly flexible robots that can work efficiently in an uncertain environment, which has
resulted in a considerable amount of attention being paid to such robots [WZG19]. Combining the capabilities
of humans such as adaptability, creativity, and intelligence and the abilities of robots such as rigidity, endurance,
and speed can dramatically increase work efficiency [TKL+ 14]. Cooperation between humans and heterogeneous
robots can play an important role in adapting robots to an unstructured and dynamic environment [COGM19].
Many algorithms have been developed to resolve issues such as sensing, perception to planning, control, and
safety in human-robot teaming [MdSB18, ZJSS17]. Among the various elements that need to be considered for
a human-robot system, the most important function is verbalization of plan execution, which includes scene

Copyright c 2019 by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY
4.0).


                                                           58
      The 1st International Workshop on the Semantic Descriptor, Semantic Modelingand Mapping for Humanlike
          Perceptionand Navigation of Mobile Robots toward Large Scale Long-Term Autonomy (SDMM19)


understanding based on natural language as illustrated in Figure 1. This can enable humans and robots to share
information in a form they can both understand, which is the most basic ability required for cooperation.

                                                PDDL Domain

                                     PDDL         Sequence of actions
                                                                                        Motion
                                    Mission
                                                                                        Planner
                                    Planner              Knowledge


                                                                          information
                                                                         Environment
                                                PDDL
                                              Problem
                                                 state
                                                          storage


                                                                                            Natural
                                                                                            Language


                                           Verbalization of Plan Execution

                                        AUV 0                        -        Interpretability
                                                                     -        Transparency
                                                                     -        Explainability
                                                                     -        Scene understanding

                                                                                              AGV 0
                                                          AGV 1

                                             Tree                           AGV 2
                                   Bench                    Fail                          “A red cylinder
                                           near                                            on the way”
                                     [ Scene graph            [ Natural language
                                      generation ]               description ]


Figure 1: Verbalization of plan execution, which is composed of interpretability, transparency, explainability and
scene understanding plays an important role in human-robot teaming. In this paper, we focus on generating scene
graphs and language descriptions for scene understanding.

   Semantic scene understanding is the process of perceiving environmental information in natural language or
a form that can infer semantic meanings. In robotics, semantic mapping algorithms, which generate graphs that
denote features and positions of detected objects as nodes, have been widely studied recently [ZWS+ 18, BADP17].
The graphs generated by these algorithms are unlike the maps generated by conventional methods, which consist
of points, corners, lines, and planes. However, the generated semantic graph is rarely applied to data sharing
methods for humans and robots, and these graphs need to be expressed in natural language. Natural-language-
based scene understanding is studied in various forms such as image captioning [XBK+ 15, KFF15], visual question
answering [GGH+ 17], and scene graph generation [WSW+ 17] in the field of computer vision. However, these
methods are rarely applied to the semantic graph maps that are used by robots to represent the environment.
Moreover, they do not address the problem of mission planning where humans and heterogeneous robots cooperate
to achieve a common goal. In this paper, we generate scene graphs and language descriptions to focus on scene
understanding, which is one fundamental element of verbalization of plan execution. A graph-based convolutional
neural network [DBV16] is employed to generate sentences attention over graphs. An iterative message passing
[XZCFF17] technique based on the gated recurrent unit (GRU) is used to generate scene graphs. We verified
the proposed algorithms through experiments.

2     Approach
This section describes two methods of scene understanding using semantic graphs for plan execution verbalization.
First, we generate natural language grounded scene graphs composed of objects as nodes, and their relationships
as edges. Then, language descriptions that describe the overall scene are generated. The details are as follows.

2.1   Semantic graph generation
Semantic scene understanding based on graph maps is widely studied in robotics. However, these graph maps are
rarely used for robotic applications such as mission planning, natural language processes, or plan execution ver-


                                                            59
     The 1st International Workshop on the Semantic Descriptor, Semantic Modelingand Mapping for Humanlike
         Perceptionand Navigation of Mobile Robots toward Large Scale Long-Term Autonomy (SDMM19)


balization. We address the issue of natural language-based surrounding scene understanding for the verbalization
of plan execution using semantic graph maps. In this study, we assume that the graph map of the surrounding
environment is generated in advance using semantic simultaneous localization and mapping (SLAM). To con-
struct a similar graph with semantic SLAM, the features and position of objects are set as nodes. Features of
objects are used for data association in the SLAM front-end. The position information of objects is utilized for
graph optimization in the SLAM back-end. The generated semantic graph map G is illustrated in Fig 2. In this
paper, multiple objects in the image are detected using a region proposal network [RHGS15] and encoded into
feature vectors using the neural network, VGGNet [SZ14].The vector concatenated with the image information
related to the i − th object and the bounding box of the object is set to the feature vector fiv of node vi ∈ V .
                                                                             e
We set the feature vector representing the union region of two objects as fij  of edge eij = (vi , vj ) ∈ Eij that
connects vi and vj .


                                   object                 CNN                      𝒱𝑖 : 𝒊-𝒕𝒉 𝒏𝒐𝒅𝒆
                                                                                        𝓔𝒊𝒋 : 𝒆𝒅𝒈𝒆 𝒃𝒆𝒕𝒘𝒆𝒆𝒏
                                                                                              𝓥𝒊 𝒂𝒏𝒅 𝓥𝒋

                        Union region                                                  𝒱𝑗
                         of objects                            feature                         𝓖 = (𝓥, 𝓔)
                                                     Feature
                           Image                                               Semantic Graph
                                                    extraction

Figure 2: Scene graph generation: Detected objects are encoded as graph node features. The union regions of two
objects are encoded as graph edge features. A convolutional neural network is utilized for feature encoding.


2.1.1   Graph inference
We infer the optimal word for each node and edge of the generated semantic graph. The graph inference process
for the semantic graph fiv and fij
                                e
                                   is as follows.

                                g ∗ = argmaxg P r(g | fiv , fij
                                                             e
                                                                )                                               (1)
                                                  YY
                                P r(g | I, BI ) =       P r(viclass , vibbox , eij | fiv , fij
                                                                                            e
                                                                                               )                (2)
                                                    i∈V j6=i


where, C and R are a set of object classes and relationship types, viclass ∈ C, vibbox ∈ IR4 , eij ∈ R. An iterative
message passing model [XZCFF17] is utilized for graph inference. Node message pooling focuses on finding words
for nodes through both the inbound and outbound edge states. Edge message pooling focuses on finding words
for edges through both the object and subject states. Through repeated message pooling, we generate scene
graphs comprising the most optimal words for each node and edge.


2.1.2   Language description
We generate language description for a semantic graph. The conventional methods utilizing convolutional neural
network are rarely applied to graph data that is irregular and unstructured. We generate sentences using a graph
convolutional neural network defined by spectral theory as follows.
                                                                1        1
                                            H (l+1) = σ(D̂− 2 ÂD̂− 2 H (l) Wg(l) )                             (3)

                                                                                 P              (l)
where, Â = A + I is an adjacency matrix A with self-connection I. D̂ii =          j Âij and Wg    are a degree
matrix and a trainable variable, respectively. H (l) ∈ RM ×D is output of the l-th layer, where H (0) = X. In
this paper, a graph is encoded as a 1024-dimensional vector with a fully connected layer. Then, a concatenated
vector composed of a graph feature and a word is fed into a recurrent neural network to predict the probabilistic
distribution of words.


                                                                60
     The 1st International Workshop on the Semantic Descriptor, Semantic Modelingand Mapping for Humanlike
         Perceptionand Navigation of Mobile Robots toward Large Scale Long-Term Autonomy (SDMM19)


             a stop sign is on the corner          a trucks is parked in a           a group of trucks sitting in
             of a city street                      parking lot                       the middle of a field


             a truck is shown on the               a large white plane on a          a vehicle is flying thorugh a
             side of a road                        clear day                         dirt road surrounded by trees


                                                                     a


                                                                                      leaf_1                    light
                      on                                        on                                  on
             leaf_1        leaf_2                    light                building
                                                                                       leaf          has
                                    has                         on                                                 tree_1
                      on                  leaf_3     door                  car                      on
              leaf          tree
                                                                                       man                       behind
                               has                             has                             on        tree
                                          leaf_4      leaf                 leaf_1
                                                                                      leaf_2    on


                                                                     b


              Figure 3: Simulation Results: (a) Language description (b) Scene graph generation


3   Experiment
We generated scene graphs and language descriptions using images of the surrounding environment obtained
with mobile robots performing mission planning of surveillance in the simulation environment. Ubuntu 16.04,
ROS Kinetic, and Gazebo 7 were used to set up the simulator. Two datasets were utilized for the neural
network training. The network for language description was trained with a COCO dataset [LMB+ 14], whereas
the network for scene graph generation was trained with a visual genome dataset [KZG+ 17]. The COCO dataset
consists of images, object boundary boxes, and captions. The visual genome dataset is composed of annotations
of object relationships and object labels. As these datasets only have images, we constructed graphs before
training the networks; VGGNet [SZ14] was used for graph construction. We set the maximum number of nodes
at 20 in order to cope with various sizes of graphs; when fewer than 20 nodes were present, empty nodes with
zeros were added.
   Even though the networks were trained with datasets from the real world, the proposed methods successfully
generated language description and scene graphs for the simulation world as illustrated in Fig. 3. These results
can be utilized for the verbalization of plan execution. Language description can contribute toward recovering
from mission failure, as the failure of a robot is inevitable. For example, assume that a robot has to go to a


                                                                     61
     The 1st International Workshop on the Semantic Descriptor, Semantic Modelingand Mapping for Humanlike
         Perceptionand Navigation of Mobile Robots toward Large Scale Long-Term Autonomy (SDMM19)


certain position and wait for a human to load a package. As it does so, a car blocks the path to the robot and
the human cannot approach it; consequently, the robot will fail its mission. In this case, the robot can inform
humans about the failure by describing the current situation and move to a new position to complete the mission.
Scene graph generation can contribute toward gathering information in unseen and dynamic environments in a
compact and communicable form. For example, assume that a robot is located in a place where a human cannot
approach it. The generated scene graph can be used for humans to identify the place where the robot is located.


4   Artificial Intelligence Planning
AI Planning is a branch of AI that aims to provide automation by generating a structure of actions that one
or multiple agents use to transition from an initial state to a desired goal state in a given environment. This is
achieved by creating a model of the environment. The model aims to accurately represent the capabilities of the
agent and the objects present in the environment, their attributes, as well as the relationship between them. In
particular, the model includes an initial state, possible actions that affect the state, as well as the desired goal
condition.
   A planner is used to find one or more plans. A plan is a partially-ordered set of actions which, once executed
are predicted by the model to achieve the goal condition. Typically planners perform search through the state-
space in order to find one or more action sequences that provide a transition from the initial state into a state
in which the goal condition holds. These forward-search planners (e.g. [CCFL10]) are equipped with various
heuristics in order to find solutions faster than having to explore every state in the state space, thus enabling
their use for planning and replanning online.


5   Planning and Plan Execution
Task planning for robots means planning with incomplete and unreliable data. Observations can be made from
sensors in order to update the model used for planning and execution through state estimation. An up-to-date
model for planning reduces the risk of plan failure, and can identify earlier when a plan under execution is no
longer valid. However, even so it is likely that plans fail during execution, and in such cases it is critical that
the robotic agent is able to explain to the operator exactly why.
   The work presented in this paper can be usefully integrated with task planning in two main ways. First the
generated scene graph can be used to update the model with new objects and relationships. Relations in the
scene graph can be used to update the (spatial) predicates that describe the current state in the planner’s model.
Second, verbalization of the scene graph enhance descriptions of the state that can be used to describe why the
plan has failed. If a location has become unreachable because of an obstruction, a verbalization of the scene
graph, such as the examples in Fig. 3, can be given to a operator as an explanation of plan failure. This allows
the operator to understand how the environment is different from what was expected, and what to do next. In
this section we discuss future work in this direction.
   A team of robots can be controlled through task planning using the ROSPlan [CFL+ 15] framework for task
planning in ROS. The scene graph will be integrated with ROSPlan to perform continuous updates to the current
state through an integration with the ROSPlan sensor interface. This can automatically connect the scene graph
generation of relations such as light on building into the predicates of the planning model. This integration has
two main advantages: first, the spatial relations in the planner’s model are kept up-to-date, which is a necessary
function if the robot operates within a dynamic environment. Second, new objects that are detected can be
immediately described in terms of their position and relation to other objects. This is a necessary step for the
planner to understand how they can be used in a plan, or what effect they might have on the state.
   Plan execution on board the robots will be extended to include verbalization describing the plan under
execution. This will be done by integrating the verbalization component with the plan execution components of
ROSPlan in the following two ways: first to provide verbalization of updates to the current state, and second
to provide verbalization of obstructions that prevent the robot from achieving its goal. In human-robot teaming
scenarios, it is important that the human operator is given sufficient situational awareness to judge the state of
the plan. By verbalizing the updates to the planner’s model, an operator does not have to be an expert in the
language of the domain model to understand what the robot is sensing. In addition, by verbalizing the reason for
plan failure, the operator can quickly understand which unexpected event or object has resulting in the failure
of the plan.


                                                        62
     The 1st International Workshop on the Semantic Descriptor, Semantic Modelingand Mapping for Humanlike
         Perceptionand Navigation of Mobile Robots toward Large Scale Long-Term Autonomy (SDMM19)


6   Conclusion
Verbalization of plan execution is the most fundamental component of human-robot collaboration in that it can
share information in an interpretable form to achieve a shared goal. In this paper, two methods of semantic
scene understanding are proposed for the verbalization of plan execution. A graph convolutional neural network
and iterative message pooling are utilized to generate both language description and a scene graph, respectively.
The proposed method was successfully verified with the simulator in our study.

Acknowledgement
This work was supported in part by Korea Evaluation Institute of Industrial Technology (KEIT) funded by
the Ministry of Trade, Industry & Energy (MOTIE) (No. 1415162366 and No. 141562820) and in part by a
Bio–Mimetic Robot Research Center funded by Defense Acquisition Program Administration, and by Agency
for Defense Development (UD190018ID).

References
[BADP17] Sean L Bowman, Nikolay Atanasov, Kostas Daniilidis, and George J Pappas. Probabilistic data asso-
         ciation for semantic slam. In Robotics and Automation (ICRA), 2017 IEEE International Conference
         on, pages 1722–1729. IEEE, 2017.
[CCFL10]    Amanda Coles, Andrew Coles, Maria Fox, and Derek Long. Forward-chaining partial-order planning.
            In ICAPS, pages 42–49, 2010.
[CFL+ 15]   M. Cashmore, M. Fox, D. Long, D. Magazzeni, B. Ridder, A. Carrera, N. Palomeras, N. Hurtós,
            and M. Carreras. Rosplan: Planning in the robot operating system. In Proceedings International
            Conference on Automated Planning and Scheduling, ICAPS, 2015.
[COGM19] Micah Corah, Cormac O’Meadhra, Kshitij Goel, and Nathan Michael. Communication-efficient plan-
         ning and mapping for multi-robot exploration in large environments. IEEE Robotics and Automation
         Letters, 4(2):1715–1721, 2019.
[DBV16]     Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on
            graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems,
            pages 3844–3852, 2016.
[GGH+ 17] Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin,
          and Li Deng. Semantic compositional networks for visual captioning. In Proceedings of the IEEE
          conference on computer vision and pattern recognition, pages 5630–5639, 2017.
[KFF15]     Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions.
            In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137,
            2015.
[KZG+ 17] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie
          Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language
          and vision using crowdsourced dense image annotations. International Journal of Computer Vision,
          123(1):32–73, 2017.
[LMB+ 14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr
          Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference
          on computer vision, pages 740–755. Springer, 2014.
[MdSB18]    Dannilo Samuel Silva Miranda, Luiz Edival de Souza, and Guilherme Sousa Bastos. A rosplan-
            based multi-robot navigation system. In 2018 Latin American Robotic Symposium, 2018 Brazilian
            Symposium on Robotics (SBR) and 2018 Workshop on Robotics in Education (WRE), pages 248–253.
            IEEE, 2018.
[RHGS15] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object
         detection with region proposal networks. In Advances in neural information processing systems,
         pages 91–99, 2015.


                                                       63
     The 1st International Workshop on the Semantic Descriptor, Semantic Modelingand Mapping for Humanlike
         Perceptionand Navigation of Mobile Robots toward Large Scale Long-Term Autonomy (SDMM19)


[SZ14]     Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
           recognition. arXiv preprint arXiv:1409.1556, 2014.

[TKL+ 14] Stefanie Tellex, Ross A Knepper, Adrian Li, Daniela Rus, and Nicholas Roy. Asking for help using
          inverse semantics. In Robotics: Science and systems, volume 2, 2014.
[WSW+ 17] Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton van den Hengel. Image captioning
          and visual question answering based on attributes and external knowledge. IEEE transactions on
          pattern analysis and machine intelligence, 40(6):1367–1381, 2017.

[WZG19]    Nana Wang, Yi Zeng, and Jie Geng. A brief review on safety strategies of physical human-robot
           interaction. In ITM Web of Conferences, volume 25, pages 1–3. EDP Sciences, 2019.
[XBK+ 15] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich
          Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual
          attention. In International conference on machine learning, pages 2048–2057, 2015.

[XZCFF17] Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph generation by iterative mes-
          sage passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
          pages 5410–5419, 2017.
[ZJSS17]   Shiqi Zhang, Yuqian Jiang, Guni Sharon, and Peter Stone. Multirobot symbolic planning under
           temporal uncertainty. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent
           Systems, pages 501–510. International Foundation for Autonomous Agents and Multiagent Systems,
           2017.
[ZWS+ 18] Liang Zhang, Leqi Wei, Peiyi Shen, Wei Wei, Guangming Zhu, and Juan Song. Semantic slam based
          on object detection and improved octomap. IEEE Access, 6:75545–75559, 2018.


                                                      64

</pre>