Commonsense Reasoning for Identifying and
Understanding the Implicit Need of Help and
Synthesizing Assistive Actions
Maëlic Neau1,2,3 , Paulo Santos2 , Anne-Gwenn Bosser1 , Nathan Beu3 and
Cédric Buche1,3
1
  Lab-STICC/ENIB, France
2
  College of Science and Engineering, Flinders University of South Australia,
1284 South Rd, Clovelly Park SA 5042, Australia
3
  CNRS, International Research Lab "CROSSING", Adelaide, Australia


                                         Abstract
                                         Human-Robot Interaction (HRI) is an emerging subfield of service robotics. While most existing ap-
                                         proaches rely on explicit signals (i.e. voice, gesture) to engage, current literature is lacking solutions to
                                         address implicit user needs. In this paper, we present an architecture to (a) detect user implicit need of
                                         help and (b) generate a set of assistive actions without prior learning. Task (a) will be performed using
                                         state-of-the-art solutions for Scene Graph Generation coupled to the use of commonsense knowledge;
                                         whereas, task (b) will be performed using additional commonsense knowledge as well as a sentiment anal-
                                         ysis on graph structure. Finally, we propose an evaluation of our solution using established benchmarks
                                         (e.g. ActionGenome dataset) along with human experiments. The main motivation of our approach is
                                         the embedding of the perception-decision-action loop in a single architecture.

                                         Keywords
                                         Commonsense Reasoning, Knowledge Graph, Vision-to-Language, Cognitive Robotics


1. Introduction
Detecting and understanding user’s intentions and needs is the fundamental backbone of service
robotics. This question relates to how high-level, abstract, concepts can be inferred from raw
sensor data (an issue intimately related to the symbol grounding problem) [1]. Traditional
approaches to this problem in robotics use explicit signals from the user such as voice [2],
gesture [3] or even touch [4]. However, the deployment of service robots in assisting activities
of daily life (ADL), especially for impaired or elderly people, is leading the way to more implicit
interactions with autonomous agents. Previous work has been introduced to understand user’s
implicit intentions in service robotics. Some use external context to predict user’s intentions


In A. Martin, K. Hinkelmann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI
2022 Spring Symposium on Machine Learning and Knowledge Engineering for Hybrid Intelligence (AAAI-MAKE 2022),
Stanford University, Palo Alto, California, USA, March 21–23, 2022.
$ neau@enib.fr (M. Neau); paulo.santos@flinders.edu.au (P. Santos); anne-gwenn.bosser@enib.fr (A. Bosser);
nathan.beu@adelaide.edu.au (N. Beu); buche@enib.fr (C. Buche)
 https://www.flinders.edu.au/people/paulo.santos (P. Santos); https://www.enib.fr/~buche/ (C. Buche)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
[5] while others rely on gaze-based signals [6]. But to the best of our knowledge none of them
integrates the use of external commonsense knowledge of the world.
   The present paper tackles an important part of this issue, where only non-verbal, visually-
observable data are taken into account. We present a system that is able to understand the
implicit user needs of help in the realisation of a task and to provide a relevant assistive action,
inspired by the way humans act based on commonsense reasoning. In a general sense, the use of
commonsense reasoning in the present work can be summarised with the following assumption:
a factor of human assistance to one another in the realisation of a task is the perception of
danger. For instance, humans are typically able to understand (without explicit prior learning)
that anything coming out of an oven is hot, and that a person should protect their hands to avoid
hurting themselves. This is connected to the following definition of commonsense reasoning,
from [7], p.170:

      Commonsense causal reasoning is qualitative reasoning about the behavior of a mech-
      anism which can be done without external memory or calculation aids, although it
      may draw on concepts learned from the advanced study of a particular domain, e.g.
      automobile mechanics, computer architecture, or medical physiology.

For instance, in the above example, we can summarise the commonsense reasoning as the
following causal relationships:
                                               produce
                                       𝑜𝑣𝑒𝑛 −−−−→ ℎ𝑒𝑎𝑡                                          (1)
                                          capable of
                                    ℎ𝑒𝑎𝑡 −−−−−→ ℎ𝑢𝑟𝑡 𝑠𝑘𝑖𝑛                                       (2)
  With a refinement on the visual features, the system is able to ground commonsense knowl-
edge to the scene as follow:
                                          capable of
                                    ℎ𝑒𝑎𝑡 −−−−−→ ℎ𝑢𝑟𝑡 ℎ𝑎𝑛𝑑                                       (3)
  This commonsense reasoning process will also help the robotic agent to build an assistive
action. In fact, in some cases, the reasoning about the visual inputs alone is not sufficient to
provide accurate help. Recalling the previous example, the system needs external knowledge to
come up with the assistive action "bring gloves to the user", as a human would do:
                                         capable of
                                  𝑔𝑙𝑜𝑣𝑒 −−−−−→ 𝑝𝑟𝑜𝑡𝑒𝑐𝑡 ℎ𝑎𝑛𝑑                                     (4)
   We believe that creating such relationships is possible by using commonsense knowledge
databases such as the ConceptNet [8] or ATOMIC [9] datasets, and also following some of the
ideas for combining logic reasoning with machine learning described in [10]. Another charac-
teristic of our work is the use of Scene Graph (SG) [11] as a tool for knowledge representation.
In fact, this type of representation could easily be enriched with external knowledge databases
as they share the same data structure: graph. Finally, the use of commonsense reasoning is the
part of our architecture, which gives us the possibility of understanding biases and correcting
them in an incremental development.
  Our position could be summarized as the following statement: from the analysis of hu-
man behavior, the use of state-of-the-art solutions from Vision-to-Language combined with
Commonsense Reasoning will leverage Cognitive Robotics.


2. Related work
The task of retrieving graph representations from still images or videos is called Scene Graph
Generation (SGG), in this section we review current approaches for SGG. To perform efficient
reasoning, our solution integrates external commonsense knowledge. Thus, we also review
solutions for knowledge-graph enrichment and completion. Finally, as our reasoning system
needs to provide a sentiment analysis to retrieve the possibility of risks, we review approaches
to connotation lexicon [12] (i.e. lexicon that lists words with connotative polarity).

2.1. Scene Graph Generation
Scene Graph Generation (SGG) [11] is the task of creating a grounded graph of visual entities
retrieved from an image with the goal of representing attributes, objects and their relationships
in a scene. Such graphs typically contain one or more triplet(s) (head entity, relation, tail entity).
Entities present in a scene graph could be person (e.g. woman), place (e.g. street), object (e.g.
jeans) or attributes (e.g. blue, long). Relations between entities are even spatial positions (e.g. in
front of, behind), actions (e.g. walking) or descriptions (e.g. wearing).
   While recent approaches for this task may differ, the majority are using object detection
and region captioning as baseline [13]. For object detection, the most reported solution is the
use of pre-trained Faster-RCNN [14], a highly efficient Convolutional Neural Network (CNN)
approach with Region Of Interest (ROI) pooling for object classification. Once objects are
detected, the SGG solutions need to pair entities with one another and find the correct predicate
to represent this relation. To do so, approaches such as Conditional Random Fields (CRF) [15] or
Transitionnal Embeddings (TransE) [16] are used. Lately, Neural Networks have leveraged this
task with new RNN/LSTM-based [17] and Graph Convolution Networks (GCN) [18] approaches.
   When working with videos, a complex structure is needed to model relationships between
images. To this end, [19] use a Temporal Convolution Network (TCN) paired to a GCN for mod-
eling within-image dependencies. In [20] the authors use Target Adaptive Context Aggregation
to relate entities to their spatio-temporal context.

2.2. Commonsense Completion
There are multiple ways to use external knowledge to enrich a scene graph. The task of Com-
monsense Completion is introduced in [21] to define the automatic completion of a knowledge
graph using commonsense knowledge, in most cases retrieved from ConceptNet [8]. In this
task the knowledge is directly added to the graph, creating new nodes and edges. In COMET
[22], the authors describe a model that learns how to generate graph completion based on
relationships between events from the ATOMIC [9] and ConceptNet [8] datasets. This method
uses a Transformer architecture, the model is trained with a dataset of graphs to predict the next
node, given the input previous node and a relation 𝑅 from the set of relationships of ATOMIC.
As the solution uses a Transformer architecture, the input and output are natural language
sentences, with specific tokens to represent relations. We can also see the task of Commonsense
Completion of the SG as a Knowledge Graph Fusion. [23] propose a new approach to bridge
knowledge and scene graphs using successive message passing on a Graph Neural Network
(GNN).

2.3. Word Connotations Lexicon
[12] is the first attempt to build a connotation lexicon — a lexicon that maps words and their
intrinsic connotation. The proposed approach learns word connotation using connotative
predicates, i.e. predicates that ensure that words often encountered with ones negatively
connoted will also be negatively connoted. With this method, the algorithm only needs a
small seed of labelled words and a database of texts to learn words’ connotations. In [24] the
authors extend this method using induction algorithms based on graph structures. The use of
Random Walk based on HITS/PageRank, Label/Graph Propagation and Constraint Optimization
is reported. With this approach, [24] propose to capture fine-grained inductions, reducing biases
from the previous solution (e.g. the world "cure" is often associated with "disease" while not
being negatively connoted).


3. Detecting and fulfilling the implicit need of help
From an input video sequence, the system builds its own representation of the task being
performed using state-of-the-art approaches of Vision-to-Language. Then, it reasons on this
representation using commonsense knowledge to assess risks for a human: if the risks are
too high, an assistive action is performed accordingly. This work aims to solve the following
questions: (a) Scene Graph enrichment with related Commonsense Knowledge; (b) Scene Graph
refinement upon visual features; (c) Sentiment Analysis from Scene Graph; (d) Action Generation
from Commonsense Scene Graph.

3.1. Scene Understanding
The efficient scene understanding task from visual inputs, including human intents and objects
affordances, is an old an unsolved challenge for Computer Vision. In this task, the use of an
appropriate representation of the scene is of utmost importance. In the literature, there are two
main approaches for scene representation: graphs and natural language processing; the former
has been developed within Scene Graph Generation and the latter within Video Captioning.
We choose scene graphs over natural language captions to model our representation of the
scene. Scene Graphs provide numerous advantages: each detected entity is clearly represented
and grounded, relationships between user and objects can be clearly identified and finally
the enrichment of external knowledge is simple as most of knowledge bases also use graph
structures [8] [9]. In this section, we will detail our system for understanding the scene and
inferring human’s risks using SG, breaking down the description in four distinct steps (illustrated
in Figure 1):
Figure 1: Our proposed architecture for scene representation and sentiment analysis.


  Scene Graph Generation First, the representation of the relevant perceived data is critical.
As a backbone, we use Faster-RCNN [14] (Figure 1 top left) to retrieve ROI features from the
input scene. Given these features, we need to construct a directed graph 𝐺 (Figure 1 top right)
composed of a set of entities 𝐸 and a set of relations 𝑅 such that:

                                          𝐺 = (𝑅, 𝐸, 𝜃)                                         (5)
  where 𝜃 is an incidence function that finds the relation 𝑟 ∈ 𝑅 between the head entity ℎ ∈ 𝐸
and the tail entity 𝑡 ∈ 𝐸 such as:

                                       𝜃 : {ℎ, 𝑡} ∈ 𝐸 2 → 𝑟                                     (6)

To generate such a graph, we follow the approach from [20] that uses Target Adaptive Context
Aggregation (TRACE) to embed temporal and spatial information. The approach is as follows:
from the visual features, relation candidates are represented as a hierarchical relation tree
(HRTree); then, the TRACE module will capture temporal and spatial relationships to model the
context with other frames; finally, a classification module will output the best inference.
   Scene Graph Enrichment Second, we enrich the graph with relevant commonsense data
(Figure 1 bottom right). The challenge here is to bridge the gap between the ontology repre-
sented in the scene graph and the ontology represented in the commonsense knowledge graph.
ConceptNet is a database of commonsense knowledge, it has been generated using multiple
resources such as crowd-sourcing or expert-generated data. For each word in natural English
language, ConceptNet relates other words or group of words using commonsense relationships
such as "isUsedFor" or "is PartOf". For each relation, ConceptNet gives a set of connected entities.
To select the most relevant, we pick the one with the highest confidence match with the data
already present in the graph. For example, the relation "knife is used for cutting vegetables"
will be selected over "knife is used for stabbing" if instances of vegetables (e.g. "tomato") are
already present in the graph. Thus, following [23], we pair similar labeled nodes from both
ontologies with a new edge. Then, all new edges are updated using successive message passing
to propagate information across the graph. Grounding is applied by inferring related regions
of the image to the corresponding enriched commonsense knowledge [25]. This, for instance,
will replace the word "vegetables" in the previous example by the one directly connected with
the image, i.e. "tomato". The enrichment of commonsense knowledge is also used to interpret
mis-used objects in a task, e.g. from the graph "PersonX is cutting a tomato with an axe"
the commonsense knowledge from the word "axe" could be infrequently related to the word
"tomato" and thus the task would be declared as unsafe.
   Sentiment Analysis on Graph Third, sentiment analysis is performed given information
from the graph (Figure 1 bottom left). We are evaluating words and their semantic connotation
(e.g. the word "heat" is negatively connoted) using connotation lexicon such as [26]. Traditional
approaches to build connotation lexicon rely on words prosody in texts, we want to extend
this representation to visual features proximity using bounding boxes coordinates associated
to every entity. For instance, spatial proximity between "negative" entities and the user in
the image features will be highly weighted. We update the graph adding a sentiment value
𝑆𝑛 ∈ [−1; 1] to each node that will represent the potential risk of the entity for the human.
   Decision Making Fourth, given the sentiment analysis, a pooling is performed with respect
to the graph dependencies to retrieve a confidence value. If this value is above a pre-defined
threshold, the task is declared as unsafe and a decision of assistance is made. This threshold is
dynamic and could be adjusted given contextual information such as the presence of a child in
the scene.

3.2. Command Generation
Once the autonomous agent understands the immediate necessity of assistance, it needs to
provide the appropriate help. To do so, the system needs to generate the best assistive action
towards increasing the safety of the task, as in the example introduced in Section 1.
   We build what we call a "commonsense response", that means the most probable set of actions
to perform for helping the human with the task. The goal here is to complete the weighted
graph retrieved from the Sentiment Analysis with the related commonsense knowledge to
stabilize the graph. We iterate through the graph using the same process as for Scene Graph
Enrichment. The difference here is that for each iteration we also perform Sentiment Analysis
and select only the positively weighted nodes. We compute the current sentiment value of the
graph 𝐺 as follows:
                                                   𝑛
                                                  ∑︁
                                        𝑆(𝐺) =        𝑆𝑛                                    (7)
                                                 𝑖=1

where 𝑛 is number of nodes. At the end of the process we obtain a graph similar to the one shown
in Figure 2, where the solution will be the highest positively weighted node that represents an
object. This object could then be found and provide by the robot to the user. If no satisfying
solution is found, one approach could be to warn the user by vocal utterances.
Figure 2: Solution retrieval via Scene Graph Completion.


3.3. Evaluation
We will evaluate our Scene Graph Generation approach using the ActivityNet dataset [27]. This
dataset contains 20k videos of 200 everyday life human activities and is used as a benchmark
for most SGG approaches. Recently, the ActionGenome dataset was also introduced in [28].
This dataset captures human daily life activities in 265k labeled frames.
   To assess user acceptance and perceived performance of the system, we will conduct a series
of human experiments. To do so, humans will engage in safe and potentially dangerous scenarios
in which the system will detect risk of danger and define its remediating action to minimise
detected danger. For practical and ethical reasons, humans will perform the safe task in a real
environment and the potentially dangerous task in a Virtual Reality environment in which
the task is simulated. We will use the Unified Theory of Acceptance and Use of Technology
(UTAUT) [29] and the Technology Acceptance Model 3 (TAM-3) [30] to evaluate acceptance of
the system in these scenarios. Additionally, we will supplement these measures with qualitative
feedback about the performance and actions of the system. Human-Human interaction in
similar scenarios will be used to evaluate the coherence of the assistive action of the system.


4. Conclusion and future work
The proposed work combines a traditional machine learning approach to generate an accurate
model of the world with knowledge representation and reasoning. Our solution includes
Scene Graph Generation as a high-level representation of the scene, commonsense knowledge
enrichment combined with sentiment analysis to asses risk for the user and, finally, a graph
completion method to retrieve relevant solutions. As a limitation, our work does not consider
other factors of the implicit need of help such as fatigue or stress. Furthermore, this proposal
does not take into account the latent context of the captured sequence. For instance, in our
running example it would be good to know if the oven is turned on or if it was turned on earlier.
   As this proposal is still an on-going work, a number of challenges remain open, such as: the
efficient fusion of scene and commonsense knowledge graphs; the sentiment analysis from
scene graph and the generation of robot commands from graph entities. All these issues will
be considered in future work, along with an investigation of the limitations, and confounding
factors, in the automatic interpretation of the implicit need of help.
5. Acknowledgments
This publication was supported by Brittany Region.


References
 [1] S. Harnad, The symbol grounding problem, Physica D: Nonlinear Phenomena 42 (1990)
     335–346.
 [2] Y. Tada, Y. Hagiwara, H. Tanaka, T. Taniguchi, Robust understanding of robot-directed
     speech commands using sequence to sequence with noise injection, Frontiers in Robotics
     and AI 6 (2020) 144.
 [3] S. Waldherr, R. Romero, S. Thrun, A gesture based interface for human-robot interaction,
     Autonomous Robots 9 (2000) 151–173.
 [4] G. Doisy, Sensorless collision detection and control by physical interaction for wheeled
     mobile robots, in: Proceedings of the seventh annual ACM/IEEE international conference
     on Human-Robot Interaction, 2012, pp. 121–122.
 [5] R. Liu, X. Zhang, S. Li, Use context to understand user’s implicit intentions in activities of
     daily living, in: 2014 IEEE International Conference on Mechatronics and Automation,
     2014, p. 1214–1219.
 [6] S. Li, X. Zhang, Implicit intention communication in human–robot interaction through
     visual behavior studies, IEEE Transactions on Human-Machine Systems 47 (2017) 437–448.
 [7] B. Kuipers, Commonsense reasoning about causality: deriving behavior from structure,
     Artificial intelligence 24 (1984) 169–203.
 [8] R. Speer, J. Chin, C. Havasi, Conceptnet 5.5: An open multilingual graph of general
     knowledge, Proceedings of the AAAI Conference on Artificial Intelligence 31 (2017).
 [9] M. Sap, R. L. Bras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin, B. Roof, N. A. Smith,
     Y. Choi, Atomic: An atlas of machine commonsense for if-then reasoning, Proceedings of
     the AAAI Conference on Artificial Intelligence 33 (2019) 3027–3035.
[10] F. van Harmelen, A. t. Teije, A boxology of design patterns for hybrid learning and
     reasoning systems, Journal of Web Engineering 18 (2019) 97–124. ArXiv: 1905.12389.
[11] J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. A. Shamma, M. S. Bernstein, L. Fei-Fei, Image
     retrieval using scene graphs, in: 2015 IEEE Conference on Computer Vision and Pattern
     Recognition (CVPR), IEEE, 2015, p. 3668–3678.
[12] S. Feng, R. Bose, Y. Choi, Learning general connotation of words using graph-based
     algorithms, in: Proceedings of the 2011 Conference on Empirical Methods in Natural
     Language Processing, Association for Computational Linguistics, 2011, p. 1092–1103.
[13] Y. Li, W. Ouyang, B. Zhou, K. Wang, X. Wang, Scene graph generation from objects,
     phrases and region captions, in: 2017 IEEE International Conference on Computer Vision
     (ICCV), IEEE, 2017, p. 1270–1279.
[14] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with
     region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence
     39 (2017) 1137–1149.
[15] B. Dai, Y. Zhang, D. Lin, Detecting visual relationships with deep relational networks, in:
     2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, p.
     3298–3308.
[16] H. Zhang, Z. Kyaw, S.-F. Chang, T.-S. Chua, Visual translation embedding network for
     visual relation detection, in: 2017 IEEE Conference on Computer Vision and Pattern
     Recognition (CVPR), 2017, p. 3107–3115.
[17] R. Zellers, M. Yatskar, S. Thomson, Y. Choi, Neural motifs: Scene graph parsing with
     global context, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
     Recognition, 2018, pp. 5831–5840.
[18] J. Yang, J. Lu, S. Lee, D. Batra, D. Parikh, Graph r-cnn for scene graph generation, in:
     Proceedings of the European conference on computer vision (ECCV), 2018, pp. 670–685.
[19] R. Wang, Z. Wei, P. Li, Q. Zhang, X. Huang, Storytelling from an image stream using
     scene graphs, Proceedings of the AAAI Conference on Artificial Intelligence 34 (2020)
     9185–9192.
[20] Y. Teng, L. Wang, Z. Li, G. Wu, Target adaptive context aggregation for video scene graph
     generation, Proceedings of the IEEE/CVF International Conference on Computer Vision
     (2021) 13688–13697.
[21] X. Li, A. Taheri, L. Tu, K. Gimpel, Commonsense knowledge base completion, in: Pro-
     ceedings of the 54th Annual Meeting of the Association for Computational Linguistics
     (Volume 1: Long Papers), Association for Computational Linguistics, 2016, p. 1445–1455.
[22] A. Bosselut, H. Rashkin, M. Sap, C. Malaviya, A. Celikyilmaz, Y. Choi, Comet: Common-
     sense transformers for automatic knowledge graph construction, arXiv:1906.05317 [cs]
     (2019).
[23] A. Zareian, S. Karaman, S.-F. Chang, Bridging Knowledge Graphs to Generate Scene Graphs,
     volume 12368 of Lecture Notes in Computer Science, Springer International Publishing, 2020,
     p. 606–623.
[24] S. Feng, J. S. Kang, P. Kuznetsova, Y. Choi, Connotation lexicon: A dash of sentiment
     beneath the surface meaning, in: Proceedings of the 51st Annual Meeting of the Association
     for Computational Linguistics (Volume 1: Long Papers), Association for Computational
     Linguistics, 2013, p. 1774–1784.
[25] A. Zareian, Z. Wang, H. You, S.-F. Chang, Learning visual commonsense for robust scene
     graph generation, arXiv:2006.09623 [cs] (2020).
[26] S. M. Mohammad, P. D. Turney, Crowdsourcing a word-emotion association lexicon,
     Computational Intelligence 29 (2013) 436–465.
[27] B. G. Fabian Caba Heilbron, Victor Escorcia, J. C. Niebles, Activitynet: A large-scale video
     benchmark for human activity understa nding, in: Proceedings of the IEEE Conference on
     Computer Vision and Pattern Recognition, 2015, pp. 961–970.
[28] J. Ji, R. Krishna, L. Fei-Fei, J. C. Niebles, Action genome: Actions as compositions of
     spatio-temporal scene graphs, in: Proceedings of the IEEE/CVF Conference on Computer
     Vision and Pattern Recognition, 2020, pp. 10236–10247.
[29] V. Venkatesh, F. Davis, M. G. Morris, Dead or alive? the development, trajectory and future
     of technology adoption research., Journal of the association for information systems 8
     (2007) 1.
[30] V. Venkatesh, H. Bala, Technology acceptance model 3 and a research agenda on interven-
     tions, Decision sciences 39 (2008) 273–315.