A Review on Intelligent Object Perception Methods
              Combining Knowledge-based Reasoning and Machine Learning∗
                                                   †                                †
                        Filippos Gouidis,1 Alexandros Vassiliades,1, 2 Theodore Patkos,1
                            Antonis Argyros,1 Nick Bassiliades,2 Dimitris Plexousakis1
                         1
                             Institute of Computer Science, Foundation for Research and Technology, Hellas,
                                                  2
                                                    Aristotle University of Thessaloniki
                             {gouidis, patkos, argyros, dp}@ics.forth.gr, {valexande, nbassili}@csd.auth.gr


                              Abstract                                (2015) for instance, while discussing the multifaceted chal-
                                                                      lenges related to automating commonsense reasoning, a cru-
  Object perception is a fundamental sub-field of Computer            cial ability for any intelligent entity operating in real-world
  Vision, covering a multitude of individual areas and having
  contributed high-impact results. While Machine Learning has
                                                                      conditions, underline the need to combine the strengths of
  been traditionally applied to address related problems, recent      diverse AI approaches from these two fields. Others, as
  studies also seek ways to integrate knowledge engineering in        for example Bengio et al. (2019) and Pearl (2018), em-
  order to expand the level of intelligence of the visual interpre-   phasize the inability of Deep Learning to effectively rec-
  tation of objects, their properties and their relations with the    ognize cause and effect relations. Pearl, Geffner (2018)
  environment. In this paper, we attempt a systematic investiga-      and recently Lenat1 , suggest to seek solutions by bridg-
  tion of how knowledge-based methods contribute to diverse           ing the gap between model-free, data-intensive learners and
  object perception tasks. We review the latest achievements          knowledge-based models and by building on the synergy be-
  and identify prominent research directions.                         tween heuristic level and epistemological level languages.
                                                                         In this paper, we review recent progress in the direction of
                        Introduction                                  coupling the strengths of ML and knowledge-based meth-
                                                                      ods, focusing our attention on the topic of Object Percep-
Despite the recent sweep of progress in Machine Learning              tion (OP), an important sub-field of Computer Vision (CV).
(ML) which is stirring public imagination about the capa-             Tasks related to OP are at the core of a wide spectrum of
bilities of future Artificial Intelligence (AI) systems, the re-      practical systems and relevant research has traditionally re-
search community seems more composed, in part due to the              lied on ML to approach the related problems. The recent
realization that current achievements are largely based on            developments have significantly advanced the field, but, in-
engineering advancements, and only partly on novel scien-             terestingly, state-of-the-art studies try to integrate symbolic
tific progress. Undoubtedly, the accomplishments are nei-             methods, in order to achieve broader visual intelligence. It
ther small nor temporary; in the latest “One Hundred Year             seems that it is becoming less of a paradox within the CV
Study on AI”, a panel of renowned AI experts is foresee-              community that in order to build intelligent vision systems,
ing tremendous impact of AI in a multitude of technological           much of the information needed is not directly observable.
and societal domains in the next decades, fueled primarily
                                                                         Existing surveys on the intersection of ML and knowledge
by systems running ML algorithms (Stone et al. 2016). Yet,
                                                                      engineering (e.g., (Nickel et al. 2016)) are indeed very infor-
there is still a lot of ground to cover, before we can obtain
                                                                      mative, but they usually offer a high-level understanding of
a deep understanding of how to overcome the limitations of
                                                                      the challenges involved. The rich literature on CV reviews,
data-driven approaches at a more generic level.
                                                                      on the other hand, adopts a more problem-specific analysis,
    Aiming at exploiting the full potential of AI, a grow-
                                                                      studying in detail the requirements of each particular CV
ing body of research is devoted to the idea of integrating
                                                                      task (see e.g., (Wu et al. 2017; Herath, Harandi, and Porikli
ML and knowledge-based approaches. Davies and Marcus
                                                                      2017; Liu et al. 2018a)). Only recently was an overview pre-
    ∗
      This project has received funding from the Hellenic Founda-     sented that shows how background knowledge can benefit
tion for Research and Innovation (HFRI) and the General Secre-        tasks, such as image understanding (Aditya, Yang, and Baral
tariat for Research and Technology (GSRT), under grant agreement      2019); our goal is to explore this direction on the topic of
No 188.                                                               intelligent OP, reporting state-of-the-art achievements and
    †
      Authors contributed equally.                                    showing how different facets of knowledge-based research
Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel-      can contribute to addressing the rich diversity of OP tasks.
mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen
                                                                         The rest of the paper investigates state-of-the-art litera-
(Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com-
bining Machine Learning and Knowledge Engineering in Practice         ture on intelligent OP along the three pillars shown in Fig-
(AAAI-MAKE 2020). Stanford University, Palo Alto, California,
                                                                         1
USA, March 23-25, 2020. Use permitted under Creative Commons               https://towardsdatascience.com/statistical-learning-and-
License Attribution 4.0 International (CC BY 4.0).                    knowledge-engineering-all-the-way-down-1bb004040114
                                                                  of abstraction needed to design a system versatile enough to
                                                                  adapt to the requirements of a particular domain, yet rigid
                                                                  enough to be encoded in a computer program, having clear
                                                                  semantics and elegant properties (Dean, Allen, and Aloi-
                                                                  monos 1995). As a result, expressiveness, i.e., what can or
                                                                  cannot be represented in a given model, and computation,
                                                                  i.e., how fast conclusions are drawn or which statements can
                                                                  be evaluated by algorithms guaranteed to terminate, are two,
                                                                  often competing, aspects to be taken into consideration.
                                                                     The form of the representational model applied plays
                                                                  a decisive role on such considerations, affecting the rich-
                                                                  ness of semantics that can be captured. While the range
                                                                  of models varies significantly, even relatively shallow rep-
                                                                  resentations have proven to offer improvements in the per-
                                                                  formance of OP methodologies (see for example (Deng et
                                                                  al. 2014; Zhu, Fathi, and Fei-Fei 2014; Zhu et al. 2015b;
         Figure 1: Organization of thematic areas.                Redmon and Farhadi 2017)). Models as simple as relational
                                                                  tables or flat weighted graphs, but also more complex multi-
                                                                  relational graphs, often called Knowledge Graphs (KGs), or
ure 1): i) symbolic models, further analyzed from the per-        even semantically-rich conceptualizations with formal se-
spectives of expressive representations, reasoning capacity,      mantics, often called ontologies, are proposed in the relevant
and open domain, Web-based knowledge; ii) commonsense             literature. In the sequel, we refer to any model that offers at
knowledge exploitation, a key skill for any intelligent sys-      least a basic structuring of data as a Knowledge Base (KB).
tem; and, iii) enhanced learning ability, building on hybrid
approaches. By no means should this review be considered          Utilization of Contextual Knowledge Context awareness
exhaustive; inevitably, relevant studies may not have been        is the ability of a system to understand the state of the envi-
included. Our intention is to capture a snapshot of the most      ronment, and to perceive the interplay of the entities inhabit-
recent research trends, discussing studies that improve pre-      ing it. This feature offers advantages in accomplishing infer-
vious results or offer new insights, hoping to provide a start-   ence tasks, but also enhances a system in terms of reusabil-
ing point for following the interesting progress currently un-    ity, as contextual knowledge enables it to adapt to new situ-
derway, while giving pointers to directions that seem worth       ations and environments that resemble known ones.
investigating. In this respect, the paper concludes with a dis-      A preliminary approach towards this direction utilized
cussion on open questions and prominent research direc-           a special semantic fusion network, which combined novel
tions. Table 1 at the end summarizes the reviewed literature.     object- and scene-based information and was able to capture
                                                                  relationships between video class labels and semantic enti-
                                                                  ties (objects and scenes) (Wu et al. 2016b). This was a Con-
         Exploitation of Symbolic Models                          volutional Neural Network (CNN) consisting of three layers,
The scope of OP research ranges over a wide spectrum of           with the first layer detecting low-level features, the second
problems, from object, action and affordance detection, to        layer object features and the third layer scene features. Al-
localization and recognition in images, to motion and struc-      though no symbolic representation was applied, this mod-
ture inference in videos, to scene understanding and visual       eling of abstraction layers for the representation of the in-
reasoning. Traditionally, OP relied on ML methodologies to        formation depicted in an image is similar to the hierarchical
find patterns in realms of data, taking as input feature vec-     structuring of knowledge used by top-down methods.
tors representing entities in terms of numeric or categorical        In a similar style, Liu et al. (2018b) recently attempted to
(membership to a more general class) attributes.                  address the problem of object detection by proposing an al-
   In this section, we discuss how high-level knowledge re-       gorithm that exploits jointly the context of a visual scene
lated to visual entities can improve the performance of OP        and an object’s relationships. Such features are typically
algorithms in manifold ways. We start by reviewing state-of-      taken into consideration in isolation by most object detec-
the-art in coupling data-driven approaches and rich knowl-        tion methods. The algorithm uses a CNN-based framework
edge representations about aspects such as context, space         tailored for object detection and combines it with a graph-
and affordances.                                                  ical model designed for the inference of object states. A
                                                                  special graph is created for each image in the training set,
Expressive Representations of Knowledge                           with nodes corresponding to objects and edges to object re-
Although the notion of a representation is rather general,        lationships. The intuition behind this approach is that the
the modeling of relational knowledge in the form of in-           graph’s topology enables an object state to be determined
dividuals (entities) and their associated relations is well-      not only by its low-level characteristics (appearance details),
studied in AI, especially in the context of symbolic rep-         but also by the states of other objects interacting with it and
resentations (see for instance (van Harmelen et al. 2007;         by the overall scene context. The experimental evaluation
Brachman and Levesque 2004)). The goal is to offer the level      conducted on different datasets underscored the importance
of knowledge stemming from local and global context.                 help compile new classes while forming the environmental
   Currently, a collection of prominent approaches are ori-          map. The main reasoning mechanisms of this study is Pro-
ented towards Gated Graph Neural Networks (GGNNs) (Li                log, although probabilistic reasoners are also used to tackle
et al. 2015), a variation of Gated Neural Networks (GNNs)            fuzzy information or uncertain relations. A CV system an-
(Scarselli et al. 2008), in order to integrate contextual knowl-     notates objects based on their shape, their distances and the
edge in the training of a system. GNNs are a special type            dimension of the environment, using a monotonic Descrip-
of NNs tailored to the learning of information encoded in a          tion Logic (DL), to build the environmental map. As a result,
graph. In GGNNs, each node corresponds to a hidden state             a coherent and well-formalized representation of environ-
vector that is updated in a iterative way. Two features that         ments is achieved, which can offer high-quality datasets for
differentiate GGNNs from standard Recurrent Neural Net-              training data-driven models. The use of DL can also offer
works (RNNs), which also use the mechanism of recurrence,            a wide spectrum of spatial reasoning capabilities with well-
is that in the former information can move bi-directionally          specified properties and formal semantics.
and that many nodes can be updated per step.                            Adopting a different approach, the KG given in (Chen
   Chuang et al. (2018) propose a GGNN-based method                  et al. 2018) enables spatial reasoning both in a local and
which exploits contextual information, such as the types of          a global context which, in turn, results in improved per-
objects and their spatial relations, in order to detect action-      formance in semantic scene understanding. The proposed
object affordances. In the same vein, Ye et al. (2017) utilize       framework consists of two distinct modules, one focusing
a two-stage pipeline built around a CNN to detect functional         on local regions of the image and one dealing with the
areas in indoor scenes. More recently, Sawatzky et al. (2019)        whole image. The local module is convolution-based, ana-
utilized a GGNN, which takes into account the global con-            lyzing small regions in the image, whereas the basic compo-
text of a scene and infers the affordances of the contained          nent of the global module is a KG representing regions and
objects. It also proposes the most suitable object for a spe-        classes as nodes, capturing spatial and semantic relations.
cific task. The authors prove that this approach yields better       According to the experimental evaluation performed on the
results compared to methods that rely solely on the results          ADE and Visual Genome datasets, the network achieves bet-
of an object classification algorithm.                               ter performance over other CNN-based baselines for region
   In (Li et al. 2017), an approach aiming to perform sit-           classification, by a margin which sometimes is close to 10%.
uation recognition is presented, based on the detection of           According to the ablation study, the most decisive factor for
human-object interactions. The goal is to predict the most           the framework’s performance was the KG.
representative verb to describe what is taking place in a
scene, capturing also relevant semantic information (roles),         Modeling Affordances Building on the geometrical struc-
such as the actor, the source and target of the action, etc. The     ture and physical properties of objects, such as rigidity and
authors utilize a GGNN, which enables combined reasoning             hollowness, the representation of affordances helps develop
about verbs and their roles through the iterative propagation        systems that can reason about how human-level tasks are
of messages along the edges.                                         performed. While ML is invaluable for automating the pro-
                                                                     cess of learning from example when data is available, rich
Spatial Contextual Knowledge A particular type of con-               representations can generalize and reuse the obtained mod-
textual knowledge concerns the spatial properties of the en-         els in situations where data-based training is not possible.
tities populating a visual scene. These may involve simple              One of the first studies that demonstrated that even very
spatial relations, such as “object x is usually part of ob-          basic semantic models can improve the performance of rec-
ject y” and “object x is usually situated near the objects           ognizing human-object interaction was (Chao et al. 2015).
x1 , . . . , xn ”, but also semantically enriched statements, such   The authors succeeded in boosting the performance of vi-
as “objects of type x are usually found inside object z (e.g.,       sual classifiers by exploiting the compositionality and con-
in the fridge), located in room y”. Due to the ubiquitous na-        currency of semantic concepts contained in images.
ture of spatial data in practical domains, such relations are           KNOWROB 2.0 (Beetz et al. 2018), which is the re-
often captured as a separate class of context.                       sult of a series of research activities in the field of Cogni-
    Semantic spatial knowledge, when fused with low-level            tive Robotics, is an excellent example of integrating top-
metric information, gives great flexibility to a system. This is     down knowledge engineering with bottom-up information
demonstrated in the study of Gemignani et al. (2016), where          structuring, involving, among others, a variety of CV tasks.
a novel representation is introduced that combines the met-          A combination of KBs of different granularity helps the
ric information of the environment with the symbolic data            KNOWROB 2.0 framework capture rich models of the
that conveys meaning to the entities inhabiting it, as well          world. The representation of high-level knowledge is based
as with topological graphs. Although delivering a generic            on OWL-DL ontologies, a decidable fragment of First-order
model with clear semantics is not their main objective, the          Logic (FOL), yet adequately expressive for most practical
resulting integrated representation enables a system to per-         domains. The KBs enable the system to answer questions,
form high-level spatial reasoning, as well as to understand          such as “how to pick up the cup”, “Which body part to use”
target locations and positions of objects in the environment.        etc. The authors provide evidence that learning human ma-
    Generality is the aim of the model proposed by Tenorth           nipulation tasks on existing methods can be boosted by using
and Beetz (2017), which uses an OWL ontology, combin-                symbolic level structured knowledge.
ing information from OpenCyc and other Web sources that                 A recently proposed novel representation model that man-
ages to balance between concept abstraction, uncertainty            recently introduced as a collection of benchmark image-
modeling and scalability is given in (Daruna et al. 2019).          based open-domain questions that, in order to be answered,
The so called RoboCSE framework encodes the abstract,               call for a deep understanding of the visual setting. VQA goes
semantic knowledge of an environment, i.e., the main con-           beyond traditional CV, since apart from image analysis, the
cepts and their relations, such as location, material and affor-    proposed methods apply also a repertoire of AI techniques,
dance, obtained by observations, simulations, or even from          such as Natural Language Processing, in order to correctly
external sources, into multi-relational embeddings. These           analyze the textual form of the question, and inferencing,
embeddings are used to represent the knowledge graph of             in order to interpret the purpose and intentions of the en-
the domain in vector space, encoding vertices that repre-           tities acting in the scene (Krishna et al. 2017). The chal-
sent entities as vectors and edges that represent relations as      lenges posed by this field are complex and multifaceted, a
mappings. While the majority of similar approaches rely on          fact which is also demonstrated by the rather poor perfor-
Bayesian Logic Networks and Markov Logic Networks, suf-             mance of state-of-the-art-systems in comparison to humans.
fering from well-known intractability problems, the authors         VQA is probably the area of CV that has drawn the most
prove that their model is highly scalable, robust to uncer-         inspiration from symbolic AI approaches to date.
tainty, and generalizes learned semantics.                              An indicative example is the approach recently presented
   Learning from demonstration, or imitation learning, is           by Wu et al. (2018), who introduced a VQA model com-
a relevant, yet broader objective, which introduces inter-          bining observations obtained from the image with informa-
esting opportunities and challenges to a CV system (see             tion extracted from a general KB, namely DBpedia. Given
(Torabi, Warnell, and Stone 2019; Ravichandar et al. 2019)).        an image-question pair, a CNN is utilized to predict a set of
Purely ML-based methods constitute the predominant re-              attributes from the image, i.e., the most recognizable objects
search direction, and only few state-of-the-art studies utilize     in the image, in terms of clarity and size. Consequently, a
knowledge-based methods, taking advantage of the reusabil-          series of captions based on the attributes is generated, which
ity and generalization of the learned information. A popular        is then used to extract relevant information from DBpedia
choice is to deploy expressive OWL-DL (Ramirez-Amaro,               through appropriately formulated queries. In a similar style,
Beetz, and Cheng 2017; Lemaignan et al. 2017) or pure DL            in (Narasimhan and Schwing 2018) an external RDF repos-
(Agostini, Torras, and Woergoetter 2017) representations to         itory is used to retrieve properties of visual concepts, such
capture world knowledge. The CV modules are assigned the            as category, used for, created by, etc. The technique utilizes
task to extract information about the state of the environ-         a Graph Convolution Network (GCN), a variation of GNN,
ment, the expert agent’s pose and location, grasping areas of       before producing an answer. In both cases, the ablation anal-
objects, affordances, shapes etc. On top of these, the cou-         ysis reveals the impact of the KB in improving performance.
pling with knowledge-based systems assists in visual inter-             Other types of questions in VQA require inferencing
pretation, for example to track human motion, to semanti-           about the properties of the objects depicted in an image.
cally annotate the movement (i.e., “how the human performs          For example, queries such as “How is the man going to
the action”) or to understand if a task is doable in a given set-   work?” or more complex queries, such as “When did the
ting. These studies show that such representations enable a         plane land?”, have been the subject of the study presented
system to reuse the learned knowledge in diverse settings           by Krishna et al. (2017), who introduced the Visual Genome
and under different conditions, without having to re-train          dataset and a VQA method. In fact, this is one of the first
classifiers from scratch. Moreover, complex queries can be          studies to bring a model trained on an RDF-based scene
answered, a topic discussed in the next subsection.                 graph that had good recall results to all What, Where, When,
                                                                    Who, Why, How queries. Even further, Su et al. (2018) intro-
Reasoning over Expressive KBs                                       duced the visual knowledge memory network (VKMN) in
Encoding knowledge in a semantically structured way is              order to handle questions, whose answers cannot be directly
only part of the story; a rich representation model can             inferred from the image visual content but require reasoning
also offer inference capabilities to a CV system, which are         over structured human knowledge.
needed for accomplishing complex tasks, such as scene un-               The importance of capturing the semantic knowledge in
derstanding, or simpler tasks under realistic conditions, such      VQA collections led also to the creation of the Relation-
as scene analysis with occlusions, noisy or erroneous input         VQA dataset (Lu et al. 2018), which extends Visual Genome
etc. A reasoning system can be used to connect the dots             with a special module measuring the semantic similarity of
that relate concepts together when only partial observation         images. In contrast to methods mining only concepts or at-
is available, especially in data-scarce situations, where an-       tributes, this model extracts relation facts related to both
notated data are not sufficiently many. In such situations,         concepts and attributes. The experimental evaluation con-
the compositionality of information, an inherent character-         ducted on VQA and COCO dataset showed that the method
istic of the entities encountered in visual domains, can be         outperformed other state-of-the-art ones. Moreover, the ab-
exploited by applying reasoning mechanisms.                         lation studies show that the incorporated semantic knowl-
                                                                    edge was crucial for the performance of the network.
Complex Query Answering Probably the field that high-                   Despite its increasing popularity, the VQA field is still
lights more clearly the needs and challenges faced by a CV          hard to confront. The generality of existing methods is also
system in answering complex queries about a visual scene is         questioned (Goyal et al. 2019). Developing generic solu-
the field of Visual Question Answering (VQA). VQA was               tions, less tightly coupled to specific datasets, will definitely
benefit the pursuit towards broader visual intelligence.           tic to the kind of inputs it receives. For example, an object
                                                                   could correspond to the background, to a particular physical
Visual Reasoning A task related to VQA that has gained             object, a texture, conjunctions of physical objects etc.
popularity in recent years is that of Visual Reasoning (VR).          To conclude, it is worth indicating also a recent trend in
In this case, the type of questions that have to be answered       visual explanation approaches that couples data-driven sys-
are more complex and require a multi-step reasoning pro-           tems with Answer Set Programming (ASP). ASP is a non-
cedure. For example, given an image containing objects of          monotonic logical formalism oriented towards hard search
different shapes and color, the task of recognizing the color      problems. A number of studies have emerged that com-
of an object of certain shape that lies in a certain area w.r.t.   bine ASP abductive or inductive reasoning for the VQA do-
the position of another object of certain shape and color falls    main, especially for cases when training data are not many
to the category of VR (in this case, first the “source” object     (see e.g., (Suchan et al. 2017; Riley and Sridharan 2019;
must be detected, then the “target” object, and, finally, its      Basu, Shakerin, and Gupta 2020))
color must be recognized). Similar to the case of VQA, a
number of VR works has drawn inspiration from symbolic             The Web as a Problem-Agnostic Source of Data
AI-based ideas.
                                                                   As the recent renaissance in AI is partly due to the availabil-
   In general, many VR works are based on Neural Mod-
                                                                   ity of big volumes of training data, along with the computa-
ule Networks (NMNs) which are NNs of adaptable architec-
                                                                   tional power to analyze them, it is only reasonable to expect
ture, the topology of which is determined by the parsing of
                                                                   that data-driven approaches will turn their attention to the
the question that has to be answered. NMNs simplify com-
                                                                   Web in order to collect the data needed. Although the bene-
plex questions into simpler sub-questions (sub-tasks), which
                                                                   fits mentioned in the previous sections are still achievable,
can be more easily addressed. The modules that constitute
                                                                   the challenges faced when using a Web repository rather
the MNMs are pre-defined neural networks that implement
                                                                   than a custom-made KB are now different.
the functions that are required for the tackling of sub-tasks,
which are assembled into a layout dynamically. Central to             The vast majority of large-scale Web repositories are not
many MNMs is the utilization of prior symbolic (structured)        problem-specific, containing a lot of irrelevant information
knowledge, which facilitates the handling of the sub-tasks.        for a ML system to be trained correctly. For the time being,
                                                                   ML systems are highly specific, excelling only when trained
   Hu et al. (2017) propose End-to-End Module Networks
                                                                   for a particular task and tested on similar to the training con-
as a variation of NMNs. The network first uses coarse func-
                                                                   ditions. As a result, state-of-the-art approaches try to rely on
tional expressions describing the structure of the computa-
                                                                   the semantics of structured KBs, in order to filter out noisy
tion required for the answering and, then, refines it accord-
                                                                   or irrelevant knowledge, by integrating external knowledge
ing to the textual input in order to assemble the network. For
                                                                   when visual information is not sufficiently reliable for con-
example, for the question “how many other objects of the
                                                                   clusion making.
same size as the purple cube exist?”, first crude functional
expression for counting and relocating would be predicted          Exploitation of Web-based Knowledge Graphs and Se-
as relevant to the answering of the question which, subse-         mantic Repositories There exists a multitude of stud-
quently, would be refined by the parameters from text analy-       ies that use external knowledge from structured or semi-
sis (in this case one such parameter is the color of the cube).    structured Web resources, in order to answer visual queries
   Similarly, Johnson et al. (2017) propose a variation of         or to perform cognitive tasks. A characteristic example is
NMNs, which is based on the concept of programs. Pro-              found in (Li, Su, and Zhu 2017), where the ConceptNet KG,
grams are symbolic structures of certain specification writ-       a semantic repository of commonsense Linked Open Data,
ten in a Domain-Specific Language and are defined by a syn-        is used to answer open domain questions on entities such as
tax and semantics. In the context of VR, programs describe         “What is the dog’s favorite food?”. The approach proceeds
a sequence of functions that must be executed, in order for        in a step-wise manner: first, visual objects and keywords are
an answer to be computed. During testing on the CLEVR              extracted from an image, using a Fast-RCNN for the ob-
dataset the model exhibited notable performance, generaliz-        jects and a LSTM for the syntactical analysis; then, queries
ing better in a variety of settings, such as for new question      to ConceptNet provide properties and values for the entities
types and human-posed questions. Building on the notion of         found in the image. When an answer is considered correct,
programs, Yi et al. (2018) further incorporated knowledge          a Dynamic Memory Network, which is an embedding vec-
regarding the structural scene representation of the image.        tor space that contains vector representations of symbolic
The method achieved near-perfect accuracy, while also pro-         knowledge triples, is renewed for future encounter of the
viding transparency to the reasoning process.                      same query. In a rather similar style, Wu et al. (2016a) ex-
   An alternative NN-based approach for VR is found                tract properties from DBpedia, by retrieving and perform-
in (Santoro et al. 2017), where the incorporation of Rela-         ing semantic analysis on the comment boxes of relevant
tion Networks (RNs) in CNNs and Long Sort-Term Mem-                Wikipedia pages. Here, a CNN performs object detection on
ory (LSTM) architectures is proposed. RNs are architectures        the image, whereas a pre-trained RNN correlates attributes
whose computations focus explicitly on relational reasoning        to sentence descriptions.
and are characterized by three important features: they can           The approach presented in (Shah et al. 2019) is the first
infer relations, they are data efficient, and they operate on a    attempt to answer a more knowledge-intensive category of
set of objects, a flexible symbolic input format that is agnos-    questions, such as “Who is to the left of Barack Obama?”
or ‘‘Do all the people in the image have a common oc-               Mining Commonsense Knowledge from Images
cupation?”. These questions make reference to the named
                                                                    Even though ML is becoming part of many systems, it is still
entities contained in an image, e.g., Barack Obama, White
                                                                    not able to easily capture CS knowledge from the perceived
House, France etc. and require large KBs to retrieve the rel-
                                                                    information. Additional techniques need to be devised to ex-
evant information. In this case, the authors choose Wikidata,
                                                                    tract this valuable type of knowledge from visual scenes. A
an RDF repository. They first extract named entities and then
                                                                    combination of textual and visual analysis, which extracts
try to connect them with a Wikidata entity using SPARQL
                                                                    subject-predicate-object triples (SPO) about objects recog-
queries. In addition, they extract spatial relations with other
                                                                    nized in a scene, is addressed in certain studies, e.g., (Vedan-
entities shown in the image and feed them to a Bi-LSTM.
                                                                    tam et al. 2015; Lin and Parikh 2015). ML classifiers for
A multi-layered perceptron calculates the prediction for an
                                                                    object recognition are trained on image datasets, while pre-
answer, taking as input the output of the LSTM, along with
                                                                    trained NN classifiers help extract SPO triples, by consid-
the SPARQL results.
                                                                    ering both the entities identified by the classifiers and the
Aligning Data Obtained from Diverse Online Sources                  textual description of the images.
Entity resolution, also known as instance matching, con-               In a different direction, in (Sadeghi, Kumar Divvala, and
cerns the task of identifying which entities across different       Farhadi 2015) the authors rely on Web images to verify the
KBs refer to the same individual. As the Web is growing in          validity of simple phrases, such as “horses eat hay”, ana-
size, this problem is becoming crucial, especially in appli-        lyzing the spatial consistency of the relative configurations
cation domains that need to integrate and align knowledge           of the entities and the relations involved. This unsupervised
obtained from various sources. An increasing number of CV           method is particularly interesting, due to the leverage it of-
studies face this problem, in an attempt to interpret visual        fers in automatically enriching CS repositories. In fact, the
information based on commonsense, non-visual knowledge.             authors show how CV-based analysis can help improve re-
   Two characteristic approaches are given in (Chernova et          call in KBs, such as WordNet, Cyc and ConceptNet, offering
al. 2017) and (Young et al. 2017) that try to assign labels         a complementary and orthogonal source of evidence.
to a visual scene using Bayesian Logic Networks (BLNs)                 Aditya et al. (2018) address the problem of generating
and relying on commonsense knowledge. In (Chernova et al.           linguistic descriptions of images by utilizing a special type
2017), knowledge is extracted from WordNet, ConceptNet,             of graph, namely scene description graphs (SDGs). Such
and Wikipedia. WordNet is utilized in order to disambiguate         graphs are built by using both low-level information de-
seed words returned by the CV annotator with the aid of their       rived using perception methods and high-level features cap-
hypernym. ConceptNet properties, such as IsLocatedIn or             turing CS knowledge stemming from the image annotations
U sedF or that may point the location of an object, are also        and lexical ontological knowledge from Web resources.
retrieved. With this method, the system can generate a com-         SDGs produce object, scene and constituent detection tu-
pact semantic KB given only a small number of objects.              ples, accompanied by a confidence score; pre-processed
   In (Young et al. 2017), a CNN trained on ImageNet is             background knowledge helps remove noise contained in the
used to annotate objects recognized in images. The system           detection. A Bayesian Network is utilized, in order for the
is capable of assigning semantic categories to specific re-         dependencies among co-occurring entities and knowledge
gions, by relying on DBpedia comment boxes to calculate             regarding abstract visual concepts to be captured. Exper-
the semantic relatedness between objects. As expected, high         imental evaluations of the method on the image-sentence
accuracy of such an approach is difficult to achieve, due to        alignment quality, i.e., how close the generated description
the diversity of information retrieved from DBpedia; conse-         is to the image being described, on Flickr8k, 30k and COCO
quently, smarter ways of identifying only the relevant part of      datasets, showed that the method achieves comparable per-
the comment boxes need to be devised.                               formance to previous state-of-the-art methods.

   Exploitation of Commonsense Knowledge                            Commonsense Knowledge in Addressing OP Tasks
Much of the information presented in a visual scene is not          State-of-the-art CS-based methodologies improve the per-
explicitly related with the features captured at the pixel level,   formance of a CV system, mainly by taking into account
but concerns observations implicitly depicted in images. Un-        textual descriptions about the entities found in a visual scene
derstanding the structure and dynamics of visual entities re-       or by retrieving semantic information from external sources
quires being able to interpret the semantic and common-             that is relevant to the image and the task at hand.
sense (CS) features that are relevant, in addition to the low-         A combination of external Web-based knowledge, text
level information obtained by photorealistic rendering tech-        processing and vision analysis is at the core of the study pre-
niques (Vedantam et al. 2015). This is a popular conclusion         sented in (Wang et al. 2018). The framework annotates ob-
reached within the CV community in the pursue towards               jects with a Fast-RNN, trained over the MS COCO dataset.
achieving visual intelligence. There is a long line of studies      The extracted entities are enriched with (i) knowledge re-
that attempt to address the problem of extracting common-           trieved from Wikipedia, in oder to perform entity classifica-
sense knowledge from visual scenes or, similarly, of utiliz-        tion; (ii) knowledge from WebChild, attempting a compara-
ing commonsense inferences to improve scene understand-             tive analysis between relevant entities; and (iii) CS knowl-
ing. In this section, we discuss state-of-the-art approaches        edge obtained from ConceptNet, to create a semantically
that advance the field in these two directions.                     rich description. The enriched entity is stored in an RDF
graph and is used to address a variety of tasks. For in-           Model-Free Learning
stance, the framework has achieved improved accuracy in            Recent studies devise methods that attempt to exploit in-
VQA benchmarks, but also it can be used to generate expla-         formation contained in higher-level representations, in order
nations for its answers. Prominent recent studies, as in (Li et    to improve scalability and generalization for tasks, such as
al. 2019) and (Narasimhan and Schwing 2018), also build on         Zero-Shot Learning (ZSL). ZSL is the problem of recogniz-
the direction of combining textual and visual analysis with        ing objects for which no visual examples have been obtained
the help of knowledge obtained from CS repositories.               and is typically achieved by exploring a semantic embedding
   Another problem that researchers try to address with the        space, e.g., attribute or semantic word vector space.
help of CS knowledge is the sparsity of categorical vari-             For example, Fu et al. (2015) utilize a semantic class label
ables in the training datasets. For example, Ramanathan et         graph, which results in a more accurate distance metric in the
al. (2015) utilize a neural network framework that uses dif-       semantic embedding space and an improved performance in
ferent types of cues (linguistic, visual and logical) in the       ZSL. Likewise, Xian et al. (2016) address the same prob-
context of human actions identification. Similarly, Lu et al.      lem by proposing a novel latent embedding model, which
(2016) exploit language priors extracted from the semantic         learns a compatibility function between the image and se-
features of an image, in order to facilitate the understand-       mantic (class) embeddings. The model utilizes image and
ing of visual relationships. The proposed model combines a         class-level side-information that is either collected through
visual module tailored to the learning of visual appearance        human annotation or through an unsupervised way from a
models for objects and predicates with a language module           Web repository of text corpora.
capable of detecting semantically related relationships.              Lee et al. (2018) propose a novel deep learning architec-
   More recently, Gu et al. (2019) utilize commonsense             ture for multi-label ZSL, which relies on KGs for the discov-
knowledge stemming from an external KB in the context of           ery of the relationships between multiple classes of objects.
scene graph generation. Namely, a special knowledge-based          The KG is built on knowledge stemming from WordNet and
feature refinement module is used, which incorporates CS           contains 3 types of label relations, super-subordinate, posi-
knowledge from ConceptNet for the prediction of object la-         tive correlation, and negative correlation. The KG is coupled
bels consisting of triplets containing the top-K correspond-       to a GGNN-type module for predicting labels.
ing relationships, the object entity and a weight correspond-         In the same vein, Wang, Ye and Gupta (2018) exploit the
ing to the frequency of the triplet. This strategy, aiming to      information contained in KGs about unseen objects, in or-
address the long tail distribution of relationships, differenti-   der to infer visual attributes that enable their detection. The
ates the approach from the linguistic-based ones described         KG nodes correspond to semantic categories and the edges
previously, managing to showcase improvement in general-           to semantic relationships, whereas the input to each node
izability and accuracy.                                            is the vector representation (semantic embedding) of each
   CS knowledge is also used to tackle other CV problems,          category. A GCN is used to transfer information between
such as in understanding relevant information about un-            different categories. This way, by utilizing the semantic em-
known objects existing in a visual scene. In (Icarte et al.        beddings of a novel category, the method can link categories
2017) or (Young et al. 2016) for instance, external CS Web-        in the KG to familiar ones and, thus, infer its attributes. The
based repositories are used as a source for locating rele-         experimental evaluation demonstrated a significant improve-
vant information. The general idea in both approaches is to        ment on the ImageNet dataset, while the ablation studies in-
retrieve as much information as possible about the recog-          dicated that the incorporation of KGs enabled the system to
nizable objects that, based on diverse metrics, are consid-        learn meaningful classifiers on top of semantic embeddings.
ered semantically close to the unknown ones. RelatedT o,              In (Marino, Salakhutdinov, and Gupta 2017), the use of
IsA, U sedF or properties found in ConceptNet, or com-             structured prior knowledge led to improved performance on
ment boxes retrieved from DBpedia are all relevant knowl-          the task of multi-label image classification. The KG is built
edge that can be used for developing semantic similarity           using WordNet for the concepts and Visual Genome for the
measures. Similar to some extent, is the approach presented        relations among them. An interesting aspect of this study is
in (Ruiz-Sarmiento, Galindo, and Gonzalez-Jimenez 2016),           the introduction of a novel NN architecture, Graph Search
which relies on RDF graphs with a probabilistic distribu-          Neural Network, as a means to efficiently incorporate large
tion over relations to capture the CS knowledge, but reverts       knowledge graphs, in order to be exploited for CV tasks.
also to a human-supervised learning approach whenever un-
known objects are encountered.                                     Inductive Learning
                                                                   The benefits of developing intelligent visual components
         Ability to Learn New Knowledge                            with reasoning and learning abilities are becoming evident
The majority of state-of-the-art studies covered in the pre-       in broader to CV domains, such as in the field of Robotics.
vious sections exploit a loosely-coupled combination of ML         This conclusion was nicely demonstrated in a recent special
and knowledge-based methodologies. A tighter integration           issue of the AI Journal (Rajan and Saffiotti 2017), where
of methodologies of the two fields is expected to achieve          causality-based reasoning emerged as a key contribution. It
much broader impact, especially in the process of learning.        is, therefore, interesting to investigate how the recent trend
In the sequel, we consider prominent attempts towards this         in combining knowledge-based representations with model-
direction, originating either from a model-free standpoint or      free models for the development of intelligent robots is mak-
from a more declarative, inductive-based perspective.              ing an impact in related OP research.
   A highly prominent line of research for modeling uncer-         huge volume of general knowledge that exists on the Web,
tainty and high-level action knowledge is focusing on com-         while eliminating the bias of information found online.
bining expressive logical probabilistic formalisms, ontolog-          Progress in the field of learning from demonstration can
ical models and ML. In (Antanas et al. 2018a) for example,         prove a vital contribution to CS inferencing and vice versa.
the system learns probabilistic first-order rules describing re-   Leaving the visual challenges involved aside, this applica-
lational affordances and pre-grasp configurations from un-         tion domain, characterized by the central role of, human
certain video data. It uses the ProbFOIL+ rule learner, along      mostly, agents, offers theory building opportunities on di-
with a simple ontology capturing object categories.                verse perspectives. Interaction with human users calls for
   More recently, Moldovan et al. (2018) significantly ex-         intuitive means of communication, where high-level, declar-
tended this approach, using the Distributional Clauses (DCs)       ative languages seem to offer a natural way of capturing hu-
formalism that integrates logic programming and probabil-          man intuition. Transferring knowledge between high-level
ity theory. DCs can use both continuous and discrete vari-         languages and low-level models is a key area of investiga-
ables, which is highly appropriate for modeling uncertainty,       tion for future symbiotic systems and a fruitful domain for
in comparison for instance to ProbLog, which is commonly           combining data-driven and symbolic approaches.
found in relevant literature. Compared to approaches that
model affordances with Bayesian Networks, this approach            Understanding Causality
scales much better, but most importantly, due to its rela-         Still, the most demanding outcomes that are expected by the
tional nature, structural parts of the theory, such as the ab-     integration of knowledge-based and ML methodologies con-
stract action-effect rules, can be transferred to similar do-      cern the aspects of causality learning and explainability. Ex-
mains without the need to be learned again.                        isting works on harvesting causality knowledge do not yet
   A similar objective is pursued by Katzouris et al. (2019),      offer convincing models. As argued in (Pearl 2018), ML
who propose an abductive-inductive incremental algorithm           needs to go beyond the detection of associations, in order
for learning and revising causal rules, in the form of Event       to exhibit explainability and counterfactual reasoning.
Calculus programs. The Event Calculus is a highly expres-             The black-box character of ML-based methods hinders
sive, non-monotonic formalism for capturing causal and             the understanding of their behavior, and eventually the
temporal relations in dynamic domains. The approach uses           acceptance of such systems. For example, recent studies
the XHAIL system as a basis, but sacrifices completeness           demonstrate the fundamental inability of neural networks
due to its incremental nature. Yet, it is able to learn weighted   to efficiently and robustly learn visual relations, which ren-
causal temporal rules, in the form of Markov Logic Net-            ders the high performance that networks of this type often
works, scaling up to large volumes of sequential data with         achieve worth a closer investigation (Kim, Ricci, and Serre
a time-like structure.                                             2018; Rosenfeld, Zemel, and Tsotsos 2018). Advancement
   Also worth mentioning is the study of Antanas et al.            in exploiting CS knowledge is expected to offer a significant
(2018b), which instead of learning how to map visual per-          leverage in understanding and reasoning with causal rela-
ceptions to task-dependent grasps, it uses a probabilistic         tions. And, of course, transparent reasoning is vital in un-
logic module to semantically reason about the most likely          derstanding the abilities and constrains of existing systems.
object part to be grasped, given the object properties and         Yet, as indicated in the current review, this latter direction is
task constraints. The approach models rules in Causal Prob-        still not pursued in a coordinated and structured way.
abilistic logic, implemented in ProbLog, in order to reason
about object categories, about the most affordable tasks and       Achieving a Tighter Integration
about the best semantic pre-grasps.                                Ultimately, unifying logical and probabilistic graphical
                                                                   models seems to be at the heart of handling the majority
    Open Problems and Research Questions                           of real-world problems. Recent studies show that even a
The review of the state-of-the-art reveals prominent solu-         loosely-coupled integration can achieve better accuracy in
tions for various OP-related topics, as well as novel con-         classification problems with small datasets in comparison
tributions that offer new insights (Table 1). The analysis can     with end-to-end deep networks and comparable accuracy
also help frame open questions towards combining ML and            with larger datasets (see e.g., (Riley and Sridharan 2019;
knowledge-based approaches in the given context.                   Basu, Shakerin, and Gupta 2020)). A tighter integration is
                                                                   highly anticipated, as it will help build systems that learn
Obtaining Human Commonsense                                        from data, while still being able to generalize to domains
The exploitation of CS knowledge is a characteristic exam-         other than the ones trained for. Existing solutions are indeed
ple of a still open research area. Its significance was ac-        promising, as for example approaches based on the widely
knowledged more than two decades ago and the research              used Markov Logic, which nevertheless introduces limita-
conducted over the years contributed methods that combine          tions on both the theoretical and the practical level (Domin-
the strengths from diverse fields of AI. At the same time, it is   gos and Lowd 2019). Its first-order nature, for instance, often
evident that there is still a long way to go; just the coupling    contradicts with the non-monotonicity met in CS domains.
of textual and visual embeddings, the mainstream in current        The support for complex tasks, such as causal, temporal or
VQA related studies, has proven to be a challenging task.          counterfactual reasoning, in a non-monotonic fashion and
Further directions need to also be explored, such as in per-       over rich conceptual representations unfolds a series of re-
forming complex forms of CS inferencing or in fusing the           search questions worth exploring in the near future.
                                             Table 1: Overview of the reviewed literature

Indicative Recent Literature                 CV Problem          ML     Methods       KB Methods      KB Con-       KB-ML Impact
                                             Focus               Applied              Applied         tribution
(Chuang et al. 2018), (Ye et al. 2017),      affordance de-      CNN,     GNN,        Knowledge       3, 4, 5, 7    offers new insights
(Sawatzky et al. 2019), (Chao et al.         tection             GGNN                 Graphs
2015), (Ramanathan et al. 2015)
(Beetz et al. 2018), (Ramirez-Amaro,         affordance de-      scoring      func-   OWL Ontology    1, 2, 3, 4,   offers new insights
Beetz, and Cheng 2017), (Lemaignan et        tection             tions,      proba-                   5, 6, 9       and improves SotA
al. 2017), (Agostini, Torras, and Woer-                          bilistic program-
goetter 2017), (Moldovan et al. 2018)                            ming      models,
                                                                 Bayesian      Net-
                                                                 works
(Icarte et al. 2017), (Redmon and            object detection    RCNN, CNN            Knowledge       1, 3, 4, 5,   offers new insights
Farhadi 2017), (Liu et al. 2018b)                                                     Graph, BLN      8
(Gemignani et al. 2016), (Tenorth and        object detection    scoring      func-   OWL       On-   1, 2, 3, 4,   improves SotA
Beetz 2017), (Young et al. 2016),                                tions, probabilis-   tology,   DL,   5, 8, 9
(Beetz et al. 2018)                                              tic programming      MLN
                                                                 models
(Chernova et al. 2017), (Young et al.        scene    under-     probabilistic        BLN             2, 3, 4, 8    offers new insights
2017), (Aditya et al. 2018)                  standing            programming,
                                                                 Bayesian      Net-
                                                                 work
(Gu et al. 2019), (Li et al. 2017), (Chen    scene    under-     GGNN                 Knowledge       3, 4, 7       improves SotA
et al. 2018)                                 standing                                 Graph
(Krishna et al. 2017), (Zhu et al. 2015a),   VQA                 CNN,      LSTM,      Knowledge       1, 2, 3, 4,   offers new insights
(Li et al. 2019), (Wu et al. 2016a),                             RCNN                 Graphs (RDF     5, 8          and improves SotA
(Wu et al. 2018), (Li, Su, and Zhu                                                    mostly)
2017), (Sadeghi, Kumar Divvala, and
Farhadi 2015), (Shah et al. 2019), (Su
et al. 2018), (Narasimhan and Schwing
2018), (Wang et al. 2018)
(Vedantam et al. 2015), (Lin and Parikh      VQA                 Gausian Mixture      RDF Graph       2, 3, 4, 5    improves SotA
2015)                                                            Model, SVM
(Lu et al. 2018)                             VQA                 Gated Recurrent      RDF Graph       1, 2, 4, 8    offers new insights
                                                                 Unit Network
(Hu et al. 2017), (Johnson et al. 2017),     visual reason-      Neural Module        Symbolic Pro-   2, 3, 5       offers new insights
(Yi et al. 2018), (Santoro et al. 2017)      ing                 Network              gramming Lan-
                                                                                      guage
(Suchan et al. 2017), (Riley and Sridha-     visual reason-      CNN, RCNN            Non-            2, 3, 4, 5,   offers new insights
ran 2019), (Basu, Shakerin, and Gupta        ing, VQA                                 monotonic       6, 7, 9
2020)                                                                                 logics, ASP
(Marino, Salakhutdinov, and Gupta            image       clas-   GGNN, GCN,           Knowledge       1, 2, 5       offers new insights
2017), (Lee et al. 2018), (Wang, Ye, and     sification/                              Graph,    RDF                 and improves SotA
Gupta 2018)                                  zero-shot                                Graph
                                             recognitions
(Fu et al. 2015), (Xian et al. 2016)         image       clas- Latent embed- Knowledge               1, 2, 5     offers new insights
                                             sification/       ding       model, Graph
                                             zero-shot         Markov Chain
                                             recognitions      Process
(Antanas et al. 2018a), (Antanas et al.      affordance        scoring     func- FOL, Causal 1, 2, 3, 6, improves SotA
2018b), (Moldovan et al. 2018), (Kat-        learning          tions, probabilis- Probabilistic      7, 9
zouris et al. 2019)                                            tic programming Logic, MLN,
                                                               models               Event Calculus
KB Contribution: 1:concept abstraction/reuse, 2:complex data querying, 3:spatial reasoning, 4:contextual reasoning,
5:relational reasoning, 6:temporal reasoning, 7:causal reasoning, 8:access to open-domain knowledge, 9:formal semantics
                        Conclusions                                [Chao et al. 2015] Chao, Y. W.; Wang, Z.; He, Y.; Wang, J.;
                                                                    and Deng, J. 2015. HICO: A benchmark for recogniz-
 In this paper, we reviewed approaches that rely on both
                                                                    ing human-object interactions in images. IEEE ICCV 2015
 knowledge-based and data-driven methods, in order to of-
                                                                    Inter:1017–1025.
 fer solutions to the field of intelligent object perception. By
 adopting a knowledge-driven, rather than a problem-specific       [Chen et al. 2018] Chen, X.; Li, L. J.; Fei-Fei, L.; and Gupta,
 grouping, we analyzed a multitude of approaches that at-           A. 2018. Iterative Visual Reasoning beyond Convolutions.
 tempt to unify high-level knowledge with diverse machine           IEEE CVPR 7239–7248.
 learning systems. The review revealed open and prominent          [Chernova et al. 2017] Chernova, S.; Chu, V.; Daruna, A.;
 directions, showing clear evidence that hybrid methods con-        Garrison, H.; Hahn, M.; Khante, P.; Liu, W.; and Thomaz, A.
 stitute an avenue worth exploring.                                 2017. Situated bayesian reasoning framework for robots op-
                                                                    erating in diverse everyday environments. In International
                         References                                 Symposium on Robotics Research (ISRR).
                                                                   [Chuang et al. 2018] Chuang, C. Y.; Li, J.; Torralba, A.; and
[Aditya et al. 2018] Aditya, S.; Yang, Y.; Baral, C.; Aloi-         Fidler, S. 2018. Learning to Act Properly: Predicting and
 monos, Y.; and Fermüller, C. 2018. Image Understand-              Explaining Affordances from Images. IEEE CVPR 975–
 ing using vision and reasoning through Scene Description           983.
 Graph. Computer Vision and Image Understanding 173:33–
 45.                                                               [Daruna et al. 2019] Daruna, A.; Liu, W.; Kira, Z.; and Cher-
                                                                    nova, S. 2019. Robocse: Robot common sense embedding.
[Aditya, Yang, and Baral 2019] Aditya, S.; Yang, Y.; and            arXiv preprint arXiv:1903.00412.
 Baral, C. 2019. Integrating knowledge and reasoning
 in image understanding. In Proceedings of the Twenty-             [Davis and Marcus 2015] Davis, E., and Marcus, G. 2015.
 Eighth International Joint Conference on Artificial Intelli-       Commonsense reasoning and commonsense knowledge in
 gence, IJCAI-19, 6252–6259. International Joint Confer-            artificial intelligence. Commun. ACM 58(9):92–103.
 ences on Artificial Intelligence Organization.                    [Dean, Allen, and Aloimonos 1995] Dean, T.; Allen, J.; and
[Agostini, Torras, and Woergoetter 2017] Agostini, A.; Tor-         Aloimonos, Y. 1995. Artificial Intelligence: Theory and
 ras, C.; and Woergoetter, F. 2017. Efficient interactive           Practice. Redwood City, CA, USA: Benjamin-Cummings
 decision-making framework for robotic applications. Arti-          Publishing Co., Inc.
 ficial Intelligence 247:187–212.                                  [Deng et al. 2014] Deng, J.; Ding, N.; Jia, Y.; Frome, A.;
[Antanas et al. 2018a] Antanas, L.; Dries, A.; Moreno, P.;          Murphy, K.; Bengio, S.; Li, Y.; Neven, H.; and Adam, H.
 and De Raedt, L. 2018a. Relational affordance learning             2014. Large-scale object classification using label relation
 for task-dependent robot grasping. In Lachiche, N., and            graphs. In ECCV, 48–64. Springer.
 Vrain, C., eds., Inductive Logic Programming, 1–15. Cham:         [Domingos and Lowd 2019] Domingos, P., and Lowd, D.
 Springer International Publishing.                                 2019. Unifying logical and statistical ai with markov logic.
[Antanas et al. 2018b] Antanas, L.; Moreno, P.; Neumann,            Commununications of the ACM 62(7):74–83.
 M.; de Figueiredo, R. P.; Kersting, K.; Santos-Victor, J.; and    [Fu et al. 2015] Fu, Z.; Xiang, T.; Kodirov, E.; and Gong, S.
 De Raedt, L. 2018b. Semantic and geometric reasoning               2015. Zero-shot object recognition by semantic manifold
 for robotic grasping: a probabilistic logic approach. Au-          distance. In IEEE CVPR, 2635–2644.
 tonomous Robots 1–26.                                             [Geffner 2018] Geffner, H. 2018. Model-free, model-based,
[Basu, Shakerin, and Gupta 2020] Basu, K.; Shakerin, F.;            and general intelligence. In Proceedings of the 27th Interna-
 and Gupta, G. 2020. Aqua: Asp-based visual question an-            tional Joint Conference on Artificial Intelligence, IJCAI’18,
 swering. In Komendantskaya, E., and Liu, Y. A., eds., Prac-        10–17. AAAI Press.
 tical Aspects of Declarative Languages, 57–72. Springer           [Gemignani et al. 2016] Gemignani, G.; Capobianco, R.;
 International Publishing.                                          Bastianelli, E.; Bloisi, D. D.; Iocchi, L.; and Nardi, D. 2016.
[Beetz et al. 2018] Beetz, M.; Beßler, D.; Haidu, A.; Pomar-        Living with robots: Interactive environmental knowledge ac-
 lan, M.; Bozcuoğlu, A. K.; and Bartels, G. 2018. Know             quisition. Robotics and Autonomous Systems 78:1–16.
 rob 2.0âa 2nd generation knowledge processing framework          [Goyal et al. 2019] Goyal, Y.; Khot, T.; Agrawal, A.;
 for cognition-enabled robotic agents. In 2018 IEEE ICRA,           Summers-Stay, D.; Batra, D.; and Parikh, D. 2019. Making
 512–519. IEEE.                                                     the V in VQA Matter: Elevating the Role of Image Under-
[Bengio et al. 2019] Bengio, Y.; Deleu, T.; Rahaman, N.; Ke,        standing in Visual Question Answering. IJCV 127(4):398–
 R.; Lachapelle, S.; Bilaniuk, O.; Goyal, A.; and Pal, C. 2019.     414.
 A meta-transfer objective for learning to disentangle causal      [Gu et al. 2019] Gu, J.; Zhao, H.; Lin, Z.; Li, S.; Cai, J.; and
 mechanisms. arXiv preprint arXiv:1901.10912.                       Ling, M. 2019. Scene graph generation with external knowl-
[Brachman and Levesque 2004] Brachman,             R.,      and     edge and image reconstruction. In IEEE CVPR, 1969–1978.
 Levesque, H. 2004. Knowledge Representation and                   [Herath, Harandi, and Porikli 2017] Herath, S.; Harandi, M.;
 Reasoning. San Francisco, CA, USA: Morgan Kaufmann                 and Porikli, F. 2017. Going deeper into action recognition:
 Publishers Inc.                                                    A survey. IMAVIS 60:4–21.
[Hu et al. 2017] Hu, R.; Andreas, J.; Rohrbach, M.; Darrell,        [Liu et al. 2018b] Liu, Y.; Wang, R.; Shan, S.; and Chen,
  T.; and Saenko, K. 2017. Learning to Reason: End-to-End            X. 2018b. Structure Inference Net: Object Detection Us-
  Module Networks for Visual Question Answering. IEEE                ing Scene-Level Context and Instance-Level Relationships.
  ICCV 2017-Octob(Figure 1):804–813.                                 IEEE CVPR 6985–6994.
[Icarte et al. 2017] Icarte, R. T.; Baier, J. A.; Ruz, C.; and      [Lu et al. 2016] Lu, C.; Krishna, R.; Bernstein, M.; and Fei-
  Soto, A. 2017. How a general-purpose commonsense ontol-            Fei, L. 2016. Visual relationship detection with language
  ogy can improve performance of learning-based image re-            priors. Lecture Notes in Computer Science (including sub-
  trieval. arXiv preprint arXiv:1705.08844.                          series Lecture Notes in Artificial Intelligence and Lecture
[Johnson et al. 2017] Johnson, J.; Hariharan, B.; van der            Notes in Bioinformatics) 9905 LNCS(Figure 2):852–869.
  Maaten, L.; Hoffman, J.; Fei-Fei, L.; Lawrence Zitnick, C.;       [Lu et al. 2018] Lu, P.; Ji, L.; Zhang, W.; Duan, N.; Zhou,
  and Girshick, R. 2017. Inferring and executing programs            M.; and Wang, J. 2018. R-VQA: Learning visual relation
  for visual reasoning. In IEEE ICCV, 2989–2998.                     facts with semantic attention for visual question answering.
[Katzouris et al. 2019] Katzouris, N.; Michelioudakis, E.;           Proceedings of the ACM SIGKDD International Conference
  Artikis, A.; and Paliouras, G. 2019. Online learning of            on Knowledge Discovery and Data Mining 1880–1889.
  weighted relational rules for complex event recognition. In       [Marino, Salakhutdinov, and Gupta 2017] Marino,             K.;
  Berlingerio, M.; Bonchi, F.; Gärtner, T.; Hurley, N.; and         Salakhutdinov, R.; and Gupta, A. 2017. The more you
  Ifrim, G., eds., Machine Learning and Knowledge Discov-            know: using knowledge graphs for image classification.
  ery in Databases, 396–413. Cham: Springer International            IEEE CVPR 2017-Janua:20–28.
  Publishing.                                                       [Moldovan et al. 2018] Moldovan, B.; Moreno, P.; Nitti, D.;
[Kim, Ricci, and Serre 2018] Kim, J.; Ricci, M.; and Serre,          Santos-Victor, J.; and De Raedt, L. 2018. Relational af-
  T. 2018. Not-so-clevr: learning same–different rela-               fordances for multiple-object manipulation. Autonomous
  tions strains feedforward neural networks. Interface focus         Robots 42(1):19–44.
  8(4):20180011.                                                    [Narasimhan and Schwing 2018] Narasimhan, M., and
[Krishna et al. 2017] Krishna, R.; Zhu, Y.; Groth, O.; John-         Schwing, A. G. 2018. Straight to the facts: Learning knowl-
  son, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li,      edge base retrieval for factual visual question answering. In
  L. J.; Shamma, D. A.; Bernstein, M. S.; and Fei-Fei, L. 2017.      Proceedings of the ECCV (ECCV), 451–468.
  Visual Genome: Connecting Language and Vision Using               [Nickel et al. 2016] Nickel, M.; Murphy, K.; Tresp, V.; and
  Crowdsourced Dense Image Annotations. IJCV 123(1):32–              Gabrilovich, E. 2016. A review of relational machine
  73.                                                                learning for knowledge graphs. Proceedings of the IEEE
[Lee et al. 2018] Lee, C. W.; Fang, W.; Yeh, C. K.; and              104(1):11–33.
  Wang, Y. C. F. 2018. Multi-label Zero-Shot Learning with          [Pearl 2018] Pearl, J. 2018. Theoretical impediments to ma-
  Structured Knowledge Graphs. IEEE CVPR 1576–1585.                  chine learning with seven sparks from the causal revolution.
[Lemaignan et al. 2017] Lemaignan, S.; Warnier, M.; Sisbot,          arXiv preprint arXiv:1801.04016.
  E. A.; Clodic, A.; and Alami, R. 2017. Artificial cognition       [Rajan and Saffiotti 2017] Rajan, K., and Saffiotti, A., eds.
  for social human–robot interaction: An implementation. Ar-         2017. Special Issue on AI and Robotics, volume 247. El-
  tificial Intelligence 247:45–69.                                   sevier. 1–440.
[Li et al. 2015] Li, Y.; Tarlow, D.; Brockschmidt, M.; and          [Ramanathan et al. 2015] Ramanathan, V.; Li, C.; Deng, J.;
  Zemel, R. 2015. Gated graph sequence neural networks.              and Han, W. 2015. Learning semantic relationships for bet-
  arXiv preprint arXiv:1511.05493.                                   ter action retrieval in images ( Supplementary ). Computer
[Li et al. 2017] Li, R.; Tapaswi, M.; Liao, R.; Jia, J.; Urtasun,    Vision and Pattern Recognition 1–4.
  R.; and Fidler, S. 2017. Situation recognition with graph         [Ramirez-Amaro, Beetz, and Cheng 2017] Ramirez-Amaro,
  neural networks. In IEEE ICCV, 4173–4182.                          K.; Beetz, M.; and Cheng, G. 2017. Transferring skills
[Li et al. 2019] Li, H.; Wang, P.; Shen, C.; and Hengel, A.          to humanoid robots by extracting semantic representations
  v. d. 2019. Visual question answering as reading compre-           from observations of human activities. Artificial Intelligence
  hension. In IEEE CVPR, 6319–6328.                                  247:95–118.
[Li, Su, and Zhu 2017] Li, G.; Su, H.; and Zhu, W. 2017.            [Ravichandar et al. 2019] Ravichandar, H.; Polydoros, A. S.;
  Incorporating external knowledge to answer open-domain             Chernova, S.; and Billard, A. 2019. Robot learning from
  visual questions with dynamic memory networks. arXiv               demonstration: A review of recent advances. Annual Review
  preprint arXiv:1712.00733.                                         of Control, Robotics, and Autonomous Systems In Press.
[Lin and Parikh 2015] Lin, X., and Parikh, D. 2015. Don’t           [Redmon and Farhadi 2017] Redmon, J., and Farhadi, A.
  just listen, use your imagination: Leveraging visual common        2017. Yolo9000: better, faster, stronger. In IEEE CVPR,
  sense for non-visual tasks. In IEEE CVPR, 2984–2993.               7263–7271.
[Liu et al. 2018a] Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.;      [Riley and Sridharan 2019] Riley, H., and Sridharan, M.
  Chen, J.; Liu, X.; and Pietikäinen, M. 2018a. Deep learn-         2019. Integrating non-monotonic logical reasoning and in-
  ing for generic object detection: A survey. arXiv preprint         ductive learning with deep learning for explainable visual
  arXiv:1809.02165.                                                  question answering. Frontiers in Robotics and AI 6:125.
[Rosenfeld, Zemel, and Tsotsos 2018] Rosenfeld,               A.;    mon sense through visual abstraction. In IEEE ICCV, 2542–
 Zemel, R.; and Tsotsos, J. K. 2018. The elephant in the             2550.
 room. arXiv preprint arXiv:1808.03305.                             [Wang et al. 2018] Wang, P.; Wu, Q.; Shen, C.; Dick, A.; and
[Ruiz-Sarmiento, Galindo, and Gonzalez-Jimenez 2016]                 van den Hengel, A. 2018. Fvqa: Fact-based visual question
 Ruiz-Sarmiento, J.-R.; Galindo, C.; and Gonzalez-Jimenez,           answering. IEEE Trans. on PAMI 40(10):2413–2427.
 J. 2016. Probability and common-sense: Tandem towards              [Wang, Ye, and Gupta 2018] Wang, X.; Ye, Y.; and Gupta, A.
 robust robotic object recognition in ambient assisted liv-          2018. Zero-Shot Recognition via Semantic Embeddings and
 ing. In Ubiquitous Computing and Ambient Intelligence.              Knowledge Graphs. IEEE CVPR 6857–6866.
 Springer. 3–8.                                                     [Wu et al. 2016a] Wu, Q.; Wang, P.; Shen, C.; Dick, A.; and
[Sadeghi, Kumar Divvala, and Farhadi 2015] Sadeghi, F.;              van den Hengel, A. 2016a. Ask me anything: Free-form vi-
 Kumar Divvala, S. K.; and Farhadi, A. 2015. Viske: Visual           sual question answering based on knowledge from external
 knowledge extraction and question answering by visual               sources. In IEEE CVPR, 4622–4630.
 verification of relation phrases. In IEEE CVPR, 1456–1464.         [Wu et al. 2016b] Wu, Z.; Fu, Y.; Jiang, Y.-G.; and Sigal, L.
[Santoro et al. 2017] Santoro, A.; Raposo, D.; Barrett, D.           2016b. Harnessing object and scene semantics for large-
 G. T.; Malinowski, M.; Pascanu, R.; Battaglia, P.; and Lilli-       scale video understanding. In TheIEEE CVPR.
 crap, T. 2017. A simple neural network module for relational       [Wu et al. 2017] Wu, Q.; Teney, D.; Wang, P.; Shen, C.; Dick,
 reasoning. (Nips).                                                  A.; and van den Hengel, A. 2017. Visual question answer-
[Sawatzky et al. 2019] Sawatzky, J.; Souri, Y.; Grund, C.;           ing: A survey of methods and datasets. CVIU 163:21–40.
 and Gall, J. 2019. What Object Should I Use? - Task Driven         [Wu et al. 2018] Wu, Q.; Shen, C.; Wang, P.; Dick, A.; and
 Object Detection.                                                   Van Den Hengel, A. 2018. Image Captioning and Vi-
[Scarselli et al. 2008] Scarselli, F.; Gori, M.; Tsoi, A. C.; Ha-    sual Question Answering Based on Attributes and External
 genbuchner, M.; and Monfardini, G. 2008. The graph neural           Knowledge. IEEE Trans. on PAMI 40(6):1367–1381.
 network model. IEEE Transactions on NN 20(1):61–80.                [Xian et al. 2016] Xian, Y.; Akata, Z.; Sharma, G.; Nguyen,
[Shah et al. 2019] Shah, S.; Mishra, A.; Yadati, N.; and             Q.; Hein, M.; and Schiele, B. 2016. Latent embeddings for
 Talukdar, P. P. 2019. Kvqa: Knowledge-aware visual ques-            zero-shot classification. In IEEE CVPR, 69–77.
 tion answering. AAAI.                                              [Ye et al. 2017] Ye, C.; Yang, Y.; Mao, R.; Fermuller, C.; and
[Stone et al. 2016] Stone, P.; Brooks, R.; Brynjolfsson, E.;         Aloimonos, Y. 2017. What can i do around here? Deep
 Calo, R.; Etzioni, O.; Hager, G.; Hirschberg, J.; Kalyanakr-        functional scene understanding for cognitive robots. IEEE
 ishnan, S.; Kamar, E.; Kraus, S.; Leyton-Brown, K.; Parkes,         ICRA 4604–4611.
 D.; Press, W.; Saxenian, A.; Shah, J.; Tambe, M.; ; and            [Yi et al. 2018] Yi, K.; Torralba, A.; Wu, J.; Kohli, P.; Gan,
 Teller, A. 2016. Artificial intelligence and life in 2030. One      C.; and Tenenbaum, J. B. 2018. Neural-symbolic VQA: Dis-
 Hundred Year Study on Artificial Intelligence: Report of the        entangling reasoning from vision and language understand-
 2015-2016 Study Panel.                                              ing. Advances in Neural Information Processing Systems
[Su et al. 2018] Su, Z.; Zhu, C.; Dong, Y.; Cai, D.; Chen, Y.;       2018-December(NeurIPS):1031–1042.
 and Li, J. 2018. Learning Visual Knowledge Memory Net-             [Young et al. 2016] Young, J.; Basile, V.; Kunze, L.; Cabrio,
 works for Visual Question Answering. IEEE CVPR 7736–                E.; and Hawes, N. 2016. Towards lifelong object learning by
 7745.                                                               integrating situated robot perception and semantic web min-
[Suchan et al. 2017] Suchan, J.; Bhatt, M.; Walega, P. A.;           ing. In Proceedings of the Twenty-second European Confer-
 and Schultz, C. P. L. 2017. Visual explanation by high-             ence on Artificial Intelligence, 1458–1466. IOS Press.
 level abduction: On answer-set programming driven reason-          [Young et al. 2017] Young, J.; Basile, V.; Suchi, M.; Kunze,
 ing about moving objects. CoRR abs/1712.00840.                      L.; Hawes, N.; Vincze, M.; and Caputo, B. 2017. Mak-
[Tenorth and Beetz 2017] Tenorth, M., and Beetz, M. 2017.            ing sense of indoor spaces using semantic web mining and
 Representations for robot knowledge in the knowrob frame-           situated robot perception. In European Semantic Web Con-
 work. Artificial Intelligence 247:151–169.                          ference, 299–313. Springer.
                                                                    [Zhu et al. 2015a] Zhu, Y.; Groth, O.; Bernstein, M.; and Fei-
[Torabi, Warnell, and Stone 2019] Torabi, F.; Warnell, G.;
                                                                     Fei, L. 2015a. Visual7W: Grounded Question Answering in
 and Stone, P. 2019. Recent advances in imitation learn-
                                                                     Images.
 ing from observation. In Proceedings of the Twenty-Eighth
 International Joint Conference on Artificial Intelligence,         [Zhu et al. 2015b] Zhu, Y.; Zhang, C.; Ré, C.; and Fei-Fei, L.
 IJCAI-19, 6325–6331. International Joint Conferences on             2015b. Building a large-scale multimodal knowledge base
 Artificial Intelligence Organization.                               for visual question answering. CoRR abs/1507.05670.
[van Harmelen et al. 2007] van Harmelen, F.; van Harmelen,          [Zhu, Fathi, and Fei-Fei 2014] Zhu, Y.; Fathi, A.; and Fei-
 F.; Lifschitz, V.; and Porter, B. 2007. Handbook of Knowl-          Fei, L. 2014. Reasoning about object affordances in a
 edge Representation. San Diego, USA: Elsevier Science.              knowledge base representation. In Fleet, D.; Pajdla, T.;
                                                                     Schiele, B.; and Tuytelaars, T., eds., ECCV, 408–424. Cham:
[Vedantam et al. 2015] Vedantam, R.; Lin, X.; Batra, T.;             Springer International Publishing.
 Lawrence Zitnick, C.; and Parikh, D. 2015. Learning com-