A Review on Intelligent Object Perception Methods Combining Knowledge-based Reasoning and Machine Learning∗ † † Filippos Gouidis,1 Alexandros Vassiliades,1, 2 Theodore Patkos,1 Antonis Argyros,1 Nick Bassiliades,2 Dimitris Plexousakis1 1 Institute of Computer Science, Foundation for Research and Technology, Hellas, 2 Aristotle University of Thessaloniki {gouidis, patkos, argyros, dp}@ics.forth.gr, {valexande, nbassili}@csd.auth.gr Abstract (2015) for instance, while discussing the multifaceted chal- lenges related to automating commonsense reasoning, a cru- Object perception is a fundamental sub-field of Computer cial ability for any intelligent entity operating in real-world Vision, covering a multitude of individual areas and having contributed high-impact results. While Machine Learning has conditions, underline the need to combine the strengths of been traditionally applied to address related problems, recent diverse AI approaches from these two fields. Others, as studies also seek ways to integrate knowledge engineering in for example Bengio et al. (2019) and Pearl (2018), em- order to expand the level of intelligence of the visual interpre- phasize the inability of Deep Learning to effectively rec- tation of objects, their properties and their relations with the ognize cause and effect relations. Pearl, Geffner (2018) environment. In this paper, we attempt a systematic investiga- and recently Lenat1 , suggest to seek solutions by bridg- tion of how knowledge-based methods contribute to diverse ing the gap between model-free, data-intensive learners and object perception tasks. We review the latest achievements knowledge-based models and by building on the synergy be- and identify prominent research directions. tween heuristic level and epistemological level languages. In this paper, we review recent progress in the direction of Introduction coupling the strengths of ML and knowledge-based meth- ods, focusing our attention on the topic of Object Percep- Despite the recent sweep of progress in Machine Learning tion (OP), an important sub-field of Computer Vision (CV). (ML) which is stirring public imagination about the capa- Tasks related to OP are at the core of a wide spectrum of bilities of future Artificial Intelligence (AI) systems, the re- practical systems and relevant research has traditionally re- search community seems more composed, in part due to the lied on ML to approach the related problems. The recent realization that current achievements are largely based on developments have significantly advanced the field, but, in- engineering advancements, and only partly on novel scien- terestingly, state-of-the-art studies try to integrate symbolic tific progress. Undoubtedly, the accomplishments are nei- methods, in order to achieve broader visual intelligence. It ther small nor temporary; in the latest “One Hundred Year seems that it is becoming less of a paradox within the CV Study on AI”, a panel of renowned AI experts is foresee- community that in order to build intelligent vision systems, ing tremendous impact of AI in a multitude of technological much of the information needed is not directly observable. and societal domains in the next decades, fueled primarily Existing surveys on the intersection of ML and knowledge by systems running ML algorithms (Stone et al. 2016). Yet, engineering (e.g., (Nickel et al. 2016)) are indeed very infor- there is still a lot of ground to cover, before we can obtain mative, but they usually offer a high-level understanding of a deep understanding of how to overcome the limitations of the challenges involved. The rich literature on CV reviews, data-driven approaches at a more generic level. on the other hand, adopts a more problem-specific analysis, Aiming at exploiting the full potential of AI, a grow- studying in detail the requirements of each particular CV ing body of research is devoted to the idea of integrating task (see e.g., (Wu et al. 2017; Herath, Harandi, and Porikli ML and knowledge-based approaches. Davies and Marcus 2017; Liu et al. 2018a)). Only recently was an overview pre- ∗ This project has received funding from the Hellenic Founda- sented that shows how background knowledge can benefit tion for Research and Innovation (HFRI) and the General Secre- tasks, such as image understanding (Aditya, Yang, and Baral tariat for Research and Technology (GSRT), under grant agreement 2019); our goal is to explore this direction on the topic of No 188. intelligent OP, reporting state-of-the-art achievements and † Authors contributed equally. showing how different facets of knowledge-based research Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel- can contribute to addressing the rich diversity of OP tasks. mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen The rest of the paper investigates state-of-the-art litera- (Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com- bining Machine Learning and Knowledge Engineering in Practice ture on intelligent OP along the three pillars shown in Fig- (AAAI-MAKE 2020). Stanford University, Palo Alto, California, 1 USA, March 23-25, 2020. Use permitted under Creative Commons https://towardsdatascience.com/statistical-learning-and- License Attribution 4.0 International (CC BY 4.0). knowledge-engineering-all-the-way-down-1bb004040114 of abstraction needed to design a system versatile enough to adapt to the requirements of a particular domain, yet rigid enough to be encoded in a computer program, having clear semantics and elegant properties (Dean, Allen, and Aloi- monos 1995). As a result, expressiveness, i.e., what can or cannot be represented in a given model, and computation, i.e., how fast conclusions are drawn or which statements can be evaluated by algorithms guaranteed to terminate, are two, often competing, aspects to be taken into consideration. The form of the representational model applied plays a decisive role on such considerations, affecting the rich- ness of semantics that can be captured. While the range of models varies significantly, even relatively shallow rep- resentations have proven to offer improvements in the per- formance of OP methodologies (see for example (Deng et al. 2014; Zhu, Fathi, and Fei-Fei 2014; Zhu et al. 2015b; Figure 1: Organization of thematic areas. Redmon and Farhadi 2017)). Models as simple as relational tables or flat weighted graphs, but also more complex multi- relational graphs, often called Knowledge Graphs (KGs), or ure 1): i) symbolic models, further analyzed from the per- even semantically-rich conceptualizations with formal se- spectives of expressive representations, reasoning capacity, mantics, often called ontologies, are proposed in the relevant and open domain, Web-based knowledge; ii) commonsense literature. In the sequel, we refer to any model that offers at knowledge exploitation, a key skill for any intelligent sys- least a basic structuring of data as a Knowledge Base (KB). tem; and, iii) enhanced learning ability, building on hybrid approaches. By no means should this review be considered Utilization of Contextual Knowledge Context awareness exhaustive; inevitably, relevant studies may not have been is the ability of a system to understand the state of the envi- included. Our intention is to capture a snapshot of the most ronment, and to perceive the interplay of the entities inhabit- recent research trends, discussing studies that improve pre- ing it. This feature offers advantages in accomplishing infer- vious results or offer new insights, hoping to provide a start- ence tasks, but also enhances a system in terms of reusabil- ing point for following the interesting progress currently un- ity, as contextual knowledge enables it to adapt to new situ- derway, while giving pointers to directions that seem worth ations and environments that resemble known ones. investigating. In this respect, the paper concludes with a dis- A preliminary approach towards this direction utilized cussion on open questions and prominent research direc- a special semantic fusion network, which combined novel tions. Table 1 at the end summarizes the reviewed literature. object- and scene-based information and was able to capture relationships between video class labels and semantic enti- ties (objects and scenes) (Wu et al. 2016b). This was a Con- Exploitation of Symbolic Models volutional Neural Network (CNN) consisting of three layers, The scope of OP research ranges over a wide spectrum of with the first layer detecting low-level features, the second problems, from object, action and affordance detection, to layer object features and the third layer scene features. Al- localization and recognition in images, to motion and struc- though no symbolic representation was applied, this mod- ture inference in videos, to scene understanding and visual eling of abstraction layers for the representation of the in- reasoning. Traditionally, OP relied on ML methodologies to formation depicted in an image is similar to the hierarchical find patterns in realms of data, taking as input feature vec- structuring of knowledge used by top-down methods. tors representing entities in terms of numeric or categorical In a similar style, Liu et al. (2018b) recently attempted to (membership to a more general class) attributes. address the problem of object detection by proposing an al- In this section, we discuss how high-level knowledge re- gorithm that exploits jointly the context of a visual scene lated to visual entities can improve the performance of OP and an object’s relationships. Such features are typically algorithms in manifold ways. We start by reviewing state-of- taken into consideration in isolation by most object detec- the-art in coupling data-driven approaches and rich knowl- tion methods. The algorithm uses a CNN-based framework edge representations about aspects such as context, space tailored for object detection and combines it with a graph- and affordances. ical model designed for the inference of object states. A special graph is created for each image in the training set, Expressive Representations of Knowledge with nodes corresponding to objects and edges to object re- Although the notion of a representation is rather general, lationships. The intuition behind this approach is that the the modeling of relational knowledge in the form of in- graph’s topology enables an object state to be determined dividuals (entities) and their associated relations is well- not only by its low-level characteristics (appearance details), studied in AI, especially in the context of symbolic rep- but also by the states of other objects interacting with it and resentations (see for instance (van Harmelen et al. 2007; by the overall scene context. The experimental evaluation Brachman and Levesque 2004)). The goal is to offer the level conducted on different datasets underscored the importance of knowledge stemming from local and global context. help compile new classes while forming the environmental Currently, a collection of prominent approaches are ori- map. The main reasoning mechanisms of this study is Pro- ented towards Gated Graph Neural Networks (GGNNs) (Li log, although probabilistic reasoners are also used to tackle et al. 2015), a variation of Gated Neural Networks (GNNs) fuzzy information or uncertain relations. A CV system an- (Scarselli et al. 2008), in order to integrate contextual knowl- notates objects based on their shape, their distances and the edge in the training of a system. GNNs are a special type dimension of the environment, using a monotonic Descrip- of NNs tailored to the learning of information encoded in a tion Logic (DL), to build the environmental map. As a result, graph. In GGNNs, each node corresponds to a hidden state a coherent and well-formalized representation of environ- vector that is updated in a iterative way. Two features that ments is achieved, which can offer high-quality datasets for differentiate GGNNs from standard Recurrent Neural Net- training data-driven models. The use of DL can also offer works (RNNs), which also use the mechanism of recurrence, a wide spectrum of spatial reasoning capabilities with well- is that in the former information can move bi-directionally specified properties and formal semantics. and that many nodes can be updated per step. Adopting a different approach, the KG given in (Chen Chuang et al. (2018) propose a GGNN-based method et al. 2018) enables spatial reasoning both in a local and which exploits contextual information, such as the types of a global context which, in turn, results in improved per- objects and their spatial relations, in order to detect action- formance in semantic scene understanding. The proposed object affordances. In the same vein, Ye et al. (2017) utilize framework consists of two distinct modules, one focusing a two-stage pipeline built around a CNN to detect functional on local regions of the image and one dealing with the areas in indoor scenes. More recently, Sawatzky et al. (2019) whole image. The local module is convolution-based, ana- utilized a GGNN, which takes into account the global con- lyzing small regions in the image, whereas the basic compo- text of a scene and infers the affordances of the contained nent of the global module is a KG representing regions and objects. It also proposes the most suitable object for a spe- classes as nodes, capturing spatial and semantic relations. cific task. The authors prove that this approach yields better According to the experimental evaluation performed on the results compared to methods that rely solely on the results ADE and Visual Genome datasets, the network achieves bet- of an object classification algorithm. ter performance over other CNN-based baselines for region In (Li et al. 2017), an approach aiming to perform sit- classification, by a margin which sometimes is close to 10%. uation recognition is presented, based on the detection of According to the ablation study, the most decisive factor for human-object interactions. The goal is to predict the most the framework’s performance was the KG. representative verb to describe what is taking place in a scene, capturing also relevant semantic information (roles), Modeling Affordances Building on the geometrical struc- such as the actor, the source and target of the action, etc. The ture and physical properties of objects, such as rigidity and authors utilize a GGNN, which enables combined reasoning hollowness, the representation of affordances helps develop about verbs and their roles through the iterative propagation systems that can reason about how human-level tasks are of messages along the edges. performed. While ML is invaluable for automating the pro- cess of learning from example when data is available, rich Spatial Contextual Knowledge A particular type of con- representations can generalize and reuse the obtained mod- textual knowledge concerns the spatial properties of the en- els in situations where data-based training is not possible. tities populating a visual scene. These may involve simple One of the first studies that demonstrated that even very spatial relations, such as “object x is usually part of ob- basic semantic models can improve the performance of rec- ject y” and “object x is usually situated near the objects ognizing human-object interaction was (Chao et al. 2015). x1 , . . . , xn ”, but also semantically enriched statements, such The authors succeeded in boosting the performance of vi- as “objects of type x are usually found inside object z (e.g., sual classifiers by exploiting the compositionality and con- in the fridge), located in room y”. Due to the ubiquitous na- currency of semantic concepts contained in images. ture of spatial data in practical domains, such relations are KNOWROB 2.0 (Beetz et al. 2018), which is the re- often captured as a separate class of context. sult of a series of research activities in the field of Cogni- Semantic spatial knowledge, when fused with low-level tive Robotics, is an excellent example of integrating top- metric information, gives great flexibility to a system. This is down knowledge engineering with bottom-up information demonstrated in the study of Gemignani et al. (2016), where structuring, involving, among others, a variety of CV tasks. a novel representation is introduced that combines the met- A combination of KBs of different granularity helps the ric information of the environment with the symbolic data KNOWROB 2.0 framework capture rich models of the that conveys meaning to the entities inhabiting it, as well world. The representation of high-level knowledge is based as with topological graphs. Although delivering a generic on OWL-DL ontologies, a decidable fragment of First-order model with clear semantics is not their main objective, the Logic (FOL), yet adequately expressive for most practical resulting integrated representation enables a system to per- domains. The KBs enable the system to answer questions, form high-level spatial reasoning, as well as to understand such as “how to pick up the cup”, “Which body part to use” target locations and positions of objects in the environment. etc. The authors provide evidence that learning human ma- Generality is the aim of the model proposed by Tenorth nipulation tasks on existing methods can be boosted by using and Beetz (2017), which uses an OWL ontology, combin- symbolic level structured knowledge. ing information from OpenCyc and other Web sources that A recently proposed novel representation model that man- ages to balance between concept abstraction, uncertainty recently introduced as a collection of benchmark image- modeling and scalability is given in (Daruna et al. 2019). based open-domain questions that, in order to be answered, The so called RoboCSE framework encodes the abstract, call for a deep understanding of the visual setting. VQA goes semantic knowledge of an environment, i.e., the main con- beyond traditional CV, since apart from image analysis, the cepts and their relations, such as location, material and affor- proposed methods apply also a repertoire of AI techniques, dance, obtained by observations, simulations, or even from such as Natural Language Processing, in order to correctly external sources, into multi-relational embeddings. These analyze the textual form of the question, and inferencing, embeddings are used to represent the knowledge graph of in order to interpret the purpose and intentions of the en- the domain in vector space, encoding vertices that repre- tities acting in the scene (Krishna et al. 2017). The chal- sent entities as vectors and edges that represent relations as lenges posed by this field are complex and multifaceted, a mappings. While the majority of similar approaches rely on fact which is also demonstrated by the rather poor perfor- Bayesian Logic Networks and Markov Logic Networks, suf- mance of state-of-the-art-systems in comparison to humans. fering from well-known intractability problems, the authors VQA is probably the area of CV that has drawn the most prove that their model is highly scalable, robust to uncer- inspiration from symbolic AI approaches to date. tainty, and generalizes learned semantics. An indicative example is the approach recently presented Learning from demonstration, or imitation learning, is by Wu et al. (2018), who introduced a VQA model com- a relevant, yet broader objective, which introduces inter- bining observations obtained from the image with informa- esting opportunities and challenges to a CV system (see tion extracted from a general KB, namely DBpedia. Given (Torabi, Warnell, and Stone 2019; Ravichandar et al. 2019)). an image-question pair, a CNN is utilized to predict a set of Purely ML-based methods constitute the predominant re- attributes from the image, i.e., the most recognizable objects search direction, and only few state-of-the-art studies utilize in the image, in terms of clarity and size. Consequently, a knowledge-based methods, taking advantage of the reusabil- series of captions based on the attributes is generated, which ity and generalization of the learned information. A popular is then used to extract relevant information from DBpedia choice is to deploy expressive OWL-DL (Ramirez-Amaro, through appropriately formulated queries. In a similar style, Beetz, and Cheng 2017; Lemaignan et al. 2017) or pure DL in (Narasimhan and Schwing 2018) an external RDF repos- (Agostini, Torras, and Woergoetter 2017) representations to itory is used to retrieve properties of visual concepts, such capture world knowledge. The CV modules are assigned the as category, used for, created by, etc. The technique utilizes task to extract information about the state of the environ- a Graph Convolution Network (GCN), a variation of GNN, ment, the expert agent’s pose and location, grasping areas of before producing an answer. In both cases, the ablation anal- objects, affordances, shapes etc. On top of these, the cou- ysis reveals the impact of the KB in improving performance. pling with knowledge-based systems assists in visual inter- Other types of questions in VQA require inferencing pretation, for example to track human motion, to semanti- about the properties of the objects depicted in an image. cally annotate the movement (i.e., “how the human performs For example, queries such as “How is the man going to the action”) or to understand if a task is doable in a given set- work?” or more complex queries, such as “When did the ting. These studies show that such representations enable a plane land?”, have been the subject of the study presented system to reuse the learned knowledge in diverse settings by Krishna et al. (2017), who introduced the Visual Genome and under different conditions, without having to re-train dataset and a VQA method. In fact, this is one of the first classifiers from scratch. Moreover, complex queries can be studies to bring a model trained on an RDF-based scene answered, a topic discussed in the next subsection. graph that had good recall results to all What, Where, When, Who, Why, How queries. Even further, Su et al. (2018) intro- Reasoning over Expressive KBs duced the visual knowledge memory network (VKMN) in Encoding knowledge in a semantically structured way is order to handle questions, whose answers cannot be directly only part of the story; a rich representation model can inferred from the image visual content but require reasoning also offer inference capabilities to a CV system, which are over structured human knowledge. needed for accomplishing complex tasks, such as scene un- The importance of capturing the semantic knowledge in derstanding, or simpler tasks under realistic conditions, such VQA collections led also to the creation of the Relation- as scene analysis with occlusions, noisy or erroneous input VQA dataset (Lu et al. 2018), which extends Visual Genome etc. A reasoning system can be used to connect the dots with a special module measuring the semantic similarity of that relate concepts together when only partial observation images. In contrast to methods mining only concepts or at- is available, especially in data-scarce situations, where an- tributes, this model extracts relation facts related to both notated data are not sufficiently many. In such situations, concepts and attributes. The experimental evaluation con- the compositionality of information, an inherent character- ducted on VQA and COCO dataset showed that the method istic of the entities encountered in visual domains, can be outperformed other state-of-the-art ones. Moreover, the ab- exploited by applying reasoning mechanisms. lation studies show that the incorporated semantic knowl- edge was crucial for the performance of the network. Complex Query Answering Probably the field that high- Despite its increasing popularity, the VQA field is still lights more clearly the needs and challenges faced by a CV hard to confront. The generality of existing methods is also system in answering complex queries about a visual scene is questioned (Goyal et al. 2019). Developing generic solu- the field of Visual Question Answering (VQA). VQA was tions, less tightly coupled to specific datasets, will definitely benefit the pursuit towards broader visual intelligence. tic to the kind of inputs it receives. For example, an object could correspond to the background, to a particular physical Visual Reasoning A task related to VQA that has gained object, a texture, conjunctions of physical objects etc. popularity in recent years is that of Visual Reasoning (VR). To conclude, it is worth indicating also a recent trend in In this case, the type of questions that have to be answered visual explanation approaches that couples data-driven sys- are more complex and require a multi-step reasoning pro- tems with Answer Set Programming (ASP). ASP is a non- cedure. For example, given an image containing objects of monotonic logical formalism oriented towards hard search different shapes and color, the task of recognizing the color problems. A number of studies have emerged that com- of an object of certain shape that lies in a certain area w.r.t. bine ASP abductive or inductive reasoning for the VQA do- the position of another object of certain shape and color falls main, especially for cases when training data are not many to the category of VR (in this case, first the “source” object (see e.g., (Suchan et al. 2017; Riley and Sridharan 2019; must be detected, then the “target” object, and, finally, its Basu, Shakerin, and Gupta 2020)) color must be recognized). Similar to the case of VQA, a number of VR works has drawn inspiration from symbolic The Web as a Problem-Agnostic Source of Data AI-based ideas. As the recent renaissance in AI is partly due to the availabil- In general, many VR works are based on Neural Mod- ity of big volumes of training data, along with the computa- ule Networks (NMNs) which are NNs of adaptable architec- tional power to analyze them, it is only reasonable to expect ture, the topology of which is determined by the parsing of that data-driven approaches will turn their attention to the the question that has to be answered. NMNs simplify com- Web in order to collect the data needed. Although the bene- plex questions into simpler sub-questions (sub-tasks), which fits mentioned in the previous sections are still achievable, can be more easily addressed. The modules that constitute the challenges faced when using a Web repository rather the MNMs are pre-defined neural networks that implement than a custom-made KB are now different. the functions that are required for the tackling of sub-tasks, which are assembled into a layout dynamically. Central to The vast majority of large-scale Web repositories are not many MNMs is the utilization of prior symbolic (structured) problem-specific, containing a lot of irrelevant information knowledge, which facilitates the handling of the sub-tasks. for a ML system to be trained correctly. For the time being, ML systems are highly specific, excelling only when trained Hu et al. (2017) propose End-to-End Module Networks for a particular task and tested on similar to the training con- as a variation of NMNs. The network first uses coarse func- ditions. As a result, state-of-the-art approaches try to rely on tional expressions describing the structure of the computa- the semantics of structured KBs, in order to filter out noisy tion required for the answering and, then, refines it accord- or irrelevant knowledge, by integrating external knowledge ing to the textual input in order to assemble the network. For when visual information is not sufficiently reliable for con- example, for the question “how many other objects of the clusion making. same size as the purple cube exist?”, first crude functional expression for counting and relocating would be predicted Exploitation of Web-based Knowledge Graphs and Se- as relevant to the answering of the question which, subse- mantic Repositories There exists a multitude of stud- quently, would be refined by the parameters from text analy- ies that use external knowledge from structured or semi- sis (in this case one such parameter is the color of the cube). structured Web resources, in order to answer visual queries Similarly, Johnson et al. (2017) propose a variation of or to perform cognitive tasks. A characteristic example is NMNs, which is based on the concept of programs. Pro- found in (Li, Su, and Zhu 2017), where the ConceptNet KG, grams are symbolic structures of certain specification writ- a semantic repository of commonsense Linked Open Data, ten in a Domain-Specific Language and are defined by a syn- is used to answer open domain questions on entities such as tax and semantics. In the context of VR, programs describe “What is the dog’s favorite food?”. The approach proceeds a sequence of functions that must be executed, in order for in a step-wise manner: first, visual objects and keywords are an answer to be computed. During testing on the CLEVR extracted from an image, using a Fast-RCNN for the ob- dataset the model exhibited notable performance, generaliz- jects and a LSTM for the syntactical analysis; then, queries ing better in a variety of settings, such as for new question to ConceptNet provide properties and values for the entities types and human-posed questions. Building on the notion of found in the image. When an answer is considered correct, programs, Yi et al. (2018) further incorporated knowledge a Dynamic Memory Network, which is an embedding vec- regarding the structural scene representation of the image. tor space that contains vector representations of symbolic The method achieved near-perfect accuracy, while also pro- knowledge triples, is renewed for future encounter of the viding transparency to the reasoning process. same query. In a rather similar style, Wu et al. (2016a) ex- An alternative NN-based approach for VR is found tract properties from DBpedia, by retrieving and perform- in (Santoro et al. 2017), where the incorporation of Rela- ing semantic analysis on the comment boxes of relevant tion Networks (RNs) in CNNs and Long Sort-Term Mem- Wikipedia pages. Here, a CNN performs object detection on ory (LSTM) architectures is proposed. RNs are architectures the image, whereas a pre-trained RNN correlates attributes whose computations focus explicitly on relational reasoning to sentence descriptions. and are characterized by three important features: they can The approach presented in (Shah et al. 2019) is the first infer relations, they are data efficient, and they operate on a attempt to answer a more knowledge-intensive category of set of objects, a flexible symbolic input format that is agnos- questions, such as “Who is to the left of Barack Obama?” or ‘‘Do all the people in the image have a common oc- Mining Commonsense Knowledge from Images cupation?”. These questions make reference to the named Even though ML is becoming part of many systems, it is still entities contained in an image, e.g., Barack Obama, White not able to easily capture CS knowledge from the perceived House, France etc. and require large KBs to retrieve the rel- information. Additional techniques need to be devised to ex- evant information. In this case, the authors choose Wikidata, tract this valuable type of knowledge from visual scenes. A an RDF repository. They first extract named entities and then combination of textual and visual analysis, which extracts try to connect them with a Wikidata entity using SPARQL subject-predicate-object triples (SPO) about objects recog- queries. In addition, they extract spatial relations with other nized in a scene, is addressed in certain studies, e.g., (Vedan- entities shown in the image and feed them to a Bi-LSTM. tam et al. 2015; Lin and Parikh 2015). ML classifiers for A multi-layered perceptron calculates the prediction for an object recognition are trained on image datasets, while pre- answer, taking as input the output of the LSTM, along with trained NN classifiers help extract SPO triples, by consid- the SPARQL results. ering both the entities identified by the classifiers and the Aligning Data Obtained from Diverse Online Sources textual description of the images. Entity resolution, also known as instance matching, con- In a different direction, in (Sadeghi, Kumar Divvala, and cerns the task of identifying which entities across different Farhadi 2015) the authors rely on Web images to verify the KBs refer to the same individual. As the Web is growing in validity of simple phrases, such as “horses eat hay”, ana- size, this problem is becoming crucial, especially in appli- lyzing the spatial consistency of the relative configurations cation domains that need to integrate and align knowledge of the entities and the relations involved. This unsupervised obtained from various sources. An increasing number of CV method is particularly interesting, due to the leverage it of- studies face this problem, in an attempt to interpret visual fers in automatically enriching CS repositories. In fact, the information based on commonsense, non-visual knowledge. authors show how CV-based analysis can help improve re- Two characteristic approaches are given in (Chernova et call in KBs, such as WordNet, Cyc and ConceptNet, offering al. 2017) and (Young et al. 2017) that try to assign labels a complementary and orthogonal source of evidence. to a visual scene using Bayesian Logic Networks (BLNs) Aditya et al. (2018) address the problem of generating and relying on commonsense knowledge. In (Chernova et al. linguistic descriptions of images by utilizing a special type 2017), knowledge is extracted from WordNet, ConceptNet, of graph, namely scene description graphs (SDGs). Such and Wikipedia. WordNet is utilized in order to disambiguate graphs are built by using both low-level information de- seed words returned by the CV annotator with the aid of their rived using perception methods and high-level features cap- hypernym. ConceptNet properties, such as IsLocatedIn or turing CS knowledge stemming from the image annotations U sedF or that may point the location of an object, are also and lexical ontological knowledge from Web resources. retrieved. With this method, the system can generate a com- SDGs produce object, scene and constituent detection tu- pact semantic KB given only a small number of objects. ples, accompanied by a confidence score; pre-processed In (Young et al. 2017), a CNN trained on ImageNet is background knowledge helps remove noise contained in the used to annotate objects recognized in images. The system detection. A Bayesian Network is utilized, in order for the is capable of assigning semantic categories to specific re- dependencies among co-occurring entities and knowledge gions, by relying on DBpedia comment boxes to calculate regarding abstract visual concepts to be captured. Exper- the semantic relatedness between objects. As expected, high imental evaluations of the method on the image-sentence accuracy of such an approach is difficult to achieve, due to alignment quality, i.e., how close the generated description the diversity of information retrieved from DBpedia; conse- is to the image being described, on Flickr8k, 30k and COCO quently, smarter ways of identifying only the relevant part of datasets, showed that the method achieves comparable per- the comment boxes need to be devised. formance to previous state-of-the-art methods. Exploitation of Commonsense Knowledge Commonsense Knowledge in Addressing OP Tasks Much of the information presented in a visual scene is not State-of-the-art CS-based methodologies improve the per- explicitly related with the features captured at the pixel level, formance of a CV system, mainly by taking into account but concerns observations implicitly depicted in images. Un- textual descriptions about the entities found in a visual scene derstanding the structure and dynamics of visual entities re- or by retrieving semantic information from external sources quires being able to interpret the semantic and common- that is relevant to the image and the task at hand. sense (CS) features that are relevant, in addition to the low- A combination of external Web-based knowledge, text level information obtained by photorealistic rendering tech- processing and vision analysis is at the core of the study pre- niques (Vedantam et al. 2015). This is a popular conclusion sented in (Wang et al. 2018). The framework annotates ob- reached within the CV community in the pursue towards jects with a Fast-RNN, trained over the MS COCO dataset. achieving visual intelligence. There is a long line of studies The extracted entities are enriched with (i) knowledge re- that attempt to address the problem of extracting common- trieved from Wikipedia, in oder to perform entity classifica- sense knowledge from visual scenes or, similarly, of utiliz- tion; (ii) knowledge from WebChild, attempting a compara- ing commonsense inferences to improve scene understand- tive analysis between relevant entities; and (iii) CS knowl- ing. In this section, we discuss state-of-the-art approaches edge obtained from ConceptNet, to create a semantically that advance the field in these two directions. rich description. The enriched entity is stored in an RDF graph and is used to address a variety of tasks. For in- Model-Free Learning stance, the framework has achieved improved accuracy in Recent studies devise methods that attempt to exploit in- VQA benchmarks, but also it can be used to generate expla- formation contained in higher-level representations, in order nations for its answers. Prominent recent studies, as in (Li et to improve scalability and generalization for tasks, such as al. 2019) and (Narasimhan and Schwing 2018), also build on Zero-Shot Learning (ZSL). ZSL is the problem of recogniz- the direction of combining textual and visual analysis with ing objects for which no visual examples have been obtained the help of knowledge obtained from CS repositories. and is typically achieved by exploring a semantic embedding Another problem that researchers try to address with the space, e.g., attribute or semantic word vector space. help of CS knowledge is the sparsity of categorical vari- For example, Fu et al. (2015) utilize a semantic class label ables in the training datasets. For example, Ramanathan et graph, which results in a more accurate distance metric in the al. (2015) utilize a neural network framework that uses dif- semantic embedding space and an improved performance in ferent types of cues (linguistic, visual and logical) in the ZSL. Likewise, Xian et al. (2016) address the same prob- context of human actions identification. Similarly, Lu et al. lem by proposing a novel latent embedding model, which (2016) exploit language priors extracted from the semantic learns a compatibility function between the image and se- features of an image, in order to facilitate the understand- mantic (class) embeddings. The model utilizes image and ing of visual relationships. The proposed model combines a class-level side-information that is either collected through visual module tailored to the learning of visual appearance human annotation or through an unsupervised way from a models for objects and predicates with a language module Web repository of text corpora. capable of detecting semantically related relationships. Lee et al. (2018) propose a novel deep learning architec- More recently, Gu et al. (2019) utilize commonsense ture for multi-label ZSL, which relies on KGs for the discov- knowledge stemming from an external KB in the context of ery of the relationships between multiple classes of objects. scene graph generation. Namely, a special knowledge-based The KG is built on knowledge stemming from WordNet and feature refinement module is used, which incorporates CS contains 3 types of label relations, super-subordinate, posi- knowledge from ConceptNet for the prediction of object la- tive correlation, and negative correlation. The KG is coupled bels consisting of triplets containing the top-K correspond- to a GGNN-type module for predicting labels. ing relationships, the object entity and a weight correspond- In the same vein, Wang, Ye and Gupta (2018) exploit the ing to the frequency of the triplet. This strategy, aiming to information contained in KGs about unseen objects, in or- address the long tail distribution of relationships, differenti- der to infer visual attributes that enable their detection. The ates the approach from the linguistic-based ones described KG nodes correspond to semantic categories and the edges previously, managing to showcase improvement in general- to semantic relationships, whereas the input to each node izability and accuracy. is the vector representation (semantic embedding) of each CS knowledge is also used to tackle other CV problems, category. A GCN is used to transfer information between such as in understanding relevant information about un- different categories. This way, by utilizing the semantic em- known objects existing in a visual scene. In (Icarte et al. beddings of a novel category, the method can link categories 2017) or (Young et al. 2016) for instance, external CS Web- in the KG to familiar ones and, thus, infer its attributes. The based repositories are used as a source for locating rele- experimental evaluation demonstrated a significant improve- vant information. The general idea in both approaches is to ment on the ImageNet dataset, while the ablation studies in- retrieve as much information as possible about the recog- dicated that the incorporation of KGs enabled the system to nizable objects that, based on diverse metrics, are consid- learn meaningful classifiers on top of semantic embeddings. ered semantically close to the unknown ones. RelatedT o, In (Marino, Salakhutdinov, and Gupta 2017), the use of IsA, U sedF or properties found in ConceptNet, or com- structured prior knowledge led to improved performance on ment boxes retrieved from DBpedia are all relevant knowl- the task of multi-label image classification. The KG is built edge that can be used for developing semantic similarity using WordNet for the concepts and Visual Genome for the measures. Similar to some extent, is the approach presented relations among them. An interesting aspect of this study is in (Ruiz-Sarmiento, Galindo, and Gonzalez-Jimenez 2016), the introduction of a novel NN architecture, Graph Search which relies on RDF graphs with a probabilistic distribu- Neural Network, as a means to efficiently incorporate large tion over relations to capture the CS knowledge, but reverts knowledge graphs, in order to be exploited for CV tasks. also to a human-supervised learning approach whenever un- known objects are encountered. Inductive Learning The benefits of developing intelligent visual components Ability to Learn New Knowledge with reasoning and learning abilities are becoming evident The majority of state-of-the-art studies covered in the pre- in broader to CV domains, such as in the field of Robotics. vious sections exploit a loosely-coupled combination of ML This conclusion was nicely demonstrated in a recent special and knowledge-based methodologies. A tighter integration issue of the AI Journal (Rajan and Saffiotti 2017), where of methodologies of the two fields is expected to achieve causality-based reasoning emerged as a key contribution. It much broader impact, especially in the process of learning. is, therefore, interesting to investigate how the recent trend In the sequel, we consider prominent attempts towards this in combining knowledge-based representations with model- direction, originating either from a model-free standpoint or free models for the development of intelligent robots is mak- from a more declarative, inductive-based perspective. ing an impact in related OP research. A highly prominent line of research for modeling uncer- huge volume of general knowledge that exists on the Web, tainty and high-level action knowledge is focusing on com- while eliminating the bias of information found online. bining expressive logical probabilistic formalisms, ontolog- Progress in the field of learning from demonstration can ical models and ML. In (Antanas et al. 2018a) for example, prove a vital contribution to CS inferencing and vice versa. the system learns probabilistic first-order rules describing re- Leaving the visual challenges involved aside, this applica- lational affordances and pre-grasp configurations from un- tion domain, characterized by the central role of, human certain video data. It uses the ProbFOIL+ rule learner, along mostly, agents, offers theory building opportunities on di- with a simple ontology capturing object categories. verse perspectives. Interaction with human users calls for More recently, Moldovan et al. (2018) significantly ex- intuitive means of communication, where high-level, declar- tended this approach, using the Distributional Clauses (DCs) ative languages seem to offer a natural way of capturing hu- formalism that integrates logic programming and probabil- man intuition. Transferring knowledge between high-level ity theory. DCs can use both continuous and discrete vari- languages and low-level models is a key area of investiga- ables, which is highly appropriate for modeling uncertainty, tion for future symbiotic systems and a fruitful domain for in comparison for instance to ProbLog, which is commonly combining data-driven and symbolic approaches. found in relevant literature. Compared to approaches that model affordances with Bayesian Networks, this approach Understanding Causality scales much better, but most importantly, due to its rela- Still, the most demanding outcomes that are expected by the tional nature, structural parts of the theory, such as the ab- integration of knowledge-based and ML methodologies con- stract action-effect rules, can be transferred to similar do- cern the aspects of causality learning and explainability. Ex- mains without the need to be learned again. isting works on harvesting causality knowledge do not yet A similar objective is pursued by Katzouris et al. (2019), offer convincing models. As argued in (Pearl 2018), ML who propose an abductive-inductive incremental algorithm needs to go beyond the detection of associations, in order for learning and revising causal rules, in the form of Event to exhibit explainability and counterfactual reasoning. Calculus programs. The Event Calculus is a highly expres- The black-box character of ML-based methods hinders sive, non-monotonic formalism for capturing causal and the understanding of their behavior, and eventually the temporal relations in dynamic domains. The approach uses acceptance of such systems. For example, recent studies the XHAIL system as a basis, but sacrifices completeness demonstrate the fundamental inability of neural networks due to its incremental nature. Yet, it is able to learn weighted to efficiently and robustly learn visual relations, which ren- causal temporal rules, in the form of Markov Logic Net- ders the high performance that networks of this type often works, scaling up to large volumes of sequential data with achieve worth a closer investigation (Kim, Ricci, and Serre a time-like structure. 2018; Rosenfeld, Zemel, and Tsotsos 2018). Advancement Also worth mentioning is the study of Antanas et al. in exploiting CS knowledge is expected to offer a significant (2018b), which instead of learning how to map visual per- leverage in understanding and reasoning with causal rela- ceptions to task-dependent grasps, it uses a probabilistic tions. And, of course, transparent reasoning is vital in un- logic module to semantically reason about the most likely derstanding the abilities and constrains of existing systems. object part to be grasped, given the object properties and Yet, as indicated in the current review, this latter direction is task constraints. The approach models rules in Causal Prob- still not pursued in a coordinated and structured way. abilistic logic, implemented in ProbLog, in order to reason about object categories, about the most affordable tasks and Achieving a Tighter Integration about the best semantic pre-grasps. Ultimately, unifying logical and probabilistic graphical models seems to be at the heart of handling the majority Open Problems and Research Questions of real-world problems. Recent studies show that even a The review of the state-of-the-art reveals prominent solu- loosely-coupled integration can achieve better accuracy in tions for various OP-related topics, as well as novel con- classification problems with small datasets in comparison tributions that offer new insights (Table 1). The analysis can with end-to-end deep networks and comparable accuracy also help frame open questions towards combining ML and with larger datasets (see e.g., (Riley and Sridharan 2019; knowledge-based approaches in the given context. Basu, Shakerin, and Gupta 2020)). A tighter integration is highly anticipated, as it will help build systems that learn Obtaining Human Commonsense from data, while still being able to generalize to domains The exploitation of CS knowledge is a characteristic exam- other than the ones trained for. Existing solutions are indeed ple of a still open research area. Its significance was ac- promising, as for example approaches based on the widely knowledged more than two decades ago and the research used Markov Logic, which nevertheless introduces limita- conducted over the years contributed methods that combine tions on both the theoretical and the practical level (Domin- the strengths from diverse fields of AI. At the same time, it is gos and Lowd 2019). Its first-order nature, for instance, often evident that there is still a long way to go; just the coupling contradicts with the non-monotonicity met in CS domains. of textual and visual embeddings, the mainstream in current The support for complex tasks, such as causal, temporal or VQA related studies, has proven to be a challenging task. counterfactual reasoning, in a non-monotonic fashion and Further directions need to also be explored, such as in per- over rich conceptual representations unfolds a series of re- forming complex forms of CS inferencing or in fusing the search questions worth exploring in the near future. Table 1: Overview of the reviewed literature Indicative Recent Literature CV Problem ML Methods KB Methods KB Con- KB-ML Impact Focus Applied Applied tribution (Chuang et al. 2018), (Ye et al. 2017), affordance de- CNN, GNN, Knowledge 3, 4, 5, 7 offers new insights (Sawatzky et al. 2019), (Chao et al. tection GGNN Graphs 2015), (Ramanathan et al. 2015) (Beetz et al. 2018), (Ramirez-Amaro, affordance de- scoring func- OWL Ontology 1, 2, 3, 4, offers new insights Beetz, and Cheng 2017), (Lemaignan et tection tions, proba- 5, 6, 9 and improves SotA al. 2017), (Agostini, Torras, and Woer- bilistic program- goetter 2017), (Moldovan et al. 2018) ming models, Bayesian Net- works (Icarte et al. 2017), (Redmon and object detection RCNN, CNN Knowledge 1, 3, 4, 5, offers new insights Farhadi 2017), (Liu et al. 2018b) Graph, BLN 8 (Gemignani et al. 2016), (Tenorth and object detection scoring func- OWL On- 1, 2, 3, 4, improves SotA Beetz 2017), (Young et al. 2016), tions, probabilis- tology, DL, 5, 8, 9 (Beetz et al. 2018) tic programming MLN models (Chernova et al. 2017), (Young et al. scene under- probabilistic BLN 2, 3, 4, 8 offers new insights 2017), (Aditya et al. 2018) standing programming, Bayesian Net- work (Gu et al. 2019), (Li et al. 2017), (Chen scene under- GGNN Knowledge 3, 4, 7 improves SotA et al. 2018) standing Graph (Krishna et al. 2017), (Zhu et al. 2015a), VQA CNN, LSTM, Knowledge 1, 2, 3, 4, offers new insights (Li et al. 2019), (Wu et al. 2016a), RCNN Graphs (RDF 5, 8 and improves SotA (Wu et al. 2018), (Li, Su, and Zhu mostly) 2017), (Sadeghi, Kumar Divvala, and Farhadi 2015), (Shah et al. 2019), (Su et al. 2018), (Narasimhan and Schwing 2018), (Wang et al. 2018) (Vedantam et al. 2015), (Lin and Parikh VQA Gausian Mixture RDF Graph 2, 3, 4, 5 improves SotA 2015) Model, SVM (Lu et al. 2018) VQA Gated Recurrent RDF Graph 1, 2, 4, 8 offers new insights Unit Network (Hu et al. 2017), (Johnson et al. 2017), visual reason- Neural Module Symbolic Pro- 2, 3, 5 offers new insights (Yi et al. 2018), (Santoro et al. 2017) ing Network gramming Lan- guage (Suchan et al. 2017), (Riley and Sridha- visual reason- CNN, RCNN Non- 2, 3, 4, 5, offers new insights ran 2019), (Basu, Shakerin, and Gupta ing, VQA monotonic 6, 7, 9 2020) logics, ASP (Marino, Salakhutdinov, and Gupta image clas- GGNN, GCN, Knowledge 1, 2, 5 offers new insights 2017), (Lee et al. 2018), (Wang, Ye, and sification/ Graph, RDF and improves SotA Gupta 2018) zero-shot Graph recognitions (Fu et al. 2015), (Xian et al. 2016) image clas- Latent embed- Knowledge 1, 2, 5 offers new insights sification/ ding model, Graph zero-shot Markov Chain recognitions Process (Antanas et al. 2018a), (Antanas et al. affordance scoring func- FOL, Causal 1, 2, 3, 6, improves SotA 2018b), (Moldovan et al. 2018), (Kat- learning tions, probabilis- Probabilistic 7, 9 zouris et al. 2019) tic programming Logic, MLN, models Event Calculus KB Contribution: 1:concept abstraction/reuse, 2:complex data querying, 3:spatial reasoning, 4:contextual reasoning, 5:relational reasoning, 6:temporal reasoning, 7:causal reasoning, 8:access to open-domain knowledge, 9:formal semantics Conclusions [Chao et al. 2015] Chao, Y. W.; Wang, Z.; He, Y.; Wang, J.; and Deng, J. 2015. HICO: A benchmark for recogniz- In this paper, we reviewed approaches that rely on both ing human-object interactions in images. IEEE ICCV 2015 knowledge-based and data-driven methods, in order to of- Inter:1017–1025. fer solutions to the field of intelligent object perception. By adopting a knowledge-driven, rather than a problem-specific [Chen et al. 2018] Chen, X.; Li, L. J.; Fei-Fei, L.; and Gupta, grouping, we analyzed a multitude of approaches that at- A. 2018. Iterative Visual Reasoning beyond Convolutions. tempt to unify high-level knowledge with diverse machine IEEE CVPR 7239–7248. learning systems. The review revealed open and prominent [Chernova et al. 2017] Chernova, S.; Chu, V.; Daruna, A.; directions, showing clear evidence that hybrid methods con- Garrison, H.; Hahn, M.; Khante, P.; Liu, W.; and Thomaz, A. stitute an avenue worth exploring. 2017. Situated bayesian reasoning framework for robots op- erating in diverse everyday environments. In International References Symposium on Robotics Research (ISRR). [Chuang et al. 2018] Chuang, C. Y.; Li, J.; Torralba, A.; and [Aditya et al. 2018] Aditya, S.; Yang, Y.; Baral, C.; Aloi- Fidler, S. 2018. Learning to Act Properly: Predicting and monos, Y.; and Fermüller, C. 2018. Image Understand- Explaining Affordances from Images. IEEE CVPR 975– ing using vision and reasoning through Scene Description 983. Graph. Computer Vision and Image Understanding 173:33– 45. [Daruna et al. 2019] Daruna, A.; Liu, W.; Kira, Z.; and Cher- nova, S. 2019. Robocse: Robot common sense embedding. [Aditya, Yang, and Baral 2019] Aditya, S.; Yang, Y.; and arXiv preprint arXiv:1903.00412. Baral, C. 2019. Integrating knowledge and reasoning in image understanding. In Proceedings of the Twenty- [Davis and Marcus 2015] Davis, E., and Marcus, G. 2015. Eighth International Joint Conference on Artificial Intelli- Commonsense reasoning and commonsense knowledge in gence, IJCAI-19, 6252–6259. International Joint Confer- artificial intelligence. Commun. ACM 58(9):92–103. ences on Artificial Intelligence Organization. [Dean, Allen, and Aloimonos 1995] Dean, T.; Allen, J.; and [Agostini, Torras, and Woergoetter 2017] Agostini, A.; Tor- Aloimonos, Y. 1995. Artificial Intelligence: Theory and ras, C.; and Woergoetter, F. 2017. Efficient interactive Practice. Redwood City, CA, USA: Benjamin-Cummings decision-making framework for robotic applications. Arti- Publishing Co., Inc. ficial Intelligence 247:187–212. [Deng et al. 2014] Deng, J.; Ding, N.; Jia, Y.; Frome, A.; [Antanas et al. 2018a] Antanas, L.; Dries, A.; Moreno, P.; Murphy, K.; Bengio, S.; Li, Y.; Neven, H.; and Adam, H. and De Raedt, L. 2018a. Relational affordance learning 2014. Large-scale object classification using label relation for task-dependent robot grasping. In Lachiche, N., and graphs. In ECCV, 48–64. Springer. Vrain, C., eds., Inductive Logic Programming, 1–15. Cham: [Domingos and Lowd 2019] Domingos, P., and Lowd, D. Springer International Publishing. 2019. Unifying logical and statistical ai with markov logic. [Antanas et al. 2018b] Antanas, L.; Moreno, P.; Neumann, Commununications of the ACM 62(7):74–83. M.; de Figueiredo, R. P.; Kersting, K.; Santos-Victor, J.; and [Fu et al. 2015] Fu, Z.; Xiang, T.; Kodirov, E.; and Gong, S. De Raedt, L. 2018b. Semantic and geometric reasoning 2015. Zero-shot object recognition by semantic manifold for robotic grasping: a probabilistic logic approach. Au- distance. In IEEE CVPR, 2635–2644. tonomous Robots 1–26. [Geffner 2018] Geffner, H. 2018. Model-free, model-based, [Basu, Shakerin, and Gupta 2020] Basu, K.; Shakerin, F.; and general intelligence. In Proceedings of the 27th Interna- and Gupta, G. 2020. Aqua: Asp-based visual question an- tional Joint Conference on Artificial Intelligence, IJCAI’18, swering. In Komendantskaya, E., and Liu, Y. A., eds., Prac- 10–17. AAAI Press. tical Aspects of Declarative Languages, 57–72. Springer [Gemignani et al. 2016] Gemignani, G.; Capobianco, R.; International Publishing. Bastianelli, E.; Bloisi, D. D.; Iocchi, L.; and Nardi, D. 2016. [Beetz et al. 2018] Beetz, M.; Beßler, D.; Haidu, A.; Pomar- Living with robots: Interactive environmental knowledge ac- lan, M.; Bozcuoğlu, A. K.; and Bartels, G. 2018. Know quisition. Robotics and Autonomous Systems 78:1–16. rob 2.0âa 2nd generation knowledge processing framework [Goyal et al. 2019] Goyal, Y.; Khot, T.; Agrawal, A.; for cognition-enabled robotic agents. In 2018 IEEE ICRA, Summers-Stay, D.; Batra, D.; and Parikh, D. 2019. Making 512–519. IEEE. the V in VQA Matter: Elevating the Role of Image Under- [Bengio et al. 2019] Bengio, Y.; Deleu, T.; Rahaman, N.; Ke, standing in Visual Question Answering. IJCV 127(4):398– R.; Lachapelle, S.; Bilaniuk, O.; Goyal, A.; and Pal, C. 2019. 414. A meta-transfer objective for learning to disentangle causal [Gu et al. 2019] Gu, J.; Zhao, H.; Lin, Z.; Li, S.; Cai, J.; and mechanisms. arXiv preprint arXiv:1901.10912. Ling, M. 2019. Scene graph generation with external knowl- [Brachman and Levesque 2004] Brachman, R., and edge and image reconstruction. In IEEE CVPR, 1969–1978. Levesque, H. 2004. Knowledge Representation and [Herath, Harandi, and Porikli 2017] Herath, S.; Harandi, M.; Reasoning. San Francisco, CA, USA: Morgan Kaufmann and Porikli, F. 2017. Going deeper into action recognition: Publishers Inc. A survey. IMAVIS 60:4–21. [Hu et al. 2017] Hu, R.; Andreas, J.; Rohrbach, M.; Darrell, [Liu et al. 2018b] Liu, Y.; Wang, R.; Shan, S.; and Chen, T.; and Saenko, K. 2017. Learning to Reason: End-to-End X. 2018b. Structure Inference Net: Object Detection Us- Module Networks for Visual Question Answering. IEEE ing Scene-Level Context and Instance-Level Relationships. ICCV 2017-Octob(Figure 1):804–813. IEEE CVPR 6985–6994. [Icarte et al. 2017] Icarte, R. T.; Baier, J. A.; Ruz, C.; and [Lu et al. 2016] Lu, C.; Krishna, R.; Bernstein, M.; and Fei- Soto, A. 2017. How a general-purpose commonsense ontol- Fei, L. 2016. Visual relationship detection with language ogy can improve performance of learning-based image re- priors. Lecture Notes in Computer Science (including sub- trieval. arXiv preprint arXiv:1705.08844. series Lecture Notes in Artificial Intelligence and Lecture [Johnson et al. 2017] Johnson, J.; Hariharan, B.; van der Notes in Bioinformatics) 9905 LNCS(Figure 2):852–869. Maaten, L.; Hoffman, J.; Fei-Fei, L.; Lawrence Zitnick, C.; [Lu et al. 2018] Lu, P.; Ji, L.; Zhang, W.; Duan, N.; Zhou, and Girshick, R. 2017. Inferring and executing programs M.; and Wang, J. 2018. R-VQA: Learning visual relation for visual reasoning. In IEEE ICCV, 2989–2998. facts with semantic attention for visual question answering. [Katzouris et al. 2019] Katzouris, N.; Michelioudakis, E.; Proceedings of the ACM SIGKDD International Conference Artikis, A.; and Paliouras, G. 2019. Online learning of on Knowledge Discovery and Data Mining 1880–1889. weighted relational rules for complex event recognition. In [Marino, Salakhutdinov, and Gupta 2017] Marino, K.; Berlingerio, M.; Bonchi, F.; Gärtner, T.; Hurley, N.; and Salakhutdinov, R.; and Gupta, A. 2017. The more you Ifrim, G., eds., Machine Learning and Knowledge Discov- know: using knowledge graphs for image classification. ery in Databases, 396–413. Cham: Springer International IEEE CVPR 2017-Janua:20–28. Publishing. [Moldovan et al. 2018] Moldovan, B.; Moreno, P.; Nitti, D.; [Kim, Ricci, and Serre 2018] Kim, J.; Ricci, M.; and Serre, Santos-Victor, J.; and De Raedt, L. 2018. Relational af- T. 2018. Not-so-clevr: learning same–different rela- fordances for multiple-object manipulation. Autonomous tions strains feedforward neural networks. Interface focus Robots 42(1):19–44. 8(4):20180011. [Narasimhan and Schwing 2018] Narasimhan, M., and [Krishna et al. 2017] Krishna, R.; Zhu, Y.; Groth, O.; John- Schwing, A. G. 2018. Straight to the facts: Learning knowl- son, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, edge base retrieval for factual visual question answering. In L. J.; Shamma, D. A.; Bernstein, M. S.; and Fei-Fei, L. 2017. Proceedings of the ECCV (ECCV), 451–468. Visual Genome: Connecting Language and Vision Using [Nickel et al. 2016] Nickel, M.; Murphy, K.; Tresp, V.; and Crowdsourced Dense Image Annotations. IJCV 123(1):32– Gabrilovich, E. 2016. A review of relational machine 73. learning for knowledge graphs. Proceedings of the IEEE [Lee et al. 2018] Lee, C. W.; Fang, W.; Yeh, C. K.; and 104(1):11–33. Wang, Y. C. F. 2018. Multi-label Zero-Shot Learning with [Pearl 2018] Pearl, J. 2018. Theoretical impediments to ma- Structured Knowledge Graphs. IEEE CVPR 1576–1585. chine learning with seven sparks from the causal revolution. [Lemaignan et al. 2017] Lemaignan, S.; Warnier, M.; Sisbot, arXiv preprint arXiv:1801.04016. E. A.; Clodic, A.; and Alami, R. 2017. Artificial cognition [Rajan and Saffiotti 2017] Rajan, K., and Saffiotti, A., eds. for social human–robot interaction: An implementation. Ar- 2017. Special Issue on AI and Robotics, volume 247. El- tificial Intelligence 247:45–69. sevier. 1–440. [Li et al. 2015] Li, Y.; Tarlow, D.; Brockschmidt, M.; and [Ramanathan et al. 2015] Ramanathan, V.; Li, C.; Deng, J.; Zemel, R. 2015. Gated graph sequence neural networks. and Han, W. 2015. Learning semantic relationships for bet- arXiv preprint arXiv:1511.05493. ter action retrieval in images ( Supplementary ). Computer [Li et al. 2017] Li, R.; Tapaswi, M.; Liao, R.; Jia, J.; Urtasun, Vision and Pattern Recognition 1–4. R.; and Fidler, S. 2017. Situation recognition with graph [Ramirez-Amaro, Beetz, and Cheng 2017] Ramirez-Amaro, neural networks. In IEEE ICCV, 4173–4182. K.; Beetz, M.; and Cheng, G. 2017. Transferring skills [Li et al. 2019] Li, H.; Wang, P.; Shen, C.; and Hengel, A. to humanoid robots by extracting semantic representations v. d. 2019. Visual question answering as reading compre- from observations of human activities. Artificial Intelligence hension. In IEEE CVPR, 6319–6328. 247:95–118. [Li, Su, and Zhu 2017] Li, G.; Su, H.; and Zhu, W. 2017. [Ravichandar et al. 2019] Ravichandar, H.; Polydoros, A. S.; Incorporating external knowledge to answer open-domain Chernova, S.; and Billard, A. 2019. Robot learning from visual questions with dynamic memory networks. arXiv demonstration: A review of recent advances. Annual Review preprint arXiv:1712.00733. of Control, Robotics, and Autonomous Systems In Press. [Lin and Parikh 2015] Lin, X., and Parikh, D. 2015. Don’t [Redmon and Farhadi 2017] Redmon, J., and Farhadi, A. just listen, use your imagination: Leveraging visual common 2017. Yolo9000: better, faster, stronger. In IEEE CVPR, sense for non-visual tasks. In IEEE CVPR, 2984–2993. 7263–7271. [Liu et al. 2018a] Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; [Riley and Sridharan 2019] Riley, H., and Sridharan, M. Chen, J.; Liu, X.; and Pietikäinen, M. 2018a. Deep learn- 2019. Integrating non-monotonic logical reasoning and in- ing for generic object detection: A survey. arXiv preprint ductive learning with deep learning for explainable visual arXiv:1809.02165. question answering. Frontiers in Robotics and AI 6:125. [Rosenfeld, Zemel, and Tsotsos 2018] Rosenfeld, A.; mon sense through visual abstraction. In IEEE ICCV, 2542– Zemel, R.; and Tsotsos, J. K. 2018. The elephant in the 2550. room. arXiv preprint arXiv:1808.03305. [Wang et al. 2018] Wang, P.; Wu, Q.; Shen, C.; Dick, A.; and [Ruiz-Sarmiento, Galindo, and Gonzalez-Jimenez 2016] van den Hengel, A. 2018. Fvqa: Fact-based visual question Ruiz-Sarmiento, J.-R.; Galindo, C.; and Gonzalez-Jimenez, answering. IEEE Trans. on PAMI 40(10):2413–2427. J. 2016. Probability and common-sense: Tandem towards [Wang, Ye, and Gupta 2018] Wang, X.; Ye, Y.; and Gupta, A. robust robotic object recognition in ambient assisted liv- 2018. Zero-Shot Recognition via Semantic Embeddings and ing. In Ubiquitous Computing and Ambient Intelligence. Knowledge Graphs. IEEE CVPR 6857–6866. Springer. 3–8. [Wu et al. 2016a] Wu, Q.; Wang, P.; Shen, C.; Dick, A.; and [Sadeghi, Kumar Divvala, and Farhadi 2015] Sadeghi, F.; van den Hengel, A. 2016a. Ask me anything: Free-form vi- Kumar Divvala, S. K.; and Farhadi, A. 2015. Viske: Visual sual question answering based on knowledge from external knowledge extraction and question answering by visual sources. In IEEE CVPR, 4622–4630. verification of relation phrases. In IEEE CVPR, 1456–1464. [Wu et al. 2016b] Wu, Z.; Fu, Y.; Jiang, Y.-G.; and Sigal, L. [Santoro et al. 2017] Santoro, A.; Raposo, D.; Barrett, D. 2016b. Harnessing object and scene semantics for large- G. T.; Malinowski, M.; Pascanu, R.; Battaglia, P.; and Lilli- scale video understanding. In TheIEEE CVPR. crap, T. 2017. A simple neural network module for relational [Wu et al. 2017] Wu, Q.; Teney, D.; Wang, P.; Shen, C.; Dick, reasoning. (Nips). A.; and van den Hengel, A. 2017. Visual question answer- [Sawatzky et al. 2019] Sawatzky, J.; Souri, Y.; Grund, C.; ing: A survey of methods and datasets. CVIU 163:21–40. and Gall, J. 2019. What Object Should I Use? - Task Driven [Wu et al. 2018] Wu, Q.; Shen, C.; Wang, P.; Dick, A.; and Object Detection. Van Den Hengel, A. 2018. Image Captioning and Vi- [Scarselli et al. 2008] Scarselli, F.; Gori, M.; Tsoi, A. C.; Ha- sual Question Answering Based on Attributes and External genbuchner, M.; and Monfardini, G. 2008. The graph neural Knowledge. IEEE Trans. on PAMI 40(6):1367–1381. network model. IEEE Transactions on NN 20(1):61–80. [Xian et al. 2016] Xian, Y.; Akata, Z.; Sharma, G.; Nguyen, [Shah et al. 2019] Shah, S.; Mishra, A.; Yadati, N.; and Q.; Hein, M.; and Schiele, B. 2016. Latent embeddings for Talukdar, P. P. 2019. Kvqa: Knowledge-aware visual ques- zero-shot classification. In IEEE CVPR, 69–77. tion answering. AAAI. [Ye et al. 2017] Ye, C.; Yang, Y.; Mao, R.; Fermuller, C.; and [Stone et al. 2016] Stone, P.; Brooks, R.; Brynjolfsson, E.; Aloimonos, Y. 2017. What can i do around here? Deep Calo, R.; Etzioni, O.; Hager, G.; Hirschberg, J.; Kalyanakr- functional scene understanding for cognitive robots. IEEE ishnan, S.; Kamar, E.; Kraus, S.; Leyton-Brown, K.; Parkes, ICRA 4604–4611. D.; Press, W.; Saxenian, A.; Shah, J.; Tambe, M.; ; and [Yi et al. 2018] Yi, K.; Torralba, A.; Wu, J.; Kohli, P.; Gan, Teller, A. 2016. Artificial intelligence and life in 2030. One C.; and Tenenbaum, J. B. 2018. Neural-symbolic VQA: Dis- Hundred Year Study on Artificial Intelligence: Report of the entangling reasoning from vision and language understand- 2015-2016 Study Panel. ing. Advances in Neural Information Processing Systems [Su et al. 2018] Su, Z.; Zhu, C.; Dong, Y.; Cai, D.; Chen, Y.; 2018-December(NeurIPS):1031–1042. and Li, J. 2018. Learning Visual Knowledge Memory Net- [Young et al. 2016] Young, J.; Basile, V.; Kunze, L.; Cabrio, works for Visual Question Answering. IEEE CVPR 7736– E.; and Hawes, N. 2016. Towards lifelong object learning by 7745. integrating situated robot perception and semantic web min- [Suchan et al. 2017] Suchan, J.; Bhatt, M.; Walega, P. A.; ing. In Proceedings of the Twenty-second European Confer- and Schultz, C. P. L. 2017. Visual explanation by high- ence on Artificial Intelligence, 1458–1466. IOS Press. level abduction: On answer-set programming driven reason- [Young et al. 2017] Young, J.; Basile, V.; Suchi, M.; Kunze, ing about moving objects. CoRR abs/1712.00840. L.; Hawes, N.; Vincze, M.; and Caputo, B. 2017. Mak- [Tenorth and Beetz 2017] Tenorth, M., and Beetz, M. 2017. ing sense of indoor spaces using semantic web mining and Representations for robot knowledge in the knowrob frame- situated robot perception. In European Semantic Web Con- work. Artificial Intelligence 247:151–169. ference, 299–313. Springer. [Zhu et al. 2015a] Zhu, Y.; Groth, O.; Bernstein, M.; and Fei- [Torabi, Warnell, and Stone 2019] Torabi, F.; Warnell, G.; Fei, L. 2015a. Visual7W: Grounded Question Answering in and Stone, P. 2019. Recent advances in imitation learn- Images. ing from observation. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, [Zhu et al. 2015b] Zhu, Y.; Zhang, C.; Ré, C.; and Fei-Fei, L. IJCAI-19, 6325–6331. International Joint Conferences on 2015b. Building a large-scale multimodal knowledge base Artificial Intelligence Organization. for visual question answering. CoRR abs/1507.05670. [van Harmelen et al. 2007] van Harmelen, F.; van Harmelen, [Zhu, Fathi, and Fei-Fei 2014] Zhu, Y.; Fathi, A.; and Fei- F.; Lifschitz, V.; and Porter, B. 2007. Handbook of Knowl- Fei, L. 2014. Reasoning about object affordances in a edge Representation. San Diego, USA: Elsevier Science. knowledge base representation. In Fleet, D.; Pajdla, T.; Schiele, B.; and Tuytelaars, T., eds., ECCV, 408–424. Cham: [Vedantam et al. 2015] Vedantam, R.; Lin, X.; Batra, T.; Springer International Publishing. Lawrence Zitnick, C.; and Parikh, D. 2015. Learning com-