Visual Intelligence using Neural-Symbolic Learning and Reasoning H.L.H. (Leo) de Penning TNO Behaviour and Societal Sciences Kampweg 5, Soesterberg, The Netherlands. leo.depenning@tno.nl Abstract The contribution of TNO, a Dutch research institute and one of the teams working on the Mind’s Eye program, is called The DARPA Mind’s Eye program seeks to develop CORTEX and is presented in this paper. CORTEX is a in machines a capability that currently exists only Visual Intelligence (VI) system and consists of a visual in animals: visual intelligence. This short paper de- processing pipeline and reasoning component that is able to scribes the initial results of a Neural-Symbolic ap- reason about events detected in visual inputs (e.g. from a proach for action recognition and description to be movie or live camera) in order to; i) recognize actions in demonstrated at the 7th international workshop on terms of verbs, ii) describe these actions in natural language, Neural-Symbolic Learning and Reasoning. iii) detect anomalies and iv) fill gaps (e.g. video blackouts by missing frames, occlusion by moving objects, or entities Introduction receding behind objects). Humans in particular perform a wide range of visual tasks with ease, which no current artificial intelligence can do in a Neural-Symbolic Cognitive Agent robust way. Humans have inherently strong spatial judg- To learn spatiotemporal relations between detected events ment and are able to learn new spatiotemporal concepts (e.g. size of bounding boxes, speed of moving entities, directly from the visual experience. Humans can visualize changes in relative distance between entities) and verbs scenes and objects, as well as the actions involving those describing actions (e.g. fall, bounce, dig) the reasoning objects. Humans possess a powerful ability to manipulate component uses a Neural-Symbolic Cognitive Agent those imagined scenes mentally to solve problems. A ma- (NSCA) that is based on a Recurrent Temporal Restricted chine-based implementation of such abilities would require Boltzmann Machine (RTRBM) (described in [de Penning et major advances in each of the following technology focus al., 2011] and presented during the IJCAI 2011 poster ses- areas: Robust recognition, Anomaly detection, Description, sion). This cognitive agent is able to learn hypotheses about Gap-filling (i.e., interpolation, prediction, and post diction). temporal relations between observed events and related These are human intelligence-inspired capabilities, which actions and can express those hypotheses in temporal logic are envisaged in service of systems to directly support hu- or natural language. This enables the reasoning component, mans in complex perceptual and reasoning tasks (e.g. like and thus CORTEX, to explain and describe the cognitive Unmanned Ground Vehicles). underpinnings of the recognition task as stated in the focus of the Mind’s Eye program. The DARPA Mind’s Eye program seeks to develop in ma- chines a capability that currently exists only in animals: The hypotheses are modelled in a RTRBM, where each visual intelligence [Donlon, 2010]. In particular, this pro- hidden unit Hj in the RTRBM represents a hypothesis about gram pursues the capability to learn generally applicable a specific relation between events e and verbs v being ob- and generative representations of action between objects in served in the visible layer V and hypotheses ht-1 that have a scene, directly from visual inputs, and then reason over been true in the previous time frame. Based on a Bayesian those learned representations. A key distinction between this inference mechanism, the NSCA can reason about observed research and the state of the art in machine vision is that the actions by selecting the most likely hypotheses h using latter has made continual progress in recognizing a wide random Gaussian sampling of the posterior probability dis- range of objects and their properties—what might be tribution (i.e. h ~ P(H|V=e˄v, Ht-1=ht-1)) and calculating the thought of as the nouns in the description of a scene. The conditional probability or likelihood of all events and verbs focus of Mind’s Eye is to add the perceptual and cognitive assuming the selected hypotheses are true (i.e. P(V|H=h)). underpinnings for recognizing and reasoning about the verbs The difference between the detected events, available in those scenes, enabling a more complete narrative of ac- ground truth and the inferred events and verbs can be used tion in the visual experience. by the NSCA to train the RTRBM (i.e. update its weights) in order to improve the hypotheses. 34 Figure 1. CORTEX Dashboard for testing and evaluation. example, reasoning about a movie that was trained to be Experiments and Results recognized as a chase, resulted in some parts being recog- nized as fall, because one of the persons was tilting over The CORTEX system and its reasoning component are when she started running, although fall was not part of the currently being tested on a recognition task using several ground truth for this movie. datasets of movies and related ground truth provided by DARPA. Figure 1 shows the CORTEX Dashboard, a user-interface for testing and evaluation of the CORTEX Conclusions and Future Work system. With the CORTEX Dashboard we are able to With the NSCA architecture, the reasoning component is visualize the detected entities (depicted by bounding box- able to learn and reason about spatiotemporal events in es in the upper-left image), probabilities on related events visual inputs and recognize these in terms of actions de- in each frame (depicted by intensities, green is 0 and yel- noted by verbs. It is also able to extract learned hypothe- low is 1, in the upper-right graph) and related verb proba- ses on events and verbs that can be used to explain the bilities calculated by the reasoning component (depicted perceptual and cognitive underpinnings of the recognition by bars in the centre graph). The bottom graph shows the task and support other visual intelligence tasks, like de- precision, recall and F-measure (i.e. harmonic mean of the scription, anomaly detection and gap-filling, yet to be precision and recall) for all verbs used to evaluate the developed in the CORTEX system. output of the reasoning component. Also it can visualize the learned hypotheses and extract these in the form of References temporal logic or natural language, which can be used to explain and describe the recognized actions. [Donlon, 2010] James Donlon. DARPA Mindʼs Eye Pro- gram: Broad Agency Announcement. Arlington, Initial results show that the reasoning component is capa- USA, 2010. ble of learning hypotheses about events and related verbs [de Penning et al., 2011] Leo de Penning, Artur S. dʼAvila and that it is able to reason with these hypotheses to cor- Garcez, Luís C. Lamb, and John-Jules C. Meyer. A rectly recognize actions based on detected events. Neural-Symbolic Cognitive Agent for Online Furthermore the results show that the reasoning compo- Learning and Reasoning. In Proceedings of the In- nent is able to recognize actions that were not there in the ternational Joint Conference on Artificial Intelli- ground truth for that specific input, but inferred from ground truth and related event patterns in other input. For gence (IJCAI), Barcelona, Spain, 2011. 35