=Paper=
{{Paper
|id=None
|storemode=property
|title=Visual Intelligence using Neural-Symbolic Learning and
Reasoning
|pdfUrl=https://ceur-ws.org/Vol-764/paper08.pdf
|volume=Vol-764
|dblpUrl=https://dblp.org/rec/conf/nesy/Penning11
}}
==Visual Intelligence using Neural-Symbolic Learning and
Reasoning==
Visual Intelligence using Neural-Symbolic Learning and Reasoning
H.L.H. (Leo) de Penning
TNO Behaviour and Societal Sciences
Kampweg 5, Soesterberg, The Netherlands.
leo.depenning@tno.nl
Abstract The contribution of TNO, a Dutch research institute and one
of the teams working on the Mind’s Eye program, is called
The DARPA Mind’s Eye program seeks to develop CORTEX and is presented in this paper. CORTEX is a
in machines a capability that currently exists only Visual Intelligence (VI) system and consists of a visual
in animals: visual intelligence. This short paper de- processing pipeline and reasoning component that is able to
scribes the initial results of a Neural-Symbolic ap- reason about events detected in visual inputs (e.g. from a
proach for action recognition and description to be movie or live camera) in order to; i) recognize actions in
demonstrated at the 7th international workshop on terms of verbs, ii) describe these actions in natural language,
Neural-Symbolic Learning and Reasoning. iii) detect anomalies and iv) fill gaps (e.g. video blackouts
by missing frames, occlusion by moving objects, or entities
Introduction receding behind objects).
Humans in particular perform a wide range of visual tasks
with ease, which no current artificial intelligence can do in a Neural-Symbolic Cognitive Agent
robust way. Humans have inherently strong spatial judg- To learn spatiotemporal relations between detected events
ment and are able to learn new spatiotemporal concepts (e.g. size of bounding boxes, speed of moving entities,
directly from the visual experience. Humans can visualize changes in relative distance between entities) and verbs
scenes and objects, as well as the actions involving those describing actions (e.g. fall, bounce, dig) the reasoning
objects. Humans possess a powerful ability to manipulate component uses a Neural-Symbolic Cognitive Agent
those imagined scenes mentally to solve problems. A ma- (NSCA) that is based on a Recurrent Temporal Restricted
chine-based implementation of such abilities would require Boltzmann Machine (RTRBM) (described in [de Penning et
major advances in each of the following technology focus al., 2011] and presented during the IJCAI 2011 poster ses-
areas: Robust recognition, Anomaly detection, Description,
sion). This cognitive agent is able to learn hypotheses about
Gap-filling (i.e., interpolation, prediction, and post diction). temporal relations between observed events and related
These are human intelligence-inspired capabilities, which actions and can express those hypotheses in temporal logic
are envisaged in service of systems to directly support hu- or natural language. This enables the reasoning component,
mans in complex perceptual and reasoning tasks (e.g. like and thus CORTEX, to explain and describe the cognitive
Unmanned Ground Vehicles). underpinnings of the recognition task as stated in the focus
of the Mind’s Eye program.
The DARPA Mind’s Eye program seeks to develop in ma-
chines a capability that currently exists only in animals: The hypotheses are modelled in a RTRBM, where each
visual intelligence [Donlon, 2010]. In particular, this pro-
hidden unit Hj in the RTRBM represents a hypothesis about
gram pursues the capability to learn generally applicable a specific relation between events e and verbs v being ob-
and generative representations of action between objects in served in the visible layer V and hypotheses ht-1 that have
a scene, directly from visual inputs, and then reason over been true in the previous time frame. Based on a Bayesian
those learned representations. A key distinction between this inference mechanism, the NSCA can reason about observed
research and the state of the art in machine vision is that the actions by selecting the most likely hypotheses h using
latter has made continual progress in recognizing a wide random Gaussian sampling of the posterior probability dis-
range of objects and their properties—what might be tribution (i.e. h ~ P(H|V=e˄v, Ht-1=ht-1)) and calculating the
thought of as the nouns in the description of a scene. The conditional probability or likelihood of all events and verbs
focus of Mind’s Eye is to add the perceptual and cognitive assuming the selected hypotheses are true (i.e. P(V|H=h)).
underpinnings for recognizing and reasoning about the verbs The difference between the detected events, available
in those scenes, enabling a more complete narrative of ac- ground truth and the inferred events and verbs can be used
tion in the visual experience. by the NSCA to train the RTRBM (i.e. update its weights)
in order to improve the hypotheses.
34
Figure 1. CORTEX Dashboard for testing and evaluation.
example, reasoning about a movie that was trained to be
Experiments and Results recognized as a chase, resulted in some parts being recog-
nized as fall, because one of the persons was tilting over
The CORTEX system and its reasoning component are when she started running, although fall was not part of the
currently being tested on a recognition task using several ground truth for this movie.
datasets of movies and related ground truth provided by
DARPA. Figure 1 shows the CORTEX Dashboard, a
user-interface for testing and evaluation of the CORTEX Conclusions and Future Work
system. With the CORTEX Dashboard we are able to With the NSCA architecture, the reasoning component is
visualize the detected entities (depicted by bounding box- able to learn and reason about spatiotemporal events in
es in the upper-left image), probabilities on related events visual inputs and recognize these in terms of actions de-
in each frame (depicted by intensities, green is 0 and yel- noted by verbs. It is also able to extract learned hypothe-
low is 1, in the upper-right graph) and related verb proba- ses on events and verbs that can be used to explain the
bilities calculated by the reasoning component (depicted perceptual and cognitive underpinnings of the recognition
by bars in the centre graph). The bottom graph shows the task and support other visual intelligence tasks, like de-
precision, recall and F-measure (i.e. harmonic mean of the scription, anomaly detection and gap-filling, yet to be
precision and recall) for all verbs used to evaluate the developed in the CORTEX system.
output of the reasoning component. Also it can visualize
the learned hypotheses and extract these in the form of References
temporal logic or natural language, which can be used to
explain and describe the recognized actions. [Donlon, 2010] James Donlon. DARPA Mindʼs Eye Pro-
gram: Broad Agency Announcement. Arlington,
Initial results show that the reasoning component is capa- USA, 2010.
ble of learning hypotheses about events and related verbs [de Penning et al., 2011] Leo de Penning, Artur S. dʼAvila
and that it is able to reason with these hypotheses to cor- Garcez, Luís C. Lamb, and John-Jules C. Meyer. A
rectly recognize actions based on detected events. Neural-Symbolic Cognitive Agent for Online
Furthermore the results show that the reasoning compo-
Learning and Reasoning. In Proceedings of the In-
nent is able to recognize actions that were not there in the
ternational Joint Conference on Artificial Intelli-
ground truth for that specific input, but inferred from
ground truth and related event patterns in other input. For gence (IJCAI), Barcelona, Spain, 2011.
35