Visual Intelligence using Neural-Symbolic Learning and Reasoning

                                                 H.L.H. (Leo) de Penning
                                           TNO Behaviour and Societal Sciences
                                         Kampweg 5, Soesterberg, The Netherlands.
                                                 leo.depenning@tno.nl


                          Abstract                                      The contribution of TNO, a Dutch research institute and one
                                                                        of the teams working on the Mind’s Eye program, is called
    The DARPA Mind’s Eye program seeks to develop                       CORTEX and is presented in this paper. CORTEX is a
    in machines a capability that currently exists only                 Visual Intelligence (VI) system and consists of a visual
    in animals: visual intelligence. This short paper de-               processing pipeline and reasoning component that is able to
    scribes the initial results of a Neural-Symbolic ap-                reason about events detected in visual inputs (e.g. from a
    proach for action recognition and description to be                 movie or live camera) in order to; i) recognize actions in
    demonstrated at the 7th international workshop on                   terms of verbs, ii) describe these actions in natural language,
    Neural-Symbolic Learning and Reasoning.                             iii) detect anomalies and iv) fill gaps (e.g. video blackouts
                                                                        by missing frames, occlusion by moving objects, or entities
Introduction                                                            receding behind objects).
Humans in particular perform a wide range of visual tasks
with ease, which no current artificial intelligence can do in a         Neural-Symbolic Cognitive Agent
robust way. Humans have inherently strong spatial judg-                 To learn spatiotemporal relations between detected events
ment and are able to learn new spatiotemporal concepts                  (e.g. size of bounding boxes, speed of moving entities,
directly from the visual experience. Humans can visualize               changes in relative distance between entities) and verbs
scenes and objects, as well as the actions involving those              describing actions (e.g. fall, bounce, dig) the reasoning
objects. Humans possess a powerful ability to manipulate                component uses a Neural-Symbolic Cognitive Agent
those imagined scenes mentally to solve problems. A ma-                 (NSCA) that is based on a Recurrent Temporal Restricted
chine-based implementation of such abilities would require              Boltzmann Machine (RTRBM) (described in [de Penning et
major advances in each of the following technology focus                al., 2011] and presented during the IJCAI 2011 poster ses-
areas: Robust recognition, Anomaly detection, Description,
                                                                        sion). This cognitive agent is able to learn hypotheses about
Gap-filling (i.e., interpolation, prediction, and post diction).        temporal relations between observed events and related
These are human intelligence-inspired capabilities, which               actions and can express those hypotheses in temporal logic
are envisaged in service of systems to directly support hu-             or natural language. This enables the reasoning component,
mans in complex perceptual and reasoning tasks (e.g. like               and thus CORTEX, to explain and describe the cognitive
Unmanned Ground Vehicles).                                              underpinnings of the recognition task as stated in the focus
                                                                        of the Mind’s Eye program.
The DARPA Mind’s Eye program seeks to develop in ma-
chines a capability that currently exists only in animals:              The hypotheses are modelled in a RTRBM, where each
visual intelligence [Donlon, 2010]. In particular, this pro-
                                                                        hidden unit Hj in the RTRBM represents a hypothesis about
gram pursues the capability to learn generally applicable               a specific relation between events e and verbs v being ob-
and generative representations of action between objects in             served in the visible layer V and hypotheses ht-1 that have
a scene, directly from visual inputs, and then reason over              been true in the previous time frame. Based on a Bayesian
those learned representations. A key distinction between this           inference mechanism, the NSCA can reason about observed
research and the state of the art in machine vision is that the         actions by selecting the most likely hypotheses h using
latter has made continual progress in recognizing a wide                random Gaussian sampling of the posterior probability dis-
range of objects and their properties—what might be                     tribution (i.e. h ~ P(H|V=e˄v, Ht-1=ht-1)) and calculating the
thought of as the nouns in the description of a scene. The              conditional probability or likelihood of all events and verbs
focus of Mind’s Eye is to add the perceptual and cognitive              assuming the selected hypotheses are true (i.e. P(V|H=h)).
underpinnings for recognizing and reasoning about the verbs             The difference between the detected events, available
in those scenes, enabling a more complete narrative of ac-              ground truth and the inferred events and verbs can be used
tion in the visual experience.                                          by the NSCA to train the RTRBM (i.e. update its weights)
                                                                        in order to improve the hypotheses.


                                                                   34
                                     Figure 1. CORTEX Dashboard for testing and evaluation.

                                                                    example, reasoning about a movie that was trained to be
Experiments and Results                                             recognized as a chase, resulted in some parts being recog-
                                                                    nized as fall, because one of the persons was tilting over
The CORTEX system and its reasoning component are                   when she started running, although fall was not part of the
currently being tested on a recognition task using several          ground truth for this movie.
datasets of movies and related ground truth provided by
DARPA. Figure 1 shows the CORTEX Dashboard, a
user-interface for testing and evaluation of the CORTEX             Conclusions and Future Work
system. With the CORTEX Dashboard we are able to                    With the NSCA architecture, the reasoning component is
visualize the detected entities (depicted by bounding box-          able to learn and reason about spatiotemporal events in
es in the upper-left image), probabilities on related events        visual inputs and recognize these in terms of actions de-
in each frame (depicted by intensities, green is 0 and yel-         noted by verbs. It is also able to extract learned hypothe-
low is 1, in the upper-right graph) and related verb proba-         ses on events and verbs that can be used to explain the
bilities calculated by the reasoning component (depicted            perceptual and cognitive underpinnings of the recognition
by bars in the centre graph). The bottom graph shows the            task and support other visual intelligence tasks, like de-
precision, recall and F-measure (i.e. harmonic mean of the          scription, anomaly detection and gap-filling, yet to be
precision and recall) for all verbs used to evaluate the            developed in the CORTEX system.
output of the reasoning component. Also it can visualize
the learned hypotheses and extract these in the form of             References
temporal logic or natural language, which can be used to
explain and describe the recognized actions.                        [Donlon, 2010] James Donlon. DARPA Mindʼs Eye Pro-
                                                                         gram: Broad Agency Announcement. Arlington,
Initial results show that the reasoning component is capa-               USA, 2010.
ble of learning hypotheses about events and related verbs           [de Penning et al., 2011] Leo de Penning, Artur S. dʼAvila
and that it is able to reason with these hypotheses to cor-               Garcez, Luís C. Lamb, and John-Jules C. Meyer. A
rectly recognize actions based on detected events.                        Neural-Symbolic Cognitive Agent for Online
Furthermore the results show that the reasoning compo-
                                                                          Learning and Reasoning. In Proceedings of the In-
nent is able to recognize actions that were not there in the
                                                                          ternational Joint Conference on Artificial Intelli-
ground truth for that specific input, but inferred from
ground truth and related event patterns in other input. For               gence (IJCAI), Barcelona, Spain, 2011.


                                                               35