Neuro-symbolic Visual Reasoning for
Multimedia Event Processing: Overview,
Prospects and Challenges
Muhammad Jaleed Khan, Edward Curry
SFI Centre for Research Training in Artificial Intelligence, Data Science Institute, National University of Ireland
Galway, Galway, Ireland.


                                       Abstract
                                       Efficient multimedia event processing is a key enabler for real-time and complex decision making in
                                       streaming media. The need for expressive queries to detect high-level human-understandable spatial
                                       and temporal events in multimedia streams is inevitable due to the explosive growth of multimedia data
                                       in smart cities and internet. The recent work in stream reasoning, event processing and visual reasoning
                                       inspires the integration of visual and commonsense reasoning in multimedia event processing, which
                                       would improve and enhance multimedia event processing in terms of expressivity of event rules and
                                       queries. This can be achieved through careful integration of knowledge about entities, relations and rules
                                       from rich knowledge bases via reasoning over multimedia streams within an event processing engine.
                                       The prospects of neuro-symbolic visual reasoning within multimedia event processing are promising,
                                       however, there are several associated challenges that are highlighted in this paper.

                                       Keywords
                                       multimedia event processing, visual reasoning, commonsense reasoning, video stream processing, spatiotemporal
                                       events


1. Introduction                                                                                   producers and detect complex events queried
                                                                                                  by subscribers in real-time. The enormous
Internet of multimedia things (IoMT), data                                                        increase in image and video content surveil-
analytics and artificial intelligence are con-                                                    lance cameras and other sources in IoMT ap-
tinuously improving smart cities and urban                                                        plications posed several challenges in real-
environments with their ever-increasing ap-                                                       time processing of multimedia events, which
plications ranging from traffic management                                                        motivated researchers in this area to ex-
to public safety. As middleware between in-                                                       tend the existing CEP engines and to devise
ternet of things and real-time applications,                                                      new CEP frameworks to support unstruc-
complex event processing (CEP) systems pro-                                                       tured multimedia streams. Over the past few
cess structured data streams from multiple                                                        years, several efforts have been made to miti-
                                                                                                  gate the challenges in multimedia event pro-
Proceedings of the CIKM 2020 Workshops, October 19-20,                                            cessing by developing techniques for exten-
Galway, Ireland.
" m.khan12@nuigalway.ie (M.J. Khan);                                                              sion of existing CEP engines for multime-
edward.curry@nuigalway.ie (E. Curry)                                                              dia events [1] and development of end-to-
~ https://www.linkedin.com/in/mjaleedkhan/ (M.J.                                                  end CEP frameworks for multimedia streams
Khan); http://edwardcurry.org/ (E. Curry)                                                         [2]. On the other hand, the research in com-
 0000-0003-4727-4722 (M.J. Khan);
0000-0001-8236-6433 (E. Curry)                                                                    puter vision has focused on complimenting
                                    © 2020 Copyright for this paper by its authors. Use permit-
                                    ted under Creative Commons License Attribution 4.0 Inter-
                                                                                                  object detection with human-like visual rea-
                                    national (CC BY 4.0).
                                    CEUR   Workshop                        Proceedings
                                                                                                  soning that allows for prediction of mean-
 CEUR
               http://ceur-ws.org


                                    (CEUR-WS.org)
 Workshop      ISSN 1613-0073
 Proceedings
ingful and useful semantic relations among           2.1. Reasoning over Streams and
detected objects based on analogy and com-                Knowledge Graph
monsense (CS) knowledge [3, 4].
                                                Emerging from the semantic web, stream-
                                                ing data is conventionally modelled accord-
2. Background                                   ing to RDF [8], a graph representation. The
                                                real-time processing of RDF streams is per-
In this paper, we discuss the background, formed in time-dependent windows that con-
prospects and challenges related to leverag- trol the access to the stream, each contain-
ing the existing visual and commonsense rea- ing a small part of the stream over which
soning to enhance multimedia event process- a task needs to be performed at a certain
ing in terms of its applicability and expres- time instant. Reasoning is performed by ap-
sivity of multimedia event queries. The mo- plying RDF Schema rules to the graph us-
tivation for development of an end-to-end ing SPARQL query language or its variants.
multimedia event processing system sup- Reasoning over knowledge graphs (KG) pro-
porting automated reasoning over multime- vides new relations among entities to en-
dia streams comes from its potential real- rich the knowledge graph and improve its
time applications in smart cities, internet and applicability [9]. Neuro-symbolic comput-
sports. Fig. 1 shows an example of traf- ing combines symbolic and statistical ap-
fic congestion event detected using visual proaches, i.e. knowledge is represented in
and commonsense reasoning over the objects symbolic form, whereas learning and reason-
and relations among the objects in the video ing are performed by DNN [10], which has
stream. A conceptual level design and a mo- shown its efficacy in object detection [11] as
tivational example of a novel CEP framework well as enhanced feature learning via knowl-
supporting visual and commonsense reason- edge infusion in DNN layers from knowledge
ing is presented in Fig. 2.                     bases [12]. Temporal KG allows time-aware
   This section presents a review of the re- representation and tracking of entities and
cent work in stream reasoning, multimedia relations [13].
event processing and visual reasoning that
could be complementary within a proposed
                                                2.2. Multimedia Event
neuro-symbolic multimedia event processing
system with support for visual reasoning.              Representation and
                                                           Processing
                                                     CEP engines inherently lacked the support
                                                     for unstructured multimedia events, which
                                                     was mitigated by a generalized approach for
                                                     handling multimedia events as native events
                                                     in CEP engines as presented in [1]. Angsu-
Figure 1: (a) Example of video stream in smart       chotmetee et al. [14] has presented an
city. (b) Detection of objects and relations. (c)    ontological approach for modeling complex
High-level event of traffic congestion detected as   events and multimedia data with syntactic
a result of automated reasoning.                     and semantic interoperability in multimedia
                                                     sensor networks, which allows subscribers
                                                     to define application-specific complex events
                                                     while keeping the low-level network repre-
Figure 2: (a) Conceptual level block diagram of a CEP framework supporting visual reasoning. The
input stream of images (or video frames) is received from a publisher, the objects are detected using
DNN and rule-based relations [5] are represented using a graph, which is followed by automated
reasoning that adds new visual relations from a knowledge base [6] and validates those relations using
commonsense knowledge [7]. The matcher performs spatial and temporal event matching on these
detected objects and relations with the spatial and temporal patterns in high-level events queried
by the subscriber. (b) An example of visual reasoning in multimedia event processing. Suppose a
subscriber is interested in the event where tennis player is either “hitting” or “missing” a shot. This
event is not explicitly defined via rules but it can be predicted via automated reasoning over detected
objects and predicted relations. (Image credits: Visual Genome [6])


sentation generic. Aslam et al. [15] leveraged      2.3. Visual and Commonsense
domain adaption and online transfer learn-               Reasoning
ing in multimedia event processing to ex-
tend support for unknown events. Knowl-             In addition to the objects and their attributes
edge graph is suitable for semantic repre-          in images, detection of relations among these
sentation and reasoning over video streams          objects is crucial for scene understanding
due to its scalability and maintainability [16],    for which compositional models [17], visual
as demonstrated in [5]. VidCEP [2], a CEP           phrase models [18] and DNN based relational
framework for detection of spatiotemporal           networks [19] are available. Visual and se-
video events expressed by subscriber-defined        mantic embeddings aid large scale visual re-
queries, includes a graph-based representa-         lation detection, such as Zhang et al. [4]
tion, Video Event Query Language (VEQL)             employed both visual and textual features to
and a complex event matcher for video data.         leverage the interactions between objects for
                                                    relation detection. Similarly, Peyre et al. [3]
                                                    added a visual phrase embedding space dur-
                                                    ing learning to enable analogical reasoning
                                                    for unseen relations and to improve robust-
                                                    ness to appearance variations of visual re-
lations. Table 1 presents some knowledge         events will be performed on the objects, rule-
bases publicly available for visual reasoning.   based relations and relations extracted us-
Wan et al. [7] proposed the use of common-       ing visual reasoning. The subscriber will
sense knowledge graph along with the visual      be instantly notified of the high-level event
features to enhance visual relation detection.   as a combined detection of those spatial-
Rajani et al. [20] leverage human reasoning      temporal patterns. The idea of developing an
and language models to generate human-like       end-to-end multimedia event processing sys-
explanations for DNN-based commonsense           tem supporting visual reasoning over video
question answering. There are various com-       streams (Fig. 2) poses several challenges that
monsense reasoning methods and datasets          are discussed in the next section. This novel
available for visual commonsense reasoning       approach will give more expressive power to
[21] and story completion [22].                  subscribers in querying complex events in
                                                 multimedia streams, and thus increase the
                                                 scope of real-time applications of multimedia
3. Neuro-symbolic Visual                         event processing in smart city applications
   Reasoning in                                  as well as internet media streaming applica-
                                                 tions.
   Multimedia Event
   Processing                                    3.2. Challenges
3.1. Prospects                                   1. Suitable representation for reason-
                                                 ing It is crucial to select a generalized and
The current multimedia event representation      scalable model to represent events and effec-
methods use knowledge graph to represent         tively perform automated reasoning to derive
the detected objects, their attributes and re-   more meaningful and expressive spatiotem-
lations among the objects in video streams.      poral events.
Pre-defined spatial-temporal rules are used      2.     Expressive query definition and
to form relations among the objects. How-        matching Providing a generic and human-
ever, the complex relations that exist among     friendly format to subscribers for writing
real-world objects also depend on seman-         expressive and high-level queries would re-
tic facts and situational variables that can     quire new constructs. Matching queries with
not be explicitly specified for every possi-     the low-level events and relations along with
ble event as rules. The statistical reason-      reasoning via knowledge bases requires ef-
ing methods and knowledge bases discussed        ficient retrieval within the complex event
in Section 2 have great potential to com-        matcher. Real-world complex events can
plement the rule-based relation formation        share similar patterns, occur as a cluster of
in multimedia event processing by inject-        similar events or occur in a hierarchical man-
ing some semantic knowledge and reason-          ner, which requires generalized, adaptive and
ing to extract more semantically meaning-        scalable spatiotemporal constructs to query
ful relations among objects. This advance-       such events.
ment will allow subscribers to define abstract   3. Labeling and training samples of vi-
or high-level human-understandable event         sual relations There can be a large numbers
query rules that can be decomposed into          of objects and possible relations among them
spatial and temporal patterns. The spatio-       in images, which can result in a large num-
temporal matching of the queried high-level      ber of categories of relations. It is difficult
Table 1
Available Knowledge Bases for Visual Reasoning
  Knowledge Base         #Images    #Entity Categories   #Entity Instances     #Relation Categories      #Relation Instances
  Open Images V4 [23]   9,200,000                  600          15,400,000                       57                   375,000
  YAGO 4 [24]                   –               10,124          64,000,000                        –                  2 billion
  Visual Genome [6]       108,077               33,877            3,843,636                  42,374                 2,269,617
  COCO-a [25]              10,000                   81               74,000                     156                   207,000
  VisKE [18]                    –                1,884                    –                   1,158                    12,593


to annotate all possible relations and to have              been explored much, which is crucial for spa-
balanced categories of relations in the train-              tiotemporal event processing.
ing data. For example, Visual Genome [6] has
a huge number of relations with unbalanced
instances of each relation.                                 Acknowledgement
4. Consistent integration of knowledge
                                                            This work was conducted with the financial
bases The object labels in datasets for ob-
                                                            support of the Science Foundation Ireland
ject detection and entity labels in knowl-
                                                            Centre for Research Training in Artificial In-
edge bases (e.g. person, human, man) are
                                                            telligence under Grant No. 18/CRT/6223.
not always the same. Similarly, knowledge
bases have different labels for the same en-
tity, different names for the same attribute                References
(e.g. birthPlace and placeOfBirth) or relation               [1] A. Aslam, E. Curry, Towards a generalized approach for
(e.g. ’at left’ and ’to left of’). This can cause                deep neural network based event processing for the in-
                                                                 ternet of multimedia things, IEEE Access 6 (2018) 25573–
inconsistency or redundancy while integrat-                      25587.
ing relations from the knowledge bases. It is                [2] P. Yadav, E. Curry, Vidcep: Complex event process-
                                                                 ing framework to detect spatiotemporal patterns in video
important to select the knowledge base and                       streams, in: 2019 IEEE International Conference on Big
dataset that are consistent and suitable for                     Data (Big Data), IEEE, 2019, pp. 2513–2522.
                                                             [3] J. Peyre, I. Laptev, C. Schmid, J. Sivic, Detecting unseen
the combined use of both object detection                        visual relations using analogies, in: Proceedings of the
and visual reasoning.                                            IEEE International Conference on Computer Vision, 2019,
                                                                 pp. 1981–1990.
5. Supporting rare or unseen visual re-                      [4] J. Zhang, Y. Kalantidis, M. Rohrbach, M. Paluri, A. Elgam-
lations Apart from the common relations,                         mal, M. Elhoseiny, Large-scale visual relationship under-
                                                                 standing, in: Proceedings of the AAAI Conference on Ar-
very rare or unseen relations among objects                      tificial Intelligence, volume 33, 2019, pp. 9185–9194.
also appear in certain scenes. It is nearly im-              [5] P. Yadav, E. Curry, Vekg: Video event knowledge graph to
                                                                 represent video streams for complex event pattern match-
possible to collect sufficient training samples                  ing, in: 2019 First International Conference on Graph
for all possible seen and unseen relations.                      Computing (GC), IEEE, 2019, pp. 13–20.
                                                             [6] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,
Handling such relations while evaluating the                     S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al., Visual
models is also a challenge.                                      genome: Connecting language and vision using crowd-
                                                                 sourced dense image annotations, International Journal
6. Temporal processing of objects and re-                        of Computer Vision 123 (2017) 32–73.
lations The recent methods on this subject                   [7] H. Wan, J. Ou, B. Wang, J. Du, J. Z. Pan, J. Zeng, Iterative
                                                                 visual relationship detection via commonsense knowledge
address complex inference tasks by decom-                        graph, in: Joint International Semantic Technology Con-
posing images or scenes into objects and vi-                     ference, Springer, 2019, pp. 210–225.
                                                             [8] Rdf 1.1 concepts and abstract syntax (2014).
sual relations among the objects. The tem-                   [9] X. Chen, S. Jia, Y. Xiang, A review: Knowledge reasoning
poral events and temporal tracking of the de-                    over knowledge graph, Expert Systems with Applications
                                                                 141 (2020) 112948.
tected objects and predicted relations has not              [10] W. Li, G. Qi, Q. Ji, Hybrid reasoning in knowledge graphs:
       Combing symbolic reasoning and statistical reasoning, Se-
       mantic Web (2020) 1–10.
[11]   Y. Fang, K. Kuan, J. Lin, C. Tan, V. Chandrasekhar, Object
       detection meets knowledge graphs (2017).
[12]   U. Kursuncu, M. Gaur, A. Sheth, Knowledge infused learn-
       ing (k-il): Towards deep incorporation of knowledge in
       deep learning, arXiv preprint arXiv:1912.00512 (2019).
[13]   A. García-Durán, S. Dumančić, M. Niepert, Learning se-
       quence encoders for temporal knowledge graph comple-
       tion, arXiv preprint arXiv:1809.03202 (2018).
[14]   C. Angsuchotmetee, R. Chbeir, Y. Cardinale, Mssn-onto:
       An ontology-based approach for flexible event processing
       in multimedia sensor networks, Future Generation Com-
       puter Systems 108 (2020) 1140–1158.
[15]   A. Aslam, E. Curry, Reducing response time for multimedia
       event processing using domain adaptation, in: Proceed-
       ings of the 2020 International Conference on Multimedia
       Retrieval, 2020, pp. 261–265.
[16]   L. Greco, P. Ritrovato, M. Vento, On the use of semantic
       technologies for video analysis, Journal of Ambient Intel-
       ligence and Humanized Computing (2020).
[17]   Y. Li, W. Ouyang, X. Wang, X. Tang, Vip-cnn: Visual
       phrase guided convolutional neural network, in: Proceed-
       ings of the IEEE Conference on Computer Vision and Pat-
       tern Recognition, 2017, pp. 1347–1356.
[18]   F. Sadeghi, S. K. Kumar Divvala, A. Farhadi, Viske: Vi-
       sual knowledge extraction and question answering by vi-
       sual verification of relation phrases, in: CVPR 2015, ????,
       pp. 1456–1464.
[19]   B. Dai, Y. Zhang, D. Lin, Detecting visual relationships
       with deep relational networks, in: Proceedings of the IEEE
       conference on computer vision and Pattern recognition,
       2017, pp. 3076–3086.
[20]   N. F. Rajani, B. McCann, C. Xiong, R. Socher, Explain your-
       self! leveraging language models for commonsense rea-
       soning, arXiv preprint arXiv:1906.02361 (2019).
[21]   R. Zellers, Y. Bisk, A. Farhadi, Y. Choi, From recognition to
       cognition: Visual commonsense reasoning, in: The IEEE
       Conference on Computer Vision and Pattern Recognition
       (CVPR), 2019.
[22]   R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi, Hel-
       laswag: Can a machine really finish your sentence?, in:
       Proceedings of the 57th Annual Meeting of the Associa-
       tion for Computational Linguistics, 2019.
[23]   A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings,
       I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Mal-
       loci, A. Kolesnikov, et al., The open images dataset v4,
       International Journal of Computer Vision (2020) 1–26.
[24]   T. P. Tanon, G. Weikum, F. Suchanek, Yago 4: A reason-
       able knowledge base, in: European Semantic Web Confer-
       ence, Springer, 2020, pp. 583–596.
[25]   M. R. Ronchi, P. Perona, Describing common human visual
       actions in images, arXiv preprint arXiv:1506.02203 (2015).