Neuro-symbolic Visual Reasoning for Multimedia Event Processing: Overview, Prospects and Challenges Muhammad Jaleed Khan, Edward Curry SFI Centre for Research Training in Artificial Intelligence, Data Science Institute, National University of Ireland Galway, Galway, Ireland. Abstract Efficient multimedia event processing is a key enabler for real-time and complex decision making in streaming media. The need for expressive queries to detect high-level human-understandable spatial and temporal events in multimedia streams is inevitable due to the explosive growth of multimedia data in smart cities and internet. The recent work in stream reasoning, event processing and visual reasoning inspires the integration of visual and commonsense reasoning in multimedia event processing, which would improve and enhance multimedia event processing in terms of expressivity of event rules and queries. This can be achieved through careful integration of knowledge about entities, relations and rules from rich knowledge bases via reasoning over multimedia streams within an event processing engine. The prospects of neuro-symbolic visual reasoning within multimedia event processing are promising, however, there are several associated challenges that are highlighted in this paper. Keywords multimedia event processing, visual reasoning, commonsense reasoning, video stream processing, spatiotemporal events 1. Introduction producers and detect complex events queried by subscribers in real-time. The enormous Internet of multimedia things (IoMT), data increase in image and video content surveil- analytics and artificial intelligence are con- lance cameras and other sources in IoMT ap- tinuously improving smart cities and urban plications posed several challenges in real- environments with their ever-increasing ap- time processing of multimedia events, which plications ranging from traffic management motivated researchers in this area to ex- to public safety. As middleware between in- tend the existing CEP engines and to devise ternet of things and real-time applications, new CEP frameworks to support unstruc- complex event processing (CEP) systems pro- tured multimedia streams. Over the past few cess structured data streams from multiple years, several efforts have been made to miti- gate the challenges in multimedia event pro- Proceedings of the CIKM 2020 Workshops, October 19-20, cessing by developing techniques for exten- Galway, Ireland. " m.khan12@nuigalway.ie (M.J. Khan); sion of existing CEP engines for multime- edward.curry@nuigalway.ie (E. Curry) dia events [1] and development of end-to- ~ https://www.linkedin.com/in/mjaleedkhan/ (M.J. end CEP frameworks for multimedia streams Khan); http://edwardcurry.org/ (E. Curry) [2]. On the other hand, the research in com-  0000-0003-4727-4722 (M.J. Khan); 0000-0001-8236-6433 (E. Curry) puter vision has focused on complimenting © 2020 Copyright for this paper by its authors. Use permit- ted under Creative Commons License Attribution 4.0 Inter- object detection with human-like visual rea- national (CC BY 4.0). CEUR Workshop Proceedings soning that allows for prediction of mean- CEUR http://ceur-ws.org (CEUR-WS.org) Workshop ISSN 1613-0073 Proceedings ingful and useful semantic relations among 2.1. Reasoning over Streams and detected objects based on analogy and com- Knowledge Graph monsense (CS) knowledge [3, 4]. Emerging from the semantic web, stream- ing data is conventionally modelled accord- 2. Background ing to RDF [8], a graph representation. The real-time processing of RDF streams is per- In this paper, we discuss the background, formed in time-dependent windows that con- prospects and challenges related to leverag- trol the access to the stream, each contain- ing the existing visual and commonsense rea- ing a small part of the stream over which soning to enhance multimedia event process- a task needs to be performed at a certain ing in terms of its applicability and expres- time instant. Reasoning is performed by ap- sivity of multimedia event queries. The mo- plying RDF Schema rules to the graph us- tivation for development of an end-to-end ing SPARQL query language or its variants. multimedia event processing system sup- Reasoning over knowledge graphs (KG) pro- porting automated reasoning over multime- vides new relations among entities to en- dia streams comes from its potential real- rich the knowledge graph and improve its time applications in smart cities, internet and applicability [9]. Neuro-symbolic comput- sports. Fig. 1 shows an example of traf- ing combines symbolic and statistical ap- fic congestion event detected using visual proaches, i.e. knowledge is represented in and commonsense reasoning over the objects symbolic form, whereas learning and reason- and relations among the objects in the video ing are performed by DNN [10], which has stream. A conceptual level design and a mo- shown its efficacy in object detection [11] as tivational example of a novel CEP framework well as enhanced feature learning via knowl- supporting visual and commonsense reason- edge infusion in DNN layers from knowledge ing is presented in Fig. 2. bases [12]. Temporal KG allows time-aware This section presents a review of the re- representation and tracking of entities and cent work in stream reasoning, multimedia relations [13]. event processing and visual reasoning that could be complementary within a proposed 2.2. Multimedia Event neuro-symbolic multimedia event processing system with support for visual reasoning. Representation and Processing CEP engines inherently lacked the support for unstructured multimedia events, which was mitigated by a generalized approach for handling multimedia events as native events in CEP engines as presented in [1]. Angsu- Figure 1: (a) Example of video stream in smart chotmetee et al. [14] has presented an city. (b) Detection of objects and relations. (c) ontological approach for modeling complex High-level event of traffic congestion detected as events and multimedia data with syntactic a result of automated reasoning. and semantic interoperability in multimedia sensor networks, which allows subscribers to define application-specific complex events while keeping the low-level network repre- Figure 2: (a) Conceptual level block diagram of a CEP framework supporting visual reasoning. The input stream of images (or video frames) is received from a publisher, the objects are detected using DNN and rule-based relations [5] are represented using a graph, which is followed by automated reasoning that adds new visual relations from a knowledge base [6] and validates those relations using commonsense knowledge [7]. The matcher performs spatial and temporal event matching on these detected objects and relations with the spatial and temporal patterns in high-level events queried by the subscriber. (b) An example of visual reasoning in multimedia event processing. Suppose a subscriber is interested in the event where tennis player is either “hitting” or “missing” a shot. This event is not explicitly defined via rules but it can be predicted via automated reasoning over detected objects and predicted relations. (Image credits: Visual Genome [6]) sentation generic. Aslam et al. [15] leveraged 2.3. Visual and Commonsense domain adaption and online transfer learn- Reasoning ing in multimedia event processing to ex- tend support for unknown events. Knowl- In addition to the objects and their attributes edge graph is suitable for semantic repre- in images, detection of relations among these sentation and reasoning over video streams objects is crucial for scene understanding due to its scalability and maintainability [16], for which compositional models [17], visual as demonstrated in [5]. VidCEP [2], a CEP phrase models [18] and DNN based relational framework for detection of spatiotemporal networks [19] are available. Visual and se- video events expressed by subscriber-defined mantic embeddings aid large scale visual re- queries, includes a graph-based representa- lation detection, such as Zhang et al. [4] tion, Video Event Query Language (VEQL) employed both visual and textual features to and a complex event matcher for video data. leverage the interactions between objects for relation detection. Similarly, Peyre et al. [3] added a visual phrase embedding space dur- ing learning to enable analogical reasoning for unseen relations and to improve robust- ness to appearance variations of visual re- lations. Table 1 presents some knowledge events will be performed on the objects, rule- bases publicly available for visual reasoning. based relations and relations extracted us- Wan et al. [7] proposed the use of common- ing visual reasoning. The subscriber will sense knowledge graph along with the visual be instantly notified of the high-level event features to enhance visual relation detection. as a combined detection of those spatial- Rajani et al. [20] leverage human reasoning temporal patterns. The idea of developing an and language models to generate human-like end-to-end multimedia event processing sys- explanations for DNN-based commonsense tem supporting visual reasoning over video question answering. There are various com- streams (Fig. 2) poses several challenges that monsense reasoning methods and datasets are discussed in the next section. This novel available for visual commonsense reasoning approach will give more expressive power to [21] and story completion [22]. subscribers in querying complex events in multimedia streams, and thus increase the scope of real-time applications of multimedia 3. Neuro-symbolic Visual event processing in smart city applications Reasoning in as well as internet media streaming applica- tions. Multimedia Event Processing 3.2. Challenges 3.1. Prospects 1. Suitable representation for reason- ing It is crucial to select a generalized and The current multimedia event representation scalable model to represent events and effec- methods use knowledge graph to represent tively perform automated reasoning to derive the detected objects, their attributes and re- more meaningful and expressive spatiotem- lations among the objects in video streams. poral events. Pre-defined spatial-temporal rules are used 2. Expressive query definition and to form relations among the objects. How- matching Providing a generic and human- ever, the complex relations that exist among friendly format to subscribers for writing real-world objects also depend on seman- expressive and high-level queries would re- tic facts and situational variables that can quire new constructs. Matching queries with not be explicitly specified for every possi- the low-level events and relations along with ble event as rules. The statistical reason- reasoning via knowledge bases requires ef- ing methods and knowledge bases discussed ficient retrieval within the complex event in Section 2 have great potential to com- matcher. Real-world complex events can plement the rule-based relation formation share similar patterns, occur as a cluster of in multimedia event processing by inject- similar events or occur in a hierarchical man- ing some semantic knowledge and reason- ner, which requires generalized, adaptive and ing to extract more semantically meaning- scalable spatiotemporal constructs to query ful relations among objects. This advance- such events. ment will allow subscribers to define abstract 3. Labeling and training samples of vi- or high-level human-understandable event sual relations There can be a large numbers query rules that can be decomposed into of objects and possible relations among them spatial and temporal patterns. The spatio- in images, which can result in a large num- temporal matching of the queried high-level ber of categories of relations. It is difficult Table 1 Available Knowledge Bases for Visual Reasoning Knowledge Base #Images #Entity Categories #Entity Instances #Relation Categories #Relation Instances Open Images V4 [23] 9,200,000 600 15,400,000 57 375,000 YAGO 4 [24] – 10,124 64,000,000 – 2 billion Visual Genome [6] 108,077 33,877 3,843,636 42,374 2,269,617 COCO-a [25] 10,000 81 74,000 156 207,000 VisKE [18] – 1,884 – 1,158 12,593 to annotate all possible relations and to have been explored much, which is crucial for spa- balanced categories of relations in the train- tiotemporal event processing. ing data. For example, Visual Genome [6] has a huge number of relations with unbalanced instances of each relation. Acknowledgement 4. Consistent integration of knowledge This work was conducted with the financial bases The object labels in datasets for ob- support of the Science Foundation Ireland ject detection and entity labels in knowl- Centre for Research Training in Artificial In- edge bases (e.g. person, human, man) are telligence under Grant No. 18/CRT/6223. not always the same. Similarly, knowledge bases have different labels for the same en- tity, different names for the same attribute References (e.g. birthPlace and placeOfBirth) or relation [1] A. Aslam, E. Curry, Towards a generalized approach for (e.g. ’at left’ and ’to left of’). This can cause deep neural network based event processing for the in- ternet of multimedia things, IEEE Access 6 (2018) 25573– inconsistency or redundancy while integrat- 25587. ing relations from the knowledge bases. It is [2] P. Yadav, E. Curry, Vidcep: Complex event process- ing framework to detect spatiotemporal patterns in video important to select the knowledge base and streams, in: 2019 IEEE International Conference on Big dataset that are consistent and suitable for Data (Big Data), IEEE, 2019, pp. 2513–2522. [3] J. Peyre, I. Laptev, C. Schmid, J. Sivic, Detecting unseen the combined use of both object detection visual relations using analogies, in: Proceedings of the and visual reasoning. IEEE International Conference on Computer Vision, 2019, pp. 1981–1990. 5. Supporting rare or unseen visual re- [4] J. Zhang, Y. Kalantidis, M. Rohrbach, M. Paluri, A. Elgam- lations Apart from the common relations, mal, M. Elhoseiny, Large-scale visual relationship under- standing, in: Proceedings of the AAAI Conference on Ar- very rare or unseen relations among objects tificial Intelligence, volume 33, 2019, pp. 9185–9194. also appear in certain scenes. It is nearly im- [5] P. Yadav, E. Curry, Vekg: Video event knowledge graph to represent video streams for complex event pattern match- possible to collect sufficient training samples ing, in: 2019 First International Conference on Graph for all possible seen and unseen relations. Computing (GC), IEEE, 2019, pp. 13–20. [6] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, Handling such relations while evaluating the S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al., Visual models is also a challenge. genome: Connecting language and vision using crowd- sourced dense image annotations, International Journal 6. Temporal processing of objects and re- of Computer Vision 123 (2017) 32–73. lations The recent methods on this subject [7] H. Wan, J. Ou, B. Wang, J. Du, J. Z. Pan, J. Zeng, Iterative visual relationship detection via commonsense knowledge address complex inference tasks by decom- graph, in: Joint International Semantic Technology Con- posing images or scenes into objects and vi- ference, Springer, 2019, pp. 210–225. [8] Rdf 1.1 concepts and abstract syntax (2014). sual relations among the objects. The tem- [9] X. Chen, S. Jia, Y. Xiang, A review: Knowledge reasoning poral events and temporal tracking of the de- over knowledge graph, Expert Systems with Applications 141 (2020) 112948. tected objects and predicted relations has not [10] W. Li, G. Qi, Q. Ji, Hybrid reasoning in knowledge graphs: Combing symbolic reasoning and statistical reasoning, Se- mantic Web (2020) 1–10. [11] Y. Fang, K. Kuan, J. Lin, C. Tan, V. Chandrasekhar, Object detection meets knowledge graphs (2017). [12] U. Kursuncu, M. Gaur, A. Sheth, Knowledge infused learn- ing (k-il): Towards deep incorporation of knowledge in deep learning, arXiv preprint arXiv:1912.00512 (2019). [13] A. García-Durán, S. Dumančić, M. Niepert, Learning se- quence encoders for temporal knowledge graph comple- tion, arXiv preprint arXiv:1809.03202 (2018). [14] C. Angsuchotmetee, R. Chbeir, Y. Cardinale, Mssn-onto: An ontology-based approach for flexible event processing in multimedia sensor networks, Future Generation Com- puter Systems 108 (2020) 1140–1158. [15] A. Aslam, E. Curry, Reducing response time for multimedia event processing using domain adaptation, in: Proceed- ings of the 2020 International Conference on Multimedia Retrieval, 2020, pp. 261–265. [16] L. Greco, P. Ritrovato, M. Vento, On the use of semantic technologies for video analysis, Journal of Ambient Intel- ligence and Humanized Computing (2020). [17] Y. Li, W. Ouyang, X. Wang, X. Tang, Vip-cnn: Visual phrase guided convolutional neural network, in: Proceed- ings of the IEEE Conference on Computer Vision and Pat- tern Recognition, 2017, pp. 1347–1356. [18] F. Sadeghi, S. K. Kumar Divvala, A. Farhadi, Viske: Vi- sual knowledge extraction and question answering by vi- sual verification of relation phrases, in: CVPR 2015, ????, pp. 1456–1464. [19] B. Dai, Y. Zhang, D. Lin, Detecting visual relationships with deep relational networks, in: Proceedings of the IEEE conference on computer vision and Pattern recognition, 2017, pp. 3076–3086. [20] N. F. Rajani, B. McCann, C. Xiong, R. Socher, Explain your- self! leveraging language models for commonsense rea- soning, arXiv preprint arXiv:1906.02361 (2019). [21] R. Zellers, Y. Bisk, A. Farhadi, Y. Choi, From recognition to cognition: Visual commonsense reasoning, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [22] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi, Hel- laswag: Can a machine really finish your sentence?, in: Proceedings of the 57th Annual Meeting of the Associa- tion for Computational Linguistics, 2019. [23] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Mal- loci, A. Kolesnikov, et al., The open images dataset v4, International Journal of Computer Vision (2020) 1–26. [24] T. P. Tanon, G. Weikum, F. Suchanek, Yago 4: A reason- able knowledge base, in: European Semantic Web Confer- ence, Springer, 2020, pp. 583–596. [25] M. R. Ronchi, P. Perona, Describing common human visual actions in images, arXiv preprint arXiv:1506.02203 (2015).