1. Introduction

Workshops, October

Neuro-symbolic Visual Reasoning for Multimedia Event Processing: Overview, Prospects and Challenges

Muhammad Jaleed Khan

Edward Curry

0 0 SFI Centre for Research Training in Artificial Intelligence, Data Science Institute, National University of Ireland Galway , Galway , Ireland

2020

1 9 20

Eficient multimedia event processing is a key enabler for real-time and complex decision making in streaming media. The need for expressive queries to detect high-level human-understandable spatial and temporal events in multimedia streams is inevitable due to the explosive growth of multimedia data in smart cities and internet. The recent work in stream reasoning, event processing and visual reasoning inspires the integration of visual and commonsense reasoning in multimedia event processing, which would improve and enhance multimedia event processing in terms of expressivity of event rules and queries. This can be achieved through careful integration of knowledge about entities, relations and rules from rich knowledge bases via reasoning over multimedia streams within an event processing engine. The prospects of neuro-symbolic visual reasoning within multimedia event processing are promising, however, there are several associated challenges that are highlighted in this paper.

eol>multimedia event processing visual reasoning commonsense reasoning video stream processing spatiotemporal events

1. Introduction

Internet of multimedia things (IoMT), data analytics and artificial intelligence are continuously improving smart cities and urban environments with their ever-increasing applications ranging from trafic management to public safety. As middleware between internet of things and real-time applications, complex event processing (CEP) systems process structured data streams from multiple producers and detect complex events queried by subscribers in real-time. The enormous increase in image and video content surveillance cameras and other sources in IoMT applications posed several challenges in realtime processing of multimedia events, which motivated researchers in this area to extend the existing CEP engines and to devise new CEP frameworks to support unstructured multimedia streams. Over the past few years, several eforts have been made to mitigate the challenges in multimedia event processing by developing techniques for extension of existing CEP engines for multimedia events [ 1 ] and development of end-toend CEP frameworks for multimedia streams [ 2 ]. On the other hand, the research in computer vision has focused on complimenting object detection with human-like visual reasoning that allows for prediction of meaningful and useful semantic relations among detected objects based on analogy and commonsense (CS) knowledge [ 3, 4 ].

Emerging from the semantic web, stream

ing data is conventionally modelled accord2. Background ing to RDF [ 8 ], a graph representation. The real-time processing of RDF streams is perIn this paper, we discuss the background, formed in time-dependent windows that conprospects and challenges related to leverag- trol the access to the stream, each containing the existing visual and commonsense rea- ing a small part of the stream over which soning to enhance multimedia event process- a task needs to be performed at a certain ing in terms of its applicability and expres- time instant. Reasoning is performed by apsivity of multimedia event queries. The mo- plying RDF Schema rules to the graph ustivation for development of an end-to-end ing SPARQL query language or its variants. multimedia event processing system sup- Reasoning over knowledge graphs (KG) proporting automated reasoning over multime- vides new relations among entities to endia streams comes from its potential real- rich the knowledge graph and improve its time applications in smart cities, internet and applicability [ 9 ]. Neuro-symbolic computsports. Fig. 1 shows an example of traf- ing combines symbolic and statistical apifc congestion event detected using visual proaches, i.e. knowledge is represented in and commonsense reasoning over the objects symbolic form, whereas learning and reasonand relations among the objects in the video ing are performed by DNN [ 10 ], which has stream. A conceptual level design and a mo- shown its eficacy in object detection [ 11 ] as tivational example of a novel CEP framework well as enhanced feature learning via knowlsupporting visual and commonsense reason- edge infusion in DNN layers from knowledge ing is presented in Fig. 2. bases [ 12 ]. Temporal KG allows time-aware

This section presents a review of the re- representation and tracking of entities and cent work in stream reasoning, multimedia relations [ 13 ]. event processing and visual reasoning that could be complementary within a proposed neuro-symbolic multimedia event processing system with support for visual reasoning.

2.2. Multimedia Event Representation and Processing

CEP engines inherently lacked the support for unstructured multimedia events, which was mitigated by a generalized approach for handling multimedia events as native events in CEP engines as presented in [ 1 ]. Angsuchotmetee et al. [ 14 ] has presented an ontological approach for modeling complex events and multimedia data with syntactic and semantic interoperability in multimedia sensor networks, which allows subscribers to define application-specific complex events while keeping the low-level network representation generic. Aslam et al. [ 15 ] leveraged 2.3. Visual and Commonsense domain adaption and online transfer learn- Reasoning ing in multimedia event processing to extend support for unknown events. Knowl- In addition to the objects and their attributes edge graph is suitable for semantic repre- in images, detection of relations among these sentation and reasoning over video streams objects is crucial for scene understanding due to its scalability and maintainability [ 16 ], for which compositional models [ 17 ], visual as demonstrated in [ 5 ]. VidCEP [ 2 ], a CEP phrase models [ 18 ] and DNN based relational framework for detection of spatiotemporal networks [ 19 ] are available. Visual and sevideo events expressed by subscriber-defined mantic embeddings aid large scale visual requeries, includes a graph-based representa- lation detection, such as Zhang et al. [ 4 ] tion, Video Event Query Language (VEQL) employed both visual and textual features to and a complex event matcher for video data. leverage the interactions between objects for relation detection. Similarly, Peyre et al. [ 3 ] added a visual phrase embedding space during learning to enable analogical reasoning for unseen relations and to improve robustness to appearance variations of visual re

3. Neuro-symbolic Visual Reasoning in Multimedia Event Processing

lations. Table 1 presents some knowledge events will be performed on the objects, rulebases publicly available for visual reasoning. based relations and relations extracted usWan et al. [ 7 ] proposed the use of common- ing visual reasoning. The subscriber will sense knowledge graph along with the visual be instantly notified of the high-level event features to enhance visual relation detection. as a combined detection of those spatialRajani et al. [ 20 ] leverage human reasoning temporal patterns. The idea of developing an and language models to generate human-like end-to-end multimedia event processing sysexplanations for DNN-based commonsense tem supporting visual reasoning over video question answering. There are various com- streams (Fig. 2) poses several challenges that monsense reasoning methods and datasets are discussed in the next section. This novel available for visual commonsense reasoning approach will give more expressive power to [ 21 ] and story completion [ 22 ]. subscribers in querying complex events in multimedia streams, and thus increase the scope of real-time applications of multimedia event processing in smart city applications as well as internet media streaming applications.

3.2. Challenges

3.1. Prospects 1. Suitable representation for reasoning It is crucial to select a generalized and The current multimedia event representation scalable model to represent events and efecmethods use knowledge graph to represent tively perform automated reasoning to derive the detected objects, their attributes and re- more meaningful and expressive spatiotemlations among the objects in video streams. poral events.

Pre-defined spatial-temporal rules are used 2. Expressive query definition and to form relations among the objects. How- matching Providing a generic and humanever, the complex relations that exist among friendly format to subscribers for writing real-world objects also depend on seman- expressive and high-level queries would retic facts and situational variables that can quire new constructs. Matching queries with not be explicitly specified for every possi- the low-level events and relations along with ble event as rules. The statistical reason- reasoning via knowledge bases requires efing methods and knowledge bases discussed ifcient retrieval within the complex event in Section 2 have great potential to com- matcher. Real-world complex events can plement the rule-based relation formation share similar patterns, occur as a cluster of in multimedia event processing by inject- similar events or occur in a hierarchical maning some semantic knowledge and reason- ner, which requires generalized, adaptive and ing to extract more semantically meaning- scalable spatiotemporal constructs to query ful relations among objects. This advance- such events. ment will allow subscribers to define abstract 3. Labeling and training samples of vior high-level human-understandable event sual relations There can be a large numbers query rules that can be decomposed into of objects and possible relations among them spatial and temporal patterns. The spatio- in images, which can result in a large numtemporal matching of the queried high-level ber of categories of relations. It is dificult been explored much, which is crucial for spabalanced categories of relations in the traintiotemporal event processing. ing data. For example, Visual Genome [ 6 ] has a huge number of relations with unbalanced instances of each relation. 4. Consistent integration of knowledge bases The object labels in datasets for object detection and entity labels in knowledge bases (e.g.

person, human, man) are not always the same. Similarly, knowledge bases have diferent labels for the same entity, diferent names for the same attribute (e.g. birthPlace and placeOfBirth) or relation (e.g. ’at left’ and ’to left of ’). This can cause inconsistency or redundancy while integrating relations from the knowledge bases. It is important to select the knowledge base and dataset that are consistent and suitable for the combined use of both object detection and visual reasoning. 5. Supporting rare or unseen visual relations Apart from the common relations, very rare or unseen relations among objects also appear in certain scenes. It is nearly impossible to collect suficient training samples for all possible seen and unseen relations.

Handling such relations while evaluating the models is also a challenge. 6. Temporal processing of objects and relations The recent methods on this subject address complex inference tasks by decomposing images or scenes into objects and visual relations among the objects. The temporal events and temporal tracking of the detected objects and predicted relations has not

Acknowledgement This work was conducted with the financial

support of the Science Foundation Ireland

Centre for Research Training in Artificial Intelligence under Grant No. 18/CRT/6223.

[1]

Aslam , E. Curry, Towards a generalized approach for deep neural network based event processing for the internet of multimedia things , IEEE Access 6 ( 2018 ) 25573 - 25587 .

[2]

Yadav , E. Curry, Vidcep: Complex event processing framework to detect spatiotemporal patterns in video streams , in: 2019 IEEE International Conference on Big Data (Big Data) , IEEE, 2019 , pp. 2513 - 2522 .

[3]

Peyre , I. Laptev,

Schmid ,

Sivic , Detecting unseen visual relations using analogies , in: Proceedings of the IEEE International Conference on Computer Vision , 2019 , pp. 1981 - 1990 .

[4]

Zhang ,

Kalantidis ,

Rohrbach ,

Paluri ,

Elgammal ,

Elhoseiny , Large-scale visual relationship understanding , in: Proceedings of the AAAI Conference on Artificial Intelligence , volume 33 , 2019 , pp. 9185 - 9194 .

[5]

Yadav , E. Curry, Vekg: Video event knowledge graph to represent video streams for complex event pattern matching , in: 2019 First International Conference on Graph Computing (GC) , IEEE, 2019 , pp. 13 - 20 .

[6]

Krishna ,

Zhu ,

Groth ,

Johnson ,

Hata ,

Kravitz ,

Chen ,

Kalantidis ,

L.-J.

Li ,

D. A.

Shamma , et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations , International Journal of Computer Vision 123 ( 2017 ) 32 - 73 .

[7]

Wan ,

Ou ,

Wang ,

Du ,

J. Z.

Pan ,

Zeng , Iterative visual relationship detection via commonsense knowledge graph , in: Joint International Semantic Technology Conference, Springer, 2019 , pp. 210 - 225 .

[8] Rdf 1.1 concepts and abstract syntax (

2014 ).

[9]

Chen ,

Jia ,

Xiang , A review: Knowledge reasoning over knowledge graph , Expert Systems with Applications 141 ( 2020 ) 112948 .

[10]

Li ,

Qi ,

Ji , Hybrid reasoning in knowledge graphs: Combing symbolic reasoning and statistical reasoning , Semantic Web ( 2020 ) 1 - 10 .

[11]

Fang ,

Kuan ,

Lin ,

Tan ,

Chandrasekhar , Object detection meets knowledge graphs ( 2017 ).

[12]

Kursuncu ,

Gaur ,

Sheth , Knowledge infused learning (k-il): Towards deep incorporation of knowledge in deep learning , arXiv preprint arXiv: 1912 . 00512 ( 2019 ).

[13]

García-Durán ,

Dumančić ,

Niepert , Learning sequence encoders for temporal knowledge graph completion , arXiv preprint arXiv: 1809 . 03202 ( 2018 ).

[14]

Angsuchotmetee ,

Chbeir ,

Cardinale , Mssn-onto: An ontology-based approach for flexible event processing in multimedia sensor networks , Future Generation Computer Systems 108 ( 2020 ) 1140 - 1158 .

[15]

Aslam , E. Curry, Reducing response time for multimedia event processing using domain adaptation , in: Proceedings of the 2020 International Conference on Multimedia Retrieval , 2020 , pp. 261 - 265 .

[16]

Greco ,

Ritrovato ,

Vento , On the use of semantic technologies for video analysis , Journal of Ambient Intelligence and Humanized Computing ( 2020 ).

[17]

Li ,

Ouyang ,

Wang ,

Tang , Vip-cnn: Visual phrase guided convolutional neural network , in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2017 , pp. 1347 - 1356 .

[18]

Sadeghi ,

S. K.

Kumar Divvala ,

Farhadi , Viske: Visual knowledge extraction and question answering by visual verification of relation phrases , in: CVPR 2015 , ????, pp. 1456 - 1464 .

[19]

Dai ,

Zhang ,

Lin , Detecting visual relationships with deep relational networks , in: Proceedings of the IEEE conference on computer vision and Pattern recognition , 2017 , pp. 3076 - 3086 .

[20]

N. F.

Rajani ,

McCann ,

Xiong ,

Socher , Explain yourself! leveraging language models for commonsense reasoning , arXiv preprint arXiv: 1906 . 02361 ( 2019 ).

[21]

Zellers ,

Bisk ,

Farhadi ,

Choi , From recognition to cognition: Visual commonsense reasoning , in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2019 .

[22]

Zellers ,

Holtzman ,

Bisk ,

Farhadi ,

Choi , Hellaswag: Can a machine really finish your sentence? , in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , 2019 .

[23]

Kuznetsova ,

Rom ,

Alldrin ,

Uijlings , I. Krasin ,

Pont-Tuset ,

Kamali ,

Popov ,

Malloci ,

Kolesnikov , et al., The open images dataset v4 , International Journal of Computer Vision ( 2020 ) 1 - 26 .

[24]

T. P.

Tanon ,

Weikum ,

Suchanek , Yago 4: A reasonable knowledge base , in: European Semantic Web Conference , Springer, 2020 , pp. 583 - 596 .

[25]

M. R.

Ronchi ,

Perona , Describing common human visual actions in images , arXiv preprint arXiv:1506.02203 ( 2015 ).