Complex Event Processing on Real-time Video Streams Ziyu Li supervised by Asterios Katsifodimos, Alessandro Bozzon and Geert-Jan Houben Delft University of Technology Z.Li-14@tudelft.nl ABSTRACT Complex Event Processing (CEP) systems provide expres- Cameras are ubiquitous nowadays and video analytic sys- sive languages to construct CEP queries for pattern detec- tems have been widely used in surveillance, traffic control, tion over events. The constructs of an event algebra that business intelligence and autonomous driving. Some appli- fulfills the need of defining complex patterns was identified cations, e.g., detecting road congestion in traffic monitoring, [3]. CEP query languages [4, 2, 1] define complex events to require continuous and timely reporting of complex patterns. be detected and correlate them to more high-level meaning- However, conventional complex event processing (CEP) sys- ful information in data streams. However, less attention has tems fail to support video processing, while the existing been paid to content-based event detection on video streams video query languages offer limited support for expressing in CEP systems, considering that conventional CEP systems advanced CEP queries, such as iteration, and window. accept structured data as input. In this PhD research, we aim to develop systems and There have been many proposals for query languages on methods to alleviate these issues. In this paper, we first video [13, 12]. These languages provide high-level seman- identify the need for an expressive CEP language which al- tics, usually in an SQL-like format, allowing users to query lows users to define queries over video streams, and receive video content. Compared to CEP languages, these SQL- fast, accurate results. To evaluate CEP queries on videos in like languages provide limited support for detecting complex real-time and with high accuracy, we explain how a stream- patterns on video content, i.e., missing operations such as ing query engine can be designed to provide native support iteration and join, and restricted use of window (by count- of machine learning (ML) models for fast and accurate in- ing number of frames), which leads to constrained queries. ference on video streams. In addition, we describe a set Table 1 shows a comparison of the current event query lan- of optimization problems that arise when ML models, with guages and existing video query languages. Operators listed trade-offs in speed, accuracy, and cost, are part of a query in the table are the combination of the ones commonly used plan. Finally, we describe how query plans on real-time in streaming systems [3] and in video retrieval systems. From videos can be optimized and deployed on edge devices with the table, we discover two research gaps, where 1) CEP sys- limited computational and network capabilities. tems lack support for video data while 2) multimedia re- trieval languages fail to support well-rounded operators for CEP. This Ph.D. work aims to leverage these two gaps. 1. INTRODUCTION Video streams are difficult to process due to their low-level Cameras installed in buildings, deployed on streets, and features (pixel value). The state-of-the-art object detection fitted on various devices generate vast amounts of video con- algorithm, Mask R-CNN [6] runs at 3 frames per second tent on a daily basis. The video footage empowers a wide (fps), while in real-time the video frame rate is 25-30 fps. range of important applications such as in-door surveillance, Nonetheless, techniques in ML, such as model specialization traffic control, business intelligence and autonomous driving and model compression can be applied to expand the model [8]. In real-world scenarios, we are generally interested in search space [9, 8]. Models differ in shape, size, and the detecting complex events and receiving instant alerts, e.g., classes of object they can identify, and most importantly, detecting road congestion in traffic monitoring and alerting in terms of accuracy and latency. Thus, it is essential to to dangerous scenes in autonomous driving. Considering the optimize these aspects in response to user preference, i.e., urgent need for real-time response, it is infeasible to store quality-oriented or speed-oriented. videos in databases. Ideally, videos would be processed on The objective of this PhD research is to leverage the ex- the spot. pressiveness of streaming languages in order to construct complex event patterns over real-time video streams, and to optimize the process aiming for efficient and sound results. To achieve the goal, in this paper we describe the PhD plan, split in three main research lines. First, we identify the mo- tivating query types that cannot be supported adequately by current video query languages. Second, we show how to exploit the operators in the context of video processing Proceedings of the VLDB 2020 PhD Workshop, August 31st, 2020. Tokyo, by leveraging the existing streaming operators and state-of- Japan. Copyright (C) 2020 for this paper by its authors. Copying permitted the-art computer vision techniques. Since traditional query for private and academic purposes. Table 1: Comparison between CEP and video query languages. (SIO: Single-item Operators, LO: Logic Operators, FMO: Flow Management Operators; TR: Temporal Relationship; SR: Spatial Relationship) SIO LO Iterations Windows FMO Aggregates TR SR Data CEP Languages SASE+ [4] X Only negation X X X Structured Flink [1] X X X X X X X Structured TESLA [2] X Only negation X X X Structured VidCEP [17] X X X X X Video Video Query Languages CVQL [12] X X X X Video SVQL [13] X X X X Video FrameQL [8] X X X Video SVQ [16, 10] X X X X X Video optimizers typically do not consider trade-offs of accuracy, as a fundamental support for autonomous driving. To com- speed and cost, we demonstrate the need to construct a ML bine with Q3 or other query types, it is feasible to construct model search space and to navigate the search space ap- more expressive queries, e.g detect human behaviors or track plying Pareto frontier and an AND-OR graph. Finally we an object throughout time. consider deploying CEP query plans over video data closer to the device where they are generated and outline the chal- lenges for optimizing such plans to execute at the edge, re- 3. RESEARCH OVERVIEW ducing communication and computation costs. In this section, we would like to highlight the challenges of CEP over real-time video streams. To process video streams in a CEP system, the challenges 2. MOTIVATING EXAMPLE QUERIES are reflected from three aspects: query language, query plan- We first identify the use cases and scenarios that users may ning with ML models, and query optimization on inference, be interested in when processing real-time video streams. which will be illustrated below and explained in the rest of Q1-Identifying of an object with attributes. The detec- the paper. tion on a single object class requests on the occurrence of Defining a CEP language for video processing. In CEP a specific object. This query type, as the simplest event systems, a user can define specific queries using CEP lan- pattern, covers a wide range of applications. For instance, guage or streaming operators to detect occurrences of partic- a road surveillance system can monitor whether a car is ular patterns of (low-level) events. Then the event streams parked at a non-stopping lane. Also, it is an essential task are fed into the user-defined patterns. If all the conditions in autonomous driving to detect a traffic light along with are met, the user will receive an outcome or alarm. its color. However, recent works, such as [9, 16, 17], do not This raises some questions: How to define a complex event either optimize the model in different scenarios, or provide pattern over video streams given the available operators? support for additional attributes detection. Are there any additional operators required that can express Q2- Identifying multiple classes of objects. What differ- the content of videos? entiates Q2 to Q1 is the detection on more than one class Query planning with ML models. Unlike conventional of object. In this query type, users are interested to per- CEP systems, the video streams generated from devices can ceive multiple objects. For example, in autonomous driving, not be directly applied to match patterns, but instead are the system should be alerted when multiple pedestrians are first processed by cascading models, where the video frames walking in the cross road with a yellow light on [8]. Still, are transformed into events, i.e., object class, location, ob- there is deficiency in [17, 10], where the model is fixed and ject attributes. the optimization cannot solve the problem of model drift. Given a complex event pattern, how can we decompose it Q3-Temporal Relation. In this type of query, the time fac- and solve the components, i.e., identifying color, detecting tor is involved in the complex event patterns. For example, objects? How many models can be applied to address the detect the sequence of events within a time interval. the issue? Do the state-of-the-art models fulfill the need? If not, proposal of this query type distinguishes itself by the high how can we extend the model search space? expressiveness of window operators and the availability to Optimizing inference. To achieve high efficiency and ef- detect iterated patterns. By providing a broader functional- fectiveness, an optimization mechanism plays a role in auto- ity of window, it is feasible to analyse the data streams in a matically assigning fast models for a given query and accu- timely manner, i.e., report aggregates over time, which has racy target. Is there an optimal solution to each query that never been seen in previous works [9, 10, 8, 16, 12, 13]. In can achieve low latency and high accuracy? Is there any addition, few existing works in video query language men- method that can present the model selection process? Can tioned iteration. This operator serves as an important role we further optimize the inference by pushing more compu- in defining the sequence of events by indicating the occur- tation to the edge closer to the data source? rence of an event multiple times in sequence. Q4-Spatial Relation. This query type perceives the spatial relation between objects, i.e., to detect a car on the right of 4. A CEP LANGUAGE FOR VIDEOS a lamp post. The application of this type of query can be As discussed previously, the conventional CEP systems applied to identify human-object interaction [16] and serves lack support for video data, while video query languages do not cater for advanced CEP needs. To leverage both, it is above-mentioned models, techniques in ML, such as special- feasible to solve the problem and design a CEP language for ized model and model compression, can also be applied to videos. We will showcase the operators that are available in enlarge the model search space. For example, a full and com- terms of the query types described above. plex algorithm, e.g., YOLO9000 [14], can classify or detect The most basic operator is that of selection [5]. This oper- thousands of classes (tasks). While a specialized model is a ator applies to events and filters out those that do not satisfy smaller model (i.e., with fewer layers and neurons) that mim- the predicates. Projection is another operator that belongs ics the behavior of a full NN model on a particular task. The to single-item operators, together with selection. This type sacrifice of generalization of inference models on restricted of operator transforms the attribute values of an event, and tasks can substantially reduce inference cost and latency. thus can be applied to generate video events by transform- To compile the query, we will construct a model search ing the video frames into streams of events, including a set space by showcasing all the models available to solve a par- of attributes, i.e., timestamp, object class and color. ticular task. Accordingly, there are multiple options that In terms of Q3 –manipulating with the temporal relation, can be selected in terms of various outcomes, i.e., accuracy the time factor plays a pivotal role here. Existing video and inference latency. These models differ in complexity, query languages [8, 17, 10] measure time by counting frames i.e., layers, number of neurons in dense layers and the abil- and doing the math using the speed of incoming frames mea- ity to generalize. sured in frames per second. Such operation is limited and constrained to answer requests, e.g., report on a regular ba- sis. Our notion of the window operator, incorporated from 6. OPTIMIZING VIDEO CEP QUERIES stream processing, assigns windows to events with respect As discussed above, the solutions to solve a task, e.g., to different notions of time opted by application developers. detecting object class, are various. And there is no single One important notion of time supported is event time, which model that can outperform all the others in terms of accu- signifies the time a frame was generated at the source device. racy and speed. Let alone, the performance of the models This provides us with a powerful abstraction for matching is data-dependent [9]. And thus, we should optimize on the temporal relationships. Notably, different types of windows inference models for each query and consider the trade-off can be expressed, namely tumbling windows, sliding win- between accuracy and speed when assigning the models. dows, session windows and global windows[1]. The high Manipulating the trade-off between accuracy and infer- expressiveness of window operators, hence, enables timely ence throughput can be regarded as a multi-objective opti- report and flexible manipulation on queries. Also, it is wor- mization problem (MOP). Pareto-optimal solutions are ap- thy to note that iteration, as an important feature to define plied in order to select pre-optimal solutions for each task. the repeated occurrences of a match event, is not covered in In this case, the aim is to process the video fast and accu- previous works [13, 12, 8, 17, 10]. In the proposal, we will rately, and thus the objectives are accuracy and speed. To extend the temporal patterns that have been seen in pre- give an example, as shown in Figure 1a, every symbol repre- vious works, by incorporating the rich definition of window sents a specific model, varying in shape, size and set of task, operators and introducing the iteration to define sequence but fulfill the same goal, i.e., identifying a specific object. of events. The blue line represents the Pareto frontier. Suppose that To detect the spatial relation between objects is an impor- f1 is inference time and f2 is accuracy. We expect f1 to be tant task in video retrieval, which is different from the con- lower and f2 to be higher. In the Pareto frontier, no solu- ventional stream processing where the events do not reveal tions in the search space are superior to the others in the any spatial relationship. New operators should be defined to line in terms of both objectives. Only the models that lie in detect such relations on video contents. State-of-the-art ob- the Pareto frontier (such as OD1 and OD2 ) are considered ject detection provide information that helps us to identify for further comparison. the class of object as well as their location within a frame, which enables to answer queries of Q4. In addition, modern streaming systems scale-out by run- ning multiple operator instances that process disjoint parts of the data in shared-nothing commodity hardware. This scalable distributed architecture is suitable for joining streams from distributed sources or distributing data to various nodes. It lays the foundation for edge computing, where a task may be divided and distributed to various computational nodes. 5. QUERY PLANNING WITH ML MODELS In order to run queries in the CEP language, we first (a) Pareto frontier (b) AND-OR graph need to identify the models that can be applied to solve the tasks in each query, e.g., detecting color or identifying Figure 1: Optimization approaches a specific object. For each task, there are various models and solutions available. For example, to detect an object, After we obtain the sub-optimal models for each task, the both object detection and image classification can be ap- next step is to select and assign the optimal solution. In the plied. The state-of-the-art object detection models include previous step, the models are coupled with the outcomes, but are not limited to Mask R-CNN [6], YOLOv2 [14]. On including the inference throughput and accuracy. If a user the other hand, AlexNet [11] and VGG-16 [15] are state- is quality-oriented, then the model with the highest accuracy of-the-art deep image classification models. Despite all the is selected, and vice versa. To represent the solution of the task, we will apply the AND-OR graph, as shown in Figure 9. ACKNOWLEDGMENTS 1b. The query is decomposed into a set of smaller problems, The author would like to thank Cognizant for their sup- i.e., identify red color and detect a car. The leaves of the ports in this project. And many thanks to Marios Fragk- AND-OR graph represent unique sub-optimal models in the oulis, Christos Koutras, Agathe Balayn, Andra Ionescu and Pareto frontier, and the optimal option will be sent to their Georgios Siachamis for their valuable feedback on this work. parents for further analysis. The process goes on until the query is solved. 10. REFERENCES [1] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, 7. EDGE/FOG COMPUTING and K. Tzoumas. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE To process the video streams in real-time, it is feasible Computer Society Technical Committee on Data to reduce transmission of data from one edge to another, by Engineering, 36(4), 2015. offloading tasks to the devices closer to the data source. But [2] G. Cugola and A. Margara. Tesla: a formally defined event due to the limited computation power available at edge de- specification language. In Proceedings of the Fourth ACM vices, the interactions with remote clusters or public clouds International Conference on Distributed Event-Based are inevitable. The models that are offloaded to the edge Systems, pages 50–61, 2010. [3] G. Cugola and A. Margara. Processing flows of information: devices may thus sacrifice accuracy for inference and the From data stream to complex event processing. ACM complex models deployed remotely should compensate for Computing Surveys (CSUR), 44(3):1–62, 2012. it. In this phase, the challenge is to decide what task shall [4] Y. Diao, N. Immerman, and D. Gyllstrom. Sase+: An agile be offloaded and what information will be transmitted given language for kleene closure over event streams. UMass the dynamic circumstances in real-world deployment. Technical Report, 2007. In the edge computing paradigm, the objectives that need [5] N. Giatrakos, E. Alevizos, A. Artikis, A. Deligiannakis, and to be considered and optimized are not restricted to infer- M. Garofalakis. Complex event recognition in the big data ence latency and accuracy anymore, but include wireless era: a survey. The VLDB Journal, 29(1):313–352, 2020. [6] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. bandwidth, processing capacity and energy consumption [7], In Proceedings of the IEEE international conference on which increase the difficulty level of assigning optimal con- computer vision, pages 2961–2969, 2017. figurations in real-time. [7] C.-C. Hung, G. Ananthanarayanan, P. Bodik, L. Golubchik, M. Yu, P. Bahl, and M. Philipose. 8. FUTURE PLAN Videoedge: Processing camera streams using hierarchical clusters. In 2018 IEEE/ACM Symposium on Edge Building prototype. As the aim of this PhD project is Computing (SEC), pages 115–131. IEEE, 2018. to provide highly expressive semantics for users to define [8] D. Kang, P. Bailis, and M. Zaharia. Blazeit: Optimizing queries over real-time video steams, we intend to first de- declarative aggregation and limit queries for neural velop a prototype that is available for simple queries, e.g., network-based video analytics. arXiv preprint Q1 and Q2. In this phase, the focus lies in the optimiza- arXiv:1805.01046, 2018. tion module, where trade-off between latency and accuracy [9] D. Kang, J. Emmons, F. Abuzaid, P. Bailis, and M. Zaharia. Noscope: optimizing neural network queries is exploited. Other queries will be implemented thereafter. over video at scale. arXiv preprint arXiv:1703.02529, 2017. Video decoding. The characteristics of video may influ- [10] N. Koudas, R. Li, and I. Xarchakos. Video monitoring ence the model performance and should also be considered, queries. In 2020 IEEE 36th International Conference on i.e., frame resolution, frame sampling rate. For example, a Data Engineering (ICDE), pages 1285–1296. IEEE, 2020. crowded road scene compared to a quiet neighborhood re- [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet quire a more frequent frame sampling rate so that no event classification with deep convolutional neural networks. In is missed, and a higher resolution to retain the details of Advances in neural information processing systems, pages 1097–1105, 2012. content. The video characteristics will serve as important [12] T. C. Kuo and A. L. Chen. Content-based query processing factors and will be taken into account in the configuration for video databases. IEEE Transactions on Multimedia, plan. 2(1):1–13, 2000. Adaptive configuration. Given the scene captured by the [13] C. Lu, M. Liu, and Z. Wu. Svql: A sql extended query camera changes overtime, even by a static-angle camera, language for video databases. International Journal of model drift is a main issue that affects the performance of Database Theory and Application, 8(3):235–248, 2015. the video processing. Dynamic configuration is complex and [14] J. Redmon and A. Farhadi. Yolo9000: better, faster, challenging if we want to maintain real-time performance, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, since dynamic configuration may hinder the inference pro- 2017. cedure. In the future, we will investigate adaptive configu- [15] K. Simonyan and A. Zisserman. Very deep convolutional rations given the trade-off between the inference speed and networks for large-scale image recognition. arXiv preprint the quality of the results. arXiv:1409.1556, 2014. Edge computing. To bring the computation closer to the [16] I. Xarchakos and N. Koudas. Svq: Streaming video queries. edge, we will investigate how to apply an edge computing In Proceedings of the 2019 International Conference on paradigm into the deployment. In this phase, the design Management of Data, pages 2013–2016, 2019. of the architecture of edge-cloud, the measure of real-time [17] P. Yadav and E. Curry. Vidcep: Complex event processing framework to detect spatiotemporal patterns in video performance, and the decision to offload tasks and commit streams. In 2019 IEEE International Conference on Big data will be important problems going forward. Data (Big Data), pages 2513–2522. IEEE, 2019.