Complex Event Processing on Real-time Video Streams

                                                                         Ziyu Li
                        supervised by Asterios Katsifodimos, Alessandro Bozzon and Geert-Jan Houben
                                                Delft University of Technology
                                                                Z.Li-14@tudelft.nl


ABSTRACT                                                                        Complex Event Processing (CEP) systems provide expres-
Cameras are ubiquitous nowadays and video analytic sys-                      sive languages to construct CEP queries for pattern detec-
tems have been widely used in surveillance, traffic control,                 tion over events. The constructs of an event algebra that
business intelligence and autonomous driving. Some appli-                    fulfills the need of defining complex patterns was identified
cations, e.g., detecting road congestion in traffic monitoring,              [3]. CEP query languages [4, 2, 1] define complex events to
require continuous and timely reporting of complex patterns.                 be detected and correlate them to more high-level meaning-
However, conventional complex event processing (CEP) sys-                    ful information in data streams. However, less attention has
tems fail to support video processing, while the existing                    been paid to content-based event detection on video streams
video query languages offer limited support for expressing                   in CEP systems, considering that conventional CEP systems
advanced CEP queries, such as iteration, and window.                         accept structured data as input.
   In this PhD research, we aim to develop systems and                          There have been many proposals for query languages on
methods to alleviate these issues. In this paper, we first                   video [13, 12]. These languages provide high-level seman-
identify the need for an expressive CEP language which al-                   tics, usually in an SQL-like format, allowing users to query
lows users to define queries over video streams, and receive                 video content. Compared to CEP languages, these SQL-
fast, accurate results. To evaluate CEP queries on videos in                 like languages provide limited support for detecting complex
real-time and with high accuracy, we explain how a stream-                   patterns on video content, i.e., missing operations such as
ing query engine can be designed to provide native support                   iteration and join, and restricted use of window (by count-
of machine learning (ML) models for fast and accurate in-                    ing number of frames), which leads to constrained queries.
ference on video streams. In addition, we describe a set                     Table 1 shows a comparison of the current event query lan-
of optimization problems that arise when ML models, with                     guages and existing video query languages. Operators listed
trade-offs in speed, accuracy, and cost, are part of a query                 in the table are the combination of the ones commonly used
plan. Finally, we describe how query plans on real-time                      in streaming systems [3] and in video retrieval systems. From
videos can be optimized and deployed on edge devices with                    the table, we discover two research gaps, where 1) CEP sys-
limited computational and network capabilities.                              tems lack support for video data while 2) multimedia re-
                                                                             trieval languages fail to support well-rounded operators for
                                                                             CEP. This Ph.D. work aims to leverage these two gaps.
1.    INTRODUCTION                                                              Video streams are difficult to process due to their low-level
   Cameras installed in buildings, deployed on streets, and                  features (pixel value). The state-of-the-art object detection
fitted on various devices generate vast amounts of video con-                algorithm, Mask R-CNN [6] runs at 3 frames per second
tent on a daily basis. The video footage empowers a wide                     (fps), while in real-time the video frame rate is 25-30 fps.
range of important applications such as in-door surveillance,                Nonetheless, techniques in ML, such as model specialization
traffic control, business intelligence and autonomous driving                and model compression can be applied to expand the model
[8]. In real-world scenarios, we are generally interested in                 search space [9, 8]. Models differ in shape, size, and the
detecting complex events and receiving instant alerts, e.g.,                 classes of object they can identify, and most importantly,
detecting road congestion in traffic monitoring and alerting                 in terms of accuracy and latency. Thus, it is essential to
to dangerous scenes in autonomous driving. Considering the                   optimize these aspects in response to user preference, i.e.,
urgent need for real-time response, it is infeasible to store                quality-oriented or speed-oriented.
videos in databases. Ideally, videos would be processed on                      The objective of this PhD research is to leverage the ex-
the spot.                                                                    pressiveness of streaming languages in order to construct
                                                                             complex event patterns over real-time video streams, and to
                                                                             optimize the process aiming for efficient and sound results.
                                                                             To achieve the goal, in this paper we describe the PhD plan,
                                                                             split in three main research lines. First, we identify the mo-
                                                                             tivating query types that cannot be supported adequately
                                                                             by current video query languages. Second, we show how
                                                                             to exploit the operators in the context of video processing
Proceedings of the VLDB 2020 PhD Workshop, August 31st, 2020. Tokyo,
                                                                             by leveraging the existing streaming operators and state-of-
Japan. Copyright (C) 2020 for this paper by its authors. Copying permitted   the-art computer vision techniques. Since traditional query
for private and academic purposes.
Table 1: Comparison between CEP and video query languages. (SIO: Single-item Operators, LO: Logic Operators, FMO:
Flow Management Operators; TR: Temporal Relationship; SR: Spatial Relationship)

                         SIO          LO           Iterations Windows FMO         Aggregates    TR     SR       Data
                                                        CEP Languages
         SASE+ [4]         X     Only negation         X         X                               X           Structured
          Flink [1]        X          X                X         X     X               X         X           Structured
         TESLA [2]         X     Only negation         X         X                               X           Structured
        VidCEP [17]        X          X                          X                               X      X      Video
                                                   Video Query Languages
         CVQL [12]         X           X               X                                                X       Video
         SVQL [13]         X           X                                                         X      X       Video
        FrameQL [8]        X           X                                                         X              Video
        SVQ [16, 10]       X           X                                               X         X      X       Video


optimizers typically do not consider trade-offs of accuracy,      as a fundamental support for autonomous driving. To com-
speed and cost, we demonstrate the need to construct a ML         bine with Q3 or other query types, it is feasible to construct
model search space and to navigate the search space ap-           more expressive queries, e.g detect human behaviors or track
plying Pareto frontier and an AND-OR graph. Finally we            an object throughout time.
consider deploying CEP query plans over video data closer
to the device where they are generated and outline the chal-
lenges for optimizing such plans to execute at the edge, re-
                                                                  3.   RESEARCH OVERVIEW
ducing communication and computation costs.                          In this section, we would like to highlight the challenges
                                                                  of CEP over real-time video streams.
                                                                     To process video streams in a CEP system, the challenges
2.   MOTIVATING EXAMPLE QUERIES                                   are reflected from three aspects: query language, query plan-
   We first identify the use cases and scenarios that users may   ning with ML models, and query optimization on inference,
be interested in when processing real-time video streams.         which will be illustrated below and explained in the rest of
   Q1-Identifying of an object with attributes. The detec-        the paper.
tion on a single object class requests on the occurrence of          Defining a CEP language for video processing. In CEP
a specific object. This query type, as the simplest event         systems, a user can define specific queries using CEP lan-
pattern, covers a wide range of applications. For instance,       guage or streaming operators to detect occurrences of partic-
a road surveillance system can monitor whether a car is           ular patterns of (low-level) events. Then the event streams
parked at a non-stopping lane. Also, it is an essential task      are fed into the user-defined patterns. If all the conditions
in autonomous driving to detect a traffic light along with        are met, the user will receive an outcome or alarm.
its color. However, recent works, such as [9, 16, 17], do not        This raises some questions: How to define a complex event
either optimize the model in different scenarios, or provide      pattern over video streams given the available operators?
support for additional attributes detection.                      Are there any additional operators required that can express
   Q2- Identifying multiple classes of objects. What differ-      the content of videos?
entiates Q2 to Q1 is the detection on more than one class            Query planning with ML models. Unlike conventional
of object. In this query type, users are interested to per-       CEP systems, the video streams generated from devices can
ceive multiple objects. For example, in autonomous driving,       not be directly applied to match patterns, but instead are
the system should be alerted when multiple pedestrians are        first processed by cascading models, where the video frames
walking in the cross road with a yellow light on [8]. Still,      are transformed into events, i.e., object class, location, ob-
there is deficiency in [17, 10], where the model is fixed and     ject attributes.
the optimization cannot solve the problem of model drift.            Given a complex event pattern, how can we decompose it
   Q3-Temporal Relation. In this type of query, the time fac-     and solve the components, i.e., identifying color, detecting
tor is involved in the complex event patterns. For example,       objects? How many models can be applied to address the
detect the sequence of events within a time interval. the         issue? Do the state-of-the-art models fulfill the need? If not,
proposal of this query type distinguishes itself by the high      how can we extend the model search space?
expressiveness of window operators and the availability to           Optimizing inference. To achieve high efficiency and ef-
detect iterated patterns. By providing a broader functional-      fectiveness, an optimization mechanism plays a role in auto-
ity of window, it is feasible to analyse the data streams in a    matically assigning fast models for a given query and accu-
timely manner, i.e., report aggregates over time, which has       racy target. Is there an optimal solution to each query that
never been seen in previous works [9, 10, 8, 16, 12, 13]. In      can achieve low latency and high accuracy? Is there any
addition, few existing works in video query language men-         method that can present the model selection process? Can
tioned iteration. This operator serves as an important role       we further optimize the inference by pushing more compu-
in defining the sequence of events by indicating the occur-       tation to the edge closer to the data source?
rence of an event multiple times in sequence.
   Q4-Spatial Relation. This query type perceives the spatial
relation between objects, i.e., to detect a car on the right of   4.   A CEP LANGUAGE FOR VIDEOS
a lamp post. The application of this type of query can be           As discussed previously, the conventional CEP systems
applied to identify human-object interaction [16] and serves      lack support for video data, while video query languages do
not cater for advanced CEP needs. To leverage both, it is           above-mentioned models, techniques in ML, such as special-
feasible to solve the problem and design a CEP language for         ized model and model compression, can also be applied to
videos. We will showcase the operators that are available in        enlarge the model search space. For example, a full and com-
terms of the query types described above.                           plex algorithm, e.g., YOLO9000 [14], can classify or detect
   The most basic operator is that of selection [5]. This oper-     thousands of classes (tasks). While a specialized model is a
ator applies to events and filters out those that do not satisfy    smaller model (i.e., with fewer layers and neurons) that mim-
the predicates. Projection is another operator that belongs         ics the behavior of a full NN model on a particular task. The
to single-item operators, together with selection. This type        sacrifice of generalization of inference models on restricted
of operator transforms the attribute values of an event, and        tasks can substantially reduce inference cost and latency.
thus can be applied to generate video events by transform-             To compile the query, we will construct a model search
ing the video frames into streams of events, including a set        space by showcasing all the models available to solve a par-
of attributes, i.e., timestamp, object class and color.             ticular task. Accordingly, there are multiple options that
   In terms of Q3 –manipulating with the temporal relation,         can be selected in terms of various outcomes, i.e., accuracy
the time factor plays a pivotal role here. Existing video           and inference latency. These models differ in complexity,
query languages [8, 17, 10] measure time by counting frames         i.e., layers, number of neurons in dense layers and the abil-
and doing the math using the speed of incoming frames mea-          ity to generalize.
sured in frames per second. Such operation is limited and
constrained to answer requests, e.g., report on a regular ba-
sis. Our notion of the window operator, incorporated from
                                                                    6.   OPTIMIZING VIDEO CEP QUERIES
stream processing, assigns windows to events with respect              As discussed above, the solutions to solve a task, e.g.,
to different notions of time opted by application developers.       detecting object class, are various. And there is no single
One important notion of time supported is event time, which         model that can outperform all the others in terms of accu-
signifies the time a frame was generated at the source device.      racy and speed. Let alone, the performance of the models
This provides us with a powerful abstraction for matching           is data-dependent [9]. And thus, we should optimize on the
temporal relationships. Notably, different types of windows         inference models for each query and consider the trade-off
can be expressed, namely tumbling windows, sliding win-             between accuracy and speed when assigning the models.
dows, session windows and global windows[1]. The high                  Manipulating the trade-off between accuracy and infer-
expressiveness of window operators, hence, enables timely           ence throughput can be regarded as a multi-objective opti-
report and flexible manipulation on queries. Also, it is wor-       mization problem (MOP). Pareto-optimal solutions are ap-
thy to note that iteration, as an important feature to define       plied in order to select pre-optimal solutions for each task.
the repeated occurrences of a match event, is not covered in        In this case, the aim is to process the video fast and accu-
previous works [13, 12, 8, 17, 10]. In the proposal, we will        rately, and thus the objectives are accuracy and speed. To
extend the temporal patterns that have been seen in pre-            give an example, as shown in Figure 1a, every symbol repre-
vious works, by incorporating the rich definition of window         sents a specific model, varying in shape, size and set of task,
operators and introducing the iteration to define sequence          but fulfill the same goal, i.e., identifying a specific object.
of events.                                                          The blue line represents the Pareto frontier. Suppose that
   To detect the spatial relation between objects is an impor-      f1 is inference time and f2 is accuracy. We expect f1 to be
tant task in video retrieval, which is different from the con-      lower and f2 to be higher. In the Pareto frontier, no solu-
ventional stream processing where the events do not reveal          tions in the search space are superior to the others in the
any spatial relationship. New operators should be defined to        line in terms of both objectives. Only the models that lie in
detect such relations on video contents. State-of-the-art ob-       the Pareto frontier (such as OD1 and OD2 ) are considered
ject detection provide information that helps us to identify        for further comparison.
the class of object as well as their location within a frame,
which enables to answer queries of Q4.
   In addition, modern streaming systems scale-out by run-
ning multiple operator instances that process disjoint parts
of the data in shared-nothing commodity hardware. This
scalable distributed architecture is suitable for joining streams
from distributed sources or distributing data to various nodes.
It lays the foundation for edge computing, where a task may
be divided and distributed to various computational nodes.

5.   QUERY PLANNING WITH ML MODELS
   In order to run queries in the CEP language, we first                 (a) Pareto frontier            (b) AND-OR graph
need to identify the models that can be applied to solve
the tasks in each query, e.g., detecting color or identifying                   Figure 1: Optimization approaches
a specific object. For each task, there are various models
and solutions available. For example, to detect an object,             After we obtain the sub-optimal models for each task, the
both object detection and image classification can be ap-           next step is to select and assign the optimal solution. In the
plied. The state-of-the-art object detection models include         previous step, the models are coupled with the outcomes,
but are not limited to Mask R-CNN [6], YOLOv2 [14]. On              including the inference throughput and accuracy. If a user
the other hand, AlexNet [11] and VGG-16 [15] are state-             is quality-oriented, then the model with the highest accuracy
of-the-art deep image classification models. Despite all the        is selected, and vice versa. To represent the solution of the
task, we will apply the AND-OR graph, as shown in Figure         9.    ACKNOWLEDGMENTS
1b. The query is decomposed into a set of smaller problems,        The author would like to thank Cognizant for their sup-
i.e., identify red color and detect a car. The leaves of the     ports in this project. And many thanks to Marios Fragk-
AND-OR graph represent unique sub-optimal models in the          oulis, Christos Koutras, Agathe Balayn, Andra Ionescu and
Pareto frontier, and the optimal option will be sent to their    Georgios Siachamis for their valuable feedback on this work.
parents for further analysis. The process goes on until the
query is solved.
                                                                 10.    REFERENCES
                                                                  [1] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi,
7.   EDGE/FOG COMPUTING                                               and K. Tzoumas. Apache flink: Stream and batch
                                                                      processing in a single engine. Bulletin of the IEEE
   To process the video streams in real-time, it is feasible          Computer Society Technical Committee on Data
to reduce transmission of data from one edge to another, by           Engineering, 36(4), 2015.
offloading tasks to the devices closer to the data source. But    [2] G. Cugola and A. Margara. Tesla: a formally defined event
due to the limited computation power available at edge de-            specification language. In Proceedings of the Fourth ACM
vices, the interactions with remote clusters or public clouds         International Conference on Distributed Event-Based
are inevitable. The models that are offloaded to the edge             Systems, pages 50–61, 2010.
                                                                  [3] G. Cugola and A. Margara. Processing flows of information:
devices may thus sacrifice accuracy for inference and the             From data stream to complex event processing. ACM
complex models deployed remotely should compensate for                Computing Surveys (CSUR), 44(3):1–62, 2012.
it. In this phase, the challenge is to decide what task shall     [4] Y. Diao, N. Immerman, and D. Gyllstrom. Sase+: An agile
be offloaded and what information will be transmitted given           language for kleene closure over event streams. UMass
the dynamic circumstances in real-world deployment.                   Technical Report, 2007.
   In the edge computing paradigm, the objectives that need       [5] N. Giatrakos, E. Alevizos, A. Artikis, A. Deligiannakis, and
to be considered and optimized are not restricted to infer-           M. Garofalakis. Complex event recognition in the big data
ence latency and accuracy anymore, but include wireless               era: a survey. The VLDB Journal, 29(1):313–352, 2020.
                                                                  [6] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn.
bandwidth, processing capacity and energy consumption [7],
                                                                      In Proceedings of the IEEE international conference on
which increase the difficulty level of assigning optimal con-         computer vision, pages 2961–2969, 2017.
figurations in real-time.                                         [7] C.-C. Hung, G. Ananthanarayanan, P. Bodik,
                                                                      L. Golubchik, M. Yu, P. Bahl, and M. Philipose.
8.   FUTURE PLAN                                                      Videoedge: Processing camera streams using hierarchical
                                                                      clusters. In 2018 IEEE/ACM Symposium on Edge
   Building prototype. As the aim of this PhD project is              Computing (SEC), pages 115–131. IEEE, 2018.
to provide highly expressive semantics for users to define        [8] D. Kang, P. Bailis, and M. Zaharia. Blazeit: Optimizing
queries over real-time video steams, we intend to first de-           declarative aggregation and limit queries for neural
velop a prototype that is available for simple queries, e.g.,         network-based video analytics. arXiv preprint
Q1 and Q2. In this phase, the focus lies in the optimiza-             arXiv:1805.01046, 2018.
tion module, where trade-off between latency and accuracy         [9] D. Kang, J. Emmons, F. Abuzaid, P. Bailis, and
                                                                      M. Zaharia. Noscope: optimizing neural network queries
is exploited. Other queries will be implemented thereafter.           over video at scale. arXiv preprint arXiv:1703.02529, 2017.
   Video decoding. The characteristics of video may influ-       [10] N. Koudas, R. Li, and I. Xarchakos. Video monitoring
ence the model performance and should also be considered,             queries. In 2020 IEEE 36th International Conference on
i.e., frame resolution, frame sampling rate. For example, a           Data Engineering (ICDE), pages 1285–1296. IEEE, 2020.
crowded road scene compared to a quiet neighborhood re-          [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
quire a more frequent frame sampling rate so that no event            classification with deep convolutional neural networks. In
is missed, and a higher resolution to retain the details of           Advances in neural information processing systems, pages
                                                                      1097–1105, 2012.
content. The video characteristics will serve as important
                                                                 [12] T. C. Kuo and A. L. Chen. Content-based query processing
factors and will be taken into account in the configuration           for video databases. IEEE Transactions on Multimedia,
plan.                                                                 2(1):1–13, 2000.
   Adaptive configuration. Given the scene captured by the       [13] C. Lu, M. Liu, and Z. Wu. Svql: A sql extended query
camera changes overtime, even by a static-angle camera,               language for video databases. International Journal of
model drift is a main issue that affects the performance of           Database Theory and Application, 8(3):235–248, 2015.
the video processing. Dynamic configuration is complex and       [14] J. Redmon and A. Farhadi. Yolo9000: better, faster,
challenging if we want to maintain real-time performance,             stronger. In Proceedings of the IEEE conference on
                                                                      computer vision and pattern recognition, pages 7263–7271,
since dynamic configuration may hinder the inference pro-             2017.
cedure. In the future, we will investigate adaptive configu-     [15] K. Simonyan and A. Zisserman. Very deep convolutional
rations given the trade-off between the inference speed and           networks for large-scale image recognition. arXiv preprint
the quality of the results.                                           arXiv:1409.1556, 2014.
   Edge computing. To bring the computation closer to the        [16] I. Xarchakos and N. Koudas. Svq: Streaming video queries.
edge, we will investigate how to apply an edge computing              In Proceedings of the 2019 International Conference on
paradigm into the deployment. In this phase, the design               Management of Data, pages 2013–2016, 2019.
of the architecture of edge-cloud, the measure of real-time      [17] P. Yadav and E. Curry. Vidcep: Complex event processing
                                                                      framework to detect spatiotemporal patterns in video
performance, and the decision to offload tasks and commit             streams. In 2019 IEEE International Conference on Big
data will be important problems going forward.                        Data (Big Data), pages 2513–2522. IEEE, 2019.