<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Complex Event Processing on Real-time Video Streams</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Ziyu Li supervised by Asterios Katsifodimos, Alessandro Bozzon and Geert-Jan Houben Delft University of Technology</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>Cameras are ubiquitous nowadays and video analytic systems have been widely used in surveillance, tra c control, business intelligence and autonomous driving. Some applications, e.g., detecting road congestion in tra c monitoring, require continuous and timely reporting of complex patterns. However, conventional complex event processing (CEP) systems fail to support video processing, while the existing video query languages o er limited support for expressing advanced CEP queries, such as iteration, and window. In this PhD research, we aim to develop systems and methods to alleviate these issues. In this paper, we rst identify the need for an expressive CEP language which allows users to de ne queries over video streams, and receive fast, accurate results. To evaluate CEP queries on videos in real-time and with high accuracy, we explain how a streaming query engine can be designed to provide native support of machine learning (ML) models for fast and accurate inference on video streams. In addition, we describe a set of optimization problems that arise when ML models, with trade-o s in speed, accuracy, and cost, are part of a query plan. Finally, we describe how query plans on real-time videos can be optimized and deployed on edge devices with limited computational and network capabilities.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Cameras installed in buildings, deployed on streets, and
tted on various devices generate vast amounts of video
content on a daily basis. The video footage empowers a wide
range of important applications such as in-door surveillance,
tra c control, business intelligence and autonomous driving
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In real-world scenarios, we are generally interested in
detecting complex events and receiving instant alerts, e.g.,
detecting road congestion in tra c monitoring and alerting
to dangerous scenes in autonomous driving. Considering the
urgent need for real-time response, it is infeasible to store
videos in databases. Ideally, videos would be processed on
the spot.
      </p>
      <p>
        Complex Event Processing (CEP) systems provide
expressive languages to construct CEP queries for pattern
detection over events. The constructs of an event algebra that
ful lls the need of de ning complex patterns was identi ed
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. CEP query languages [
        <xref ref-type="bibr" rid="ref1 ref2 ref4">4, 2, 1</xref>
        ] de ne complex events to
be detected and correlate them to more high-level
meaningful information in data streams. However, less attention has
been paid to content-based event detection on video streams
in CEP systems, considering that conventional CEP systems
accept structured data as input.
      </p>
      <p>
        There have been many proposals for query languages on
video [
        <xref ref-type="bibr" rid="ref12 ref13">13, 12</xref>
        ]. These languages provide high-level
semantics, usually in an SQL-like format, allowing users to query
video content. Compared to CEP languages, these
SQLlike languages provide limited support for detecting complex
patterns on video content, i.e., missing operations such as
iteration and join, and restricted use of window (by
counting number of frames), which leads to constrained queries.
Table 1 shows a comparison of the current event query
languages and existing video query languages. Operators listed
in the table are the combination of the ones commonly used
in streaming systems [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and in video retrieval systems. From
the table, we discover two research gaps, where 1) CEP
systems lack support for video data while 2) multimedia
retrieval languages fail to support well-rounded operators for
CEP. This Ph.D. work aims to leverage these two gaps.
      </p>
      <p>
        Video streams are di cult to process due to their low-level
features (pixel value). The state-of-the-art object detection
algorithm, Mask R-CNN [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] runs at 3 frames per second
(fps), while in real-time the video frame rate is 25-30 fps.
Nonetheless, techniques in ML, such as model specialization
and model compression can be applied to expand the model
search space [
        <xref ref-type="bibr" rid="ref8 ref9">9, 8</xref>
        ]. Models di er in shape, size, and the
classes of object they can identify, and most importantly,
in terms of accuracy and latency. Thus, it is essential to
optimize these aspects in response to user preference, i.e.,
quality-oriented or speed-oriented.
      </p>
      <p>The objective of this PhD research is to leverage the
expressiveness of streaming languages in order to construct
complex event patterns over real-time video streams, and to
optimize the process aiming for e cient and sound results.
To achieve the goal, in this paper we describe the PhD plan,
split in three main research lines. First, we identify the
motivating query types that cannot be supported adequately
by current video query languages. Second, we show how
to exploit the operators in the context of video processing
by leveraging the existing streaming operators and
state-ofthe-art computer vision techniques. Since traditional query
optimizers typically do not consider trade-o s of accuracy,
speed and cost, we demonstrate the need to construct a ML
model search space and to navigate the search space
applying Pareto frontier and an AND-OR graph. Finally we
consider deploying CEP query plans over video data closer
to the device where they are generated and outline the
challenges for optimizing such plans to execute at the edge,
reducing communication and computation costs.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>MOTIVATING EXAMPLE QUERIES</title>
      <p>We rst identify the use cases and scenarios that users may
be interested in when processing real-time video streams.</p>
      <p>
        Q1-Identifying of an object with attributes. The
detection on a single object class requests on the occurrence of
a speci c object. This query type, as the simplest event
pattern, covers a wide range of applications. For instance,
a road surveillance system can monitor whether a car is
parked at a non-stopping lane. Also, it is an essential task
in autonomous driving to detect a tra c light along with
its color. However, recent works, such as [
        <xref ref-type="bibr" rid="ref16 ref17 ref9">9, 16, 17</xref>
        ], do not
either optimize the model in di erent scenarios, or provide
support for additional attributes detection.
      </p>
      <p>
        Q2- Identifying multiple classes of objects. What di
erentiates Q2 to Q1 is the detection on more than one class
of object. In this query type, users are interested to
perceive multiple objects. For example, in autonomous driving,
the system should be alerted when multiple pedestrians are
walking in the cross road with a yellow light on [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Still,
there is de ciency in [
        <xref ref-type="bibr" rid="ref10 ref17">17, 10</xref>
        ], where the model is xed and
the optimization cannot solve the problem of model drift.
      </p>
      <p>
        Q3-Temporal Relation. In this type of query, the time
factor is involved in the complex event patterns. For example,
detect the sequence of events within a time interval. the
proposal of this query type distinguishes itself by the high
expressiveness of window operators and the availability to
detect iterated patterns. By providing a broader
functionality of window, it is feasible to analyse the data streams in a
timely manner, i.e., report aggregates over time, which has
never been seen in previous works [
        <xref ref-type="bibr" rid="ref10 ref12 ref13 ref16 ref8 ref9">9, 10, 8, 16, 12, 13</xref>
        ]. In
addition, few existing works in video query language
mentioned iteration. This operator serves as an important role
in de ning the sequence of events by indicating the
occurrence of an event multiple times in sequence.
      </p>
      <p>
        Q4-Spatial Relation. This query type perceives the spatial
relation between objects, i.e., to detect a car on the right of
a lamp post. The application of this type of query can be
applied to identify human-object interaction [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and serves
as a fundamental support for autonomous driving. To
combine with Q3 or other query types, it is feasible to construct
more expressive queries, e.g detect human behaviors or track
an object throughout time.
3.
      </p>
    </sec>
    <sec id="sec-3">
      <title>RESEARCH OVERVIEW</title>
      <p>In this section, we would like to highlight the challenges
of CEP over real-time video streams.</p>
      <p>To process video streams in a CEP system, the challenges
are re ected from three aspects: query language, query
planning with ML models, and query optimization on inference,
which will be illustrated below and explained in the rest of
the paper.</p>
      <p>De ning a CEP language for video processing. In CEP
systems, a user can de ne speci c queries using CEP
language or streaming operators to detect occurrences of
particular patterns of (low-level) events. Then the event streams
are fed into the user-de ned patterns. If all the conditions
are met, the user will receive an outcome or alarm.</p>
      <p>This raises some questions: How to de ne a complex event
pattern over video streams given the available operators?
Are there any additional operators required that can express
the content of videos?</p>
      <p>Query planning with ML models. Unlike conventional
CEP systems, the video streams generated from devices can
not be directly applied to match patterns, but instead are
rst processed by cascading models, where the video frames
are transformed into events, i.e., object class, location,
object attributes.</p>
      <p>Given a complex event pattern, how can we decompose it
and solve the components, i.e., identifying color, detecting
objects? How many models can be applied to address the
issue? Do the state-of-the-art models ful ll the need? If not,
how can we extend the model search space?</p>
      <p>Optimizing inference. To achieve high e ciency and
effectiveness, an optimization mechanism plays a role in
automatically assigning fast models for a given query and
accuracy target. Is there an optimal solution to each query that
can achieve low latency and high accuracy? Is there any
method that can present the model selection process? Can
we further optimize the inference by pushing more
computation to the edge closer to the data source?</p>
    </sec>
    <sec id="sec-4">
      <title>A CEP LANGUAGE FOR VIDEOS</title>
      <p>As discussed previously, the conventional CEP systems
lack support for video data, while video query languages do
not cater for advanced CEP needs. To leverage both, it is
feasible to solve the problem and design a CEP language for
videos. We will showcase the operators that are available in
terms of the query types described above.</p>
      <p>
        The most basic operator is that of selection [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This
operator applies to events and lters out those that do not satisfy
the predicates. Projection is another operator that belongs
to single-item operators, together with selection. This type
of operator transforms the attribute values of an event, and
thus can be applied to generate video events by
transforming the video frames into streams of events, including a set
of attributes, i.e., timestamp, object class and color.
      </p>
      <p>
        In terms of Q3 {manipulating with the temporal relation,
the time factor plays a pivotal role here. Existing video
query languages [
        <xref ref-type="bibr" rid="ref10 ref17 ref8">8, 17, 10</xref>
        ] measure time by counting frames
and doing the math using the speed of incoming frames
measured in frames per second. Such operation is limited and
constrained to answer requests, e.g., report on a regular
basis. Our notion of the window operator, incorporated from
stream processing, assigns windows to events with respect
to di erent notions of time opted by application developers.
One important notion of time supported is event time, which
signi es the time a frame was generated at the source device.
This provides us with a powerful abstraction for matching
temporal relationships. Notably, di erent types of windows
can be expressed, namely tumbling windows, sliding
windows, session windows and global windows[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The high
expressiveness of window operators, hence, enables timely
report and exible manipulation on queries. Also, it is
worthy to note that iteration, as an important feature to de ne
the repeated occurrences of a match event, is not covered in
previous works [
        <xref ref-type="bibr" rid="ref10 ref12 ref13 ref17 ref8">13, 12, 8, 17, 10</xref>
        ]. In the proposal, we will
extend the temporal patterns that have been seen in
previous works, by incorporating the rich de nition of window
operators and introducing the iteration to de ne sequence
of events.
      </p>
      <p>To detect the spatial relation between objects is an
important task in video retrieval, which is di erent from the
conventional stream processing where the events do not reveal
any spatial relationship. New operators should be de ned to
detect such relations on video contents. State-of-the-art
object detection provide information that helps us to identify
the class of object as well as their location within a frame,
which enables to answer queries of Q4.</p>
      <p>In addition, modern streaming systems scale-out by
running multiple operator instances that process disjoint parts
of the data in shared-nothing commodity hardware. This
scalable distributed architecture is suitable for joining streams
from distributed sources or distributing data to various nodes.
It lays the foundation for edge computing, where a task may
be divided and distributed to various computational nodes.</p>
    </sec>
    <sec id="sec-5">
      <title>QUERY PLANNING WITH ML MODELS</title>
      <p>
        In order to run queries in the CEP language, we rst
need to identify the models that can be applied to solve
the tasks in each query, e.g., detecting color or identifying
a speci c object. For each task, there are various models
and solutions available. For example, to detect an object,
both object detection and image classi cation can be
applied. The state-of-the-art object detection models include
but are not limited to Mask R-CNN [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], YOLOv2 [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. On
the other hand, AlexNet [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and VGG-16 [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] are
stateof-the-art deep image classi cation models. Despite all the
above-mentioned models, techniques in ML, such as
specialized model and model compression, can also be applied to
enlarge the model search space. For example, a full and
complex algorithm, e.g., YOLO9000 [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], can classify or detect
thousands of classes (tasks). While a specialized model is a
smaller model (i.e., with fewer layers and neurons) that
mimics the behavior of a full NN model on a particular task. The
sacri ce of generalization of inference models on restricted
tasks can substantially reduce inference cost and latency.
      </p>
      <p>To compile the query, we will construct a model search
space by showcasing all the models available to solve a
particular task. Accordingly, there are multiple options that
can be selected in terms of various outcomes, i.e., accuracy
and inference latency. These models di er in complexity,
i.e., layers, number of neurons in dense layers and the
ability to generalize.
6.</p>
    </sec>
    <sec id="sec-6">
      <title>OPTIMIZING VIDEO CEP QUERIES</title>
      <p>
        As discussed above, the solutions to solve a task, e.g.,
detecting object class, are various. And there is no single
model that can outperform all the others in terms of
accuracy and speed. Let alone, the performance of the models
is data-dependent [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. And thus, we should optimize on the
inference models for each query and consider the trade-o
between accuracy and speed when assigning the models.
      </p>
      <p>Manipulating the trade-o between accuracy and
inference throughput can be regarded as a multi-objective
optimization problem (MOP). Pareto-optimal solutions are
applied in order to select pre-optimal solutions for each task.
In this case, the aim is to process the video fast and
accurately, and thus the objectives are accuracy and speed. To
give an example, as shown in Figure 1a, every symbol
represents a speci c model, varying in shape, size and set of task,
but ful ll the same goal, i.e., identifying a speci c object.
The blue line represents the Pareto frontier. Suppose that
f1 is inference time and f2 is accuracy. We expect f1 to be
lower and f2 to be higher. In the Pareto frontier, no
solutions in the search space are superior to the others in the
line in terms of both objectives. Only the models that lie in
the Pareto frontier (such as OD1 and OD2 ) are considered
for further comparison.</p>
      <p>(a) Pareto frontier</p>
      <p>(b) AND-OR graph</p>
      <p>After we obtain the sub-optimal models for each task, the
next step is to select and assign the optimal solution. In the
previous step, the models are coupled with the outcomes,
including the inference throughput and accuracy. If a user
is quality-oriented, then the model with the highest accuracy
is selected, and vice versa. To represent the solution of the
task, we will apply the AND-OR graph, as shown in Figure
1b. The query is decomposed into a set of smaller problems,
i.e., identify red color and detect a car. The leaves of the
AND-OR graph represent unique sub-optimal models in the
Pareto frontier, and the optimal option will be sent to their
parents for further analysis. The process goes on until the
query is solved.</p>
    </sec>
    <sec id="sec-7">
      <title>7. EDGE/FOG COMPUTING</title>
      <p>To process the video streams in real-time, it is feasible
to reduce transmission of data from one edge to another, by
o oading tasks to the devices closer to the data source. But
due to the limited computation power available at edge
devices, the interactions with remote clusters or public clouds
are inevitable. The models that are o oaded to the edge
devices may thus sacri ce accuracy for inference and the
complex models deployed remotely should compensate for
it. In this phase, the challenge is to decide what task shall
be o oaded and what information will be transmitted given
the dynamic circumstances in real-world deployment.</p>
      <p>
        In the edge computing paradigm, the objectives that need
to be considered and optimized are not restricted to
inference latency and accuracy anymore, but include wireless
bandwidth, processing capacity and energy consumption [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
which increase the di culty level of assigning optimal
congurations in real-time.
      </p>
    </sec>
    <sec id="sec-8">
      <title>8. FUTURE PLAN</title>
      <p>Building prototype. As the aim of this PhD project is
to provide highly expressive semantics for users to de ne
queries over real-time video steams, we intend to rst
develop a prototype that is available for simple queries, e.g.,
Q1 and Q2. In this phase, the focus lies in the
optimization module, where trade-o between latency and accuracy
is exploited. Other queries will be implemented thereafter.</p>
      <p>Video decoding. The characteristics of video may in
uence the model performance and should also be considered,
i.e., frame resolution, frame sampling rate. For example, a
crowded road scene compared to a quiet neighborhood
require a more frequent frame sampling rate so that no event
is missed, and a higher resolution to retain the details of
content. The video characteristics will serve as important
factors and will be taken into account in the con guration
plan.</p>
      <p>Adaptive con guration. Given the scene captured by the
camera changes overtime, even by a static-angle camera,
model drift is a main issue that a ects the performance of
the video processing. Dynamic con guration is complex and
challenging if we want to maintain real-time performance,
since dynamic con guration may hinder the inference
procedure. In the future, we will investigate adaptive con
gurations given the trade-o between the inference speed and
the quality of the results.</p>
      <p>Edge computing. To bring the computation closer to the
edge, we will investigate how to apply an edge computing
paradigm into the deployment. In this phase, the design
of the architecture of edge-cloud, the measure of real-time
performance, and the decision to o oad tasks and commit
data will be important problems going forward.</p>
      <p>The author would like to thank Cognizant for their
supports in this project. And many thanks to Marios
Fragkoulis, Christos Koutras, Agathe Balayn, Andra Ionescu and
Georgios Siachamis for their valuable feedback on this work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Carbone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Katsifodimos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ewen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Markl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Haridi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Tzoumas</surname>
          </string-name>
          .
          <article-title>Apache ink: Stream and batch processing in a single engine</article-title>
          .
          <source>Bulletin of the IEEE Computer Society Technical Committee on Data Engineering</source>
          ,
          <volume>36</volume>
          (
          <issue>4</issue>
          ),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cugola</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Margara</surname>
          </string-name>
          .
          <article-title>Tesla: a formally de ned event speci cation language</article-title>
          .
          <source>In Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems</source>
          , pages
          <fpage>50</fpage>
          {
          <fpage>61</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cugola</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Margara</surname>
          </string-name>
          .
          <article-title>Processing ows of information: From data stream to complex event processing</article-title>
          .
          <source>ACM Computing Surveys (CSUR)</source>
          ,
          <volume>44</volume>
          (
          <issue>3</issue>
          ):1{
          <fpage>62</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Diao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Immerman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Gyllstrom</surname>
          </string-name>
          . Sase+:
          <article-title>An agile language for kleene closure over event streams</article-title>
          .
          <source>UMass Technical Report</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Giatrakos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Alevizos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Artikis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Deligiannakis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Garofalakis</surname>
          </string-name>
          .
          <article-title>Complex event recognition in the big data era: a survey</article-title>
          .
          <source>The VLDB Journal</source>
          ,
          <volume>29</volume>
          (
          <issue>1</issue>
          ):
          <volume>313</volume>
          {
          <fpage>352</fpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          , G. Gkioxari,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick. Mask</surname>
          </string-name>
          r-cnn.
          <source>In Proceedings of the IEEE international conference on computer vision</source>
          , pages
          <volume>2961</volume>
          {
          <fpage>2969</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>C.-C. Hung</surname>
            , G. Ananthanarayanan,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Bodik</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Golubchik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Bahl</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Philipose</surname>
          </string-name>
          . Videoedge:
          <article-title>Processing camera streams using hierarchical clusters</article-title>
          .
          <source>In 2018 IEEE/ACM Symposium on Edge Computing (SEC)</source>
          , pages
          <fpage>115</fpage>
          {
          <fpage>131</fpage>
          . IEEE,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bailis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          . Blazeit:
          <article-title>Optimizing declarative aggregation and limit queries for neural network-based video analytics</article-title>
          .
          <source>arXiv preprint arXiv:1805.01046</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Emmons</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Abuzaid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bailis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          .
          <article-title>Noscope: optimizing neural network queries over video at scale</article-title>
          .
          <source>arXiv preprint arXiv:1703.02529</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Koudas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Xarchakos.</surname>
          </string-name>
          <article-title>Video monitoring queries</article-title>
          .
          <source>In 2020 IEEE 36th International Conference on Data Engineering (ICDE)</source>
          , pages
          <fpage>1285</fpage>
          {
          <fpage>1296</fpage>
          . IEEE,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Imagenet classi cation with deep convolutional neural networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <volume>1097</volume>
          {
          <fpage>1105</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Kuo</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Content-based query processing for video databases</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>13</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lu</surname>
          </string-name>
          , M. Liu, and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <article-title>Svql: A sql extended query language for video databases</article-title>
          .
          <source>International Journal of Database Theory and Application</source>
          ,
          <volume>8</volume>
          (
          <issue>3</issue>
          ):
          <volume>235</volume>
          {
          <fpage>248</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          and
          <string-name>
            <surname>A. Farhadi.</surname>
          </string-name>
          <article-title>Yolo9000: better, faster, stronger</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          , pages
          <volume>7263</volume>
          {
          <fpage>7271</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>I.</given-names>
            <surname>Xarchakos</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Koudas</surname>
          </string-name>
          . Svq:
          <article-title>Streaming video queries</article-title>
          .
          <source>In Proceedings of the 2019 International Conference on Management of Data</source>
          , pages
          <year>2013</year>
          {
          <year>2016</year>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>P.</given-names>
            <surname>Yadav</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Curry</surname>
          </string-name>
          . Vidcep:
          <article-title>Complex event processing framework to detect spatiotemporal patterns in video streams</article-title>
          .
          <source>In 2019 IEEE International Conference on Big Data (Big Data)</source>
          , pages
          <fpage>2513</fpage>
          {
          <fpage>2522</fpage>
          . IEEE,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>