<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extreme-Scale Interactive Cross-Platform Streaming Analytics - The INFORE Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antonios Deligiannakis</string-name>
          <email>adeli@athenarc.gr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikos Giatrakos</string-name>
          <email>ngiatrakos@athenarc.gr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yannis Kotidis</string-name>
          <email>kotidis@aueb.gr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasilis Samoladas</string-name>
          <email>vsam@softnet.tuc.gr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alkis Simitsis</string-name>
          <email>alkis@athenarc.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Athena Research Center</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Athena Research Center, Technical University of Crete</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Athens University of Economics and</institution>
          ,
          <addr-line>Business</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Nowadays, analytics are at the core of many emerging applications and can greatly benefit from the abundance of data and the progress in the processing capabilities of modern hardware. Still, new challenges arise with the extreme complexity of designing applications to exploit in full the potential of such oferings. In this work, we describe the main challenges in delivering advanced, cross-platform, streaming analytics at scale with the primary goal of interactivity. We present the architecture of a prototype system designed from scratch to tackle such challenges and we describe the internals of its components. Finally, we discuss key added-value synergies upon serving target analytics workflows in real-world applications and present a preliminary performance evaluation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Interactive analytics at scale lie at the core of many diverse
applications. In life sciences, studying the efect of drug applications
on simulated tumors may produce data streams of 100Gb/min [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
These streams need to get analyzed online to interactively
determine successive drug combinations. In the financial domain, NYSE
alone generates several terabytes of stock data per day [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Interactive analytics enable timely reaction to investment opportunities
or risks. In maritime surveillance, one needs to fuse heterogeneous,
voluminous data composed of position streams of thousands of
vessels, satellite images, thermal camera or acoustic signal streams
of on-site devices to investigate ongoing illegal activities at sea [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>Relevant data not only originate from a variety of heterogeneous
sources but also often arrive at multiple, geo-dispersed clusters.
In life sciences, several eforts aim at federating analytics across
Copyright © 2021 for the individual papers by the papers’ authors. Copyright © 2021
for the volume as a collection by its editors. This volume and its papers are published
under the Creative Commons License Attribution 4.0 International (CC BY 4.0).
Published in the Proceedings of the 2nd Workshop on Search, Exploration, and
Analysis in Heterogeneous Datastores, co-located with VLDB 2021 (August 16-20, 2021,
Copenhagen, Denmark) on CEUR-WS.org.
data centers1. Similarly, discovering dependencies or correlations
among economies or industries involves global analysis of stock
streams originating from various local market data centers. In
maritime data analysis, streams come from a number of ground- or
satellite-based receivers before being ingested in proximate data
centers. Additionally, Big Data technologies are significantly
fragmented. Each of these scenarios, involves delivering analytics over
networked clusters each hosting a variety of Big Data platforms.</p>
      <p>Typically, such scenarios are designed as streaming analytics
workflows. Our goal is to provide a solution for executing global,
cross-platform, streaming analytics workflows over a network of
computer clusters hosting diverse Big Data technologies, with high
interactivity; i.e., continuously deriving the output of the
worklfow in real-time and adapting to ad-hoc changes required due to
observed results or system performance metrics, at runtime.</p>
      <p>To this extent, streaming Big Data platforms such as Apache
Spark, Flink, Kafka2 approach the problem by ensuring horizontal
scalability, i.e, by scaling out (parallelizing) the computation to a
number of Virtual Machines (VM) and providing APIs to program
distributed Big Data processing pipelines. These platforms are not
designed to support cross-platform execution scenarios. They also
provide none or suboptimal support for online Machine Learning
(ML) or Data Mining (DM) operators, since the major ML APIs
they provide do not focus on streaming data. Frameworks such as
Apache Beam or Apache SAMOA 3, allow for creating code that is
executable in diferent, supported Big Data platforms. Nevertheless,
this does not amend the lack of resource management in
multicluster and cross-platform settings.</p>
      <p>
        Related systems such as Rheem [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Ires [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], BigDawg [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
Musketeer [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] are designed towards cross-platform execution of global
workflows, but they can only handle batch, instead of stream,
processing pipelines. Even in systems like BigDawg or Rheem, which
support stream processing in S-Store [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and JavaStreams
respectively, cross-streaming platform execution remains unsupported.
      </p>
      <p>
        Our work aims at filling the gap in handling eficiently advanced,
cross-platform, interactive analytics at scale. We describe the
architecture of an open-source system, namely INFORE 4, designed from
scratch for this purpose. A recent INFORE demo [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] (Best Demo
1https://elixir-europe.org/about-us, https://commonfund.nih.gov/bd2k/, https://cyverse.org/
2https://spark.apache.org/, https://flink.apache.org/, https://kafka.apache.org/
3https://beam.apache.org/, https://samoa.incubator.apache.org/
4https://bitbucket.org/infore_research_project/
Award at CIKM 2020) showcases how code-free, cross-platform
analytics are feasible through user-friendly, graphical design and
execution of streaming workflows. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] presents the frontend, while
in the current paper we focus on the backend. We detail its key
features, three of its core components and inter-component
synergies in real-world applications. We also present a preliminary
performance evaluation.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>OVERVIEW OF THE APPROACH</title>
      <sec id="sec-2-1">
        <title>Requirements and Key Features. Based on our interaction with</title>
        <p>real-world practitioners and applications, early in INFORE’s design,
we identified three scalability requirements that should complement
interactivity and cross-streaming platform execution features.</p>
        <p>Enhanced Horizontal scalability. Parallelization is not adequate
to handle extreme-scale scenarios. The number of available VMs
in private clouds has a limit when it comes to extreme scale, while
monetary charges arise for cloud providers.</p>
        <p>Vertical scalability. Need to scale the computation with the
number of processed streams. For instance, computing the correlation
matrix of stock data streams results in a quadratic explosion in space
and computational complexity, which is infeasible for thousands or
millions of streams in our motivating scenarios.</p>
        <p>
          Federated scalability. Need to scale the computation in settings
where data arrive at multiple, geo-dispersed sites. In such settings,
even when we do everything right within each cluster, the potential
for interactivity is network bound [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Overview of Key Components. INFORE employs three core</title>
        <p>components for realizing the aforementioned key requirements.</p>
        <p>
          Synopses Data Engine (SDE). There is a wide consensus in stream
processing [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] that approximate but rapid answers to analytics tasks,
more often than not, sufice. Synopses, such as samples, sketches,
histograms [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], provide approximate answers, with accuracy
guarantees, to popular (cardinality, frequency moment, correlation, set
membership, quantile) analytic operators. The maintenance and
use of data synopses is essential to stream processing applications
because applications get only one look of the data. Synopses are
used as an approximate version of the infinite data, which due to its
small size can be processed and iterated upon, without the need of
reading the entire, potentially infinite, data stream multiple times.
        </p>
        <p>
          The SDE builds and maintains such synopses across platforms
and clusters. It achieves enhanced horizontal scalability because
synopses shed the load assigned to each VM in a cluster and reduce
memory usage. For federated scalability, it reduces communication
by transmitting compact data summaries. For interactivity, we add
a unique SDE-as-a-Service (SDEaaS) design where the SDE is a
constantly running service (job, in one or more clusters) that can
accept any query, synopses creation on new streams or plugging
code for new synopses at runtime, with zero downtime for running
workflows. For vertical scalability, the SDEaaS design maintains
thousands of synopses for thousands of streams, in settings where
prior eforts [
          <xref ref-type="bibr" rid="ref15 ref16 ref18 ref19">15, 16, 18, 19</xref>
          ] can maintain only few tens of synopses.
We henceforth use the terms SDE and SDEaaS interchangeably.
        </p>
        <p>
          Online Machine Learning and Data Mining (OMLDM). OMLDM
applies distributed, online ML and DM adopting a Parameter Server
(PS) paradigm [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] (Figure 1(e)). The PS paradigm enhances
horizontal and federated scalability via the option of an asynchronous
(besides synchronous) synchronization policy to reduce the efect of
stragglers and bandwidth consumption [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], respectively.
Regarding interactivity, the OMLDM component allows to interactively
adjust the number of learners and also to deploy the current
up-todate ML/DM model at the PS side, should concept drifts occur, at the
very same time the training process in parallel learners proceeds.
Established synergies with the SDE and the Optimizer components
also boost vertical and federated scalability.
        </p>
        <p>Optimizer Component. The Optimizer is a prerequisite to
efectively use the SDE and the OMLDM Components towards
interactivity. The Optimizer picks (a) the networked cluster, (b) the Big
Data platforms available in that cluster, and (c) the provisioned
resources for each SDE, OMLDM (or other) operator of the global
workflow. A good solution is determined taking into account
performance criteria and cost models, related to interactivity, engaging
throughput, (network and processing) latency, memory usage etc.
These components cooperate as follows (more in Section 4).</p>
        <p>Synopses-based Optimization. Given a workflow, the Optimizer
produces alternative physical execution plans where it replaces
exact (e.g., count, window, correlation) operators with equivalent,
but approximate ones from the SDE. It then provides suggestions
for equivalent, but approximate, workflows accompanied by the
expected interactivity (execution speed up) vs. accuracy trade-of.</p>
        <p>
          Boosting Vertical Scalability. The SDE Component boosts vertical
scalability of the OMLDM Component in ways that are not
possible otherwise. Indicatively, we use Discrete Fourier Transform
(DFT) and Locality Sensitive Hashing (LSH) bitmaps [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] to
bucketize (hash) streams during correlation matrix computation, outlier
detection, clustering or feature extraction. Vertical scalability is
achieved by hashing streams to learners based on their DFT or LSH
coeficients. The complexity of all these operators is reduced since
computations are pruned for streams that do not end up nearby in
the hashing space.
        </p>
        <p>
          Pumped Parameter Server. With the help of the Optimizer, we
extend the OMLDM Component with an additional synchronization
policy, besides the synchronous and asynchronous
communication entailed by the classic PS design [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. We incorporate a novel
communication protocol, termed Functional Geometric Monitoring
(FGM) [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. This protocol strengthens horizontal (within a cluster)
and federated scalability by bridging the gap between synchronous
and asynchronous communication. Instead of having learners
communicating in predefined rounds/batches (synchronous) or when
each one is updated (asynchronous), FGM requires communication
only when a concept drift is likely to have occurred.
        </p>
        <p>INFORE Architecture. We develop a prototype of the INFORE
system incorporating operators from SDE, OMLDM and stream
transformations provided by the streaming APIs of Big Data
platforms such as Apache Flink or Spark. An overview of the INFORE
architecture is illustrated in Figure 1(a). Currently, INFORE supports
streaming analytics workflows over Apache Flink, Spark Structured
Streaming and Kafka. Cross-(Big Data) platform and cluster
communication is performed via JSON formatted Kafka messages. In a
nutshell, INFORE works as follows:</p>
        <p>(Step 1) The user designs in a Graphical Editor (left of Figure 1(a))
the desired workflow composed of stream transformation, OMLDM
and SDE operators. These are logical operators composing the
logical view of the workflow, independent of the underlying platforms.
Logical WF</p>
        <p>2</p>
        <p>Suggested Manager
Graphical Physical WF
Editor</p>
        <p>Stream
OMLDM Transformations</p>
        <p>Physical WF</p>
        <p>Stats, Optimizer
Logical WF</p>
        <p>3
5 4
Monitor Submit</p>
        <p>Computing Clusters
Data
Topic
Request
Topic</p>
        <p>Map
parse
Map
parse</p>
        <p>CoFlatMap
HashData
Register
Synopsis
Register
Request</p>
        <p>FlatMap
—Data Path
—Requests Path</p>
        <p>CoFlatMap</p>
        <p>add
estimate</p>
        <p>Split
splitter
To
geodispersed</p>
        <p>Union Topic
—Single-stream
Synopsis Estimation</p>
        <p>Union
federator
Union</p>
        <p>Topic
—Federated</p>
        <p>Synopsis
—Mergeable</p>
        <p>Synopsis
Estimation
merge
FlatMap
(a) Overall INFORE Architecture.
(b) SDEaaS Component (Condensed View).</p>
        <p>Manager
module</p>
        <p>Statistics</p>
        <p>Benchmarking
Optimization
Algorithms
Cost Estimator</p>
        <p>Training
Topic
Control
Topic
Prediction</p>
        <p>Topic</p>
        <p>Map
parse</p>
        <p>Map
parse</p>
        <p>FlatMap
Learner</p>
        <p>FlatMap
Parameter</p>
        <p>Server
FlatMap
Predictor</p>
        <p>Feedback</p>
        <p>Topic</p>
        <p>Control
Response Topic
—Training</p>
        <p>Path
—Control</p>
        <p>Path
Prediction
Output Topic
—Prediction</p>
        <p>Path
(c) Optimizer Component
(d) OMLDM Component (Condensed View).
lab led tsaed
lo o p
G M U
l
edoM tree re</p>
        <p>)
iizannoo scyFanGM s
t ,
llaboG raaPm rvSe rcySnh ,(scyn ltrsaaem
c
o
l
rrLaeen1 rrLaeen2 … rrLaeenn llsaendooM
c
o</p>
        <p>L
(e) Param. Server View.
(Step 2) When the designed workflow is submitted, a JSON file
is conveyed to a Manager module (middle of Figure 1(a)). Initially,
the Manager sends the JSON to the Optimizer (right of Figure 1(a)).</p>
        <p>(Step 3) The Optimizer determines the physical view of the
worklfow, performing a mapping of the logical operators to physical ones
to be executed on specific network clusters and Big Data platforms,
so that various performance criteria are optimized.</p>
        <p>(Step 4) The Optimizer returns a JSON with physical operators
to the Manager which compiles .  files deploying stream
transformation, OMLDM and SDE operators (top of Figure 1(a)) to be
submitted at chosen (cluster, Big Data platform) pairs.</p>
        <p>(Step 5) The submitted jobs are monitored and performance
statistics are collected.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 KEY-COMPONENT INTERNALS</title>
      <p>This section elaborates on INFORE’s key components. For space
considerations, we describe the concepts using Flink as an example
implementation, but the design is generic to support operators in
all streaming platforms mentioned above.</p>
    </sec>
    <sec id="sec-4">
      <title>3.1 SDE Component</title>
      <p>
        The architecture of the SDE Component is illustrated in Figure 1(b).
In streaming settings, multiple workflows may run continuously
providing output streams to application interfaces for timely
decision making. The SDEaaS design we adopt (see Section 2) can
serve multiple wofklows, running concurrently and continuously,
with approximate, commonly used operators (e.g., counts, sums,
frequency moments) reusing, instead of duplicating, streams and
synopses common to workflows. Moreover, for every new synopsis
maintenance request, SDEaaS reserves new execution tasks for the
operators in Figure 1(b), while without SDEaaS we would reserve
entire threads (task slots in terms of Flink). In that, SDEaaS manages
thousands of synopsis, on-demand, for thousands of streams; hence,
ensuring vertical scalability [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        To briefly showcase the performance benefits of our approach,
Figure 3(a) shows the results of an experiment over a cluster with
40 task slots. (The hardware details are not important for this
experiment.) We compare SDEaaS to a non-SDEaaS approach, such
as Proteus [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], that extends Flink with data synopsis utilities, but
lacks the SDEaaS paradigm. We design an experiment where we
start with maintaining 2 Count-Min sketches for frequency
estimations on the volume, price pairs of financial data (stocks). Then,
we express demands for maintaining more Count-Min sketches
for up to 5000 sketches/stocks. We do that without stopping the
already running synopses. Figure 3(a) shows the sum of
throughputs of all running jobs for non-SDEaaS and the throughput of
SDEaaS. non-SDEaaS starts new jobs for new synopses and assigns
at least one entire task slot to each new job. Hence, the 40 task
slots in our cluster are depleted and ✘ signs in Figure 3(a) denote
that non-SDEaaS cannot maintain more than 40 synopses. SDEaaS
has no such limit since it assigns new tasks to a single running
service (job) at runtime, with zero downtime. SDEaaS also exhibits
higher throughput than non-SDEaaS due to fine-grained resource
utilization at the task, instead of task slot, level.
      </p>
      <p>
        The extensible SDE Library currently includes a variety of
synopses techniques for cardinality, frequency moment, correlation,
set membership or quantile estimation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Please check [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] for
further details. The SDE API allows on-the-fly requests for
building/stopping synopses, plug-in code for new synopses in the SDE
Library, ad-hoc and continuous queries. To enable the SDEaaS
design for interactivity and vertical scalability, the SDE Library is
built using subtype polymorphism and inheritance in Java. This is
the key to allow new synopsis code to be loaded at runtime and
also allows all requests to be issued on-the-fly by the Manager
module, instead of deploying separate jobs. Every new synopsis
simply extends a Synopsis class with its detailed algorithms for
add (streaming updates), estimate (query answering) and merge
for outputting an estimation for mergeable synopses [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] kept at
multiple workers or clusters.
      </p>
      <p>When a synopsis operator of the SDE is included in a workflow,
every request arriving via the RequestTopic in Figure 1(b) for
building or querying a running synopsis follows the red-colored path.
The RegisterSynopsis, RegisterRequest FlatMaps initially
produce and then look up the same keys pointing to workers that
should maintain the synopsis, but for a diferent purpose each.
RegisterSynopsis uses these keys to direct streaming tuples, which
follow the blue-colored path, to assigned workers to update the
synopsis. Register Request uses the same keys to direct queries to
those workers and derive estimations at the estimation FlatMap
by reading the status of the synopsis kept at the add FlatMap.</p>
      <p>
        When a streaming update arrives at the Kafka DataTopic, the
FlatMap termed HashData looks up the keys of workers produced
by RegisterSynopsis and directs the update to the proper workers.
There, the add FlatMap includes the update in the maintained
synopsis. For instance, in case a FM sketch is maintained[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the add
operation hashes the incoming tuple to a position of the maintained
bitmap and turns the corresponding bit to 1 if it is not already set.
Upon a query, the estimation FlatMap uses the lowest set bit of
that bitmap to derive a cardinality estimation[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        The rest of the operators on the right-hand side of Figure 1(b) are
used for appropriately controlling the output topics. The splitter
(split operator in Flink) directs the estimation to the OutputTopic
if a synopsis is defined on a single stream handled by one worker
(green path). If a synopses engages many streams (e.g. many stocks)
maintained at multiple workers (purple path) or multiple clusters
(yellow path – clusters communicate via the UnionTopic of Kafka)
merge FlatMap synthesizes the estimations (for instance, executing
a logical disjunction operation on various FM sketches [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) before
outputting the estimation via OutputTopic.
3.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>OMLDM Component</title>
      <p>The OMLDM component fills the gap of existing batch
processingfocused APIs such as Spark’s MLlib or FlinkML. Its internal
architecture is illustrated in Figure 1(d). Two types of pipelines are engaged
in the OMLDM architecture: training and prediction pipelines.</p>
      <p>Training pipelines. Training data streams follow the blue-colored
path of Figure 1(d). They are ingested via the TrainingTopic. The
rest of the training pipeline is implemented by a pair of Learner
and ParameterServer FlatMaps. We follow a Parameter Server
(PS) distributed model (Figure 1(e)), where an interactively defined
number of identical learners distribute amongst themselves the
computational load of training on an incoming data stream of
training samples. In a nutshell, PS holds and updates the state of the
learning process, including the current version of the trained model.
Learners compute the updates to the model and coordinate via the
PS. This is because the iterative nature of ML/DM computations
requires communication between all parallel workers, and this is done
by the PS in each pipeline. Training pipelines are data sinks, that
is, they do not generate an output stream. However, they do have
a Feedback Kafka topic to support iteration required by involved
tasks. The Feedback topic is also read by the predictors of the red
path, discussed below, to update their models upon a concept drift.
For instance, should training on labeled streams of a classification
task take place on par with prediction (tagging unlabeled streams),
in case of a concept drift, the PS sends a new model to predictors
via the FeedbackTopic.</p>
      <p>
        For training pipelines, it is the pipeline’s responsibility to
coordinate the processing between the learners and the PS. As noted in
Section 2, the supported coordination policies include synchronous,
asynchronous and FGM, a novel and bandwidth-eficient
coordination policy that we introduced at [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Performance-wise, the
synchronous policy does not encourage enhanced horizontal
scalability because when many learners are used, the total utilization
is usually low, should only few stragglers exist. The asynchronous
one is the policy of choice in large-scale ML; the processing speed is
much higher when many learners are used, and therefore training is
more scalable. The FGM policy provides both enhanced horizontal
and federated scalability because (a) the learners do not need to
await each other’s sync in rounds or contact the PS immediately
with each update, (b) communication takes place asynchronously,
but only when a concept drift may have occurred. This is achieved
in synergy with the Optimizer (see Section 4).
      </p>
      <p>Figure 1(d) shows an instance of the PS and learners running on
a Flink cluster. The messaging interface between Learner and the
PS is handled via Kafka, and hence it is generic and independent of
the underlying Big Data platform.</p>
      <p>Prediction pipelines. These are shown with the red-colored path of
Figure 1(d). Prediction pipelines do not train the models they apply;
these models are loaded at Predictor FlatMap via control
messages. A model can be changed interactively by the PS at runtime, if
there has been a concept drift, via the FeedbackTopic. Prediction
pipelines employ some model for classification, interpolation or
inference and transform input to output streams, which can be
processed further by downstream operators of a global workflow.
The control topics (green arrows and Kafka topics in Figure 1(d))
deliver control messages to and from an OMLDM job. The control
API understands Create/Read/Update/Delete (CRUD) commands
for each type of resource. For instance, using control messages the
OMLDM component will respond with the current status of the PS
and learners (e.g. how many learners are currently running) or the
currently up-to-date model at the PS.</p>
      <p>OMLDM currently supports (i) classification: Passive Aggressive
Classifier, Online Support Vector Machines and Vertical Hoefding
Trees, (ii) clustering/outlier detection: BIRCH, Online k-Means and
StreamKM++ and (iii) regression: Passive Aggressive Regressor,
Online Ridge and Polynomial Regression, Autoregressive Integrated
Moving Average. Facilities for correlation matrix computation,
standardization, polynomial feature extraction are also available.
(a) Initial Workflow at the Graphical Editor.
(b) Suggested Workflow via Synopses-based Optimization.</p>
    </sec>
    <sec id="sec-6">
      <title>Optimizer Component</title>
      <p>Given a workflow composed of various operators, we have to
properly tune their execution prescribing both the Big Data Platform
and the networked computer cluster that achieve good overall
performance with respect to interactivity-related metrics are met.</p>
      <p>The internal architecture of the INFORE Optimizer is shown
in Figure 1(c). There are a number of submodules involved before
providing logical to physical operator mappings to be deployed
for execution. First, we use a statistics collector to derive
performance measurements from each executed workflow. Statistics are
collected via JMX or Slurm5 and are ingested in an ELK stack6
while monitoring jobs.</p>
      <p>We have developed a Benchmarking submodule that automates
the acquisition of performance metrics for every supported operator
run on diferent Big Data platforms. The Benchmarking submodule
circumvents steps 1-3 in Figure 1(a) and enables direct job submision
to the underlying clusters engaging supported operators and stream
sources. In that, the execution of various operators and (part of)
workflows on diferent (Big Data platform, cluster) pairs is a priori
benchmarked, statistics are collected and performance models are
built to avoid a cold start.</p>
      <p>
        Cost models are derived with a Bayesian Optimization approach
inspired by CherryPick [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. From the statistics we have collected
from either micro-benchmarks and/or actual workflow execution,
we form a Gaussian performance/cost prediction model for the
various operators and (parts of) workflows executed in diferent
(Big Data platform, cluster) pairs. The cost model is incrementally
improved with new workflow executions.
      </p>
      <p>Algorithmically, the Optimizer solves a multi-criteria
optimization problem with objectives involving throughput, communication
cost, processing and network latency, CPU/GPU and memory usage
as well as accuracy for synopses-based optimization purposes. Due
to conflicting objectives, we cannot optimize them all
simultaneously. Therefore, we resort to Pareto optimal solutions obtained
by maximizing a weighted combination of these objectives under
clusters’ capacity constraints.</p>
      <p>
        Notably, no prior efort [
        <xref ref-type="bibr" rid="ref2 ref4 ref5 ref9">2, 4, 5, 9</xref>
        ] examined such a rich set of
optimization objectives related to interactivity. Indicatively, only
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] considers network performance metrics (but for batch
processing scenarios). To derive Pareto optimal solutions, besides a baseline,
Exhaustive Search algorithm that examines every possible mapping
of logical to physical operators in a workflow, we have developed a
novel variation of an A*-alike algorithm. The classic A* algorithm
5https://docs.oracle.com/javase/tutorial/jmx/overview/, slurm.schedmd.com/
6https://www.elastic.co/what-is/elk-stack
cannot work in our setup as is, because it can only suggest optimal
paths. This would mean that some logical operators would be left
out from suggesting a physical operator mapping. Our novel A*
variation, fosters equivalent heuristics to the typical A* to prune the
search space, but outputs a graph (instead of a path) with physical
operators instantiating every logical one in the submitted workflow.
This new algorithm can speed up the suggestion of the optimal
physical workflow by orders of magnitude (see Section 4). We have
also developed a Greedy and a Heuristic algorithm that leverage
parallelism and domain knowledge to boost the performance further
and to enable adaptivity (i.e., provide in-real time and incrementally
a new plan to adapt to ad hoc performance hiccups). All algorithms
employ the cost estimation models to evaluate alternative physical
operators while exploring the space of alternative mappings.
4
      </p>
    </sec>
    <sec id="sec-7">
      <title>SYNERGIES IN A REAL CASE STUDY</title>
      <p>This section describes how our prototype works in a real use case
scenario from the financial domain and how advanced synergies are
established among individual components. For exhibition purposes,
we employ a single cluster equipped with Flink and Spark.</p>
      <p>The logical workflow submitted to the Optimizer is illustrated
in Figure 2(a). The workflow of Figure 2(a) utilizes Level 1, Level
2 stock data7 to discover highly positively/negatively correlated
pairs of stocks. In Figure 2(a), data arrive at a Kafka source. The
Split operator separates Level 1 from Level 2 data. It directs Level
2 data to the bottom branch of the workflow. There, the Level 2 bids
are Filtered (i.e., for monitoring only a subset of stocks). The bids
per stock are counted using a Count logical operator. When a trade
for a stock is realized, a new Level 1 tuple is directed by Split to
the upper part of the workflow. A Project operator keeps only the
timestamp of the trade for each stock. The Join operator joins the
stock trade, Level 1 tuple with the count of bids the stock received
until the trade. The result is inserted in a time Window of recent such
counts, forming a time series. Finally, the CorrelationMatrix of
stocks’ time series is computed by that OMLDM operator.</p>
      <sec id="sec-7-1">
        <title>Synopses-based Optimization. When the workflow is submit</title>
        <p>ted, for every logical operator, there are two platform choices, Flink
or Spark. The Optimizer holds a dictionary to determine the
equivalence of logical operators to physical ones in either platform. In
Synopses-based Optimization, the Optimizer attempts to boost
interactivity by treating the SDE as an additional platform. That is,
its dictionary also holds mappings about which exact operators
could be replaced by approximate ones of the SDE. Therefore, the
7www.investopedia.com/terms/l/level1.asp,www.investopedia.com/terms/l/level2.asp
Throughput versus Number of Synopses</p>
        <p>Optimizer will finally suggest an execution plan as shown in
Figure 2(b). In this plan the Count operator has been replaced by a
SDE.CountMin sketch operator of the SDE and the Window operator
has been replaced by SDE.DFT. All physical operators are prescribed
to be executed over Flink, i.e., where the SDE service runs.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Boosting Vertical Scalability. The SDE.DFT operator of the</title>
        <p>
          SDE, for every window it approximates, outputs a pair of values: a
bucketID and a vector of reduced dimensionality compared to the
original window. The bucketID is based on the first few coeficients
of the DFT [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and is used as the key to hash stock streams to
learners of the CorrelationMatrix OMLDM operator (Figure 2(b)).
Reduced vector correlations are pruned for stocks hashed far away
in the hashing space since these stock streams are guaranteed to
be highly (dis)similar [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. This boosts vertical scalability for a task
of quadratic complexity to the number of input streams.
        </p>
        <p>
          Pumped Parameter Server. The stream entering the
Correlation Matrix operator of OMLDM is duplicated to the training
and prediction pipelines (Figure 1(d)). Learners (buckets in our
previous discussion) compute the pairwise correlations of the streams
they receive (correlation coeficient and the beta regression
parameter coincide for standardized vectors), while the PS holds the current
up-to-date correlation matrix which is essentially the global model
at the predictor and the PS sides. A concept drift for the global
model would occur if the Frobenius norm measuring the diference
between the matrix kept at PS and the true correlation matrix may
have exceeded a threshold. The Optimizer will decompose this
constraint to local ones installed at each learner. Mathematical details
are described in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], but in a simple version of FGM, the Optimizer
will prescribe that learners will sync with the PS when a signed
distance value computed locally at each learner, becomes positive.
Notice that local constraints are determined by the Optimizer which
initially prescribes and then adapts the learners at runtime.
        </p>
      </sec>
      <sec id="sec-7-3">
        <title>Preliminary Experimental Evaluation. We employ a Kafka</title>
        <p>cluster with 4 Dell PowerEdge R320 Intel Xeon E5-2430 v2 2.50GHz
machines with 32GB RAM and a cluster with 10 Dell PowerEdge
R300 Quad Core Xeon X3323 2.5GHz machines with 8GB RAM
each, for Flink and Spark Structured Streaming. We use real Level 1,
Level 2 stock data (∼5000 stocks/∼10 TB) provided by a European
ifnancial company. We set DFT to use 8 coeficients, CountMin
( = 0.002,  = 0.01), we use 0.9 correlation threshold and a time
window of 5 minutes, based on input from the data provider.
Figure 3(b) shows the time it takes to the two optimization algorithms
(Exhaustive and A*–alike) to produce a physical workflow from the
time they receive the logical one, when 1 (Flink), 2 (Spark,Flink) or
3 (Spark, Flink, Synopses-based Optimization) are available. As the
ifgure demonstrates for more than 1 platform, our novel A*–alike
algorithm reduces the time to prescribe a physical workflow by 1 to
4 orders of magnitude. For 1 platform (still the Optimizer prescribes
provisioned resources) the Exhaustive Search Algorithm is faster,
since A*–alike requires more initialization time.</p>
        <p>In Figure 3(c) we show the scalability of the INFORE approach
versus a technique that executes the workflow of Figure 2(a) in Flink
(denoted by "Parallelism"), without synopses-based optimization,
boosted vertical scalability or the OMLDM Component. INFORE
executes the workflow of Figure 2(b). We plot the throughput ratio of
INFORE and "Parallelism" over a Naive approach that executes the
same workflow with parallelism set to 1. The two horizontal axes in
the figure show the number of processed stock streams (bottom) and
the volume in terabytes (top) ingested using the maximum Kafka
rate. Thus, the top axis accounts for horizontal scalability (volume,
speed), while the bottom axis accounts for vertical scalability.</p>
        <p>When we monitor 500 streams, i.e., 250K/2 cells in the correlation
matrix, the fact that "Parallelism" lacks synopses-based optimization
and vertical scalability makes its throughput ratio approaching 1, i.e.
the Naive approach. INFORE’s performance remains steady when
switching from 50 to 500 streams improving "Parallelism" by 3 times.
The most important findings come upon switching to 5000 stocks
(25M/2 cells in the matrix). Figure 3(c) denotes that because of the
lack of enhanced horizontal and vertical scalability, "Parallelism"
becomes equivalent to the Naive one. INFORE exhibits more than
an order of magnitude improved throughput with the trade-of of
highly correlated pairs being identified with 90% accuracy.</p>
        <p>For federated scalability, we measure communicated bytes among
the operators. Due to space constraints, we only report here that
INFORE can reduce communication by more than an order of
magnitude when 5000 stock streams are being monitored.
5</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSIONS &amp; ONGOING WORK</title>
      <p>We have presented the design, internal structure, and integration
aspects of key components of INFORE, a prototype designed for
interactive analytics at scale. To our knowledge, INFORE is the first
system for cross-platform, streaming analytics. We currently enrich
the SDE and OMLDM libraries to strengthen the support for the
three scalability types and enhance analytics capabilities.</p>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work has received funding from the EU Horizon 2020 research
and innovation program INFORE under grant agreement No 825070.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>2016. Data</given-names>
            <surname>Stream Management - Processing High-Speed Data</surname>
          </string-name>
          Streams. Springer. https://doi.org/10.1007/978-3-
          <fpage>540</fpage>
          -28608-0
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Divy</given-names>
            <surname>Agrawal</surname>
          </string-name>
          , Sanjay Chawla, Bertty Contreras-Rojas,
          <article-title>Ahmed K</article-title>
          . Elmagarmid, Yasser Idris, Zoi Kaoudi,
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Kruse</surname>
          </string-name>
          , Ji Lucas, Essam Mansour, Mourad Ouzzani, Paolo Papotti,
          <string-name>
            <surname>Jorge-Arnulfo</surname>
            Quiané-Ruiz,
            <given-names>Nan</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
            , Saravanan Thirumuruganathan, and
            <given-names>Anis</given-names>
          </string-name>
          <string-name>
            <surname>Troudi</surname>
          </string-name>
          .
          <year>2018</year>
          . RHEEM: Enabling
          <string-name>
            <surname>Cross-Platform Data Processing - May The Big Data Be With You</surname>
          </string-name>
          !
          <article-title>-</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>11</volume>
          ,
          <issue>11</issue>
          (
          <year>2018</year>
          ),
          <fpage>1414</fpage>
          -
          <lpage>1427</lpage>
          . https://doi.org/10.14778/3236187.3236195
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Omid</given-names>
            <surname>Alipourfard</surname>
          </string-name>
          , Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman,
          <string-name>
            <given-names>Minlan</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Ming</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics</article-title>
          .
          <source>In 14th USENIX Symposium on Networked Systems Design and Implementation</source>
          ,
          <string-name>
            <surname>NSDI</surname>
          </string-name>
          <year>2017</year>
          , Boston, MA, USA, March
          <volume>27</volume>
          -29,
          <year>2017</year>
          ,
          <string-name>
            <given-names>Aditya</given-names>
            <surname>Akella</surname>
          </string-name>
          and Jon Howell (Eds.).
          <source>USENIX Association</source>
          ,
          <volume>469</volume>
          -
          <fpage>482</fpage>
          . https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/ alipourfard
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Katerina</given-names>
            <surname>Doka</surname>
          </string-name>
          , Nikolaos Papailiou, Dimitrios Tsoumakos, Christos Mantas, and
          <string-name>
            <given-names>Nectarios</given-names>
            <surname>Koziris</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>IReS: Intelligent, Multi-Engine Resource Scheduler for Big Data Analytics Workflows</article-title>
          .
          <source>In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data</source>
          , Melbourne, Victoria, Australia, May 31 - June 4,
          <year>2015</year>
          ,
          <string-name>
            <surname>Timos</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Sellis</surname>
          </string-name>
          , Susan B.
          <string-name>
            <surname>Davidson</surname>
          </string-name>
          , and Zachary G. Ives (Eds.). ACM,
          <volume>1451</volume>
          -
          <fpage>1456</fpage>
          . https://doi.org/10.1145/2723372.2735377
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Aaron</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Elmore</surname>
            , Jennie Duggan, Mike Stonebraker, Magdalena Balazinska, Ugur Çetintemel, Vijay Gadepally, Jefrey Heer, Bill Howe, Jeremy Kepner, Tim Kraska, Samuel Madden, David Maier, Timothy G. Mattson, Stavros Papadopoulos, Jef Parkhurst, Nesime Tatbul, Manasi Vartak, and
            <given-names>Stan</given-names>
          </string-name>
          <string-name>
            <surname>Zdonik</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A Demonstration of the BigDAWG Polystore System</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>8</volume>
          ,
          <issue>12</issue>
          (
          <year>2015</year>
          ),
          <fpage>1908</fpage>
          -
          <lpage>1911</lpage>
          . https://doi.org/10.14778/2824032.2824098
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Forbes. 2021</surname>
          </string-name>
          <article-title>(last accessed)</article-title>
          . https://www.forbes.com/sites/tomgroenfeldt/ 2013/02/14/at-nyse
          <article-title>-the-data-deluge-overwhelms-traditional-databases/ #362df2415aab.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Nikos</given-names>
            <surname>Giatrakos</surname>
          </string-name>
          , David Arnu,
          <string-name>
            <given-names>Theodoros</given-names>
            <surname>Bitsakis</surname>
          </string-name>
          , Antonios Deligiannakis,
          <string-name>
            <given-names>Minos N.</given-names>
            <surname>Garofalakis</surname>
          </string-name>
          , Ralf Klinkenberg, Aris Konidaris, Antonis Kontaxakis, Yannis Kotidis, Vasilis Samoladas, Alkis Simitsis, George Stamatakis, Fabian Temme, Mate Torok, Edwin Yaqub, Arnau Montagud, Miguel Ponce de Leon, Holger Arndt, and
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Burkard</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>INforE: Interactive Cross-platform Analytics for Everyone</article-title>
          .
          <source>In CIKM '20: The 29th ACM International Conference on Information and Knowledge Management</source>
          , Virtual Event, Ireland,
          <source>October 19-23</source>
          ,
          <year>2020</year>
          . ACM,
          <volume>3389</volume>
          -
          <fpage>3392</fpage>
          . https://doi.org/10.1145/3340531.3417435
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Nikos</given-names>
            <surname>Giatrakos</surname>
          </string-name>
          , Nikos Katzouris, Antonios Deligiannakis, Alexander Artikis,
          <string-name>
            <given-names>Minos N.</given-names>
            <surname>Garofalakis</surname>
          </string-name>
          , George Paliouras, Holger Arndt, Rafaele Grasso, Ralf Klinkenberg, Miguel Ponce de Leon, Gian Gaetano Tartaglia, Alfonso Valencia, and
          <string-name>
            <given-names>Dimitrios</given-names>
            <surname>Zissis</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Interactive Extreme: Scale Analytics Towards Battling Cancer</article-title>
          .
          <source>IEEE Technol. Soc. Mag</source>
          .
          <volume>38</volume>
          ,
          <issue>2</issue>
          (
          <year>2019</year>
          ),
          <fpage>54</fpage>
          -
          <lpage>61</lpage>
          . https://doi.org/10.1109/MTS.
          <year>2019</year>
          .2913071
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Ionel</given-names>
            <surname>Gog</surname>
          </string-name>
          , Malte Schwarzkopf, Natacha Crooks, Matthew P. Grosvenor, Allen Clement, and
          <string-name>
            <given-names>Steven</given-names>
            <surname>Hand</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Musketeer: all for one, one for all in data processing systems</article-title>
          .
          <source>In Proceedings of the Tenth European Conference on Computer Systems</source>
          , EuroSys 2015, Bordeaux, France,
          <source>April 21-24</source>
          ,
          <year>2015</year>
          ,
          <string-name>
            <given-names>Laurent</given-names>
            <surname>Réveillère</surname>
          </string-name>
          , Tim Harris, and Maurice Herlihy (Eds.). ACM,
          <volume>2</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          :
          <fpage>16</fpage>
          . https://doi.org/10.1145/ 2741948.2741968
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Jeyhun</surname>
            <given-names>Karimov</given-names>
          </string-name>
          , Tilmann Rabl, Asterios Katsifodimos, Roman Samarev, Henri Heiskanen, and
          <string-name>
            <given-names>Volker</given-names>
            <surname>Markl</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Benchmarking Distributed Stream Data Processing Systems</article-title>
          .
          <source>In 34th IEEE International Conference on Data Engineering, ICDE</source>
          <year>2018</year>
          , Paris, France,
          <source>April 16-19</source>
          ,
          <year>2018</year>
          . IEEE Computer Society,
          <fpage>1507</fpage>
          -
          <lpage>1518</lpage>
          . https://doi.org/10.1109/ICDE.
          <year>2018</year>
          .00169
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Antonis</surname>
            <given-names>Kontaxakis</given-names>
          </string-name>
          , Nikos Giatrakos, and
          <string-name>
            <given-names>Antonios</given-names>
            <surname>Deligiannakis</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>A Synopses Data Engine for Interactive Extreme-Scale Analytics</article-title>
          .
          <source>In CIKM '20: The 29th ACM International Conference on Information and Knowledge Management</source>
          , Virtual Event, Ireland,
          <source>October 19-23</source>
          ,
          <year>2020</year>
          . ACM,
          <year>2085</year>
          -
          <fpage>2088</fpage>
          . https://doi.org/10. 1145/3340531.3412154
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Mu</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>David G.</given-names>
            <surname>Andersen</surname>
          </string-name>
          , Jun Woo Park, Alexander J.
          <string-name>
            <surname>Smola</surname>
            , Amr Ahmed, Vanja Josifovski,
            <given-names>James</given-names>
          </string-name>
          <string-name>
            <surname>Long</surname>
            ,
            <given-names>Eugene J.</given-names>
          </string-name>
          <string-name>
            <surname>Shekita</surname>
          </string-name>
          , and
          <string-name>
            <surname>Bor-Yiing Su</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Scaling Distributed Machine Learning with the Parameter Server</article-title>
          .
          <source>In 11th USENIX Symposium on Operating Systems Design and Implementation</source>
          , OSDI '14,
          <string-name>
            <surname>Broomifeld</surname>
          </string-name>
          , CO, USA, October 6-
          <issue>8</issue>
          ,
          <year>2014</year>
          ,
          <string-name>
            <given-names>Jason</given-names>
            <surname>Flinn</surname>
          </string-name>
          and Hank Levy (Eds.).
          <source>USENIX Association</source>
          ,
          <volume>583</volume>
          -
          <fpage>598</fpage>
          . https://www.usenix.org/conference/osdi14/technicalsessions/presentation/li_mu
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>John</surname>
            <given-names>Meehan</given-names>
          </string-name>
          , Stan Zdonik, Shaobo Tian, Yulong Tian, Nesime Tatbul, Adam Dziedzic, and
          <string-name>
            <given-names>Aaron J.</given-names>
            <surname>Elmore</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Integrating real-time and batch processing in a polystore</article-title>
          .
          <source>In 2016 IEEE High Performance Extreme Computing Conference, HPEC</source>
          <year>2016</year>
          ,
          <article-title>Waltham</article-title>
          , MA, USA, September
          <volume>13</volume>
          -
          <issue>15</issue>
          ,
          <year>2016</year>
          . IEEE, 1-
          <fpage>7</fpage>
          . https://doi. org/10.1109/HPEC.
          <year>2016</year>
          .7761585
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Aristides</surname>
            <given-names>Milios</given-names>
          </string-name>
          , Konstantina Bereta, Konstantinos Chatzikokolakis, Dimitris Zissis, and
          <string-name>
            <given-names>Stan</given-names>
            <surname>Matwin</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Automatic Fusion of Satellite Imagery and AIS data for Vessel Detection</article-title>
          .
          <source>In 22th International Conference on Information Fusion, FUSION</source>
          <year>2019</year>
          , Ottawa, ON, Canada,
          <source>July 2-5</source>
          ,
          <year>2019</year>
          . IEEE, 1-
          <fpage>5</fpage>
          . https://ieeexplore. ieee.org/document/9011339
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Barzan</given-names>
            <surname>Mozafari</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>SnappyData</article-title>
          . In Encyclopedia of Big Data Technologies, Sherif Sakr and
          <string-name>
            <given-names>Albert Y.</given-names>
            <surname>Zomaya</surname>
          </string-name>
          (Eds.). Springer. https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -63962-8_
          <fpage>258</fpage>
          -
          <lpage>1</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Proteus</given-names>
            <surname>Project</surname>
          </string-name>
          .
          <year>2021</year>
          (
          <article-title>last accessed)</article-title>
          . https://github.com/proteus-h2020.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Vasilis</given-names>
            <surname>Samoladas</surname>
          </string-name>
          and
          <string-name>
            <given-names>Minos N.</given-names>
            <surname>Garofalakis</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Functional Geometric Monitoring for Distributed Streams</article-title>
          .
          <source>In Advances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT</source>
          <year>2019</year>
          , Lisbon, Portugal, March
          <volume>26</volume>
          -29,
          <year>2019</year>
          . OpenProceedings.org,
          <volume>85</volume>
          -
          <fpage>96</fpage>
          . https://doi.org/10.5441/002/ edbt.
          <year>2019</year>
          .09
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Stream-lib</surname>
          </string-name>
          .
          <year>2021</year>
          <article-title>(last accessed)</article-title>
          . https://github.com/addthis/stream-lib.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Yahoo</surname>
            <given-names>DataSketch. 2021</given-names>
          </string-name>
          <article-title>(last accessed)</article-title>
          . https://datasketches.github.io/.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>