<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Extreme-Scale Interactive Cross-Platform Streaming Analytics -The INFORE Approach</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Antonios</forename><surname>Deligiannakis</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Nikos</forename><surname>Giatrakos</surname></persName>
							<email>ngiatrakos@athenarc.gr</email>
						</author>
						<author>
							<persName><forename type="first">Yannis</forename><surname>Kotidis</surname></persName>
							<email>kotidis@aueb.gr</email>
						</author>
						<author>
							<persName><forename type="first">Vasilis</forename><surname>Samoladas</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Alkis</forename><surname>Simitsis</surname></persName>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="institution">Athena Research Center</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="institution">Technical University of Crete</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="institution">Athena Research Center</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff3">
								<orgName type="institution">Technical University of Crete</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff4">
								<orgName type="institution">Athens University of Economics and Business</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff5">
								<orgName type="institution">Athena Research Center</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff6">
								<orgName type="institution">Technical University of Crete</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff7">
								<orgName type="department">Athena Research Center</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Extreme-Scale Interactive Cross-Platform Streaming Analytics -The INFORE Approach</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">B398223E46ADE3758B6DB8964E39BCAB</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T08:55+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Nowadays, analytics are at the core of many emerging applications and can greatly benefit from the abundance of data and the progress in the processing capabilities of modern hardware. Still, new challenges arise with the extreme complexity of designing applications to exploit in full the potential of such offerings. In this work, we describe the main challenges in delivering advanced, cross-platform, streaming analytics at scale with the primary goal of interactivity. We present the architecture of a prototype system designed from scratch to tackle such challenges and we describe the internals of its components. Finally, we discuss key added-value synergies upon serving target analytics workflows in real-world applications and present a preliminary performance evaluation.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Interactive analytics at scale lie at the core of many diverse applications. In life sciences, studying the effect of drug applications on simulated tumors may produce data streams of 100Gb/min <ref type="bibr" target="#b7">[8]</ref>. These streams need to get analyzed online to interactively determine successive drug combinations. In the financial domain, NYSE alone generates several terabytes of stock data per day <ref type="bibr" target="#b5">[6]</ref>. Interactive analytics enable timely reaction to investment opportunities or risks. In maritime surveillance, one needs to fuse heterogeneous, voluminous data composed of position streams of thousands of vessels, satellite images, thermal camera or acoustic signal streams of on-site devices to investigate ongoing illegal activities at sea <ref type="bibr" target="#b13">[14]</ref>.</p><p>Relevant data not only originate from a variety of heterogeneous sources but also often arrive at multiple, geo-dispersed clusters. In life sciences, several efforts aim at federating analytics across data centers 1 . Similarly, discovering dependencies or correlations among economies or industries involves global analysis of stock streams originating from various local market data centers. In maritime data analysis, streams come from a number of ground-or satellite-based receivers before being ingested in proximate data centers. Additionally, Big Data technologies are significantly fragmented. Each of these scenarios, involves delivering analytics over networked clusters each hosting a variety of Big Data platforms.</p><p>Typically, such scenarios are designed as streaming analytics workflows. Our goal is to provide a solution for executing global, cross-platform, streaming analytics workflows over a network of computer clusters hosting diverse Big Data technologies, with high interactivity; i.e., continuously deriving the output of the workflow in real-time and adapting to ad-hoc changes required due to observed results or system performance metrics, at runtime.</p><p>To this extent, streaming Big Data platforms such as Apache Spark, Flink, Kafka 2 approach the problem by ensuring horizontal scalability, i.e, by scaling out (parallelizing) the computation to a number of Virtual Machines (VM) and providing APIs to program distributed Big Data processing pipelines. These platforms are not designed to support cross-platform execution scenarios. They also provide none or suboptimal support for online Machine Learning (ML) or Data Mining (DM) operators, since the major ML APIs they provide do not focus on streaming data. Frameworks such as Apache Beam or Apache SAMOA 3 , allow for creating code that is executable in different, supported Big Data platforms. Nevertheless, this does not amend the lack of resource management in multicluster and cross-platform settings.</p><p>Related systems such as Rheem <ref type="bibr" target="#b1">[2]</ref>, Ires <ref type="bibr" target="#b3">[4]</ref>, BigDawg <ref type="bibr" target="#b4">[5]</ref>, Musketeer <ref type="bibr" target="#b8">[9]</ref> are designed towards cross-platform execution of global workflows, but they can only handle batch, instead of stream, processing pipelines. Even in systems like BigDawg or Rheem, which support stream processing in S-Store <ref type="bibr" target="#b12">[13]</ref> and JavaStreams respectively, cross-streaming platform execution remains unsupported.</p><p>Our work aims at filling the gap in handling efficiently advanced, cross-platform, interactive analytics at scale. We describe the architecture of an open-source system, namely INFORE 4 , designed from scratch for this purpose. A recent INFORE demo <ref type="bibr" target="#b6">[7]</ref> (Best Demo Award at CIKM 2020) showcases how code-free, cross-platform analytics are feasible through user-friendly, graphical design and execution of streaming workflows. <ref type="bibr" target="#b6">[7]</ref> presents the frontend, while in the current paper we focus on the backend. We detail its key features, three of its core components and inter-component synergies in real-world applications. We also present a preliminary performance evaluation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">OVERVIEW OF THE APPROACH</head><p>Requirements and Key Features. Based on our interaction with real-world practitioners and applications, early in INFORE's design, we identified three scalability requirements that should complement interactivity and cross-streaming platform execution features.</p><p>Enhanced Horizontal scalability. Parallelization is not adequate to handle extreme-scale scenarios. The number of available VMs in private clouds has a limit when it comes to extreme scale, while monetary charges arise for cloud providers.</p><p>Vertical scalability. Need to scale the computation with the number of processed streams. For instance, computing the correlation matrix of stock data streams results in a quadratic explosion in space and computational complexity, which is infeasible for thousands or millions of streams in our motivating scenarios.</p><p>Federated scalability. Need to scale the computation in settings where data arrive at multiple, geo-dispersed sites. In such settings, even when we do everything right within each cluster, the potential for interactivity is network bound <ref type="bibr" target="#b9">[10]</ref>.</p><p>Overview of Key Components. INFORE employs three core components for realizing the aforementioned key requirements.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Synopses Data Engine (SDE).</head><p>There is a wide consensus in stream processing <ref type="bibr" target="#b0">[1]</ref> that approximate but rapid answers to analytics tasks, more often than not, suffice. Synopses, such as samples, sketches, histograms <ref type="bibr" target="#b0">[1]</ref>, provide approximate answers, with accuracy guarantees, to popular (cardinality, frequency moment, correlation, set membership, quantile) analytic operators. The maintenance and use of data synopses is essential to stream processing applications because applications get only one look of the data. Synopses are used as an approximate version of the infinite data, which due to its small size can be processed and iterated upon, without the need of reading the entire, potentially infinite, data stream multiple times.</p><p>The SDE builds and maintains such synopses across platforms and clusters. It achieves enhanced horizontal scalability because synopses shed the load assigned to each VM in a cluster and reduce memory usage. For federated scalability, it reduces communication by transmitting compact data summaries. For interactivity, we add a unique SDE-as-a-Service (SDEaaS) design where the SDE is a constantly running service (job, in one or more clusters) that can accept any query, synopses creation on new streams or plugging code for new synopses at runtime, with zero downtime for running workflows. For vertical scalability, the SDEaaS design maintains thousands of synopses for thousands of streams, in settings where prior efforts <ref type="bibr" target="#b14">[15,</ref><ref type="bibr">16,</ref><ref type="bibr">18,</ref><ref type="bibr" target="#b16">19]</ref> can maintain only few tens of synopses. We henceforth use the terms SDE and SDEaaS interchangeably.</p><p>Online Machine Learning and Data Mining (OMLDM). OMLDM applies distributed, online ML and DM adopting a Parameter Server (PS) paradigm <ref type="bibr" target="#b11">[12]</ref> (Figure <ref type="figure" target="#fig_2">1</ref>(e)). The PS paradigm enhances horizontal and federated scalability via the option of an asynchronous (besides synchronous) synchronization policy to reduce the effect of stragglers and bandwidth consumption <ref type="bibr" target="#b11">[12]</ref>, respectively. Regarding interactivity, the OMLDM component allows to interactively adjust the number of learners and also to deploy the current up-todate ML/DM model at the PS side, should concept drifts occur, at the very same time the training process in parallel learners proceeds. Established synergies with the SDE and the Optimizer components also boost vertical and federated scalability.</p><p>Optimizer Component. The Optimizer is a prerequisite to effectively use the SDE and the OMLDM Components towards interactivity. The Optimizer picks (a) the networked cluster, (b) the Big Data platforms available in that cluster, and (c) the provisioned resources for each SDE, OMLDM (or other) operator of the global workflow. A good solution is determined taking into account performance criteria and cost models, related to interactivity, engaging throughput, (network and processing) latency, memory usage etc. These components cooperate as follows (more in Section 4).</p><p>Synopses-based Optimization. Given a workflow, the Optimizer produces alternative physical execution plans where it replaces exact (e.g., count, window, correlation) operators with equivalent, but approximate ones from the SDE. It then provides suggestions for equivalent, but approximate, workflows accompanied by the expected interactivity (execution speed up) vs. accuracy trade-off.</p><p>Boosting Vertical Scalability. The SDE Component boosts vertical scalability of the OMLDM Component in ways that are not possible otherwise. Indicatively, we use Discrete Fourier Transform (DFT) and Locality Sensitive Hashing (LSH) bitmaps <ref type="bibr" target="#b10">[11]</ref> to bucketize (hash) streams during correlation matrix computation, outlier detection, clustering or feature extraction. Vertical scalability is achieved by hashing streams to learners based on their DFT or LSH coefficients. The complexity of all these operators is reduced since computations are pruned for streams that do not end up nearby in the hashing space.</p><p>Pumped Parameter Server. With the help of the Optimizer, we extend the OMLDM Component with an additional synchronization policy, besides the synchronous and asynchronous communication entailed by the classic PS design <ref type="bibr" target="#b11">[12]</ref>. We incorporate a novel communication protocol, termed Functional Geometric Monitoring (FGM) <ref type="bibr" target="#b15">[17]</ref>. This protocol strengthens horizontal (within a cluster) and federated scalability by bridging the gap between synchronous and asynchronous communication. Instead of having learners communicating in predefined rounds/batches (synchronous) or when each one is updated (asynchronous), FGM requires communication only when a concept drift is likely to have occurred.</p><p>INFORE Architecture. We develop a prototype of the INFORE system incorporating operators from SDE, OMLDM and stream transformations provided by the streaming APIs of Big Data platforms such as Apache Flink or Spark. An overview of the INFORE architecture is illustrated in Figure <ref type="figure" target="#fig_2">1</ref>   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">KEY-COMPONENT INTERNALS</head><p>This section elaborates on INFORE's key components. For space considerations, we describe the concepts using Flink as an example implementation, but the design is generic to support operators in all streaming platforms mentioned above.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">SDE Component</head><p>The architecture of the SDE Component is illustrated in Figure <ref type="figure" target="#fig_2">1(b)</ref>. In streaming settings, multiple workflows may run continuously providing output streams to application interfaces for timely decision making. The SDEaaS design we adopt (see Section 2) can serve multiple wofklows, running concurrently and continuously, with approximate, commonly used operators (e.g., counts, sums, frequency moments) reusing, instead of duplicating, streams and synopses common to workflows. Moreover, for every new synopsis maintenance request, SDEaaS reserves new execution tasks for the operators in Figure <ref type="figure" target="#fig_2">1</ref>(b), while without SDEaaS we would reserve entire threads (task slots in terms of Flink). In that, SDEaaS manages thousands of synopsis, on-demand, for thousands of streams; hence, ensuring vertical scalability <ref type="bibr" target="#b10">[11]</ref>.</p><p>To briefly showcase the performance benefits of our approach, Figure <ref type="figure" target="#fig_8">3</ref>(a) shows the results of an experiment over a cluster with 40 task slots. (The hardware details are not important for this experiment.) We compare SDEaaS to a non-SDEaaS approach, such as Proteus <ref type="bibr">[16]</ref>, that extends Flink with data synopsis utilities, but lacks the SDEaaS paradigm. We design an experiment where we start with maintaining 2 Count-Min sketches for frequency estimations on the volume, price pairs of financial data (stocks). Then, we express demands for maintaining more Count-Min sketches for up to 5000 sketches/stocks. We do that without stopping the already running synopses. Figure <ref type="figure" target="#fig_8">3(a)</ref> shows the sum of throughputs of all running jobs for non-SDEaaS and the throughput of SDEaaS. non-SDEaaS starts new jobs for new synopses and assigns at least one entire task slot to each new job. Hence, the 40 task slots in our cluster are depleted and ✘ signs in Figure <ref type="figure" target="#fig_8">3</ref>(a) denote that non-SDEaaS cannot maintain more than 40 synopses. SDEaaS has no such limit since it assigns new tasks to a single running service (job) at runtime, with zero downtime. SDEaaS also exhibits higher throughput than non-SDEaaS due to fine-grained resource utilization at the task, instead of task slot, level.</p><p>The extensible SDE Library currently includes a variety of synopses techniques for cardinality, frequency moment, correlation, set membership or quantile estimation <ref type="bibr" target="#b0">[1]</ref>. Please check <ref type="bibr" target="#b10">[11]</ref> for further details. The SDE API allows on-the-fly requests for building/stopping synopses, plug-in code for new synopses in the SDE Library, ad-hoc and continuous queries. To enable the SDEaaS design for interactivity and vertical scalability, the SDE Library is built using subtype polymorphism and inheritance in Java. This is the key to allow new synopsis code to be loaded at runtime and also allows all requests to be issued on-the-fly by the Manager module, instead of deploying separate jobs. Every new synopsis simply extends a Synopsis class with its detailed algorithms for add (streaming updates), estimate (query answering) and merge for outputting an estimation for mergeable synopses <ref type="bibr" target="#b0">[1]</ref> kept at multiple workers or clusters.</p><p>When a synopsis operator of the SDE is included in a workflow, every request arriving via the RequestTopic in Figure <ref type="figure" target="#fig_2">1</ref>(b) for building or querying a running synopsis follows the red-colored path. The RegisterSynopsis, RegisterRequest FlatMaps initially produce and then look up the same keys pointing to workers that should maintain the synopsis, but for a different purpose each. RegisterSynopsis uses these keys to direct streaming tuples, which follow the blue-colored path, to assigned workers to update the synopsis. Register Request uses the same keys to direct queries to those workers and derive estimations at the estimation FlatMap by reading the status of the synopsis kept at the add FlatMap.</p><p>When a streaming update arrives at the Kafka DataTopic, the FlatMap termed HashData looks up the keys of workers produced by RegisterSynopsis and directs the update to the proper workers. There, the add FlatMap includes the update in the maintained synopsis. For instance, in case a FM sketch is maintained <ref type="bibr" target="#b0">[1]</ref>, the add operation hashes the incoming tuple to a position of the maintained bitmap and turns the corresponding bit to 1 if it is not already set. Upon a query, the estimation FlatMap uses the lowest set bit of that bitmap to derive a cardinality estimation <ref type="bibr" target="#b0">[1]</ref>.</p><p>The rest of the operators on the right-hand side of Figure <ref type="figure" target="#fig_2">1</ref>(b) are used for appropriately controlling the output topics. The splitter (split operator in Flink) directs the estimation to the OutputTopic if a synopsis is defined on a single stream handled by one worker (green path). If a synopses engages many streams (e.g. many stocks) maintained at multiple workers (purple path) or multiple clusters (yellow path -clusters communicate via the UnionTopic of Kafka) merge FlatMap synthesizes the estimations (for instance, executing a logical disjunction operation on various FM sketches <ref type="bibr" target="#b0">[1]</ref>) before outputting the estimation via OutputTopic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">OMLDM Component</head><p>The OMLDM component fills the gap of existing batch processingfocused APIs such as Spark's MLlib or FlinkML. Its internal architecture is illustrated in Figure <ref type="figure" target="#fig_2">1</ref> , where an interactively defined number of identical learners distribute amongst themselves the computational load of training on an incoming data stream of training samples. In a nutshell, PS holds and updates the state of the learning process, including the current version of the trained model. Learners compute the updates to the model and coordinate via the PS. This is because the iterative nature of ML/DM computations requires communication between all parallel workers, and this is done by the PS in each pipeline. Training pipelines are data sinks, that is, they do not generate an output stream. However, they do have a Feedback Kafka topic to support iteration required by involved tasks. The Feedback topic is also read by the predictors of the red path, discussed below, to update their models upon a concept drift. For instance, should training on labeled streams of a classification task take place on par with prediction (tagging unlabeled streams), in case of a concept drift, the PS sends a new model to predictors via the FeedbackTopic.</p><p>For training pipelines, it is the pipeline's responsibility to coordinate the processing between the learners and the PS. As noted in Section 2, the supported coordination policies include synchronous, asynchronous and FGM, a novel and bandwidth-efficient coordination policy that we introduced at <ref type="bibr" target="#b15">[17]</ref>. Performance-wise, the synchronous policy does not encourage enhanced horizontal scalability because when many learners are used, the total utilization is usually low, should only few stragglers exist. The asynchronous one is the policy of choice in large-scale ML; the processing speed is much higher when many learners are used, and therefore training is more scalable. The FGM policy provides both enhanced horizontal and federated scalability because (a) the learners do not need to await each other's sync in rounds or contact the PS immediately with each update, (b) communication takes place asynchronously, but only when a concept drift may have occurred. This is achieved in synergy with the Optimizer (see Section 4).</p><p>Figure <ref type="figure" target="#fig_2">1(d)</ref> shows an instance of the PS and learners running on a Flink cluster. The messaging interface between Learner and the PS is handled via Kafka, and hence it is generic and independent of the underlying Big Data platform.</p><p>Prediction pipelines. These are shown with the red-colored path of Figure <ref type="figure" target="#fig_2">1(d)</ref>. Prediction pipelines do not train the models they apply; these models are loaded at Predictor FlatMap via control messages. A model can be changed interactively by the PS at runtime, if there has been a concept drift, via the FeedbackTopic. Prediction pipelines employ some model for classification, interpolation or inference and transform input to output streams, which can be processed further by downstream operators of a global workflow. The control topics (green arrows and Kafka topics in Figure <ref type="figure" target="#fig_2">1(d)</ref>) deliver control messages to and from an OMLDM job. The control API understands Create/Read/Update/Delete (CRUD) commands for each type of resource. For instance, using control messages the OMLDM component will respond with the current status of the PS and learners (e.g. how many learners are currently running) or the currently up-to-date model at the PS.</p><p>OMLDM currently supports (i) classification: Passive Aggressive Classifier, Online Support Vector Machines and Vertical Hoeffding Trees, (ii) clustering/outlier detection: BIRCH, Online k-Means and StreamKM++ and (iii) regression: Passive Aggressive Regressor, Online Ridge and Polynomial Regression, Autoregressive Integrated Moving Average. Facilities for correlation matrix computation, standardization, polynomial feature extraction are also available.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Optimizer Component</head><p>Given a workflow composed of various operators, we have to properly tune their execution prescribing both the Big Data Platform and the networked computer cluster that achieve good overall performance with respect to interactivity-related metrics are met.</p><p>The internal architecture of the INFORE Optimizer is shown in Figure <ref type="figure" target="#fig_2">1(c</ref>). There are a number of submodules involved before providing logical to physical operator mappings to be deployed for execution. First, we use a statistics collector to derive performance measurements from each executed workflow. Statistics are collected via JMX or Slurm <ref type="foot" target="#foot_0">5</ref> and are ingested in an ELK stack<ref type="foot" target="#foot_1">6</ref> while monitoring jobs.</p><p>We have developed a Benchmarking submodule that automates the acquisition of performance metrics for every supported operator run on different Big Data platforms. The Benchmarking submodule circumvents steps 1-3 in Figure <ref type="figure" target="#fig_2">1</ref>(a) and enables direct job submision to the underlying clusters engaging supported operators and stream sources. In that, the execution of various operators and (part of) workflows on different (Big Data platform, cluster) pairs is a priori benchmarked, statistics are collected and performance models are built to avoid a cold start.</p><p>Cost models are derived with a Bayesian Optimization approach inspired by CherryPick <ref type="bibr" target="#b2">[3]</ref>. From the statistics we have collected from either micro-benchmarks and/or actual workflow execution, we form a Gaussian performance/cost prediction model for the various operators and (parts of) workflows executed in different (Big Data platform, cluster) pairs. The cost model is incrementally improved with new workflow executions.</p><p>Algorithmically, the Optimizer solves a multi-criteria optimization problem with objectives involving throughput, communication cost, processing and network latency, CPU/GPU and memory usage as well as accuracy for synopses-based optimization purposes. Due to conflicting objectives, we cannot optimize them all simultaneously. Therefore, we resort to Pareto optimal solutions obtained by maximizing a weighted combination of these objectives under clusters' capacity constraints.</p><p>Notably, no prior effort [2, 4, 5, 9] examined such a rich set of optimization objectives related to interactivity. Indicatively, only <ref type="bibr" target="#b1">[2]</ref> considers network performance metrics (but for batch processing scenarios). To derive Pareto optimal solutions, besides a baseline, Exhaustive Search algorithm that examines every possible mapping of logical to physical operators in a workflow, we have developed a novel variation of an A*-alike algorithm. The classic A* algorithm cannot work in our setup as is, because it can only suggest optimal paths. This would mean that some logical operators would be left out from suggesting a physical operator mapping. Our novel A* variation, fosters equivalent heuristics to the typical A* to prune the search space, but outputs a graph (instead of a path) with physical operators instantiating every logical one in the submitted workflow. This new algorithm can speed up the suggestion of the optimal physical workflow by orders of magnitude (see Section 4). We have also developed a Greedy and a Heuristic algorithm that leverage parallelism and domain knowledge to boost the performance further and to enable adaptivity (i.e., provide in-real time and incrementally a new plan to adapt to ad hoc performance hiccups). All algorithms employ the cost estimation models to evaluate alternative physical operators while exploring the space of alternative mappings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">SYNERGIES IN A REAL CASE STUDY</head><p>This section describes how our prototype works in a real use case scenario from the financial domain and how advanced synergies are established among individual components. For exhibition purposes, we employ a single cluster equipped with Flink and Spark.</p><p>The logical workflow submitted to the Optimizer is illustrated in Figure <ref type="figure" target="#fig_5">2</ref>(a). The workflow of Figure <ref type="figure" target="#fig_5">2</ref>(a) utilizes Level 1, Level 2 stock data 7 to discover highly positively/negatively correlated pairs of stocks. In Figure <ref type="figure" target="#fig_5">2</ref>(a), data arrive at a Kafka source. The Split operator separates Level 1 from Level 2 data. It directs Level 2 data to the bottom branch of the workflow. There, the Level 2 bids are Filtered (i.e., for monitoring only a subset of stocks). The bids per stock are counted using a Count logical operator. When a trade for a stock is realized, a new Level 1 tuple is directed by Split to the upper part of the workflow. A Project operator keeps only the timestamp of the trade for each stock. The Join operator joins the stock trade, Level 1 tuple with the count of bids the stock received until the trade. The result is inserted in a time Window of recent such counts, forming a time series. Finally, the CorrelationMatrix of stocks' time series is computed by that OMLDM operator.</p><p>Synopses-based Optimization. When the workflow is submitted, for every logical operator, there are two platform choices, Flink or Spark. The Optimizer holds a dictionary to determine the equivalence of logical operators to physical ones in either platform. In Synopses-based Optimization, the Optimizer attempts to boost interactivity by treating the SDE as an additional platform. That is, its dictionary also holds mappings about which exact operators could be replaced by approximate ones of the SDE. Therefore, the 7 www.investopedia.com/terms/l/level1.asp,www.investopedia.com/terms/l/level2.asp   Boosting Vertical Scalability. The SDE.DFT operator of the SDE, for every window it approximates, outputs a pair of values: a bucketID and a vector of reduced dimensionality compared to the original window. The bucketID is based on the first few coefficients of the DFT <ref type="bibr" target="#b10">[11]</ref> and is used as the key to hash stock streams to learners of the CorrelationMatrix OMLDM operator (Figure <ref type="figure" target="#fig_5">2</ref>(b)). Reduced vector correlations are pruned for stocks hashed far away in the hashing space since these stock streams are guaranteed to be highly (dis)similar <ref type="bibr" target="#b10">[11]</ref>. This boosts vertical scalability for a task of quadratic complexity to the number of input streams.</p><p>Pumped Parameter Server. The stream entering the Correlation Matrix operator of OMLDM is duplicated to the training and prediction pipelines (Figure <ref type="figure" target="#fig_2">1(d)</ref>). Learners (buckets in our previous discussion) compute the pairwise correlations of the streams they receive (correlation coefficient and the beta regression parameter coincide for standardized vectors), while the PS holds the current up-to-date correlation matrix which is essentially the global model at the predictor and the PS sides. A concept drift for the global model would occur if the Frobenius norm measuring the difference between the matrix kept at PS and the true correlation matrix may have exceeded a threshold. The Optimizer will decompose this constraint to local ones installed at each learner. Mathematical details are described in <ref type="bibr" target="#b15">[17]</ref>, but in a simple version of FGM, the Optimizer will prescribe that learners will sync with the PS when a signed distance value computed locally at each learner, becomes positive. Notice that local constraints are determined by the Optimizer which initially prescribes and then adapts the learners at runtime.</p><p>Preliminary Experimental Evaluation. We employ a Kafka cluster with 4 Dell PowerEdge R320 Intel Xeon E5-2430 v2 2.50GHz machines with 32GB RAM and a cluster with 10 Dell PowerEdge R300 Quad Core Xeon X3323 2.5GHz machines with 8GB RAM each, for Flink and Spark Structured Streaming. We use real Level 1, Level 2 stock data (∼5000 stocks/∼10 TB) provided by a European financial company. We set DFT to use 8 coefficients, CountMin (𝜖 = 0.002, 𝛿 = 0.01), we use 0.9 correlation threshold and a time window of 5 minutes, based on input from the data provider. When we monitor 500 streams, i.e., 250K/2 cells in the correlation matrix, the fact that "Parallelism" lacks synopses-based optimization and vertical scalability makes its throughput ratio approaching 1, i.e. the Naive approach. INFORE's performance remains steady when switching from 50 to 500 streams improving "Parallelism" by 3 times. The most important findings come upon switching to 5000 stocks (25M/2 cells in the matrix). Figure <ref type="figure" target="#fig_8">3</ref>(c) denotes that because of the lack of enhanced horizontal and vertical scalability, "Parallelism" becomes equivalent to the Naive one. INFORE exhibits more than an order of magnitude improved throughput with the trade-off of highly correlated pairs being identified with 90% accuracy.</p><p>For federated scalability, we measure communicated bytes among the operators. Due to space constraints, we only report here that INFORE can reduce communication by more than an order of magnitude when 5000 stock streams are being monitored.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">CONCLUSIONS &amp; ONGOING WORK</head><p>We have presented the design, internal structure, and integration aspects of key components of INFORE, a prototype designed for interactive analytics at scale. To our knowledge, INFORE is the first system for cross-platform, streaming analytics. We currently enrich the SDE and OMLDM libraries to strengthen the support for the three scalability types and enhance analytics capabilities.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>(a). Currently, INFORE supports streaming analytics workflows over Apache Flink, Spark Structured Streaming and Kafka. Cross-(Big Data) platform and cluster communication is performed via JSON formatted Kafka messages. In a nutshell, INFORE works as follows:(Step 1) The user designs in a Graphical Editor (left of Figure1(a)) the desired workflow composed of stream transformation, OMLDM and SDE operators. These are logical operators composing the logical view of the workflow, independent of the underlying platforms.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>Param. Server View.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Overall INFORE Architecture &amp; Internal Architecture of Key Components</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>(d). Two types of pipelines are engaged in the OMLDM architecture: training and prediction pipelines. Training pipelines. Training data streams follow the blue-colored path of Figure 1(d). They are ingested via the TrainingTopic. The rest of the training pipeline is implemented by a pair of Learner and ParameterServer FlatMaps. We follow a Parameter Server (PS) distributed model (Figure 1(e))</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head></head><label></label><figDesc>(a) Initial Workflow at the Graphical Editor. (b) Suggested Workflow via Synopses-based Optimization.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Initial and transformed workflow using Synopses-based Optimization.</figDesc><graphic coords="5,56.04,88.67,221.94,63.15" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head></head><label></label><figDesc>SDEaaS vs non-SDEaaS.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head></head><label></label><figDesc>TB) x Max Kafka Ingestion Rate (c) Horizontal &amp; Vertical Scalability.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Performance Evaluation.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_9"><head></head><label></label><figDesc>Figure 3(b) shows the time it takes to the two optimization algorithms (Exhaustive and A*-alike) to produce a physical workflow from the time they receive the logical one, when 1 (Flink), 2 (Spark,Flink) or 3 (Spark, Flink, Synopses-based Optimization) are available. As the figure demonstrates for more than 1 platform, our novel A*-alike algorithm reduces the time to prescribe a physical workflow by 1 to 4 orders of magnitude. For 1 platform (still the Optimizer prescribes provisioned resources) the Exhaustive Search Algorithm is faster, since A*-alike requires more initialization time. In Figure 3(c) we show the scalability of the INFORE approach versus a technique that executes the workflow of Figure 2(a) in Flink (denoted by "Parallelism"), without synopses-based optimization, boosted vertical scalability or the OMLDM Component. INFORE executes the workflow of Figure 2(b). We plot the throughput ratio of INFORE and "Parallelism" over a Naive approach that executes the same workflow with parallelism set to 1. The two horizontal axes in the figure show the number of processed stock streams (bottom) and the volume in terabytes (top) ingested using the maximum Kafka rate. Thus, the top axis accounts for horizontal scalability (volume, speed), while the bottom axis accounts for vertical scalability.</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_0">https://docs.oracle.com/javase/tutorial/jmx/overview/, slurm.schedmd.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_1">https://www.elastic.co/what-is/elk-stack</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGMENTS</head><p>This work has received funding from the EU Horizon 2020 research and innovation program INFORE under grant agreement No 825070.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<idno type="DOI">10.1007/978-3-540-28608-0</idno>
		<ptr target="https://doi.org/10.1007/978-3-540-28608-0" />
		<title level="m">Data Stream Management -Processing High-Speed Data Streams</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">RHEEM: Enabling Cross-Platform Data Processing -May The Big Data Be With You!</title>
		<author>
			<persName><forename type="first">Divy</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sanjay</forename><surname>Chawla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bertty</forename><surname>Contreras-Rojas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ahmed</forename><forename type="middle">K</forename><surname>Elmagarmid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yasser</forename><surname>Idris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zoi</forename><surname>Kaoudi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sebastian</forename><surname>Kruse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ji</forename><surname>Lucas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Essam</forename><surname>Mansour</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mourad</forename><surname>Ouzzani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Paolo</forename><surname>Papotti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jorge-Arnulfo</forename><surname>Quiané-Ruiz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nan</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Saravanan</forename><surname>Thirumuruganathan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Anis</forename><surname>Troudi</surname></persName>
		</author>
		<idno type="DOI">10.14778/3236187.3236195</idno>
		<ptr target="https://doi.org/10.14778/3236187.3236195" />
	</analytic>
	<monogr>
		<title level="j">Proc. VLDB Endow</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="issue">11</biblScope>
			<biblScope unit="page" from="1414" to="1427" />
			<date type="published" when="2018">2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics</title>
		<author>
			<persName><forename type="first">Omid</forename><surname>Alipourfard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hongqiang</forename><surname>Harry Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jianshu</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shivaram</forename><surname>Venkataraman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Minlan</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ming</forename><surname>Zhang</surname></persName>
		</author>
		<ptr target="https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/alipourfard" />
	</analytic>
	<monogr>
		<title level="m">14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017</title>
				<editor>
			<persName><forename type="first">Aditya</forename><surname>Akella</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Jon</forename><surname>Howell</surname></persName>
		</editor>
		<meeting><address><addrLine>Boston, MA, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017-03-27">2017. March 27-29, 2017</date>
			<biblScope unit="page" from="469" to="482" />
		</imprint>
	</monogr>
	<note>USENIX Association</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">IReS: Intelligent, Multi-Engine Resource Scheduler for Big Data Analytics Workflows</title>
		<author>
			<persName><forename type="first">Katerina</forename><surname>Doka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nikolaos</forename><surname>Papailiou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dimitrios</forename><surname>Tsoumakos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christos</forename><surname>Mantas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nectarios</forename><surname>Koziris</surname></persName>
		</author>
		<idno type="DOI">10.1145/2723372.2735377</idno>
		<ptr target="https://doi.org/10.1145/2723372.2735377" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data</title>
				<editor>
			<persName><forename type="first">K</forename><surname>Timos</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Susan</forename><forename type="middle">B</forename><surname>Sellis</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Zachary</forename><forename type="middle">G</forename><surname>Davidson</surname></persName>
		</editor>
		<editor>
			<persName><surname>Ives</surname></persName>
		</editor>
		<meeting>the 2015 ACM SIGMOD International Conference on Management of Data<address><addrLine>Melbourne, Victoria, Australia</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2015-05-31">2015. May 31 -June 4, 2015</date>
			<biblScope unit="page" from="1451" to="1456" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A Demonstration of the BigDAWG Polystore System</title>
		<author>
			<persName><forename type="first">Aaron</forename><forename type="middle">J</forename><surname>Elmore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jennie</forename><surname>Duggan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mike</forename><surname>Stonebraker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Magdalena</forename><surname>Balazinska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ugur</forename><surname>Çetintemel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vijay</forename><surname>Gadepally</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jeffrey</forename><surname>Heer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bill</forename><surname>Howe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jeremy</forename><surname>Kepner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tim</forename><surname>Kraska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Samuel</forename><surname>Madden</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><surname>Maier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Timothy</forename><forename type="middle">G</forename><surname>Mattson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stavros</forename><surname>Papadopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jeff</forename><surname>Parkhurst</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nesime</forename><surname>Tatbul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Manasi</forename><surname>Vartak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stan</forename><surname>Zdonik</surname></persName>
		</author>
		<idno type="DOI">10.14778/2824032.2824098</idno>
		<idno>2824032.2824098</idno>
		<ptr target="https://doi.org/10.14778/" />
	</analytic>
	<monogr>
		<title level="j">Proc. VLDB Endow</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="issue">12</biblScope>
			<biblScope unit="page" from="1908" to="1911" />
			<date type="published" when="2015">2015. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title/>
		<author>
			<persName><surname>Forbes</surname></persName>
		</author>
		<ptr target="https://www.forbes.com/sites/tomgroenfeldt/2013/02/14/at-nyse-the-data-deluge-overwhelms-traditional-databases/#362df2415aab" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">INforE: Interactive Cross-platform Analytics for Everyone</title>
		<author>
			<persName><forename type="first">Nikos</forename><surname>Giatrakos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><surname>Arnu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Theodoros</forename><surname>Bitsakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Antonios</forename><surname>Deligiannakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Minos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ralf</forename><surname>Garofalakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aris</forename><surname>Klinkenberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Antonis</forename><surname>Konidaris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yannis</forename><surname>Kontaxakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vasilis</forename><surname>Kotidis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alkis</forename><surname>Samoladas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">George</forename><surname>Simitsis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fabian</forename><surname>Stamatakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mate</forename><surname>Temme</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Edwin</forename><surname>Torok</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Arnau</forename><surname>Yaqub</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Miguel</forename><surname>Montagud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Holger</forename><surname>Ponce De Leon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stefan</forename><surname>Arndt</surname></persName>
		</author>
		<author>
			<persName><surname>Burkard</surname></persName>
		</author>
		<idno type="DOI">10.1145/3340531.3417435</idno>
		<ptr target="https://doi.org/10.1145/3340531.3417435" />
	</analytic>
	<monogr>
		<title level="m">CIKM &apos;20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event</title>
				<meeting><address><addrLine>, Ireland</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2020-10-19">2020. October 19-23, 2020</date>
			<biblScope unit="page" from="3389" to="3392" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Interactive Extreme: Scale Analytics Towards Battling Cancer</title>
		<author>
			<persName><forename type="first">Nikos</forename><surname>Giatrakos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nikos</forename><surname>Katzouris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Antonios</forename><surname>Deligiannakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alexander</forename><surname>Artikis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Minos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">George</forename><surname>Garofalakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Holger</forename><surname>Paliouras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Raffaele</forename><surname>Arndt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ralf</forename><surname>Grasso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Miguel</forename><surname>Klinkenberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Gian</forename><surname>Ponce De Leon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alfonso</forename><surname>Gaetano Tartaglia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dimitrios</forename><surname>Valencia</surname></persName>
		</author>
		<author>
			<persName><surname>Zissis</surname></persName>
		</author>
		<idno type="DOI">10.1109/MTS.2019.2913071</idno>
		<ptr target="https://doi.org/10.1109/MTS.2019.2913071" />
	</analytic>
	<monogr>
		<title level="j">IEEE Technol. Soc. Mag</title>
		<imprint>
			<biblScope unit="volume">38</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="54" to="61" />
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Musketeer: all for one, one for all in data processing systems</title>
		<author>
			<persName><forename type="first">Ionel</forename><surname>Gog</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Malte</forename><surname>Schwarzkopf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Natacha</forename><surname>Crooks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Matthew</forename><forename type="middle">P</forename><surname>Grosvenor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Allen</forename><surname>Clement</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Steven</forename><surname>Hand</surname></persName>
		</author>
		<idno type="DOI">10.1145/2741948.2741968</idno>
		<ptr target="https://doi.org/10.1145/2741948.2741968" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Tenth European Conference on Computer Systems, EuroSys 2015</title>
				<editor>
			<persName><forename type="first">Laurent</forename><surname>Réveillère</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Tim</forename><surname>Harris</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Maurice</forename><surname>Herlihy</surname></persName>
		</editor>
		<meeting>the Tenth European Conference on Computer Systems, EuroSys 2015<address><addrLine>Bordeaux, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015-04-21">2015. April 21-24, 2015</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page">16</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Benchmarking Distributed Stream Data Processing Systems</title>
		<author>
			<persName><forename type="first">Jeyhun</forename><surname>Karimov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tilmann</forename><surname>Rabl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Asterios</forename><surname>Katsifodimos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Roman</forename><surname>Samarev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Henri</forename><surname>Heiskanen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Volker</forename><surname>Markl</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICDE.2018.00169</idno>
		<ptr target="https://doi.org/10.1109/ICDE.2018.00169" />
	</analytic>
	<monogr>
		<title level="m">34th IEEE International Conference on Data Engineering, ICDE 2018</title>
				<meeting><address><addrLine>Paris, France</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="2018-04-16">2018. April 16-19, 2018</date>
			<biblScope unit="page" from="1507" to="1518" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">A Synopses Data Engine for Interactive Extreme-Scale Analytics</title>
		<author>
			<persName><forename type="first">Antonis</forename><surname>Kontaxakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nikos</forename><surname>Giatrakos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Antonios</forename><surname>Deligiannakis</surname></persName>
		</author>
		<idno type="DOI">10.1145/3340531.3412154</idno>
		<ptr target="https://doi.org/10.1145/3340531.3412154" />
	</analytic>
	<monogr>
		<title level="m">CIKM &apos;20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event</title>
				<meeting><address><addrLine>, Ireland</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2020-10-19">2020. October 19-23, 2020</date>
			<biblScope unit="page" from="2085" to="2088" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Scaling Distributed Machine Learning with the Parameter Server</title>
		<author>
			<persName><forename type="first">Mu</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><forename type="middle">G</forename><surname>Andersen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jun</forename><forename type="middle">Woo</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alexander</forename><forename type="middle">J</forename><surname>Smola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Amr</forename><surname>Ahmed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vanja</forename><surname>Josifovski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">James</forename><surname>Long</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eugene</forename><forename type="middle">J</forename><surname>Shekita</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bor-Yiing</forename><surname>Su</surname></persName>
		</author>
		<ptr target="https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu" />
	</analytic>
	<monogr>
		<title level="m">11th USENIX Symposium on Operating Systems Design and Implementation, OSDI &apos;14</title>
				<editor>
			<persName><forename type="first">Jason</forename><surname>Flinn</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Hank</forename><surname>Levy</surname></persName>
		</editor>
		<meeting><address><addrLine>Broomfield, CO, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014-10-06">2014. October 6-8, 2014</date>
			<biblScope unit="page" from="583" to="598" />
		</imprint>
	</monogr>
	<note>USENIX Association</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Integrating real-time and batch processing in a polystore</title>
		<author>
			<persName><forename type="first">John</forename><surname>Meehan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stan</forename><surname>Zdonik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shaobo</forename><surname>Tian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yulong</forename><surname>Tian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nesime</forename><surname>Tatbul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Adam</forename><surname>Dziedzic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aaron</forename><forename type="middle">J</forename><surname>Elmore</surname></persName>
		</author>
		<idno type="DOI">10.1109/HPEC.2016.7761585</idno>
		<ptr target="https://doi.org/10.1109/HPEC.2016.7761585" />
	</analytic>
	<monogr>
		<title level="m">IEEE High Performance Extreme Computing Conference, HPEC 2016</title>
				<meeting><address><addrLine>Waltham, MA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2016-09-13">2016. 2016. September 13-15, 2016</date>
			<biblScope unit="page" from="1" to="7" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Automatic Fusion of Satellite Imagery and AIS data for Vessel Detection</title>
		<author>
			<persName><forename type="first">Aristides</forename><surname>Milios</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Konstantina</forename><surname>Bereta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Konstantinos</forename><surname>Chatzikokolakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dimitris</forename><surname>Zissis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stan</forename><surname>Matwin</surname></persName>
		</author>
		<ptr target="https://ieeexplore.ieee.org/document/9011339" />
	</analytic>
	<monogr>
		<title level="m">22th International Conference on Information Fusion, FUSION 2019</title>
				<meeting><address><addrLine>Ottawa, ON, Canada</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2019-07-02">2019. July 2-5, 2019</date>
			<biblScope unit="page" from="1" to="5" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">SnappyData. In Encyclopedia of Big Data Technologies</title>
		<author>
			<persName><forename type="first">Barzan</forename><surname>Mozafari</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-319-63962-8_258-1</idno>
		<idno>-3- 319-63962-8_258-1</idno>
		<ptr target="https://doi.org/10.1007/978" />
		<editor>Sherif Sakr and Albert Y. Zomaya</editor>
		<imprint>
			<date type="published" when="2019">2019</date>
			<publisher>Springer</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Functional Geometric Monitoring for Distributed Streams</title>
		<author>
			<persName><forename type="first">Vasilis</forename><surname>Samoladas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Minos</surname></persName>
		</author>
		<author>
			<persName><surname>Garofalakis</surname></persName>
		</author>
		<idno type="DOI">10.5441/002/edbt.2019.09</idno>
		<ptr target="https://doi.org/10.5441/002/edbt.2019.09" />
	</analytic>
	<monogr>
		<title level="m">Advances in Database Technology -22nd International Conference on Extending Database Technology, EDBT 2019</title>
				<meeting><address><addrLine>Lisbon, Portugal</addrLine></address></meeting>
		<imprint>
			<publisher>OpenProceedings</publisher>
			<date type="published" when="1996">2019. March 26-29, 2019. -96</date>
			<biblScope unit="page">85</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">Yahoo</forename><surname>Datasketch</surname></persName>
		</author>
		<ptr target="https://datasketches.github.io/" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note>last accessed</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
