Extreme-Scale Interactive Cross-Platform Streaming Analytics -The INFORE Approach

Extreme-Scale Interactive Cross-Platform Streaming Analytics -The INFORE Approach AntoniosDeligiannakis NikosGiatrakos ngiatrakos@athenarc.gr YannisKotidis kotidis@aueb.gr VasilisSamoladas AlkisSimitsis Athena Research Center Technical University of Crete Athena Research Center Technical University of Crete Athens University of Economics and Business Athena Research Center Technical University of Crete Athena Research Center Extreme-Scale Interactive Cross-Platform Streaming Analytics -The INFORE Approach B398223E46ADE3758B6DB8964E39BCAB GROBID - A machine learning software for extracting information from scholarly documents

Nowadays, analytics are at the core of many emerging applications and can greatly benefit from the abundance of data and the progress in the processing capabilities of modern hardware. Still, new challenges arise with the extreme complexity of designing applications to exploit in full the potential of such offerings. In this work, we describe the main challenges in delivering advanced, cross-platform, streaming analytics at scale with the primary goal of interactivity. We present the architecture of a prototype system designed from scratch to tackle such challenges and we describe the internals of its components. Finally, we discuss key added-value synergies upon serving target analytics workflows in real-world applications and present a preliminary performance evaluation.

INTRODUCTION

Interactive analytics at scale lie at the core of many diverse applications. In life sciences, studying the effect of drug applications on simulated tumors may produce data streams of 100Gb/min [8]. These streams need to get analyzed online to interactively determine successive drug combinations. In the financial domain, NYSE alone generates several terabytes of stock data per day [6]. Interactive analytics enable timely reaction to investment opportunities or risks. In maritime surveillance, one needs to fuse heterogeneous, voluminous data composed of position streams of thousands of vessels, satellite images, thermal camera or acoustic signal streams of on-site devices to investigate ongoing illegal activities at sea [14].

Relevant data not only originate from a variety of heterogeneous sources but also often arrive at multiple, geo-dispersed clusters. In life sciences, several efforts aim at federating analytics across data centers 1 . Similarly, discovering dependencies or correlations among economies or industries involves global analysis of stock streams originating from various local market data centers. In maritime data analysis, streams come from a number of ground-or satellite-based receivers before being ingested in proximate data centers. Additionally, Big Data technologies are significantly fragmented. Each of these scenarios, involves delivering analytics over networked clusters each hosting a variety of Big Data platforms.

Typically, such scenarios are designed as streaming analytics workflows. Our goal is to provide a solution for executing global, cross-platform, streaming analytics workflows over a network of computer clusters hosting diverse Big Data technologies, with high interactivity; i.e., continuously deriving the output of the workflow in real-time and adapting to ad-hoc changes required due to observed results or system performance metrics, at runtime.

To this extent, streaming Big Data platforms such as Apache Spark, Flink, Kafka 2 approach the problem by ensuring horizontal scalability, i.e, by scaling out (parallelizing) the computation to a number of Virtual Machines (VM) and providing APIs to program distributed Big Data processing pipelines. These platforms are not designed to support cross-platform execution scenarios. They also provide none or suboptimal support for online Machine Learning (ML) or Data Mining (DM) operators, since the major ML APIs they provide do not focus on streaming data. Frameworks such as Apache Beam or Apache SAMOA 3 , allow for creating code that is executable in different, supported Big Data platforms. Nevertheless, this does not amend the lack of resource management in multicluster and cross-platform settings.

Related systems such as Rheem [2], Ires [4], BigDawg [5], Musketeer [9] are designed towards cross-platform execution of global workflows, but they can only handle batch, instead of stream, processing pipelines. Even in systems like BigDawg or Rheem, which support stream processing in S-Store [13] and JavaStreams respectively, cross-streaming platform execution remains unsupported.

Our work aims at filling the gap in handling efficiently advanced, cross-platform, interactive analytics at scale. We describe the architecture of an open-source system, namely INFORE 4 , designed from scratch for this purpose. A recent INFORE demo [7] (Best Demo Award at CIKM 2020) showcases how code-free, cross-platform analytics are feasible through user-friendly, graphical design and execution of streaming workflows. [7] presents the frontend, while in the current paper we focus on the backend. We detail its key features, three of its core components and inter-component synergies in real-world applications. We also present a preliminary performance evaluation.

OVERVIEW OF THE APPROACH

Requirements and Key Features. Based on our interaction with real-world practitioners and applications, early in INFORE's design, we identified three scalability requirements that should complement interactivity and cross-streaming platform execution features.

Enhanced Horizontal scalability. Parallelization is not adequate to handle extreme-scale scenarios. The number of available VMs in private clouds has a limit when it comes to extreme scale, while monetary charges arise for cloud providers.

Vertical scalability. Need to scale the computation with the number of processed streams. For instance, computing the correlation matrix of stock data streams results in a quadratic explosion in space and computational complexity, which is infeasible for thousands or millions of streams in our motivating scenarios.

Federated scalability. Need to scale the computation in settings where data arrive at multiple, geo-dispersed sites. In such settings, even when we do everything right within each cluster, the potential for interactivity is network bound [10].

Overview of Key Components. INFORE employs three core components for realizing the aforementioned key requirements.

Synopses Data Engine (SDE).

There is a wide consensus in stream processing [1] that approximate but rapid answers to analytics tasks, more often than not, suffice. Synopses, such as samples, sketches, histograms [1], provide approximate answers, with accuracy guarantees, to popular (cardinality, frequency moment, correlation, set membership, quantile) analytic operators. The maintenance and use of data synopses is essential to stream processing applications because applications get only one look of the data. Synopses are used as an approximate version of the infinite data, which due to its small size can be processed and iterated upon, without the need of reading the entire, potentially infinite, data stream multiple times.

The SDE builds and maintains such synopses across platforms and clusters. It achieves enhanced horizontal scalability because synopses shed the load assigned to each VM in a cluster and reduce memory usage. For federated scalability, it reduces communication by transmitting compact data summaries. For interactivity, we add a unique SDE-as-a-Service (SDEaaS) design where the SDE is a constantly running service (job, in one or more clusters) that can accept any query, synopses creation on new streams or plugging code for new synopses at runtime, with zero downtime for running workflows. For vertical scalability, the SDEaaS design maintains thousands of synopses for thousands of streams, in settings where prior efforts [15,16,18,19] can maintain only few tens of synopses. We henceforth use the terms SDE and SDEaaS interchangeably.

Online Machine Learning and Data Mining (OMLDM). OMLDM applies distributed, online ML and DM adopting a Parameter Server (PS) paradigm [12] (Figure 1(e)). The PS paradigm enhances horizontal and federated scalability via the option of an asynchronous (besides synchronous) synchronization policy to reduce the effect of stragglers and bandwidth consumption [12], respectively. Regarding interactivity, the OMLDM component allows to interactively adjust the number of learners and also to deploy the current up-todate ML/DM model at the PS side, should concept drifts occur, at the very same time the training process in parallel learners proceeds. Established synergies with the SDE and the Optimizer components also boost vertical and federated scalability.

Optimizer Component. The Optimizer is a prerequisite to effectively use the SDE and the OMLDM Components towards interactivity. The Optimizer picks (a) the networked cluster, (b) the Big Data platforms available in that cluster, and (c) the provisioned resources for each SDE, OMLDM (or other) operator of the global workflow. A good solution is determined taking into account performance criteria and cost models, related to interactivity, engaging throughput, (network and processing) latency, memory usage etc. These components cooperate as follows (more in Section 4).

Synopses-based Optimization. Given a workflow, the Optimizer produces alternative physical execution plans where it replaces exact (e.g., count, window, correlation) operators with equivalent, but approximate ones from the SDE. It then provides suggestions for equivalent, but approximate, workflows accompanied by the expected interactivity (execution speed up) vs. accuracy trade-off.

Boosting Vertical Scalability. The SDE Component boosts vertical scalability of the OMLDM Component in ways that are not possible otherwise. Indicatively, we use Discrete Fourier Transform (DFT) and Locality Sensitive Hashing (LSH) bitmaps [11] to bucketize (hash) streams during correlation matrix computation, outlier detection, clustering or feature extraction. Vertical scalability is achieved by hashing streams to learners based on their DFT or LSH coefficients. The complexity of all these operators is reduced since computations are pruned for streams that do not end up nearby in the hashing space.

Pumped Parameter Server. With the help of the Optimizer, we extend the OMLDM Component with an additional synchronization policy, besides the synchronous and asynchronous communication entailed by the classic PS design [12]. We incorporate a novel communication protocol, termed Functional Geometric Monitoring (FGM) [17]. This protocol strengthens horizontal (within a cluster) and federated scalability by bridging the gap between synchronous and asynchronous communication. Instead of having learners communicating in predefined rounds/batches (synchronous) or when each one is updated (asynchronous), FGM requires communication only when a concept drift is likely to have occurred.

INFORE Architecture. We develop a prototype of the INFORE system incorporating operators from SDE, OMLDM and stream transformations provided by the streaming APIs of Big Data platforms such as Apache Flink or Spark. An overview of the INFORE architecture is illustrated in Figure 1

KEY-COMPONENT INTERNALS

This section elaborates on INFORE's key components. For space considerations, we describe the concepts using Flink as an example implementation, but the design is generic to support operators in all streaming platforms mentioned above.

SDE Component

The architecture of the SDE Component is illustrated in Figure 1(b). In streaming settings, multiple workflows may run continuously providing output streams to application interfaces for timely decision making. The SDEaaS design we adopt (see Section 2) can serve multiple wofklows, running concurrently and continuously, with approximate, commonly used operators (e.g., counts, sums, frequency moments) reusing, instead of duplicating, streams and synopses common to workflows. Moreover, for every new synopsis maintenance request, SDEaaS reserves new execution tasks for the operators in Figure 1(b), while without SDEaaS we would reserve entire threads (task slots in terms of Flink). In that, SDEaaS manages thousands of synopsis, on-demand, for thousands of streams; hence, ensuring vertical scalability [11].

To briefly showcase the performance benefits of our approach, Figure 3(a) shows the results of an experiment over a cluster with 40 task slots. (The hardware details are not important for this experiment.) We compare SDEaaS to a non-SDEaaS approach, such as Proteus [16], that extends Flink with data synopsis utilities, but lacks the SDEaaS paradigm. We design an experiment where we start with maintaining 2 Count-Min sketches for frequency estimations on the volume, price pairs of financial data (stocks). Then, we express demands for maintaining more Count-Min sketches for up to 5000 sketches/stocks. We do that without stopping the already running synopses. Figure 3(a) shows the sum of throughputs of all running jobs for non-SDEaaS and the throughput of SDEaaS. non-SDEaaS starts new jobs for new synopses and assigns at least one entire task slot to each new job. Hence, the 40 task slots in our cluster are depleted and ✘ signs in Figure 3(a) denote that non-SDEaaS cannot maintain more than 40 synopses. SDEaaS has no such limit since it assigns new tasks to a single running service (job) at runtime, with zero downtime. SDEaaS also exhibits higher throughput than non-SDEaaS due to fine-grained resource utilization at the task, instead of task slot, level.

The extensible SDE Library currently includes a variety of synopses techniques for cardinality, frequency moment, correlation, set membership or quantile estimation [1]. Please check [11] for further details. The SDE API allows on-the-fly requests for building/stopping synopses, plug-in code for new synopses in the SDE Library, ad-hoc and continuous queries. To enable the SDEaaS design for interactivity and vertical scalability, the SDE Library is built using subtype polymorphism and inheritance in Java. This is the key to allow new synopsis code to be loaded at runtime and also allows all requests to be issued on-the-fly by the Manager module, instead of deploying separate jobs. Every new synopsis simply extends a Synopsis class with its detailed algorithms for add (streaming updates), estimate (query answering) and merge for outputting an estimation for mergeable synopses [1] kept at multiple workers or clusters.

When a synopsis operator of the SDE is included in a workflow, every request arriving via the RequestTopic in Figure 1(b) for building or querying a running synopsis follows the red-colored path. The RegisterSynopsis, RegisterRequest FlatMaps initially produce and then look up the same keys pointing to workers that should maintain the synopsis, but for a different purpose each. RegisterSynopsis uses these keys to direct streaming tuples, which follow the blue-colored path, to assigned workers to update the synopsis. Register Request uses the same keys to direct queries to those workers and derive estimations at the estimation FlatMap by reading the status of the synopsis kept at the add FlatMap.

When a streaming update arrives at the Kafka DataTopic, the FlatMap termed HashData looks up the keys of workers produced by RegisterSynopsis and directs the update to the proper workers. There, the add FlatMap includes the update in the maintained synopsis. For instance, in case a FM sketch is maintained [1], the add operation hashes the incoming tuple to a position of the maintained bitmap and turns the corresponding bit to 1 if it is not already set. Upon a query, the estimation FlatMap uses the lowest set bit of that bitmap to derive a cardinality estimation [1].

The rest of the operators on the right-hand side of Figure 1(b) are used for appropriately controlling the output topics. The splitter (split operator in Flink) directs the estimation to the OutputTopic if a synopsis is defined on a single stream handled by one worker (green path). If a synopses engages many streams (e.g. many stocks) maintained at multiple workers (purple path) or multiple clusters (yellow path -clusters communicate via the UnionTopic of Kafka) merge FlatMap synthesizes the estimations (for instance, executing a logical disjunction operation on various FM sketches [1]) before outputting the estimation via OutputTopic.

OMLDM Component

The OMLDM component fills the gap of existing batch processingfocused APIs such as Spark's MLlib or FlinkML. Its internal architecture is illustrated in Figure 1 , where an interactively defined number of identical learners distribute amongst themselves the computational load of training on an incoming data stream of training samples. In a nutshell, PS holds and updates the state of the learning process, including the current version of the trained model. Learners compute the updates to the model and coordinate via the PS. This is because the iterative nature of ML/DM computations requires communication between all parallel workers, and this is done by the PS in each pipeline. Training pipelines are data sinks, that is, they do not generate an output stream. However, they do have a Feedback Kafka topic to support iteration required by involved tasks. The Feedback topic is also read by the predictors of the red path, discussed below, to update their models upon a concept drift. For instance, should training on labeled streams of a classification task take place on par with prediction (tagging unlabeled streams), in case of a concept drift, the PS sends a new model to predictors via the FeedbackTopic.

For training pipelines, it is the pipeline's responsibility to coordinate the processing between the learners and the PS. As noted in Section 2, the supported coordination policies include synchronous, asynchronous and FGM, a novel and bandwidth-efficient coordination policy that we introduced at [17]. Performance-wise, the synchronous policy does not encourage enhanced horizontal scalability because when many learners are used, the total utilization is usually low, should only few stragglers exist. The asynchronous one is the policy of choice in large-scale ML; the processing speed is much higher when many learners are used, and therefore training is more scalable. The FGM policy provides both enhanced horizontal and federated scalability because (a) the learners do not need to await each other's sync in rounds or contact the PS immediately with each update, (b) communication takes place asynchronously, but only when a concept drift may have occurred. This is achieved in synergy with the Optimizer (see Section 4).

Figure 1(d) shows an instance of the PS and learners running on a Flink cluster. The messaging interface between Learner and the PS is handled via Kafka, and hence it is generic and independent of the underlying Big Data platform.

Prediction pipelines. These are shown with the red-colored path of Figure 1(d). Prediction pipelines do not train the models they apply; these models are loaded at Predictor FlatMap via control messages. A model can be changed interactively by the PS at runtime, if there has been a concept drift, via the FeedbackTopic. Prediction pipelines employ some model for classification, interpolation or inference and transform input to output streams, which can be processed further by downstream operators of a global workflow. The control topics (green arrows and Kafka topics in Figure 1(d)) deliver control messages to and from an OMLDM job. The control API understands Create/Read/Update/Delete (CRUD) commands for each type of resource. For instance, using control messages the OMLDM component will respond with the current status of the PS and learners (e.g. how many learners are currently running) or the currently up-to-date model at the PS.

OMLDM currently supports (i) classification: Passive Aggressive Classifier, Online Support Vector Machines and Vertical Hoeffding Trees, (ii) clustering/outlier detection: BIRCH, Online k-Means and StreamKM++ and (iii) regression: Passive Aggressive Regressor, Online Ridge and Polynomial Regression, Autoregressive Integrated Moving Average. Facilities for correlation matrix computation, standardization, polynomial feature extraction are also available.

Optimizer Component

Given a workflow composed of various operators, we have to properly tune their execution prescribing both the Big Data Platform and the networked computer cluster that achieve good overall performance with respect to interactivity-related metrics are met.

The internal architecture of the INFORE Optimizer is shown in Figure 1(c). There are a number of submodules involved before providing logical to physical operator mappings to be deployed for execution. First, we use a statistics collector to derive performance measurements from each executed workflow. Statistics are collected via JMX or Slurm 5 and are ingested in an ELK stack6 while monitoring jobs.

We have developed a Benchmarking submodule that automates the acquisition of performance metrics for every supported operator run on different Big Data platforms. The Benchmarking submodule circumvents steps 1-3 in Figure 1(a) and enables direct job submision to the underlying clusters engaging supported operators and stream sources. In that, the execution of various operators and (part of) workflows on different (Big Data platform, cluster) pairs is a priori benchmarked, statistics are collected and performance models are built to avoid a cold start.

Cost models are derived with a Bayesian Optimization approach inspired by CherryPick [3]. From the statistics we have collected from either micro-benchmarks and/or actual workflow execution, we form a Gaussian performance/cost prediction model for the various operators and (parts of) workflows executed in different (Big Data platform, cluster) pairs. The cost model is incrementally improved with new workflow executions.

Algorithmically, the Optimizer solves a multi-criteria optimization problem with objectives involving throughput, communication cost, processing and network latency, CPU/GPU and memory usage as well as accuracy for synopses-based optimization purposes. Due to conflicting objectives, we cannot optimize them all simultaneously. Therefore, we resort to Pareto optimal solutions obtained by maximizing a weighted combination of these objectives under clusters' capacity constraints.

Notably, no prior effort [2, 4, 5, 9] examined such a rich set of optimization objectives related to interactivity. Indicatively, only [2] considers network performance metrics (but for batch processing scenarios). To derive Pareto optimal solutions, besides a baseline, Exhaustive Search algorithm that examines every possible mapping of logical to physical operators in a workflow, we have developed a novel variation of an A*-alike algorithm. The classic A* algorithm cannot work in our setup as is, because it can only suggest optimal paths. This would mean that some logical operators would be left out from suggesting a physical operator mapping. Our novel A* variation, fosters equivalent heuristics to the typical A* to prune the search space, but outputs a graph (instead of a path) with physical operators instantiating every logical one in the submitted workflow. This new algorithm can speed up the suggestion of the optimal physical workflow by orders of magnitude (see Section 4). We have also developed a Greedy and a Heuristic algorithm that leverage parallelism and domain knowledge to boost the performance further and to enable adaptivity (i.e., provide in-real time and incrementally a new plan to adapt to ad hoc performance hiccups). All algorithms employ the cost estimation models to evaluate alternative physical operators while exploring the space of alternative mappings.

SYNERGIES IN A REAL CASE STUDY

This section describes how our prototype works in a real use case scenario from the financial domain and how advanced synergies are established among individual components. For exhibition purposes, we employ a single cluster equipped with Flink and Spark.

The logical workflow submitted to the Optimizer is illustrated in Figure 2(a). The workflow of Figure 2(a) utilizes Level 1, Level 2 stock data 7 to discover highly positively/negatively correlated pairs of stocks. In Figure 2(a), data arrive at a Kafka source. The Split operator separates Level 1 from Level 2 data. It directs Level 2 data to the bottom branch of the workflow. There, the Level 2 bids are Filtered (i.e., for monitoring only a subset of stocks). The bids per stock are counted using a Count logical operator. When a trade for a stock is realized, a new Level 1 tuple is directed by Split to the upper part of the workflow. A Project operator keeps only the timestamp of the trade for each stock. The Join operator joins the stock trade, Level 1 tuple with the count of bids the stock received until the trade. The result is inserted in a time Window of recent such counts, forming a time series. Finally, the CorrelationMatrix of stocks' time series is computed by that OMLDM operator.

Synopses-based Optimization. When the workflow is submitted, for every logical operator, there are two platform choices, Flink or Spark. The Optimizer holds a dictionary to determine the equivalence of logical operators to physical ones in either platform. In Synopses-based Optimization, the Optimizer attempts to boost interactivity by treating the SDE as an additional platform. That is, its dictionary also holds mappings about which exact operators could be replaced by approximate ones of the SDE. Therefore, the 7 www.investopedia.com/terms/l/level1.asp,www.investopedia.com/terms/l/level2.asp Boosting Vertical Scalability. The SDE.DFT operator of the SDE, for every window it approximates, outputs a pair of values: a bucketID and a vector of reduced dimensionality compared to the original window. The bucketID is based on the first few coefficients of the DFT [11] and is used as the key to hash stock streams to learners of the CorrelationMatrix OMLDM operator (Figure 2(b)). Reduced vector correlations are pruned for stocks hashed far away in the hashing space since these stock streams are guaranteed to be highly (dis)similar [11]. This boosts vertical scalability for a task of quadratic complexity to the number of input streams.

Pumped Parameter Server. The stream entering the Correlation Matrix operator of OMLDM is duplicated to the training and prediction pipelines (Figure 1(d)). Learners (buckets in our previous discussion) compute the pairwise correlations of the streams they receive (correlation coefficient and the beta regression parameter coincide for standardized vectors), while the PS holds the current up-to-date correlation matrix which is essentially the global model at the predictor and the PS sides. A concept drift for the global model would occur if the Frobenius norm measuring the difference between the matrix kept at PS and the true correlation matrix may have exceeded a threshold. The Optimizer will decompose this constraint to local ones installed at each learner. Mathematical details are described in [17], but in a simple version of FGM, the Optimizer will prescribe that learners will sync with the PS when a signed distance value computed locally at each learner, becomes positive. Notice that local constraints are determined by the Optimizer which initially prescribes and then adapts the learners at runtime.

Preliminary Experimental Evaluation. We employ a Kafka cluster with 4 Dell PowerEdge R320 Intel Xeon E5-2430 v2 2.50GHz machines with 32GB RAM and a cluster with 10 Dell PowerEdge R300 Quad Core Xeon X3323 2.5GHz machines with 8GB RAM each, for Flink and Spark Structured Streaming. We use real Level 1, Level 2 stock data (∼5000 stocks/∼10 TB) provided by a European financial company. We set DFT to use 8 coefficients, CountMin (𝜖 = 0.002, 𝛿 = 0.01), we use 0.9 correlation threshold and a time window of 5 minutes, based on input from the data provider. When we monitor 500 streams, i.e., 250K/2 cells in the correlation matrix, the fact that "Parallelism" lacks synopses-based optimization and vertical scalability makes its throughput ratio approaching 1, i.e. the Naive approach. INFORE's performance remains steady when switching from 50 to 500 streams improving "Parallelism" by 3 times. The most important findings come upon switching to 5000 stocks (25M/2 cells in the matrix). Figure 3(c) denotes that because of the lack of enhanced horizontal and vertical scalability, "Parallelism" becomes equivalent to the Naive one. INFORE exhibits more than an order of magnitude improved throughput with the trade-off of highly correlated pairs being identified with 90% accuracy.

For federated scalability, we measure communicated bytes among the operators. Due to space constraints, we only report here that INFORE can reduce communication by more than an order of magnitude when 5000 stock streams are being monitored.

CONCLUSIONS & ONGOING WORK

We have presented the design, internal structure, and integration aspects of key components of INFORE, a prototype designed for interactive analytics at scale. To our knowledge, INFORE is the first system for cross-platform, streaming analytics. We currently enrich the SDE and OMLDM libraries to strengthen the support for the three scalability types and enhance analytics capabilities.

(a). Currently, INFORE supports streaming analytics workflows over Apache Flink, Spark Structured Streaming and Kafka. Cross-(Big Data) platform and cluster communication is performed via JSON formatted Kafka messages. In a nutshell, INFORE works as follows:(Step 1) The user designs in a Graphical Editor (left of Figure1(a)) the desired workflow composed of stream transformation, OMLDM and SDE operators. These are logical operators composing the logical view of the workflow, independent of the underlying platforms.

Param. Server View.

Figure 1 :1Figure 1: Overall INFORE Architecture & Internal Architecture of Key Components

(d). Two types of pipelines are engaged in the OMLDM architecture: training and prediction pipelines. Training pipelines. Training data streams follow the blue-colored path of Figure 1(d). They are ingested via the TrainingTopic. The rest of the training pipeline is implemented by a pair of Learner and ParameterServer FlatMaps. We follow a Parameter Server (PS) distributed model (Figure 1(e))

(a) Initial Workflow at the Graphical Editor. (b) Suggested Workflow via Synopses-based Optimization.

Figure 2 :2Figure 2: Initial and transformed workflow using Synopses-based Optimization.

SDEaaS vs non-SDEaaS.

TB) x Max Kafka Ingestion Rate (c) Horizontal & Vertical Scalability.

Figure 3 :3Figure 3: Performance Evaluation.

Figure 3(b) shows the time it takes to the two optimization algorithms (Exhaustive and A*-alike) to produce a physical workflow from the time they receive the logical one, when 1 (Flink), 2 (Spark,Flink) or 3 (Spark, Flink, Synopses-based Optimization) are available. As the figure demonstrates for more than 1 platform, our novel A*-alike algorithm reduces the time to prescribe a physical workflow by 1 to 4 orders of magnitude. For 1 platform (still the Optimizer prescribes provisioned resources) the Exhaustive Search Algorithm is faster, since A*-alike requires more initialization time. In Figure 3(c) we show the scalability of the INFORE approach versus a technique that executes the workflow of Figure 2(a) in Flink (denoted by "Parallelism"), without synopses-based optimization, boosted vertical scalability or the OMLDM Component. INFORE executes the workflow of Figure 2(b). We plot the throughput ratio of INFORE and "Parallelism" over a Naive approach that executes the same workflow with parallelism set to 1. The two horizontal axes in the figure show the number of processed stock streams (bottom) and the volume in terabytes (top) ingested using the maximum Kafka rate. Thus, the top axis accounts for horizontal scalability (volume, speed), while the bottom axis accounts for vertical scalability. https://docs.oracle.com/javase/tutorial/jmx/overview/, slurm.schedmd.com/ https://www.elastic.co/what-is/elk-stack

ACKNOWLEDGMENTS

This work has received funding from the EU Horizon 2020 research and innovation program INFORE under grant agreement No 825070.

10.1007/978-3-540-28608-0 Data Stream Management -Processing High-Speed Data Streams Springer 2016 RHEEM: Enabling Cross-Platform Data Processing -May The Big Data Be With You! DivyAgrawal SanjayChawla BerttyContreras-Rojas AhmedKElmagarmid YasserIdris ZoiKaoudi SebastianKruse JiLucas EssamMansour MouradOuzzani PaoloPapotti Jorge-ArnulfoQuiané-Ruiz NanTang SaravananThirumuruganathan AnisTroudi 10.14778/3236187.3236195 Proc. VLDB Endow 11 11 2018. 2018 CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics OmidAlipourfard HongqiangHarry Liu JianshuChen ShivaramVenkataraman MinlanYu MingZhang 14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017 AdityaAkella JonHowell

Boston, MA, USA

2017. March 27-29, 2017 USENIX Association IReS: Intelligent, Multi-Engine Resource Scheduler for Big Data Analytics Workflows KaterinaDoka NikolaosPapailiou DimitriosTsoumakos ChristosMantas NectariosKoziris 10.1145/2723372.2735377 Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data KTimos SusanBSellis ZacharyGDavidson Ives the 2015 ACM SIGMOD International Conference on Management of Data

Melbourne, Victoria, Australia

ACM 2015. May 31 -June 4, 2015 A Demonstration of the BigDAWG Polystore System AaronJElmore JennieDuggan MikeStonebraker MagdalenaBalazinska UgurÇetintemel VijayGadepally JeffreyHeer BillHowe JeremyKepner TimKraska SamuelMadden DavidMaier TimothyGMattson StavrosPapadopoulos JeffParkhurst NesimeTatbul ManasiVartak StanZdonik 10.14778/2824032.2824098 2824032.2824098 Proc. VLDB Endow 8 12 2015. 2015 <author> <persName><surname>Forbes</surname></persName> </author> <ptr target="https://www.forbes.com/sites/tomgroenfeldt/2013/02/14/at-nyse-the-data-deluge-overwhelms-traditional-databases/#362df2415aab" /> <imprint> <date type="published" when="2021">2021</date> </imprint> </monogr> </biblStruct> <biblStruct xml:id="b6"> <analytic> <title level="a" type="main">INforE: Interactive Cross-platform Analytics for Everyone NikosGiatrakos DavidArnu TheodorosBitsakis AntoniosDeligiannakis NMinos RalfGarofalakis ArisKlinkenberg AntonisKonidaris YannisKontaxakis VasilisKotidis AlkisSamoladas GeorgeSimitsis FabianStamatakis MateTemme EdwinTorok ArnauYaqub MiguelMontagud HolgerPonce De Leon StefanArndt Burkard 10.1145/3340531.3417435 CIKM '20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event

, Ireland

ACM 2020. October 19-23, 2020 Interactive Extreme: Scale Analytics Towards Battling Cancer NikosGiatrakos NikosKatzouris AntoniosDeligiannakis AlexanderArtikis NMinos GeorgeGarofalakis HolgerPaliouras RaffaeleArndt RalfGrasso MiguelKlinkenberg GianPonce De Leon AlfonsoGaetano Tartaglia DimitriosValencia Zissis 10.1109/MTS.2019.2913071 IEEE Technol. Soc. Mag 38 2 2019. 2019 Musketeer: all for one, one for all in data processing systems IonelGog MalteSchwarzkopf NatachaCrooks MatthewPGrosvenor AllenClement StevenHand 10.1145/2741948.2741968 Proceedings of the Tenth European Conference on Computer Systems, EuroSys 2015 LaurentRéveillère TimHarris MauriceHerlihy the Tenth European Conference on Computer Systems, EuroSys 2015

Bordeaux, France

2015. April 21-24, 2015 2 16 Benchmarking Distributed Stream Data Processing Systems JeyhunKarimov TilmannRabl AsteriosKatsifodimos RomanSamarev HenriHeiskanen VolkerMarkl 10.1109/ICDE.2018.00169 34th IEEE International Conference on Data Engineering, ICDE 2018

Paris, France

IEEE Computer Society 2018. April 16-19, 2018 A Synopses Data Engine for Interactive Extreme-Scale Analytics AntonisKontaxakis NikosGiatrakos AntoniosDeligiannakis 10.1145/3340531.3412154 CIKM '20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event

, Ireland

ACM 2020. October 19-23, 2020 Scaling Distributed Machine Learning with the Parameter Server MuLi DavidGAndersen JunWooPark AlexanderJSmola AmrAhmed VanjaJosifovski JamesLong EugeneJShekita Bor-YiingSu 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI '14 JasonFlinn HankLevy

Broomfield, CO, USA

2014. October 6-8, 2014 USENIX Association Integrating real-time and batch processing in a polystore JohnMeehan StanZdonik ShaoboTian YulongTian NesimeTatbul AdamDziedzic AaronJElmore 10.1109/HPEC.2016.7761585 IEEE High Performance Extreme Computing Conference, HPEC 2016

Waltham, MA, USA

IEEE 2016. 2016. September 13-15, 2016 Automatic Fusion of Satellite Imagery and AIS data for Vessel Detection AristidesMilios KonstantinaBereta KonstantinosChatzikokolakis DimitrisZissis StanMatwin 22th International Conference on Information Fusion, FUSION 2019

Ottawa, ON, Canada

IEEE 2019. July 2-5, 2019 SnappyData. In Encyclopedia of Big Data Technologies BarzanMozafari 10.1007/978-3-319-63962-8_258-1 -3- 319-63962-8_258-1 Sherif Sakr and Albert Y. Zomaya 2019 Springer Functional Geometric Monitoring for Distributed Streams VasilisSamoladas NMinos Garofalakis 10.5441/002/edbt.2019.09 Advances in Database Technology -22nd International Conference on Extending Database Technology, EDBT 2019

Lisbon, Portugal

OpenProceedings 2019. March 26-29, 2019. -96 85 <author> <persName><forename type="first">Yahoo</forename><surname>Datasketch</surname></persName> </author> <ptr target="https://datasketches.github.io/" /> <imprint> <date type="published" when="2021">2021</date> </imprint> </monogr> <note>last accessed</note> </biblStruct> </listBibl> </div> </back> </text> </TEI>