=Paper=
{{Paper
|id=Vol-2367/paper_14
|storemode=property
|title=How to become a (Throughput) Billionaire: The Stream Processing Engine PipeFabric
|pdfUrl=https://ceur-ws.org/Vol-2367/paper_14.pdf
|volume=Vol-2367
|authors=Constantin Pohl
|dblpUrl=https://dblp.org/rec/conf/gvd/Pohl19
}}
==How to become a (Throughput) Billionaire: The Stream Processing Engine PipeFabric==
How to become a (Throughput) Billionaire: The Stream Processing Engine PipeFabric Constantin Pohl TU Ilmenau, Germany constantin.pohl@tu-ilmenau.de ABSTRACT Recent work focuses on exploitation of modern hardware, The ability to process data in real time has gained more since memory as well as processors tend to become more and and more importance in the last years through the rise of more specialized to better solve different requirements of ap- IoT and Industry 4.0. Stream processing engines were deve- plications. GPUs, Multi- and Manycore CPUs, FPGAs, or loped to handle huge amounts of data with high throughput even Vector Engines on processor side, HBM, NVM, HDD, under tight latency constrains. Trends in modern hardware or SSD on memory side show massively different behavi- have led to further specializations to efficiently utilize their or under different tasks and come with various configura- chances and opportunities, like parallelization to multiple tions and challenges to utilize them efficiently. To combine cores, vectorization, or awareness of NUMA. high throughput as well as low latency data processing with In this paper, we present the stream processing engine opportunities given by modern hardware, we introduce our PipeFabric, which is under ongoing development at our de- SPE PipeFabric 1 in this paper, along with its concepts and partment. We will describe internal concepts and stream se- design decisions. mantics along with decisions taken in the design space. In addition, we will show challenges posed by modern hardwa- re that we are considering to improve performance and usa- 2. RELATED WORK bility of our engine. Finally, we underline the potential of There are many SPEs published in the past, some being PipeFabric by running parallelized queries on a single Xeon frameworks developed by research groups, others being com- Phi processor, resulting in about 1.3 billion tuples processed mercially used engines in industry. In this section, we give a per second. short overview of selected SPEs, classified into the scale-up (single node) and scale-out (distributed) principle. Keywords Aurora [3] was one of the first general purpose SPEs that specialized on answering queries on data streams in real- Stream Processing, SPE, PipeFabric, Xeon Phi time. Queries are described by directed graphs, connecting different operators together and thus forming the data flow. 1. INTRODUCTION Since Aurora was designed to run on a single node, the Bo- Various applications require processing and analysis of da- realis [4] SPE added fault-tolerance and consistency to run ta continously with short response times. To point an exam- in a distributed setting. ple, smart manufacturing machines of Industry 4.0 use sen- Other recent distributed SPEs are Apache Flink [5] (for- sors to stream their status information, allowing to detect ked from the Stratosphere engine), Apache Storm [6], and and correct anomalies in their behavior as fast as possible. Apache Spark Streaming [7]. All of them can be com- In the early 2000s, the first stream processing engines (of- monly found in various companies, having a large user base. ten referred to as SPEs) were published, clearly outperfor- Their main goal in addition to low latency and high through- ming relational DBMS for this task [1]. The one-tuple-at- put is scalability, along with fault tolerance within a distri- a-time concept (also known as Volcano style from Graefe buted setting. Processing Big Data under real-time cons- [2]) for data streaming allowed SPEs to keep individual tu- traints requires the distribution of computation to multiple ple latencies low. To increase throughput in terms of tuples machines in a cluster eventually, since scale-up is limited. processed per second, micro-batching strategies as well as Nevertheless, scale-up solutions can also come very far for data parallelization by partitioning were applied and refined a fraction of monetary cost of a distributed solution. SA- over time. BER [8] and StreamBox [9] are two SPEs that are optimi- zed to run on a single node. While SABER can be executed on heterogeneous processing units like GPUs, StreamBox can run on Manycore CPUs supporting out of order tuple processing. Finally, our SPE PipeFabric can be classified into the field of scale-up SPEs, focused on efficient execution of que- ries on Multicore and Manycore CPUs. 31st GI-Workshop on Foundations of Databases (Grundlagen von Daten- banken), 11.06.2019 - 14.06.2019, Saarburg, Germany. 1 Copyright is held by the author/owner(s). Open Source, https://github.com/dbis-ilm/pipefabric 3. STREAM PROCESSING PARADIGM Specialized sources. In addition, PipeFabric provides va- With the goal of low latency in mind, the general strea- rious specialized source operators for different use cases. To ming workflow follows the one-tuple-at-a-time strategy. Ho- run the Linear Road benchmark [1], a synchronized source is wever, if latency requirements can be relaxed, gathering tup- provided, publishing tuples from a file in real-time according les together into batches for less communication efforts (like to its timestamp. Another specialized source is the data ge- function calls) and vectorized processing can greatly increase nerator, which will continuously generate tuples according throughput. to a format specified by the user. One last example is the PipeFabric provides a query structure called Topology (li- matrix source, sending lines, columns, or even full matrices ke Apache Flink), which contains one or more streaming represented as tuples. sources, operators applied on them, and optional stream sinks. Topologies can be conceptually seen as directed acy- 3.2 Operators clic graphs routing tuples through different operators (see PipeFabric supports various operator types being applied Figure 1). on incoming tuples. The most common single source ope- rators are the projection of attributes, applied predicates (selections), aggregations, or groupings. Each of them has its own operator which can also be configured, e.g. to choo- se an aggregation type (count, sum, etc.). To join multiple sources, PipeFabric uses the non-blocking symmetric hash join in addition to the recently published ScaleJoin algo- rithm [10]. With a customizable operator called notify, it is also possible to apply any UDFs on incoming tuples via lambda functions. 3.3 Stream Sinks Figure 1: Example of a Stream Query Sinks are operators which logically terminate a stream or query. It is possible to write query results on the fly into files, This query example consist of two input data streams, tables, or a new data stream. Results can also be returned two operators applied on each of them (specifying the key as a general output, e.g. for visualization in a GUI. However, attribute and window semantics), combined together by a PipeFabric currently does not have an own visualization tool join with a final grouping on a certain attribute. like e.g. Aurora has. Operators (called Pipes) are connected via channels, fol- lowing the publish/subscribe pattern. They can connect to 4. STREAMING CONCEPTS any operator upstream by subscribing, publishing their own In this section, we further describe streaming concepts of results to other operators downstream. To reduce overhead, SPEs which are also realized in PipeFabric. only tuple pointers are passed between them. The following subsections briefly describe the different sources, operators, Partitioning/Merging. To utilize intra-query parallelism, and sinks in our SPE. it is possible to create multiple instances of the same ope- rator, splitting tuples with a partitioning function and mer- 3.1 Stream Sources ging results of partitions afterwards. This concept is shown in Figure 2. Stream sources produce tuples for individual queries, thus being the necessary query starting points. The main sources of streaming are tuples provided via different network pro- tocols, files (or tables), other streams, or specialized sources. Network protocols. The most common source for tuples are servers or sensors that deliver data being processed conti- nuously. PipeFabric can connect via REST API, RabbitMQ, Apache Kafka, MQTT, and ZeroMQ. Protocol logic for the connection is internally realized within the parametrizable Figure 2: Partitioning and Merge Schema source operators, hidden from the user. Each partition is run by a separate thread, exchanging Files/Tables. Data streams can also subscribe to different tuples with a synchronized queue. This allows the utilization files like CSVs or binaries, as well as relational tables. The- of all cores on a Multicore or even Manycore CPU, increasing refore, a query can also run different benchmarks provided throughput massively if the computational requirements wi- as files on the file system. A special use case are tables e.g. thin a partition are high enough to justify synchronization from RocksDB, allowing also to use transactional semantics efforts. on operations under ACID guarantees. Batching/Unbatching. As previously mentioned, batching Streams. Another source is the subscription on already de- tuples together reduces communication efforts (especially fined PipeFabric streams. This allows queries to send their between threads) and enables vectorized execution of opera- results conceptually as a new data stream on which other tions. In PipeFabric, a batch as well as unbatch operator is queries can subscribe to. provided. The batch operator stores incoming tuple pointers internally until a given batch size is reached, forwarding the 5.1 Adaptive Partitioning batch at once by creating and passing a batch pointer to Data stream behavior can change during runtime. This the next operator. The unbatch operator does the opposite, means that the amount of tuples arriving per second can extracting tuple pointers from a batch and forwarding them change, leading to different amounts of partitions that would again one after another. be ideal to solve that moment of the query. If the degree of partitioning is too high, computing resources are wasted, Window. A window operator tracks incoming tuples for which is also a common problem for cloud providers. On marking them as outdated after a while. Outdated tuples the other hand, if the workload is underestimated, too less do not participate in stateful operations like aggregates or partitions cannot catch up with tuple arrival rates leading joins, being removed from those states for further calculati- to wrong results because of tuples being discarded due to ons. Long running queries can therefore discard tuples after a full buffers. while, keeping the memory footprint low and also managea- Dynamic partitioning approaches address this problem by ble. Most common window algorithms are the tumbling and using a partitioning function that can be changed over time. sliding window. The former drops all tuples at once when This allows to counter skewed streams by dynamically chan- its size is reached while the latter slowly fades out tuples ging the tuple routes to underutilized partitions. Neverthe- individually. Figure 3 visualizes the concept for the sliding less, one step further is the adaptive partitioning strategy window calculating an aggregate. where not only the function is variable but also the num- ber of partitions can change. Figure 4 shows recent work for an adaptive partitioning strategy within PipeFabric, where the y-axis describes the number of tuples arriving per se- cond. The partitions are directly converted into throughput to allow a better comparison. 140000 Dataset Figure 3: Sliding Window Semantics 120000 Partitions [tp/s overall] Both window types can discard tuples based on time or 100000 tuple count. PipeFabric uses a list data structure for the Tuples per second window state internally to allow efficient appending and re- 80000 moving at both ends of the list. 60000 Transaction Support. Transactions are a common concept 40000 in relational database systems. They wrap operations on ta- bles together, providing ACID guarantees to always ensu- 20000 re consistency of the database. Since PipeFabric can also stream data from or to tables, it contains basic transaction 0 support for recovery and consistency [11]. To point an exam- 0 200 400 600 800 1000 1200 1400 1600 1800 Time [s] ple, a query writing to a table executes the changes through running transactions, while multiple queries reading and wri- ting need isolation to guarantee correctness additionally. Figure 4: Adaptive Partitioning Behavior 5. MODERN HARDWARE CHALLENGES Such an approach, scaling the number of partitions up and In this section, we give an overview of our ongoing work down, raises mainly two problems: with PipeFabric regarding modern hardware, mainly focu- • State migration when a partition is removed. sing on Manycore CPU utilization and support for upcoming NVM technology. Manycore CPUs provide high core num- • Decision for scaling, i.e. when to add or to remove bers and thus high thread counts, resulting in challenges a partition. in terms of synchronization and thread contention. It is al- so important to notice that intra-query parallelism through The migration problem can be solved by stopping the que- multithreading usually leads to tuples arriving out of order ry, performing the state migration, and resuming. However, after partitioning which can be a problem for queries detec- this can break latency constraints since during a stop the ting patterns over time. processing cannot continue. A better proposed solution is to On the memory layer, NVM and also high-bandwidth me- create a parallel state which gets duplicates of new tuples mory (HBM) pose new challenges for stream processing. until both states are equal. Then, the original state can be NVM offers persistence with latencies comparable to main dropped safely. memory (with a read/write asymmetry), which is interesting PipeFabric currently uses a static partitioning concept to explore especially for transactional operations on tables. where the number of partitions as well as the partitioning HBM on the other hand offers great performance for appli- function does not change. At the moment we are investiga- cations being memory bound at the cost of small capacity, ting possibilities and options to apply an adaptive approach leading to optimization problems where to use it for the to fully utilize a Manycore CPU under skewed data stream greatest performance benefit. behavior. 5.2 Order-Preserved Merging others with publish/subscribe channels. When the connecti- As mentioned in the last section, partitioning can lead to on is finished, new notifications must be sent to again allow out of order tuples afterwards, e.g. when a predicate within a tuple exchange. partition drops more or less tuples or a hash table for joining When the UDF of an operator is changed, only this opera- tuples has a chain of cache misses on probing. Ordering the tor is involved in modification. This means that the function output without blocking results can be difficult. cannot be executed while it is changing, leading to notifica- A solution for this problem is to store incoming tuples tions being necessary again. in different queues per partition. Then, the merge operation For both modifications, tuples have to be buffered while can check and compare the first elements in all of the queues, the query is changed. Ideally, the modification is done du- forwarding the oldest tuple among them (see Figure 5 for the ring a delay in tuple arrival of the data stream, else a short general idea). blocking is inevitable. 5.4 HBM Allocation on States One of our previous works [14] investigated HBM impact on different query states. We concluded that for operations with small states like aggregates it is not useful at all, while windows can benefit a little from more bandwidth. Stream sources on the other hand greatly benefit from HBM. In a followup work, we will integrate HBM detection wi- thin our SPE along with the provision of custom HBM state allocators. In addition to that, we would like to add a cost model for HBM to allow a query optimizer to choose bet- ween the different memory types. Finally, since the symme- tric hash join only marginally improves with more band- width (being mostly latency bound), we are investigating different stream join algorithms to improve bandwidth uti- lization especially on a Xeon Phi processor. 5.5 Lockfree Data Structures Figure 5: Order-Preserved Merging With Manycore CPUs, the degree of contention on shared data structures can nullify any advantage of parallelizati- For the special case that one partition is not producing on. The usage of fine grained locks or latches along with any outputs, it is possible to add a dummy element after a optimistic concurrency protocols can improve scalability a certain time to guarantee ordered execution only within a lot. Recent work on PipeFabric investigated lockfree data certain time frame. This strategy is also named as k-slack structures for states that are accessed by multiple threads algorithm [12]. concurrently. Regarding PipeFabric, we would like to combine an order- Queues between threads are a prominent example, whe- preserved merge with an adaptive partitioning approach. re specialized lockfree queues (the so-called Single Producer This means that on a change on the partition number the Single Consumer (SPSC) queues) realized as ring buffers sorted merge operation has additional overhead on synchro- greatly enhance performance. Hash tables for joins are ano- nization with the adaptive partitioner, since it has to know ther common structure to benefit from the lockfree para- which partition will be removed soon. digm, which is not only restricted to stream processing but also for joins on relational tables. 5.3 Real-time Query Modification Lockfree programming usually has a huge disadvantage Even if not directly hardware-related, it would be a quality when it comes to debugging or guaranteeing thread safe- of life feature to be able to change the query (which possibly ty. However, there are high level abstractions in libraries runs for weeks or months) without restarting it [13]. Over a like Boost2 or Intel TBB3 which hide lockfree operations longer period, the query states can become huge, especially behind a user-friendly interface. Currently PipeFabric still when the query has a lot of operators, not even to mention uses locks and latches, but to improve throughput, we plan the time lost when the state is migrated into a new query. To to add lockfree pendants in the near future. add real-time query modification, we would like to address the following use cases: 5.6 Multiway-Stream Join The symmetric hash join within PipeFabric is a binary • Add or remove a new operator within the query da- hash join, which means that it can only join two connected taflow. input streams. To join a higher number of stream sources, the current solution leads to a binary tree of symmetric hash • Change the function (UDF) of a single operator. join operators with the following problems: For our SPE PipeFabric, these features need an additional • Individual tuple latency can become extremely bad if controller thread that can be invoked by the user to trigger it has to be repeatedly joined from bottom up within a query modification. To add or remove a new operator, the the tree. previous as well as next operator within the query need a 2 notification to not exchange tuples anymore. After that noti- https://www.boost.org/doc/libs/1_66_0/doc/html/lockfree.html 3 fication, the new operator must be created and connected to https://software.intel.com/en-us/node/506169 • Since intermediate join results are fully materialized in 6.2 How to become a Billionaire each join operator, the memory footprint can become A query that is able to process a billion tuples per se- very large for intermediate hash tables inevitably. cond needs some tuning along with simplifications, obvious- Multiway join operators that can connect to any number of ly. First, all input streams are fully allocated within the streams on the other hand look very promising, since a single MCDRAM first, there is no regular DDR4 RAM involved, join instance has a lot of opportunities to optimize tuple not to speak of disks like SSDs. The stream query only ap- storage and probe sequences. Figure 6 shows the concept of plies a selection predicate on each input tuple, forwarding a binary join tree compared to a multiway join. those tuples that satisfy the predicate to an empty UDF ope- rator. More computations lead to more work for the threads, reducing overall throughput. The different predicates as well as their measured impact on performance are summarized in Table 1. Predicate Selectivity tp/s (256 threads) true 100% 720M key mod(2) 50% 937M key mod(10) 10% 1.16B false 0% 1.39B Table 1: Selection Operator Next, the query is realized as inter-query parallelism which means that all 256 threads of the KNL run a local query ver- sion subscribed to an replicated input stream without con- tention between them, to overcome the low clock frequency. We ran the query with a different degree of inter-query par- allelism. The selection predicate is false, however, the query would also allow more than a billion tuples per second with 10% selectivity, as shown in Table 1. The results can be Figure 6: Binary vs. Multiway Join found in Figure 7. To efficiently join many concurrent data streams, the join operator has to minimize probe misses to reduce contention (e.g. by only probing when matches can be found in all hash 1.5B tables). In addition to that, different parallelization strate- Output Tuples per Second [tp/s] gies (like data parallelism or fully sharing states) are possible which we like to investigate in future work. 1.2B 6. EXPERIMENTAL EVALUATION 900M With this section, we will prove the statement in the title of this paper experimentally. First, we will list our experi- 600M mental setup, followed by the results and discussion after- wards. 300M 6.1 Setup On the hardware side, we used a Xeon Phi Knights Lan- 100M ding Manycore CPU (KNL 7210) with 64 cores à 1.3GHz, 0 8 16 32 64 128 192 256 supporting up to 4 threads each due to hyperthreading. It Inter-Query Threads runs in SNC4 mode, which means that the cores are distri- buted into four distinct regions classified as NUMA nodes. Figure 7: Stream Query Results Along with the CPU comes 16GB HBM on chip, the so- called Multi-Channel DRAM (MCDRAM). This MCDRAM The scaling of throughput is close to ideal, where doubling provides over 420GB/s memory bandwidth and is configu- the number of threads doubles the overall throughput. Ho- red in Flat mode, therefore it can be manually addressed via wever, including concurrent actions like synchronized access Numactl4 or Memkind API5 , else it is not used at all. to data structures reduces the scalability the more threads We built our SPE PipeFabric with the Intel compiler versi- are added. When using DDR4 memory that has only around on 17.0.6. The operating system is CentOS version 7 running 80GB/s bandwidth, the highest throughput is reached at 64 Linux kernel 3.10. The most important compilation flags are threads with approximately 200 million tuples per second code optimization with -O3 and -xMIC AVX512 for auto- (not shown in the plot). More than 64 threads degrade per- vectorization with AVX512 instruction set. formance on DDR4, since the threads exceed the bandwidth 4 https://www.systutorials.com/docs/linux/man/8-numactl/ and thus are idling while the memory controllers finish their 5 http://memkind.github.io/memkind/ requests. 7. CONCLUSION Maryland, USA, June 14-16, 2005, pages 13–24, 2005. In this paper, we presented PipeFabric, a SPE developed [5] Paris Carbone, Stephan Ewen, Gyula Fóra, Seif at our department with focus on scale-up performance. First, Haridi, Stefan Richter, and Kostas Tzoumas. State we gave an overview of other well-known SPEs, classified by Management in Apache Flink R : Consistent Stateful their decision on scaling up or scaling out. After that, we Distributed Stream Processing. PVLDB, briefly described stream processing characteristics on the 10(12):1718–1729, 2017. base of PipeFabric. In addition to the basic concepts, we [6] Ankit Toshniwal, Siddarth Taneja, Amit Shukla, extended the section by discussing various streaming para- Karthikeyan Ramasamy, Jignesh M. Patel, Sanjeev digms to better utilize given hardware, like partitioning of Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, the data flow or batching tuples. Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Then, we came to our current research heavily influenced Dmitriy V. Ryaboy. Storm @Twitter. In SIGMOD by modern hardware. We explained the challenges posed 2014, Snowbird, UT, USA, June 22-27, 2014, pages mainly by Manycore CPUs as well as HBM, followed by our 147–156, 2014. recommendations and ideas to improve our SPE. Adaptive [7] Matei Zaharia, Tathagata Das, Haoyuan Li, Scott partitioning will allow queries to scale with data stream be- Shenker, and Ion Stoica. Discretized Streams: An havior, which is even more important on a Manycore CPU Efficient and Fault-Tolerant Model for Stream that can provide hundreds of partitions easily. In combina- Processing on Large Clusters. In 4th USENIX tion with an order-preserved merge step, results from the Workshop on Hot Topics in Cloud Computing, partitioning can be reordered again, allowing further analy- HotCloud’12, Boston, MA, USA, June 12-13, 2012, sis downstream (like pattern matching). With long-running 2012. queries, we plan to add query modifications in real time, [8] Alexandros Koliousis, Matthias Weidlich, Raul Castro where operators can be added as well as removed without Fernandez, Alexander L. Wolf, Paolo Costa, and restarting the query as well as online changeable UDFs. To Peter R. Pietzuch. SABER: Window-Based Hybrid better utilize HBM, we will add allocators accordingly, lea- Stream Processing for Heterogeneous Architectures. In ding to a cost model for an optimizer being able to decide on Proceedings of the 2016 International Conference on which memory type states should be placed. To further im- Management of Data, SIGMOD Conference 2016, San prove throughput under high contention, lockfree pendants Francisco, CA, USA, June 26 - July 01, 2016, pages to our used data structures will be added. And finally, since 555–569, 2016. a binary join tree badly utilizes bandwidth and hurts indi- [9] Hongyu Miao, Heejin Park, Myeongjae Jeon, Gennady vidual latency, we plan to investigate multiway stream joins Pekhimenko, Kathryn S. McKinley, and Felix Xiaozhu in the future. Lin. StreamBox: Modern Stream Processing on a After the discussion on modern hardware challenges, we Multicore Machine. In 2017 USENIX Annual described our experiments on which we were able to crea- Technical Conference, USENIX ATC 2017, Santa te a stream query written in PipeFabric, running on the Clara, CA, USA, July 12-14, 2017., pages 617–629, Xeon Phi processor, leading to more than a billion tuples 2017. processed per second finally. Although the query is more a [10] Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina synthetical one, it underlines the potential of our SPE, ne- Papatriantafilou, and Philippas Tsigas. ScaleJoin: a vertheless. Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join. In 2015 IEEE International Conference 8. REFERENCES on Big Data, Big Data 2015, Santa Clara, CA, USA, [1] Arvind Arasu, Mitch Cherniack, Eduardo F. Galvez, October 29 - November 1, 2015, pages 144–153, 2015. David Maier, Anurag Maskey, Esther Ryvkina, [11] Philipp Goetze and Kai-Uwe Sattler. Snapshot Michael Stonebraker, and Richard Tibbetts. Linear Isolation for Transactional Stream Processing. In Road: A Stream Data Management Benchmark. In Proceedings of the 22th International Conference on VLDB Proceedings, Toronto, Canada, August 31 - Extending Database Technology, EDBT 2019. September 3 2004, pages 480–491, 2004. OpenProceedings.org, March 2019. [2] Goetz Graefe. Volcano - An Extensible and Parallel [12] Christopher Mutschler and Michael Philippsen. Query Evaluation System. IEEE Trans. Knowl. Data Distributed Low-Latency Out-of-Order Event Eng., 6(1):120–135, 1994. Processing for High Data Rate Sensor Streams. In [3] Daniel J. Abadi, Donald Carney, Ugur Çetintemel, 27th IEEE International Symposium on Parallel and Mitch Cherniack, Christian Convey, C. Erwin, Distributed Processing, IPDPS 2013, Cambridge, MA, Eduardo F. Galvez, M. Hatoun, Anurag Maskey, Alex USA, May 20-24, 2013, pages 1133–1144, 2013. Rasin, A. Singer, Michael Stonebraker, Nesime Tatbul, [13] Adrian Bartnik, Bonaventura Del Monte, Tilmann Ying Xing, R. Yan, and Stanley B. Zdonik. Aurora: A Rabl, and Volker Markl. On-the-fly Reconfiguration of Data Stream Management System. In Proceedings of Query Plans for Stateful Stream Processing Engines. the 2003 ACM SIGMOD International Conference on In BTW Proceedings, 4.-8. March 2019, Rostock, Management of Data, San Diego, California, USA, Germany, pages 127–146, 2019. June 9-12, 2003, page 666, 2003. [14] Constantin Pohl. Stream Processing on [4] Magdalena Balazinska, Hari Balakrishnan, Samuel High-Bandwidth Memory. In Proceedings of the 30th Madden, and Michael Stonebraker. Fault-Tolerance in GI-Workshop Grundlagen von Datenbanken, the Borealis Distributed Stream Processing System. In Wuppertal, Germany, May 22-25, 2018., pages 41–46, Proceedings of the ACM SIGMOD International 2018. Conference on Management of Data, Baltimore,