=Paper= {{Paper |id=Vol-2367/paper_14 |storemode=property |title=How to become a (Throughput) Billionaire: The Stream Processing Engine PipeFabric |pdfUrl=https://ceur-ws.org/Vol-2367/paper_14.pdf |volume=Vol-2367 |authors=Constantin Pohl |dblpUrl=https://dblp.org/rec/conf/gvd/Pohl19 }} ==How to become a (Throughput) Billionaire: The Stream Processing Engine PipeFabric== https://ceur-ws.org/Vol-2367/paper_14.pdf
                   How to become a (Throughput) Billionaire:
                   The Stream Processing Engine PipeFabric

                                                          Constantin Pohl
                                                        TU Ilmenau, Germany
                                               constantin.pohl@tu-ilmenau.de


ABSTRACT                                                                 Recent work focuses on exploitation of modern hardware,
The ability to process data in real time has gained more              since memory as well as processors tend to become more and
and more importance in the last years through the rise of             more specialized to better solve different requirements of ap-
IoT and Industry 4.0. Stream processing engines were deve-            plications. GPUs, Multi- and Manycore CPUs, FPGAs, or
loped to handle huge amounts of data with high throughput             even Vector Engines on processor side, HBM, NVM, HDD,
under tight latency constrains. Trends in modern hardware             or SSD on memory side show massively different behavi-
have led to further specializations to efficiently utilize their      or under different tasks and come with various configura-
chances and opportunities, like parallelization to multiple           tions and challenges to utilize them efficiently. To combine
cores, vectorization, or awareness of NUMA.                           high throughput as well as low latency data processing with
   In this paper, we present the stream processing engine             opportunities given by modern hardware, we introduce our
PipeFabric, which is under ongoing development at our de-             SPE PipeFabric 1 in this paper, along with its concepts and
partment. We will describe internal concepts and stream se-           design decisions.
mantics along with decisions taken in the design space. In
addition, we will show challenges posed by modern hardwa-
re that we are considering to improve performance and usa-            2.      RELATED WORK
bility of our engine. Finally, we underline the potential of             There are many SPEs published in the past, some being
PipeFabric by running parallelized queries on a single Xeon           frameworks developed by research groups, others being com-
Phi processor, resulting in about 1.3 billion tuples processed        mercially used engines in industry. In this section, we give a
per second.                                                           short overview of selected SPEs, classified into the scale-up
                                                                      (single node) and scale-out (distributed) principle.
Keywords                                                                 Aurora [3] was one of the first general purpose SPEs that
                                                                      specialized on answering queries on data streams in real-
Stream Processing, SPE, PipeFabric, Xeon Phi
                                                                      time. Queries are described by directed graphs, connecting
                                                                      different operators together and thus forming the data flow.
1.   INTRODUCTION                                                     Since Aurora was designed to run on a single node, the Bo-
   Various applications require processing and analysis of da-        realis [4] SPE added fault-tolerance and consistency to run
ta continously with short response times. To point an exam-           in a distributed setting.
ple, smart manufacturing machines of Industry 4.0 use sen-               Other recent distributed SPEs are Apache Flink [5] (for-
sors to stream their status information, allowing to detect           ked from the Stratosphere engine), Apache Storm [6], and
and correct anomalies in their behavior as fast as possible.          Apache Spark Streaming [7]. All of them can be com-
   In the early 2000s, the first stream processing engines (of-       monly found in various companies, having a large user base.
ten referred to as SPEs) were published, clearly outperfor-           Their main goal in addition to low latency and high through-
ming relational DBMS for this task [1]. The one-tuple-at-             put is scalability, along with fault tolerance within a distri-
a-time concept (also known as Volcano style from Graefe               buted setting. Processing Big Data under real-time cons-
[2]) for data streaming allowed SPEs to keep individual tu-           traints requires the distribution of computation to multiple
ple latencies low. To increase throughput in terms of tuples          machines in a cluster eventually, since scale-up is limited.
processed per second, micro-batching strategies as well as               Nevertheless, scale-up solutions can also come very far for
data parallelization by partitioning were applied and refined         a fraction of monetary cost of a distributed solution. SA-
over time.                                                            BER [8] and StreamBox [9] are two SPEs that are optimi-
                                                                      zed to run on a single node. While SABER can be executed
                                                                      on heterogeneous processing units like GPUs, StreamBox
                                                                      can run on Manycore CPUs supporting out of order tuple
                                                                      processing.
                                                                         Finally, our SPE PipeFabric can be classified into the
                                                                      field of scale-up SPEs, focused on efficient execution of que-
                                                                      ries on Multicore and Manycore CPUs.
31st GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 11.06.2019 - 14.06.2019, Saarburg, Germany.                  1
Copyright is held by the author/owner(s).                                 Open Source, https://github.com/dbis-ilm/pipefabric
3.    STREAM PROCESSING PARADIGM                                    Specialized sources. In addition, PipeFabric provides va-
   With the goal of low latency in mind, the general strea-         rious specialized source operators for different use cases. To
ming workflow follows the one-tuple-at-a-time strategy. Ho-         run the Linear Road benchmark [1], a synchronized source is
wever, if latency requirements can be relaxed, gathering tup-       provided, publishing tuples from a file in real-time according
les together into batches for less communication efforts (like      to its timestamp. Another specialized source is the data ge-
function calls) and vectorized processing can greatly increase      nerator, which will continuously generate tuples according
throughput.                                                         to a format specified by the user. One last example is the
   PipeFabric provides a query structure called Topology (li-       matrix source, sending lines, columns, or even full matrices
ke Apache Flink), which contains one or more streaming              represented as tuples.
sources, operators applied on them, and optional stream
sinks. Topologies can be conceptually seen as directed acy-
                                                                    3.2    Operators
clic graphs routing tuples through different operators (see            PipeFabric supports various operator types being applied
Figure 1).                                                          on incoming tuples. The most common single source ope-
                                                                    rators are the projection of attributes, applied predicates
                                                                    (selections), aggregations, or groupings. Each of them has
                                                                    its own operator which can also be configured, e.g. to choo-
                                                                    se an aggregation type (count, sum, etc.). To join multiple
                                                                    sources, PipeFabric uses the non-blocking symmetric hash
                                                                    join in addition to the recently published ScaleJoin algo-
                                                                    rithm [10]. With a customizable operator called notify, it
                                                                    is also possible to apply any UDFs on incoming tuples via
                                                                    lambda functions.

                                                                    3.3    Stream Sinks
        Figure 1: Example of a Stream Query                            Sinks are operators which logically terminate a stream or
                                                                    query. It is possible to write query results on the fly into files,
   This query example consist of two input data streams,            tables, or a new data stream. Results can also be returned
two operators applied on each of them (specifying the key           as a general output, e.g. for visualization in a GUI. However,
attribute and window semantics), combined together by a             PipeFabric currently does not have an own visualization tool
join with a final grouping on a certain attribute.                  like e.g. Aurora has.
   Operators (called Pipes) are connected via channels, fol-
lowing the publish/subscribe pattern. They can connect to           4.    STREAMING CONCEPTS
any operator upstream by subscribing, publishing their own            In this section, we further describe streaming concepts of
results to other operators downstream. To reduce overhead,          SPEs which are also realized in PipeFabric.
only tuple pointers are passed between them. The following
subsections briefly describe the different sources, operators,      Partitioning/Merging. To utilize intra-query parallelism,
and sinks in our SPE.                                               it is possible to create multiple instances of the same ope-
                                                                    rator, splitting tuples with a partitioning function and mer-
3.1    Stream Sources                                               ging results of partitions afterwards. This concept is shown
                                                                    in Figure 2.
  Stream sources produce tuples for individual queries, thus
being the necessary query starting points. The main sources
of streaming are tuples provided via different network pro-
tocols, files (or tables), other streams, or specialized sources.

Network protocols. The most common source for tuples
are servers or sensors that deliver data being processed conti-
nuously. PipeFabric can connect via REST API, RabbitMQ,
Apache Kafka, MQTT, and ZeroMQ. Protocol logic for the
connection is internally realized within the parametrizable               Figure 2: Partitioning and Merge Schema
source operators, hidden from the user.
                                                                       Each partition is run by a separate thread, exchanging
Files/Tables. Data streams can also subscribe to different          tuples with a synchronized queue. This allows the utilization
files like CSVs or binaries, as well as relational tables. The-     of all cores on a Multicore or even Manycore CPU, increasing
refore, a query can also run different benchmarks provided          throughput massively if the computational requirements wi-
as files on the file system. A special use case are tables e.g.     thin a partition are high enough to justify synchronization
from RocksDB, allowing also to use transactional semantics          efforts.
on operations under ACID guarantees.
                                                                    Batching/Unbatching. As previously mentioned, batching
Streams. Another source is the subscription on already de-          tuples together reduces communication efforts (especially
fined PipeFabric streams. This allows queries to send their         between threads) and enables vectorized execution of opera-
results conceptually as a new data stream on which other            tions. In PipeFabric, a batch as well as unbatch operator is
queries can subscribe to.                                           provided. The batch operator stores incoming tuple pointers
internally until a given batch size is reached, forwarding the   5.1                     Adaptive Partitioning
batch at once by creating and passing a batch pointer to            Data stream behavior can change during runtime. This
the next operator. The unbatch operator does the opposite,       means that the amount of tuples arriving per second can
extracting tuple pointers from a batch and forwarding them       change, leading to different amounts of partitions that would
again one after another.                                         be ideal to solve that moment of the query. If the degree of
                                                                 partitioning is too high, computing resources are wasted,
Window. A window operator tracks incoming tuples for             which is also a common problem for cloud providers. On
marking them as outdated after a while. Outdated tuples          the other hand, if the workload is underestimated, too less
do not participate in stateful operations like aggregates or     partitions cannot catch up with tuple arrival rates leading
joins, being removed from those states for further calculati-    to wrong results because of tuples being discarded due to
ons. Long running queries can therefore discard tuples after a   full buffers.
while, keeping the memory footprint low and also managea-           Dynamic partitioning approaches address this problem by
ble. Most common window algorithms are the tumbling and          using a partitioning function that can be changed over time.
sliding window. The former drops all tuples at once when         This allows to counter skewed streams by dynamically chan-
its size is reached while the latter slowly fades out tuples     ging the tuple routes to underutilized partitions. Neverthe-
individually. Figure 3 visualizes the concept for the sliding    less, one step further is the adaptive partitioning strategy
window calculating an aggregate.                                 where not only the function is variable but also the num-
                                                                 ber of partitions can change. Figure 4 shows recent work for
                                                                 an adaptive partitioning strategy within PipeFabric, where
                                                                 the y-axis describes the number of tuples arriving per se-
                                                                 cond. The partitions are directly converted into throughput
                                                                 to allow a better comparison.


                                                                                     140000
                                                                                                                            Dataset
         Figure 3: Sliding Window Semantics                                          120000                                 Partitions [tp/s overall]

  Both window types can discard tuples based on time or                              100000
tuple count. PipeFabric uses a list data structure for the
                                                                 Tuples per second




window state internally to allow efficient appending and re-                          80000
moving at both ends of the list.
                                                                                      60000
Transaction Support. Transactions are a common concept
                                                                                      40000
in relational database systems. They wrap operations on ta-
bles together, providing ACID guarantees to always ensu-
                                                                                      20000
re consistency of the database. Since PipeFabric can also
stream data from or to tables, it contains basic transaction                              0
support for recovery and consistency [11]. To point an exam-                                  0   200   400   600   800 1000 1200 1400 1600 1800
                                                                                                                      Time [s]
ple, a query writing to a table executes the changes through
running transactions, while multiple queries reading and wri-
ting need isolation to guarantee correctness additionally.                              Figure 4: Adaptive Partitioning Behavior

5.   MODERN HARDWARE CHALLENGES                                    Such an approach, scaling the number of partitions up and
   In this section, we give an overview of our ongoing work      down, raises mainly two problems:
with PipeFabric regarding modern hardware, mainly focu-                              • State migration when a partition is removed.
sing on Manycore CPU utilization and support for upcoming
NVM technology. Manycore CPUs provide high core num-                                 • Decision for scaling, i.e. when to add or to remove
bers and thus high thread counts, resulting in challenges                              a partition.
in terms of synchronization and thread contention. It is al-
so important to notice that intra-query parallelism through         The migration problem can be solved by stopping the que-
multithreading usually leads to tuples arriving out of order     ry, performing the state migration, and resuming. However,
after partitioning which can be a problem for queries detec-     this can break latency constraints since during a stop the
ting patterns over time.                                         processing cannot continue. A better proposed solution is to
   On the memory layer, NVM and also high-bandwidth me-          create a parallel state which gets duplicates of new tuples
mory (HBM) pose new challenges for stream processing.            until both states are equal. Then, the original state can be
NVM offers persistence with latencies comparable to main         dropped safely.
memory (with a read/write asymmetry), which is interesting          PipeFabric currently uses a static partitioning concept
to explore especially for transactional operations on tables.    where the number of partitions as well as the partitioning
HBM on the other hand offers great performance for appli-        function does not change. At the moment we are investiga-
cations being memory bound at the cost of small capacity,        ting possibilities and options to apply an adaptive approach
leading to optimization problems where to use it for the         to fully utilize a Manycore CPU under skewed data stream
greatest performance benefit.                                    behavior.
5.2    Order-Preserved Merging                                    others with publish/subscribe channels. When the connecti-
  As mentioned in the last section, partitioning can lead to      on is finished, new notifications must be sent to again allow
out of order tuples afterwards, e.g. when a predicate within a    tuple exchange.
partition drops more or less tuples or a hash table for joining      When the UDF of an operator is changed, only this opera-
tuples has a chain of cache misses on probing. Ordering the       tor is involved in modification. This means that the function
output without blocking results can be difficult.                 cannot be executed while it is changing, leading to notifica-
  A solution for this problem is to store incoming tuples         tions being necessary again.
in different queues per partition. Then, the merge operation         For both modifications, tuples have to be buffered while
can check and compare the first elements in all of the queues,    the query is changed. Ideally, the modification is done du-
forwarding the oldest tuple among them (see Figure 5 for the      ring a delay in tuple arrival of the data stream, else a short
general idea).                                                    blocking is inevitable.

                                                                  5.4       HBM Allocation on States
                                                                     One of our previous works [14] investigated HBM impact
                                                                  on different query states. We concluded that for operations
                                                                  with small states like aggregates it is not useful at all, while
                                                                  windows can benefit a little from more bandwidth. Stream
                                                                  sources on the other hand greatly benefit from HBM.
                                                                     In a followup work, we will integrate HBM detection wi-
                                                                  thin our SPE along with the provision of custom HBM state
                                                                  allocators. In addition to that, we would like to add a cost
                                                                  model for HBM to allow a query optimizer to choose bet-
                                                                  ween the different memory types. Finally, since the symme-
                                                                  tric hash join only marginally improves with more band-
                                                                  width (being mostly latency bound), we are investigating
                                                                  different stream join algorithms to improve bandwidth uti-
                                                                  lization especially on a Xeon Phi processor.

                                                                  5.5       Lockfree Data Structures
         Figure 5: Order-Preserved Merging                           With Manycore CPUs, the degree of contention on shared
                                                                  data structures can nullify any advantage of parallelizati-
  For the special case that one partition is not producing        on. The usage of fine grained locks or latches along with
any outputs, it is possible to add a dummy element after a        optimistic concurrency protocols can improve scalability a
certain time to guarantee ordered execution only within a         lot. Recent work on PipeFabric investigated lockfree data
certain time frame. This strategy is also named as k-slack        structures for states that are accessed by multiple threads
algorithm [12].                                                   concurrently.
  Regarding PipeFabric, we would like to combine an order-           Queues between threads are a prominent example, whe-
preserved merge with an adaptive partitioning approach.           re specialized lockfree queues (the so-called Single Producer
This means that on a change on the partition number the           Single Consumer (SPSC) queues) realized as ring buffers
sorted merge operation has additional overhead on synchro-        greatly enhance performance. Hash tables for joins are ano-
nization with the adaptive partitioner, since it has to know      ther common structure to benefit from the lockfree para-
which partition will be removed soon.                             digm, which is not only restricted to stream processing but
                                                                  also for joins on relational tables.
5.3    Real-time Query Modification                                  Lockfree programming usually has a huge disadvantage
   Even if not directly hardware-related, it would be a quality   when it comes to debugging or guaranteeing thread safe-
of life feature to be able to change the query (which possibly    ty. However, there are high level abstractions in libraries
runs for weeks or months) without restarting it [13]. Over a      like Boost2 or Intel TBB3 which hide lockfree operations
longer period, the query states can become huge, especially       behind a user-friendly interface. Currently PipeFabric still
when the query has a lot of operators, not even to mention        uses locks and latches, but to improve throughput, we plan
the time lost when the state is migrated into a new query. To     to add lockfree pendants in the near future.
add real-time query modification, we would like to address
the following use cases:                                          5.6       Multiway-Stream Join
                                                                     The symmetric hash join within PipeFabric is a binary
   • Add or remove a new operator within the query da-            hash join, which means that it can only join two connected
     taflow.                                                      input streams. To join a higher number of stream sources,
                                                                  the current solution leads to a binary tree of symmetric hash
   • Change the function (UDF) of a single operator.
                                                                  join operators with the following problems:
For our SPE PipeFabric, these features need an additional
                                                                        • Individual tuple latency can become extremely bad if
controller thread that can be invoked by the user to trigger
                                                                          it has to be repeatedly joined from bottom up within
a query modification. To add or remove a new operator, the
                                                                          the tree.
previous as well as next operator within the query need a
                                                                  2
notification to not exchange tuples anymore. After that noti-         https://www.boost.org/doc/libs/1_66_0/doc/html/lockfree.html
                                                                  3
fication, the new operator must be created and connected to           https://software.intel.com/en-us/node/506169
      • Since intermediate join results are fully materialized in        6.2                             How to become a Billionaire
        each join operator, the memory footprint can become                 A query that is able to process a billion tuples per se-
        very large for intermediate hash tables inevitably.              cond needs some tuning along with simplifications, obvious-
Multiway join operators that can connect to any number of                ly. First, all input streams are fully allocated within the
streams on the other hand look very promising, since a single            MCDRAM first, there is no regular DDR4 RAM involved,
join instance has a lot of opportunities to optimize tuple               not to speak of disks like SSDs. The stream query only ap-
storage and probe sequences. Figure 6 shows the concept of               plies a selection predicate on each input tuple, forwarding
a binary join tree compared to a multiway join.                          those tuples that satisfy the predicate to an empty UDF ope-
                                                                         rator. More computations lead to more work for the threads,
                                                                         reducing overall throughput. The different predicates as well
                                                                         as their measured impact on performance are summarized
                                                                         in Table 1.

                                                                                                          Predicate    Selectivity   tp/s (256 threads)
                                                                                                             true        100%              720M
                                                                                                         key mod(2)       50%              937M
                                                                                                         key mod(10)      10%              1.16B
                                                                                                             false         0%              1.39B

                                                                                                                 Table 1: Selection Operator

                                                                            Next, the query is realized as inter-query parallelism which
                                                                         means that all 256 threads of the KNL run a local query ver-
                                                                         sion subscribed to an replicated input stream without con-
                                                                         tention between them, to overcome the low clock frequency.
                                                                         We ran the query with a different degree of inter-query par-
                                                                         allelism. The selection predicate is false, however, the query
                                                                         would also allow more than a billion tuples per second with
                                                                         10% selectivity, as shown in Table 1. The results can be
            Figure 6: Binary vs. Multiway Join                           found in Figure 7.

   To efficiently join many concurrent data streams, the join
operator has to minimize probe misses to reduce contention
(e.g. by only probing when matches can be found in all hash                                           1.5B
tables). In addition to that, different parallelization strate-
                                                                    Output Tuples per Second [tp/s]




gies (like data parallelism or fully sharing states) are possible
which we like to investigate in future work.                                                          1.2B

6.      EXPERIMENTAL EVALUATION                                                                       900M
  With this section, we will prove the statement in the title
of this paper experimentally. First, we will list our experi-                                         600M
mental setup, followed by the results and discussion after-
wards.
                                                                                                      300M
6.1       Setup
  On the hardware side, we used a Xeon Phi Knights Lan-                                               100M
ding Manycore CPU (KNL 7210) with 64 cores à 1.3GHz,                                                    0 8      16    32      64 128 192       256
supporting up to 4 threads each due to hyperthreading. It                                                               Inter-Query Threads
runs in SNC4 mode, which means that the cores are distri-
buted into four distinct regions classified as NUMA nodes.                                                     Figure 7: Stream Query Results
Along with the CPU comes 16GB HBM on chip, the so-
called Multi-Channel DRAM (MCDRAM). This MCDRAM                            The scaling of throughput is close to ideal, where doubling
provides over 420GB/s memory bandwidth and is configu-                   the number of threads doubles the overall throughput. Ho-
red in Flat mode, therefore it can be manually addressed via             wever, including concurrent actions like synchronized access
Numactl4 or Memkind API5 , else it is not used at all.                   to data structures reduces the scalability the more threads
  We built our SPE PipeFabric with the Intel compiler versi-             are added. When using DDR4 memory that has only around
on 17.0.6. The operating system is CentOS version 7 running              80GB/s bandwidth, the highest throughput is reached at 64
Linux kernel 3.10. The most important compilation flags are              threads with approximately 200 million tuples per second
code optimization with -O3 and -xMIC AVX512 for auto-                    (not shown in the plot). More than 64 threads degrade per-
vectorization with AVX512 instruction set.                               formance on DDR4, since the threads exceed the bandwidth
4
    https://www.systutorials.com/docs/linux/man/8-numactl/               and thus are idling while the memory controllers finish their
5
    http://memkind.github.io/memkind/                                    requests.
7.   CONCLUSION                                                      Maryland, USA, June 14-16, 2005, pages 13–24, 2005.
   In this paper, we presented PipeFabric, a SPE developed       [5] Paris Carbone, Stephan Ewen, Gyula Fóra, Seif
at our department with focus on scale-up performance. First,         Haridi, Stefan Richter, and Kostas Tzoumas. State
we gave an overview of other well-known SPEs, classified by          Management in Apache Flink R : Consistent Stateful
their decision on scaling up or scaling out. After that, we          Distributed Stream Processing. PVLDB,
briefly described stream processing characteristics on the           10(12):1718–1729, 2017.
base of PipeFabric. In addition to the basic concepts, we        [6] Ankit Toshniwal, Siddarth Taneja, Amit Shukla,
extended the section by discussing various streaming para-           Karthikeyan Ramasamy, Jignesh M. Patel, Sanjeev
digms to better utilize given hardware, like partitioning of         Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu,
the data flow or batching tuples.                                    Jake Donham, Nikunj Bhagat, Sailesh Mittal, and
   Then, we came to our current research heavily influenced          Dmitriy V. Ryaboy. Storm @Twitter. In SIGMOD
by modern hardware. We explained the challenges posed                2014, Snowbird, UT, USA, June 22-27, 2014, pages
mainly by Manycore CPUs as well as HBM, followed by our              147–156, 2014.
recommendations and ideas to improve our SPE. Adaptive           [7] Matei Zaharia, Tathagata Das, Haoyuan Li, Scott
partitioning will allow queries to scale with data stream be-        Shenker, and Ion Stoica. Discretized Streams: An
havior, which is even more important on a Manycore CPU               Efficient and Fault-Tolerant Model for Stream
that can provide hundreds of partitions easily. In combina-          Processing on Large Clusters. In 4th USENIX
tion with an order-preserved merge step, results from the            Workshop on Hot Topics in Cloud Computing,
partitioning can be reordered again, allowing further analy-         HotCloud’12, Boston, MA, USA, June 12-13, 2012,
sis downstream (like pattern matching). With long-running            2012.
queries, we plan to add query modifications in real time,        [8] Alexandros Koliousis, Matthias Weidlich, Raul Castro
where operators can be added as well as removed without              Fernandez, Alexander L. Wolf, Paolo Costa, and
restarting the query as well as online changeable UDFs. To           Peter R. Pietzuch. SABER: Window-Based Hybrid
better utilize HBM, we will add allocators accordingly, lea-         Stream Processing for Heterogeneous Architectures. In
ding to a cost model for an optimizer being able to decide on        Proceedings of the 2016 International Conference on
which memory type states should be placed. To further im-            Management of Data, SIGMOD Conference 2016, San
prove throughput under high contention, lockfree pendants            Francisco, CA, USA, June 26 - July 01, 2016, pages
to our used data structures will be added. And finally, since        555–569, 2016.
a binary join tree badly utilizes bandwidth and hurts indi-      [9] Hongyu Miao, Heejin Park, Myeongjae Jeon, Gennady
vidual latency, we plan to investigate multiway stream joins         Pekhimenko, Kathryn S. McKinley, and Felix Xiaozhu
in the future.                                                       Lin. StreamBox: Modern Stream Processing on a
   After the discussion on modern hardware challenges, we            Multicore Machine. In 2017 USENIX Annual
described our experiments on which we were able to crea-             Technical Conference, USENIX ATC 2017, Santa
te a stream query written in PipeFabric, running on the              Clara, CA, USA, July 12-14, 2017., pages 617–629,
Xeon Phi processor, leading to more than a billion tuples            2017.
processed per second finally. Although the query is more a      [10] Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina
synthetical one, it underlines the potential of our SPE, ne-         Papatriantafilou, and Philippas Tsigas. ScaleJoin: a
vertheless.                                                          Deterministic, Disjoint-Parallel and Skew-Resilient
                                                                     Stream Join. In 2015 IEEE International Conference
8.   REFERENCES                                                      on Big Data, Big Data 2015, Santa Clara, CA, USA,
 [1] Arvind Arasu, Mitch Cherniack, Eduardo F. Galvez,               October 29 - November 1, 2015, pages 144–153, 2015.
     David Maier, Anurag Maskey, Esther Ryvkina,                [11] Philipp Goetze and Kai-Uwe Sattler. Snapshot
     Michael Stonebraker, and Richard Tibbetts. Linear               Isolation for Transactional Stream Processing. In
     Road: A Stream Data Management Benchmark. In                    Proceedings of the 22th International Conference on
     VLDB Proceedings, Toronto, Canada, August 31 -                  Extending Database Technology, EDBT 2019.
     September 3 2004, pages 480–491, 2004.                          OpenProceedings.org, March 2019.
 [2] Goetz Graefe. Volcano - An Extensible and Parallel         [12] Christopher Mutschler and Michael Philippsen.
     Query Evaluation System. IEEE Trans. Knowl. Data                Distributed Low-Latency Out-of-Order Event
     Eng., 6(1):120–135, 1994.                                       Processing for High Data Rate Sensor Streams. In
 [3] Daniel J. Abadi, Donald Carney, Ugur Çetintemel,               27th IEEE International Symposium on Parallel and
     Mitch Cherniack, Christian Convey, C. Erwin,                    Distributed Processing, IPDPS 2013, Cambridge, MA,
     Eduardo F. Galvez, M. Hatoun, Anurag Maskey, Alex               USA, May 20-24, 2013, pages 1133–1144, 2013.
     Rasin, A. Singer, Michael Stonebraker, Nesime Tatbul,      [13] Adrian Bartnik, Bonaventura Del Monte, Tilmann
     Ying Xing, R. Yan, and Stanley B. Zdonik. Aurora: A             Rabl, and Volker Markl. On-the-fly Reconfiguration of
     Data Stream Management System. In Proceedings of                Query Plans for Stateful Stream Processing Engines.
     the 2003 ACM SIGMOD International Conference on                 In BTW Proceedings, 4.-8. March 2019, Rostock,
     Management of Data, San Diego, California, USA,                 Germany, pages 127–146, 2019.
     June 9-12, 2003, page 666, 2003.                           [14] Constantin Pohl. Stream Processing on
 [4] Magdalena Balazinska, Hari Balakrishnan, Samuel                 High-Bandwidth Memory. In Proceedings of the 30th
     Madden, and Michael Stonebraker. Fault-Tolerance in             GI-Workshop Grundlagen von Datenbanken,
     the Borealis Distributed Stream Processing System. In           Wuppertal, Germany, May 22-25, 2018., pages 41–46,
     Proceedings of the ACM SIGMOD International                     2018.
     Conference on Management of Data, Baltimore,