=Paper= {{Paper |id=Vol-3341/db3895 |storemode=property |title=Rhino: Efficient Management of Very Large Distributed State for Stream Processing Engines [Abstract] |pdfUrl=https://ceur-ws.org/Vol-3341/DB-LWDA_2022_CRC_3895.pdf |volume=Vol-3341 |authors=Bonaventura Del Monte,Steffen Zeuch,Tilmann Rabl,Volker Markl |dblpUrl=https://dblp.org/rec/conf/lwa/MonteZRM22 }} ==Rhino: Efficient Management of Very Large Distributed State for Stream Processing Engines [Abstract]== https://ceur-ws.org/Vol-3341/DB-LWDA_2022_CRC_3895.pdf
Rhino: Efficient Management of Very Large
Distributed State for Stream Processing Engines
[Abstract]
Bonaventura Del Monte1 , Steffen Zeuch1,2 , Tilmann Rabl3 and Volker Markl1,2
1
  Technische Universität Berlin, Einsteinufer 17, 10587 Berlin, Germany
2
  DFKI GmbH, Alt-Moabit 91c, 10559 Berlin, Germany
3
  Hasso Plattner Institute, Campus II, Building F, 1st Floor, August-Bebel-Str. 88, 14482 Potsdam, Germany


           Abstract
           Scale-out stream processing engines (SPEs) are powering large big data applications on high velocity
           data streams. Industrial setups require SPEs to sustain outages, varying data rates, and low-latency
           processing. Thus, SPEs need to transparently reconfigure stateful queries during runtime. However,
           state-of-the-art SPEs are not ready yet to handle on-the-fly reconfigurations of queries with terabytes of
           state due to three problems. First, network overhead: a reconfiguration involves state migration between
           workers over a network, which results in more resource utilization and latency proportional to state
           size. Second, consistency: a reconfiguration has to guarantee exactly-once processing semantics through
           consistent state management and record routing. Third, processing overhead: a reconfiguration must
           have minimal impact on performance of query processing.
               Today, several industrial and research solutions provide state migration. However, these solutions
           restrict their scope to small state sizes or offer limited on-the-fly reconfigurations. Apache Flink [1],
           Apache Spark [2], and Apache Samza [3], enable consistency but at the expense of performance and
           network throughput. They support large, consistent operator state but they restart the running query
           upon its reconfiguration. Research prototypes, e.g., Chi [4], Megaphone [5], and SEEP [6] address
           consistency and performance but not network overhead. They enable fine-grained reconfiguration but
           support smaller state sizes (i.e., tens of gigabytes).
               To bridge the gap between stateful stream processing and operational efficiency via on-the-fly query
           reconfigurations and state migration, we propose Rhino. Rhino is a library for efficient management of
           very large distributed state compatible with SPEs based on the streaming dataflow paradigm [7]. Rhino
           enables on-the-fly reconfiguration of a running query to provide resource elasticity, fault tolerance, and
           runtime query optimizations, such as load balancing, in the presence of very large distributed state (i.e.,
           up to TBs). To this end, Rhino applies a state-centric, proactive replication protocol to asynchronously
           replicate the state of a running operator on a set of SPE workers through incremental checkpoints.
           Furthermore, Rhino applies a handover protocol that smoothly migrates processing and state of a
           running operator among workers. This does not impact query execution and guarantees exactly-once
           processing.
               Overall, our evaluation shows that Rhino scales with state sizes of up to TBs, reconfigures a running
           query fifteen times faster than baseline solutions, and reduces latency by three orders of magnitude
           upon a reconfiguration. Rhino was originally published as full research paper at the 2020 ACM SIGMOD
           conference [8]. Currently, we use Rhino in our NebulaStream data management platform for unified
           Cloud and Internet-of-Things environments [9].

           Keywords
           Distributed Stream Processing, Query Optimization, Fault Tolerance
Acknowledgments
This work was funded by the German Ministry for Education and Research as BIFOLD - Berlin
Institute for the Foundations of Learning and Data (ref. 01IS18025A and 01IS18037A).


References
[1] A. Alexandrov, R. Bergmann, S. Ewen, J. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich,
    U. Leser, V. Markl, F. Naumann, M. Peters, A. Rheinländer, M. Sax, S. Schelter, M. Höger,
    K. Tzoumas, D. Warneke, The stratosphere platform for big data analytics, The VLDB
    Journal (2014).
[2] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, I. Stoica, Discretized streams: Fault-tolerant
    streaming computation at scale, in: ACM SOSP, 2013.
[3] S. A. Noghabi, K. Paramasivam, Y. Pan, N. Ramesh, J. Bringhurst, I. Gupta, R. H. Campbell,
    Samza: Stateful scalable stream processing at linkedin, PVLDB (2017).
[4] L. Mai, K. Zeng, R. Potharaju, L. Xu, S. Venkataraman, P. Costa, T. Kim, S. Muthukrishnan,
    V. Kuppa, S. Dhulipalla, S. Rao, Chi: A scalable and programmable control plane for
    distributed stream processing systems, VLDB (2018).
[5] M. Hoffmann, A. Lattuada, F. McSherry, V. Kalavri, T. Roscoe, Megaphone: Latency-
    conscious state migration for distributed streaming dataflows, VLDB (2019).
[6] R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, P. Pietzuch, Integrating scale out and
    fault tolerance in stream processing using operator state management, in: ACM SIGMOD,
    2013.
[7] T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. Fernández-Moctezuma, R. Lax,
    S. McVeety, D. Mills, F. Perry, E. Schmidt, et al., The dataflow model: a practical approach
    to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data
    processing, PVLDB 8 (2015) 1792–1803.
[8] B. Del Monte, S. Zeuch, T. Rabl, V. Markl, Rhino: Efficient management of very large
    distributed state for stream processing engines, in: Proceedings of the 2020 ACM SIGMOD
    International Conference on Management of Data, SIGMOD ’20, Association for Computing
    Machinery, New York, NY, USA, 2020, p. 2471–2486.
[9] S. Zeuch, A. Chaudhary, B. Monte, H. Gavriilidis, D. Giouroukis, P. Grulich, S. Breß, J. Traub,
    V. Markl, The nebulastream platform: Data and application management for the internet of
    things, in: Conference on Innovative Data Systems Research (CIDR), 2020.




LWDA’22: Lernen, Wissen, Daten, Analysen. October 05–07, 2022, Hildesheim, Germany
Envelope-Open delmonte@tu-berlin.de (B. Del Monte); steffen.zeuch@dfki.de (S. Zeuch); tilmann.rabl@hpi.de (T. Rabl);
volker.markl@tu-berlin.de (V. Markl)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)