=Paper=
{{Paper
|id=Vol-1826/paper7
|storemode=property
|title=VISP Testbed - A Toolkit for Modeling and Evaluating Resource Provisioning Algorithms for Stream Processing Applications
|pdfUrl=https://ceur-ws.org/Vol-1826/paper7.pdf
|volume=Vol-1826
|authors=Christoph Hochreiner
|dblpUrl=https://dblp.org/rec/conf/zeus/Hochreiner17
}}
==VISP Testbed - A Toolkit for Modeling and Evaluating Resource Provisioning Algorithms for Stream Processing Applications==
VISP Testbed - A Toolkit for Modeling and
Evaluating Resource Provisioning Algorithms for
Stream Processing Applications
Christoph Hochreiner
Distributed Systems Group, TU Wien, Vienna, Austria
c.hochreiner@infosys.tuwien.ac.at
Abstract. Inspired by the current transition towards a data-centric
society, more and more researchers investigate cost-efficient resource
provisioning algorithms for stream processing applications. While these
algorithms already cover a variety of different optimization strategies,
they are rarely benchmarked against each other. This makes it almost
impossible to compare these different algorithms.
To resolve this problem, we propose the VISP Testbed, a toolkit which
eases the development of new provisioning algorithms by providing a
runtime for stream processing applications, a library of stream processing
topologies, and baseline algorithms to systematically benchmark new
algorithms.
Keywords: Data Stream Processing, Testbed, Reproducibility
1 Introduction
Due to the current transition towards a data-centric society, the research area
on data stream processing gets more and more traction to tackle the challenges
regarding the volume, variety and velocity of unbound streaming data [10] as
well as different geographic locations of data sources [5]. A common approach, to
implement stream processing applications, is to decompose the data processing
into single steps, i.e., operators, and compose topologies as shown in Fig. 1. These
topologies can then be enacted by Stream Processing Engines (SPEs), like IBM
System S [3], Apache Storm [14], Apache Spark [19], CSA [12] or VISP [7].
Manufacturing
Machine
Temperature Analyze Production Data Sensor
Filter (1) Temperature (2)
Update Temperature Sensor
Dashboard (5)
Transform Data Calculate OEE Stream Processing
(3) (4) Operator
Data Transmission
Fig. 1. Stream Processing Topology
O. Kopp, J. Lenhard, C. Pautasso (Eds.): 9th ZEUS Workshop, ZEUS 2017, Lugano,
Switzerland, 13-14 February 2017, published at http://ceur-ws.org/
38 Christoph Hochreiner
The majority of these SPEs is designed, to process a large volumes of data,
nevertheless, they often fail to adapt to varying volumes of data. In typical
scenarios, like the processing of the sensor data originating from manufacturing
machines, as depicted in Fig. 1, the volume of sensor data is often subject to
change due to the variable amount of running machines, e.g., due to maintenance
downtimes or in error scenarios. This variable volume of data requires different
computational resources to process the data in (near) real-time. To comply with
the changing resource requirements, it is either possible to over-provision the
SPE, i.e., allocate sufficient resources to cover all peaks, or to scale the SPE on
demand, i.e., add computational resources only when they are required.
Fig. 2 demonstrates an exemplary scenario for an elastic SPE, which adapts
at runtime to the resource requirements of the stream processing application.
At the beginning, the incoming data volume is low and it is sufficient to only
instantiate one operator of each kind. After some time, the data volume increases
and it is required to replicate individual operators to deal with the incoming
data volume. As soon as the data volume decreases again, the SPE can release
some of the replicated operators while maintaining (near) real-time data pro-
cessing capabilities. Due to the fact that on-demand resource provisioning is not
trivial, several researchers started to investigate different resource provisioning
strategies [1, 6, 9, 17].
Although each of these optimization algorithms follows an individual op-
timization strategy, they have one common challenge when it comes to the
evaluation of their algorithm. Up to now, none of the established SPEs provides
extensive resource optimization interfaces that can be used to implement and
evaluate custom resource provisioning algorithms. While dedicated benchmarking
frameworks are available for other areas, like iFogSim [4] for fog environments
or Gerbil [15] for semantic entity annotations, there are hardly any projects for
resource provisioning algorithms in the stream processing domain. Although there
exists several benchmarking projects like streaming-benchmarks1 or flotilla2 , the
scientific community picked up this topic only recently [13]. Nevertheless, to the
best of our knowledge there is no framework to benchmark resource provisioning
algorithms for SPEs.
Stream Processing
Operator
Incoming Data
Computational
Resources
Time
5 5 5 4
4 4 4 3
3 3 3 1 1
2 2 4 2 4 5 4
1 1 3 1 3 2 3
Fig. 2. Resource Requirements for a Deployed Stream Processing Topology
1
https://github.com/yahoo/streaming-benchmarks
2
https://github.com/tylertreat/Flotilla
VISP Testbed 39
Due to the lack of such a framework, each research group is required to
implement an evaluation environment to evaluate their custom algorithm against
baseline algorithms instead of state of the art algorithms. To fill this gap, we
propose the VISP Testbed as part of our Vienna ecosystem for elastic stream
processing, which not only provides a dedicated interface for new resource pro-
visioning algorithms but also different data generation pattern, topologies and
baseline algorithms to benchmark new custom algorithms and create reproducible
evaluations.
Data Provider VISP Runtime
Arrival Resource Resource Baseline
Pattern Optimization Provisioning Algorithm
Generation
Data Stream Custom
Data Processing
Speed Algorithm
Generation QoS Activity
Monitoring Tracking Topologies
Duration
Monitoring Monitoring
Data Data
Reporting
Data Aggregation
KPI Calculation Graph Generation
Fig. 3. VISP Testbed
2 System Design
One of the most crucial challenges for realizing a testbed is the data, which is
used for the evaluation. This data needs to be recorded from real world systems
which makes it very difficult to obtain such data because most data owners do
not support the publication of their data. Although there are some data dumps
available for scientific purposes, e.g., those published in the course of the DEBS
Grand Challenges3 , they are often only suited to evaluate the overall performance
of an SPE and not the adoption capabilities of a resource provisioning algorithm.
Therefore we decided to implement generic stream processing operators with
artificial arrival pattern besides one concrete stream processing topology based on
the T-Drive data set [18]. The artificial arrival pattern can be used, to evaluate the
adoption capabilities of the resource provisioning algorithms in a clearly defined
scenario, before they can be evaluated against real world scenarios without any
predefined arrival scenarios. The system design, as shown in Fig 3, consists of
three components which are discussed in the remainder of this section.
3
http://debs.org/?page_id=24
40 Christoph Hochreiner
2.1 Data Provider
The Data Provider component4 is designed to provide a variety of reproducible
data streams as input for the VISP Runtime. These data streams can be configured
based on their generation speed, i.e., how many data items should be generated
each second, generation duration, i.e., how long should the data be generated and
which data generation pattern should be applied. In order to simulate different
load scenarios, we have selected four different arrival pattern, based on real world
file access pattern [16], as illustrated in Fig 4. These arrival pattern pose different
challenges to SPEs as well to its resource provisioning algorithm.
20
15
Data Volume
Constant Arrival
10 Constant Increase
Pulsating Arrival
Spikes
5
0
0 5 10 15 20 25 30 35
Time
Fig. 4. Arrival Pattern
The simplest arrival pattern is the constant arrival, which generates a constant
volume of data. This arrival pattern is predestined to test the overall functionality
of a resource provisioning algorithm without generating any need for adoption
after an initial provisioning.
The constant increase pattern is designed to apply stress tests to the SPE
and to the resource provisioning algorithm. This pattern is suited to determine
the maximal time, for which the resource provisioning algorithm can provide
sufficient computational resources for the SPE to comply with given Service Level
Agreements (SLAs).
The pulsating arrival pattern generates a constantly changing volume of data.
This pattern requires the resource provisioning algorithm to constantly update
the computational resources for the SPE and is designed to test the adaptation
capabilities of the resource provisioning algorithm.
The last arrival pattern is the spikes pattern, which has similar properties
compared to the constant arrival, apart from short data volume bursts. These
short bursts pose high challenges to any resource provisioning algorithm, because
they may be prone to allocate a lot of resources for this short time, instead of
applying other compensation mechanism like tolerating SLA violations for a short
time.
4
https://github.com/visp-streaming/dataProvider
VISP Testbed 41
2.2 VISP Runtime
The core component of the VISP Testbed is the SPE, i.e., the VISP Runtime5 , as
presented in our previous work [6, 7]. The VISP Runtime is designed to process
incoming data according to a predefined topology, like the one shown in Fig. 1.
Besides the actual data processing of the incoming data stream, the VISP Runtime
also takes care of resource provisioning for operators and the resource optimization
thereof based on given algorithms. The Quality of Service (QoS) monitoring
component of the VISP Runtime monitors different aspects of the data processing,
like the processing duration of single data items, the latency between two operators
or the individual resource consumption of a single operator. Furthermore, the
VISP Runtime also features an activity tracking component, which tracks all
activities within the VISP Runtime, like upscaling or downscaling of a operators
and leasing or releasing new computational resources from a cloud resource
provider. These metrics can be either used for the resource optimization but also
for the Reporting component, which automatically interprets the evaluation. In
order to support future evaluations, we provide a basic library of topologies as
well as algorithms.
Evaluation Topologies Our basic library of topologies is motivated by the
SAP R/3 reference model [2]. Although the SAP R/3 reference model was
originally designed to provide a large variety of business processes for bench-
marking purposes, we realized, that today’s topologies have similar structures
and operations, like data replication (AND-operations) or conditional routes
(XOR-operations). Based on this realization, we have selected 10 exemplary
topologies based on our previous work [8] to evaluate different scenarios. These
exemplary topologies can be equipped with different operators, such as an instant
forward operator, a data aggregation operator or a busy-waiting operator to
trigger a specific CPU or memory load for each data item.
Baseline Algorithms Currently, the VISP Testbed offers two baseline algo-
rithms, whose configuration, e.g., thresholds, can be parametrized if required.
For the first algorithm, we have selected the fixed provisioning algorithm,
where the resource provisioning is specified before the evaluation starts and does
not change, regardless of the actual data volume. This approach allows to model
under-provisioning scenarios, where the SPE has not enough resources at hand
to process all data in (near) real-time at peak times as well as over-provisioning
scenarios, where a segment of the computational resources is not required most
of the time. These two scenarios are suitable to provide a lower baseline for the
QoS related attributes, i.e., in an under-provisioning scenario, and the upper
baseline for the computational cost, i.e., in an over-provisioning scenario.
For the second baseline algorithm, we have selected a threshold based algorithm
to adopt the computational resources on-demand. This algorithm replicates
operators, when a specific Key Performance Indicator (KPI), e.g., processing
5
https://github.com/visp-streaming/runtime
42 Christoph Hochreiner
duration, exceeds a threshold and scales it down when these resources are not
required anymore. This algorithm provides a more realistic baseline for custom
resource provisioning algorithms than the fixed one because it is able to elastically
react to varying incoming data volumes.
2.3 Reporting
The final component for the VISP Testbed is the Reporting component, which
is currently integrated within the VISP Runtime. This component aggregates
the monitoring data from both the Data Provider and the VISP Runtime to
automatically generate evaluation reports. For our evaluation reports we dis-
tinguish between textual reports, which feature quantitative KPIs, such as the
total cost for computational resources, performed provisioning operations or SLA-
compliance for the overall data processing, and a graphical representation. For
the graphical representation, the reporting component interprets the monitoring
time series of the evaluation and generates diagrams with the help of gnuplot6 .
This graphical representations can the be used for more detailed analysis of the
resource provisioning algorithm and to identify further optimization potentials.
3 Conclusion
In this paper, we have presented the VISP Testbed, which provides a basic toolkit
to conduct reproducible and comparable evaluations with the goal to support the
design of future resource provisioning algorithms. Until now, the VISP Testbed
only features a small library of topologies and baseline algorithms, but we plan
to extend the topology library with concrete topologies from the manufacturing
domain [11].
Furthermore, we plan to implement more resource provisioning algorithms
based on our ongoing research as well as those from other researchers to provide
more competitive baselines. At last, we also want to provide an additional probing
component for the VISP testbed, to identify the minimal resource requirements
for a specific stream processing operator, which can then be latter used for the
resource provisioning algorithms.
Acknowledgments This paper is supported by TU Wien research funds and by
the Commission of the European Union within the CREMA H2020-RIA project
(Grant agreement no. 637066).
References
1. Cardellini, V., Grassi, V., Lo Presti, F., Nardelli, M.: Optimal operator placement
for distributed stream processing applications. In: Proc. of the 10th Int. Conf. on
Distributed and Event-based Systems (DEBS). pp. 69–80. ACM (2016)
6
http://gnuplot.sourceforge.net
VISP Testbed 43
2. Curran, T.A., Keller, G.: SAP R/3 Business Blueprint: Understanding the Business
Process Reference Model. Prentice Hall PTR, Upper Saddle River (1997)
3. Gedik, B., Andrade, H., Wu, K.L., Yu, P.S., Doo, M.: SPADE: The System S
Declarative Stream Processing Engine. In: 2008 ACM SIGMOD Int. Conf. on
Management of Data. pp. 1123–1134 (2008)
4. Gupta, H., Dastjerdi, A.V., Ghosh, S.K., Buyya, R.: iFogSim: A Toolkit for Modeling
and Simulation of Resource Management Techniques in Internet of Things, Edge
and Fog Computing Environments. arXiv preprint arXiv:1606.02007 (2016)
5. Hochreiner, C., Schulte, S., Dustdar, S., Lecue, F.: Elastic Stream Processing for
Distributed Environments. IEEE Internet Computing 19(6), 54–59 (2015)
6. Hochreiner, C., Vögler, M., Schulte, S., Dustdar, S.: Elastic Stream Processing
for the Internet of Things. In: 9th Int. Conf. on Cloud Computing (CLOUD). pp.
100–107. IEEE (2016)
7. Hochreiner, C., Vögler, M., Waibel, P., Dustdar, S.: VISP: An Ecosystem for
Elastic Data Stream Processing for the Internet of Things. In: 20th Int. Enterprise
Distributed Object Computing Conf. (EDOC). pp. 19–29. IEEE (2016)
8. Hoenisch, P., Schuller, D., Schulte, S., Hochreiner, C., Dustdar, S.: Optimization of
complex elastic processes. Trans. on Services Computing 9(5), 700–713 (2016)
9. Lohrmann, B., Janacik, P., Kao, O.: Elastic stream processing with latency guaran-
tees. In: 35th Int. Conf. on Dist. Comp. Systems (ICDCS). pp. 399–410 (2015)
10. McAfee, A., Brynjolfsson, E., Davenport, T.H., Patil, D., Barton, D.: Big data. The
management revolution. Harvard Bus Rev 90(10), 61–67 (2012)
11. Schulte, S., Hoenisch, P., Hochreiner, C., Dustdar, S., Klusch, M., Schuller, D.: To-
wards process support for cloud manufacturing. In: 18th Int. Enterprise Distributed
Object Computing Conf. (EDOC). pp. 142–149. IEEE (2014)
12. Shen, Z., Kumaran, V., Franklin, M.J., Krishnamurthy, S., Bhat, A., Kumar, M.,
Lerche, R., Macpherson, K.: Csa: Streaming engine for internet of things. Data
Engineering pp. 39–50 (2015)
13. Shukla, A., Simmhan, Y.: Benchmarking distributed stream processing platforms
for iot applications. In: Performance Evaluation and Benchmarking: Traditional
to Big Data to Internet of Things - 8th TPC Technology Conference, TPCTC. pp.
NN–NN (2016)
14. Toshniwal, A., Taneja, S., Shukla, A., Ramasamy, K., Patel, J.M., Kulkarni, S.,
Jackson, J., Gade, K., Fu, M., Donham, J., Bhagat, N., Mittal, S., Ryaboy, D.:
Storm@twitter. In: 2014 ACM SIGMOD Int. Conf. on Management of Data. pp.
147–156 (2014)
15. Usbeck, R., Röder, M., Ngonga Ngomo, A.C., Baron, C., Both, A., Brümmer, M.,
Ceccarelli, D., Cornolti, M., Cherix, D., Eickmann, B., et al.: GERBIL: General
entity annotator benchmarking framework. In: Proc. of the 24th Int. Conf. on World
Wide Web. pp. 1133–1143. ACM (2015)
16. Waibel, P., Hochreiner, C., Schulte, S.: Cost-efficient data redundancy in the cloud.
In: 9th International Conference on Service-Oriented Computing and Applications
(SOCA). pp. 1–9. IEEE (2016)
17. Xu, L., Peng, B., Gupta, I.: Stela: Enabling stream processing systems to scale-in
and scale-out on-demand. In: Int. Conf. on Cloud Engineering (IC2E). IEEE (2016)
18. Yuan, J., Zheng, Y., Zhang, C., Xie, W., Xie, X., Sun, G., Huang, Y.: T-drive:
Driving directions based on taxi trajectories. ACM SIGSPATIAL GIS 2010 (2010)
19. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster
computing with working sets. HotCloud 10, 10–17 (2010)