=Paper=
{{Paper
|id=Vol-1744/paper6
|storemode=property
|title=Quality Analysis and Verification of Data-Intensive Applications
|pdfUrl=https://ceur-ws.org/Vol-1744/paper6.pdf
|volume=Vol-1744
|authors=Francesco Marconi
}}
==Quality Analysis and Verification of Data-Intensive Applications==
<pdf width="1500px">https://ceur-ws.org/Vol-1744/paper6.pdf</pdf>
<pre>
           Quality Analysis and Verification of
              Data-Intensive Applications

                                  Francesco Marconi

                       DEIB, Politecnico di Milano, Milan, Italy
                            francesco.marconi@polimi.it


       Abstract Data-intensive applications are gaining momentum in multiple
       domains of science and business, thanks to the technological advancements
       of distributed systems and the increased awareness of the importance
       of data. However, the fast-paced development of such technologies is
       followed by a lack of methodological support for the design of such
       applications. Quality requirements regarding safety and data privacy are
       often overlooked by designers, increasing the number and severity of
       quality-related failures, imposing heavy refactoring of the applications.
       This work aims at providing a methodology and tool enabling the formal
       analysis and verification of quality properties in data-intensive application.
       The preliminary stage of this work is focused on safety analysis of topology-
       based streaming applications, but the approach could be extended to
       other technologies and encompass a wider range of properties.


Keywords: Data-intensive Applications, Distributed Systems, Formal Verifica-
tion, Metric Temporal Logic, Apache Storm


1    Introduction

 Big Data technologies have shown significant development and diffusion in re-
 cent years. The so called “data-intensive” applications [6], or DIAs, exploit the
 increasing maturity and accessibility of emerging technologies (such as cloud
 computing, MapReduce/Hadoop, NoSQL databases, stream processing) in order
 to process massive quantities of data in a timely manner. Since these technologies
 are inherently complex and lack of standard development methodologies, both
 academia and industry are involved in researching innovative solutions to support
 the entire life-cycle of DIAs, with particular attention to quality requirements.
     The DICE project [11] defines techniques and tools for the data-aware quality-
 driven development of DIAs. It offers tools for quality assessment, architecture
 enhancement, agile delivery and continuous testing of DIAs. As contributors to
 the DICE project we are investigating the usage of formal methods to perform
 safety and privacy analysis on the design of DIAs.
     The research question addressed by this work is:
“How can we verify quality properties, such as safety and privacy, of data-intensive
applications?”.


                                       1
Copyright © 2016 for this paper by its authors. Copying permitted for private
and academic purposes
2       Francesco Marconi

This question, having such a broad scope, triggers other challenges, such as:
(i) Which (application-dependent) properties could we verify? Associated to
which technologies? (ii) How can we devise a mechanism to automatically carry
out formal verification, allowing the designer (i.e., not a formal methods expert)
to define the application without having to directly deal with the details of the
formal model?
    Before describing our approach, it is appropriate to remark that there are
multiple dimensions to take into consideration in order to perform formal analysis
and verification on DIAs. These dimensions include: (i) functional aspects of
the application: DIAs are typically composed of multiple distributed and multi-
threaded components, each of them performing some processing on data. Relevant
features, such as operations performed, input and output data, can be considered
(and captured in the model) with different granularity. Depending on the goal
of the analysis, it might be relevant to represent the operations at the thread
level, at the component level or even at a coarser level of granularity. Details add
complexity to the model, impacting the computation time to run verification,
so a trade-off is often needed. (ii) non-functional aspects are, by definition, of
primary importance when talking about quality analysis. Different features can
be targeted, such as performance, availability, safety, privacy, etc. In our case
all the relevant non-functional aspects are related to time. (iii) infrastructural
aspects need to be considered in DIAs: applications typically run on distributed
systems that add a further layer of complexity to the analysis. Also in this case,
the appropriate choice of the abstraction level is crucial. Focusing on one or more
of these dimensions will determine the kind of analysis to undertake.


2    Approach

Given the wide range of technologies and frameworks available in the Big Data
ecosystem, we decided to focus on a specific set of applications (i.e., streaming
applications), analyze it and identify related safety issues. We then devised a
formal model capturing the behavior of those applications, and allowing for the
representation of such issues as violations of safety properties.
     Many of these applications can be described by means of topologies, that
is, directed graphs of computation. Topologies are composed of two kinds of
nodes: computational nodes, representing the logical operations performed over
data, and source nodes, representing the entry point of data into the application.
Topologies are implemented differently across the various frameworks, but are
still able to represent, with a certain degree of abstraction, most of the streaming
applications. The work currently focuses on the Apache Storm [1] technology, a
well known streaming framework that allows for parallel, distributed, real-time
processing of large-scale streaming data on horizontally scalable systems.
     In Storm computational nodes are called bolts, and source nodes are called
spouts (see Fig. 1). All the nodes have dedicated input queues from which they
read data to be processed. One of the key concerns in Storm applications is
that time-related parameters, such as emission rates of data, may induce an
                                    Quality Analysis and Verification of DIAs     3


                        Figure 1: Example of Storm topology.


excessive load on the topology by accumulating data in bolts’ queues. Since
streaming applications are supposed to be permanently active, and the streams
of data consist in potentially infinite sequences of messages, the occurrence of
such overloads may lead to serious degradation of crucial quality aspects like
latency and throughput. For this reason we decided to capture such concern as
an unwanted behavior of the system against which the designer wants to analyze
the safety of the design.
    We devised novel a model called Timed Counter Networks (TCN) [9], in
which we modeled the behavior of spouts and bolts as parametric components
that can be used as building blocks for defining topologies. The parameters
characterizing each component are essentially: the parallelism level (number of
threads executing its operations) for both spouts and bolts, the processing time
of single messages and the kind of function performed (filter, transformation,
join of multiple streams) for bolts, the average number of messages produced per
time unit for spouts. The model is expressed through Constraint LTL over-clocks
(CLTLoc) [4] metric temporal logic enriched with positive counters. The choice of
the formalism is motivated by its flexibility and by the fact that it is supported
(provided some limited extensions) by the Zot1 bounded model/satisfiability
checker. This allowed us to define a tool supported mechanism [3] for automating
the verification.
    The tool, named D-VerT2 , allows for the definition of topologies as UML
Class diagrams, conveniently annotated with a specific UML profile, and performs
a series of transformations to create the corresponding formal model and to run
verification on it (see Fig. 2). The verification outcome is presented to the user
both in a textual and in a graphical format, in order to provide a visual hint about
the potential design issues. The UML modeling, together with the automatic
transformation mechanism and the graphical display of results aim at hiding the
complexity of the underlying models and engines to the user, minimizing the
need of technical expertise in formal methods.
    In the current stage of development the tool supports verification of the
property concerning the aforementioned unbounded growth of load on bolts’
 1
     https://github.com/fm-polimi/zot
2
     https://github.com/dice-project/DICE-Verification
4        Francesco Marconi


              Figure 2: Architecture and execution flow of D-VerT


queues, i. e., “all bolts’ queues have a bounded number of elements”. We argue
that, if this property holds, then all bolts are able to process the incoming flow
of data in a timely manner. The property is disproved if there exists an execution
where at least one of the queues grows with an unbounded trend. This kind
of analysis is enabled by adapting the Bounded Satisfiability Checking (BSC)
technique used for CLTLoc [2,5] to deal with the addition of discrete counters that
makes Timed Counter Networks undecidable. More details about the verification
technique can be found in [9].
    The preliminary version of D-VerT is available either as part of the DICE
Platform3 or as a standalone application. As shown in Fig. 2, its client-server
design includes a client component, implemented as an Eclipse plug-in and a
RESTful web server component, distributed as a multi-container application.


3     Preliminary and expected results

We ran experiments on different topologies in order to have qualitative feedback
about the analysis performed by D-VerT and to collect quantitative measures
of the verification time. Some of the preliminary outcomes of such experiments
are presented in [9]. Under the qualitative point of view, the tool was able to
spot some meaningful design flaws on use-case topologies provided by industrial
partners. Execution times have shown a relevant dependency on the number of
nodes composing the topology, but also on the specific configuration of each node
(i. e., some topologies with evident design flaws took considerably less time to be
analyzed with respect to other nontrivial ones).
     Future works will be conducted on several directions. Our plans include:
(i) mitigating the state-space explosion problem registered when the model
includes too many components; (ii) investigating new technologies, such as
Apache Spark, Flink or Tez; (iii) reasoning about new properties to verify;
(iv) examining in depth the theoretical aspects of the timed counter networks
model in order to get new results in terms of soundness and completeness.
3
    https://github.com/dice-project/DICE-Platform
                                     Quality Analysis and Verification of DIAs          5

4    Related works
Various approaches exist for the formalization of distributed systems; however,
to the best of our knowledge, none focuses on topology-based DIAs. In the
context of DIAs, some works aim at verifying properties that are related to the
framework itself such as reliability and load balancing [12], or the validity of
messaging flow in MapReduce [13]. Our work is instead focused on supporting
“application-dependent” analysis. There are similarities between the analysis
proposed in our work and some aspects of the analysis of networks, as they both
deal with properties related to throughput and latency in topology-based systems.
However, the latter generally focuses on different kind of problems, such as the
verification of routing protocols [10].
    Timed counter networks are inspired from Vector Addition Systems with
States (VASS) [8] and Timed Petri Nets [7].
Acknowledgment This is a joint work with Marcello M. Bersani under the
supervision of Matteo G. Rossi, supported by Horizon 2020 project no. 644869
(DICE).


References
 1. Apache Storm, http://storm.apache.org/
 2. Bersani, M.M., Frigeri, A., Morzenti, A., Pradella, M., Rossi, M., Pietro, P.S.:
    Constraint LTL satisfiability checking without automata. J. Applied Logic 12(4),
    522–557 (2014)
 3. Bersani, M.M., Marconi, F., Rossi, M., Erascu, M.: A tool for verification of big-data
    applications. In: Proc. of QUDOS. pp. 44–45 (2016)
 4. Bersani, M.M., Rossi, M., San Pietro, P.: A tool for deciding the satisfiability of
    continuous-time metric temporal logic. In: Proc. of TIME. pp. 99–106 (2013)
 5. Bersani, M.M., Rossi, M., San Pietro, P.: A tool for deciding the satisfiability of
    continuous-time metric temporal logic. Acta Informatica 53(2), 171–206 (2016)
 6. Chen, C.P., Zhang, C.Y.: Data-intensive applications, challenges, techniques and
    technologies: A survey on big data. Information Sciences 275, 314–347 (2014)
 7. Furia, C.A., Mandrioli, D., Morzenti, A., Rossi, M.: Modeling Time in Computing.
    Monographs in Theoretical Computer Science. An EATCS Series, Springer (2012)
 8. Karp, R.M., Miller, R.E.: Parallel program schemata. Journal of Computer and
    System Sciences 3(2), 147 – 195 (1969)
 9. Marconi, F., Bersani, M., Erascu, M., Rossi, M.: Towards the formal verification
    of data-intensive applications through metric temporal logic. In: Proc. of ICFEM
    2016. LNCS, vol. 10009, pp. 1–17 (2016), to appear
10. Qadir, J., Hasan, O.: Applying formal methods to networking: Theory, techniques,
    and applications. IEEE Communications Surveys & Tutorials 17(1), 256–291 (2015)
11. The DICE Consortium: DICE Verification Repository (Jan, 2016), url: https:
    //github.com/dice-project/DICE-Verification
12. Tommaso Di Noia, M.M., Sciascio, E.D.: A computational model for mapreduce
    job flow (2014)
13. Yang, F., Su, W., Zhu, H., Li, Q.: Formalizing mapreduce with csp. In: Proceedings
    of ECBS. pp. 358–367. IEEE Computer Society (2010)

</pre>