MONDO-SAM: A Framework to Systematically Assess MDE Scalability Benedek Izsó, Gábor Szárnyas, István Ráth and Dániel Varró Fault Tolerant Systems Research Group Department of Measurement and Information Systems Budapest University of Technology and Economics H-1117, Magyar Tudósok krt. 2. Budapest, Hungary {izso, szarnyas, rath, varro}@mit.bme.hu ∗ ABSTRACT TTC cases focus on measuring query and transformation ex- Processing models efficiently is an important productivity ecution time against instance models of increasing size. TTC factor in Model-Driven Engineering (MDE) processes. In or- promotes reproducibility by providing pre-configured virtual der to optimize a toolchain to meet scalability requirements machines on which individual tools can be executed; how- of complex MDE scenarios, reliable performance measures ever, the very nature of this environment and the limited of different tools are key enablers that can help selecting the resources make precise comparison difficult. best tool for a given workload. To enable systematic and re- producible benchmarking across different domains, scenar- Benchmarks are also used outside of the MDE community. ios and workloads, we propose MONDO-SAM, an extensi- The SP2 Bench [7] and Berlin SPARQL Benchmark (BSBM) ble MDE benchmarking framework. Beyond providing eas- [3] are SPARQL benchmarks over semantic databases (triple ily reusable features for common benchmarking tasks that stores). The first uses RDF models based on the real world are based on best practices, our framework puts special em- DBLP bibliography database, while the latter is centered phasis on metrics, which enables scalability analysis along around an e-commerce case study. Both benchmarks scale different problem characteristics. To illustrate the practical up in the size of models (up to 25M and 150B elements), applicability of our proposal, we demonstrate how different however SP2 Bench does not consider model modifications, variants of a model validation benchmark featuring several and BSBM does not detail query and instance model com- MDE tools from various technological domains have been plexity. SPLODGE [4] is another similar approach, where integrated into the system. SPARQL queries were generated systematically, based on metrics for a predefined dataset. Queries are scaled up to three navigations (joins), but other metrics as the com- 1. INTRODUCTION plexity of the instance model were not investigated. The As Model-Driven Engineering (MDE) has gained mainstream common technological characteristics of these benchmarks is momentum in complex system development domains over that they are frequently run on very large computer systems the past decade, scalability issues associated to MDE tools that are not accessible to most users, or rely on commercial and technologies are nowadays well known [6]. To address software components that are hard to obtain. these challenges, the community has responded with a mul- titude of benchmarks. To summarize, currently available graph based benchmarks are affected by two main issues: (i) technologically, they are The majority of these efforts have been created by tool frequently built on virtualized architectures or have exotic providers for the purpose to measure performance develop- dependencies, making measurements hard to reproduce in- ments of specific engines [8, 2]. As a notable exception, dependently; and (ii) conceptually, they typically only ana- the Transformation Tool Contest (TTC) [1] attempts cross- lyze measurement results against a limited view of the prob- technology comparison by proposing multiple cases which lem: the execution time of a fixed task scaled against in- are solved by the authors of (mainly EMF based) MDE tools. creasing model size. As a result, the relative complexity of ∗ This work was partially supported by the CERTIMOT current benchmarks can not be precisely quantified, which (ERC HU-09-01-2010-0003) and MONDO (EU ICT-611125) makes them difficult to compare them to each other. projects partly during the fourth author’s sabbatical. In previous work [5], we have found that other metrics (such as various query complexity measures, instance model char- acteristics, and the combination of these) can affect results very significantly. Building on these results, in this pa- per we propose the extensible MONDO-SAM framework that is integrated into the official MONDO benchmark open BigMDE’14 July 24, 2014. York, UK. repository1 . MONDO-SAM provides reusable benchmark- Copyright c 2014 for the individual papers by the papers’ authors. Copy- ing permitted for private and academic purposes. This volume is published 1 and copyrighted by its editors. http://opensourceprojects.eu/p/mondo/ d31-transformation-benchmarks/ Benchmarking process Benchmark architecture Generate Artefacts Measure Results Analyze generator core benchmark core domain- reusable Model Railway generator M: Sesame Q: SPARQL Q: EIQ primitives specific Real D: Railway S: validation M: EMF Q: OCL world Execute RDF EMF language- apps Query Perf. values specific tool- benchmark T: EclipseOCL T: INCQUERY T: Sesame Performance specific Synthetic diagrams Calculate core Transf. Metrics analyzer metrics generator Metrics EMF-IQPL RDF-SPARQL language- specific Scenario Figure 2: Benchmark framework architecture. Figure 1: Benchmarking process. study of BSBM. Generated models should be semantically ing primitives (like metrics evaluation, time measurement, equivalent, however, it is a question whether structural equal- result storage) that can be flexibly organized into bench- ity should be preserved. E.g. in certain cases EMF models marking workflows that are specific to a given case study. must have a dedicated container object with containment MONDO-SAM also provides an API so that tehnologically relations to all objects which is not required in RDF. different tools can be integrated into the framework in a uni- form way. A unique emphasis of the framework is built-in 2.3 Core features support for metrics calculation that enables characteriza- tion of the benchmarking problems as published in [5]. The built-in reporting facility allows to investigate the scalabil- Benchmark component. The benchmark component (in Fig. 2) measures performance of different tools for given cases. A ity of MDE tools along different metrics in diagrams. Fi- case can be defined as a quintuple of (D, S, M, Q, T ), where nally, the entire framework and integrated case studies can D defines the domain, S the scenario, M the modification be compiled and run using the Maven build system, mak- and Q the query. The T modules implement tool specific ing deployment and reproducible execution in a standard, glue code and select D, S, M, Q. All modules reuse com- Java-enabled computing environment feasible. mon functions of the core, like configuration (with default values and tool-specific extensions), wall-clock time mea- 2. OVERVIEW OF THE FRAMEWORK surement which is done with highest (nanosecond) precision 2.1 A process model for MDE benchmarks (that does not mean same accuracy), and momentary mem- The benchmarking process for MDD applications is depicted ory consumption, which are recorded in a central place. At in Fig. 1. Inputs of the benchmark are the instance model, runtime, language-specific modifications (transformations), queries run on the instance model, the transformation rules queries, and instances of the selected domain must be avail- or modification logics and a scenario definition (or workflow) able. describing execution sequences. In this case, scenario can de- scribe MDD use cases (like model validation, model trans- formation, incremental code generation), including warmup Model instantiator. A common aspect of the generator and teardown operations, if required. Inputs can also be and the benchmark module is reproducibility. In tool-specific derived from real-world applications, or are synthetically scenario implementations boundaries are well separated by generated providing complete control over the benchmark. the scenario interfaces, and where generation or execution Complexity of the input is characterized by metrics, while is randomized, a pseudo-random generator is used with the scenario execution implementations are instrumented to mea- random seed set to a predefined value. However, nondeter- sure resource consumption (wall-clock times, memory and ministic operations (like choosing an element from a set) and I/O usage). Finally, these measured values and calculated tool implementations can disperse results between runs. metrics are visualized on diagrams automatically to find the fastest tool, or to identify performance improvements of a specific tool. Metrics evaluator. To describe benchmark input with quan- titative values, they are characterized by metrics which are 2.2 Architecture evaluated by the metrics component. Language specific im- The benchmark framework consisting of four components is plementations analyze model-query pairs, and store calcu- depicted in Fig. 2. The generator component allows syn- lated metric values centrally gathered by the core which are thetic generation of benchmark inputs. The core module analyzed later together with the measured values. handles configuration, domain-specific modules describe gen- eration method of input data (like generation of instance models, queries), and language-specific modules serialize gen- Result reporting and analysis. When measurement and erated logical artifacts into files (like EMF models or OCL metrics data become available, the analyzer component (im- queries). The selected domain constrains languages, as do- plemented in R) automatically creates HTML report with main description concepts must be supported. For exam- diagrams. To show scalability according to different mea- ple transitivity or multi-level metamodeling is not supported sures, on the x axis metrics can be selected, while the y axis by EMF, but the latter is required by the e-commerce case represents resource consumption. Raw data can be post- processed, i.e. dimensions can be changed (e.g. to change Total time of the phases RouteSensor (x,y:logscale), XForm 703498.00 time to ms dimension to reflect its accuracy), and derived 328949.78 values can be calculated (e.g. the median of incremental 153814.16 ● Tools Time [ms] recheck steps, or total processing time). 71922.22 ● Eclipse OCL ● EMF−IncQuery Java 33630.23 ● Drools ● Sesame 2.4 Best Practices to Minimize Validity Threats 15725.21 7352.98 ● ● ● ● During the execution of the cases, noise coming from the 3438.19 ● environment should be kept at minimum. Possible sources 6k 24k 12k 49k 23k 90k 43k 170k 88k 347k 176k 691k 361k 1M 715k 2M 1M 5M 94 193 348 642 1301 2k 5k 10k 21k of noise include the caching mechanisms of various compo- −9 −19 −34 −64 −130 −260 −532 −1062 −2109 Nodes nents (e.g. file system and the database management sys- Edges Results tem), warm-up effect of the runtime environment (e.g. the Modifications Java Virtual Machine), scheduled tasks (e.g. cron) and swap- ping. For large heaps, the Garbage Collector of the JVM can Figure 3: Required time to perform a task. block the run for minutes, so minimizing its call is advised which is achieved by setting minimal and maximal heap size to an equal value, thus eliminating GC calls at memory ex- DetCheck Times − RouteSensor size=128 (y:logscale, x:continuous) pansions. 643587.46 ● 74910.33 In the implementation of framework components, only the 8719.18 ● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● minimal amount of libraries should be loaded. On one hand, ● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● Tool Time (ms) 1014.87 ● ● Drools proper organization of the dependencies is the responsibility ● ● ●●●●●● ● ● ● ● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● EclipseOCL EMF−IncQuery Java of the developer. On the other hand it is enforced by the 118.13 ● ● Sesame framework architecture, as tool-specific implementations are 13.75 ● independent, and functions as entry points calling the frame- ● ● ● ●● 1.60 ● ● ● ● work that uses inversion of control (IoC) without the usage ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0.19 ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● of additional execution environments, such as OSGi. 0 25 50 75 100 Index To alleviate random disturbances, each test case is run sev- Figure 4: Check time during revalidations. eral times (e.g. ten times) by the framework and aggregated by the analyzer. 3. INTEGRATED CASE STUDIES 3.2 Extended Train Benchmark The usability of the framework is demonstrated by four ex- The extended version is available online2 which introduces amples. Three variations of the previously published Train new languages: in addition to EMF, RDF and GraphML Benchmark, and a new, soon to be released model compre- model formats were added. New tools (Drools, Sesame, hension benchmark are integrated into the framework. 4store and Neo4j) were added, and queries were translated to each tool’s native language. From now not all tools have 3.1 Basic Train Benchmark in-memory implementation, some use hard disk as storage, The first version of the Train Benchmark [9] compares the so to lower disk overhead, memory filesystems were used for performance of EMF-IncQuery with Eclipse OCL and its storage. Also it should be noted that some databases com- incremental version, the OCL Impact Analyzer in an incre- piled as JARs next to the benchmark code, some database mental model validation use case. Instance models are gen- use native server daemons that are also handled by the erated from a railway domain, and four hand-written queries benchmark execution framework. In this case a new sce- (with different complexity) perform model validation tasks. nario variation is defined, where after the batch validation, The scenario starts with a model loading phase, where the larger modification is performed in one edit phase (to sim- instance is read from a file, followed by a check phase, where ulates automatic model correction), and finally recheck is a model validation query is executed (returning constraint executed. violating elements). Afterwards (to simulate a user in front of an editor), multiple (100) edits and rechecks performed. As the benchmark framework records every check and edit In this case batch, incremental validation time and memory time subsequently calls can be displayed on a diagram to consumption was measured. show its changes. Fig. 4 depicts such a case for tools at a given model size and query. It can be observed that the One kind of diagrams display execution times as the func- first query time is almost always the highest, probably due tion of model and query metrics. Fig. 3 shows total exe- to the lazy loading of classes and tool initialization. An- cution time for a specific query and scenario in a logarith- other interesting point for the incremental EMF-IncQuery mic diagram for different tools. On the x axis model size and Drools tools is around the tenth check, where evalua- (the number of nodes and edges) is displayed, together with tion times are dropped significantly. As the same queries the number of results, and the number of changes in the are executed, this may be attributed to the changed model result set. Although model size is the most influencing per- structure, or to the kicked in JIT compiler. This diagram formance factor during the load phase, in the check phase, also shows the required warmup time for each tool, and its especially for incremental tools other metrics come into the changing in stages. picture as most influencing factors, like the result set size, 2 or the number of variables in a query [5]. https://incquery.net/publications/trainbenchmark/ Benchmark workflow ability of tools against various complexity measures. We 31x queries demonstrated the versatility of the framework is demon- strated by the integration of previous versions of the Train Benchmark [9, 5] and a new benchmark from the code model Generate Load Query Report most Benchmark workflow – code model domain. influencing metrics 12x ! The extensible framework including the APIs, core compo- (3 tools) EMF, RDF nents and documentated samples is available as open source models code from the MONDO Git repository3 . (a) Metrics evaluation Load Check1 Refactor Checkn Report 5. REFERENCES [1] Transformation tool contest. www.transformation-tool-contest.eu, 2014. ! [2] G. Bergmann, I. Ráth, T. Szabó, P. Torrini, and Java Validation D. Varró. Incremental pattern matching for the Code code performance efficient computation of transitive closure. In Sixth patterns International Conference on Graph Transformation, volume 7562/2012, pages 386–400, Bremen, Germany, (b) Code model 09/2012 2012. Springer. Figure 5: Different use-cases of the framework [3] C. Bizer and A. Schultz. The Berlin SPARQL Benchmark. International Journal On Semantic Web and Information Systems, 5(2), 2009. 3.3 Model Metrics for Performance Prediction [4] O. Görlitz, M. Thimm, and S. Staab. SPLODGE: In the article [5] tools are narrowed down to a basic Java Systematic generation of SPARQL benchmark queries implementation, EMF-IncQuery, and Sesame. However, for Linked Open Data. In C.-M. et al., editor, The for a modified metamodel nine new instances were generated Semantic Web – ISWC 2012, volume 7649 of LNCS, (belonging to different edge distributions). The benchmark pages 116–132. Springer Berlin Heidelberg, 2012. was extended with 31 queries scaling along 5 query metrics. [5] B. Izsó, Z. Szatmári, G. Bergmann, Á. Horváth, and The goal of this paper was not to compare tool performances, I. Ráth. Towards precise metrics for predicting graph but to identify which metrics influence processing time and query performance. In IEEE/ACM 28th International memory usage the most. (See Fig. 5a.) Conference on Automated Software Engineering, pages 412–431, Silicon Valley, CA, USA, 2013. IEEE. Detailed results are available in the paper, however it can [6] D. S. Kolovos, L. M. Rose, N. Matragkas, R. F. Paige, be noted that for the EMF-IncQuery tool the number of E. Guerra, J. S. Cuadrado, J. De Lara, I. Ráth, matches, for Sesame the number of query variables showed D. Varró, M. Tisi, and J. Cabot. A research roadmap high correlation with the check time, and low correlation towards achieving scalability in model driven of model size metrics that also emphasize considering other engineering. In Proceedings of the Workshop on aspects than model size. Scalability in Model Driven Engineering, BigMDE ’13, pages 2:1–2:10, New York, NY, USA, 2013. ACM. 3.4 ITM Factory [7] M. Schmidt, T. Hornung, G. Lausen, and C. Pinkel. The fourth case (inspired by [10]) integrated into the frame- SP2Bench: A SPARQL performance benchmark. In work is currently under development, and it took another Proc. of the 25th International Conference on Data domain from the field of software comprehension. Input of Engineering, pages 222–233, Shanghai, China, 2009. the benchmark are not serialized models, but Java projects. IEEE. In the first step, source code is read into a software model, [8] M. Tichy, C. Krause, and G. Liebel. Detecting transformations are code edits or complex refactor opera- performance bad smells for henshin model tions. After software modifications, correctness of the code transformations. In B. Baudry, J. Dingel, L. Lucio, base is validated (Fig. 5b). and H. Vangheluwe, editors, AMT@MoDELS, volume 1077 of CEUR Workshop Proceedings. CEUR, 2013. In the code modeling case similar investigations can be done, [9] Z. Ujhelyi, G. Bergmann, Á. Hegedüs, Á. Horváth, however processing tools should scale in the lines of code B. Izsó, I. Ráth, Z. Szatmári, and D. Varró. (and not in the number of nodes or edges). This also moti- EMF-IncQuery: An Integrated Development vates displaying performance as a function of different met- Environment for Live Model Queries. Science of rics. Computer Programming, 2014. Accepted. [10] Z. Ujhelyi, Á. Horváth, D. Varró, N. I. Csiszár, 4. CONCLUSION G. Szőke, L. Vidács, and R. Ferenc. Anti-pattern In this paper we proposed MONDO-SAM, a framework that detection with model queries: A comparison of provides common functions required for benchmarking, and approaches. In IEEE CSMR-WCRE 2014 Software MDE-specific scenarios, models, queries and transformations Evolution Week. IEEE, 02/2014 2014. as reusable and configurable primitives. As the main focus, integrated benchmark cases can be characterized by metrics, 3 https://opensourceprojects.eu/git/p/mondo/ which enables the reporting module to analyze the scal- trainbenchmark