MONDO-SAM: A Framework to Systematically
                         Assess MDE Scalability

                           Benedek Izsó, Gábor Szárnyas, István Ráth and Dániel Varró
                                             Fault Tolerant Systems Research Group
                                       Department of Measurement and Information Systems
                                        Budapest University of Technology and Economics
                                                 H-1117, Magyar Tudósok krt. 2.
                                                        Budapest, Hungary
                                           {izso, szarnyas, rath, varro}@mit.bme.hu ∗

ABSTRACT                                                                    TTC cases focus on measuring query and transformation ex-
Processing models efficiently is an important productivity                  ecution time against instance models of increasing size. TTC
factor in Model-Driven Engineering (MDE) processes. In or-                  promotes reproducibility by providing pre-configured virtual
der to optimize a toolchain to meet scalability requirements                machines on which individual tools can be executed; how-
of complex MDE scenarios, reliable performance measures                     ever, the very nature of this environment and the limited
of different tools are key enablers that can help selecting the             resources make precise comparison difficult.
best tool for a given workload. To enable systematic and re-
producible benchmarking across different domains, scenar-                   Benchmarks are also used outside of the MDE community.
ios and workloads, we propose MONDO-SAM, an extensi-                        The SP2 Bench [7] and Berlin SPARQL Benchmark (BSBM)
ble MDE benchmarking framework. Beyond providing eas-                       [3] are SPARQL benchmarks over semantic databases (triple
ily reusable features for common benchmarking tasks that                    stores). The first uses RDF models based on the real world
are based on best practices, our framework puts special em-                 DBLP bibliography database, while the latter is centered
phasis on metrics, which enables scalability analysis along                 around an e-commerce case study. Both benchmarks scale
different problem characteristics. To illustrate the practical              up in the size of models (up to 25M and 150B elements),
applicability of our proposal, we demonstrate how different                 however SP2 Bench does not consider model modifications,
variants of a model validation benchmark featuring several                  and BSBM does not detail query and instance model com-
MDE tools from various technological domains have been                      plexity. SPLODGE [4] is another similar approach, where
integrated into the system.                                                 SPARQL queries were generated systematically, based on
                                                                            metrics for a predefined dataset. Queries are scaled up
                                                                            to three navigations (joins), but other metrics as the com-
1.    INTRODUCTION                                                          plexity of the instance model were not investigated. The
As Model-Driven Engineering (MDE) has gained mainstream
                                                                            common technological characteristics of these benchmarks is
momentum in complex system development domains over
                                                                            that they are frequently run on very large computer systems
the past decade, scalability issues associated to MDE tools
                                                                            that are not accessible to most users, or rely on commercial
and technologies are nowadays well known [6]. To address
                                                                            software components that are hard to obtain.
these challenges, the community has responded with a mul-
titude of benchmarks.
                                                                            To summarize, currently available graph based benchmarks
                                                                            are affected by two main issues: (i) technologically, they are
The majority of these efforts have been created by tool
                                                                            frequently built on virtualized architectures or have exotic
providers for the purpose to measure performance develop-
                                                                            dependencies, making measurements hard to reproduce in-
ments of specific engines [8, 2]. As a notable exception,
                                                                            dependently; and (ii) conceptually, they typically only ana-
the Transformation Tool Contest (TTC) [1] attempts cross-
                                                                            lyze measurement results against a limited view of the prob-
technology comparison by proposing multiple cases which
                                                                            lem: the execution time of a fixed task scaled against in-
are solved by the authors of (mainly EMF based) MDE tools.
                                                                            creasing model size. As a result, the relative complexity of
∗
  This work was partially supported by the CERTIMOT                         current benchmarks can not be precisely quantified, which
(ERC HU-09-01-2010-0003) and MONDO (EU ICT-611125)                          makes them difficult to compare them to each other.
projects partly during the fourth author’s sabbatical.
                                                                            In previous work [5], we have found that other metrics (such
                                                                            as various query complexity measures, instance model char-
                                                                            acteristics, and the combination of these) can affect results
                                                                            very significantly. Building on these results, in this pa-
                                                                            per we propose the extensible MONDO-SAM framework
                                                                            that is integrated into the official MONDO benchmark open
BigMDE’14 July 24, 2014. York, UK.                                          repository1 . MONDO-SAM provides reusable benchmark-
Copyright c 2014 for the individual papers by the papers’ authors. Copy-
ing permitted for private and academic purposes. This volume is published   1
and copyrighted by its editors.                                               http://opensourceprojects.eu/p/mondo/
                                                                            d31-transformation-benchmarks/
                     Benchmarking process                                                Benchmark architecture
  Generate     Artefacts    Measure      Results        Analyze           generator          core                       benchmark                 core

                                                                                         domain-                                            reusable
                Model                                                Railway generator                M: Sesame     Q: SPARQL    Q: EIQ    primitives
                                                                                          specific
   Real                                                                                               D: Railway   S: validation M: EMF Q: OCL
   world                    Execute                                   RDF        EMF language-
   apps         Query                  Perf. values                                    specific                                                  tool-
                           benchmark                                                                 T: EclipseOCL T: INCQUERY   T: Sesame
                                                      Performance                                                                              specific
  Synthetic                                             diagrams
                           Calculate                                                                                                              core
                Transf.                  Metrics                          analyzer                            metrics
  generator                 Metrics
                                                                                                      EMF-IQPL     RDF-SPARQL                language-
                                                                                                                                               specific
               Scenario

                                                                             Figure 2: Benchmark framework architecture.
              Figure 1: Benchmarking process.


                                                                    study of BSBM. Generated models should be semantically
ing primitives (like metrics evaluation, time measurement,          equivalent, however, it is a question whether structural equal-
result storage) that can be flexibly organized into bench-          ity should be preserved. E.g. in certain cases EMF models
marking workflows that are specific to a given case study.          must have a dedicated container object with containment
MONDO-SAM also provides an API so that tehnologically               relations to all objects which is not required in RDF.
different tools can be integrated into the framework in a uni-
form way. A unique emphasis of the framework is built-in            2.3      Core features
support for metrics calculation that enables characteriza-
tion of the benchmarking problems as published in [5]. The
built-in reporting facility allows to investigate the scalabil-
                                                                    Benchmark component. The benchmark component (in Fig. 2)
                                                                    measures performance of different tools for given cases. A
ity of MDE tools along different metrics in diagrams. Fi-
                                                                    case can be defined as a quintuple of (D, S, M, Q, T ), where
nally, the entire framework and integrated case studies can
                                                                    D defines the domain, S the scenario, M the modification
be compiled and run using the Maven build system, mak-
                                                                    and Q the query. The T modules implement tool specific
ing deployment and reproducible execution in a standard,
                                                                    glue code and select D, S, M, Q. All modules reuse com-
Java-enabled computing environment feasible.
                                                                    mon functions of the core, like configuration (with default
                                                                    values and tool-specific extensions), wall-clock time mea-
2. OVERVIEW OF THE FRAMEWORK                                        surement which is done with highest (nanosecond) precision
2.1 A process model for MDE benchmarks                              (that does not mean same accuracy), and momentary mem-
The benchmarking process for MDD applications is depicted           ory consumption, which are recorded in a central place. At
in Fig. 1. Inputs of the benchmark are the instance model,          runtime, language-specific modifications (transformations),
queries run on the instance model, the transformation rules         queries, and instances of the selected domain must be avail-
or modification logics and a scenario definition (or workflow)      able.
describing execution sequences. In this case, scenario can de-
scribe MDD use cases (like model validation, model trans-
formation, incremental code generation), including warmup           Model instantiator. A common aspect of the generator
and teardown operations, if required. Inputs can also be            and the benchmark module is reproducibility. In tool-specific
derived from real-world applications, or are synthetically          scenario implementations boundaries are well separated by
generated providing complete control over the benchmark.            the scenario interfaces, and where generation or execution
Complexity of the input is characterized by metrics, while          is randomized, a pseudo-random generator is used with the
scenario execution implementations are instrumented to mea-         random seed set to a predefined value. However, nondeter-
sure resource consumption (wall-clock times, memory and             ministic operations (like choosing an element from a set) and
I/O usage). Finally, these measured values and calculated           tool implementations can disperse results between runs.
metrics are visualized on diagrams automatically to find the
fastest tool, or to identify performance improvements of a
specific tool.                                                      Metrics evaluator. To describe benchmark input with quan-
                                                                    titative values, they are characterized by metrics which are
2.2     Architecture                                                evaluated by the metrics component. Language specific im-
The benchmark framework consisting of four components is            plementations analyze model-query pairs, and store calcu-
depicted in Fig. 2. The generator component allows syn-             lated metric values centrally gathered by the core which are
thetic generation of benchmark inputs. The core module              analyzed later together with the measured values.
handles configuration, domain-specific modules describe gen-
eration method of input data (like generation of instance
models, queries), and language-specific modules serialize gen-      Result reporting and analysis. When measurement and
erated logical artifacts into files (like EMF models or OCL         metrics data become available, the analyzer component (im-
queries). The selected domain constrains languages, as do-          plemented in R) automatically creates HTML report with
main description concepts must be supported. For exam-              diagrams. To show scalability according to different mea-
ple transitivity or multi-level metamodeling is not supported       sures, on the x axis metrics can be selected, while the y axis
by EMF, but the latter is required by the e-commerce case           represents resource consumption. Raw data can be post-
processed, i.e. dimensions can be changed (e.g. to change                                                           Total time of the phases RouteSensor (x,y:logscale), XForm
                                                                         703498.00

time to ms dimension to reflect its accuracy), and derived               328949.78

values can be calculated (e.g. the median of incremental                     153814.16
                                                                                                                                                                                                                                      ●     Tools


                                                                 Time [ms]
recheck steps, or total processing time).                                    71922.22
                                                                                                                                                                                                                          ●
                                                                                                                                                                                                                                              Eclipse OCL
                                                                                                                                                                                                                                            ● EMF−IncQuery
                                                                                                                                                                                                                                              Java
                                                                             33630.23                                                                                                                        ●                                Drools
                                                                                                                                                                                                 ●                                            Sesame

2.4   Best Practices to Minimize Validity Threats                            15725.21

                                                                              7352.98                                                              ●
                                                                                                                                                                  ●
                                                                                                                                                                              ●
                                                                                                                                    ●
During the execution of the cases, noise coming from the                      3438.19
                                                                                         ●


environment should be kept at minimum. Possible sources                                  6k
                                                                                         24k
                                                                                                                                12k
                                                                                                                                49k
                                                                                                                                                   23k
                                                                                                                                                   90k
                                                                                                                                                                43k
                                                                                                                                                                170k
                                                                                                                                                                              88k
                                                                                                                                                                             347k
                                                                                                                                                                                               176k
                                                                                                                                                                                               691k
                                                                                                                                                                                                            361k
                                                                                                                                                                                                             1M
                                                                                                                                                                                                                         715k
                                                                                                                                                                                                                          2M
                                                                                                                                                                                                                                     1M
                                                                                                                                                                                                                                     5M
                                                                                         94                                     193                348          642          1301               2k           5k          10k         21k
of noise include the caching mechanisms of various compo-                                −9                                     −19                −34          −64          −130              −260         −532        −1062       −2109

                                                                                                                                                                         Nodes
nents (e.g. file system and the database management sys-                                                                                                                 Edges
                                                                                                                                                                         Results
tem), warm-up effect of the runtime environment (e.g. the                                                                                                              Modifications

Java Virtual Machine), scheduled tasks (e.g. cron) and swap-
ping. For large heaps, the Garbage Collector of the JVM can                                  Figure 3: Required time to perform a task.
block the run for minutes, so minimizing its call is advised
which is achieved by setting minimal and maximal heap size
to an equal value, thus eliminating GC calls at memory ex-                                       DetCheck Times − RouteSensor size=128 (y:logscale, x:continuous)
pansions.                                                                643587.46       ●


                                                                             74910.33


In the implementation of framework components, only the                       8719.18    ●
                                                                                             ●●●●●●
                                                                                                                ●
                                                                                                                    ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●


minimal amount of libraries should be loaded. On one hand,
                                                                                             ●
                                                                                         ●
                                                                                                 ●●●●           ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
                                                                                                                                     ●
                                                                                                                                                                                                                                            Tool


                                                                 Time (ms)
                                                                              1014.87                               ●                                                                                                                        ●
                                                                                                                                                                                                                                               Drools
proper organization of the dependencies is the responsibility                            ●
                                                                                             ●
                                                                                                 ●●●●●●                 ●
                                                                                                                            ●
                                                                                                                                ●
                                                                                                                                    ●
                                                                                                                                        ●●●●●●●●●●●●●●●   ●●●●●●●●●●●●●●●●●●●●●●●●●
                                                                                                                                                                                      ●
                                                                                                                                                                                          ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
                                                                                                                                                                                                                                             ●

                                                                                                                                                                                                                                             ●
                                                                                                                                                                                                                                               EclipseOCL
                                                                                                                                                                                                                                               EMF−IncQuery
                                                                                                                                                                                                                                               Java
of the developer. On the other hand it is enforced by the                       118.13                                                                                                                                                       ●

                                                                                                                                                                                                                                             ●
                                                                                                                                                                                                                                               Sesame

framework architecture, as tool-specific implementations are                    13.75
                                                                                         ●


independent, and functions as entry points calling the frame-
                                                                                             ●
                                                                                             ●

                                                                                                       ●
                                                                                                 ●●
                                                                                 1.60            ●
                                                                                                   ●       ●
                                                                                                       ●


work that uses inversion of control (IoC) without the usage                                                    ●●
                                                                                                                    ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●


                                                                                 0.19                      ●                                                                                           ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
                                                                                                               ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●


of additional execution environments, such as OSGi.                                      0                                                         25                         50                             75                       100
                                                                                                                                                                            Index


To alleviate random disturbances, each test case is run sev-
                                                                                                 Figure 4: Check time during revalidations.
eral times (e.g. ten times) by the framework and aggregated
by the analyzer.

3.    INTEGRATED CASE STUDIES                                    3.2                     Extended Train Benchmark
The usability of the framework is demonstrated by four ex-       The extended version is available online2 which introduces
amples. Three variations of the previously published Train       new languages: in addition to EMF, RDF and GraphML
Benchmark, and a new, soon to be released model compre-          model formats were added. New tools (Drools, Sesame,
hension benchmark are integrated into the framework.             4store and Neo4j) were added, and queries were translated
                                                                 to each tool’s native language. From now not all tools have
3.1   Basic Train Benchmark                                      in-memory implementation, some use hard disk as storage,
The first version of the Train Benchmark [9] compares the        so to lower disk overhead, memory filesystems were used for
performance of EMF-IncQuery with Eclipse OCL and its             storage. Also it should be noted that some databases com-
incremental version, the OCL Impact Analyzer in an incre-        piled as JARs next to the benchmark code, some database
mental model validation use case. Instance models are gen-       use native server daemons that are also handled by the
erated from a railway domain, and four hand-written queries      benchmark execution framework. In this case a new sce-
(with different complexity) perform model validation tasks.      nario variation is defined, where after the batch validation,
The scenario starts with a model loading phase, where the        larger modification is performed in one edit phase (to sim-
instance is read from a file, followed by a check phase, where   ulates automatic model correction), and finally recheck is
a model validation query is executed (returning constraint       executed.
violating elements). Afterwards (to simulate a user in front
of an editor), multiple (100) edits and rechecks performed.      As the benchmark framework records every check and edit
In this case batch, incremental validation time and memory       time subsequently calls can be displayed on a diagram to
consumption was measured.                                        show its changes. Fig. 4 depicts such a case for tools at a
                                                                 given model size and query. It can be observed that the
One kind of diagrams display execution times as the func-        first query time is almost always the highest, probably due
tion of model and query metrics. Fig. 3 shows total exe-         to the lazy loading of classes and tool initialization. An-
cution time for a specific query and scenario in a logarith-     other interesting point for the incremental EMF-IncQuery
mic diagram for different tools. On the x axis model size        and Drools tools is around the tenth check, where evalua-
(the number of nodes and edges) is displayed, together with      tion times are dropped significantly. As the same queries
the number of results, and the number of changes in the          are executed, this may be attributed to the changed model
result set. Although model size is the most influencing per-     structure, or to the kicked in JIT compiler. This diagram
formance factor during the load phase, in the check phase,       also shows the required warmup time for each tool, and its
especially for incremental tools other metrics come into the     changing in stages.
picture as most influencing factors, like the result set size,
                                                                 2
or the number of variables in a query [5].                            https://incquery.net/publications/trainbenchmark/
                                      Benchmark workflow
                                                                                ability of tools against various complexity measures. We
                             31x
                            queries
                                                                                demonstrated the versatility of the framework is demon-
                                                                                strated by the integration of previous versions of the Train
                                                                                Benchmark [9, 5] and a new benchmark from the code model
           Generate    Load           Query       Report      most
      Benchmark workflow – code model                                           domain.
                                 influencing
                                                              metrics
          12x                                 !                                 The extensible framework including the APIs, core compo-
                                                             (3 tools)
        EMF, RDF                                                                nents and documentated samples is available as open source
         models
                                                                                code from the MONDO Git repository3 .
                       (a) Metrics evaluation
          Load     Check1          Refactor         Checkn      Report          5.   REFERENCES
                                                                                 [1] Transformation tool contest.
                                                                                     www.transformation-tool-contest.eu, 2014.
                               !                                                 [2] G. Bergmann, I. Ráth, T. Szabó, P. Torrini, and
 Java                                                              Validation        D. Varró. Incremental pattern matching for the
                       Code
 code                                                            performance         efficient computation of transitive closure. In Sixth
                      patterns
                                                                                     International Conference on Graph Transformation,
                                                                                     volume 7562/2012, pages 386–400, Bremen, Germany,
                             (b) Code model
                                                                                     09/2012 2012. Springer.
        Figure 5: Different use-cases of the framework                           [3] C. Bizer and A. Schultz. The Berlin SPARQL
                                                                                     Benchmark. International Journal On Semantic Web
                                                                                     and Information Systems, 5(2), 2009.
3.3     Model Metrics for Performance Prediction                                 [4] O. Görlitz, M. Thimm, and S. Staab. SPLODGE:
In the article [5] tools are narrowed down to a basic Java                           Systematic generation of SPARQL benchmark queries
implementation, EMF-IncQuery, and Sesame. However,                                   for Linked Open Data. In C.-M. et al., editor, The
for a modified metamodel nine new instances were generated                           Semantic Web – ISWC 2012, volume 7649 of LNCS,
(belonging to different edge distributions). The benchmark                           pages 116–132. Springer Berlin Heidelberg, 2012.
was extended with 31 queries scaling along 5 query metrics.                      [5] B. Izsó, Z. Szatmári, G. Bergmann, Á. Horváth, and
The goal of this paper was not to compare tool performances,                         I. Ráth. Towards precise metrics for predicting graph
but to identify which metrics influence processing time and                          query performance. In IEEE/ACM 28th International
memory usage the most. (See Fig. 5a.)                                                Conference on Automated Software Engineering, pages
                                                                                     412–431, Silicon Valley, CA, USA, 2013. IEEE.
Detailed results are available in the paper, however it can                      [6] D. S. Kolovos, L. M. Rose, N. Matragkas, R. F. Paige,
be noted that for the EMF-IncQuery tool the number of                                E. Guerra, J. S. Cuadrado, J. De Lara, I. Ráth,
matches, for Sesame the number of query variables showed                             D. Varró, M. Tisi, and J. Cabot. A research roadmap
high correlation with the check time, and low correlation                            towards achieving scalability in model driven
of model size metrics that also emphasize considering other                          engineering. In Proceedings of the Workshop on
aspects than model size.                                                             Scalability in Model Driven Engineering, BigMDE ’13,
                                                                                     pages 2:1–2:10, New York, NY, USA, 2013. ACM.
3.4     ITM Factory                                                              [7] M. Schmidt, T. Hornung, G. Lausen, and C. Pinkel.
The fourth case (inspired by [10]) integrated into the frame-                        SP2Bench: A SPARQL performance benchmark. In
work is currently under development, and it took another                             Proc. of the 25th International Conference on Data
domain from the field of software comprehension. Input of                            Engineering, pages 222–233, Shanghai, China, 2009.
the benchmark are not serialized models, but Java projects.                          IEEE.
In the first step, source code is read into a software model,                    [8] M. Tichy, C. Krause, and G. Liebel. Detecting
transformations are code edits or complex refactor opera-                            performance bad smells for henshin model
tions. After software modifications, correctness of the code                         transformations. In B. Baudry, J. Dingel, L. Lucio,
base is validated (Fig. 5b).                                                         and H. Vangheluwe, editors, AMT@MoDELS, volume
                                                                                     1077 of CEUR Workshop Proceedings. CEUR, 2013.
In the code modeling case similar investigations can be done,                    [9] Z. Ujhelyi, G. Bergmann, Á. Hegedüs, Á. Horváth,
however processing tools should scale in the lines of code                           B. Izsó, I. Ráth, Z. Szatmári, and D. Varró.
(and not in the number of nodes or edges). This also moti-                           EMF-IncQuery: An Integrated Development
vates displaying performance as a function of different met-                         Environment for Live Model Queries. Science of
rics.                                                                                Computer Programming, 2014. Accepted.
                                                                                [10] Z. Ujhelyi, Á. Horváth, D. Varró, N. I. Csiszár,
4.    CONCLUSION                                                                     G. Szőke, L. Vidács, and R. Ferenc. Anti-pattern
In this paper we proposed MONDO-SAM, a framework that                                detection with model queries: A comparison of
provides common functions required for benchmarking, and                             approaches. In IEEE CSMR-WCRE 2014 Software
MDE-specific scenarios, models, queries and transformations                          Evolution Week. IEEE, 02/2014 2014.
as reusable and configurable primitives. As the main focus,
integrated benchmark cases can be characterized by metrics,                     3
                                                                                  https://opensourceprojects.eu/git/p/mondo/
which enables the reporting module to analyze the scal-                         trainbenchmark