On Evaluating Multi-Level Modeling
                               Colin Atkinson                                            Thomas Kühne
                           University of Mannheim                                Victoria University of Wellington
                            Mannheim, Germany                                       Wellington, New Zealand
                Email: atkinson@informatik.uni-mannheim.de                           Email: tk@ecs.vuw.ac.nz


   Abstract—Multi-Level Modeling is receiving increasing levels     which they measure something of relevance. Naturally, rele-
of interest and its active research community is continuing to      vance itself depends on the stakeholders and their respective
make progress. However, to advance the discipline effectively it    goals. However, in general, a meaningful evaluation provides
is necessary to increase industry adoption and achieve better
community cohesion. We believe that the key to addressing both      results that have some kind of real-world relevance. In contrast,
these challenges is to promote the creation of more comparisons     a meaningless evaluation – e.g., measuring the number of
in the multi-level modeling field based on meaningful objective     vowels in a language’s keywords – has no such real-world
evaluations. In this position paper, we provide our view on what    relevance. A comparison based on such an evaluation would
constitutes meaningful evaluations and discuss some of the issues   not yield any meaningful insights into which language should
involved in obtaining them, while presenting a broad overview
of existing multi-level modeling evaluations. In particular, we     be preferred for achieving any reasonable real-world impact.
emphasize the importance of understanding and managing the          Meaningful evaluations, on the other hand, should be designed
difference between internal and external qualities.                 to deliver some insights that provide the basis for pragmatic
                                                                    guidance. In order for an evaluation to be meaningful in our
                      I. I NTRODUCTION                              sense, it must address the following aspects:
   Although Multi-level Modeling (MLM) has seen steady de-          A1 : Measurability Any targeted properties must be objec-
velopment over recent years, industry adoption is still virtually   tively observable. Ideally, measurements should yield numeric
non-existent (a rare application of MLM in an industry setting      results that are directly proportional to the property being
is described in [8]). One explanation for the low adoption rate     measured, as it is then not only possible to decide which
is the current unavailability of industrial-strength approaches     approach is better but also by how much. It is never possible to
and tools. However, even if better tool support were avail-         judge an approach or tool to be, e.g., “good” or “productive”,
able, wider adoption would still be hindered by the lack of         without breaking down how the quality concerned manifests
compelling evidence that switching from Two-Level Modeling          itself in terms of measurable properties.
(TLM) to MLM brings benefits in industrial contexts. Creating       A2 : Conclusiveness Measurements should yield consistent re-
convincing comparisons would reduce this barrier and could          sults. Repeated performances of the evaluation need not yield
even expedite MLM research through industrial funding.              exactly the same outcome, but they should deliver reliable
   Another obstacle to the discipline’s future growth is the lack   values within a given margin of error. This also excludes
of research cohesion that may eventually make it impossible         results with low confidence (e.g., because they lack statistical
for community members to build on each other’s results.             significance).
Without a sense of direction — i.e., a common understanding         A3 : Impartiality The choice of the properties to be measured
of the way forward – the discipline runs the risk of research       must not favor particularities of one solution that have no
diversification to a point where it loses its core focus and        proven relationship to the ultimate goal. For example, a set of
subsequently critical mass. We therefore believe that objective     postulated requirements must be formulated in such a manner
comparisons between competing approaches are not only               that they reference the problem-domain and the ultimate
desirable to provide a compass for future development but may       benefits to the targeted user rather than solution details.
eventually become necessary for the discipline’s survival.          A4 : Trueness When using proxies (e.g. substitutes for real-
   Since convincing comparative evaluations are the key to          world artifacts or practitioners) care must be taken to ensure
addressing both of the aforementioned challenges, in this           that no circumstantial bias is introduced. Trueness, therefore
position paper we discuss some of the issues involved in            comprises at least:
performing such evaluations in the context of MLM. We first            A4.1 : Context Relevance Model proxies (i.e., samples used
establish the basic parameters of meaningful, scientifically        in lieu of real-world models) and the assumed operations on
sound evaluations and then discuss a variety of concrete            them should be demonstrated to be representative. Otherwise,
approaches, pointing to existing work where applicable.             a skewed selection could introduce undesired bias.
                                                                       A4.2 : Demographic Relevance Substitute users should be
              II. M EANINGFUL E VALUATIONS                          demonstrated to be representative of real users. In general, it
   The effectiveness of MLM evaluations in promoting indus-         is not possible to transfer results between different bodies of
try adoption and research cohesion depends on the extent to         users (e.g., from students to practitioners in the field).
A5 : Pragmatic Relevance Targeted properties must have a            for MLM is its ability to reduce accidental complexity [3],
bearing on the actual needs of the intended users. This criterion   i.e., the difference in complexity between an ideal model and
is the very foundation of a meaningful evaluation. The previous     a concrete model involving solution-induced overhead, e.g.
aspects essentially characterize sound evaluations, whereas         workarounds.
pragmatic relevance requires that there is an intent to measure        A number of evaluations of multi-level modeling have
something of pragmatic value.                                       been based on approximating the complexity of a model by
   It is obviously challenging to “tick” all the above “boxes”      measuring its size, that is, the number of its elements. For
in practice, but we feel it is useful to have a checklist that      example, Gerbig performed a comparison based on model size
helps to document where an evaluation may be lacking.               in his Ph.D. thesis using a sample model from the enterprise
                                                                    architecture domain [7]. The MLM version of the model has
      III. I NTERNAL VERSUS E XTERNAL Q UALITIES                    50 modeling elements while the TLM version, using standard
   Some of the aforementioned aspects are more difficult to         workaround patterns such as the Type-Object pattern [10],
address than others. In order to understand why, it is important    has 95 modeling elements, amounting to an increase of 90%.
to be aware of whether an evaluation is intended to evaluate        Rossini et al. performed a similar evaluation which yielded
an internal quality or an external quality. We use these terms      a three-fold increase in the number of modeling elements
with their usual meaning in software engineering [17].              in a two-level versus a multi-level model of their CloudML
   In our context, internal qualities pertain to the directly       scenario [16].
measurable properties of a model, e.g., number of model                The extent of the practical relevance of the above evalua-
elements, number of constraints, average inheritance depth,         tions was shown by de Lara et. al. by measuring the application
etc. External qualities, on the other hand, pertain to the          frequency of TLM workaround techniques (cf. “Item Descrip-
experience users have when working with a model, e.g.,              tor” pattern [6], “Type-Object” pattern [10], “Adaptive Object-
creating it, understanding it, maintaining it, etc.                 Model” [18], etc.) in real-world models [12]. Since these
   Ultimately, only the external qualities have a direct bearing    workaround techniques are responsible for increases of the size
on meaningful evaluations. However, due to the cost and             of two-level models relative to their multi-level counterparts,
challenges involved in assessing external qualities directly        de Lara et al. hence demonstrated that the observations made
in a meaningful way, one often attempts to approximate the          in [7], [16] apply to a wide range of modeling practice. As
assessment of external qualities by assessing internal qualities    much as 35% of all models in some areas [12], could thus
instead, based on the idea that there is a correlation between      benefit from the potential size reductions.
internal and external qualities. It is standard practice to as-        Although the above results provide a convincing endorse-
sume that optimizing certain internal qualities (e.g., reducing     ment for the practical relevance of MLM, they do so only to
complexity) is the key to achieving certain desirable external      the extent that the assumption that model size1 approximates
qualities (e.g., increased maintainability). However, such an       model complexity is reasonable. A larger model based on a
indirect evaluation of external properties is only trustworthy      simple underlying language could conceivably be preferable
if the assumed underlying correlation has been demonstrated,        to a compact model based on a complex language.
or at least has been made plausible by compelling arguments.           Going beyond assessing model size, it appears useful
   Interestingly, A1 &A2 are most easily addressed by focusing      to consider other classic metrics [5], [13], [15] and qual-
on the internal qualities of an approach. Such qualities, e.g.,     ity attributes [4], [14]. Indeed, in his MLM vs TLM
the complexity of the models created by an approach, can            comparison, Gerbig also considered such classic met-
typically be reliably assessed. In contrast, assessing external     rics [7]. Overall, however, these proved to be less con-
qualities often implies some compromise in A1 &A2 because           clusive than model size comparisons, although he de-
sample populations may be small or certain assumptions may          tected clear advantages for MLM with respect to cou-
not generalize.                                                     pling (average number of distinct connected classes) and
   Aspects A3 & A5 , on the other hand, are best addressed by       overhead2 ((well-formedness rules + additional operations)/
focusing on the external qualities of an approach. External         element count)) [7].
qualities directly reflect the utility of the approach to its          Given these less conclusive results (compared to model
users and hence avoid solution bias (A3 ) plus intrinsically        size analyses) it would be easy to be skeptical about the
imply pragmatic relevance (A5 ). The increased cost involved in     actual advantages offered by MLM. However, it is important
directly assessing external qualities relates to ensuring conclu-   to observe that these metrics were originally designed to target
siveness (A2 ) and trueness (A4 ). This cost is considerable and    the type level only and thus entirely ignore the instance-level
therefore represents a major hurdle for this kind of evaluation.    complexity caused by the application of TLM workarounds.
                                                                    This weakness of classic metrics for evaluating MLM is
           IV. A SSESSING I NTERNAL Q UALITIES                      understandable given their motivation rooted in programming
  Complexity is one of the most commonly measured internal          and/or modeling software. In these contexts, instances and
qualities since it is assumed to have a correlation with im-          1 Apparently equivalent to the much debated “lines of code” metric for
portant external qualities such as maintainability, robustness,     source code.
and trustworthiness etc. In fact, the main value proposition          2 Referred to as “complexity” in [7].
                               C1   Comprehend     Demonstrate understanding of a model.
                               C2   Complete       Read an incomplete model and correctly add missing parts.
                               C3   Critique       Read a defective model and identify all issues.
                               C4   Correct        Read a defective model and address all issues.
                               C5   Create         Create a model from scratch for a specified purpose.

                                                             TABLE I
                                           C OGNITIVE C HALLENGES OF THE “5C”-A PPROACH


their relationships are irrelevant to users. However, in many           Another external quality which lends itself relatively
domain modeling applications instances directly represent the        straightforwardly to measurement is model robustness, i.e., the
subject under study. In such contexts, the complexity of             resilience of a model to user error. Here the goal would be to
instance models is therefore very much a concern to users            assess the likelihood of introducing errors when creating/main-
and should thus be considered in evaluations.                        taining models. In particular, in the context of MLM to TLM
   Instead of focusing on model properties (e.g., model com-         comparisons, one would expect a two-level model to suffer
plexity), one may also consider language properties (e.g., lan-      from more accidentally introduced errors than a correspond-
guage expressiveness). For example, Atkinson et al. based their      ing multi-level model. TLM would only provide the same
comparison of Melanee with MetaDepth on the differences be-          safeguards against the introduction of model inconsistencies
tween their respective language features [2]. Grossmann et al.’s     if all the well-formedness constraints implied by MLM are
more comprehensive comparison of 21 MLM approaches [9]               transposed into the equivalent TLM models. One would still,
also involved language feature comparisons. However, Gross-          however, expect a higher rate of well-formedness violations,
mann et al. also considered the intended target audience and         since it is most likely easier to make mistakes in a lower level
the purpose of approaches, and furthermore considered the            two-level model, compared to a higher-level multi-level model.
extent to which an approach has seen industry usage. This               The final external quality we can cover here is productivity,
latter consideration could be regarded as including an external      i.e., the speed by which users can develop or make changes to
quality, but without further information on how well the             models. The underlying hypothesis of what could be referred
respective MLM approaches performed in industrial contexts           to as cognitive challenge-based evaluations is that modeler
it is only a good starting point for further investigations.         performance is a function of the adequacy of the language/tool
   Ideally, feature-based comparisons should be accompanied          used. The higher the adequacy of the language/tool, the better
by an analysis of the impact of the different features on users.     the modeler should perform when facing standard tasks.
While certain features may seem elegant, ultimately their value         To this end, we propose a “5C”-approach, comprising the
must be assessed by considering external qualities.                  cognitive challenges listed in Table I.
                                                                        Assessing the adequacy of an approach would be performed
           V. A SSESSING E XTERNAL Q UALITIES
                                                                     by measuring completion speeds for representative concrete
   In order to evaluate the ultimate purpose of any approach         tasks of the above five kinds. If languages/tools actually yield
intended to deliver value to a user, it is necessary to determine    different levels of productivity, one should expect to see
properties based on external qualities which relate to user          differences in the C1 -C5 completion measurements. Ideally,
experience. As far as we are aware, only two MLM evaluations         subjects should be chosen in such a way that results transfer
of this kind have been performed to date. Both of these              to the intended user base in order to achieve demographic
investigate model changes and thus can be reasonably regarded        relevance. Full context relevance will be very hard to achieve
as evaluating (aspects of) maintainability. In his Ph.D. thesis,     with this approach as it is typically not feasible to work with
Gerbig performed a comparative model change analysis by              realistically sized models in such experiments.
counting the number of primitive change operations needed
to respond to certain requirements changes [7]. It turned out                                  VI. C ONCLUSION
that a homogeneous treatment of all classification levels and
Melanee’s emendation service [1] reduce the effort needed to            The goal of this position paper has been to provide a
change the multi-level version of the model compared to the          discussion of the issues involved when aiming to perform
two-level, EMF-based version.                                        meaningful evaluations while providing a broad overview of
   Kimura et al., also used a change-based approach to               the MLM evaluations that have been conducted to date. The
compare Melanee, MetaDepth and EMF, with a particular                number of already existing MLM evaluations is encouraging
focus on extensibility [11]. These kinds of analyses exhibit         and each of them represents a very useful step towards growing
ideal measurability, reproducibility, impartiality, and pragmatic    MLM as a discipline. However, our discussion has shown
relevance. However, whether context relevance is adequately          that the evaluations performed until now are overwhelmingly
addressed depends on how representative the chosen models            focused on internal rather than external qualities. Hence their
and editing operations are.                                          pragmatic relevance – in the absence of the demonstration of
a strong correlation between the internal qualities they asses                [4] Bansiya, J., Davis, C.G.: A hierarchical model for object-oriented design
with the external qualities that matter to users – is limited.                    quality assessment. IEEE Trans. Softw. Eng. 28(1), 4–17 (Jan 2002)
                                                                              [5] Chidamber, S.R., Kemerer, C.F.: A metrics suite for object oriented
   It is natural that the first evaluations performed in an                       design. IEEE Transactions on Software Engineering 20(6), 476–493
emerging field are focused on internal qualities, as these are                    (1994)
usually much easier to asses than external ones. However, we                  [6] Coad, P.: Object-oriented patterns. Communications of the ACM 35(9),
                                                                                  152–159 (Sep 1992)
believe that for a) the benefits of MLM to become convincing                  [7] Gerbig, R.: Deep, Seamless, Multi-format, Multi-notation Definition
enough to generate serious interest from industry, and b)                         and Use of Domain-specific Languages. Ph.D. thesis, University of
comparative evaluations to become useful enough to maintain                       Mannheim (2017)
                                                                              [8] Igamberdiev, M., Grossmann, G., Selway, M., Stumptner, M.: An
the cohesion and momentum the research community requires,                        integrated multi-level modeling approach for industrial-scale data in-
more user-oriented evaluations focusing on external qualities                     teroperability. Software & Systems Modeling pp. 1–26 (2016)
will be needed.                                                               [9] Igamberdiev, M., Grossmann, G., Stumptner, M.: A feature-based cate-
                                                                                  gorization of multi-level modeling approaches and tools. In: Proceedings
   An important initiative in this regard is the “Bicycle Chal-                   of the 3rd Workshop on Multi-Level Modelling co-located with the 19th
lenge” proposed by the MULTI 2017 workshop as a common                            ACM/IEEE International Conference MODELS 2016. CEUR Workshop
sample scenario, allowing various MLM approaches to be                            Proceedings, vol. Vol-1722, pp. 45–55
                                                                             [10] Johnson, R., Woolf, B.: Type object. In: Martin, R.C., Riehle, D.,
compared based on an example with practical relevance.                            Buschmann, F. (eds.) Pattern Languages of Program Design 3, pp. 47–
Ideally, more such benchmarks will be designed in the fu-                         65. Addison-Wesley (1997)
ture along with agreed upon usage scenarios, e.g., involving                 [11] Kosaku Kimura, K.S.: An evaluation of multi-level modeling frame-
                                                                                  works for extensible graphical editing tools. In: Proceedings of the
subsequent extensions, detecting and removing defects, etc.                       3rd Workshop on Multi-Level Modelling co-located with the 19th
   It will remain a challenge to distinguish models and usage                     ACM/IEEE International Conference MODELS 2016. CEUR Workshop
scenarios that have context relevance from those that do not,                     Proceedings, vol. Vol-1722, pp. 35–44
                                                                             [12] Lara, J.D., Guerra, E., Cuadrado, J.S.: When and how to use multilevel
but any attempts to move MLM evaluations towards directly                         modelling. ACM Transactions on Software Engineering and Methodol-
assessing external qualities or to strengthen the confidence                      ogy 24(2), 12:1–12:46 (2014)
in hitherto only assumed correlations between internal and                   [13] Lorenz, M., Kidd, J.: Object-oriented software metrics: a practical guide.
                                                                                  Prentice-Hall, Inc. (1994)
external qualities will represent significant steps forward.                 [14] Ma, H., Shao, W., Zhang, L., Ma, Z., Jiang, Y.: Applying oo metrics
                                                                                  to assess uml meta-models. In: Baar, T., Strohmeier, A., Moreira, A.,
                            R EFERENCES                                           Mellor, S.J. (eds.) Proceedings of UML 2004, Lisbon, Portugal, pp. 12–
[1] Atkinson, C., Gerbig, R., Kennel, B.: On-the-fly emendation of multi-         26. Springer (2004)
    level models. In: Proceedings of the 8th European Conference on          [15] Purao, S., Vaishnavi, V.: Product metrics for object-oriented systems.
    Modelling Foundations and Applications. pp. 194–209. ECMFA’12,                ACM Comput. Surv. 35(2), 191–221 (2003)
    Springer (2012)                                                          [16] Rossini, A., de Lara, J., Guerra, E., Nikolov, N.: A comparison of
[2] Atkinson, C., Gerbig, R., Lara, J.D., Guerra, E.: A feature-based com-        two-level and multi-level modelling for cloud-based applications. In:
    parison of melanee and metadepth. In: Proceedings of the 3rd Workshop         Proceedings of ECMFA 2015. pp. 18–32. LNCS 9153 (2015)
    on Multi-Level Modelling co-located with the 19th ACM/IEEE Interna-      [17] Sommerville, I.: Software Engineering. Pearson, 10th edn. (2016)
    tional Conference MODELS 2016. CEUR Workshop Proceedings, vol.           [18] Yoder, J.W., Johnson, R.E.: The adaptive object-model architectural
    Vol-1722, pp. 25–34                                                           style. In: Proceedings of the 3rd IEEE/IFIP Conference on Software
[3] Atkinson, C., Kühne, T.: Reducing accidental complexity in domain            Architecture: System Design, Development and Maintenance. pp. 3–27.
    models. Software and Systems Modeling 7(3), 345–359 (Springer Ver-            Kluwer (2002)
    lag, 2008)