=Paper=
{{Paper
|id=Vol-2019/multi_10
|storemode=property
|title=On Evaluating Multi-level Modeling
|pdfUrl=https://ceur-ws.org/Vol-2019/multi_10.pdf
|volume=Vol-2019
|authors=Colin Atkinson,Thomas Kühne
|dblpUrl=https://dblp.org/rec/conf/models/0001K17
}}
==On Evaluating Multi-level Modeling==
On Evaluating Multi-Level Modeling
Colin Atkinson Thomas Kühne
University of Mannheim Victoria University of Wellington
Mannheim, Germany Wellington, New Zealand
Email: atkinson@informatik.uni-mannheim.de Email: tk@ecs.vuw.ac.nz
Abstract—Multi-Level Modeling is receiving increasing levels which they measure something of relevance. Naturally, rele-
of interest and its active research community is continuing to vance itself depends on the stakeholders and their respective
make progress. However, to advance the discipline effectively it goals. However, in general, a meaningful evaluation provides
is necessary to increase industry adoption and achieve better
community cohesion. We believe that the key to addressing both results that have some kind of real-world relevance. In contrast,
these challenges is to promote the creation of more comparisons a meaningless evaluation – e.g., measuring the number of
in the multi-level modeling field based on meaningful objective vowels in a language’s keywords – has no such real-world
evaluations. In this position paper, we provide our view on what relevance. A comparison based on such an evaluation would
constitutes meaningful evaluations and discuss some of the issues not yield any meaningful insights into which language should
involved in obtaining them, while presenting a broad overview
of existing multi-level modeling evaluations. In particular, we be preferred for achieving any reasonable real-world impact.
emphasize the importance of understanding and managing the Meaningful evaluations, on the other hand, should be designed
difference between internal and external qualities. to deliver some insights that provide the basis for pragmatic
guidance. In order for an evaluation to be meaningful in our
I. I NTRODUCTION sense, it must address the following aspects:
Although Multi-level Modeling (MLM) has seen steady de- A1 : Measurability Any targeted properties must be objec-
velopment over recent years, industry adoption is still virtually tively observable. Ideally, measurements should yield numeric
non-existent (a rare application of MLM in an industry setting results that are directly proportional to the property being
is described in [8]). One explanation for the low adoption rate measured, as it is then not only possible to decide which
is the current unavailability of industrial-strength approaches approach is better but also by how much. It is never possible to
and tools. However, even if better tool support were avail- judge an approach or tool to be, e.g., “good” or “productive”,
able, wider adoption would still be hindered by the lack of without breaking down how the quality concerned manifests
compelling evidence that switching from Two-Level Modeling itself in terms of measurable properties.
(TLM) to MLM brings benefits in industrial contexts. Creating A2 : Conclusiveness Measurements should yield consistent re-
convincing comparisons would reduce this barrier and could sults. Repeated performances of the evaluation need not yield
even expedite MLM research through industrial funding. exactly the same outcome, but they should deliver reliable
Another obstacle to the discipline’s future growth is the lack values within a given margin of error. This also excludes
of research cohesion that may eventually make it impossible results with low confidence (e.g., because they lack statistical
for community members to build on each other’s results. significance).
Without a sense of direction — i.e., a common understanding A3 : Impartiality The choice of the properties to be measured
of the way forward – the discipline runs the risk of research must not favor particularities of one solution that have no
diversification to a point where it loses its core focus and proven relationship to the ultimate goal. For example, a set of
subsequently critical mass. We therefore believe that objective postulated requirements must be formulated in such a manner
comparisons between competing approaches are not only that they reference the problem-domain and the ultimate
desirable to provide a compass for future development but may benefits to the targeted user rather than solution details.
eventually become necessary for the discipline’s survival. A4 : Trueness When using proxies (e.g. substitutes for real-
Since convincing comparative evaluations are the key to world artifacts or practitioners) care must be taken to ensure
addressing both of the aforementioned challenges, in this that no circumstantial bias is introduced. Trueness, therefore
position paper we discuss some of the issues involved in comprises at least:
performing such evaluations in the context of MLM. We first A4.1 : Context Relevance Model proxies (i.e., samples used
establish the basic parameters of meaningful, scientifically in lieu of real-world models) and the assumed operations on
sound evaluations and then discuss a variety of concrete them should be demonstrated to be representative. Otherwise,
approaches, pointing to existing work where applicable. a skewed selection could introduce undesired bias.
A4.2 : Demographic Relevance Substitute users should be
II. M EANINGFUL E VALUATIONS demonstrated to be representative of real users. In general, it
The effectiveness of MLM evaluations in promoting indus- is not possible to transfer results between different bodies of
try adoption and research cohesion depends on the extent to users (e.g., from students to practitioners in the field).
A5 : Pragmatic Relevance Targeted properties must have a for MLM is its ability to reduce accidental complexity [3],
bearing on the actual needs of the intended users. This criterion i.e., the difference in complexity between an ideal model and
is the very foundation of a meaningful evaluation. The previous a concrete model involving solution-induced overhead, e.g.
aspects essentially characterize sound evaluations, whereas workarounds.
pragmatic relevance requires that there is an intent to measure A number of evaluations of multi-level modeling have
something of pragmatic value. been based on approximating the complexity of a model by
It is obviously challenging to “tick” all the above “boxes” measuring its size, that is, the number of its elements. For
in practice, but we feel it is useful to have a checklist that example, Gerbig performed a comparison based on model size
helps to document where an evaluation may be lacking. in his Ph.D. thesis using a sample model from the enterprise
architecture domain [7]. The MLM version of the model has
III. I NTERNAL VERSUS E XTERNAL Q UALITIES 50 modeling elements while the TLM version, using standard
Some of the aforementioned aspects are more difficult to workaround patterns such as the Type-Object pattern [10],
address than others. In order to understand why, it is important has 95 modeling elements, amounting to an increase of 90%.
to be aware of whether an evaluation is intended to evaluate Rossini et al. performed a similar evaluation which yielded
an internal quality or an external quality. We use these terms a three-fold increase in the number of modeling elements
with their usual meaning in software engineering [17]. in a two-level versus a multi-level model of their CloudML
In our context, internal qualities pertain to the directly scenario [16].
measurable properties of a model, e.g., number of model The extent of the practical relevance of the above evalua-
elements, number of constraints, average inheritance depth, tions was shown by de Lara et. al. by measuring the application
etc. External qualities, on the other hand, pertain to the frequency of TLM workaround techniques (cf. “Item Descrip-
experience users have when working with a model, e.g., tor” pattern [6], “Type-Object” pattern [10], “Adaptive Object-
creating it, understanding it, maintaining it, etc. Model” [18], etc.) in real-world models [12]. Since these
Ultimately, only the external qualities have a direct bearing workaround techniques are responsible for increases of the size
on meaningful evaluations. However, due to the cost and of two-level models relative to their multi-level counterparts,
challenges involved in assessing external qualities directly de Lara et al. hence demonstrated that the observations made
in a meaningful way, one often attempts to approximate the in [7], [16] apply to a wide range of modeling practice. As
assessment of external qualities by assessing internal qualities much as 35% of all models in some areas [12], could thus
instead, based on the idea that there is a correlation between benefit from the potential size reductions.
internal and external qualities. It is standard practice to as- Although the above results provide a convincing endorse-
sume that optimizing certain internal qualities (e.g., reducing ment for the practical relevance of MLM, they do so only to
complexity) is the key to achieving certain desirable external the extent that the assumption that model size1 approximates
qualities (e.g., increased maintainability). However, such an model complexity is reasonable. A larger model based on a
indirect evaluation of external properties is only trustworthy simple underlying language could conceivably be preferable
if the assumed underlying correlation has been demonstrated, to a compact model based on a complex language.
or at least has been made plausible by compelling arguments. Going beyond assessing model size, it appears useful
Interestingly, A1 &A2 are most easily addressed by focusing to consider other classic metrics [5], [13], [15] and qual-
on the internal qualities of an approach. Such qualities, e.g., ity attributes [4], [14]. Indeed, in his MLM vs TLM
the complexity of the models created by an approach, can comparison, Gerbig also considered such classic met-
typically be reliably assessed. In contrast, assessing external rics [7]. Overall, however, these proved to be less con-
qualities often implies some compromise in A1 &A2 because clusive than model size comparisons, although he de-
sample populations may be small or certain assumptions may tected clear advantages for MLM with respect to cou-
not generalize. pling (average number of distinct connected classes) and
Aspects A3 & A5 , on the other hand, are best addressed by overhead2 ((well-formedness rules + additional operations)/
focusing on the external qualities of an approach. External element count)) [7].
qualities directly reflect the utility of the approach to its Given these less conclusive results (compared to model
users and hence avoid solution bias (A3 ) plus intrinsically size analyses) it would be easy to be skeptical about the
imply pragmatic relevance (A5 ). The increased cost involved in actual advantages offered by MLM. However, it is important
directly assessing external qualities relates to ensuring conclu- to observe that these metrics were originally designed to target
siveness (A2 ) and trueness (A4 ). This cost is considerable and the type level only and thus entirely ignore the instance-level
therefore represents a major hurdle for this kind of evaluation. complexity caused by the application of TLM workarounds.
This weakness of classic metrics for evaluating MLM is
IV. A SSESSING I NTERNAL Q UALITIES understandable given their motivation rooted in programming
Complexity is one of the most commonly measured internal and/or modeling software. In these contexts, instances and
qualities since it is assumed to have a correlation with im- 1 Apparently equivalent to the much debated “lines of code” metric for
portant external qualities such as maintainability, robustness, source code.
and trustworthiness etc. In fact, the main value proposition 2 Referred to as “complexity” in [7].
C1 Comprehend Demonstrate understanding of a model.
C2 Complete Read an incomplete model and correctly add missing parts.
C3 Critique Read a defective model and identify all issues.
C4 Correct Read a defective model and address all issues.
C5 Create Create a model from scratch for a specified purpose.
TABLE I
C OGNITIVE C HALLENGES OF THE “5C”-A PPROACH
their relationships are irrelevant to users. However, in many Another external quality which lends itself relatively
domain modeling applications instances directly represent the straightforwardly to measurement is model robustness, i.e., the
subject under study. In such contexts, the complexity of resilience of a model to user error. Here the goal would be to
instance models is therefore very much a concern to users assess the likelihood of introducing errors when creating/main-
and should thus be considered in evaluations. taining models. In particular, in the context of MLM to TLM
Instead of focusing on model properties (e.g., model com- comparisons, one would expect a two-level model to suffer
plexity), one may also consider language properties (e.g., lan- from more accidentally introduced errors than a correspond-
guage expressiveness). For example, Atkinson et al. based their ing multi-level model. TLM would only provide the same
comparison of Melanee with MetaDepth on the differences be- safeguards against the introduction of model inconsistencies
tween their respective language features [2]. Grossmann et al.’s if all the well-formedness constraints implied by MLM are
more comprehensive comparison of 21 MLM approaches [9] transposed into the equivalent TLM models. One would still,
also involved language feature comparisons. However, Gross- however, expect a higher rate of well-formedness violations,
mann et al. also considered the intended target audience and since it is most likely easier to make mistakes in a lower level
the purpose of approaches, and furthermore considered the two-level model, compared to a higher-level multi-level model.
extent to which an approach has seen industry usage. This The final external quality we can cover here is productivity,
latter consideration could be regarded as including an external i.e., the speed by which users can develop or make changes to
quality, but without further information on how well the models. The underlying hypothesis of what could be referred
respective MLM approaches performed in industrial contexts to as cognitive challenge-based evaluations is that modeler
it is only a good starting point for further investigations. performance is a function of the adequacy of the language/tool
Ideally, feature-based comparisons should be accompanied used. The higher the adequacy of the language/tool, the better
by an analysis of the impact of the different features on users. the modeler should perform when facing standard tasks.
While certain features may seem elegant, ultimately their value To this end, we propose a “5C”-approach, comprising the
must be assessed by considering external qualities. cognitive challenges listed in Table I.
Assessing the adequacy of an approach would be performed
V. A SSESSING E XTERNAL Q UALITIES
by measuring completion speeds for representative concrete
In order to evaluate the ultimate purpose of any approach tasks of the above five kinds. If languages/tools actually yield
intended to deliver value to a user, it is necessary to determine different levels of productivity, one should expect to see
properties based on external qualities which relate to user differences in the C1 -C5 completion measurements. Ideally,
experience. As far as we are aware, only two MLM evaluations subjects should be chosen in such a way that results transfer
of this kind have been performed to date. Both of these to the intended user base in order to achieve demographic
investigate model changes and thus can be reasonably regarded relevance. Full context relevance will be very hard to achieve
as evaluating (aspects of) maintainability. In his Ph.D. thesis, with this approach as it is typically not feasible to work with
Gerbig performed a comparative model change analysis by realistically sized models in such experiments.
counting the number of primitive change operations needed
to respond to certain requirements changes [7]. It turned out VI. C ONCLUSION
that a homogeneous treatment of all classification levels and
Melanee’s emendation service [1] reduce the effort needed to The goal of this position paper has been to provide a
change the multi-level version of the model compared to the discussion of the issues involved when aiming to perform
two-level, EMF-based version. meaningful evaluations while providing a broad overview of
Kimura et al., also used a change-based approach to the MLM evaluations that have been conducted to date. The
compare Melanee, MetaDepth and EMF, with a particular number of already existing MLM evaluations is encouraging
focus on extensibility [11]. These kinds of analyses exhibit and each of them represents a very useful step towards growing
ideal measurability, reproducibility, impartiality, and pragmatic MLM as a discipline. However, our discussion has shown
relevance. However, whether context relevance is adequately that the evaluations performed until now are overwhelmingly
addressed depends on how representative the chosen models focused on internal rather than external qualities. Hence their
and editing operations are. pragmatic relevance – in the absence of the demonstration of
a strong correlation between the internal qualities they asses [4] Bansiya, J., Davis, C.G.: A hierarchical model for object-oriented design
with the external qualities that matter to users – is limited. quality assessment. IEEE Trans. Softw. Eng. 28(1), 4–17 (Jan 2002)
[5] Chidamber, S.R., Kemerer, C.F.: A metrics suite for object oriented
It is natural that the first evaluations performed in an design. IEEE Transactions on Software Engineering 20(6), 476–493
emerging field are focused on internal qualities, as these are (1994)
usually much easier to asses than external ones. However, we [6] Coad, P.: Object-oriented patterns. Communications of the ACM 35(9),
152–159 (Sep 1992)
believe that for a) the benefits of MLM to become convincing [7] Gerbig, R.: Deep, Seamless, Multi-format, Multi-notation Definition
enough to generate serious interest from industry, and b) and Use of Domain-specific Languages. Ph.D. thesis, University of
comparative evaluations to become useful enough to maintain Mannheim (2017)
[8] Igamberdiev, M., Grossmann, G., Selway, M., Stumptner, M.: An
the cohesion and momentum the research community requires, integrated multi-level modeling approach for industrial-scale data in-
more user-oriented evaluations focusing on external qualities teroperability. Software & Systems Modeling pp. 1–26 (2016)
will be needed. [9] Igamberdiev, M., Grossmann, G., Stumptner, M.: A feature-based cate-
gorization of multi-level modeling approaches and tools. In: Proceedings
An important initiative in this regard is the “Bicycle Chal- of the 3rd Workshop on Multi-Level Modelling co-located with the 19th
lenge” proposed by the MULTI 2017 workshop as a common ACM/IEEE International Conference MODELS 2016. CEUR Workshop
sample scenario, allowing various MLM approaches to be Proceedings, vol. Vol-1722, pp. 45–55
[10] Johnson, R., Woolf, B.: Type object. In: Martin, R.C., Riehle, D.,
compared based on an example with practical relevance. Buschmann, F. (eds.) Pattern Languages of Program Design 3, pp. 47–
Ideally, more such benchmarks will be designed in the fu- 65. Addison-Wesley (1997)
ture along with agreed upon usage scenarios, e.g., involving [11] Kosaku Kimura, K.S.: An evaluation of multi-level modeling frame-
works for extensible graphical editing tools. In: Proceedings of the
subsequent extensions, detecting and removing defects, etc. 3rd Workshop on Multi-Level Modelling co-located with the 19th
It will remain a challenge to distinguish models and usage ACM/IEEE International Conference MODELS 2016. CEUR Workshop
scenarios that have context relevance from those that do not, Proceedings, vol. Vol-1722, pp. 35–44
[12] Lara, J.D., Guerra, E., Cuadrado, J.S.: When and how to use multilevel
but any attempts to move MLM evaluations towards directly modelling. ACM Transactions on Software Engineering and Methodol-
assessing external qualities or to strengthen the confidence ogy 24(2), 12:1–12:46 (2014)
in hitherto only assumed correlations between internal and [13] Lorenz, M., Kidd, J.: Object-oriented software metrics: a practical guide.
Prentice-Hall, Inc. (1994)
external qualities will represent significant steps forward. [14] Ma, H., Shao, W., Zhang, L., Ma, Z., Jiang, Y.: Applying oo metrics
to assess uml meta-models. In: Baar, T., Strohmeier, A., Moreira, A.,
R EFERENCES Mellor, S.J. (eds.) Proceedings of UML 2004, Lisbon, Portugal, pp. 12–
[1] Atkinson, C., Gerbig, R., Kennel, B.: On-the-fly emendation of multi- 26. Springer (2004)
level models. In: Proceedings of the 8th European Conference on [15] Purao, S., Vaishnavi, V.: Product metrics for object-oriented systems.
Modelling Foundations and Applications. pp. 194–209. ECMFA’12, ACM Comput. Surv. 35(2), 191–221 (2003)
Springer (2012) [16] Rossini, A., de Lara, J., Guerra, E., Nikolov, N.: A comparison of
[2] Atkinson, C., Gerbig, R., Lara, J.D., Guerra, E.: A feature-based com- two-level and multi-level modelling for cloud-based applications. In:
parison of melanee and metadepth. In: Proceedings of the 3rd Workshop Proceedings of ECMFA 2015. pp. 18–32. LNCS 9153 (2015)
on Multi-Level Modelling co-located with the 19th ACM/IEEE Interna- [17] Sommerville, I.: Software Engineering. Pearson, 10th edn. (2016)
tional Conference MODELS 2016. CEUR Workshop Proceedings, vol. [18] Yoder, J.W., Johnson, R.E.: The adaptive object-model architectural
Vol-1722, pp. 25–34 style. In: Proceedings of the 3rd IEEE/IFIP Conference on Software
[3] Atkinson, C., Kühne, T.: Reducing accidental complexity in domain Architecture: System Design, Development and Maintenance. pp. 3–27.
models. Software and Systems Modeling 7(3), 345–359 (Springer Ver- Kluwer (2002)
lag, 2008)