On Evaluating Multi-Level Modeling Colin Atkinson Thomas Kühne University of Mannheim Victoria University of Wellington Mannheim, Germany Wellington, New Zealand Email: atkinson@informatik.uni-mannheim.de Email: tk@ecs.vuw.ac.nz Abstract—Multi-Level Modeling is receiving increasing levels which they measure something of relevance. Naturally, rele- of interest and its active research community is continuing to vance itself depends on the stakeholders and their respective make progress. However, to advance the discipline effectively it goals. However, in general, a meaningful evaluation provides is necessary to increase industry adoption and achieve better community cohesion. We believe that the key to addressing both results that have some kind of real-world relevance. In contrast, these challenges is to promote the creation of more comparisons a meaningless evaluation – e.g., measuring the number of in the multi-level modeling field based on meaningful objective vowels in a language’s keywords – has no such real-world evaluations. In this position paper, we provide our view on what relevance. A comparison based on such an evaluation would constitutes meaningful evaluations and discuss some of the issues not yield any meaningful insights into which language should involved in obtaining them, while presenting a broad overview of existing multi-level modeling evaluations. In particular, we be preferred for achieving any reasonable real-world impact. emphasize the importance of understanding and managing the Meaningful evaluations, on the other hand, should be designed difference between internal and external qualities. to deliver some insights that provide the basis for pragmatic guidance. In order for an evaluation to be meaningful in our I. I NTRODUCTION sense, it must address the following aspects: Although Multi-level Modeling (MLM) has seen steady de- A1 : Measurability Any targeted properties must be objec- velopment over recent years, industry adoption is still virtually tively observable. Ideally, measurements should yield numeric non-existent (a rare application of MLM in an industry setting results that are directly proportional to the property being is described in [8]). One explanation for the low adoption rate measured, as it is then not only possible to decide which is the current unavailability of industrial-strength approaches approach is better but also by how much. It is never possible to and tools. However, even if better tool support were avail- judge an approach or tool to be, e.g., “good” or “productive”, able, wider adoption would still be hindered by the lack of without breaking down how the quality concerned manifests compelling evidence that switching from Two-Level Modeling itself in terms of measurable properties. (TLM) to MLM brings benefits in industrial contexts. Creating A2 : Conclusiveness Measurements should yield consistent re- convincing comparisons would reduce this barrier and could sults. Repeated performances of the evaluation need not yield even expedite MLM research through industrial funding. exactly the same outcome, but they should deliver reliable Another obstacle to the discipline’s future growth is the lack values within a given margin of error. This also excludes of research cohesion that may eventually make it impossible results with low confidence (e.g., because they lack statistical for community members to build on each other’s results. significance). Without a sense of direction — i.e., a common understanding A3 : Impartiality The choice of the properties to be measured of the way forward – the discipline runs the risk of research must not favor particularities of one solution that have no diversification to a point where it loses its core focus and proven relationship to the ultimate goal. For example, a set of subsequently critical mass. We therefore believe that objective postulated requirements must be formulated in such a manner comparisons between competing approaches are not only that they reference the problem-domain and the ultimate desirable to provide a compass for future development but may benefits to the targeted user rather than solution details. eventually become necessary for the discipline’s survival. A4 : Trueness When using proxies (e.g. substitutes for real- Since convincing comparative evaluations are the key to world artifacts or practitioners) care must be taken to ensure addressing both of the aforementioned challenges, in this that no circumstantial bias is introduced. Trueness, therefore position paper we discuss some of the issues involved in comprises at least: performing such evaluations in the context of MLM. We first A4.1 : Context Relevance Model proxies (i.e., samples used establish the basic parameters of meaningful, scientifically in lieu of real-world models) and the assumed operations on sound evaluations and then discuss a variety of concrete them should be demonstrated to be representative. Otherwise, approaches, pointing to existing work where applicable. a skewed selection could introduce undesired bias. A4.2 : Demographic Relevance Substitute users should be II. M EANINGFUL E VALUATIONS demonstrated to be representative of real users. In general, it The effectiveness of MLM evaluations in promoting indus- is not possible to transfer results between different bodies of try adoption and research cohesion depends on the extent to users (e.g., from students to practitioners in the field). A5 : Pragmatic Relevance Targeted properties must have a for MLM is its ability to reduce accidental complexity [3], bearing on the actual needs of the intended users. This criterion i.e., the difference in complexity between an ideal model and is the very foundation of a meaningful evaluation. The previous a concrete model involving solution-induced overhead, e.g. aspects essentially characterize sound evaluations, whereas workarounds. pragmatic relevance requires that there is an intent to measure A number of evaluations of multi-level modeling have something of pragmatic value. been based on approximating the complexity of a model by It is obviously challenging to “tick” all the above “boxes” measuring its size, that is, the number of its elements. For in practice, but we feel it is useful to have a checklist that example, Gerbig performed a comparison based on model size helps to document where an evaluation may be lacking. in his Ph.D. thesis using a sample model from the enterprise architecture domain [7]. The MLM version of the model has III. I NTERNAL VERSUS E XTERNAL Q UALITIES 50 modeling elements while the TLM version, using standard Some of the aforementioned aspects are more difficult to workaround patterns such as the Type-Object pattern [10], address than others. In order to understand why, it is important has 95 modeling elements, amounting to an increase of 90%. to be aware of whether an evaluation is intended to evaluate Rossini et al. performed a similar evaluation which yielded an internal quality or an external quality. We use these terms a three-fold increase in the number of modeling elements with their usual meaning in software engineering [17]. in a two-level versus a multi-level model of their CloudML In our context, internal qualities pertain to the directly scenario [16]. measurable properties of a model, e.g., number of model The extent of the practical relevance of the above evalua- elements, number of constraints, average inheritance depth, tions was shown by de Lara et. al. by measuring the application etc. External qualities, on the other hand, pertain to the frequency of TLM workaround techniques (cf. “Item Descrip- experience users have when working with a model, e.g., tor” pattern [6], “Type-Object” pattern [10], “Adaptive Object- creating it, understanding it, maintaining it, etc. Model” [18], etc.) in real-world models [12]. Since these Ultimately, only the external qualities have a direct bearing workaround techniques are responsible for increases of the size on meaningful evaluations. However, due to the cost and of two-level models relative to their multi-level counterparts, challenges involved in assessing external qualities directly de Lara et al. hence demonstrated that the observations made in a meaningful way, one often attempts to approximate the in [7], [16] apply to a wide range of modeling practice. As assessment of external qualities by assessing internal qualities much as 35% of all models in some areas [12], could thus instead, based on the idea that there is a correlation between benefit from the potential size reductions. internal and external qualities. It is standard practice to as- Although the above results provide a convincing endorse- sume that optimizing certain internal qualities (e.g., reducing ment for the practical relevance of MLM, they do so only to complexity) is the key to achieving certain desirable external the extent that the assumption that model size1 approximates qualities (e.g., increased maintainability). However, such an model complexity is reasonable. A larger model based on a indirect evaluation of external properties is only trustworthy simple underlying language could conceivably be preferable if the assumed underlying correlation has been demonstrated, to a compact model based on a complex language. or at least has been made plausible by compelling arguments. Going beyond assessing model size, it appears useful Interestingly, A1 &A2 are most easily addressed by focusing to consider other classic metrics [5], [13], [15] and qual- on the internal qualities of an approach. Such qualities, e.g., ity attributes [4], [14]. Indeed, in his MLM vs TLM the complexity of the models created by an approach, can comparison, Gerbig also considered such classic met- typically be reliably assessed. In contrast, assessing external rics [7]. Overall, however, these proved to be less con- qualities often implies some compromise in A1 &A2 because clusive than model size comparisons, although he de- sample populations may be small or certain assumptions may tected clear advantages for MLM with respect to cou- not generalize. pling (average number of distinct connected classes) and Aspects A3 & A5 , on the other hand, are best addressed by overhead2 ((well-formedness rules + additional operations)/ focusing on the external qualities of an approach. External element count)) [7]. qualities directly reflect the utility of the approach to its Given these less conclusive results (compared to model users and hence avoid solution bias (A3 ) plus intrinsically size analyses) it would be easy to be skeptical about the imply pragmatic relevance (A5 ). The increased cost involved in actual advantages offered by MLM. However, it is important directly assessing external qualities relates to ensuring conclu- to observe that these metrics were originally designed to target siveness (A2 ) and trueness (A4 ). This cost is considerable and the type level only and thus entirely ignore the instance-level therefore represents a major hurdle for this kind of evaluation. complexity caused by the application of TLM workarounds. This weakness of classic metrics for evaluating MLM is IV. A SSESSING I NTERNAL Q UALITIES understandable given their motivation rooted in programming Complexity is one of the most commonly measured internal and/or modeling software. In these contexts, instances and qualities since it is assumed to have a correlation with im- 1 Apparently equivalent to the much debated “lines of code” metric for portant external qualities such as maintainability, robustness, source code. and trustworthiness etc. In fact, the main value proposition 2 Referred to as “complexity” in [7]. C1 Comprehend Demonstrate understanding of a model. C2 Complete Read an incomplete model and correctly add missing parts. C3 Critique Read a defective model and identify all issues. C4 Correct Read a defective model and address all issues. C5 Create Create a model from scratch for a specified purpose. TABLE I C OGNITIVE C HALLENGES OF THE “5C”-A PPROACH their relationships are irrelevant to users. However, in many Another external quality which lends itself relatively domain modeling applications instances directly represent the straightforwardly to measurement is model robustness, i.e., the subject under study. In such contexts, the complexity of resilience of a model to user error. Here the goal would be to instance models is therefore very much a concern to users assess the likelihood of introducing errors when creating/main- and should thus be considered in evaluations. taining models. In particular, in the context of MLM to TLM Instead of focusing on model properties (e.g., model com- comparisons, one would expect a two-level model to suffer plexity), one may also consider language properties (e.g., lan- from more accidentally introduced errors than a correspond- guage expressiveness). For example, Atkinson et al. based their ing multi-level model. TLM would only provide the same comparison of Melanee with MetaDepth on the differences be- safeguards against the introduction of model inconsistencies tween their respective language features [2]. Grossmann et al.’s if all the well-formedness constraints implied by MLM are more comprehensive comparison of 21 MLM approaches [9] transposed into the equivalent TLM models. One would still, also involved language feature comparisons. However, Gross- however, expect a higher rate of well-formedness violations, mann et al. also considered the intended target audience and since it is most likely easier to make mistakes in a lower level the purpose of approaches, and furthermore considered the two-level model, compared to a higher-level multi-level model. extent to which an approach has seen industry usage. This The final external quality we can cover here is productivity, latter consideration could be regarded as including an external i.e., the speed by which users can develop or make changes to quality, but without further information on how well the models. The underlying hypothesis of what could be referred respective MLM approaches performed in industrial contexts to as cognitive challenge-based evaluations is that modeler it is only a good starting point for further investigations. performance is a function of the adequacy of the language/tool Ideally, feature-based comparisons should be accompanied used. The higher the adequacy of the language/tool, the better by an analysis of the impact of the different features on users. the modeler should perform when facing standard tasks. While certain features may seem elegant, ultimately their value To this end, we propose a “5C”-approach, comprising the must be assessed by considering external qualities. cognitive challenges listed in Table I. Assessing the adequacy of an approach would be performed V. A SSESSING E XTERNAL Q UALITIES by measuring completion speeds for representative concrete In order to evaluate the ultimate purpose of any approach tasks of the above five kinds. If languages/tools actually yield intended to deliver value to a user, it is necessary to determine different levels of productivity, one should expect to see properties based on external qualities which relate to user differences in the C1 -C5 completion measurements. Ideally, experience. As far as we are aware, only two MLM evaluations subjects should be chosen in such a way that results transfer of this kind have been performed to date. Both of these to the intended user base in order to achieve demographic investigate model changes and thus can be reasonably regarded relevance. Full context relevance will be very hard to achieve as evaluating (aspects of) maintainability. In his Ph.D. thesis, with this approach as it is typically not feasible to work with Gerbig performed a comparative model change analysis by realistically sized models in such experiments. counting the number of primitive change operations needed to respond to certain requirements changes [7]. It turned out VI. C ONCLUSION that a homogeneous treatment of all classification levels and Melanee’s emendation service [1] reduce the effort needed to The goal of this position paper has been to provide a change the multi-level version of the model compared to the discussion of the issues involved when aiming to perform two-level, EMF-based version. meaningful evaluations while providing a broad overview of Kimura et al., also used a change-based approach to the MLM evaluations that have been conducted to date. The compare Melanee, MetaDepth and EMF, with a particular number of already existing MLM evaluations is encouraging focus on extensibility [11]. These kinds of analyses exhibit and each of them represents a very useful step towards growing ideal measurability, reproducibility, impartiality, and pragmatic MLM as a discipline. However, our discussion has shown relevance. However, whether context relevance is adequately that the evaluations performed until now are overwhelmingly addressed depends on how representative the chosen models focused on internal rather than external qualities. Hence their and editing operations are. pragmatic relevance – in the absence of the demonstration of a strong correlation between the internal qualities they asses [4] Bansiya, J., Davis, C.G.: A hierarchical model for object-oriented design with the external qualities that matter to users – is limited. quality assessment. IEEE Trans. Softw. Eng. 28(1), 4–17 (Jan 2002) [5] Chidamber, S.R., Kemerer, C.F.: A metrics suite for object oriented It is natural that the first evaluations performed in an design. IEEE Transactions on Software Engineering 20(6), 476–493 emerging field are focused on internal qualities, as these are (1994) usually much easier to asses than external ones. However, we [6] Coad, P.: Object-oriented patterns. Communications of the ACM 35(9), 152–159 (Sep 1992) believe that for a) the benefits of MLM to become convincing [7] Gerbig, R.: Deep, Seamless, Multi-format, Multi-notation Definition enough to generate serious interest from industry, and b) and Use of Domain-specific Languages. Ph.D. thesis, University of comparative evaluations to become useful enough to maintain Mannheim (2017) [8] Igamberdiev, M., Grossmann, G., Selway, M., Stumptner, M.: An the cohesion and momentum the research community requires, integrated multi-level modeling approach for industrial-scale data in- more user-oriented evaluations focusing on external qualities teroperability. Software & Systems Modeling pp. 1–26 (2016) will be needed. [9] Igamberdiev, M., Grossmann, G., Stumptner, M.: A feature-based cate- gorization of multi-level modeling approaches and tools. In: Proceedings An important initiative in this regard is the “Bicycle Chal- of the 3rd Workshop on Multi-Level Modelling co-located with the 19th lenge” proposed by the MULTI 2017 workshop as a common ACM/IEEE International Conference MODELS 2016. CEUR Workshop sample scenario, allowing various MLM approaches to be Proceedings, vol. Vol-1722, pp. 45–55 [10] Johnson, R., Woolf, B.: Type object. In: Martin, R.C., Riehle, D., compared based on an example with practical relevance. Buschmann, F. (eds.) Pattern Languages of Program Design 3, pp. 47– Ideally, more such benchmarks will be designed in the fu- 65. Addison-Wesley (1997) ture along with agreed upon usage scenarios, e.g., involving [11] Kosaku Kimura, K.S.: An evaluation of multi-level modeling frame- works for extensible graphical editing tools. In: Proceedings of the subsequent extensions, detecting and removing defects, etc. 3rd Workshop on Multi-Level Modelling co-located with the 19th It will remain a challenge to distinguish models and usage ACM/IEEE International Conference MODELS 2016. CEUR Workshop scenarios that have context relevance from those that do not, Proceedings, vol. Vol-1722, pp. 35–44 [12] Lara, J.D., Guerra, E., Cuadrado, J.S.: When and how to use multilevel but any attempts to move MLM evaluations towards directly modelling. ACM Transactions on Software Engineering and Methodol- assessing external qualities or to strengthen the confidence ogy 24(2), 12:1–12:46 (2014) in hitherto only assumed correlations between internal and [13] Lorenz, M., Kidd, J.: Object-oriented software metrics: a practical guide. Prentice-Hall, Inc. (1994) external qualities will represent significant steps forward. [14] Ma, H., Shao, W., Zhang, L., Ma, Z., Jiang, Y.: Applying oo metrics to assess uml meta-models. In: Baar, T., Strohmeier, A., Moreira, A., R EFERENCES Mellor, S.J. (eds.) Proceedings of UML 2004, Lisbon, Portugal, pp. 12– [1] Atkinson, C., Gerbig, R., Kennel, B.: On-the-fly emendation of multi- 26. Springer (2004) level models. In: Proceedings of the 8th European Conference on [15] Purao, S., Vaishnavi, V.: Product metrics for object-oriented systems. Modelling Foundations and Applications. pp. 194–209. ECMFA’12, ACM Comput. Surv. 35(2), 191–221 (2003) Springer (2012) [16] Rossini, A., de Lara, J., Guerra, E., Nikolov, N.: A comparison of [2] Atkinson, C., Gerbig, R., Lara, J.D., Guerra, E.: A feature-based com- two-level and multi-level modelling for cloud-based applications. In: parison of melanee and metadepth. In: Proceedings of the 3rd Workshop Proceedings of ECMFA 2015. pp. 18–32. LNCS 9153 (2015) on Multi-Level Modelling co-located with the 19th ACM/IEEE Interna- [17] Sommerville, I.: Software Engineering. Pearson, 10th edn. (2016) tional Conference MODELS 2016. CEUR Workshop Proceedings, vol. [18] Yoder, J.W., Johnson, R.E.: The adaptive object-model architectural Vol-1722, pp. 25–34 style. In: Proceedings of the 3rd IEEE/IFIP Conference on Software [3] Atkinson, C., Kühne, T.: Reducing accidental complexity in domain Architecture: System Design, Development and Maintenance. pp. 3–27. models. Software and Systems Modeling 7(3), 345–359 (Springer Ver- Kluwer (2002) lag, 2008)