Type Inference Using Concrete Syntax Properties in Flexible Model-Driven Engineering Athanasios Zolotas1 , Nicholas Matragkas2 , Sam Devlin1 , Dimitrios S. Kolovos1 , and Richard F. Paige1 1 Department of Computer Science, University of York, York, UK 2 Department of Computer Science, University of Hull, Hull, UK Email: {amz502, sam.devlin, dimitris.kolovos, richard.paige}@york.ac.uk, n.matragkas@hull.ac.uk Abstract. In traditional Model-Driven Engineering (MDE) models are instantiated from metamodels. In contrast, in Flexible MDE, language engineers initially create example models of the envisioned metamodel. Due to the lack of a metamodel at the beginning, the example models may include errors like missing types, typos or the use of different types to express the same domain concepts. In previous work [1] an approach that uses semantic properties of the example models to infer the types of the elements that are left untyped was proposed. In this paper, we build on that approach by investigating how concrete syntax properties (like the shape or the color of the elements) of the example models can help in the direction of type inference. We evaluate the approach on an example model. The initial results suggest that on average 64% of the nodes are correctly identified. 1 Introduction In traditional MDE the models instantiated using editors, conform to a pre- defined metamodel. In contrast, in Flexible MDE, language engineers use draw- ing editors, like those proposed in [2], [3] and [17], to express example models that will be used to infer the envisioned metamodel. These tools are used as free-form whiteboards on which the domain experts sketch elements that repre- sent concepts and assign types to them. As there is no metamodel to specify the semantics, this process is error prone. Drawn elements may be left untyped, or the same concept may be represented by using two or more types. The first could happen either unintentionally (the engineer forgot to assign a type to the ele- ment) or intentionally (engineers leave repeatedly-expressed concepts untyped). The latter may occur either because two or more engineers are involved and thus different terminology may be used or, if the process is long-term, the engineer may have forgotten the type used in the past for a specific concept. This paper addresses the challenges associated with identifying and managing omissions during type assignment in flexible modelling by using a variation of the type inference approach proposed in [1]: missing types are inferred by computing and analysing matches between untyped and typed elements that share the same graphical characteristics. The difference between the approach proposed in [1] is that in this work we are looking for concrete syntax properties of the drawings and not at semantic relations. More specifically, in this approach, features that represent graphical characteristics of the nodes (i.e. the shape, color, width and height) are fed to a Classification and Regression Tree (CART) algorithm which predicts the types of the untyped nodes. We present the approach in detail, using an illustrative flexible modelling approach based on GraphML and the flexible modelling technique called Muddles [2]. We demonstrate the approach’s accuracy via experiments run on an example flexible model. This approach is based on the assumption that language engineers tend to use, to the extent possible, the same concrete syntax to express the same concept in a diagram. However, in order to explore the capabilities of the approach in cases where this assumption does not stand (or stands partially) we add noise to the example model by applying changes to the graphical properties of its nodes (e.g. randomly changing the shape of some nodes). 2 Related Work In [4], rules that should be taken into consideration to construct the graphi- cal syntax of languages is proposed. The author claims that the importance of graphical notation is a neglected issue so far and he adapts the theory of com- munication by Shannon and Weaver [5] to highlight that the effectiveness of the graphical syntax can be increased by choosing the most appropriate notation conventions of these that the human mind can process. In [6], Bertin identified a set of 8 visual variables that can encode and reveal information about the elements they represent or their relations in a graphical way. In the field of bottom-up metamodelling, Cho et al. [7] propose an approach for an semi-automatic metamodel inference using example models. In [8], exam- ple models created by domain experts can be used to infer the metamodel. In [9], evolved models that do not conform to their metamodel are used to recover the metamodel. In [2], users, using a drawing tool, define example models which are then amenable to programmatic model management scripts. Type inference is used in programming languages. The Hindley-Milner [10] [11] algorithm and its extension by Milner and Damas [12] are the most common used in this domain. Code statements are reduced to simpler constructs for which a type is already known. Such approaches are challenging to apply in flexible modelling where there is no predefined abstract syntax. Inferring types (or metamodels) from example models is a matching problem: elements that are “sufficiently similar” may have similar or identical types; a classification was published in [13]. A model matching technique is used in [14] to generate traceability links. The nodes of the two models are matched by checking their name similarity. In [15] each element’s unique MOF identifier is used to calculate the difference and union of models that belong to the same ancestor model. In [16] manually generated signatures are used to match elements and perform model merging. The last three approaches are of limited flexibility as they depend on names or persistent identifiers for inference. Finally, in Flexisketch [17] the authors adapt the algorithm proposed in [18] to find possible matches between the shapes of hand-drawn elements that appear on the same canvas. This work builds on the approach presented in [1]. A CART algorithm is used to infer the types of nodes. However, the data fed to the classification algorithm consist of features that are related to semantic aspects of the example models (number of attributes, unique incoming and outgoing references and unique parents and children of each node). In this work, we propose the use of features that are related to the concrete syntax of the drawn example models. 3 Background: Muddles In Muddles [2], a GraphML compliant drawing editor called yEd3 is used to allow language engineers draw example models that are amenable to model manage- ment suites. The drawn elements are annotated with their types. Attributes can also be added by using the appropriate node properties input boxes. The drawing is automatically transformed to the intermediate Muddle model and can then be consumed by a model management suite like the Epsilon platform [19]. In this work, due to the fact that we are interested in the graphical properties of the drawings, like the shape, the color and the size of the drawn elements we use the extended version of the Muddles approach that was presented in [20] that extracts such information. A muddling example follows for better understanding. 3.1 Example Fig. 1. Example The language engineer is interested in creating a language for expressing zoos. The process starts by drawing an example zoo diagram (Fig. 1(b)). Next, diagram elements are annotated with basic type information. For instance, the type of both hexagon shapes is defined as Doctor and the type of the directed edges from Doctor to Animal nodes (circles) as instances of the cures relation- ship. The types are not bound to the shape but to each element, meaning that in another example or even in the same drawing, a hexagon can be of type Doctor while a different hexagon can be of type Animal. Types and properties of the types (e.g. attributes, multiplicity of edges) can also be specified. More details about these properties are presented in [2]. Model management programs can then be used to manage this diagram. For example, the following Epsilon Object Language (EOL) [21] script returns the 3 http://www.yworks.com/en/products_yed_about.html names of all the elements of Type Animal (the nodes typed as “Animal” have a String attribute named name assigned to them). var a n i m a l s = Animal . a l l ( ) ; for ( a in animals ) { ( ” Animal : ” + a . name ) . p r i n t l n ( ) ; } Listing 1.1. EOL commands executed on the drawing 4 Type Inference Approach In this section the type inference approach followed is presented (an overview is presented in Fig. 2). The engineer creates an example diagram of the envisioned DSL using a GraphML compliant tool (yEd). Some of the elements may be left untyped for the reasons discussed in Section 1. The mechanism proposed in this approach, analyses the drawing and collects graphical information from it which is then fed into a classification algorithm to classify the untyped elements. Fig. 2. An overview of the proposed approach (taken from [1]) 4.1 Features Collection The classification of the elements is based on characteristics that each element has. These are called features. We call the set of all the characteristics that are collected from each node as feature signature. In addition, the type of each node is attached to the last position of the feature signature. If the element is left untyped, then this place is left empty. In this approach we propose the use of 4 graphical characteristics as features for each node. The selected characteristics are presented in Table 1. Example signatures follow for better understanding. Table 1. Signature features for nodes Name of Feature Description Shape The shape of the node (e.g. rectangle, ellipse, etc.). Color The color of the filling of the node in HEX (e.g. #FFCC00, etc.). Width The width of node expressed in pixels. Height The height of the node expressed in pixels. For example, the feature signature for the node named “Animal Tamara” and annotated with the type “Animal” is [ellipse, #FFCC00, 114, 112, Animal]. The first value is the shape of the node (ellipse) while the second is the color of the filling. The third and fourth features are the width and the height. The last value is the annotated type for this element. Similarly, the feature signature for the node named “Animal Joe” is [ellipse, #FFCC00, 107, 105, Animal]. From these two signatures one can see that elements of the same type may or may not have the same signatures. This justifies the selection of a classification algorithm and not a simple matching algorithm, as classification algorithms do not only look for perfect matches but can also classify the items by picking each time the features that are more important in the set that they are trained on. A script that parses the diagram and constructs the feature signature for the nodes was created. The signatures are stored in a file that is fed to the CART. 4.2 Classification Classification algorithms are a supervised machine learning method for finding functions mapping input features to a discrete output class from a finite set of possible values. They require a dataset to train on, after which they can generalise from the previous examples given in the training set to predict the class of new unseen instances. Many classification algorithms exist, some of the most established being de- cision trees, random forests, support vector machines and neural networks [22]. For this work we chose to use decision trees. Specifically, we used the rpart pack- age (version 4.1-9)4 that implements the functionality of (CART) [23] in R5 . In practice other classification algorithms (i.e. support vector machines or neural networks) can often have higher accuracy, but will produce a hypothesis in a form that is not as easily interpreted. Given the exploratory nature of this work, these other algorithms were not deployed in favour of the aid to debugging pro- vided by being able to interpret how the learnt hypothesis would classify future instances. For example, a possible decision tree for type inference is illustrated in Figure 3. Internal nodes represent features (e.g. Shape, Colour, etc.), each branch from a node is labelled with values of the feature and leaf nodes repre- sent the final classification given. To classify a new instance, start at the root node of the tree and take the branch that represents the value of that feature in the new instance. Continue to process each internal node reached in the same manner until a leaf node is reached. The predicted classification of the new in- stance is the value of that leaf node. For example, given the tree in Fig. 3, a new instance which shape is not an ellipse and its colour is different than white (#FFFFFF) classified as Zoo (path is highlighted in Fig. 3). In our approach, the feature signatures list that contains the signatures of the known elements of the model are the input features to the CART algorithm. A trained decision tree is produced which can be used to classify (identify the type of) the untyped nodes using their feature signatures. The success of a classification algorithm can be evaluated by the accuracy of the resultant model (e.g. the decision tree learnt by CART) on test data not used when training. The accuracy of a model is the sum of true positives and negatives (i.e. all correctly classified instances) divided by the total number of instances in the 4 http://cran.r-project.org/web/packages/rpart/index.html 5 http://www.r-project.org/ test set. A single measure of accuracy can be artificially inflated due to the learnt model overfitting bias in the dataset used for training. To overcome this, k-fold classification can be implemented [24]. This approach generates k different splits of the data into training and test data sets and returns the mean accuracy generated from k repeats with each repeat using a unique split of the data. Fig. 3. Example Decision Tree 5 Experimentation Process In this section the experimentation process used to evaluate the performance of the proposed approach is presented. An overview is given in Fig. 46 . Fig. 4. The experimentation process In order to test the proposed approach we applied the classification algorithm to a muddle. This muddle was created before commencing this research as part of a side project to express requirements for a web application. Our experience working with Muddles suggests that it is a fairly complicated example as it consists of more than 100 elements of 20 different types. This muddle was also one of the 11 models that were part of the experiment for the approach presented in [1]. A comparison with the 10 other models used in the experiment of the previous approach is not possible as these were automatically generated using mechanisms that are biased in the selection of the 4 features that we assess in this work: all the nodes were of the same shape, color and size. In addition, we tested the resilience of the proposed approach to human error and the bias that our muddling habits may cause: we tend to use the same shape when we express a specific type. We need to highlight here, that in some cases all elements of the same type share same features but there are types where this is not the case. In contrast, some feature values (e.g. rectangle) were not used only for one type: elements of different types share common features in our 6 The code, data, results and a guide on how to use the approach can be found in http://www.zolotas.net/type-inference-graphical experimentation example. Arguably, this is the case with any other relatively large muddle as the number of available shapes and colors is in practice limited. Thus, in order to check if adhering to some basic conventions when drawing an example model is important for the accuracy of the prediction, we performed a second experiment by adding noise to some of the elements by explicitly changing their features. We did that gradually by altering randomly one feature of none (0%) up to all (100%) of the elements of the muddle using a step of 20% (0%, 20%, ..., 100%). 40% noise addition means that 40% of the nodes have exactly one feature (randomly selected) changed to something else (e.g. shape is changed from rectangle to ellipse). A step-by-step description of the experiment follows. Initially a script collects the features from the muddle and places them in a list that includes the feature signatures for each element (step 1 ). As the example has 105 nodes, there are 105 feature signatures in this list. This process is repeated 6 times; a new signatures list file is created for each of the noise levels. Next, each list is randomly separated into a training and a test set (step 2 in Figure 4). The training set contains the nodes for which in a realistic scenario the types are known while the test set contains those nodes that are left untyped. In order to reach unbiased results (due to an unlucky or lucky random sampling) we perform the random sampling 10 times for each file (10-Fold). It is also of interest to identify if the amount of knowledge that the algorithm has on each diagram is of importance to the success ratio. For that reason we use 7 different sampling rates; from 30% to 90%. For example, a 40% sampling rate means that 40 % of the nodes are thought to be of known type while the rest (60%) are the nodes for which the type is unknown. The generated couples of training and test sets (420 in total) are then fed to the CART algorithm (step 3 ). The algorithm is trained on a training set and predicts the types of the elements of the coupling test set. The success ratio of the prediction is then calculated. 6 Results Table 2 summarises the results for all the 420 runs. Each cell in the table contains the average accuracy of the classification for the 10 runs for the specific added- noise level and sampling rate. For instance, the highlighted value 0.64 indicates that on average, 64% of the missing types were successfully predicted for the 10 samples of the 20% added-noise model, using a 70% sampling rate. The results suggest that the accuracy scores for the original (0% noise) model on average are similar to the success rates scores of the inference approach pro- posed in [1] (last row of Table 2). As expected, increasing the level of noise results in worse prediction scores. The same trend is also visible when decreasing the sampling rate. The correlation coefficients Corel. 1 and Corel.2, which definition follows, confirm that visual observation. Corel. 1: How strong is the dependency between the sampling rate and the success score? Corel. 2: How strong is the dependency between the added-noise level and the success score? The values for Corel. 1 indicate a strong correlation for all the added-noise levels: prediction scores increase as training sets (the nodes with known types) become larger. The same behaviour was also observed in [1]. Regarding the second correlation (Corel. 2) we observe a perfect (negative) correlation between the number of nodes in a drawing that have altered features and the success score across all the sampling rates. This is evidence that following specific rules in the graphical syntax of the drawing increases the chances for correct type inference. By the term “specific rules” it is not implied that these rules should be strict. As discussed in previous sections, in the 0% added-noise example the authors use the same shapes to express the same concepts or in other cases the same color but not in a rigorous manner: same graphical properties are used in different concepts while the same concepts may have different graphical properties. We have also identified that the results related with the added noise experimentation (especially those of 80% and 100%) are influenced a lot by the randomness in picking the feature that each time will be altered. More specifically, if the random algorithm picked a lucky noise injection (changing in each node a feature that it is not distinctive for this type) then the results were better. Table 2. Results summary table Average Success Ratio (Accuracy) for Different Sampling Rates Noise Level 30% 40% 50% 60% 70% 80% 90% Avg. Corel. 1 0% 0.55 0.58 0.61 0.63 0.67 0.71 0.71 0.637 0.99 20% 0.50 0.53 0.58 0.61 0.64 0.63 0.74 0.604 0.96 40% 0.47 0.53 0.52 0.56 0.57 0.62 0.63 0.557 0.97 60% 0.42 0.46 0.45 0.49 0.46 0.56 0.53 0.481 0.85 80% 0.35 0.38 0.44 0.43 0.50 0.44 0.56 0.443 0.89 100% 0.35 0.35 0.35 0.37 0.42 0.42 0.40 0.380 0.85 Corel. 2 -0.98 -0.98 -0.99 -0.99 -0.95 -0.98 -0.93 - - Muddle [1] 0.55 0.56 0.59 0.65 0.59 0.67 0.64 0.607 - 6.1 Threats to Validity One example, created before commencing this research, was used to evaluate the approach. Although having one example may not be the best way to extract safe results we believe that it gives at least preliminary evidence that concrete syntax can be used to infer types of nodes in flexible modelling. A threat related to that which works against the approach is that the model consists of 20 different types. According to our experience with Muddles, this is a marginal one as engineers tend to use more rigorous editors as the models increase in size. The results in the experiments run in the previous approach [1] suggest that as the number of types increases the prediction accuracy decreases. In addition, the example model consisted of a number of highly repeated elements of the same type (e.g. nodes of type ”MenuItem”). If these elements can be correctly identified, the fact that they participate a lot in the example leads to better success scores. However, a balancing fact is that there were also 5 types which appear less than 2 times, reducing the chances of having them predicted correctly in case all of their instances end in the testing set (there will be no appearance in the training set so the algorithm doesn’t know about the existence of this specific type). 7 Conclusions and Future Work In this work we assessed whether the concrete syntax of flexible models can be used to infer the types of the elements that are left untyped using CART. Ex- periments suggest that on average 64% of the types were correctly identified. We also experimented with the intentional addition of noise in the diagram to check how this affects the prediction accuracy. A strong correlation between the percentage of altered nodes and the accuracy was identified providing evidence that this approach is more successful if it is used under the assumption that modellers tend to use, to the extent possible, the same graphical notation for elements of the same concept. We believe that this behaviour can be “uninten- tionally” replicated because of the “copy-paste” nature of muddling (e.g. create an animal node once and then copy & paste the node when you need it again). This way the same graphic notation is used for all the elements of the same type. It is important to highlight that if the type of the node was typed before the “copy-paste” event took place, which is not necessarily always the case, then the type is also transferred to the newly pasted nodes. In the future, we plan to introduce and test additional features like font size and color, border size, orientation etc. In addition, in order to check if the prediction accuracy can be further increased, our intention is to combine the concrete syntax feature with the semantic related features proposed in [1]. Finally, we plan to run user studies to further evaluate the approach on more example models developed by language engineers. Acknowledgments This work was carried out in cooperation with Digital Lightspeed Solutions Ltd, and was part supported by the Engineering and Physical Sciences Research Council (EPSRC) through the Large Scale Complex IT Systems (LSCITS) ini- tiative, and by the EU, through the MONDO FP7 STREP project (#611125). References 1. Zolotas, A., Matragkas, N., Devlin, S., Kolovos, D., Paige, R.: Type inference in flexible model-driven engineering. In Taentzer, G., Bordeleau, F., eds.: Modelling Foundations and Applications. Volume 9153 of Lecture Notes in Computer Science. Springer International Publishing (2015) 75–91 2. Kolovos, D.S., Matragkas, N., Rodrı́guez, H.H., Paige, R.F.: Programmatic muddle management. XM 2013–Extreme Modeling Workshop (2013) 2 3. Gabrysiak, G., Giese, H., Lüders, A., Seibel, A.: How can metamodels be used flexibly. In: Proceedings of ICSE 2011 workshop on flexible modeling tools, Waikik- i/Honolulu. Volume 22. (2011) 4. Moody, D.L.: The physics of notations: toward a scientific basis for constructing visual notations in software engineering. Software Engineering, IEEE Transactions on 35(6) (2009) 756–779 5. Shannon, C.E., Weaver, W.: The mathematical theory of communication. (2002) 6. Bertin, J.: Semiology of graphics: diagrams, networks, maps. (1983) 7. Cho, H., Gray, J., Syriani, E.: Creating visual domain-specific modeling languages from end-user demonstration. In: Modeling in Software Engineering (MISE), 2012 ICSE Workshop on, IEEE (2012) 22–28 8. Sánchez-Cuadrado, J., De Lara, J., Guerra, E.: Bottom-up meta-modelling: An interactive approach. Springer (2012) 9. Javed, F., Mernik, M., Gray, J., Bryant, B.R.: Mars: A metamodel recovery system using grammar inference. Information and Software Technology 50(9) (2008) 948– 968 10. Hindley, R.: The principal type-scheme of an object in combinatory logic. Trans- actions of the american mathematical society (1969) 29–60 11. Milner, R.: A theory of type polymorphism in programming. Journal of computer and system sciences 17(3) (1978) 348–375 12. Damas, L., Milner, R.: Principal type-schemes for functional programs. In: Pro- ceedings of the 9th ACM SIGPLAN-SIGACT symposium on Principles of pro- gramming languages, ACM (1982) 207–212 13. Kolovos, D.S., Di Ruscio, D., Pierantonio, A., Paige, R.F.: Different models for model matching: An analysis of approaches to support model differencing. In: Proceedings of the 2009 ICSE Workshop on Comparison and Versioning of Software Models. CVSM ’09, Washington, DC, USA, IEEE Computer Society (2009) 1–6 14. Grammel, B., Kastenholz, S., Voigt, K.: Model matching for trace link generation in model-driven software development. Springer (2012) 15. Alanen, M., Porres, I.: Difference and union of models. Springer (2003) 16. Reddy, R., France, R., Ghosh, S., Fleurey, F., Baudry, B.: Model composition-a signature-based approach. In: Aspect Oriented Modeling (AOM) Workshop. (2005) 17. Wüest, D., Seyff, N., Glinz, M.: Flexisketch: A mobile sketching tool for software modeling. In: Mobile Computing, Applications, and Services. Springer (2013) 225–244 18. Coyette, A., Schimke, S., Vanderdonckt, J., Vielhauer, C.: Trainable sketch rec- ognizer for graphical user interface design. In: Human-Computer Interaction– INTERACT 2007. Springer (2007) 124–135 19. Paige, R.F., Kolovos, D.S., Rose, L.M., Drivalos, N., Polack, F.A.: The design of a conceptual framework and technical infrastructure for model management language engineering. In: Engineering of Complex Computer Systems, 2009 14th IEEE International Conference on, IEEE (2009) 162–171 20. Zolotas, A., Kolovos, D.S., Matragkas, N., Paige, R.F.: Assigning semantics to graphical concrete syntaxes. In: XM 2014–Extreme Modeling Workshop. 12 21. Kolovos, D.S., Paige, R.F., Polack, F.A.: The epsilon object language (eol). In: Model Driven Architecture–Foundations and Applications, Springer (2006) 128– 142 22. Jiawei, H., Kamber, M.: Data mining: concepts and techniques. San Francisco, CA, itd: Morgan Kaufmann 5 (2001) 23. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees. CRC press (1984) 24. Mitchell, T.M.: Machine learning. 1997. Burr Ridge, IL: McGraw Hill 45 (1997)