Tool Support for Model Splitting using Information Retrieval and Model Crawling Techniques Daniel G. Strüber, Michael Lukaszczyk, Gabriele Taentzer Philipps-University Marburg Department for Mathematics and Computer Science Hans-Meerwein-Str., 35032 Marburg, Germany {strueber,lukaszcz22,taentzer}@informatik.uni-marburg.de ABSTRACT proach proposed in [11] aims to create model decompositions To facilitate the collaboration in large-scale modeling sce- from existing domain knowledge in the form of textual de- narios, it is sometimes advisable to split a model into a set scriptions: The user provides a set of descriptive texts, each of sub-models that can be maintained and analyzed indepen- describing one sub-model in the target decomposition. From dently. Existing automated approaches to model splitting, this input, a splitting suggestion is created using a combined however, suffer from insufficient consideration of the stake- information retrieval and topology analysis approach. The holder’s intentions and add a significant overhead for com- descriptions can be assembled from available requirement or prehending the created decompositions. We present a new documentation artifacts. However, the input set is not re- tool that aims to create more informed model decomposi- quired to be complete: In fact, the approach can support the tions by leveraging existing domain knowledge in the form user in incrementally discovering sub-model descriptions. of textual descriptions. From the user perspective, the tool comprises a textual editor for assembling the descriptions The contribution of this paper is a tool and supporting semi- and a visual editor for reviewing and post-processing the automated user process making the outlined splitting tech- generated splitting suggestions. We preliminarily evaluate nique available to modelers. We have tested it on large meta- the tool in a case study involving a real-life model. models in the magnitude of 100 to 250 classifiers. As design goals, we target usability and extensibility for the splitting of instances of arbitrary meta-models. The remainder of this Categories and Subject Descriptors paper is divided as follows: In Section 2, we briefly illustrate D.2.0 [Software Engineering]: Tools; D.2.8 [Software the underlying technique. The user process is shown in Sec- Engineering]: Distribution, Maintenance, and Enhance- tion 3. In Sections 4 and 5, we elaborate on the design goals ment and implementation. In Section 6, we present a case study preliminarily evaluating the proposed tool and user process. 1. INTRODUCTION We discuss related work and conclude in Sections 7 and 8. As model-driven engineering is applied in ever-greater sce- narios ranging over significant spans in time and space, the 2. BACKGROUND maintenance obstacles induced by large models increase in In this section, we give a brief overview on model splitting urgency. Large models without a proper decomposition are as perfomed by our tool. A detailed account is found in [11]. hard to comprehend, to change, to reuse, and to collaborate on. Even in projects where an initial decomposition is tai- The technique, outlined in Fig. 1, takes three input param- lored with great care, changing requirements may deem it eters: The model to be split – in the proposed tool, an EMF necessary to refactor for a finer-grained or even orthogonal meta-model –, a set of textual descriptions of each target one. As the manual refactoring of large models is non-trivial sub-model, and a completeness condition. The completeness and expensive, this problem calls for automation. condition specifies whether the set of sub-model descriptions is complete or partial. The technique creates a set of map- Earlier automated approaches to model splitting, such as pings from model elements to sub-models, calling it splitting those presented in [7, 12], suggest techniques based on anal- suggestion. In the case of a complete input set, each element ysis of strongly connected components or clusters, not ac- counting for the semantics of the split and the intention for performing it. To address this shortcoming, a recent ap- Model Splitting BigMDE’14 July 24, 2014. York, UK Copyright c 2014 for the individual papers by the papers’ authors. Copy- ing permitted for private and academic purposes. This volume is published and copyrighted by its editors. Figure 1: Underlying model splitting technique. Input model Output sub-models Start the splitting 1 Perform splitting 5 process Derive a splitting suggestion Define the splitting 3 Review and post-process description the splitting suggestion 2 4 Figure 2: Overview. Figure 3: Defining the splitting description. is assigned to one sub-model. In the partial case, some el- (1) Start the splitting process. Using a context menu ements may remain unassigned. The user can inspect the entry on the meta-model to be split, the user triggers the unassigned elements to discover additional sub-models and creation of a splitting description file. The splitting descrip- describe them, incrementally creating a complete split. tion is automatically opened in a textual editor, shown in Fig. 3. By default, the file contains a small usage example. Information retrieval. To obtain an initial mapping be- tween the model and the textual sub-model descriptions, we (2) Define the splitting description. Using the editor, apply an established statistical technique from information the user assembles the descriptions of the target sub-models. retrieval research: Latent Static Analysis (LSA) [8]. For a For a comfortable user experience, the editor provides syn- query (e.g., a sub-model description) over a fixed set of doc- tax highlighting, static validation, and folding capabilities. uments (e.g., a set of model element names), LSA scores the The textual editor is also used for configuration: Adding the relevance of each document to the input query. To compute keyword partially and defining a numerical threshold, the the scores, queries and documents are represented as vectors user can set the completeness condition in order to obtain and the similarity between the query vector and each docu- a partial split. Furthermore, the user can fine-tune inter- ment vector is computed – intuitively speaking, the degree in nal parameters used during the execution of the underlying that they point in the same direction. Mathematically, this technique. In Fig. 3, the weights assigned to different rela- is calculated in terms of the cosine, yielding a score between tionship types and the alpha exponent that shapes the scor- 0 and 1. The vector representation is based on a metrics ing function are modified. However, parameter tuning is an called term frequency-inverse document frequency (td–idf ). optional feature: In [11], we identified a default combination of parameter values that, when applied to six independent Model crawling. To create the splitting suggestion, we use class models, achieved an average accuracy of 80% in com- the model elements ranked highest by LSA as seeds. Starting parison to hand-tailored decompositions. from these seeds, we crawl the model exhaustively to score each model element’s relevance for each target sub-model. (3) Derive a splitting suggestion. Using a context menu Afterwards, each model element is assigned to the sub-model entry on the splitting description file, the user triggers the it was deemed most relevant for, ties being broken randomly. automated creation of a splitting suggestion. A splitting Model crawling extends an approach proposed in [9]. The suggestion comprises a set of assignment entries, each hold- underlying intuition is that of a breadth-first search: We first ing a link to a model element, a link to a target sub-model, visit and score the seeds’ neighbours, then the neighbours’ and the relevance score. To compute the splitting sugges- neighbours, et cetera. Scores of newly accessed elements are tion, the technique outlined in Section 2 is applied. The calculated based on the scores of previously scored elements. splitting suggestion is persisted to the file system. The scoring formula accounts for topological properties, such as the connectivity of newly accessed elements, and seman- (4) Review and post-process the suggestion. To ob- tic implications of the respective relationship types (e.g., in tain visual access to the splitting suggestion, the user can meta-models, containment suggests strong connectivity). now open the model in a model editor. The user activates a dedicated layer called model splitting. This action trig- 3. USER PROCESS gers the color-coding of model elements corresponding to the The user process, shown in Fig. 2, comprises two manual splitting suggestion, shown in Fig. 4. As further visual aid, tasks (2 and 4) and three automated tasks (1, 3 and 5). The the assignment of a model element is also displayed textually manual tasks rely on human intelligence and domain knowl- above its name. For post-processing, the user may want to edge. They are facilitated by textual and visual tool support. change some assignments for model elements that were not The automated tasks are triggered by context menu entries. assigned to the proper target sub-model. This is done using mapping is implemented by a domain-specific language for splitting descriptions with custom editor support. Prema- ture commitment is inhibited and progressive evaluation is promoted by providing an incremental process that allows tweaking with input values while receiving rapid feedback. For traceability, our file-based approach to user input allows to keep the splitting description and use it later, e.g., for documentation purposes. 5. IMPLEMENTATION Eclipse Modeling Framework [10] is the de-facto reference implementation of the EMOF modeling standard. Conse- quently, it was natural for us to design the new tool as an extension for EMF. As such, it can be plugged into an exist- ing Eclipse installation without further effort. For the split- ting description editor, we leveraged the powerful code gen- Figure 4: Reviewing and post-processing the split- eration facilities of Xtext [5]. We defined a simple domain- ting suggestion. specific language for splitting descriptions. The editor with its syntax highlighting and code completion features was fully generated by Xtext. For customization, we added a the palette tool entry Assign. When the user reassigns a couple of checks (e.g., forbidden characters, uniqueness of model element, the respective entry in the splitting sugges- sub-model names). The visual splitting layer is an exten- tion is automatically updated. It is worth mentioning that sion of EcoreTools 2.0 [2] which is based on the Sirius [4] if the user is not satisfied with the results, he or she may framework and, as of June 2014, determined to be part of iterate Steps 2 to 4 as often as required, tweaking the de- the new Eclipse release Luna 4.4. We used this new technol- scriptions and parameter settings. One important scenario ogy as we benefit from its support for multiple viewpoints, for this is the discovery of new sub-models: The user can allowing us to tailor a splitting viewpoint to our needs. set the completeness condition to partial in Step 2 which leads to some model elements not being assigned in Step 3. The user inspects these elements in Step 4 to create new 6. CASE STUDY sub-model descriptions. In a case study, we investigated two research questions: (RQ1) How efficient is the proposed tool in comparison to manual (5) Perform splitting. Given that the user is satisfied splitting? (RQ2) Is the proposed tool usable? with the post-processed splitting suggestion, the actual split- ting can be triggered by the user. The user may choose from 6.1 Subjects and Task two context menu entries: One for splitting the input model Model: Extended Joomla-Specific Language (eJSL) is a meta- into multiple physical resources, the other for splitting it model for web applications based on the Joomla content into sub-packages within the same resource. management system [3]. It comprises 116 classes, 39 enumer- ations, 176 enumerated attributes, 41 generalizations, 145 containment references, and 47 plain references. eJSL was 4. DESIGN GOALS designed by a doctoral student affiliated with our research In this section, we shortly discuss design goals that were group we shall refer to as X. X has significant experience in fundamental in the design of the proposed tool. modeling language design. Previous to our work, X man- ually split eJSL into five sub-models, calling them Pages, Extensibility. The underlying technique possesses an in- Content, Menu, User, and Configuration. According to his nate extensibility that should be carried over to the end-user. account, he invested a significant effort that spanned, among It is applicable to models conforming to arbitrary meta- other duties, over the course of two weeks. He printed the models, given that they fulfill two properties: (i) Model diagram on paper, cut and reassembled fragments. After- elements must have meaningful textual descriptions that a wards, he assigned colors to model elements in the diagram splitting description can be matched against. (ii) Except editor and layouted them by hand. for trivial reconciliation, constraints imposed by the meta- model may not be broken in arbitrary sub-models. We ad- Task: We instructed another software engineer, referred to dress this design goal by using a framework approach: To as Y, to decompose eJSL using the tool. Y is a doctoral customize the tool for a new meta-model, the user subclasses student with significant experience in modeling language de- a set of base classes. For instance, to define how input mod- sign, but unrelated to eJSL and model splitting. We asked els are converted to a generic graph representation used dur- X to provide the required domain knowledge in the form ing crawling, they subclass a class named GraphBuilder. of descriptive texts briefly explaining his intuitions for the hand-tailored decomposition. The descriptions, each con- Usability. The design of the tool is informed by Cognitive sisting of 85 words on average, were handed to Y in a text Dimensions, a framework for the human-centered design of document. The task given to Y was to create a decompo- languages and tools [6]: Providing an editable visual layer sition that faithfully reflects the separation of concerns pro- on top of a standard editor is a major step towards visibility posed by the textual descriptions. We briefly instructed Y in – visual accessibility of elements and their relations – and the usage of the tool based on the example shown in Fig. 3 away from viscosity – resistance to change. Closeness of and 4 and encouraged him to make use of post-processing. 6.2 Results In the Democles model composer [1], the user can iterate Efficiency. To approach (RQ1), we define efficient as re- the lattice of all permitted decompositions by unfolding en- quiring a minimal amount of time to create an accurate re- tries in a tree-like wizard. Graphical presentation of a split sult. Positing the hand-tailored split as perfectly accurate, is provided by an add-on graph visualization library. How- we measured accuracy of the tool-supported split in terms ever, this visualization is read-only and not integrated with of average F-measure, considering both precision and recall. a modeling editor, ruling out the re-assigning of model ele- Accuracy was determined before and after post-processing: ments for post-processing as supported by the new tool. During review of the initial splitting suggestion S1, Y reas- signed five model elements to create the final suggestion S2. The splitting tool proposed in [12] makes classic clustering From S1 to S2, precision increased from 82% to 86% and algorithms available for EMF models. It provides a wizard recall from 84% to 88%, determining a rise in F-measure for the selection and customization of algorithms. However, from 83% to 87%. It took Y five minutes to create S1. except for numerical input parameters, the user cannot in- The reviewing and post-processing that brought the 4% gain fluence the generated results. The tool provides a tree-based took further 55 minutes. Consequently, in terms of extra- editor for the reassigning of model elements to target sub- polated overall amount of time, tool-supported splitting out- models, but does not present any visual feedback. performed manual splitting. 8. CONCLUSION Usability. To approach (RQ2), we conducted an informal In this paper, we present a tool for the splitting of large interview. Y perceived the user process as comprehensible, meta-models. The tool provides a textual editor that allows the description editor as easy to use and the color-coding as defining the desired target sub-models by means of textual useful. An activity found crucial during post-processing was descriptions. It generates a splitting suggestion that can be examining the direct neighbours of a model element. Y per- reviewed and post-processed in a visual editor. Based on the ceived this task as cumbersome: He often had to navigate splitting suggestion, the input model can be automatically for edge targets outside the visible scope. For future work, split either into multiple resources or packages within one re- we aim at dedicated support for this activity: On selection, source. The tool is open source and can be found, along with neighbourhood information should be instantly available in the models mentioned in this paper, at https://www.uni- a tool-tip displaying the names of adjacent elements. One marburg.de/fb12/swt/forschung/software. In the future, further suggestion by Y, the color-coding of edges, directly we plan to apply the technique on other models than class made it into the current version. Y also invested consider- models, deeming it necessary to account for constraints. able time in layouting, i.e., aligning the color-coded model elements into groups – an activity outside of the scope of this 9. REFERENCES work. It is an interesting challenge to devise a layouting al- [1] Democles. http://democles.lassy.uni.lu/, May 2011. gorithm that aligns the sub-models of a model as clusters. [2] Ecoretools 2.0. http://www.eclipse.org/ecoretools/, Inspection of the false positives and negatives in S2 revealed May 2014. that 50% of them concerned enumerations, the other 50% [3] Joomla. http://www.joomla.org/, May 2014. concerning classes. Y pointed out that enumerations were [4] Sirius. http://www.eclipse.org/sirius/, May 2014. hard to relate to classes visually as they are not connected by edges. We consider representing enumerated attributes [5] Xtext. http://www.eclipse.org/xtext/, May 2014. as edges rather than class members in future work. [6] T. R. G. Green and M. Petre. Usability analysis of visual programming environments: a cognitive dimensions framework. Journal of Visual Languages & 6.3 Validity Computing, 7(2):131–174, 1996. Threats to external validity – or generalizability – are the [7] P. Kelsen, Q. Ma, and C. Glodt. Models within size of the input model and the size of the test group. It re- models: Taming model complexity using the mains a question left to future work whether our tool scales sub-model lattice. Fundamental Approaches to for meta-models of significantly more elements. However, Software Engineering, pages 171–185, 2011. an analysis of publicly available meta-models1 indicates the [8] T. K. Landauer, P. W. Foltz, and D. Laham. An input model size to be typical for large meta-models de- Introduction to Latent Semantic Analysis. Discourse manding an adequate decomposition. The test group size Processes, (25):259–284, 1998. indeed precludes claims for generality, but allows to provide [9] M. P. Robillard. Automatic Generation of Suggestions tentative evidence for critical design weaknesses and bene- for Program Investigation. In Proc. of ESEC/FSE-13, fits. A potential threat to internal validity – or freeness from pages 11–20, 2005. systematic error – is the flow of information from the control [10] D. Steinberg, F. Budinsky, E. Merks, and group to the test group. To mitigate this threat, we ascer- M. Paternostro. EMF: Eclipse Modeling Framework. tained in consultation with X that the textual descriptions Pearson Education, 2008. in vagueness and level of detail represented the intuitions for [11] D. Strüber, J. Rubin, G. Taentzer, and M. Chechik. splitting before the manual split was executed. Splitting models using information retrieval and model crawling techniques. Fundamental Approaches to 7. RELATED WORK Software Engineering, pages 47–62, 2014. In this section, we discuss related tooling. A survey of work [12] D. Strüber, M. Selter, and G. Taentzer. Tool support related to the underlying approach is provided in [11]. for clustering large meta-models. In Proceedings of the Workshop on Scalability in Model Driven Engineering, 1 page 7. ACM, 2013. http://www.emn.fr/z-info/atlanmod/index.php/Ecore