Database Project management MySQL A Tool for Clustering Metamodel Repositories RelationalDBSchema MSProject MSProject2 Gantt Francesco Basciani, Davide Di Ruscio, Juri Di Rocco, Alfonso Pierantonio DISIM - University of L’Aquila (Italy) - Email: {name.surname@univaq.it} Ludovico Iovino Gran Sasso Science Institute, L’Aquila, Italy - Email: {ludovico.iovino@gssi.infn.it} Building Abstract— Over the last years, several model repositories have Project tools been proposed in response to the need of the MDE community management Maven for advanced systems supporting the reuse of modeling artifacts. MSProject Ant Modelers can interact with MDE repositories with different MSProject2 intents ranging from merely repository browsing, to searching specific artifacts satisfying precise requirements. The organi- Gantt MySQL zation and browsing facilities provided by current repositories RelationalDBSchema Database is limited since they do not produce structured overviews of the contained artifacts, and the categorization mechanisms (if any) are based on manual activities. When dealing with large Fig. 1. Example of classified metamodels numbers of modeling artifacts, such limitations increase the effort related to both managing and reusing artifacts stored in model repositories. By focusing on metamodels management, in this II. BACKGROUND paper we propose a clustering tool for automatically organizing Even though several MDE approaches have been conceived stored metamodels and provide users with repository overviews as, for instance, the application domains covered by the available over the last years to support a wide range of model man- metamodels. The approach has been implemented and integrated agement activities, model repositories are not yet as well in the MDEForge repository1 . developed and widespread as source-code repositories [7], [4]. Most of the potential benefits of the existing online I. M OTIVATION AND GOALS repositories remain unexploited especially when hundreds or The increasing adoption of Model-Driven Engineering even thousands of modeling artifacts have to be managed. In (MDE) [19] in business organizations led to the need of particular, by focusing on the provided functionalities for orga- gathering artifacts in model repositories [11]. Several model nizing, browsing, and searching metamodels, all the available repositories (see [12], [13], [15], [16], [11] just to mention repositories are affected by the following issues: a few) have been introduced in the past decade. Among them metamodel zoos (as for instance the Ecore Zoo2 ) hold I1. they do not provide the means to automatically produce metamodels, which are typically categorized to improve search structured overviews of the contained metamodels, which are and/or browse operations. However, locating relevant informa- typically shown as merely lists of stored elements, and that tion in a vast repository is intrinsically difficult, because it are consequently difficult to browse. Organizations like the requires domain experts to manually annotate all metamodels one shown in Fig. 1 would permit to have an overview of in the repository with accurate metadata [4]: an activity that the metamodels stored in the considered repository, e.g., with is time consuming and prone to errors and omissions. In fact, respect to the covered application domains; acquiring knowledge about a software artifact is a challenging I2. none of the available repositories provide mechanisms to task: it is estimated that up to 60% of software maintenance automatically categorize the stored artifacts, thereby making is spent on comprehension [5]. In order to mitigate the the interaction with the repositories complex. Even users that difficulties related to the manual categorization of metamodels, want to contribute with additional artifacts have to manually we propose a clustering tool for metamodel repositories: an annotate and classify them during the creation phase. unsupervised procedure, which automatically organizes meta- In the next sections we propose a tool able to address these models into clusters. Mutually similar artifacts are grouped challenges by focusing on the management of metamodels together depending on a proximity measure, whose definition stored in publicly available repositories. can be given according to specific search and browsing re- quirements. The tool is based on agglomerative hierarchical clustering [14] and explores well-known proximity measures III. P ROPOSED METAMODEL CLUSTERING APPROACH as well as metamodel-specific ones, each providing different In order to deal with the issues discussed in Section II in browsing characteristics. this section we propose an unsupervised metamodel clustering 1 This research was supported by the EU through the Model- Based Social mechanism that permits to automatically organize unstructured Learning for Public Administrations (Learn Pad) FP7 project (619583). metamodel repositories and provide the users with overviews 2 ATLAS Ecore Zoo: http://www.emn.fr/z-info/atlanmod/index.php/Zoos of the available metamodels. A. Overview REST API Two different user roles are involved in the proposed Proximity Clustering Clustering Calculator Creator Visualizer clustering approach namely the Repository Maintainer and the Users WEB Transformation Metrics Repository User discussed in the following. Access chain … Calculator Extensions Repository Maintainer: the application of the whole meta- Transformation Model Metamodel model clustering approach is performed by the maintainer Repository Core of the repository who can have access to the functionalities Fig. 2. MDEForge Architecture described below. Apply Metamodel Clustering: it represents the key func- B. Supporting tool tionality of the proposed clustering approach. It consists of The proposed clustering method has been implemented as calculating the proximity matrix representing the similarities extensions of the MDEForge platform [1]. In particular, as of all the metamodels available in the repository, and then shown in Fig. 2, MDEForge consists of core services that applying the clustering algorithm. are provided to enable the management of modeling artifacts, Manage Singleton Clusters: when a new metamodel is being namely transformations, models, and metamodels. Atop of added to the repository, it may happen that according to such core services, extensions can be developed to add new the used proximity measure it does not fit in any of the functionalities. Both core service and extensions are available existing clusters and consequently it induces the creation through Web access and programmatic interfaces (API) that of a singleton cluster, i.e., a cluster consisting of only one enable their adoption as software as a service. For instance, element. The repository maintainer can periodically consider in [2] we propose a service to automatically compose model all the available singleton clusters and verify if they have been transformations according to user requirements. We have also created, e.g., because of the used proximity measure has to be developed extensions to calculate several metrics on stored refined. artifacts, and to support the understanding of metamodel and Refine the Proximity Measure: the proximity measure plays a transformation characteristics [8], [6]. In the remainder of the key role in the whole clustering approach, and consequently its section, we give details about the extensions that are shown in definition is an iterative process, aiming at increasing the ac- dashed boxes in Fig. 2 and that we have developed to support curacy of the automatically obtained metamodel clusters. The the proposed metamodel clustering approach. Concerning the refinement process relies on the availability of reference data, other services of MDEForge the reader can refer to [1], [7]. which are typically obtained by manual activities. Such data Proximity Calculator: it plays a key role in the proposed must be approximated by the automated clustering procedure clustering approach since it is responsible of calculating the as discussed in the next section. mutual similarities between all the metamodels and thus create Repository User: similarly to what happens in the case of a corresponding proximity matrix. Identifying the appropriate open source software, the availability of public model reposi- similarity measure is a difficult task that might depend on tories can give place to multitudes of users and developers that the available data set, on the considered application domain, are willing to share their modeling artifacts. In this respect, on the goal of the analysis being performed, etc. [14]. Con- by focusing on the metamodel clustering aspects, the proposed sequently, from an architectural point of view, the proxim- approach provides the users with the functionalities discussed ity calculator has been designed in terms of an interface below. consisting of a method calculateSimilarity(Metamodel Add New Metamodel: In contrast to existing metamodel mm1 , Metamodel mm1 ), and then different concrete im- archives, users that add new metamodels in the repository can plementations can be provided. So far we have developed omit the specification of corresponding metadata. Even in such different similarity measures already available in the system cases, the provided approach is able to automatically classify even though we plan to experiment and provide additional the new metamodels. In fact the appropriate clusters are iden- ones. In particular, several similarity measures have been pro- tified by considering the content of the metamodels without posed in literature [3]. Among those typically applied to text the need for additional user input. However, as previously documents we have considered the cosine similarity [3] and mentioned, it might happen that newly added metamodels do the Dice’s coefficient [9] with the aim of relating the similarity not fit in any of the existing clusters. Then, the repository of two metamodels on the terms used therein and consequently maintainer takes care of such situations by means of the on the corresponding application domains. Moreover, we functionality Manage Singleton Clusters previously discussed. have developed two additional similarity functions specifically Visualize Metamodel Clusters: the approach produces conceived for modeling artifacts. Both of them rely on the overviews of the automatically produced metamodel clus- matching models calculated by means of EMFCompare3 : i) ters. Thus in addition to the list of available metamodels, Match-based similarity: it is defined as the total number of the system is able to generate graphical representations of matched elements identified by EMFCompare divided by the the available metamodels clusters and give also the means to total number of elements contained in the analysed couple navigate them and retrieve detailed information about their content if requested by the user. 3 http://www.eclipse.org/emf/compare/ Fig. 3. Sample visualizations of automatically created metamodel clusters of metamodels; ii) Containment-based similarity: the previous most connected with the other ones in the cluster. Additionally, index does not perform well when one of the input metamodels metamodels can be downloaded or even viewed by means of is contained in the other one. As an example we can consider an integrated tree-based editor. the full specification of UML and the UML Class Diagrams. IV. A PPLICATION OF THE PROPOSED METAMODEL In such cases the match-based similarity value would be very CLUSTERING APPROACH low since the total number of matched elements would be much lesser than the total number of elements contained in In this section we discuss the application of the clustering the two metamodels. In order to deal with such cases, the approach on a concrete data set consisting of 295 metamodels containment-based similarity is defined as the total number of retrieved from the Ecore Zoo. We have applied the clustering matched elements divided by the lesser of the total elements technique by using the four similarity functions mentioned in the two input metamodels. in the previous section and by specifying different thresholds. Clustering Creator: by using the proximity calculator previ- Due to space constraints in this section we focus on the match- ously discussed, it creates clusters of metamodels by applying based similarity measure. For the same reason, the process the agglomerative hierarchical clustering algorithm. As to the that we have followed to validate the developed clustering cluster proximity calculation, which is performed during each technique is also omitted. It is worth noting that the data iteration of the algorithm, it is possible to specify the distance reported in Table I can be reproduced by interacting with to be used, i.e., single link, complete link, and group average. the cluster visualizer component discussed in the previous section, which permits to select the similarity measure to be Cluster Visualizer: it creates graphical and tabular represen- used and the desired similarity threshold. Then the graphical tations of the calculated metamodel clusters. The user can representation of the retrieved clusters is updated accordingly. explore the available metamodels by specifying the similarity Figure 4 shows the number of clusters that are identified measure to be applied, and the threshold value used to filter with respect to the chosen similarity threshold. A too low the identified metamodels pairs and show only those that have threshold correponds to consider the repository population a similarity value greater than the given threshold. The left almost undistinguished, whilst a too high threshold returns too hand side of Fig. 3 shows the cluster visualizer at work. In many clusters with too few elements. particular, the shown connected graphs represent the identified clusters and the thickness of the edges is proportional to the V. R ELATED W ORK proximity value of each connected metamodels represented as Clustering techniques have been used in several applications nodes in the graph. For each cluster, the system permits to including software and data comprehension. In [18] the authors retrieve additional information as shown in the upper right- presents a methodology for handling the problem of database hand side of Fig. 3. In particular, given a cluster all the migration. The approach uses semantic clustering to facilitate contained metamodels are listed together with additional in- the translation of extended entity relationship schema into formation like the most representative metamodel, i.e., the one a schema of complex objects. They start from an Extended Threshold Clusters Avg. Max Singleton cluster size cluster size clusters monolithic model into sub-models for the comprehension and 0.1 45 6.555 228 37 maintenance. The work in [17] presents a technique, which 0.15 96 3.072 152 76 is based on metamodeling, Petri nets, and facets for the anal- 0.2 157 1.878 72 129 0.25 192 1.536 19 160 ysis and clustering of requirements diagrams. Intuitively, the 0.3 214 1.378 14 182 approach is able to obtain the domain description in terms of 0.35 227 1.299 14 201 the relations and dependencies of modeled services. Then the 0.4 234 1.260 14 210 0.45 238 1.239 14 213 analysis and the clustering of requirements are automatically 0.5 245 1.204 14 224 calculated accordingly. 0.55 250 1.180 13 232 0.6 256 1.152 12 241 VI. A DDITIONAL INFORMATION 0.65 257 1.148 12 242 – MDEForge website and source code: http://www.mdeforge.org 0.7 259 1.139 12 243 – Related publications: [1], [2], [7], [8], [6] 0.75 263 1.122 8 246 0.8 268 1.101 6 252 R EFERENCES 0.85 272 1.085 4 258 [1] F. Basciani, J. Di Rocco, D. Di Ruscio, A. Di Salle, L. Iovino, 0.9 280 1.054 4 268 and A. Pierantonio. MDEForge: an Extensible Web-Based Modeling 0.95 288 1.024 3 282 Platform. In Procs of CloudMDE@MoDELS 2014, Valencia, Spain, TABLE I September 30, 2014., pages 66–75, 2014. M ATCH - BASED SIMILARITY [2] F. Basciani, D. Di Ruscio, L. Iovino, and A. Pierantonio. Automated Chaining of Model Transformations with Incompatible Metamodels. In Procs. of MODELS 2014, pages 602–618, 2014. [3] P. Berkhin. A survey of clustering data mining techniques. In J. Kogan, C. Nicholas, and M. Teboulle, editors, Grouping Multidimensional Data, pages 25–71. Springer Berlin Heidelberg, 2006. [4] B. Bislimovska, A. Bozzon, M. Brambilla, and P. Fraternali. Textual and Content-Based Search in Repositories of Web Application Models. ACM Transactions on the Web, 8(2):1–47, Mar. 2014. [5] P. Bourque, R. Dupuis, A. Abran, J. W. Moore, and L. L. Tripp. The Guide to the Software Engineering Body of Knowledge. IEEE Software, 16(6):35–44, 1999. [6] J. Di Rocco, D. Di Ruscio, L. Iovino, and A. Pierantonio. Mining metrics for understanding metamodel characteristics. In Procs. MiSE 2014 at ICSE 2014, pages 55–60, 2014. [7] J. Di Rocco, D. Di Ruscio, L. Iovino, and A. Pierantonio. Collaborative repositories in Model-Driven Engineering. IEEE Software, pages 28–34, May 2015. [8] J. Di Rocco, D. Di Ruscio, L. Iovino, and A. Pierantonio. Mining Fig. 4. Match-based Similarity thresholds Correlations of ATL Model Transformation and Metamodel Metrics. In Procs of MiSE 2015 at ICSE 2015, 2015. [9] L. R. Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):pp. 297–302, 1945. Entity Relationships (EER) schema to create a set of clustered [10] O. El Beggar, B. Bousetta, and G. Taoufiq. Comparative study between schemata such that each clustered schema corresponds to a clustering and model driven reverse engineering approaches. Lecture Notes on Software Engineering, 1(2), 2013. level of abstraction and grouping of the initial schema. By [11] R. B. France, J. M. Bieman, S. P. Mandalaparty, B. H. C. Cheng, and iteratively shrinking portions of EER diagram into complex A. Jensen. Repository for Model Driven Development (ReMoDD). In entities, the approach creates a schema of complex entities, Procs. of ICSE 2012, pages 1471–1472. IEEE, 2012. [12] C. Hein, T. Ritter, and M. Wagner. Model-driven tool integration with hiding the details about the components. The user can select ModelBus. Workshop Future Trends of Model-Driven, 2009. a level of clustering to show components at some degree of [13] T. Holmes, U. Zdun, and S. Dustdar. Automating the Management and detail exaclty like we do in our approach. In [10] authors Versioning of Service Models at Runtime to Support Service Monitoring. In EDOC, pages 211–218, Sept. 2012. use clustering techniques and Model-Driven Reverse Engi- [14] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. neering principles for software comprehension. In particular, ACM computing surveys (CSUR), 31(3):264–323, 1999. the authors start by extracting data from source code for [15] B. Karasneh and M. R. V. Chaudron. Online img2uml repository: An online repository for UML models. In Procs of EESSMod 2013 at the input data matrix construction. In the code extraction, MoDELS 2013, pages 61–66, 2013. they consider the paragraph as the smallest atomic unit and [16] R. Kutsche, N. Milanovic, G. Bauhoff, T. Baum, M. Cartsburg, their cluster analysis is based on the hypothesis that record D. Kumpe, and J. Widiker. BIZYCLE: Model-based Interoperability Platform for Software and Data Integration. In Procs.of the MDTPI at fields existing in the same paragraphs can be grouped. For ECMDA, 2008. the data matrix the chosen distance of similarity for the [17] O. Lopez, M. A. Laguna, and F. J. Garcia. Reuse based analysis and cluster identification is the Euclidean distance. The paper clustering of requirements diagrams. In Procs of REFSQ02, pages 71– 82, 2002. in [20] presents a tool for the decomposition of a meta- [18] R. Missaoui, R. Godin, and H. Sahraoui. Migrating to an object-oriented model into clusters of model elements. The authors claim that database using semantic clustering and transformation rules. Data and large-scale diagrams, representing models and metamodels, Knowledge Engineering, 27(1):97 – 113, 1998. [19] D. C. Schmidt. Guest Editor’s Introduction: Model-Driven Engineering. are often difficult to understand for the lack of appropriate Computer, 39(2):25–31, Feb. 2006. modularization structures that allow examining a model in [20] D. Strüber, M. Selter, and G. Taentzer. Tool support for clustering large sub-parts. This work provides a meaningful way to split a meta-models. In Procs. of BigMDE ’13, pages 7:1–7:4. ACM, 2013.