Database
                                                                                                                                                                Project
                                                                                                                                                                management
                                                                                                                              MySQL

        A Tool for Clustering Metamodel Repositories                                                             RelationalDBSchema
                                                                                                                                                    MSProject
                                                                                                                                           MSProject2

                                                                                                                                           Gantt
                         Francesco Basciani, Davide Di Ruscio, Juri Di Rocco, Alfonso Pierantonio
                             DISIM - University of L’Aquila (Italy) - Email: {name.surname@univaq.it}
                                                               Ludovico Iovino
                        Gran Sasso Science Institute, L’Aquila, Italy - Email: {ludovico.iovino@gssi.infn.it}


                                                                                                                                                 Building
   Abstract— Over the last years, several model repositories have                    Project                                                     tools
been proposed in response to the need of the MDE community                           management                                  Maven

for advanced systems supporting the reuse of modeling artifacts.                                     MSProject
                                                                                                                           Ant
Modelers can interact with MDE repositories with different                                 MSProject2
intents ranging from merely repository browsing, to searching
specific artifacts satisfying precise requirements. The organi-                              Gantt                         MySQL
zation and browsing facilities provided by current repositories                                            RelationalDBSchema                Database
is limited since they do not produce structured overviews of
the contained artifacts, and the categorization mechanisms (if
any) are based on manual activities. When dealing with large
                                                                                           Fig. 1. Example of classified metamodels
numbers of modeling artifacts, such limitations increase the effort
related to both managing and reusing artifacts stored in model
repositories. By focusing on metamodels management, in this
                                                                                                        II. BACKGROUND
paper we propose a clustering tool for automatically organizing
                                                                               Even though several MDE approaches have been conceived
stored metamodels and provide users with repository overviews
as, for instance, the application domains covered by the available          over the last years to support a wide range of model man-
metamodels. The approach has been implemented and integrated                agement activities, model repositories are not yet as well
in the MDEForge repository1 .                                               developed and widespread as source-code repositories [7],
                                                                            [4]. Most of the potential benefits of the existing online
                   I. M OTIVATION AND GOALS
                                                                            repositories remain unexploited especially when hundreds or
   The increasing adoption of Model-Driven Engineering
                                                                            even thousands of modeling artifacts have to be managed. In
(MDE) [19] in business organizations led to the need of
                                                                            particular, by focusing on the provided functionalities for orga-
gathering artifacts in model repositories [11]. Several model
                                                                            nizing, browsing, and searching metamodels, all the available
repositories (see [12], [13], [15], [16], [11] just to mention
                                                                            repositories are affected by the following issues:
a few) have been introduced in the past decade. Among
them metamodel zoos (as for instance the Ecore Zoo2 ) hold                  I1. they do not provide the means to automatically produce
metamodels, which are typically categorized to improve search               structured overviews of the contained metamodels, which are
and/or browse operations. However, locating relevant informa-               typically shown as merely lists of stored elements, and that
tion in a vast repository is intrinsically difficult, because it            are consequently difficult to browse. Organizations like the
requires domain experts to manually annotate all metamodels                 one shown in Fig. 1 would permit to have an overview of
in the repository with accurate metadata [4]: an activity that              the metamodels stored in the considered repository, e.g., with
is time consuming and prone to errors and omissions. In fact,               respect to the covered application domains;
acquiring knowledge about a software artifact is a challenging              I2. none of the available repositories provide mechanisms to
task: it is estimated that up to 60% of software maintenance                automatically categorize the stored artifacts, thereby making
is spent on comprehension [5]. In order to mitigate the                     the interaction with the repositories complex. Even users that
difficulties related to the manual categorization of metamodels,            want to contribute with additional artifacts have to manually
we propose a clustering tool for metamodel repositories: an                 annotate and classify them during the creation phase.
unsupervised procedure, which automatically organizes meta-                    In the next sections we propose a tool able to address these
models into clusters. Mutually similar artifacts are grouped                challenges by focusing on the management of metamodels
together depending on a proximity measure, whose definition                 stored in publicly available repositories.
can be given according to specific search and browsing re-
quirements. The tool is based on agglomerative hierarchical
clustering [14] and explores well-known proximity measures                      III. P ROPOSED METAMODEL CLUSTERING APPROACH
as well as metamodel-specific ones, each providing different                   In order to deal with the issues discussed in Section II in
browsing characteristics.                                                   this section we propose an unsupervised metamodel clustering
  1 This research was supported by the EU through the Model- Based Social   mechanism that permits to automatically organize unstructured
Learning for Public Administrations (Learn Pad) FP7 project (619583).       metamodel repositories and provide the users with overviews
  2 ATLAS Ecore Zoo: http://www.emn.fr/z-info/atlanmod/index.php/Zoos       of the available metamodels.
A. Overview                                                                                             REST API

   Two different user roles are involved in the proposed                            Proximity        Clustering      Clustering
                                                                                    Calculator        Creator        Visualizer
clustering approach namely the Repository Maintainer and the                                                                                     Users
                                                                          WEB       Transformation                  Metrics
Repository User discussed in the following.                              Access          chain
                                                                                                         …
                                                                                                                   Calculator     Extensions

Repository Maintainer: the application of the whole meta-
                                                                                  Transformation        Model           Metamodel
model clustering approach is performed by the maintainer                                                                                       Repository
                                                                                                                                       Core
of the repository who can have access to the functionalities
                                                                                           Fig. 2. MDEForge Architecture
described below.
Apply Metamodel Clustering: it represents the key func-             B. Supporting tool
tionality of the proposed clustering approach. It consists of          The proposed clustering method has been implemented as
calculating the proximity matrix representing the similarities      extensions of the MDEForge platform [1]. In particular, as
of all the metamodels available in the repository, and then         shown in Fig. 2, MDEForge consists of core services that
applying the clustering algorithm.                                  are provided to enable the management of modeling artifacts,
Manage Singleton Clusters: when a new metamodel is being            namely transformations, models, and metamodels. Atop of
added to the repository, it may happen that according to            such core services, extensions can be developed to add new
the used proximity measure it does not fit in any of the            functionalities. Both core service and extensions are available
existing clusters and consequently it induces the creation          through Web access and programmatic interfaces (API) that
of a singleton cluster, i.e., a cluster consisting of only one      enable their adoption as software as a service. For instance,
element. The repository maintainer can periodically consider        in [2] we propose a service to automatically compose model
all the available singleton clusters and verify if they have been   transformations according to user requirements. We have also
created, e.g., because of the used proximity measure has to be      developed extensions to calculate several metrics on stored
refined.                                                            artifacts, and to support the understanding of metamodel and
Refine the Proximity Measure: the proximity measure plays a         transformation characteristics [8], [6]. In the remainder of the
key role in the whole clustering approach, and consequently its     section, we give details about the extensions that are shown in
definition is an iterative process, aiming at increasing the ac-    dashed boxes in Fig. 2 and that we have developed to support
curacy of the automatically obtained metamodel clusters. The        the proposed metamodel clustering approach. Concerning the
refinement process relies on the availability of reference data,    other services of MDEForge the reader can refer to [1], [7].
which are typically obtained by manual activities. Such data
                                                                    Proximity Calculator: it plays a key role in the proposed
must be approximated by the automated clustering procedure
                                                                    clustering approach since it is responsible of calculating the
as discussed in the next section.
                                                                    mutual similarities between all the metamodels and thus create
Repository User: similarly to what happens in the case of           a corresponding proximity matrix. Identifying the appropriate
open source software, the availability of public model reposi-      similarity measure is a difficult task that might depend on
tories can give place to multitudes of users and developers that    the available data set, on the considered application domain,
are willing to share their modeling artifacts. In this respect,     on the goal of the analysis being performed, etc. [14]. Con-
by focusing on the metamodel clustering aspects, the proposed       sequently, from an architectural point of view, the proxim-
approach provides the users with the functionalities discussed      ity calculator has been designed in terms of an interface
below.                                                              consisting of a method calculateSimilarity(Metamodel
Add New Metamodel: In contrast to existing metamodel                mm1 , Metamodel mm1 ), and then different concrete im-
archives, users that add new metamodels in the repository can       plementations can be provided. So far we have developed
omit the specification of corresponding metadata. Even in such      different similarity measures already available in the system
cases, the provided approach is able to automatically classify      even though we plan to experiment and provide additional
the new metamodels. In fact the appropriate clusters are iden-      ones. In particular, several similarity measures have been pro-
tified by considering the content of the metamodels without         posed in literature [3]. Among those typically applied to text
the need for additional user input. However, as previously          documents we have considered the cosine similarity [3] and
mentioned, it might happen that newly added metamodels do           the Dice’s coefficient [9] with the aim of relating the similarity
not fit in any of the existing clusters. Then, the repository       of two metamodels on the terms used therein and consequently
maintainer takes care of such situations by means of the            on the corresponding application domains. Moreover, we
functionality Manage Singleton Clusters previously discussed.       have developed two additional similarity functions specifically
Visualize Metamodel Clusters: the approach produces                 conceived for modeling artifacts. Both of them rely on the
overviews of the automatically produced metamodel clus-             matching models calculated by means of EMFCompare3 : i)
ters. Thus in addition to the list of available metamodels,         Match-based similarity: it is defined as the total number of
the system is able to generate graphical representations of         matched elements identified by EMFCompare divided by the
the available metamodels clusters and give also the means to        total number of elements contained in the analysed couple
navigate them and retrieve detailed information about their
content if requested by the user.                                     3 http://www.eclipse.org/emf/compare/
                                   Fig. 3. Sample visualizations of automatically created metamodel clusters
of metamodels; ii) Containment-based similarity: the previous            most connected with the other ones in the cluster. Additionally,
index does not perform well when one of the input metamodels             metamodels can be downloaded or even viewed by means of
is contained in the other one. As an example we can consider             an integrated tree-based editor.
the full specification of UML and the UML Class Diagrams.
                                                                              IV. A PPLICATION OF THE PROPOSED METAMODEL
In such cases the match-based similarity value would be very
                                                                                               CLUSTERING APPROACH
low since the total number of matched elements would be
much lesser than the total number of elements contained in                  In this section we discuss the application of the clustering
the two metamodels. In order to deal with such cases, the                approach on a concrete data set consisting of 295 metamodels
containment-based similarity is defined as the total number of           retrieved from the Ecore Zoo. We have applied the clustering
matched elements divided by the lesser of the total elements             technique by using the four similarity functions mentioned
in the two input metamodels.                                             in the previous section and by specifying different thresholds.
Clustering Creator: by using the proximity calculator previ-             Due to space constraints in this section we focus on the match-
ously discussed, it creates clusters of metamodels by applying           based similarity measure. For the same reason, the process
the agglomerative hierarchical clustering algorithm. As to the           that we have followed to validate the developed clustering
cluster proximity calculation, which is performed during each            technique is also omitted. It is worth noting that the data
iteration of the algorithm, it is possible to specify the distance       reported in Table I can be reproduced by interacting with
to be used, i.e., single link, complete link, and group average.         the cluster visualizer component discussed in the previous
                                                                         section, which permits to select the similarity measure to be
Cluster Visualizer: it creates graphical and tabular represen-           used and the desired similarity threshold. Then the graphical
tations of the calculated metamodel clusters. The user can               representation of the retrieved clusters is updated accordingly.
explore the available metamodels by specifying the similarity               Figure 4 shows the number of clusters that are identified
measure to be applied, and the threshold value used to filter            with respect to the chosen similarity threshold. A too low
the identified metamodels pairs and show only those that have            threshold correponds to consider the repository population
a similarity value greater than the given threshold. The left            almost undistinguished, whilst a too high threshold returns too
hand side of Fig. 3 shows the cluster visualizer at work. In             many clusters with too few elements.
particular, the shown connected graphs represent the identified
clusters and the thickness of the edges is proportional to the                               V. R ELATED W ORK
proximity value of each connected metamodels represented as                Clustering techniques have been used in several applications
nodes in the graph. For each cluster, the system permits to              including software and data comprehension. In [18] the authors
retrieve additional information as shown in the upper right-             presents a methodology for handling the problem of database
hand side of Fig. 3. In particular, given a cluster all the              migration. The approach uses semantic clustering to facilitate
contained metamodels are listed together with additional in-             the translation of extended entity relationship schema into
formation like the most representative metamodel, i.e., the one          a schema of complex objects. They start from an Extended
 Threshold     Clusters       Avg.           Max         Singleton
                           cluster size   cluster size    clusters   monolithic model into sub-models for the comprehension and
 0.1             45           6.555           228            37      maintenance. The work in [17] presents a technique, which
 0.15            96           3.072           152            76      is based on metamodeling, Petri nets, and facets for the anal-
 0.2             157          1.878            72           129
 0.25            192          1.536            19           160
                                                                     ysis and clustering of requirements diagrams. Intuitively, the
 0.3             214          1.378            14           182      approach is able to obtain the domain description in terms of
 0.35            227          1.299            14           201      the relations and dependencies of modeled services. Then the
 0.4             234          1.260            14           210
 0.45            238          1.239            14           213
                                                                     analysis and the clustering of requirements are automatically
 0.5             245          1.204            14           224      calculated accordingly.
 0.55            250          1.180            13           232
 0.6             256          1.152            12           241                 VI. A DDITIONAL INFORMATION
 0.65            257          1.148            12           242        – MDEForge website and source code: http://www.mdeforge.org
 0.7             259          1.139            12           243         – Related publications: [1], [2], [7], [8], [6]
 0.75            263          1.122            8            246
 0.8             268          1.101            6            252                                    R EFERENCES
 0.85            272          1.085            4            258
                                                                      [1] F. Basciani, J. Di Rocco, D. Di Ruscio, A. Di Salle, L. Iovino,
 0.9             280          1.054            4            268
                                                                          and A. Pierantonio. MDEForge: an Extensible Web-Based Modeling
 0.95            288          1.024            3            282
                                                                          Platform. In Procs of CloudMDE@MoDELS 2014, Valencia, Spain,
                             TABLE I                                      September 30, 2014., pages 66–75, 2014.
                   M ATCH - BASED SIMILARITY                          [2] F. Basciani, D. Di Ruscio, L. Iovino, and A. Pierantonio. Automated
                                                                          Chaining of Model Transformations with Incompatible Metamodels. In
                                                                          Procs. of MODELS 2014, pages 602–618, 2014.
                                                                      [3] P. Berkhin. A survey of clustering data mining techniques. In J. Kogan,
                                                                          C. Nicholas, and M. Teboulle, editors, Grouping Multidimensional Data,
                                                                          pages 25–71. Springer Berlin Heidelberg, 2006.
                                                                      [4] B. Bislimovska, A. Bozzon, M. Brambilla, and P. Fraternali. Textual
                                                                          and Content-Based Search in Repositories of Web Application Models.
                                                                          ACM Transactions on the Web, 8(2):1–47, Mar. 2014.
                                                                      [5] P. Bourque, R. Dupuis, A. Abran, J. W. Moore, and L. L. Tripp. The
                                                                          Guide to the Software Engineering Body of Knowledge. IEEE Software,
                                                                          16(6):35–44, 1999.
                                                                      [6] J. Di Rocco, D. Di Ruscio, L. Iovino, and A. Pierantonio. Mining metrics
                                                                          for understanding metamodel characteristics. In Procs. MiSE 2014 at
                                                                          ICSE 2014, pages 55–60, 2014.
                                                                      [7] J. Di Rocco, D. Di Ruscio, L. Iovino, and A. Pierantonio. Collaborative
                                                                          repositories in Model-Driven Engineering. IEEE Software, pages 28–34,
                                                                          May 2015.
                                                                      [8] J. Di Rocco, D. Di Ruscio, L. Iovino, and A. Pierantonio. Mining
             Fig. 4. Match-based Similarity thresholds                    Correlations of ATL Model Transformation and Metamodel Metrics. In
                                                                          Procs of MiSE 2015 at ICSE 2015, 2015.
                                                                      [9] L. R. Dice. Measures of the amount of ecologic association between
                                                                          species. Ecology, 26(3):pp. 297–302, 1945.
Entity Relationships (EER) schema to create a set of clustered       [10] O. El Beggar, B. Bousetta, and G. Taoufiq. Comparative study between
schemata such that each clustered schema corresponds to a                 clustering and model driven reverse engineering approaches. Lecture
                                                                          Notes on Software Engineering, 1(2), 2013.
level of abstraction and grouping of the initial schema. By          [11] R. B. France, J. M. Bieman, S. P. Mandalaparty, B. H. C. Cheng, and
iteratively shrinking portions of EER diagram into complex                A. Jensen. Repository for Model Driven Development (ReMoDD). In
entities, the approach creates a schema of complex entities,              Procs. of ICSE 2012, pages 1471–1472. IEEE, 2012.
                                                                     [12] C. Hein, T. Ritter, and M. Wagner. Model-driven tool integration with
hiding the details about the components. The user can select              ModelBus. Workshop Future Trends of Model-Driven, 2009.
a level of clustering to show components at some degree of           [13] T. Holmes, U. Zdun, and S. Dustdar. Automating the Management and
detail exaclty like we do in our approach. In [10] authors                Versioning of Service Models at Runtime to Support Service Monitoring.
                                                                          In EDOC, pages 211–218, Sept. 2012.
use clustering techniques and Model-Driven Reverse Engi-             [14] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review.
neering principles for software comprehension. In particular,             ACM computing surveys (CSUR), 31(3):264–323, 1999.
the authors start by extracting data from source code for            [15] B. Karasneh and M. R. V. Chaudron. Online img2uml repository: An
                                                                          online repository for UML models. In Procs of EESSMod 2013 at
the input data matrix construction. In the code extraction,               MoDELS 2013, pages 61–66, 2013.
they consider the paragraph as the smallest atomic unit and          [16] R. Kutsche, N. Milanovic, G. Bauhoff, T. Baum, M. Cartsburg,
their cluster analysis is based on the hypothesis that record             D. Kumpe, and J. Widiker. BIZYCLE: Model-based Interoperability
                                                                          Platform for Software and Data Integration. In Procs.of the MDTPI at
fields existing in the same paragraphs can be grouped. For                ECMDA, 2008.
the data matrix the chosen distance of similarity for the            [17] O. Lopez, M. A. Laguna, and F. J. Garcia. Reuse based analysis and
cluster identification is the Euclidean distance. The paper               clustering of requirements diagrams. In Procs of REFSQ02, pages 71–
                                                                          82, 2002.
in [20] presents a tool for the decomposition of a meta-             [18] R. Missaoui, R. Godin, and H. Sahraoui. Migrating to an object-oriented
model into clusters of model elements. The authors claim that             database using semantic clustering and transformation rules. Data and
large-scale diagrams, representing models and metamodels,                 Knowledge Engineering, 27(1):97 – 113, 1998.
                                                                     [19] D. C. Schmidt. Guest Editor’s Introduction: Model-Driven Engineering.
are often difficult to understand for the lack of appropriate             Computer, 39(2):25–31, Feb. 2006.
modularization structures that allow examining a model in            [20] D. Strüber, M. Selter, and G. Taentzer. Tool support for clustering large
sub-parts. This work provides a meaningful way to split a                 meta-models. In Procs. of BigMDE ’13, pages 7:1–7:4. ACM, 2013.