Tool Support for Model Splitting using Information
               Retrieval and Model Crawling Techniques

                        Daniel G. Strüber, Michael Lukaszczyk, Gabriele Taentzer
                                              Philipps-University Marburg
                                   Department for Mathematics and Computer Science
                                     Hans-Meerwein-Str., 35032 Marburg, Germany
                       {strueber,lukaszcz22,taentzer}@informatik.uni-marburg.de

ABSTRACT                                                                    proach proposed in [11] aims to create model decompositions
To facilitate the collaboration in large-scale modeling sce-                from existing domain knowledge in the form of textual de-
narios, it is sometimes advisable to split a model into a set               scriptions: The user provides a set of descriptive texts, each
of sub-models that can be maintained and analyzed indepen-                  describing one sub-model in the target decomposition. From
dently. Existing automated approaches to model splitting,                   this input, a splitting suggestion is created using a combined
however, suffer from insufficient consideration of the stake-               information retrieval and topology analysis approach. The
holder’s intentions and add a significant overhead for com-                 descriptions can be assembled from available requirement or
prehending the created decompositions. We present a new                     documentation artifacts. However, the input set is not re-
tool that aims to create more informed model decomposi-                     quired to be complete: In fact, the approach can support the
tions by leveraging existing domain knowledge in the form                   user in incrementally discovering sub-model descriptions.
of textual descriptions. From the user perspective, the tool
comprises a textual editor for assembling the descriptions                  The contribution of this paper is a tool and supporting semi-
and a visual editor for reviewing and post-processing the                   automated user process making the outlined splitting tech-
generated splitting suggestions. We preliminarily evaluate                  nique available to modelers. We have tested it on large meta-
the tool in a case study involving a real-life model.                       models in the magnitude of 100 to 250 classifiers. As design
                                                                            goals, we target usability and extensibility for the splitting
                                                                            of instances of arbitrary meta-models. The remainder of this
Categories and Subject Descriptors                                          paper is divided as follows: In Section 2, we briefly illustrate
D.2.0 [Software Engineering]: Tools; D.2.8 [Software                        the underlying technique. The user process is shown in Sec-
Engineering]: Distribution, Maintenance, and Enhance-                       tion 3. In Sections 4 and 5, we elaborate on the design goals
ment                                                                        and implementation. In Section 6, we present a case study
                                                                            preliminarily evaluating the proposed tool and user process.
1.    INTRODUCTION                                                          We discuss related work and conclude in Sections 7 and 8.
As model-driven engineering is applied in ever-greater sce-
narios ranging over significant spans in time and space, the                2.   BACKGROUND
maintenance obstacles induced by large models increase in                   In this section, we give a brief overview on model splitting
urgency. Large models without a proper decomposition are                    as perfomed by our tool. A detailed account is found in [11].
hard to comprehend, to change, to reuse, and to collaborate
on. Even in projects where an initial decomposition is tai-                 The technique, outlined in Fig. 1, takes three input param-
lored with great care, changing requirements may deem it                    eters: The model to be split – in the proposed tool, an EMF
necessary to refactor for a finer-grained or even orthogonal                meta-model –, a set of textual descriptions of each target
one. As the manual refactoring of large models is non-trivial               sub-model, and a completeness condition. The completeness
and expensive, this problem calls for automation.                           condition specifies whether the set of sub-model descriptions
                                                                            is complete or partial. The technique creates a set of map-
Earlier automated approaches to model splitting, such as                    pings from model elements to sub-models, calling it splitting
those presented in [7, 12], suggest techniques based on anal-               suggestion. In the case of a complete input set, each element
ysis of strongly connected components or clusters, not ac-
counting for the semantics of the split and the intention
for performing it. To address this shortcoming, a recent ap-
                                                                                                  Model Splitting


BigMDE’14 July 24, 2014. York, UK
Copyright c 2014 for the individual papers by the papers’ authors. Copy-
ing permitted for private and academic purposes. This volume is published
and copyrighted by its editors.
                                                                              Figure 1: Underlying model splitting technique.
     Input model                      Output sub-models
     Start the splitting
                         1               Perform splitting   5
          process

                          Derive a
                          splitting
                         suggestion


  Define the splitting       3        Review and post-process
      description                      the splitting suggestion
  2                                                               4

                    Figure 2: Overview.                                   Figure 3: Defining the splitting description.


is assigned to one sub-model. In the partial case, some el-
                                                                      (1) Start the splitting process. Using a context menu
ements may remain unassigned. The user can inspect the
                                                                      entry on the meta-model to be split, the user triggers the
unassigned elements to discover additional sub-models and
                                                                      creation of a splitting description file. The splitting descrip-
describe them, incrementally creating a complete split.
                                                                      tion is automatically opened in a textual editor, shown in
                                                                      Fig. 3. By default, the file contains a small usage example.
Information retrieval. To obtain an initial mapping be-
tween the model and the textual sub-model descriptions, we
                                                                      (2) Define the splitting description. Using the editor,
apply an established statistical technique from information
                                                                      the user assembles the descriptions of the target sub-models.
retrieval research: Latent Static Analysis (LSA) [8]. For a
                                                                      For a comfortable user experience, the editor provides syn-
query (e.g., a sub-model description) over a fixed set of doc-
                                                                      tax highlighting, static validation, and folding capabilities.
uments (e.g., a set of model element names), LSA scores the
                                                                      The textual editor is also used for configuration: Adding the
relevance of each document to the input query. To compute
                                                                      keyword partially and defining a numerical threshold, the
the scores, queries and documents are represented as vectors
                                                                      user can set the completeness condition in order to obtain
and the similarity between the query vector and each docu-
                                                                      a partial split. Furthermore, the user can fine-tune inter-
ment vector is computed – intuitively speaking, the degree in
                                                                      nal parameters used during the execution of the underlying
that they point in the same direction. Mathematically, this
                                                                      technique. In Fig. 3, the weights assigned to different rela-
is calculated in terms of the cosine, yielding a score between
                                                                      tionship types and the alpha exponent that shapes the scor-
0 and 1. The vector representation is based on a metrics
                                                                      ing function are modified. However, parameter tuning is an
called term frequency-inverse document frequency (td–idf ).
                                                                      optional feature: In [11], we identified a default combination
                                                                      of parameter values that, when applied to six independent
Model crawling. To create the splitting suggestion, we use
                                                                      class models, achieved an average accuracy of 80% in com-
the model elements ranked highest by LSA as seeds. Starting
                                                                      parison to hand-tailored decompositions.
from these seeds, we crawl the model exhaustively to score
each model element’s relevance for each target sub-model.
                                                                      (3) Derive a splitting suggestion. Using a context menu
Afterwards, each model element is assigned to the sub-model
                                                                      entry on the splitting description file, the user triggers the
it was deemed most relevant for, ties being broken randomly.
                                                                      automated creation of a splitting suggestion. A splitting
Model crawling extends an approach proposed in [9]. The
                                                                      suggestion comprises a set of assignment entries, each hold-
underlying intuition is that of a breadth-first search: We first
                                                                      ing a link to a model element, a link to a target sub-model,
visit and score the seeds’ neighbours, then the neighbours’
                                                                      and the relevance score. To compute the splitting sugges-
neighbours, et cetera. Scores of newly accessed elements are
                                                                      tion, the technique outlined in Section 2 is applied. The
calculated based on the scores of previously scored elements.
                                                                      splitting suggestion is persisted to the file system.
The scoring formula accounts for topological properties, such
as the connectivity of newly accessed elements, and seman-
                                                                      (4) Review and post-process the suggestion. To ob-
tic implications of the respective relationship types (e.g., in
                                                                      tain visual access to the splitting suggestion, the user can
meta-models, containment suggests strong connectivity).
                                                                      now open the model in a model editor. The user activates
                                                                      a dedicated layer called model splitting. This action trig-
3.    USER PROCESS                                                    gers the color-coding of model elements corresponding to the
The user process, shown in Fig. 2, comprises two manual               splitting suggestion, shown in Fig. 4. As further visual aid,
tasks (2 and 4) and three automated tasks (1, 3 and 5). The           the assignment of a model element is also displayed textually
manual tasks rely on human intelligence and domain knowl-             above its name. For post-processing, the user may want to
edge. They are facilitated by textual and visual tool support.        change some assignments for model elements that were not
The automated tasks are triggered by context menu entries.            assigned to the proper target sub-model. This is done using
                                                                  mapping is implemented by a domain-specific language for
                                                                  splitting descriptions with custom editor support. Prema-
                                                                  ture commitment is inhibited and progressive evaluation is
                                                                  promoted by providing an incremental process that allows
                                                                  tweaking with input values while receiving rapid feedback.
                                                                  For traceability, our file-based approach to user input allows
                                                                  to keep the splitting description and use it later, e.g., for
                                                                  documentation purposes.

                                                                  5.    IMPLEMENTATION
                                                                  Eclipse Modeling Framework [10] is the de-facto reference
                                                                  implementation of the EMOF modeling standard. Conse-
                                                                  quently, it was natural for us to design the new tool as an
                                                                  extension for EMF. As such, it can be plugged into an exist-
                                                                  ing Eclipse installation without further effort. For the split-
                                                                  ting description editor, we leveraged the powerful code gen-
Figure 4: Reviewing and post-processing the split-
                                                                  eration facilities of Xtext [5]. We defined a simple domain-
ting suggestion.
                                                                  specific language for splitting descriptions. The editor with
                                                                  its syntax highlighting and code completion features was
                                                                  fully generated by Xtext. For customization, we added a
the palette tool entry Assign. When the user reassigns a
                                                                  couple of checks (e.g., forbidden characters, uniqueness of
model element, the respective entry in the splitting sugges-
                                                                  sub-model names). The visual splitting layer is an exten-
tion is automatically updated. It is worth mentioning that
                                                                  sion of EcoreTools 2.0 [2] which is based on the Sirius [4]
if the user is not satisfied with the results, he or she may
                                                                  framework and, as of June 2014, determined to be part of
iterate Steps 2 to 4 as often as required, tweaking the de-
                                                                  the new Eclipse release Luna 4.4. We used this new technol-
scriptions and parameter settings. One important scenario
                                                                  ogy as we benefit from its support for multiple viewpoints,
for this is the discovery of new sub-models: The user can
                                                                  allowing us to tailor a splitting viewpoint to our needs.
set the completeness condition to partial in Step 2 which
leads to some model elements not being assigned in Step 3.
The user inspects these elements in Step 4 to create new          6.    CASE STUDY
sub-model descriptions.                                           In a case study, we investigated two research questions: (RQ1)
                                                                  How efficient is the proposed tool in comparison to manual
(5) Perform splitting. Given that the user is satisfied           splitting? (RQ2) Is the proposed tool usable?
with the post-processed splitting suggestion, the actual split-
ting can be triggered by the user. The user may choose from       6.1    Subjects and Task
two context menu entries: One for splitting the input model       Model: Extended Joomla-Specific Language (eJSL) is a meta-
into multiple physical resources, the other for splitting it      model for web applications based on the Joomla content
into sub-packages within the same resource.                       management system [3]. It comprises 116 classes, 39 enumer-
                                                                  ations, 176 enumerated attributes, 41 generalizations, 145
                                                                  containment references, and 47 plain references. eJSL was
4.   DESIGN GOALS                                                 designed by a doctoral student affiliated with our research
In this section, we shortly discuss design goals that were        group we shall refer to as X. X has significant experience in
fundamental in the design of the proposed tool.                   modeling language design. Previous to our work, X man-
                                                                  ually split eJSL into five sub-models, calling them Pages,
Extensibility. The underlying technique possesses an in-          Content, Menu, User, and Configuration. According to his
nate extensibility that should be carried over to the end-user.   account, he invested a significant effort that spanned, among
It is applicable to models conforming to arbitrary meta-          other duties, over the course of two weeks. He printed the
models, given that they fulfill two properties: (i) Model         diagram on paper, cut and reassembled fragments. After-
elements must have meaningful textual descriptions that a         wards, he assigned colors to model elements in the diagram
splitting description can be matched against. (ii) Except         editor and layouted them by hand.
for trivial reconciliation, constraints imposed by the meta-
model may not be broken in arbitrary sub-models. We ad-           Task: We instructed another software engineer, referred to
dress this design goal by using a framework approach: To          as Y, to decompose eJSL using the tool. Y is a doctoral
customize the tool for a new meta-model, the user subclasses      student with significant experience in modeling language de-
a set of base classes. For instance, to define how input mod-     sign, but unrelated to eJSL and model splitting. We asked
els are converted to a generic graph representation used dur-     X to provide the required domain knowledge in the form
ing crawling, they subclass a class named GraphBuilder.           of descriptive texts briefly explaining his intuitions for the
                                                                  hand-tailored decomposition. The descriptions, each con-
Usability. The design of the tool is informed by Cognitive        sisting of 85 words on average, were handed to Y in a text
Dimensions, a framework for the human-centered design of          document. The task given to Y was to create a decompo-
languages and tools [6]: Providing an editable visual layer       sition that faithfully reflects the separation of concerns pro-
on top of a standard editor is a major step towards visibility    posed by the textual descriptions. We briefly instructed Y in
– visual accessibility of elements and their relations – and      the usage of the tool based on the example shown in Fig. 3
away from viscosity – resistance to change. Closeness of          and 4 and encouraged him to make use of post-processing.
6.2      Results                                                   In the Democles model composer [1], the user can iterate
Efficiency. To approach (RQ1), we define efficient as re-          the lattice of all permitted decompositions by unfolding en-
quiring a minimal amount of time to create an accurate re-         tries in a tree-like wizard. Graphical presentation of a split
sult. Positing the hand-tailored split as perfectly accurate,      is provided by an add-on graph visualization library. How-
we measured accuracy of the tool-supported split in terms          ever, this visualization is read-only and not integrated with
of average F-measure, considering both precision and recall.       a modeling editor, ruling out the re-assigning of model ele-
Accuracy was determined before and after post-processing:          ments for post-processing as supported by the new tool.
During review of the initial splitting suggestion S1, Y reas-
signed five model elements to create the final suggestion S2.      The splitting tool proposed in [12] makes classic clustering
From S1 to S2, precision increased from 82% to 86% and             algorithms available for EMF models. It provides a wizard
recall from 84% to 88%, determining a rise in F-measure            for the selection and customization of algorithms. However,
from 83% to 87%. It took Y five minutes to create S1.              except for numerical input parameters, the user cannot in-
The reviewing and post-processing that brought the 4% gain         fluence the generated results. The tool provides a tree-based
took further 55 minutes. Consequently, in terms of extra-          editor for the reassigning of model elements to target sub-
polated overall amount of time, tool-supported splitting out-      models, but does not present any visual feedback.
performed manual splitting.
                                                                   8.   CONCLUSION
Usability. To approach (RQ2), we conducted an informal             In this paper, we present a tool for the splitting of large
interview. Y perceived the user process as comprehensible,         meta-models. The tool provides a textual editor that allows
the description editor as easy to use and the color-coding as      defining the desired target sub-models by means of textual
useful. An activity found crucial during post-processing was       descriptions. It generates a splitting suggestion that can be
examining the direct neighbours of a model element. Y per-         reviewed and post-processed in a visual editor. Based on the
ceived this task as cumbersome: He often had to navigate           splitting suggestion, the input model can be automatically
for edge targets outside the visible scope. For future work,       split either into multiple resources or packages within one re-
we aim at dedicated support for this activity: On selection,       source. The tool is open source and can be found, along with
neighbourhood information should be instantly available in         the models mentioned in this paper, at https://www.uni-
a tool-tip displaying the names of adjacent elements. One          marburg.de/fb12/swt/forschung/software. In the future,
further suggestion by Y, the color-coding of edges, directly       we plan to apply the technique on other models than class
made it into the current version. Y also invested consider-        models, deeming it necessary to account for constraints.
able time in layouting, i.e., aligning the color-coded model
elements into groups – an activity outside of the scope of this    9.   REFERENCES
work. It is an interesting challenge to devise a layouting al-      [1] Democles. http://democles.lassy.uni.lu/, May 2011.
gorithm that aligns the sub-models of a model as clusters.          [2] Ecoretools 2.0. http://www.eclipse.org/ecoretools/,
Inspection of the false positives and negatives in S2 revealed          May 2014.
that 50% of them concerned enumerations, the other 50%              [3] Joomla. http://www.joomla.org/, May 2014.
concerning classes. Y pointed out that enumerations were
                                                                    [4] Sirius. http://www.eclipse.org/sirius/, May 2014.
hard to relate to classes visually as they are not connected
by edges. We consider representing enumerated attributes            [5] Xtext. http://www.eclipse.org/xtext/, May 2014.
as edges rather than class members in future work.                  [6] T. R. G. Green and M. Petre. Usability analysis of
                                                                        visual programming environments: a cognitive
                                                                        dimensions framework. Journal of Visual Languages &
6.3      Validity                                                       Computing, 7(2):131–174, 1996.
Threats to external validity – or generalizability – are the        [7] P. Kelsen, Q. Ma, and C. Glodt. Models within
size of the input model and the size of the test group. It re-          models: Taming model complexity using the
mains a question left to future work whether our tool scales            sub-model lattice. Fundamental Approaches to
for meta-models of significantly more elements. However,                Software Engineering, pages 171–185, 2011.
an analysis of publicly available meta-models1 indicates the        [8] T. K. Landauer, P. W. Foltz, and D. Laham. An
input model size to be typical for large meta-models de-                Introduction to Latent Semantic Analysis. Discourse
manding an adequate decomposition. The test group size                  Processes, (25):259–284, 1998.
indeed precludes claims for generality, but allows to provide       [9] M. P. Robillard. Automatic Generation of Suggestions
tentative evidence for critical design weaknesses and bene-             for Program Investigation. In Proc. of ESEC/FSE-13,
fits. A potential threat to internal validity – or freeness from        pages 11–20, 2005.
systematic error – is the flow of information from the control     [10] D. Steinberg, F. Budinsky, E. Merks, and
group to the test group. To mitigate this threat, we ascer-             M. Paternostro. EMF: Eclipse Modeling Framework.
tained in consultation with X that the textual descriptions             Pearson Education, 2008.
in vagueness and level of detail represented the intuitions for    [11] D. Strüber, J. Rubin, G. Taentzer, and M. Chechik.
splitting before the manual split was executed.                         Splitting models using information retrieval and model
                                                                        crawling techniques. Fundamental Approaches to
7.     RELATED WORK                                                     Software Engineering, pages 47–62, 2014.
In this section, we discuss related tooling. A survey of work      [12] D. Strüber, M. Selter, and G. Taentzer. Tool support
related to the underlying approach is provided in [11].                 for clustering large meta-models. In Proceedings of the
                                                                        Workshop on Scalability in Model Driven Engineering,
1                                                                       page 7. ACM, 2013.
    http://www.emn.fr/z-info/atlanmod/index.php/Ecore