=Paper= {{Paper |id=Vol-2245/ammore_paper_2 |storemode=property |title=Model Analytics for Feature Models: Case Studies for S.P.L.O.T. Repository |pdfUrl=https://ceur-ws.org/Vol-2245/ammore_paper_2.pdf |volume=Vol-2245 |authors=Önder Babur,Loek Cleophas,Mark van den Brand |dblpUrl=https://dblp.org/rec/conf/models/BaburCB18 }} ==Model Analytics for Feature Models: Case Studies for S.P.L.O.T. Repository== https://ceur-ws.org/Vol-2245/ammore_paper_2.pdf

Model analytics for feature models:
case studies for S.P.L.O.T. repository
Önder Babur, Loek Cleophas and Mark van den Brand
Eindhoven University of Technology
Eindhoven, The Netherlands
O.Babur@tue.nl,L.G.W.A.Cleophas@tue.nl,M.G.J.v.d.Brand@tue.nl

ABSTRACT rapidly, scaling up the aforementioned problems. These have serious
Model-Driven Engineering and Software Product Lines promote the implications in scenarios involving both repository management
use of models as central artifacts for a variety of activities including and use. First of all, there is a lack of repository overview, e.g. what
domain analysis and generative software development. As these groups of models there are, and to which domains these belong.
paradigms gain popularity, the number and variety of models in use This type of information would enable repository exploration, facil-
increase. Several initiatives to gather models in repositories exist, itating model search and reuse. Secondly, as new models are added,
such as ATL Zoo for metamodels or S.P.L.O.T. for feature models, either the model manager or the users themselves are burdened
aiming for public access and reuse. However, as those repositories with the manual labeling of the models e.g. with respect to their
are only partly or not at all curated, the growing number of models domains. And lastly, there is a considerable amount of duplicate
leads to problems such as duplicates a.k.a. clones, and lack of reposi- models, clones arbitrarily copy-pasted, and also various versions of
tory overview. This makes both repository management and model the same models lying around in the repository.
searching/reuse very hard. We address this issue for S.P.L.O.T. by These issues have been raised in the domain of MDE [7, 9]. A
adapting SAMOS, our generic model analytics framework for fea- promising solution is the automatic comparison of models [24]
ture model comparison. We perform two exploratory case studies. for gaining some information on the repository dataset such as
First, we aim for getting a high level repository overview with grouping/subgrouping of models, proximities among models (and
large clusters and their domains. Secondly, we try to get clusters groups as well) and outliers. Doing this on a large scale for hundreds
of highly similar models, to be interpreted as duplicates or clones. of models requires techniques beyond the complex and expensive
We conclude our approach is applicable for feature models and can pairwise comparison such as in [19]; rather it requires approximate
improve the use and maintenance of S.P.L.O.T. but fast and scalable techniques. These include e.g. fragmentation of
models into smaller chunks, typically via Information Retrieval(IR)-
KEYWORDS based and statistical methods such as clustering [7, 9], especially
for clone detection [5].
Model-driven engineering, software product lines, feature models,
There has been a considerable amount of work in the SPL com-
model comparison, vector space model, clustering, model analytics,
munity on feature model analysis, comparison and use of IR-based
model management.
techniques, however with several important distinctions. First of
all, inspecting the thorough literature study of Benavides et al. [11]
1 INTRODUCTION on automated analysis of feature models reveals that analysis is
Model-Driven Engineering (MDE) and Software Product Lines mostly performed on a single feature model and some configuration
(SPLs) are paradigms heavily using models for a variety of activities of that, for instance to find out the dead features or valid products.
ranging from domain analysis to software development, deploy- Other approaches involve multiple feature models as input, model
ment and testing. While one of the key objectives of such paradigms comparison is generally perceived based on the configuration se-
is the management and reuse of increasingly complex software arti- mantics (as used by She et al. [23] in contrast with ontological
facts, the same problem emerges as they gain popularity and wider semantics): feature models are transformed into logical formulas,
adoption: there are more, larger and more complex models in use [8]. and reasoned about their pairwise relationships such as general-
Recently, there has been some effort to collect various models in ization/specialization [25], or exact differences [3, 12]. Another
model repositories to facilitate public access and reuse. Notable ex- approach uses EMF Compare to calculate pairwise differences be-
amples are the ATL Ecore Metamodel Zoo1 , and Software Product tween feature models [13]. An interesting take on feature model
Lines Online Tools (S.P.L.O.T.) feature model repository2 [20]. One comparison is presented by Xing [26], who argues that feature
problem of such repositories is when they are either partly or not models might evolve over time with changes in both the structure
at all curated. and feature names/descriptions, and applies their generic model
This is particularly evident in S.P.L.O.T.: a quick inspection of differencing technique to feature models using the structural (or
the individual models reveals that (1) models usually lack proper ontological according to [23]) information in the models. On the
metadata on their domains, versions, etc.; (2) there are quite many other hand, many researchers have proposed IR and clustering,
duplicates/clones/versions of models with no explicit relationship not for comparing feature models but requirements, product de-
noted. Moreover the number of models in the repository increases scriptions, or features themselves (e.g. their names, the text in their
description) for reverse engineering feature models [4, 23]. Along
1 http://web.emn.fr/x-info/atlanmod/index.php?title=Ecore
2 http://www.splot-research.org/ a line of work mostly on model synthesis and composition [1, 2],
Bécan et al. utilize IR and NLP techniques in their interactive model machine learning (ML) technique is called clustering. Among many
synthesis tool [10]. To our best knowledge, there has been no com- clustering methods [17], there is a distinction between flat cluster-
parable work in the modelling and SPL domains to cluster large ing, where a flat cluster labeling is done, and hierarchical clustering,
numbers of feature models with our objectives and scalability. where a hierarchy of proximities is produced.
In this paper, we attempt to apply our generic model analytics Finally, n-grams [18] are used in computational linguistics to
framework to compare the feature models in the S.P.L.O.T. reposi- build probabilistic models of natural language text, e.g. for esti-
tory. Our goals are twofold; introducing our approach to the SPL mating the next word given a sequence of words, or comparing
community which we believe can benefit from the proposed tech- text collections based on their n-gram profiles. In essence, n-grams
niques, and testing the genericness and extensibility of our ap- represent a linear encoding of structural context.
proach for new model types and datasets. First, we extend our
framework with an extraction scheme for feature models using 2.3 SAMOS Framework for Feature Models
the S.P.L.O.T. Java API for parsing Simple XML Feature Model Our generic model analytics framework SAMOS (Statistical Analy-
(SXFM) files. Using many utilities of the framework, notably Natu- sis of MOdelS) [5–7] applies the above ideas for models. We have so
ral Language Processing (NLP) tools, we test our approach on the far used SAMOS for Ecore metamodels, UML class diagrams, state
1034-model dataset in S.P.L.O.T. We perform two case studies: firstly charts and industrial domain specific modelling languages; in the
trying to get relatively large sized clusters and their corresponding scenarios of domain clustering, data preprocessing and filtering,
domains in the repository; and secondly obtaining clusters of very and notably clone detection [5]. The workflow, as depicted in Fig-
similar models—to be interpreted as duplicates, clones or versions. ure 1, starts with the extraction of IR-features (note the distinction
We conclude our approach is indeed applicable for feature models with features as in feature models) and constraints from a set of
and can improve the use and maintenance of S.P.L.O.T. input feature models, with a traversal of those models using the
SXFM Java Parser Library provided by S.P.L.O.T. We use various
2 ANALYZING FEATURE MODELS schemes and NLP steps such as tokenization, filtering and synonym
In this section we start with some preliminaries and move on to checking, to populate a VSM after NLP. As a result, each feature
detail our approach for analyzing and comparing feature models. model is represented in the VSM as a point in a high dimensional
space and model similarity is reduced to a distance calculation.
2.1 SXFM Feature Models Clustering is applied over these distances. The framework allows
There are many notations for feature models, starting with the configuring several matching schemes (e.g. whether to ignore types,
original one by Kang et al. [14], later extended with cardinalities, check synonyms) and weighting schemes (e.g. idf or type weights).
additional constraints, attributes and so on [21, 22]. As a starting
point for this study we take the SXFM notation supported by the
models in S.P.L.O.T. A feature model has a feature tree with different NLP
types of features in it (Root, Solitaire), optional/mandatory modifier,
SXFM
Parser
Tokeniza