=Paper= {{Paper |id=Vol-2245/ammore_paper_2 |storemode=property |title=Model Analytics for Feature Models: Case Studies for S.P.L.O.T. Repository |pdfUrl=https://ceur-ws.org/Vol-2245/ammore_paper_2.pdf |volume=Vol-2245 |authors=Önder Babur,Loek Cleophas,Mark van den Brand |dblpUrl=https://dblp.org/rec/conf/models/BaburCB18 }} ==Model Analytics for Feature Models: Case Studies for S.P.L.O.T. Repository== https://ceur-ws.org/Vol-2245/ammore_paper_2.pdf
                                      Model analytics for feature models:
                                      case studies for S.P.L.O.T. repository
                                         Önder Babur, Loek Cleophas and Mark van den Brand
                                                     Eindhoven University of Technology
                                                         Eindhoven, The Netherlands
                                        O.Babur@tue.nl,L.G.W.A.Cleophas@tue.nl,M.G.J.v.d.Brand@tue.nl

ABSTRACT                                                                   rapidly, scaling up the aforementioned problems. These have serious
Model-Driven Engineering and Software Product Lines promote the            implications in scenarios involving both repository management
use of models as central artifacts for a variety of activities including   and use. First of all, there is a lack of repository overview, e.g. what
domain analysis and generative software development. As these              groups of models there are, and to which domains these belong.
paradigms gain popularity, the number and variety of models in use         This type of information would enable repository exploration, facil-
increase. Several initiatives to gather models in repositories exist,      itating model search and reuse. Secondly, as new models are added,
such as ATL Zoo for metamodels or S.P.L.O.T. for feature models,           either the model manager or the users themselves are burdened
aiming for public access and reuse. However, as those repositories         with the manual labeling of the models e.g. with respect to their
are only partly or not at all curated, the growing number of models        domains. And lastly, there is a considerable amount of duplicate
leads to problems such as duplicates a.k.a. clones, and lack of reposi-    models, clones arbitrarily copy-pasted, and also various versions of
tory overview. This makes both repository management and model             the same models lying around in the repository.
searching/reuse very hard. We address this issue for S.P.L.O.T. by             These issues have been raised in the domain of MDE [7, 9]. A
adapting SAMOS, our generic model analytics framework for fea-             promising solution is the automatic comparison of models [24]
ture model comparison. We perform two exploratory case studies.            for gaining some information on the repository dataset such as
First, we aim for getting a high level repository overview with            grouping/subgrouping of models, proximities among models (and
large clusters and their domains. Secondly, we try to get clusters         groups as well) and outliers. Doing this on a large scale for hundreds
of highly similar models, to be interpreted as duplicates or clones.       of models requires techniques beyond the complex and expensive
We conclude our approach is applicable for feature models and can          pairwise comparison such as in [19]; rather it requires approximate
improve the use and maintenance of S.P.L.O.T.                              but fast and scalable techniques. These include e.g. fragmentation of
                                                                           models into smaller chunks, typically via Information Retrieval(IR)-
KEYWORDS                                                                   based and statistical methods such as clustering [7, 9], especially
                                                                           for clone detection [5].
Model-driven engineering, software product lines, feature models,
                                                                               There has been a considerable amount of work in the SPL com-
model comparison, vector space model, clustering, model analytics,
                                                                           munity on feature model analysis, comparison and use of IR-based
model management.
                                                                           techniques, however with several important distinctions. First of
                                                                           all, inspecting the thorough literature study of Benavides et al. [11]
1    INTRODUCTION                                                          on automated analysis of feature models reveals that analysis is
Model-Driven Engineering (MDE) and Software Product Lines                  mostly performed on a single feature model and some configuration
(SPLs) are paradigms heavily using models for a variety of activities      of that, for instance to find out the dead features or valid products.
ranging from domain analysis to software development, deploy-              Other approaches involve multiple feature models as input, model
ment and testing. While one of the key objectives of such paradigms        comparison is generally perceived based on the configuration se-
is the management and reuse of increasingly complex software arti-         mantics (as used by She et al. [23] in contrast with ontological
facts, the same problem emerges as they gain popularity and wider          semantics): feature models are transformed into logical formulas,
adoption: there are more, larger and more complex models in use [8].       and reasoned about their pairwise relationships such as general-
Recently, there has been some effort to collect various models in          ization/specialization [25], or exact differences [3, 12]. Another
model repositories to facilitate public access and reuse. Notable ex-      approach uses EMF Compare to calculate pairwise differences be-
amples are the ATL Ecore Metamodel Zoo1 , and Software Product             tween feature models [13]. An interesting take on feature model
Lines Online Tools (S.P.L.O.T.) feature model repository2 [20]. One        comparison is presented by Xing [26], who argues that feature
problem of such repositories is when they are either partly or not         models might evolve over time with changes in both the structure
at all curated.                                                            and feature names/descriptions, and applies their generic model
    This is particularly evident in S.P.L.O.T.: a quick inspection of      differencing technique to feature models using the structural (or
the individual models reveals that (1) models usually lack proper          ontological according to [23]) information in the models. On the
metadata on their domains, versions, etc.; (2) there are quite many        other hand, many researchers have proposed IR and clustering,
duplicates/clones/versions of models with no explicit relationship         not for comparing feature models but requirements, product de-
noted. Moreover the number of models in the repository increases           scriptions, or features themselves (e.g. their names, the text in their
                                                                           description) for reverse engineering feature models [4, 23]. Along
1 http://web.emn.fr/x-info/atlanmod/index.php?title=Ecore
2 http://www.splot-research.org/                                           a line of work mostly on model synthesis and composition [1, 2],
Bécan et al. utilize IR and NLP techniques in their interactive model     machine learning (ML) technique is called clustering. Among many
synthesis tool [10]. To our best knowledge, there has been no com-        clustering methods [17], there is a distinction between flat cluster-
parable work in the modelling and SPL domains to cluster large            ing, where a flat cluster labeling is done, and hierarchical clustering,
numbers of feature models with our objectives and scalability.            where a hierarchy of proximities is produced.
   In this paper, we attempt to apply our generic model analytics            Finally, n-grams [18] are used in computational linguistics to
framework to compare the feature models in the S.P.L.O.T. reposi-         build probabilistic models of natural language text, e.g. for esti-
tory. Our goals are twofold; introducing our approach to the SPL          mating the next word given a sequence of words, or comparing
community which we believe can benefit from the proposed tech-            text collections based on their n-gram profiles. In essence, n-grams
niques, and testing the genericness and extensibility of our ap-          represent a linear encoding of structural context.
proach for new model types and datasets. First, we extend our
framework with an extraction scheme for feature models using              2.3        SAMOS Framework for Feature Models
the S.P.L.O.T. Java API for parsing Simple XML Feature Model              Our generic model analytics framework SAMOS (Statistical Analy-
(SXFM) files. Using many utilities of the framework, notably Natu-        sis of MOdelS) [5–7] applies the above ideas for models. We have so
ral Language Processing (NLP) tools, we test our approach on the          far used SAMOS for Ecore metamodels, UML class diagrams, state
1034-model dataset in S.P.L.O.T. We perform two case studies: firstly     charts and industrial domain specific modelling languages; in the
trying to get relatively large sized clusters and their corresponding     scenarios of domain clustering, data preprocessing and filtering,
domains in the repository; and secondly obtaining clusters of very        and notably clone detection [5]. The workflow, as depicted in Fig-
similar models—to be interpreted as duplicates, clones or versions.       ure 1, starts with the extraction of IR-features (note the distinction
We conclude our approach is indeed applicable for feature models          with features as in feature models) and constraints from a set of
and can improve the use and maintenance of S.P.L.O.T.                     input feature models, with a traversal of those models using the
                                                                          SXFM Java Parser Library provided by S.P.L.O.T. We use various
2     ANALYZING FEATURE MODELS                                            schemes and NLP steps such as tokenization, filtering and synonym
In this section we start with some preliminaries and move on to           checking, to populate a VSM after NLP. As a result, each feature
detail our approach for analyzing and comparing feature models.           model is represented in the VSM as a point in a high dimensional
                                                                          space and model similarity is reduced to a distance calculation.
2.1    SXFM Feature Models                                                Clustering is applied over these distances. The framework allows
There are many notations for feature models, starting with the            configuring several matching schemes (e.g. whether to ignore types,
original one by Kang et al. [14], later extended with cardinalities,      check synonyms) and weighting schemes (e.g. idf or type weights).
additional constraints, attributes and so on [21, 22]. As a starting
point for this study we take the SXFM notation supported by the
models in S.P.L.O.T. A feature model has a feature tree with different                                                                                                                                      NLP	
  
types of features in it (Root, Solitaire), optional/mandatory modifier,                                           	
  	
  	
  SXFM	
  Parser	
                               Tokeniza