Short paper
       On Breaking The Curse of Dimensionality in Reverse
                  Engineering Feature Models
               Jean-Marc Davril and Patrick Heymans1 and Guillaume Bécan and Mathieu Acher2


Abstract. Feature models have become one of the most widely                         tem solely from a set of existing products is not desired in practice,
used formalism for representing the variability among the products                  a reverse engineered FM can be used by stakeholders to collect valu-
of a product line. The design of a feature model from a set of exist-               able insights about the product domain and to assess the fit between
ing products can help stakeholders communicate on the commonal-                     the product line solution space and customer expectations. Depend-
ities and differences between the products, facilitate the adoption of              ing on the complexity of the solution space and the richness of the
mass customization strategies, or support the definition of the solu-               product domain, the manual construction of an FM can prove to be
tion space of a product configurator (i.e. the sets of products that will           time-consuming and prone to errors.
be and will not be offered to the targeted customers). As the manual                   For these reason researchers have provided significant contribu-
construction of feature models proves to be a time-consuming and                    tions in the development of techniques for automating the extraction
error prone task, researchers have proposed various approaches for                  of FMs from sets of existing products specifications (or configura-
automatically deriving feature models from available product data.                  tions) - e.g. see [7, 19, 1, 14, 20, 2]. Existing approaches mostly
Existing reverse engineering techniques mostly rely on data mining                  rely on logical techniques that search for statistically significant re-
algorithms that search for frequently occurring patterns between the                lationships between the feature values. For instance, if two feature
features of the available product configurations. However, when the                 values never occur together in the configurations, one could infer a
number of features is too large, the sparsity among the configura-                  constraint stating that these values exclude each other. Similarly, two
tions can reduce the quality of the extracted model. In this paper,                 values that frequently occur together can imply a configuration de-
we discuss motivations for the development of dimensionality reduc-                 pendency between the two features.
tion techniques for product lines in order to support the extraction of                In this paper, we discuss the pitfalls related to the extraction of
feature models in the case of high-dimensional product spaces. We                   FMs from product configuration data. In particular, we highlight the
use a real world dataset to illustrate the problems arising with high               limitation of applying logical approaches to high dimensional data
dimensionality and present four research questions to address them.                 (i.e. when the product space is defined by a very large number of fea-
                                                                                    tures). When the number of features grows, logical methods require
                                                                                    an increasing number of available configurations to maintain statisti-
1     Introduction                                                                  cal significance. This means that while automatic support is desirable
Feature Models (FMs) have been first introduced for representing                    to help practitioners manage high dimensional product space, an FM
the commonalities among the software systems of a software prod-                    extraction process that solely relies on logical techniques does not
uct line [15]. An FM specifies the features that form the potential                 cope well with high dimensionality. We argue that, even when a large
products (also called configurations) of a product line and how these               volume of configuration data is available, one cannot consistently as-
features can be combined to define specific products. In [4] Berger et              sume that a logical pattern extracted from the data should be used
al. survey the adoption of variability modeling techniques in industry              to define the boundaries of the configuration space. It follows that
and report that FMs are the most frequently observed notation.                      while it is desirable to support practitioners with logical techniques,
   Defining an FM over a set of existing configurations can bring                   an FM extraction process should also consider available product do-
valuable support in the adoption of mass customization strategies.                  main knowledge. In the following sections, we highlight the need for
Tseng and Jiao [18] define mass customization as the process of “pro-               formal methods that operate on both logical techniques and back-
ducing goods and services to meet individual customer’s needs with                  ground domain knowledge.
near mass production efficiency”. A key phase in the development of                    The remainder of the paper is organised as follows. Section 2 de-
a mass customization strategy is the definition of the solution space               scribes feature modeling and prior work related to the synthesis of
that should be offered by the provider [16] - that is, all the product              FMs. Section 3 illustrates the curse of dimensionality in the context
configurations that should be made available in order to satisfy cus-               of the FM synthesis problem with a real world example. In section 4
tomer demand.                                                                       we elaborate a research plan with four research questions.
   An FM is a concise representation of the solution space of a prod-
uct line. It can be used to engineer mass customization systems, such
as configurators [5, 9]. In this case the FM serves as the knowledge
                                                                                    2   Feature Model Synthesis
for the configuration system [10]. While deriving a configuration sys-              An FM is an explicit representation of the underlying variability
1 University Namur, Belgium, email: firstname.lastname@unamur.be                    among the products of a product line. It defines the set of features
2     Inria and Irisa, Université   Rennes   1,   France,   email:   first-        that compose the products and how these features can be combined
    name.lastname@inria.fr                                                          into products. An FM can be represented as a tree in which nodes are


                                                                               19                Juha Tiihonen, Andreas Falkner and Tomas Axling, Editors
                                                                                              Proceedings of the 17th International Configuration Workshop
                                                                                                                   September 10-11, 2015, Vienna, Austria
 features and edges between the features represent hierarchical rela-                            Camera       High resolution      Basic     size
 tionships (see Figure 1). In the tree hierarchy, each parent-child de-                            X                X                         8
 composition constrains the valid combinations of features that can                                X                X                         7
 be found in product configurations. The FM in Figure 1 shows a                                                     X                         6
 XOR-decomposition for the feature Screen into the features High                                                                     X        7
 definition and Basic (i.e. the two child features form a XOR-
 group), which specifies that exactly one of the two child features
 has to be included in each configuration. Other usual decomposition                 Table 2: A potential product matrix that could lead to the extraction
 types are OR-groups and Mutex-groups, which respectively define                     of the FM in Figure 1
 that when the parent feature is selected, all features, or at most one
                                                                                     of FMs is the elicitation of the FM hierarchy. She et al. [19] propose
 feature, must be included. As shown in Figure 1, filled circles and
                                                                                     an interactive approach to recommend users with the likely parent
 full circles represent mandatory and optional child features respec-
                                                                                     candidates for specific features. In [8] we proposed to weight edges
 tively. It is also possible to define cross-tree constraints such as the
                                                                                     between features on the basis of both probabilistic dependencies be-
 implies relationship in Figure 1.
                                                                                     tween the features and similarity between their textual descriptions.
    An FM is attributed if there are typed attributes associated to its
                                                                                     We then considered the selection of the hierarchy as the computation
 features. Figure 1 shows an attribute size of type integer under the
                                                                                     of an optimum branching between the features [21]. The FM synthe-
 feature Screen. Such FMs are referred hereafter as Attributed Fea-
                                                                                     sis techniques proposed in [2] aim at producing FMs that convey a
 ture Models (AFM).
                                                                                     hierarchy that is conceptually sound w.r.t. ontological aspects. In [3],
    The semantics of an FM f m, noted [[f m]], is commonly defined
                                                                                     Bécan et al. use domain knowledge to distinguish features from at-
 as the sets of products (i.e. the sets of sets of features) that satisfy the
                                                                                     tributes and propose a procedure to mine AFM.
 constraints specified by f m [17]. Table 2 lists the three valid product
                                                                                        Other works address the extraction of FMs from less structured
 configurations for the sample FM in Figure 1.
                                                                                     artefacts such as textual product descriptions [23, 8, 11].
                                                                                        In [6] Czarnecki et al. use probabilistic logic to formalise the foun-
                               Phone                                                 dations of Probabilistic Feature Models (PFMs). The authors also
                                                                                     propose an algorithm to build PFMs upon constraints extracted from
                                                                                     sets of configurations. PFMs can contain soft constraints which ex-
                                                                                     press probabilistic dependencies between features.
                    Camera              Screen
                                        size : int

                                                                                     3     The curse of dimensionality
              implies           High                 Basic                           A machine learning algorithm suffers from the effects of the so-
                             resolution                                              called curse of dimensionality when it does not scale well with high-
                                                                                     dimensional data. For example, performance issues can arise when
                                                                                     the complexity of the algorithm is exponential in the number of di-
           Figure 1. A sample FM for a product line of phones
                                                                                     mensions of the dataset. High dimensionality can also impact the
                                                                                     quality of results when some dimensions of the dataset are not rele-
                  Camera       High resolution        Basic                          vant to the problem to be solved, and thus feed an algorithm with dis-
                                                                                     tracting information. Data are referred to as being high-dimensional
                    X                X
                                                                                     when they are embedded into an high-dimensional space. In the con-
                                     X
                                                                                     text of the FM synthesis problem, the data are formed by the existing
                                                        X
                                                                                     product configurations. Consequently, data dimensionality is defined
                                                                                     by the number of features, the number of attributes, and the size of
   Table 1: The valid product configurations for the FM in Figure 1                  the domains for the values of these attributes.

    The FM synthesis problem consists in the extraction of an FM from
 an existing set of products. The synthesis can be decomposed into                   3.1    High dimensionality in FM synthesis
 two steps. First, a logical formula over the inclusion of features in
 products is mined from the set of configurations. Then, an FM is                    We now list the variability structures that are commonly mined from
 extracted from the logical formula [7]. Many different FMs can be                   configuration matrices by existing FM synthesis approaches.
 built for the same logical formula [19]. Therefore, the FM extraction
 requires heuristics for guiding the selection of the edges that will                • Binary implications: Binary implications indicate dependencies
 form the tree hierarchy [19, 8, 20, 2].                                               between the feature or attribute values in the configuration matrix.
    The initial set of configurations can be represented as a configura-             • Hierarchy: A tree hierarchy is built from the binary implications
 tion matrix, which presents the features that are included in the ex-                 between the features. Conceptually, the hierarchy of an FM orga-
 isting configurations, as well as the values for the feature attributes.              nizes the features into different levels of increasing detail. It also
 Table 2 shows a possible initial configuration matrix from which the                  defines that the selection of a child feature in a configuration al-
 FM in Figure 1 could be synthesised.                                                  ways implies the selection of its parent feature.
    There are multiple examples of prior work related to the synthe-                 • Mandatory features: Once the hierarchy has been found, for any
 sis of FMs from a logical formula, or from sets of formally defined                   binary implication from a parent feature to one of its child, the
 configurations [7, 19, 1, 14, 20]. A major challenge in the synthesis                 child has to be made mandatory.


Juha Tiihonen, Andreas Falkner and Tomas Axling, Editors                        20
Proceedings of the 17th International Configuration Workshop
September 10-11, 2015, Vienna, Austria
• Feature groups: OR-groups, XOR-groups and Mutex-groups rep-                    with a maximum of 8906. Such large numbers of constraints put
  resent how sibling features can be combined together in product                into question the validity of the extracted constraints - that is whether
  configurations.                                                                these are legitimate configuration constraints w.r.t. to restrictions in
• Cross-tree constraints: In addition to the constraints represented             the product line domain. When the data is sparse, it can be hard to
  in the feature hierarchy, cross-tree constraints such as requires or           evaluate whether the configurations just happened to exhibit the con-
  excludes relationships are mined.                                              straint. Moreover, when the number of constraints is high, many dif-
                                                                                 ferent FMs that fit the data equally well can be derived from them.
    In order to illustrate the curse of dimensionality in the context of         A purely statistical synthesis approach is thus limited as it cannot
the FM synthesis problem, we have applied an AFM synthesis algo-                 be used to assess the quality of the candidate FMs. Therefore, it
rithm to a real world dataset extracted from the Best Buy product cat-           would be useful to automatically reduce the number of irrelevant
alog. Best Buy is an American retailer that provides consumer elec-              constraints, or help users assess them. Several approaches can be
tronics, and publishes its products data on the web through an API               considered to determine a readable subset of relevant constraints to
3
  . We built configuration matrices for the Best Buy data by merging             present to users, e.g. prioritization, or minimisation [22].
extracted sets of products that have at least 75% of features and at-               Our example does not illustrate the synthesis of PFMs. While the
tributes in common. A description of the AFM synthesis algorithm                 constraints mined for crisp FMs have a confidence of 100% (i.e. they
is out of the scope of this paper and can be found in [3].                       cannot be violated by any product), the constraints mined for PFMs
    We have considered 242 extracted configuration matrices. Table 3             have a confidence above a predefined threshold lower than 100%.
shows statistics about these matrices. The number of Configurations              PFMs can be useful to model variability trends among the products.
is the number of products in the matrix while the number of Vari-                Similar to the synthesis of FMs, a high dimensional matrix can lead
ables is the number of columns (corresponding to features or at-                 to the computation of a very large number of constraints with a con-
tributes). Domain size is the number of distinct values in a column.             fidence above the predefined threshold, and thus make the elicitation
The properties of the configuration matrices are quite representative            of the PFM structure arduous.
of high dimensional product spaces. The number of products is low
w.r.t. to the total number of distinct cell values. In our dataset, there        3.2    Dimensionality reduction
is almost the same number of variables (columns) as configurations;
and in average there are more than 5 values per variable. The appli-             In machine learning, the term dimensionality reduction denotes the
cation of the AFM synthesis algorithm to this dataset brings to light            process of reducing the number of attributes to be considered in the
the need for further research efforts as summarised below.                       data for a particular task [12]. Dimensionality reduction techniques
                                                                                 are commonly divided into two categories: feature selection and fea-
                                  Min   Median    Mean      Max                  ture extraction.
       Configurations             11     27.0     47.1       203                    In feature selection, a subset of the original data attributes is se-
     Variables (columns)          23     50.0     49.6       91                  lected (see [13]). This typically involves the identification of filtering
        Domain size                1     2.66     5.45      47.18                criteria on the attributes (filter methods) or the use of the machine
                                                                                 learning algorithm itself for ranking the relevance of the attributes
                                                                                 (wrapper methods). In an FM synthesis process, feature selection
              Table 3: Statistics on the Best Buy dataset
                                                                                 could be achieved by choosing a subset of the features to be consid-
                                                                                 ered during the elicitation of the FM hierarchy. Once an initial hier-
   Firstly, the Best Buy configuration matrices contain empty cells.             archy would be computed from the core features, the filtered features
The average proportion of empty cells in the matrices is 14.4%, and              could then be appended to it.
in the worst case, the proportion is 25.0% The problem with empty                   Feature extraction consists in defining a projection from the
cells is that they do not have a clearly defined semantics in terms of           high-dimensional space of the data to a space of lower dimension. Let
variability. One might consider that an empty cells translates the ab-           us consider a product line featuring the features length, width
sence of the corresponding feature in the configuration. However it is           and depth. One could define a mathematical function over the val-
unsure whether the feature is really excluded, or if its value is simply         ues of these three features to replace them with a new attribute size,
unknown. This uncertainty is important because different interpreta-             thus reducing the number of dimensions by mapping three features
tions of empty cells could lead to different synthesised FMs.                    to a single one. The intended benefit is to reduce the cognitive effort
   Another concern is the ability to distinguish features from at-               when configuring since (1) less configuration variables are presented
tributes among the columns of the matrix. As for the empty cells,                to users; (2) the configuration variables abstract details that are typi-
different heuristics for this task could result in very different synthe-        cally technical, making the promise of raising the level of abstraction
sised FMs. Furthermore, each attribute should be associated to the               for domain analysts or end-users of the engineered configurator.
appropriate parent feature, and automating the association resolution
becomes harder as the number of features and attributes grows.
   One possible direction for addressing the interpretation of empty             4     Research Agenda
cells and the distinction between features and attributes is to rely on          We aim at addressing dimensionality reduction in the synthesis of
the specification of domain knowledge by users. This strategy would              FMs in future research. To this end, we state four research questions:
require the design of user interactions that prevent users from being
overwhelmed with huge volume of variability information, notwith-                • RQ1: How should empty cells in configuration matrices be in-
standing the large number of features and attributes in the dataset.               terpreted during the FM synthesis? An empty cell can either
   Another important concern is related to constraints. The number                 represent the absence of a feature (resp. attribute) or translate a
of constraints synthesised from the Best Buy matrices average 237                  lack of knowledge for the value of a feature (resp. attribute). Ac-
                                                                                   knowledging different semantics for empty cells can lead to dif-
3 http://developer.bestbuy.com/
                                                                                   ferent synthesis results. It would be interesting to investigate the


                                                                            21                Juha Tiihonen, Andreas Falkner and Tomas Axling, Editors
                                                                                           Proceedings of the 17th International Configuration Workshop
                                                                                                                September 10-11, 2015, Vienna, Austria
   use of complementary data, such as product descriptions or user                  [2] Guillaume Bécan, Mathieu Acher, Benoit Baudry, and Sana Ben Nasr,
   knowledge, to set these missing values.                                              ‘Breathing ontological knowledge into feature model synthesis: An em-
                                                                                        pirical study’, Empirical Software Engineering, 51, (2015).
 • RQ2: How to use background-knowledge in the construction                         [3] Guillaume Bécan, Razieh Behjati, Arnaud Gotlieb, and Mathieu Acher,
   of FMs? Current synthesis approaches commonly use only data                          ‘Synthesis of attributed feature models from product descriptions’, in
   for constructing FMs. However, when a configuration matrix is                        19th International Software Product Line Conference, Nashville, TN,
   highly dimensional, it is possible to find many different FMs that                   USA, (2015).
   fit the data equally well. Some researchers in the machine learning              [4] Thorsten Berger, Ralf Rublack, Divya Nair, Joanne M. Atlee, Martin
                                                                                        Becker, Krzysztof Czarnecki, and Andrzej Wsowski, ‘A survey of vari-
   community have already considered that constructing models only                      ability modeling in industrial practice’, in Proceedings of the Seventh
   from observed data is a bad practice, which has been referred to as                  International Workshop on Variability Modelling of Software-intensive
   data fishing. In order to select the right FM, we believe that practi-               Systems, VaMoS ’13, pp. 7:1–7:8, New York, NY, USA, (2013). ACM.
   cal synthesis procedures should be guided by existing knowledge                  [5] Quentin Boucher, Gilles Perrouin, and Patrick Heymans, ‘Deriving
                                                                                        configuration interfaces from feature models: A vision paper’, in Pro-
   about the application domain of the targeted FM (i.e. background                     ceedings of the Sixth International Workshop on Variability Modeling
   knowledge). More specifically, background knowledge could be                         of Software-Intensive Systems, pp. 37–44. ACM, (2012).
   particularly useful for (1) differentiating the columns of a config-             [6] Krzysztof Czarnecki, Steven She, and Andrzej Wasowski, ‘Sample
   uration matrix into features and attributes, and (2) selecting the                   spaces and feature models: There and back again’, in Proceedings of the
   tree hierarchy among the features of the FM.                                         12th International Software Product Line Conference. IEEE, (2008).
                                                                                    [7] Krzysztof Czarnecki and Andrzej Wasowski, ‘Feature diagrams and
 • RQ3: How to reduce dimensionality by modeling product                                logics: There and back again’, in Proceedings of the 11th International
   qualities? Configuration matrices represent products as sets of                      Software Product Line Conference, 2007. SPLC’07. IEEE, (2007).
   features, which refer to a large number of technical specifications              [8] Jean-Marc Davril, Edouard Delfosse, Negar Hariri, Mathieu Acher,
   (e.g. size, weight, battery life). However, customers usually com-                   Jane Cleland-Huang, and Patrick Heymans, ‘Feature model extraction
                                                                                        from large collections of informal product descriptions’, in Proceedings
   municate and reason about the products in terms of different ab-                     of the 2013 9th Joint Meeting on Foundations of Software Engineering.
   stractions we call product qualities (e.g. ease-of-use, portability,                 ACM, (2013).
   ergonomics). An interesting research direction is the design of for-             [9] Deepak Dhungana, Andreas Falkner, and Alois Haselbock, ‘Con-
   mal methods for augmenting configuration matrices with product                       figuration of cardinality-based feature models using generative con-
   qualities. Projections of qualities onto the technical features could                straint satisfaction’, in Software Engineering and Advanced Applica-
                                                                                        tions (SEAA), 2011 37th EUROMICRO Conference on. IEEE, (2011).
   help define a new configuration space of lower dimensionality.                  [10] Alexander Felfernig, ‘Standardized configuration knowledge represen-
 • RQ4: How to assess the quality of extracted FMs? One usually                         tations as technological foundation for mass customization’, IEEE
   wants to elicit an FM with a set of specific tasks to solve in mind,                 Transactions on Engineering Management, 54(1), 41–56, (2007).
   which means that the quality of the FM should be assessed on the                [11] Alessio Ferrari, Giorgio O Spagnolo, and Felice Dell’Orletta, ‘Mining
                                                                                        commonalities and variabilities from natural language documents’, in
   basis of the degree of support it provides w.r.t. these tasks. For in-               Proceedings of the 17th International Software Product Line Confer-
   stance, if an FM is extracted from a set of products with the aim of                 ence, pp. 116–120. ACM, (2013).
   providing an overview of the underlying variability (domain anal-               [12] Imola K Fodor. A survey of dimension reduction techniques, 2002.
   ysis task), then the readability and the learnability of the FM are             [13] Isabelle Guyon and André Elisseeff, ‘An introduction to variable and
   relevant quality criteria. In the case of the synthesis of an FM to                  feature selection’, The Journal of Machine Learning Research, (2003).
                                                                                   [14] Evelyn Nicole Haslinger, Roberto Erick Lopez-Herrejon, and Alexan-
   model a solution space, the conformance of the FM to the exist-                      der Egyed, ‘On extracting feature models from sets of valid feature
   ing products should be evaluated. A framework is thus required to                    combinations’, in Fundamental Approaches to Software Engineering,
   identify the evaluation criteria of extracted FMs.                                   Springer, (2013).
                                                                                   [15] Kyo C Kang, Sholom G Cohen, James A Hess, William E Novak, and
                                                                                        A Spencer Peterson, ‘Feature-oriented domain analysis (foda) feasibil-
 5    Conclusion                                                                        ity study’, Technical report, DTIC Document, (1990).
                                                                                   [16] FT Piller and P Blazek, ‘Core capabilities of sustainable mass cus-
 In this paper, we discussed the problems arising during the automatic                  tomization’, Knowledgebased Configuration–From Research to Busi-
 synthesis of feature models from high-dimensional configuration ma-                    ness Cases. Morgan Kaufmann Publishers, 107–120, (2014).
                                                                                   [17] P Schobbens, Patrick Heymans, and J-C Trigaux, ‘Feature diagrams: A
 trices. We first framed concepts well-studied in the machine learning                  survey and a formal semantics’, in 14th IEEE international conference
 community, such as the curse of dimensionality and dimensionality                      on Requirements Engineering, pp. 139–148. IEEE, (2006).
 reduction, in the context of the feature model synthesis problem. Us-             [18] J. Seng, M.M.and Jiao, ‘Mass customization’, in Handbook of Indus-
 ing a real world dataset extracted from the Best Buy website, we then                  trial Engineering: Technology and Operations Management, Third Edi-
                                                                                        tion (ed G. Salvendy), (2001).
 highlighted the steps in an attributed feature model synthesis process
                                                                                   [19] Steven She, Rafael Lotufo, Thorsten Berger, Andrzej Wasowski, and
 that require further investigations (e.g., when computing constraints).                Krzysztof Czarnecki, ‘Reverse engineering feature models’, in Pro-
 We stated research questions related to the application of FM synthe-                  ceedings of the 33rd International Conference on Software Engineering
 sis algorithms to high dimensional configuration matrices.                             (ICSE), pp. 461–470. IEEE, (2011).
    The motivation for enabling domain experts to apply dimension-                 [20] Steven She, Uwe Ryssel, Nele Andersen, Andrzej Wasowski, and
                                                                                        Krzysztof Czarnecki, ‘Efficient synthesis of feature models’, Informa-
 ality reduction on configuration matrices is to synthesize variability                 tion and Software Technology, 56(9), 1122–1143, (2014).
 information that is relevant w.r.t. to the intention of practitioners, and        [21] Robert Endre Tarjan, ‘Finding optimum branchings’, Networks, 7(1),
 to produce more useful, readable resulting feature models.                             25–35, (1977).
                                                                                   [22] Alexander von Rhein, Alexander Grebhahn, Sven Apel, Norbert Sieg-
                                                                                        mund, Dirk Beyer, and Thorsten Berger, ‘Presence-condition simplifi-
 REFERENCES                                                                             cation in highly configurable systems’, in Proceedings of the Interna-
                                                                                        tional Conference Software Engineering (ICSE), (2015).
  [1] Mathieu Acher, Anthony Cleve, Gilles Perrouin, Patrick Heymans,              [23] Nathan Weston, Ruzanna Chitchyan, and Awais Rashid, ‘A framework
      Charles Vanbeneden, Philippe Collet, and Philippe Lahire, ‘On ex-                 for constructing semantically composable feature models from natural
      tracting feature models from product descriptions’, in Sixth Interna-             language requirements’, in 13th International Software Product Line
      tional Workshop on Variability Modeling of Software-Intensive Sys-                Conference, pp. 211–220, (2009).
      tems. ACM, (2012).


Juha Tiihonen, Andreas Falkner and Tomas Axling, Editors                      22
Proceedings of the 17th International Configuration Workshop
September 10-11, 2015, Vienna, Austria