Short paper On Breaking The Curse of Dimensionality in Reverse Engineering Feature Models Jean-Marc Davril and Patrick Heymans1 and Guillaume Bécan and Mathieu Acher2 Abstract. Feature models have become one of the most widely tem solely from a set of existing products is not desired in practice, used formalism for representing the variability among the products a reverse engineered FM can be used by stakeholders to collect valu- of a product line. The design of a feature model from a set of exist- able insights about the product domain and to assess the fit between ing products can help stakeholders communicate on the commonal- the product line solution space and customer expectations. Depend- ities and differences between the products, facilitate the adoption of ing on the complexity of the solution space and the richness of the mass customization strategies, or support the definition of the solu- product domain, the manual construction of an FM can prove to be tion space of a product configurator (i.e. the sets of products that will time-consuming and prone to errors. be and will not be offered to the targeted customers). As the manual For these reason researchers have provided significant contribu- construction of feature models proves to be a time-consuming and tions in the development of techniques for automating the extraction error prone task, researchers have proposed various approaches for of FMs from sets of existing products specifications (or configura- automatically deriving feature models from available product data. tions) - e.g. see [7, 19, 1, 14, 20, 2]. Existing approaches mostly Existing reverse engineering techniques mostly rely on data mining rely on logical techniques that search for statistically significant re- algorithms that search for frequently occurring patterns between the lationships between the feature values. For instance, if two feature features of the available product configurations. However, when the values never occur together in the configurations, one could infer a number of features is too large, the sparsity among the configura- constraint stating that these values exclude each other. Similarly, two tions can reduce the quality of the extracted model. In this paper, values that frequently occur together can imply a configuration de- we discuss motivations for the development of dimensionality reduc- pendency between the two features. tion techniques for product lines in order to support the extraction of In this paper, we discuss the pitfalls related to the extraction of feature models in the case of high-dimensional product spaces. We FMs from product configuration data. In particular, we highlight the use a real world dataset to illustrate the problems arising with high limitation of applying logical approaches to high dimensional data dimensionality and present four research questions to address them. (i.e. when the product space is defined by a very large number of fea- tures). When the number of features grows, logical methods require an increasing number of available configurations to maintain statisti- 1 Introduction cal significance. This means that while automatic support is desirable Feature Models (FMs) have been first introduced for representing to help practitioners manage high dimensional product space, an FM the commonalities among the software systems of a software prod- extraction process that solely relies on logical techniques does not uct line [15]. An FM specifies the features that form the potential cope well with high dimensionality. We argue that, even when a large products (also called configurations) of a product line and how these volume of configuration data is available, one cannot consistently as- features can be combined to define specific products. In [4] Berger et sume that a logical pattern extracted from the data should be used al. survey the adoption of variability modeling techniques in industry to define the boundaries of the configuration space. It follows that and report that FMs are the most frequently observed notation. while it is desirable to support practitioners with logical techniques, Defining an FM over a set of existing configurations can bring an FM extraction process should also consider available product do- valuable support in the adoption of mass customization strategies. main knowledge. In the following sections, we highlight the need for Tseng and Jiao [18] define mass customization as the process of “pro- formal methods that operate on both logical techniques and back- ducing goods and services to meet individual customer’s needs with ground domain knowledge. near mass production efficiency”. A key phase in the development of The remainder of the paper is organised as follows. Section 2 de- a mass customization strategy is the definition of the solution space scribes feature modeling and prior work related to the synthesis of that should be offered by the provider [16] - that is, all the product FMs. Section 3 illustrates the curse of dimensionality in the context configurations that should be made available in order to satisfy cus- of the FM synthesis problem with a real world example. In section 4 tomer demand. we elaborate a research plan with four research questions. An FM is a concise representation of the solution space of a prod- uct line. It can be used to engineer mass customization systems, such as configurators [5, 9]. In this case the FM serves as the knowledge 2 Feature Model Synthesis for the configuration system [10]. While deriving a configuration sys- An FM is an explicit representation of the underlying variability 1 University Namur, Belgium, email: firstname.lastname@unamur.be among the products of a product line. It defines the set of features 2 Inria and Irisa, Université Rennes 1, France, email: first- that compose the products and how these features can be combined name.lastname@inria.fr into products. An FM can be represented as a tree in which nodes are 19 Juha Tiihonen, Andreas Falkner and Tomas Axling, Editors Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria features and edges between the features represent hierarchical rela- Camera High resolution Basic size tionships (see Figure 1). In the tree hierarchy, each parent-child de- X X 8 composition constrains the valid combinations of features that can X X 7 be found in product configurations. The FM in Figure 1 shows a X 6 XOR-decomposition for the feature Screen into the features High X 7 definition and Basic (i.e. the two child features form a XOR- group), which specifies that exactly one of the two child features has to be included in each configuration. Other usual decomposition Table 2: A potential product matrix that could lead to the extraction types are OR-groups and Mutex-groups, which respectively define of the FM in Figure 1 that when the parent feature is selected, all features, or at most one of FMs is the elicitation of the FM hierarchy. She et al. [19] propose feature, must be included. As shown in Figure 1, filled circles and an interactive approach to recommend users with the likely parent full circles represent mandatory and optional child features respec- candidates for specific features. In [8] we proposed to weight edges tively. It is also possible to define cross-tree constraints such as the between features on the basis of both probabilistic dependencies be- implies relationship in Figure 1. tween the features and similarity between their textual descriptions. An FM is attributed if there are typed attributes associated to its We then considered the selection of the hierarchy as the computation features. Figure 1 shows an attribute size of type integer under the of an optimum branching between the features [21]. The FM synthe- feature Screen. Such FMs are referred hereafter as Attributed Fea- sis techniques proposed in [2] aim at producing FMs that convey a ture Models (AFM). hierarchy that is conceptually sound w.r.t. ontological aspects. In [3], The semantics of an FM f m, noted [[f m]], is commonly defined Bécan et al. use domain knowledge to distinguish features from at- as the sets of products (i.e. the sets of sets of features) that satisfy the tributes and propose a procedure to mine AFM. constraints specified by f m [17]. Table 2 lists the three valid product Other works address the extraction of FMs from less structured configurations for the sample FM in Figure 1. artefacts such as textual product descriptions [23, 8, 11]. In [6] Czarnecki et al. use probabilistic logic to formalise the foun- Phone dations of Probabilistic Feature Models (PFMs). The authors also propose an algorithm to build PFMs upon constraints extracted from sets of configurations. PFMs can contain soft constraints which ex- press probabilistic dependencies between features. Camera Screen size : int 3 The curse of dimensionality implies High Basic A machine learning algorithm suffers from the effects of the so- resolution called curse of dimensionality when it does not scale well with high- dimensional data. For example, performance issues can arise when the complexity of the algorithm is exponential in the number of di- Figure 1. A sample FM for a product line of phones mensions of the dataset. High dimensionality can also impact the quality of results when some dimensions of the dataset are not rele- Camera High resolution Basic vant to the problem to be solved, and thus feed an algorithm with dis- tracting information. Data are referred to as being high-dimensional X X when they are embedded into an high-dimensional space. In the con- X text of the FM synthesis problem, the data are formed by the existing X product configurations. Consequently, data dimensionality is defined by the number of features, the number of attributes, and the size of Table 1: The valid product configurations for the FM in Figure 1 the domains for the values of these attributes. The FM synthesis problem consists in the extraction of an FM from an existing set of products. The synthesis can be decomposed into 3.1 High dimensionality in FM synthesis two steps. First, a logical formula over the inclusion of features in products is mined from the set of configurations. Then, an FM is We now list the variability structures that are commonly mined from extracted from the logical formula [7]. Many different FMs can be configuration matrices by existing FM synthesis approaches. built for the same logical formula [19]. Therefore, the FM extraction requires heuristics for guiding the selection of the edges that will • Binary implications: Binary implications indicate dependencies form the tree hierarchy [19, 8, 20, 2]. between the feature or attribute values in the configuration matrix. The initial set of configurations can be represented as a configura- • Hierarchy: A tree hierarchy is built from the binary implications tion matrix, which presents the features that are included in the ex- between the features. Conceptually, the hierarchy of an FM orga- isting configurations, as well as the values for the feature attributes. nizes the features into different levels of increasing detail. It also Table 2 shows a possible initial configuration matrix from which the defines that the selection of a child feature in a configuration al- FM in Figure 1 could be synthesised. ways implies the selection of its parent feature. There are multiple examples of prior work related to the synthe- • Mandatory features: Once the hierarchy has been found, for any sis of FMs from a logical formula, or from sets of formally defined binary implication from a parent feature to one of its child, the configurations [7, 19, 1, 14, 20]. A major challenge in the synthesis child has to be made mandatory. Juha Tiihonen, Andreas Falkner and Tomas Axling, Editors 20 Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria • Feature groups: OR-groups, XOR-groups and Mutex-groups rep- with a maximum of 8906. Such large numbers of constraints put resent how sibling features can be combined together in product into question the validity of the extracted constraints - that is whether configurations. these are legitimate configuration constraints w.r.t. to restrictions in • Cross-tree constraints: In addition to the constraints represented the product line domain. When the data is sparse, it can be hard to in the feature hierarchy, cross-tree constraints such as requires or evaluate whether the configurations just happened to exhibit the con- excludes relationships are mined. straint. Moreover, when the number of constraints is high, many dif- ferent FMs that fit the data equally well can be derived from them. In order to illustrate the curse of dimensionality in the context of A purely statistical synthesis approach is thus limited as it cannot the FM synthesis problem, we have applied an AFM synthesis algo- be used to assess the quality of the candidate FMs. Therefore, it rithm to a real world dataset extracted from the Best Buy product cat- would be useful to automatically reduce the number of irrelevant alog. Best Buy is an American retailer that provides consumer elec- constraints, or help users assess them. Several approaches can be tronics, and publishes its products data on the web through an API considered to determine a readable subset of relevant constraints to 3 . We built configuration matrices for the Best Buy data by merging present to users, e.g. prioritization, or minimisation [22]. extracted sets of products that have at least 75% of features and at- Our example does not illustrate the synthesis of PFMs. While the tributes in common. A description of the AFM synthesis algorithm constraints mined for crisp FMs have a confidence of 100% (i.e. they is out of the scope of this paper and can be found in [3]. cannot be violated by any product), the constraints mined for PFMs We have considered 242 extracted configuration matrices. Table 3 have a confidence above a predefined threshold lower than 100%. shows statistics about these matrices. The number of Configurations PFMs can be useful to model variability trends among the products. is the number of products in the matrix while the number of Vari- Similar to the synthesis of FMs, a high dimensional matrix can lead ables is the number of columns (corresponding to features or at- to the computation of a very large number of constraints with a con- tributes). Domain size is the number of distinct values in a column. fidence above the predefined threshold, and thus make the elicitation The properties of the configuration matrices are quite representative of the PFM structure arduous. of high dimensional product spaces. The number of products is low w.r.t. to the total number of distinct cell values. In our dataset, there 3.2 Dimensionality reduction is almost the same number of variables (columns) as configurations; and in average there are more than 5 values per variable. The appli- In machine learning, the term dimensionality reduction denotes the cation of the AFM synthesis algorithm to this dataset brings to light process of reducing the number of attributes to be considered in the the need for further research efforts as summarised below. data for a particular task [12]. Dimensionality reduction techniques are commonly divided into two categories: feature selection and fea- Min Median Mean Max ture extraction. Configurations 11 27.0 47.1 203 In feature selection, a subset of the original data attributes is se- Variables (columns) 23 50.0 49.6 91 lected (see [13]). This typically involves the identification of filtering Domain size 1 2.66 5.45 47.18 criteria on the attributes (filter methods) or the use of the machine learning algorithm itself for ranking the relevance of the attributes (wrapper methods). In an FM synthesis process, feature selection Table 3: Statistics on the Best Buy dataset could be achieved by choosing a subset of the features to be consid- ered during the elicitation of the FM hierarchy. Once an initial hier- Firstly, the Best Buy configuration matrices contain empty cells. archy would be computed from the core features, the filtered features The average proportion of empty cells in the matrices is 14.4%, and could then be appended to it. in the worst case, the proportion is 25.0% The problem with empty Feature extraction consists in defining a projection from the cells is that they do not have a clearly defined semantics in terms of high-dimensional space of the data to a space of lower dimension. Let variability. One might consider that an empty cells translates the ab- us consider a product line featuring the features length, width sence of the corresponding feature in the configuration. However it is and depth. One could define a mathematical function over the val- unsure whether the feature is really excluded, or if its value is simply ues of these three features to replace them with a new attribute size, unknown. This uncertainty is important because different interpreta- thus reducing the number of dimensions by mapping three features tions of empty cells could lead to different synthesised FMs. to a single one. The intended benefit is to reduce the cognitive effort Another concern is the ability to distinguish features from at- when configuring since (1) less configuration variables are presented tributes among the columns of the matrix. As for the empty cells, to users; (2) the configuration variables abstract details that are typi- different heuristics for this task could result in very different synthe- cally technical, making the promise of raising the level of abstraction sised FMs. Furthermore, each attribute should be associated to the for domain analysts or end-users of the engineered configurator. appropriate parent feature, and automating the association resolution becomes harder as the number of features and attributes grows. One possible direction for addressing the interpretation of empty 4 Research Agenda cells and the distinction between features and attributes is to rely on We aim at addressing dimensionality reduction in the synthesis of the specification of domain knowledge by users. This strategy would FMs in future research. To this end, we state four research questions: require the design of user interactions that prevent users from being overwhelmed with huge volume of variability information, notwith- • RQ1: How should empty cells in configuration matrices be in- standing the large number of features and attributes in the dataset. terpreted during the FM synthesis? An empty cell can either Another important concern is related to constraints. The number represent the absence of a feature (resp. attribute) or translate a of constraints synthesised from the Best Buy matrices average 237 lack of knowledge for the value of a feature (resp. attribute). Ac- knowledging different semantics for empty cells can lead to dif- 3 http://developer.bestbuy.com/ ferent synthesis results. It would be interesting to investigate the 21 Juha Tiihonen, Andreas Falkner and Tomas Axling, Editors Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria use of complementary data, such as product descriptions or user [2] Guillaume Bécan, Mathieu Acher, Benoit Baudry, and Sana Ben Nasr, knowledge, to set these missing values. ‘Breathing ontological knowledge into feature model synthesis: An em- pirical study’, Empirical Software Engineering, 51, (2015). • RQ2: How to use background-knowledge in the construction [3] Guillaume Bécan, Razieh Behjati, Arnaud Gotlieb, and Mathieu Acher, of FMs? Current synthesis approaches commonly use only data ‘Synthesis of attributed feature models from product descriptions’, in for constructing FMs. However, when a configuration matrix is 19th International Software Product Line Conference, Nashville, TN, highly dimensional, it is possible to find many different FMs that USA, (2015). fit the data equally well. Some researchers in the machine learning [4] Thorsten Berger, Ralf Rublack, Divya Nair, Joanne M. Atlee, Martin Becker, Krzysztof Czarnecki, and Andrzej Wsowski, ‘A survey of vari- community have already considered that constructing models only ability modeling in industrial practice’, in Proceedings of the Seventh from observed data is a bad practice, which has been referred to as International Workshop on Variability Modelling of Software-intensive data fishing. In order to select the right FM, we believe that practi- Systems, VaMoS ’13, pp. 7:1–7:8, New York, NY, USA, (2013). ACM. cal synthesis procedures should be guided by existing knowledge [5] Quentin Boucher, Gilles Perrouin, and Patrick Heymans, ‘Deriving configuration interfaces from feature models: A vision paper’, in Pro- about the application domain of the targeted FM (i.e. background ceedings of the Sixth International Workshop on Variability Modeling knowledge). More specifically, background knowledge could be of Software-Intensive Systems, pp. 37–44. ACM, (2012). particularly useful for (1) differentiating the columns of a config- [6] Krzysztof Czarnecki, Steven She, and Andrzej Wasowski, ‘Sample uration matrix into features and attributes, and (2) selecting the spaces and feature models: There and back again’, in Proceedings of the tree hierarchy among the features of the FM. 12th International Software Product Line Conference. IEEE, (2008). [7] Krzysztof Czarnecki and Andrzej Wasowski, ‘Feature diagrams and • RQ3: How to reduce dimensionality by modeling product logics: There and back again’, in Proceedings of the 11th International qualities? Configuration matrices represent products as sets of Software Product Line Conference, 2007. SPLC’07. IEEE, (2007). features, which refer to a large number of technical specifications [8] Jean-Marc Davril, Edouard Delfosse, Negar Hariri, Mathieu Acher, (e.g. size, weight, battery life). However, customers usually com- Jane Cleland-Huang, and Patrick Heymans, ‘Feature model extraction from large collections of informal product descriptions’, in Proceedings municate and reason about the products in terms of different ab- of the 2013 9th Joint Meeting on Foundations of Software Engineering. stractions we call product qualities (e.g. ease-of-use, portability, ACM, (2013). ergonomics). An interesting research direction is the design of for- [9] Deepak Dhungana, Andreas Falkner, and Alois Haselbock, ‘Con- mal methods for augmenting configuration matrices with product figuration of cardinality-based feature models using generative con- qualities. Projections of qualities onto the technical features could straint satisfaction’, in Software Engineering and Advanced Applica- tions (SEAA), 2011 37th EUROMICRO Conference on. IEEE, (2011). help define a new configuration space of lower dimensionality. [10] Alexander Felfernig, ‘Standardized configuration knowledge represen- • RQ4: How to assess the quality of extracted FMs? One usually tations as technological foundation for mass customization’, IEEE wants to elicit an FM with a set of specific tasks to solve in mind, Transactions on Engineering Management, 54(1), 41–56, (2007). which means that the quality of the FM should be assessed on the [11] Alessio Ferrari, Giorgio O Spagnolo, and Felice Dell’Orletta, ‘Mining commonalities and variabilities from natural language documents’, in basis of the degree of support it provides w.r.t. these tasks. For in- Proceedings of the 17th International Software Product Line Confer- stance, if an FM is extracted from a set of products with the aim of ence, pp. 116–120. ACM, (2013). providing an overview of the underlying variability (domain anal- [12] Imola K Fodor. A survey of dimension reduction techniques, 2002. ysis task), then the readability and the learnability of the FM are [13] Isabelle Guyon and André Elisseeff, ‘An introduction to variable and relevant quality criteria. In the case of the synthesis of an FM to feature selection’, The Journal of Machine Learning Research, (2003). [14] Evelyn Nicole Haslinger, Roberto Erick Lopez-Herrejon, and Alexan- model a solution space, the conformance of the FM to the exist- der Egyed, ‘On extracting feature models from sets of valid feature ing products should be evaluated. A framework is thus required to combinations’, in Fundamental Approaches to Software Engineering, identify the evaluation criteria of extracted FMs. Springer, (2013). [15] Kyo C Kang, Sholom G Cohen, James A Hess, William E Novak, and A Spencer Peterson, ‘Feature-oriented domain analysis (foda) feasibil- 5 Conclusion ity study’, Technical report, DTIC Document, (1990). [16] FT Piller and P Blazek, ‘Core capabilities of sustainable mass cus- In this paper, we discussed the problems arising during the automatic tomization’, Knowledgebased Configuration–From Research to Busi- synthesis of feature models from high-dimensional configuration ma- ness Cases. Morgan Kaufmann Publishers, 107–120, (2014). [17] P Schobbens, Patrick Heymans, and J-C Trigaux, ‘Feature diagrams: A trices. We first framed concepts well-studied in the machine learning survey and a formal semantics’, in 14th IEEE international conference community, such as the curse of dimensionality and dimensionality on Requirements Engineering, pp. 139–148. IEEE, (2006). reduction, in the context of the feature model synthesis problem. Us- [18] J. Seng, M.M.and Jiao, ‘Mass customization’, in Handbook of Indus- ing a real world dataset extracted from the Best Buy website, we then trial Engineering: Technology and Operations Management, Third Edi- tion (ed G. Salvendy), (2001). highlighted the steps in an attributed feature model synthesis process [19] Steven She, Rafael Lotufo, Thorsten Berger, Andrzej Wasowski, and that require further investigations (e.g., when computing constraints). Krzysztof Czarnecki, ‘Reverse engineering feature models’, in Pro- We stated research questions related to the application of FM synthe- ceedings of the 33rd International Conference on Software Engineering sis algorithms to high dimensional configuration matrices. (ICSE), pp. 461–470. IEEE, (2011). The motivation for enabling domain experts to apply dimension- [20] Steven She, Uwe Ryssel, Nele Andersen, Andrzej Wasowski, and Krzysztof Czarnecki, ‘Efficient synthesis of feature models’, Informa- ality reduction on configuration matrices is to synthesize variability tion and Software Technology, 56(9), 1122–1143, (2014). information that is relevant w.r.t. to the intention of practitioners, and [21] Robert Endre Tarjan, ‘Finding optimum branchings’, Networks, 7(1), to produce more useful, readable resulting feature models. 25–35, (1977). [22] Alexander von Rhein, Alexander Grebhahn, Sven Apel, Norbert Sieg- mund, Dirk Beyer, and Thorsten Berger, ‘Presence-condition simplifi- REFERENCES cation in highly configurable systems’, in Proceedings of the Interna- tional Conference Software Engineering (ICSE), (2015). [1] Mathieu Acher, Anthony Cleve, Gilles Perrouin, Patrick Heymans, [23] Nathan Weston, Ruzanna Chitchyan, and Awais Rashid, ‘A framework Charles Vanbeneden, Philippe Collet, and Philippe Lahire, ‘On ex- for constructing semantically composable feature models from natural tracting feature models from product descriptions’, in Sixth Interna- language requirements’, in 13th International Software Product Line tional Workshop on Variability Modeling of Software-Intensive Sys- Conference, pp. 211–220, (2009). tems. ACM, (2012). Juha Tiihonen, Andreas Falkner and Tomas Axling, Editors 22 Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria