Column Oriented Compilation of Variant Tables Albert Haag1 Abstract. The objective of this work is to improve variant table • query the table using a key defined in the DBMS evaluation in a configurator by compiling/compressing the tables in- • iterate over the current solution set of the table given domain re- dividually to reduce both processing time and space. The main re- strictions for the associated product properties sult is a proposed simple heuristic for decomposing the variant table I assume that an existing (legacy) configurator has already imple- into subtables and a representation linking these subtables in a di- mented this functionality. This will include means for efficiently test- rected acyclic graph (DAG). The size of the compression obtained ing membership in a domain (memberp), testing if domains intersect by this heuristic for examples used in [2, 10] is comparable to that (intersectsp), and for calculating the intersection of domains (inter- achieved there. However, a formal analysis of complexity has not section). yet been completed. A prototype implemented in Java exists. Objec- tives in designing it were to keep it completely decoupled from any particular configurator, while using little machinery in order to keep 2 Introduction and Notation software maintenance costs low. Testing both on abstract examples In [8] I look at tables that list excluded combinations of product prop- and on tables that resemble real customer data is ongoing and looks erties. I term these negative variant tables. The techniques and the promising. Non-atomic table cells (such as real intervals, or value associated prototypical implementation I present here cover that ap- sets) are supported. My approach to negative variant tables [8] has proach as well. Here, except in Section 7, I limit the exposition to been incorporated into the implementation. positive tables. For simplicity, I take a positive variant table T to be given as a 1 Preamble relational table of values (atoms). I relax the assumption about T being relational later in Section 3.3. If T has k columns and r rows Following the usage of the SAP Variant Configurator - SAP VC [4] I it is an r × k array of values: T = aij ; i = 1 . . . r; j = 1 . . . k. k term a table that lists all valid combinations of properties of a product is the arity of T . Each column is mapped to a product property such as a variant table. One use of a variant table in configuration is as a as Color, Size, . . . . The product properties are denoted by vj : j = table constraint. However, variant tables and their maintenance by 1 . . . k. modelers entail some special considerations that have implications Following notation in [8], I define the column domains πj (T ) as beyond what is formally captured by that concept: the set of all values occurring in the j-th column of T :3 r • The product properties and the values referred to have a business [ πj (T ) := {aij ∈ T } (1) meaning outside of configuration. Value domains for a product i=1 property are naturally maintained in a defined sort order • Variant tables may be stored in a database table outside the model. I define sj as the number of elements in πj (T ): A data definition in a database management system (DBMS) may sj := |πj (T )| (2) exist that defines database keys etc. • Tables will often be relational, i.e., table cells will be atomic val- I call ues (strings or numbers), but non-atomic entries may occur 2 π(T ) := π1 (T ) × . . . × πk (T ) • Customers have the tendency to maintain large wide tables, i.e., the global domain tuple for v1 , . . . , vk and I denote a run-time do- normalization is often sacrificed in favor of fewer tables. In such main restriction for the product property vj as Rj ⊆ πj (T ). cases, compression techniques seem particularly advantageous For ease of notation, I refer to any subset of π(T ) that is a Carte- • A variant tables can have its own individual update cycle. The sian product as a c-tuple. Both π(T ) itself and the tuple of run-time overall model should not need to be compiled or otherwise pro- restrictions R := R1 × . . . × Rk are c-tuples. cessed each time a change is made to a table T can be seen as a set of value tuples and, as such, T ⊆ π(T ), but T is not necessarily a c-tuple. In the special case that T itself is The relevant functionality a configurator has to provide regarding a c-tuple4 , i.e., T = π1 (T ) × . . . × πj (T ), then variant tables is to k X • restrict domains via contraint propagation (general arc consistency s := sj (3) j=1 GAC (see [3])), treating the variant table as a constraint relation. 3 π (T ) can be seen as the projection of T for the j-th column 1 SAP SE, Germany, email: albert.haag@t-online.de j 4 In this case, it holds that for any given c- tuple (run-time restriction) R 2 The SAP VC allows real-valued intervals and and sets of values in table cells. Such a table cannot be transparently stored as a relation table in a π(T ∩ R) = T ∩ R DBMS 89 Juha Tiihonen, Andreas Falkner, and Tomas Axling, Editors Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria values suffice to represent the N := ( kj=1 sj ) tuples in T . In this disjoint subtables based on a particular heuristic. Q case, the c-tuple π(T ) is a compressed way of representing the ar- In Section 3, I introduce the basic approach to decomposing a ta- ray aij . This observation is central to attempts to compress a given ble. In Section 4, I discuss the heuristic. My running example is a table into a disjoint union5 of as few as possible c-tuples. It has been (single) variant table listing all variants of a t-shirt. The example is utilized in various other work. Notably, I cite [10, 6] in this context. taken and adapted further from [2]. The t-shirt has three properties When T is viewed as a constraint relation, v1 . . . vk are just the Color (v1 ), Size (v2 ), and Print (v3 ) with global domains: constraint variables, and each value x ∈ πj (T ) maps to a propo- sition p(vj , x) that states that vj can be consistently assigned to x: • π1 (T ) := {Black, Blue, Red, W hite} p(vj , x) |= (vj = x). In this case, a row (tuple) ri in T directly maps • π2 (T ) := {Large, M edium, Small} to a conjunction of such propositions: ri = (ai1 , . . . , aik ) |= τi := • π3 (T ) := {M IB(Men in Black), ST W (Save the Whales)} p(v1 , ai1 ) ∧ . . . ∧ p(vk , aik ), and T itself represents the disjunction: Of the 24 possible t-shirts only 11 are valid due to constraints that Wr T |= τi . state that M IB implies black and ST W implies ¬small as de- i=1 picted in table (1). In Section 5, I extend this example to be slightly Given the definition of s in (3), there are s distinct propositions more complex. associated with T , one for each value in each πj (T ). Hence, given I have implemented a prototype in Java that meets the functionality any value assignment to these s propositions, T , seen as a logical requirements listed above. Here, I refer to it as the VDD prototype. It expression implementing the constraint relation, will evaluate to 1 functions standalone, independent of any particular configurator. A (>/true) or 0 (⊥/false), and T defines a Boolean function: feature of this implementation is that it can be selectively applied to F : 2π1 (T )×...πk (T ) → {0, 1} (4) some tables, while processing others with the existing means of the configurator. Results obtained using this prototype validate the ap- F can be represented by a BDD (Ordered Binary Decision Diagram) proach (see Section 6). In Section 7, I comment on results for double or one of its cousins [12]. BDDs have the enticing property that find- negation of variant tables, an approach I develop in [8]. Real run- ing the right ordering of their Boolean variables (the propositions time performance evaluations have not been done, but in Section 8 p(vj , x)) can lead to a very compact representation. Furthermore, I discuss what results I have. I close this paper with an outlook and this representation can potentially be found by existing agnostic op- conclusions (Section 9). timization algorithms. The complexity of this optimization is high, Finally, a disclaimer: While the motivation for this work lies in my and heuristics are employed in practice. The construction of an op- past at SAP and is based on insights and experiences with the prod- timal BDD is not suitable for configuration run-time, but must be uct configurators there [4, 7], all work on this paper was performed performed in advance. Hence, this approach is referred to as a compi- privately during the last two years after transition into partial retire- lation approach. For configuration, this has been pursued in [9]. The ment. The implementation is neither endorsed by SAP nor does it approach using multi-valued decision diagrams (MDD) [2] is related reflect ongoing SAP development. to the BDD approach. Zero-Suppressed decision diagrams (ZDDs) [12] are another flavor of BDD, which I refer to again below. From a database point of view, approaches based on compression 3 Decomposition that allow fast reading but slow writing have been developed, among 3.1 Column Oriented Decomposition them column-oriented databases [14]. The SAP HANA database sup- ports column orientation as well. My work, here, is not directly based Let s be defined as in (3). Given a table T of arity k with r rows on database techniques, although thinking about column-orientation ri = (ai1 , . . . , aik ) and selecting one of the s propositions p(vj , x) was the trigger for the heuristics detailed in Section 46 . associated with T , then T can be decomposed into those rows in Both the BDD approaches and the database approaches entail a which x occurs in column j and those where it doesn’t. Define maintenance-time transformation into a compact representation that L(T , j, x) as the sub-table of T consisting of those rows that do facilitates run-time evaluation, and both strive for a compact repre- not reference p(vj , x): sentation (“hard to write, easy to read”). This would also apply to various approaches at identifying c-tuple subsets of T whether with L(T , j, x) := {ri = (ai1 , . . . , aik ) ∈ T | aij 6= x} (5) the explicit notion of achieving compression [10] or of simply speed- ing up constraint propagation (GAC) algorithms [6]. L(T , j, x) has the same arity k as T . In the complementary sub- Thus, all these approaches could be termed as compression or table T \ L(T , j, x) all values in the j-th column are equal to x compilation or as both. Indeed, I conjecture that the ultimately by construction. Define R(T , j, x) as the sub-table of arity (k − 1) achievable results to be more or less identical, up to differences obtained by removing the j-th column from T \ L(T , j, x): forced by the respective formalism, which may be unfavorable in some circumstances. For example, my experiences suggest that a R(T , j, x) := {(aih ) ⊂ (T \ L(T , j, x)) | h 6= j} (6) BDD may be less suitable than a ZDD for compiling a single table constraint, because in the latter propositions need only be represented Given a table T and a proposition p(vj , x) associated with T , then where they occur in positive form (see [12]). The approach I follow I call L(T , j, x) defined in (5) the left sub-table of T and R(T , j, x) here is motivated by looking at ways of decomposing a table into defined in (6) the right sub-table of T . Either L(T , j, x) and/or R(T , j, x) may be empty. 5 I exclusively use the term disjoint union to refer to a union of disjoint sets The variant table of the 11 variants of the t-shirt example is shown [5]. I denote the disjoint union of two sets A and B by A∪· B which implies in Table 1. It illustrates a decomposition of this table table based that A ∩ B = ∅ 6 It would be interesting to investigate whether a column-oriented database on the proposition p(v2 , M edium). L(T , 2, M edium) is in bold- could in itself be beneficially employed in this context. This was proposed face, and R(T , 2, M edium) is underlined. (The value block of cells by colleague at SAP some time ago, but has not been followed up on {ai2 ∈ T | ai2 = M edium} is in italics.) Juha Tiihonen, Andreas Falkner, and Tomas Axling, Editors 90 Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria Table 1. Basic decomposition of a table Figure 2 shows an entire VDD8 .9 Color Size Print Black Small MIB 1:(2, Small) Black Medium MIB Black Large MIB Black Medium STW 2:(1, White) Black Large STW White Medium STW White Large STW 3:(3, STW) Red Medium STW Red Large STW Blue Medium STW 4:(1, Black) Blue Large STW Blue Small STW 5:(2, Medium) 4:(1, Black) The decomposition process can be continued until only empty sub- 7:(2, Large) 3:(3, STW) 5:(2, Medium) tables remain. The question, which propostion (value block) to de- compose on next at each non-empty subtable will depend a suitable 8:(1, Red) 5:(2, Medium) 6:(3, MIB) 4:(1, Black) heuristic (see Section 4). 9:(1, Blue) 7:(2, Large) 6:(3, MIB) 3.2 Variant Decision Diagram - VDD ⊤ ⊥ A variant table can be represented as a decomposition tree. The root represents the entire table. It is labeled with a first chosen proposition p(vj1 , x1 ). It has two children. One, termed the left child, represents Figure 2. VDD of t-shirt using heuristic h1 (Section 4) the left sub-table L(T , j1 , x1 ). The other, termed the right child, rep- resents the right sub-table R(T , j1 , x1 ). Each of these children can in turn have children if it can be decomposed further. An empty left child is labeled by a special symbol ⊥. An empty right child is la- In Figure 2 each node is labeled in the form hp : (j, val)i, beled by a special symbol >. (All leaves of a decomposition tree are where (j, val) is the column/value pair that denotes the proposition labeled either by ⊥ or >.) Figure 1 shows a graphic depiction7 . It p(vj , x), and p is a unique identifier for the node/proposition. The also shows the further decomposition of R(T , j1 , x1 ), here indicat- identifiers are contiguously numbered. Thus, the set of propositions ing that it has two empty children. for T can be seen as totally ordered according to this numbering. The ordering underlying the graph in Figure 2 is p(v2 , Small), p(v1 , x)W hite, p(v3 , x)ST W , p(v1 , Black), B(T,j,x) p(v2 , M edium), p(v3 , M IB), p(v2 , Large), p(v1 , Red), p(v1 , Blue) Left child: L(T,j, x) Right child: R(T,j,x) Given that such a total ordering can be identified in the decompo- ⊥ ⊤ sition, the resulting VDD may be seen as a ZDD (zero-suppressed decision diagram) [12]. The gist of the algorithms for evaluation of BDDs and their cousins given in [12], such as Algorithm C and Al- Figure 1. Basic scheme of a decomposition gorithm B would apply. However, I have not made any verbatim use of these so far.10 VDDs have certain special characteristics beyond ZDDs. A termi- nal HI-link always leads to >, and a terminal LO-link always leads Identical subtables may arise at different points in the decompo- to ⊥. This and further characteristics that are ensured by the heuristic sition tree. A goal of minimal representation is to represent these h2 given in Algorithm 1 allow certain short-cuts in the implementa- multiple occurrences only once by transforming the decomposition tion (see Sections 4 and 6). tree into a directed acyclic graph (DAG), which I call a VDD or vari- 8 By conventions established in [12], I term a link to the left child as a LO- ant decision diagram. All leaves can be identified with one of two link, drawn with a dotted line, preferably to the left, and a link to the right predefined nodes also labeled ⊥ and >. Subsequently, all nodes la- child as a HI-link, drawn with a filled line, preferably to the right. A LO- beled by the same proposition p(vj , x) that have identical children link is followed when the proposition in the node label is disbelieved. A are represented by re-using one shared node. This reduction can be HI-link is followed when it is believed. The terminal nodes ⊥ and > are accomplished by an algorithm in the spirit of Algorithm R in [12]. called sinks 9 The amount of compression achieved in figure 2 is not overwhelming. The heuristic h2 does better (see figure 3) 7 For simplicity in creating the graph, the label of the root node is given as 10 Both heuristics h1 and h2 in Section 4 are designed to guarantee such an B(T, j, x) for p(vj1 , x1 ) ordering, but this is doesn’t seem essential to the VDD approach in general 91 Juha Tiihonen, Andreas Falkner, and Tomas Axling, Editors Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria 3.3 Set-labeled Nodes this is not the case for a set representation of c-tuples. Hence, a VDD is a more compact representation. There is an additional reduction of a VDD I have implemented as Also, there may be external sources for set-labeled nodes if the an option. This is most easily described using the concept of an l- maintained variant table is not relational. The SAP VC allows mod- chain. Define the subgraph composed of a node and all its descen- eling variant tables with cells that contain real-valued intervals as dent nodes that can be reached from it following only LO-links as well as a condensed form, where a cell may contain a set of values. the l-chain of the node11 . Nodes in an l-chain that pertain to the Such cells can be directly represented as set-labeled nodes. same column and have the same right child can be joined into a sin- I close this section in noting that in [10] a Boolean function as in gle node. Let p(vj1 , x1 ), . . . , p(vh , xh ) be the propositions in the (4) is also used to construct a decomposition of a table into disjoint labels of members in an l-chain that can be joined. The resulting union of Cartesian products (c-tuples). There the resulting decompo- combined node is labeled with the disjunction of these propositions: sition into c-tuples is the goal. Here the VDD is the goal, as I base P := p(vj1 , x1 ) ∨ . . . ∨ p(vh , xh ). This disjunction is represented, h the evaluation on it (see Section 8). S for short, by the (non-atomic) set of values X := xh occurring i=1 in P . In the sequel, I refer to a node by 4 Heuristics For the exposition in this section, I assume T to be in relational form ν(j, X) (7) with arity k. where j is the column index of the referenced product property vj , The graph in Figure 2 is derived using on an initial heuristic h1, and X is the set of values represented by the node12 . In case the node which I tried. h1 is based on trying to minimize splitting value blocks label is a single atomic value {x}, I also denote this by ν(j, x). during decomposition. A value block for a proposition p(vj , x) is the Figure 3 is a graph of the t-shirt using set-labeled nodes13 . It is not rows in a table T that reference that proposition. Decomposing T on a reduction of Figure 2, but rather a reduction of the graph in Figure some other proposition p(vh , y) will split p(vj , x), if it has rows in 4 (see Section 4). both L(T , j, y) and R(T , j, y). Subsequently, both of these children need to be decomposed by p(vj , x), whereas a single decomposition would have handled p(vj , x) for T at its root level. It is assumed that keeping large value blocks intact is good, and the order of decompo- 1:(3, MIB)|7 sition decisions should incur as little damage to these value blocks as possible. In order to apply this heuristic, all value blocks, their sizes, and the row indices that pertain to each one are initially calculated 2:(3, STW)|6 and stored. I do not go into further detail on this here. The current implementation relies on characteristics of a VDD en- 12:(2, [Large, Medium])|5 10:(2, [Large, Medium, Small])|3 sured by the decomposition heuristic h2 given in Algorithm 1. Recall that the total number of propositions is s, defined by (3), and that the number of propositions that pertain to the j-th column is sj , defined 11:(1, [Black, Blue, Red, White])|4 6:(1, Black)|2 by (2). Algorithm 1 (Heuristic h2) F T First, define P (T ) as an ordered set of all s propositions p(vj , x) by 1. Sorting the k columns of T by sj , ascending (largest values last). p(vj , x), below, refers to the j-th column with respect to this or- Figure 3. VDD of t-shirt with set-labeled nodes dering of the columns 2. Within each column, sort the values by their business order (any defined order the implementation can readily implement). xpj refers to the p-th value in the j-th column domain πj (T ). By inspection, the complexity of the graph in Figure 3 is compara- Then, ble to that obtained for the merged MDD in [2] (Figure 2 (b) there). 1. Make a root node of the VDD for the first proposition in the first This further reduction is important from the viewpoint of com- column: p(v1 , x11 ) pression, as each path from the root of the graph to a sink > can 2. While nodes with a non-terminal subtable remain: split them al- be seen as a c-tuple in the solution of T , and a set-labeled node ways using the first proposition (value block) in their first column reduces the number of such c-tuples. It is also important from the 3. Optionally, collect members of an l-chain with the same child viewpoint of run-time performance, if set intersection (intersectsp) nodes into an aggregated set-labeled node. I refer to this variant is more efficient than multiple membership tests. A VDD using set- of the heuristic as h2∗ labeled nodes should result in similar c-tuples as the approach in [10] 4. Reduce the nodes by unifying equivalent nodes as discussed in (depending of course on the heuristic). A key difference is that VDD Section 3.2 paths may share nodes (c-tuples sharing common tails)14 , whereas Figure 4 shows the graph of Table 1 produced using Algorithm 11 A maximal l-chain would be one for a node which does not itself occur as 1 without set-labeled nodes (h2). Figure 3 shows the same graph the left child of any other node 12 For the exposition, here, X represents a finite disjunction of propositions. produced with set-labeled nodes (h2∗ ). In the implemented prototype it can also be an interval with continuous A decomposition based on Algorithm 1 has the following charac- bounds teristics: 13 The new set-labeled nodes are assigned a uniquely identifying node num- ber outside the range used for numbering the propositions • After k HI-links the >-sink is always reached. (This is trivially 14 Figure 5 has shared nodes true for all VDDs, as each row consists of k elements) Juha Tiihonen, Andreas Falkner, and Tomas Axling, Editors 92 Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria The new values combine with everything, except that no rows are added for combinations of M IB (print) and Y ellow (color). The 1:(3, MIB)|13 table of all variants then has 73 rows (as opposed to the 11 of table 1)16 . The graph is given in Figure 5 2:(3, STW)|12 1:(3, MIB)|12 3:(2, Large)|11 2:(3, STW)|11 4:(2, Medium)|10 3:(3, none)|10 6:(2, Small)|8 18:(2, [Large, Medium, Small])|5 6:(1, Black)|9 3:(2, Large)|5 20:(2, [Large, Medium, Small, XL, XXL])|9 19:(2, [Large, Medium, XL, XXL])|7 17:(2, [XL, XXL])|4 7:(1, Blue)|8 4:(2, Medium)|4 16:(1, [Black, Blue, DarkPurple, Red, White, Yellow])|3 11:(1, DarkPurple)|6 15:(1, [Black, DarkPurple])|2 8:(1, Red)|7 5:(2, Small)|3 F T 9:(1, White)|6 6:(1, Black)|2 Figure 5. VDD of extended t-shirt with set-labeled nodes T F Now 11 nodes are needed instead of the 6 nodes in Figure 3. The table is decomposed into 5 c-tuples. The node labeled h16 : Figure 4. Basic VDD of t-shirt using algorithm 1 (1, [Black, Blue, DarkP urple, Red, W hite, Y ellow])i is shared by three parents. • All nodes in an l-chain (i.e., linked via LO-links) will always pertain to the same column. This follows from the fact that the 6 Empirical Compression Results columns of any non-empty left subtable of a table T are the same A prototype I have implemented in Java was tested for functional as the columns of T . The heuristic says to always choose from the correctness against small exemplary tables such as the t-shirt model first column given in Table 1. This set of exemplary tables also contains some • A node pertaining to the jth column is always (j − 1) HI-links negative variant tables to test the approach in [8]. I refer to the im- distant from the root node. This follows by iterating the argument plementation as the VDD prototype. It was further tested against 238 that if a table T is decomposed by a proposition p(v1 , x) refer- relational variant tables taken from three product models used at SAP encing its first column, then the right sub-table T has the second in configurator maintenance issues. Since this data is proprietary, I column of T as its first column (by construction)) give only summary statistics on the results. Testing on publicly avail- Note that for BDDs an optimal ordering of P (T ) is important able models, such as the Renault model [1] is a next step for future to achieve a minimal graph. Thus, the search space for finding this work. is s! (s factorial). In Algorithm 1 only the order of the columns is It proved possible to successfully compile all 238 tables in this test important. It is not important how the values are ordered within the base with all of the following three approaches: column. To see this, note that the proposition p(v1 , x11 ) used in de- • the heuristic h1 used to obtain the graph in Figure 2 composition slices T horizontally15 . The slices obtained overall with • the heuristic h2 in Algorithm 1 - without merging nodes to set- respect to all values xp1 in the first column are the same, regardless labeled nodes of the order of the values in a column. Thus, the search space for • the heuristic h2∗ in Algorithm 1 - with merging nodes to set- an optimal column order is merely k!. As Algorithm 1 indicates, I labeled nodes am currently only exploring one ordering of columns, supposing that it will dominate the others. However, this still needs to be verified Table 2 gives statistics on the table size and complexity. It lists the empirically. minimal, maximal, and average values, as well as the values for the four quartiles Q1 , Q2 , Q3 , Q4 , for each of the following parameters: 5 Example of Extended T-shirt Model k (arity), r (number of rows), s (number of propositions), and N (number of cells - k ∗ r). In [8] I extended the t-shirt by adding the colors Y ellow and There is one table with arity one. This is used as a technique of DarkP urple, the sizes XL and XXL, and the print none to the dynamically defining a global domain for a product property in a global domains πj (T ) (given in Section 2): variant table, rather than doing this directly in the declaration of the product property in the model. This technique has the disadvantage π1 (T ) = {Black, Red, W hite, Blue, Y ellow, DarkP urple} that it is more difficult to ensure translation of all relevant texts asso- π2 (T ) = {Large, M edium, Small, XL, XXL} ciated with a value in multi-lingual deployments of the configuration π3 (T ) = {M IB, ST W, none} solution. It may, however, be used to dynamically restrict a large, 15 The terminology is inspired by [6] 16 I elaborate on the derivation of this example in [8] 93 Juha Tiihonen, Andreas Falkner, and Tomas Axling, Editors Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria some components that are non-linear to process, whereas h2 does Table 2. Size statistics on 238 SAP VC variant tables not. Not surprisingly, using merged nodes further reduces both the number of nodes (n) in the VDD as well as the number of distinct la- Range k r s N bels of the nodes (s). The times are obtained on my Apple Mac mini Minimum 1 1 2 2 Q1 2 14 17 42 with 2.5 GHz Intel Core i5 and 8GB memory. Times on other de- Q2 3 56.5 46 176 velopment PCs (both Windows and MacBook) are comparable. The Q3 5 137.5 93.75 635.5 time to compile the largest table with h1 is almost 20 minutes, but it Q4 16 21372 998 213720 takes less than one second using h2 with and without merging for the Average 4.29 238.53 79.04 1590.51 same table. Thus, compiling these tables into VDDs with h2 would Maximum 16 21372 998 213720 almost be feasible at run-time. Heuristic h2 strictly dominates h1 with respect to achieved pre-defined, enterprise-wide global domain to the requirements of a compression (smaller number of nodes) for 143 of the 238 ta- particular product. General modeling considerations and experiences bles18 . For 71 tables the same compression was achieved. For with the SAP VC are elaborated in [4]. 24 tables h1 strictly dominates h2. Table 4 compares the advan- The largest arity is 16. The associated table has only 76 rows. The tages/disadvantages of h2 over h1. The three cases are labeled largest table with 21372 rows has arity 10. The table with the largest “h2 > h1” (h2 strictly dominates h1), “h2 = h1” (indifference), number of propositions (998) has arity six and 469 rows. and “h2 < h1” (h2 is strictly dominated by h1). Table 4 lists av- Table 3 gives statistics on the achieved compression for the three erages for the following parameters: T ab, s, N , n, ∆R , and ∆t. compression techniques. (I discuss double negation separately, in T ab is the number of tables that pertain to that case. The parameters Section 7.) The variant tables are partitioned into the four quartiles s, N , and n are as defined above for Table 3. ∆R is the weighted for s (number of distinct node labels17 ). These are denoted by Qs1 , difference in reduction: ∆R = T ab ∗ (reducth2 − reducth1 ). It is Qs2 , Qs3 , and Qs4 . Averages are given for each of these four parti- positive where h2 has the advantage. ∆t is the weighted difference tions for the following parameters: s, N (number of cells in variant in compile time: ∆t = T ab ∗ (th2 − th1 ) in milli-seconds. It is nega- table), n (number of nodes), reduct (reduction: (N − n)), and t tive where h2 has the advantage. The last row gives the averages per (compilation time in milli-seconds). Explicit results are also given table ∆R/238 and ∆t/238 over all rows. for the table with largest number of cells (Max N ), largest number of propositions (Max s), and largest arity (Max k), as well as the overall Table 4. Averages for dominated heuristics average. Table 3. Average compression on 238 SAP VC variant tables Dominance T ab N s n ∆R ∆t h2 > h1 143 867.06 85.03 222.59 7702 -563065 h2 = h1 71 114.76 54.83 56.00 0 -1208 h2 < h1 24 10266.88 114.92 282.58 -992 -1206770 Range N Heur s n reduct t (msec) Average 28.19 -7446.43 h1 17 20 15.75 4 Qs1 42 h2 17 18 19.5 0 h2∗ 8 8 28.75 0 h1 46 73.5 95.5 17.5 The largest table is one where h1 strictly dominates h2. However, Qs2 176 h2 46 65 102.5 1 the compile time using h1 is almost 20 min. That for h2 is 0.6 sec. h2∗ 17 19.5 155 1 Overall, h2 seems to prevail over h1. The average gain in reduction h1 93.75 184 418.5 112.75 Qs3 635.5 h2 93.75 154.75 489.5 3 over all 238 tables is 28.19. The average gain in compilation time is h2∗ 53.75 74.5 558.5 5 7446.43 (msec). The large gain in the latter is due to the non-linear h1 998 3756 213329 1192988 performance of h1, which makes it grossly uncompetitive for large Qs4 213720 h2 998 2728 213289 598 tables. Further experiments with other column orderings and with h2∗ 988 2381 213575 659 other heuristics remains a topic of future work. h1 79.04 178.95 1411.57 5475.59 Average 1590.51 h2 79.04 149.95 1440.56 9.77 h2∗ 50.32 83.92 1506.59 12.43 h1 152 391 54793 1192988 7 Excursion: Negation Max N 213720 h2 152 431 213289 598 h2∗ 70 145 213575 659 In [8] I describe double negation of a positive table as one approach h1 998 998 868 478 to compression that is completely independent of the VDD mecha- Max s 2814 h2 998 998 868 86 nism. Being able to negate a table, i.e. calculate the complement with h2∗ 130 130 1736 91 respect to a given domain restrictions is central in that approach. h1 181 885 331 874 Max k 1216 h2 181 653 563 5 My implementation of the VDD-prototype supports the approach h2∗ 182 649 567 5 in [8], and supports negation of a VDD. A BDD can be negated by switching the sinks ⊥ and >. This doesn’t work for a VDD (and for a ZDD in general). Algorithm 2 gives the spirit of my implementation The compilation times of h2 are drastically better for large tables for negating a VDD produced using Algorithm 1 for a variant table T than those for h1. This is because the information needed by h1 has of arity k against a domain restriction tuple R. It produces a negated VDD that has set-labeled nodes. 17 For VDDs without merged nodes this is just the number of propositions. For VDDs with merged nodes this is a different number, because labels 18 As merging could potentially also be done in conjunction with heuristic for disjunctions of propositions are added, but not all original propositions h1, it is not a fair comparison to compare the number of nodes achieved still explicitly appear as node labels with h1 against that achieved with h2∗ Juha Tiihonen, Andreas Falkner, and Tomas Axling, Editors 94 Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria Algorithm 2 (Negation) j to determine admissibility of all occurring node labels. Given Start with the root node of V. Negate it as described below that the domains are naturally ordered, binary search can be used to speed this up. The VDD prototype also imposes an ordering on • If ν is a non-negated terminal node, i.e., ν references the last col- the node labels to facilitate this umn index k, replace it with a node ν that is assigned the comple- • Let n be the number of nodes in V. The question of which nodes mentary label LC(ν) := πj (T )\LC(ν), where LC(ν) is defined have admissible paths to >, is related to the problem of count- as the union of all values/sets in the l-chain19 of ν ing the admissible paths. After determining which node labels are • If ν is a non-negated non-terminal node that references column admissible, this has the remaining complexity of O(n) (c.f., Algo- index j < k, negate it by doing the following in order: rithm C in [12]) – negate each right child of each node in its l-chain in turn In the case that V is a VDD without set-valued nodes, i.e., all node – add a node ν⊥ to the end of the l-chain of ν , where ν⊥ repre- labels are of the form X = {x}, V ∩ R is just the solution set of sents the c-tuple LC(ν)×Rj+1 ×. . .×Rk . ν⊥ itself is labeled T ∩ R, where T is the variant table encoded by V. But, if V has with LC(ν). Its right child is a node labeled Rj+1 which has a non-atomic set-valued nodes20 T ∩ R ⊆ V ∩ R. Here, the evaluation right child Rj+2 . . . . All LO-links of added nodes point to ⊥ comes at the slight additional cost of determining the solution set by additionally intersecting V ∩ R with R.21 The intersection of two c- Prune any unneeded nodes that have an empty set as a label or have tuples is easy to calculate. Thus, this additional cost is offset, because a right child that is pruned, suitably rerouting a link to a pruned node determining the admissibility of each node is now faster (a smaller to the left child of the pruned node number of intersectsp tests hopefully offsets the otherwise greater number of memberp tests). The VDD prototype actually does the pruning of empty nodes on Real run-time measurements have not yet been performed. How- the fly. If the root node itself is pruned, the entire resulting VDD is ever, in the beginning, in trying to determine whether the VDD ap- empty after negation. proach is worthwile, I did an experimental integration with the (Java- The fact that double negation of a positive table needs to yield the based) SAP IPC configurator (see [4]) with the product models en- same solution set as the original table (see [8]) provides a straightfor- compassing the 238 tables mentioned in Section 6. The software con- ward possibility to test for the correctness of this approach to nega- figuration I used was completely non-standard, so any results are not tion. objectively meaningful, but they did encourage me to pursue this ap- In order to avoid complexity issues with very large complements, I proach. The expectations on performance gains might be roughly ori- so far applied double negation only in those cases where the number ented on the inverse of the compression ratio N/n (total number of of tuples in the complement was smaller or equal to the number of table cells/number of nodes). For heuristics h1/h2/h2∗ the averages tuples in the original table. 57 of the 238 SAP VC variant tables that for N/n are 2.2/2.4/3.9, respectively. But, this does not account for are the basis for the results I presented in Section 6 proved amenable losses due to the overhead of needing more complex set operations. to double negation in this sense. Of these, 18 did not yield a smaller In any case, real run-time measurements need to be performed as a VDD than that obtained with heuristic h2∗ . For the remaining 39 the future step. maximal gain was 4 nodes, the average gain was 1.89 nodes. I close this section with a remark on using a VDD as a simple Run-time performance tests have yet to be made, but these results database. A query with an instantiated database key is equivalent to raise the question, whether double negation will add value over di- a run-time restriction R, where the key’s properties are restricted to rect compression. However, the concept of double negation is inde- singleton values, and the other properties are unconstrained (or have pendent of the VDD concept, and could be applied on its own with- the domain that is established at the time of the query). The solution out using VDDs. Also, it remains to be seen, if the test on whether set of the VDD will contain only tuples consistent with the query constraint propagation can be gainfully applied, given in [8], proves (by construction). If, for example, the key is defined as unique in the valuable. database (and the table content is consistent with this definition), the result can contain at most the unique response (or the empty set, if 8 Evaluation of a VDD no row for the key exists in the table). For queries with non-unique database keys, the solution set needs to be intersected with R, as Given a run-time restriction R, a VDD V, and a node ν(j, X) in V discussed above. (using the notation in (7)). ν(j, X) can be marked as out if X ∩Rj = ∅. The admissible solutions of V are all paths from the root node to the sink > that do not contain any node marked out. I denote this set 9 Conclusion as V ∩ R, for short. If there are no such paths, then R is inconsistent given V. Although it remains to explore other variants of Algorithm 1 with dif- I do not go into further detail on how to determine V ∩ R. This fol- ferent orderings of the columns, the compression achieved with the lows the spirit of known algorithms for directed acyclic graphs. For current version (both h2 and h2∗ ) has been surprisingly satisfactory. example, see [12] for an exposition in the context of BDDs/ZDDs. The very short compile times may be of more practical advantage Concerning the general complexity of the calculations: than a more expensive search for heuristics that provide (somewhat) better results. However, further investigation into search and heuris- • Let sj be the number of distinct node labels pertaining to the j-th tics of an altogether different type should to be done more completely column of V (as in (2), but modified to allow for set-valued labels). and formally. This is future work. sj node-labels must be intersected with Rj for each column index 20 Either merged nodes or nodes representing continuous intervals 19 Defined in Section 3.3, the l-chain of a node is the sub-graph consisting 21 To see this, suppose that T is itself a c-tuple, so that there is at most of the node and all nodes reachable from it via LO-links, but excluding the one admissible c-tuple consisting of T itself. Obviously R may be more sink ⊥. All nodes in an l-chain reference the same column index j restrictive 95 Juha Tiihonen, Andreas Falkner, and Tomas Axling, Editors Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria The goal of this work was not only to find a good compression [2] H.R. Andersen, T. Hadzic, and D. Pisinger, ‘Interactive cost configura- technique, but also to provide a solution that can be used in conjunc- tion over decision diagrams’, J. Artif. Intell. Res. (JAIR), 37, 99–139, (2010). tion with a legacy configurator to enhance the handling of variant [3] C. Bessiere, ‘Constraint propagation’, in Handbook of Constraint Pro- tables, either as a whole or individually. The VDD prototype I imple- gramming, eds., F. Rossi, P. van Beek, and T. Walsh, chapter 3, Elsevier, mented uses little machinery and adds little to software maintenance (2006). (training, size of the code base, etc.). It also conveys little risk. In [4] U. Blumöhr, M. Münch, and M. Ukalovic, Variant Configuration with the event that some tables cannot be compiled to a VDD, the legacy SAP, second edition, SAP Press, Galileo Press, 2012. [5] K. Ferland, Discrete Mathematics, Cengage Learning, 2008. handling can be seamlessly kept. (This did not occur in the initial [6] Nebras Gharbi, Fred Hemery, Christophe Lecoutre, and Olivier Rous- explorations using the h1 heuristic with the test models.) The SAP sel, ‘Sliced table constraints: Combining compression and tabular re- VC is a configurator with a very large user base. Thus, any change duction’, in CPAIOR’14, pp. 120–135, (2014). to it comes with large risk. This is the type of situation I had in mind [7] A. Haag, ‘Chapter 27 - product configuration in sap: A retrospective’, in Knowledge-Based Configuration, eds., Alexander Felfernig, Lothar when designing the VDD prototype.22 Hotz, Claire Bagley, and Juha Tiihonen, 319 – 337, Morgan Kaufmann, In the course of this work I have come to believe that all of the Boston, (2014). following three approaches to speed up variant table handling and to [8] A. Haag, ‘An approach to arc consistency with negative variant tables’, look for a compact representation yield very similar results: in Proceedings of the 17th International Configuration Workshop, Vi- enna, Austria, September 10-11, 2015., pp. 81–87, (2015). • BDDs in various flavors ([9, 2]) [9] Tarik Hadzic, ‘A bdd-based approach to interactive configuration’, in • Compression into c-tuples and constraint slicing ([6, 10]) Principles and Practice of Constraint Programming - CP 2004, 10th International Conference, CP 2004, Toronto, Canada, September 27 • Read optimized databases (such as column-oriented databases - October 1, 2004, Proceedings, ed., Mark Wallace, volume 3258 of ([14])) in conjunction with a constraint propagation algorithm Lecture Notes in Computer Science, p. 797. Springer, (2004). (e.g. the STR-algorithm [13]) [10] G. Katsirelos and T. Walsh, ‘A compression algorithm for large arity extensional constraints’, in Principles and Practice of Constraint Pro- The differences are in the machinery needed for the intended de- gramming - CP 2007, 13th International Conference, CP 2007, Prov- ployment and in the heuristics that suggest themselves. Although the idence, RI, USA, September 23-27, 2007, Proceedings, ed., Christian Bessiere, volume 4741 of Lecture Notes in Computer Science, pp. 379– representation as a VDD is central to my approach at compression 393. Springer, (2007). and evaluation, functionally, the two important aspects in practice are [11] H. Keller, The Official ABAP Reference, number Bd. 1 in Galileo SAP that it functions as a (limited) replacement for a database, and that it Press, Galileo Press, 2005. performs constraint propagation. I would see as a topic of future work [12] D.E. Knuth, The Art of Computer Programming, volume 4A Combina- to both look more closely at read optimized databases and to investi- torial Algorithms Part 1, chapter Binary Decision Diagrams, 202–280, Pearson Education, Boston, 2011. gate, if the VDD approach can be extended to support more complex [13] C. Lecoutre, ‘STR2: optimized simple tabular reduction for table con- database queries. Investigating the commonality between the three straints’, Constraints, 16(4), 341–371, (2011). approaches (BDDs, compression, read-optimized databases) more [14] Michael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, formally could be another interesting topic. The VDD approach has Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth J. O’Neil, Patrick E. O’Neil, Alex Rasin, Nga Tran, elements of all three. and Stanley B. Zdonik, ‘C-store: A column-oriented DBMS’, in Pro- A note in closing: The compression algorithm in [10] is based on a ceedings of the 31st International Conference on Very Large Data very similar approach to table decomposition. I was not aware of this Bases, Trondheim, Norway, August 30 - September 2, 2005, eds., Kle- work until recently, so there are some unfortunate disconnects in con- mens Böhm, Christian S. Jensen, Laura M. Haas, Martin L. Kersten, ventions I use. I tend to follow [12], whereas in [10] left and right are Per-Åke Larson, and Beng Chin Ooi, pp. 553–564. ACM, (2005). used the other way around. I did adopt use of the term c-tuple. I think the column oriented view, here, is more intuitive and has resulted in more useful heuristics. Another major difference to [10], which I see, is that the I base evaluation directly on the VDD. Furthermore, the VDD supports nodes labeled with real-valued intervals, but his could also be added to [10] in a straightforward manner23 . ACKNOWLEDGEMENTS I would like to thank the anonymous reviewers and my daughter Laura for their constructive suggestions, on how to improve the intel- ligibility of this paper. I am afraid the current result may not yet meet their expectations, but I think it is a step forward from the previous version. REFERENCES [1] J. Amilhastre, H. Fargier, and P Marquis, ‘Consistency restoration and explanations in dynamic csps application to configuration’, Artif. Intell., 135(1-2), 199–234, (2002). 22 In my estimation, the effort to reimplement the current VDD functionality in SAP ABAP ([11]) for use with the SAP VC would be feasible from a cost and risk perspective 23 Basically treating an interval syntactically like a value in compilation, but evaluating it like a set at run-time Juha Tiihonen, Andreas Falkner, and Tomas Axling, Editors 96 Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria