Column Oriented Compilation of Variant Tables
                                                                        Albert Haag1


Abstract. The objective of this work is to improve variant table                       • query the table using a key defined in the DBMS
evaluation in a configurator by compiling/compressing the tables in-                   • iterate over the current solution set of the table given domain re-
dividually to reduce both processing time and space. The main re-                        strictions for the associated product properties
sult is a proposed simple heuristic for decomposing the variant table
                                                                                          I assume that an existing (legacy) configurator has already imple-
into subtables and a representation linking these subtables in a di-
                                                                                       mented this functionality. This will include means for efficiently test-
rected acyclic graph (DAG). The size of the compression obtained
                                                                                       ing membership in a domain (memberp), testing if domains intersect
by this heuristic for examples used in [2, 10] is comparable to that
                                                                                       (intersectsp), and for calculating the intersection of domains (inter-
achieved there. However, a formal analysis of complexity has not
                                                                                       section).
yet been completed. A prototype implemented in Java exists. Objec-
tives in designing it were to keep it completely decoupled from any
particular configurator, while using little machinery in order to keep                 2     Introduction and Notation
software maintenance costs low. Testing both on abstract examples                      In [8] I look at tables that list excluded combinations of product prop-
and on tables that resemble real customer data is ongoing and looks                    erties. I term these negative variant tables. The techniques and the
promising. Non-atomic table cells (such as real intervals, or value                    associated prototypical implementation I present here cover that ap-
sets) are supported. My approach to negative variant tables [8] has                    proach as well. Here, except in Section 7, I limit the exposition to
been incorporated into the implementation.                                             positive tables.
                                                                                           For simplicity, I take a positive variant table T to be given as a
1     Preamble                                                                         relational table of values (atoms). I relax the assumption about T
                                                                                       being relational later in Section 3.3. If T has k columns and r rows
Following the usage of the SAP Variant Configurator - SAP VC [4] I                     it is an r × k array of values: T = aij ; i = 1 . . . r; j = 1 . . . k. k
term a table that lists all valid combinations of properties of a product              is the arity of T . Each column is mapped to a product property such
as a variant table. One use of a variant table in configuration is as a                as Color, Size, . . . . The product properties are denoted by vj : j =
table constraint. However, variant tables and their maintenance by                     1 . . . k.
modelers entail some special considerations that have implications                         Following notation in [8], I define the column domains πj (T ) as
beyond what is formally captured by that concept:                                      the set of all values occurring in the j-th column of T :3
                                                                                                                             r
• The product properties and the values referred to have a business                                                          [
                                                                                                                πj (T ) :=       {aij ∈ T }                            (1)
  meaning outside of configuration. Value domains for a product                                                              i=1
  property are naturally maintained in a defined sort order
• Variant tables may be stored in a database table outside the model.                  I define sj as the number of elements in πj (T ):
  A data definition in a database management system (DBMS) may                                                         sj := |πj (T )|                                 (2)
  exist that defines database keys etc.
• Tables will often be relational, i.e., table cells will be atomic val-                   I call
  ues (strings or numbers), but non-atomic entries may occur 2                                              π(T ) := π1 (T ) × . . . × πk (T )
• Customers have the tendency to maintain large wide tables, i.e.,                     the global domain tuple for v1 , . . . , vk and I denote a run-time do-
  normalization is often sacrificed in favor of fewer tables. In such                  main restriction for the product property vj as Rj ⊆ πj (T ).
  cases, compression techniques seem particularly advantageous                            For ease of notation, I refer to any subset of π(T ) that is a Carte-
• A variant tables can have its own individual update cycle. The                       sian product as a c-tuple. Both π(T ) itself and the tuple of run-time
  overall model should not need to be compiled or otherwise pro-                       restrictions R := R1 × . . . × Rk are c-tuples.
  cessed each time a change is made to a table                                            T can be seen as a set of value tuples and, as such, T ⊆ π(T ),
                                                                                       but T is not necessarily a c-tuple. In the special case that T itself is
  The relevant functionality a configurator has to provide regarding                   a c-tuple4 , i.e., T = π1 (T ) × . . . × πj (T ), then
variant tables is to
                                                                                                                               k
                                                                                                                               X
• restrict domains via contraint propagation (general arc consistency                                                   s :=         sj                                (3)
                                                                                                                               j=1
  GAC (see [3])), treating the variant table as a constraint relation.
                                                                                       3 π (T ) can be seen as the projection of T for the j-th column
1 SAP SE, Germany, email: albert.haag@t-online.de                                          j
                                                                                       4 In this case, it holds that for any given c- tuple (run-time restriction) R
2 The SAP VC allows real-valued intervals and and sets of values in table
    cells. Such a table cannot be transparently stored as a relation table in a                                      π(T ∩ R) = T ∩ R
    DBMS


                                                                                  89                  Juha Tiihonen, Andreas Falkner, and Tomas Axling, Editors
                                                                                                    Proceedings of the 17th International Configuration Workshop
                                                                                                                         September 10-11, 2015, Vienna, Austria
 values suffice to represent the N := ( kj=1 sj ) tuples in T . In this                  disjoint subtables based on a particular heuristic.
                                               Q
 case, the c-tuple π(T ) is a compressed way of representing the ar-                        In Section 3, I introduce the basic approach to decomposing a ta-
 ray aij . This observation is central to attempts to compress a given                   ble. In Section 4, I discuss the heuristic. My running example is a
 table into a disjoint union5 of as few as possible c-tuples. It has been                (single) variant table listing all variants of a t-shirt. The example is
 utilized in various other work. Notably, I cite [10, 6] in this context.                taken and adapted further from [2]. The t-shirt has three properties
    When T is viewed as a constraint relation, v1 . . . vk are just the                  Color (v1 ), Size (v2 ), and Print (v3 ) with global domains:
 constraint variables, and each value x ∈ πj (T ) maps to a propo-
 sition p(vj , x) that states that vj can be consistently assigned to x:                 • π1 (T ) := {Black, Blue, Red, W hite}
 p(vj , x) |= (vj = x). In this case, a row (tuple) ri in T directly maps                • π2 (T ) := {Large, M edium, Small}
 to a conjunction of such propositions: ri = (ai1 , . . . , aik ) |= τi :=               • π3 (T ) := {M IB(Men in Black), ST W (Save the Whales)}
 p(v1 , ai1 ) ∧ . . . ∧ p(vk , aik ), and T itself represents the disjunction:           Of the 24 possible t-shirts only 11 are valid due to constraints that
        Wr
 T |=       τi .                                                                         state that M IB implies black and ST W implies ¬small as de-
       i=1
                                                                                         picted in table (1). In Section 5, I extend this example to be slightly
    Given the definition of s in (3), there are s distinct propositions
                                                                                         more complex.
 associated with T , one for each value in each πj (T ). Hence, given
                                                                                            I have implemented a prototype in Java that meets the functionality
 any value assignment to these s propositions, T , seen as a logical
                                                                                         requirements listed above. Here, I refer to it as the VDD prototype. It
 expression implementing the constraint relation, will evaluate to 1
                                                                                         functions standalone, independent of any particular configurator. A
 (>/true) or 0 (⊥/false), and T defines a Boolean function:
                                                                                         feature of this implementation is that it can be selectively applied to
                       F : 2π1 (T )×...πk (T ) → {0, 1}                      (4)         some tables, while processing others with the existing means of the
                                                                                         configurator. Results obtained using this prototype validate the ap-
 F can be represented by a BDD (Ordered Binary Decision Diagram)                         proach (see Section 6). In Section 7, I comment on results for double
 or one of its cousins [12]. BDDs have the enticing property that find-                  negation of variant tables, an approach I develop in [8]. Real run-
 ing the right ordering of their Boolean variables (the propositions                     time performance evaluations have not been done, but in Section 8
 p(vj , x)) can lead to a very compact representation. Furthermore,                      I discuss what results I have. I close this paper with an outlook and
 this representation can potentially be found by existing agnostic op-                   conclusions (Section 9).
 timization algorithms. The complexity of this optimization is high,                        Finally, a disclaimer: While the motivation for this work lies in my
 and heuristics are employed in practice. The construction of an op-                     past at SAP and is based on insights and experiences with the prod-
 timal BDD is not suitable for configuration run-time, but must be                       uct configurators there [4, 7], all work on this paper was performed
 performed in advance. Hence, this approach is referred to as a compi-                   privately during the last two years after transition into partial retire-
 lation approach. For configuration, this has been pursued in [9]. The                   ment. The implementation is neither endorsed by SAP nor does it
 approach using multi-valued decision diagrams (MDD) [2] is related                      reflect ongoing SAP development.
 to the BDD approach. Zero-Suppressed decision diagrams (ZDDs)
 [12] are another flavor of BDD, which I refer to again below.
    From a database point of view, approaches based on compression                       3     Decomposition
 that allow fast reading but slow writing have been developed, among                     3.1    Column Oriented Decomposition
 them column-oriented databases [14]. The SAP HANA database sup-
 ports column orientation as well. My work, here, is not directly based                  Let s be defined as in (3). Given a table T of arity k with r rows
 on database techniques, although thinking about column-orientation                      ri = (ai1 , . . . , aik ) and selecting one of the s propositions p(vj , x)
 was the trigger for the heuristics detailed in Section 46 .                             associated with T , then T can be decomposed into those rows in
    Both the BDD approaches and the database approaches entail a                         which x occurs in column j and those where it doesn’t. Define
 maintenance-time transformation into a compact representation that                      L(T , j, x) as the sub-table of T consisting of those rows that do
 facilitates run-time evaluation, and both strive for a compact repre-                   not reference p(vj , x):
 sentation (“hard to write, easy to read”). This would also apply to
 various approaches at identifying c-tuple subsets of T whether with                            L(T , j, x) := {ri = (ai1 , . . . , aik ) ∈ T   | aij 6= x}     (5)
 the explicit notion of achieving compression [10] or of simply speed-
 ing up constraint propagation (GAC) algorithms [6].                                        L(T , j, x) has the same arity k as T . In the complementary sub-
    Thus, all these approaches could be termed as compression or                         table T \ L(T , j, x) all values in the j-th column are equal to x
 compilation or as both. Indeed, I conjecture that the ultimately                        by construction. Define R(T , j, x) as the sub-table of arity (k − 1)
 achievable results to be more or less identical, up to differences                      obtained by removing the j-th column from T \ L(T , j, x):
 forced by the respective formalism, which may be unfavorable in
 some circumstances. For example, my experiences suggest that a                                 R(T , j, x) := {(aih ) ⊂ (T \ L(T , j, x)) | h 6= j}            (6)
 BDD may be less suitable than a ZDD for compiling a single table
 constraint, because in the latter propositions need only be represented                    Given a table T and a proposition p(vj , x) associated with T , then
 where they occur in positive form (see [12]). The approach I follow                     I call L(T , j, x) defined in (5) the left sub-table of T and R(T , j, x)
 here is motivated by looking at ways of decomposing a table into                        defined in (6) the right sub-table of T . Either L(T , j, x) and/or
                                                                                         R(T , j, x) may be empty.
 5 I exclusively use the term disjoint union to refer to a union of disjoint sets           The variant table of the 11 variants of the t-shirt example is shown
   [5]. I denote the disjoint union of two sets A and B by A∪· B which implies           in Table 1. It illustrates a decomposition of this table table based
   that A ∩ B = ∅
 6 It would be interesting to investigate whether a column-oriented database             on the proposition p(v2 , M edium). L(T , 2, M edium) is in bold-
   could in itself be beneficially employed in this context. This was proposed           face, and R(T , 2, M edium) is underlined. (The value block of cells
   by colleague at SAP some time ago, but has not been followed up on                    {ai2 ∈ T | ai2 = M edium} is in italics.)


Juha Tiihonen, Andreas Falkner, and Tomas Axling, Editors                           90
Proceedings of the 17th International Configuration Workshop
September 10-11, 2015, Vienna, Austria
                 Table 1. Basic decomposition of a table                              Figure 2 shows an entire VDD8 .9


                         Color           Size            Print
                         Black           Small           MIB                                                                                                              1:(2, Small)
                         Black           Medium          MIB
                         Black           Large           MIB
                         Black           Medium          STW                                                                                              2:(1, White)
                         Black           Large           STW
                         White           Medium          STW
                         White           Large           STW                                                                                               3:(3, STW)
                         Red             Medium          STW
                         Red             Large           STW
                         Blue            Medium          STW                                                         4:(1, Black)
                         Blue            Large           STW
                         Blue            Small           STW
                                                                                                    5:(2, Medium)                                          4:(1, Black)


   The decomposition process can be continued until only empty sub-                                  7:(2, Large)                      3:(3, STW)         5:(2, Medium)
tables remain. The question, which propostion (value block) to de-
compose on next at each non-empty subtable will depend a suitable
                                                                                      8:(1, Red)                    5:(2, Medium)            6:(3, MIB)                                  4:(1, Black)
heuristic (see Section 4).

                                                                                                   9:(1, Blue)                7:(2, Large)                    6:(3, MIB)

3.2    Variant Decision Diagram - VDD
                                                                                                       ⊤                                            ⊥
A variant table can be represented as a decomposition tree. The root
represents the entire table. It is labeled with a first chosen proposition
p(vj1 , x1 ). It has two children. One, termed the left child, represents                             Figure 2. VDD of t-shirt using heuristic h1 (Section 4)
the left sub-table L(T , j1 , x1 ). The other, termed the right child, rep-
resents the right sub-table R(T , j1 , x1 ). Each of these children can
in turn have children if it can be decomposed further. An empty left
child is labeled by a special symbol ⊥. An empty right child is la-                      In Figure 2 each node is labeled in the form hp : (j, val)i,
beled by a special symbol >. (All leaves of a decomposition tree are                  where (j, val) is the column/value pair that denotes the proposition
labeled either by ⊥ or >.) Figure 1 shows a graphic depiction7 . It                   p(vj , x), and p is a unique identifier for the node/proposition. The
also shows the further decomposition of R(T , j1 , x1 ), here indicat-                identifiers are contiguously numbered. Thus, the set of propositions
ing that it has two empty children.                                                   for T can be seen as totally ordered according to this numbering. The
                                                                                      ordering underlying the graph in Figure 2 is

                                                                                           p(v2 , Small), p(v1 , x)W hite, p(v3 , x)ST W , p(v1 , Black),
                                          B(T,j,x)
                                                                                           p(v2 , M edium), p(v3 , M IB), p(v2 , Large), p(v1 , Red),
                                                                                           p(v1 , Blue)
                      Left child: L(T,j, x)     Right child: R(T,j,x)

                                                                                         Given that such a total ordering can be identified in the decompo-
                                                     ⊥       ⊤                        sition, the resulting VDD may be seen as a ZDD (zero-suppressed
                                                                                      decision diagram) [12]. The gist of the algorithms for evaluation of
                                                                                      BDDs and their cousins given in [12], such as Algorithm C and Al-
               Figure 1. Basic scheme of a decomposition                              gorithm B would apply. However, I have not made any verbatim use
                                                                                      of these so far.10
                                                                                         VDDs have certain special characteristics beyond ZDDs. A termi-
                                                                                      nal HI-link always leads to >, and a terminal LO-link always leads
   Identical subtables may arise at different points in the decompo-                  to ⊥. This and further characteristics that are ensured by the heuristic
sition tree. A goal of minimal representation is to represent these                   h2 given in Algorithm 1 allow certain short-cuts in the implementa-
multiple occurrences only once by transforming the decomposition                      tion (see Sections 4 and 6).
tree into a directed acyclic graph (DAG), which I call a VDD or vari-
                                                                                      8 By conventions established in [12], I term a link to the left child as a LO-
ant decision diagram. All leaves can be identified with one of two
                                                                                        link, drawn with a dotted line, preferably to the left, and a link to the right
predefined nodes also labeled ⊥ and >. Subsequently, all nodes la-                      child as a HI-link, drawn with a filled line, preferably to the right. A LO-
beled by the same proposition p(vj , x) that have identical children                    link is followed when the proposition in the node label is disbelieved. A
are represented by re-using one shared node. This reduction can be                      HI-link is followed when it is believed. The terminal nodes ⊥ and > are
accomplished by an algorithm in the spirit of Algorithm R in [12].                      called sinks
                                                                                      9 The amount of compression achieved in figure 2 is not overwhelming. The
                                                                                        heuristic h2 does better (see figure 3)
7 For simplicity in creating the graph, the label of the root node is given as        10 Both heuristics h1 and h2 in Section 4 are designed to guarantee such an
  B(T, j, x) for p(vj1 , x1 )                                                            ordering, but this is doesn’t seem essential to the VDD approach in general


                                                                                 91                       Juha Tiihonen, Andreas Falkner, and Tomas Axling, Editors
                                                                                                        Proceedings of the 17th International Configuration Workshop
                                                                                                                             September 10-11, 2015, Vienna, Austria
 3.3     Set-labeled Nodes                                                                                 this is not the case for a set representation of c-tuples. Hence, a VDD
                                                                                                           is a more compact representation.
 There is an additional reduction of a VDD I have implemented as
                                                                                                              Also, there may be external sources for set-labeled nodes if the
 an option. This is most easily described using the concept of an l-
                                                                                                           maintained variant table is not relational. The SAP VC allows mod-
 chain. Define the subgraph composed of a node and all its descen-
                                                                                                           eling variant tables with cells that contain real-valued intervals as
 dent nodes that can be reached from it following only LO-links as
                                                                                                           well as a condensed form, where a cell may contain a set of values.
 the l-chain of the node11 . Nodes in an l-chain that pertain to the
                                                                                                           Such cells can be directly represented as set-labeled nodes.
 same column and have the same right child can be joined into a sin-
                                                                                                              I close this section in noting that in [10] a Boolean function as in
 gle node. Let p(vj1 , x1 ), . . . , p(vh , xh ) be the propositions in the
                                                                                                           (4) is also used to construct a decomposition of a table into disjoint
 labels of members in an l-chain that can be joined. The resulting
                                                                                                           union of Cartesian products (c-tuples). There the resulting decompo-
 combined node is labeled with the disjunction of these propositions:
                                                                                                           sition into c-tuples is the goal. Here the VDD is the goal, as I base
 P := p(vj1 , x1 ) ∨ . . . ∨ p(vh , xh ). This disjunction is represented,
                                                           h
                                                                                                           the evaluation on it (see Section 8).
                                                           S
 for short, by the (non-atomic) set of values X :=           xh occurring
                                                                              i=1
 in P . In the sequel, I refer to a node by                                                                4    Heuristics
                                                                                                           For the exposition in this section, I assume T to be in relational form
                                           ν(j, X)                                              (7)
                                                                                                           with arity k.
 where j is the column index of the referenced product property vj ,                                          The graph in Figure 2 is derived using on an initial heuristic h1,
 and X is the set of values represented by the node12 . In case the node                                   which I tried. h1 is based on trying to minimize splitting value blocks
 label is a single atomic value {x}, I also denote this by ν(j, x).                                        during decomposition. A value block for a proposition p(vj , x) is the
    Figure 3 is a graph of the t-shirt using set-labeled nodes13 . It is not                               rows in a table T that reference that proposition. Decomposing T on
 a reduction of Figure 2, but rather a reduction of the graph in Figure                                    some other proposition p(vh , y) will split p(vj , x), if it has rows in
 4 (see Section 4).                                                                                        both L(T , j, y) and R(T , j, y). Subsequently, both of these children
                                                                                                           need to be decomposed by p(vj , x), whereas a single decomposition
                                                                                                           would have handled p(vj , x) for T at its root level. It is assumed that
                                                                                                           keeping large value blocks intact is good, and the order of decompo-
                          1:(3, MIB)|7                                                                     sition decisions should incur as little damage to these value blocks as
                                                                                                           possible. In order to apply this heuristic, all value blocks, their sizes,
                                                                                                           and the row indices that pertain to each one are initially calculated
                  2:(3, STW)|6
                                                                                                           and stored. I do not go into further detail on this here.
                                                                                                              The current implementation relies on characteristics of a VDD en-
            12:(2, [Large, Medium])|5                     10:(2, [Large, Medium, Small])|3                 sured by the decomposition heuristic h2 given in Algorithm 1. Recall
                                                                                                           that the total number of propositions is s, defined by (3), and that the
                                                                                                           number of propositions that pertain to the j-th column is sj , defined
                    11:(1, [Black, Blue, Red, White])|4                        6:(1, Black)|2              by (2).
                                                                                                           Algorithm 1 (Heuristic h2)
                                    F                            T                                         First, define P (T ) as an ordered set of all s propositions p(vj , x) by
                                                                                                           1. Sorting the k columns of T by sj , ascending (largest values last).
                                                                                                              p(vj , x), below, refers to the j-th column with respect to this or-
              Figure 3. VDD of t-shirt with set-labeled nodes                                                 dering of the columns
                                                                                                           2. Within each column, sort the values by their business order (any
                                                                                                              defined order the implementation can readily implement). xpj
                                                                                                              refers to the p-th value in the j-th column domain πj (T ).
    By inspection, the complexity of the graph in Figure 3 is compara-                                     Then,
 ble to that obtained for the merged MDD in [2] (Figure 2 (b) there).                                      1. Make a root node of the VDD for the first proposition in the first
    This further reduction is important from the viewpoint of com-                                            column: p(v1 , x11 )
 pression, as each path from the root of the graph to a sink > can                                         2. While nodes with a non-terminal subtable remain: split them al-
 be seen as a c-tuple in the solution of T , and a set-labeled node                                           ways using the first proposition (value block) in their first column
 reduces the number of such c-tuples. It is also important from the                                        3. Optionally, collect members of an l-chain with the same child
 viewpoint of run-time performance, if set intersection (intersectsp)                                         nodes into an aggregated set-labeled node. I refer to this variant
 is more efficient than multiple membership tests. A VDD using set-                                           of the heuristic as h2∗
 labeled nodes should result in similar c-tuples as the approach in [10]                                   4. Reduce the nodes by unifying equivalent nodes as discussed in
 (depending of course on the heuristic). A key difference is that VDD                                         Section 3.2
 paths may share nodes (c-tuples sharing common tails)14 , whereas
                                                                                                              Figure 4 shows the graph of Table 1 produced using Algorithm
 11 A maximal l-chain would be one for a node which does not itself occur as
                                                                                                           1 without set-labeled nodes (h2). Figure 3 shows the same graph
    the left child of any other node
 12 For the exposition, here, X represents a finite disjunction of propositions.                           produced with set-labeled nodes (h2∗ ).
    In the implemented prototype it can also be an interval with continuous                                   A decomposition based on Algorithm 1 has the following charac-
    bounds                                                                                                 teristics:
 13 The new set-labeled nodes are assigned a uniquely identifying node num-
    ber outside the range used for numbering the propositions                                              • After k HI-links the >-sink is always reached. (This is trivially
 14 Figure 5 has shared nodes
                                                                                                             true for all VDDs, as each row consists of k elements)


Juha Tiihonen, Andreas Falkner, and Tomas Axling, Editors                                             92
Proceedings of the 17th International Configuration Workshop
September 10-11, 2015, Vienna, Austria
                                                                                                          The new values combine with everything, except that no rows are
                                                                                                          added for combinations of M IB (print) and Y ellow (color). The
                                                           1:(3, MIB)|13
                                                                                                          table of all variants then has 73 rows (as opposed to the 11 of table
                                                                                                          1)16 . The graph is given in Figure 5
                                                 2:(3, STW)|12


                                                                                                                                                                                                     1:(3, MIB)|12
                               3:(2, Large)|11


                                                                                                                                                                                    2:(3, STW)|11

                              4:(2, Medium)|10

                                                                                                                            3:(3, none)|10                                          6:(2, Small)|8                    18:(2, [Large, Medium, Small])|5


           6:(1, Black)|9                                          3:(2, Large)|5
                                                                                                              20:(2, [Large, Medium, Small, XL, XXL])|9              19:(2, [Large, Medium, XL, XXL])|7                     17:(2, [XL, XXL])|4


            7:(1, Blue)|8                                                    4:(2, Medium)|4
                                                                                                                                                            16:(1, [Black, Blue, DarkPurple, Red, White, Yellow])|3                            11:(1, DarkPurple)|6   15:(1, [Black, DarkPurple])|2


                   8:(1, Red)|7                                                     5:(2, Small)|3                                                                                    F                                                                  T


                            9:(1, White)|6                        6:(1, Black)|2                                          Figure 5.                       VDD of extended t-shirt with set-labeled nodes

               T                                       F


                                                                                                            Now 11 nodes are needed instead of the 6 nodes in Figure 3.
                                                                                                          The table is decomposed into 5 c-tuples. The node labeled h16 :
            Figure 4. Basic VDD of t-shirt using algorithm 1
                                                                                                          (1, [Black, Blue, DarkP urple, Red, W hite, Y ellow])i is shared
                                                                                                          by three parents.

• All nodes in an l-chain (i.e., linked via LO-links) will always
  pertain to the same column. This follows from the fact that the                                         6      Empirical Compression Results
  columns of any non-empty left subtable of a table T are the same                                        A prototype I have implemented in Java was tested for functional
  as the columns of T . The heuristic says to always choose from the                                      correctness against small exemplary tables such as the t-shirt model
  first column                                                                                            given in Table 1. This set of exemplary tables also contains some
• A node pertaining to the jth column is always (j − 1) HI-links                                          negative variant tables to test the approach in [8]. I refer to the im-
  distant from the root node. This follows by iterating the argument                                      plementation as the VDD prototype. It was further tested against 238
  that if a table T is decomposed by a proposition p(v1 , x) refer-                                       relational variant tables taken from three product models used at SAP
  encing its first column, then the right sub-table T has the second                                      in configurator maintenance issues. Since this data is proprietary, I
  column of T as its first column (by construction))                                                      give only summary statistics on the results. Testing on publicly avail-
   Note that for BDDs an optimal ordering of P (T ) is important                                          able models, such as the Renault model [1] is a next step for future
to achieve a minimal graph. Thus, the search space for finding this                                       work.
is s! (s factorial). In Algorithm 1 only the order of the columns is                                         It proved possible to successfully compile all 238 tables in this test
important. It is not important how the values are ordered within the                                      base with all of the following three approaches:
column. To see this, note that the proposition p(v1 , x11 ) used in de-                                   • the heuristic h1 used to obtain the graph in Figure 2
composition slices T horizontally15 . The slices obtained overall with                                    • the heuristic h2 in Algorithm 1 - without merging nodes to set-
respect to all values xp1 in the first column are the same, regardless                                      labeled nodes
of the order of the values in a column. Thus, the search space for                                        • the heuristic h2∗ in Algorithm 1 - with merging nodes to set-
an optimal column order is merely k!. As Algorithm 1 indicates, I                                           labeled nodes
am currently only exploring one ordering of columns, supposing that
it will dominate the others. However, this still needs to be verified                                        Table 2 gives statistics on the table size and complexity. It lists the
empirically.                                                                                              minimal, maximal, and average values, as well as the values for the
                                                                                                          four quartiles Q1 , Q2 , Q3 , Q4 , for each of the following parameters:
5     Example of Extended T-shirt Model                                                                   k (arity), r (number of rows), s (number of propositions), and N
                                                                                                          (number of cells - k ∗ r).
In [8] I extended the t-shirt by adding the colors Y ellow and                                               There is one table with arity one. This is used as a technique of
DarkP urple, the sizes XL and XXL, and the print none to the                                              dynamically defining a global domain for a product property in a
global domains πj (T ) (given in Section 2):                                                              variant table, rather than doing this directly in the declaration of the
                                                                                                          product property in the model. This technique has the disadvantage
    π1 (T ) = {Black, Red, W hite, Blue, Y ellow, DarkP urple}
                                                                                                          that it is more difficult to ensure translation of all relevant texts asso-
    π2 (T ) = {Large, M edium, Small, XL, XXL}                                                            ciated with a value in multi-lingual deployments of the configuration
    π3 (T ) = {M IB, ST W, none}                                                                          solution. It may, however, be used to dynamically restrict a large,
15 The terminology is inspired by [6]                                                                     16 I elaborate on the derivation of this example in [8]


                                                                                                     93                            Juha Tiihonen, Andreas Falkner, and Tomas Axling, Editors
                                                                                                                                 Proceedings of the 17th International Configuration Workshop
                                                                                                                                                      September 10-11, 2015, Vienna, Austria
                                                                                           some components that are non-linear to process, whereas h2 does
             Table 2.       Size statistics on 238 SAP VC variant tables
                                                                                           not. Not surprisingly, using merged nodes further reduces both the
                                                                                           number of nodes (n) in the VDD as well as the number of distinct la-
               Range              k        r          s         N                          bels of the nodes (s). The times are obtained on my Apple Mac mini
               Minimum            1         1         2          2
               Q1                 2        14        17         42                         with 2.5 GHz Intel Core i5 and 8GB memory. Times on other de-
               Q2                 3       56.5       46        176                         velopment PCs (both Windows and MacBook) are comparable. The
               Q3                 5      137.5      93.75     635.5                        time to compile the largest table with h1 is almost 20 minutes, but it
               Q4                 16     21372       998     213720                        takes less than one second using h2 with and without merging for the
               Average           4.29    238.53     79.04    1590.51
                                                                                           same table. Thus, compiling these tables into VDDs with h2 would
               Maximum            16     21372       998     213720
                                                                                           almost be feasible at run-time.
                                                                                              Heuristic h2 strictly dominates h1 with respect to achieved
 pre-defined, enterprise-wide global domain to the requirements of a                       compression (smaller number of nodes) for 143 of the 238 ta-
 particular product. General modeling considerations and experiences                       bles18 . For 71 tables the same compression was achieved. For
 with the SAP VC are elaborated in [4].                                                    24 tables h1 strictly dominates h2. Table 4 compares the advan-
    The largest arity is 16. The associated table has only 76 rows. The                    tages/disadvantages of h2 over h1. The three cases are labeled
 largest table with 21372 rows has arity 10. The table with the largest                    “h2 > h1” (h2 strictly dominates h1), “h2 = h1” (indifference),
 number of propositions (998) has arity six and 469 rows.                                  and “h2 < h1” (h2 is strictly dominated by h1). Table 4 lists av-
    Table 3 gives statistics on the achieved compression for the three                     erages for the following parameters: T ab, s, N , n, ∆R , and ∆t.
 compression techniques. (I discuss double negation separately, in                         T ab is the number of tables that pertain to that case. The parameters
 Section 7.) The variant tables are partitioned into the four quartiles                    s, N , and n are as defined above for Table 3. ∆R is the weighted
 for s (number of distinct node labels17 ). These are denoted by Qs1 ,                     difference in reduction: ∆R = T ab ∗ (reducth2 − reducth1 ). It is
 Qs2 , Qs3 , and Qs4 . Averages are given for each of these four parti-                    positive where h2 has the advantage. ∆t is the weighted difference
 tions for the following parameters: s, N (number of cells in variant                      in compile time: ∆t = T ab ∗ (th2 − th1 ) in milli-seconds. It is nega-
 table), n (number of nodes), reduct (reduction: (N − n)), and t                           tive where h2 has the advantage. The last row gives the averages per
 (compilation time in milli-seconds). Explicit results are also given                      table ∆R/238 and ∆t/238 over all rows.
 for the table with largest number of cells (Max N ), largest number of
 propositions (Max s), and largest arity (Max k), as well as the overall                                   Table 4.   Averages for dominated heuristics
 average.

         Table 3.     Average compression on 238 SAP VC variant tables                         Dominance    T ab            N         s         n      ∆R           ∆t
                                                                                               h2 > h1      143         867.06    85.03    222.59     7702      -563065
                                                                                               h2 = h1        71        114.76    54.83     56.00         0       -1208
                                                                                               h2 < h1        24      10266.88   114.92    282.58      -992   -1206770
   Range             N          Heur       s         n      reduct         t (msec)            Average                                                28.19    -7446.43
                                h1        17         20      15.75             4
   Qs1               42         h2        17         18       19.5             0
                                h2∗        8          8      28.75              0
                                h1        46       73.5       95.5            17.5            The largest table is one where h1 strictly dominates h2. However,
   Qs2              176         h2        46         65      102.5             1           the compile time using h1 is almost 20 min. That for h2 is 0.6 sec.
                                h2∗       17       19.5       155              1           Overall, h2 seems to prevail over h1. The average gain in reduction
                                h1       93.75      184      418.5          112.75
   Qs3              635.5       h2       93.75    154.75     489.5              3
                                                                                           over all 238 tables is 28.19. The average gain in compilation time is
                                h2∗      53.75      74.5     558.5              5          7446.43 (msec). The large gain in the latter is due to the non-linear
                                h1        998      3756     213329         1192988         performance of h1, which makes it grossly uncompetitive for large
   Qs4          213720          h2        998      2728     213289            598          tables. Further experiments with other column orderings and with
                                h2∗       988      2381     213575            659          other heuristics remains a topic of future work.
                                h1       79.04    178.95    1411.57        5475.59
   Average      1590.51         h2       79.04    149.95    1440.56           9.77
                                h2∗      50.32     83.92    1506.59          12.43
                                h1        152       391      54793         1192988         7     Excursion: Negation
   Max N        213720          h2        152       431     213289            598
                                h2∗       70        145     213575            659          In [8] I describe double negation of a positive table as one approach
                                h1        998       998        868            478          to compression that is completely independent of the VDD mecha-
   Max s            2814        h2        998       998        868             86          nism. Being able to negate a table, i.e. calculate the complement with
                                h2∗       130       130       1736             91
                                                                                           respect to a given domain restrictions is central in that approach.
                                h1        181       885        331            874
   Max k            1216        h2        181       653        563              5             My implementation of the VDD-prototype supports the approach
                                h2∗       182       649        567              5          in [8], and supports negation of a VDD. A BDD can be negated by
                                                                                           switching the sinks ⊥ and >. This doesn’t work for a VDD (and for a
                                                                                           ZDD in general). Algorithm 2 gives the spirit of my implementation
    The compilation times of h2 are drastically better for large tables                    for negating a VDD produced using Algorithm 1 for a variant table T
 than those for h1. This is because the information needed by h1 has                       of arity k against a domain restriction tuple R. It produces a negated
                                                                                           VDD that has set-labeled nodes.
 17 For VDDs without merged nodes this is just the number of propositions.
    For VDDs with merged nodes this is a different number, because labels                  18 As merging could potentially also be done in conjunction with heuristic
    for disjunctions of propositions are added, but not all original propositions              h1, it is not a fair comparison to compare the number of nodes achieved
    still explicitly appear as node labels                                                     with h1 against that achieved with h2∗


Juha Tiihonen, Andreas Falkner, and Tomas Axling, Editors                             94
Proceedings of the 17th International Configuration Workshop
September 10-11, 2015, Vienna, Austria
Algorithm 2 (Negation)                                                                   j to determine admissibility of all occurring node labels. Given
Start with the root node of V. Negate it as described below                              that the domains are naturally ordered, binary search can be used
                                                                                         to speed this up. The VDD prototype also imposes an ordering on
• If ν is a non-negated terminal node, i.e., ν references the last col-                  the node labels to facilitate this
  umn index k, replace it with a node ν that is assigned the comple-                   • Let n be the number of nodes in V. The question of which nodes
  mentary label LC(ν) := πj (T )\LC(ν), where LC(ν) is defined                           have admissible paths to >, is related to the problem of count-
  as the union of all values/sets in the l-chain19 of ν                                  ing the admissible paths. After determining which node labels are
• If ν is a non-negated non-terminal node that references column                         admissible, this has the remaining complexity of O(n) (c.f., Algo-
  index j < k, negate it by doing the following in order:                                rithm C in [12])
    – negate each right child of each node in its l-chain in turn
                                                                                          In the case that V is a VDD without set-valued nodes, i.e., all node
    – add a node ν⊥ to the end of the l-chain of ν , where ν⊥ repre-                   labels are of the form X = {x}, V ∩ R is just the solution set of
      sents the c-tuple LC(ν)×Rj+1 ×. . .×Rk . ν⊥ itself is labeled                    T ∩ R, where T is the variant table encoded by V. But, if V has
      with LC(ν). Its right child is a node labeled Rj+1 which has a                   non-atomic set-valued nodes20 T ∩ R ⊆ V ∩ R. Here, the evaluation
      right child Rj+2 . . . . All LO-links of added nodes point to ⊥                  comes at the slight additional cost of determining the solution set by
                                                                                       additionally intersecting V ∩ R with R.21 The intersection of two c-
Prune any unneeded nodes that have an empty set as a label or have                     tuples is easy to calculate. Thus, this additional cost is offset, because
a right child that is pruned, suitably rerouting a link to a pruned node               determining the admissibility of each node is now faster (a smaller
to the left child of the pruned node                                                   number of intersectsp tests hopefully offsets the otherwise greater
                                                                                       number of memberp tests).
   The VDD prototype actually does the pruning of empty nodes on
                                                                                          Real run-time measurements have not yet been performed. How-
the fly. If the root node itself is pruned, the entire resulting VDD is
                                                                                       ever, in the beginning, in trying to determine whether the VDD ap-
empty after negation.
                                                                                       proach is worthwile, I did an experimental integration with the (Java-
   The fact that double negation of a positive table needs to yield the
                                                                                       based) SAP IPC configurator (see [4]) with the product models en-
same solution set as the original table (see [8]) provides a straightfor-
                                                                                       compassing the 238 tables mentioned in Section 6. The software con-
ward possibility to test for the correctness of this approach to nega-
                                                                                       figuration I used was completely non-standard, so any results are not
tion.
                                                                                       objectively meaningful, but they did encourage me to pursue this ap-
   In order to avoid complexity issues with very large complements, I
                                                                                       proach. The expectations on performance gains might be roughly ori-
so far applied double negation only in those cases where the number
                                                                                       ented on the inverse of the compression ratio N/n (total number of
of tuples in the complement was smaller or equal to the number of
                                                                                       table cells/number of nodes). For heuristics h1/h2/h2∗ the averages
tuples in the original table. 57 of the 238 SAP VC variant tables that
                                                                                       for N/n are 2.2/2.4/3.9, respectively. But, this does not account for
are the basis for the results I presented in Section 6 proved amenable
                                                                                       losses due to the overhead of needing more complex set operations.
to double negation in this sense. Of these, 18 did not yield a smaller
                                                                                       In any case, real run-time measurements need to be performed as a
VDD than that obtained with heuristic h2∗ . For the remaining 39 the
                                                                                       future step.
maximal gain was 4 nodes, the average gain was 1.89 nodes.
                                                                                          I close this section with a remark on using a VDD as a simple
   Run-time performance tests have yet to be made, but these results
                                                                                       database. A query with an instantiated database key is equivalent to
raise the question, whether double negation will add value over di-
                                                                                       a run-time restriction R, where the key’s properties are restricted to
rect compression. However, the concept of double negation is inde-
                                                                                       singleton values, and the other properties are unconstrained (or have
pendent of the VDD concept, and could be applied on its own with-
                                                                                       the domain that is established at the time of the query). The solution
out using VDDs. Also, it remains to be seen, if the test on whether
                                                                                       set of the VDD will contain only tuples consistent with the query
constraint propagation can be gainfully applied, given in [8], proves
                                                                                       (by construction). If, for example, the key is defined as unique in the
valuable.
                                                                                       database (and the table content is consistent with this definition), the
                                                                                       result can contain at most the unique response (or the empty set, if
8     Evaluation of a VDD                                                              no row for the key exists in the table). For queries with non-unique
                                                                                       database keys, the solution set needs to be intersected with R, as
Given a run-time restriction R, a VDD V, and a node ν(j, X) in V
                                                                                       discussed above.
(using the notation in (7)). ν(j, X) can be marked as out if X ∩Rj =
∅. The admissible solutions of V are all paths from the root node to
the sink > that do not contain any node marked out. I denote this set                  9     Conclusion
as V ∩ R, for short. If there are no such paths, then R is inconsistent
given V.                                                                               Although it remains to explore other variants of Algorithm 1 with dif-
   I do not go into further detail on how to determine V ∩ R. This fol-                ferent orderings of the columns, the compression achieved with the
lows the spirit of known algorithms for directed acyclic graphs. For                   current version (both h2 and h2∗ ) has been surprisingly satisfactory.
example, see [12] for an exposition in the context of BDDs/ZDDs.                       The very short compile times may be of more practical advantage
   Concerning the general complexity of the calculations:                              than a more expensive search for heuristics that provide (somewhat)
                                                                                       better results. However, further investigation into search and heuris-
• Let sj be the number of distinct node labels pertaining to the j-th                  tics of an altogether different type should to be done more completely
  column of V (as in (2), but modified to allow for set-valued labels).                and formally. This is future work.
  sj node-labels must be intersected with Rj for each column index
                                                                                       20 Either merged nodes or nodes representing continuous intervals
19 Defined in Section 3.3, the l-chain of a node is the sub-graph consisting           21 To see this, suppose that T    is itself a c-tuple, so that there is at most
    of the node and all nodes reachable from it via LO-links, but excluding the            one admissible c-tuple consisting of T itself. Obviously R may be more
    sink ⊥. All nodes in an l-chain reference the same column index j                      restrictive


                                                                                  95                 Juha Tiihonen, Andreas Falkner, and Tomas Axling, Editors
                                                                                                   Proceedings of the 17th International Configuration Workshop
                                                                                                                        September 10-11, 2015, Vienna, Austria
    The goal of this work was not only to find a good compression                         [2] H.R. Andersen, T. Hadzic, and D. Pisinger, ‘Interactive cost configura-
 technique, but also to provide a solution that can be used in conjunc-                       tion over decision diagrams’, J. Artif. Intell. Res. (JAIR), 37, 99–139,
                                                                                              (2010).
 tion with a legacy configurator to enhance the handling of variant                       [3] C. Bessiere, ‘Constraint propagation’, in Handbook of Constraint Pro-
 tables, either as a whole or individually. The VDD prototype I imple-                        gramming, eds., F. Rossi, P. van Beek, and T. Walsh, chapter 3, Elsevier,
 mented uses little machinery and adds little to software maintenance                         (2006).
 (training, size of the code base, etc.). It also conveys little risk. In                 [4] U. Blumöhr, M. Münch, and M. Ukalovic, Variant Configuration with
 the event that some tables cannot be compiled to a VDD, the legacy                           SAP, second edition, SAP Press, Galileo Press, 2012.
                                                                                          [5] K. Ferland, Discrete Mathematics, Cengage Learning, 2008.
 handling can be seamlessly kept. (This did not occur in the initial                      [6] Nebras Gharbi, Fred Hemery, Christophe Lecoutre, and Olivier Rous-
 explorations using the h1 heuristic with the test models.) The SAP                           sel, ‘Sliced table constraints: Combining compression and tabular re-
 VC is a configurator with a very large user base. Thus, any change                           duction’, in CPAIOR’14, pp. 120–135, (2014).
 to it comes with large risk. This is the type of situation I had in mind                 [7] A. Haag, ‘Chapter 27 - product configuration in sap: A retrospective’,
                                                                                              in Knowledge-Based Configuration, eds., Alexander Felfernig, Lothar
 when designing the VDD prototype.22                                                          Hotz, Claire Bagley, and Juha Tiihonen, 319 – 337, Morgan Kaufmann,
    In the course of this work I have come to believe that all of the                         Boston, (2014).
 following three approaches to speed up variant table handling and to                     [8] A. Haag, ‘An approach to arc consistency with negative variant tables’,
 look for a compact representation yield very similar results:                                in Proceedings of the 17th International Configuration Workshop, Vi-
                                                                                              enna, Austria, September 10-11, 2015., pp. 81–87, (2015).
 • BDDs in various flavors ([9, 2])                                                       [9] Tarik Hadzic, ‘A bdd-based approach to interactive configuration’, in
 • Compression into c-tuples and constraint slicing ([6, 10])                                 Principles and Practice of Constraint Programming - CP 2004, 10th
                                                                                              International Conference, CP 2004, Toronto, Canada, September 27
 • Read optimized databases (such as column-oriented databases                                - October 1, 2004, Proceedings, ed., Mark Wallace, volume 3258 of
   ([14])) in conjunction with a constraint propagation algorithm                             Lecture Notes in Computer Science, p. 797. Springer, (2004).
   (e.g. the STR-algorithm [13])                                                         [10] G. Katsirelos and T. Walsh, ‘A compression algorithm for large arity
                                                                                              extensional constraints’, in Principles and Practice of Constraint Pro-
    The differences are in the machinery needed for the intended de-                          gramming - CP 2007, 13th International Conference, CP 2007, Prov-
 ployment and in the heuristics that suggest themselves. Although the                         idence, RI, USA, September 23-27, 2007, Proceedings, ed., Christian
                                                                                              Bessiere, volume 4741 of Lecture Notes in Computer Science, pp. 379–
 representation as a VDD is central to my approach at compression
                                                                                              393. Springer, (2007).
 and evaluation, functionally, the two important aspects in practice are                 [11] H. Keller, The Official ABAP Reference, number Bd. 1 in Galileo SAP
 that it functions as a (limited) replacement for a database, and that it                     Press, Galileo Press, 2005.
 performs constraint propagation. I would see as a topic of future work                  [12] D.E. Knuth, The Art of Computer Programming, volume 4A Combina-
 to both look more closely at read optimized databases and to investi-                        torial Algorithms Part 1, chapter Binary Decision Diagrams, 202–280,
                                                                                              Pearson Education, Boston, 2011.
 gate, if the VDD approach can be extended to support more complex                       [13] C. Lecoutre, ‘STR2: optimized simple tabular reduction for table con-
 database queries. Investigating the commonality between the three                            straints’, Constraints, 16(4), 341–371, (2011).
 approaches (BDDs, compression, read-optimized databases) more                           [14] Michael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen,
 formally could be another interesting topic. The VDD approach has                            Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel
                                                                                              Madden, Elizabeth J. O’Neil, Patrick E. O’Neil, Alex Rasin, Nga Tran,
 elements of all three.
                                                                                              and Stanley B. Zdonik, ‘C-store: A column-oriented DBMS’, in Pro-
    A note in closing: The compression algorithm in [10] is based on a                        ceedings of the 31st International Conference on Very Large Data
 very similar approach to table decomposition. I was not aware of this                        Bases, Trondheim, Norway, August 30 - September 2, 2005, eds., Kle-
 work until recently, so there are some unfortunate disconnects in con-                       mens Böhm, Christian S. Jensen, Laura M. Haas, Martin L. Kersten,
 ventions I use. I tend to follow [12], whereas in [10] left and right are                    Per-Åke Larson, and Beng Chin Ooi, pp. 553–564. ACM, (2005).
 used the other way around. I did adopt use of the term c-tuple. I think
 the column oriented view, here, is more intuitive and has resulted in
 more useful heuristics. Another major difference to [10], which I see,
 is that the I base evaluation directly on the VDD. Furthermore, the
 VDD supports nodes labeled with real-valued intervals, but his could
 also be added to [10] in a straightforward manner23 .

 ACKNOWLEDGEMENTS
 I would like to thank the anonymous reviewers and my daughter
 Laura for their constructive suggestions, on how to improve the intel-
 ligibility of this paper. I am afraid the current result may not yet meet
 their expectations, but I think it is a step forward from the previous
 version.

 REFERENCES
  [1] J. Amilhastre, H. Fargier, and P Marquis, ‘Consistency restoration and
      explanations in dynamic csps application to configuration’, Artif. Intell.,
      135(1-2), 199–234, (2002).
 22 In my estimation, the effort to reimplement the current VDD functionality
    in SAP ABAP ([11]) for use with the SAP VC would be feasible from a
    cost and risk perspective
 23 Basically treating an interval syntactically like a value in compilation, but
    evaluating it like a set at run-time


Juha Tiihonen, Andreas Falkner, and Tomas Axling, Editors                           96
Proceedings of the 17th International Configuration Workshop
September 10-11, 2015, Vienna, Austria