<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Uncovering Functional Dependencies in MDD-Compiled Product Catalogues</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Tarik Hadzˇic ́ and Barry O'Sullivan Cork Constraint Computation Centre Department of Computer Science, University College Cork</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Product catalogues are usually represented as tables. They are regularly updated with new variants and, over time, various forms of inconsistencies or undesirable properties can be introduced, especially when global changes are made. We argue that by compiling product catalogues into decision diagrams we can support a number of highlevel queries for detecting and checking for various forms of inconsistency, as well as verifying other properties relevant for user interaction. In particular, a class of functional dependency-based inconsistencies can be detected efficiently. An example is verifying properties such as: “each identical configuration of a product has the same price”. This paper presents a number of algorithmic advances. We illustrate the usefulness of our approach by evaluating these high-level properties over realworld publicly available product catalogues.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Uncovering functional dependencies is an important
problem in many artificial intelligence (AI) domains. Many AI
datasets are represented in tabular form, defined in terms of
a set of attributes (columns). A large dataset might be
represented in a database table or spreadsheet and determining
that one or more attributes is functionally determined by
values of other attributes can be of critical importance, e.g. in
analyzing the effect of chemical compounds on
cancerogenity or studying the shopping habits of the customers.
Functional dependencies are most commonly used in the process
of designing relational database schema.</p>
      <p>We suggest that uncovering functional dependencies can
also be useful in an online product configuration system
context. For example, various forms of catalogue consistencies
can be expressed as functional dependencies. For example,
checking whether “each identical configuration of a product
has the same price” reduces to verifying that the price
attribute is functionally determined by the remaining attributes.
This could be relevant for product catalogues that are
regularly updated with new variants, or preprocessed for various
purposes. Furthermore, functional dependencies can help to
reason about the length of user interaction with respect to the
design of the user interface. If a small subset of attributes
functionally determines all the others, then an interface that
encourages a user to first assign attributes from such a subset,
might help reduce the total number of interaction steps.</p>
      <p>In this paper we study the problem of identifying and
testing functional dependencies amongst subsets of the attributes
X that define a catalogue. Specifically, we consider
dependencies of the form Y ! x for a subset of variables Y ½ X
and variable x 2 X, which hold if and only if for any two
products p1; p2 from the catalogue, whenever p1 and p2 agree
on attributes Y , i.e. p1[Y ] ´ p2[Y ], they also agree on
attribute x, i.e. p1[x] ´ p2[x].</p>
      <p>A number of approaches have been suggested in the
database community for uncovering functional
dependencies [Huhtala et al., 1999]. However, all of them find
dependencies by operating over an explicit representation of the
catalogue, and state-of-the art approaches take linear time,
O(T ), in the number of products, T , in the catalogue
[Huhtala et al., 1999; Schlimmer, 1993]. Other approaches incur
even more complexity by using sorting, in O(T ¢log(T )) time,
or explicit comparison of all tuples, in O(T 2) time. Note that
the number of tuples can grow exponentially with the
number of attributes, however most real world catalogues are very
sparse and contain only a tiny proportion of all possible
products.</p>
      <p>The main contribution of this paper is an approach to
uncovering functional dependencies when a dataset is
compressed into a multi-valued decision diagram (MDD) rather
than as an extensionally represented table, the standard in
relational database theory. MDDs are directed acyclic graphs
that, for a sorted list of attributes, use prefix and suffix
sharing to compactly represent the data. Each product is encoded
as a path, and every edge on the path encodes an
attributevalue pair (variable assignment), xi = a. While the size of an
MDD is, in the worst-case, linear in the number of products,
they can often be exponentially smaller. We propose a number
of algorithms for uncovering functional dependencies; these
algorithms have time complexity that is quadratic in the size
of the MDD. In worst case, when there is no compression,
we get O(T 2) running time. However, whenever the MDD
is small we guarantee a sublinear-time algorithm for testing
Y ! x.</p>
      <p>We implemented the algorithms we have introduced in this
paper and applied them to several publicly available product
catalogues. As a result, we have uncovered several properties,
some of which indicate “bugs” in the datasets which have
previously gone undetected.</p>
      <p>The remainder of this paper is organised as follows.
Section 2 presents the necessary technical background required
throughout the paper. We show how functional
dependencies can be uncovered in decision diagram representations
of product catalogues in Section 3. A variety of specialised
functional dependencies, called subset-induced dependencies
are presented in Section 4. We define the notion of
approximate dependencies in Section 5, which are inexact forms of
standard functional dependencies. We report on a number of
experiments in Section 6. Section 7 presents a number of
concluding remarks and outlines directions for future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>In this section we provide the necessary background on
preliminary concepts and introduce our notational conventions.
2.1</p>
      <sec id="sec-2-1">
        <title>Solution Sets</title>
        <p>We are given a set of variables (product attributes) X =
fx1; : : : ; xng defined over finite domains D1; : : : ; Dn of
possible values of the attributes of a product. We are also given
a set of solutions (available products) Sol µ D1 £ : : : ; Dn,
which can be defined either explicitly, e.g. as a set of items
in a product catalogue, or implicitly, e.g. as the set of
solutions to a constraint satisfaction problem hX; D; F =def
fc1; : : : ; cmgi defining which products can be manufactured
in a factory. The latter approach is common when defining a
catalogue of configurable products. Consider for example the
following implicit representation of a T-shirt catalogue.
Example 1 (Representing a T-Shirt Catalogue). We are
interested in selecting a T-shirt which is defined by three
attributes: the color (black, white, red, or blue), the size (small,
medium, or large) and the print (“Men In Black” - MIB or
“Save The Whales” - STW). There are two rules that define
the set of valid combinations: if we choose the MIB print
then the color black has to be chosen as well, and if we
choose the small size then the STW print (including a big
picture of a whale) cannot be selected as the picture of a
whale does not fit on the small shirt. The implicit
representation (X; D; F ) of the T-shirt example consists of variables
X = fx1; x2; x3g representing color, size and print.
Variable domains are D1 = f0; 1; 2; 3g (black ; white; red ; blue),
D2 = f0; 1; 2g (small ; medium; large), and D3 = f0; 1g
(MIB ; STW ). The two rules translate to F = ff1; f2g,
where f1 is x3 = 0 ) x1 = 0 (MIB ) black ) and f2
is (x2 = 0 ) x3 6= 1) (small ) not STW ).
}</p>
        <p>The same solution space of the T-shirt example (satisfying
the configuration rules F ) can be also given explicitly as a set
of items in a catalogue, as shown in Table 1.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Decision Diagrams</title>
        <p>Decision diagrams are compressed representations of
solution sets Sol µ D1 £ : : : £ Dn. Formally, decision diagrams
are a family of rooted directed acyclic graphs (DAGs) where
each node u is labeled with a variable xi and each of its
outgoing edges e are labeled with a value a 2 Di. The decision
diagram contains one or more terminal nodes, each labeled
with a constant and having no outgoing edges. The most well
known member of this family is the binary decision diagram
(BDD) [Bryant, 1986] which is used as a compressed
representation of Boolean functions in many areas, such as
verification, model checking, VLSI design [Meinel and Theobald,
1998; Wegener, 2000; Drechsler, 2001], etc. In this paper
we will primarily operate with the following variant, called
multi-valued decision diagrams:
Definition 1 (Multi-valued Decision Diagram). An MDD M
is a rooted directed acyclic graph (V; E), where V is a set of
vertices containing the special terminal vertex 1 and a root
r 2 V . Further, var : V ! f1; : : : ; n + 1g is a labeling of all
nodes with a variable index such that var(1) = n + 1. Each
edge e 2 E is denoted with a triple (u; u0; a) of its start node
u, its end node u0 and an associated value a.</p>
        <p>We work only with ordered MDDs. A total ordering &lt;
of the variables is assumed such that for all edges (u; u0; a)
var(u) &lt; var(u0). For convenience we assume that the
variables in X are ordered according to their indices. Ordered
MDDs can be considered as being arranged in n layers of
vertices, each layer being labeled with the same variable
index. We will denote as Vi the set of all nodes labeled with
xi, Vi = fu 2 V j var(u) = ig. Similarly, we will
denote with Ei the set of all edges originating in Vi, i.e.
Ei = fe(u; u0; a) 2 E j var(u) = ig. Unless otherwise
specified, we assume that on each path from the root to the
terminal, every variable labels exactly one node. An MDD
encodes a CSP solution set Sol µ D1£: : :£Dn, defined over
variables fx1; : : : ; xng. To check whether an assignment
a = (a1; : : : ; an) 2 D1 £ : : : £ Dn is in Sol we traverse M
from the root, and at every node u labeled with variable xi, we
follow an edge labeled with ai. If there is no such edge then a
is not a solution a 62 Sol. Otherwise, if such a traversal
eventually ends in terminal 1 then a 2 Sol. We will denote with
p : u1 Ã u2 any path in MDD from u1 to u2. Also, edges
between u and u0 will be sometimes denoted as e : u ! u0.
A value a of an edge e(u; u0; a) will be sometimes denoted as
v(e), while a partial assignment associated with path p will
be denoted as v(p). We will use Ch[u] to denote the set of all
outgoing (children) edges of node u. Every path corresponds
to a unique assignment. Hence, the set of all solutions
repreu2
0 1 2
u7</p>
        <p>The MDD from Figure 1 is as large as explicit
representation in Table 1 - the number of edges is equal to the number
of attribute-value pairs. However, the critical benefit of
decision diagrams is that they can become exponentially smaller
than the size of solution set they encode by merging
isomorphic subgraphs. Two nodes u1; u2 are isomorphic if they
encode the same solution set Sol(u1) = Sol(u2). In
Figure 2 we show equivalent merged MDDs for the T-shirt
solution set. For the sake of clarity, we first indicate how the
nodes have been merged in Figure 2(a) by using the same
unique node identifiers from Figure 1. For example, nodes
fu3; u4; u5g have been merged since they have the same
solution sets f(1; 1); (2; 1)g. The same merged MDD, with new
unique node identifiers, is shown in Figure 2(b). The utility
of compressing product catalogues has already been
demonstrated in [Nicholson et al., 2006]. In this paper, unless
emphasized otherwise, by MDD we always assume an ordered
merged MDD.</p>
      </sec>
      <sec id="sec-2-3">
        <title>On the Size and Construction of Merged MDDs. Given a</title>
        <p>variable ordering there is a unique merged MDD for a given
solution set. The size of the MDD depends critically on the
ordering, and could vary exponentially. The merged MDD
representation of a solution set Sol with T entries defined
over n attributes, jSolj = T , can have at most T ¢ n edges.
However, data often exhibits sharing as illustrated in the
Tshirt example, and a merged MDD might be exponentially
sented by the MDD is Sol = fv(p) j p : r Ã 1g. In fact,
every node u 2 Vi can be associated with two subsets of
solutions. Sol(u) = fv(p) j p : u Ã 1g µ Di £ : : : £ Dn, and
Sol(r; u) = fv(p) j p : r Ã ug µ D1 £ : : : £ Di¡1:</p>
        <p>Consider the MDD in Figure 1. It represents directly the
solution set of T-shirt catalogue from Table 1. For each node
we indicate a unique identifier ui. All nodes in the same layer
correspond to the same variable. Node u1 is the root node.
Nodes in the first, second and third layer are V1 = fu1g,
V2 = fu2; u3; u4; u5g and V3 = fu6; u7; : : : ; u14g
respectively. For each edge e we indicate a value v(e). The set
of outgoing edges from, for example, node u2 is Ch[u2] =
f(u2; u6; 0); (u2; u7; 1); (u2; u8; 2)g. The solution set
associated with, for example, node u3 is the set of partial
assignments Sol(u3) = f(1; 1); (2; 1)g. There are in total eleven
paths from u1 to 1, corresponding directly to the eleven
products in Table 1.
0 1 2 3
u2
smaller. In the above example, we reduced the number of
edges from 33 to 13. The effect of merging is best illustrated
by considering two extreme cases. An MDD for solution set
D1 £ : : : £ Dn contains only n internal nodes and Pn
i=1 jDij
edges while there are exponentially more solutions T = jD1j¢
: : : ¢ jDnj. In contrast, an MDD where every two edges at
each layer are labeled with a different value is guaranteed to
provide no sharing of nodes, and it would contain T ¢ n nodes.</p>
        <p>An MDD can always be constructed for a given catalogue
in linear time O(T ). We start with a direct representation,
such as the T-shirt MDD in Figure 1, and in a single
bottomup pass detect and merge those nodes that root identical
solution spaces. However, MDDs can be constructed also
from implicit description in the form of constraint
satisfaction problem. We first create an MDD Mi for each
constraint ci, and then use pairwise conjunctions to construct
M1 ^ : : : ^ Mm. Each conjunction of two MDDs can be
performed in quadratic time and space. The size of an MDD can
grow exponentially in the number of variables, but in
practice, for many interesting constraints the size is surprisingly
small.
2.3</p>
      </sec>
      <sec id="sec-2-4">
        <title>Functional Dependencies</title>
        <p>Given solution set Sol defined over variables X =
fx1; : : : ; xng, and given solution a 2 Sol, a = (a1; : : : ; an),
we define the projection of solution a on variable xi, denoted
as a[xi], to be the value of the i-th coordinate in the tuple,
a[xi] =def ai. Similarly, we define projection of a onto a
subset of variables Y µ X, denoted as a[Y ], as a tuple of
values corresponding to variables in Y . Finally, for a given
set of solutions S µ Sol, we define the projection on subset
of variables Y , denoted as S[Y ], as a collection of all
projected tuples, S[Y ] =def fa[Y ] j a 2 Sg.</p>
        <p>For a solution set Sol, defined over variables X =
fx1; : : : ; xng we say that a variable xi is functionally
determined by a subset of variables Y µ X, denoted as Y ! xi,
if for any two solutions a1; a2 2 Sol, whenever a1 and a2
agree on variables Y (a1[Y ] = a2[Y ]), they also agree on
variable xi (a1[xi] = a2[xi]). Formally:
Y ! xi ,def 8a1;a22Sol a1[Y ] = a2[Y ] ) a1[xi] = a2[xi]:</p>
        <p>A number of approaches are known in the database
community for uncovering all minimal functional dependencies.
A core operation that is executed repeatedly is testing for
atomic functional dependencies of the form Y ! xi.
Stateof-the art approaches for testing atomic dependencies first
cluster data in equivalence classes with respect to the value
of xi, and then make multiple linear iterations through the
dataset. This incurs linear complexity in the number of
solutions O(T ) [Huhtala et al., 1999; Schlimmer, 1993].
Other approaches incur even more complexity, using sorting
O(T ¢ log(T )) or explicit comparison of all tuples O(T 2).</p>
      </sec>
      <sec id="sec-2-5">
        <title>Application to Recommendation and Configuration</title>
        <p>Detecting functional dependencies has applications in many
areas. For example, uncovering that an attribute is
functionally determined by values of other attributes can be of critical
importance in analyzing the effect of chemical compounds
on cancerogenity, studying the shopping habits of customers,
etc. In this paper however, we suggest that uncovering
functional dependencies can also be useful in a recommendation
and interactive configuration context.</p>
        <p>Firstly, various forms of catalogue consistencies can be
expressed as functional dependencies. If product catalogues are
frequently updated by the addition, removal or change of its
items, over time various forms of inconsistencies or
undesirable properties might be introduced. In particular, if care is
not taken, a change of pricing policy might result in the
addition of an item to the dataset that is already present in the
catalogue but differs in price. Furthermore, catalogue datasets are
often transformed or preprocessed for various forms of
analysis or communication. Such transformations might involve
the removal of “redundant” attributes, such as textual
descriptions. If care is not taken, non-redundant attributes might be
removed as well, thus influencing the soundness of the results
of the analysis performed over the processed dataset. Thus,
analyzing functional dependencies can help detect possible
inconsistencies. As an illustration, checking whether “each
identical configuration of a product has the same price”
reduces to verifying that the price attribute is functionally
determined by the remaining attributes.</p>
        <p>Secondly, functional dependencies can help us reason
about the length of user interaction with respect to the design
of the user interface. If a subset of attributes Y ½ X
functionally determines all the other variables, Y ! X, then a
user is guaranteed to completely specify the product as soon
as all the variables Y are assigned. Hence, an interface in
which a user is encouraged to first assign variables Y , might
help reduce the total number of interaction steps. This could
be an important addition to recent efforts towards the formal
analysis of user navigation. In [Felfernig, 2006] the author
used a formal model of the recommender process, based on
finite state automata, to support automatic debugging of faulty
models of recommender user interfaces. In [Mahmood and
Ricci, 2007] the authors presented a recommender system
that autonomously learns an adaptive interaction strategy,
using a formal model of user interaction based on Markov
decision processes. In [Hadzic and O’Sullivan, 2008] the
authors introduced critique graphs as a formalism for analyzing
various aspects of interaction in conversational recommender
systems. In particular reachability of products through
critiquing was discussed.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Functional Dependencies in MDDs</title>
      <p>The main contribution of this paper is an approach to
uncovering functional dependencies when a product (solution) set
is compressed into a multi-valued decision diagram (MDD).
As noted earlier, MDDs are, in the worst-case, linear in the
size of the catalogue O(T ), but they can often be
exponentially smaller. Since the algorithms we suggest for uncovering
functional dependencies, in this and the following sections,
have quadratic complexity in the size of the MDD, whenever
the MDD is sufficiently small we guarantee a sublinear-time
algorithm for performing atomic dependency tests Y ! x.</p>
      <p>We will first discuss how to detect directional
dependencies, which respect variable ordering and are particularly easy
to detect. We will then discuss uncovering general
dependencies of the form X n fxig ! xi.
3.1</p>
      <sec id="sec-3-1">
        <title>Directional Dependencies</title>
        <p>Directional dependencies fx1; : : : ; xi¡1g ! xi state that a
variable xi is determined by the subset of all variables
preceding it in the variable ordering of the MDD. This is a
particularly easy to detect class of dependencies as shown in the
following proposition.</p>
        <p>Proposition 1. fx1; : : : ; xi¡1g ! xi iff for all u 2 Vi, u has
only one outgoing edge jOut(u)j = 1.</p>
        <p>Proof. Let p : r Ã u be a path from root to u, e1 : u ! u1
and e2 : u ! u2 be two outgoing edges and p1 : u1 Ã 1
and p2 : u2 Ã 1 be paths from u1 and u2 to terminal
1 respectively. Then paths (p; e1; p1) and (p; e2; p2)
represent two solutions with identical assignments to variables
fx1; : : : ; xi¡1g and two different assignments to xi variable,
v(e1) 6= v(e2).</p>
        <p>Proposition 1 provides us with a simple test for checking
whether fx1; : : : ; xi¡1g ! xi. It suffices that all nodes in Vi
have exactly one outgoing edge. This can be easily checked
by verifying that the number of nodes and outgoing edges is
the same, jVij = jEij. The set of all variables implied by
variables preceding in the order are given by:</p>
        <p>ImpÁ = fxi j jVij = jEijg:
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>General Dependencies</title>
        <p>Variables determined by subsets of variables preceding in the
order, ImpÁ, do not account for all implied variables. If
variable xi is implied by any subset Y µ X n fxig then it will
be also implied by X n fxig. Therefore, the set of all implied
variables Imp is the set of all xi such that X n fxig ! xi. To
detect such variables in an MDD, we will use the following
proposition.</p>
        <p>Proposition 2. X n fxig 6! xi if and only if there is a node
u 2 Vi with two outgoing edges e1 : u ! u1, e2 : u ! u2
such that v(e1) 6= v(e2) and Sol(u1) \ Sol(u2) 6= ;.
Proof. If X n fxig 6! xi then there are two solutions a =
(a1; : : : ; an); a0 = (a01; : : : ; a0 ) differing only in the i-th
con
ordinate, ai 6= a0i. Let pa and pa0 be the paths encoding
these solutions. These paths must be of the form: pa =
(p; e1; p1), pa0 (p; e2; p2), where p is a unique path encoding
(a1; : : : ; ai¡1). Path p ends in a node u 2 Vi. Since ai 6= a0 ,
i
v(e1) 6= v(e2) and since v(p1) = v(p2) = (ai+1; : : : ; an) it
follows Sol(u1) \ Sol(u2) ¶ f(ai+1; : : : ; an)g.</p>
        <p>On the other hand, if there is u 2 Vi with two
outgoing edges e1; e2, such that v(e1) 6= v(e2) and Sol(u1) \
Sol(u2) 6= ; then we can choose two paths p1 : u1 Ã 1,
p2 : u2 Ã 1 such that v(p1) = v(p2). It suffices to
choose any path from root to u, p : r Ã u to construct
paths pa = (p; e1; p1), pa0 (p; e2; p2) which encode
solutions differing only at the i-th coordinate and thus proving
X n fxig 6! xi.</p>
        <p>Assume that for each pair of nodes in the same layer u1; u2
we have precomputed Boolean indicators D[u1; u2]</p>
        <p>D[u1; u2] = 1 , Sol(u1) \ Sol(u2) 6= ;:
Whenever we encounter a pair of nodes (u1; u2) such that
D[u1; u2] = 1, we are guaranteed that there are at least two
paths p1 : u1 Ã 1 and p2 : u2 Ã 1 encoding the same
solution v(p1) = v(p2). Given such labels, we can compute
all functionally determined variables using Algorithm 1. In
each layer we check for all pairs of edges with the same
parent e1(u; u1; a1),e2(u; u2; a2) whether D[u1; u2] = 1. As
soon as such edges are found we have proven that xi is not
implied and we may proceed to the next layer. The algorithm
runs in O(Pin=1 jVij ¢ jDij2) steps, since for each node in
each layer u 2 Vi, we compare in worst case all pairs of its
scuhcilhdrpeaniresd.geTsh,eansdpatcheerceomarpeleaxtimtyoisst OjD(iPj£in=(1jjDViijj2¡) s1i)n=c2e
we have to store Boolean indicators D[u1; u2] for each pair
(u1; u2) 2 Vi2.</p>
        <p>Algorithm 1: Compute functionally determined
variables.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Data: MDD M (V; E)</title>
        <p>Imp = X;
foreach i = 1; : : : ; n do
foreach u 2 Vi; jCh(u)j &gt; 1 do
foreach e1 : u ! u1; e2 : u ! u2 do
if v(e1) 6= v(e2) ^ D[u1; u2] = 1 then</p>
        <p>Imp Ã Imp n fxig;
go to next layer;
return Imp;</p>
        <p>Compatibility pairs D[u1; u2] can be computed in
quadratic time and space using a dynamic programming
scheme from Algorithm 2. We first initialize D[u1; u2] = 0
for all pairs of nodes, except for the terminal 1, setting
D[1; 1] = 1. We then, in a bottom-up manner, traverse the
MDD. The recursive relationship used for dynamic
programming is based on observing that D[u1; u2] = 1 iff there are
two outgoing edges e1 : u1 ! u01; e2 : u2 ! u02 such
tOha(Ptvn(e1) = v(e2) ^ D[u01; u0 ] = 1. The algorithm runs in
2
i=1 jEij2) time since in each layer Ei each pair of edges
is compared at most once. The algorithm takes £(Pi jVij2)
space, since we introduce a compatibility indicator for each
pair of nodes in the layer.</p>
        <sec id="sec-3-3-1">
          <title>Algorithm 2: Compute Boolean indicators.</title>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Data: MDD M (V; E)</title>
        <p>D[¢; ¢] = 0, D[1; 1] = 1;
foreach i = n; : : : ; 1 do
foreach (u1; u2) 2 Vi £ Vi do
if u1 = u2 then</p>
        <p>D[u1; u2] = 1;
continue;
foreach e1 : u1 ! u01; e2 : u2 ! u02 do
if v(e1) = v(e2) ^ D[u01; u02] = 1 then
D[u1; u2] = 1;
break;
return D;
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Subset-Induced Dependencies</title>
      <p>Given a subset of variables Y µ X we may want to compute
the set of all variables implied by Y :</p>
      <p>ImpY = fxi 2 X n Y j Y ! xig:
This could help us to evaluate an overall impact of a subset
of variables. We could exploit this information in a
number of different settings. In particular, if we find Y such that
ImpY = X n Y , then regardless of how we assign variables
Y , we would completely specify entire solution. This could
be important for increasing the usability of user interaction
since, if a user is assigning only variables Y we guarantee
that the number of user interactions before completely
specifying a solution is at most jY j. Furthermore, such an
information could help organize the visual layout of the variables
in a user interface. If a variable xi is determined by variables
Y , by displaying xi closer to Y in the user interface, a user
would faster evaluate implications of his assignments to Y
variables.</p>
      <p>By definition, a variable xi is not implied by Y iff there are
two solutions a1; a2 such that a1[Y ] = a2[Y ] and ai 6= a0 .
i
Recall that S[Y ] denotes a projection of set S on variables
Y . An observation that would help us detect such variables is
provided in the following proposition.</p>
      <p>Proposition 3. Y 6! xi iff there are two edges e1; e2 2 Ei,
e1 : u1 ! u01, e2 : u2 ! u02, such that v(e1) 6= v(e2)
and Sol(r; u1)[Y ] \ Sol(r; u2)[Y ] 6= ; and Sol(u01)[Y ] \
Sol(u02)[Y ] 6= ;.</p>
      <p>Assuming that for every pair of nodes u1; u2 we have
computed Boolean indicators:</p>
      <p>UY [u1; u2] = 1 , Sol(r; u1)[Y ] \ Sol(r; u2)[Y ] 6= ;
DY [u1; u2] = 1 , Sol(u1)[Y ] \ Sol(u2)[Y ] 6= ;
we could use an adaptation of Algorithm 1 to detect all
variables implied by subset Y . The adaptation is presented in
Algorithm 3.</p>
      <sec id="sec-4-1">
        <title>Algorithm 3: Compute Y -implied variables.</title>
        <p>Data: MDD M (V; E), Y ½ X, UY ; DY
ImpY = X;
foreach i = 1; : : : ; n do
foreach (u1; u2) 2 Vi £ Vi do
foreach e1 : u1 ! u01; e2 : u2 ! u02 do
if v(e1) 6= v(e2) ^ UY [u1; u2] =
1 ^ DY [u01; u02] = 1 then</p>
        <p>ImpY Ã ImpY n fxig;
go to next layer;
return ImpY ;</p>
        <p>Compatibility labels UY ; DY can be computed in quadratic
time and space using an adaptation of Algorithm 2 which
is presented in Algorithm 4. To construct a recursive
relationship on which the computation is based, for a given pair
of nodes u1; u2 2 Vj we have to differentiate between two
cases. If xj 62 Y , then DY [u1; u2] = 1 iff there are two
outgoing edges e1 : u1 ! u01; e2 : u2 ! u02 such that
DY [u01; u02] = 1 regardless of whether v(e1) = v(e2) or
v(e1) 6= v(e2). If xj 2 Y then DY [u1; u2] = 1 iff in addition
to DY [u01; u0 ] = 1 it also holds v(e1) = v(e2). Compatibility
2
labels UY are computed in an analogous manner.</p>
        <p>The algorithm runs in O(Pin=1 jEij2) time since for both
traversals, in each layer Ei each pair of edges is compared at
most once. The algorithm takes £(Pi jVij2) space, since we
introduce two Boolean compatibility indicators UY ; DY for
each pair of nodes in the layer.
4.1</p>
        <sec id="sec-4-1-1">
          <title>Finding Minimal Dependencies</title>
          <p>Given a subset of variables Y0 µ X such that X n Y0 ! Y0,
it is often required to compute the set of all minimal subsets
of variables Y µ X n Y0 that imply Y0. In other words, we
want to compute:</p>
          <p>Y = fY µ X n Y0 j Y ! Y0; s:t: 6 9Y 0½Y Y 0 ! Y0g:
There could be an exponential number of such sets, and a
number of approaches has been developed that operate on a
set-containment lattice [Huhtala et al., 1999] and avoid
unnecessary tests Y ! x. For example, whenever Y
determines x, Y ! x, all supersets of Y also determine x. A
number of similar optimizations are implemented in [Huhtala
et al., 1999]. Extending existing approaches by incorporating
our MDD-tests is an interesting direction for future research,
but falls out of the scope of this paper.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Approximative Dependencies</title>
      <p>We have so far discussed only exact dependencies, i.e. we
detect only whether variable xi is determined or not. However,
a subset of variables Y might have a significant implicative</p>
      <sec id="sec-5-1">
        <title>Algorithm 4: Compute Y -Boolean indicators.</title>
        <p>Data: MDD M (V; E), Y ½ X
DY [¢; ¢] = 0, DY [1; 1] = 1;
foreach i = n; : : : ; 1 do
foreach (u1; u2) 2 Vi £ Vi do
if u1 = u2 then</p>
        <p>DY [u1; u2] = 1;
continue;
foreach e1 : u1 ! u01; e2 : u2 ! u02 do
if xi 62 Y ^ DY [u01; u02] = 1 then</p>
        <p>DY [u1; u2] = 1;
break;
else if
xi 2 Y ^ v(e1) = v(e2) ^ DY [u01; u02] = 1
then</p>
        <p>DY [u1; u2] = 1;
break;
UY [¢; ¢] = 0, UY [r; r] = 1;
foreach i = 1; : : : ; n do
foreach (u01; u02) 2 Vi+1 £ Vi+1 do
if u01 = u02 then</p>
        <p>UY [u01; u02] = 1;
continue;
foreach e1 : u1 ! u01; e2 : u2 ! u02 do
if xi 62 Y ^ UY [u1; u2] = 1 then</p>
        <p>UY [u01; u02] = 1;
break;
else if
xi 2 Y ^ v(e1) = v(e2) ^ UY [u1; u2] = 1
then</p>
        <p>UY [u01; u02] = 1;
break;
influence on a variable x even though it does not imply it
exactly. Therefore we are interested in computing a degree of
dependency Y ! x. Let d(Y; x) 2 [0::1] denote such a
degree. There is a number of ways to define it. In [Huhtala et al.,
1999] the authors use for the measure a minimal percentage
of solutions that has to be removed in order for dependency
to hold. We however use the following definition:
d(Y; x) =</p>
        <p>jSol[Y ]j
jSol[Y [ fxg]j</p>
        <p>When Y ! x, then jSol[Y ]j = jSol[Y [ fxg]j, and
d(Y; x) = 1. On the other hand, when x is completely
undetermined by Y , then every assignment to Y variables
a 2 Sol[Y ] can be combined with all values in domain for
x, Dx. In that case,</p>
        <p>d(Y; x) =
d(Y; x) 2 [</p>
        <p>1
jDxj</p>
        <p>:
1
jDxj
where larger values indicate larger degrees of dependency,
and when d(Y; x) = 1, Y ! x holds exactly. We can
interpret this measure as follows - whenever d(Y; x) = d, every
assignment to variables Y can on average be combined with
d1 values of x.</p>
        <p>Note that such a statistic can be easily computed using
a BDD-based representation of solution set Sol. A
BDDpackage BuDDy [Lind-Nielsen, online] supports both
projection and counting operations. Counting the number of
solutions in a BDD is an efficient operation - linear in the number
of nodes. While computing a BDD representation of a
projected solution space, such as Sol[Y ], can in theory increase
the size of the BDD - in practice projecting out variables is
almost always an efficient operation that decreases the number
of nodes significantly.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Experimental Evaluation</title>
      <p>We have applied our techniques to four well-known product
catalogues that are frequently used in recommender system
research. These are related to digital cameras, laptop
computers, property lettings and travel [Nicholson et al., 2006]. For
the purposes of our evaluation, we analyzed the data under
the same adjustments that are usually done for experimental
evaluation. Firstly, all unique identifiers or textual
descriptions, were removed. Secondly, all declared domain values
that did not appear in at least one product, but appeared in the
specification, were also removed. If some values appeared
in datasets but were not declared, we added them to model
specification.</p>
      <p>A summary of the properties of the instances are reported
in Table 2. For each instance, Cameras, Laptops, Travel,
Lettings we show the number of rows in the initial explicit
solution set representation, the number of solutions Sol extracted
from the MDD representation, the number of variables X,
and minimal, maximal and average domain size. We
uncovered that three out of four datasets contain duplicate entries,
since the number of solutions is smaller than the number of
rows.
After compiling catalogues into MDDs, we performed a
dependency analysis X n fxig ! xi for each instance and each
attribute. A detailed analysis of product catalogues is
presented in Table 3. For each instance, we list all variables X
and for each variable xi 2 X we indicate its type
(Categorical, Numerical, Boolean), domain size and degree of
dependence:
d(X n fxig; xi) = jSol[X n fxig]j :
jSol]j</p>
      <p>Recall that the degree of dependence d(Y; x) captures how
many values of xi can on average be combined with every
assignment to variables Y . For example, if d(Y; x) = 1=2, then
every assignment to variables Y can on average be combined
with 2 values in a domain of x. If d(Y; x) = 1 then for
every assignment to Y variables there is exactly one compatible
value in domain of x and hence, x is functionally determined
by Y . We can see from the table that almost all variables are
functionally determined, and those that are not have a very
high level of functional dependency. This is not surprising
given that the price of the product is the most distinguishing
attribute - behaving similarly as a unique key in a database.
However, to our surprise, price is not functionally dependent
in any of the catalogues! This indicates that in each
catalogue there are identical configurations which have different
prices! This is particularly emphasized for the Lettings
catalogue, where degree of dependence is 0:563. This means that
each configuration of remaining attributes on average has two
different prices.</p>
      <p>After analyzing more closely the variable specification and
original and processed datasets, we discovered that a Street
attribute of the Lettings catalogue was not declared in the
dataset specification. Therefore, the preprocessing step
ignored the corresponding column. Hence, a number of
lettings with identical specifications and in the same region were
treated as identical even though they were offered in different
streets of the same region. For the same reason, the number
of duplicate entries in Table 2 of the Lettings catalogue was
disproportionably large.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions</title>
      <p>In this paper we presented an approach to computing
functional dependencies over an MDD representation of a
product catalogue. We focused on verifying catalogue
consistencies as an application relevant within an online
configuration/recommendation context. Using functional
dependencies as an analytic tool we discovered that a set of publicly
available product catalogues exhibits specific characteristics
that have not been noted to-date; some of these characteristics
can be regarded as bugs in the catalogue definition or in the
preprocessing step of the catalogues. This warrants caution
in the future investigations in the area of web-based
configuration and recommender systems that rely on preprocessed
forms of product catalogues. The fact that the datasets might
violate some consistency criteria should be taken into account
when drawing conclusions from experimental evaluations.</p>
      <p>We also identified that functional dependencies can be
used to formally reason about the length of user
interaction. This topic fits within recent research about
recommender systems [Felfernig, 2006; Mahmood and Ricci, 2007;
Hadzic and O’Sullivan, 2008] but was not the focus of this
paper. Instead, it would be pursued in future work. In
addition, in the future we plan to further investigate the utility
of decision diagrams for recommending general configurable
products.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>Hadzic is supported by a Post-doctoral Research Fellowship
from the Irish Research Council for Science, Engineering and
Technology. O’Sullivan is supported by Science Foundation
Ireland (Grant Number 05/IN/I886). We would like to thank
anonymous reviewers for their useful comments.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[Bryant</source>
          ,
          <year>1986</year>
          ]
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Bryant</surname>
          </string-name>
          .
          <article-title>Graph-based algorithms for boolean function manipulation</article-title>
          .
          <source>IEEE Transactions on Computers</source>
          ,
          <year>1986</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[Drechsler</source>
          , 2001]
          <string-name>
            <given-names>Rolf</given-names>
            <surname>Drechsler</surname>
          </string-name>
          .
          <article-title>Binary decision diagrams in theory and practice</article-title>
          .
          <source>International Journal on Software Tools for Technology Transfer (STTT)</source>
          ,
          <volume>3</volume>
          (
          <issue>2</issue>
          ):
          <fpage>112</fpage>
          -
          <lpage>136</lpage>
          , May
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>[Felfernig</source>
          , 2006]
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Felfernig</surname>
          </string-name>
          .
          <article-title>Diagnosing faulty transitions in recommender user interface descriptions</article-title>
          .
          <source>In Advances in Applied Artificial Intelligence</source>
          , pages
          <fpage>869</fpage>
          -
          <lpage>878</lpage>
          . Springer-Verlag,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>[</given-names>
            <surname>Hadzic</surname>
          </string-name>
          and
          <string-name>
            <given-names>O</given-names>
            <surname>'Sullivan</surname>
          </string-name>
          , 2008]
          <string-name>
            <given-names>Tarik</given-names>
            <surname>Hadzic</surname>
          </string-name>
          and
          <string-name>
            <surname>Barry O'Sullivan.</surname>
          </string-name>
          <article-title>Critique graphs for catalogue navigation</article-title>
          .
          <source>In RecSys '08: Proceedings of the 2008 ACM conference on Recommender systems</source>
          , pages
          <fpage>115</fpage>
          -
          <lpage>122</lpage>
          , New York, NY, USA,
          <year>2008</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Huhtala et al.,
          <year>1999</year>
          ] Yka¨ Huhtala, Juha Ka¨rkka¨inen, Pasi Porkka, and
          <string-name>
            <given-names>Hannu</given-names>
            <surname>Toivonen</surname>
          </string-name>
          .
          <article-title>Tane: An efficient algorithm for discovering functional and approximate dependencies</article-title>
          .
          <source>The Computer Journal</source>
          ,
          <volume>42</volume>
          (
          <issue>2</issue>
          ):
          <fpage>100</fpage>
          -
          <lpage>111</lpage>
          ,
          <year>March 1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[Mahmood and Ricci</source>
          , 2007]
          <string-name>
            <given-names>Tariq</given-names>
            <surname>Mahmood</surname>
          </string-name>
          and
          <string-name>
            <given-names>Francesco</given-names>
            <surname>Ricci</surname>
          </string-name>
          .
          <article-title>Learning and adaptivity in interactive recommender systems</article-title>
          .
          <source>In ICEC '07: Proceedings of the ninth international conference on Electronic commerce</source>
          , pages
          <fpage>75</fpage>
          -
          <lpage>84</lpage>
          , New York, NY, USA,
          <year>2007</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>[Meinel and Theobald</source>
          , 1998]
          <string-name>
            <given-names>C.</given-names>
            <surname>Meinel</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Theobald</surname>
          </string-name>
          .
          <source>Algorithms and Data Structures in VLSI Design</source>
          . Springer,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Nicholson et al.,
          <year>2006</year>
          ]
          <string-name>
            <given-names>Ross</given-names>
            <surname>Nicholson</surname>
          </string-name>
          , Derek Bridge, and Nic Wilson.
          <article-title>Decision diagrams: Fast and flexible support for case retrieval and recommendation</article-title>
          .
          <source>In Proceedings of Eighth European Conference on Case-Based Reasoning (ECCBR</source>
          <year>2006</year>
          ),
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>[Schlimmer</source>
          , 1993]
          <string-name>
            <surname>Jeffrey</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Schlimmer</surname>
          </string-name>
          .
          <article-title>Efficiently inducing determinations: A complete and systematic search algorithm that uses optimal pruning</article-title>
          .
          <source>In ICML</source>
          , pages
          <fpage>284</fpage>
          -
          <lpage>290</lpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>[Wegener</source>
          , 2000]
          <string-name>
            <given-names>Ingo</given-names>
            <surname>Wegener</surname>
          </string-name>
          .
          <source>Branching Programs and Binary Decision Diagrams. Society for Industrial and Applied Mathematics (SIAM)</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>