Constructing CP-nets from Users Past Behaviors
                        Reza Khoshkangini                                                      Maria Silvia Pini
                 University of Padova, Padova, Italy                      Department of Information Engineering, University of
                       khosh@math.unipd.it                                               Padova, Padova, Italy
                                                                                           pini@dei.unipd.it

                          Francesca Rossi                                                       Dinh Van Tran
                  IBM T. J. Watson Research Center,                                   University of Padova, Padova, Italy
                     Yorktown Heights, NY, USA                                               dinh@math.unipd.it
                (on leave from University of Padova)

ABSTRACT                                                                  may cause changes in users’ preferences over time. Self-adaptive
While recommender systems over time have significantly improved           recommender systems have been developed to overcome such prob-
the task of finding and providing the best service for users in various   lems in the application domains, where the context of users and/or
domains, there are still some limitations regarding the extraction of     of services can influence the recommendation [6, 18, 24].
users’ preferences from their behaviors when they are dealing with           Users’ preferences are often qualitative and conditional. For
a specific service provider. In this paper we propose a framework to      example, if it is sunny, I prefer to go to a restaurant that has a garden
automatically extract and learn users’ conditional and qualitative        with tables outside. CP-nets [24] are a graphical model to represent
preferences by considering past behavior without asking any infor-        in a compact and intuitive way this kind of preferences. Briefly,
mation from the users. To do that, we construct a CP-net modeling         a CP-net is a graph in which each node is labeled with a table
users’ preferences via a procedure that employs multiple Informa-         describing the user’s preference over alternative values of this node
tion Criterion score functions within an heuristic algorithm to learn     given different values of the parent nodes. An example of CP-net is
a Bayesian network. The approach has been validated experimen-            shown in Figure 1.a. CP-nets has been already used to model users’
tally on a dataset of real users and the results are promising.           preferences in automated decision making and in modeling human
                                                                          preferences in real-world applications.
KEYWORDS                                                                     In this paper, we propose a framework to automatically con-
                                                                          struct CP-nets from users’ past behaviors without demanding any
Bayesian Network, CP-net, Recommender System
                                                                          information from the users. Users’ past behavior (or preferences)
                                                                          is characterized, by the set of domain features, which are logged
                                                                          in their profiles, through the their interactions with the system.
                                                                          For example, in the restaurant recommendation domain, the users’
                                                                          past behavior may be defined by the restaurants that have been
                                                                          selected previously by the user and each restaurant can be seen as
1   INTRODUCTION                                                          a combination of values to a set of features (such as, price range,
Over the past decades significant efforts have been undertaken            location, type of restaurant, etc.).
among researchers, practitioners and companies to develop various            To build a CP-net from agent’s past behavior we perform the
types of recommender systems to meet users’ requests [3, 23]. The         following steps.
aim of these systems is to personalize service recommendations
for individual users, as well as aggregating users’ preferences to
recommend a service for a group of users in various domains from              • Given the user’s history, for example previous selected restau-
movies [26] to restaurants [14, 18], from hotels to recommending                rants, we first identify some of the most important attributes
items or products, etc.                                                         that have influenced the final choices. Then the selected
   Advancement in recommender systems have been performed                       attributes become the features/nodes of the CP-net.
by considering the context, which is basically defined as any in-             • Then, we divide these features in three layers: root layer,
formation that could be used to characterize the situation of an                intermediate layer and layer. The root layer contains only
entity in a particular domain [11]. On the one hand, considering the            the root node, which is the most important feature according
context can improve the performance of the recommender systems,                 to both user’s preferences and the procedure used to extract
which leads to enhance the satisfaction degree of users by properly             features. The root node may be for example the price of the
fulfilling their demands [17]. On the other hand, it can make the               restaurant if the possible options for a user in selecting a
recommendation task more complex, since changes in the context                  restaurant depend mainly on his budget.
                                                                                The target layer contains only the target node which is the
Copyright © CIKM 2018 for the individual papers by the papers'                  most dependent feature. The target node will be selected
authors. Copyright © CIKM 2018 for the volume as a collection                   according to the domain knowledge. For instance, in the
by its editors. This volume and its papers are published under                  restaurant recommendation domain, the target not may be
the Creative Commons License Attribution 4.0 International (CC
BY 4.0).
      the type of the restaurant (or food), while in the movie rec-     O i > O j between outcomes, under certain assumptions/constraints
      ommendation domain, the target node may be the name               on the inputs for learning the structure of the users’ CP-nets.
      of the movie. The intermediate layer contains all the other          The rest of the paper is organized as follows. Section 2 provides
      nodes which correspond to all other selected features.            some basic notions of CP-nets and Bayesian nets. Section 3 presents
    • Now that we have divided the features in three groups, we         our CP-net constructor and Section 4 describes the experimental
      need to express the dependencies among them. We assume            evaluation and the results. Finally, Section 5 summarizes the paper
      that there are only outgoing arcs from the root node, and         and provides some hints for future work.
      only in-going arcs in the target node, while to identify the
      dependency relations between features in the intermediate         2     CP-NET AND BAYESIAN NETWORK
      layer we learn a Bayesian Network (BN) from the user’s
      preferences by exploiting some scoring functions (such as         In this section we present the key notions of CP-nets and Bayesian
      BIC, and AIC [8]) implemented by an Hill climbing algorithm.      Networks.
      Finally, from the conditional probabilities annotated with
      each node of the BN we define the corresponding conditional            CP-net: CP-net [5] is a graphical model to represent conditional
      preferences associated to every node of the CP-net.               and qualitative preference relations between variables (aka features).
                                                                        Lets assume, there is a set of variables V = {X 1 , ..., X n } with finite
                                                                        domains D(X 1 ), ..., D(X n ). For each variable X i , each user specifies
Experimental results show that the presented approach for building      a set of parents P (X i ) that can affect her preferences over the value
CP-nets from users’ past behavior is promising in the restaurant and    of X i . So this defines a dependency graph such that every variable
movie recommendation domains. In the experiments performed on           X i may have P (X i ) as its immediate predecessors. For each node X i ,
a real data-set the scoring function BIC is the best one.               there is a conditional preference table that shows for each possible
   The proposed approach to construct CP-nets from users’ his-          combination of parents values the preference over values of X i .
tories is important for two reasons: the former one is that users’           An example of CP-net is shown in Figure 1 (a). It contains three
preferences are essentially qualitative and constructing these qual-    features (aka variables) X 1, X 2 and X 3, standing for the Price, Qual-
itative tendencies in the system can increase the knowledge of          ity and the Type of restaurant, respectively. X 1 is an independent
the system from users’ preferences. The latter is inspired from the     variable, while X 2 depends on X 1 and X 3 depends on both X 1 and
former advantage, such that this construction is significantly im-      X 2. This CP-net models the fact that the users strictly prefers a
portant when the service system needs to deal with the context that     restaurant with a medium price (x 11 ) to an expensive one (x 21 ), while
brings up the dynamicity challenge in a real-time recommender           his preference between “medium” quality (x 12 ) and “high” quality
system [18]. The change in context leads to change on players pref-
                                                                        (x 22 ) is conditioned on the price to be selected. In addition, the pref-
erences over the time. Hence constructing users’ preferences in a
                                                                        erence on the type of restaurant Chinese = x 13 and Mexican = x 23
qualitative form enables self-adaptive recommender systems [18]
                                                                        depends on both the quality of the restaurant and the price.
to increase users’ satisfaction degree.
   For example, assume that we know from the context that there is
                                                                         Bayesian Network (BN): A Bayesian network is a probabilis-
blocked traffic on the road for going to the recommended restaurant,
                                                                        tic graphical model that represents a set of variables and their
the system should recommend another restaurant starting from the
                                                                        conditional dependencies via a directed acyclic graph (DAG) G =
built CP-nets modeling the users’ preferences.
                                                                        (V , EG ) [7, 16], where V is the set of features, and EG represents the
   Modeling and learning the users’ preferences expressed via CP-
                                                                        set of direct arcs (dependency) between the features, e.g., X i → X j ,
nets is a task that has been studied extensively by adopting various
                                                                        and there is a constraint in BN that avoids any directed cycles
techniques, such as observing/asking multiple questions to the
                                                                        (similarly to the concept of acyclic CP-net).
users [4]. In some studies, researchers start by assuming a depen-
                                                                           The strength of the correlation between features is defined in
dency structure (the user’s CP-nets) and then they try to learn the
                                                                        the second component of the BN, where there is a set of parameters
users’ conditional preferences [9]. Bigot et al. [4] discussed the
                                                                        in the network that show the conditional probability distribution
possibility of learning Probabilistic CP-nets (PCP-nets), which have
                                                                        between variables.
been introduced in [10] in two settings (online and off-line). In
                                                                           Considering n variables V = {X 1 , ..., X n } in a BN, the joint dis-
this paper, Bayesian networks are used to learn PCP-nets. In both
                                                                        tribution, that is shown by P (X 1 = x 1 , X 2 = x 2 , ..., X n = x n ) can
settings they assume to have the dependency graph and then they
                                                                        be factorized as follows:
ask multiple queries to the users to build up and learn the structure
of the network. Similarly, Guerin et al. [15] present an algorithm
for learning CP-net preferences by interacting with users rather
than using users’ histories. Learning conditional preferences may           P (X 1 , ..., X n ) = P (X 1 ) × P (X 2 |X 1 )... × P (X n |X 1 , ..., X n−1 )
be a tedious and costly task, even with acyclic CP-nets. However,                                                             Y
                                                                                                                          =       P (X i |X 1 , ..., X i−1 )   (1)
the complexity of the problem can be reduced by interacting with
                                                                                                                               i
the users to simplify the learning procedure. For instance, Koriche
et al. [19] propose an approach to identify a preference ordering          In a BN, the value of each node X i ∈ V is dependent on the val-
with binary domains, which uses membership queries. Similar work        ues of its parents P (X 1 , X 2 , ..., X n ) = i P (X i |Pa(X i )), otherwise
                                                                                                                      Q
has been done in [12], where pairwise comparisons are used e.g.,        the node is defined as an independent node. For each node X i , there
                                                 Figure 1: A CP-net and a Bayesian Network.


is a conditional probability table that shows for each possible com-        influence users on targeting a service/restaurant, feature selection
bination of parents values the probability distribution over values         process is required to enhance the effectiveness of the framework.
of X i .                                                                    In other words, in feature selection, we aim to nominate a sub-
   An example of BN is shown in Figure 1 (b), where the probability         set Vs ⊂ V containing features that leverage the values of the
that I target a Chinese x 13 or Mexican x 23 restaurant for a dinner        target feature such that |Vs | = m with a given m. Thus, we use
depends on the values of his parents, that are Price (X1) and Quality       Information Gain1 , which is widely used in the state of art for
of the restaurant (X1). It can be seen that the value of Quality of         feature selection [21], to select the most important features to be
the restaurant depends also on the Price (X1). The conditional prob-        considered in the constructor space. This method selects which
ability tables associated at each variable are shown in Figure 1 (b).       feature/attribute represents the greatest information gain (more
For example, the conditional probability table associated to variable       details about the function can be found in [21]). Eventually, the list
X 1 shows that the first value in the domain of X 1 has probability         of features in Vs is sorted in decreasing order to build the following
0.2. Other conditional probability tables list the probability that         list {X 0 , X 1s , . . . , Xms }. Then, the system attaches X t (as the target
the child variable/takes one of the values in its domain for each           node) at the end of the list as follows V = {X 0 , X 1s , . . . , Xms , X t }
combinations of values in the domains of its parents.                       to achieve the desired list of features in feature selection process.
                                                                                Once the process is done, the system breaks down the sorted
3     TECHNICAL APPROACH                                                    features into three layers that are detailed in the following section.
This section shows our approach to build a CP-net representing
the conditional and qualitative preferences of a user starting from         3.2     Layers Extraction
the history of the user. For the sake of clarification, we will explain     Given the above list V = {X 0 , X 1s , . . . , Xms , X t } we aim to build
how the constructor works in different sections by using examples           an acyclic and directed graph that consists of three layers: Root layer,
of preferences, which are taken from the domain of the restaurant           Intermediate layer, and the Target layer. Hereafter, we describe in
recommender system.                                                         detail each layer and links connecting the nodes inside the graph.
                                                                            Notice that the terms “node” and “feature” refer to the same concept
3.1    Feature Selection                                                    from now on.
Within hundred numbers of attributes logged in the data-set sourced               • Root Layer: this layer contains only the root node, which
by users’ interactions at searching and selecting appropriate ser-                  is the most important feature among the others in the list.
vices (e.g., selecting the desired restaurant in this context), the                 In other words, given the list “V ”, the first feature X 0 will
features that are more important on influencing users’ preferences                  be considered as the root node. Since, this node X 0 is an
should be selected. Thus, given a set of variables/features (from the               independent feature, it does not have any income link from
data set) V = {X 1, X 2, . . . , X n }, without the loss of generality we           the other nodes. As an example, considering the restaurant
assume that X n is the variable corresponding to the target node of                 selection domain, where the possible options for a user in se-
the CP-net, which is the most constrained variable.                                 lecting a restaurant depend mainly on his budget. In this case,
   In this context, target node points to the Type of restaurant/food,              price of the restaurant could be the feature that corresponds
that may have as values the types of restaurants that the user has                  to the root node.
selected before and logged in his profile. We refer X n ≡ X t as a                • Intermediate Layer: the main procedure of extracting users’
target variable. The values of the target variables are based on the                conditional preferences on the basis of the strength of rela-
set of remaining variables V \ {X n }.                                              tions between features under certain conditions or threshold,
   Since every variable (among the other features that a user may           1 To use Information Gain function, we took the advantage of FSelector [20] library in
consider to select a service/restaurant e.g., price, location, etc.) may    R.
         will be executed in this layer. This layer contains all the nodes                                              likelihood (NMC)), and Bayesian Score Function such as; Bayesian
         except Root and the Target nodes as follows {X 1s , . . . , Xms }.                                             Dirichlet (BD), and K2 [8].
         To set the internal links between intermediate nodes, we                                                           In general, Bayesian Score Functions try to find a network B
         need to measure the dependence between any pair of nodes                                                       with a maximum value SG with given data SG (B, D) [28]. In these
         (X is , X js ). The algorithm adds a link between the these                                                    functions, the procedure of constructing the network begins from a
         nodes that have dependence values higher or equal to a                                                         prior distribution and then computes the posterior probability dis-
         given threshold. This threshold value could be determined                                                      tribution. The network with maximum number is the best network
         automatically or manually. In this paper we decide to fix                                                      among the other possible networks. While Information Theory
         this threshold manually as described in Section 4. Assume                                                      Functions act based on compression [2]. They attempt to recognize
         that the determined dependencies between all the features                                                      the most compact model over the data D with less score value,
         in the Intermediate layer are in [0.0 - 1], thus we can set the                                                therefore, a model with minimum score is the best model among
         threshold Tr=0.7.                                                                                              the others.
       • Target Layer: as the name implies, this layer indicates the                                                        To show the constructor and how the score functions work, we
         target node X t that shows the user’s preference (last attribute                                               use Bayesian Information Criterion (BIC) [22] throughout this paper,
         in the sorted list) toward the specific domain. Thus, this layer                                               but we implement the constructor with all the possible functions
         has only incoming links from the nodes, which are located                                                      (in the second layer) to evaluate the performance of the algorithm
         in the Intermediate layer, or in the Root layer.                                                               in the specific domain.
    The action flow is shown in Figure 2, where the selected features                                                       Bayesian Information Criterion [22] is the most popular score
in the list V (Fig 2 (a) ) are segmented into three layers (Fig 2 (b)                                                   function among the others which is constructed based on likelihood
). Then, the proposed constructor integrates the layers (Fig 2 (c) )                                                    function. BIC that is known “Schwarz criterion” model is a type of
based upon the dependencies between the features in the second                                                          criterion model selection among the other models. However, the
layer obtained by exploiting the score functions and the algorithm                                                      performance of the function highly differ from data-set to data-set,
explained in the next Section 3.3.                                                                                      it is shown that BIC score function outperforms consistently the
                                                                                                                        other score functions in constructing Bayesian network structure
                                                                                                                        in different data-sets [22].
List of Features                 Root Layer
                                                            X1
                                                                                       Root Layer
                                                                                                          X1                In this score function the lower value points the better fit the
            V ={x1,..,xn}
                                                       wZ
                                                                      wl                                                model, or the lower value represents the minimum information loss
                                                                                                                        related to the candidate model which is shown in Equation 2.
    X1           X2         X3                                                                       X2         X3
                                                                                       Intermediat


                                                   X2                 X3
                                 Intermediat


                                                                                          Layer
                                    Layer


    X4
                                                  wk             wj        wi
                                                                                                                                           BIC = −2 ∗ ln(θ ) + ln(N ) ∗ k                (2)
                 X5         X6
                                                   X4                 X5
                                                                                                                            where k refers to the degree of freedom to be estimated and
                                                                                                     X4         X5


      ...         ...       Xn   Target                                w= dependency
                                                                                                                        N points the number of observations or sample size. The set of
                                                                                       Target
                                 Layer
                                                            X6
                                                                           weight
                                                                                       Layer              X6            model parameters that maximize the likelihood function is shown
    a) Set of Features in                      b) Starting With Empty Graph
                                                                                                                        by θ and ln(θ ) refers to the likelihood of the truth model. A more
                                                                                                c) Final CP-net Graph
         the List V                                     With 3 Layers                                                   lower BIC value indicates the obtained model is more likely to be
                                                                                                                        considered as the best and true network model among the others.
                 Figure 2: The CP-net Constructor Frontend.                                                             More details about how these functions work are presented in [25].
                                                                                                                        Hence, to perform the above functions to construct the desired
                                                                                                                        network in the second layer, we borrow the structures shown in [1],
                                                                                                                        where the various algorithms have been discussed such as Simulated
3.3          Feature Dependency Measurement and                                                                         Annealing algorithm, Heuristic algorithm, Genetic algorithms.
             Constructing CP-net                                                                                            Taking into account the advantage of the greedy search and
Due to the similarity between the concepts described above, we                                                          heuristic algorithm [13], we use Hill Climbing (HC) algorithm to
exploit Bayesian network’s score functions to construct the main                                                        execute BIC, AIC, MIT, LL, BD, NMC, etc. functions to obtain the
shape of users’ CP-nets in the second layer.                                                                            structure of the users’ CP-nets in the second layer following the
   Many algorithms and techniques have been developed to tackle                                                         proposed structure shown in Algorithm 1.
the problem of building a Bayesian Network, whose performance                                                               Recalling the feature selection and breaking the list “V” into
vary according to the used score functions from data/domain to                                                          three layers, HC starts with an empty graph (G) in the second layer
data/domain. Hence, we implement the framework by various kinds                                                         and attempts to find a model with high score (or minimum score)
of score functions to decrease the miss classification results which                                                    by incrementally searching among the other possible models from
lead to the suitable technique that accommodate with our data and                                                       its local neighbors:
provide the highest performance.                                                                                            In other words, this is an optimization model that begins with an
   Score functions in Bayesian network lies into two main cate-                                                         arbitrary structure of the network, then try to find a better network
gories of Information Theory Score such as Mutual information test                                                      by incrementally tuning the scores. Hence, if a new model with
(MIT), Bayesian Information criterion (BIC), Akaike Information                                                         high/minimum score is found, it will be substituted with the old
Criterion (AIC), Log Likelihood (LL), and Normalized Maximum                                                            model. Algorithm 1:
   These steps are repeated until no further model with high/minimum    the structure of the network is obtained in the second layer, the
score can be found. Although the algorithm has the problem of           system converts the probability into preference as shown below.
getting stuck in the local region that depends on the starting point,        The procedure starts from the independent nodes. The highest
we took the privilege of its high performance and accuracy to build     probable value for a feature will be considered as the preferred
the network.                                                            value among the other possible values. The independent nodes (as
   In the following section we show how connecting the layers to        parents) influence the value of the remaining nodes (as children) on
define the CP-net.                                                      basis of the probability table as shown in Figure 3. To better under-
                                                                        stand the conversion let us consider v = {X 2, X 3, X 4} v ⊂ V with
3.4    Converting Probability to Preferences                            3 features and their binary domains D (X 2) = {x 12 , x 22 }, D(X 3) =
This section show how to interpret the strength correlation between     {x 13 , x 23 } and D (X 4) = {x 14 , x 24 }. As it is shown in Figure 3, X 2 influ-
nodes to user’s preferences under the concept of CP-net. To this end,   ences X 3 and X 4 in the second layer, and similarly X 4 influences
the system determines the “maximum” probability of a feature’s          X 3. Hence, X 3 is identified as the most dependent node among
value as the user’s preference. It means if a node from the list “V ”   the others with its probability table. The conditional probability
e.g., Price contains two values in the domain such as cheap and         table associated to X 3 can be used to derive preferences over values
expensive, the system from the conditional probability table (in        in D (X 2) in the CP-net. In the conditional preference table of the
Bayesian Network) that is computed for each node in the list, picks     CP-net associated to X 2 we have x 22 more preferred than x 12 since
the value/domain (e.g., cheap) that has highest probability as the      the probability of x 22 is 0.8 while the probability of x 12 is 0.2. Thus,
user’s preference (see the table associated to X 2 in Figure 3). Once   if x 22 is preferred for X 2 , then x 14 with probability 0.7 is the more
                                                                        preferred value for X 4. Similarly, the value of X 3 depends on the
                                                                        value X 2 and X 4 according to the probabilities. Here, we just gave
 Algorithm 1: CPnet Constructor Algorithm                               a simple example of features with binary domains but this could be
  Rnode                         ▷ root node                             extended with n numbers of values in the domains for each feature
                                                                        in the sorted list V .
  Inodes                   ▷ intermediate nodes
  Tnode                        ▷ target node
  Score G                         ▷ score G                             3.5     Connecting the Layers
  ScoreNG                       ▷ new score G                           This section is in charge of integrating the Root and the Target
  Function start with empty graph G                                     nodes to the nodes that are located in the Intermediate layer. Since,
      ScoreNG ← Score (G)                                               the Root node dominates the other nodes in the two layers, the
      Initialize ScoreNG with G                                         system generates a matrix of dependency between them. Practi-
      while Score (G) not minimize do                                   cally, the aim is to find the strength relations between the Root
          provide acyclic operation G :                                 and the rest, hence we use Chi-square, Information gain, Gain ra-
          add, delete, reverse                                          tio and Symmetrical uncertainty functions [21] to obtained these
          if ScoreNG > Score (G) then                                   dependencies. This process of generating the dependency matrix
              ScoreNG ← ScoreNG                                         is broken down into two sections. First, is applied between the
                                                                        Root layer and “Intermediate and Target” layers (both together, as
      DepMatrix ← dependency(Rnode,Tnode, Inodes)                       the Root node dominates the rest of the nodes which are located
      if Depmatrix (Rnode,Tnode, Inodes) ≥ Tr then                      in these two layers). Secondly, it will be employed between the
         connectRnode → InodesandTnode                                  Intermediate layer and the Target layer. Having at hand these two
         if DepMtx (Inodes,Tnode) then                                  dependency matrix, we set a threshold CV = [0 − 1] as a Confidence
             connectInodes → Tnode                                      Value to eliminate the links between these layers which can not
                                                                        meet the threshold value. This helps to find out the most impor-
                                                                        tant dependencies within the user CP-net that can characterize the
                                                                        user preferences. The described work-flow produces the final user’s
                                                                        CP-net that characterizes the user’ preferences.
  Function start with empty graph G                                        To validate the correctness of the obtained CP-net, we carried
     ScoreNG ← Score (G)                                                out the cross validation method on a real data-set that is detailed
     Initialize ScoreNG with G                                          in the next section.
     while Score (G) not minimize do
         ..
          .                                                             4     EXPERIMENTAL EVALUATION AND
                                                                              RESULTS
                                                                        The main task of recommender systems is to provide the best per-
                                                                        sonalized service for users in various domains. In this paper, we
  if ScoreNG > Score (G) then                                           validate our procedure for constructing users’ CP-nets from their
      ScoreNG ← ScoreNG                                                 past behaviors in the restaurant recommendation domain. In the
                                                                        following, we describe in detail the experimental setup.
                                                                    X2 Probability Table                                                                         Constructed        Dependency Graph for variable X2
                                                                      P (X2=x1)   P(X2=x2)        (0.2) < (0.8)                                                    CP-net
                                                                                                                                                                               X2                   X2
                     X3 Probability Table                                0.2        0.8                              Transferring to                                                            x21 ≺ x22
                                                             X2                                                        Preference
    (0.56)>(0.45)    X2    X4    P (X3=x1)   P(X3=x2)
                                                                                                         x1 ≺ x2
                     x21   x41      0.65       0.35                                                                                    Dependency Graph for variable X4

                     x21   x42      0.7        0.3                                                                                         X2          X4                           Dependency Graph for variable X3
                                                                                   X4 Probability Table                                                                        X4   with the binary domains x1 and x2
        x1 ≻ x2      x22   x41      0.56       0.45                                                                                       x21        x41 ≻ x42
                                                                                                                                                                                        X2    X4            X3
                                                                                   X2        P (X4=x1)    P(X4=x2)
                     x22   x42      0.20       0.80                                                                   (0.7)>(0.3)         x22        x41≺ x42                          x21    x41        x31≻ x32
                                                                                   x21          0.4         0.6
                                                                                                                                                                                       x21    x42        x31≺ x32

                                                                                   x22          0.7         0.3                                                                        x22    x41        x31 ≻ x32
                                             X3                          X4                                              x1 ≻ x2                                               X3
                                                                                                                                                                                        x22   x42        x31 ≻ x32


                                                        Figure 3: From a Bayesian network to the corresponding CP-net.


4.1      Data Collection                                                                                                  “11”, and then we gradually reduce it to the top 4 important
We exploited two real datasets of users’ preferences over the set                                                         features in the list.
of restaurants in Mexico City, and Movie in Denmark and UK,                                                             • Dependency: {0.5, 0.6, 0.7}: Since we have defined the de-
from UCL repository2 . The former is statistics on users’ restaurant                                                      pendencies between features in this interval [0-1], we tune
selections. It contains 1116 samples where each sample carries                                                            the threshold similar to the number of features to find the
11 different features. Each sample indicates the selection of one                                                         best results.
restaurant performed by a user. The latter is statistics on users’                                                      • Direction: {0.5, 0.6, 0.7}: We also tuned the algorithm with
movie ratings. Movie dataset includes 2297 instances with around                                                          three values to obtain the direction of the dependency be-
122 unique users where every sample holds 28 different features.                                                          tween the features.
Due to the page limit, we only present features in the restaurant                                                       • Threshold (Tr) for connecting the first and the second layer:
selection dataset in Table 1, which shows the description of the                                                          {0.03, 0.04, 0.05}. As mentioned above, we use different
features and their abbreviations.                                                                                         functions to calculate the dependencies between layers, thus
   Then we group the samples that belong to the same user. Next,                                                          we may have different values of threshold.
we exclude the users who have recorded less than 10 selections                                                          • Threshold (Tr) for connecting the second and the third layer:
or ratings. This is due to the fact that a small number of samples                                                        {0.03, 0.04, 0.05}
does not allow the system to efficiently learn and make inference.
Thus, we select only 18 users with at least 10 recorded samples in                                                   Evaluation Setting for Dataset 2 (Movie): In contrast to dataset 1,
the first dataset. While, the users whose ratings are logged in the                                               the unique users in Movie dataset contains fairly enough instances.
movie dataset have enough instances, so all the unique players are                                                Thus, 10-fold cross validation is implemented to validate the sta-
included.                                                                                                         bility of the approach by feeding new data. In each round, 10% of
                                                                                                                  data is left out to be considered as the test set and the rests –90%–
4.2      Evaluation & Model Selection                                                                             are used to learn a CP-net using our proposed constructor and to
                                                                                                                  train the model exploiting Bayesian Method. Similar settings and
To achieve the best CP-net structure representing users’ preferences,
                                                                                                                  parameters have set for number of selected features, dependency,
we execute the algorithm with a range of parameters, where each
                                                                                                                  direction and Thresholds.
combination of parameter values specifies a specific structure of
                                                                                                                     Once the model is trained, the algorithm computes a score e.g.,
CP-net. For each round, to find the best model to make inference,
                                                                                                                  for each restaurant and movie (rating) option. This score shows
the grid search technique is employed to find the optimal tuple of
                                                                                                                  how probable is for the restaurant to be considered according the
hyperparameter values.
                                                                                                                  user’s interest. Then the system selects a restaurant which has the
   Evaluation Setting for Dataset 1 (Restaurant): Since we have a                                                 highest probability value as the user’s preference.
limited number of samples for every user in Restaurant dataset,                                                      To evaluate the performance of the method we have used the
the iterative leave-one-out cross validation is the best option to                                                following well-known metrics: (i) Precision, that is the fraction of
use for our multi-class classification problem. In each round, one                                                retrieved instances that are relevant, (ii) Recall, that is the fraction
sample is left out to be considered as the test set and the remaining                                             of relevant instances that are retrieved, and (iii) F-measure that is a
samples are used to learn a CP-net using our proposed method and                                                  weighted average of precision and recall [27].
to train the model using Bayesian Method. In the following, we                                                       The performance of each class is computed (for every user), and
have listed the parameters and the set of the values that we have                                                 then an average over all classes performance is considered as the
used to construct multiple CP-nets.                                                                               performance of the method for the user.
      • Number of selected features: We have run the algorithm with
        various number of features to find out the right number that                                              4.3       Results
        can provide the best result. Thus we start from all features                                              Table 2. presents our preliminary experimental results in the restau-
                                                                                                                  rant recommendation domain. In particular, it shows the perfor-
2 https://archive.ics.uci.edu/ml/datasets.html                                                                    mance of our approach by considering various score functions.
                            Table 1: Data-set Information                             Result For Dataset 2: Analogous to the first evaluation, BIC func-
                                                                                   tion works slightly better in all metrics such as precision with 0.51,
            #    Features    Restaurant Selection: Description and domain          recall with 0.52, f-measure by 0.52 and ∼0.50 accuracy, w.r.t K2 and
            1      Ub        users budget (low, medium, high)                      Loglike on this users’ statistical data (Figure 4. shows a complex
            2       Q        quality of restaurant (low, medium and high)          cp-net that is constructed by BIC with 8 features in r). However,
            3       Al       the restaurant serves alcohol or not (yes or not)
                                                                                   the figures show a weak result for BIC function compared to the
            4      Sm        smoking area inside of the restaurant (yes or no)
                                                                                   first assessment on Restaurant dataset.
                             accessibility of the restaurant
            5      Ac                                                                 In contrast to the first experiment, both K2 and Loglike were
                               (easy, medium or difficult)
            6      Pr        restaurant price (low, medium and high)               correctly implemented by reporting around ∼0.49 in F-measure.
            7      Ar        outdoor restaurant or indoor restaurant               These low figures might have obtained due to existing 5 classes
            8      Os        other services, like internet, etc (yes or not)       in the target layer (target node has 5 different domains including
                             the restaurant has parking area or not                Very Bad, Bad, Medium, Good and Very Good which indicate the
            9      Pk
                                            (yes or no)                            users’ rating on different movies), which make classification and
                             restaurant rate which are given by user/              prediction arduous. Taking into account 20% accuracy as the base-
        10         Ur
                                  people (low, medium and high)                    line for random classification, the obtained figures also show around
        11          Tr       type of restaurant (Italian, Latin-American, Asian)
                                                                                   30% above the baseline that is an admissible result. However further
                                                                                   investigation is required both in terms of design and evaluation to
                                                                                   increase and assess the performance of the constructor.
Such a performance is measured in terms of “Precision, Recall, and
F-measure”.
   Result For Dataset 1: As it is shown in Table 2, within the score
functions implemented by the hill-climbing algorithm, only BIC,
AIC and MDL were applicable in our data-set. The parameters,
which are automatically set to generate such results vary from each
other, since the functions use different strategy to calculate their
scores in constructing BN. BIC has set by 6 number of features,
0.7 arc dependency and 0.03 dependency between layers, while
AIC provides its best result by 7 number of features, 0.6 for arc
dependency and 0.05 for layer dependency. The values of precision
and recall using BIC and AIC are very similar with ∼0.70, however
BIC performs slightly better both in terms of precision and recall.
In contrast, MDL provides a pour results by 0.54, 0.57 and 0.555 in
precision, recall and f-measure, respectively.
   It is noticeable that BIC, AIC and MDL with a smaller and greater                          Figure 4: A complex CP-net with 8 Features.
numbers of features (reported in Table 2.) result mostly incomplete
or/and a very complex and cycle graph. In addition, we have not
inserted results for K2 and Loglike since the approach with these
score functions often construct CP-nets with cycles, specially after               5    CONCLUSION
appending the nodes of the layers in the final step.                               We have presented a system for automatically constructing CP-nets
                                                                                   modeling users’ preferences from their past interaction with a ser-
Table 2: The Performance of The Proposed Method on The                             vice provider. To construct the user CP-net we have first constructed
Restaurant Data-set.                                                               a Bayesian network. We have exploited an heuristic algorithm Hill
                                                                                   Climbing to execute various score functions, such as BIC, AIC, Log-
                                                                                   lik, K2, MDL, in order to recognize the best graph model among
             Score                                              # of Arc Dep
                     Precision Recall F-measure Description                        all the possible models. Empirical results from restaurant selection
            Function                                          Features Tr TR
              BIC      0.71    0.708    0.706     applicable     6     0.7 0.03    ad Movie domains have shown that this constructor may have a
              AIC      0.70     0.69    0.694     applicable     7     0.6 0.05    significant impact to enhance users’ satisfaction.
Dataset 1


             Loglik      x       x        x    not-applicable    x      x x            This construction may be very useful when the recommender
               K2        x       x        x     causing cycle    x      x x        system must deal with context and the users’ preferences may
              MDL      0.54     0.57    0.555     applicable     4     0.5 0.05    change over time. We plan to integrate the proposed constructor in
              BIC      0.51     0.52     0.52     applicable     8     0.4 0.03    the Self-adaptive Context-Aware recommender system (SaCARS)
              AIC        x       x        x     causing cycle    x      x x        presented in [18]. This integration allows SaCARS to completely
Dataset 2


             Loglik    0.54     0.43    0.485     applicable     8     0.5 0.03    learn and model the users’ conditional preferences from their be-
               K2      0.55     0.42    0.485     applicable     7     0.6 0.03
                                                                                   havior without human interference.
              MDL        x       x        x    not-applicable    x      x x

                                                                                   REFERENCES
                                                                                    [1] [n. d.]. Artificial intelligence: A modern approach author. ([n. d.]).
 [2] Hirotogu Akaike. 1998. Information theory and an extension of the maximum
     likelihood principle. In Selected Papers of Hirotugu Akaike. Springer, 199–213.
 [3] Xavier Amatriain and Justin Basilico. 2016. Past, Present, and Future of Rec-
     ommender Systems: An Industry Perspective. In Proceedings of the 10th ACM
     Conference on Recommender Systems (RecSys ’16). ACM, New York, NY, USA,
     211–214. https://doi.org/10.1145/2959100.2959144
 [4] Damien Bigot, Jérôme Mengin, and Bruno Zanuttini. 2014. Learning Probabilistic
     CP-nets from Observations of Optimal Items.. In STAIRS. 81–90.
 [5] Craig Boutilier, Ronen I. Brafman, Carmel Domshlak, Holger H. Hoos, and David
     Poole. 2004. CP-nets: A Tool for Representing and Reasoning with Conditional
     Ceteris Paribus Preference Statements. J. Artif. Int. Res. 21, 1 (Feb. 2004), 135–191.
     http://dl.acm.org/citation.cfm?id=1622467.1622473
 [6] Yuriy Brun, Ron Desmarais, Kurt Geihs, Marin Litoiu, Antonia Lopes, Mary Shaw,
     and Michael Smit. 2013. A design space for self-adaptive systems. In Software
     Engineering for Self-Adaptive Systems II. Springer, 33–50.
 [7] Luis M de Campos. 2006. A scoring function for learning Bayesian networks
     based on mutual information and conditional independence tests. Journal of
     Machine Learning Research 7, Oct (2006), 2149–2187.
 [8] Alexandra M Carvalho. 2009. Scoring functions for learning Bayesian networks.
     Inesc-id Tec. Rep (2009).
 [9] Yann Chevaleyre, Frédéric Koriche, Jérôme Lang, Jérôme Mengin, and Bruno
     Zanuttini. 2010. Learning ordinal preferences on multiattribute domains: The
     case of CP-nets. In Preference learning. Springer, 273–296.
[10] Cristina Cornelio, Judy Goldsmith, Nicholas Mattei, Francesca Rossi, and K Brent
     Venable. 2013. Dynamic and probabilistic cp-nets. mpref 2013 (2013).
[11] Anind K. Dey. 2001. Understanding and Using Context. Personal Ubiquitous
     Comput. 5, 1 (Jan. 2001), 4–7. https://doi.org/10.1007/s007790170019
[12] Yannis Dimopoulos, Loizos Michael, and Fani Athienitou. 2009. Ceteris Paribus
     Preference Elicitation with Predictive Guarantees.. In IJCAI, Vol. 9. 1–6.
[13] Thomas A Feo and Mauricio GC Resende. 1995. Greedy randomized adaptive
     search procedures. Journal of global optimization 6, 2 (1995), 109–133.
[14] Yanjie Fu, Bin Liu, Yong Ge, Zijun Yao, and Hui Xiong. 2014. User preference
     learning with multiple information fusion for restaurant recommendation. In
     Proceedings of the 2014 SIAM International Conference on Data Mining. SIAM,
     470–478.
[15] Joshua T Guerin, Thomas E Allen, and Judy Goldsmith. 2013. Learning CP-net
     preferences online from user queries. In International Conference on Algorithmic
     DecisionTheory. Springer, 208–220.
[16] Finn V Jensen. 1996. An introduction to Bayesian networks. Vol. 210. UCL press
     London.
[17] Reza Khoshkangini, Maria Silvia Pini, and Francesca Rossi. 2016. A design of
     context-aware framework for conditional preferences of group of users. In
     Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed
     Computing. Springer, 97–112.
[18] Reza Khoshkangini, Maria Silvia Pini, and Francesca Rossi. 2016. A self-adaptive
     context-aware group recommender system. In Conference of the Italian Association
     for Artificial Intelligence. Springer, 250–265.
[19] Frédéric Koriche and Bruno Zanuttini. 2010. Learning Conditional Preference
     Networks. Artif. Intell. 174, 11 (July 2010), 685–703. https://doi.org/10.1016/j.
     artint.2010.04.019
[20] Miron B Kursa, Witold R Rudnicki, et al. 2010. Feature selection with the Boruta
     package. J Stat Softw 36, 11 (2010), 1–13.
[21] Changki Lee and Gary Geunbae Lee. 2006. Information gain and divergence-based
     feature selection for machine learning-based text categorization. Information
     processing & management 42, 1 (2006), 155–165.
[22] Zhifa Liu, Brandon Malone, and Changhe Yuan. 2012. Empirical evaluation of
     scoring functions for Bayesian network model selection. BMC bioinformatics 13,
     15 (2012), S14.
[23] Jie Lu, Dianshuang Wu, Mingsong Mao, Wei Wang, and Guangquan Zhang. 2015.
     Recommender System Application Developments. Decis. Support Syst. 74, C (June
     2015), 12–32. https://doi.org/10.1016/j.dss.2015.03.008
[24] Alberto Marchetti-Spaccamela, Andrea Vitaletti, Luca Becchetti, and Ugo Cole-
     santi. 2008. Self-Adaptive Recommendation Systems: Models and Experi-
     mental Analysis. 2008 Second IEEE International Conference on Self-Adaptive
     and Self-Organizing Systems (SASO) 00 (2008), 479–480. https://doi.org/doi.
     ieeecomputersociety.org/10.1109/SASO.2008.55
[25] Radford M Neal. 2012. Bayesian learning for neural networks. Vol. 118. Springer
     Science & Business Media.
[26] Chihiro Ono, Mori Kurokawa, Yoichi Motomura, and Hideki Asoh. 2007. A
     Context-Aware Movie Preference Model Using a Bayesian Network for Rec-
     ommendation and Promotion. In Proceedings of the 11th International Confer-
     ence on User Modeling (UM ’07). Springer-Verlag, Berlin, Heidelberg, 247–257.
     https://doi.org/10.1007/978-3-540-73078-1_28
[27] David Martin Powers. 2011. Evaluation: from precision, recall and F-measure to
     ROC, informedness, markedness and correlation. (2011).
[28] Marco Scutari. 2009. Learning Bayesian networks with the bnlearn R package.
     arXiv preprint arXiv:0908.3817 (2009).