Kernalized Collaborative Contextual Bandits
                  Leonardo Cella                               Romaric Gaudel                                      Paolo Cremonesi
          leonardo.cella@mail.polimi.it                      romaric.gaudel@ensai.fr                          paolo.cremonesi@polimi.it
              Politecnico di Milano                         Univ. Rennes, Ensai, CREST                          Politecnico di Milano
                   Milan, Italy                                   Rennes, France                                      Milan, Italy

ABSTRACT                                                                   information [3]. Previous approaches to the contextual bandit prob-
We tackle the problem of recommending products in the online               lems usually assume that the features-rewards mapping is a linear
recommendation scenario, which occurs many times in real ap-               relationship ([1], [5], [4]) the only exception is given by [8].
plications. The most famous and explored instances are news rec-              In a nutshell, in this paper we demonstrate that Collaborative
ommendations and advertisements. In this work we propose an                Bandits [6] may be extended, through kernel trick, and get out of
extension to the state of the art Bandit models to not only take care      the linearity assumption. Our modeling assumptions are that the
of different users’ interactions, but also to go beyond the linearity      expected reward obtained after choosing a product to recommend
assumption of the expected reward. As applicative case we may              can be expressed as a function of both the product features and
consider situations in which the number of actions (products) is           other users’ interactions on different items. To do so, in this project
too big to sample all of them even once, and at the same time we           we introduce a contextual multi-armed bandits model that rely
have several changing users to serve content to.                           on kernel methods to go beyond the linearity assumption, and on
                                                                           graphs to take into account the Collaborative aspect.
KEYWORDS                                                                      The following section details the modeling assumptions and
                                                                           Section 3 presents the proposed algorithm.
Recommender Systems, Contextual Bandits, Kernel Methods
ACM Reference format:                                                      2     LEARNING MODEL
Leonardo Cella, Romaric Gaudel, and Paolo Cremonesi. 2017. Kernalized
                                                                           We assume that the learning process can be divided in T discrete
Collaborative Contextual Bandits. In Proceedings of RecSys 2017 Posters,
Como, Italy, August 27-31, 2 pages.                                        rounds t = 1, . . . ,T . At time t, the learner receives a user index ut ∈
                                                                           U = {1, . . . , n} to provide recommendation to, and a set of available
                                                                           contexts1 (arms) Xt = {xt,1 , . . . , xt,c t } ⊆ ℜd . The learner has to
1     INTRODUCTION AND RELATED WORKS                                       select one of the content feature vectors xt ∈ Xt to recommend to
In the Recommender Systems (RS) field the most valuable informa-           ut , and then it observes a payoff vt whose expectation is ϕ(xt )⊤θ ut ,
tion to rely on are user interactions. That is the reason why Collab-      where (i) we assume there exists a mapping ϕ : ℜd → H that maps
orative Filtering methods are the current state of the art model, or       the data to a Hilbert Space, and (ii) θut ∈ H is unknown from the
at least the ones that give the most important contribution when           learner. More specifically, we call ℜd the primal space and H the
recommending. In the web we have many real applications such               related reproducing kernel Hilbert space.
as: computational advertisement, news recommendation or on-line                The learner aims at maximizing the total payoff over the T
                                                                           rounds: t vt . This goal is usually translated in a minimization/bounding
                                                                                    Í
streaming, that do not fit the classical recommending scenario.
Their peculiarity is the fact that both the sets of active users and       problem of a loss variable called (pseudo-)regret, that measures the
available products are very fluid, therefore they change with time         gap of the learner policy wrt. the optimal one (being aware of
[2].                                                                       parameters (θu )u ∈U ). The regret at time t is defined as:
   In this on-line recommendation scenario, in particular when
                                                                                                 r t = max ϕ(x)⊤θut − ϕ(xt )⊤θut                               (1)
we also have contexts besides item feature vectors, multi-armed                                        x∈Xt
bandits techniques have shown to be an excellent solution and are
                                                                               Let define the kernel function as: k(x, x’) := ϕ(x)⊤ϕ(x’) ∀x, x’ ∈
the current state of the art model. Most of the previous efforts on
                                                                           ℜd . From that function and given a dataset composed by t records
contextual bandits were spent on looking to the recommendation
problem from a single user standpoint. We may find just a few              x1 , . . . , xt ∈ ℜd , we define the kernel matrix as Kt := k(xi , xj )i, j ≤t .
preliminary works along the collaborative direction [6].                       It’s worth noting that in such scenario there is no need to get
   With this project we also want to consider scenarios where the          access to content representation. As we clarify in next section, the
set of products is too big to be explored entirely, therefore we de-       algorithm only requires to know the kernel value k(x, x’) for any
cide to exploit kernel methods [7]. They provide a way to extract          pair (x, x’) of contents which have been recommended. Similarly,
from the primal context features space non-linear relationships that       the estimates of the unknown parameters (θu )u ∈U are never ex-
map original features to the obtained rewards relying on similar-          plicitly expressed.
ity information between contexts. It is useful also to mention that            In order to represent the collaborative effect, we also assume
there are settings where contexts similarities are the only available      that users and contents can be co-clustered as expressed in the
                                                                           following. First, for each content vector x ∈ ℜd , the set U can be

RecSys 2017 Poster Proceedings, August 27-31, Como, Italy                  1 Along the paper, we identify contexts as the concatenation of item feature representa-
© 2017 Copyright held by the author(s).                                    tion and real contextual properties(when available). Therefore they fully characterize
                                                                           the available items properties.
RecSys 2017 Poster Proceedings, August 27-31, Como, Italy                                                                                                L. Cella et al.
                                                             Ðm(x )
clusterized as C(x) = U ix                  , such that (i) U = i=1 U ix    Algorithm 1 Collaborative Kernalized Bandits
                                  1≤i ≤m(x)
and U i ∩ U j = ∅ for any 1 ≤ i < j ≤ m(x), and (ii) the users                  1: Initialize the user graph as connected over U and the item
belonging to the same cluster react similarly when the content with                  graph as connected over I
feature vector x is recommended to them. Namely, if two users u and             2: for t = 1,2,. . . do
u ′ belong to the same cluster U kx , then ϕ(x)⊤θ u − ϕ(x)⊤θ u′ ≤ γ             3:      receive ut ∈ U and the set of contents Xt
for some unknown gap parameter γ ≥ 0.                                           4:      for x ∈ Xt do //Collaborative part
    Second, the content-vectors are themselves clustered in sets                5:          identify the current user cluster
X 1 , X 2 , . . . , X m such that two contents belonging to the same            6:          compute cluster-aggregated variables v̂u,t,x and σ̂u,t,x
cluster induce the same clustering on U : ∀1 ≤ j ≤ m, ∀x, x ′ ∈                 7:      select recommended content x̄t according to Equation (5)
X j , C(x) = C(x ′ ).                                                           8:      receive payoff vt
    Clearly the co-clustering mapping is not known and is one of                9:      update clustering graphs
the two main learning objective of the proposed algorithm. The
novelty compared to [6] is that we assume clusterings over users is
determined by non-linear functions of item features thanks to the           edges from the users-graph associated to the selected content. The
applications of kernel methods.                                             deleted edges (ut , u) are whose such that:
                                                                                           |v̂u′ t ,t, x̄t − v̂u,t,
                                                                                                               ′             ′ ′               ′ ′
                                                                                                                    x̄t | ≥ η σ̂u t ,t, x̄t + η σ̂u,t, x̄t ,         (6)
3    ALGORITHM
                                                                            where the prime on v and σ denotes the fact that the values are
The proposed algorithm (Algorithm 1) adopts the upper confidence            computed similarly to equations (3) and (4), while only focusing on
bound paradigm to manage the exploration-exploitation tradeoffs.            past-iterations concerning u or u ′ . Parameter η ′ > 0 controls the
In detail the estimation of the expected reward and of its corre-           expected gap between clusters.
sponding confidence bound is done given the estimations of the                 Finally, for each content x in the same content-cluster as x̄t ,
clustering, which will be depicted later on. At time-step t, for a user     we compute the neighborhood N (x) = {u : |v̂u′ t ,t,x − v̂u,t,x′    | ≤
u and an item x, we first define the estimation θˆu,t,x as the solution     η σ̂ut ,t,x + η σ̂u,t,x }. We remove each edge (x̄t , x) such that this
                                                                             ′  ′          ′  ′
of the optimization problem                                                 neighborhood differs from the one induced by the freshly updated
                        Õ
             arg min          (vt − ϕ(x̄t )⊤θ )2 + λ∥θ ∥ 2 ,         (2)    users-graph.
                 θ      s ∈Tu, t, x
                                                                            4        CONCLUSIONS AND FUTURE WORKS
where Tu,t,x is composed of past time-steps s at which the user
                                                                            In this paper we demonstrate that collaborative bandits may be
us belongs to the same (estimated) cluster as u (given x). By de-
                                                                            extended, through kernel trick, and get out of the linearity assump-
noting Φu,t,x the matrix which rows correspond to the vectors
                                                                            tion.
{ϕ(xt )⊤ , t ∈ Tu,t,x }, Ku,t,x the product Φu,t,x Φu,t,x
                                                      ⊤    , and ru,t,x
                                ⊤          ˆ
the vector [r t , t ∈ Tu,t,x ] , we get θu,t,x = Φu,t,x ⊤ (Ku,t,x +         REFERENCES
γ I )−1 ru,t,x . This lead to the following estimate for the expected        [1] Peter Auer. 2002. Using Confidence Bounds for Exploitation-Exploration trade-
reward of content x wrt. user u at time t:                                       Offs. In Journal of Machine Learning Research. 397–422.
                                                                             [2] Leonardo Cella. 2017. Modeling user behavior with evolving users and catalogs
       v̂u,t,x = ϕ(x)⊤θˆu,t,x = ku,t,x ⊤ (Ku,t,x + γ I )−1 ru,t,x ,   (3)        of evolving items. In Extended Proceedings of the 25th UMAP conference.
                                                                             [3] Y. Chen, E. K. Garcia, R. M. Gupta, A. Rahimi, and L. Cazzanti. 2009. Similarity-
where ku,t,x = Φu,t,xϕ(x) = [k(x, xt ), t ∈ Tu,t,x ]⊤ . Note that                based Classification: Concepts and Algorithms. In Journal of Machine Learning
                                                                                 Research. 747–776.
this equation expresses v̂u,t,x only after past rewards and kernel           [4] Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. 2011. Contextual Bandits
distance between contents.                                                       with Linear Pauyoff Functions. In Proceedings of the 14th International Conference
   Similarly, the confidence interval on-top of v̂u,t,x is expressed             on Artificial Intelligence and Statistics AISTATS.
                                                                             [5] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A Contextual-
as                                                                               Bandit Approach to personalized News Article Recommendation. In WWW 2010.
               1
                 q                                                           [6] Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. July, 2016. Collaborative
   σ̂u,t,x = λ 2 (k(x, x) − ku,t,x
                             ⊤     (Kt + γ I )−1ku,t,x ] log(t + 1). (4)         Filtering Bandits. In SIGIR 16. ACM, 176–185.
                                                                             [7] J. Shawe-Taylor and N. Cristianini. 2004. Kernel Methods for Pattern Analysis.
  Finally, the chosen arm is the one that maximizes the upper                    In Cambridge University Press.
                                                                             [8] M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. August, 2013. Finit-
confidence bound :                                                               Time Analysis of Kernelised Contextual bandits. In UAI’13 Proceedings of the
                                                                                 Twenty-Ninth Conference on Uncertainty in Artificial Intelligence. 654–663.
                     xt = arg max v̂ut ,t,x + ησ̂u,t,x ,              (5)
                             x∈Xt

where η ≥ 0 is the exploration parameter.
   It remains to explain the way the clusterings are estimated. The
clusterings are represented by maintaining undirected graphs for
which each connected component represents a cluster. One graph
stands for contents, and for each contents-cluster induced thereof
there is one graph to cluster users. The algorithm starts with fully
connected graphs : every contents and every users are reachable
each other. Thereafter, after getting feedback vt , we first delete