=Paper= {{Paper |id=Vol-1905/recsys2017_poster21 |storemode=property |title=Kernalized Collaborative Contextual Bandits |pdfUrl=https://ceur-ws.org/Vol-1905/recsys2017_poster21.pdf |volume=Vol-1905 |authors=Leonardo Cella,Romaric Gaudel,Paolo Cremonesi |dblpUrl=https://dblp.org/rec/conf/recsys/CellaGC17 }} ==Kernalized Collaborative Contextual Bandits== https://ceur-ws.org/Vol-1905/recsys2017_poster21.pdf

Kernalized Collaborative Contextual Bandits
Leonardo Cella Romaric Gaudel Paolo Cremonesi
leonardo.cella@mail.polimi.it romaric.gaudel@ensai.fr paolo.cremonesi@polimi.it
Politecnico di Milano Univ. Rennes, Ensai, CREST Politecnico di Milano
Milan, Italy Rennes, France Milan, Italy

ABSTRACT information [3]. Previous approaches to the contextual bandit prob-
We tackle the problem of recommending products in the online lems usually assume that the features-rewards mapping is a linear
recommendation scenario, which occurs many times in real ap- relationship ([1], [5], [4]) the only exception is given by [8].
plications. The most famous and explored instances are news rec- In a nutshell, in this paper we demonstrate that Collaborative
ommendations and advertisements. In this work we propose an Bandits [6] may be extended, through kernel trick, and get out of
extension to the state of the art Bandit models to not only take care the linearity assumption. Our modeling assumptions are that the
of different users’ interactions, but also to go beyond the linearity expected reward obtained after choosing a product to recommend
assumption of the expected reward. As applicative case we may can be expressed as a function of both the product features and
consider situations in which the number of actions (products) is other users’ interactions on different items. To do so, in this project
too big to sample all of them even once, and at the same time we we introduce a contextual multi-armed bandits model that rely
have several changing users to serve content to. on kernel methods to go beyond the linearity assumption, and on
graphs to take into account the Collaborative aspect.
KEYWORDS The following section details the modeling assumptions and
Section 3 presents the proposed algorithm.
Recommender Systems, Contextual Bandits, Kernel Methods
ACM Reference format: 2 LEARNING MODEL
Leonardo Cella, Romaric Gaudel, and Paolo Cremonesi. 2017. Kernalized
We assume that the learning process can be divided in T discrete
Collaborative Contextual Bandits. In Proceedings of RecSys 2017 Posters,
Como, Italy, August 27-31, 2 pages. rounds t = 1, . . . ,T . At time t, the learner receives a user index ut ∈
U = {1, . . . , n} to provide recommendation to, and a set of available
contexts1 (arms) Xt = {xt,1 , . . . , xt,c t } ⊆ ℜd . The learner has to
1 INTRODUCTION AND RELATED WORKS select one of the content feature vectors xt ∈ Xt to recommend to
In the Recommender Systems (RS) field the most valuable informa- ut , and then it observes a payoff vt whose expectation is ϕ(xt )⊤θ ut ,
tion to rely on are user interactions. That is the reason why Collab- where (i) we assume there exists a mapping ϕ : ℜd → H that maps
orative Filtering methods are the current state of the art model, or the data to a Hilbert Space, and (ii) θut ∈ H is unknown from the
at least the ones that give the most important contribution when learner. More specifically, we call ℜd the primal space and H the
recommending. In the web we have many real applications such related reproducing kernel Hilbert space.
as: computational advertisement, news recommendation or on-line The learner aims at maximizing the total payoff over the T
rounds: t vt . This goal is usually translated in a minimization/bounding
Í
streaming, that do not fit the classical recommending scenario.
Their peculiarity is the fact that both the sets of active users and problem of a loss variable called (pseudo-)regret, that measures the
available products are very fluid, therefore they change with time gap of the learner policy wrt. the optimal one (being aware of
[2]. parameters (θu )u ∈U ). The regret at time t is defined as:
In this on-line recommendation scenario, in particular when
r t = max ϕ(x)⊤θut − ϕ(xt )⊤θut (1)
we also have contexts besides item feature vectors, multi-armed x∈Xt
bandits techniques have shown to be an excellent solution and are
Let define the kernel function as: k(x, x’) := ϕ(x)⊤ϕ(x’) ∀x, x’ ∈
the current state of the art model. Most of the previous efforts on
ℜd . From that function and given a dataset composed by t records
contextual bandits were spent on looking to the recommendation
problem from a single user standpoint. We may find just a few x1 , . . . , xt ∈ ℜd , we define the kernel matrix as Kt := k(xi , xj )i, j ≤t .
preliminary works along the collaborative direction [6]. It’s worth noting that in such scenario there is no need to get
With this project we also want to consider scenarios where the access to content representation. As we clarify in next section, the
set of products is too big to be explored entirely, therefore we de- algorithm only requires to know the kernel value k(x, x’) for any
cide to exploit kernel methods [7]. They provide a way to extract pair (x, x’) of contents which have been recommended. Similarly,
from the primal context features space non-linear relationships that the estimates of the unknown parameters (θu )u ∈U are never ex-
map original features to the obtained rewards relying on similar- plicitly expressed.
ity information between contexts. It is useful also to mention that In order to represent the collaborative effect, we also assume
there are settings where contexts similarities are the only available that users and contents can be co-clustered as expressed in the
following. First, for each content vector x ∈ ℜd , the set U can be

RecSys 2017 Poster Proceedings, August 27-31, Como, Italy 1 Along the paper, we identify contexts as the concatenation of item feature representa-
© 2017 Copyright held by the author(s). tion and real contextual properties(when available). Therefore they fully characterize
the available items properties.
RecSys 2017 Poster Proceedings, August 27-31, Como, Italy L. Cella et al.
Ðm(x )
clusterized as C(x) = U ix , such that (i) U = i=1 U ix Algorithm 1 Collaborative Kernalized Bandits
1≤i ≤m(x)
and U i ∩ U j = ∅ for any 1 ≤ i < j ≤ m(x), and (ii) the users 1: Initialize the user graph as connected over U and the item
belonging to the same cluster react similarly when the content with graph as connected over I
feature vector x is recommended to them. Namely, if two users u and 2: for t = 1,2,. . . do
u ′ belong to the same cluster U kx , then ϕ(x)⊤θ u − ϕ(x)⊤θ u′ ≤ γ 3: receive ut ∈ U and the set of contents Xt
for some unknown gap parameter γ ≥ 0. 4: for x ∈ Xt do //Collaborative part
Second, the content-vectors are themselves clustered in sets 5: identify the current user cluster
X 1 , X 2 , . . . , X m such that two contents belonging to the same 6: compute cluster-aggregated variables v̂u,t,x and σ̂u,t,x
cluster induce the same clustering on U : ∀1 ≤ j ≤ m, ∀x, x ′ ∈ 7: select recommended content x̄t according to Equation (5)
X j , C(x) = C(x ′ ). 8: receive payoff vt
Clearly the co-clustering mapping is not known and is one of 9: update clustering graphs
the two main learning objective of the proposed algorithm. The
novelty compared to [6] is that we assume clusterings over users is
determined by non-linear functions of item features thanks to the edges from the users-graph associated to the selected content. The
applications of kernel methods. deleted edges (ut , u) are whose such that:
|v̂u′ t ,t, x̄t − v̂u,t,
′ ′ ′ ′ ′
x̄t | ≥ η σ̂u t ,t, x̄t + η σ̂u,t, x̄t , (6)
3 ALGORITHM
where the prime on v and σ denotes the fact that the values are
The proposed algorithm (Algorithm 1) adopts the upper confidence computed similarly to equations (3) and (4), while only focusing on
bound paradigm to manage the exploration-exploitation tradeoffs. past-iterations concerning u or u ′ . Parameter η ′ > 0 controls the
In detail the estimation of the expected reward and of its corre- expected gap between clusters.
sponding confidence bound is done given the estimations of the Finally, for each content x in the same content-cluster as x̄t ,
clustering, which will be depicted later on. At time-step t, for a user we compute the neighborhood N (x) = {u : |v̂u′ t ,t,x − v̂u,t,x′ | ≤
u and an item x, we first define the estimation θˆu,t,x as the solution η σ̂ut ,t,x + η σ̂u,t,x }. We remove each edge (x̄t , x) such that this
′ ′ ′ ′
of the optimization problem neighborhood differs from the one induced by the freshly updated
Õ
arg min (vt − ϕ(x̄t )⊤θ )2 + λ∥θ ∥ 2 , (2) users-graph.
θ s ∈Tu, t, x
4 CONCLUSIONS AND FUTURE WORKS
where Tu,t,x is composed of past time-steps s at which the user
In this paper we demonstrate that collaborative bandits may be
us belongs to the same (estimated) cluster as u (given x). By de-
extended, through kernel trick, and get out of the linearity assump-
noting Φu,t,x the matrix which rows correspond to the vectors
tion.
{ϕ(xt )⊤ , t ∈ Tu,t,x }, Ku,t,x the product Φu,t,x Φu,t,x
⊤ , and ru,t,x
⊤ ˆ
the vector [r t , t ∈ Tu,t,x ] , we get θu,t,x = Φu,t,x ⊤ (Ku,t,x + REFERENCES
γ I )−1 ru,t,x . This lead to the following estimate for the expected [1] Peter Auer. 2002. Using Confidence Bounds for Exploitation-Exploration trade-
reward of content x wrt. user u at time t: Offs. In Journal of Machine Learning Research. 397–422.
[2] Leonardo Cella. 2017. Modeling user behavior with evolving users and catalogs
v̂u,t,x = ϕ(x)⊤θˆu,t,x = ku,t,x ⊤ (Ku,t,x + γ I )−1 ru,t,x , (3) of evolving items. In Extended Proceedings of the 25th UMAP conference.
[3] Y. Chen, E. K. Garcia, R. M. Gupta, A. Rahimi, and L. Cazzanti. 2009. Similarity-
where ku,t,x = Φu,t,xϕ(x) = [k(x, xt ), t ∈ Tu,t,x ]⊤ . Note that based Classification: Concepts and Algorithms. In Journal of Machine Learning
Research. 747–776.
this equation expresses v̂u,t,x only after past rewards and kernel [4] Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. 2011. Contextual Bandits
distance between contents. with Linear Pauyoff Functions. In Proceedings of the 14th International Conference
Similarly, the confidence interval on-top of v̂u,t,x is expressed on Artificial Intelligence and Statistics AISTATS.
[5] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A Contextual-
as Bandit Approach to personalized News Article Recommendation. In WWW 2010.
1
q [6] Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. July, 2016. Collaborative
σ̂u,t,x = λ 2 (k(x, x) − ku,t,x
⊤ (Kt + γ I )−1ku,t,x ] log(t + 1). (4) Filtering Bandits. In SIGIR 16. ACM, 176–185.
[7] J. Shawe-Taylor and N. Cristianini. 2004. Kernel Methods for Pattern Analysis.
Finally, the chosen arm is the one that maximizes the upper In Cambridge University Press.
[8] M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. August, 2013. Finit-
confidence bound : Time Analysis of Kernelised Contextual bandits. In UAI’13 Proceedings of the
Twenty-Ninth Conference on Uncertainty in Artificial Intelligence. 654–663.
xt = arg max v̂ut ,t,x + ησ̂u,t,x , (5)
x∈Xt

where η ≥ 0 is the exploration parameter.
It remains to explain the way the clusterings are estimated. The
clusterings are represented by maintaining undirected graphs for
which each connected component represents a cluster. One graph
stands for contents, and for each contents-cluster induced thereof
there is one graph to cluster users. The algorithm starts with fully
connected graphs : every contents and every users are reachable
each other. Thereafter, after getting feedback vt , we first delete