Kernalized Collaborative Contextual Bandits Leonardo Cella Romaric Gaudel Paolo Cremonesi leonardo.cella@mail.polimi.it romaric.gaudel@ensai.fr paolo.cremonesi@polimi.it Politecnico di Milano Univ. Rennes, Ensai, CREST Politecnico di Milano Milan, Italy Rennes, France Milan, Italy ABSTRACT information [3]. Previous approaches to the contextual bandit prob- We tackle the problem of recommending products in the online lems usually assume that the features-rewards mapping is a linear recommendation scenario, which occurs many times in real ap- relationship ([1], [5], [4]) the only exception is given by [8]. plications. The most famous and explored instances are news rec- In a nutshell, in this paper we demonstrate that Collaborative ommendations and advertisements. In this work we propose an Bandits [6] may be extended, through kernel trick, and get out of extension to the state of the art Bandit models to not only take care the linearity assumption. Our modeling assumptions are that the of different users’ interactions, but also to go beyond the linearity expected reward obtained after choosing a product to recommend assumption of the expected reward. As applicative case we may can be expressed as a function of both the product features and consider situations in which the number of actions (products) is other users’ interactions on different items. To do so, in this project too big to sample all of them even once, and at the same time we we introduce a contextual multi-armed bandits model that rely have several changing users to serve content to. on kernel methods to go beyond the linearity assumption, and on graphs to take into account the Collaborative aspect. KEYWORDS The following section details the modeling assumptions and Section 3 presents the proposed algorithm. Recommender Systems, Contextual Bandits, Kernel Methods ACM Reference format: 2 LEARNING MODEL Leonardo Cella, Romaric Gaudel, and Paolo Cremonesi. 2017. Kernalized We assume that the learning process can be divided in T discrete Collaborative Contextual Bandits. In Proceedings of RecSys 2017 Posters, Como, Italy, August 27-31, 2 pages. rounds t = 1, . . . ,T . At time t, the learner receives a user index ut ∈ U = {1, . . . , n} to provide recommendation to, and a set of available contexts1 (arms) Xt = {xt,1 , . . . , xt,c t } ⊆ ℜd . The learner has to 1 INTRODUCTION AND RELATED WORKS select one of the content feature vectors xt ∈ Xt to recommend to In the Recommender Systems (RS) field the most valuable informa- ut , and then it observes a payoff vt whose expectation is ϕ(xt )⊤θ ut , tion to rely on are user interactions. That is the reason why Collab- where (i) we assume there exists a mapping ϕ : ℜd → H that maps orative Filtering methods are the current state of the art model, or the data to a Hilbert Space, and (ii) θut ∈ H is unknown from the at least the ones that give the most important contribution when learner. More specifically, we call ℜd the primal space and H the recommending. In the web we have many real applications such related reproducing kernel Hilbert space. as: computational advertisement, news recommendation or on-line The learner aims at maximizing the total payoff over the T rounds: t vt . This goal is usually translated in a minimization/bounding Í streaming, that do not fit the classical recommending scenario. Their peculiarity is the fact that both the sets of active users and problem of a loss variable called (pseudo-)regret, that measures the available products are very fluid, therefore they change with time gap of the learner policy wrt. the optimal one (being aware of [2]. parameters (θu )u ∈U ). The regret at time t is defined as: In this on-line recommendation scenario, in particular when r t = max ϕ(x)⊤θut − ϕ(xt )⊤θut (1) we also have contexts besides item feature vectors, multi-armed x∈Xt bandits techniques have shown to be an excellent solution and are Let define the kernel function as: k(x, x’) := ϕ(x)⊤ϕ(x’) ∀x, x’ ∈ the current state of the art model. Most of the previous efforts on ℜd . From that function and given a dataset composed by t records contextual bandits were spent on looking to the recommendation problem from a single user standpoint. We may find just a few x1 , . . . , xt ∈ ℜd , we define the kernel matrix as Kt := k(xi , xj )i, j ≤t . preliminary works along the collaborative direction [6]. It’s worth noting that in such scenario there is no need to get With this project we also want to consider scenarios where the access to content representation. As we clarify in next section, the set of products is too big to be explored entirely, therefore we de- algorithm only requires to know the kernel value k(x, x’) for any cide to exploit kernel methods [7]. They provide a way to extract pair (x, x’) of contents which have been recommended. Similarly, from the primal context features space non-linear relationships that the estimates of the unknown parameters (θu )u ∈U are never ex- map original features to the obtained rewards relying on similar- plicitly expressed. ity information between contexts. It is useful also to mention that In order to represent the collaborative effect, we also assume there are settings where contexts similarities are the only available that users and contents can be co-clustered as expressed in the following. First, for each content vector x ∈ ℜd , the set U can be RecSys 2017 Poster Proceedings, August 27-31, Como, Italy 1 Along the paper, we identify contexts as the concatenation of item feature representa- © 2017 Copyright held by the author(s). tion and real contextual properties(when available). Therefore they fully characterize the available items properties. RecSys 2017 Poster Proceedings, August 27-31, Como, Italy L. Cella et al.   Ðm(x ) clusterized as C(x) = U ix , such that (i) U = i=1 U ix Algorithm 1 Collaborative Kernalized Bandits 1≤i ≤m(x) and U i ∩ U j = ∅ for any 1 ≤ i < j ≤ m(x), and (ii) the users 1: Initialize the user graph as connected over U and the item belonging to the same cluster react similarly when the content with graph as connected over I feature vector x is recommended to them. Namely, if two users u and 2: for t = 1,2,. . . do u ′ belong to the same cluster U kx , then ϕ(x)⊤θ u − ϕ(x)⊤θ u′ ≤ γ 3: receive ut ∈ U and the set of contents Xt for some unknown gap parameter γ ≥ 0. 4: for x ∈ Xt do //Collaborative part Second, the content-vectors are themselves clustered in sets 5: identify the current user cluster X 1 , X 2 , . . . , X m such that two contents belonging to the same 6: compute cluster-aggregated variables v̂u,t,x and σ̂u,t,x cluster induce the same clustering on U : ∀1 ≤ j ≤ m, ∀x, x ′ ∈ 7: select recommended content x̄t according to Equation (5) X j , C(x) = C(x ′ ). 8: receive payoff vt Clearly the co-clustering mapping is not known and is one of 9: update clustering graphs the two main learning objective of the proposed algorithm. The novelty compared to [6] is that we assume clusterings over users is determined by non-linear functions of item features thanks to the edges from the users-graph associated to the selected content. The applications of kernel methods. deleted edges (ut , u) are whose such that: |v̂u′ t ,t, x̄t − v̂u,t, ′ ′ ′ ′ ′ x̄t | ≥ η σ̂u t ,t, x̄t + η σ̂u,t, x̄t , (6) 3 ALGORITHM where the prime on v and σ denotes the fact that the values are The proposed algorithm (Algorithm 1) adopts the upper confidence computed similarly to equations (3) and (4), while only focusing on bound paradigm to manage the exploration-exploitation tradeoffs. past-iterations concerning u or u ′ . Parameter η ′ > 0 controls the In detail the estimation of the expected reward and of its corre- expected gap between clusters. sponding confidence bound is done given the estimations of the Finally, for each content x in the same content-cluster as x̄t , clustering, which will be depicted later on. At time-step t, for a user we compute the neighborhood N (x) = {u : |v̂u′ t ,t,x − v̂u,t,x′ | ≤ u and an item x, we first define the estimation θˆu,t,x as the solution η σ̂ut ,t,x + η σ̂u,t,x }. We remove each edge (x̄t , x) such that this ′ ′ ′ ′ of the optimization problem neighborhood differs from the one induced by the freshly updated Õ arg min (vt − ϕ(x̄t )⊤θ )2 + λ∥θ ∥ 2 , (2) users-graph. θ s ∈Tu, t, x 4 CONCLUSIONS AND FUTURE WORKS where Tu,t,x is composed of past time-steps s at which the user In this paper we demonstrate that collaborative bandits may be us belongs to the same (estimated) cluster as u (given x). By de- extended, through kernel trick, and get out of the linearity assump- noting Φu,t,x the matrix which rows correspond to the vectors tion. {ϕ(xt )⊤ , t ∈ Tu,t,x }, Ku,t,x the product Φu,t,x Φu,t,x ⊤ , and ru,t,x ⊤ ˆ the vector [r t , t ∈ Tu,t,x ] , we get θu,t,x = Φu,t,x ⊤ (Ku,t,x + REFERENCES γ I )−1 ru,t,x . This lead to the following estimate for the expected [1] Peter Auer. 2002. Using Confidence Bounds for Exploitation-Exploration trade- reward of content x wrt. user u at time t: Offs. In Journal of Machine Learning Research. 397–422. [2] Leonardo Cella. 2017. Modeling user behavior with evolving users and catalogs v̂u,t,x = ϕ(x)⊤θˆu,t,x = ku,t,x ⊤ (Ku,t,x + γ I )−1 ru,t,x , (3) of evolving items. In Extended Proceedings of the 25th UMAP conference. [3] Y. Chen, E. K. Garcia, R. M. Gupta, A. Rahimi, and L. Cazzanti. 2009. Similarity- where ku,t,x = Φu,t,xϕ(x) = [k(x, xt ), t ∈ Tu,t,x ]⊤ . Note that based Classification: Concepts and Algorithms. In Journal of Machine Learning Research. 747–776. this equation expresses v̂u,t,x only after past rewards and kernel [4] Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. 2011. Contextual Bandits distance between contents. with Linear Pauyoff Functions. In Proceedings of the 14th International Conference Similarly, the confidence interval on-top of v̂u,t,x is expressed on Artificial Intelligence and Statistics AISTATS. [5] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A Contextual- as Bandit Approach to personalized News Article Recommendation. In WWW 2010. 1 q [6] Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. July, 2016. Collaborative σ̂u,t,x = λ 2 (k(x, x) − ku,t,x ⊤ (Kt + γ I )−1ku,t,x ] log(t + 1). (4) Filtering Bandits. In SIGIR 16. ACM, 176–185. [7] J. Shawe-Taylor and N. Cristianini. 2004. Kernel Methods for Pattern Analysis. Finally, the chosen arm is the one that maximizes the upper In Cambridge University Press. [8] M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. August, 2013. Finit- confidence bound : Time Analysis of Kernelised Contextual bandits. In UAI’13 Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence. 654–663. xt = arg max v̂ut ,t,x + ησ̂u,t,x , (5) x∈Xt where η ≥ 0 is the exploration parameter. It remains to explain the way the clusterings are estimated. The clusterings are represented by maintaining undirected graphs for which each connected component represents a cluster. One graph stands for contents, and for each contents-cluster induced thereof there is one graph to cluster users. The algorithm starts with fully connected graphs : every contents and every users are reachable each other. Thereafter, after getting feedback vt , we first delete