Interaction-Grounded Learning for Recommender Systems Jessica Maghakian1 , Kishan Panaganti2 , Paul Mineiro3 , Akanksha Saran3 and Cheng Tan3 1 Stony Brook University, USA 2 Texas A&M University, USA 3 Microsoft Research NYC, USA Abstract Recommender systems have long grappled with optimizing user satisfaction using only implicit user feedback. Many approaches in the literature rely on complicated feedback modeling and costly user studies. We propose online recommender systems as a candidate for the recently introduced Interaction Grounded Learning (IGL) paradigm. In IGL, a learner attempts to optimize a latent reward in an environment by observing feedback with no grounding. We introduce a novel personalized variant of IGL for recommender systems that can leverage explicit and implicit user feedback to maximize user satisfaction, with no feedback signal modeling and minimal assumptions. With our empirical evaluations that include simulations as well as experiments on real product data, we demonstrate the effectiveness of IGL for recommender systems. Keywords recommendation systems, interaction-grounded learning, contextual bandits, reinforcement learning 1. Introduction listening sessions [11], half of the clicked on content was actually disliked by users. The last decade has seen unprecedented growth in e- Challenge 2: Incorporating multiple implicit feedback commerce, social media and digital streaming offerings, signals requires manual feature engineering. In addition resulting in users that are overwhelmed with content to clicks, user implicit feedback can include dwell time and choices. Online recommender systems offer a way [3], mouse movement [12], scroll information [13] and to alleviate this information overload and improve user gaze [14]. One popular approach uses dwell time to filter experience by providing personalized content. Unfor- out noisy clicks, with the reasoning that satisfied users tunately, optimizing user satisfaction is challenging be- stay on pages longer [3]. Although the industry standard cause explicit feedback indicating user satisfaction is rare is 30+ seconds of dwell time for a β€œmeaningful” click, this in practice [1]. To resolve the problem of data sparsity, number actually varies depending on the page topic, read- practitioners rely on implicit signals such as clicks [2] or ability and content length [15]. It is equally challenging dwell time [3] as a proxy for user satisfaction. However, to incorporate other signals, for example, behaviors such designing an optimization objective using implicit signals as viewport time, dwell time and scroll patterns have a is nontrivial, and many modern recommender systems complicated temporal relationship and represent prefer- suffer from the following challenges. ence in different phases [10]. There is an extensive body Challenge 1: No one implicit signal is the true user satis- of work on modeling different implicit feedback signals faction signal. User clicks are the most readily available [16, 17], however these niche models may not general- signal, and the Click-Through Rate (CTR) metric has be- ize well across a diverse user base, or stay relevant as come the gold standard for evaluating the performance of recommender systems and their users evolve. online recommendation systems [4]. Yet there are many To tackle these challenges, we propose online recom- instances when a user will interact via clicks and be un- mender systems as a candidate for Interaction-Grounded satisfied with the content. The most familiar of these is Learning (IGL) [18]. IGL is a learning paradigm where clickbait, where poor quality content attracts user clicks a learner optimizes for latent rewards by interacting by exploiting cognitive biases such as caption bias [5], with the environment and associating observed feedback position bias [6] or the curiosity gap [7, 8]. Optimization with the unobservable true reward. Although IGL was of the CTR will naturally promote clickbait items that originally inspired by brain-computer interface applica- provide negative user experiences and cause distrust in tions, in this paper we demonstrate that the framework, the recommender system [9]. Recent studies show that when utilizing a different generative assumption and aug- clicks may even be a signal of user dissatisfaction. In lab- mented with an additional latent state, is also well suited oratory studies of online news reading [10] and Spotify for recommendation applications. Existing approaches ORSUM@ACM RecSys 2022: 5th Workshop on Online Recommender such as reinforcement learning and traditional contextual Systems and User Modeling, jointly with the 16th ACM Conference on bandits suffer from the choice of reward function. How- Recommender Systems, September 23rd, 2022, Seattle, WA, USA ever IGL resolves the 2 above challenges while making Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). minimal assumptions about the value of observed user CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) feedback. Our new approach is able to incorporate both et al. [19] loosen full conditional independence by con- explicit and implicit signals, leverage ambiguous user sidering context conditional independence, i.e. 𝑦 βŸ‚ π‘₯|π‘Ž, π‘Ÿ. feedback and adapt to the different ways in which users For our setting, this corresponds to the user feedback interact with the system. varying for combinations of preference and content, but Our Contributions. We introduce IGL for recom- remaining consistent across all users. Neither of these mender systems, allowing us to leverage implicit and two assumptions are applicable in the setting of online explicit feedback signals and mitigate the need for re- content recommendation because different users inter- ward engineering. We present the first IGL strategy for act with recommender systems in different ways. This context-dependent feedback, the first use of inverse kine- is evidenced by our production data from a real world matics as an IGL objective, and the first IGL strategy image recommendation system (see Sec. 4.3) along with for more than two latent states. Using simulations and existing results in the literature [20, 21]. By assuming real production data, we demonstrate that recommender user-specific communication rather than item-specific systems require at least 3 reward states, and that IGL communication, we allow for personalized reward learn- is able to address two big challenges for modern online ing. recommender systems. Number of Latent Reward States. Prior work shows the binary latent reward assumption, along with an as- sumption that rewards are rare under a known reference 2. Background on policy, is sufficient for IGL to succeed. Specifically, op- Interaction-Grounded Learning timizing the contrast between a learned policy and the oblivious uniform policy is able to succeed when feed- Problem Statement. Consider a learner that is inter- back is both context and action independent [18]; and acting with an environment while trying to optimize optimizing the contrast between the learned policy and their policies without access to any grounding or explicit all constant-action policies succeeds when the feedback reward signal. At each time step, the stationary environ- is context independent [19]. ment generates a context π‘₯ ∈ 𝒳 which is sampled i.i.d. Although the binary latent reward assumption (e.g., from a distribution 𝑑0 . The learner observes the context satisfied or dissatisfied) appears reasonable for recom- and then selects an action π‘Ž ∈ π’œ from a finite action set. mendation scenarios, it fails to account for user indiffer- In response, the environment jointly generates a latent ence versus user dissatisfaction. This observation was reward and feedback vector (π‘Ÿ, 𝑦) ∈ β„› ×𝒴 conditional on first motivated by our production data, where a 2 state (π‘₯, π‘Ž). However, the learner is only able to observe 𝑦 and IGL policy would sometimes maximize feedback signals not π‘Ÿ. Since the latent reward can be either deterministic with obviously negative semantics. Assuming users ig- or stochastic, let 𝑅(π‘₯, π‘Ž) ∢= 𝔼(π‘₯,π‘Ž) [π‘Ÿ] denote the expected nore most content most of the time [22], negative feed- reward after choosing action π‘Ž for context π‘₯. In the IGL back can be as difficult to elicit as positive feedback, and a setting, the context space 𝒳 and feedback vector space 𝒴 2 state IGL model is unable to distinguish between these can be arbitrarily large. Let πœ‹ ∈ Ξ  ∢ 𝒳 β†’ Ξ”(π’œ ) denote extremes. Hence, we posit a minimal latent state model a stochastic policy, with corresponding expected return for recommender systems involves 3 states: (i) π‘Ÿ = 1, 𝑉 (πœ‹) ∢= 𝔼(π‘₯,π‘Ž)βˆΌπ‘‘0 Γ—πœ‹ [π‘Ÿ]. In IGL, the learner’s goal is to when users are satisfied with the recommended content, find the optimal policy πœ‹ βˆ— = argmaxπœ‹βˆˆΞ  𝑉 (πœ‹), while only (ii) π‘Ÿ = 0, when users are indifferent or inattentive, and able to observe context-action-feedback (π‘₯, π‘Ž, 𝑦) triples. (iii) π‘Ÿ = βˆ’1, when users are dissatisfied. In the recommender system setting, the context π‘₯ is the user, the action π‘Ž is the recommended content and 3. Derivations the feedback 𝑦 is the user feedback. Unfortunately ex- isting IGL approaches ([18], [19]) leverage assumptions We now address the first of the previously mentioned designed for classification and control tasks which are challenges from Sec. 1. For the recommender system set- a poor fit for recommendation scenarios: (i) context- ting, we use the assumption that 𝑦 βŸ‚ π‘Ž|π‘₯, π‘Ÿ, namely that independence of the feedback and (ii) binary latent re- the feedback 𝑦 is independent of the displayed content wards. π‘Ž given the user π‘₯ and their disposition toward the dis- Feedback Dependence Assumptions. It is information played content π‘Ÿ ∈ {βˆ’1, 0, 1}. Thus, we assume that users theoretically impossible to solve IGL without assump- may communicate in different ways, but a given user tions about the relation between π‘₯, π‘Ž and 𝑦 [19]. In the expresses satisfaction, dissatisfaction and indifference in first paper on IGL, the authors assumed full conditional the same way. independence of the feedback on the context and chosen The statistical dependence of 𝑦 on π‘₯ frustrates the action, i.e. 𝑦 βŸ‚ π‘₯, π‘Ž|π‘Ÿ. For recommender systems, this un- use of learning objectives which utilize the product of desirably implies that all users communicate preferences marginal distributions over (π‘₯, 𝑦). Essentially, given ar- identically for all content. In the following paper, Xie bitrary dependence upon π‘₯, learning must operate on each example in isolation without requiring comparison does not imply that observing such feedback will induce across examples. This motivates attempting to predict an extreme event detection; rather the feedback must the current action from the current context and the cur- have a probability that strongly depends upon which rently observed feedback, i.e., inverse kinematics. action is taken. Because feedback is assumed condition- ally independent of action, the only way for feedback 3.1. Inverse Kinematics to help predict which action is played is via the (action dependence of the) latent reward. In this section we motivate our inverse kinematics strat- egy using exact expectations. When acting according to 3.3. Extreme Event Disambiguation any policy 𝑃(π‘Ž|π‘₯), we can imagine trying to predict the ac- tion taken given the context and feedback; the posterior With 2 latent states, π‘Ÿ β‰  0 ⟹ π‘Ÿ = 1, and we can reduce distribution is to a standard contextual bandit with inferred rewards 1(𝑃(π‘Ž|𝑦, π‘₯) > 2𝑃(π‘Ž|π‘₯)). With 3 latent states, π‘Ÿ β‰  0 ⟹ 𝑃(π‘Ž|π‘₯)𝑃(𝑦|π‘Ž, π‘₯) 𝑃(π‘Ž|𝑦, π‘₯) = (Bayes rule) π‘Ÿ = Β±1, and additional information is necessary to disam- 𝑃(𝑦|π‘₯) biguate the extreme events. We assume partial reward in- 𝑃(𝑦|π‘Ÿ, π‘Ž, π‘₯) formation is available via a β€œdefinitely negative” function = 𝑃(π‘Ž|π‘₯) βˆ‘ 𝑃(π‘Ÿ|π‘Ž, π‘₯) (Total Probability) 𝑃(𝑦|π‘₯) dn ∢ 𝒳 Γ— 𝒴 β†’ {βˆ’1, 0} where 𝑃(dn(π‘₯, 𝑦) = 0|π‘Ÿ = 1) = 1 π‘Ÿ and 𝑃(dn(π‘₯, 𝑦) = βˆ’1|π‘Ÿ = βˆ’1) > 0. This reduces ex- 𝑃(𝑦|π‘Ÿ, π‘₯) = 𝑃(π‘Ž|π‘₯) βˆ‘ 𝑃(π‘Ÿ|π‘Ž, π‘₯) (𝑦 βŸ‚ π‘Ž|π‘₯, π‘Ÿ) treme event disambiguation to one-sided learning [23] π‘Ÿ 𝑃(𝑦|π‘₯) applied only to extreme events, where we try to pre- 𝑃(π‘Ÿ|𝑦, π‘₯) dict the underlying latent state given (π‘₯, π‘Ž). We assume = 𝑃(π‘Ž|π‘₯) βˆ‘ 𝑃(π‘Ÿ|π‘Ž, π‘₯) (Bayes rule) π‘Ÿ 𝑃(π‘Ÿ|π‘₯) partial labelling is selected completely at random [24] 𝑃(π‘Ÿ|π‘Ž, π‘₯)𝑃(π‘Ž, π‘₯) and treat the (constant) negative labelling propensity 𝛼 = βˆ‘ 𝑃(π‘Ÿ|𝑦, π‘₯) . (Total Probability) as a hyperparameter. We arrive at our 3-state reward π‘Ÿ βˆ‘π‘Ž 𝑃(π‘Ÿ|π‘Ž, π‘₯)𝑃(π‘Ž|π‘₯) extractor (1) We arrive at an inner product between a reward ⎧0 𝑃(π‘Ž|𝑦, π‘₯) ≀ 2𝑃(π‘Ž|π‘₯) decoder term 𝑃(π‘Ÿ|𝑦, π‘₯) and a reward predictor term 𝜌(π‘₯, π‘Ž, 𝑦) = βˆ’1 𝑃(π‘Ž|𝑦, π‘₯) > 2𝑃(π‘Ž|π‘₯) and dn(π‘₯, 𝑦) = βˆ’1 , 𝑃(π‘Ÿ|π‘Ž,π‘₯)𝑃(π‘Ž|π‘₯) ⎨ βˆ‘π‘Ž 𝑃(π‘Ÿ|π‘Ž,π‘₯)𝑃(π‘Ž|π‘₯) . βŽ©π›Ό otherwise (3) 3.2. Extreme Event Detection equivalent to Zhang and Lee [25, Equation 11] scaled by 𝛼. Note setting 𝛼 = 1 embeds 2-state IGL. Direct extraction of a reward predictor using maximum likelihood on the action prediction problem with equa- 3.3.1. Implementation Notes tion (1) is frustrated by two identifiability issues: first, this expression is invariant to a permutation of the re- In practice, 𝑃(π‘Ž|π‘₯) is known but the other probabilities Μ‚ wards on a context dependent basis; and second, the rel- are estimated. 𝑃(π‘Ž|𝑦, π‘₯) is estimated online using maxi- ative scale of two terms being multiplied is not uniquely mum likelihood on the problem predicting π‘Ž from (π‘₯, 𝑦), determined by their product. To mitigate the first issue, i.e., on a data stream of tuples ((π‘₯, 𝑦), π‘Ž). The current we assume βˆ‘π‘Ž 𝑃(π‘Ÿ = 0|π‘Ž, π‘₯)𝑃(π‘Ž|π‘₯) > 12 , i.e., nonzero re- estimates induce 𝜌(π‘₯, Μ‚ π‘Ž, 𝑦) based upon the plug-in ver- sion of equation (3). In this manner, the original data wards are rare under 𝑃(π‘Ž|π‘₯); and to mitigate the second stream of (π‘₯, π‘Ž, 𝑦) tuples is transformed into stream of issue, we assume the feedback can be perfectly decoded, Μ‚ π‘Ž, 𝑦)) tuples and reduced to a standard on- (π‘₯, π‘Ž, π‘Ÿ Μ‚ = 𝜌(π‘₯, i.e., 𝑃(π‘Ÿ|𝑦, π‘₯) ∈ {0, 1}. Under these assumptions we have line contextual bandit problem. 𝑃(π‘Ÿ = 0|π‘Ž, π‘₯)𝑃(π‘Ž|π‘₯) As an additional complication, although 𝑃(π‘Ž|π‘₯) is π‘Ÿ = 0 ⟹ 𝑃(π‘Ž|𝑦, π‘₯) = known, it is typically a good policy under which rewards βˆ‘π‘Ž 𝑃(π‘Ÿ = 0|π‘Ž, π‘₯)𝑃(π‘Ž|π‘₯) are not rare (e.g., offline learning with a good historical ≀ 2𝑃(π‘Ÿ = 0|π‘Ž, π‘₯)𝑃(π‘Ž|π‘₯) ≀ 2𝑃(π‘Ž|π‘₯). policy; or acting online according to the policy being (2) learned by the IGL procedure). Therefore we use impor- tance weighting to synthesize a uniform action distribu- Equation (2) forms the basis for our extreme event de- tion 𝑃(π‘Ž|π‘₯) from the true action distribution.1 Ultimately tector: anytime the posterior probability of an action is we arrive at the procedure of Algorithm 1. predicted to be more than twice the prior probability, we 1 deduce π‘Ÿ β‰  0. When the number of actions is changing from round to round, Note a feedback merely being apriori rare or frequent we use importance weighting to synthesize a non-uniform action distribution with low rewards, but we elide this detail for ease of (i.e., the magnitude of 𝑃(𝑦|π‘₯) under the policy 𝑃(π‘Ž|π‘₯)) exposition. Algorithm 1 IGL, Inverse Kinematics and either 2 or 3 for evaluation but never revealed to the algorithm. Latent States. Simulator Design. Before the start of each experiment, Input: Contextual bandit algorithm CB-Alg . user profiles with fixed latent rewards for each action are Input: Calibrated weighted multiclass classification al- generated. The users are also assigned predetermined gorithm MC-Alg . communication styles, so the probability of emitting a Input: Definitely negative oracle DN . # given signal conditioned on the latent reward remains DN (…) = 0 for 2 state IGL static throughout the duration of the experiment. For Input: Negative labelling propensity 𝛼. # the available feedback, users can provide feedback using 𝛼 = 1 for 2 state IGL five signals: (1) like, (2) dislike, (3) click, (4) skip and Input: Action set size 𝐾. (5) none. The feedback includes a mix of explicit (likes, 1: πœ‹ ← new CB-Alg . dislikes) and implicit (clicks, skips, none) signals. Despite 2: IK ← new MC-Alg . receiving no human input on the assumed meaning of 3: for 𝑑 = 1, 2, … ; do the implicit signals, we will demonstrate that IGL can 4: Observe context π‘₯𝑑 and action set 𝐴𝑑 with |𝐴𝑑 | = 𝐾. determine which feedback are associated with which latent state. In addition to policy optimization, IGL can 5: if On-policy IGL then also be a tool for automated feature discovery. To reveal 6: 𝑃(β‹…|π‘₯𝑑 ) ← πœ‹.predict(π‘₯𝑑 , 𝐴𝑑 ). # the qualitative properties of the approach, the simulated Compute action distribution probabilities for observing a particular feedback given the 7: Play π‘Žπ‘‘ ∼ 𝑃(β‹…|π‘₯𝑑 ) and observe feedback 𝑦𝑑 . reward are chosen so that they can be perfectly decoded, 8: else i.e., each feedback has a nonzero emission probability 9: Observe (π‘₯𝑑 , π‘Žπ‘‘ , 𝑦𝑑 , 𝑃(β‹…|π‘₯𝑑 )). in exactly one latent reward state. Production data does 10: 𝑀𝑑 ← 1/(𝐾 𝑃(π‘Žπ‘‘ |π‘₯𝑑 )). # not obey this constraint (e.g., accidental emissions of all Synthetic uniform distribution feedback occur at some rate): theoretical analysis of our 11: 𝑃(π‘ŽΜ‚ 𝑑 |𝑦𝑑 , π‘₯𝑑 ) ← IK .predict((π‘₯𝑑 , 𝑦𝑑 ), 𝐴𝑑 , π‘Žπ‘‘ ). # approach without perfectly decodable rewards is a topic Predict action probability for future work. 12: if 𝐾𝑃(π‘Ž Μ‚ 𝑑 |𝑦𝑑 , π‘₯𝑑 ) ≀ 2 then # π‘Ÿπ‘‘Μ‚ = 0 4.1. Motivating the 3 State Model for 13: πœ‹.learn(π‘₯𝑑 , π‘Žπ‘‘ , 𝐴𝑑 , π‘Ÿπ‘‘ = 0, 𝑀𝑑 ) 14: else # π‘Ÿπ‘‘Μ‚ β‰  0 Recommender Systems 15: if DN (…) = 0 then We now implement Algorithm 1 for 2 latent states as 16: πœ‹.learn(π‘₯𝑑 , π‘Žπ‘‘ , 𝐴𝑑 , π‘Ÿπ‘‘ = 𝛼, 𝑀𝑑 ) IGL-P(2) . The experiment here shows the following 17: else # Definitely negative two results about IGL-P(2) : (i) it is able to succeed in 18: πœ‹.learn(π‘₯𝑑 , π‘Žπ‘‘ , 𝐴𝑑 , π‘Ÿπ‘‘ = βˆ’1, 𝑀𝑑 ) the scenario when there are 2 underlying latent rewards 19: IK .learn((π‘₯𝑑 , 𝑦𝑑 ), 𝐴𝑑 , π‘Žπ‘‘ , 𝑀𝑑 ). and (ii) it can no longer do so when there are 3 latent states. Fig. 1 shows the simulator setup used, where clicks and likes are used to communicate satisfaction, 4. Empirical Evaluations and dislikes, skips and no feedback (none) convey (active or passive) dissatisfaction. Due to the sensitivity around production metrics and Fig. 2 shows the distribution of rewards for IGL-P(2) customer segments, most experiments demonstrate qual- as a function of the number of iterations, for both the 2 itative effects via simulation, with simulator properties and 3 latent state model. When there are only 2 latent inspired by production observations. Our final experi- rewards, IGL-P(2) consistently improves; however for ment (Sec. 4.3) includes relative performance data from a 3 latent states, IGL-P(3) oscillates between π‘Ÿ = 1 and production real-world image recommendation scenario. π‘Ÿ = βˆ’1, resulting in much lower average user satisfac- Abbreviations. Algorithms are denoted by the following tion. The empirical results demonstrate that although abbreviations: Personalized IGL for 2 latent states (IGL- IGL-P(2) can successfully identify and maximize the P(2) ); Personalized IGL for 3 latent states (IGL-P(3) ). rare feedbacks it encounters, it is unable to distinguish General Evaluation Setup. At each time step 𝑑, the con- between satisfied and dissatisfied users. text π‘₯𝑑 is provided from either the simulator (Sec. 4.1-4.2) or the logged production data (Sec. 4.3). The learner then 4.2. IGL-P(3) : Personalized Reward selects an action π‘Žπ‘‘ and receives feedback 𝑦𝑑 . In these eval- uations, each user provides feedback in exactly one inter- Learning for Recommendations action and different user feedback signals are mutually Since IGL-P(2) is not sufficient for the recommendation exclusive, so that 𝑦𝑑 is a one-hot vector. In simulated en- system setting, we now explore the performance of IGL- vironments, the ground truth reward is sometimes used P(3) . Using the same simulator as Fig. 1b, we evaluated quires direct grounding from the dislike signal. We next examined how IGL-P(3) is impacted by increased or de- creased presence of user dislikes. Fig. 3b was generated by varying the probability 𝑝 of users emitting dislikes given π‘Ÿ = βˆ’1, and then averaging over 10 experiments for each choice of 𝑝. While lower dislike emission probabil- (a) 2 latent state model ities are associated with slower convergence, IGL-P(3) is able to overcome the increase in unlabeled feedback and learn to associate the skip signal with user dissatifac- tion. Once the feedback decoding stabilizes, regardless of the dislike emission probability, IGL-P(3) enjoys strong performance for the remainder of the experiment. (b) 3 latent state model Figure 1: Simulator settings for 2 state and 3 state latent model. In Fig. 1a, r = 0 corresponds to anything other than the user actively enjoying the content, whereas in Fig. 1b, lack of user enjoyment is split into indifference and active dissatisfaction. (a) Ground truth learning curves, P(dislike|r = βˆ’1) = 0.2. (a) Two latent states (b) Effect of varying P(dislike|r = βˆ’1). Figure 3: Performance of IGL-P(3) in simulated environment. In Fig. 3a, IGL-P(3) successfully maximizes user satisfaction while minimizing dissatisfaction. Fig. 3b demonstrates how IGL-P(3) is robust to varying the frequency of partial informa- tion received, although more data is needed for convergence when β€œdefinitely bad” events are less frequent. (b) Three latent states Figure 2: Performance of IGL-P(2) in simulated environment. 4.3. Production Results Although IGL-P(2) is successful with the 2 state simulator, it Our production setting is a real world image recommen- fails on the 3 state simulator and oscillates between attempt- dation system that serves hundreds of millions of users. ing to maximize r = 1 and r = βˆ’1. In our recommendation system interface, users provide feedback in the form of clicks, likes, dislikes or no feed- back. All four signals are mutually exclusive and the user IGL-P(3) . Fig. 3a demonstrates the distribution of the only provides one feedback after each interaction. For rewards over the course of the experiment. IGL-P(3) these experiments, we use data that spans millions of quickly converged, and because of the partial negative interactions over a period of days. The current policy feedback for dislikes, never attempted to maximize the implemented in practice is a CB algorithm that utilizes a π‘Ÿ = βˆ’1 state. Even though users used the ambiguous hand-engineered reward function. The production policy skip signal to express dissatisfaction 80% of the time, achieves both more click and like feedback than directly IGL-P(3) was still able to learn user preferences. optimizing for the number of clicks or directly optimizing In order for IGL-P(3) to succeed, the algorithm re- for the number of likes. As a result, any improvements Algorithm Clicks Likes Dislikes IGL-P(3) [0.999, 1.067, 1.152] [0.985, 1.029, 1.054] [0.751, 1.072, 1.274] IGL-P(2) [0.926, 1.005, 1.091] [0.914, 0.949, 0.988] [1.141, 1.337, 1.557] Table 1 Relative metrics lift over a production baseline. The production baseline uses a hand-engineered reward function which is not available to IGL algorithms. Shown are point estimates and associated bootstrap 95% confidence regions. IGL-P(2) erroneously increases dislikes to the detriment of other metrics. IGL-P(3) directionally improves over the hand-engineered reward function. over the production policy imply improvement over any human input. We considered 5 feedback signals in this bandit algorithm for click feedback. work, but IGL can easily be scaled to incorporate many We implement IGL-P(2) and IGL-P(3) and report the more signals with little computational cost. performance as relative lift metrics over the production To complete this work, we want to theoretically inves- baseline. Unlike the simulation setting, we no longer tigate the approach presented here in two key directions: have access to the user’s latent reward after each inter- first, characterizing finite-sample behaviour; and second, action. As a result, we evaluate the performance of the relaxing the assumption of perfectly decodable reward. novel IGL implementations through the implicit and ex- One of the open challenges for IGL is developing ef- plicit feedback signals. An increase in both clicks and fective ways of evaluating its performance given the lack likes, and a decrease in dislikes, are considered desirable of true grounding, especially in situations where explicit outcomes. Table 1 shows the results of our empirical user feedback might not be available at all. We speculate study. that, due to both personalization and the β€œrewards are In the simulations, IGL-P(2) exhibited a failure mode rare” prior, the latent reward inferred by IGL could prove of reliable identification of extreme events, with an in- superior in casually predicting longitudinal outcomes ability to avoid extreme negative events. Our produc- relative to raw feedback statistics. Because longitudi- tion data shows a similar pathology, where IGL-P(2) nal outcomes can have facially obvious semantics (e.g., receives dramatically more dislikes, at the expense of subscription renewals) this could provide an alternative likes. Although the true latent state is unknown, IGL- grounding for evaluating IGL. P(2) achieved worse performance on explicit feedback Another promising future direction is IGL for fair rec- signals, directly implying that users had fewer positive ommender systems. Modern systems optimize for set interactions and significantly more negative interactions. objectives, often marginalizing user subpopulations that These results provide evidence for the >2 latent state interact with recommender systems in different ways model in real world recommendation systems. [26]. Since context dependent IGL allows for person- Although we established that users have more than alized reward learning, it has the potential to perform two latent states, it might not be the case that 3 states consistently and fairly across diverse subgroups of users. is sufficient to capture the recommendation system set- This work is partially supported by NSFNational Sci- ting. Our evaluation of IGL-P(3) on our data however, ence Foundationhttps://www.nsf.gov/ under Grant No. provides evidence that 3 states are enough, and that IGL [https://www.nsfgrfp.org/]NSF1650114. is able to succeed with the context dependent assump- tions. IGL-P(3) was able to achieve performance compa- rable to the production baseline, with a strong directional References improvement in total clicks. This is a notable achieve- [1] M. Grčar, D. Mladenič, B. Fortuna, M. Grobelnik, ment, because the baseline deployed in production uses Data sparsity issues in the collaborative filtering a meticulously tuned, hand-engineered reward function framework, in: International workshop on knowl- generated from an order of magnitude more historical edge discovery on the web, Springer, 2005, pp. data. 58–76. [2] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering 5. Discussion for implicit feedback datasets, in: 2008 Eighth IEEE international conference on data mining, Ieee, 2008, We presented IGL for recommender systems, an approach pp. 263–272. to producing personalized recommendations that can [3] X. Yi, L. Hong, E. Zhong, N. N. Liu, S. Rajan, Beyond leverage rich and diverse types of user feedback signals. clicks: dwell time for personalization, in: Proceed- In this paper, we showed that IGL can elegantly sidestep ings of the 8th ACM Conference on Recommender complicated manual reward engineering and effectively systems, 2014, pp. 113–120. learn how to maximize user satisfaction with minimal [4] T. Silveira, M. Zhang, X. Lin, Y. Liu, S. Ma, How good your recommender system is? a survey on ference on Web search and data mining, 2014, pp. evaluations in recommendation, International Jour- 193–202. nal of Machine Learning and Cybernetics 10 (2019) [16] W. Wang, F. Feng, X. He, L. Nie, T.-S. Chua, De- 813–831. noising implicit feedback for recommendation, in: [5] K. Hofmann, F. Behr, F. Radlinski, On caption bias Proceedings of the 14th ACM international con- in interleaving experiments, in: Proceedings of the ference on web search and data mining, 2021, pp. 21st ACM international conference on Information 373–381. and knowledge management, 2012, pp. 115–124. [17] Z. Liang, S. Huang, X. Huang, R. Cao, W. Yu, Post- [6] K. Hofmann, A. Schuth, A. Bellogin, M. d. Rijke, Ef- click behaviors enhanced recommendation system, fects of position bias on click-based recommender in: 2020 IEEE 21st International Conference on In- evaluation, in: European Conference on Informa- formation Reuse and Integration for Data Science tion Retrieval, Springer, 2014, pp. 624–630. (IRI), IEEE, 2020, pp. 128–135. [7] M. Potthast, S. KΓΆpsel, B. Stein, M. Hagen, Clickbait [18] T. Xie, J. Langford, P. Mineiro, I. Momennejad, detection, in: European conference on information Interaction-grounded learning, in: International retrieval, Springer, 2016, pp. 810–817. Conference on Machine Learning, PMLR, 2021, pp. [8] K. Scott, You won’t believe what’s in this paper! 11414–11423. clickbait, relevance and the curiosity gap, Journal [19] T. Xie, A. Saran, D. J. Foster, L. Molu, I. Momenne- of pragmatics 175 (2021) 53–66. jad, N. Jiang, P. Mineiro, J. Langford, Interaction- [9] W. Wang, F. Feng, X. He, H. Zhang, T.-S. Chua, grounded learning with action-inclusive feedback, Clicks can be cheating: Counterfactual recommen- arXiv preprint arXiv:2206.08364 (2022). dation for mitigating clickbait issue, in: Proceed- [20] J. Beel, S. Langer, A. NΓΌrnberger, M. Genzmehr, The ings of the 44th International ACM SIGIR Confer- impact of demographics (age and gender) and other ence on Research and Development in Information user-characteristics on evaluating recommender Retrieval, 2021, pp. 1288–1297. systems, in: International Conference on Theory [10] H. Lu, M. Zhang, S. Ma, Between clicks and satis- and Practice of Digital Libraries, Springer, 2013, pp. faction: Study on multi-phase user preferences and 396–400. satisfaction for online news reading, in: The 41st [21] D. Shin, How do users interact with algorithm International ACM SIGIR Conference on Research recommender systems? the interaction of users, & Development in Information Retrieval, 2018, pp. algorithms, and performance, Computers in Human 435–444. Behavior 109 (2020) 106344. [11] H. Wen, L. Yang, D. Estrin, Leveraging post-click [22] T. T. Nguyen, P.-M. Hui, F. M. Harper, L. Terveen, feedback for content recommendations, in: Pro- J. A. Konstan, Exploring the filter bubble: the effect ceedings of the 13th ACM Conference on Recom- of using recommender systems on content diversity, mender Systems, 2019, pp. 278–286. in: Proceedings of the 23rd international conference [12] M. Ishwarya, G. Swetha, S. Saptha Maaleekaa, on World wide web, 2014, pp. 677–686. R. Anu Grahaa, Efficient recommender system [23] J. Bekker, J. Davis, Learning from positive and by implicit emotion prediction, in: Advances in unlabeled data: A survey, Machine Learning 109 Big Data and Cloud Computing, Springer, 2019, pp. (2020) 719–760. 173–178. [24] C. Elkan, K. Noto, Learning classifiers from only [13] L. PeΕ‘ka, P. VojtΓ‘Ε‘, Estimating importance of im- positive and unlabeled data, in: Proceedings of plicit factors in e-commerce recommender systems, the 14th ACM SIGKDD international conference in: Proceedings of the 2nd International Confer- on Knowledge discovery and data mining, 2008, pp. ence on Web Intelligence, Mining and Semantics, 213–220. 2012, pp. 1–4. [25] D. Zhang, W. S. Lee, A simple probabilistic approach [14] E. Sood, S. Tannert, D. Frassinelli, A. Bulling, N. T. to learning from positive and unlabeled examples, Vu, Interpreting attention models with human vi- in: Proceedings of the 5th annual UK workshop on sual attention in machine reading comprehension, computational intelligence (UKCI), 2005, pp. 83–87. arXiv preprint arXiv:2010.06396 (2020). [26] N. Neophytou, B. Mitra, C. Stinson, Revisiting pop- [15] Y. Kim, A. Hassan, R. W. White, I. Zitouni, Mod- ularity and demographic biases in recommender eling dwell time to predict click-level satisfaction, evaluation and effectiveness, in: European Confer- in: Proceedings of the 7th ACM international con- ence on Information Retrieval, Springer, 2022, pp. 641–654.