Interaction-Grounded Learning for Recommender Systems
Jessica Maghakian1 , Kishan Panaganti2 , Paul Mineiro3 , Akanksha Saran3 and Cheng Tan3
1
  Stony Brook University, USA
2
  Texas A&M University, USA
3
  Microsoft Research NYC, USA


                                          Abstract
                                          Recommender systems have long grappled with optimizing user satisfaction using only implicit user feedback. Many
                                          approaches in the literature rely on complicated feedback modeling and costly user studies. We propose online recommender
                                          systems as a candidate for the recently introduced Interaction Grounded Learning (IGL) paradigm. In IGL, a learner attempts
                                          to optimize a latent reward in an environment by observing feedback with no grounding. We introduce a novel personalized
                                          variant of IGL for recommender systems that can leverage explicit and implicit user feedback to maximize user satisfaction,
                                          with no feedback signal modeling and minimal assumptions. With our empirical evaluations that include simulations as well
                                          as experiments on real product data, we demonstrate the effectiveness of IGL for recommender systems.

                                          Keywords
                                          recommendation systems, interaction-grounded learning, contextual bandits, reinforcement learning


1. Introduction                                                                                       listening sessions [11], half of the clicked on content was
                                                                                                      actually disliked by users.
The last decade has seen unprecedented growth in e-                                                      Challenge 2: Incorporating multiple implicit feedback
commerce, social media and digital streaming offerings, signals requires manual feature engineering. In addition
resulting in users that are overwhelmed with content to clicks, user implicit feedback can include dwell time
and choices. Online recommender systems offer a way [3], mouse movement [12], scroll information [13] and
to alleviate this information overload and improve user gaze [14]. One popular approach uses dwell time to filter
experience by providing personalized content. Unfor- out noisy clicks, with the reasoning that satisfied users
tunately, optimizing user satisfaction is challenging be- stay on pages longer [3]. Although the industry standard
cause explicit feedback indicating user satisfaction is rare is 30+ seconds of dwell time for a “meaningful” click, this
in practice [1]. To resolve the problem of data sparsity, number actually varies depending on the page topic, read-
practitioners rely on implicit signals such as clicks [2] or ability and content length [15]. It is equally challenging
dwell time [3] as a proxy for user satisfaction. However, to incorporate other signals, for example, behaviors such
designing an optimization objective using implicit signals as viewport time, dwell time and scroll patterns have a
is nontrivial, and many modern recommender systems complicated temporal relationship and represent prefer-
suffer from the following challenges.                                                                 ence in different phases [10]. There is an extensive body
   Challenge 1: No one implicit signal is the true user satis- of work on modeling different implicit feedback signals
faction signal. User clicks are the most readily available [16, 17], however these niche models may not general-
signal, and the Click-Through Rate (CTR) metric has be- ize well across a diverse user base, or stay relevant as
come the gold standard for evaluating the performance of recommender systems and their users evolve.
online recommendation systems [4]. Yet there are many                                                    To tackle these challenges, we propose online recom-
instances when a user will interact via clicks and be un- mender systems as a candidate for Interaction-Grounded
satisfied with the content. The most familiar of these is Learning (IGL) [18]. IGL is a learning paradigm where
clickbait, where poor quality content attracts user clicks a learner optimizes for latent rewards by interacting
by exploiting cognitive biases such as caption bias [5], with the environment and associating observed feedback
position bias [6] or the curiosity gap [7, 8]. Optimization with the unobservable true reward. Although IGL was
of the CTR will naturally promote clickbait items that originally inspired by brain-computer interface applica-
provide negative user experiences and cause distrust in tions, in this paper we demonstrate that the framework,
the recommender system [9]. Recent studies show that when utilizing a different generative assumption and aug-
clicks may even be a signal of user dissatisfaction. In lab- mented with an additional latent state, is also well suited
oratory studies of online news reading [10] and Spotify for recommendation applications. Existing approaches
ORSUM@ACM RecSys 2022: 5th Workshop on Online Recommender such as reinforcement learning and traditional contextual
Systems and User Modeling, jointly with the 16th ACM Conference on bandits suffer from the choice of reward function. How-
Recommender Systems, September 23rd, 2022, Seattle, WA, USA                                           ever IGL resolves the 2 above challenges while making
         © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
         Attribution 4.0 International (CC BY 4.0).                                                   minimal assumptions about the value of observed user
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
feedback. Our new approach is able to incorporate both         et al. [19] loosen full conditional independence by con-
explicit and implicit signals, leverage ambiguous user         sidering context conditional independence, i.e. 𝑦 ⟂ 𝑥|𝑎, 𝑟.
feedback and adapt to the different ways in which users        For our setting, this corresponds to the user feedback
interact with the system.                                      varying for combinations of preference and content, but
   Our Contributions. We introduce IGL for recom-              remaining consistent across all users. Neither of these
mender systems, allowing us to leverage implicit and           two assumptions are applicable in the setting of online
explicit feedback signals and mitigate the need for re-        content recommendation because different users inter-
ward engineering. We present the first IGL strategy for        act with recommender systems in different ways. This
context-dependent feedback, the first use of inverse kine-     is evidenced by our production data from a real world
matics as an IGL objective, and the first IGL strategy         image recommendation system (see Sec. 4.3) along with
for more than two latent states. Using simulations and         existing results in the literature [20, 21]. By assuming
real production data, we demonstrate that recommender          user-specific communication rather than item-specific
systems require at least 3 reward states, and that IGL         communication, we allow for personalized reward learn-
is able to address two big challenges for modern online        ing.
recommender systems.                                           Number of Latent Reward States. Prior work shows
                                                               the binary latent reward assumption, along with an as-
                                                               sumption that rewards are rare under a known reference
2. Background on                                               policy, is sufficient for IGL to succeed. Specifically, op-
   Interaction-Grounded Learning                               timizing the contrast between a learned policy and the
                                                               oblivious uniform policy is able to succeed when feed-
Problem Statement. Consider a learner that is inter-           back is both context and action independent [18]; and
acting with an environment while trying to optimize            optimizing the contrast between the learned policy and
their policies without access to any grounding or explicit     all constant-action policies succeeds when the feedback
reward signal. At each time step, the stationary environ-      is context independent [19].
ment generates a context 𝑥 ∈ 𝒳 which is sampled i.i.d.             Although the binary latent reward assumption (e.g.,
from a distribution 𝑑0 . The learner observes the context      satisfied or dissatisfied) appears reasonable for recom-
and then selects an action 𝑎 ∈ 𝒜 from a finite action set.     mendation scenarios, it fails to account for user indiffer-
In response, the environment jointly generates a latent        ence versus user dissatisfaction. This observation was
reward and feedback vector (𝑟, 𝑦) ∈ ℛ ×𝒴 conditional on        first motivated by our production data, where a 2 state
(𝑥, 𝑎). However, the learner is only able to observe 𝑦 and     IGL policy would sometimes maximize feedback signals
not 𝑟. Since the latent reward can be either deterministic     with obviously negative semantics. Assuming users ig-
or stochastic, let 𝑅(𝑥, 𝑎) ∶= 𝔼(𝑥,𝑎) [𝑟] denote the expected   nore most content most of the time [22], negative feed-
reward after choosing action 𝑎 for context 𝑥. In the IGL       back can be as difficult to elicit as positive feedback, and a
setting, the context space 𝒳 and feedback vector space 𝒴       2 state IGL model is unable to distinguish between these
can be arbitrarily large. Let 𝜋 ∈ Π ∶ 𝒳 → Δ(𝒜 ) denote         extremes. Hence, we posit a minimal latent state model
a stochastic policy, with corresponding expected return        for recommender systems involves 3 states: (i) 𝑟 = 1,
𝑉 (𝜋) ∶= 𝔼(𝑥,𝑎)∼𝑑0 ×𝜋 [𝑟]. In IGL, the learner’s goal is to    when users are satisfied with the recommended content,
find the optimal policy 𝜋 ∗ = argmax𝜋∈Π 𝑉 (𝜋), while only      (ii) 𝑟 = 0, when users are indifferent or inattentive, and
able to observe context-action-feedback (𝑥, 𝑎, 𝑦) triples.     (iii) 𝑟 = −1, when users are dissatisfied.
   In the recommender system setting, the context 𝑥 is
the user, the action 𝑎 is the recommended content and          3. Derivations
the feedback 𝑦 is the user feedback. Unfortunately ex-
isting IGL approaches ([18], [19]) leverage assumptions        We now address the first of the previously mentioned
designed for classification and control tasks which are        challenges from Sec. 1. For the recommender system set-
a poor fit for recommendation scenarios: (i) context-          ting, we use the assumption that 𝑦 ⟂ 𝑎|𝑥, 𝑟, namely that
independence of the feedback and (ii) binary latent re-        the feedback 𝑦 is independent of the displayed content
wards.                                                         𝑎 given the user 𝑥 and their disposition toward the dis-
Feedback Dependence Assumptions. It is information             played content 𝑟 ∈ {−1, 0, 1}. Thus, we assume that users
theoretically impossible to solve IGL without assump-          may communicate in different ways, but a given user
tions about the relation between 𝑥, 𝑎 and 𝑦 [19]. In the       expresses satisfaction, dissatisfaction and indifference in
first paper on IGL, the authors assumed full conditional       the same way.
independence of the feedback on the context and chosen            The statistical dependence of 𝑦 on 𝑥 frustrates the
action, i.e. 𝑦 ⟂ 𝑥, 𝑎|𝑟. For recommender systems, this un-     use of learning objectives which utilize the product of
desirably implies that all users communicate preferences       marginal distributions over (𝑥, 𝑦). Essentially, given ar-
identically for all content. In the following paper, Xie       bitrary dependence upon 𝑥, learning must operate on
each example in isolation without requiring comparison              does not imply that observing such feedback will induce
across examples. This motivates attempting to predict               an extreme event detection; rather the feedback must
the current action from the current context and the cur-            have a probability that strongly depends upon which
rently observed feedback, i.e., inverse kinematics.                 action is taken. Because feedback is assumed condition-
                                                                    ally independent of action, the only way for feedback
3.1. Inverse Kinematics                                             to help predict which action is played is via the (action
                                                                    dependence of the) latent reward.
In this section we motivate our inverse kinematics strat-
egy using exact expectations. When acting according to 3.3. Extreme Event Disambiguation
any policy 𝑃(𝑎|𝑥), we can imagine trying to predict the ac-
tion taken given the context and feedback; the posterior With 2 latent states, 𝑟 ≠ 0 ⟹ 𝑟 = 1, and we can reduce
distribution is                                                    to a standard contextual bandit with inferred rewards
                                                                   1(𝑃(𝑎|𝑦, 𝑥) > 2𝑃(𝑎|𝑥)). With 3 latent states, 𝑟 ≠ 0 ⟹
               𝑃(𝑎|𝑥)𝑃(𝑦|𝑎, 𝑥)
𝑃(𝑎|𝑦, 𝑥) =                                         (Bayes rule) 𝑟 = ±1, and additional information is necessary to disam-
                     𝑃(𝑦|𝑥)                                        biguate the extreme events. We assume partial reward in-
                   𝑃(𝑦|𝑟, 𝑎, 𝑥)                                    formation is available via a “definitely negative” function
   = 𝑃(𝑎|𝑥) ∑                   𝑃(𝑟|𝑎, 𝑥)    (Total Probability)
                      𝑃(𝑦|𝑥)                                       dn ∶ 𝒳 × 𝒴 → {−1, 0} where 𝑃(dn(𝑥, 𝑦) = 0|𝑟 = 1) = 1
                𝑟
                                                                   and 𝑃(dn(𝑥, 𝑦) = −1|𝑟 = −1) > 0. This reduces ex-
                   𝑃(𝑦|𝑟, 𝑥)
   = 𝑃(𝑎|𝑥) ∑                𝑃(𝑟|𝑎, 𝑥)                (𝑦 ⟂ 𝑎|𝑥, 𝑟) treme event disambiguation to one-sided learning [23]
                𝑟 𝑃(𝑦|𝑥)                                           applied only to extreme events, where we try to pre-
                   𝑃(𝑟|𝑦, 𝑥)                                       dict the underlying latent state given (𝑥, 𝑎). We assume
   = 𝑃(𝑎|𝑥) ∑                𝑃(𝑟|𝑎, 𝑥)              (Bayes rule)
                𝑟    𝑃(𝑟|𝑥)                                        partial labelling is selected completely at random [24]
                       𝑃(𝑟|𝑎, 𝑥)𝑃(𝑎, 𝑥)                            and treat the (constant) negative labelling propensity 𝛼
   = ∑ 𝑃(𝑟|𝑦, 𝑥)                          . (Total Probability) as a hyperparameter. We arrive at our 3-state reward
        𝑟            ∑𝑎 𝑃(𝑟|𝑎, 𝑥)𝑃(𝑎|𝑥)
                                                                   extractor
                                                              (1)
We arrive at an inner product between a reward                                     ⎧0     𝑃(𝑎|𝑦, 𝑥) ≤ 2𝑃(𝑎|𝑥)
decoder term 𝑃(𝑟|𝑦, 𝑥) and a reward predictor term 𝜌(𝑥, 𝑎, 𝑦) = −1 𝑃(𝑎|𝑦, 𝑥) > 2𝑃(𝑎|𝑥) and dn(𝑥, 𝑦) = −1 ,
   𝑃(𝑟|𝑎,𝑥)𝑃(𝑎|𝑥)                                                                  ⎨
 ∑𝑎 𝑃(𝑟|𝑎,𝑥)𝑃(𝑎|𝑥)
                   .                                                               ⎩𝛼     otherwise
                                                                                                                                     (3)
3.2. Extreme Event Detection                                       equivalent     to Zhang   and Lee  [25, Equation    11]   scaled  by
                                                                   𝛼. Note setting 𝛼 = 1 embeds 2-state IGL.
Direct extraction of a reward predictor using maximum
likelihood on the action prediction problem with equa- 3.3.1. Implementation Notes
tion (1) is frustrated by two identifiability issues: first,
this expression is invariant to a permutation of the re- In practice, 𝑃(𝑎|𝑥) is known but the other probabilities
                                                                                        ̂
wards on a context dependent basis; and second, the rel- are estimated. 𝑃(𝑎|𝑦, 𝑥) is estimated online using maxi-
ative scale of two terms being multiplied is not uniquely          mum      likelihood  on  the problem predicting 𝑎 from (𝑥, 𝑦),
determined by their product. To mitigate the first issue, i.e., on a data stream of tuples ((𝑥, 𝑦), 𝑎). The current
we assume ∑𝑎 𝑃(𝑟 = 0|𝑎, 𝑥)𝑃(𝑎|𝑥) > 12 , i.e., nonzero re- estimates induce 𝜌(𝑥,            ̂ 𝑎, 𝑦) based upon the plug-in ver-
                                                                   sion of equation (3). In this manner, the original data
wards are rare under 𝑃(𝑎|𝑥); and to mitigate the second
                                                                   stream of (𝑥, 𝑎, 𝑦) tuples is transformed into stream of
issue, we assume the feedback can be perfectly decoded,
                                                                                 ̂ 𝑎, 𝑦)) tuples and reduced to a standard on-
                                                                   (𝑥, 𝑎, 𝑟 ̂ = 𝜌(𝑥,
i.e., 𝑃(𝑟|𝑦, 𝑥) ∈ {0, 1}. Under these assumptions we have
                                                                   line contextual bandit problem.
                                 𝑃(𝑟 = 0|𝑎, 𝑥)𝑃(𝑎|𝑥)                   As an additional complication, although 𝑃(𝑎|𝑥) is
  𝑟 = 0 ⟹ 𝑃(𝑎|𝑦, 𝑥) =                                              known, it is typically a good policy under which rewards
                               ∑𝑎 𝑃(𝑟 = 0|𝑎, 𝑥)𝑃(𝑎|𝑥)
                                                                   are not rare (e.g., offline learning with a good historical
                            ≤ 2𝑃(𝑟 = 0|𝑎, 𝑥)𝑃(𝑎|𝑥) ≤ 2𝑃(𝑎|𝑥).      policy; or acting online according to the policy being
                                                              (2) learned by the IGL procedure). Therefore we use impor-
                                                                   tance weighting to synthesize a uniform action distribu-
Equation (2) forms the basis for our extreme event de-
                                                                   tion 𝑃(𝑎|𝑥) from the true action distribution.1 Ultimately
tector: anytime the posterior probability of an action is
                                                                   we arrive at the procedure of Algorithm 1.
predicted to be more than twice the prior probability, we
                                                                   1
deduce 𝑟 ≠ 0.                                                       When the number of actions is changing from round to round,
    Note a feedback merely being apriori rare or frequent we use importance weighting to synthesize a non-uniform action
                                                                     distribution with low rewards, but we elide this detail for ease of
(i.e., the magnitude of 𝑃(𝑦|𝑥) under the policy 𝑃(𝑎|𝑥)) exposition.
Algorithm 1 IGL, Inverse Kinematics and either 2 or 3               for evaluation but never revealed to the algorithm.
Latent States.                                                      Simulator Design. Before the start of each experiment,
Input: Contextual bandit algorithm CB-Alg .                         user profiles with fixed latent rewards for each action are
Input: Calibrated weighted multiclass classification al-            generated. The users are also assigned predetermined
    gorithm MC-Alg .                                                communication styles, so the probability of emitting a
Input: Definitely negative oracle DN .                   #          given signal conditioned on the latent reward remains
    DN (…) = 0 for 2 state IGL                                      static throughout the duration of the experiment. For
Input: Negative labelling propensity 𝛼.                  #          the available feedback, users can provide feedback using
    𝛼 = 1 for 2 state IGL                                           five signals: (1) like, (2) dislike, (3) click, (4) skip and
Input: Action set size 𝐾.                                           (5) none. The feedback includes a mix of explicit (likes,
 1: 𝜋 ← new CB-Alg .                                                dislikes) and implicit (clicks, skips, none) signals. Despite
 2: IK ← new MC-Alg .                                               receiving no human input on the assumed meaning of
 3: for 𝑡 = 1, 2, … ; do                                            the implicit signals, we will demonstrate that IGL can
 4:    Observe context 𝑥𝑡 and action set 𝐴𝑡 with |𝐴𝑡 | = 𝐾.         determine which feedback are associated with which
                                                                    latent state. In addition to policy optimization, IGL can
 5:    if On-policy IGL then                                        also be a tool for automated feature discovery. To reveal
 6:         𝑃(⋅|𝑥𝑡 ) ← 𝜋.predict(𝑥𝑡 , 𝐴𝑡 ).                     #   the qualitative properties of the approach, the simulated
            Compute action distribution                             probabilities for observing a particular feedback given the
 7:         Play 𝑎𝑡 ∼ 𝑃(⋅|𝑥𝑡 ) and observe feedback 𝑦𝑡 .            reward are chosen so that they can be perfectly decoded,
 8:    else                                                         i.e., each feedback has a nonzero emission probability
 9:         Observe (𝑥𝑡 , 𝑎𝑡 , 𝑦𝑡 , 𝑃(⋅|𝑥𝑡 )).                      in exactly one latent reward state. Production data does
10:    𝑤𝑡 ← 1/(𝐾 𝑃(𝑎𝑡 |𝑥𝑡 )).                                   #   not obey this constraint (e.g., accidental emissions of all
       Synthetic uniform distribution                               feedback occur at some rate): theoretical analysis of our
11:    𝑃(𝑎̂ 𝑡 |𝑦𝑡 , 𝑥𝑡 ) ← IK .predict((𝑥𝑡 , 𝑦𝑡 ), 𝐴𝑡 , 𝑎𝑡 ).   #   approach without perfectly decodable rewards is a topic
       Predict action probability                                   for future work.
12:    if 𝐾𝑃(𝑎  ̂ 𝑡 |𝑦𝑡 , 𝑥𝑡 ) ≤ 2 then                         #
       𝑟𝑡̂ = 0                                                      4.1. Motivating the 3 State Model for
13:         𝜋.learn(𝑥𝑡 , 𝑎𝑡 , 𝐴𝑡 , 𝑟𝑡 = 0, 𝑤𝑡 )
14:    else                     # 𝑟𝑡̂ ≠ 0
                                                                         Recommender Systems
15:         if DN (…) = 0 then                                      We now implement Algorithm 1 for 2 latent states as
16:            𝜋.learn(𝑥𝑡 , 𝑎𝑡 , 𝐴𝑡 , 𝑟𝑡 = 𝛼, 𝑤𝑡 )                  IGL-P(2) . The experiment here shows the following
17:         else                # Definitely negative               two results about IGL-P(2) : (i) it is able to succeed in
18:            𝜋.learn(𝑥𝑡 , 𝑎𝑡 , 𝐴𝑡 , 𝑟𝑡 = −1, 𝑤𝑡 )                 the scenario when there are 2 underlying latent rewards
19:    IK .learn((𝑥𝑡 , 𝑦𝑡 ), 𝐴𝑡 , 𝑎𝑡 , 𝑤𝑡 ).                        and (ii) it can no longer do so when there are 3 latent
                                                                    states. Fig. 1 shows the simulator setup used, where
                                                                    clicks and likes are used to communicate satisfaction,
4. Empirical Evaluations                                            and dislikes, skips and no feedback (none) convey (active
                                                                    or passive) dissatisfaction.
Due to the sensitivity around production metrics and                   Fig. 2 shows the distribution of rewards for IGL-P(2)
customer segments, most experiments demonstrate qual-               as a function of the number of iterations, for both the 2
itative effects via simulation, with simulator properties           and 3 latent state model. When there are only 2 latent
inspired by production observations. Our final experi-              rewards, IGL-P(2) consistently improves; however for
ment (Sec. 4.3) includes relative performance data from a           3 latent states, IGL-P(3) oscillates between 𝑟 = 1 and
production real-world image recommendation scenario.                𝑟 = −1, resulting in much lower average user satisfac-
Abbreviations. Algorithms are denoted by the following              tion. The empirical results demonstrate that although
abbreviations: Personalized IGL for 2 latent states (IGL-           IGL-P(2) can successfully identify and maximize the
P(2) ); Personalized IGL for 3 latent states (IGL-P(3) ).           rare feedbacks it encounters, it is unable to distinguish
General Evaluation Setup. At each time step 𝑡, the con-             between satisfied and dissatisfied users.
text 𝑥𝑡 is provided from either the simulator (Sec. 4.1-4.2)
or the logged production data (Sec. 4.3). The learner then          4.2. IGL-P(3) : Personalized Reward
selects an action 𝑎𝑡 and receives feedback 𝑦𝑡 . In these eval-
uations, each user provides feedback in exactly one inter-
                                                                         Learning for Recommendations
action and different user feedback signals are mutually             Since IGL-P(2) is not sufficient for the recommendation
exclusive, so that 𝑦𝑡 is a one-hot vector. In simulated en-         system setting, we now explore the performance of IGL-
vironments, the ground truth reward is sometimes used               P(3) . Using the same simulator as Fig. 1b, we evaluated
                                                                 quires direct grounding from the dislike signal. We next
                                                                 examined how IGL-P(3) is impacted by increased or de-
                                                                 creased presence of user dislikes. Fig. 3b was generated
                                                                 by varying the probability 𝑝 of users emitting dislikes
                                                                 given 𝑟 = −1, and then averaging over 10 experiments for
                                                                 each choice of 𝑝. While lower dislike emission probabil-
                   (a) 2 latent state model                      ities are associated with slower convergence, IGL-P(3)
                                                                 is able to overcome the increase in unlabeled feedback
                                                                 and learn to associate the skip signal with user dissatifac-
                                                                 tion. Once the feedback decoding stabilizes, regardless of
                                                                 the dislike emission probability, IGL-P(3) enjoys strong
                                                                 performance for the remainder of the experiment.


                   (b) 3 latent state model
Figure 1: Simulator settings for 2 state and 3 state latent
model. In Fig. 1a, r = 0 corresponds to anything other than
the user actively enjoying the content, whereas in Fig. 1b,
lack of user enjoyment is split into indifference and active
dissatisfaction.
                                                                   (a) Ground truth learning curves, P(dislike|r = −1) = 0.2.


                     (a) Two latent states

                                                                             (b) Effect of varying P(dislike|r = −1).
                                                                 Figure 3: Performance of IGL-P(3) in simulated environment.
                                                                 In Fig. 3a, IGL-P(3) successfully maximizes user satisfaction
                                                                 while minimizing dissatisfaction. Fig. 3b demonstrates how
                                                                 IGL-P(3) is robust to varying the frequency of partial informa-
                                                                 tion received, although more data is needed for convergence
                                                                 when “definitely bad” events are less frequent.


                    (b) Three latent states
Figure 2: Performance of IGL-P(2) in simulated environment.
                                                                 4.3. Production Results
Although IGL-P(2) is successful with the 2 state simulator, it
                                                        Our production setting is a real world image recommen-
fails on the 3 state simulator and oscillates between attempt-
                                                        dation system that serves hundreds of millions of users.
ing to maximize r = 1 and r = −1.                       In our recommendation system interface, users provide
                                                        feedback in the form of clicks, likes, dislikes or no feed-
                                                        back. All four signals are mutually exclusive and the user
IGL-P(3) . Fig. 3a demonstrates the distribution of the only provides one feedback after each interaction. For
rewards over the course of the experiment. IGL-P(3) these experiments, we use data that spans millions of
quickly converged, and because of the partial negative interactions over a period of days. The current policy
feedback for dislikes, never attempted to maximize the implemented in practice is a CB algorithm that utilizes a
𝑟 = −1 state. Even though users used the ambiguous hand-engineered reward function. The production policy
skip signal to express dissatisfaction 80% of the time, achieves both more click and like feedback than directly
IGL-P(3) was still able to learn user preferences.      optimizing for the number of clicks or directly optimizing
  In order for IGL-P(3) to succeed, the algorithm re- for the number of likes. As a result, any improvements
                      Algorithm            Clicks                  Likes                  Dislikes
                       IGL-P(3)     [0.999, 1.067, 1.152]   [0.985, 1.029, 1.054]   [0.751, 1.072, 1.274]
                       IGL-P(2)     [0.926, 1.005, 1.091]   [0.914, 0.949, 0.988]   [1.141, 1.337, 1.557]

Table 1
Relative metrics lift over a production baseline. The production baseline uses a hand-engineered reward function which is
not available to IGL algorithms. Shown are point estimates and associated bootstrap 95% confidence regions. IGL-P(2)
erroneously increases dislikes to the detriment of other metrics. IGL-P(3) directionally improves over the hand-engineered
reward function.
over the production policy imply improvement over any           human input. We considered 5 feedback signals in this
bandit algorithm for click feedback.                            work, but IGL can easily be scaled to incorporate many
   We implement IGL-P(2) and IGL-P(3) and report the            more signals with little computational cost.
performance as relative lift metrics over the production           To complete this work, we want to theoretically inves-
baseline. Unlike the simulation setting, we no longer           tigate the approach presented here in two key directions:
have access to the user’s latent reward after each inter-       first, characterizing finite-sample behaviour; and second,
action. As a result, we evaluate the performance of the         relaxing the assumption of perfectly decodable reward.
novel IGL implementations through the implicit and ex-             One of the open challenges for IGL is developing ef-
plicit feedback signals. An increase in both clicks and         fective ways of evaluating its performance given the lack
likes, and a decrease in dislikes, are considered desirable     of true grounding, especially in situations where explicit
outcomes. Table 1 shows the results of our empirical            user feedback might not be available at all. We speculate
study.                                                          that, due to both personalization and the “rewards are
   In the simulations, IGL-P(2) exhibited a failure mode        rare” prior, the latent reward inferred by IGL could prove
of reliable identification of extreme events, with an in-       superior in casually predicting longitudinal outcomes
ability to avoid extreme negative events. Our produc-           relative to raw feedback statistics. Because longitudi-
tion data shows a similar pathology, where IGL-P(2)             nal outcomes can have facially obvious semantics (e.g.,
receives dramatically more dislikes, at the expense of          subscription renewals) this could provide an alternative
likes. Although the true latent state is unknown, IGL-          grounding for evaluating IGL.
P(2) achieved worse performance on explicit feedback               Another promising future direction is IGL for fair rec-
signals, directly implying that users had fewer positive        ommender systems. Modern systems optimize for set
interactions and significantly more negative interactions.      objectives, often marginalizing user subpopulations that
These results provide evidence for the >2 latent state          interact with recommender systems in different ways
model in real world recommendation systems.                     [26]. Since context dependent IGL allows for person-
   Although we established that users have more than            alized reward learning, it has the potential to perform
two latent states, it might not be the case that 3 states       consistently and fairly across diverse subgroups of users.
is sufficient to capture the recommendation system set-            This work is partially supported by NSFNational Sci-
ting. Our evaluation of IGL-P(3) on our data however,           ence Foundationhttps://www.nsf.gov/ under Grant No.
provides evidence that 3 states are enough, and that IGL        [https://www.nsfgrfp.org/]NSF1650114.
is able to succeed with the context dependent assump-
tions. IGL-P(3) was able to achieve performance compa-
rable to the production baseline, with a strong directional     References
improvement in total clicks. This is a notable achieve-
                                                                  [1] M. Grčar, D. Mladenič, B. Fortuna, M. Grobelnik,
ment, because the baseline deployed in production uses
                                                                      Data sparsity issues in the collaborative filtering
a meticulously tuned, hand-engineered reward function
                                                                      framework, in: International workshop on knowl-
generated from an order of magnitude more historical
                                                                      edge discovery on the web, Springer, 2005, pp.
data.
                                                                      58–76.
                                                                  [2] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering
5. Discussion                                                         for implicit feedback datasets, in: 2008 Eighth IEEE
                                                                      international conference on data mining, Ieee, 2008,
We presented IGL for recommender systems, an approach                 pp. 263–272.
to producing personalized recommendations that can                [3] X. Yi, L. Hong, E. Zhong, N. N. Liu, S. Rajan, Beyond
leverage rich and diverse types of user feedback signals.             clicks: dwell time for personalization, in: Proceed-
In this paper, we showed that IGL can elegantly sidestep              ings of the 8th ACM Conference on Recommender
complicated manual reward engineering and effectively                 systems, 2014, pp. 113–120.
learn how to maximize user satisfaction with minimal              [4] T. Silveira, M. Zhang, X. Lin, Y. Liu, S. Ma, How
     good your recommender system is? a survey on                  ference on Web search and data mining, 2014, pp.
     evaluations in recommendation, International Jour-            193–202.
     nal of Machine Learning and Cybernetics 10 (2019)        [16] W. Wang, F. Feng, X. He, L. Nie, T.-S. Chua, De-
     813–831.                                                      noising implicit feedback for recommendation, in:
 [5] K. Hofmann, F. Behr, F. Radlinski, On caption bias            Proceedings of the 14th ACM international con-
     in interleaving experiments, in: Proceedings of the           ference on web search and data mining, 2021, pp.
     21st ACM international conference on Information              373–381.
     and knowledge management, 2012, pp. 115–124.             [17] Z. Liang, S. Huang, X. Huang, R. Cao, W. Yu, Post-
 [6] K. Hofmann, A. Schuth, A. Bellogin, M. d. Rijke, Ef-          click behaviors enhanced recommendation system,
     fects of position bias on click-based recommender             in: 2020 IEEE 21st International Conference on In-
     evaluation, in: European Conference on Informa-               formation Reuse and Integration for Data Science
     tion Retrieval, Springer, 2014, pp. 624–630.                  (IRI), IEEE, 2020, pp. 128–135.
 [7] M. Potthast, S. Köpsel, B. Stein, M. Hagen, Clickbait    [18] T. Xie, J. Langford, P. Mineiro, I. Momennejad,
     detection, in: European conference on information             Interaction-grounded learning, in: International
     retrieval, Springer, 2016, pp. 810–817.                       Conference on Machine Learning, PMLR, 2021, pp.
 [8] K. Scott, You won’t believe what’s in this paper!             11414–11423.
     clickbait, relevance and the curiosity gap, Journal      [19] T. Xie, A. Saran, D. J. Foster, L. Molu, I. Momenne-
     of pragmatics 175 (2021) 53–66.                               jad, N. Jiang, P. Mineiro, J. Langford, Interaction-
 [9] W. Wang, F. Feng, X. He, H. Zhang, T.-S. Chua,                grounded learning with action-inclusive feedback,
     Clicks can be cheating: Counterfactual recommen-              arXiv preprint arXiv:2206.08364 (2022).
     dation for mitigating clickbait issue, in: Proceed-      [20] J. Beel, S. Langer, A. Nürnberger, M. Genzmehr, The
     ings of the 44th International ACM SIGIR Confer-              impact of demographics (age and gender) and other
     ence on Research and Development in Information               user-characteristics on evaluating recommender
     Retrieval, 2021, pp. 1288–1297.                               systems, in: International Conference on Theory
[10] H. Lu, M. Zhang, S. Ma, Between clicks and satis-             and Practice of Digital Libraries, Springer, 2013, pp.
     faction: Study on multi-phase user preferences and            396–400.
     satisfaction for online news reading, in: The 41st       [21] D. Shin, How do users interact with algorithm
     International ACM SIGIR Conference on Research                recommender systems? the interaction of users,
     & Development in Information Retrieval, 2018, pp.             algorithms, and performance, Computers in Human
     435–444.                                                      Behavior 109 (2020) 106344.
[11] H. Wen, L. Yang, D. Estrin, Leveraging post-click        [22] T. T. Nguyen, P.-M. Hui, F. M. Harper, L. Terveen,
     feedback for content recommendations, in: Pro-                J. A. Konstan, Exploring the filter bubble: the effect
     ceedings of the 13th ACM Conference on Recom-                 of using recommender systems on content diversity,
     mender Systems, 2019, pp. 278–286.                            in: Proceedings of the 23rd international conference
[12] M. Ishwarya, G. Swetha, S. Saptha Maaleekaa,                  on World wide web, 2014, pp. 677–686.
     R. Anu Grahaa, Efficient recommender system              [23] J. Bekker, J. Davis, Learning from positive and
     by implicit emotion prediction, in: Advances in               unlabeled data: A survey, Machine Learning 109
     Big Data and Cloud Computing, Springer, 2019, pp.             (2020) 719–760.
     173–178.                                                 [24] C. Elkan, K. Noto, Learning classifiers from only
[13] L. Peška, P. Vojtáš, Estimating importance of im-             positive and unlabeled data, in: Proceedings of
     plicit factors in e-commerce recommender systems,             the 14th ACM SIGKDD international conference
     in: Proceedings of the 2nd International Confer-              on Knowledge discovery and data mining, 2008, pp.
     ence on Web Intelligence, Mining and Semantics,               213–220.
     2012, pp. 1–4.                                           [25] D. Zhang, W. S. Lee, A simple probabilistic approach
[14] E. Sood, S. Tannert, D. Frassinelli, A. Bulling, N. T.        to learning from positive and unlabeled examples,
     Vu, Interpreting attention models with human vi-              in: Proceedings of the 5th annual UK workshop on
     sual attention in machine reading comprehension,              computational intelligence (UKCI), 2005, pp. 83–87.
     arXiv preprint arXiv:2010.06396 (2020).                  [26] N. Neophytou, B. Mitra, C. Stinson, Revisiting pop-
[15] Y. Kim, A. Hassan, R. W. White, I. Zitouni, Mod-              ularity and demographic biases in recommender
     eling dwell time to predict click-level satisfaction,         evaluation and effectiveness, in: European Confer-
     in: Proceedings of the 7th ACM international con-             ence on Information Retrieval, Springer, 2022, pp.
                                                                   641–654.