Delayed Rewards in the context of Reinforcement
                 Learning based Recommender Systems
                                                                    Debmalya Biswas1


Abstract. We present a Reinforcement Learning (RL) based ap-                        RL refers to a branch of Artificial Intelligence (AI), which is able
proach to implement Recommender systems. The results are based                   to achieve complex goals by maximizing a reward function in real-
on a real-life Wellness app that is able to provide personalized health          time. The reward function works similar to incentivizing a child with
related content to users in an interactive fashion. Unfortunately, cur-          candy and spankings, such that the algorithm is penalized when it
rent recommender systems are unable to adapt to continuously evolv-              takes a wrong decision and rewarded when it takes a right one – this is
ing features, e.g. user sentiment, and scenarios where the RL re-                reinforcement. The reinforcement aspect also allows it to adapt faster
ward needs to be computed based on multiple and unreliable feed-                 to real-time changes in user sentiment. For a detailed introduction to
back channels (e.g., sensors, wearables). To overcome this, we pro-              RL frameworks, the interested reader is referred to [3].
pose three constructs: (i) weighted feedback channels, (ii) delayed                 Previous works have explored RL in the context of Recommender
rewards, and (iii) rewards boosting, which we believe are essential              Systems [5, 7, 10], and enterprise adoption also seems to be gaining
for RL to be used in Recommender Systems. Finally, we also pro-                  momentum with the recent availability of Cloud APIs (e.g. Azure
vide some implementation details on how the Wellness App based                   Personalizer [2, 6]) and Google’s RecSim [1]. However, they still
on Azure Personalizer was extended to accommodate the above RL                   work like a typical Recommender System. Given a user profile and
constructs.                                                                      categorized recommendations, the system makes a recommendation
                                                                                 based on popularity, interests, demographics, frequency and other
                                                                                 features. The main novelty of these systems is that they are able
1     INTRODUCTION                                                               to identify the features (or combination of features) of recommen-
                                                                                 dations getting higher rewards for a specific user; which can then
Wellness apps have historically suffered from low adoption rates.                be customized for that user to provide better recommendations. Un-
Personalized recommendations have the potential of improving                     fortunately, this is still inefficient for real-life systems which need
adoption, by making increasingly relevant and timely recommenda-                 to adapt to continuously evolving features, e.g. user sentiment, and
tions to users. While recommendation engines (and consequently, the              where the reward needs to computed based on multiple and unreli-
apps based on them) have grown in maturity, they still suffer from               able feedback channels (e.g., sensors, wearables).
the ‘cold start’ problem and the fact that it is basically a push-based             The rest of the paper is organized as follows: Section 2 outlines the
mechanism lacking the level of interactivity needed to make such                 problem scenario and formulates it as an RL problem. In Section 3,
apps appealing to millennials.                                                   we propose three RL constructs needed to overcome the above limi-
   We present a Wellness app case-study where we applied a combi-                tations: (i) weighted feedback channels, (ii) delayed rewards, and (iii)
nation of Reinforcement Learning (RL) and Natural Language Pro-                  rewards boosting, which we believe are essential constructs for RL to
cessing (NLP)/Chatbots to provide a highly personalized and inter-               be used in Recommender Systems. ‘Delayed Rewards’ in this context
active experience to users. We focus on the interactive aspect of the            is different from the notion of Delayed RL [8], where rewards in the
app, where the app is able to profile and converse with users in real-           distant future are not considered as valuable as immediate rewards. In
time, providing relevant content adapted to the current sentiment and            our notion of ‘Delayed Rewards’, a received reward is only applied
past preferences of the user.                                                    after its consistency has been validated by a subsequent action. We
   The core of such chatbots is an intent recognition Natural Lan-               provide implementation details in Section 4, on how to extend an RL
guage Understanding (NLU) engine [9], which is trained with hard-                powered Wellness App based on Azure Personalizer, to accommo-
coded examples of question variations. When no intent is matched                 date the above constructs. Section 5 concludes the paper providing
with a confidence level above 30%, the chatbot returns a fallback                some directions for future work.
answer. The user sentiment is computed based on both the (ex-
plicit) user response and (implicit) environmental aspects, e.g. lo-
cation (home, office, market, . . . ), temperature, lighting, time of the        2     PROBLEM SCENARIO
day, weather, other family members present in the vicinity, and so
on; to further adapt the chatbot response.                                       In this section, we set the problem context and formulate it as a Re-
                                                                                 inforcement Learning problem.
1    Philip Morris Products S. A., Lausanne, Switzerland, email: deb-
    malya.biswas@pmi.com
    Copyright © 2020 for this paper by its authors. Use permitted under Cre-     2.1    Wellness App
    ative Commons License Attribution 4.0 International (CC BY 4.0). This
    volume is published and copyrighted by its editors. Advances in Artificial   The Wellness app supports both push based notifications, where per-
    Intelligence for Healthcare, September 4, 2020, Virtual Workshop.            sonalized health, fitness, activity, etc. related recommendations are
pushed to the user; as well as interactive chats where the app reacts
                                                                                                       r(a, fa ) = s(fa )                    (2)
in response to a user query. We assume the existence of a knowledge-
base KB of articles, pictures and videos, with the artifacts ranked           where r and s refer to the reward and sentiment functions, re-
according to their relevance to different user profiles / sentiments.         spectively. Once computed, the KB is updated with the computed
   The Wellness app architecture is described in Fig. 1, which shows          reward / sentiment for the corresponding action.
how the user response and environmental conditions are:

1. gathered using available sensors to compute the ‘current’ feed-        3     RL REWARD AND POLICY EXTENSIONS
   back, including environmental context (e.g. webcam pic of the          In this section, we show how the Reward and Policy functions are
   user can be used to infer the user sentiment to a chatbot response     extended to accommodate the real-life challenges posed by our RL
   / notification, the room lighting conditions and other users present   based Wellness App.
   in the vicinity),
2. which is then combined with the user conversation history to
   quantify the user sentiment curve and discount any sudden              3.1    Weighted (Multiple) Feedback Channels
   changes in sentiment due to unrelated factors;
                                                                          As described in Fig. 1, we consider a multi-feedback channel, with
3. leading to the aggregate reward value corresponding to the last
                                                                          feedback captured from user (edge) devices / sensors, e.g. webcam,
   chatbot response / app notification provided to the user.
                                                                          thermostat, smartwatch, or a camera, microphone, accelerometer em-
   This reward value is then provided as feedback to the RL agent, to     bedded within the mobile device hosting the app. For instance, a we-
choose the next optimal chatbot response / app notification from the      bcam frame capturing the facial expression of the user, heart rate
knowledgebase. It is worthwhile noting here that capturing the user       provided by the user smartwatch, can be considered together with
sentiment, esp. the environmental aspects, requires a high degree of      the user provided text response “Thanks for the great suggestion”; in
knowledge regarding the user context. As such, we need to perform         computing the user sentiment to a recommended action.
this in a privacy preserving fashion. We suffice to say here that ap-        Let {fa1 , fa2 , ...fan } denote the feedback received for action a.
propriate privacy protections are provided by the ‘Privacy’ block in      Recall that s(f ) denotes the user sentiment computed independently
Fig. 1, and further details are provided in [4].                          based on the respective sensory feedback f . The user sentiment com-
                                                                          putation can be considered as a classifier outputting a value between
                                                                          1-10. The reward can then be computed as a weighted average of the
2.2    RL Formulation                                                     sentiment scores, denoted below:
We formulate the RL Engine for the above scenario as follows (illus-                                                 n
                                                                                                                     X
trated in Fig. 2):                                                                    ra ({fa1 , fa2 , ...fan }) =         (wi × s(fai ))    (3)
                                                                                                                     i=1
• Action (a): An action a in this case corresponds to a KB article
  which is delivered to the user either as a push notification, or in        where the weights {wa1 , wa2 , ...wan } allow the system to harmo-
  response to a user query, or as part of an ongoing conversation.        nize the received feedback, as some feedback channels may suffer
• Agent (A): is the one performing actions. In this case, the Agent is    from low reliability issues. For instance, if fi corresponds to a user
  the App delivering actions to the users, where an action is selected    typed response, fj corresponds to a webcam snapshot; then higher
  based on its Policy.                                                    weightage is given to fi . The reasoning here is that the user might
• Policy (π): is the strategy that the agent employs to select the next   be ‘smiling’ in the snapshot, however the ‘smile’ is due to his kid
  best action. Given a user profile Up , (current) sentiment Us , and     entering the room (also captured in the frame), and not necessarily
  query Uq ; the Policy function computes the product of the article      in response to the received recommendation / action. At the same
  scores returned by the NLP and Recommendation Engines respec-           time, if the sentiment computed based on the user text response indi-
  tively, selecting the item with the highest score as the next best      cates that he/she is ‘stressed’, then we give higher weightage to user
  action:                                                                 explicit (text response) feedback in this case.
   – The NLP Engine (N E) parses the query and outputs a score
     for each KB article, based on the “text similarity” of the article   3.2    Delayed Rewards
     to the user query.
                                                                          A ‘delayed rewards’ strategy is applied in the case of reward incon-
   – Similarly, the Recommendation Engine (RE) provides a score           sistency, where the current (computed) reward is ‘negative’ for an
     for each article based on the reward associated with each arti-      action to which the user has been known to react positively in the
     cle, with respect to the user profile and sentiment. The Policy      past; or vice versa. For instance, let us consider that the user senti-
     function π can be formalized as follows:                             ment is low for a recommendation of category ‘Shopping’, to which
       π(Up , Us , Uq ) = a | max[N E(a, Uq ) × RE(a, Up , Us )]          the user has been known to react very positively (to other ‘Shopping’
                                a                                         related recommendations) in the past. Given such inconsistency, the
                                                                   (1)    delayed rewards strategy buffers the computed reward rat for action
• Reward (r): refers to the feedback by which we measure the suc-         at at time t; and provides an indication to the RL Agent-Policy (π)
  cess or failure of an agent’s recommended action. The feedback          to try another recommendation of the same type (‘Shopping’) - to
  can e.g. refer to the amount of time that a user spends reading a       validate the user sentiment - before updating the rewards for both at
  recommended article. We consider a 2-step reward function com-          and at+1 at time t + 1.
  putation where the feedback fa received with respect to a recom-           To accommodate the ‘delayed rewards’ strategy, the rewards func-
  mended action is first mapped to a sentiment score, which is then       tion is extended with a memory buffer that allows the rewards of
  mapped to a reward.                                                     last m actions [at+m , at+m−1 , ..., at ] to be aggregated and applied
                                                         Figure 1. Wellness app architecture


                                                                                                    πdt (Up , Us , Uq ) =
                                                                                 
                                                                                     d=1:      at | rat ≈ rat−1                                  (5)
                                                                                     d=0:      a | max[N E(a, Uq ) × RE(a, Up , Us )]
                                                                                                      a


                                                                                The RL formulation extended with delayed reward / policy is il-
                                                                             lustrated in Fig. 3.


                     Figure 2. RL formulation


retroactively at time (t + m). The delayed rewards function dr is
denoted as follows:


                                               m
                                               X                                     Figure 3. Delayed Reward - Policy based RL formulation
   drati ∈ {at ,at+1 ,...,at+m } | (t + m) =         (wi × rati )    (4)
                                               i=0


   where | t + m implies that the reward for the actions
[at+m , at+m−1 , ..., at ], although computed individually; can only         3.3     Rewards Boosting
be applied at time (t + m). As before, the respective weights wi
allow us to harmonize the effect of an inconsistent feedback, where          Rewards boosting, or rather rewards normalization, applies mainly to
the reward for an action ati is applied based on the reward computed         continuous chat interactions. In such cases, if the user sentiment for a
for a later action a(t+1)i .                                                 recommended action is ‘negative’; it might not be the fault of the last
   To effectively enforce the ‘delayed rewards’ strategy, the Policy         action only. It is possible that the conversation sentiment was already
π is also extended to recommend an action of the same type, as the           degrading, and the last recommended action is simply following the
previous recommended action; if the delay flag d is set (d = 1): The         downward trend. On the other hand, given a worsening conversation
”delayed” Policy πdt is outlined below:                                      sentiment, a ‘positive’ sentiment for a recommended action implies
that it had a very positive impact on the user; and hence its corre-
sponding reward should be boosted. For example, let us consider a
ranking of the user sentiments:
    Disgusted → Angry → Sad → Confused → Calm → Happy
   Given this, a change from ‘Disgusted’ to ‘Happy’ would lead to
a much higher (positive) boost, than a (negative) change from ‘Con-
fused’ to ‘Sad’.
   The boosted reward rbat for an action at at time t is computed as
follows:
                            1
                   rbat =     (ra − rbat−1 ) × rat                 (6)
                            2 t
   It is easy to see that a ‘positive’ rat = 7 following a ‘negative’
rbat−1 = −5, will lead to rat getting boosted by a factor of 12 (7 −
(−5)) = 6. On the same lines, a ‘negative’ rat = −6 following a
‘positive’ rbat−1 = 4, will lead to rat getting further degraded by a
factor of 21 (−6 − 4) = −5.
   We leave it as future work to extend the ‘boost’ function to last n
actions (instead of just the last action above). In this extended sce-
nario, the system maintains a sentiment curve of the last n actions,
and the deviation is computed with respect to a curve, instead of a
discrete value. The expected benefit here is that it should allow the
system to react better to user sentiment trends.

                                                                                     Figure 4. Wellness Recommender App screenshots
4    IMPLEMENTATION
In this section, we extended our RL powered Wellness App based
on Azure Personalizer, to accommodate the constructs outlined in
the previous section. Azure Personalizer [2] is a Cloud based API            score and the ‘current’ user sentiment, e.g. tragic, sad, depressing,
providing an implementation of RL Contextual Bandits [6]. In short,          etc. related articles / activities are not shown unless the user is in
Personalizer provides two primary APIs:                                      a ‘happy’ mood (Fig. 4).
                                                                             Personalizer leaves the reward computation on the client side.
• Rank API: The mobile app invokes Rank API with a list of ac-               We developed a Rewards Computation module (with reference to
  tions and their features, and the user context and features. Given         Eq. 3) that combines (i) the activity / article related score, i.e. the
  this, the Rank API returns a list of ranked actions. Internally, the       activity / app selected and the time spent interacting with it, to-
  Personalizer app uses the Explore-Exploit trade-off to rank the ac-        gether with (ii) the sentiment score computed based on the user
  tions:                                                                     snapshot; with a higher weightage assigned to the latter given its
                                                                             effectiveness in capturing the user reaction to a recommended ar-
    – Exploit: Ranks the actions based on past data (current inference       ticle / activity.
      model).                                                              • Rewards boosting: To accommodate this, we consider a ranking
    – Explore: Select a different action instead of the top action. The      of the user sentiments returned by the Face API:
      ‘explore’ percentage is a configurable parameter, and can be set       Disgusted → Angry → Sad → Confused → Calm → Happy
      along the lines of an epsilon greedy strategy.                         The rewards boosting factor (Eq. 6) is assigned proportional to
• Rewards API: The mobile app presents the ranked content (re-               the change in user sentiment before and after displaying the rec-
  turned by Rank API) and computes the reward corresponding to               ommended activity / article.
  each action. It then invokes the Reward API to return the com-           • Delayed rewards: is a core RL construct that requires updating
  puted rewards to Personalizer. Personalizer correlates the action-         the backend Recommendation Engine and RL Reward and Policy
  reward, updating its inference model.                                      functions (Eq. 4 and Eq. 5). As such, it is difficult to implement
                                                                             based on a Cloud API (without having direct access to the under-
   We now provide details of our Wellness Recommender App (il-               lying Recommendation and RL Engines). We are in the process of
lustrated in Fig. 4). In addition to the usual article / activity recom-     adapting an Open Source Contextual Bandits implementation to
mendations, the app provides the option to have a video chat as well         provide the full ‘delayed rewards’ strategy.
to improve the ‘interactive quotient’ of the app. The implementation         For now, we only simulated the behavior of the RL Reward func-
details of the RL constructs proposed in this paper are outlined be-         tion (Eq. 4) on the client (app) side. We added a memory buffer
low:                                                                         to the Rewards Computation module to store rewards computed
                                                                             for an iteration, until a ‘similar’ reward gets computed for a set
• Multiple feedback channels: in this case correspond to the live            of activities / articles with similar features recommended as part
  video feed and user interaction with an article / activity. Given          of another iteration. At this point, the aggregate rewards are re-
  a user snapshot (captured from the live feed), the sentiment score         turned to Personalizer (via Reward API) for recommended activ-
  is computed using Azure Face API. The article / activity recom-            ities / articles of both iterations. This ensures that only consistent
  mended by the app depends on both the article / activity relevance         rewards are considered while training the Personalizer RL infer-
    ence model.


5    CONCLUSION
In this work, we considered the implementation of a RL based Rec-
ommender System, in the context of a real-life Wellness App. RL is a
powerful primitive for such problems as it allows the app to learn and
adapt to user preferences / sentiment in real-time. However, during
the case-study, we realized that current RL frameworks lack certain
constructs needed for them to be applied to such Recommender Sys-
tems.
   To overcome this limitation, we introduced three RL constructs
that we implemented for our Wellness app: (i) weighted feedback
channels, (ii) delayed rewards, and (iii) rewards boosting. The pro-
posed RL constructs are fundamental in nature as they impact the in-
terplay between Reward and Policy functions; and we hope that their
addition to existing RL frameworks will lead to increased enterprise
adoption.


ACKNOWLEDGEMENTS
I would like to thank Louis Beck and Sami Ben Hassan for their
insights and support in developing the Wellness Recommender App.


REFERENCES
 [1] Google RecSim, 2020 (accessed December 9, 2020).
     https://opensource.google/projects/recsim.
 [2] Microsoft Azure Personalizer, 2020 (accessed December 9,
     2020).             https://azure.microsoft.com/en-us/services/cognitive-
     services/personalizer/.
 [3] A. Barto and R. S. Sutton, Reinforcement Learning: An Introduction,
     MIT Press, Cambridge, MA, 2018.
 [4] D. Biswas, ‘Privacy Preserving Chatbot Conversations’, in Proceedings
     of the 3rd IEEE Conference on Artificial Intelligence and Knowledge
     Engineering (AIKE), (2020).
 [5] S. Choi, H. Ha, U. Hwang, C. Kim, J. Ha, and S. Yoon. Reinforcement
     Learning based Recommender System using Biclustering Technique,
     2018. arXiv:1801.05532.
 [6] L. Li, W. Chu, J. Langford, and R. E. Schapire, ‘A Contextual-Bandit
     Approach to Personalized News Article Recommendation’, in Proceed-
     ings of the 19th International Conference on World Wide Web (WWW),
     p. 661–670, (2010).
 [7] F. Liu, R. Tang, X. Li, Y. Ye, H. Chen, H. Guo, and Y. Zhang. Deep
     Reinforcement Learning based Recommendation with Explicit User-
     Item Interactions Modeling, 2019.
 [8] N.      J.     Nilsson.              Delayed-Reinforcement       Learn-
     ing,       2020         (accessed       December         9,      2020).
     http://heim.ifi.uio.no/ mes/inf1400/COOL/REF/Standford/ch11.pdf.
 [9] E. Ricciardelli and D. Biswas, ‘Self-improving Chatbots based on Rein-
     forcement Learning’, in Proceedings of the 4th Multidisciplinary Con-
     ference on Reinforcement Learning and Decision Making (RLDM),
     (2019).
[10] N. Taghipour, A. Kardan, and S. S. Ghidary, ‘Usage-Based Web Rec-
     ommendations: A Reinforcement Learning Approach’, in Proceed-
     ings of the ACM Conference on Recommender Systems (RecSys), p.
     113–120, (2007).