Delayed Rewards in the context of Reinforcement Learning based Recommender Systems Debmalya Biswas1 Abstract. We present a Reinforcement Learning (RL) based ap- RL refers to a branch of Artificial Intelligence (AI), which is able proach to implement Recommender systems. The results are based to achieve complex goals by maximizing a reward function in real- on a real-life Wellness app that is able to provide personalized health time. The reward function works similar to incentivizing a child with related content to users in an interactive fashion. Unfortunately, cur- candy and spankings, such that the algorithm is penalized when it rent recommender systems are unable to adapt to continuously evolv- takes a wrong decision and rewarded when it takes a right one – this is ing features, e.g. user sentiment, and scenarios where the RL re- reinforcement. The reinforcement aspect also allows it to adapt faster ward needs to be computed based on multiple and unreliable feed- to real-time changes in user sentiment. For a detailed introduction to back channels (e.g., sensors, wearables). To overcome this, we pro- RL frameworks, the interested reader is referred to [3]. pose three constructs: (i) weighted feedback channels, (ii) delayed Previous works have explored RL in the context of Recommender rewards, and (iii) rewards boosting, which we believe are essential Systems [5, 7, 10], and enterprise adoption also seems to be gaining for RL to be used in Recommender Systems. Finally, we also pro- momentum with the recent availability of Cloud APIs (e.g. Azure vide some implementation details on how the Wellness App based Personalizer [2, 6]) and Google’s RecSim [1]. However, they still on Azure Personalizer was extended to accommodate the above RL work like a typical Recommender System. Given a user profile and constructs. categorized recommendations, the system makes a recommendation based on popularity, interests, demographics, frequency and other features. The main novelty of these systems is that they are able 1 INTRODUCTION to identify the features (or combination of features) of recommen- dations getting higher rewards for a specific user; which can then Wellness apps have historically suffered from low adoption rates. be customized for that user to provide better recommendations. Un- Personalized recommendations have the potential of improving fortunately, this is still inefficient for real-life systems which need adoption, by making increasingly relevant and timely recommenda- to adapt to continuously evolving features, e.g. user sentiment, and tions to users. While recommendation engines (and consequently, the where the reward needs to computed based on multiple and unreli- apps based on them) have grown in maturity, they still suffer from able feedback channels (e.g., sensors, wearables). the ‘cold start’ problem and the fact that it is basically a push-based The rest of the paper is organized as follows: Section 2 outlines the mechanism lacking the level of interactivity needed to make such problem scenario and formulates it as an RL problem. In Section 3, apps appealing to millennials. we propose three RL constructs needed to overcome the above limi- We present a Wellness app case-study where we applied a combi- tations: (i) weighted feedback channels, (ii) delayed rewards, and (iii) nation of Reinforcement Learning (RL) and Natural Language Pro- rewards boosting, which we believe are essential constructs for RL to cessing (NLP)/Chatbots to provide a highly personalized and inter- be used in Recommender Systems. ‘Delayed Rewards’ in this context active experience to users. We focus on the interactive aspect of the is different from the notion of Delayed RL [8], where rewards in the app, where the app is able to profile and converse with users in real- distant future are not considered as valuable as immediate rewards. In time, providing relevant content adapted to the current sentiment and our notion of ‘Delayed Rewards’, a received reward is only applied past preferences of the user. after its consistency has been validated by a subsequent action. We The core of such chatbots is an intent recognition Natural Lan- provide implementation details in Section 4, on how to extend an RL guage Understanding (NLU) engine [9], which is trained with hard- powered Wellness App based on Azure Personalizer, to accommo- coded examples of question variations. When no intent is matched date the above constructs. Section 5 concludes the paper providing with a confidence level above 30%, the chatbot returns a fallback some directions for future work. answer. The user sentiment is computed based on both the (ex- plicit) user response and (implicit) environmental aspects, e.g. lo- cation (home, office, market, . . . ), temperature, lighting, time of the 2 PROBLEM SCENARIO day, weather, other family members present in the vicinity, and so on; to further adapt the chatbot response. In this section, we set the problem context and formulate it as a Re- inforcement Learning problem. 1 Philip Morris Products S. A., Lausanne, Switzerland, email: deb- malya.biswas@pmi.com Copyright © 2020 for this paper by its authors. Use permitted under Cre- 2.1 Wellness App ative Commons License Attribution 4.0 International (CC BY 4.0). This volume is published and copyrighted by its editors. Advances in Artificial The Wellness app supports both push based notifications, where per- Intelligence for Healthcare, September 4, 2020, Virtual Workshop. sonalized health, fitness, activity, etc. related recommendations are pushed to the user; as well as interactive chats where the app reacts r(a, fa ) = s(fa ) (2) in response to a user query. We assume the existence of a knowledge- base KB of articles, pictures and videos, with the artifacts ranked where r and s refer to the reward and sentiment functions, re- according to their relevance to different user profiles / sentiments. spectively. Once computed, the KB is updated with the computed The Wellness app architecture is described in Fig. 1, which shows reward / sentiment for the corresponding action. how the user response and environmental conditions are: 1. gathered using available sensors to compute the ‘current’ feed- 3 RL REWARD AND POLICY EXTENSIONS back, including environmental context (e.g. webcam pic of the In this section, we show how the Reward and Policy functions are user can be used to infer the user sentiment to a chatbot response extended to accommodate the real-life challenges posed by our RL / notification, the room lighting conditions and other users present based Wellness App. in the vicinity), 2. which is then combined with the user conversation history to quantify the user sentiment curve and discount any sudden 3.1 Weighted (Multiple) Feedback Channels changes in sentiment due to unrelated factors; As described in Fig. 1, we consider a multi-feedback channel, with 3. leading to the aggregate reward value corresponding to the last feedback captured from user (edge) devices / sensors, e.g. webcam, chatbot response / app notification provided to the user. thermostat, smartwatch, or a camera, microphone, accelerometer em- This reward value is then provided as feedback to the RL agent, to bedded within the mobile device hosting the app. For instance, a we- choose the next optimal chatbot response / app notification from the bcam frame capturing the facial expression of the user, heart rate knowledgebase. It is worthwhile noting here that capturing the user provided by the user smartwatch, can be considered together with sentiment, esp. the environmental aspects, requires a high degree of the user provided text response “Thanks for the great suggestion”; in knowledge regarding the user context. As such, we need to perform computing the user sentiment to a recommended action. this in a privacy preserving fashion. We suffice to say here that ap- Let {fa1 , fa2 , ...fan } denote the feedback received for action a. propriate privacy protections are provided by the ‘Privacy’ block in Recall that s(f ) denotes the user sentiment computed independently Fig. 1, and further details are provided in [4]. based on the respective sensory feedback f . The user sentiment com- putation can be considered as a classifier outputting a value between 1-10. The reward can then be computed as a weighted average of the 2.2 RL Formulation sentiment scores, denoted below: We formulate the RL Engine for the above scenario as follows (illus- n X trated in Fig. 2): ra ({fa1 , fa2 , ...fan }) = (wi × s(fai )) (3) i=1 • Action (a): An action a in this case corresponds to a KB article which is delivered to the user either as a push notification, or in where the weights {wa1 , wa2 , ...wan } allow the system to harmo- response to a user query, or as part of an ongoing conversation. nize the received feedback, as some feedback channels may suffer • Agent (A): is the one performing actions. In this case, the Agent is from low reliability issues. For instance, if fi corresponds to a user the App delivering actions to the users, where an action is selected typed response, fj corresponds to a webcam snapshot; then higher based on its Policy. weightage is given to fi . The reasoning here is that the user might • Policy (π): is the strategy that the agent employs to select the next be ‘smiling’ in the snapshot, however the ‘smile’ is due to his kid best action. Given a user profile Up , (current) sentiment Us , and entering the room (also captured in the frame), and not necessarily query Uq ; the Policy function computes the product of the article in response to the received recommendation / action. At the same scores returned by the NLP and Recommendation Engines respec- time, if the sentiment computed based on the user text response indi- tively, selecting the item with the highest score as the next best cates that he/she is ‘stressed’, then we give higher weightage to user action: explicit (text response) feedback in this case. – The NLP Engine (N E) parses the query and outputs a score for each KB article, based on the “text similarity” of the article 3.2 Delayed Rewards to the user query. A ‘delayed rewards’ strategy is applied in the case of reward incon- – Similarly, the Recommendation Engine (RE) provides a score sistency, where the current (computed) reward is ‘negative’ for an for each article based on the reward associated with each arti- action to which the user has been known to react positively in the cle, with respect to the user profile and sentiment. The Policy past; or vice versa. For instance, let us consider that the user senti- function π can be formalized as follows: ment is low for a recommendation of category ‘Shopping’, to which π(Up , Us , Uq ) = a | max[N E(a, Uq ) × RE(a, Up , Us )] the user has been known to react very positively (to other ‘Shopping’ a related recommendations) in the past. Given such inconsistency, the (1) delayed rewards strategy buffers the computed reward rat for action • Reward (r): refers to the feedback by which we measure the suc- at at time t; and provides an indication to the RL Agent-Policy (π) cess or failure of an agent’s recommended action. The feedback to try another recommendation of the same type (‘Shopping’) - to can e.g. refer to the amount of time that a user spends reading a validate the user sentiment - before updating the rewards for both at recommended article. We consider a 2-step reward function com- and at+1 at time t + 1. putation where the feedback fa received with respect to a recom- To accommodate the ‘delayed rewards’ strategy, the rewards func- mended action is first mapped to a sentiment score, which is then tion is extended with a memory buffer that allows the rewards of mapped to a reward. last m actions [at+m , at+m−1 , ..., at ] to be aggregated and applied Figure 1. Wellness app architecture πdt (Up , Us , Uq ) =  d=1: at | rat ≈ rat−1 (5) d=0: a | max[N E(a, Uq ) × RE(a, Up , Us )] a The RL formulation extended with delayed reward / policy is il- lustrated in Fig. 3. Figure 2. RL formulation retroactively at time (t + m). The delayed rewards function dr is denoted as follows: m X Figure 3. Delayed Reward - Policy based RL formulation drati ∈ {at ,at+1 ,...,at+m } | (t + m) = (wi × rati ) (4) i=0 where | t + m implies that the reward for the actions [at+m , at+m−1 , ..., at ], although computed individually; can only 3.3 Rewards Boosting be applied at time (t + m). As before, the respective weights wi allow us to harmonize the effect of an inconsistent feedback, where Rewards boosting, or rather rewards normalization, applies mainly to the reward for an action ati is applied based on the reward computed continuous chat interactions. In such cases, if the user sentiment for a for a later action a(t+1)i . recommended action is ‘negative’; it might not be the fault of the last To effectively enforce the ‘delayed rewards’ strategy, the Policy action only. It is possible that the conversation sentiment was already π is also extended to recommend an action of the same type, as the degrading, and the last recommended action is simply following the previous recommended action; if the delay flag d is set (d = 1): The downward trend. On the other hand, given a worsening conversation ”delayed” Policy πdt is outlined below: sentiment, a ‘positive’ sentiment for a recommended action implies that it had a very positive impact on the user; and hence its corre- sponding reward should be boosted. For example, let us consider a ranking of the user sentiments: Disgusted → Angry → Sad → Confused → Calm → Happy Given this, a change from ‘Disgusted’ to ‘Happy’ would lead to a much higher (positive) boost, than a (negative) change from ‘Con- fused’ to ‘Sad’. The boosted reward rbat for an action at at time t is computed as follows: 1 rbat = (ra − rbat−1 ) × rat (6) 2 t It is easy to see that a ‘positive’ rat = 7 following a ‘negative’ rbat−1 = −5, will lead to rat getting boosted by a factor of 12 (7 − (−5)) = 6. On the same lines, a ‘negative’ rat = −6 following a ‘positive’ rbat−1 = 4, will lead to rat getting further degraded by a factor of 21 (−6 − 4) = −5. We leave it as future work to extend the ‘boost’ function to last n actions (instead of just the last action above). In this extended sce- nario, the system maintains a sentiment curve of the last n actions, and the deviation is computed with respect to a curve, instead of a discrete value. The expected benefit here is that it should allow the system to react better to user sentiment trends. Figure 4. Wellness Recommender App screenshots 4 IMPLEMENTATION In this section, we extended our RL powered Wellness App based on Azure Personalizer, to accommodate the constructs outlined in the previous section. Azure Personalizer [2] is a Cloud based API score and the ‘current’ user sentiment, e.g. tragic, sad, depressing, providing an implementation of RL Contextual Bandits [6]. In short, etc. related articles / activities are not shown unless the user is in Personalizer provides two primary APIs: a ‘happy’ mood (Fig. 4). Personalizer leaves the reward computation on the client side. • Rank API: The mobile app invokes Rank API with a list of ac- We developed a Rewards Computation module (with reference to tions and their features, and the user context and features. Given Eq. 3) that combines (i) the activity / article related score, i.e. the this, the Rank API returns a list of ranked actions. Internally, the activity / app selected and the time spent interacting with it, to- Personalizer app uses the Explore-Exploit trade-off to rank the ac- gether with (ii) the sentiment score computed based on the user tions: snapshot; with a higher weightage assigned to the latter given its effectiveness in capturing the user reaction to a recommended ar- – Exploit: Ranks the actions based on past data (current inference ticle / activity. model). • Rewards boosting: To accommodate this, we consider a ranking – Explore: Select a different action instead of the top action. The of the user sentiments returned by the Face API: ‘explore’ percentage is a configurable parameter, and can be set Disgusted → Angry → Sad → Confused → Calm → Happy along the lines of an epsilon greedy strategy. The rewards boosting factor (Eq. 6) is assigned proportional to • Rewards API: The mobile app presents the ranked content (re- the change in user sentiment before and after displaying the rec- turned by Rank API) and computes the reward corresponding to ommended activity / article. each action. It then invokes the Reward API to return the com- • Delayed rewards: is a core RL construct that requires updating puted rewards to Personalizer. Personalizer correlates the action- the backend Recommendation Engine and RL Reward and Policy reward, updating its inference model. functions (Eq. 4 and Eq. 5). As such, it is difficult to implement based on a Cloud API (without having direct access to the under- We now provide details of our Wellness Recommender App (il- lying Recommendation and RL Engines). We are in the process of lustrated in Fig. 4). In addition to the usual article / activity recom- adapting an Open Source Contextual Bandits implementation to mendations, the app provides the option to have a video chat as well provide the full ‘delayed rewards’ strategy. to improve the ‘interactive quotient’ of the app. The implementation For now, we only simulated the behavior of the RL Reward func- details of the RL constructs proposed in this paper are outlined be- tion (Eq. 4) on the client (app) side. We added a memory buffer low: to the Rewards Computation module to store rewards computed for an iteration, until a ‘similar’ reward gets computed for a set • Multiple feedback channels: in this case correspond to the live of activities / articles with similar features recommended as part video feed and user interaction with an article / activity. Given of another iteration. At this point, the aggregate rewards are re- a user snapshot (captured from the live feed), the sentiment score turned to Personalizer (via Reward API) for recommended activ- is computed using Azure Face API. The article / activity recom- ities / articles of both iterations. This ensures that only consistent mended by the app depends on both the article / activity relevance rewards are considered while training the Personalizer RL infer- ence model. 5 CONCLUSION In this work, we considered the implementation of a RL based Rec- ommender System, in the context of a real-life Wellness App. RL is a powerful primitive for such problems as it allows the app to learn and adapt to user preferences / sentiment in real-time. However, during the case-study, we realized that current RL frameworks lack certain constructs needed for them to be applied to such Recommender Sys- tems. To overcome this limitation, we introduced three RL constructs that we implemented for our Wellness app: (i) weighted feedback channels, (ii) delayed rewards, and (iii) rewards boosting. The pro- posed RL constructs are fundamental in nature as they impact the in- terplay between Reward and Policy functions; and we hope that their addition to existing RL frameworks will lead to increased enterprise adoption. ACKNOWLEDGEMENTS I would like to thank Louis Beck and Sami Ben Hassan for their insights and support in developing the Wellness Recommender App. REFERENCES [1] Google RecSim, 2020 (accessed December 9, 2020). https://opensource.google/projects/recsim. [2] Microsoft Azure Personalizer, 2020 (accessed December 9, 2020). https://azure.microsoft.com/en-us/services/cognitive- services/personalizer/. [3] A. Barto and R. S. Sutton, Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, 2018. [4] D. Biswas, ‘Privacy Preserving Chatbot Conversations’, in Proceedings of the 3rd IEEE Conference on Artificial Intelligence and Knowledge Engineering (AIKE), (2020). [5] S. Choi, H. Ha, U. Hwang, C. Kim, J. Ha, and S. Yoon. Reinforcement Learning based Recommender System using Biclustering Technique, 2018. arXiv:1801.05532. [6] L. Li, W. Chu, J. Langford, and R. E. Schapire, ‘A Contextual-Bandit Approach to Personalized News Article Recommendation’, in Proceed- ings of the 19th International Conference on World Wide Web (WWW), p. 661–670, (2010). [7] F. Liu, R. Tang, X. Li, Y. Ye, H. Chen, H. Guo, and Y. Zhang. Deep Reinforcement Learning based Recommendation with Explicit User- Item Interactions Modeling, 2019. [8] N. J. Nilsson. Delayed-Reinforcement Learn- ing, 2020 (accessed December 9, 2020). http://heim.ifi.uio.no/ mes/inf1400/COOL/REF/Standford/ch11.pdf. [9] E. Ricciardelli and D. Biswas, ‘Self-improving Chatbots based on Rein- forcement Learning’, in Proceedings of the 4th Multidisciplinary Con- ference on Reinforcement Learning and Decision Making (RLDM), (2019). [10] N. Taghipour, A. Kardan, and S. S. Ghidary, ‘Usage-Based Web Rec- ommendations: A Reinforcement Learning Approach’, in Proceed- ings of the ACM Conference on Recommender Systems (RecSys), p. 113–120, (2007).