From personalized timely notification to healthy habit formation: A feasibility study of reinforcement learning approaches on synthetic data Aneta Lisowska1 , Szymon Wilk1 and Mor Peleg2 1 Institute of Computing Science, Poznań University of Technology, Poznań, Poland 2 Department of Information Systems, University of Haifa, Haifa, Israel Abstract Cancer patients may struggle with mental wellbeing issues such as distress and depression. As a part of the CAPABLE project, we aim to develop a digital behaviour-change intervention that helps them build positive health habits and improve their wellbeing. The main challenge to the evaluation of the system is the lack of access to real data prior to intervention start. Therefore, first, we created a simulator that mimics patient responses to activity suggestions based on Fogg’s behaviour model. Later we used supervised and reinforcement learning methods to learn the best time of sending the patient prompts. We found that the reinforcement learning methods learn quickly not to over-notify patients and find prompt policies that are more effective in facilitating users in performing target activity than a random notification strategy, but are less effective than adaptive supervised learning method trained to predict patient responsiveness. Keywords Fogg behaviour model, reinforcement learning, digital behaviour change intervention 1. Introduction Cancer patients may struggle with mental well-being issues such as distress and depression at all stages of their cancer journey [1, 2]. Poor mental health not only impacts their quality of life but also reduces treatment adherence and cancer survival [3]. As a part of the Horizon 2020 CAncer PAtient Better Life Experience (CAPABLE) project, we aim to develop a digital behaviour change intervention [4] that could help cancer patients build emotional resilience and form positive health habits. The patients will be equipped with a coaching system (Virtual Coach), which contains three components: backend (with most of the logic and processing), mobile health app (for functions that need to be performed in proximity to the patient, such as supporting, and communicating) and a smartwatch (for sensing). The Virtual Coach (VC) will have the capacity to interact with the patient through notifications and will suggest multiple activities from the domain of mindfulness and positive psychology. In this first proof of concept work, we focus on stimulating the development of a daily AIxIA 2021 SMARTERCARE Workshop, November 29, 2021, Milan, IT $ aneta.lisowska@put.poznan.pl (A. Lisowska); szymon.wilk@cs.put.poznan.pl (S. Wilk); morpeleg@is.haifa.ac.il (M. Peleg)  0000-0002-4489-5956 (A. Lisowska); 0000-0002-7807-454X (S. Wilk); 0000-0002-5612-768X (M. Peleg) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 7 Aneta Lisowska et al. CEUR Workshop Proceedings 7–18 walking habit. We chose this activity for initial exploration because of its evidence-based benefits to psychological wellbeing [5] and to the overall health of cancer survivors [6, 7]. We draw inspiration from Fogg’s Behaviour Model [8] (see Section 2.1) to simulate the patients’ responses to the notification (trigger) sent by the VC (reminding the patients to perform the daily activity), depending on the patient’s motivation and ability; these three components are the cornerstones of Fogg’s theory. The responsiveness of the patients to the notifications depends also on the context e.g., the time of the day [9], location of the patient [10] and physiological state of the patient [11] (e.g. stress level). Previously, researchers investigated the possibility of using reinforcement learning (RL) for the identification of the appropriate time to send notifications in order to optimize user engagement with mobile applications [12]. Our goal is to facilitate patient habit formation through timely notification. We investigate the use of supervised and reinforcement learning approaches to learning the best pattern of interacting with patients through notifications (see Section 3). A starting point is constructing a simulator that would allow us to test learning-based techniques without access to real-life data. Thus, in this feasibility study we: • Propose a simulator that mimics patients’ responsiveness to activity suggestions, based on Fogg’s behaviour model and on findings from multiple health intervention studies and mobile notification research. • Apply supervised and reinforcement learning approaches to learn the best time of send- ing the patient notifications in order to achieve positive habit formation and compare the patient’s responsiveness to prompts against that achieved with randomly timed notifications. 2. Related work 2.1. Modeling behaviour According to Fogg, habit formation depends on three components: the person’s motivation, her ability to perform the task, and the existence of a trigger reminding her to perform the target behaviour [13]. Fogg expresses this dependency as "B=MAT" [8] and suggests that behaviour occurs when the person’s level of motivation and ability and the existence of the trigger place her above the action threshold (i.e., performing the behaviour), which we here interpret as: {︃ 1 if (𝑀 𝑜𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 × 𝐴𝑏𝑖𝑙𝑖𝑡𝑦 × 𝑇 𝑟𝑖𝑔𝑔𝑒𝑟) > action threshold 𝐵𝑒ℎ𝑎𝑣𝑖𝑜𝑢𝑟 = (1) 0 otherwise Each of the three components may in turn be influenced by a range of internal (personal) and external factors. We describe them below in the context of our digital behaviour change intervention system. Motivation. Jowsey et al. [14] interviewed Australian patients with chronic illness to gain an understanding of what motivates them to engage in self-management behaviours. The authors report that maintaining a positive attitude (i.e., positive emotional valence, Valence) was one of 8 Aneta Lisowska et al. CEUR Workshop Proceedings 7–18 the most important internal factors motivating patients to control their health. The external factor impacting motivation was the presence of family and friends (Family). On the other hand, the patients got demotivated when they were perceiving self-management behaviour as having limited benefit (Perceived Benefit). Interestingly, sleepiness might decrease motivation for behaviours that are not oriented toward assuring sufficient sleep [15]. Conversely, sufficient sleep (Sufficient Sleep) might have a positive effect on motivation, for example Dolsen et al. suggests that improving sleep benefits patient adherence to treatment [16]. In our proof of concept behaviour simulator, the above mentioned factors have an equal contribution, although we acknowledge that this is a very simplified view of motivation: 𝑀 𝑜𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 = 𝑉 𝑎𝑙𝑒𝑛𝑐𝑒 + 𝐹 𝑎𝑚𝑖𝑙𝑦 + Perceived Benefit + Sufficient Sleep (2) We adopt the same simple cumulative factors representation for the remaining components of behaviour. Note that for behaviour equation multiplication is used because it is enough that a single component is 0 and the target behaviour will be not performed. We assume this is not true for the computation of the components, e.g., a person might currently experience negative valence, but the remaining motivation factors might be present and hence motivation should not be nullified. Ability. Chan et al. found that app users are more receptive to respond to memory-training suggestions when they are under low cognitive load [17]. High cognitive load might suggest that the person pays attention to another difficult task e.g., driving a car [18] and is not available to engage with the suggested activity (Load). The ability to perform a task can be also affected by self-efficacy, i.e., perception of one’s capability to execute the target activity, and whether the person has performed this activity successfully before. Self-efficacy has been shown to relate to treatment adherence [19] (Self- efficacy). Finally, patient’s ability to perform the target behaviour may be affected by the patient being tired [20] or bored of [21] repeating the same activity (Strained). 𝐴𝑏𝑖𝑙𝑖𝑡𝑦 = Low Load + Self-efficacy + Unstrained (3) Trigger. In a laboratory study, Goyal et al. [11] found that the best moment to interrupt people is when they are in a state of increasing arousal (Arousal). Bidargadi et al. found that the notifications delivered on weekends or during midday are more effective [22]. On the other hand, participants in the Kunzler et al. study were more receptive to intervention on weekdays rather than weekends between 10am-6pm[23] (Day). Saikia et al. [9] also found that the timing of the notification is important; in their study, the largest percentage of users engaged with notification around 11am (Time). In addition, multiple studies considered persons’ current location and motion activity in finding the most appropriate notification delivery context [12, 23] (Location, Motion) . 𝑇 𝑟𝑖𝑔𝑔𝑒𝑟 = 𝐴𝑟𝑜𝑢𝑠𝑎𝑙 + 𝐷𝑎𝑦 + 𝑇 𝑖𝑚𝑒 + 𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛 + 𝑀 𝑜𝑡𝑖𝑜𝑛 (4) We set the trigger to zero when patient is sleeping in order not to interrupt sleep with activity suggestions. 9 Aneta Lisowska et al. CEUR Workshop Proceedings 7–18 Table 1 Factors impacting motivation, ability and trigger Factors States (scores) Function of Based on Poor sleep quality correlates with high negative and low positive emotions [26] positive (1), Physical activity has a positive within-day association with pleasant core Valence negative (0) affects [27] sleep (last 24h) Stress is a state of high arousal and negative valence: physical activity (last 24h) - physical inactivity increases chances of high stress [28] no. notifications (last 24h) Arousal low (0), mid (1), - smartphone user experience stress from receiving frequent notifications [29] high (0) The optimal physiological arousal resulting in flow-experience lies between stress and relaxation [30] Cognitive notifications Notifications at detected breakpoint timing resulted in lower cognitive load low (1), high (0) load timing compared to randomly timed notifications [31] time of the day Awake awake (1), asleep (0) The daily amount of activity positively relates with sleep quality [32] physical activity (last 24h) High arousal is associated with reductions in slow-wave sleep and negative Sufficient arousal yes (1), no (0) valence with disruptions to REM sleep [33] sleep valence constant (between each Perceived During-behaviour affect is predictive of concurrent and future physical activity yes (1), no (0) performed activity) benefit behaviour [34] valence (during activity) Self-efficacy low (0), high (1) no. activity performed (overall) (confidence) no. activity performed (last 24h) Unstrained yes (1), no (0) time since the activity Location home (1), other (0) time time stationary (1), Motion successful notification walking (0) past walking frequency Time morning (0), midday (1), of the day evening (0), night (0) time (one hour time-step) weekend(1), Day week day (0) Has family yes (1), no (0) constant Some of the above-described factors affecting motivation, ability, and trigger are constant for the patient (e.g., the presence of family members); others vary with time and can be obtained through self-reporting or inferred from signals gathered by wearable devices (e.g., cognitive load [24] or stress[25]). Table 1 lists the factors affecting behaviour along with their states. To compute the behaviour components, we assign the value of 1 to the factor state that benefits the behaviour and 0 otherwise. For example, motivation = 4 if the patient is in a positive valence (1), has support from family (1), perceives benefit in performing the activity (1) and had sufficient amount of sleep last night (1). Let’s say that the patient also has a low cognitive load (1) at the moment and feels confident that they can perform the task (1) but by now they are tired of and bored from repeating the activity (0) so their ability = 2. In terms of trigger the patient is in a low arousal state (0), it is a weekday (0)1 , midday (1), home (1) and the patient is sitting (1) = 3. So we have 𝑚𝑜𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛(4) × 𝑎𝑏𝑖𝑙𝑖𝑡𝑦(2) × 𝑡𝑟𝑖𝑔𝑔𝑒𝑟(3) = 24. When the action threshold = 25 the behaviour would not be performed. But let’s say that if the patient was not yet strained from repeating activity today their ability would be 3 and behaviour computation would result in 36, which is above the action threshold and the behaviour would occur (activity performed). 1 Trigger scores (except arousal) are affected by preference, e.g. if patient prefers to perform activity on weekday the score for weekday will be 1 rather than 0. 10 Aneta Lisowska et al. CEUR Workshop Proceedings 7–18 2.2. Machine learning for notifications Ho et al. [12] framed the task of notifying users at the right time as an RL problem. In Ho’s et al. framework, the goal is to maximize the effectiveness of the notifications. The notification system is an agent that interacts with the app user (environment). Based on the user’s context (the state received from the environment), the agent chooses to notify or not (action). The agent is rewarded when the user responds to the notification. Ho et al. conclude that the Q-Learning (QL) [35] RL method yields higher notification response rate on crowd-sourced data than support vector machine or shallow neural network trained in a supervised fashion. In follow up real-life five-week study, the authors compared Advance Actor Critic (A2C) RL approach against Random Forest (RF) that has been trained on data gathered with random notification policy in the first three weeks of the study [36]. The authors concluded that A2C is more effective in reducing dismissed notifications, but the supervised approach achieved a higher notification answer rate. They suggested that the benefits of RL method might be more prominent in the context of users whose preferences change. We consider this condition in our experiments (see Section 4). Kunzler et al. [23] utilised RF to detect receptivity to Just-In-Time Adaptive Interventions and showed that it increased receptivity over biased random classifier, however, they did not consider RL based approaches. In the follow-up study Mishran et al. [37] compared the static supervised learning approach against the adaptive approach. In adaptive set up the supervised model is retrained every time new receptivity data are made available. Mishran et al. observes that receptivity to notifications from the adaptive model improved over the course of the study. We include the adaptive supervised learning method in our experiments and compare against RL-based approaches (see Section 3.2). Sutton et al. [38] proposed that RL could be used for management of notifications coming from various applications. Their goal was to learn which notifications are important and which should be blocked. They created Open AI gym environment for the training of QL and Deep QL (DQL) agents on synthetic data. When the models were applied to real data, they could predict a user’s action toward notifications better than a random benchmark. The authors suggested that the synthetic data could be applied for RL training in cases where the use of real data is not possible. Encouraged by Sutton’s et al. findings we use Open AI gym [39] to create an environment (see Section 3.1) and synthesize patients behaviour according to the behaviour model described in Section 2.1. 3. Methods We draw inspiration from works described above and formulate our problem similarly to Ho et al. [36], where the VC is an agent, a patient is an environment and the patient’s context is the environments state. However, we use a richer description of the context which includes the patient’s physiological data that will be captured by the smartwatch (thirteen variables described below). 11 Aneta Lisowska et al. CEUR Workshop Proceedings 7–18 3.1. Patient environment We used Open AI gym [39] to create a patient environment and simulate patients response to prompt according to the equation 12 . We simulate a patient who has a family, prefers being notified at midday, and does not want to receive more than 3 notifications daily. The patient’s action threshold is set to 203 and their initial state shows issues with sleeping (less than 5h a day), negative valence for majority of the day and no physical activity, to improve patient sleep and mood they are recommended to develop a walking habit. The patient state available to the VC includes thirteen variables: time of day, weekday, benefit score (since the last action performed), location, awake state, valence, arousal, cognitive load, motion, time since activity performed, number of hours slept (in the last 24h), number of times notified (in the last 24h) and number of time patient performed the activity (in the last 24h). 3.2. Prompt strategies We consider three different prompt strategies that are described below. Random: At each time step (each hour) the VC randomly decides whether to send or not a prompt to the patient, except at night time. Supervised Approaches: Following [36, 23, 37] we selected RF model and train it on data gathered during the first three weeks (as in [36]) of simulated intervention. We use scikit-learn [40] implementation of the model with the default parameters and balanced class weighting. The model is trained in a supervised fashion on pairs of patient states and their response to prompt (the activity was performed or not). During the sample gathering phase, the prompts are sent every hour except at night. After the initial training phase, the model prompts the patient when it predicts that they will perform the activity, given their current context (the thirteen variables). Following [37] we consider both static and adaptive model training approaches. In the former case, the model is applied as it is and no further training after the initial training phase is performed. In the latter case, after each prompt, a new example describing the context and the response is added to the training set and the model is retrained. RL Approaches: Every hour the VC decides whether to send a prompt or not using RL model (agent). If suggested by the model, it sends the prompt to the environment (patient). The environment calculates the value of eq. 1 based on the results simulates the action of walking and then responds to agent with a reward. The agent always receives a reward of 20 units for notifications which result in the target walking behaviour (even if the number of prompts exceed the daily notification threshold), -1 in cases where the agent notified the simulated patient but the patient did not perform the activity, -10 when the agent sent a notification which resulted in exceeding the tolerated number of daily notifications and the patient did not perform the activity, and 0 when the agent did not send a notification. We employ three RL approaches: DQL, A2C and proximal policy optimization (PPO) al- gorithm, which is the state of the art for many continuous control problems [41] . For all we use implementation from stablebaseline3 [42] with default parameters, except that DQL 2 Simulation code at https://github.com/Capable-project/capable-rl4vc.git 3 At this threshold 70% of runs with hourly notification results in target behaviour being performed at least once during the course of intervention. 12 Aneta Lisowska et al. CEUR Workshop Proceedings 7–18 Figure 1: Comparison of the prompt strategies in terms of a number of prompts daily thought the intervention (this excludes runs with failed training). 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑠𝑡𝑎𝑟𝑡 = 24 (1 day of data gathering), A2C and PPO 𝑛_𝑠𝑡𝑒𝑝𝑠 = 24 (update every day). 4. Experiments We simulate eight week long digital intervention and compare all prompt strategies in terms of their effectiveness (activity performed to prompt ratio) and potential annoyance to the user (number of prompts daily) in four conditions: • Stable responses pattern – patient preferences and activity threshold stay constant across the intervention. • Habituation – patient habituates to prompts (its activity threshold increases every day by 0.15). Given how we define eq. 1 and specific factors (all integers), this corresponds to increasing the threshold by 1 after each week. • Preference shift – patient changes their preferred time of notification to the evening after week 5. • Preference shift + Habituation – patient habituates to prompts and changes their preferred time of notification to the evening after week 5. Given the variation in the results between runs, we ran each strategy 500 times. Figure 1 shows the algorithms’ learning of the number of daily notifications (fewer notifications are desirable to patients) and Figure 2 shows the mean prompt effectiveness across all runs. After 3 weeks (21 days) of initial training, the supervised approaches (both static and adaptive) send the prompts at more appropriate times than other methods, which manifest in the higher prompt effectiveness (Fig. 2). The main disadvantage of the supervised learning methods is that they might be causing annoyance to the user at the initial stage of the intervention as they require frequent prompts to gather positive and negative training samples. The model performance also depends on how many positive samples have been gathered during that initial training phase. Runs in which none of the prompts resulted in an activity being performed in first 21 days lead to failed model training (no convergence) and the model always predicting not to prompt the user (these runs are excluded from Fig. 1 and 2). 13 Aneta Lisowska et al. CEUR Workshop Proceedings 7–18 (a) Stable response pattern (b) Preference shift (c) Habituation (d) Habituation + Preference shift Figure 2: Comparison of the prompt strategies in terms of prompt effectiveness. Note the mean is computed from runs when at least 1 prompt has been sent. All the reinforcement learning approaches learned to prompt the user less than 3 times daily in the course of the intervention (Fig. 1). DQN reaches desired number of prompts the fastest within around 9 days and the ratio of prompts that results in the activity performed is higher than in the case of randomly timed prompts (Fig. 2a). The best reinforcement learning method in our experiments is PPO which slowly reaches the prompt effectiveness of supervised model but requires fewer prompts in the initial three weeks of intervention. Changes in simulated patients’ prompt response patterns (Fig. 2b,c,d) have the most visible impact on supervised models. Nevertheless, if initially successfully trained, the supervised methods remain more effective in prompting patients than other approaches. The drop in the response effectiveness in the case of the adaptive supervised training approach is due to reusing the training samples gathered before the simulated patient preferences shift. There were not enough samples to train the new model after detecting performance drop. Up-weighting of the more recent samples during model retraining or use of ensemble learning method designed to tackle concept drift[43] could lead to better prompt strategy adaptation. We also checked if the simulated patient environment, even in the hardest simulation con- ditions (habituated + preference shift), captures the better habit formation and consequently results in improvements in patients’ state when interventions are administered. Fig. 3 shows patient state without intervention, with intervention when the prompts are sent hourly during the day and intervention with timely notifications using PPO. The simulated patient changed their state as a result of intervention with walking activity, sparking improvements in mood and number of hours slept. Note that in our simulation positive valence depends on both the physical activity and the number of notifications (Table 1), therefore the timely notification causes a longer time of being in a positive mood than randomly timed prompts. 14 Aneta Lisowska et al. CEUR Workshop Proceedings 7–18 Figure 3: Effect of intervention on simulated patient. 5. Discussion The comparison of different prompt strategies in a simulated patient environment suggests that the supervised learning approaches to patient’s notifications might send the most appropriately timed prompts. However, their training could potentially tire patients with a large number of prompts during the initial intervention stage. It might be interesting to investigate if a smart sampling strategy, such as diversity sampling could reduce the notification burden in the initial intervention stage but keep model performance high in the remainder of the intervention. In our experiments, PPO was the most effective RL-based prompting method. It almost reached activity performed to prompt ratio of supervised models at the end of the intervention period, and it required sending fewer prompts than supervised trained models overall, which might burden less patients. PPO displayed stable improvement in prompt effectiveness over time, regardless of the simulated patient prompt habituation or preference shift, making it potentially an interesting candidate approach for real-life study. All learning methods considered in this study could benefit from a larger number of training samples, especially those capturing the context in which prompt results in the performed activity. Here we simulated a single patient but in reality, there might be multiple patients with similar behaviour patterns and leveraging learning from them could provide a boost in model performance. Currently, one limitation of our simulated environment is that we do not capture personality differences, which have been shown to impact user receptivity to notification [23]. In future work, we aim to include this information in our simulation and investigate learning from multiple different users. Acknowledgments The CAPABLE project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 875052. 15 Aneta Lisowska et al. CEUR Workshop Proceedings 7–18 References [1] J. C. Holland, Y. Alici, Management of distress in cancer patients., The Journal of Supportive Oncology 8 (2010) 4–12. [2] A. Pitman, S. Suleman, N. Hyde, A. Hodgkiss, Depression and anxiety in patients with cancer, BMJ 361 (2018). [3] M. R. DiMatteo, K. B. Haskard-Zolnierek, Impact of depression on treatment adherence and survival from cancer, Depression and Cancer (2011) 101–124. [4] E. B. Hekler, S. Michie, M. Pavel, D. E. Rivera, L. M. Collins, H. B. Jimison, C. Garnett, S. Parral, D. Spruijt-Metz, Advancing models and theories for digital behavior change interventions, American Journal of Preventive Medicine 51 (2016) 825–832. [5] P. Kelly, C. Williamson, A. G. Niven, R. Hunter, N. Mutrie, J. Richards, Walking on sunshine: scoping review of the evidence for walking and mental health, British Journal of Sports Medicine 52 (2018) 800–806. [6] D. B. Wilson, J. S. Porter, G. Parker, J. Kilpatrick, Peer reviewed: Anthropometric changes using a walking intervention in african american breast cancer survivors: A pilot study, Preventing Chronic Disease 2 (2005). [7] L. J. Frensham, G. Parfitt, J. Dollman, Effect of a 12-week online walking intervention on health and quality of life in cancer survivors: a quasi-randomized controlled trial, International Journal of Environmental Research and Public Health 15 (2018) 2081. [8] B. J. Fogg, Tiny Habits: The Small Changes That Change Everything, Houghton Mifflin Harcourt, 2019. [9] P. Saikia, M. Cheung, J. She, S. Park, Effectiveness of mobile notification delivery, in: 2017 18th IEEE International Conference on Mobile Data Management (MDM), IEEE, 2017, pp. 21–29. [10] H. Sarker, M. Sharmin, A. A. Ali, M. M. Rahman, R. Bari, S. M. Hossain, S. Kumar, Assessing the availability of users to engage in just-in-time intervention in the natural environment, in: Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiqui- tous Computing, 2014, pp. 909–920. [11] N. Goyal, S. R. Fussell, Intelligent interruption management using electro dermal activity based physiological sensor for collaborative sensemaking, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1 (2017) 1–21. [12] B.-J. Ho, B. Balaji, M. Koseoglu, M. Srivastava, Nurture: notifying users at the right time using reinforcement learning, in: Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, 2018, pp. 1194–1201. [13] B. J. Fogg, A behavior model for persuasive design, in: Proceedings of the 4th International Conference on Persuasive Technology, 2009, pp. 1–7. [14] T. Jowsey, C. Pearce-Brown, K. A. Douglas, L. Yen, What motivates australian health service users with chronic illness to engage in self-management behaviour?, Health Expectations 17 (2014) 267–277. [15] J. Axelsson, M. Ingre, G. Kecklund, M. Lekander, K. P. Wright Jr, T. Sundelin, Sleepiness as motivation: a potential mechanism for how sleep deprivation affects behavior, Sleep 43 (2020) zsz291. 16 Aneta Lisowska et al. CEUR Workshop Proceedings 7–18 [16] M. R. Dolsen, A. M. Soehner, C. M. Morin, L. Bélanger, M. Walker, A. G. Harvey, Sleep the night before and after a treatment session: A critical ingredient for treatment adherence?, Journal of Consulting and Clinical Psychology 85 (2017) 647. [17] S. W. Chan, S. Sapkota, R. Mathews, H. Zhang, S. Nanayakkara, Prompto: Investigating receptivity to prompts based on cognitive load from memory training conversational agent, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4 (2020) 1–23. [18] Y.-C. Lee, J. D. Lee, L. Ng Boyle, The interaction of cognitive load and attention-directing cues in driving, Human Factors 51 (2009) 271–280. [19] A. Luszczynska, R. Schwarzer, The role of self-efficacy in health self-regulation, The Adaptive Self: Personal Continuity and Intentional Self-development 15 (2005) 137–152. [20] S. F. Chastin, N. Fitzpatrick, M. Andrews, N. DiCroce, Determinants of sedentary behavior, motivation, barriers and strategies to reduce sitting time in older women: a qualitative investigation, International Journal of Environmental Research and Public Health 11 (2014) 773–791. [21] S. Bucci, J. Ainsworth, C. Barrowclough, S. Lewis, G. Haddock, K. Berry, R. Emsley, D. Edge, M. Machin, A theory-informed digital health intervention in people with severe mental health problems, in: MEDINFO 2019: Health and Wellbeing e-Networks for All, IOS Press, 2019, pp. 526–530. [22] N. Bidargaddi, D. Almirall, S. Murphy, I. Nahum-Shani, M. Kovalcik, T. Pituch, H. Maaieh, V. Strecher, To prompt or not to prompt? a microrandomized trial of time-varying push notifications to increase proximal engagement with a mobile health app, JMIR mHealth and uHealth 6 (2018) e10123. [23] F. Künzler, V. Mishra, J.-N. Kramer, D. Kotz, E. Fleisch, T. Kowatsch, Exploring the state- of-receptivity for mhealth interventions, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3 (2019) 1–27. [24] A. Lisowska, S. Wilk, M. Peleg, Is it a good time to survey you? cognitive load classification from blood volume pulse, in: 2021 IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS), IEEE, 2021, pp. 137–141. [25] A. Lisowska, S. Wilk, M. Peleg, Catching patient’s attention at the right time to help them undergo behavioural change: Stress classification experiment from blood volume pulse, in: Proceedings of the 19th International Conference on Artificial Intelligence in Medicine, AIME 2021, Springer, 2021, pp. 72–82. [26] C. Baglioni, K. Spiegelhalder, C. Lombardo, D. Riemann, Sleep and emotions: a focus on insomnia, Sleep Medicine Reviews 14 (2010) 227–238. [27] A. Ivarsson, A. Stenling, K. Weman Josefsson, S. Höglind, M. Lindwall, Associations between physical activity and core affects within and across days: a daily diary study, Psychology & Health 36 (2021) 43–58. [28] X. Zhai, M. Ye, C. Wang, Q. Gu, T. Huang, K. Wang, Z. Chen, X. Fan, Associations among physical activity and smartphone use with perceived stress and sleep quality of chinese college students, Mental Health and Physical Activity 18 (2020) 100323. [29] S. Yoon, S.-s. Lee, J.-m. Lee, K. Lee, Understanding notification stress of smartphone messenger app, in: CHI’14 Extended Abstracts on Human Factors in Computing Systems, 2014, pp. 1735–1740. 17 Aneta Lisowska et al. CEUR Workshop Proceedings 7–18 [30] C. Peifer, A. Schulz, H. Schächinger, N. Baumann, C. H. Antoni, The relation of flow- experience and physiological arousal under stress—can u shape it?, Journal of Experimental Social Psychology 53 (2014) 62–69. [31] T. Okoshi, K. Tsubouchi, M. Taji, T. Ichikawa, H. Tokuda, Attention and engagement- awareness in the wild: A large-scale study with adaptive notifications, in: 2017 IEEE International Conference on Pervasive Computing and Communications (PERCOM), IEEE, 2017, pp. 100–110. [32] A. N. S. Bisson, S. A. Robinson, M. E. Lachman, Walk to a better night of sleep: testing the relationship between physical activity and sleep, Sleep Health 5 (2019) 487–494. [33] C. P. Fairholme, R. Manber, Sleep, emotions, and emotion regulation: an overview, Sleep and Affect (2015) 45–61. [34] D. M. Williams, S. Dunsiger, E. G. Jennings, B. H. Marcus, Does affective valence during and immediately following a 10-min walk predict concurrent and future physical activity?, Annals of Behavioral Medicine 44 (2012) 43–51. [35] C. J. Watkins, P. Dayan, Q-learning, Machine Learning 8 (1992) 279–292. [36] B.-J. Ho, B. Balaji, M. Koseoglu, S. Sandha, S. Pei, M. Srivastava, Quick question: Interrupt- ing users for microtasks with reinforcement learning, arXiv preprint arXiv:2007.09515 (2020). [37] V. Mishra, F. Künzler, J.-N. Kramer, E. Fleisch, T. Kowatsch, D. Kotz, Detecting receptiv- ity for mhealth interventions in the natural environment, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5 (2021) 1–24. [38] R. Sutton, K. Fraser, O. Conlan, A reinforcement learning and synthetic data approach to mobile notification management, in: Proceedings of the 17th International Conference on Advances in Mobile Computing & Multimedia, 2019, pp. 155–164. [39] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, W. Zaremba, OpenAI Gym, 2016. arXiv:arXiv:1606.01540. [40] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. [41] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347 (2017). [42] A. Raffin, A. Hill, M. Ernestus, A. Gleave, A. Kanervisto, N. Dormann, Stable baselines3, https://github.com/DLR-RM/stable-baselines3, 2019. [43] B. Krawczyk, L. L. Minku, J. Gama, J. Stefanowski, M. Woźniak, Ensemble learning for data stream analysis: A survey, Information Fusion 37 (2017) 132–156. 18