Event-triggered reinforcement learning; an application to buildings’ micro-climate control Ashkan Haji Hosseinloo1 , Munther Dahleh2 Laboratory for Information and Decision Systems, MIT, USA ashkanhh@mit.edu1 , dahleh@mit.edu2 Abstract dynamics and heterogeneous environment disturbances (Wei, Wang, and Zhu 2017). They also rely on an accurate Smart buildings have great potential for shaping an energy- model of the building that makes them resource-extensive efficient, sustainable, and more economic future for our planet as buildings account for approximately 40% of the and costly. Moreover, the need for prior modeling of global energy consumption. However, most learning methods the buildings prevents a plug and play deployment of the for micro-climate control in buildings are based on Markov model-based controllers. To remedy these issues data-driven Decision Processes with fixed transition times that suffer approaches for HVAC control have attracted much interest from high variance in the learning phase. Furthermore, ignor- in the recent years towards building smart homes. Although ing its continuing-task nature the micro-climate control prob- the idea of smart homes where household devices (e.g. lem is often modeled and solved as an episodic-task problem appliances, thermostats, and lights) can operate efficiently with discounted rewards. This can result in a wrong optimiza- in an autonomous, coordinated, and adaptive fashion, has tion solution. To overcome these issues we propose an event- been around for a couple of decades (Mozer 1998), its triggered learning control and formulate it based on Semi- realization now looks ever more pragmatic with immense Markov Decision Processes with variable transition times and in an average-reward setting. We show via simulation the ef- recent advances in Internet of Things (IoT) and sensor tech- ficacy of our approach in controlling the micro-climate of a nology (Minoli, Sohraby, and Occhiogrosso 2017). Among single-zone building. different data-driven control approaches, reinforcement learning (RL) has found more attention in the recent years due to enormous recent algorithmic advances in this field as Introduction well as its ability to learn efficient control policies solely Buildings account for approximately 40% of global energy from experiential data via trial and error. consumption about half of which is used by heating, ventilation, and air conditioning (HVAC) systems, the The Neural Network House project (Mozer 1998; primary means to control micro-climate in buildings. Mozer and Miller 1997) is perhaps the first application Furthermore, buildings are responsible for one-third of of RL in building energy management system. Since then global energy-related greenhouse gas emissions. Hence, and over the past couple of decades different RL tech- even an incremental improvement in the energy efficiency niques from tabular Q-learning (Liu and Henze 2006; of buildings and HVAC systems goes a long way towards Barrett and Linder 2015; Cheng et al. 2016; building a greener, more economic, and energy-efficient Chen et al. 2018) to Deep RL (Wei, Wang, and Zhu 2017; future. In addition to their economic and environmental Avendano et al. 2018) have been employed to optimally impacts, HVAC systems can also affect productivity and control the micro-climate in buildings. The control objective decision-making performance of occupants in buildings in all these studies is a variation of energy consumption/cost via controlling indoor thermal and air quality. For all these minimization subject to some constraints e.g. occupants’ reasons micro-climate control in buildings is an important comfort in some metric. More recently policy gradient RL issue for its large-scale economic, environmental, and techniques were adopted for the HVAC control problem. societal effects. For instance, Deep Deterministic Policy Gradient (DDPG) was used in (Gao, Li, and Wen 2019) and (Li et al. 2019) The main goal of the micro-climate control in buildings to control energy consumption in a single-zone laboratory is to minimize the building’s (mainly HVAC’s) energy and 2-zone data center buildings, respectively. The reader consumption while improving occupants’ comfort in some is referred to (Hosseinloo et al. 2020) for a comprehensive metric. Model-based control strategies are often inefficient literature review on RL application in smart buildings. in practice, due to the complexity in building thermal Copyright © 2020, for this paper by its authors. Use permitted Similar to many other RL application studies in physical under Creative Commons License Attribution 4.0 International sciences, there are two main issues with the above- (CCBY 4.0). mentioned studies; first, they model and solve the problem of micro-climate control as an episodic-task problem dynamics of the system are characterized by the function with discounted reward while it should be modeled as a f (.). Via the control action u(t) we would like to maximize continuing-task problem with average reward. Average the performance measure J defined as: reward is really what matters in continuing-task problems 1 T Z and greedily maximizing discounted future value does not J = lim {re u(t)+rc (T −Td )2 +rsw δ(t−tsw )} dt, necessarily maximize the average reward (Naik et al. 2019). T →∞ T 0 In particular, solutions that fundamentally rely on episodes (2) are likely to fare worse than those that fully embrace the where, tsw is the time when the controller switches from continuing task setting. 0 to 1 (the heater switches from OFF to ON) or vice versa, and δ(.) is the Dirac delta function. The first term Second, in all these studies the control problem is mod- of the integrand penalizes the energy consumption while eled based on Markov Decision Processes (MDPs) where the second and the third terms correspond to occupants’ learning and decision making occur at fixed sampling rate. comfort. Specifically the second term penalizes temperature The fixed time intervals between decisions (control actions) deviations from a desired set-point temperature (Td ) while is restrictive in continuous-time problems; a large interval the third term prevents frequent ON/OFF switching of the (low sampling rate) deteriorates the control accuracy while heater. The relative effects of these terms are balanced by a small interval (high sampling rate) could drastically affect their corresponding weights i.e. re , rc , and rsw . the learning quality. For instance, as reported in (Munos 2006) among others, policy gradient estimate is subject to To reduce the space of possible control policies(laws) we variance explosion when the discretization time-step tends constraint the optimization to a class of parameterized con- to zero. The intuitive reason for that problem lies in the fact trol policies, specifically to threshold policies. This strat- that the number of decisions before getting a meaningful egy is particularly beneficial in the RL framework since it reward grows to infinity. Furthermore, learning and control could significantly reduce the learning sample complexity. at fixed time intervals may not be desired in large-scale We characterize the threshold policies by some manifolds in resource-constrained wireless embedded control systems the state space of the system which determine when the con- (Heemels, Johansson, and Tabuada 2012). trol action switches (e.g. ON  OFF in this study). We call these manifolds switching manifolds and the control action In this study, we eliminate the major drawbacks of switches only when hitting these manifolds which we refer the learning techniques discussed above by proposing to as events. Figure 1 (a) illustrates an schematic threshold an event-triggered learning controller where the control policy for the 1-zone building example with switching ON problem is formulated based on Semi-Markov Decision and OFF manifolds while Fig.1(b) depicts the thermal dy- Processes (SMDPs) with variable time intervals (decision namics of the building temperature under such controller. epochs). The problem is formulated in an RL framework We can mathematically formulate the control action as: as a continuing-task problem with undiscounted average- θ  reward optimization objective. The rest of the paper is 0, if T (t) ≥ TOFF θ organized as follows. The next section explains the problem u(t) = 1, if T (t) ≤ TON , (3) statement and the proposed controller. SMDP formulation  − u(t ), otherwise section describes the problem formulation and the proposed θ θ learning framework. Finally, the simulation results and the where, TOFF and TON are thresholds (manifolds) for switch- paper remarks are presented in the last two sections. ing OFF and ON, respectively that are parameterized by parameter vector θ. These thresholds are in general state- dependent. The goal is to find the optimal control policy Problem statement u∗ (t) within the parameterized policies i.e. to find the op- timal parameter vector θ∗ , which maximizes the long-run In this study we present and explain our proposed learning average reward (performance metric) J defined by (2) with methods via a simplified one-zone building; however, the no prior knowledge of the system dynamics. In the next sec- methods and concepts are applicable to more general set- tion we cast this decision making problem as an SMDP. tings. Here we study the problem of minimizing the energy consumption in a one-zone building with unknown thermal dynamics and subject to occupants’ comfort constraints. For SMDP formulation specificity and with no loss of generality we consider the By defining the switching manifolds the control problem heating problem rather than cooling. Temperature of the is reduced to learning the optimal manifolds. Once the building evolves as: manifolds (the θ vector) are decided, the actual control actions (u(t) ∈ {0, 1}) are automatically known based on dT = f (T ; To , u), (1) (3). We can thus think of the manifolds, hence the θ’s as dt the higher level control actions and the ON/OFF heater where, T (t) ∈ R represents the building temperature, status (u(t)) as the lower-level control actions. These are To (t) ∈ R is the outside temperature (disturbance), and usually referred to as options and primitive actions in the u(t) ∈ {0, 1} denotes the heater’s ON/OFF status (the actual hierarchical RL framework (Sutton, Precup, and Singh control actions). Unknown and potentially nonlinear thermal 1999). By doing so we are changing our decision variables two epochs of τk and τk+1 as: r(sk , ak ) = f (sk , ak ) Z τk+1  0 +E c(Wt , sk , ak )dt |Sk = sk , Ak = ak . 0 τk (4) Let us also define the average transition time starting at state sk and under action ak as τ (sk , ak ): τ (sk , ak ) =E [τk+1 − τk |Sk = sk , Ak = ak ] Figure 1: (a) Threshold policy and the switching manifolds Z ∞ for the 1-zone building example (b) thermal dynamics of the (5) = tF (dt|sk , ak ). building example under threshold control policy 0 The actions ak s of the SMDP are determined by a stochas- tic or deterministic policy in each state. In many real-world control problems the optimal and/or the desired control pol- from u(t) to θ. Although we could control (i.e. set the θ icy is a deterministic policy. Hence, here we focus on de- values) and learn (i.e. learn a better θ) at fixed time steps we terministic policies a = µ(s) which deterministically map restrict them to times when the events occur i.e. when the the state s to the action a. Furthermore, as discussed ear- system state trajectory hits a manifold. We do this because lier, for the sake of scalability and sample efficiency we re- making too many decision in a short period of time (with strict the control problem to a class of policies µθ (s) pa- no significant accumulated reward) could result in large rameterized by the parameter vector θ. With this assumption variance as discussed earlier. This change in the timing of the expected rewards and the transition times at each state the control and learning changes the underlying formulation will be functions of the states and the parameter vector θ from an MDP with fixed transition times to an SMDP with i.e. r(sk , ak ) = r(sk ; θ) and τ (sk , ak ) = τ (sk ; θ). Then the stochastic transition times. infinite-horizon average reward could be written as1 : Pn E [ k=0 r(sk ; θ)] We study the control problem in an RL framework in J(θ) = lim Pn . (6) n→∞ E [ which an agent acts in a stochastic environment/system by k=0 τ (sk ; θ)] sequentially choosing actions with no knowledge of the An online learning algorithm could be devised if we can environment/system dynamics. We model the RL control calculate a good estimate of the gradient ∇θ J in an online problem as an SMDP which is defined as a five tuple fashion, which could then be used to improve the policy (S, A, P, R, F ), where S is the state space, A is the action parameters via stochastic gradient. But let us first draw space, P is a set of state-action dependent transition proba- clear connections between the SMDP formulation presented bilities, R is the reward function, and F is a function giving in this section and the micro-climate control problem in probability of transition time, aka sojourn or dwell time, the previous section. By defining the switching manifolds for each state-action pair. Let τk be the decision epochs temperature thresholds become the actions of the underlying (times) with τ0 = 0, and Sk ∈ S be the state variable at SMDP. Let us take the building temperature (T ) and the decision epoch τk . If the system is at state Sk = sk at epoch heater status (h) at the beginning of epochs as the state of τk and action Ak = ak is applied, the system will move the system i.e. sk = [Tk , hk ]. Then we can write actions as to the state Sk+1 = sk+1 at epoch τk+1 with probability a = µθ (s) = h TOFF θ (s) + (1 − h)TON θ θ (s), where TOFF (s) p(sk+1 |sk , ak ) = P(Sk+1 = sk+1 |Sk = sk , Ak = ak ). θ and TON (s) are threshold temperatures for switching the This transition occurs within tk unit times with a probability heater OFF and ON, respectively and they could generally of F (tk |sk , ak ) = Pr(τk+1 − τk ≤ tk |Sk = sk , Ak = ak ). be state-dependent. Regarding the rewards, by comparing Hence, the SMDP kernel Φ(sk+1 , tk ) = Pr(Sk+1 = equations (2) and (4) one can conclude that f (sk , ak ) = rsw sk+1 , τk+1 − τk ≤ tk |Sk = sk , Ak = ak ) could be written and c(Wt0 , sk , ak ) = re u(t0 ) + rc (T (t0 ) − Td )2 . as Φ(sk+1 , tk ) = p(sk+1 |sk , ak )F (tk |sk , ak ). If r(s, a) and τ (s, a) are known and we somehow have access to the system dynamics, we can estimate J(θ) for a The reward function for an SMDP is in general more com- given parameter vector θ by constructing a long sequence plex than that of an MDP. Between epochs (τk ≤ t0 ≤ τk+1 ) of s0 , a0 , r0 , τ0 , ..., sn , an , rn , τn via simulation. If we do the system evolves based on the so-called natural process this for different values of θ we can approximate the per- Wt0 . Let us suppose the reward between two decision epochs formance metric J as a function of θ. We can then use this consists of two parts; a fixed state-action dependent reward approximation to estimate the performance gradient and use of f (sk , ak ) and a time-continuous reward accumulated in the transition time at a rate of c(Wt0 , sk , ak ). We can then 1 For the average reward to be independent of the initial state, write the expected reward r(sk , ak ) ∈ R(Sk , Ak ) between the embedded MDP is required to be unichain. Figure 2: Block diagram of the online learning control it to improve the actual policy via e.g. stochastic gradient. The idea here is to construct the above-mentioned trajec- tory sequence with online learning of r(s, a) and τ (s, a) but without learning the system dynamics. This is possi- ble because of our choice of policies, namely the threshold policies. Since the action ak is a temperature threshold the Figure 3: Long-run average reward (performance metric) J temperature in the next epoch is automatically revealed at as a function of fixed ON/OFF temperature thresholds the current epoch i.e. Tk+1 = ak . Moreover, because these thresholds are switching manifolds the heater status must switch in the next epoch, i.e. hk+1 = 1 − hk . However, in more complex set-ups, we may not be able to fully de- duce the next state of the system via threshold policies. For instance, if the system state includes the electricity price, we cannot fully evaluate sk+1 based on sk and ak ; but we can still construct a less-accurate sequence of transitions which could be sufficient since we usually do not need a very accu- rate estimate of ∇θ J for online learning. The online learn- ing control explained here is schematically illustrated in the form of a block diagram in Fig. 2. Results In this section we implement our proposed method to con- trol the heating system of a one-zone building in order to Figure 4: Time history of the learnt temperature thresholds minimize energy consumption without jeopardizing the oc- during the online learning process cupants’ comfort. We use a simplified linear model charac- terized by a first-order ordinary differential equation as fol- lows: Since we know the optimal controller has fixed temper- dT ature thresholds, we can represent the control policy with C + K(T − To ) = u(t)Q̇h , (7) dt only two parameters (2-component θ vector) i.e. the thresh- old themselves (TON and TOFF ). Also, we use neural nets where, C = 2000 kJK −1 is the building’s heat capacity, for function approximation; a neural net with one hidden K = 325W K −1 is the building’s thermal conductance, layer and 24 hidden nodes for r(s, a) and τ (s, a), and an- and Q̇h = 13 kW is the heater’s power. As defined earlier, other neural net with one hidden layer and 10 hidden nodes h(t) ∈ {0, 1} is the heater status, and To = −10 °C is the for J(θ). It is worth noting that our proposed control is an outdoor temperature. The reward rates are set as follows: off-policy control. A very exploratory behaviour policy is −rsw = 0.8 unit, −re = 1.2/3600 unit s−1 , and rc = employed for the learning simulation. Figure 4 illustrates −1.2/3600 unit K −2 s−1 . how the controller learns the optimal thresholds within less than a week. The learnt thresholds at the end of the learning The optimal control for this example is indeed a threshold process are TON = 12.5 °C and TOFF = 17.4 °C which are policy with constant ON/OFF thresholds. Via brute-force almost the same as the optimal thresholds. simulations and search the optimal thresholds are found to be TON = 12.5 °C and TOFF = 17.5 °C with a correspond- ing long-run average reward of J = −3.70 unit hr−1 (see Conclusion Fig.3). This is the ground truth thresholds for the optimal In this study we proposed an SMDP framework for RL- control of the building which our learning controller (Fig. based control of micro-climate in buildings. We utilized 2) should learn using an stream of online data. threshold policies in which the learning and control take place when the thresholds are reached. This results in variable-time intervals for the learning and control which Munos, R. 2006. Policy gradient in continuous time. Journal makes the SMDP framework more suitable for this class of Machine Learning Research 7(May):771–791. of control problems. Using the threshold policies we devel- Naik, A.; Shariff, R.; Yasui, N.; and Sutton, R. S. 2019. Dis- oped a model-based policy gradient RL approach for the counted reinforcement learning is not an optimization prob- controller. We showed via simulation the efficacy of our lem. arXiv preprint arXiv:1910.02140. approach in controlling the micro-climate of a single-zone Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between building. mdps and semi-mdps: A framework for temporal abstrac- tion in reinforcement learning. Artificial intelligence 112(1- References 2):181–211. Avendano, D. N.; Ruyssinck, J.; Vandekerckhove, S.; Wei, T.; Wang, Y.; and Zhu, Q. 2017. Deep reinforce- Van Hoecke, S.; and Deschrijver, D. 2018. Data-driven ment learning for building hvac control. In Proceedings of optimization of energy efficiency and comfort in an apart- the 54th Annual Design Automation Conference 2017, 22. ment. In 2018 International Conference on Intelligent Sys- ACM. tems (IS), 174–182. IEEE. Barrett, E., and Linder, S. 2015. Autonomous hvac control, a reinforcement learning approach. In Joint European Con- ference on Machine Learning and Knowledge Discovery in Databases, 3–19. Springer. Chen, Y.; Norford, L. K.; Samuelson, H. W.; and Malkawi, A. 2018. Optimal control of hvac and window systems for natural ventilation through reinforcement learning. Energy and Buildings 169:195–205. Cheng, Z.; Zhao, Q.; Wang, F.; Jiang, Y.; Xia, L.; and Ding, J. 2016. Satisfaction based q-learning for integrated lighting and blind control. Energy and Buildings 127:43–55. Gao, G.; Li, J.; and Wen, Y. 2019. Energy-efficient thermal comfort control in smart buildings via deep reinforcement learning. arXiv preprint arXiv:1901.04693. Heemels, W.; Johansson, K. H.; and Tabuada, P. 2012. An introduction to event-triggered and self-triggered control. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), 3270–3285. IEEE. Hosseinloo, A. H.; Ryzhov, A.; Bischi, A.; Ouerdane, H.; Turitsyn, K.; and Dahleh, M. A. 2020. Data-driven control of micro-climate in buildings; an event-triggered reinforce- ment learning approach. arXiv preprint arXiv:2001.10505. Li, Y.; Wen, Y.; Tao, D.; and Guan, K. 2019. Transform- ing cooling optimization for green data center via deep rein- forcement learning. IEEE transactions on cybernetics. Liu, S., and Henze, G. P. 2006. Experimental analysis of simulated reinforcement learning control for active and pas- sive building thermal storage inventory: Part 1. theoretical foundation. Energy and Buildings 38(2):142 – 147. Minoli, D.; Sohraby, K.; and Occhiogrosso, B. 2017. Iot considerations, requirements, and architectures for smart buildings—energy optimization and next-generation build- ing management systems. IEEE Internet of Things Journal 4(1):269–283. Mozer, M. C., and Miller, D. 1997. Parsing the stream of time: The value of event-based segmentation in a complex real-world control problem. In International School on Neu- ral Networks, Initiated by IIASS and EMFCSC, 370–388. Springer. Mozer, M. C. 1998. The neural network house: An envi- ronment hat adapts to its inhabitants. In Proc. AAAI Spring Symp. Intelligent Environments, volume 58.