Event-triggered reinforcement learning; an application to buildings’
                                  micro-climate control

                                       Ashkan Haji Hosseinloo1 , Munther Dahleh2
                                   Laboratory for Information and Decision Systems, MIT, USA
                                             ashkanhh@mit.edu1 , dahleh@mit.edu2


                           Abstract                                dynamics and heterogeneous environment disturbances
                                                                   (Wei, Wang, and Zhu 2017). They also rely on an accurate
  Smart buildings have great potential for shaping an energy-
                                                                   model of the building that makes them resource-extensive
  efficient, sustainable, and more economic future for our
  planet as buildings account for approximately 40% of the         and costly. Moreover, the need for prior modeling of
  global energy consumption. However, most learning methods        the buildings prevents a plug and play deployment of the
  for micro-climate control in buildings are based on Markov       model-based controllers. To remedy these issues data-driven
  Decision Processes with fixed transition times that suffer       approaches for HVAC control have attracted much interest
  from high variance in the learning phase. Furthermore, ignor-    in the recent years towards building smart homes. Although
  ing its continuing-task nature the micro-climate control prob-   the idea of smart homes where household devices (e.g.
  lem is often modeled and solved as an episodic-task problem      appliances, thermostats, and lights) can operate efficiently
  with discounted rewards. This can result in a wrong optimiza-    in an autonomous, coordinated, and adaptive fashion, has
  tion solution. To overcome these issues we propose an event-     been around for a couple of decades (Mozer 1998), its
  triggered learning control and formulate it based on Semi-
                                                                   realization now looks ever more pragmatic with immense
  Markov Decision Processes with variable transition times and
  in an average-reward setting. We show via simulation the ef-     recent advances in Internet of Things (IoT) and sensor tech-
  ficacy of our approach in controlling the micro-climate of a     nology (Minoli, Sohraby, and Occhiogrosso 2017). Among
  single-zone building.                                            different data-driven control approaches, reinforcement
                                                                   learning (RL) has found more attention in the recent years
                                                                   due to enormous recent algorithmic advances in this field as
                       Introduction                                well as its ability to learn efficient control policies solely
Buildings account for approximately 40% of global energy           from experiential data via trial and error.
consumption about half of which is used by heating,
ventilation, and air conditioning (HVAC) systems, the                 The Neural Network House project (Mozer 1998;
primary means to control micro-climate in buildings.               Mozer and Miller 1997) is perhaps the first application
Furthermore, buildings are responsible for one-third of            of RL in building energy management system. Since then
global energy-related greenhouse gas emissions. Hence,             and over the past couple of decades different RL tech-
even an incremental improvement in the energy efficiency           niques from tabular Q-learning (Liu and Henze 2006;
of buildings and HVAC systems goes a long way towards              Barrett and Linder 2015; Cheng et al. 2016;
building a greener, more economic, and energy-efficient            Chen et al. 2018) to Deep RL (Wei, Wang, and Zhu 2017;
future. In addition to their economic and environmental            Avendano et al. 2018) have been employed to optimally
impacts, HVAC systems can also affect productivity and             control the micro-climate in buildings. The control objective
decision-making performance of occupants in buildings              in all these studies is a variation of energy consumption/cost
via controlling indoor thermal and air quality. For all these      minimization subject to some constraints e.g. occupants’
reasons micro-climate control in buildings is an important         comfort in some metric. More recently policy gradient RL
issue for its large-scale economic, environmental, and             techniques were adopted for the HVAC control problem.
societal effects.                                                  For instance, Deep Deterministic Policy Gradient (DDPG)
                                                                   was used in (Gao, Li, and Wen 2019) and (Li et al. 2019)
   The main goal of the micro-climate control in buildings         to control energy consumption in a single-zone laboratory
is to minimize the building’s (mainly HVAC’s) energy               and 2-zone data center buildings, respectively. The reader
consumption while improving occupants’ comfort in some             is referred to (Hosseinloo et al. 2020) for a comprehensive
metric. Model-based control strategies are often inefficient       literature review on RL application in smart buildings.
in practice, due to the complexity in building thermal
Copyright © 2020, for this paper by its authors. Use permitted        Similar to many other RL application studies in physical
under Creative Commons License Attribution 4.0 International       sciences, there are two main issues with the above-
(CCBY 4.0).                                                        mentioned studies; first, they model and solve the problem
of micro-climate control as an episodic-task problem              dynamics of the system are characterized by the function
with discounted reward while it should be modeled as a            f (.). Via the control action u(t) we would like to maximize
continuing-task problem with average reward. Average              the performance measure J defined as:
reward is really what matters in continuing-task problems
                                                                               1 T
                                                                                 Z
and greedily maximizing discounted future value does not          J = lim            {re u(t)+rc (T −Td )2 +rsw δ(t−tsw )} dt,
necessarily maximize the average reward (Naik et al. 2019).             T →∞ T 0
In particular, solutions that fundamentally rely on episodes                                                               (2)
are likely to fare worse than those that fully embrace the        where, tsw is the time when the controller switches from
continuing task setting.                                          0 to 1 (the heater switches from OFF to ON) or vice
                                                                  versa, and δ(.) is the Dirac delta function. The first term
   Second, in all these studies the control problem is mod-       of the integrand penalizes the energy consumption while
eled based on Markov Decision Processes (MDPs) where              the second and the third terms correspond to occupants’
learning and decision making occur at fixed sampling rate.        comfort. Specifically the second term penalizes temperature
The fixed time intervals between decisions (control actions)      deviations from a desired set-point temperature (Td ) while
is restrictive in continuous-time problems; a large interval      the third term prevents frequent ON/OFF switching of the
(low sampling rate) deteriorates the control accuracy while       heater. The relative effects of these terms are balanced by
a small interval (high sampling rate) could drastically affect    their corresponding weights i.e. re , rc , and rsw .
the learning quality. For instance, as reported in (Munos
2006) among others, policy gradient estimate is subject to           To reduce the space of possible control policies(laws) we
variance explosion when the discretization time-step tends        constraint the optimization to a class of parameterized con-
to zero. The intuitive reason for that problem lies in the fact   trol policies, specifically to threshold policies. This strat-
that the number of decisions before getting a meaningful          egy is particularly beneficial in the RL framework since it
reward grows to infinity. Furthermore, learning and control       could significantly reduce the learning sample complexity.
at fixed time intervals may not be desired in large-scale         We characterize the threshold policies by some manifolds in
resource-constrained wireless embedded control systems            the state space of the system which determine when the con-
(Heemels, Johansson, and Tabuada 2012).                           trol action switches (e.g. ON  OFF in this study). We call
                                                                  these manifolds switching manifolds and the control action
   In this study, we eliminate the major drawbacks of             switches only when hitting these manifolds which we refer
the learning techniques discussed above by proposing              to as events. Figure 1 (a) illustrates an schematic threshold
an event-triggered learning controller where the control          policy for the 1-zone building example with switching ON
problem is formulated based on Semi-Markov Decision               and OFF manifolds while Fig.1(b) depicts the thermal dy-
Processes (SMDPs) with variable time intervals (decision          namics of the building temperature under such controller.
epochs). The problem is formulated in an RL framework             We can mathematically formulate the control action as:
as a continuing-task problem with undiscounted average-
                                                                                                               θ
                                                                                       
reward optimization objective. The rest of the paper is                                0,         if T (t) ≥ TOFF
                                                                                                               θ
organized as follows. The next section explains the problem                    u(t) = 1,           if T (t) ≤ TON  ,         (3)
statement and the proposed controller. SMDP formulation                                 −
                                                                                          u(t ), otherwise
section describes the problem formulation and the proposed
                                                                           θ          θ
learning framework. Finally, the simulation results and the       where, TOFF   and TON   are thresholds (manifolds) for switch-
paper remarks are presented in the last two sections.             ing OFF and ON, respectively that are parameterized by
                                                                  parameter vector θ. These thresholds are in general state-
                                                                  dependent. The goal is to find the optimal control policy
                  Problem statement                               u∗ (t) within the parameterized policies i.e. to find the op-
                                                                  timal parameter vector θ∗ , which maximizes the long-run
In this study we present and explain our proposed learning
                                                                  average reward (performance metric) J defined by (2) with
methods via a simplified one-zone building; however, the
                                                                  no prior knowledge of the system dynamics. In the next sec-
methods and concepts are applicable to more general set-
                                                                  tion we cast this decision making problem as an SMDP.
tings. Here we study the problem of minimizing the energy
consumption in a one-zone building with unknown thermal
dynamics and subject to occupants’ comfort constraints. For                         SMDP formulation
specificity and with no loss of generality we consider the        By defining the switching manifolds the control problem
heating problem rather than cooling. Temperature of the           is reduced to learning the optimal manifolds. Once the
building evolves as:                                              manifolds (the θ vector) are decided, the actual control
                                                                  actions (u(t) ∈ {0, 1}) are automatically known based on
                     dT
                         = f (T ; To , u),               (1)      (3). We can thus think of the manifolds, hence the θ’s as
                     dt                                           the higher level control actions and the ON/OFF heater
where, T (t) ∈ R represents the building temperature,             status (u(t)) as the lower-level control actions. These are
To (t) ∈ R is the outside temperature (disturbance), and          usually referred to as options and primitive actions in the
u(t) ∈ {0, 1} denotes the heater’s ON/OFF status (the actual      hierarchical RL framework (Sutton, Precup, and Singh
control actions). Unknown and potentially nonlinear thermal       1999). By doing so we are changing our decision variables
                                                                   two epochs of τk and τk+1 as:
                                                                   r(sk , ak ) = f (sk , ak )
                                                                                   Z τk+1                                         
                                                                                                                0
                                                                               +E             c(Wt , sk , ak )dt |Sk = sk , Ak = ak .
                                                                                                  0
                                                                                      τk
                                                                                                                                     (4)
                                                                   Let us also define the average transition time starting at state
                                                                   sk and under action ak as τ (sk , ak ):
                                                                           τ (sk , ak ) =E [τk+1 − τk |Sk = sk , Ak = ak ]
Figure 1: (a) Threshold policy and the switching manifolds                                Z ∞
for the 1-zone building example (b) thermal dynamics of the                                                                          (5)
                                                                                        =     tF (dt|sk , ak ).
building example under threshold control policy                                            0

                                                                      The actions ak s of the SMDP are determined by a stochas-
                                                                   tic or deterministic policy in each state. In many real-world
                                                                   control problems the optimal and/or the desired control pol-
from u(t) to θ. Although we could control (i.e. set the θ          icy is a deterministic policy. Hence, here we focus on de-
values) and learn (i.e. learn a better θ) at fixed time steps we   terministic policies a = µ(s) which deterministically map
restrict them to times when the events occur i.e. when the         the state s to the action a. Furthermore, as discussed ear-
system state trajectory hits a manifold. We do this because        lier, for the sake of scalability and sample efficiency we re-
making too many decision in a short period of time (with           strict the control problem to a class of policies µθ (s) pa-
no significant accumulated reward) could result in large           rameterized by the parameter vector θ. With this assumption
variance as discussed earlier. This change in the timing of        the expected rewards and the transition times at each state
the control and learning changes the underlying formulation        will be functions of the states and the parameter vector θ
from an MDP with fixed transition times to an SMDP with            i.e. r(sk , ak ) = r(sk ; θ) and τ (sk , ak ) = τ (sk ; θ). Then the
stochastic transition times.                                       infinite-horizon average reward could be written as1 :

                                                                                                   Pn
                                                                                               E [ k=0 r(sk ; θ)]
   We study the control problem in an RL framework in                             J(θ) = lim      Pn                .                (6)
                                                                                           n→∞ E [
which an agent acts in a stochastic environment/system by                                           k=0 τ (sk ; θ)]
sequentially choosing actions with no knowledge of the                An online learning algorithm could be devised if we can
environment/system dynamics. We model the RL control               calculate a good estimate of the gradient ∇θ J in an online
problem as an SMDP which is defined as a five tuple                fashion, which could then be used to improve the policy
(S, A, P, R, F ), where S is the state space, A is the action      parameters via stochastic gradient. But let us first draw
space, P is a set of state-action dependent transition proba-      clear connections between the SMDP formulation presented
bilities, R is the reward function, and F is a function giving     in this section and the micro-climate control problem in
probability of transition time, aka sojourn or dwell time,         the previous section. By defining the switching manifolds
for each state-action pair. Let τk be the decision epochs          temperature thresholds become the actions of the underlying
(times) with τ0 = 0, and Sk ∈ S be the state variable at           SMDP. Let us take the building temperature (T ) and the
decision epoch τk . If the system is at state Sk = sk at epoch     heater status (h) at the beginning of epochs as the state of
τk and action Ak = ak is applied, the system will move             the system i.e. sk = [Tk , hk ]. Then we can write actions as
to the state Sk+1 = sk+1 at epoch τk+1 with probability            a = µθ (s) = h TOFF θ
                                                                                           (s) + (1 − h)TON  θ               θ
                                                                                                                 (s), where TOFF (s)
p(sk+1 |sk , ak ) = P(Sk+1 = sk+1 |Sk = sk , Ak = ak ).                   θ
                                                                   and TON (s) are threshold temperatures for switching the
This transition occurs within tk unit times with a probability     heater OFF and ON, respectively and they could generally
of F (tk |sk , ak ) = Pr(τk+1 − τk ≤ tk |Sk = sk , Ak = ak ).      be state-dependent. Regarding the rewards, by comparing
Hence, the SMDP kernel Φ(sk+1 , tk ) = Pr(Sk+1 =                   equations (2) and (4) one can conclude that f (sk , ak ) = rsw
sk+1 , τk+1 − τk ≤ tk |Sk = sk , Ak = ak ) could be written        and c(Wt0 , sk , ak ) = re u(t0 ) + rc (T (t0 ) − Td )2 .
as Φ(sk+1 , tk ) = p(sk+1 |sk , ak )F (tk |sk , ak ).
                                                                      If r(s, a) and τ (s, a) are known and we somehow have
                                                                   access to the system dynamics, we can estimate J(θ) for a
   The reward function for an SMDP is in general more com-         given parameter vector θ by constructing a long sequence
plex than that of an MDP. Between epochs (τk ≤ t0 ≤ τk+1 )         of s0 , a0 , r0 , τ0 , ..., sn , an , rn , τn via simulation. If we do
the system evolves based on the so-called natural process          this for different values of θ we can approximate the per-
Wt0 . Let us suppose the reward between two decision epochs        formance metric J as a function of θ. We can then use this
consists of two parts; a fixed state-action dependent reward       approximation to estimate the performance gradient and use
of f (sk , ak ) and a time-continuous reward accumulated in
the transition time at a rate of c(Wt0 , sk , ak ). We can then        1
                                                                         For the average reward to be independent of the initial state,
write the expected reward r(sk , ak ) ∈ R(Sk , Ak ) between        the embedded MDP is required to be unichain.
  Figure 2: Block diagram of the online learning control


it to improve the actual policy via e.g. stochastic gradient.
The idea here is to construct the above-mentioned trajec-
tory sequence with online learning of r(s, a) and τ (s, a)
but without learning the system dynamics. This is possi-
ble because of our choice of policies, namely the threshold
policies. Since the action ak is a temperature threshold the       Figure 3: Long-run average reward (performance metric) J
temperature in the next epoch is automatically revealed at         as a function of fixed ON/OFF temperature thresholds
the current epoch i.e. Tk+1 = ak . Moreover, because these
thresholds are switching manifolds the heater status must
switch in the next epoch, i.e. hk+1 = 1 − hk . However,
in more complex set-ups, we may not be able to fully de-
duce the next state of the system via threshold policies. For
instance, if the system state includes the electricity price, we
cannot fully evaluate sk+1 based on sk and ak ; but we can
still construct a less-accurate sequence of transitions which
could be sufficient since we usually do not need a very accu-
rate estimate of ∇θ J for online learning. The online learn-
ing control explained here is schematically illustrated in the
form of a block diagram in Fig. 2.

                           Results
In this section we implement our proposed method to con-
trol the heating system of a one-zone building in order to         Figure 4: Time history of the learnt temperature thresholds
minimize energy consumption without jeopardizing the oc-           during the online learning process
cupants’ comfort. We use a simplified linear model charac-
terized by a first-order ordinary differential equation as fol-
lows:                                                                 Since we know the optimal controller has fixed temper-
                  dT                                               ature thresholds, we can represent the control policy with
               C      + K(T − To ) = u(t)Q̇h ,              (7)
                  dt                                               only two parameters (2-component θ vector) i.e. the thresh-
                                                                   old themselves (TON and TOFF ). Also, we use neural nets
  where, C = 2000 kJK −1 is the building’s heat capacity,
                                                                   for function approximation; a neural net with one hidden
K = 325W K −1 is the building’s thermal conductance,
                                                                   layer and 24 hidden nodes for r(s, a) and τ (s, a), and an-
and Q̇h = 13 kW is the heater’s power. As defined earlier,         other neural net with one hidden layer and 10 hidden nodes
h(t) ∈ {0, 1} is the heater status, and To = −10 °C is the         for J(θ). It is worth noting that our proposed control is an
outdoor temperature. The reward rates are set as follows:          off-policy control. A very exploratory behaviour policy is
−rsw = 0.8 unit, −re = 1.2/3600 unit s−1 , and rc =                employed for the learning simulation. Figure 4 illustrates
−1.2/3600 unit K −2 s−1 .                                          how the controller learns the optimal thresholds within less
                                                                   than a week. The learnt thresholds at the end of the learning
   The optimal control for this example is indeed a threshold      process are TON = 12.5 °C and TOFF = 17.4 °C which are
policy with constant ON/OFF thresholds. Via brute-force            almost the same as the optimal thresholds.
simulations and search the optimal thresholds are found to
be TON = 12.5 °C and TOFF = 17.5 °C with a correspond-
ing long-run average reward of J = −3.70 unit hr−1 (see
                                                                                          Conclusion
Fig.3). This is the ground truth thresholds for the optimal        In this study we proposed an SMDP framework for RL-
control of the building which our learning controller (Fig.        based control of micro-climate in buildings. We utilized
2) should learn using an stream of online data.                    threshold policies in which the learning and control take
                                                                   place when the thresholds are reached. This results in
variable-time intervals for the learning and control which       Munos, R. 2006. Policy gradient in continuous time. Journal
makes the SMDP framework more suitable for this class            of Machine Learning Research 7(May):771–791.
of control problems. Using the threshold policies we devel-      Naik, A.; Shariff, R.; Yasui, N.; and Sutton, R. S. 2019. Dis-
oped a model-based policy gradient RL approach for the           counted reinforcement learning is not an optimization prob-
controller. We showed via simulation the efficacy of our         lem. arXiv preprint arXiv:1910.02140.
approach in controlling the micro-climate of a single-zone
                                                                 Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between
building.
                                                                 mdps and semi-mdps: A framework for temporal abstrac-
                                                                 tion in reinforcement learning. Artificial intelligence 112(1-
                       References                                2):181–211.
Avendano, D. N.; Ruyssinck, J.; Vandekerckhove, S.;              Wei, T.; Wang, Y.; and Zhu, Q. 2017. Deep reinforce-
Van Hoecke, S.; and Deschrijver, D. 2018. Data-driven            ment learning for building hvac control. In Proceedings of
optimization of energy efficiency and comfort in an apart-       the 54th Annual Design Automation Conference 2017, 22.
ment. In 2018 International Conference on Intelligent Sys-       ACM.
tems (IS), 174–182. IEEE.
Barrett, E., and Linder, S. 2015. Autonomous hvac control,
a reinforcement learning approach. In Joint European Con-
ference on Machine Learning and Knowledge Discovery in
Databases, 3–19. Springer.
Chen, Y.; Norford, L. K.; Samuelson, H. W.; and Malkawi,
A. 2018. Optimal control of hvac and window systems for
natural ventilation through reinforcement learning. Energy
and Buildings 169:195–205.
Cheng, Z.; Zhao, Q.; Wang, F.; Jiang, Y.; Xia, L.; and Ding,
J. 2016. Satisfaction based q-learning for integrated lighting
and blind control. Energy and Buildings 127:43–55.
Gao, G.; Li, J.; and Wen, Y. 2019. Energy-efficient thermal
comfort control in smart buildings via deep reinforcement
learning. arXiv preprint arXiv:1901.04693.
Heemels, W.; Johansson, K. H.; and Tabuada, P. 2012. An
introduction to event-triggered and self-triggered control. In
2012 IEEE 51st IEEE Conference on Decision and Control
(CDC), 3270–3285. IEEE.
Hosseinloo, A. H.; Ryzhov, A.; Bischi, A.; Ouerdane, H.;
Turitsyn, K.; and Dahleh, M. A. 2020. Data-driven control
of micro-climate in buildings; an event-triggered reinforce-
ment learning approach. arXiv preprint arXiv:2001.10505.
Li, Y.; Wen, Y.; Tao, D.; and Guan, K. 2019. Transform-
ing cooling optimization for green data center via deep rein-
forcement learning. IEEE transactions on cybernetics.
Liu, S., and Henze, G. P. 2006. Experimental analysis of
simulated reinforcement learning control for active and pas-
sive building thermal storage inventory: Part 1. theoretical
foundation. Energy and Buildings 38(2):142 – 147.
Minoli, D.; Sohraby, K.; and Occhiogrosso, B. 2017. Iot
considerations, requirements, and architectures for smart
buildings—energy optimization and next-generation build-
ing management systems. IEEE Internet of Things Journal
4(1):269–283.
Mozer, M. C., and Miller, D. 1997. Parsing the stream of
time: The value of event-based segmentation in a complex
real-world control problem. In International School on Neu-
ral Networks, Initiated by IIASS and EMFCSC, 370–388.
Springer.
Mozer, M. C. 1998. The neural network house: An envi-
ronment hat adapts to its inhabitants. In Proc. AAAI Spring
Symp. Intelligent Environments, volume 58.