<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>R. 2006. Policy gradient in continuous time. Journal
of Machine Learning Research 7(May):771-791.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Event-triggered reinforcement learning; an application to buildings' micro-climate control</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ashkan Haji Hosseinloo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Munther Dahleh</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laboratory for Information</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Decision Systems</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>USA ashkanhh@mit.edu</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>dahleh@mit.edu</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>1998</year>
      </pub-date>
      <volume>58</volume>
      <abstract>
        <p>Smart buildings have great potential for shaping an energyefficient, sustainable, and more economic future for our planet as buildings account for approximately 40% of the global energy consumption. However, most learning methods for micro-climate control in buildings are based on Markov Decision Processes with fixed transition times that suffer from high variance in the learning phase. Furthermore, ignoring its continuing-task nature the micro-climate control problem is often modeled and solved as an episodic-task problem with discounted rewards. This can result in a wrong optimization solution. To overcome these issues we propose an eventtriggered learning control and formulate it based on SemiMarkov Decision Processes with variable transition times and in an average-reward setting. We show via simulation the efficacy of our approach in controlling the micro-climate of a single-zone building.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Buildings account for approximately 40% of global energy
consumption about half of which is used by heating,
ventilation, and air conditioning (HVAC) systems, the
primary means to control micro-climate in buildings.
Furthermore, buildings are responsible for one-third of
global energy-related greenhouse gas emissions. Hence,
even an incremental improvement in the energy efficiency
of buildings and HVAC systems goes a long way towards
building a greener, more economic, and energy-efficient
future. In addition to their economic and environmental
impacts, HVAC systems can also affect productivity and
decision-making performance of occupants in buildings
via controlling indoor thermal and air quality. For all these
reasons micro-climate control in buildings is an important
issue for its large-scale economic, environmental, and
societal effects.</p>
      <p>The main goal of the micro-climate control in buildings
is to minimize the building’s (mainly HVAC’s) energy
consumption while improving occupants’ comfort in some
metric. Model-based control strategies are often inefficient
in practice, due to the complexity in building thermal
dynamics and heterogeneous environment disturbances
(Wei, Wang, and Zhu 2017). They also rely on an accurate
model of the building that makes them resource-extensive
and costly. Moreover, the need for prior modeling of
the buildings prevents a plug and play deployment of the
model-based controllers. To remedy these issues data-driven
approaches for HVAC control have attracted much interest
in the recent years towards building smart homes. Although
the idea of smart homes where household devices (e.g.
appliances, thermostats, and lights) can operate efficiently
in an autonomous, coordinated, and adaptive fashion, has
been around for a couple of decades (Mozer 1998), its
realization now looks ever more pragmatic with immense
recent advances in Internet of Things (IoT) and sensor
technology (Minoli, Sohraby, and Occhiogrosso 2017). Among
different data-driven control approaches, reinforcement
learning (RL) has found more attention in the recent years
due to enormous recent algorithmic advances in this field as
well as its ability to learn efficient control policies solely
from experiential data via trial and error.</p>
      <p>
        The Neural Network House project (Mozer 1998;
Mozer and Miller 1997) is perhaps the first application
of RL in building energy management system. Since then
and over the past couple of decades different RL
techniques from tabular Q-learning
        <xref ref-type="bibr" rid="ref2 ref3 ref4">(Liu and Henze 2006;
Barrett and Linder 2015; Cheng et al. 2016;
Chen et al. 2018)</xref>
        to Deep RL
        <xref ref-type="bibr" rid="ref1">(Wei, Wang, and Zhu 2017;
Avendano et al. 2018)</xref>
        have been employed to optimally
control the micro-climate in buildings. The control objective
in all these studies is a variation of energy consumption/cost
minimization subject to some constraints e.g. occupants’
comfort in some metric. More recently policy gradient RL
techniques were adopted for the HVAC control problem.
For instance, Deep Deterministic Policy Gradient (DDPG)
was used in
        <xref ref-type="bibr" rid="ref5">(Gao, Li, and Wen 2019)</xref>
        and
        <xref ref-type="bibr" rid="ref5">(Li et al. 2019)</xref>
        to control energy consumption in a single-zone laboratory
and 2-zone data center buildings, respectively. The reader
is referred to
        <xref ref-type="bibr" rid="ref7">(Hosseinloo et al. 2020)</xref>
        for a comprehensive
literature review on RL application in smart buildings.
      </p>
      <p>Similar to many other RL application studies in physical
sciences, there are two main issues with the
abovementioned studies; first, they model and solve the problem
of micro-climate control as an episodic-task problem
with discounted reward while it should be modeled as a
continuing-task problem with average reward. Average
reward is really what matters in continuing-task problems
and greedily maximizing discounted future value does not
necessarily maximize the average reward (Naik et al. 2019).
In particular, solutions that fundamentally rely on episodes
are likely to fare worse than those that fully embrace the
continuing task setting.</p>
      <p>
        Second, in all these studies the control problem is
modeled based on Markov Decision Processes (MDPs) where
learning and decision making occur at fixed sampling rate.
The fixed time intervals between decisions (control actions)
is restrictive in continuous-time problems; a large interval
(low sampling rate) deteriorates the control accuracy while
a small interval (high sampling rate) could drastically affect
the learning quality. For instance, as reported in (Munos
2006) among others, policy gradient estimate is subject to
variance explosion when the discretization time-step tends
to zero. The intuitive reason for that problem lies in the fact
that the number of decisions before getting a meaningful
reward grows to infinity. Furthermore, learning and control
at fixed time intervals may not be desired in large-scale
resource-constrained wireless embedded control systems
        <xref ref-type="bibr" rid="ref6">(Heemels, Johansson, and Tabuada 2012)</xref>
        .
      </p>
      <p>In this study, we eliminate the major drawbacks of
the learning techniques discussed above by proposing
an event-triggered learning controller where the control
problem is formulated based on Semi-Markov Decision
Processes (SMDPs) with variable time intervals (decision
epochs). The problem is formulated in an RL framework
as a continuing-task problem with undiscounted
averagereward optimization objective. The rest of the paper is
organized as follows. The next section explains the problem
statement and the proposed controller. SMDP formulation
section describes the problem formulation and the proposed
learning framework. Finally, the simulation results and the
paper remarks are presented in the last two sections.</p>
    </sec>
    <sec id="sec-2">
      <title>Problem statement</title>
      <p>In this study we present and explain our proposed learning
methods via a simplified one-zone building; however, the
methods and concepts are applicable to more general
settings. Here we study the problem of minimizing the energy
consumption in a one-zone building with unknown thermal
dynamics and subject to occupants’ comfort constraints. For
specificity and with no loss of generality we consider the
heating problem rather than cooling. Temperature of the
building evolves as:
dT
dt
= f (T ; To; u);
(1)
where, T (t) 2 R represents the building temperature,
To(t) 2 R is the outside temperature (disturbance), and
u(t) 2 f0; 1g denotes the heater’s ON/OFF status (the actual
control actions). Unknown and potentially nonlinear thermal
dynamics of the system are characterized by the function
f (:). Via the control action u(t) we would like to maximize
the performance measure J defined as:
J = lim</p>
      <p>T !1 T
(2)
where, tsw is the time when the controller switches from
0 to 1 (the heater switches from OFF to ON) or vice
versa, and (:) is the Dirac delta function. The first term
of the integrand penalizes the energy consumption while
the second and the third terms correspond to occupants’
comfort. Specifically the second term penalizes temperature
deviations from a desired set-point temperature (Td) while
the third term prevents frequent ON/OFF switching of the
heater. The relative effects of these terms are balanced by
their corresponding weights i.e. re, rc, and rsw.</p>
      <p>To reduce the space of possible control policies(laws) we
constraint the optimization to a class of parameterized
control policies, specifically to threshold policies. This
strategy is particularly beneficial in the RL framework since it
could significantly reduce the learning sample complexity.
We characterize the threshold policies by some manifolds in
the state space of the system which determine when the
control action switches (e.g. ON OFF in this study). We call
these manifolds switching manifolds and the control action
switches only when hitting these manifolds which we refer
to as events. Figure 1 (a) illustrates an schematic threshold
policy for the 1-zone building example with switching ON
and OFF manifolds while Fig.1(b) depicts the thermal
dynamics of the building temperature under such controller.
We can mathematically formulate the control action as:
u(t) =
80; if T (t)
&lt;</p>
      <p>1; if T (t)
:u(t ); otherwise</p>
      <p>TOFF
TON ;
(3)
where, TOFF and TON are thresholds (manifolds) for
switching OFF and ON, respectively that are parameterized by
parameter vector . These thresholds are in general
statedependent. The goal is to find the optimal control policy
u (t) within the parameterized policies i.e. to find the
optimal parameter vector , which maximizes the long-run
average reward (performance metric) J defined by (2) with
no prior knowledge of the system dynamics. In the next
section we cast this decision making problem as an SMDP.</p>
    </sec>
    <sec id="sec-3">
      <title>SMDP formulation</title>
      <p>By defining the switching manifolds the control problem
is reduced to learning the optimal manifolds. Once the
manifolds (the vector) are decided, the actual control
actions (u(t) 2 f0; 1g) are automatically known based on
(3). We can thus think of the manifolds, hence the ’s as
the higher level control actions and the ON/OFF heater
status (u(t)) as the lower-level control actions. These are
usually referred to as options and primitive actions in the
hierarchical RL framework (Sutton, Precup, and Singh
1999). By doing so we are changing our decision variables
from u(t) to . Although we could control (i.e. set the
values) and learn (i.e. learn a better ) at fixed time steps we
restrict them to times when the events occur i.e. when the
system state trajectory hits a manifold. We do this because
making too many decision in a short period of time (with
no significant accumulated reward) could result in large
variance as discussed earlier. This change in the timing of
the control and learning changes the underlying formulation
from an MDP with fixed transition times to an SMDP with
stochastic transition times.</p>
      <p>We study the control problem in an RL framework in
which an agent acts in a stochastic environment/system by
sequentially choosing actions with no knowledge of the
environment/system dynamics. We model the RL control
problem as an SMDP which is defined as a five tuple
(S; A; P; R; F ), where S is the state space, A is the action
space, P is a set of state-action dependent transition
probabilities, R is the reward function, and F is a function giving
probability of transition time, aka sojourn or dwell time,
for each state-action pair. Let k be the decision epochs
(times) with 0 = 0, and Sk 2 S be the state variable at
decision epoch k. If the system is at state Sk = sk at epoch
k and action Ak = ak is applied, the system will move
to the state Sk+1 = sk+1 at epoch k+1 with probability
p(sk+1jsk; ak) = P(Sk+1 = sk+1jSk = sk; Ak = ak).
This transition occurs within tk unit times with a probability
of F (tkjsk; ak) = Pr( k+1 k tkjSk = sk; Ak = ak).
Hence, the SMDP kernel (sk+1; tk) = Pr(Sk+1 =
sk+1; k+1 k tkjSk = sk; Ak = ak) could be written
as (sk+1; tk) = p(sk+1jsk; ak)F (tkjsk; ak).</p>
      <p>The reward function for an SMDP is in general more
complex than that of an MDP. Between epochs ( k t0 k+1)
the system evolves based on the so-called natural process
Wt0 . Let us suppose the reward between two decision epochs
consists of two parts; a fixed state-action dependent reward
of f (sk; ak) and a time-continuous reward accumulated in
the transition time at a rate of c(Wt0 ; sk; ak). We can then
write the expected reward r(sk; ak) 2 R(Sk; Ak) between
r(sk; ak) = f (sk; ak)
+ E</p>
      <p>Z k+1
k
c(Wt0 ; sk; ak)dt0jSk = sk; Ak = ak :
(4)
(5)
(6)
Let us also define the average transition time starting at state
sk and under action ak as (sk; ak):
(sk; ak) =E [ k+1</p>
      <p>kjSk = sk; Ak = ak]
=</p>
      <p>Z 1
0</p>
      <p>tF (dtjsk; ak):</p>
      <p>The actions aks of the SMDP are determined by a
stochastic or deterministic policy in each state. In many real-world
control problems the optimal and/or the desired control
policy is a deterministic policy. Hence, here we focus on
deterministic policies a = (s) which deterministically map
the state s to the action a. Furthermore, as discussed
earlier, for the sake of scalability and sample efficiency we
restrict the control problem to a class of policies (s)
parameterized by the parameter vector . With this assumption
the expected rewards and the transition times at each state
will be functions of the states and the parameter vector
i.e. r(sk; ak) = r(sk; ) and (sk; ak) = (sk; ). Then the
infinite-horizon average reward could be written as1:
J ( ) = lim
n!1 E [Pn
k=0 (sk; )]</p>
      <p>:
E [Pn
k=0 r(sk; )]</p>
      <p>An online learning algorithm could be devised if we can
calculate a good estimate of the gradient r J in an online
fashion, which could then be used to improve the policy
parameters via stochastic gradient. But let us first draw
clear connections between the SMDP formulation presented
in this section and the micro-climate control problem in
the previous section. By defining the switching manifolds
temperature thresholds become the actions of the underlying
SMDP. Let us take the building temperature (T ) and the
heater status (h) at the beginning of epochs as the state of
the system i.e. sk = [Tk; hk]. Then we can write actions as
a = (s) = h TOFF(s) + (1 h)TON(s), where TOFF(s)
and TON(s) are threshold temperatures for switching the
heater OFF and ON, respectively and they could generally
be state-dependent. Regarding the rewards, by comparing
equations (2) and (4) one can conclude that f (sk; ak) = rsw
and c(Wt0 ; sk; ak) = reu(t0) + rc(T (t0) Td)2.</p>
      <p>If r(s; a) and (s; a) are known and we somehow have
access to the system dynamics, we can estimate J ( ) for a
given parameter vector by constructing a long sequence
of s0; a0; r0; 0; :::; sn; an; rn; n via simulation. If we do
this for different values of we can approximate the
performance metric J as a function of . We can then use this
approximation to estimate the performance gradient and use
1For the average reward to be independent of the initial state,
the embedded MDP is required to be unichain.
it to improve the actual policy via e.g. stochastic gradient.
The idea here is to construct the above-mentioned
trajectory sequence with online learning of r(s; a) and (s; a)
but without learning the system dynamics. This is
possible because of our choice of policies, namely the threshold
policies. Since the action ak is a temperature threshold the
temperature in the next epoch is automatically revealed at
the current epoch i.e. Tk+1 = ak. Moreover, because these
thresholds are switching manifolds the heater status must
switch in the next epoch, i.e. hk+1 = 1 hk. However,
in more complex set-ups, we may not be able to fully
deduce the next state of the system via threshold policies. For
instance, if the system state includes the electricity price, we
cannot fully evaluate sk+1 based on sk and ak; but we can
still construct a less-accurate sequence of transitions which
could be sufficient since we usually do not need a very
accurate estimate of r J for online learning. The online
learning control explained here is schematically illustrated in the
form of a block diagram in Fig. 2.</p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>In this section we implement our proposed method to
control the heating system of a one-zone building in order to
minimize energy consumption without jeopardizing the
occupants’ comfort. We use a simplified linear model
characterized by a first-order ordinary differential equation as
follows:</p>
      <p>C
dT
dt
+ K(T</p>
      <p>To) = u(t)Q_ h;
(7)
where, C = 2000 kJ K 1 is the building’s heat capacity,
K = 325W K 1 is the building’s thermal conductance,
and Q_ h = 13 kW is the heater’s power. As defined earlier,
h(t) 2 f0; 1g is the heater status, and To = 10 °C is the
outdoor temperature. The reward rates are set as follows:
rsw = 0:8 unit; re = 1:2=3600 unit s 1; and rc =
1:2=3600 unit K 2 s 1.</p>
      <p>The optimal control for this example is indeed a threshold
policy with constant ON/OFF thresholds. Via brute-force
simulations and search the optimal thresholds are found to
be TON = 12:5 °C and TOFF = 17:5 °C with a
corresponding long-run average reward of J = 3:70 unit hr 1 (see
Fig.3). This is the ground truth thresholds for the optimal
control of the building which our learning controller (Fig.
2) should learn using an stream of online data.</p>
      <p>Since we know the optimal controller has fixed
temperature thresholds, we can represent the control policy with
only two parameters (2-component vector) i.e. the
threshold themselves (TON and TOFF). Also, we use neural nets
for function approximation; a neural net with one hidden
layer and 24 hidden nodes for r(s; a) and (s; a), and
another neural net with one hidden layer and 10 hidden nodes
for J ( ). It is worth noting that our proposed control is an
off-policy control. A very exploratory behaviour policy is
employed for the learning simulation. Figure 4 illustrates
how the controller learns the optimal thresholds within less
than a week. The learnt thresholds at the end of the learning
process are TON = 12:5 °C and TOFF = 17:4 °C which are
almost the same as the optimal thresholds.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this study we proposed an SMDP framework for
RLbased control of micro-climate in buildings. We utilized
threshold policies in which the learning and control take
place when the thresholds are reached. This results in
variable-time intervals for the learning and control which
makes the SMDP framework more suitable for this class
of control problems. Using the threshold policies we
developed a model-based policy gradient RL approach for the
controller. We showed via simulation the efficacy of our
approach in controlling the micro-climate of a single-zone
building.
Li, Y.; Wen, Y.; Tao, D.; and Guan, K. 2019.
Transforming cooling optimization for green data center via deep
reinforcement learning. IEEE transactions on cybernetics.
Liu, S., and Henze, G. P. 2006. Experimental analysis of
simulated reinforcement learning control for active and
passive building thermal storage inventory: Part 1. theoretical
foundation. Energy and Buildings 38(2):142 – 147.
Minoli, D.; Sohraby, K.; and Occhiogrosso, B. 2017. Iot
considerations, requirements, and architectures for smart
buildings—energy optimization and next-generation
building management systems. IEEE Internet of Things Journal
4(1):269–283.</p>
      <p>Mozer, M. C., and Miller, D. 1997. Parsing the stream of
time: The value of event-based segmentation in a complex
real-world control problem. In International School on
Neural Networks, Initiated by IIASS and EMFCSC, 370–388.
Springer.
Naik, A.; Shariff, R.; Yasui, N.; and Sutton, R. S. 2019.
Discounted reinforcement learning is not an optimization
problem. arXiv preprint arXiv:1910.02140.</p>
      <p>Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between
mdps and semi-mdps: A framework for temporal
abstraction in reinforcement learning. Artificial intelligence
112(12):181–211.</p>
      <p>Wei, T.; Wang, Y.; and Zhu, Q. 2017. Deep
reinforcement learning for building hvac control. In Proceedings of
the 54th Annual Design Automation Conference 2017, 22.
ACM.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Avendano</surname>
            ,
            <given-names>D. N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ruyssinck</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Vandekerckhove,
          <string-name>
            <given-names>S.; Van</given-names>
            <surname>Hoecke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ; and
            <surname>Deschrijver</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Data-driven optimization of energy efficiency and comfort in an apartment</article-title>
          .
          <source>In 2018 International Conference on Intelligent Systems (IS)</source>
          ,
          <fpage>174</fpage>
          -
          <lpage>182</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Barrett</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Linder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Autonomous hvac control, a reinforcement learning approach</article-title>
          .
          <source>In Joint European Conference on Machine Learning and Knowledge Discovery in Databases</source>
          ,
          <volume>3</volume>
          -
          <fpage>19</fpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Norford</surname>
            ,
            <given-names>L. K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Samuelson</surname>
            ,
            <given-names>H. W.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Malkawi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Optimal control of hvac and window systems for natural ventilation through reinforcement learning</article-title>
          .
          <source>Energy and Buildings</source>
          <volume>169</volume>
          :
          <fpage>195</fpage>
          -
          <lpage>205</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Cheng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Satisfaction based q-learning for integrated lighting and blind control</article-title>
          .
          <source>Energy and Buildings</source>
          <volume>127</volume>
          :
          <fpage>43</fpage>
          -
          <lpage>55</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; and Wen,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Energy-efficient thermal comfort control in smart buildings via deep reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1901</source>
          .04693.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Heemels</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; Johansson,
          <string-name>
            <surname>K. H.</surname>
          </string-name>
          ; and Tabuada,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2012</year>
          .
          <article-title>An introduction to event-triggered and self-triggered control</article-title>
          .
          <source>In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC)</source>
          ,
          <fpage>3270</fpage>
          -
          <lpage>3285</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Hosseinloo</surname>
            ,
            <given-names>A. H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ryzhov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bischi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ouerdane</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Turitsyn</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and Dahleh,
          <string-name>
            <surname>M. A.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>Data-driven control of micro-climate in buildings; an event-triggered reinforcement learning approach</article-title>
          . arXiv preprint arXiv:
          <year>2001</year>
          .10505.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>