<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards the Use of Quality of Service Metrics in Reinforcement Learning: A Robotics Example</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>J. F. Inglés-Romero</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>J. M. Espín</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>R. Jiménez</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>R. Font</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>C. Vicente-Chicote</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Biometric Vox S.L.</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Infomicro Comunicaciones S.L.</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Extremadura, QSEG, Escuela Politécnica de Cáceres</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>Service robots are expected to operate in real-world environments, which are inherently open-ended and show a huge number of potential situations and contingencies. This variability can be addressed applying reinforcement learning, which enables a robot to autonomously discover an optimal behavior through trial-and-error interactions with the environment. The process is carried out by measuring the improvements the robot achieves after executing each action. In this regard, RoQME, an Integrated Technical Project of the EU H2020 RobMoSys Project, aims at providing global robot Quality-of-Service (QoS) metrics in terms of non-functional properties, such as safety, reliability, efficiency or usability. This paper presents a preliminary work in which the estimation of these metrics at runtime (based on the contextual information available) can be used to enrich the reinforcement learning process.</p>
      </abstract>
      <kwd-group>
        <kwd>Reinforcement Learning</kwd>
        <kwd>Quality of Service</kwd>
        <kwd>RoQME</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        With the advance of robotics and its increasingly growing use in all kinds of
realworld applications, service robots are expected to operate (at least safely and with a
reasonable performance) in different environments and situations. In this sense, a
primary goal is to produce autonomous robots, capable of interacting with their
environments and learning behaviors that allow them to improve their overall performance
over time, e.g., through trial and error. This is the idea behind Reinforcement
Learning (RL) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which offers a framework and a set of tools for the design of
sophisticated and hard-to-engineer behaviors.
      </p>
      <p>
        In RL, an agent observes its environment and interacts with it by performing an
action. After that, the environment transitions to a new state providing a reward. The
goal is to find a policy that optimizes the long-term sum of rewards. One of the
fundamental problems of RL is the so-called cursing of the objective specification [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Rewards are an essential part of any RL problem, as they implicitly determine the
desired behavior. However, the specification of a good reward function can be highly
complex. This is, at least in part, because it requires to accurately quantify the
rewards, which does not fit well with the natural way people express objectives.
      </p>
      <p>
        The RoQME Integrated Technical Project (ITP), funded by EU H2020 RobMoSys
Project [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], aims at contributing a model-driven tool-chain for dealing with
systemlevel non-functional properties, enabling the specification of global Robot Quality of
Service (QoS) metrics. RoQME also aims at generating RobMoSys-compliant
components, ready to provide other components with QoS metrics. The estimation of
these metrics at runtime, in terms of the contextual information available, can then be
used for different purposes, e.g., as part of a reward function in RL.
      </p>
      <p>Integrating QoS metrics in the rewards can enrich the learning process by
extending the quality criteria considered, for example, with non-functional properties, such
as user satisfaction, safety, power consumption or reliability. Moreover, RoQME
provides a simple modeling language to specify QoS metrics, in which qualitative
descriptions predominate over quantitative ones. As a result, RoQME limits the use of
numbers, promoting a more natural way of expressing problems without introducing
ambiguity in its execution semantics.</p>
      <p>This paper presents a work in progress towards the use of QoS metrics in RL.
Santa Bot, a “toy” example in which a robot delivers gifts to children, will help us
illustrate the problem in simple terms.</p>
      <p>The rest of the paper is organized as follows. Section 2 describes the Santa Bot
example. Section 3 introduces the RoQME modeling language. Section 4 shows some
simulation results for the Santa Bot example. Section 5 reviews related work and,
finally, Section 6 draws some conclusions and outlines future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Santa Bot: an illustrative example</title>
      <p>This section introduces the example that will be used throughout the paper to illustrate
the proposed approach. The goal is to present the reinforcement learning problem in
simple terms to explain the role that the RoQME QoS metrics can play. The example
takes place in a shopping mall where a robot, called Santa Bot, distributes gifts to
children. In the area set up for this purpose, a number of children waits in line to
receive a gift from Santa Bot. Each child has a (finite) list of wishes, containing his/her
most desired toys in order of preference. Unfortunately, these lists were made in
secret and are unknown to Santa Bot. Santa Bot will try to guess the best gift for each
child to meet their expectations and thus maximize their joy.
2.1</p>
      <sec id="sec-2-1">
        <title>Formalizing the example scenario</title>
        <p>Let us consider a queue of M children waiting to receive a gift, i.e., ℎ&amp;,
ℎ( … ℎ*. As the order prevails, ℎ+ will receive a gift after ℎ+, &amp; and
before ℎ+-&amp;. Moreover, we identify all the different types of toys with natural
numbers, i.e., Toys = {1, 2, 3…, K}. At time t, the Santa Bot bag contains a number of
instances of each toy k, denoted by /0 ∈ {0,1, … , /5}, being /5 the initial amount. As
gifts are delivered, the number of instances of a toy decreases, remaining at 0 when
the toy is no longer available, therefore, /0 ≥ /0-&amp;. In this scenario, the action of</p>
        <p>Santa Bot is limited to deciding what gift is given to each child. Listing 1 specifies the
effect of this action.</p>
        <p>Being ℎ+ the child to receive a gift at time t
deliver gift k to childi
[pre-condition]</p>
        <p>/0 &gt; 0
[post-condition] /0-&amp; = /0 − 1</p>
        <sec id="sec-2-1-1">
          <title>Listing 1. Specification of the delivery action in the example scenario.</title>
          <p>Although Santa Bot can adopt different strategies for delivering the gifts in its bag,
the best approach will be the one that maximizes children satisfaction. In this case,
satisfaction is associated with the ability to fulfill the children’s wishes, expressed in
their wish lists. Thus, we consider that each child has a wish list represented by a
ntuple (&amp;, ( , … , =), where entries are toy identifiers (+ ∈ ), and show uniqueness
(∀,  ∈ {1,2, … , }, + = G ⟺  = ) and order (+ is preferred over G if and only if  &lt;
). Moreover, the function (ℎ+) = (&amp;, ( , … , =) links children to wish lists,
such that (ℎ&amp;) = (2, 6, 1) indicates that the first child in the queue wants toys 2,
6, and 1, in order of preference.</p>
          <p>Equation 1 shows a possible definition of satisfaction for ℎ+ receiving a toy j.
This function provides a score that determines the goodness of a decision, so the
higher its value the better.</p>
          <p>(ℎ+, ) = ∑∀ST∈UV(WX+YZ[) () ∙ ( − /)
Being  a decreasing positive function and  the Kronecker delta function, i.e., () =
1 when x=0, otherwise it is 0. It is worth noting that Equation 1 produces 0 if the
decision does not match any option in the wish list, otherwise it increases as the selected
gift has a higher position in this list. Finally, the result of the problem is the entire
sequence of decisions made for all the children, i.e.,  = (&amp;, ( , … , *), where +
indicates the toy delivered to ℎ+. Equation 2 shows the overall satisfaction considering
the complete sequence of decisions d.</p>
          <p>Z = ∑∀+ (ℎ+, +)
(1)
(2)
2.2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>The reinforcement learning problem</title>
        <p>
          Santa Bot poses an optimization problem whose optimal solution would be feasible
using integer linear programming if all wish lists were known. However, since this is
not the case, the robot is expected to autonomously discover the optimal solution
through trial-and-error interactions with its environment. In Reinforcement Learning
(RL) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], an agent observes its environment and interacts with it by performing an
action. After that, the environment transitions to a new state providing a reward. The
goal of the algorithm is to find a policy that optimizes the long-term sum of rewards.
The main elements of a RL problem (states, transitions, actions and rewards) are
usually modeled as a Markov Decision Process (MDP) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Fig. 1 shows the basic MDP
specification for the Santa Bot example. It is worth noting that, for the sake of
simplicity, we will not delve into the details of RL and MDP.
        </p>
        <p>The Santa Bot environment considers two sets of states: In-front and Leaving. The
former indicates that a child is in front of Santa Bot waiting for a gift. In this situation,
the state is defined in terms of the gifts available (i.e., /0) and some observable
features of the child. In the example, we have supposed that the robot can perceive the
apparent age, the gender and the predominant color of the child’s clothes. Ideally,
these features will include sufficient information to trace preferences and common
tastes among children with similar aspects. Actually, a successful learning process
should be able to detect these tendencies and exploit them by making good decisions.</p>
        <p>
          Once the robot performs the delivery action, the environment transitions from
Infront to Leaving. Note that it returns to In-front when a new child arrives. Leaving
integrates the satisfaction of the child with the gift, which is represented by a QoS
metric. This metric provides a real value in the range [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ] indicating how much the
child liked the gift (being 1 the highest degree of satisfaction). The reward function
will depend on this value, e.g., see Equation 3, where the reward changes in a linear
way from 0 to  according to the satisfaction.
        </p>
        <p>=  ∙ ,  ∈
ℝ(3)</p>
        <p>The reward function is an essential part of any RL problem, as it implicitly
determines the desired behavior we want to achieve in our system. However, it is very
difficult to establish a good reward mechanism in practice. Note that Equation 3
seems simple because we have moved the complexity to the specification of the QoS
metric, i.e., to how the robot measures satisfaction. The following section illustrates
how RoQME can alleviate the complexity of specifying rewards by supporting the
definition of QoS metrics.</p>
        <p>ACTION:
deliver (gift k)
REWARD:
reward (satisfaction)</p>
        <p>i=i+1
$ℎ&amp;'()</p>
        <p>IN FRONT
Gifts available !"#</p>
        <p>Child features
$ℎ&amp;'()
LEAVING
QoS metric:</p>
        <p>
          Satisfaction
RoQME aims at providing robotics engineers with a model-driven tool-chain
allowing them to: (1) specify system-level non-functional properties; and (2) generate
RobMoSys-compliant components, ready to provide other components with QoS
metrics defined on the previous non-functional properties. In the following, we use
the Santa Bot example to present the main modeling concepts of RoQME and how
they are translated into QoS metrics at runtime. More information about the RoQME
meta-model and its integration into RobMoSys can be found in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>
          The previous section left open the specification of the QoS metric for measuring the
satisfaction of a child after receiving a gift (hereinafter, simply denoted as
satisfaction). It is worth noting that QoS metric is not a modeling concept in RoQME, but
rather, a runtime artifact implicitly bound to a non-functional property.
Nonfunctional properties, which can be thought of as particular quality aspects of a
system, are included in the modeling language, thus, in our example, satisfaction is
modeled as a non-functional property using the keyword property (see line 4
in Listing 2). Regarding the execution semantics, a RoQME model abstracts a Belief
Network [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], in which properties are represented by unobserved Boolean variables. In
this sense, the variable associated with satisfaction would indicate whether or not
Santa Bot is optimal in terms of this property. The runtime quantification of this belief
results in the corresponding QoS metric value. For example, a resulting value of 0.67
can be understood as the probability of the gift being satisfactory for the child.
        </p>
        <p>The belief of a property (i.e., the QoS metric) fluctuates over time according to the
evidences observed by the robot in its environment (contextual information). RoQME
allows specifying observations (observation) as conditions in terms of context
variables (context), so that the detection of an observation will reinforce (or undermine)
the belief. In the belief network, observations are evidence variables that exhibit a
direct probabilistic dependence with the property.</p>
        <p>Lines 5-8 in Listing 2 show four observations for the Santa Bot example. These
observations use the following context variables: (1) face, which indicates the facial
expression of the child after receiving a gift; and (2) age, the apparent age of the child
perceived by the robot. Note that the robot will continuously feed RoQME with this
contextual information. Each observation will reinforce or undermine the child
satisfaction according to the emotion expressed by the child. We have assumed that
surprise and anger are stronger emotions than joy and sadness, and thus the first ones
should have a higher influence on satisfaction than the second ones. Moreover,
observations 1 and 4 are conditioned by age, which means that strong reactions tend to be
more or less frequent depending on the age. Note that toddlers may tend to react more
vividly than school-aged children. Therefore, it is used to normalize the effect of the
observation among children of different ages.
context face : eventType {JOY, SURPRISE, NEUTRAL, SADNESS, ANGER}
context age : enum {TODDLER, PRESCHOOLER, SCHOOL_AGED , ADOLESCENT}
context prevSatisfaction : number
property satisfaction : number prior prevSatisfaction
observation obs1 : SURPRISE reinforces satisfaction highly conditionedBy age
observation obs2 : JOY reinforces satisfaction
observation obs3 : SADNESS undermines satisfaction
observation obs4 : ANGER undermines satisfaction highly conditionedBy age</p>
        <sec id="sec-2-2-1">
          <title>Listing 2. A simple RoQME model for modeling children satisfaction.</title>
          <p>Finally, the context variable prevSatisfaction provides the satisfaction of the child
who received the gift just before the current one. We want to use this information to
model possible influences between two consecutive children, e.g., a child expressing
anger could have an effect on the behavior of the following child. This is
implemented by defining a bias in the prior probability of satisfaction (see line 4).</p>
          <p>As we have already mentioned, a RoQME model translates all its semantics into a
belief network. Fig. 2 shows the qualitative specification of the network resulting
from the model in Listing 2. Note that, for the sake of clarity, we have omitted
probabilities. This belief network will be the “brain” of a generated RobMoSys-compliant
component aimed at measuring satisfaction. In general, the generated component will
estimate the value of each non-functional property, specified in the RoQME model,
by successively processing the available contextual information, either from internal
(e.g., robot sensors) or external (e.g., web services, other robots, etc.) sources. The
contextual information received by the component will be sequentially processed by:
(1) a context monitor that receives raw contextual data and produces context events;
(2) an event processor that searches for the event patterns specified in the RoQME
model and, when found, produces observations; and, finally (3) a probabilistic
reasoner that computes a numeric estimation for each metric. This information could
then be used by other components, e.g. the robot task sequencer could integrate the
RL process to adapt the robot behavior according to the provided QoS metrics.</p>
          <p>Previous
satisfaction</p>
          <p>Obs 1</p>
          <p>Satisfaction
Obs 2</p>
          <p>Obs 3
Age</p>
          <p>Obs 4
This section presents the simulation results obtained on the Santa Bot example.
Before detailing the results, let us describe the simulation setting.</p>
          <p>Queues of children. We have generated random queues of length 50 with children
showing different features (i.e., age, gender and clothes color). In particular, we have
considered four age groups: (1) toddlers, (2) preschoolers, (3) school-aged children
and (4) adolescents, with the following density function [0.4, 0.3, 0.2, 0.1], and
uniformly distributed gender and clothes colors (5 colors).</p>
          <p>Wish lists. For each child, we have produced a 3-item list from 20 different toys.
Preferences were distributed depending on the children features (age, gender and
clothes color). These dependencies create tendencies that are expected to be detected
and exploited by the learning process. The correlation is clearly exposed in Fig. 3,
where the heat map represents favorite toys in relation to children features.
Children reactions. As we described in Section 3, RoQME estimates satisfaction
considering as contextual information the age and the face expression of the children.</p>
          <p>While the former has already been defined, the children reactions need to be
established. For that, we have assigned face expressions to each child receiving a particular
gift according to his/her features and wish list. The idea is to introduce inclinations
similar to those described in Section 3. It is obvious that the algorithm will not be able
to learn if the RoQME model for satisfaction does not reflect reality.</p>
          <p>Simulations. We have executed the simulation on 1000 episodes, where each episode
consists of a new queue of 50 children. In addition, we have considered that Santa Bot
has an infinite number of gifts available for each type of toy. The left side of Fig. 4
shows the learning process in an initial state (episode 25), in which the exploration of
the states (i.e., choosing a random action) is preferred to acquire new information and
discover the best actions. This can be seen in the upper heat map, where Santa Bot
delivers gifts following a uniform approach. As for the cumulative rewards
represented in the lower heat map, it begins to show the correlations we have introduced in the
data (see Fig. 3). On the other hand, the right side of Fig. 4 shows the learning process
in an advanced state (episode 1000), in which the exploitation (i.e., choosing the best
action according to the information already learned at that moment) is preferred over
exploration. The upper heat map shows how the process seems to prioritize the gifts
that have provided greater reward. As for the lower map, it is similar to Fig. 3, which
means the learning process was successful.</p>
          <p>Fig. 4. Up, the number of visits to the Q-matrix cell; down the Q-matrix. (Left) Episode 25
of the learning algorithm. (Right) Episode 1000 of the learning algorithm.</p>
          <p>Fig. 5. (Left) Score evolution over 1000 episodes. (Right) Q-matrix after 1000 episodes, in
which the RoQME model estimates satisfaction wrongly.</p>
          <p>The left side of Fig. 5 shows the evolution of the learning process, in which the score
seems to stabilize after 400 episodes. Finally, the system has achieved an average
score of 72.03% and a maximum score of 84.67% with respect to the optimal solution
(the one with known wish lists). Finally, we have modified the RoQME model to
observe the effect of modeling wrongly. The right side of Fig. 5 shows that, in this
case, the process fails to learn.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Related work</title>
      <p>
        Reinforcement learning emerged as a combination of optimal control (using dynamic
programming) and trial-and-error learning (inspired by the animal world) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In the
last years, thanks to more powerful computing systems and new deep learning
techniques, RL has received an impulse in domains such as video games and
simulations [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In the field of robotics, optimization problems have a temporal structure,
where RL techniques seem to be suitable. However, there may be cases showing high
dimensionality continuous states and actions, with states not fully observable and
noise-free. Those cases generally result in modeling failures that can be accumulated
over time, so training with physical agents is needed [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        In the literature, we can find numerous RL techniques applied to diverse robotic
tasks. For example, robot manipulators aimed at learning how to reach a certain
position or open a door [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7-9</xref>
        ]; or mobile robots that learn how to move in crowded
spaces [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Recently, RL has been used to teach a car how to drive in a short period of
time with just a camera and feedback of speed and steering angle [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Despite the large number of applications, one of the fundamental problems of RL
is the so-called cursing of the objective specification [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Rewards are a crucial part of
any RL problem, as they implicitly determine the desired behavior we want to
achieve. However, the specification of a good reward function can be highly complex.
In this sense, domain experts may be required to define the proper reward function.
      </p>
      <p>
        To alleviate this problem, different techniques have been used to define reward
functions. A perfect scenario would be a human supervising the agent and providing
feedback about the reward that the system should give according to its actions, but
this is expensive and error prone. In [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], authors use EEG-signals of a human user as
reward function, so that the user unconsciously teaches the agent to act as he wants.
      </p>
      <p>
        Another way is, firstly, to train a learner to discover which people actions are better
than others, and then use it in RL to give reward to the agent simulating a person.
Regarding the perspective applied in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], it creates a Bayesian model based on
feedback from experts. On the contrary, in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], non-expert people are considered to train
the learner with reinforcement learning techniques.
      </p>
      <p>
        Some rules defined at design time in a lax way can be introduced in the learning
process as a bias to enhance the behavior as in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], where the robot has to respect
some social rules of circulation.
      </p>
      <p>
        Regarding QoS, in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] the system tries to autonomously improve the quality of the
services of the robot by adapting its component-based architecture and applying RL to
meet the user preferences. The system learns how to estimate the non-functional
properties in the process as it has not prior knowledge about them.
6
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and future work</title>
      <p>
        This paper presents a preliminary work about the integration of RoQME QoS metrics
into the reward strategy of RL problems. Moreover, we have introduced and
formalized Santa Bot, an optimization “toy” example inspired by Santa Claus, used to
illustrate the explanations in simple terms. In the following, we highlight some remarks:
• The execution semantics of a QoS metric relies on a belief network, which is a
well-known mathematical abstraction that has been successfully applied to many
domains, such as medical diagnosis and natural language processing.
Consequently, we can benefit from existing tools and techniques that are used for the analysis
and simulation of probabilistic networks.
• The RoQME modeling language allows users to transparently specify the
qualitative part of the underlying belief network (i.e. nodes and arcs of the directed
acyclic graph). The RoQME framework is in charge of automatically completing the
quantitative part of the network (i.e., the conditional probability tables). As
quantification is often referred to as a major obstacle in building probabilistic
networks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], RoQME eases the modeling process by abstracting probabilities. In this
sense, although there are many probabilistic programming languages [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] that can
be used to specify belief networks, unlike RoQME, they usually need a detailed
specification of probabilities.
• Although the specification of RoQME QoS metrics does not need to be addressed
by domain experts, a RoQME model that does not sufficiently represent reality will
have a great impact on the learning process.
• We have simulated the Santa Bot example considering an unlimited number of
gifts. Although this relaxation of the problem has not affected the explanations, it
is pending to take more advantage of the example and to apply our approach to
more realistic robotics scenarios.
      </p>
      <p>For the future, we plan to continue exploring the potential of RoQME QoS metrics
applied to RL. We also intend to study ways of improving the QoS modeling process.</p>
      <sec id="sec-4-1">
        <title>Acknowledgements</title>
        <p>RoQME has received funding from the European Union’s H2020 Research and
Innovation Programme under grant agreement No. 732410, in the form of financial
support to third parties of the RobMoSys project.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Kober</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bagnel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Peters</surname>
          </string-name>
          , J.:
          <article-title>Reinforcerment learning in robotics: A survey</article-title>
          .
          <source>The international Journal of Robotics Research</source>
          <volume>0</volume>
          (
          <issue>0</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          , (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. RobMoSys website, http://robmosys.eu,
          <source>last accessed</source>
          <year>2018</year>
          /07/24.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Russel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Norvig</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Artificial intelligence: A modern approach</article-title>
          . 3er edn. Upper Saddle River, NJ, USA: Prentice Hall Press, (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Vicente-Chicote</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Inglés-Romero</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Martinez</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A Component-Based and Model-Driven Approach to Deal with Non-Functional Properties through Global QoS Metrics</article-title>
          . 5th
          <source>International Workshop on Interplay of Model-Driven and Component-Based Software Engineering (ModComp)</source>
          . Copenhagen, Denmark (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Barto</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Reinforcement learning: an introduction. 1st edn</article-title>
          . The MIT Press, Cambridge, Massachusetts, USA, (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antonoglou</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wierstra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Riedmiller</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Playing Atari with Deep Reinforcement Learning</article-title>
          ,
          <source>NIPS Deep Learning Workshop</source>
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holly</surname>
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lillicrap</surname>
            <given-names>T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Levine</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates</article-title>
          .
          <source>2017 IEEE International Conference on Robotics and Automation (ICRA)</source>
          ,
          <fpage>3389</fpage>
          -
          <lpage>3396</lpage>
          , (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kalakrishnan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ludovic</surname>
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pastor</surname>
            <given-names>P.</given-names>
          </string-name>
          and Schall S.:
          <article-title>“Learning force control policies for compliant manipulation</article-title>
          .”
          <source>2011 IEEE/RSJ International Conference on Intelligent Robots and Systems</source>
          (
          <year>2011</year>
          ):
          <fpage>4639</fpage>
          -
          <lpage>4644</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Yahya</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalakrishnan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chebotar</surname>
            ,
            <given-names>Y</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Levine</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Collective robot reinforcement learning with distributed asynchronous guided policy search</article-title>
          .
          <source>2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source>
          (
          <year>2017</year>
          ):
          <fpage>79</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Chen</surname>
            , Yu Fan, Everett,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>How</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          :
          <article-title>Socially aware motion planning with deep reinforcement learning</article-title>
          .
          <source>2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source>
          (
          <year>2017</year>
          ):
          <fpage>1343</fpage>
          -
          <lpage>1350</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kendall</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hawke</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Janz</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mazur</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reda</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Allen</surname>
            <given-names>J.M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Lam</surname>
          </string-name>
          , VD.:
          <article-title>Learning to Drive in a Day, arXiv preprint</article-title>
          arXiv:
          <year>1807</year>
          .00412
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Iturrate</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montesano</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Minguez</surname>
          </string-name>
          , J.:
          <article-title>Robot reinforcement learning using EEGbased reward signals</article-title>
          .
          <source>2010 IEEE International Conference on Robotics and Automation</source>
          (
          <year>2010</year>
          ):
          <fpage>4822</fpage>
          -
          <lpage>4829</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. Wilson,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Fern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            and
            <surname>Tadepalli</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>A Bayesian Approach for Policy Learning from Trajectory Preference Queries</article-title>
          .
          <source>NIPS</source>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Christiano</surname>
            ,
            <given-names>P. F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leike</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , T.,
          <string-name>
            <surname>Miljan</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shane</surname>
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Amodei</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Deep Reinforcement Learning from Human Preferences</article-title>
          .
          <source>NIPS</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Wang</surname>
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            <given-names>W.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Bouguettaya</surname>
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>(2010) Adaptive Service Composition Based on Reinforcement Learning</article-title>
          . In: Maglio P.P.,
          <string-name>
            <surname>Weske</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Fantinato</surname>
            <given-names>M.</given-names>
          </string-name>
          <article-title>(eds) Service-Oriented Computing</article-title>
          .
          <source>ICSOC 2010. Lecture Notes in Computer Science</source>
          , vol.
          <volume>6470</volume>
          . Springer, Berlin, Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <article-title>Probabilistic-programming.org, http://probabilistic-programming</article-title>
          .org/wiki/Home, last accessed 18/08/
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>