<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Case for Robust AI in Robotics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shashank Pathak</string-name>
          <email>shashank.pathak@iit.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Pulina</string-name>
          <email>lpulina@uniss.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Armando Tacchella</string-name>
          <email>armando.tacchella@unige.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Context</institution>
          ,
          <addr-line>Motivation, Objectives</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universita degli Studi di Genova</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>iCub Facility</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Researchers envision a world wherein robots are free to interact with the external environment, thereby including human beings, other living creatures, robots and a variety of inanimate objects. It is always tacitly assumed that interactions will be smooth, i.e., they will ful ll several desirable properties ranging from safety to appropriateness. We posit that a reasonable mathematical model to frame such vision is that of Markov decision processes, and that ensuring smooth interactions amounts to endow robots with control policies that are provably compliant with side conditions expressed in probabilistic temporal logic.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        or equal to 0 ( 0) exactly when V (s) V 0 (s) for all s 2 S. Given some
reasonable de nition of value V (s) for all s 2 S, solving an MDP amounts to
nd such that for all possible | more about MDPs and related
decision problems can be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        From a modeling point of view, Markov decision processes and associated
optimization problems, capture a broad set of approaches to the analysis and
synthesis of intelligent behavior for autonomous agents which can be put to
good use in the eld of Robotics. For instance, in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] it is argued that many
problems of AI planning under uncertainty can be modeled as MDPs. In the
learning community | see [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for a recent persective | reinforcement learning
(RL) is viewed as one of the key techniques to synthesize intelligent behavior
for interactive agents, and the mathematical underpinning of RL is also given
by Markov decision processes. Even closer to eld robotics, the area of optimal
control has a long tradition of leveraging MDPs for problems involving sequential
decision making under uncertainty | see, e.g., [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Given the widespread adoption
of MDPs in AI and related (sub) elds, proposing techniques to achieve robustness
of autonomous agents based on Markov decision processes is bound to have a
broad impact. We believe that Robotics might bene t the most from robust AI
techniques, since robots ought to be functional but also dependable, and the
trade-o between these two aspects need to be fully understood and explored.
      </p>
      <p>Our key proposition is to extend the modeling framework of MDPs to one
that includes explicit side conditions expressed in probabilistic computation tree
logic (PCTL). The syntax of PCTL is de ned considering the set of state
formulas, and the set of path formulas. Given a set of atomic propositions AP ,
is de ned inductively as: (i) if ' 2 AP then ' 2 ; (ii) &gt; 2 ; if ; 2
then also ^ 2 and : 2 ; and (iii) P./p[ ] 2 where ./2 f ; &lt;; ; &gt;g,
cpo2nt[a0i;n1s] eaxnadctly2the ,ewxphreersesiPon./sp[of ]tyisptehXepro(bnaebxitli)s,ticUpakth o(bpoeurantdoerd. Tunhteils)etand</p>
      <p>
        U (until ) where ; 2 and k 2 N | more on PCTL can be found in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Given an MDP M, a de nition of value V (s) for all states of M, and a PCTL
formula ', the Markov decision problem with probabilistic side condition (mdpp)
can be de ned as the problem of nding such that for all policies
and DM; j= ', i.e., ' is always satis ed in the discrete-time Markov chain
(DTMC) DM; corresponding to the combination of M and | see [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for
details about DTMCS and PCTL semantics, and [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] for details about combining
MDPs with policies to yield DTMCs.
2
      </p>
      <p>
        State of the art
We can distinguish current approaches to mdpp into two broad categories, the
rst one oblivious of formal techniques, and the second one deeply rooted in
formal veri cation and reasoning. In the former category we can list
multiobjective reinforcement learning [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] | with Geibel and Wysotski's approach as a
special case [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The main idea of these approaches is to encode the requirement
expressed by ' in the value function V (s). Both approaches do not require
knowledge of p( js; a) because the relevant information is learned by interacting
with (a physical realisation of) M. In this way, solving the Markov decision
problem yields a policy that most probably also satis es ', although formal
guarantees are not provided. Still in the rst category, we can consider Gillula
and Tomlin's safe online learning [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] which, albeit restricted to safety properties,
provides a mathematically precise way to combine side conditions to the online
solution of an MDP via reinforcement learning (RL) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Decision-theoretic
planning [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] | a.k.a. indirect RL [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] model-based RL [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] or controller synthesis
on MDPs [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] | is another approach in which mdpp can be formalised by taking
into account both the elements related optimisation of V and the side conditions
expressed by '. In this case the solution is precise, but it requires the knowledge
of p( js; a) in advance and the policy synthesised is always deterministic. Yet
another way to incorporate side conditions is to consider the overall model as a
constrained MDP [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], where one type of cost is to be optimised while keeping the
other types of costs within a bound. As before, the approach requires knowledge
of the model. All the methods listed above do not cover the cases in which both
p( js; a) is unknown and the optimal policy is stochastic. Also, the learning-based
ones have an unclear relation between the logic speci cation of ' and functional
speci cation of rewards.
      </p>
      <p>
        Another set of approaches to mdpp is the one based on formal methods which
can be used to solve parts of the mdpp problem. For a known model M and
a given policy , probabilistic model checking supported by e cient tools like
PRISM [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] or MRMC [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] can be applied to check whether the side condition '
is satis ed by the controller policy . If a policy does not satisfy the PCTL
property ', model repair [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] can be applied to modify the policy such that '
becomes true. First the DTMC model resulting from the MDP model under
the given policy is parameterised, using linear combinations of real-valued
parameters in the transition probabilities, where the parameter domains de ne
the allowed areas for the repair. Additionally, a cost-function over the parameters
can be given. Now model repair can be applied to nd (if it exists) a parameter
valuation within the parameter domains which on the one hand induces the
satisfaction of the property ' and on the other hand minimises the value of the
cost-function, i.e., changing the transition probabilities and thereby repairing
the DTMC with minimal costs. Unfortunately, this approach needs non-linear
optimisation and therefore it does not scale for larger models. An approach that
we recently proposed with other researchers uses a greedy repair algorithm [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Instead of global optimisation, it uses local repair steps iteratively. Though it
needs to iteratively invoke probabilistic model checking, this approach scales well
also for large models. However, it can incorporate rewards and values V (s) only
heuristically in a quite restricted manner.
3
      </p>
      <p>Towards Robust AI in Robotics: a Challenge
Potentially, an AI agent embodied in a robot may face a wide variety of scenarios
each characterized by di erent safety constraints and learning objectives. To
Configure robot</p>
      <p>Robottrainingstarts
Observe trainer
(acquire D1)</p>
      <p>Robottrainingends
Simulate learning
(compute )
Check MD1; j= '</p>
      <p>Check OK? NO
YES Robotbehavessafely
Ship robot</p>
      <p>Improve
configuration
Improve
acquisition
(Re)Shape
reward profile</p>
      <p>NO
Learning OK? YES</p>
      <p>NO
Data OK? YES</p>
      <p>Initialize robot</p>
      <p>Robotcalibrationstarts
Observe user
(acquire D2)</p>
      <p>Robotcalibrationends
Simulate learning
(compute )
Check MD2; j= '
obtain a quantitative assessment about our capability to attack the mdpp problem
it is useful to focus on a speci c scenario which contains all the basic ingredients
found in more complex ones, yet it is signi cant and amenable of a relatively
simple implementation. In particular, the case of a single robot interacting with
a single human across a common workspace is considered. It is assumed that the
robot observes the human while she is accomplishing a given task, which at some
point, requires the robot to chip in and, e.g., nalize the task alone or help the
human to do so. The task must be learned by the robot, but RL is run o ine in a
simulator to avoid risk of injuries to the human during the trial-and-error process
which characterizes RL. As shown in Figure 1 two di erent ows of activities are
considered. The rst one | Figure 1 (left) | is thought to happen at the end of
the production stage (factory), where the robot is con gured, trained and checked
by experts to accomplish a given task. The second one | Figure 1 (right) | is
thought to happen during the deployment stage (e.g., household), where the user
is allowed to (i) calibrate the robot, i.e., adapt its behavior to the contingencies
of the environment to be found at the user's place, and to (ii) modify the robot's
behavior, i.e., customize the robot according to speci c preferences.</p>
      <p>
        We believe that the current state of the art is unable to solve the mdpp
problem in a totally satisfactorily way in cases like the one exempli ed in
Figure 1. However there are strong potentials in combining e cient but potentially
imprecise engineering approaches with precise but potentially ine cient formal
methods. For example, RL-based methods are well established for MDP controller
synthesis, where the optimality criteria are encoded by rewards and the value
function V (s). To assure that the controller learned by RL is safe, during
RL learning we could use probabilistic model checking. If the current (not
yet necessarily optimal) controller turns out to be unsafe, we could repair the
controller. Additionally, it might also be necessary to modify the rewards and/or
the value function to direct RL towards safe solutions. This could be done, for
example, based on probabilistic counterexamples [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. In contrast with
rewardshaping approaches that guarantee invariance of the optimal policy learned [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
such a reward or value function repair aims to obtain sub-optimal but safe policy.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Puterman</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          :
          <article-title>Markov Decision Processes: Discrete Stochastic Dynamic Programming</article-title>
          . John Wiley and Sons (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Boutilier</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hanks</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Decision-theoretic planning: Structural assumptions and computational leverage</article-title>
          .
          <source>Journal of Arti cial Intelligence Research</source>
          <volume>11</volume>
          (
          <issue>1</issue>
          ) (
          <year>1999</year>
          )
          <fpage>94</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Wiering</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Otterlo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Reinforcement learning</article-title>
          .
          <source>In: Adaptation, Learning, and Optimization</source>
          . Volume
          <volume>12</volume>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bertsekas</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bertsekas</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bertsekas</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bertsekas</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          :
          <article-title>Dynamic programming and optimal control</article-title>
          .
          <source>Athena Scienti c Belmont</source>
          , MA (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Baier</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katoen</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          : Principles of Model Checking. The MIT Press (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Pathak</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abraham</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jansen</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tacchella</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katoen</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A greedy approach for the e cient repair of stochastic models</article-title>
          .
          <source>In: Proc. of NFM'15. Volume 9058 of LNCS</source>
          , Springer (
          <year>2015</year>
          )
          <volume>295</volume>
          {
          <fpage>309</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Natarajan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tadepalli</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Dynamic preferences in multi-criteria reinforcement learning</article-title>
          .
          <source>In: Proc. of ICML'05</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2005</year>
          )
          <volume>601</volume>
          {
          <fpage>608</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Geibel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wysotzki</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Risk-Sensitive Reinforcement Learning Applied to Control under Constraints</article-title>
          .
          <source>Journal of Arti cial Intelligence Research</source>
          <volume>24</volume>
          (
          <year>2005</year>
          )
          <volume>81</volume>
          {
          <fpage>108</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Gillula</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tomlin</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>Guaranteed safe online learning via reachability: tracking a ground target using a quadrotor</article-title>
          .
          <source>In: Proc. of ICRA'12</source>
          ,
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2012</year>
          )
          <volume>2723</volume>
          {
          <fpage>2730</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barto</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Reinforcement Learning { An Introduction</article-title>
          . MIT Press (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. Drager,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Forejt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Kwiatkowska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Parker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Ujma</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Permissive controller synthesis for probabilistic systems</article-title>
          .
          <source>In: Proc. of TACAS'14</source>
          . Springer (
          <year>2014</year>
          )
          <volume>531</volume>
          {
          <fpage>546</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Altman</surname>
          </string-name>
          , E.:
          <article-title>Constrained Markov decision processes. Volume 7</article-title>
          . CRC Press (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kwiatkowska</surname>
            ,
            <given-names>M.Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Norman</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parker</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Prism 4.0: Veri cation of probabilistic real-time systems</article-title>
          .
          <source>In: Proc. of CAV. Volume 6806 of LNCS</source>
          , Springer (
          <year>2011</year>
          )
          <volume>585</volume>
          {
          <fpage>591</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Katoen</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zapreev</surname>
            ,
            <given-names>I.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hahn</surname>
            ,
            <given-names>E.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hermanns</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jansen</surname>
            ,
            <given-names>D.N.:</given-names>
          </string-name>
          <article-title>The ins and outs of the probabilistic model checker MRMC</article-title>
          .
          <source>Performance Evaluation</source>
          <volume>68</volume>
          (
          <issue>2</issue>
          ) (
          <year>2011</year>
          )
          <volume>90</volume>
          {
          <fpage>104</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Bartocci</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grosu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katsaros</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramakrishnan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smolka</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          :
          <article-title>Model repair for probabilistic systems</article-title>
          .
          <source>In: Proc. of TACAS. Volume 6605 of LNCS</source>
          , Springer (
          <year>2011</year>
          )
          <volume>326</volume>
          {
          <fpage>340</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Abraham</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Becker</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dehnert</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jansen</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katoen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wimmer</surname>
          </string-name>
          , R.:
          <article-title>Counterexample generation for discrete-time Markov models: An introductory survey</article-title>
          .
          <source>In: Proc. of SFM. Volume 8483 of LNCS</source>
          , Springer (
          <year>2014</year>
          )
          <volume>65</volume>
          {
          <fpage>121</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harada</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Policy invariance under reward transformations: Theory and application to reward shaping</article-title>
          .
          <source>In: ICML</source>
          . Volume
          <volume>99</volume>
          (
          <year>1999</year>
          )
          <volume>278</volume>
          {
          <fpage>287</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>