<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Reinforcement Learning for Autonomous Agents Exploring Environments: an Experimental Framework and Preliminary Results</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nassim Habbash</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federico Bottoni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Vizzari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics</institution>
          ,
          <addr-line>Systems and Communication (DISCo)</addr-line>
          ,
          <institution>University of Milano-Bicocca</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <fpage>84</fpage>
      <lpage>100</lpage>
      <abstract>
        <p>Reinforcement Learning (RL) is being growingly investigated as an approach to achieve autonomous agents, where the term autonomous has a stronger acceptation than the current most widespread one. On a more pragmatic level, recent developments and results in the RL area suggest that this approach might even be a promising alternative to current agent-based approaches to the modeling of complex systems. This work presents an investigation of the level of readiness of a state-of-the-art model to tackle issues of orientation and exploration of a randomly generated environment, as a toy problem to evaluate the adequacy of the RL approach to provide support to modelers in the area of complex systems simulation, and in particular pedestrian and crowd simulation. The paper presents the adopted approach, the achieved results, and discusses future developments on this line of work.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;agent-based modeling and simulation</kwd>
        <kwd>reinforcement learning</kwd>
        <kwd>complex-systems</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Reinforcement Learning (RL) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is being growingly investigated as an approach to implement
autonomous agents, where the acceptation of the term “autonomous” is closer to Russell and
Norvig’s [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] than the most widely adopted ones in agent computing. Russell and Norvig state
that:
      </p>
      <p>A system is autonomous to the extent that its behavior is determined by its own
experience</p>
      <p>
        A certain amount of initial knowledge (in an analogy to built-in reflexes in animals and
humans) is reasonable, but it should be sided by the ability to learn. RL approaches, reinvigorated
by the energy, eforts, and promises brought by the deep learning revolution, seems one of the
most promising ways to investigate how to provide an agent this kind of autonomy. On a more
pragmatic level, recent developments and results in the RL area suggest that this approach might
even be a promising alternative to current agent-based approaches to the modeling of complex
systems [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]: whereas currently behavioral models for agents are carefully hand crafted, often
following a complicated interdisciplinary efort involving diferent roles and types of knowledge,
as well as validation processes based on the acquisition and analysis of data describing the
studied phenomenon, RL could simplify this work, focusing on the definition of an environment
representation, the definition of a model for agent perception and action, and defining a reward
function. The learning process could, in theory, be able to explore the potential space of the
policies (i.e. agent behavioral specifications) and converge to the desired decision making model.
While the definition of a model of the environment, as well as agent perception and action, and
the definition of a reward function are tasks requiring substantial knowledge about the studied
domain and phenomenon, the learning process could significantly simplify modeler’s work, and
at the same time it could solve issues related to model calibration.
      </p>
      <p>
        The present work is set in this scenario: in particular, we want here to explore the level
of readiness of state-of-the-art models to tackle issues of orientation and exploration of an
environment [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] by an agent that does not own prior knowledge about its topology. The
environment is characterised by the presence of obstacles, generated randomly, and by a target
for agent’s movement, a goal that must be reached while, at the same time, avoiding obstacles.
This represents a toy problem allowing us to investigate the adequacy of the RL approach to
support modelers in the area of complex systems simulation, and in particular pedestrian and
crowd simulation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We adopted Proximal Policy Optimization (PPO) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and trained agents in
the above introduced type of environment: the achieved decision making model was evaluated
in new environments analogous to the ones employed for training, but we also evaluated the
adequacy of the final model to guide agents in diferent types of environment, less random and
more similar to human built environments (i.e. including rooms, corridors, passages) to start
evaluating if agents for simulating typical pedestrians could be trained through a RL approach.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Problem Statement</title>
      <p>The main focus of this paper is automatic exploration of environments without a-priori
knowledge of their topology. This is modeled through a single-agent system, where an agent is
encouraged to look out for a target placed randomly in a semi-randomly generated environment.
This environment presents an arbitrary number of obstacles placed randomly on its space. The
environment can be seen as a rough approximation of natural, mountainous terrain, or artificial,
post-disaster terrain, such as a wrecked room. The agent can observe the environment through
its front-mounted sensors and move on the XY Cartesian plane. In order to solve the problem
of automatic exploration in an ever changing obstacle-ridden environment, the main task is to
generalize the exploration procedure, to achieve an agent able to explore diferent environments
from the ones it was trained on.</p>
      <p>In this paper we develop a framework around this task using Reinforcement Learning.
Section 3 provides a definition of the agent, the environment and their interactions. Section 4 goes
through Reinforcement Learning and the specific technique adopted for this work. Section 5
provides an architecture to the system, with some details on the tools used. Section 6 reports
the experimental results obtained, and Section 7 provides some considerations on the work and
possible future developments.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Agent and Environment</title>
      <sec id="sec-3-1">
        <title>3.1. Agent model</title>
        <p>The agent is modeled after a simplified rover robot with omnidirectional wheels, capable of
moving on the ground in all directions. The location of the agent in the environment is described
by the triple (, , ), where (, ) denotes its position on the XY plane, and  denotes its
orientation. The agent size is 1x1x1 units.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Observation space</title>
        <p>The agent can observe the environment through a set of LIDARs that create a array of surveying
rays: these are time-of-flight sensors which provide information on both the distance between
the agent and the collided object and the object’s type. If a ray is not long enough to reach an
object because it is too far away, the data provides the over-maximum-range information to the
agent. The standard LIDAR length is 20 units. The agent is equipped with 14 LIDARs, placed
radially on a plane starting from the middle of its front-facing side, giving it a field of view of
[- 23  ; 2  ] for 20 units of range.</p>
        <p>3
More formally, we can define an observation or state as a set of tuples, as follows:
 = {(1, 1), (2, 2), ..., (, )},  ∈</p>
        <p>Where  is the number of LIDARs on the agent,  represents the distance from a certain
LIDAR to a colliding object in range, and  is the type of said object, with  ∈ {, , ∅}.</p>
        <p>The observations (or state) space is hence defined as , the set of all possible states .</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Action space</title>
        <p>The possible actions that the agent can perform are:
(1)
(2)
• Move forward or backward
• Move to the left or to the right, stepping aside (literally), without changing orientation
• Rotate counterclockwise or clockwise (yaw)
The agent can also combine the actions, for example going forward-right while rotating
counterclockwise.</p>
        <p>More formally, we can define the action space as:</p>
        <p>= { , , }
Where  ,  and  represent the movement on the associated axes, and
their value can be either {− 1, 0, 1}, where 0 represents no movement, and -1 and 1 represent
movement towards one or the other reference point on the axis.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Environment model</title>
        <p>The environment is a flat area of 50x50 units, bounded on its extremities by walls tall 1 unit. A
set of gray cubes of 3x3x3 units each are randomly placed on this area as obstacles. The target
the goal the agent must reach - is positioned randomly between this set of obstacles, and is an
orange cube of 3x3x3 units.</p>
        <p>The provided interactions between agent and environment are collisions. The agent collides
with another object in the environment if there’s an intersection between the bounding boxes
of the two entities. The floor is excluded from collisions. If a collision happens between agent
and obstacles, the agent sufers a penalty, while if a collision happens between agent and target,
the episode ends successfully, as the agent achieved its goal.</p>
        <p>The environment is generated every time the episodes ends, successfully or not, so no two
identical episodes are played by the agent. This generation is parametric, allowing for more
or less dense obstacle distribution in the environment, or longer or shorter distance between
agent and the target.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Reinforcement Learning</title>
      <p>
        In the past couple of years Reinforcement Learning has seen many successful and remarkable
applications in the robot and locomotive field, such as [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This approach provides many
benefits: experimentation can be done in a safe, simulated environment, and it’s possible to
train models through millions of iterations of experience to learn an optimal behaviour. In some
ifelds - such as robot movement - the RL approach currently outperforms classic heuristic and
evolutionary methods [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Reinforcement Learning is a technique where an agent learns by interacting with the
environment. The agent ought to take actions that maximize a reward, selecting these from its past
experiences (Exploitation) and completely new choices (Exploration), making this essentially a
trial-and-error learning strategy. After suficient training, an agent can generalize an optimal
strategy, allowing itself to actively adapt to the environment and maximize future rewards.
Generally, an RL algorithm is composed of the following components:
1. A policy function, which is a mapping between the state space and the action space of
the agent
2. A reward signal, which defines the goal of the problem, and is sent by the environment
to the agent at each time-step
3. A value function, which defines the expected future reward the agent can gain from the
current and all subsequent future states
4. A model, which defines the behaviour of the environment</p>
      <p>At any time, an agent is in a given states of the overall environment  ∈  (that it should
be able to perceive; from now on, we can consider the state the portion of the environment
that is perceivable by the agent), and it can choose to take one of many actions  ∈ , to cause
a change of state to another one with a given probability  . Taken an action  chosen by an
agent, the environment returns a reward signal  ∈  as a feedback on the goodness of the
action. The behaviour of the agent is regulated by what’s called a policy function  , which can
be defined as
 Θ(|) =  ( = | = )
(3)
and represents a distribution over actions given states at time  with parameters Θ - in this case
the policy function is stochastic, as it maps over probabilities. Following is a brief introduction
to the two main algorithms used in this work.</p>
      <sec id="sec-4-1">
        <title>4.1. Proximal Policy Optimization</title>
        <p>
          RL presents a plethora of diferent approaches. Proximal Policy Optimization (PPO) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], the
one used in this work, is a policy gradient algorithm which works by learning the policy
function  directly. These methods have a better convergence properties compared to dynamic
programming methods, but need a more abundant set of training samples. Policy gradients
work by learning the policy’s parameters through a policy score function,  (Θ) , through which
is then possible to apply gradient ascent to maximize the score of the policy with respect to the
policy’s parameters, Θ . A common way to define the policy score function is through a loss
function:
 (Θ) =
[ Θ(|)]
(4)
which is the expected value of the log probability of taking action  at state  times the
advantage function , representing an estimate of the relative value of the taken action. As
such, when the advantage estimate is positive, the gradient will be positive as well; through
gradient ascent the probability of taking the correct action will increase, while decreasing the
probabilities of the actions associated to negative advantage, in the other case. The main issue
with this vanilla policy gradient approach is that gradient ascent might eventually lead out of
the range of states where the current experience data of the agent has been collected, changing
completely the policy. One way to solve this issue is to update the policy conservatively, so
as to not move too far in one single update. This is the solution applied by the Trust Region
Policy Optimization algorithm [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], which forms the basis of PPO. PPO implements this update
constraint in its objective function through what it calls Clipped Surrogate Objective. First,
it defines a probability ratio between new and old policy (Θ) , which tells if an action for
a state is more likely or less likely to happen after the policy update, and it is defined as
(Θ) =  Θ(|) . PPO’s loss function, is then defined as:
 Θ(|)
 (Θ) =
[((Θ) , ((Θ) , 1 − , 1 +  ))]
(5)
        </p>
        <p>The Clipped Surrogate Objective presents two probability ratios, one non clipped, which
is the default objective as expressed in 4 in terms of policy ratio, and one clipped in a range.
The function presents two cases depending on whether the advantage function is positive or
negative:
1.  &gt; 0: the action taken had a better than expected efect, therefore, the new policy is
encouraged to taking this action in that state;
2.  &lt; 0: the action had a negative efect on the outcome, therefore, the new policy is
discouraged to taking this action in that state.</p>
        <p>
          In both cases, because of the clipping, the actions will only increase or decrease in probability
of 1 ±  , preventing updating the policy too much, while allowing the gradient updates to undo
bad updates (e.g. the action was good but it was accidentally made less probable) by choosing
the non-clipped objective when it is lower than the clipped one. Note that the final loss function
for PPO adds two other terms to be optimized at the same time, but we suggest the original
paper [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] for a more complete overview of PPO.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Curiosity</title>
        <p>Reward sparseness is one of the main issues with RL. If an environment is defined with a sparse
reward function, the agent won’t get any feedback about whether its actions at the current time
step are good or bad, but only at the end of the episode, where it either managed to succeed
in the task or fail. This means that the reward signal is 0 most of the time, and is positive in
only few states and actions. One simple example is the game of chess: the reward could be
obtained only at the end of the match, but at the beginning, when the reward might be 10, 50,
100 time steps away, if the agent can’t receive feedback for its current actions it can only move
randomly until, by sheer luck, it manages to get a positive reward; long range dependencies
must then be learned, leading to a complicated and potentially overly long learning process.
There are many ways to solve the problem of reward sparseness, such as reward shaping, which
requires domain-specific knowledge on the problem, or intrinsic reward signals, additional
reward signals to mitigate the sparseness of extrinsic reward signals.</p>
        <p>
          Curiosity [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] falls into the second category, and its goal is to make the agent actively seek out
and explore states of the environment that it would not explore. This is done by supplying the
default reward signal with an additional intrinsic component which is computed by a curiosity
module. This module is comprised of a forward model, which takes in  and , and tries to
predict the features of next state the agent will find itself in Φ ˆ (+1). The more diferent this
value compared to the features of the real next state Φ( +1), the higher the intrinsic reward.
        </p>
        <p>To avoid getting stuck into unexpected states that are produced by random processes non
influenced by the agent, the module is comprised of an inverse model, which takes Φ( ),
Φ( +1) and tries to predict the action ˆ that was taken to get from  to +1. By training
the encoder (Φ ) together with the inverse model, it s possible to make it so that the feature
extracted ignore those states and events that are impossible to influence, retaining only features
actually influenced by the agent’s actions.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. System Architecture</title>
      <p>The system has been developed using Unity3D as the main simulation platform and ML-Agents
for Reinforcement Learning1.</p>
      <p>Unity3D is a well-established open-source game engine. It provides a lot of out-of-the-box
functionalities including tools to assemble a scene, the 3D engine to render it, a physics engine
to physically simulate object interaction under physical laws, and many plugins and utilities.
In Unity an object is defined as a GameObject, and it can have attached diferent components
according to necessity, such as RigidBodies for physical computations, Controllers for movement,
decision-making and elements of the learning system (Agent, Academy and others). An object’s
life-cycle starts with an initialization and then a cyclic refresh of its state, while the engine
provides handler methods for these phases for customizing them through an event-driven
programming.</p>
      <p>Unity keeps track of time and events on a stratified pipeline: physics, game logic and scene
rendering logic are each computed sequentially and asynchronously:
1. Objects initialization.
2. Physics cycle (triggers, collisions, etc). May happen more than once per frame if the fixed
time-step is less than the actual frame update time.
3. Input events
4. Game logic, co-routines
5. Scene rendering
6. Decommissioning (objects destruction)</p>
      <p>One notable caveat is that physics updates may happen at a diferent rate than game logic.
This, in a game development scenario is sometimes treated by bufering either inputs or events,
to result in smoother physical handling, while for more simulations in which a precise
correspondence between simulated and simulation time is necessary this might pose a slight inconvenience.
1The project’s source code is available on Github: https://github.com/nhabbash/autonomous-exploration-agent.
However, for the goals of the present work, this consideration does not represent a crucial
problem.</p>
      <p>ML-Agents is an open-source Unity3D plugin that enables games and simulations to serve as
environments for training intelligent agents. It provides diferent implementations of
state-ofthe-art Reinforcement Learning algorithms implemented in TensorFlow, including PPO and
Curiosity. ML-Agents comes with an SDK that can be integrated seamlessly into Unity and
provides utilities for controlling the agent(s) and environment.</p>
      <sec id="sec-5-1">
        <title>5.1. System design</title>
        <p>Figure 2 represents the main actors of the system and their interactions. At the start of each
episode the environment is randomly generated according to its parameters, placing obstacles,
the target and the agent. The agent can then perform its actions, moving around the
environment, while collecting observations through its sensors and receiving rewards according
to the goodness of its actions. Physical collisions trigger negative or positive instantaneous
rewards according to the time of collision: obstacle collisions produce negative rewards, while
target collisions produce positive rewards and end the episode, as the task is successful. The
agent class, ExplorationAgent, is responsible of the agent initialization, observation collection,
collision detection and physical movement through the decisions received by the Brain
interface. The decision-making of the agent is made through its Brain interface, which provides
the actions produced by the model to the agent class. The environment is comprised of two
classes, ExplorationArea is responsible for resetting and starting every episode, rendering the
UI, logging information on the simulation process and placing every object in the environment
according to its parameters, while the Academy works in tandem with the Brain to regulate
the learning process, acting as a hub routing information - whether it’s observations from the
environment or inferences made by the models - between the system and the RL algorithms
under the hood.</p>
        <p>Once the training phase ends, ML-Agents generates a model file which can be connected to
the brain and used directly for inference. Figure 3 explains how exactly the learning system is
structured between Unity and ML-Agents.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Reward Signal</title>
        <p>The main RL algorithm used in this work is PPO. The reward signal is composed of:
• Extrinsic signal: the standard reward produced by the environment according to the
goodness of the agent’s actions
• Intrinsic signal: the Curiosity of the agent
The extrinsic signal presents some reward shaping, and is defined as:  = 5 *  − . The  term
stands for penalty, and is a negative reward formulated as  =  * 0.1 +  * 0.001. Every
time the agent reaches a target, indicated by  (target collisions), the episode ends, and the
positive reward is 5, while if it hits an obstacle, indicated by  (obstacle collisions), it receives a
penalty of 0.1 for each collision. The agent is also penalized as time passes, receiving a 0.001
negative reward for each timestep, to incentive the agent to solve the task faster.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Experiments</title>
      <p>The intrinsic signal is the Curiosity module, which as section 4.2 goes through, provides an
additional reward signal to encourage the exploration of unseen states.
Diferent scenarios for the analysis of the efectiveness of the proposed system have been
investigated.</p>
      <p>The main diferences between scenarios consist in:
1. Curriculum environment parameters
2. Penalty function
3. Observation source (LIDAR or camera)
4. Knowledge transferability to structured environments</p>
      <p>Comparison between the scenarios is conducted on two aspects: the firsts depends on the
canonical RL charts built in training-time in order to define the reward, their trends over times,
their balance and other information about the system; the other aspect is an environmental
performance comparison, conducted through three performance measures pertaining to the
investigated setting, these being CPM (collisions per minute), measuring the mean number of
collisions made by the agent on obstacles, TPM (targets per minute), measuring the mean number
of goal targets successfully reached by the agent and CPT (collisions per target), measuring
the mean number of collisions the agents does before getting to a target. As models between
scenarios have been trained with varying curricula and environments, these measures estimate
the performance of every model on a shared environment, making comparison between the
models happen on the same plane and circumstances. This environmental performances have
been measured in a parallel fashion in order to gather more accurate data.</p>
      <p>An interactive demo of the system is also available to allow visual comparison of the diferent
scenarios2.</p>
      <sec id="sec-6-1">
        <title>6.1. Baseline</title>
        <p>The first experiment acts as a baseline for other variations, and was conducted on the following
parameters:
1. Static parameters:
2. Penalty function:  =  * 0.1 +  * 0.001
3. Observations source: LIDAR set
2Demo at https://nhabbash.github.io/autonomous-exploration-agent/</p>
        <p>Number of obstacles 10
Min spawn distance 2</p>
        <p>Target distance 45</p>
        <p>The parameters in this setting generate fairly open environments with at most 10 obstacles.
The minimum distance between obstacles is 2 units, while the target spawns at a distance of 45
units, which is more then half of the environment size.</p>
        <p>Table 1 shows how the model collides roughly 5 times per minute and manages to reach a
target about 4 and times per minute, making roughly 1.25 obstacles per target reached.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Curriculum</title>
        <p>The second experiment implemented curriculum learning into the training pipeline. The
curriculum was structured in seven lessons scaling along with the cumulative reward. Following
are its settings:
1. Dynamic parameters:</p>
        <p>Reward thresholds
Number of obstacles
Min spawn distance</p>
        <p>Target distance
2. Penalty function:  =  * 0.1 +  * 0.001
3. Observations source: LIDAR set</p>
        <p>The parameters in this setting generate an increasingly harder environment, with the target
getting farther from the agent, and the obstacles getting more cluttered and closer together.</p>
        <p>Table 1 shows how the model collides roughly 10.6 times per minute and manages to reach a
target about 9.6 and times per minute, making roughly 1.09 obstacles per target reached. This
is a significant improvement compared to the baseline, as not only the agent is able to reach the
target faster, reaching 2.6 times more targets, but also with less collisions.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Harder penalty</title>
        <p>The third experiment implemented curriculum learning into the training pipeline, and added a
harsher penalty to the agent. The curriculum is structured as the above experiment, with the
exception of the new parameter Penalty ofset. Following are its settings:
1. Curriculum parameters:</p>
        <p>Reward thresholds
Number of obstacles
Min spawn distance</p>
        <p>Target distance
Penalty ofset
2. Harder penalty function:  =  +  +  * 0.001
3. Observations source: LIDAR set</p>
        <p>As for the previous situation, the parameters generate a harder environment as the cumulative
reward increase, but this time the penalty function too increases in dificulty. The rationale of
this experiment is that, as the agent learns how to move to reach the target, the agent should
learn to not collide frequently, but instead just search in the environment for the target smoothly.
Figure 4 shows how the diferent penalty relate to each other and the Cumulative Reward lower
limit without considering time decrease (same in all penalties).</p>
        <p>Table 1 shows how the model collides roughly 4.5 times per minute and manages to reach
a target about 3.9 and times per minute, making roughly 1.18 obstacles per target reached.
This model obtains results similar to the baseline, while staying below the performances the
curriculum model obtained. This may be due to the harsher penalty that does not give the
model time to adapt to an optimal policy.</p>
      </sec>
      <sec id="sec-6-4">
        <title>6.4. Camera sensors</title>
        <p>The fourth experiment implemented the same curriculum and penalty as the second experiment.
The main diference consists in the use of a camera sensor instead of the LIDAR array, thus
generating images as observations.</p>
        <p>1. Curriculum parameters: same as the second experiment
2. Penalty function: same as the second experiment
3. Observations source: Camera 84x84 RGB</p>
        <p>The model performs significantly worse than the others. This is plausibly due to the low
number of steps taken by the training of the model (74k) compared to the others (700k), which
did not let the algorithm converge to an optimal policy. We must add that the choice not to
investigate a longer training phase is due to the fact that the need to analyse this heavier form
of input, that nonetheless might not necessarily be more informative, made the training much
more expensive in terms of computation time, so whereas the steps are less than in the other
experiments, the overall computation time for learning is very similar. The model manages to
reach a target with roughly 2.5 obstacles hits each time, but taking roughly 3 times as much as
the optimal curriculum model.</p>
      </sec>
      <sec id="sec-6-5">
        <title>6.5. Structured environment transferability</title>
        <p>Taking the best performing model (i.e. Curriculum), the experiment consisted in testing how
well does the model generalize its task to structured environments - and how well does what
the model learned during training in the chaotic environments transfer to other structured
environments, which consist in:
1. "Rooms", two rooms are linked by a tight space in a wall, the agent has to pass through
the opening to reach the target;
2. "Corridor" is a long environment - literally a corridor - that the agent has to run across to
reach the target;
3. "Corridor tight" is similar to the previous environment but tighter
4. "Turn" is a corridor including a 90° left turn
5. "Crossroad" represents two ways cross and the agent has either to go straight on, to turn
left or to turn right.</p>
        <p>These environments tested the capability of the agent to follow directions and to look for the
target in every space of the scene.</p>
        <p>The model does not perform as well as in every environments. This seems to be due the
strategy that the model has learned to be optimal for the resolution of the task at hand, seems to
be random exploration, and will be addressed later on. The agent seems able to follow a linear
path, if the environment is wide enough to allow it to stay away from the borders, it is apparent
comparing Corridor and Corridor Tight outcomes: the second experiment has less targets and
more collisions per minute then the first one. The model’s performances in Crossroad and Turn
are similar but, in the first case the agent has to change path more often then in the second one,
so collisions happen more frequently. Rooms experiment has low both TPM and CPM because
the agent tends to stay in the spawn room, without attempting to pass from the door.</p>
      </sec>
      <sec id="sec-6-6">
        <title>6.6. Performance measures comparison</title>
        <p>The models have comparable performances: the curriculum, hardpenalty and camera models
reach a similar cumulative reward - with the camera ending on top, followed by the curriculum
and then hardpenalty models. This comparison shows the first caveat of the experiments:
cumulative reward notwithstanding, the curriculum model achieved way higher environmental
performances than the other two models. The comparison makes even more sense if compared
to the baseline model, which almost perfectly converged on a higher cumulative reward, but
even then, the curriculum model achieved better performances, even with a lower cumulative
reward.</p>
        <p>We also performed experiments in a 3D version of the environment, whose complete
description is omitted for sake of space, but in which (intuitively) actions included the possibility to
maneuver in the Z axis as well, increasing significantly the dimension of the state space. The
measurements of such a 3D maneuvering model took a very long time to converge and its lack
of progress in the curriculum shows how the model still hasn’t converged to an optimal policy.</p>
        <p>For the models employing curriculum learning, we discovered a tendency for the models
to reach a plateau early on in terms of steps. This may be due to the manual thresholding
setup, which does not accurately increase dificulty. The only model which the learning process
managed to reach the last dificulty level is the curriculum model.</p>
        <p>It is of note how the distribution of reward of the models capable of obtaining a working
policy shifts in time from the curiosity itself to the environmental reward, meaning a significant
shift between curiosity-driven exploration and exploitation of the strategy learnt by the policy.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>The empirical results show a certain decoupling between the environmental measures and the
classical performance measures. Measuring performance in reinforcement learning settings is
well-known to be tricky, as not always cumulative rewards, policy losses and other low-level
measures are able to capture the efectiveness of the behaviour of the agent in the setting.</p>
      <p>In the proposed setting, the experiments showed how curriculum learning can be an efective
solution for improving the generalizing capabilities of the model, improving significantly how
the agent behaves.</p>
      <p>The proposed learning system shows that the policy that most commonly is converged
to is essentially a random search strategy: the agent randomly explores the environment to
ifnd the target. This is demonstrated by the behaviours of the diferent models - to diferent
levels of performances - which shows the agent randomly moving between obstacles, revisiting
previously seen areas until he manages to get the target in the range of its sensors. This
is probably due to the random nature of the environment generation, as no two episodes
present the same environment, and as such the agent isn’t able to memorize the layout of the
environment (or portions of it), but is only able to generally try to avoid obstacles until the
targets comes into sight.</p>
      <p>
        This consideration represents a reasonable way to interpret results in environments whose
structure is closer to the human built environment: whenever looking around to see if the target
is finally in sight and moving towards it is possible without excessive risk of hitting an obstacles,
the process yields good results, otherwise it leads to a failure. While this does not represent
a negative result per se, it is a clear warning that the learning procedure, the environments
employed to shape the learning process, can lead to unforeseen and undesirable results. It
must be stressed that, whereas this is a sort of malicious exploitation of a behavioural model
that was trained on some types of environments and that is being tested in diferent situations,
other state of the art approaches specifically devised and trained to achieve proper pedestrian
dynamics still do not produce results that are competitive with hand crafted models [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>Possible future developments on the RL model side are:
1. Implement memory: adding memory to the agent (in the form of a RNN module) might
allow it to form a sort of experience bufer for the current episode and allows it to explore
the environment in a non-random fashion.
2. Rework the reward and penalty functions: the proposed reward and penalty are
pretty simplistic, a possible enhancement to the penalty could be implementing
softcollisions, that is, scaling the negative reward obtained by the agent in a collision according
to the velocity of the collision - safe, soft touches can be allowed.
3. Compare diferent RL algorithms : diferent reinforcement learning algorithms (A3C,
DQN) might show diferent insights on the optimal way to implement intelligent agents
in the proposed setting.</p>
      <p>On the other hand, a diferent training procedure, including specific environments aimed at
representing a sort of grammar of the built environment, that could be used to define a sort of
curriculum for training specifically pedestrian agents, should be defined to more specifically
evaluate the plausibility of the application of this RL approach to pedestrian modeling and
simulation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <source>Reinforcement Learning: an Introduction</source>
          , MIT press Cambridge,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Norvig</surname>
          </string-name>
          , Artificial Intelligence:
          <string-name>
            <given-names>A Modern</given-names>
            <surname>Approach</surname>
          </string-name>
          (4th ed.), Pearson,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bandini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Manzoni</surname>
          </string-name>
          , G. Vizzari,
          <article-title>Agent based modeling and simulation: An informatics perspective</article-title>
          ,
          <source>Journal of Artificial Societies and Social Simulation</source>
          <volume>12</volume>
          (
          <year>2009</year>
          )
          <article-title>4</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Weyns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Omicini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Odell</surname>
          </string-name>
          ,
          <article-title>Environment as a first class abstraction in multiagent systems</article-title>
          ,
          <source>Autonomous Agents Multi-Agent Systems</source>
          <volume>14</volume>
          (
          <year>2007</year>
          )
          <fpage>5</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Vizzari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Crociani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bandini</surname>
          </string-name>
          ,
          <article-title>An agent-based model for plausible wayfinding in pedestrian simulation</article-title>
          ,
          <source>Eng. Appl. Artif. Intell</source>
          .
          <volume>87</volume>
          (
          <year>2020</year>
          ). URL: https://doi.org/10.1016/j. engappai.
          <year>2019</year>
          .
          <volume>103241</volume>
          . doi:
          <volume>10</volume>
          .1016/j.engappai.
          <year>2019</year>
          .
          <volume>103241</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wolski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Klimov</surname>
          </string-name>
          ,
          <article-title>Proximal policy optimization algorithms</article-title>
          ,
          <source>CoRR abs/1707</source>
          .06347 (
          <year>2017</year>
          ). URL: http://arxiv.org/abs/1707.06347. arXiv:
          <volume>1707</volume>
          .
          <fpage>06347</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Heess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>TB</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sriram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lemmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Merel</surname>
          </string-name>
          , G. Wayne,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tassa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Erez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M. A.</given-names>
            <surname>Eslami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <article-title>Emergence of locomotion behaviours in rich environments</article-title>
          ,
          <source>CoRR abs/1707</source>
          .02286 (
          <year>2017</year>
          ). URL: http://arxiv.org/abs/1707.02286. arXiv:
          <volume>1707</volume>
          .
          <fpage>02286</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Turk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. K.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Learning symmetric and low-energy locomotion</article-title>
          ,
          <source>ACM Trans. Graph</source>
          .
          <volume>37</volume>
          (
          <year>2018</year>
          )
          <volume>144</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>144</lpage>
          :
          <fpage>12</fpage>
          . URL: https://doi.org/10.1145/3197517.3201397. doi:
          <volume>10</volume>
          . 1145/3197517.3201397.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Moritz</surname>
          </string-name>
          ,
          <article-title>Trust region policy optimization</article-title>
          , in: F.
          <string-name>
            <given-names>R.</given-names>
            <surname>Bach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 32nd International Conference on Machine Learning</source>
          ,
          <string-name>
            <surname>ICML</surname>
          </string-name>
          <year>2015</year>
          , Lille, France,
          <fpage>6</fpage>
          -
          <issue>11</issue>
          <year>July 2015</year>
          , volume
          <volume>37</volume>
          <source>of JMLR Workshop and Conference Proceedings, JMLR.org</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1889</fpage>
          -
          <lpage>1897</lpage>
          . URL: http://proceedings.mlr. press/v37/schulman15.html.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Pathak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Efros</surname>
          </string-name>
          , T. Darrell,
          <article-title>Curiosity-driven exploration by selfsupervised prediction</article-title>
          , in: D.
          <string-name>
            <surname>Precup</surname>
            ,
            <given-names>Y. W.</given-names>
          </string-name>
          <string-name>
            <surname>Teh</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 34th International Conference on Machine Learning</source>
          ,
          <string-name>
            <surname>ICML</surname>
          </string-name>
          <year>2017</year>
          ,
          <article-title>Sydney</article-title>
          ,
          <string-name>
            <surname>NSW</surname>
          </string-name>
          , Australia,
          <fpage>6</fpage>
          -
          <issue>11</issue>
          <year>August 2017</year>
          , volume
          <volume>70</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2778</fpage>
          -
          <lpage>2787</lpage>
          . URL: http://proceedings.mlr.press/v70/pathak17a.html.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Martinez-Gil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lozano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fernández-Rebollo</surname>
          </string-name>
          ,
          <article-title>Emergent behaviors and scalability for multi-agent reinforcement learning-based pedestrian models</article-title>
          ,
          <source>Simul. Model. Pract. Theory</source>
          <volume>74</volume>
          (
          <year>2017</year>
          )
          <fpage>117</fpage>
          -
          <lpage>133</lpage>
          . URL: https://doi.org/10.1016/j.simpat.
          <year>2017</year>
          .
          <volume>03</volume>
          .003. doi:
          <volume>10</volume>
          .1016/ j.simpat.
          <year>2017</year>
          .
          <volume>03</volume>
          .003.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>