<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Theory</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Curriculum-Based Reinforcement Learning for Pedestrian Simulation: Towards an Explainable Training Process</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giuseppe Vizzari</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniela Briola</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Cecconello</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Electrical Electronic and Computer Engineering, University of Catania</institution>
          ,
          <addr-line>Catania</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Informatics</institution>
          ,
          <addr-line>Systems and Communication (DISCo)</addr-line>
          ,
          <institution>University of Milano-Bicocca</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>87</volume>
      <issue>2018</issue>
      <fpage>73</fpage>
      <lpage>91</lpage>
      <abstract>
        <p>Deep Reinforcement Learning (DRL) is a promising approach in the development of autonomous agents adopted in diferent contexts, from robotic control to virtual avatars in video games. The present contribution presents an application of DRL to the context of pedestrian simulation: building on previous results, we focus on wayfinding decisions, i.e. the decisions among diferent alternative trajectories within an annotated (planar) environment comprising rooms and passages, in which the agent might need to reach specific intermediate goals before moving towards a final exit. By employing a curriculum based approach, the learning process guides agents to develop a policy leading to the exploration of the environment to reach a set of intermediate waypoints and the final movement target, irrespectively of the specific map of the environment. We discuss the adopted approach, the achieved results, and we discuss potential steps towards improving the explainability of the training process by means of formalization of scenarios included in the curriculum, and their intended training goals.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;agent-based simulation</kwd>
        <kwd>pedestrian simulation</kwd>
        <kwd>reinforcement learning</kwd>
        <kwd>curriculum learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Research in the area of pedestrian modelling has started investigating the application of Machine
Learning (ML) approaches leveraging the growing amount of data describing pedestrian and
crowd behavior1. Even assuming the validity of the experiments and observations that generated
those data (a topic that however is raising important questions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]), the data driven nature of
these supervised approaches, however, makes it dificult to achieve models characterized by
the level of generality (i.e. applicability to a relatively wide range of situations in which maybe
no experiment was carried out, and therefore there is no available data) achieved by manually
defined approaches.
      </p>
      <p>
        Recent results [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] have investigated the adoption Reinforcement Learning (RL) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a particular
type of ML approach in which agents situated in an environment explore the potential space of
the policies (i.e., agent behavioral specifications) and converge to a specific behavior assuring
the maximization of an expected cumulative reward (a feedback signal evaluating the adequacy
of agent’s behavior in a given situation). By their own nature, per se, these forms of agent
training processes do not need a huge amount of data, but they rather require a careful definition
of agent’s perceptions, actions, and the way they are evaluated within a given situation (i.e. a
reward function, more on this point later in the paper). It is not an uncommon RL workflow
to perform training on the specific problem / situation at hand, to achieve the best possible
results, and to obtain costly models which fit the situation but are not necessarily applicable
to others. Within a previous work [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] we proposed a curriculum [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] based approach, in which
agents were proposed training scenarios of growing complexity, granting both a reasonably
fast training and an interesting level of generality and direct applicability of the learned policy
without retraining on a new scenario to be investigated. The model focused on operational
level decisions, although agents were provided with basic information supporting the navigation
of simple environments comprising interconnected rooms, but they were not able to choose
among alternative paths towards a final goal.
      </p>
      <p>
        The present contribution builds on the framework discussed in the above cited paper to
investigate the possibility to train agents that are able to perceive and exploit environmental
information supporting wayfinding [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Passages are associated to information indicating if
they represent a reasonable way towards a final exit from a scenario, and also if they are to be
followed to reach intermediate goals. The learning process guides agents to develop a policy
leading to the exploration of the environment to reach a set of intermediate waypoints and the
ifnal movement target, irrespectively of the specific map of the environment. The contribution
will discuss the overall framework, the experimented training process, and the achieved results.
      </p>
      <p>The paper will describe the fundamental elements of the approach, its implementation within
a software framework employing Unity2 and ML-Agents3, describing the promising achieved
simulation results: in particular, we will show that the proposed approach is able to produce
plausible results in environments that were not used for sake of training, so the approach seems
promising at least in terms of generality. While the curriculum learning structuring implies a
goal driven approach in the design of the training process (a step of the curriculum definitely
has an intended outcome in teaching the agent how to deal with a certain situation, in pushing
it to learn to do something) this passage is mostly intuitive, vague, and definitely not formalized
as of this moment. It would be instead important, both for sake of explainability of the training
process, as well as for supporting more systematic analyses of the curriculum structure and
composition, and potentially to support its maintenance and extension, to take these aspects
into account: the paper will provide an initial discussion of how we intend to approach this
very interesting future development.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        Pedestrian and crowd dynamics, as suggested in the introduction, represent an area in which
scientific research has produced valuable results, that are now being practically employed by
2https://unity.com
3https://github.com/Unity-Technologies/ml-agents
of-the–shelf tools: PTV Viswalk 4 oficially states that it employs mechanisms based on the
social force model introduced by [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. An interesting and compact discussion of the field, from
a research oriented standpoint, is presented by [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], although it is really dificult to provide a
compact and yet substantial and comprehensive introduction to the field. Wayfinding and path
planning activities, for instance, are object of recent intense research, and they also to try to
consider factors like partial or imprecise knowledge of an environment (as discussed by [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]), its
dynamic level of congestion, and human factors like imitation (as proposed by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) can influence
overall observed system dynamics.
      </p>
      <p>
        Machine learning approaches have not yet delivered results able to substitute the traditional
hand crafted models adopted in commercial simulators, and they are still at the stage of active
researches. One of the first approaches, by [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], has investigated both RL techniques
(Qlearning) and a classification approach to basically choose an action among a small set of
available alternatives, based on a description of the current situation, and employing a decision
tree. More recently, diferent authors tried to frame the problem in such a way that regression
techniques could be employed, either to predict the scalar value of pedestrian’s velocity vector
(see, in particular, the work by [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]) or to predict the both the walking speed and the direction
to be employed (as presented by [12]) considering the current perceived situation. The basic
idea is that, thanks to the growing availability of raw data describing pedestrian experiments
(see the above mentioned web site gathering and making available videos and tracking data
about pedestrian and crowd experiments 5), one could simply devise a deep neural network
to be trained according to the contextual situation perceived by a pedestrian and the velocity
actually adopted in the next frame of the video. While this approach is relatively straightforward,
it is quite limited in terms of the actual possibility to produce a general model of pedestrian
behaviour: even when the whole process should lead to a successful training of the network
and to achieving even very good results in the specific situations documented in the dataset,
there is actually no guarantee that the network would produce plausible movement predictions
in diferent situations not covered by the experiments.
      </p>
      <p>
        The RL approach has been recently applied again to the problem of pedestrian behavioural
modeling and simulation producing very interesting results by the already mentioned [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]:
authors clearly and very honestly discuss the limits of the achieved model. In particular,
although trained agents achieve encouraging quantitative results, also from the perspective
of capability of the model to generalize and face potentially complicated environments, social
interaction situations, and movement patterns, in some situations, they actually cannot complete
the movement they wanted to perform. We emphasize that this is completely understandable
when applying an approach that basically explores the space of potential policies for identifying
reasonable behavioral specifications in a complex situation, but still this testifies that there is
still need to perform further investigations to really evaluate the adequacy of the RL approach
to the problem of pedestrians and crowd simulation.
      </p>
      <p>A general consideration on RL compared to other ML approaches can however already be
done: on the one hand, RL requires the modelers to provide a set of assumptions, not just about
the model of perception and action of the agent. This is a cost, but it also means that the model</p>
      <sec id="sec-2-1">
        <title>4https://www.ptvgroup.com/en/solutions/products/ptv-viswalk/ 5https://ped.fz-juelich.de/da/doku.php</title>
        <p>(a)
(b)
can embed (i) concepts about how the environment is actually conceived and interpreted by the
agent in relation to its goal oriented behaviour, and (ii) an idea of what should be considered
a desirable behaviour (and just as well also what should be considered bad choices), and this
can represent a way of guiding the learning process in the large space of potential policies.
From this perspective, the presented approach is in tune with recent works on heuristics-guided
RL [13] (although we did not technically employ the techniques and framework proposed by
the authors), not just for accelerating the training process, but also to achieve a more generally
applicable behavioral model.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. The Proposed Model</title>
      <sec id="sec-3-1">
        <title>3.1. Representation of the Environment</title>
        <p>For sake of simplicity in this experimental study environments are bound to be squares of 30
× 30 metres surrounded by walls. Concrete objects that prevent pedestrian passage, such as
walls, obstacles are represented in gray, while violet rectangles are intermediate and final goals,
whereas blue squares and rectangles are associated to intermediate movement goals not strictly
associated to a passage (they can represent areas in which agents need to carry out specific
actions). Blue and violet markers (in the vein of [14]), do not hinder the possibility of moving
through them, and they are essentially a modeling tool to support agent’s navigation in the
environment. The models we want to achieve represent an alternative to Unity’s path finding
and (more generally) pedestrian agent control mechanisms.</p>
        <p>The modeler must thus perform an annotation of the environment before using it in the
proposed approach; an example of an environment annotated with this rationale is shown in
Figure 1. On the left, the various objects are exemplified, while the right shows a vector of
booleans associated to every intermediate target, and more precisely to every side of them. The
semantics of this vector is that the presence of a 1 in the ℎ position implies that crossing it
from that side will lead an agent towards a specific goal: the small scale scenario depicted in
the example comprises two sub goals (respectively associated to the first and second bits) and
the final exit (associated to the third bit), so the vector is associated to three bits. The rationale
of this representation is to mimic the presence of signposting indications, guiding visitors of a
building in finding their way to reach. In the experiments in the remainder of the paper we
considered vectors including nine positions for intermediate goals plus the final exit.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Agent Perception</title>
        <p>Agents perception is supported by a set of projectors generating rays extending up to a certain
distance (14 m in these experiments) and supplying indications on what is potentially intersected
and the associated distance from the generating agent. Projectors are distributed around the
agent according to this rule:   =  ( − 1 +  * , _) where  has been set to 1.5,
max_vision to 90 and  0 to 0. As a consequence, projectors emit rays at 0°, ± 1.5°, ± 4.5°, ± 9°,
± 15°, ± 22.5°, ± 31.5°, ± 42°, ± 54°, ± 67.5°, ± 82.5° and ± 90°. Figure 1(a) graphically depicts
this distribution. There are thus 23 angles, and for each of them two rays are projected, being
associated to diferent types of information, supporting navigation among diferent rooms
(information about walls and targets), and within rooms (information about walls and goals).
Wall information is therefore currently present twice in agent’s perception (although in one
case it is in the same input in which a target towards another room might be perceived, and
in the other it is in the same input in which a goal within a room might be perceived) but,
as we will see in the experimental results discussion, this redundancy does not lead to issues
preventing training convergence.</p>
        <p>
          The overall agent’s observation is summarized in Table 1: in addition to rays, it includes
information about agent’s state such as the current velocity, a vector indicating if the goal
associated to the ℎ should be reached by the agent, and a boolean that is true when the agent
has reached all intermediate goals and should move toward the final exit. To improve the
performance of neural networks typically employed DRL algorithms all numerical observations
have been normalized in the interval [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Action space</title>
        <p>Each agent is provided with an individual desired velocity that is drawn from a normal
distribution with average of 1.5 m/s and a standard deviation of 0.2 m/s. Each decision, and for these
experiments we decided to grant agents three decisions per second (in line with [15], combining
cognitive plausibility, quality of the achieved results, and computational costs), determines a
potential change in its velocity and this is basically what agent’s decision is all about for this
model.</p>
        <p>
          Agent’s action space has been therefore modeled as the choice of two (conceptually)
continuous values in the [
          <xref ref-type="bibr" rid="ref1">-1,1</xref>
          ] interval that are used to determine a change in velocity vector,
        </p>
        <p>Intrinsic
Walls and targets
Walls and goals</p>
        <p>Observation
Own velocity</p>
        <p>GoalsVector
AllGoalsAchieved</p>
        <p>Distance
Type/tag
Direction
Distance
Type/tag</p>
        <p>Value
Number
Boolean
Number
Number</p>
        <p>Boolean
Vector of booleans
One Hot Encoding
Vector of booleans
 =   ,   − 1 +</p>
        <p>Where  is set to 0 and  is set to 1.7 m/s. According to this equation the
agent is able to reach a complete stop or the maximum velocity is two actions (i.e. about 0.66 s).</p>
        <p>The second element of the decision determines a change in agent’s direction; in particular,
  =  − 1 + 1 * 25. The walking direction can therefore change 25° each 0.33s, that is plausible
for normal pedestrian walking, but would be probably not reasonable for modeling running
and/or sport related movements.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Reward Function</title>
        <p>The reward function is a central component in a RL approach, representing the only feedback
signal guiding the learning process. In this case, we are dealing with a particular form of decision
making, with conflicting tendencies that are generally reconciled quickly, almost unconsciously,
in a combination of individual and collective intelligence, that generally leads to sub-optimal
overall performance [16, 14].</p>
        <p>
          Given the above considerations, we hand-crafted a reward function, initially in terms of macro
components, i.e. factors generally influencing pedestrian behavior. Later on we performed a sort
of initial tuning of the related weights defining the relative importance of the diferent factors.
Unlike in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] we focused on individual motivation in exploring the environment, following
indications ofered by the environment, adjusting the current velocity so as to achieve a plausible
overall trajectory. The overall reward function is defined in Equation 2:
⎧+6
⎪
⎪
⎪
⎪⎪⎪− 3
⎪
⎪⎪⎪+0.5
⎪
⎪
⎪
⎪⎪⎪− 1.5
⎪
⎪
⎪⎪⎪− 0.5
Reward = ⎨+1
⎪
⎪⎪⎪− 0.5
⎪
⎪
⎪⎪− 0.1
⎪
⎪
⎪
⎪⎪− 0.01
⎪
⎪
⎪
⎪⎪− 0.0001
⎪
⎪
⎪
⎩− 6
        </p>
        <sec id="sec-3-4-1">
          <title>Final target reached, all subgoals reached</title>
          <p>Final target reached, at least one subgoal not reached
Valid intermediate target reached
Invalid intermediate target reached
No target in sight
Subgoal reached
Wall in proximity &lt; 0.6 m
At least one subgoal has not been reached yet
All subgoals reached no target in sight
Each step done
Reached the end of steps per episode
(2)</p>
          <p>The only ways to increase the cumulative reward are the reaching of a final target (but only
if all subgoals were reached), a valid intermediate target (meaning one that leads towards a
pursued subgoal or the final target), or a subgoal. Most actions will instead yield a negative
reward, associated to an implausible choice or simply to the fact that another decision turn has
passed (this small penalty pushes agents to avoid wasting time). Reaching the end of an episode
of training (we will provide more information about them later on) without having completed
the scenario (i.e. reaching the subgoals and then the final exit) will lead to a substantial negative
reward.</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Adopted RL algorithm</title>
        <p>Within this work we adopted Proximal Policy Optimization (PPO) [17], a state–of–the–art RL
policy–based algorithm, and in particular its implementation provided by ML-Agents 6. PPO
is a policy gradient algorithm learning directly the policy function  , selecting the action to
be carried out in a given situation, without the need of a value function (an estimation of the
expected return of an action carried out in a given state). These methods generally have better
convergence properties compared to dynamic programming methods, but they need a more
abundant set of training samples. Policy gradients work by learning the policy’s parameters
through a policy score function,  (Θ) , through which it is possible to apply gradient ascent,
maximizing the score of the policy with respect to the policy’s parameters, Θ . A common way
to define the policy score function is through a loss function:
 (Θ) =
[ Θ(|)]
(3)
which is the expected value of the log probability of taking action  at state  times the
advantage function , representing an estimate of the relative value of the taken action. When
the advantage estimate is positive, the gradient will be positive as well; through gradient ascent
the probability of taking the correct action will increase, while decreasing the probabilities of
the actions associated to negative advantage. Constructing these estimates requires exploring</p>
        <sec id="sec-3-5-1">
          <title>6https://github.com/Unity-Technologies/ml-agents</title>
          <p>the efect of actions in diferent situations, but this approach is fundamentally diferent from
supervised learning, since no annotated dataset is necessary.</p>
          <p>For this specific work, we also employed a curiosity mechanism [ 18] pushing the agent to
explore the state space more eficiently and generalize the acquired experience to unexplored
scenarios.</p>
          <p>The goal of the work was essentially to evaluate the adequacy of the approach to the problem
of achieving a proper pedestrian simulation model and we did not analyze the performance of
diferent RL algorithms yet.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. The Curriculum Based Learning Process</title>
      <p>
        The notion of Curriculum Learning [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] represents a general training strategy within machine
learning, initially conceived to reduce the training times. The underlying rationale is to present
examples (in particular labeled examples in supervised learning) in order of increasing dificulty
during training, illustrating gradually more concepts and more complications to the decision.
In the context of RL it has been employed as a transfer learning technique in RL, DRL and
Multi–Agent RL [19]: the agent can exploit experiences acquired carrying out simpler scenarios
when training to solve more complex ones, in an intra–agent transfer learning scheme. Besides
improving convergence properties and performance, in some situations it was also reported to
support a better generalization of the overall training process [20]. This aspect, coupled with
the intuitiveness and cognitive efectiveness of the approach, led us to pursue this approach
since achieving a good level of generalization of the acquired experience was also extremely
important for our problem. We wanted to train a single model directly applicable to diferent,
alternative designs on the same crowding condition, without having to perform training for
every specific design (which would lead to achieve incomparable results, since they would be
achieved by means of diferent pedestrian models).
      </p>
      <p>The finally adopted approach, therefore, proceeds training agents in a set of scenarios of
growing complexity, one at a time, but it also provides a final parallel retraining of the agent
in a selected number of scenarios before the end of the overall training, to refresh previously
acquired competences.</p>
      <p>For sake of clarity in the remainder of the section, we define more clearly some key concepts:
• scenario or step of the curriculum: the terms will be used interchangeably, and basically
they imply a specific simple environment and specific conditions for considering this part
of the training process completed;
• episode: each scenario generally needs to be experienced several times, each of them called
an episode, to accumulate experience (often associated to errors and negative rewards);
each episode has a maximum duration, but it can end with the achievement of the goal of
the scenario;
• each episode therefore leads to the achievement of a cumulative reward, that is the
summation of instant rewards achieved as a consequence of each decision and action
step;
• the above mentioned completion condition for a given scenario is expressed in terms of
a mathematical test for the evaluation of episodes cumulative rewards: for example, a
Environment</p>
      <p>StartEz</p>
      <p>CorridorObjective
StartObjectiveRandom</p>
      <p>StartPassage</p>
      <p>Path
DoubleChoice
TaskBeforeExit
2Rooms</p>
      <p>Rooms
LibrarySmall</p>
      <p>Rows
RowsMiddleSplit</p>
      <p>typical completion condition could be “the average cumulative rewards of the last 10
episodes is higher than threshold ℎ”.</p>
      <sec id="sec-4-1">
        <title>4.1. Details of the Curriculum</title>
        <p>Starting from the above considerations, we defined a specific curriculum for RL-pedestrian
agents based on this sequence of tasks of increasing complexity that are sub–goals of the overall
training. Table 2 reports the diferent environments that were included in the curriculum,
including the duration of the associated episode and a threshold for evaluating the acquired
cumulative reward. For sake of automation of the curriculum execution, we consider a step
of the curriculum to be successfully completed whenever the average cumulative reward for
trained agents, excluding the top and bottom 10% (for avoiding being excessively influenced by
a small number outliers), exceeds a this empirically defined threshold, specifically configured
for every step of the curriculum. The table also shows whose environment are included in the
ifnal retraining phase, that must be carried out before using the trained agents for simulation in
environments not yet experienced.</p>
        <p>The defined curriculum supports the acquisition of three main competences for an agent:
• the ability to steer and walk towards a target / mid-targets;
• the ability to choose among alternative paths, some of which might be preferable according
to environmental annotations;
• the ability to avoid moving toward the final target unless all intermediate goals have been
reached.</p>
        <p>We defined this sequence thanks to expertise in the context of pedestrian simulation, as well
as to a preliminary experimental phase and to the experience acquired in a previous research
(a) StartObjectiveRandom</p>
        <p>
          (b) Rooms
(c) TasksBeforeExit
(d) LibrarySmall
efort adopting a similar approach [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. An ablation study, as well as analyses of the robustness
of this training process are object of future works.
        </p>
        <p>For sake of space, we cannot describe every environment and scenario included in the
curriculum, but a selection of these training environments is shown in Figure 2.</p>
        <p>Starting from basic situations in which the agent learns how to steer towards a randomly
positioned goal, reach it, and only afterwards move towards the final exit (also randomly
positioned) (Figure 2(a)), we have situations in which the agent can choose alternative paths
in which the conflicting tendencies to save time and reach all subgoals make choices more
complicated (Figures 2(b), 2(c) and 2(d)).</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Reward Trend During Training</title>
        <p>We executed the above described curriculum training process adopting the following choices
for the configuration of the PPO algorithm implemented in ML-Agents:
• the employed neural network employed is a fully connected network with 2 hidden layers
of 256 nodes each; a larger network leads to longer training times but it does not improve
the quality of the achieved results, whereas a smaller network does not converge to a
reasonable policies;</p>
        <p>• we employed a basic PPO with curiosity mechanisms [18], combining extrinsic reward
signal (strength 1) and a very moderate intrinsic signal (strength 0.004);
• we adopted a very high number for max_steps (maximum episode duration) to let the
curriculum guide the actual training, rather than predefined parameters.</p>
        <p>The overall training time with the defined curriculum varies according to diferent factors,
but on a Windows based PC employing an AMD Ryzen 5 2600 (3.4 GHz) provided with 16 GB
RAM, employing only the CPU7 would require around 1 hour and half. Figure 3 shows the
trend of the cumulative reward; in particular, the shown value is computed averaging out the
cumulative reward achieved by agents in 12 episodes within the associated environment.</p>
        <p>The diferent colors highlight the duration of the diferent scenarios of the curriculum: as
expected the reward drops in a very significant way when agents moves on from a step of the
curriculum to the next one, but through time the training converges. It is also apparent that
diferent environments in the curriculum have a diferent dificulty, at least within this specific
training process. The TaskBeforeExit environment seems particularly challenging, at least in
terms of slow convergence to the necessary threshold for advancing to the next stage of the
curriculum: this is due to the fact that the agent repeatedly passes nearby the final exit and
therefore is tempted to move toward it despite it still has subgoals to reach. Once again, an
ablation study of the curriculum would be important and it is object of future works.</p>
        <p>7The adopted version of ML-Agents suggests doing so, since it would not properly exploit a GPU.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Analysis of Achieved Results</title>
      <p>In line with the goal of evaluating the possibility to achieve a policy and an agent model able to
generalize the experience acquired in the training process, we tested the achieved pedestrian
model in some specific environments not included in the curriculum.</p>
      <p>In particular, Figure 4 shows the “Cross” environment, in which the agent must reach two
goals situated in opposite rooms before moving toward the exit situated in the upper part of
the environment. The left diagram shows the trajectory achieved by adopting the built-in
Unity AI navigation system 8 in which specific intermediate points are indicated in terms of
absolute coordinates and the A* algorithm is used to achieve the trajectory. Strictly speaking,
this is not comparable to the one achieved through our approach, since our agent does not
know the coordinates of the subgoals. Moreover, the goal of our simulations is not to minimize
the travel time, but to achieve trajectories (and walking speeds) that are close to those that a
human pedestrian would follow. From this perspective, the results achieved in this scenario
are interesting since the pedestrian does not make sharp bends to minimize the lenght of the
trajectory reaching points very close to the walls, but it rather makes smoother bends, unless a
180° turn is the most sensible behavior (i.e. after reaching the subgoals). The diagram on the
right, showing the distance among the points in space among the two agents throughout the
simulation, highlights that the RL agent mostly difers in the observable behavior in the bends
before reaching the first subgoal and after reaching the second one.</p>
      <p>Figure 5 shows the same kind of diagrams and analyses in a much more complicated scenario,
denoted as Supermarket, resembling a mini market in which diferent regions can include
subgoals of interest to the agent. Here the agent must follow signposts associated to intermediate
targets (passages among sections of the market) and pursue the subgoals associated to its own
“shopping list”. Also in this case, the RL agent generates a longer trajectory, avoiding getting
too close to walls and taking smoother bends (in this case no 180° turn is plausible, and the only
case in which a 90° bend would be reasonable the adopted trajectory is rather smooth).</p>
      <p>
        These results by no means represent a complete validation of the model, but rather an
illustration of the achieved results, which are promising although the approach present significant
limits (first of all, unlike in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] social aspects and the fact that multiple pedestrians can be
8See https://docs.unity3d.com/Packages/com.unity.ai.navigation@1.1/manual/NavInnerWorkings.html
present at the same time in the same environment were not considered in this experiment).
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Towards an Explainable and Systematic Training Process?</title>
      <p>The proposed approach based on the exploitation of a curriculum aims at creating a unique
pedestrian model able to make the agent move in an unknown environment, exploiting its
previous experience in simpler scenarios: the choice of these simple scenarios is of paramount
importance, since they represent a set of “basic, archetypal situations” that ideally would be a
good representation of (parts of) any unknown environment.</p>
      <p>A problem in exploiting this approach is that we get a unique final model, where it is
impossible to identify the part of the model related to a specific scenario: this means that if the
model performs poorly in new unknown environments, it is plausible to consider evaluating
if it is a situation that the present curriculum does not face suficiently well or skips at all.
This means that either a scenario already present in the curriculum must be changed (either in
spatial structure or other relevant parameters, such as the threshold for completion), or a brand
new scenario must be conceived and inserted in the curriculum. This also implies identifying
the proper position in the sequence of scenarios. Of course, then, the whole training process
must be executed again to achieve a new policy, a new behavioral model for the agent. This
obviously represents a problem, both for sake of technological transfer, and especially should
this approach be adopted in a distributed computing setting or (even worse) in a federated
learning [21] context.</p>
      <p>Our proposal to tackle these issues requires the definition of a semantic and explicit
formulation of the notion of scenario, intended goals (supporting the acquisition of a given competence)
and dificulties proposed to the agent in training. This basic element could first of all make
visible the design process in the definition of the overall curriculum, making it more
understandable for the modeler and for future sharing. First, this change would enable an explanation
of the training process and a sort of justification of why the trained agent can perform some
kind of behavioral pattern. Second, in case of implausible or problematic agent performance
in some new environment, it could support the identification of scenario(s) to be changed to
better face the new challenges strengthening some trained competence, or the creation of a new
scenario to face newly identified training goals. Of course this implies remaining in the same
deep structure of the model (agent’s perceptions, actions, environmental building blocks).</p>
      <p>In the long run, this could even lead to the conception of an automated process, integrated in
the overall knowledge of the agent, supporting the self-revision of the training process or even
a federated learning scenario in which multiple training agents could share experiences.</p>
      <p>A longer term and even more challenging research direction, in the vein of recent proposals
in cognitive sciences [22], would lead to the definition of a more articulated agent architecture
comprising a knowledge-level component (potentially supported by sub-symbolic components)
(i) inspecting the environment (or portion of environment) the agent is situated into and
identifying the current situation, (ii) selecting / retrieving the most appropriate decision making
model to be employed from a repertoire of pre-trained behavioral models (even achieved by
means of diferent Machine Learning techniques or any inference mechanism for deciding how
to move within a certain environmental condition). Also, potentially the agent could share
experiences / models within a community of agents of this type by means of some collaboration
mechanism, exploiting other’s agents experience in diferent environments (so, improving the
obtained behavioral models) and speeding-up the learning phase.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>
        The paper has presented a research efort aimed at experimenting the adequacy of applying RL
techniques to pedestrian simulation, especially considering the need to achieve general models
applicable to a wide range of situations without the need of performing a training for each
analyzed scenario. In addition to the longer term goals discussed in the previous section, a list
of considerations of more immediate concern can be provided:
• we did not show a quantitative analysis of the achieved results, also for sake of space:
this analysis, representing a first step in the direction of model validation, is object of
current and future works;
• an extensive analysis of the efects of changes in RL algorithm, hyperparameters,
configuration of the curriculum: we reached the presented solution performing some comparisons
with alternative settings, but a systematic analysis of each of these aspect would require
a focused specific work;
• additional quantitative experiments to improve the evaluation of the achieved results on
the side of pedestrian simulation may be performed, possibly with a comparison with
results of experimental observations, towards a validation of the model;
• overcoming some current limits in the expressiveness of the model: we focused here on
wayfinding, while in a previous work we achieved a more general model for pedestrian
operation decisions [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Group presence and social influence [ 23], for instance, were not
considered.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>Giuseppe Vizzari and Daniela Briola carried out this work within the context of MOST -
Sustainable Mobility National Research Center and received funding from the European
Union—NextGenerationEU (National Sustainable Mobility Center CN00000023, Italian Ministry of University
and Research Decree n. 1033—17/06/2022, Spoke 8).
//doi.org/10.1080/15472450.2019.1621756. doi:10.1080/15472450.2019.1621756.
arXiv:https://doi.org/10.1080/15472450.2019.1621756.
[12] X. Zhao, L. Xia, J. Zhang, W. Song, Artificial neural network based modeling on
unidirectional and bidirectional pedestrian flow at straight corridors, Physica A:
Statistical Mechanics and its Applications 547 (2020) 123825. URL: https://www.sciencedirect.
com/science/article/pii/S0378437119321272. doi:https://doi.org/10.1016/j.physa.
2019.123825.
[13] C. Cheng, A. Kolobov, A. Swaminathan, Heuristic-guided reinforcement
learning, in: M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, J. W. Vaughan
(Eds.), Advances in Neural Information Processing Systems 34: Annual Conference
on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021,
virtual, 2021, pp. 13550–13563. URL: https://proceedings.neurips.cc/paper/2021/hash/
70d31b87bd021441e5e6bf23eb84a306-Abstract.html.
[14] L. Crociani, G. Vizzari, S. Bandini, Modeling environmental operative elements in
agent-based pedestrian simulation, Collective Dynamics 5 (2020) 508–511. URL: https:
//collective-dynamics.eu/index.php/cod/article/view/A85. doi:10.17815/CD.2020.85.
[15] S. Paris, S. Donikian, Activity-driven populace: A cognitive approach to crowd simulation,</p>
      <p>IEEE Computer Graphics and Applications 29 (2009) 34–43. doi:10.1109/MCG.2009.58.
[16] M. Haghani, M. Sarvi, Imitative (herd) behaviour in direction decision-making hinders
eficiency of crowd evacuation processes, Safety Science 114 (2019) 49–60. URL: https:
//www.sciencedirect.com/science/article/pii/S0925753518309275. doi:https://doi.org/
10.1016/j.ssci.2018.12.026.
[17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy
optimization algorithms, CoRR abs/1707.06347 (2017). URL: http://arxiv.org/abs/1707.06347.
arXiv:1707.06347.
[18] D. Pathak, P. Agrawal, A. A. Efros, T. Darrell, Curiosity-driven exploration by
selfsupervised prediction, in: D. Precup, Y. W. Teh (Eds.), Proceedings of the 34th International
Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017,
volume 70 of Proceedings of Machine Learning Research, PMLR, 2017, pp. 2778–2787. URL:
http://proceedings.mlr.press/v70/pathak17a.html.
[19] F. L. D. Silva, A. H. R. Costa, A survey on transfer learning for multiagent reinforcement
learning systems, Journal of Artificial Intelligence Research 64 (2019) 645–703. doi: 10.
1613/jair.1.11396.
[20] B. Baker, I. Kanitscheider, T. M. Markov, Y. Wu, G. Powell, B. McGrew, I. Mordatch,
Emergent tool use from multi-agent autocurricula, in: 8th International Conference on Learning
Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net,
2020. URL: https://openreview.net/forum?id=SkxpxJBKwS.
[21] C. Zhang, Y. Xie, H. Bai, B. Yu, W. Li, Y. Gao, A survey on federated learning,
KnowledgeBased Systems 216 (2021) 106775. URL: https://www.sciencedirect.com/science/article/pii/
S0950705121000381. doi:https://doi.org/10.1016/j.knosys.2021.106775.
[22] H. Mercier, D. Sperber, The Enigma of Reason, Harvard University Press, Cambridge,</p>
      <p>Massachussets, 2017.
[23] L. Crociani, Y. Zeng, G. Vizzari, S. Bandini, Shape matters: Modelling, calibrating
and validating pedestrian movement considering groups, Simulation Modelling
Prac</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Haghani</surname>
          </string-name>
          ,
          <article-title>The notion of validity in experimental crowd dynamics</article-title>
          ,
          <source>International Journal of Disaster Risk Reduction</source>
          <volume>93</volume>
          (
          <year>2023</year>
          )
          <article-title>103750</article-title>
          . URL: https://www.sciencedirect. com/science/article/pii/S2212420923002303. doi:https://doi.org/10.1016/j.ijdrr.
          <year>2023</year>
          .
          <volume>103750</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Martinez-Gil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lozano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <article-title>Emergent behaviors and scalability for multiagent reinforcement learning-based pedestrian models</article-title>
          ,
          <source>Simulation Modelling Practice and Theory</source>
          <volume>74</volume>
          (
          <year>2017</year>
          )
          <fpage>117</fpage>
          -
          <lpage>133</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/ S1569190X17300503. doi:https://doi.org/10.1016/j.simpat.
          <year>2017</year>
          .
          <volume>03</volume>
          .003.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <article-title>Reinforcement Learning, an Introduction (Second Edition)</article-title>
          , MIT Press,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Vizzari</surname>
          </string-name>
          , T. Cecconello,
          <article-title>Pedestrian simulation with reinforcement learning: A curriculumbased approach</article-title>
          ,
          <source>Future Internet</source>
          <volume>15</volume>
          (
          <year>2023</year>
          ). URL: https://www.mdpi.com/1999-5903/15/1/12. doi:
          <volume>10</volume>
          .3390/fi15010012.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Louradour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <article-title>Curriculum learning</article-title>
          ,
          <source>in: Proceedings of the 26th Annual International Conference on Machine Learning</source>
          , ICML '09,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2009</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          . URL: https://doi.org/10. 1145/1553374.1553380. doi:
          <volume>10</volume>
          .1145/1553374.1553380.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Vizzari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Crociani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bandini</surname>
          </string-name>
          ,
          <article-title>An agent-based model for plausible wayfinding in pedestrian simulation</article-title>
          ,
          <source>Engineering Applications of Artificial Intelligence</source>
          <volume>87</volume>
          (
          <year>2020</year>
          )
          <article-title>103241</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/S0952197619302246. doi:https: //doi.org/10.1016/j.engappai.
          <year>2019</year>
          .
          <volume>103241</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Helbing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Molnár</surname>
          </string-name>
          ,
          <article-title>Social force model for pedestrian dynamics</article-title>
          ,
          <source>Phys. Rev. E</source>
          <volume>51</volume>
          (
          <year>1995</year>
          )
          <fpage>4282</fpage>
          -
          <lpage>4286</lpage>
          . URL: https://link.aps.org/doi/10.1103/PhysRevE.51.4282. doi:
          <volume>10</volume>
          .1103/ PhysRevE.51.4282.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Schadschneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Klingsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klüpfel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kretz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rogsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Seyfried</surname>
          </string-name>
          , Evacuation Dynamics:
          <article-title>Empirical Results, Modeling and Applications</article-title>
          , in: R. A.
          <string-name>
            <surname>Meyers</surname>
          </string-name>
          (Ed.),
          <source>Encyclopedia of Complexity and Systems Science</source>
          , Springer,
          <year>2009</year>
          , pp.
          <fpage>3142</fpage>
          -
          <lpage>3176</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Andresen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chraibi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Seyfried</surname>
          </string-name>
          ,
          <article-title>A representation of partial spatial knowledge: a cognitive map approach for evacuation simulations</article-title>
          ,
          <source>Transportmetrica A: Transport Science</source>
          <volume>14</volume>
          (
          <year>2018</year>
          )
          <fpage>433</fpage>
          -
          <lpage>467</lpage>
          . URL: https://doi.org/ 10.1080/23249935.
          <year>2018</year>
          .
          <volume>1432717</volume>
          . doi:
          <volume>10</volume>
          .1080/23249935.
          <year>2018</year>
          .
          <volume>1432717</volume>
          . arXiv:https://doi.org/10.1080/23249935.
          <year>2018</year>
          .
          <volume>1432717</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Junges</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Klügl</surname>
          </string-name>
          ,
          <article-title>Programming agent behavior by learning is simulation models</article-title>
          ,
          <source>Applied Artificial Intelligence</source>
          <volume>26</volume>
          (
          <year>2012</year>
          )
          <fpage>349</fpage>
          -
          <lpage>375</lpage>
          . doi:
          <volume>10</volume>
          .1080/08839514.
          <year>2012</year>
          .
          <volume>652906</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tordeux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chraibi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Seyfried</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schadschneider</surname>
          </string-name>
          ,
          <article-title>Prediction of pedestrian dynamics in complex architectures with artificial neural networks</article-title>
          ,
          <source>Journal of Intelligent Transportation Systems</source>
          <volume>24</volume>
          (
          <year>2020</year>
          )
          <fpage>556</fpage>
          -
          <lpage>568</lpage>
          . URL: https:
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>