<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Artificial Intelligence Research 64 (2019) 645-703. doi: 10.1613/jair.1.
11396.
[13] B. Baker</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/1553374.1553380</article-id>
      <title-group>
        <article-title>Curriculum-Based RL for Pedestrian Simulation: Sensitivity Analysis and Hyperparameter Exploration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giuseppe Vizzari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniela Briola</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federico Pisapia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics</institution>
          ,
          <addr-line>Systems and Communication (DISCo)</addr-line>
          ,
          <institution>University of Milano-Bicocca</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>3579</volume>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Deep Reinforcement Learning (DRL) has recently shown encouraging results as a potential approach to the simulation of complex systems, in particular pedestrians and crowds. Curriculum-based approaches, in addition to reward design, represent conceptual and practical tools supporting the integration of domain knowledge and modeler's expertise into an agent training process, significantly reducing manual modeling efort while still granting the possibility to achieve plausible results in a relatively wide set of situations. Some of the workflows proposed in the literature, however, did not systematically analyze the sensitivity of the overall approach to changes in the model and in hyperparameters used to achieve proposed results. The present contribution represents a step in this direction, providing a set of experiments (i) showing the fact that curriculum based DRL models efectively grant a higher level of generalization compared to models trained even in challenging scenarios, at the cost of a relatively little overhead; (ii) showing the efect of changes both in model configuration (in particular the action model) and in hyperparameters of the learning algorithm, and suggesting lines for new research in the field to overcome current limitations.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;agent-based simulation</kwd>
        <kwd>pedestrian simulation</kwd>
        <kwd>reinforcement learning</kwd>
        <kwd>curriculum learning</kwd>
        <kwd>hyperparameter exploration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Pedestrian and crowd simulation represent simultaneously an inter- and multidisciplinary research
area, gathering contributions from disciplines ranging from social psychology, to applied mathematics,
to engineering, as well as a consolidated context of application of commercial tools1, used on an
a daily basis by designers, planners and decision makers. Researches on this topic has developed
initially as an additional area of investigation for the much more consolidated transportation research,
despite the apparent diference between flows of vehicles and pedestrians, that have extremely diferent
constraints [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], but the interest in granting the possibility to produce plausible forecasts about the
actual utilization of space to designers and planners has led to the acquisition of a substantial body
of knowledge about empirical evidences, modeling and applications [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], that supported an efective
technology transfer. For example, the social force model [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is oficially employed within PTV Viswalk 2,
a very successful commercial simulator.
      </p>
      <p>
        Research on the field is still very active 3, aiming to improve the quality of the achieved results and
to extend the range of considered phenomenologies: in particular, one research direction that has
witnessed a significant growth of attention is the one exploring the possibility to employ recent results
in Artificial Intelligence, and especially Deep Learning techniques, to the modeling of pedestrian and
crowd behaviour. The activity of a human modeler, i.e. the user of (potentially commercial) tool implies
a number of activities and decisions, to describe how the model is applied to a specific context: typically
this implies importing a CAD file of an environment that is being investigated (e.g. a newly designed
structure, the premises in which a crowded event takes place), and annotating it with information on how
the pedestrians enter and circulate the area (i.e. where they go, how they use the environment). Some of
these activities require information about the actual dynamics that can be observed in the environment
(e.g. typical attendance) or plausible/informed assumptions. Human intervention is therefore at two
levels: definition of the general model for pedestrian behaviour (e.g. the social force model) and specific
application to a given context (e.g. types of pedestrians, their number, initial positions and respective
goals in the environment). The pedestrian dynamics research community, however, started a systematic
acquisition and sharing of empirical data from studies, observations, experiments, in an open science
efort (supported by several research projects) several years ago 4. Not surprisingly, in the past few
years several researches have started investigating the possibility to create pedestrian models leveraging
the available data to learn pedestrian models, first of all specific to a situation, and hopefully of more
general applicability in the future. It must be noted that this kind of activity is at the same time related
but diferent from trajectory forecasting [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], where the temporal window associated with the horizon of
prediction is generally limited to few seconds with a focus on a specific and relatively limited area.
      </p>
      <p>In line with recent research results aiming at exploiting Deep Reinforcement Learning (DRL)
techniques [5] for achieving a suficiently general behavioural model for a pedestrian agent positioned
and moving in an environment, the present contribution employs a curriculum [6] based approach
that, together with a careful reward function design phase, allowed us to exploit expertise on the
simulated phenomenon and to achieve a behavioural model for a pedestrian agent showing promising
results. The RL approach has been experimented for pedestrian simulation in [7]: the authors defined a
perception model providing the agent with relevant information about a finite set of nearby agents,
the nearest obstacle and the final goal, and they define an action model that basically regulates agent’s
velocity vector in terms of angle variation and acceleration/deceleration. This approach ispired a first
version of the model discussed in the present paper, whose initial results [8] showed the practical
possibility to achieve plausible results. This contribution presents the outcomes of a set of experiments
performing a sensitivity analysis, changing some model elements, and exploring the implication of
changes in the hyperparameters of the initially proposed training process. The models we want to
achieve represent an alternative to already existing path planning models and pedestrian agent control
mechanisms, in situations where the goal of the agent is to reach a final target, passing by intermediate
targets, if necessary, showing a realistic pedestrian behaviours: an immediate exploitation of this work
may be its inclusion in Unity, the platform used for this experiment, which is a largely adopted Game
Engine, where the ofered model for moving avatars follows the shortest path, resulting in a unrealistic
movement of the avatar.</p>
      <p>The paper is organized as follows: Section 2 presents the baseline RL model and Section 3 introduces
the Curriculum Based approach. Section 4 provides an overview of the simulator supporting the
experiments, while Section 5 describes the rationale of the overall analysis and presents selected results,
supporting the claim that this research line is worth continuing despite the current limitations. Section
6 discusses some immediate research directions that could improve the practical applicability of the
achieved results.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Reiforcement Learning Pedestrian Model</title>
      <sec id="sec-2-1">
        <title>2.1. Representation of the Environment</title>
        <p>For the experimental study presented in this paper, we adopted environments of 20 × 20 metres
surrounded by walls, with diferent internal structures, although the models and simulator can work
with environments of diferent size. Walls and obstacles are represented in gray, violet rectangles are
intermediate and final targets. These violet areas are markers: they do not prevent the possibility of
moving through them, but they are perceivable by agents such as gateways, mid-sections of bends,
exits, and they support agent’s navigation of the environment. The modeler must therefore perform an
annotation of the environment before using it, as showed for example in Figure 1(a).
(a) Environment elements
(b) Rays of perception</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Agent Perception</title>
        <p>Agents perceive the environments by means of a set of projectors generating rays extending up 14 m
(in these experiments) that provide indications on what is hit and its distance from the agent. Projectors
are distributed around the agent according to this rule:   =  ( − 1 +  * , _) where 
has been set to 1.5, max_vision to 90 and  0 to 0. This grants a more granular perception in the direction
of movement and more sparse perception of the sides of the agent (see Fig. 1(b)). There are thus 23
angles, and for each of them two rays are projected to collect information supporting both navigation
among diferent rooms and within rooms, and agents avoidance. The agent is also provided with cones
in which it can detect the presence of walls and other agents, for supporting close range avoidance
behaviours.</p>
        <p>
          The overall agent’s observation is summarized in Table 1: in addition to the above mentioned
information, it includes basic agent’s state (current velocity). To improve the performance of neural
networks typically employed DRL algorithms, all numerical observations have been normalized in the
interval [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ].
        </p>
        <sec id="sec-2-2-1">
          <title>Type</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>Observation</title>
          <p>Self Information
Own velocity
Walls and targets
Pedestrian
Distance
Type/tag
Distance
Direction</p>
          <p>Speed</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>Value</title>
          <p>Number
Number
Number</p>
          <p>Angle</p>
          <p>Scalar
One Hot Encoding</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Action Model</title>
        <p>
          The regulation of the velocity vector related to agent’s movement (magnitude and direction of walking
speed) is the only action managed by the action model. In line with the literature [9], agents take
three decisions per second. Each agent is provided with an individual desired velocity  that is
drawn from a normal distribution with average of 1.5 m/s and a standard deviation of 0.2 m/s. Agent’s
action space has been therefore defined as the choice of two continuous values in the [
          <xref ref-type="bibr" rid="ref1">-1,1</xref>
          ] interval
that are used to determine a change in velocity vector, respectively for magnitude and direction. The
ifrst element causes a change in the walking speed defined by Equation 1:
        </p>
        <p>︁(
 =  
,  
︁(
− 1 +
 * 0 ,</p>
        <p>︁)
2
where  is set to 0. According to this equation, the agent is able to reach a complete stop or
the maximum velocity is two actions (i.e. about 0.66 s).</p>
        <p>The second element of the decision determines a change in agent’s direction; in particular,   =
 − 1 + 1 * 25. The walking direction can therefore change 25° each 0.33s; while this angle is plausible
for normal pedestrian walking, this value is arbitrary and one of the experiments that will be described
in Section 5 is about an evaluation of diferent values for the maximum turning angle.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Reward Function</title>
        <p>Any RL approach heavily relies on its reward function, which is the only feedback signal guiding the
learning process. The form of decision making we are dealing with is complex, comprising conflicting
tendencies (e.g. imitation but proxemic tendency to preserve personal space) that are generally reconciled
quickly, almost unconsciously, in a combination of individual and collective intelligence, that generally
leads to sub-optimal overall performance [10].</p>
        <p>Considering this, we hand-crafted a reward function considering factors recognized to be generally
influencing pedestrian behavior, and performing a tuning of the related weights defining the relative
importance of the diferent factors. The overall reward function is defined in Equation 2:
Reward = ⎨
⎧⎪Final target reached
⎪⎪Intermediate target reached for the first time
⎪⎪Intermediate target reached again
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪⎪⎪No target in sight</p>
        <p>Wall in proximity &lt; 0.6 m
⎪⎩Elapsed timestep
⎪
⎪
⎪⎪⎪Pedestrian in proximity &lt; 0.6 m
⎪⎪⎪Pedestrian in proximity &lt; 1 m
⎪⎪⎪Pedestrian in proximity &lt; 1.4 m
+6
+0.5
− 1</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Adopted RL algorithm</title>
        <p>We exploited the Proximal Policy Optimization (PPO) [11], a state–of–the–art RL policy–based
algorithm, as implemented by ML-Agents 5. PPO is a policy gradient algorithm that directly learns the
policy function  , responsible for selecting actions in a given situation, without the need for a value
function (which estimates the expected return of an action in a given state). Compared to dynamic
5https://github.com/Unity-Technologies/ml-agents
(1)
(2)
(a) Bends with Obstacles</p>
        <p>(b) Corridor
(c) Intersection
(d) Bidirectional Door
programming methods, which rely on value functions, policy gradient methods generally exhibit better
convergence properties but require a larger set of training samples. Policy gradients work by learning
the policy’s parameters through a policy score function, denoted as  (Θ) , where Θ represents the
policy’s parameters. This score function is optimized using gradient ascent, aiming to maximize the
policy’s performance. A common way to define the policy score function is through a loss function:
 (Θ) =
[ Θ(|) · ]
(3)
where  Θ(|) represents the log probability of taking action  given state , and  is the
advantage function, estimating the relative value of the taken action. When the advantage estimate is
positive, the gradient is positive as well, leading to an increase in the probability of selecting the correct
action. Conversely, the probabilities of actions associated with negative contributions are decreased.
Through this mechanism, the policy gradually improves by iteratively updating its parameters. An
exploration of the efect of actions in diferent situations is therefore necessary, but the approach is
fundamentally diferent from supervised learning, since no annotated dataset is necessary.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Curriculum Based Learning Process</title>
      <sec id="sec-3-1">
        <title>3.1. Curriculum Learning for Reinforcement Learning</title>
        <p>Curriculum Learning [6] was defined with the aim of reducing the training times for supervised ML
approaches by adopting a cognitively plausible approach: proposed examples increase in dificulty
during the training, illustrating gradually more complicated situations to the learning algorithm. Within
the RL context, it has been employed as a transfer learning technique [12]: the idea is that the agent
exploits experiences acquired on simpler scenarios when facing more complex ones within the training
process, in an intra–agent transfer learning scheme. In addition to improving convergence, it has been
reported that in some situations it supported better generalization properties in the learned policies [13].
We adopted this approach especially considering this final aspect: we wanted to train a single model
directly applicable to new environments, without having to perform training for every specific one. The
ifnally adopted approach first trains agents in a set of scenarios of growing complexity, one at a time,
but then it also provides a final simultaneous retraining of the agent in a selected number of scenarios
before the end of the overall training, to refresh previously acquired competences.</p>
        <p>For sake of clarity in the remainder of the paper, we define more clearly some key concepts:
• scenario (or step) of the curriculum: a specific environment and specific conditions for considering
this part of the training process completed;
• episode: each scenario generally needs to be experienced several times, each of them called an
episode, to accumulate experience; each episode ends after a maximum time or by achieving the
goal of the scenario;
• each episode therefore leads to the achievement of a cumulative reward, that is the summation of
instant rewards achieved as a consequence of each decision and action step;
• for a given scenario, the above mentioned completion condition is modeled as a mathematical test
over the episodes’s cumulative rewards: a typical completion condition could be, for example,
“the average cumulative rewards of the last 10 episodes is higher than threshold ℎ”.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. The Proposed Curriculum</title>
        <p>The proposed curriculum is described in Table 2: it includes very simple environments in which the
agent learns how to steer to look for the final target and walks towards it with just perimetral walls,
then it has to face situations in which the environment is narrow (a basic corridor) and in which bends
are present. Then social interaction is introduced, first of all with agents with compatible directions,
then with conflicting ones, in geometries presenting bends and even bottlenecks. A selection of training
(a) Omega bends environment
(b) Blind bend environment
(c) Double door counterflow
environments is shown in Figure 2, while the complete list of environments is described in Table 2.
Most scenarios are characterised by a certain degree of stochasticity: for instance, the initial position
(and facing) of the agent is randomly determined (within a given area), and the environment is flipped
during the training process, with the goal of avoiding overfitting. After all the environments have been
proposed and the training has brought the agent to achieve the necessary level of average accumulated
reward in a defined number of consecutive episodes, a retraining phase starts. In this phase, a selection
of scenarios starts simultaneously: this phase ensures that the experience brought by the first scenarios
is not forgotten, and that the final policy can successfully face any of the training scenarios.</p>
        <p>To evaluate the level of generalization achieved through this learning process, we also defined a set of
test scenarios, that are not part of the curriculum, and used afterwards to evaluate the final policy. Figure 3
shows a selection of these environments: bends with diferent angles, in various combinations, with
pedestrian flows (with diferent final targets) crossing and even counterflows were considered.</p>
        <p>The choice and order of scenarios is based on knowledge about the simulated phenomena and on
preliminary tests to evaluate the practical convergence of the process, but it is arbitrary: some of the
experiments that were carried out (Section 5) were related to both the evaluation of the higher level of
generalization of the curriculum based approach and changes in the curriculum structure.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. The Simulation and Training System</title>
      <p>The system developed to experiment the proposed approach is based on Unity6: in particular, the
scenarios, agents, perceptive capabilities, as well as the necessary monitoring components for acquiring
data about the pedestrian dynamics and data structures representing concepts related to the approach
(e.g. curriculum), are implemented as Unity Prefabs7 and C# scripts. Unity does not directly include
components supporting DRL techniques, but the ML-Agents toolkit8 provides both an extension to
the Unity environment as well as a set of Python components enabling training and using DRL based
agents. In particular, ML-agents provides a Python (and PyTorch) based trainer able to receive inputs
associated to environmental signals (available actions, observations, and rewards), to manage DRL
learning processes. To connect Unity and the trainer, ML-Agent needs to wrap Unity, defining a
communicator component realizing an inter-process communication with the trainer through a Python
API. The overall architecture is depicted in Figure 4.</p>
      <p>During training, scenarios are run in Unity, while in parallel the ML-Agents trainer process must be
also running, to receive and process signals from the environment, to perform DRL training. Then, the
6https://unity.com
7https://docs.unity3d.com/Manual/Prefabs.html
8https://github.com/Unity-Technologies/ml-agents
achieved policies can be saved and used directly within Unity without the need to have the ML-Agent
trainer running (or even installed locally in the machine running the specific Unity instance), thus
realizing the aim of this work, that is, generating realistic pedestrian models to be used, for example,
in simulations where autonomous avatars are needed. We are in the process of organizing an open
repository in which we will share the developed code as well as the training and test scenarios.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Achieved Results</title>
      <p>First results of this approach confirmed the possibility complete the training process achieving plausible
results also in situations not included in the training curriculum [8]. However, several choices were
arbitrary, and a sensitivity analysis for some modeling elements (in particular the maximum turning
angle) were not carried out, as well as an exploration of hyperparameters of the training process. In
this work we describe a set of experiments exploring these aspects, acquiring additional insights on the
most promising ways to continue this line of research and current (or definitive) limits.</p>
      <p>Figure 5 summarizes the carried out experiments. First of all, we focused on the curriculum, backing
up the claim that the curriculum based approach grants a higher level of generalization compared to
agents trained in a single scenario; we also evaluated the impact of the retraining phase. A second block
of training processes and experiments evaluated alternative choices in the maximum turning angle for
each decision step. Then, we evaluated the impact of performing simpler or more complex training
processes by tuning the hyperparameters of the ML-Agent trainer: within this block of experiments
we also considered a discrete action model (supporting only discrete changes of the velocity vector),
but also an increased presence of stochastic elements in the training scenarios. For sake of space, we
will only visually present some of the achieved results and we will only comment the results of some
experiments.</p>
      <sec id="sec-5-1">
        <title>5.1. Curriculum Related Experiments</title>
        <p>The first experiment related to the presence and structure of the curriculum in the training process
compared an agent trained with the baseline curriculum (Baseline CV in Figure 5) with an agent trained
on just the Bidirectional Door scenario (No CV in Figure 5). This experiment might seem unnecessary,
since single scenarios of the proposed curriculum are not suficiently large and varied to represent all the
variety and dificulties proposed by the union of the scenarios. However, it would be very hard to create
a single environment in which all of the situations faced by trained agents in the whole curriculum
would be faced. Moreover, this experiments provides a quantitative idea of the efects of curriculum on
training process duration as well its capability to support a good level of generalization. Figure 6 (a) and
(b) describe the trend of the cumulative reward acquired by trained agents in the baseline curriculum
(Baseline CV) and in a training carried out on just the Bidirectional Door scenario (No CV): the duration
of the training process for just the Bidirectional Door scenario is comparable to the duration of the
whole curriculum based process (more or less 3500 seconds), excluding the retraining and consolidation
phase. A more complicated single training scenario, more generally covering additional competences
compared to Bidirectional Door that does not propose bends, would require an even longer training
process. These results corroborate the idea that a gradual acquisition of experiences simplifies and
makes more rapid the convergence of the overall training process.</p>
        <p>Comparing the two models in one of the test environments we can see how the model achieved
with the No CV procedure performs poorly. Figure 6(c) shows the quality of both models when facing
the Omega bends environment, simultaneously showing trajectories and walking speeds of 50 runs in
which two agents move from the upper right corner to the lower right one: the quality is much higher
for the Baseline CV model, the agents trained solely on the Bidirectional door scenario have learned how
to manage social interaction, at least in that kind of environment, but they have a hard time navigating
through bends without slowing down frequently and significantly.</p>
        <p>We have also evaluated the impact of the retraining and consolidation process (not reported due to
space limitation): while they surely have a noticeable cost in terms of training time, results showed that
they have an important efect in allowing the Baseline CV agents in remembering how to face situations
encountered in the early steps of the curriculum.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Modeling Alternatives and Hyperparameters</title>
        <p>An arbitrary modeling choice was about the maximum angle for pedestrian turning per step, set to 25°:
within this experiment we trained agents that could turn up to 35° (Baseline CV 35°) and 45° (Baseline
CV 45°) per decision step. Diferences in the duration of the training process were minimal, but the
observable behaviour of the achieved policies was instead very diferent, as shown in Figure 7. Figure 7(a)
shows the diferent paths followed by two agents (first and second group) simultaneously present in
the Omega bends environment (again, 50 episodes are shown): Baseline CV 35 ° agents have a hard time
moving in the second U turn, as well as in some of the corridors, slowing down unnecessarily and often
moving very close to the walls; Baseline CV 45 ° agents move more regularly, but their trajectories have
a quite high variability, and they sometimes slow down without apparent reasons. Figure 7(b) shows
the frequency of adopted rotation angles: Baseline CV agents often have small regulation of the angle,
and they also have all right/all left turning decisions; Baseline CV 35° almost only make all right/all left
turning decisions; finally, Baseline CV 45° agents instead never take all right turning decisions, and they
rarely take all left turning decisions. The overall density of the achieved cumulative reward per episode
is shown in 7(c), confirming that the 25 °maximum rotation yields the best results.</p>
        <p>Instabilities in the results of some of the models described so far pushed us to consider evaluating
the efect of allowing a longer, more through training process for the neural network employed by PPO
algorithm. We considered performing training with a smaller network (just one hidden layer, made
up of 128 units) and smaller batch size and bufer size; we also tried training the same network of the
baseline model (two hidden layers, with 256 hidden units) with a larger batch size and bufer size. The
bufer size is the number of experiences to collect before updating the policy model, and the batch size
(a) Cumulative reward trend Baseline CV</p>
        <p>(b) Cumulative reward trend No CV
(c) Paths in Omega bends environment
is the number of experiences in each iteration of the gradient descent. Results of the training with the
smaller network (CV_TC_low) showed significantly lower training times, associated to good results in
test scenarios in which social interaction was absent or simple (compatible goals, few conflicts), but
bad results in scenarios in which social interactions are frequent and require slowing down and/or
performing sharp collision avoidance maneuvers. Considering the simpler structure of the network,
we also tried adopting a simpler action model, with discrete instead of continuous, would improve the
situation (CV_TC_low_discrete). This prediction turned out to be true, yielding improvements that led to
(a) Paths comparison for diferent turning angles in Omega bends environment
(b) Turning angle frequency
(c) Overall reward comparison
comparable results with the baseline model with lower training costs: the CV_TC_low_discrete training
process represents a potentially useful approach when social interactions are not frequent (i.e. very low
density scenarios).</p>
        <p>The training process employing larger bufer and batch sizes ( CV_TC_high) was expected to grant
improvements in case of intense social interaction, due to the possibility to better explore the complex
patterns arising from the interactions of multiple agents. Results, however, showed longer training times
(see Figure ??), improvements in environments with frequent and dificult social interactions, but also
instabilities also in relatively simple test environments. We interpreted this as a sign of overfitting. We
decided, as a final attempt in this set of experiments, to further increase the intensity of the stochastic
elements in the individual initialization of the training scenarios, increasing the frequency and extent
of random changes in the initial position of agents, flips and changes in the spatial structure of the
(a) Paths in Blind bend
(b) Paths in Double door counterflow
environments. This interestingly led to a decreased training time (from about 8000s to less than 6500s):
the increased variety of the situations simplified the convergence to a policy able to stably achieve
higher rewards. Regarding the quality of the dynamics generated by the model, 8(a) and 8(b) show
patterns of interactions respectively in the Blind bend (4 agents moving from the upper side and two
agents from the lower left corner) and Double door (8 agents, equally divided in the upper and lower
rooms) counterflow test scenarios, again 50 episodes for each graph. Agents of the CV_TC_high_stoc
model have more stable trajectories, with fewer drops in walking speed, suggesting that increasing the
variety of situations encountered by the agents in the training process is even more important with
the growth of the network capacity and with the thoroughness of the training process for the neural
network employed by the PPO algorithm.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Limitations</title>
        <p>Even though all the performed experiments employed components, mechanisms, and functionalities
ofered by Unity, in principle nothing prevents employing open source alternatives such as Godot 9. In
any case, due to dificulties in managing complicated patterns of movement for the 3D models of the
agents, we do not believe that this kind of model could scale to levels of density consistently higher
than 2 pedestrians per square meter.</p>
        <p>The performed experiments do not consider situations in which agents need to reach specific
intermediate points of the environment in movement plans. Agents trained through this process act here
and now: they essentially depend on the environment and annotations to reach the final target of their
movement. We have worked on extensions of the model supporting exploration the environment to
reach specific intermediate targets before the final one [ 14], but these preliminary results do not consider
the social interaction part that is instead central in this work. An integration of these aspects would
be necessary for a realistic application of this kind of approach in real world pedestrian simulation
systems, and it is object of current and future works. Validation of the model represents a task that will
be more seriously tackled afterwards, the preliminary results are encouraging but still partial.</p>
        <p>We did not (yet) take a Multi-Agent [15] perspective to Reinforcement Learning, and in a sense this
work represents an exploration of the limits that the single agent approach can reach in situations that
are actually characterized by the simultaneous presence of autonomous agents influencing each other’s
actions and performances.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Developments</title>
      <p>This paper presented a curriculum based DRL approach to pedestrian modeling and simulation. The
model and training approach were presented, as well as a set of experiments (i) showing the fact that
curriculum based DRL models efectively grant a higher level of generalization compared to models
trained even in challenging scenarios, at the cost of a relatively little overhead; (ii) showing the efect of
changes both in model configuration (in particular the action model) and in hyperparameters of the
learning algorithm, and suggesting lines for new research in the field to overcome current limitations.</p>
      <p>Future works are aimed at extending the model to embed the capability to explore and plan paths in
the environment granting the possibility to reach specific intermediate targets in a more complicated
environment, as well as a more thorough validation of the achieved results that might suggest extensions
to the current curriculum by adding further scenarios that should support the acquisition of additional
movement competences to the agents. The proposed morel, as of this moment, only takes a relatively
shallow approach to the evaluation of mutual distances among pedestrians: the recent COVID19
outbreak has shown that contextual conditions can call for more granular and individual consideration
of interpersonal distances, potentially considering afective states [16].</p>
      <p>Besides model improvement, and in addition to supporting designers’ and decision makers’ activities
implying the need to simulate pedestrians, these models could be also used in Virtual Reality systems
for guiding avatars [17] that should exhibit plausible behaviors.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This work was partly developed within the Spoke 8 — MaaS and and Innovative services of the
National Center for Sustainable Mobility (MOST) set up by the “Piano nazionale di ripresa e resilienza
(PNRR)—M4C2, investimento 1.4, “Potenziamento strutture di ricerca e creazione di “campioni nazionali
di R&amp;S” su alcune Key Enabling Technologies” funded by the European Union. Project code CN00000023,
CUP: D93C22000410001. This work was also partially supported by the MUR under the grant
“Dipartimenti di Eccellenza 2023-2027" of the Department of Informatics, Systems and Communication of the
University of Milano-Bicocca, Italy.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Batty</surname>
          </string-name>
          ,
          <article-title>Agent-based pedestrian modeling</article-title>
          ,
          <source>Environment and Planning B: Planning and Design</source>
          <volume>28</volume>
          (
          <year>2001</year>
          )
          <fpage>321</fpage>
          -
          <lpage>326</lpage>
          . URL: https://doi.org/10.1068/b2803ed. doi:
          <volume>10</volume>
          .1068/b2803ed. arXiv:https://doi.org/10.1068/b2803ed.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Schadschneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Klingsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klüpfel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kretz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rogsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Seyfried</surname>
          </string-name>
          , Evacuation Dynamics:
          <article-title>Empirical Results, Modeling and Applications</article-title>
          ,
          <source>in: Encyclopedia of Complexity and Systems Science</source>
          , Springer New York,
          <year>2009</year>
          , pp.
          <fpage>3142</fpage>
          -
          <lpage>3176</lpage>
          . URL: https://link.springer.com/referenceworkentry/10. 1007/978-0-
          <fpage>387</fpage>
          -30440-3_
          <fpage>187</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-0-
          <fpage>387</fpage>
          -30440-3_
          <fpage>187</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Helbing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Molnár</surname>
          </string-name>
          ,
          <article-title>Social force model for pedestrian dynamics</article-title>
          ,
          <source>Phys. Rev. E</source>
          <volume>51</volume>
          (
          <year>1995</year>
          )
          <fpage>4282</fpage>
          -
          <lpage>4286</lpage>
          . doi:
          <volume>10</volume>
          .1103/PhysRevE.51.4282.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kothari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kreiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alahi</surname>
          </string-name>
          ,
          <article-title>Human trajectory forecasting in crowds: A deep learning perspective</article-title>
          ,
          <source>IEEE Transactions on Intelligent Transportation Systems</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          . doi:
          <volume>10</volume>
          .1109/TITS.
          <year>2021</year>
          .
          <volume>3069362</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>