=Paper= {{Paper |id=Vol-3332/paper10 |storemode=property |title=Combining Fast and Slow Thinking for Human-like and Efficient Navigation in Constrained Environments |pdfUrl=https://ceur-ws.org/Vol-3332/paper10.pdf |volume=Vol-3332 |authors=Marianna Bergamaschi Ganapini,Murray Campbell,Francesco Fabiano,Lior Horesh,Jonathan Lenchner,Andrea Loreggia,Nicholas Mattei,Francesca Rossi,Biplav Srivastava,Kristen Brent Venable }} ==Combining Fast and Slow Thinking for Human-like and Efficient Navigation in Constrained Environments== https://ceur-ws.org/Vol-3332/paper10.pdf
Combining Fast and Slow Thinking for Human-like
and Efficient Navigation in Constrained
Environments
M. Bergamaschi Ganapini1 , M. Campbell2 , F. Fabiano3 , L. Horesh2 , J. Lenchner2 ,
A. Loreggia4 , N. Mattei5 , F. Rossi2 , B. Srivastava6 and K. B. Venable7
1
  Union College - USA
2
  IBM Research - USA
3
  University of Parma - Italy
4
  University of Brescia - Italy
5
  Tulane University - USA
6
  University South Carolina - USA
7
  University West Florida, IHMC - USA


                                         Abstract
                                         Current AI systems lack several important human capabilities, such as adaptability, generalizability, self-
                                         control, consistency, common sense, and causal reasoning. We believe that existing cognitive theories of
                                         human decision making, such as the thinking fast and slow theory, can provide insights on how to advance
                                         AI systems towards some of these capabilities. In this paper, we propose a general architecture that is based
                                         on fast/slow solvers and a metacognitive component. We then present experimental results on the behavior
                                         of an instance of this architecture, for AI systems that make decisions about navigating in a constrained
                                         environment. We show how combining the fast and slow decision modalities, which can be implemented
                                         by learning and reasoning components respectively, allows the system to evolve over time and gradually
                                         pass from slow to fast thinking with enough experience, and that this greatly helps in decision quality,
                                         resource consumption, and efficiency.

1. Introduction
AI systems have seen great advancement in recent years, on many applications that pervade our
everyday life. However, we are still mostly seeing instances of narrow AI that are typically focused
on a very limited set of competencies and goals, e.g., image interpretation, natural language
processing, classification, prediction, and many others. Moreover, while these successes can be
accredited to improved algorithms and techniques, they are also tightly linked to the availability
of huge datasets and computational power [1]. State-of-the-art AI still lacks many capabilities

AAAI 2022 FALL SYMPOSIUM SERIES, Thinking Fast and Slow and Other Cognitive Theories in AI, November
17-19, 2022, Westin Arlington Gateway in Arlington, Virginia, USA
" bergamam@union.edu (M. B. Ganapini); mcam@us.ibm.com (M. Campbell); francesco.fabiano@unipr.it
(F. Fabiano); lhoresh@us.ibm.com (L. Horesh); lenchner@us.ibm.com (J. Lenchner); andrea.loreggia@unibs.it
(A. Loreggia); andrea.loreggia@unibs.it (N. Mattei); Francesca.Rossi2@ibm.com (F. Rossi); BIPLAV.S@sc.edu
(B. Srivastava); bvenable@uwf.edu (K. B. Venable)
 0000-0001-8158-894X (M. Campbell); 0000-0002-1161-0336 (F. Fabiano); 0000-0001-6350-0238 (L. Horesh);
0000-0002-9427-8470 (J. Lenchner); 0000-0002-9846-0157 (A. Loreggia); 0000-0002-3569-4335 (N. Mattei);
0000-0001-8898-219X (F. Rossi); 0000-0002-7292-3838 (B. Srivastava); 0000-0002-1092-9759 (K. B. Venable)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
that would naturally be included in a notion of (human) intelligence, such as generalizability,
adaptability, robustness, explainability, causal analysis, abstraction, common sense reasoning,
ethical reasoning [2, 3], as well as a complex and seamless integration of learning and reasoning
supported by both implicit and explicit knowledge [4].
   We believe that a better study of the mechanisms that allow humans to have these capabilities
can help [5]. We focus especially on D. Kahneman’s theory of thinking fast and slow [6], and
we propose a multi-agent AI architecture (called SOFAI, for SlOw and Fast AI) where incoming
problems are solved by either System 1 (or “fast") agents (also called “solvers"), that react
by exploiting only past experience, or by System 2 (or “slow") agents, that are deliberately
activated when there is the need to reason and search for optimal solutions beyond what is
expected from the System 1 agents. Given the need to choose between these two kinds of
solvers, a meta-cognitive agent is employed, performing introspection and arbitration roles, and
assessing the need to employ System 2 solvers by considering resource constraints, abilities of
the solvers, past experience, and expected reward for a correct solution of the given problem
[7, 8]. Many approaches to the design of AI systems have been inspired by the dual-system theory
[9, 10, 11, 12, 13, 14, 15], showing that this theory inspires many AI researchers.
   In this paper we describe the SOFAI architecture, characterizing the System 1 and System
2 solvers and the role of the meta-cognitive agent, and provide motivations for the adopted
design choices. We then focus on a specific instance of the SOFAI architecture, that provides the
multi-agent platform for generating trajectories in a grid environment with penalties over states,
actions, and state features. In this instance, decisions are at the level of each move from one grid
cell to another. We show that the combination of fast and slow decision modalities, which can be
implemented as learning and reasoning components, allows the system to create trajectories that
are similar to human-like ones, compared to using only one of the modalities. Human-likeness
is here exemplified by the trajectories built by a Multi-alternative Decision Field Theory model
(MDFT) [16], that has been shown to mimick the way humans decide among several alternatives.
In our case, the possible moves in a grid state, take into account non-rational behaviors related to
alternatives’ similarity. Moreover, the SOFAI trajectories are shown to generate a better reward
and to require a shorter decision time overall. We also illustrate the evolution of the behavior
of the SOFAI system over time, showing that, just like in humans, initially the system mostly
uses the System 2 decision modality, and then passes to using mostly System 1 when enough
experience over moves and trajectories is collected.


2. Thinking Fast and Slow in AI
In this section we give an overview of the SOFAI architecture, additional details are available in
the Appendix. SOFAI is a multi-agent architecture (see Figure 1) where incoming problems are
initially handled by those System 1 (S1) solvers that possess the required skills to tackle them,
analogous to what is done by humans who first react to an external stimulus via their System 1.
                                            System 1                                                   Meta-cognition
                     Task/problem            Solver                                                       Module
                                       • Based on past                  Proposed                  •   Chooses between S1 solution and
                                         experiences
                                       • Acts in O(1)
                                                                        solution                      S2 activation
                                                                                                  •   Assesses value of success,
                                       • Activates                      and confidence                resources, trustworthiness of
                                         autonomously                                                 solvers
                                                                                                  •   Adopts a two-phase assessment

           Model                                                        Adoption of
                                                                        system 1 solution               OR             Activation
          /Solver
          Updater               Model of              Model of               Model of                          System 2 Solver
      •   Updates the            World                 Self                   Others                       •    Employs reasoning
          models                                                                                           •    Consumes more resources
      •   (Re)trains the                                                                                   •    Activated by meta-
          S1 solvers         Knowledge about                              Knowledge and beliefs                 cognitive module
                             environment                                  about other agents
      •   Autonomous                               Past decisions and
                             impacted by agent’s                          impacting the same
          activation         decisions             their reward           environment
                                                                                                  Solution/decision

Figure 1: The SOFAI architecture.


2.1. Fast and Slow Solvers
As mentioned, incoming problems trigger System 1 (S1) solvers. We assume such solvers act in
constant time, i.e., their running time is not a function of the size of the input problem instance, by
relying on the past experience of the system, which is maintained in the model of self. The model
of the world contains the knowledge accumulated by the system over the external environment
and the expected tasks, while the model of others contains the knowledge and beliefs about other
agents who may act in the same environment. The model updater agent acts in the background to
keep all models updated as new knowledge of the world, of other agents, or new decisions are
generated and evaluated.
   Once an S1 solver has solved the problem (for the sake of simplicity, assume a single S1 solver),
the proposed solution and the associated confidence level are available to the meta-cognitive
(MC) module. At this point the MC agent starts its operations, with the task of choosing between
adopting the S1 solver’s solution or activating a System 2 (S2) solver. S2 agents use some form of
reasoning over the current problem and usually consume more resources (especially time) than S1
agents. Also, they never work on a problem unless they are explicitly invoked by the MC module.
   To make its decision, the MC agent assesses the current resource availability, the expected
resource consumption of the S2 solver, the expected reward for a correct solution for each
of the available solvers, as well as the solution and confidence evaluations coming from the
S1 solver. In order to not waste resources at the meta-cognitive level, the MC agent includes
two successive assessment phases, the first one faster and more approximate, related to rapid
unconscious assessment in humans [17, 18], and the second one (to be used only if needed) more
careful and resource-costly, analogous to the conscious introspective process in humans [19]. The
next section will provide more details about the internal steps of the MC agent.
   This architecture and flow of tasks allows for minimizing time to action when there is no need
for S2 processing since S1 solvers act in constant time. It also allows the MC agent to exploit the
proposed action and confidence of S1 when deciding whether to activate S2, which leads to more
informed and hopefully better decisions by the MC.
   Notice that we do not assume that S2 solvers are always better than S1 solvers, analogously
to what happens in human reasoning [20]. Take for example complex arithmetic, which usually
requires humans to employ System 2, vs perception tasks, which are typically handled by our
System 1. Similarly, in the SOFAI architecture we allow for tasks that might be better handled by
S1 solvers, especially once the system has acquired enough experience on those tasks.

2.2. The Role of Meta-cognition
We focus on the concept of meta-cognition as initially defined by Flavell [21], Nelson [22], that is,
the set of processes and mechanisms that could allow a computational system to both monitor and
control its own cognitive activities, processes, and structures. The goal of this form of control is
to improve the quality of the system’s decisions [23]. Among the existing computational models
of meta-cognition [24, 25, 26], we propose a centralized meta-cognitive module that exploits both
internal and external data, and arbitrates between S1 and S2 solvers in the process of solving a
single task. Notice however that this arbitration is different from an algorithm portfolio selection,
which is already successfully used to tackle many problems [27], because of the characterization
of S1 and S2 solvers and the way the MC agent controls them.
   The MC module exploits information coming from two main sources: 1) the system’s internal
models of self, world, and others; 2) the S1 solver(s), providing a proposed decision for a task,
and their confidence in the proposed decision.
   The first meta-cognitive phase (MC1) activates automatically as a new task arrives and a
solution for the problem is provided by an S1 solver. MC1 decides between accepting the solution
proposed by the S1 solver or activating the second meta-cognitive phase (MC2). MC2 then makes
sure that there are enough resources for running S2. If not, MC2 adopts the S1 solver’s proposed
solution. MC1 also compares the confidence provided by the S1 solver with the risk attitude of
the system: if the confidence is high enough, MC1 adopts the S1 solver’s solution. Otherwise, it
activates the next assessment phase (MC2) to make a more careful decision. The rationale for this
phase of the decision process is that we envision that often the system will adopt the solution
proposed by the S1 solver, because it is good enough given the expected reward for solving the
task, or because there are not enough resources to invoke more complex reasoning.
   Contrarily to MC1, MC2 decides between accepting the solution proposed by the S1 solver or
activating an S2 solver for the task. To do this, MC2 evaluates the expected reward of using the
S2 solver in the current state to solve the given task, using information contained in the model
of self about past actions taken by this or other solvers to solve the same task, and the expected
cost of running this solver. MC2 then compares the expected reward for the S2 solver with the
expected reward of the action proposed by the S1 solver: if the expected additional reward of
running the S2 solver, as compared to using the S1 solution, is large enough, then MC2 activates
the S2 solver. Otherwise, it adopts the S1 solution.
   To evaluate the expected reward of the action proposed by S1, MC2 retrieves from the model of
self the expected immediate and future reward for the action in the current state (approximating
the forward analysis to avoid a too costly computation), and combines this information with the
confidence the S1 solver has in the action. The rationale for the behavior of MC2 is based on the
design decision to avoid costly reasoning processes unless the additional cost is compensated by
an even greater additional expected reward for the solution that the S2 solver will identify for this
task. This is analogous to what happens in humans [7].


3. Instantiating SOFAI on Grid Navigation
In the SOFAI instance that we consider and evaluate in this paper, the decision environment is a
9 × 9 grid and the task is to generate a trajectory from an initial state 𝑆0 to a goal state 𝑆𝐺 , by
making moves from one state to an adjacent one in a sequence, while minimizing penalties.
   Such penalties are generated by constraints over
moves (there are 8 moves for each state), specific states
(grid cells), and state features (in our setting, these are
colors associated to states). For example, there could be
a penalty for moving left, for going to the cell (1,3), and
for moving to a blue state. In our specific experimental
setting, any move brings a penalty of −4, each constraint
violation gives a penalty of −50, and reaching the goal
state gives a reward of 10.
   This decision environment is non-deterministic: there
is a 10% chance of failure, meaning that the decision of Figure 2: Example of the constrained grid
moving to a certain adjacent state may result in a move              decision scenario. Black squares
to another adjacent state chosen at random. Figure 2                 represents states with penal-
shows an example of our grid decision environment.                   ties. Penalties are generated also
                                                                     when the agent moves left or
   Given this decision environment, we instantiate the
                                                                     bottom-right, or when it moves
SOFAI architecture as follows: (1) one S1 solver, that               to a blue or a green state. The
uses information about the past trajectories to decide the           red lines describe a set of tra-
next move (see below for details); (2) one S2 solver, that           jectories generated by the agent
uses MDFT to make the decision about the next move;                  (all with the same start and end
(3) MC agent: its behavior is described by Algorithm 1;              point). The strength of the red
(4) model of the world: the grid environment; (5) model              color for each move corresponds
of self: it includes past trajectories and their features            to the amount of trajectories em-
                                                                     ploying such move.
(moves, reward, length, time); (6) no model of others.
   In Algorithm 1:
     • 𝑛𝑇 𝑟𝑎𝑗(𝑠𝑥 , {𝑆, 𝐴𝐿𝐿}) returns the number of times in state 𝑠𝑥 an action computed by
       solver 𝑆 (𝐴𝐿𝐿 means any solver) has been adopted by the system; if they are below 𝑡1 (a
       natural number), it means that we don’t have enough experience yet.
    • 𝑝𝑎𝑟𝑡𝑅𝑒𝑤𝑎𝑟𝑑(𝑇 ) and 𝑎𝑣𝑔𝑅𝑒𝑤𝑎𝑟𝑑(𝑠𝑥 ) are respectively the partial reward of the trajectory
      𝑇 and the average partial reward that the agent gets when it usually reaches state 𝑠𝑥 : if we
      are below 𝑡2 (between 0 and 1), it means that we are performing worse than past experience.
    • 𝑐 is the confidence of the S1 solver: if it is below 𝑡3 (between 0 and 1) it means that our
      attitude to risk does not tolerate the confidence level.
Algorithm 1 The MC agent
Input (Action a, Confidence c, State 𝑠𝑥 , Partial Trajectory T)
                               𝑝𝑎𝑟𝑡𝑅𝑒𝑤𝑎𝑟𝑑(𝑇 )
1: if 𝑛𝑇 𝑟𝑎𝑗(𝑠𝑥 , 𝐴𝐿𝐿) ≤ 𝑡1 or 𝑎𝑣𝑔𝑅𝑒𝑤𝑎𝑟𝑑(𝑠 𝑥)
                                              ≤ 𝑡2 or
    𝑐 ≤ 𝑡3 then
2:     if 𝑛𝑇 𝑟𝑎𝑗(𝑠𝑥 , 𝑆2) ≤ 𝑡6 then
3:         randomly adopt S1 decision or activate S2 solver
4:     else
5:         𝑒𝑥𝑝𝐶𝑜𝑠𝑡𝑆2 ← 𝑒𝑥𝑝𝑇       𝑖𝑚𝑒𝑆2
                               𝑟𝑒𝑚𝑇 𝑖𝑚𝑒
6:         if 𝑒𝑥𝑝𝐶𝑜𝑠𝑡𝑆2 ≤ 1 and
            (𝑒𝑥𝑝𝑅𝑒𝑤𝑎𝑟𝑑𝑆2 (𝑠𝑥 )˘𝑒𝑥𝑝𝑅𝑒𝑤𝑎𝑟𝑑(𝑠𝑥 ,𝑎))
                           𝑒𝑥𝑝𝐶𝑜𝑠𝑡𝑆2
                                                   > 𝑡4 then
 7:            Set the attention weights in W
 8:            Activate the S2 solver
 9:        else
10:             Adopt S1 decision
11:         end if
12:     end if
13: else
14:     Adopt S1 decision
15: end if


  If any of the tests in line 1 are passed (meaning, the condition is not satisfied), the MC system
(MC1) adopts the S1 decision. Otherwise, it performs a more careful evaluation (MC2):
    • 𝑡6 checks that the S2 solver has enough experience. If not, a random choice between S1
      and S2 is made (line 3).
    • Otherwise, it checks if it is convenient to activate S2 (line 6), comparing the expected
       gain in reward normalize by its cost. 𝑡4 gives the tolerance for this gain. If it is convenient,
       MC activates the S2 solver (line 8), otherwise it adopts S1’s decision. In this evaluation,
       𝑒𝑥𝑝𝑇 𝑖𝑚𝑒𝑆2 and 𝑟𝑒𝑚𝑇 𝑖𝑚𝑒 are respectively the average amount of time taken by S2 to
       compute an action and the remaining time to complete the trajectory; 𝑒𝑥𝑝𝑅𝑒𝑤𝑎𝑟𝑑𝑆2 (𝑠𝑥 )
       and 𝑒𝑥𝑝𝑅𝑒𝑤𝑎𝑟𝑑(𝑠𝑥 , 𝑎) are the expected reward using S2 in state 𝑠𝑥 and the expected
       reward of adopting action 𝑎 (computed     ∑︀ by S1) in state 𝑠𝑥 . The expected reward for an
       action 𝑎 in a state 𝑠𝑥 is: 𝐸(𝑅|𝑠𝑥 , 𝑎) = 𝑟𝑖 ∈𝑅𝑠𝑥 ,𝑎 𝑃 (𝑟𝑖 |𝑠𝑥 , 𝑎) * 𝑟𝑖 , where 𝑅𝑠𝑥 ,𝑎 is the set of
       all the rewards in state 𝑠𝑥 taking the action 𝑎 that are stored in the model of self; 𝑃 (𝑟𝑖 |𝑠𝑥 , 𝑎)
       is the probability of getting the reward 𝑟𝑖 in state 𝑠𝑥 taking the action 𝑎. As the expected
       reward depends on the past experience stored in the model of self, it is possible to compute a
                                                         (𝑟−0.5)
       confidence as follows: 𝑐(𝑠𝑥 , 𝑎) = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑( (𝜎+1𝑒−10))      ), where 𝜎 is the standard deviation
       of the rewards in 𝑠𝑥 taking an action 𝑎, 𝑟 is the probability of taking action 𝑎 in state 𝑠𝑥 .
   The S1 agent, given a state 𝑠𝑥 , chooses the action that maximizes the expected reward based
on the past experience. That is: argmax𝑎 (𝐸(𝑅|𝑠𝑥 , 𝑎) * 𝑐(𝑠𝑥 , 𝑎)). MC1 and MC2 bear some
resemblance to UCB and model-based learning in RL [28]. However, in SOFAI we decompose
some of these techniques to make decisions in a more fine grained manner.
   The S2 agent, instead, employs the MDFT machinery (see Section A.1.2) to make a decision,
where the 𝑀 matrix has two columns, containing the Q values of a nominal and constrained
RL agents, and the attention weights 𝑊 are set in three possible ways: 1) attention to satisfying
the constraints only if we have already violated many of them (denoted by 01), 2) attention to
reaching the goal state only if the current partial trajectory is too long (denoted by 10), and 3)
attention to both goal and constraints (denoted by 02). We will call the three resulting versions
SOFAI 01, 10, and 02.


4. Experimental Results
We generated at random 10 grids, and for each grid we randomly chose: initial and final states, 2
constrained actions, 6 constrained states, 12 constrained state features (6 green and 6 blue). For
each grid, we run: (1) two reinforcement learning agents: one that tries to avoid the constraint
penalties to reach the goal (called RL Constrained), and the other that just tries to reach the goal
with no attention to the constraints (called RL Nominal). These agents will provide the baselines;
(2) the S1 solver; (3) the S2 solver (that is, MDFT): this agent will be both a component of
SOFAI and the provider of human-like trajectories; (4) SOFAI 01, SOFAI 10, and SOFAI 02.
Each agent generates 1000 trajectories. We experimented with many combinations of values
for the parameters. Here, we report the results for the following configuration: 𝑡1 = 200, 𝑡2 =
0.8, 𝑡3 = 0.4, 𝑡4 = 0, 𝑡6 = 1.
   We checked which agent generates trajecto-
ries that are more similar to the human ones
(exemplified by MDFT). Figure 3 reports the
average JS-divergence between the set of tra-
jectories generated by MDFT and the other
systems. SOFAI agents perform much better
than S1, especially in the 01 configuration.
   We then compared the three versions of SO-
FAI to S1 alone, S2 alone, and the two RL
agents, in terms of the length of the generated
paths, total reward, and time to generate the Figure 3: Average JS divergence between the set of
trajectories, see Figure 4 and 5. It is easy to            trajectories generated by MDFT and the
see that S1 performs very badly on all three               other systems.
criteria, while the other systems are compara-
ble. Notice that RL Nominal represents a lower bound for the length criteria and an upper bound
for the reward, since it gets to the goal with no attention to satisfying the constraints. For both
reward and time, SOFAI (which combines S1 and S2) performs better than using only S1 or only
S2.
   We then passed from the aggregate results over all 1000 trajectories to checking the behavior
of SOFAI and the other agents over time, from trajectory 1 to 1000. The goal is to see how SOFAI
methods evolve in their behavior and their decisions on how to combine its S1 and S2 agents.
Given that SOFAI 01 performs comparably or better than the other two versions, in the following
we only show the behavior of this version and will denote it simply as SOFAI.
   Figure 6 and 7 shows the length, reward, and time for each of the 1000 trajectories, comparing
SOFAI to S1 and to S2. In terms of length and reward, S1 does not perform well at all, while
SOFAI and S2 are comparable. However, the time chart shows that SOFAI is much faster than S2
and over time it also becomes faster than S1, even if it uses a combination of S1 and S2. This is
due to the fact that S1 alone cannot exploit the experience gathered by S2 within SOFAI, so it
generates much worse and longer trajectories, which require much more time. Perhaps the most
Figure 4: Average length (left), reward (right), for each trajectory, aggregated over 10 grids and
1000 trajectories.




Figure 5: Average time for each trajectory, aggregated over 10 grids and 1000 trajectories.




Figure 6: Average length (left) and reward (right), for each trajectory aggregated over 10 grids.


interesting is Figure 8. The left figure shows the average time spent by S1 and S2 within SOFAI
in taking a single decision (thus a single move in the trajectory): S2 always takes more time than
S1, and this is stable over time. The right figure shows the average reward for a single move: S2
is rather stable in generating high quality moves, while S1 at first performs very badly (since
there is not enough experience yet) and later generates better moves (but still worse than S2).
The question is now: how come S1 improves so much over time? The answer is given by Figure
Figure 7: Average time to compute each trajectory aggregated over 10 grids.




Figure 8: Time to compute a move (left), average reward for a move (right) for each sub-system,
over 10 grids.




Figure 9: Average fraction of times each sub-system is used over 10 grids.


9 which shows the percentage of usage of S1 and S2 in each trajectory. As we can see, at the
beginning SOFAI uses mostly S2, since the lack of experience makes S1 not trustable (that is,
the MC algorithm does not lead to the adoption of the S1 decision). After a while, with enough
trajectories built by (mostly) S2 and stored in the model of self, SOFAI (more precisely, the MC
agent) can trust S1 enough to use it more often when deciding the next move, so much that after
about 450 trajectories S1 is used more often than S2. This allows SOFAI to be faster while not
degrading the reward of the generated trajectories. This behavior is similar to what happens in
humans (as described in Section A.1.1): we first tackle a non-familiar problem with our System 2,
until we have enough experience that it becomes familiar and we pass to using System 1.


5. Future Work
We presented SOFAI, a conceptual architecture inspired by the thinking fast and slow theory of
human decision making, and we described its behavior over a grid environment, showing that it
is able to combine S1 and S2 decision modalities to generate high quality decisions faster than
using just S1 or S2. We plan to generalize our work to allow for several S1 and/or S2 solvers and
several problems for the same architecture, thus tackling issues of ontology and similarity.
References
 [1] G. Marcus, The next decade in AI: Four steps towards robust artificial intelligence, arXiv
     preprint arXiv:2002.06177 (2020).
 [2] F. Rossi, N. Mattei, Building ethically bounded AI, in: Proceedings of the 33rd AAAI
     Conference on Artificial Intelligence (AAAI), 2019.
 [3] F. Rossi, A. Loreggia, Preferences and ethical priorities: thinking fast and slow in AI, in:
     Proceedings of the 18th international conference on autonomous agents and multiagent
     systems, 2019, pp. 3–4.
 [4] M. L. Littman, et al., Gathering Strength, Gathering Storms: The One Hundred Year Study
     on Artificial Intelligence (AI100) 2021 Study Panel Report, Stanford University (2021).
 [5] G. Booch, F. Fabiano, L. Horesh, K. Kate, J. Lenchner, N. Linck, A. Loreggia, K. Murgesan,
     N. Mattei, F. Rossi, B. Srivastava, Thinking fast and slow in AI, in: Proceedings of the
     AAAI Conference on Artificial Intelligence, volume 35, 2021, pp. 15042–15046.
 [6] D. Kahneman, Thinking, Fast and Slow, Macmillan, 2011.
 [7] A. Shenhav, M. M. Botvinick, J. D. Cohen, The expected value of control: an integrative
     theory of anterior cingulate cortex function, Neuron 79 (2013) 217–240.
 [8] V. A. Thompson, J. A. P. Turner, G. Pennycook, Intuition, reason, and metacognition,
     Cognitive psychology 63 (2011) 107–140.
 [9] Y. Bengio, The consciousness prior, arXiv preprint arXiv:1709.08568 (2017).
[10] G. Goel, N. Chen, A. Wierman, Thinking fast and slow: Optimization decomposition across
     timescales, in: IEEE 56th Conference on Decision and Control (CDC), IEEE, 2017, pp.
     1291–1298.
[11] D. Chen, Y. Bai, W. Zhao, S. Ament, J. M. Gregoire, C. P. Gomes, Deep reasoning networks:
     Thinking fast and slow, arXiv preprint arXiv:1906.00855 (2019).
[12] T. Anthony, Z. Tian, D. Barber, Thinking fast and slow with deep learning and tree search,
     in: Advances in Neural Information Processing Systems, 2017, pp. 5360–5370.
[13] S. Mittal, A. Joshi, T. Finin, Thinking, fast and slow: Combining vector spaces and
     knowledge graphs, arXiv preprint arXiv:1708.03310 (2017).
[14] R. Noothigattu, et al., Teaching AI agents ethical values using reinforcement learning and
     policy orchestration, IBM J. Res. Dev. 63 (2019) 2:1–2:9.
[15] A. Gulati, S. Soni, S. Rao, Interleaving fast and slow decision making, arXiv preprint
     arXiv:2010.16244 (2020).
[16] R. M. Roe, J. R. Busemeyer, J. T. Townsend, Multialternative decision field theory: A
     dynamic connectionst model of decision making., Psychological review 108 (2001) 370.
[17] R. Ackerman, V. A. Thompson, Meta-reasoning: Monitoring and control of thinking and
     reasoning, Trends in Cognitive Sciences 21 (2017) 607–617.
[18] J. Proust, The philosophy of metacognition: Mental agency and self-awareness, OUP Oxford,
     2013.
[19] P. Carruthers, Explicit nonconceptual metacognition, Philosophical Studies 178 (2021)
     2337–2356.
[20] G. Gigerenzer, H. Brighton, Homo heuristicus: Why biased minds make better inferences,
     Topics in Cognitive Science 1 (2009) 107–143.
[21] J. H. Flavell, Metacognition and cognitive monitoring: A new area of cognitive–
     developmental inquiry., American psychologist 34 (1979) 906.
[22] T. O. Nelson, Metamemory: A theoretical framework and new findings, in: Psychology of
     learning and motivation, volume 26, Elsevier, 1990, pp. 125–173.
[23] M. T. Cox, A. Raja, Metareasoning: Thinking about thinking, MIT Press, 2011.
[24] M. T. Cox, Metacognition in computation: A selected research review, Artificial intelligence
     169 (2005) 104–141.
[25] J. D. Kralik, et al., Metacognition for a common model of cognition, Procedia computer
     science 145 (2018) 730–739.
[26] I. Posner, Robots thinking fast and slow: On dual process theory and metacognition in
     embodied AI (2020).
[27] P. Kerschke, H. H. Hoos, F. Neumann, H. Trautmann, Automated algorithm selection:
     Survey and perspectives, Evolutionary computation 27 (2019) 3–45.
[28] R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, 2nd Edition, A
     Bradford Book, Cambridge, MA, USA, 2018.
[29] D. Kim, G. Y. Park, P. John, S. W. Lee, et al., Task complexity interacts with state-
     space uncertainty in the arbitration between model-based and model-free learning, Nature
     communications 10 (2019) 1–14.
[30] J. R. Busemeyer, J. T. Townsend, Decision field theory: a dynamic-cognitive approach to
     decision making in an uncertain environment., Psychological review 100 (1993) 432.
[31] J. M. Hotaling, J. R. Busemeyer, J. Li, Theoretical developments in decision field theory:
     Comment on tsetsos, usher, and chater (2010)., Psychological Review (2010).
A. Appendix
A.1. Background
We introduce the main ideas of the thinking fast and slow theory. We also describe the main
features of the Multi-alternative Decision Field Theory (MDFT) [16], that we will use in the
experiments (Section 3 and 4) to generate human-like trajectories in the grid environment.

A.1.1. Thinking Fast and Slow in Humans
According to Kahneman’s theory, described in his book “Thinking, Fast and Slow" [6], human’s
decisions are supported and guided by the cooperation of two kinds of capabilities, that for the
sake of simplicity are called systems: System 1 (“thinking fast") provides tools for intuitive,
imprecise, fast, and often unconscious decisions, while System 2 (“thinking slow") handles more
complex situations where logical and rational thinking is needed to reach a complex decision.
   System 1 is guided mainly by intuition rather than deliberation. It gives fast answers to simple
questions. Such answers are sometimes wrong, mainly because of unconscious bias or because
they rely on heuristics or other short cuts [20], and usually do not provide explanations. However,
System 1 is able to build models of the world that, although inaccurate and imprecise, can fill
knowledge gaps through causal inference, allowing us to respond reasonably well to the many
stimuli of our everyday life.
   When the problem is too complex for System 1, System 2 kicks in and solves it with access to
additional computational resources, full attention, and sophisticated logical reasoning. A typical
example of a problem handled by System 2 is solving a complex arithmetic calculation, or a
multi-criteria optimization problem. To do this, humans need to be able to recognize that a
problem goes beyond a threshold of cognitive ease and therefore see the need to activate a more
global and accurate reasoning machinery [6]. Hence, introspection and meta-cognition is essential
in this process.
   When a problem is new and difficult to solve, it is handled by System 2 [29]. However, certain
problems, over time as more experience is acquired, pass on to System 1. The procedures System
2 adopts to find solutions to such problems become part of the experience that System 1 can
later use with little effort. Thus, over time, some problems, initially solvable only by resorting to
System 2 reasoning tools, can become manageable by System 1. A typical example is reading
text in our own native language. However, this does not happen with all tasks. An example of
a problem that never passes to System 1 is finding the correct solution to complex arithmetic
questions.

A.1.2. Multi-Alternative Decision Field Theory
Multi-alternative Decision Field Theory (MDFT) [16] models human preferential choice as an
iterative cumulative process. In MDFT, an agent is confronted with multiple options and equipped
with an initial personal evaluation for them along different criteria, called attributes. For example,
a student who needs to choose a main course among those offered by the cafeteria will have
in mind an initial evaluation of the options in terms of how tasty and healthy they look. More
formally, MDFT comprises of:
   Personal Evaluation: Given set of options 𝑂 = {𝑜1 , . . . , 𝑜𝑘 } and set of attributes 𝐴 =
{𝐴1 , . . . , 𝐴𝐽 }, the subjective value of option 𝑜𝑖 on attribute 𝐴𝑗 is denoted by 𝑚𝑖𝑗 and stored
in matrix M. In our example, let us assume that the cafeteria options are Salad (S), Burrito (B)
and Vegetable pasta (V). Matrix M, containing the student’s preferences, could be defined as
shown in Figure 10 (left), where rows correspond to the options (𝑆, 𝐵, 𝑉 ) and the columns to the
attributes 𝑇 𝑎𝑠𝑡𝑒 and 𝐻𝑒𝑎𝑙𝑡ℎ.



Figure 10: Evaluation (M), Contrast (C), and Feedback (S) matrix.

   Attention Weights: Attention weights are used to express the attention allocated to each
attribute at a particular time 𝑡 during the deliberation. We denote them by vector W(𝑡) where 𝑊𝑗 (𝑡)
represents the attention to attribute 𝑗 at time 𝑡. We adopt the common simplifying assumption that,
     ∑︀ point in time, the decision maker attends to only one attribute [16]. Thus, 𝑊𝑗 (𝑡) ∈ {0, 1}
at each
and 𝑗 𝑊𝑗 (𝑡) = 1, ∀𝑡, 𝑗. In our example, we have two attributes, so at any point in time 𝑡 we will
have W(𝑡) = [1, 0], or W(𝑡) = [0, 1], representing that the student is attending to, respectively,
𝑇 𝑎𝑠𝑡𝑒 or 𝐻𝑒𝑎𝑙𝑡ℎ. The attention weights change across time according to a stationary stochastic
process with probability distribution w, where 𝑤𝑗 is the probability of attending to attribute 𝐴𝑗 .
In our example, defining 𝑤1 = 0.55 and 𝑤2 = 0.45 would mean that at each point in time, the
student will be attending 𝑇 𝑎𝑠𝑡𝑒 with probability 0.55 and 𝐻𝑒𝑎𝑙𝑡ℎ with probability 0.45.
   Contrast Matrix: Contrast matrix C is used to compute the advantage (or disadvantage) of
an option with respect to the other options. In the MDFT literature [30, 16, 31], C is defined by
contrasting the initial evaluation of one alternative against the average of the evaluations of the
others, as shown for the case with three options in Figure 10 (center).
   At any moment in time, each alternative in the choice set is associated with a valence value. The
valence for option 𝑜𝑖 at time 𝑡, denoted 𝑣𝑖 (𝑡), represents its momentary advantage (or disadvantage)
when compared with other options on some attribute under consideration. The valence vector
for 𝑘 options 𝑜1 , . . . , 𝑜𝑘 at time 𝑡, denoted by column vector V(𝑡) = [𝑣1 (𝑡), . . . , 𝑣𝑘 (𝑡)]𝑇 , is
formed by V(𝑡) = C × M × W(𝑡). In our example, the valence vector at any time point in which
W(𝑡) = [1, 0], is V(𝑡) = [1 − 7/2, 5 − 3/2, 2 − 6/2]𝑇 .
   Preferences for each option are accumulated across the iterations of the deliberation process
until a decision is made. This is done by using Feedback Matrix S, which defines how the
accumulated preferences affect the preferences computed at the next iteration. This interaction
depends on how similar the options are in terms of their initial evaluation expressed in M.
Intuitively, the new preference of an option is affected positively and strongly by the preference
it had accumulated so far, while it is inhibited by the preference of similar options. This lateral
inhibition decreases as the dissimilarity between options increases. Figure 10 (right) shows S for
our example [31].
   At any moment in time, the preference of each alternative is calculated by P(𝑡 + 1) =
S × P(𝑡) + V(𝑡 + 1) where S × P(𝑡) is the contribution of the past preferences and V(𝑡 + 1) is the
valence computed at that iteration. Starting with P(0) = 0, preferences are then accumulated for
either a fixed number of iterations (and the option with the highest preference is selected) or until
the preference of an option reaches a given threshold. In the first case, MDFT models decision
making with a specified deliberation time, while, in the latter, it models cases where deliberation
time is unspecified and choice is dictated by the accumulated preference magnitude. In general,
different runs of the same MDFT model may return different choices due to the attention weights’
distribution. In this way, MDFT induces choice distributions over set of options and is capable of
capturing well know behavioral effects such as the compromise, similarity, and attraction effects
that have been observed in humans and that violate rationality principles [30].