<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Little Learning Machines: Real-Time Deep Reinforcement Learning as a Casual Creativity Game</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dante Camarena</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nick Counter</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniil Markelov</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pietro Gagliano</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Don Nguyen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rhys Becker</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fiona Firby</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zina Rahman</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard Rosenbaum</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liam A. Clarke</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Skibinski</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Transitional Forms Inc.</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Toronto ON</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Canada</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>In this paper, we present Little Learning Machines, a groundbreaking game that enables players to take on the role of a reinforcement learning (RL) trainer. Utilizing reward and environment modeling, players train miniature robots to perform tasks, creating an open-ended space for exploring and crafting behavior. Notably, the game introduces innovative methods for executing RL in near real-time, a significant stride in the field. We delve into the technical challenges and solutions encountered in implementing a robust and dynamic simulation for this RL platform. This paper focuses on a system description, while pointing to potential avenues for enhancements and expansions to further enrich the player experience, as well as opportunities for additional research from player feedback. This pioneering game not only demystifies RL but also serves as a versatile tool for learning, research, and creativity in the realm of artificial intelligence.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Deep Reinforcement Learning</kwd>
        <kwd>Educational Games</kwd>
        <kwd>Video Game</kwd>
        <kwd>Interactive Learning</kwd>
        <kwd>Artificial Intelligence</kwd>
        <kwd>Neural Networks</kwd>
        <kwd>Game Development</kwd>
        <kwd>Unity Game Engine</kwd>
        <kwd>Casual Creativity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Is it possible for a video game to make the intricate field</title>
        <p>of Deep Reinforcement Learning (Deep RL) accessible to
all? From outperforming humans in intensive tasks [1] to
managing complex systems [2, 3] and refining extensive
language models[4], RL has definitely demonstrated its
adaptability and potential.</p>
        <p>Despite these advancements, RL continues to be a
complex topic that is often taught with a strong emphasis on
theory.[5]. On the other end of the spectrum, popular
science approaches show an overly simplified description
of the RL procedure that provides a powerful intuition,
but struggles in communicating the iterative process and
pitfalls of experiments in the field. [ 6]. This is further Figure 1: An Animo petting a dog.
complicated by the technical complexity of setting up an
environment, and may be overwhelming for someone
new to the field, creating an entry barrier that is hard to
overcome. Interestingly, when a typical person is asked world in which they produce and train Animo: robots
to describe RL, they often liken it to the process of train- that behave autonomously. The player’s primary mode
ing a dog, a concept most people are familiar with as of interaction with this world is through the behaviour
it is centered around a the strong central metaphor of of their Animo.</p>
        <p>Reward.[7] The game is primarily intended to provide players with</p>
        <p>In this paper, we introduce Little Learning Machines. a casual creativity experience. [8] This is to say, players of
A game that streamlines the process of training Deep RL Animo are given a rich environment with a variety of
inagents. In this game, players are introduced to a voxelized teractions to allow them to train agents and gain autotelic
enjoyment from creating and fostering new behaviours.</p>
        <p>AIIDE Workshop on Experimental Artificial Intelligence in Games, The game is further developed to allow measured
exploOctober 08, 2023, University of Utah, Utah, USA ration of the full capabilities of the space, providing a
†∗TChoerrseesapuotnhdoirnsgcaountthriobru.ted equally. gradual unlock and reveal system similar to that found
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License in games like Minecraft [9] and Wobbledogs[10]. Finally,
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) the game tells a story about the dificulty of raising little
creatures, an interesting parable that allows lighthearted ideas, including specific reward structures or
environanalysis of stochastic (random and unpredictable) agents. ment curricula, often led to breakthroughs in behaviour</p>
        <p>Through this experience, we intend on making RL ac- that outpaced the work of our engineers alone. As a
recessible and engaging for everyone, regardless of their sult, we built tooling to help them become more involved
technical background. Initial Playtesting has shown in the training process.
mastery and engagement in users ranging from Middle After years of work, this tooling evolved into a game
School to Post-Secondary. By providing players with a which is the subject of this paper, Little Learning
Mapractical set of tools for training their agents, they can chines.
gain first-hand knowledge about the impact of their
decisions on the agent’s behavior. This leads to an intuitive 2.1. Prior work
understanding of RL concepts such as reward functions,
state and action spaces, and exploration vs exploitation
trade-ofs. However, these results were collected in the
middle of iterative and intensive game development and
are not the focus of this paper. We expect to conduct
additional research on player testing and feedback in
future work. What follows is is a system description of the
project as well as analysis of its construction.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>It is important to acknowledge the previous applications
of training reinforcement learning models to evolve
characters in games. Notable work released includes
classic life-sim game Creatures (1996)[12] as well as
godgame Black and White(2001) [13]. Creatures provides
the player with a variation of characters to play with,
but often the mechanisms underlying their behaviour are
not immediately obvious to novice players. Conversely,
Black &amp; White provides a single creature to train, but the
simplicity of the feedback mechanism may limit player
expresivity. More recent work includes projects such
as ArtBot[14] that focuses on teaching fundamental AI
concept, and familiarizing the player with the process of
training.</p>
      <p>Little Learning Machines attempts to bridge the gaps
across these games. It does so by focusing its gameplay
on an explicit training/testing loop, providing the player a
rich environment where they may attempt to train many
diferent behaviours and a variety of agents to compare
and contrast, all the time guiding the player through each
step.</p>
      <p>Our studio first ran into Reinforcement Learning while
developing Agence[11], a co-production with the
Canadian National Film Board, in which we trained
reinforcement learning agents that were the main characters of
our film. Instead of directly programming them, we
deifned a set of objectives and used RL to allow them to
develop behaviours on their own. It took us a lot of
experimenting to understand the efects of rewards and
environment on our agents. However, as our team
became familiar with the nuances of training agents, our
agents’ behaviour began to feel more engaging, and we
began to enjoy the experience.</p>
      <p>Throughout production of that film, we became at- 3. Gameplay
tached to each iteration of our little agents. We were
delighted to watch them overcome small obstacles, and The gameplay of Little Learning Machines is designed to
develop curious behaviours. They would often find cre- replicate the process of solving a Reinforcement Learning
ative and unique ways to approach the challenges we problem. As such it’s divided into two steps, Training
gave them. Everyone in our team, including illustrators, in a Simulation (”The Cloud”) and Performing in the
producers and marketing were beginning to develop hy- Real World (”The Islands”). The player assumes the role
potheses and curiosities about training agents. Their of an RL trainer, and their primary task is to train their</p>
      <sec id="sec-2-1">
        <title>Animo. Players may set up environments with multiple</title>
        <p>Animo. While this slows down training and places more
strain on the user’s computer, it can create fun
experiments where agents develop competitive behaviour.</p>
        <p>As the Animo perform a variety of tasks in the Real
World, the player gains access to new items and more
Islands. This loop was instrumental in helping motivate
players to explore new potential for their Animo,
gradually introduce new mechanics, techniques and items,
as well as provide them with a helpful balance between
novelty and focus.</p>
        <sec id="sec-2-1-1">
          <title>3.1. Training Process</title>
          <p>3.1.1. Building an Environment
To train an agent, the player must visit a special place
called ”The Cloud” in which agents train. In here, the
player may construct small environments to train specific
behaviours. The player may modify the environment
terrain by adding or removing blocks. They can also
place items on a 2.5d grid, (items may have a vertical
position on the map, but tunnels, bridges and overhangs
are not allowed). The environment influences how the
Animo learns, adapts, and behaves, adding an additional
layer of strategy and personalization to the gameplay.
occur in the simulation. This allows the player to
incentivize or dis-incentivize any action the Animo may take.
Examples include: Collecting a Crystal, Standing still,
Lighting a tree on fire, Hugging another Animo, etc.</p>
          <p>Positive rewards that the player provides are referred
to as Love and Negative rewards that the player provides
are referred to as Fear. Once the player starts training,
each step the Animo takes results in particles floating
out of the Animo’s head to indicate receiving a specific
reward. In the case of receiving a negative reward, a
slight shock animation makes a subtle nod to Skinner’s
Operant Conditioning Mechanism [15].
3.1.3. Configuring Resets
As the last step before starting training, the user can
configure the environment reset. Here the player can
indicate how often the training resets. The player can
indicate a number of steps or a condition such as: all
crystals collected or all flowers watered. The player can also
apply a slight amount of randomization to their training
environment, preventing the network from over-fitting
[16, 17] to specific configurations of items in a level.
3.1.4. Observing training</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Once the rewards are set, the player can begin training.</title>
        <p>3.1.2. Setting Rewards While the neural network is trained in a background
process, the player can observe an example of their ”training
After constructing an environment and placing their An- cloud” simulated in front of them. As the network
upimo, the player is then required to set a reward. The dates, the behaviour of the character in front of the player
reward-setting interface is described in detail below. In improves. The player is able to stop and continue
trainshort, the player can set up rules to automatically pro- ing at any time. This allows players to reflect on their
vide positive or negative rewards for any events that can Animo’s performance in real time and adjust the rewards
accordingly. If the players aren’t satisfied with their An- 3.2. Exploration
imo’s learning, they can tweak the rewards and observe
how these changes afect the Animo’s behavior. 3.2.1. Tutorial</p>
        <p>This iterative process imitates the real-world applica- To begin, players start out with a single Animo, on an
tion of RL, where iterative experimentation and incre- empty island with just sand and crystals. The player is
mental adjustments are key to successful learning. As a greeted by an enthusiastic teacher named Imogen who
result, the player performs the role of a training curricu- guides the player through the process of training agents.
lum, requiring them to set a goal that is just advanced The first agent the player trains has to navigate a simple
enough for the robot to be able to learn. Players are grid world and collect crystals. Due to the unorthodoxy
free to continue training their existing model with dif- of the training process, having Imogen explain some of
ferent rewards and parameters. Continual training [18] the concepts was essential to providing players the right
is a way of exploring Animo’s skill transferability and mindset to approach training.
adaptability.
3.2.2. Quest Islands
3.1.5. Resetting Animo</p>
      </sec>
      <sec id="sec-2-3">
        <title>After the tutorial, The player encounters a set of islands,</title>
        <p>Due to a number of reasons such as: Catastrophic for- each thematically diferent. In each island, the player
getting[19], Reaching Local Minumum[16, 17], Network may find a new set of items, new Animo and Quests to
Collapse[20] or Neuron Deactivation[21], the network complete. Quests help provide objectives for players who
may be unable to recover and continue learning. A player may need a bit more direction. In order to complete a
is faced with a dificult choice of having to reset the net- quest, An Animo is required to perform a specific set of
work of their Animo. This results in a freshly initialized actions without User interaction.
network. This feature ensures that players always have Examples of Quests include: Chop 5 trees, throw the
a way to re-calibrate their strategies and try diferent ball at the dog 3 times, collect all these crystals without
approaches to training their Animo. stepping on any flowers, etc.</p>
        <p>The player may not make direct modifications to Quest
Islands (although they could train behaviours to do so),
resulting in some items being out of reach until certain
conditions are met on the island. Once items are within
reach of an Animo, they may be clicked to be added to the
user’s inventory. Any item they have encountered may
now be used in their training, allowing them to create
infinite copies of the item on the cloud.
3.2.3. Home Island</p>
      </sec>
      <sec id="sec-2-4">
        <title>As the player explores each island, they may bring home</title>
        <p>Items, NPCs and Costumes from other islands. This
allows players to create spaces for their Animo to play and
interact with one-another. The player is also able to train
their Animo to complete large projects and permanently
change the look of their main island.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Environment</title>
      <sec id="sec-3-1">
        <title>Animo is set in a voxel grid environment reminiscent</title>
        <p>of popular games like Into the Breach[22] and Crypt of
Necrodancer [23]. This choice serves to simplify the
environment dynamics, enabling real-time training of agents.
Players are able to construct their training environments
by changing grid size, cell heights, and placing unlocked
objects on the grid before starting the training in the
”Cloud”, however they are currently unable to interact
with the environment during training.</p>
        <p>The Environment described here was primarily
designed to be extremely easy to discretize and perceive,
while providing the most richness and extensibility. Our
objective was to create a baseline of generic interactions
that could support variety of dynamics for agent
interactions.</p>
        <sec id="sec-3-1-1">
          <title>4.1. Terrain</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>The game world is a heightmap-based grid, with no tun</title>
        <p>nels or bridges that would complicate the navigation.
Each grid cell has a height, with cells beneath the
water line being classified as either shallow or deep water,
based on depth below the water line. This
straightforward layout allows players to focus on the core gameplay
mechanics, particularly training the RL agent to interact
with objects that exist on grid cells. Each grid cell can
contain objects of diferent types that can be broadly
categorized into Actors, Mediums, and Items (holdable
and non-holdable).</p>
        <sec id="sec-3-2-1">
          <title>4.2. Objects</title>
          <p>4.2.1. Items</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Holdable items can be grabbed, dropped and often used</title>
        <p>by actors. Examples of such items are shovels, which
decrease the height of the block in front of Animo upon
use, or an axe that can be used to destroy various objects.</p>
        <p>Non-holdable items are entities that can be interacted
with in diferent ways. For example Tree Fruits can be
planted and turned into a tree sprout, which then can
be watered to produce a Tree or Fruit Tree, that in turn
produces Tree fruits, completing the loop. Only a single
holdable or non-holdable object can occupy a cell at a
time.
4.2.2. Actors</p>
      </sec>
      <sec id="sec-3-4">
        <title>Actors include: Animo, Autonimo, Dogs, Snowpals, etc.</title>
        <p>All actors outside of Animo use traditional Hierarchical
State Machine based AI to perform their behaviours. This
provides a sharp contrast to neural network behaviour
as during the first few parts of the game, they appear
smarter than Animo. However, as the game progresses,
it becomes obvious that their programmed behaviours
are deterministic, and unable to adapt. A single actor
can exist on a given cell and can hold a single item at
the time. Actors can use objects to manipulate the world
or to create new objects. Animo and Autonimo have 7
actions that they can attempt to perform: wait, move
forward, turn left, turn right, turn back, grab, and use.
Animo can not move to a block higher than 1 above
height, but can move to any height below their current
cell’s. Upon stepping on the cell with deep water they
become helpless and do not take actions until they get
out of it. The grab action allows Animo to pick up or
exchange a held object with an object on the same cell.
Use action can either use the held item or an object on
the cell in front of Animo.
4.2.3. Mediums</p>
      </sec>
      <sec id="sec-3-5">
        <title>Mediums are objects that do not prevent other objects from being placed on the cell that they occupy. Fire or paint are examples of such objects. Mediums are primarily used as modes of interaction between items.</title>
        <sec id="sec-3-5-1">
          <title>4.3. Intents</title>
          <p>Each simulation step, objects generate Intents. Intents
are chains of actions that are evaluated simultaneously
and deterministically. As intents are executed, each
intent modifies a local range of grid cells. This allows
the simulation to be executed in parallel using spacial
partitioning. Intents that get executed in order of their
priority, potentially producing new intents. Intents with
similar priorities that afect overlapping regions produce
conflicts. Possible intent conflicts get resolved
according to predefined rules (e.g. none of the move intents
attempting to move onto the same cell will get executed).
Simulation step completes when there are no intents that
need to be executed. This results in a turn-based
simulation, where the chain of causation of intents remains
preserved, allowing players to set up rewards for events
that were caused only by specific Animo’s actions. This
is important for reward attribution in environments that
feature multiple Animo training simultaneously.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Reward UI</title>
      <sec id="sec-4-1">
        <title>Finding a way to give players control over rewards was</title>
        <p>particularly dificult. Original drafts of the game design
had Animo requiring resources such as food or batteries,
and had training rewards be derived from these systems.
However, we found that such a system would limit the
kinds of behaviours that the players could create.</p>
        <p>We initially provided players with a menu with every
interaction in the game, but navigating that menu quickly
became overwhelming. We found an elegant solution
that shows specific interactions by limiting the options
on screen to the set of items available. Furthermore,
the introduction of Iconography allowed easier access
for younger players, non-English speakers and
readingimpaired play testers.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Animo observations</title>
      <p>way that is complementary to Animo’s network
architecture is important for eficient training. Animo’s
perception is designed to be pluggable, meaning that they can
have any number of sensors, describing diferent parts
of the environments independently of each other. The
game comes with 9 unlockable Animo, each with their
own unique way of perceiving the world around them.</p>
      <sec id="sec-5-1">
        <title>6.1. Vector Sensors</title>
        <p>Vector sensors are one-dimensional and do not preserve
any data structure. They perceive every observed value
independently, which makes them quick learners.
However, they sufer from the curse of dimensionality and
quickly lose performance as the number of observations
grows. Compass sensors are an example of Vector sensors
that are designed to inform the agent about the direction
to the nearest item of player’s choice. This sensor allows
the agent to be aware of objects that are out of range of
other sensors.</p>
        <sec id="sec-5-1-1">
          <title>Animo’s ability to interact with its environment and make</title>
          <p>decisions is largely dependent on what it can observe and 6.2. Convolutional Sensors
how it observes it. One of the main challenges of crafting
sensors is to find a balance between providing enough Convolutional sensors [24] are designed to perceive
information for the agent to make meaningful decisions, image-like data and use convolutions to exploit patterns
and not overwhelming it with too much information. in the observations. They might be slightly slower to run,
Taking advantage of representing the environment in a but they learn kernels that extract specific information
from the observation. They perceive a patch of the grid
around the agent, rotated towards the direction that the
agent is facing. Each perceived cell is parsed into
relevant information that usually consists of terrain (height,
ground type, occupancy), actor at cell, item held by an
actor at cell, medium, and object on ground.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>6.3. Attention Sensors</title>
        <sec id="sec-5-2-1">
          <title>Attention sensors [25] perceive groups of values, called</title>
          <p>entities. This enables them to perceive a variable number
of observations and, while they are significantly slower
to run, their ability to learn is impressive, albeit fragile.
Attention sensors perceive a number of objects that are
closest to Animo. This perception system has the
advantage of only focusing on objects instead of cells, which
allows for a more compact representation compared to
the Animo Grid Sensor.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>6.4. Object Encodings</title>
        <p>The inputs to each sensor are required to encode the
type of objects in their perception as a part of state
observation. One-hot encoding is a common method of
approaching this task, that scales poorly as the number
of object types increases. Our sensor system allows to
easily switch between one-hot encoding, hand-crafted
object properties, binary, and ternary object type Id
encodings, allowing players to use the perception system
suitable for the task at hand.
to be engaging, it was essential to show improvement in
the agents as fast as possible. However, fast techniques
such as Q-Learning or Tabular learning were unable to
perform at the same level as Proximal Policy
Optimization.</p>
        <p>That said, implementation of such algorithms within
engine would be a significant undertaking, and one that
we would have a hard time adapting into a game. As a
6.5. Sensor Conclusions result, we had to perform significant engineering to
leverage existing RL infrastructure. Our application makes
After conducting numerous experiments, we found that use of an embedded python engine to access PyTorch
our initial hypotheses for each sensor were somewhat to train agents during gameplay. All training is local to
misguided. Just including more information was insuf- the player’s computer and has shown good performance
ifcient for better Animo performance. While we would even on hardware as dated as a Surface Pro 3.
often see improvement in performance relative to total The training architecture is designed with wall clock
training steps, this would come at a cost of increased training time being the priority. The training process is
inference time or slower model updates. This resulted built around the modified version Unity ML-Agents, a
in similar wall clock time for training of all three sen- powerful tool for training intelligent agents to be used
sor types. That said, there are nuanced diferences in with Unity Game Engine.[26] ML-Agents provides a
flexthe performance of each sensor type. Animo behaviour ible platform that supports multiple training algorithms,
is qualitatively diferent from one another. The ease of including Soft Actor-Critic (SAC) and Proximal Policy
swapping between these configurations has allowed for Optimization (PPO). Unity also provides a multi-platform
better analysis. Enthusiast players of the game may them- inference engine called Barracuda, which allows us to
selves make tweaks to sensor configurations and perhaps inference models without causing disruption to the user
be able to create further conclusions. framerate.</p>
        <p>While early prototypes used ML-Agents directly, and
7. Technical Architecture modified its internals to support runtime training, recent
versions of Little Learning Machines have been
modiAnimo is one of the first games to allow players to train ifed to use only a small fraction of ML-Agents, and train
Deep RL agents within a game. However, the process without the Unity Executable. The main vehicle for this
of training DRL Agents can takes hours, Resulting in it approach is the development of the Animo Simulation,
being slow and un-engaging. In order for training itself which allows us to run simulations without running the
game. This eliminates the overheads of running the Unity
Engine and gRPC communication protocol to generate
sample trajectories during training, that is usually the
case with the standard ML-Agents implementation.</p>
        <p>As a result, The Little Learning Machines Project
consists of 4 layers:
• A Unity Project, (the game) layer that allows us</p>
        <p>to use Unity Editor for development.
• Micro ML-Agents unity package, a lightweight</p>
        <p>fork of ML Agents’ C# side package.
• Animo Simulation, a standalone C# dll.
• Animo Trainer A custom Python trainer, that uses</p>
        <p>PythonNet to interface with Animo Simulation
directly.</p>
        <sec id="sec-5-3-1">
          <title>Above all, Little Learning Machines is the easiest in</title>
          <p>troduction to the addictive process of training your own
artificially intelligent agents. It exposes players to a loop
of training and observing ever smarter models, the
delightfully frustrating process of iterating on experiments</p>
          <p>The Animo Simulation is designed to be self-suficient, and the joy of seeing the models break through plateaus
allowing for potential use with other training algorithms. and traps. The game requires no prior experience,
reWe have conducted several experiments with DreamerV3 quires no math (except a bit of graph literacy) and no
[27], a new model-based training algorithm, to show coding experience. It even installs python and PyTorch
that Animo Simulation can be used as a configurable for you. It is the most straightforward way to
experibenchmark environment for testing various training algo- ence reinforcement learning. And it does so not just by
rithms. During training, our modified ML-Agents Python having you set the experiments and look at graphs, but
package periodically exports current models and train- by letting you see the Animo learn in front of your eyes.
ing statistics that are conveniently displayed within the We hope that it’s an inspiration for generations to come.
game. This architecture, along with python, PyTorch and While the technical potential of the project can be quite
the rest of the dependencies are installed automatically exciting, it is at the end of a day, a game made with care
during game setup. and attention, ideally enjoyed ludically by a small group</p>
          <p>Finally, the simulation uses a pluggable architecture, of individuals.
allowing for the easy design and implementation of new
Objects, Sensors, NPCs, Reward functions and Training
Algorithms. We intend on presenting the Animo Sim- Acknowledgments
ulation as a viable benchmark for new RL algorithms
in a separate conference, as it provides novel means of
evaluation for said algorithms.</p>
          <p>Micro ML-Agents can be reviewed here:
https://github.com/transformsai/micro-ml-agents.</p>
          <p>Animo Trainer Simulation can be installed here:
https://pypi.org/project/animo-trainer/. Both packages
require additional documentation.
• Modders can add new Items, objects and other</p>
          <p>content in Mods
• Creative players can come up with new
environ</p>
          <p>ments, challenges and tests for their Animo
• Competitive players can push the performance</p>
          <p>of these algorithms to their best.
• Educators can use the platform to experientially
demonstrate peculiarities of reinforcement
learning.</p>
        </sec>
        <sec id="sec-5-3-2">
          <title>In memoriam of Anuj Patel, this project would not be</title>
          <p>possible without his hard work.</p>
          <p>Thanks to the following people for their support
during production: Alexander Bakogeorge, Casey Bluestein,
Chloe West, David Oppenheim, Erin Ray, Eve
Cuthberson, Kory Mathewson, Manal Siddiqui Pablo Samuel
Castro. Thanks to Level Curve Inc for their help with
Audio and Music: Eliza Daly, Matt Miller, Robby Duguay.
Thanks to Durham College for their advice and support:
Khris Finley, Richa Thomas, Ryan Miller, Yuqi ”Stanley”
Zhou, Dina Samaha, Dr. Vibha Tyagi, Tejas Vyas, Saba
Siddiqi, Sharath Kumar</p>
          <p>Thanks to the following people for their support to
the project: Adam Myhill, Darren Throop, Euro Beinat,
Kevin West, Paul Van Der Boor, Peter Vuong, Priya Ratti,
Victor Nguyen, Vivian Gagliano. This project was
possible thanks to generous funding support from the Canada
Media Fund and from Ontario Creates.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>8. Discussion and Conclusion</title>
      <p>Ultimately, Little Learning Machines is a bridge between
culture and research. It’s an experiential platform,
allowing users to not just understand RL in theory, but gain
an intuitive grasp of its peculiarities through exposure
and experimentation. It’s a shared platform for people
to experiment and share diferent perspectives on RL. As
a small sample of the kinds of communities that could
interact with it, one can imagine:
• RL Researchers can test and benchmark new RL</p>
      <p>Algorithms
• Enthusiast Players can fiddle with sensors and</p>
      <p>hyperparmeters
[16] A. Zhang, N. Ballas, J. Pineau, A dissection of
overiftting and generalization in continuous
reinforce[1] OpenAI, :, C. Berner, G. Brockman, B. Chan, V. Che- ment learning, arXiv preprint arXiv:1806.07937
ung, P. Dębiak, C. Dennison, D. Farhi, Q. Fis- (2018).
cher, S. Hashme, C. Hesse, R. Józefowicz, S. Gray, [17] C. Zhang, O. Vinyals, R. Munos, S. Bengio, A study
C. Olsson, J. Pachocki, M. Petrov, H. P. d. O. Pinto, on overfitting in deep reinforcement learning, arXiv
J. Raiman, T. Salimans, J. Schlatter, J. Schneider, preprint arXiv:1804.06893 (2018).</p>
      <p>S. Sidor, I. Sutskever, J. Tang, F. Wolski, S. Zhang, [18] K. Khetarpal, M. Riemer, I. Rish, D. Precup, Towards
Dota 2 with large scale deep reinforcement learning, continual reinforcement learning: A review and
2019. arXiv:1912.06680. perspectives. arxiv, arXiv preprint arXiv:2012.13490
[2] M. G. Bellemare, S. Candido, P. S. Castro, J. Gong, (2020).</p>
      <p>M. C. Machado, S. Moitra, S. S. Ponda, Z. Wang, [19] P. Kaushik, A. Gain, A. Kortylewski, A. Yuille,
UnAutonomous navigation of stratospheric balloons derstanding catastrophic forgetting and
rememberusing reinforcement learning, Nature 588 (2020) ing in continual learning with optimal relevance
77–82. mapping, arXiv preprint arXiv:2102.11343 (2021).
[3] J. Degrave, F. Felici, J. Buchli, M. Neunert, B. Tracey, [20] V. Kothapalli, Neural collapse: A review on
F. Carpanese, T. Ewalds, R. Hafner, A. Abdolmaleki, modelling principles and generalization, 2023.
D. de Las Casas, et al., Magnetic control of toka- arXiv:2206.04041.
mak plasmas through deep reinforcement learning, [21] G. Sokar, R. Agarwal, P. S. Castro, U. Evci, The
dorNature 602 (2022) 414–419. mant neuron phenomenon in deep reinforcement
[4] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wain- learning, 2023. arXiv:2302.12902.
wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, [22] Subset Games, Into The Breach, Videogame, 2018.
A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, URL: https://subsetgames.com/itb.html.
M. Simens, A. Askell, P. Welinder, P. Christiano, [23] Brace Yourself Games, Crypt of the Necrodancer,
J. Leike, R. Lowe, Training language models to Videogame, 2015. URL: https://braceyourselfgames.
follow instructions with human feedback, 2022. com/crypt-of-the-necrodancer/.</p>
      <p>arXiv:2203.02155. [24] S. Albawi, T. A. Mohammed, S. Al-Zawi,
Under[5] M. Morales, Grokking deep reinforcement learning, standing of a convolutional neural network, in:</p>
      <p>Manning Publications, 2020. 2017 International Conference on Engineering and
[6] C. Course, Y. Bisk, J. Ashe, Reinforcement learning: Technology (ICET), 2017, pp. 1–6. doi:10.1109/
Crash course ai 9, 2019. URL: https://www.youtube. ICEngTechnol.2017.8308186.</p>
      <p>com/watch?v=nIgIv4IfJ6s. [25] B. Baker, I. Kanitscheider, T. Markov, Y. Wu,
[7] D. Silver, S. Singh, D. Precup, R. S. Sutton, Reward G. Powell, B. McGrew, I. Mordatch, Emergent
is enough, Artificial Intelligence 299 (2021) 103535. tool use from multi-agent autocurricula, 2020.
[8] K. Compton, M. Mateas, Casual creators., in: ICCC, arXiv:1909.07528.</p>
      <p>2015, pp. 228–235. [26] A. Juliani, V.-P. Berges, E. Teng, A. Cohen, J. Harper,
[9] Mojang, Minecraft, Videogame, 2011. URL: https: C. Elion, C. Goy, Y. Gao, H. Henry, M. Mattar,
//minecraft.net. D. Lange, Unity: A general platform for intelligent
[10] T. Astle, Animal Uprising, Wobbledogs, Videogame, agents, 2020. arXiv:1809.02627.</p>
      <p>2022. URL: https://wobbledogs.com/. [27] D. Hafner, J. Pasukonis, J. Ba, T. Lillicrap,
Master[11] P. Gagliano, C. Blustein, D. Oppenheim, Agence, a ing diverse domains through world models, 2023.
dynamic film about (and with) artificial intelligence, arXiv:2301.04104.
in: ACM SIGGRAPH 2021 Immersive Pavilion, 2021,
pp. 1–2.
[12] S. Grand, D. Clif, A. Malhotra, Creatures: Artificial
life autonomous software agents for home
entertainment, in: Proceedings of the first international
conference on Autonomous agents, 1997, pp. 22–29.
[13] Lionhead Studios, Black &amp; White, Videogame, 2001.
[14] M. Zammit, I. Voulgari, A. Liapis, G. N. Yannakakis,</p>
      <p>The road to ai literacy education: from pedagogical
needs to tangible game design, Academic
Conferences International, 2021.
[15] B. F. Skinner, Reinforcement today., American</p>
      <p>Psychologist 13 (1958) 94.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>