=Paper= {{Paper |id=Vol-2600/paper6 |storemode=property |title=ARO: A Memory-Based Approach to Environments with Perceptual Aliasing |pdfUrl=https://ceur-ws.org/Vol-2600/paper6.pdf |volume=Vol-2600 |authors=Ryan Regier,Owen Price,Alex Hadi,Zachary Faltersack,Andrew Nuxoll |dblpUrl=https://dblp.org/rec/conf/aaaiss/RegierPHFN20 }} ==ARO: A Memory-Based Approach to Environments with Perceptual Aliasing== https://ceur-ws.org/Vol-2600/paper6.pdf
    ARO: A Memory-Based Approach to Environments with Perceptual Aliasing

              Ryan Regier, Owen Price, Alex Hadi, Zachary Faltersack and Andrew Nuxoll
                                                        University of Portland
                                                   5000 North Willamette Boulevard
                                                       Portland, Oregon 97203
                                                         engineering@up.edu




                           Abstract                                    Specifically, the agent is not aware of the following:
  In order to integrate machine learning and knowledge engi-         • the number of states
  neering, it is necessary to store both learned and engineered      • what state it currently is in
  knowledge in a common format that can be used by an agent
  to make intelligent decisions. Such integration is much more       • the transition table of the FSM
  useful in data sparse environments, where traditional machine      The agent repeatedly performs the task in a given FSM and
  learning techniques fail. To this end, we define a rule struc-
  ture that stores patterns of cause and effect. These rules can
                                                                     its success is measured by how many actions it takes to reach
  easily embed expert knowledge on causal relationships. We          the goal at each iteration.
  will then present ARO, an algorithm for learning and using            In such an environment, a traditional machine learning
  rules to navigate data sparse environments. Our results show       agent, e.g., Q-Learning (Sutton, Barto, and others 1998), is
  that ARO is more effective than other known solutions to           unable to learn effective behavior. Such an agent attempts
  this problem. ARO has the additional advantages that it more       to learn the action that yields the maximum expected utility
  easily integrates with knowledge engineering techniques and        for each possible sensory input. However, in this environ-
  uses no hyperparameters.                                           ment, all states (except the goal state) appear identical to the
                                                                     agent. Furthermore, there is no reward function to indicate
                                                                     to the agent that it has performed particularly poor actions
                        Background                                   such as looping. So, the agent can only learn a single “best”
In environments with an abundance of data, machine learn-            action to take in any non-goal state.
ing agents are effective at learning patterns, even those not           The situation changes once the agent is given additional
known by experts in the domain. However, in data-sparse en-          sensors beyond the goal sensor. Consider an environment
vironments, traditional machine learning agents fail (Chris-         like Blind FSM but the agent also has a binary sensor that
man 1992). We believe that this kind of environment could            tells it when it is currently in an odd-numbered state or an
be ideal for the integration of knowledge engineering with           even-numbered state (presuming the states were arbitrarily
machine learning techniques. One such environment is the             enumerated). We shall henceforth refer to this as the OS-
Blind Finite State Machine environment. In this environ-             FSM (Odd-Sensor Finite State Machine) environment. Tra-
ment, the agent navigates a finite state machine (FSM)               ditional machine learning algorithms still perform poorly in
(Hopcroft, Motwani, and Ullman 2006) to reach a goal state.          OSFSM since they can only distinguish three categories of
The machine is designed so that the agent can reach the goal         states: odd-numbered, even-numbered or the goal. However,
from any state (no dead ends) but otherwise the transition ta-       as more and more sensors are provided a traditional machine
ble is randomly generated. Upon reaching the goal the agent          learning algorithm becomes more and more effective. Thus,
is immediately moved to a randomly selected, non-goal state          there is a continuum from complete state aliasing (e.g., the
and informed of its success. Thus the agent knows when it            Blind FSM environment) to a fully observable environment.
reaches the goal but can never actually take an action in the           To be successful in the Blind FSM or OSFSM environ-
goal state.                                                          ments, an agent must map sequences of experiences to ac-
   The task seems trivial at first. However, it is greatly com-      tions. An experience, in this case, is the agent’s sensory in-
plicated because the agent is only aware of the following:           put and the action it selected. For example, the agent could
                                                                     learn that if it takes actions a1 , a2 , ..., an in sequence and
• the FSM alphabet (available actions)                               does not reach the goal then it can subsequently reach the
• a goal sensor indicating when it reaches the goal                  goal in a single step by taking action an+1 . Furthermore,
                                                                     the agent could learn that certain sequences of actions have
Copyright c 2020, Association for the Advancement of Artificial      more utility than others regardless of what non-goal state it
Intelligence (www.aaai.org). All rights reserved.                    is currently at. In effect, the agent must de-alias states by
using its recent memories as an identifier.                                               Environment
   We expect that the agent would need a significant amount        We will now formally classify the environment for ARO. We
of experience in order to achieve this de-aliasing. If the agent   define the environment by a tuple (M, G, O, f ). The ma-
visits the same state through two different paths, the agent       chine M is a finite state machine with state set S, alphabet
may be unable to tell that the states are the same and must        α, and transition function T such that the goal state G is
learn separate policies for each. Overall, the agent may need      in S and there is a path from every state in S to G. The
to learn substantially more policies than the number of states     observation set O is the set of possible observations (differ-
in order to successfully navigate such an environment. Be-         ent sensor values) where G ∈ O denotes the observation of
cause of this hindrance, being able to incorporate existing        reaching the goal state. The observation function f : S → O
knowledge into the agent, such as through knowledge engi-          computes what is observed in any state where f (s) = G if
neering, could significantly improve the speed at which the        and only if s = G.
agent learns.                                                         An agent in this environment is told α and G, but is oth-
   In previous research, an algorithm called MaRz (Ro-             erwise provided no information about the environment. The
driguez et al. 2017) has shown success in Blind FSM en-            agent is initialized in some unknown random state s0 . At
vironments. MaRz performs an A*-Search (Russell and                any time-step t when the agent is in state st , the agent will
Norvig 2009) over possible sequences of actions to find the        be told f (st ). If st is not a goal state, the agent must pro-
shortest universal sequence. A universal sequence is a se-         vide an action at ∈ α. The agent will then transition to
quence of actions that causes the agent to reach the goal from     st+1 = T (st , at ). If instead st is a goal state, the agent
any non-goal state, and such a sequence must always exist          must provide any action, then it will be moved to non-goal
in an FSM where every state has a path to goal.                    state st+1 at random. This will continue until an fixed but
   However, MaRz cannot incorporate observations into its          unknown number of goals are reached. The objective of the
algorithm, limiting its scope. Furthermore, the best candi-        agent is to minimize the total number of time steps.
date we see for using knowledge engineering in the MaRz al-           In the Blind FSM environment, O = {, G} and f (s) =
gorithm is through the heuristic function for the search. But       for non-goal states. In the OSFSM environment, O =
the method for converting expert domain knowledge into a           {0, 1, G}, with 0 and 1 denoting whether the state is even
search heuristic seems non-trivial.                                or odd, respectively, and f (s) returns the parity of s for non-
   Other algorithms called Near Sequence Memory (NSM),             goal states.
Utile Suffix Memory (USM) and Noisy Utile Suffix Mem-
ory (NUSM) have been shown to be effective in these kinds                               ARO Overview
of environments (McCallum 1995b; 1995a; Shani and Braf-            We are now ready to present ARO, an algorithm designed for
man 2005). These operate by looking for the closest past ex-       navigating environments with extreme state aliasing such as
periences to the most recent events in memory. Then, these         Blind FSM and OSFSM.
algorithms determine which action taken in these past expe-           ARO stores previous events in units called episodes. Each
riences was most effective for reaching the goal. Sometimes,       episode consists of the tuple (o, a), where o is the observa-
the agent will randomly explore instead.                           tion f (s) and a is the action taken after the observation. In
   Similarly to MaRz, these algorithms do not have a clear         the OSFSM environment, an example episode may be (1, b).
path to integrating expert knowledge. Since all “knowledge”        In this episode, the agent was in an odd-numbered state and
these algorithms learn is stored as chains of memories, it ap-     took action b. When the agent reaches the goal, the action is
pears expert knowledge would have to be stored in a similar        irrelevant, so we denote such an episode simply with “G.” In
fashion in order for these algorithms to use it. A more con-       non-goal episodes, we concatenate the tuple for brevity, so
venient algorithm would expect expert knowledge to be in           the above episode would be denoted “1b”.
the format of a set of declarative facts.
   In contrast to these other algorithms, ARO generates and        Rules
employs facts similarly to what an expert may provide, mak-
                                                                   Using these episodes, the agent records patterns of cause and
ing the learned experience of the agent and the experience of
                                                                   effect in a structure called a rule. Rules are denoted like this
an expert interchangeable. To this end, we will define a rule,
                                                                   example:
which records a known probabilistic pattern of cause and
                                                                                           1a → 1 : 23%.
effect. This more naturally fits what expert knowledge may
appear as. For example:                                            This rule states that being in an odd state and taking action
                                                                   a “caused” an odd state 23% of the time. The effect part of
• Performing action a will cause x.                                a rule is always in this same style, but causes may be more
• If you just performed action a0 and then you saw x, then         complex. For example, consider this rule:
  you can perform action a1 to complete the task.                                        0b, 1a → 1 : 6%.
• If you see x, then performing action a will cause y or z to
  occur at random.                                                    This rule states that being in an even state and performing
                                                                   action b, then reaching an odd state and performing action a
   All of these examples may be formatted as rules. We will        has caused the agent to reach an odd state 6% of the time.
formally define rules later in this paper, but for now it is          More formally, a rule consists of three parts: a sequence
imperative to more formally define the environments.               of episodes of any length known as the cause sequence, an
observation known as the effect, and a probability. The cause      Expected Value
sequence of a rule may be empty, which can be interpreted
as the probability that transitioning to a new state causes a      Given a sequence of previous episodes that have occurred
particular observation. The number of episodes in the cause        and an observation, we can recursively compute the ex-
sequence of a rule is called the depth of a rule.                  pected number of moves it will take to reach the goal. This is
   The three examples for rules given in the background sec-       computed under the assumption that at each choice the agent
tion may now be formally stated. We will use ∗ to denote           may have in the future, it will make the best possible move.
any possible value.                                                The algorithm is below. (Note: the G ET H EURISTIC function
                                                                   will be defined later in the paper.)
• ∗a → x : 100%
                                                                    1: function EV(Episode[] episodes, Observation o)
• ∗a0 , xa1 → G : 100%                                              2:     BestSum = ∞
• xa → y : 50% and xa → z : 50%                                     3:     BestM ove = 
   As we can see from the final example, it is often useful to      4:     for a in α do
consider the set of all rules with the same cause sequence.         5:        Sum = 0
This lets us consider all possible outcomes from taking a           6:        Let next = (o, a) be a new episode.
particular action. We call such a set of rules a rule set.          7:        Let nextSeq be episodes appended by next
   We used a tree structure to store rules, but this choice does    8:        nextRuleSet = G ET RULE S ET(nextSeq)
not affect functionality in contrast to similar algorithms that     9:        if nextRuleSet is empty then                // base
use a tree-based data structure (McCallum 1995a; Shani and              case
Brafman 2005). So rather than explain this tree, we merely         10:            Sum = G ET H EURISTIC( )
define the function G ET RULE that takes as inputs both a          11:        end if
cause sequence and an effect then produces as an output the        12:        for rule in nextRuleSet do
corresponding rule. We also define the function G ET RULE -        13:            Let e be the effect of rule.
S ET that takes a cause sequence as an input and produces          14:            Let p be the probability of rule.
the corresponding rule set as an output.                           15:            if e = G then
   Since the agent does not have access to the underly-            16:                Sum = Sum + p
ing FSM, it cannot compute the probabilities for rules ex-         17:            else
actly. This leaves us with two choices for developing the          18:                Sum = Sum + p ∗ (EV(nextSeq, e) +1)
rules: learned knowledge and expert knowledge. For learned         19:            end if
knowledge, the agent estimates the probability using the rate      20:        end for
at which the effect follows the cause sequence in its episode      21:        if Sum < BestSum then
history. For example, if 0b occurs 10 times in memory and          22:            BestSum = Sum
0b, G occurs 3 times in memory, then the agent would use           23:            BestM ove = a
the rule                                                           24:        end if
                       0b → G : 30%.                               25:     end for
                                                                   26:     return BestSum
Expert knowledge can be used instead of this estimate if it
                                                                   27: end function
is provided to the agent. G ET RULE does not distinguish be-
tween the source of the rule when providing it to the agent.          This algorithm also computes the move it expects will
   As noted by a reviewer of this paper, it may be necessary       take the fewest number of steps to reach the goal on average,
to treat expert knowledge as only likely to be correct rather      stored in the BestM ove variable. Let G ET B EST M OVE be a
than certain to be correct. To accomplish this, expert knowl-      function that, when given the same inputs as EV, will return
edge can be associated with a weight, with higher weights          the value of the BestM ove variable. The action returned
indicating a higher confidence in the correctness of the rule.     by G ET B EST M OVE will minimize the expected number of
In this case, ARO can treat these weights like the frequency       steps to the goal based on what the agent has learned about
of the observation of this pattern. These frequencies can be       the environment, so we would expect this would minimize
added to the total frequencies of ARO’s observations in or-        the average number of steps to the goal over the long term.
der to compute the probability. For instance, say the rule         ARO uses this function to determine what action to make.
0b → 0 : 100% is provided with a weight of 8, but ARO                 There are two aspects of this policy that are not yet ex-
observes the pattern 0a, 1 twice. Then ARO would use the           plained. First, we cannot compute an expected value for a
rules 0b → 0 : 80% and 0b → 1 : 20%. Thus, over time ex-           move we have never taken before. This case is dealt with
pert rules would either be confirmed by ARO’s observations         in line 10 of the code with the G ET H EURISTIC function,
or replaced with more accurate rules. In a sense, the weight       which we will define below. Second, there are multiple se-
serves as a kind of Bayesian prior of the actual probability       quences the agent may pass into G ET B EST M OVE. For in-
of the effect following the cause. The implications of such a      stance, if the most recent episodes are 1a, 0a and the last
system are not fully explored here.                                observation was 0, the agent may call G ET B EST M OVE([],
   Now that we have a method for storing patterns of cause         0), G ET B EST M OVE([0a], 0), or G ET B EST M OVE([1a, 0a],
and effect, we can use this information to predict future out-     0). Each sequence may produce a different recommendation,
comes and find the move that minimizes the expected num-           and the agent must choose which recommendation to follow.
ber of steps to goal.                                              These are the corresponding topics of the next two sections.
Explore Heuristic                                                 all the arms before a time horizon is reached (Vermorel and
Let us first deal with the case where we must calculate an        Mohri 2005). In this context, the value of an arm is estimated
expected value for a rule set containing no rules. This can       based on the average value of other arms.
be equivalently stated as follows: ARO has never seen the
                                                                  Sequence Selection
episode sequence episodes occur, then observed the obser-
vation o, and then performed some action a. Furthermore,           The agent may now calculate the best move given a sequence
no expert has provided knowledge on what they expect to            of recent episodes. However, different sized sequences may
occur in this scenario. Thus, taking action a can be thought       produce different results. If a large sequence is used, the
of as an “exploration” by the agent. The agent must choose         episode sequence is more likely to be rare in memory, pro-
between exploring this new action and exploiting its existing      ducing unreliable predictions. Furthermore, a long sequence
knowledge to reach the goal.                                       may contain superfluous information, which would imply
   The explore-exploit tradeoff is well studied under the con-     that a more common and shorter sequence would be a suffi-
text of the Multi-Armed Bandit problems (Berry and Frist-          cient state identifier. With a small sequence, the episode se-
edt 1985) and has several known solutions (Sutton, Barto,          quence may be too common for the agent to know where it is
and others 1998; Kearns and Singh 2002; Brafman and Ten-           in the machine or to determine if it is going in a loop, which
nenholtz 2002). However, these algorithms are not applica-         could cause undesirable behavior such as infinite looping.
ble because they assume that the agent always recognizes           To resolve this dilemma, ARO uses a greedy strategy of tak-
the identity of the state it has reached. Research with regard     ing the episode sequence that produces the smallest expected
to explore-exploit in environments with extreme perceptual         value and the smallest such sequence if there are multiple
aliasing is more difficult. Lacking an established explore-        options. Hence, if the agent has knowledge of a fast path to
exploit algorithm, ARO takes a simple approach that seems          the goal, it will take it regardless of sequence length.
effective.                                                             This alone, however, is not sufficient to prevent loops. At
   To compare the value of this exploration to other actions       timestep t, say the sequence seq has the property EV(seq, o)
we have taken, we must decide the value of exploring in the        = x, where x is the minimal expected value and o is the cur-
units of expected value. It is possible to find an estimate of     rent observation. Hence, the action G ET B EST M OVE(seq, o)
the expected value by using the rule set corresponding to re-      is performed. Let nextSeq be seq appended by the new
moving the first episode from episodes. However, repeating         episode generated by performing that action. We found that
this process can lead to infinite depth recursion. Rather than     the expected value of nextSeq in timestep t + 1 can be
creating a suitable base case to make the recursion finite, we     higher than x, usually from an improbable event informing
instead attempt to estimate the value that this recursion will     the agent that it is farther from the goal than the average case.
return. Let p denote the probability of reaching the goal ac-      When this occurred, we observed that the agent would often
cording to the 0-deep rule → G : p. We note that if we were        switch to a shorter sequence with a better expected value for
to continue recursion infinitely, the probability of reaching      a move recommendation. However, this is not because the
the goal in any step should converge on p. If we estimate that     agent actually knows a faster path to goal, but because the
this probability holds constant on every step, then the num-       agent effectively “forgets” that it is in a poor position. After
ber of steps to goal follows a Geometric distribution with         the agent has performed a few moves, it will again learn that
probability p. This gives an expected value of p1 . To prevent     it is in a poor position and switch to a shorter sequence for a
                                                                   recommendation. By repeating this behavior, the agent gets
dividing by 0, we add 1 to the number of goals observed
                                                                   stuck in a loop.
when calculating p. Hence, we can now define the heuristic
                                                                       To deal with this issue, we force the agent to deal with
function:
                                                                   the bad event by not letting the agent switch to a different
     function G ET H EURISTIC( )                                   sequence for a move recommendation. In other words, if seq
        return 1/p                                                 was used to find the best move at timestep t, then nextSeq
     end function                                                  must be used to find the best move at timestep t + 1. There
   As a benefit of this technique, we have derived a quanti-       are two exceptions to this rule:
tative value of exploration without using hyperparameters,        1. The agent observes a goal. Because the agent is moved
in contrast to traditional techniques (Sutton, Barto, and oth-         randomly upon reaching the goal state, there is no infor-
ers 1998; Kearns and Singh 2002; Brafman and Tennenholtz               mation to be gained by using a sequence with a goal ob-
2002). Thus, this is particularly suitable to online learning          servation.
and artificial general intelligence problems where hyperpa-
rameter tuning is not desirable. Also of note is that p1 nearly   2. No rules apply to the current situation using nextSeq.
equals the average number of steps to goal over the lifetime           This implies the agent is in a novel scenario and the cur-
of the agent (they are not equal because we must add one               rent sequence can provide no data to guide the agent.
to the number of goals). This means that ARO’s explore-                Using these procedures, we can now fully define the pol-
exploit solution has an intuitive description: explore if the      icy of ARO. We define the global variable prevSeq, which
alternative is worse than average.                                 stores the sequence from the previous time step used to de-
   This intuitive idea appears in other contexts. Notably, an      termine which action to take, and is initialized to the empty
algorithm called POKER has been applied to the multi-              sequence. ARO’s policy is then described in the following
armed bandit problem where the agent is not able to test           algorithm.
                                                                   → 0 : 4/7 because ARO has never seen an even state be-
                     3        b            1         b             fore. This probability is chosen because ARO has seen an
                                                                   even state 100% of the time it was not in an odd state, and
                                                                   based on the rule → 1 : 3/7, ARO should not be in an odd
                                  a            a a                 state 4/7ths of the time. At this point, ARO has seen 1 state
                                                                   and 0 goals, so p = 21 and our heuristic is 2 moves to the
                                                                   goal. (Recall that we add one to the number of goals when
                     5        b            2                       computing p.) ARO will compute an expected value of 2 for
                                                                   moves a and b since it has no rules to predict the outcome
                                                                   of either action. In case of a tie, ARO will arbitrarily pick
                         a                                         the earlier move in the alphabet, in this case a. Before it re-
                                                                   turns, ARO will also update prevSeq to [0a]. ARO will then
                                                                   transition to state 4.
                     b       6                 b                      ARO again receives an observation of 0. ARO will form
                                                                   the new rule 0a → 0 : 100% because of this observation.
                                                                   The heuristic will also update to 3. ARO will look for all rule
                         a            b                            sets with cause sequence 0a, 0∗ because [0a] is prevSeq.
                                                                   However, since no rules apply with this sequence, ARO will
                                                                   check all possible previous sequences to find rule sets. Those
                                                                   sequences are [] and [0a].
                 4        a               goal                        First, ARO calls EV([], 0). The move b has no rule set
                                                                   apply and thus has an expected value of 3, the heuristic. The
                                                                   move a will find the rule set consisting solely of the rule
                                                                   0a → 0 : 100%, which will cause a recursive call EV([0a],
                 Figure 1: An example FSM                          0). In this call, ARO will find no rule sets that apply and thus
                                                                   return the heuristic of 3 with BestM ove set arbitrarily to a.
                                                                   On return, ARO adds 1 to the return value to account for the
 1: function P OLICY(Observation o)                                step of reaching that state and multiplies by p = 100%, so
 2:    if o = G then                                               a has an overall expected value of 4. Since EVreturns the
 3:        Set prevSeq to the empty sequence.                      minimum value, it will return 3 with BestM ove set to b.
 4:        return α[0]                                                Next, from the P OLICY function, ARO will try the other
 5:    end if                                                      sequence by calling EV([0a], 0). Note that we already eval-
 6:    Let e be the last episode.                                  uated the result of this call in the last paragraph - returning
 7:    Let seq be prevSeq appended by e.                           3 with BestM ove set to a. This overlap is very common.
 8:    if There are no rules with cause sequence seq, o∗ or        If EVcalls are cached, it is typically only necessary to call
    seq contains the episode G then                                EVa handful of times in order to generate all values needed
 9:       Set prevSeq to the sequence of recent episodes           to select an action. We have also tied for best value since
    such that EV(prevSeq, o) is minimized.                         both calls returned 3, so we select the move generated by
10:       If there is a tie, use the shortest possible sequence.   the shortest sequence, which in this case is the move b. Fi-
11:    else                                                        nally, prevSeq is set to [0b] and ARO transitions to state 5.
12:       prevSeq = seq                                               ARO will receive an observation of 1, creating the new
13:    end if                                                      rules 0b → 1 : 100% and 0a, 0b → 1 : 100%. Again,
14:    return G ET B EST M OVE(prevSeq, o)                         no rules apply so all sequences are checked. This operation
15: end function                                                   is common when ARO begins to explore but becomes rare
                                                                   once ARO has gathered data. Since the agent has never seen
                         Example                                   an odd state before, the only helpful rule is the expert rule
                                                                   1a, 0b → G : 100%. It may appear that the agent can use
Let us use the FSM in Figure 1 as an example for ARO               this rule, but this is not the case. The agent does not know
to navigate. Furthermore, suppose that the environment will        how likely it is to reach an even state by making the move
provide an odd state sensor and that the following two rules       a; in fact, the agent does not even know if this outcome is
are provided by experts on this FSM:                               possible. As such, it will defer to the heuristic for comput-
               → 1 : 3/7; 1a, 0b → G : 100%                        ing the expected value of a and not use this rule at all. As a
                                                                   consequence of this fact, simpler rules are vastly more im-
Note that the first rule says odd states happen only 3/7ths of     portant to the agent than complex ones because the former
the time. This is because the goal state should occur 1/7th        are needed in order to capitalize on the latter. Luckily, those
of the time and the goal is neither even nor odd. Finally,         rules are also the easiest for the agent to accurately learn.
suppose ARO randomly begins in state 6.                            The best move found will be a as proposed by the empty
   Being in an even state, ARO will be provided an ob-             sequence, so the agent will move into state 6 with prevSeq
servation of 0. Immediately, ARO will form the new rule            set to [1a].
   At this point, you may notice that ARO is performing a
kind of breadth first search. With very little knowledge of
how the goal is reached, this is the best strategy ARO can
perform. However, once the goal has been reached, ARO
will begin biasing its actions toward moves and sequences
of moves that are more likely to reach the goal.
   Next, ARO receives an observation of 0 and creates the
rules 1a → 0 : 100%, 0b, 1a → 0 : 100%, and 0a, 0b, 1a →
0 : 100%. These may seem redundant, but they are all stored
independently so that each of their probabilities may change
independently in the future. The agent may now capitalize
on its expert knowledge. Using prevSeq, the agent finds b is
the best move with an expected value of 1 because of the rule
1a, 0b → G : 100%. This beats move a, with an expected             Figure 2: Agents in the Blind FSM Environment with 50
value the same as the heuristic at 5. ARO then takes move b        states and an alphabet size of 3. Data averaged from 1000
to reach the goal.                                                 trials with random FSMs.
   Upon reaching the goal, ARO will update and create sev-
eral rules. A full list of rules that ARO knows is presented
below, with rules of equal depth on the same line:
             → 1 : 3/7; → 0 : 3/7; → G : 1/7
0a → 0 : 100%; 0b → 1 : 50%; 0b → G : 50%; 1a → 0 : 100%
0a, 0b → 1 : 100%; 0b, 1a → 0 : 100%; 1a, 0b → G : 100%
        0a, 0b, 1a → 0 : 100%; 0b, 1a, 0b → G : 100%
Even though the agent saw 3 even states and only one odd
state, the expert knowledge provided has allowed the 0-deep
rules to converge on the correct probabilities. The agent has
very little data, so some of the rules it created are incorrect.
For instance, 0a → 0 should not have a 100% probability
due to state 2. But the agent has also correctly discovered        Figure 3: Agents in the Odd-Sensor FSM Environment with
patterns such as 0a, 0b → 1 : 100% which it will be able to        50 states and an alphabet size of 3. Data averaged from 1000
immediately use to help navigate the FSM.                          trials with random FSMs.

                          Results
                                                                   OSFSM environment. These results are shown for a FSM
To assess ARO, we compared its performance in the OSFSM            with 50 states and a three-letter alphabet (three possible ac-
environment to that of two other algorithms: MaRz (Ro-             tions per state). The abscissa measures successive trips to the
driguez et al. 2017) and Nearest Sequence Memory (McCal-           goal. The ordinate measures the number of actions required
lum 1995b). Since these agents cannot incorporate knowl-           to reach the goal on that trip.
edge engineering no rules were provided for ARO a pri-                Predictably, the number of steps to reach the goal declines
ori. Since we would expect that being provided expert rules        exponentially on successive trips the goal, indicating that the
would improve performance, this data can be seen as a              agent is learning to improve its behavior. This is true for
worst-case outcome for ARO.                                        all three algorithms. MaRz learns faster than NSM for the
   There were a few minor modifications to the implementa-         first few goals but converges at a much less efficient policy.
tion of ARO from the above specification to improve mem-           ARO learns significantly faster than NSM and converges at
ory usage and run time. First, there is no need to record a        a policy with nearly identical performance to NSM.
rule if a prefix of the cause sequence is unique. If the pre-         ARO is able to learn near optimal behavior much more
fix occurs again, we may look back through history to cre-         rapidly than the other algorithms. These results were consis-
ate the rule on the fly. This modification does not affect the     tent across a variety of different machines ranging from 10
behavior of ARO. Second, we cache the result of G ET EV            to 90 states and alphabets of 2 to 9 letters (not shown).
and G ET B EST M OVE after every goal, which dramatically
improves computation time. This technically deviates from
the specification because the value of G ET H EURISTIC is                                Future Work
saved and not recomputed. Since 1/p trends downward as             This work demonstrates that ARO is effective in two specific
the agent learns more, this means that the agent undervalues       environments with extreme state aliasing. Other tests (not in
explores compared to the specification. We found the overall       this paper) have indicated that ARO can handle other deter-
effect on the agent to be small.                                   ministic environments as well, including environments with
   Figure 2 shows the performance of all three agents in the       little state aliasing.
Blind FSM environment. Figure 3 shows the results for the              The most pressing concern is that it is not obvious how
best to generalize ARO to accommodate a stochastic obser-          Kearns, M., and Singh, S. 2002. Near-optimal reinforcement
vation function, such as a function that returns 0 or 1 at ran-    learning in polynomial time. Machine learning 49(2-3):209–
dom. With a random observation function, storing all possi-        232.
ble rules can quickly become infeasible in terms of memory         McCallum, A. 1995a. Instance-based utile distinctions for
because the number of rules with a given depth is exponen-         reinforcement learning with hidden state. In ICML, 387–395.
tial and unbounded. This is unlike a deterministic observa-        McCallum, R. A. 1995b. Instance-based state identification
tion function, where the number of rules with a given depth        for reinforcement learning. In Advances in Neural Informa-
is no greater than the total number of states of the FSM - one     tion Processing Systems, 377–384.
rule corresponding to each possible initial state of the agent     Rodriguez, C.; Marston, G.; Goolkasian, W.; Rosenberg, A.;
at the beginning of the cause sequence. Furthermore, ARO           and Nuxoll, A. 2017. The MaRz algorithm: Towards an ar-
                                                                   tificial general episodic learner. In International Conference
has no means of recognizing that the observations it receives
                                                                   on Artificial General Intelligence, 154–163. Springer.
are random, so it uses rules trying to predict the value of ran-
                                                                   Russell, S., and Norvig, P. 2009. Artificial Intelligence: A
dom observations as a basis to try to reach the goal. These
                                                                   Modern Approach. Upper Saddle River, NJ, USA: Prentice
factors cause ARO to perform poorly.                               Hall Press, 3rd edition.
   Another potential enhancement is to allow ARO to learn          Shani, G., and Brafman, R. I. 2005. Resolving perceptual
other kinds of rules with more general causes or effects. For      aliasing in the presence of noisy sensors. In Advances in Neu-
example, in our previous examples of rules we used a wild-         ral Information Processing Systems, 1249–1256.
card character for a rule. While this rule can be provided by      Sutton, R. S.; Barto, A. G.; et al. 1998. Introduction to rein-
an expert, ARO cannot learn such a rule and must instead           forcement learning. MIT press Cambridge.
learn a similar rule for every possible value of the wildcard.     Vermorel, J., and Mohri, M. 2005. Multi-armed bandit algo-
   ARO may also need to be adjusted to address environment         rithms and empirical evaluation. In European conference on
with these properties:                                             machine learning, 437–448. Springer.
• a reward function rather than a single goal state
• a non-deterministic FSM
• a dynamic transition table
• continuously valued sensors
   ARO’s potential extends beyond knowledge integration.
As an online algorithm, ARO can take advantage of knowl-
edge immediately and therefore could be a step towards
ways in which humans and artificially intelligent agents
work collaboratively on a given task in real time.
   In addition, the development of ARO builds upon pre-
vious research to explore the application of episodic mem-
ory for artificial general intelligence (Rodriguez et al. 2017).
ARO, therefore, is likely equally applicable in that context.
In particular, the lack of hyperparameters means that it can
be applied in a general context and does not need to be re-
tuned each time it is used in a new environment.

                        References
  Berry, D. A., and Fristedt, B. 1985. Bandit problems: se-
  quential allocation of experiments (monographs on statistics
  and applied probability). London: Chapman and Hall 5:71–
  87.
  Brafman, R. I., and Tennenholtz, M. 2002. R-max-a general
  polynomial time algorithm for near-optimal reinforcement
  learning. Journal of Machine Learning Research 3(Oct):213–
  231.
  Chrisman, L. 1992. Reinforcement learning with perceptual
  aliasing: The perceptual distinctions approach. In Proceed-
  ings of the Tenth National Conference on Artificial Intelli-
  gence, AAAI 1992, 183–188. AAAI Press.
  Hopcroft, J. E.; Motwani, R.; and Ullman, J. D. 2006. Intro-
  duction to Automata Theory, Languages, and Computation
  (3rd Edition). Boston, MA, USA: Addison-Wesley Longman
  Publishing Co., Inc.