=Paper= {{Paper |id=Vol-1097/STIDS2013_P2 |storemode=property |title=Hierarchical Decision Making |pdfUrl=https://ceur-ws.org/Vol-1097/STIDS2013_P2_Lewis.pdf |volume=Vol-1097 |dblpUrl=https://dblp.org/rec/conf/stids/Lewis13 }} ==Hierarchical Decision Making== https://ceur-ws.org/Vol-1097/STIDS2013_P2_Lewis.pdf
                              Hierarchical Decision Making
                                                          Matthew J. Lewis
                                                    Data Exploitation Systems
                                                  Michigan Aerospace Corporation
                                                          Ann Arbor, MI
                                                      mlewis@michaero.com


    Abstract—Decision making must be made within an                  data and determine how to best present it to the human user.
appropriate context; we contend that such context is best            We may use machine learning techniques to identify and
represented by a hierarchy of states. The lowest levels of this      classify decision contexts, as well as predict which data, in
hierarchy represent the observed raw data, or specific low-level     what formats and with what visual representations are most
behaviors and decisions. As we ascend the hierarchy, the states      useful.
become increasingly abstract, representing higher order tactics,
strategies, and over-arching mission goals.                              In this paper we advocate an approach to building flexible
                                                                     frameworks that identify a decision context that may help
    By representing the hierarchy using probabilistic graphical      anticipate the types of data the user will find useful, as well as
models, we can readily learn the structure and parameters that
                                                                     how the data should be represented and visualized. As users
define a user’s behavior by observing his activities over time—
what data they use, how it is visualized, and what decisions are
                                                                     interact with the framework to make decisions, entering search
made. Once learned, the resulting mathematical models may be
                                                                     terms, selecting data, interacting with tables and plots, and
combined with the techniques of reinforcement learning to            ultimately making decisions, the framework learns and adapts,
predict behavior and anticipate the needs of the user, delivering    improving its predictive capabilities and honing its notion of
appropriate data, visualizations, and recommending optimal           what constitutes a decision context.
actions.
                                                                                         II.   DECISION CONTEXT
  Keywords—decision making; hierarchical hidden Markov
models; reinforcement learning.                                      A. The Importance of a Decision Context
                                                                         How can we learn the decision context for a particular task
                       I.   INTRODUCTION                             or goal? We have, at the lowest level, the measured input
    Human operators, particularly in the context of military         associated with how the user is using a system to help make a
operations, must quickly make critical decisions. Although           decision. This can be much more than the keywords associated
these operators have access to unprecedented volumes of              with a database search. It could include what elements of an
diverse data sources involving media reports, financial              interface the user clicks on, heat maps of cursor positions, how
information, imagery, signals intelligence, and human                long a given data source is investigated, what data elements a
intelligence, making decisions based on such data is                 user expands—even, if available via camera interfaces, what
confounded by many factors, including: 1) Limited human and          information the operator is actually looking at, and for how
computational resources; 2) difficulty synthesizing a coherent       long. The details we record about how an operator interacts
picture from volumes of manifold and (often) irrelevant data;        with a user interface (UI)—or, even better, how an entire
3) an inability to derive meaning from or detect structure in        population of operators interacts with a UI—will yield
high dimensional data; and, 4) the randomness and uncertainty        valuable, predictive insight into what data is useful and what
intrinsic to the real world. When a human operator is making a       data is quickly discarded by the operator, with respect to the
decision, much of this data is irrelevant and, worse, confusing.     estimated decision context. This information can be gleaned
                                                                     from nearly any system by using external monitoring software
    To reduce the cognitive load of human decision makers and        packages.
improve the quality of the decisions they make, we can develop
frameworks—algorithms, APIs, and user interfaces—that                B. Decision Context as a Hierarchy
detect what data is relevant and how it should be presented in a         Making the leap from crude interaction measurements to
manner that is particular to the decision context. For our           understanding the intent of the human operator and the
purposes here, a decision context specifies: 1) the goal, task, or   decisions he is trying to make is difficult—indeed, a problem at
mission relevant to our decision; 2) the reason the task is          the core of machine learning and artificial intelligence.
important (i.e., a global perspective); 3) any constraints and
utilities associated with the decision; 4) previous decisions that       One approach to attacking this problem is to understand the
were made, as well as likely future ones; and, importantly, 5)       decision context as a hierarchy, wherein the lowest layers of
the set of candidate decisions available to the decision maker.      the hierarchy represent the raw input signals—the
                                                                     measurements of operator interaction with the UI, and the
    In fact, human operators may not explicitly conceive of this     structure and content of requested data; as we move up in the
decision context, but it nevertheless serves as a useful set of      hierarchy, inputs from the lower layers are mapped to
latent variables, which help us intelligently aggregate relevant     increasingly more abstract ideas—operator intent, operator




                                                 STIDS 2013 Proceedings Page 162
confusion, mission goal, etc., which together form the decision              Using these mathematical models, we may take low level
context. One tool for efficiently representing such hierarchies          behavioral inputs and infer the higher order goals that are likely
is the hierarchical hidden Markov model (HHMM) [1].                      driving this behavior. Conversely, given a higher order goal,
                                                                         we can estimate the behaviors that will likely be used in the
C. An Example                                                            context of that goal. This information can be used to optimally
    To take a concrete example, imagine driving a car. At the            configure a user interface, retrieve relevant data, and otherwise
very lowest level, a driver is taking in visual and auditory             support operator decision making. We describe a way to
information about obstacles and other cars on the road, and              achieve this optimization below.
making numerous low-level decisions—press the gas, press
the brake, turn left, and so forth. But these decisions are always         III.   LEARNING BY INTERACTING WITH THE ENVIRONMENT
being made in the context of a higher-order goal—say,
                                                                         A. From States to Actions
navigating to the grocery store. And that goal, in turn, is made
in the context of a yet higher order goal—needing to eat.                    We have argued that describing decision context as a
Decisions are made at each level in the hierarchy, and                   hierarchy provides a rich way to describe operator behavior. In
influence the decisions made at the other levels.                        a sense, it provides a flexible way of modeling the state of the
                                                                         operator — why an operator is doing something, how he is
    Consider Fig. 1, which illustrates a portion of a highly             trying to accomplish it, at a strategic level, and what resources
simplified hierarchical hidden Markov model. Each blue node              he likely needs to support the effort. This representation can
in the network represents a state of the system. These states are        determine, at each time step, what the most likely decision an
arranged into a hierarchy of levels. Suppose you are hungry, so          operator is likely to make, based on past behavior. But this
that at the highest level in this hierarchy you are in the eat           representation alone does not provide a mechansim for learning
state. There is some probability you will transition to another          — at a given timestep, the best possible decision.
state at this level — the sleep state, for example. However,
more likely, because of your hunger, you will transition to a                Consider again the car driving example: if we have an
lower level in the hierarchy, perhaps to the node that represents        HHMM running, a few observations (our operator is in the car,
the go to the grocery store state. This in turn transitions to yet a     he is driving toward the grocery store) will quickly establish
lower level, to represent increasingly specific sequences of             the decision context (the operator needs to eat), and can predict
behavior — leave the house followed by drive to the store.               that he will turn left at the upcoming intersection, because he
When behavior at one level of the hierarchy is completed                 has done so previously, and because it ultimately leads to the
(which happens when one transitions to a black node), control            grocery store. But it predicts or suggests this decision only
is returned to one level higher in the hierarchy, where it left off.     because of what has been observed in the past, not because
Note that, in general, states transition at lower levels change          turning left happens to be the fastest way to reach the grocery
much more quickly than they do at higher levels: things are              store. In order to optimize the decision making process, we
changing quickly as we drive down the street, stopping at stop           harness another piece of mathematical technology, known as
signs, taking right turns, etc. But all the while we are still in the    Reinforcement Learning.
eat state—the higher order context for our behaviors has not             B. Reinforcement Learning
changed.
                                                                             Reinforcement Learning (RL) is a machine learning
                                                                         technique inspired by behavioral psychology, developed to
                                                                         emulate the manner in which humans learn via experience [4].
                                                                         RL is concerned with teaching an agent how to interact with an
                                                                         environment in order to maximize a cumulative reward. RL has
                                                                         been successfully applied to problems across many domains,
                                                                         including industrial planning, autonomous vehicle control,
                                                                         pattern recognition, dynamic channel allocation in the cellular
                                                                         industry, and even the game of chess.
                                                                              At each time step ! of an RL algorithm, the agent finds
                                                                         itself in a state, !! ∈ !. When in this state, it has available a
                                                                         number of available actions, or decisions, !! . It selects an
                                                                         action according to a policy, !(!, !) , which records the
Fig 1: An example of a hierarchical hidden Markov network.               probability of selecting action ! given the agent is in state !,
                                                                         i.e., ! !, ! = !(!|!) . Because of this action, the agent
    Learning the structure of such a network, as well as the             transitions to a new state !!!! and receives a scalar reward
probabilities associated with transitions between states and             !!!! .
levels, can be done efficiently and in an unsupervised manner
using the mathematics of probabilistic graphical models [2, 3].              The reward function is a crucial component. It may be a
We have implemented these techniques in the context of                   complex, time dependent function of the state of the agent and
autonomous vehicle control, predictive analytics, and                    his environment; it is used to encode the goals of a decision
electronic warfare.                                                      making task—that is, by optimizing this function, we achieve
                                                                         the planning goal. Surprisingly complex behavior can emerge
                                                                         from the use of even simple scalar reward functions. In this




                                                        STIDS 2013 Proceedings Page 163
instance, the reward function could be the utility of the final     RL algorithms because there is some nonzero probability (!, as
decision, the speed at which the decision is made, or a             defined above) that we will select a non-optimal action. This
combination of these and other measures.                            ensures that we are trying new things, and although we may
                                                                    not always be making the optimal decision, we can avoid
    But—and this is critical—the goal of RL is not to maximize
                                                                    making decidedly poor decisions because our environment has
the reward received at the very next time step, but rather the
                                                                    changed from beneath us — there is always, of course, a
total cumulative reward over all time, which we call the return:
                                                                    tradeoff between achieving optimal behavior and responding
                    !! = !! + !!!!! + ! ! !!!! + ⋯            (1)   quickly to a changing environment (i.e., stability vs.
                                                                    maneuverability), but with these methods we can parameterize
   The scalar ! ∈ [0,1] determines the importance of future         and quantify the tradeoff.
rewards relative to near-term rewards. If ! = 1, distant future
rewards are as important as near term rewards. For 0 < ! < 1,       C. Bringing the Pieces Together
we refer to !! as the discounted return.                                We have discussed two distinct pieces of mathematical
    The advantage of this approach is that the reinforcement        equipment that may be used to deal with hierarchical decision
learning agents are not required to act to maximize short-term      making. For the first piece, an HHMM is used to learn the
gains, but rather learn to act in complex ways to achieve           decision contexts relevant to a problem. These decision contexs
objectives, even if they must make occasionally suboptimal          define a set of states. Given a state, and a set of actions that the
decisions.                                                          operator (or the computer) may take, our second piece of
                                                                    technology, Reinforcement Learning, sets out to learn what
   An object of fundamental interest in RL is the action-value      decisions will lead to the best outcomes, with respect to a set of
function, which we denote by !(!, !) . The action-value             reward functions.
function records the expected discounted return:
                                                                        The fact that these two pieces share a common language—
                ! ! !, ! = !{!! |!! = !, !! = !}              (2)   they understand they operator’s state and the actions he may
                                                                    take, indicate they may work together effectively. Given the
    The superscript ! indicates that the action-value function is   HHMM predicts we’re in a particular state, the RL algorithms
relative to the policy !. It tells us the expected value of being   recommend an action. As a results we transition to a new state,
in state ! and taking action !. The goal of an RL agent is to       which is estimated by the HHMM, another action is
learn this function by interacting with its environment. Once       recommended, and the cycle repeats. A human operator may
this function has been learned, determining an optimal policy,      either execute the decision recommended by the RL
!, is straightforward: given we are in state !, we select the       algorithms, or select another action—and the system learns
action ! so that we maximize the expected value—that is, we         from the resulting state in either case.
select the optimal action ! ∗ = max! !(!, !). In fact, we do not
always select the optimal action, but sometimes (with                   To continue the analogy with the car: As the human
probability ! ) select a suboptimal action. In this way, we         operator approaches an intersection, the RL algorithm may
manage to avoid the local minima in our reward functions, and       understand that the user is in an eat state, attempting to go to
may more rapidly adapt as our environment evolves in time.          the grocery store. Typically, at this point, according to the
                                                                    HHMM, the user turns left, but the RL algorithm may
    We learn the ! function iteratively. We begin with an           recommend right. As a result, the user arrives at the grocery
arbitrarily initialized function, !(!, !). Starting at time !, in   store four minutes faster than usual. This reinforces the RL
state !! , we select an action ! using a policy derived from the    algorithm’s decision to recommend that action, and in the
function !—i.e., we select the action with the highest expected     future, it will preferentially recommend it. If at some point
value. Because of the action, we find ourselves in state !!!! ,     something, say construction, renders that driving route
and receive reward !!!! . We update the action value function       unmanageable, the RL algorithms will adapt accordingly, and
as follows by replacing the value of ! !! , !! with,                perhaps again recommend taking a left hand turn at the
   ! !! , !! + ! !!!! + ! max! ! !!!! , ! − !(!! , !! )      (3)    intersection, instead.

   The scalar ! ∈ [0,1] is a learning rate, which specifies how                            IV.    CONCLUSIONS
quickly the system adapts.                                              Decision making must be made within the appropriate
   After the update, we select another action, and the cycle        context, and we contend that context is best represented by a
continues. We can demonstrate (via practical applications and       hierarchy of states. The lowest levels of this hierarchy
formal mathematical proofs) that this iterative procedure           represent the observed raw data, or specific low-level
converges to the correct value for the action-value function !.     behaviors and decisions. As we ascend the hierarchy, the states
The policy ! is implicit in the action-value function, as           become increasingly abstract, representing higher order tactics,
emphasized above: when we are in state ! we select ! ∗ =            strategies, and over-arching mission goals.
max! !(!, !).                                                           By representing this hierarchy using probabilistic graphical
    Importantly, the RL algorithm does not stop learning after      models, we can readily learn the structure and parameters of a
this convergence, for the simple reason that the environment        user’s behavior by simply observing their activities over
may be changing, and our the agent’s behavior may need to           time—what data they use, what plots they make and use, etc.
adapt accordingly. This occurs naturally in the context of the      Once learned, the resulting mathematical models may be used
                                                                    to intelligently predict behavior, anticipate the needs of the




                                                STIDS 2013 Proceedings Page 164
user, and deliver the appropriate data, visualizations, and other   ultimate decision goals of the system, as well as methods for
resources before the user even knows he wants it.                   labeling the states of the decision context hierarchy. In
                                                                    addition, for specific uses, we must develop a consistent
    Furthermore, given this hierarchical representation, we can
                                                                    method for ingesting data so that it may be readily used by
use the mathematics of reinforcement learning to help the user      these algorithms.
make the best possible decision (or decisions) with respect to
the specified reward functions.                                         These and other challenges represent future research efforts
                                                                    in this area.
A. Moving Forward
    Although Reinforcement Learning methods have been                                            REFERENCES
successfully integrated with probabilistic graphical networks,      [1]   S. Fine, Y. Singer and N. Tishby, "The Hierarchical Hidden Markov
which allow us to build autonomous decision making systems                Model: Analysis and Applications", Machine Learning, vol. 32, p. 41–
that learn and adapt from experience, and though these                    62, 1998
technologies have been applied to a number of disparate fields,     [2]   K. Murphy and M. Paskin. "Linear Time Inference in Hierarchical
including autonomous vehicle control, and electronic warfare,             HMMs", NIPS-01 (Neural Info. Proc. Systems).
more research needs to be done to develop these into a general      [3]   K. Murphy, "Dynamic Bayesian Networks: Representation, Inference
                                                                          and Learning", UC Berkeley, Computer Science Division, July 2002.
framework.
                                                                    [4]   R. Sutton & A. Barto, “Reinforcement Learning: An Introduction”, MIT
   In order to produce a flexible framework, we must create a             Press, Cambridge, MA, 1998.
method for easily defining reward functions in terms of the




                                                STIDS 2013 Proceedings Page 165