=Paper=
{{Paper
|id=Vol-1097/STIDS2013_P2
|storemode=property
|title=Hierarchical Decision Making
|pdfUrl=https://ceur-ws.org/Vol-1097/STIDS2013_P2_Lewis.pdf
|volume=Vol-1097
|dblpUrl=https://dblp.org/rec/conf/stids/Lewis13
}}
==Hierarchical Decision Making==
Hierarchical Decision Making
Matthew J. Lewis
Data Exploitation Systems
Michigan Aerospace Corporation
Ann Arbor, MI
mlewis@michaero.com
Abstract—Decision making must be made within an data and determine how to best present it to the human user.
appropriate context; we contend that such context is best We may use machine learning techniques to identify and
represented by a hierarchy of states. The lowest levels of this classify decision contexts, as well as predict which data, in
hierarchy represent the observed raw data, or specific low-level what formats and with what visual representations are most
behaviors and decisions. As we ascend the hierarchy, the states useful.
become increasingly abstract, representing higher order tactics,
strategies, and over-arching mission goals. In this paper we advocate an approach to building flexible
frameworks that identify a decision context that may help
By representing the hierarchy using probabilistic graphical anticipate the types of data the user will find useful, as well as
models, we can readily learn the structure and parameters that
how the data should be represented and visualized. As users
define a user’s behavior by observing his activities over time—
what data they use, how it is visualized, and what decisions are
interact with the framework to make decisions, entering search
made. Once learned, the resulting mathematical models may be
terms, selecting data, interacting with tables and plots, and
combined with the techniques of reinforcement learning to ultimately making decisions, the framework learns and adapts,
predict behavior and anticipate the needs of the user, delivering improving its predictive capabilities and honing its notion of
appropriate data, visualizations, and recommending optimal what constitutes a decision context.
actions.
II. DECISION CONTEXT
Keywords—decision making; hierarchical hidden Markov
models; reinforcement learning. A. The Importance of a Decision Context
How can we learn the decision context for a particular task
I. INTRODUCTION or goal? We have, at the lowest level, the measured input
Human operators, particularly in the context of military associated with how the user is using a system to help make a
operations, must quickly make critical decisions. Although decision. This can be much more than the keywords associated
these operators have access to unprecedented volumes of with a database search. It could include what elements of an
diverse data sources involving media reports, financial interface the user clicks on, heat maps of cursor positions, how
information, imagery, signals intelligence, and human long a given data source is investigated, what data elements a
intelligence, making decisions based on such data is user expands—even, if available via camera interfaces, what
confounded by many factors, including: 1) Limited human and information the operator is actually looking at, and for how
computational resources; 2) difficulty synthesizing a coherent long. The details we record about how an operator interacts
picture from volumes of manifold and (often) irrelevant data; with a user interface (UI)—or, even better, how an entire
3) an inability to derive meaning from or detect structure in population of operators interacts with a UI—will yield
high dimensional data; and, 4) the randomness and uncertainty valuable, predictive insight into what data is useful and what
intrinsic to the real world. When a human operator is making a data is quickly discarded by the operator, with respect to the
decision, much of this data is irrelevant and, worse, confusing. estimated decision context. This information can be gleaned
from nearly any system by using external monitoring software
To reduce the cognitive load of human decision makers and packages.
improve the quality of the decisions they make, we can develop
frameworks—algorithms, APIs, and user interfaces—that B. Decision Context as a Hierarchy
detect what data is relevant and how it should be presented in a Making the leap from crude interaction measurements to
manner that is particular to the decision context. For our understanding the intent of the human operator and the
purposes here, a decision context specifies: 1) the goal, task, or decisions he is trying to make is difficult—indeed, a problem at
mission relevant to our decision; 2) the reason the task is the core of machine learning and artificial intelligence.
important (i.e., a global perspective); 3) any constraints and
utilities associated with the decision; 4) previous decisions that One approach to attacking this problem is to understand the
were made, as well as likely future ones; and, importantly, 5) decision context as a hierarchy, wherein the lowest layers of
the set of candidate decisions available to the decision maker. the hierarchy represent the raw input signals—the
measurements of operator interaction with the UI, and the
In fact, human operators may not explicitly conceive of this structure and content of requested data; as we move up in the
decision context, but it nevertheless serves as a useful set of hierarchy, inputs from the lower layers are mapped to
latent variables, which help us intelligently aggregate relevant increasingly more abstract ideas—operator intent, operator
STIDS 2013 Proceedings Page 162
confusion, mission goal, etc., which together form the decision Using these mathematical models, we may take low level
context. One tool for efficiently representing such hierarchies behavioral inputs and infer the higher order goals that are likely
is the hierarchical hidden Markov model (HHMM) [1]. driving this behavior. Conversely, given a higher order goal,
we can estimate the behaviors that will likely be used in the
C. An Example context of that goal. This information can be used to optimally
To take a concrete example, imagine driving a car. At the configure a user interface, retrieve relevant data, and otherwise
very lowest level, a driver is taking in visual and auditory support operator decision making. We describe a way to
information about obstacles and other cars on the road, and achieve this optimization below.
making numerous low-level decisions—press the gas, press
the brake, turn left, and so forth. But these decisions are always III. LEARNING BY INTERACTING WITH THE ENVIRONMENT
being made in the context of a higher-order goal—say,
A. From States to Actions
navigating to the grocery store. And that goal, in turn, is made
in the context of a yet higher order goal—needing to eat. We have argued that describing decision context as a
Decisions are made at each level in the hierarchy, and hierarchy provides a rich way to describe operator behavior. In
influence the decisions made at the other levels. a sense, it provides a flexible way of modeling the state of the
operator — why an operator is doing something, how he is
Consider Fig. 1, which illustrates a portion of a highly trying to accomplish it, at a strategic level, and what resources
simplified hierarchical hidden Markov model. Each blue node he likely needs to support the effort. This representation can
in the network represents a state of the system. These states are determine, at each time step, what the most likely decision an
arranged into a hierarchy of levels. Suppose you are hungry, so operator is likely to make, based on past behavior. But this
that at the highest level in this hierarchy you are in the eat representation alone does not provide a mechansim for learning
state. There is some probability you will transition to another — at a given timestep, the best possible decision.
state at this level — the sleep state, for example. However,
more likely, because of your hunger, you will transition to a Consider again the car driving example: if we have an
lower level in the hierarchy, perhaps to the node that represents HHMM running, a few observations (our operator is in the car,
the go to the grocery store state. This in turn transitions to yet a he is driving toward the grocery store) will quickly establish
lower level, to represent increasingly specific sequences of the decision context (the operator needs to eat), and can predict
behavior — leave the house followed by drive to the store. that he will turn left at the upcoming intersection, because he
When behavior at one level of the hierarchy is completed has done so previously, and because it ultimately leads to the
(which happens when one transitions to a black node), control grocery store. But it predicts or suggests this decision only
is returned to one level higher in the hierarchy, where it left off. because of what has been observed in the past, not because
Note that, in general, states transition at lower levels change turning left happens to be the fastest way to reach the grocery
much more quickly than they do at higher levels: things are store. In order to optimize the decision making process, we
changing quickly as we drive down the street, stopping at stop harness another piece of mathematical technology, known as
signs, taking right turns, etc. But all the while we are still in the Reinforcement Learning.
eat state—the higher order context for our behaviors has not B. Reinforcement Learning
changed.
Reinforcement Learning (RL) is a machine learning
technique inspired by behavioral psychology, developed to
emulate the manner in which humans learn via experience [4].
RL is concerned with teaching an agent how to interact with an
environment in order to maximize a cumulative reward. RL has
been successfully applied to problems across many domains,
including industrial planning, autonomous vehicle control,
pattern recognition, dynamic channel allocation in the cellular
industry, and even the game of chess.
At each time step ! of an RL algorithm, the agent finds
itself in a state, !! ∈ !. When in this state, it has available a
number of available actions, or decisions, !! . It selects an
action according to a policy, !(!, !) , which records the
Fig 1: An example of a hierarchical hidden Markov network. probability of selecting action ! given the agent is in state !,
i.e., ! !, ! = !(!|!) . Because of this action, the agent
Learning the structure of such a network, as well as the transitions to a new state !!!! and receives a scalar reward
probabilities associated with transitions between states and !!!! .
levels, can be done efficiently and in an unsupervised manner
using the mathematics of probabilistic graphical models [2, 3]. The reward function is a crucial component. It may be a
We have implemented these techniques in the context of complex, time dependent function of the state of the agent and
autonomous vehicle control, predictive analytics, and his environment; it is used to encode the goals of a decision
electronic warfare. making task—that is, by optimizing this function, we achieve
the planning goal. Surprisingly complex behavior can emerge
from the use of even simple scalar reward functions. In this
STIDS 2013 Proceedings Page 163
instance, the reward function could be the utility of the final RL algorithms because there is some nonzero probability (!, as
decision, the speed at which the decision is made, or a defined above) that we will select a non-optimal action. This
combination of these and other measures. ensures that we are trying new things, and although we may
not always be making the optimal decision, we can avoid
But—and this is critical—the goal of RL is not to maximize
making decidedly poor decisions because our environment has
the reward received at the very next time step, but rather the
changed from beneath us — there is always, of course, a
total cumulative reward over all time, which we call the return:
tradeoff between achieving optimal behavior and responding
!! = !! + !!!!! + ! ! !!!! + ⋯ (1) quickly to a changing environment (i.e., stability vs.
maneuverability), but with these methods we can parameterize
The scalar ! ∈ [0,1] determines the importance of future and quantify the tradeoff.
rewards relative to near-term rewards. If ! = 1, distant future
rewards are as important as near term rewards. For 0 < ! < 1, C. Bringing the Pieces Together
we refer to !! as the discounted return. We have discussed two distinct pieces of mathematical
The advantage of this approach is that the reinforcement equipment that may be used to deal with hierarchical decision
learning agents are not required to act to maximize short-term making. For the first piece, an HHMM is used to learn the
gains, but rather learn to act in complex ways to achieve decision contexts relevant to a problem. These decision contexs
objectives, even if they must make occasionally suboptimal define a set of states. Given a state, and a set of actions that the
decisions. operator (or the computer) may take, our second piece of
technology, Reinforcement Learning, sets out to learn what
An object of fundamental interest in RL is the action-value decisions will lead to the best outcomes, with respect to a set of
function, which we denote by !(!, !) . The action-value reward functions.
function records the expected discounted return:
The fact that these two pieces share a common language—
! ! !, ! = !{!! |!! = !, !! = !} (2) they understand they operator’s state and the actions he may
take, indicate they may work together effectively. Given the
The superscript ! indicates that the action-value function is HHMM predicts we’re in a particular state, the RL algorithms
relative to the policy !. It tells us the expected value of being recommend an action. As a results we transition to a new state,
in state ! and taking action !. The goal of an RL agent is to which is estimated by the HHMM, another action is
learn this function by interacting with its environment. Once recommended, and the cycle repeats. A human operator may
this function has been learned, determining an optimal policy, either execute the decision recommended by the RL
!, is straightforward: given we are in state !, we select the algorithms, or select another action—and the system learns
action ! so that we maximize the expected value—that is, we from the resulting state in either case.
select the optimal action ! ∗ = max! !(!, !). In fact, we do not
always select the optimal action, but sometimes (with To continue the analogy with the car: As the human
probability ! ) select a suboptimal action. In this way, we operator approaches an intersection, the RL algorithm may
manage to avoid the local minima in our reward functions, and understand that the user is in an eat state, attempting to go to
may more rapidly adapt as our environment evolves in time. the grocery store. Typically, at this point, according to the
HHMM, the user turns left, but the RL algorithm may
We learn the ! function iteratively. We begin with an recommend right. As a result, the user arrives at the grocery
arbitrarily initialized function, !(!, !). Starting at time !, in store four minutes faster than usual. This reinforces the RL
state !! , we select an action ! using a policy derived from the algorithm’s decision to recommend that action, and in the
function !—i.e., we select the action with the highest expected future, it will preferentially recommend it. If at some point
value. Because of the action, we find ourselves in state !!!! , something, say construction, renders that driving route
and receive reward !!!! . We update the action value function unmanageable, the RL algorithms will adapt accordingly, and
as follows by replacing the value of ! !! , !! with, perhaps again recommend taking a left hand turn at the
! !! , !! + ! !!!! + ! max! ! !!!! , ! − !(!! , !! ) (3) intersection, instead.
The scalar ! ∈ [0,1] is a learning rate, which specifies how IV. CONCLUSIONS
quickly the system adapts. Decision making must be made within the appropriate
After the update, we select another action, and the cycle context, and we contend that context is best represented by a
continues. We can demonstrate (via practical applications and hierarchy of states. The lowest levels of this hierarchy
formal mathematical proofs) that this iterative procedure represent the observed raw data, or specific low-level
converges to the correct value for the action-value function !. behaviors and decisions. As we ascend the hierarchy, the states
The policy ! is implicit in the action-value function, as become increasingly abstract, representing higher order tactics,
emphasized above: when we are in state ! we select ! ∗ = strategies, and over-arching mission goals.
max! !(!, !). By representing this hierarchy using probabilistic graphical
Importantly, the RL algorithm does not stop learning after models, we can readily learn the structure and parameters of a
this convergence, for the simple reason that the environment user’s behavior by simply observing their activities over
may be changing, and our the agent’s behavior may need to time—what data they use, what plots they make and use, etc.
adapt accordingly. This occurs naturally in the context of the Once learned, the resulting mathematical models may be used
to intelligently predict behavior, anticipate the needs of the
STIDS 2013 Proceedings Page 164
user, and deliver the appropriate data, visualizations, and other ultimate decision goals of the system, as well as methods for
resources before the user even knows he wants it. labeling the states of the decision context hierarchy. In
addition, for specific uses, we must develop a consistent
Furthermore, given this hierarchical representation, we can
method for ingesting data so that it may be readily used by
use the mathematics of reinforcement learning to help the user these algorithms.
make the best possible decision (or decisions) with respect to
the specified reward functions. These and other challenges represent future research efforts
in this area.
A. Moving Forward
Although Reinforcement Learning methods have been REFERENCES
successfully integrated with probabilistic graphical networks, [1] S. Fine, Y. Singer and N. Tishby, "The Hierarchical Hidden Markov
which allow us to build autonomous decision making systems Model: Analysis and Applications", Machine Learning, vol. 32, p. 41–
that learn and adapt from experience, and though these 62, 1998
technologies have been applied to a number of disparate fields, [2] K. Murphy and M. Paskin. "Linear Time Inference in Hierarchical
including autonomous vehicle control, and electronic warfare, HMMs", NIPS-01 (Neural Info. Proc. Systems).
more research needs to be done to develop these into a general [3] K. Murphy, "Dynamic Bayesian Networks: Representation, Inference
and Learning", UC Berkeley, Computer Science Division, July 2002.
framework.
[4] R. Sutton & A. Barto, “Reinforcement Learning: An Introduction”, MIT
In order to produce a flexible framework, we must create a Press, Cambridge, MA, 1998.
method for easily defining reward functions in terms of the
STIDS 2013 Proceedings Page 165