=Paper= {{Paper |id=Vol-2819/session3paper3 |storemode=property |title=Causal Learning in Modeling Multi-segment War Game Leveraging Machine Intelligence with EVE Structures |pdfUrl=https://ceur-ws.org/Vol-2819/session3paper3.pdf |volume=Vol-2819 |authors=Ying Zhao,Bruce Nagy,Tony Kendall,Riqui Schwamm }} ==Causal Learning in Modeling Multi-segment War Game Leveraging Machine Intelligence with EVE Structures== https://ceur-ws.org/Vol-2819/session3paper3.pdf
      Modeling A Multi-segment Wargame Leveraging Machine Intelligence and
                        Event-Verb-Event (EVE) Structures


                               Ying Zhao, Bruce Nagy, Tony Kendall, Riqui Schwamm *




                             Abstract                                   human level performance benchmarks. Convolutional neural
                                                                        networks and reinforcement learning combined can achieve
  The paper depicts a generic representation of a multi-segment
                                                                        the best perceptive AI for input data of imagery and acous-
  wargame leveraging machine intelligence with two opposing
  asymmetrical players. We show an innovative Event-Verb-               tics for superior human level of performance (Silver, Schrit-
  Event (EVE) structure that is used to represent small pieces          twieser, and Simonyan 2017).
  of knowledge, actions, and tactics. We show the wargame                  These technologies have the great potential to address the
  paradigm and related machine intelligence techniques, in-             unique challenges of modeling complex functions of defense
  cluding data mining, machine learning, and reasoning AI               applications including mission planning, decision making,
  which have a natural linkage to causal learning applied to this       and causal reasoning. Leveraging machine intelligence, in
  game. We also show specifically a rule-based reinforcement            the sense of leveraging big databases, existing/new knowl-
  learning algorithm, i.e., Soar-RL, which can modify, link, and        edge, and tactics repositories, is critical for the future suc-
  combine a large collection EVEs rules, which represent exist-         cess of defense applications. For example, when warfight-
  ing and new knowledge, to optimize the likelihood to win or
  lose a game in the end. We show a simulation and a real-time
                                                                        ers make decisions, they need to take into considerations
  test for the methodology.                                             of all possible states of different types of opponents and
                                                                        adversaries’ intentions, strategies, decisions, and actions,
                                                                        which can be overwhelming for humans. Machine intelli-
                         Introduction                                   gence tools are needed for assisting humans to reduce their
In recent years, machine learning (ML) has successfully                 cognitive load. The paper presents a use case and a real-
used back-propagation and large data sets to reach human                life test with a need to elevate machine intelligence to assist
level performance. The techniques have been used to solve               mission planning, decision making, and causal learning for
difficult pattern recognition problems as perceptive artificial         warfighters.
intelligence or perceptive AI for many types of use cases.
These algorithms implement learning as a process of grad-               EVEs Structures and Multi-segment Wargame
ual adjustment of underlying models’ parameters (Lake et al.
2016). However, although there is great potential for these             We first define a generic representation of a multi-segment
techniques to be applied to real-life, they have been criti-            wargame with two opposing asymmetrical players as shown
cized for being black boxes and lacking understanding of                in Figure 1. Such a wargame is divided into multi-segments
causality which can be very important for decision mak-                 with events and verbs or actions alternating with a self-
ers. Reasoning AI such as reinforcement learning (Sutton                player and opponent. For each player, actions character-
and Barto 2014) and game theory (Brown and Sandholm                     ized by the verbs are grouped into a few categories, e.g.,
2017) have been successful as well in terms of producing                VA , VB , VC , VD , and VE . These categories can represent ac-
                                                                        tions typically used in various warfare areas. Events gener-
    * This will certify that all author(s) of the above article/paper   ated by the actions or verbs happen sequentially or in paral-
are employees of the U.S. Government and performed this work            lel in each segment. Probabilistic rules E → V and V → E
as part of their employment, and that the article/paper is there-       present the valid moves and probability of states. An EVE
fore not subject to U.S. copyright protection. No copyright. Use        example would be: “If an opponent is found (event), then
permitted under Creative Commons License Attribution 4.0 Inter-         track (verb) the opponent using tool A” and “if the oppo-
national (CC BY 4.0). In: Proceedings of AAAI Symposium on
                                                                        nent has been successfully tracked (event), then target (verb)
the 2nd Workshop on Deep Models and Artificial Intelligence for
Defense Applications: Potentials, Theories, Practices, Tools, and       the opponent using tool B.” Figure 2 shows a list of ac-
Risks, November 11-12, 2020, Virtual, published at http://ceur-         tion options for Segment 1 in a very high level for a test
ws.org. Ying Zhao, Tony Kendall, and Riqui Schwamm are with             game named “Battle Readiness Engagement Management
the Naval Postgraduate School, Monterey, CA; Bruce Nagy is with         (BREM)” as shown in Figure 3. Such EVEs can be different
the NAVAIR, China Lake, CA.                                             or asymmetrical for each player, i.e., two opposing asym-
metrical players have their own sets of EVEs rules guiding         causal.
corresponding valid moves. The verbs or actions consume          • The top approach ingest existing databases such as
time and other costs. An event represents a single measur-         databases about red/blue capabilities combined with hu-
able outcome or state after an action. Events are discrete and     man experts on-the-loop process to design the initial
do not consume time but have value (e.g., contribution to          EVEs. Since it is a human-on-the-loop process, the re-
win or lose a game in the end). They are evaluated by a set        sulted EVEs are causal. The human experts can also val-
of unifying equations to determine expected winning, los-          idate, filter, and combine the EVEs from the data mining
ing, or drawing status for each of the opposing asymmetrical       process and make sure they are causal.
players.
                                                                 • The initial EVEs are the input to the reinforcement learn-
   Where Do EVEs Structures Come From?                             ing or other machine learning algorithms to reason the
                                                                   best course of actions for blue/red. The integration of ma-
EVEs rules are generated top-down from experts, or learned         chine and causal learning techniques has the potential for
bottom-up using unsupervised learning from historical data         tactical decision edge for the players.
and knowledge repositories as shown in Figure 4:
                                                                    The novelty/significance of the EVEs structures is that
• The bottom up approach applies data mining per-                they are used to describe small pieces of knowledge, actions,
  formed on the historical unstructured text data (e.g.,         and tactics which can be systematically linked and com-
  wargame logs) using entity/event extraction tools such as      bined to optimize a global measure of effectiveness (MOE),
  spaCy (sciSpaCy 2020) or Bert (Beltagy, Lo, and Cohan          e.g.,likelihood to win or lose a game in the end. Such EVEs
  2019) tools and sequential pattern/link analysis tools such    repositories can be extremely large; some EVEs rules may
  as lexical link analysis. The data mining process output       be outdated or inconsistent – as they can be accumulated
  initial EVEs structures that may include the ones are not      over a long period time. New rules and tactics from big data,
                                                                 new sensors are necessary to be incorporated into current
                                                                 and future warfighting planning and executions. As shown in
                                                                 Figure 3 of the BREM game, searching, optimizing, learn-
                                                                 ing, and gaming with novel course of actions using a large
                                                                 collection of knowledge, patterns, and rules from databases
                                                                 can only be made possible for warfighers with the help of
                                                                 computing power and machine intelligence algorithms. One
                                                                 novelty/significance of this paper is to show how to apply a
                                                                 specific form of machine learning of reinforcement learning
                                                                 (RL), i.e., Soar-RL, to select, modify, link, and combine the
                                                                 EVEs rules. Finally, our wargame and machine intelligence
                                                                 paradigms possess natural linkages to causal learning that
                                                                 is especially important to warfighting activities such as mis-
                                                                 sion planning, cognitive behavior, and intent pattern recog-
                                                                 nition.

Figure 1: A wargame divided into multi-segments with
                                                                               Causal Learning Factors
events and verbs alternating with opposing asymmetrical          Our paper first offers a paradigm and supported evidence
players                                                          that our wargame definition which have a natural linkage




                                                                 Figure 3: Battle Readiness Engagement Management
          Figure 2: Action options for Segment 1                 (BREM) Game
to causal learning in the following aspects. It relates to the     Counterfactuals
three layers of a causal hierarchy (Mackenzie and Pearl            The top of the causality ladder hierarchy is a typical ques-
2018) (Pearl 2018) - association, intervention, and counter-       tion asked “What if I had acted differently?” Tradition-
factuals, as well as a few other key elements of causal learn-     ally, the effect is defined as the outcome of an action
ing as detailed in the following sections.                         for an entity and for the same entity without the action,
                                                                   i.e.,P (E|C)–P (E|N ot C). However, since this causal ef-
Association                                                        fect is impossible to be directly observed for the same entity.
                                                                   This is commonly referred to as the fundamental problem
Association is the lowest level in the causality ladder hier-      of causal inference. The predicted counterfactual outcome
archy. The common consensus of statisticians is that data-         or counterfactual-based model of causal inference has led
driven machine intelligence analysis, including data min-          to key breakthroughs in applied statistics and game AI. The
ing and various flavors of ML, is appropriate in discovering       conceptual advances come from the idea of an entity-level
statistical correlations from data. However, machine intelli-      action or “treatment” effect, although it is unobservable, can
gence requires human analysts to validate the correlations         be predicted in various ways.
to conclude, in a scalable and error-prone way, which cor-            For example, the causal effect is typically measured us-
relations are causal and which ones are co-accidental. For         ing randomized two populations, one with the “treatment”
instance, the EVEs rules can be learned and extracted from         or action (or cause C) and other one without the “treatment”
historical documents as correlations, associations, and se-        or action (Not C or control group), two populations are ran-
quential patterns. These rules are later validated by human        domized to ensure they are similar to each other (as if they
experts in their area of domain expertise.                         are the same entity) in average and in all other dimensions
                                                                   except the treatment dimension. This is the randomized con-
                                                                   trol treatment (RCT) theory, which is a standard practice in
Intervention                                                       the social sciences, drug development, and clinic trials.
                                                                      With recent data-driven approaches such as data mining
Intervention ranks higher than association in the causality        and machine learning, people can robustly estimate a lo-
ladder hierarchy which involves taking actions and gener-          cal average treatment effect in the region of overlap be-
ating new data. A typical question at this level of causality      tween treatment and control populations, but inferences for
ladder would be: What will happen if we increase the in-           averages, outside this zone are sensitive to underline ma-
tensity of an action defined by a verb? The answers to the         chine learning algorithms. For example, people have applied
question are more than just mining the existing data. They         non-parametric models of machine learning such as nearest
need to have a new data generated in reaction to an interven-      neighbors and random forests (Wager and Athey 2018) for
tion to see if the underlying action is the cause to the desired   better causal learning since these methods can approximate
effect (e.g., winning the game) or how sensitive the effect to     the local treatment and control populations close to a real
the cause is. The intervention can be modeled as an action         RCT setting.
or an verb. For example, instead of examining P (X|M ), i.e.,
the likelihood of observable data X given the model M , one        Relations of Machine Intelligence Algorithms
should further make sure M is actionable or P (X|do(M ))
can be examined. The EVEs structures contain verbs as pos-         Traditional statistical analysis heavily depends on hypoth-
sible actions, therefore act as interventions naturally. Soar-     esis tests, where the likelihood of the data P (X|H) given
RL and related global MOE are designed to evaluate the ef-         a hypothesis H is compared to the a null hypothesis H0 .
fects of the interventions or action/state combinations are        The hypothesis test directly relates to the MLE where the
shown in Sec. .                                                    likelihood of the data P (X|M ) given a model M is esti-
                                                                   mated. The general assumption is that the model is a gener-




  Figure 4: Causal learning integration for the war game                 Figure 5: ML algorithms and model complexity
ative model of the data if the likelihood is maximized com-         machine learning approaches together, and always keep hu-
pared to all other models, therefore, the model is the cause        man experts on-the-loop for interpreting the cause and effect
of the data (effect). The generative models of the current          relations.
ML/AI (Rezende, Mohamed, and Wierstra 2014) related to                 In our setting, the EVEs structures E → V rules, espe-
the hypothesis tests and MLE, also consider given classi-           cially how actions are made to use these rules, belong to
fication labels, what are the likelihood of the data that are       machine learning techniques; while the V → E rules are
observed. Causal learning can be considered as a subclass of        generative rules.
generative models because, at an abstract level, that resem-
bles how the data are actually generated as shown in rein-                         Reinforcement Learning
forcement learning (Sutton and Barto 2014).
   The current machine learning techniques are different            One of the most successful machine learning techniques is
from the maximum likelihood estimation approach, for ex-            reinforcement learning where an AI agent takes action and
ample, neural networks or many other machine learning               generates a new state. It learns from reward data from the en-
classifiers which directly estimate posterior probabilities of      vironment by modifying its internal models. Reinforcement
the models (e.g., probability of a classification given the         learning is considered a causal learning model in this con-
data), i.e., estimate P (M |X). The posterior probability es-       text since it is designed to generate the desired data (effect)
timation related approaches have been criticized for being          by taking the right actions (cause). The EVEs structures al-
black boxes, lack of explainability and causality.                  low perform reinforcement learning following certain con-
   The two approaches have been competing historically.             straintsfrom large-scale existing knowledge bases.
One of the important AI applications is the automatic speech
recognition, where Hidden Markov Models (HMMs) have                 Soar and Reinforcement Learning (Soar-RL)
been the leading approach since the late 1980s (Juang and
                                                                    Soar (Laird 2012) (Laird, Derbinsky, and Tinkerhess 2012)
Rabiner 1990), yet this framework has been gradually re-
                                                                    is a cognitive architecture that scalably integrates a rule-
placed with deep learning components which are considered
                                                                    based AI system with many other capabilities, including re-
a better approach than HMM for speech recognition (Hin-
                                                                    inforcement learning and long-term memory. The main de-
tonb et al. 2012). In other words, although these black box
                                                                    cision cycle of Soar involves rules that propose new oper-
ML/AI techniques are not often causal or explainable, they
                                                                    ators (e.g., internal decisions or external actions), as well
do work well in applications in term of producing accurate
                                                                    as preferences for selecting amongst them; an architectural
prediction and classification for new data (Wager and Athey
                                                                    operator-selection process; and application rules that modify
2018).
                                                                    agent state. A preference is defined as the probability, con-
   To explain this blackbox effectiveness, for many machine         tribution, or impact to reach the desired outcome (event) if
learning algorithms, researchers apply cross-validation, reg-       an operator is selected. The reinforcement-learning module
ularization, and transfer learning theories. The focus of su-       (Soar-RL) modifies numeric preferences for selecting oper-
pervised machine learning is to make accurate prediction or         ators based on a reward signal, either via internal or external
classification for out of sample data, i.e., new data that do not   source(s) – importantly, Soar-RL learns in an online, incre-
show in sample of train data. For example, machine learn-           mental fashion and thus does not require batch processing of
ing classifiers, a typical classification error graph for train     (potentially big) data. Soar has been used in modeling large-
(in-sample) and test (out-of-sample) errors, with respect to        scale complex cognitive functions for warfighting processes
the complexity of the models, e.g., measured by so called           like the ones in a kill chain (Zhao, Mooren, and Derbinsky
Vapnik–Chervonenkis dimension or VC dimension (Vapnik               2017).
2000), is shown in Figure 5. From machine learning perspec-
tive, out-of-sample error is always larger than the in-sample
error (i.e., so-called overfitting problem), there is an optimal    Machine Learning in Soar-RL
model complexity d*vc which gives the lowest test error for         A multi-segment wargame as we defined is played by a self-
a given train data. This is usually accomplished using cross-       player and her opponent. There are large collections of dif-
validation or regularization to avoid the overfitting (Bishop       ferent (asymmetrical) actions (verbs) for both players based
2007). The theories of cross-validation and regularization          on initial collections of EVEs rules. A state is the input data
theories suggest to keep the models as small and smooth             to a self-player that can not be decided or controlled by her-
as possible so the difference between in-sample error and           self. Part of states may refer to the state of an opponent. An
out-of-sample error is minimized. Transfer learning theories        opponent may be the environmental factor such as rain or no
suggest to consider fundamental reasons why some learning           rain. A combination of the self-player’s actions and states of
results can be transferred from one data set to another, or         the self-player can result in certain reward for the self-player,
from one area to another. For example, in deep learning, the        for example, win or lose a game in the end. In some cases,
initial layers of neural networks learn the basic features of       the environmental factors can be the opponent. The oppo-
machine visions, e.g., edges and corners of images, which           nent of a self-player can be a competitor or an adversarial
can be transferred across domains and data sets.                    who can take deliberate actions to defeat the self-player. In
   When the applications require causality analysis such as         any of these cases, the self-player needs to constantly sim-
in a multi-segment wargame we study in this paper, we need          ulate the behavior and intent of the potential opponent and
to integrate the reasonable causal elements with data-driven        take the best of course of actions. If the opponent is hidden
                                                                   Since we only consider an on-policy setting or SARSA,
     Table 1: Asymmetric Action/State Combinations                 Q(st+1 , a) = 0 and let
     Self-Player                   Opponent
                                   (e.g.,adversarial)                                δt = α(rt+1 − Q(st , at ))                  (2)
     Action/state combination d1   o1
     Action/state combination di   oj                              α, rt+1 = 1 for a positive reward or −1 for a negative re-
     ...                           ...                             ward. In order to converge, r∗ = Q(s∗ , a∗ ) in Eq. (2), we
                                                                   ask: Is there a set of preferences p1 ,p2 ,...pm that makes δt in
     Action/state combination dN oM
                                                                   Eq. (2) as small as possible when t → ∞.
                                                                      The total probability of winning for an action/state combi-
      Table 2: Action/State Combination Components                 nation is the summation of the preferences from each of the
                                                                   action/state combination components (Q-value in Eq. (1)).
     Action/State f1 fi ... fK End Reward
                                                                   For any action/state combination di which consists of K
     Combination
                                                                   components included (v = 1) and K 0 components not in-
     di              1    0   ... 1      win
                                                                   cluded (v = 0). Equation (3) predicts a win in the end.
                     ... ... ... ...     not win
                                                                                                         0
                                                                                   K
                                                                                   X                  K
                                                                                                      X
                                                                                         fk v∗ c1 >           fk0 v∗ c0 ,        (3)
or information about the opponent is not perfect or the oppo-                      k=1                k0 =1
nent’s intent changes dynamically in the game (Brown and
Sandholm 2017), the self-player’s course of action (COA)              where ∗ denotes value 1 or 0. The self-player gains a pos-
needs constantly adjust, re-dynamically program,and adapt          itive reward 1 if a correct action is taken at time t or a neg-
her COAs.                                                          ative reward −1 if a wrong action is taken. For example,
   Table 1 shows a self-player and opponent taking asym-           for an action/state combination, total preference added for
metric action/state combinations. Table 2 shows each ac-           win is 4 and lose is 1, the predicted result would be win.
tion/state combination can consist of multiple components.         If the truth is indeed win for this combination for the self-
An action is a decision of the self-player needs to decide         player, then each of the K win rules’ preferences related to
                                                                                                                          1
that can maximize her reward along the game timeline in the        the combination is modified using a positive reward K    . If the
end. An action/state combination di consists of a sequence         ground truth is lose for this combination, each of the same K
                                                                                                                                  1
of components fk with its value v1 or v0 that the self-player      rules’ preferences is modified using a negative reward − K       .
needs to decide. An fk with its value v1 or v0 can be an           In other words, Soar-RL always modify the rules that in-
EVEs rule or tactics selected from a library of rules with pa-     volve the predicted win or lose. Note some components that
rameters. An fk with its value v1 or v0 can be also the state      are not included (v = 0) can also contribute positively to the
of herself (e.g., capability of her defense) that she needs to     win (c = 1) in Eq. (3), this is an example of counterfactu-
consider when making decisions; an fk with its value v1 or         als considered in Soar-RL. Figure 6 (a) shows an example of
v0 can be also the state of the opponent that the self-player      the BREM game where Soar suggests an action component,
has to estimate from observable data (e.g., sensor data).          i.e., “C2 (Command and Control) - Assess weapons and air-
   We use an action/state combinations instead of a course         craft availability including CASREPs.” Figure 6 (b) shows
of action or COA because an action/state combination repre-        that Soar makes the suggestion by selecting the highest dif-
sent more flexible sequential and parallel actions and states      ference of the scores for class 1 (good) and class 0 (bad) for
in a wargame, while a COA refers more of traditional se-           all nine possible choices of the current segment of the game.
quential actions taken by warfighters.                             Since the score (-0.099093) is negative, the Soar’s predicted
                                                                   class is 0 (bad). Figure 6 (c) shows Soar-RL updated the
Soar-RL Details                                                    preferences of the rules reflected the predicted class and the
                                                                   input components.
In a Soar-RL, a preference is defined as the probability of
a rule to be used with respect to a total reward. To translate
into the multi-segment wargame, a preference is the contri-
bution of an EVE rule or fk to be selected for a self-player
to win. Define preferences fk v1 c1 , fk v0 c1 , fk v1 c0 , and
fk v0 c0 , where fk v1 − c1 means “if an action/state com-
bination component fk is included (v = 1), there is a pref-
erence (probability) fk v1 c1 for the self-player to win the
game in the end (c = 1).”
   We show how the preferences can be computed for the
rules. Let m be the number of rules and N the number
of data for Soar-RL to perform on-policy learning (Laird
2012) (Laird, Derbinsky, and Tinkerhess 2012).
Q(st+1 , at+1 ) = Q(st , at )+α[r+γ max Q(st+1 , a)−Q(st , at )]
                                      a∈A
                                                           (1)                   Figure 6: Soar example in BREM
                       Simulation                                Soar-RL and Counterfactuals
The simulation case contains about 50 components for an          When a Soar-RL learns/updates the rules, it learns/updates
action/state combination. The 50 components include 30 di-       the preferences of a component as well as the preferences of
mensional actions as a vector Sa possibly taken by a self-       the counterfactuals. In other words,
player, 20 dimensional states of the opponent (o in Tab. 2),      • P (good result|component k included),
and 10 dimensional states of the self-player. In our repre-
sentation as in Table 2, each component of state or action is     • P (good result|component k not included),
represented as binary 1 or 0 as a Boolean lattice (Boolean        • P (bad result|component k included), and
2019). The training set contains about 1 million action/state     • P (bad result|component k not included),
combinations that have win and lose tagged for the end game
results. The test set contains about 300,000 action/state com-   where an action/state combination with K components is in-
binations. The value (good or bad) of each action/state com-     cluded and K 0 component not included, are estimated inde-
binations is based on if the self-player wins (good) or loses    pendently and used for causality reasoning to see if a com-
(bad) a game. There is less than 10% of the combinations are     ponent k is a cause for good or bad result (effect).
good combinations. There are possible 250 combinations for          We also compute
the self-player’s action and state components. The sample          P (good result|an action/state combination)
data sets are only part of all the possible combinations. The           K
paper focused on applying Soar-RL to learn the value (win               X
                                                                    =         P (good result|component k included)
or lose a game) function from sample action/state combina-
                                                                        k=1                                               (5)
tions as shown in Eq. (4):
                                                                        K0
                                                                        X
               W in or lose = f (Sa , Ss , Os )           (4)      +          P (good result|component k not included)
                                                                        k=1
   When a self-player simulates and performs what-if analy-
sis in the future, she can use Eq. (4) and optimization algo-      and
rithms to search actions Sa when varying Ss , O, or both.
   We used Soar-RL with a fixed learning rate α = 0.0004
and reward values 1 when the value is predicted correctly          P (bad result|an action/state combination)
compared to the ground truth and −1 when the value pre-                 K
diction is incorrect compared to the ground truth.
                                                                        X
                                                                    =         P (bad result|component k included)
                                                                        k=1                                               (6)
Convergence of Soar-RL                                                    0
                                                                        K
Since Soar-RL is an online on-policy machine learning al-
                                                                        X
                                                                    +         P (bad result|component k not included).
gorithm, it is important to show the algorithm converges in
                                                                        k=1
theory and in practice. Figure 7 shows the convergence of the
changes of the preferences for the use case when the itera-
tion is 20. The convergence of the preferences can be proved
using the game theory and reinforcement learning theory.




Figure 7: The convergence of learning preferences of the         Figure 8: Students Test Setup at the Naval Postgraduate
rules in Soar-RL                                                 School (NPS)
The difference between                                             techniques used to link, modify, and update a large collec-
P (good result|an action/state combination) and                    tion of existing and new knowledge and tactics for a multi-
P (bad result|an action/state combination)                         segment wargame. We also illustrated the critical elements
is used to predict if an action combination is good or bad.        of causal learning for the EVEs structures and Soar-RL. We
                                                                   tested the methodology with a simulation data set and a real-
Soar-RL and Explainable AI (XAI)                                   life game test with the NPS human players. The integration
Soar-RL is also based on understandable EVEs rules and             of machine and causal learning techniques has the potential
provides the advantage of explainable AI (XAI) (xai 2018).         for a wide range of tactical decision edge applications.
The rules used in the prediction and updating are listed in
Figure 6 (c).                                                                       Acknowledgements
                                                                   Authors would like to thank NAVAIR China Lake, the Of-
                  The Test at the NPS                              fice of Naval Research (ONR), the NPS’s Naval Research
The BREM game can be played by a human, e.g. such Naval            Program (NRP), and DARPA’s Explainable Artificial Intelli-
Postgraduate School (NPS) students (i.e., the blue player)         gence (XAI) program for supporting the research. The views
against the AI assistant (i.e., the red player) as shown in Fig-   and conclusions contained in this document are those of the
ure 3. Once we receive the NPS Institutional Review Board          authors and should not be interpreted as representing the of-
(IRB) approval we will organize an event and recruit stu-          ficial policies, either expressed or implied of the U.S. Gov-
dents who will test the BREM game and played against the           ernment.
AI assistant as shown in Figure 8. There are 100 dimen-
sional states/actions in total, 42 simulation conditions (en-                             References
vironmental factors and logistics), 20 defenser’s states (op-
                                                                   Beltagy, I.; Lo, K.; and Cohan, A.           2019.    A pre-
ponent’s), 29 attacker’s states (self-player’s), and nine self-
                                                                   trained language model for scientific text. Retrieved from
player actions, i.e., the weapon mix. The nine self-player’s
                                                                   https://arxiv.org/abs/1903.10676.
actions (Sa) will be compared what the AI assistant’s ac-
tions powered by the algorithms. Soar-RL is used to learn          Bishop, C. M. 2007. Pattern Recognition and Machine
the value function (win or lose) f as shown in Figure 8.           Learning. New York, NY, USA: Springer.
From preliminary observations among researchers playing            Boolean.       2019.     Boolean lattice.     Retrieved from
the BREM game, we collected data from ten games in to-             https://www.sciencedirect.com/topics/mathematics/boolean-
tal. Before the machine learning, human players won eight          lattice.
games and the AI assistant won one game. The nine games            Brown, N., and Sandholm, T. 2017. Safe and nested
were used in the machine learning combining Soar-RL with           endgame solving for imperfect-information games. Pro-
DE. After the machine learning, the AI assistant won one           ceedings of the AAAI workshop on Computer Poker and
game that a human player lost. Lesson learned is that human        Imperfect Information Games.
players tend to use default or known actions while the AI
assistant was able to search a wide range of possibilities of      Goldberg, G. 1989. Genetic Algorithms in Search, Opti-
actions when performing the machine learning.                      mization, and Machine Learning. Addison Wesley.
                                                                   Juang, B. H., and Rabiner, L. R. 1990. Hidden markov mod-
Differential Evolution (DE) Algorithms                             els for speech recognition. Technometric 33(3):251–272.
We focus on evolutionary algorithms for optimization for the       Laird, J. E.; Derbinsky, N.; and Tinkerhess, M. 2012. Online
nine self-player’s actions. Genetic algorithms are evolution-      determination of value-function structure and action-value
ary algorithms (Goldberg 1989) which keep the metaphor of          estimates for reinforcement learning in a cognitive architec-
genetic reproduction of selection, mutation, and crossover         ture. Advances in Cognitive Systems 2:221–238.
where the objective function’s derivatives are not easy            Laird, J. E. 2012. The Soar Cognitive Architecture. Cam-
to compute, therefore gradient decent type of algorithms           bridge, MA, USA: MIT Press.
can not be applied for optimization. Differential evolution
                                                                   Lake, B. M.; Ullman, T. D.; Tenenbaum, J. B.; and Gersh-
(DE) (Rocca, Oliveri, and Massa 2011) is in the same style,
                                                                   man, S. J. 2016. Building machines that learn and think like
however, deals with the optimization parameters in real
                                                                   people. Retrieved from https://arxiv.org/abs/1604.00289.
numbers as floats, or doubles. For continuous parameter op-
timization, DE is a derivative-free optimization inspired by       Mackenzie, D., and Pearl, J. 2018. The Book of Why: The
the theory of evolution, where the fittest individuals of a pop-   New Science of Cause and Effect. New York, NY, USA:
ulation are more likely to survive in the future, the popula-      Penguin.
tion improves generation after generation.                         Pearl, J. 2018. The seven pillars of causal reason-
                                                                   ing with reflections on machine learning. Retrieved from
                        Conclusion                                 https://ftp.cs.ucla.edu/pub/stat ser/r481.pdf.
In this paper, we showed the EVEs structures and integration       Rezende, D. J.; Mohamed, S.; and Wierstra, D. 2014.
of machine and causal learning and optimization techniques         Stochastic backpropagation and approximate inference in
used in modeling a multi-segment wargame. We showed                deep generative models. Proceedings of the 31st Interna-
how the EVEs structures and related machine intelligence           tional Conference on Machine Learning (ICML).
Rocca, P.; Oliveri, G.; and Massa, A. 2011. Differential
evolution as applied to electromagnetics. In IEEE Antennas
and Propagation Magazine 53(1):38–49.
sciSpaCy.         2020.       scispacy.    Retrieved from
https://allenai.github.io/scispacy/.
Silver, D.; Schrittwieser, J.; and Simonyan, K. 2017. Mas-
tering the game of go without human knowledge. Nature
550:354–359.
Sutton, R. S., and Barto, A. G. 2014. Reinforcement Learn-
ing: An Introduction. Cambridge, MA, USA: MIT Press.
Vapnik, V. 2000. The Nature of Statistical Learning Theory.
New York, NY, USA: Springer.
Wager, S., and Athey, S. 2018. Estimation and inference of
heterogeneous treatment effects using random forests. Jour-
nal of the American Statistical Association 113(523):1228–
1242.
xai.        2018.        Darpa xai.        Retrieved from
https://www.darpa.mil/program/explainable-artificial-
intelligence.
Zhao, Y.; Mooren, E.; and Derbinsky, N. 2017. Reinforce-
ment learning for modeling large-scale cognitive reasoning.
Proceedings of the 9th International Joint Conference on
Knowledge Discovery, Knowledge Engineering and Knowl-
edge Management, 233–238.