=Paper=
{{Paper
|id=Vol-2819/session3paper3
|storemode=property
|title=Causal Learning in Modeling Multi-segment War Game Leveraging Machine Intelligence with EVE Structures
|pdfUrl=https://ceur-ws.org/Vol-2819/session3paper3.pdf
|volume=Vol-2819
|authors=Ying Zhao,Bruce Nagy,Tony Kendall,Riqui Schwamm
}}
==Causal Learning in Modeling Multi-segment War Game Leveraging Machine Intelligence with EVE Structures==
Modeling A Multi-segment Wargame Leveraging Machine Intelligence and
Event-Verb-Event (EVE) Structures
Ying Zhao, Bruce Nagy, Tony Kendall, Riqui Schwamm *
Abstract human level performance benchmarks. Convolutional neural
networks and reinforcement learning combined can achieve
The paper depicts a generic representation of a multi-segment
the best perceptive AI for input data of imagery and acous-
wargame leveraging machine intelligence with two opposing
asymmetrical players. We show an innovative Event-Verb- tics for superior human level of performance (Silver, Schrit-
Event (EVE) structure that is used to represent small pieces twieser, and Simonyan 2017).
of knowledge, actions, and tactics. We show the wargame These technologies have the great potential to address the
paradigm and related machine intelligence techniques, in- unique challenges of modeling complex functions of defense
cluding data mining, machine learning, and reasoning AI applications including mission planning, decision making,
which have a natural linkage to causal learning applied to this and causal reasoning. Leveraging machine intelligence, in
game. We also show specifically a rule-based reinforcement the sense of leveraging big databases, existing/new knowl-
learning algorithm, i.e., Soar-RL, which can modify, link, and edge, and tactics repositories, is critical for the future suc-
combine a large collection EVEs rules, which represent exist- cess of defense applications. For example, when warfight-
ing and new knowledge, to optimize the likelihood to win or
lose a game in the end. We show a simulation and a real-time
ers make decisions, they need to take into considerations
test for the methodology. of all possible states of different types of opponents and
adversaries’ intentions, strategies, decisions, and actions,
which can be overwhelming for humans. Machine intelli-
Introduction gence tools are needed for assisting humans to reduce their
In recent years, machine learning (ML) has successfully cognitive load. The paper presents a use case and a real-
used back-propagation and large data sets to reach human life test with a need to elevate machine intelligence to assist
level performance. The techniques have been used to solve mission planning, decision making, and causal learning for
difficult pattern recognition problems as perceptive artificial warfighters.
intelligence or perceptive AI for many types of use cases.
These algorithms implement learning as a process of grad- EVEs Structures and Multi-segment Wargame
ual adjustment of underlying models’ parameters (Lake et al.
2016). However, although there is great potential for these We first define a generic representation of a multi-segment
techniques to be applied to real-life, they have been criti- wargame with two opposing asymmetrical players as shown
cized for being black boxes and lacking understanding of in Figure 1. Such a wargame is divided into multi-segments
causality which can be very important for decision mak- with events and verbs or actions alternating with a self-
ers. Reasoning AI such as reinforcement learning (Sutton player and opponent. For each player, actions character-
and Barto 2014) and game theory (Brown and Sandholm ized by the verbs are grouped into a few categories, e.g.,
2017) have been successful as well in terms of producing VA , VB , VC , VD , and VE . These categories can represent ac-
tions typically used in various warfare areas. Events gener-
* This will certify that all author(s) of the above article/paper ated by the actions or verbs happen sequentially or in paral-
are employees of the U.S. Government and performed this work lel in each segment. Probabilistic rules E → V and V → E
as part of their employment, and that the article/paper is there- present the valid moves and probability of states. An EVE
fore not subject to U.S. copyright protection. No copyright. Use example would be: “If an opponent is found (event), then
permitted under Creative Commons License Attribution 4.0 Inter- track (verb) the opponent using tool A” and “if the oppo-
national (CC BY 4.0). In: Proceedings of AAAI Symposium on
nent has been successfully tracked (event), then target (verb)
the 2nd Workshop on Deep Models and Artificial Intelligence for
Defense Applications: Potentials, Theories, Practices, Tools, and the opponent using tool B.” Figure 2 shows a list of ac-
Risks, November 11-12, 2020, Virtual, published at http://ceur- tion options for Segment 1 in a very high level for a test
ws.org. Ying Zhao, Tony Kendall, and Riqui Schwamm are with game named “Battle Readiness Engagement Management
the Naval Postgraduate School, Monterey, CA; Bruce Nagy is with (BREM)” as shown in Figure 3. Such EVEs can be different
the NAVAIR, China Lake, CA. or asymmetrical for each player, i.e., two opposing asym-
metrical players have their own sets of EVEs rules guiding causal.
corresponding valid moves. The verbs or actions consume • The top approach ingest existing databases such as
time and other costs. An event represents a single measur- databases about red/blue capabilities combined with hu-
able outcome or state after an action. Events are discrete and man experts on-the-loop process to design the initial
do not consume time but have value (e.g., contribution to EVEs. Since it is a human-on-the-loop process, the re-
win or lose a game in the end). They are evaluated by a set sulted EVEs are causal. The human experts can also val-
of unifying equations to determine expected winning, los- idate, filter, and combine the EVEs from the data mining
ing, or drawing status for each of the opposing asymmetrical process and make sure they are causal.
players.
• The initial EVEs are the input to the reinforcement learn-
Where Do EVEs Structures Come From? ing or other machine learning algorithms to reason the
best course of actions for blue/red. The integration of ma-
EVEs rules are generated top-down from experts, or learned chine and causal learning techniques has the potential for
bottom-up using unsupervised learning from historical data tactical decision edge for the players.
and knowledge repositories as shown in Figure 4:
The novelty/significance of the EVEs structures is that
• The bottom up approach applies data mining per- they are used to describe small pieces of knowledge, actions,
formed on the historical unstructured text data (e.g., and tactics which can be systematically linked and com-
wargame logs) using entity/event extraction tools such as bined to optimize a global measure of effectiveness (MOE),
spaCy (sciSpaCy 2020) or Bert (Beltagy, Lo, and Cohan e.g.,likelihood to win or lose a game in the end. Such EVEs
2019) tools and sequential pattern/link analysis tools such repositories can be extremely large; some EVEs rules may
as lexical link analysis. The data mining process output be outdated or inconsistent – as they can be accumulated
initial EVEs structures that may include the ones are not over a long period time. New rules and tactics from big data,
new sensors are necessary to be incorporated into current
and future warfighting planning and executions. As shown in
Figure 3 of the BREM game, searching, optimizing, learn-
ing, and gaming with novel course of actions using a large
collection of knowledge, patterns, and rules from databases
can only be made possible for warfighers with the help of
computing power and machine intelligence algorithms. One
novelty/significance of this paper is to show how to apply a
specific form of machine learning of reinforcement learning
(RL), i.e., Soar-RL, to select, modify, link, and combine the
EVEs rules. Finally, our wargame and machine intelligence
paradigms possess natural linkages to causal learning that
is especially important to warfighting activities such as mis-
sion planning, cognitive behavior, and intent pattern recog-
nition.
Figure 1: A wargame divided into multi-segments with
Causal Learning Factors
events and verbs alternating with opposing asymmetrical Our paper first offers a paradigm and supported evidence
players that our wargame definition which have a natural linkage
Figure 3: Battle Readiness Engagement Management
Figure 2: Action options for Segment 1 (BREM) Game
to causal learning in the following aspects. It relates to the Counterfactuals
three layers of a causal hierarchy (Mackenzie and Pearl The top of the causality ladder hierarchy is a typical ques-
2018) (Pearl 2018) - association, intervention, and counter- tion asked “What if I had acted differently?” Tradition-
factuals, as well as a few other key elements of causal learn- ally, the effect is defined as the outcome of an action
ing as detailed in the following sections. for an entity and for the same entity without the action,
i.e.,P (E|C)–P (E|N ot C). However, since this causal ef-
Association fect is impossible to be directly observed for the same entity.
This is commonly referred to as the fundamental problem
Association is the lowest level in the causality ladder hier- of causal inference. The predicted counterfactual outcome
archy. The common consensus of statisticians is that data- or counterfactual-based model of causal inference has led
driven machine intelligence analysis, including data min- to key breakthroughs in applied statistics and game AI. The
ing and various flavors of ML, is appropriate in discovering conceptual advances come from the idea of an entity-level
statistical correlations from data. However, machine intelli- action or “treatment” effect, although it is unobservable, can
gence requires human analysts to validate the correlations be predicted in various ways.
to conclude, in a scalable and error-prone way, which cor- For example, the causal effect is typically measured us-
relations are causal and which ones are co-accidental. For ing randomized two populations, one with the “treatment”
instance, the EVEs rules can be learned and extracted from or action (or cause C) and other one without the “treatment”
historical documents as correlations, associations, and se- or action (Not C or control group), two populations are ran-
quential patterns. These rules are later validated by human domized to ensure they are similar to each other (as if they
experts in their area of domain expertise. are the same entity) in average and in all other dimensions
except the treatment dimension. This is the randomized con-
trol treatment (RCT) theory, which is a standard practice in
Intervention the social sciences, drug development, and clinic trials.
With recent data-driven approaches such as data mining
Intervention ranks higher than association in the causality and machine learning, people can robustly estimate a lo-
ladder hierarchy which involves taking actions and gener- cal average treatment effect in the region of overlap be-
ating new data. A typical question at this level of causality tween treatment and control populations, but inferences for
ladder would be: What will happen if we increase the in- averages, outside this zone are sensitive to underline ma-
tensity of an action defined by a verb? The answers to the chine learning algorithms. For example, people have applied
question are more than just mining the existing data. They non-parametric models of machine learning such as nearest
need to have a new data generated in reaction to an interven- neighbors and random forests (Wager and Athey 2018) for
tion to see if the underlying action is the cause to the desired better causal learning since these methods can approximate
effect (e.g., winning the game) or how sensitive the effect to the local treatment and control populations close to a real
the cause is. The intervention can be modeled as an action RCT setting.
or an verb. For example, instead of examining P (X|M ), i.e.,
the likelihood of observable data X given the model M , one Relations of Machine Intelligence Algorithms
should further make sure M is actionable or P (X|do(M ))
can be examined. The EVEs structures contain verbs as pos- Traditional statistical analysis heavily depends on hypoth-
sible actions, therefore act as interventions naturally. Soar- esis tests, where the likelihood of the data P (X|H) given
RL and related global MOE are designed to evaluate the ef- a hypothesis H is compared to the a null hypothesis H0 .
fects of the interventions or action/state combinations are The hypothesis test directly relates to the MLE where the
shown in Sec. . likelihood of the data P (X|M ) given a model M is esti-
mated. The general assumption is that the model is a gener-
Figure 4: Causal learning integration for the war game Figure 5: ML algorithms and model complexity
ative model of the data if the likelihood is maximized com- machine learning approaches together, and always keep hu-
pared to all other models, therefore, the model is the cause man experts on-the-loop for interpreting the cause and effect
of the data (effect). The generative models of the current relations.
ML/AI (Rezende, Mohamed, and Wierstra 2014) related to In our setting, the EVEs structures E → V rules, espe-
the hypothesis tests and MLE, also consider given classi- cially how actions are made to use these rules, belong to
fication labels, what are the likelihood of the data that are machine learning techniques; while the V → E rules are
observed. Causal learning can be considered as a subclass of generative rules.
generative models because, at an abstract level, that resem-
bles how the data are actually generated as shown in rein- Reinforcement Learning
forcement learning (Sutton and Barto 2014).
The current machine learning techniques are different One of the most successful machine learning techniques is
from the maximum likelihood estimation approach, for ex- reinforcement learning where an AI agent takes action and
ample, neural networks or many other machine learning generates a new state. It learns from reward data from the en-
classifiers which directly estimate posterior probabilities of vironment by modifying its internal models. Reinforcement
the models (e.g., probability of a classification given the learning is considered a causal learning model in this con-
data), i.e., estimate P (M |X). The posterior probability es- text since it is designed to generate the desired data (effect)
timation related approaches have been criticized for being by taking the right actions (cause). The EVEs structures al-
black boxes, lack of explainability and causality. low perform reinforcement learning following certain con-
The two approaches have been competing historically. straintsfrom large-scale existing knowledge bases.
One of the important AI applications is the automatic speech
recognition, where Hidden Markov Models (HMMs) have Soar and Reinforcement Learning (Soar-RL)
been the leading approach since the late 1980s (Juang and
Soar (Laird 2012) (Laird, Derbinsky, and Tinkerhess 2012)
Rabiner 1990), yet this framework has been gradually re-
is a cognitive architecture that scalably integrates a rule-
placed with deep learning components which are considered
based AI system with many other capabilities, including re-
a better approach than HMM for speech recognition (Hin-
inforcement learning and long-term memory. The main de-
tonb et al. 2012). In other words, although these black box
cision cycle of Soar involves rules that propose new oper-
ML/AI techniques are not often causal or explainable, they
ators (e.g., internal decisions or external actions), as well
do work well in applications in term of producing accurate
as preferences for selecting amongst them; an architectural
prediction and classification for new data (Wager and Athey
operator-selection process; and application rules that modify
2018).
agent state. A preference is defined as the probability, con-
To explain this blackbox effectiveness, for many machine tribution, or impact to reach the desired outcome (event) if
learning algorithms, researchers apply cross-validation, reg- an operator is selected. The reinforcement-learning module
ularization, and transfer learning theories. The focus of su- (Soar-RL) modifies numeric preferences for selecting oper-
pervised machine learning is to make accurate prediction or ators based on a reward signal, either via internal or external
classification for out of sample data, i.e., new data that do not source(s) – importantly, Soar-RL learns in an online, incre-
show in sample of train data. For example, machine learn- mental fashion and thus does not require batch processing of
ing classifiers, a typical classification error graph for train (potentially big) data. Soar has been used in modeling large-
(in-sample) and test (out-of-sample) errors, with respect to scale complex cognitive functions for warfighting processes
the complexity of the models, e.g., measured by so called like the ones in a kill chain (Zhao, Mooren, and Derbinsky
Vapnik–Chervonenkis dimension or VC dimension (Vapnik 2017).
2000), is shown in Figure 5. From machine learning perspec-
tive, out-of-sample error is always larger than the in-sample
error (i.e., so-called overfitting problem), there is an optimal Machine Learning in Soar-RL
model complexity d*vc which gives the lowest test error for A multi-segment wargame as we defined is played by a self-
a given train data. This is usually accomplished using cross- player and her opponent. There are large collections of dif-
validation or regularization to avoid the overfitting (Bishop ferent (asymmetrical) actions (verbs) for both players based
2007). The theories of cross-validation and regularization on initial collections of EVEs rules. A state is the input data
theories suggest to keep the models as small and smooth to a self-player that can not be decided or controlled by her-
as possible so the difference between in-sample error and self. Part of states may refer to the state of an opponent. An
out-of-sample error is minimized. Transfer learning theories opponent may be the environmental factor such as rain or no
suggest to consider fundamental reasons why some learning rain. A combination of the self-player’s actions and states of
results can be transferred from one data set to another, or the self-player can result in certain reward for the self-player,
from one area to another. For example, in deep learning, the for example, win or lose a game in the end. In some cases,
initial layers of neural networks learn the basic features of the environmental factors can be the opponent. The oppo-
machine visions, e.g., edges and corners of images, which nent of a self-player can be a competitor or an adversarial
can be transferred across domains and data sets. who can take deliberate actions to defeat the self-player. In
When the applications require causality analysis such as any of these cases, the self-player needs to constantly sim-
in a multi-segment wargame we study in this paper, we need ulate the behavior and intent of the potential opponent and
to integrate the reasonable causal elements with data-driven take the best of course of actions. If the opponent is hidden
Since we only consider an on-policy setting or SARSA,
Table 1: Asymmetric Action/State Combinations Q(st+1 , a) = 0 and let
Self-Player Opponent
(e.g.,adversarial) δt = α(rt+1 − Q(st , at )) (2)
Action/state combination d1 o1
Action/state combination di oj α, rt+1 = 1 for a positive reward or −1 for a negative re-
... ... ward. In order to converge, r∗ = Q(s∗ , a∗ ) in Eq. (2), we
ask: Is there a set of preferences p1 ,p2 ,...pm that makes δt in
Action/state combination dN oM
Eq. (2) as small as possible when t → ∞.
The total probability of winning for an action/state combi-
Table 2: Action/State Combination Components nation is the summation of the preferences from each of the
action/state combination components (Q-value in Eq. (1)).
Action/State f1 fi ... fK End Reward
For any action/state combination di which consists of K
Combination
components included (v = 1) and K 0 components not in-
di 1 0 ... 1 win
cluded (v = 0). Equation (3) predicts a win in the end.
... ... ... ... not win
0
K
X K
X
fk v∗ c1 > fk0 v∗ c0 , (3)
or information about the opponent is not perfect or the oppo- k=1 k0 =1
nent’s intent changes dynamically in the game (Brown and
Sandholm 2017), the self-player’s course of action (COA) where ∗ denotes value 1 or 0. The self-player gains a pos-
needs constantly adjust, re-dynamically program,and adapt itive reward 1 if a correct action is taken at time t or a neg-
her COAs. ative reward −1 if a wrong action is taken. For example,
Table 1 shows a self-player and opponent taking asym- for an action/state combination, total preference added for
metric action/state combinations. Table 2 shows each ac- win is 4 and lose is 1, the predicted result would be win.
tion/state combination can consist of multiple components. If the truth is indeed win for this combination for the self-
An action is a decision of the self-player needs to decide player, then each of the K win rules’ preferences related to
1
that can maximize her reward along the game timeline in the the combination is modified using a positive reward K . If the
end. An action/state combination di consists of a sequence ground truth is lose for this combination, each of the same K
1
of components fk with its value v1 or v0 that the self-player rules’ preferences is modified using a negative reward − K .
needs to decide. An fk with its value v1 or v0 can be an In other words, Soar-RL always modify the rules that in-
EVEs rule or tactics selected from a library of rules with pa- volve the predicted win or lose. Note some components that
rameters. An fk with its value v1 or v0 can be also the state are not included (v = 0) can also contribute positively to the
of herself (e.g., capability of her defense) that she needs to win (c = 1) in Eq. (3), this is an example of counterfactu-
consider when making decisions; an fk with its value v1 or als considered in Soar-RL. Figure 6 (a) shows an example of
v0 can be also the state of the opponent that the self-player the BREM game where Soar suggests an action component,
has to estimate from observable data (e.g., sensor data). i.e., “C2 (Command and Control) - Assess weapons and air-
We use an action/state combinations instead of a course craft availability including CASREPs.” Figure 6 (b) shows
of action or COA because an action/state combination repre- that Soar makes the suggestion by selecting the highest dif-
sent more flexible sequential and parallel actions and states ference of the scores for class 1 (good) and class 0 (bad) for
in a wargame, while a COA refers more of traditional se- all nine possible choices of the current segment of the game.
quential actions taken by warfighters. Since the score (-0.099093) is negative, the Soar’s predicted
class is 0 (bad). Figure 6 (c) shows Soar-RL updated the
Soar-RL Details preferences of the rules reflected the predicted class and the
input components.
In a Soar-RL, a preference is defined as the probability of
a rule to be used with respect to a total reward. To translate
into the multi-segment wargame, a preference is the contri-
bution of an EVE rule or fk to be selected for a self-player
to win. Define preferences fk v1 c1 , fk v0 c1 , fk v1 c0 , and
fk v0 c0 , where fk v1 − c1 means “if an action/state com-
bination component fk is included (v = 1), there is a pref-
erence (probability) fk v1 c1 for the self-player to win the
game in the end (c = 1).”
We show how the preferences can be computed for the
rules. Let m be the number of rules and N the number
of data for Soar-RL to perform on-policy learning (Laird
2012) (Laird, Derbinsky, and Tinkerhess 2012).
Q(st+1 , at+1 ) = Q(st , at )+α[r+γ max Q(st+1 , a)−Q(st , at )]
a∈A
(1) Figure 6: Soar example in BREM
Simulation Soar-RL and Counterfactuals
The simulation case contains about 50 components for an When a Soar-RL learns/updates the rules, it learns/updates
action/state combination. The 50 components include 30 di- the preferences of a component as well as the preferences of
mensional actions as a vector Sa possibly taken by a self- the counterfactuals. In other words,
player, 20 dimensional states of the opponent (o in Tab. 2), • P (good result|component k included),
and 10 dimensional states of the self-player. In our repre-
sentation as in Table 2, each component of state or action is • P (good result|component k not included),
represented as binary 1 or 0 as a Boolean lattice (Boolean • P (bad result|component k included), and
2019). The training set contains about 1 million action/state • P (bad result|component k not included),
combinations that have win and lose tagged for the end game
results. The test set contains about 300,000 action/state com- where an action/state combination with K components is in-
binations. The value (good or bad) of each action/state com- cluded and K 0 component not included, are estimated inde-
binations is based on if the self-player wins (good) or loses pendently and used for causality reasoning to see if a com-
(bad) a game. There is less than 10% of the combinations are ponent k is a cause for good or bad result (effect).
good combinations. There are possible 250 combinations for We also compute
the self-player’s action and state components. The sample P (good result|an action/state combination)
data sets are only part of all the possible combinations. The K
paper focused on applying Soar-RL to learn the value (win X
= P (good result|component k included)
or lose a game) function from sample action/state combina-
k=1 (5)
tions as shown in Eq. (4):
K0
X
W in or lose = f (Sa , Ss , Os ) (4) + P (good result|component k not included)
k=1
When a self-player simulates and performs what-if analy-
sis in the future, she can use Eq. (4) and optimization algo- and
rithms to search actions Sa when varying Ss , O, or both.
We used Soar-RL with a fixed learning rate α = 0.0004
and reward values 1 when the value is predicted correctly P (bad result|an action/state combination)
compared to the ground truth and −1 when the value pre- K
diction is incorrect compared to the ground truth.
X
= P (bad result|component k included)
k=1 (6)
Convergence of Soar-RL 0
K
Since Soar-RL is an online on-policy machine learning al-
X
+ P (bad result|component k not included).
gorithm, it is important to show the algorithm converges in
k=1
theory and in practice. Figure 7 shows the convergence of the
changes of the preferences for the use case when the itera-
tion is 20. The convergence of the preferences can be proved
using the game theory and reinforcement learning theory.
Figure 7: The convergence of learning preferences of the Figure 8: Students Test Setup at the Naval Postgraduate
rules in Soar-RL School (NPS)
The difference between techniques used to link, modify, and update a large collec-
P (good result|an action/state combination) and tion of existing and new knowledge and tactics for a multi-
P (bad result|an action/state combination) segment wargame. We also illustrated the critical elements
is used to predict if an action combination is good or bad. of causal learning for the EVEs structures and Soar-RL. We
tested the methodology with a simulation data set and a real-
Soar-RL and Explainable AI (XAI) life game test with the NPS human players. The integration
Soar-RL is also based on understandable EVEs rules and of machine and causal learning techniques has the potential
provides the advantage of explainable AI (XAI) (xai 2018). for a wide range of tactical decision edge applications.
The rules used in the prediction and updating are listed in
Figure 6 (c). Acknowledgements
Authors would like to thank NAVAIR China Lake, the Of-
The Test at the NPS fice of Naval Research (ONR), the NPS’s Naval Research
The BREM game can be played by a human, e.g. such Naval Program (NRP), and DARPA’s Explainable Artificial Intelli-
Postgraduate School (NPS) students (i.e., the blue player) gence (XAI) program for supporting the research. The views
against the AI assistant (i.e., the red player) as shown in Fig- and conclusions contained in this document are those of the
ure 3. Once we receive the NPS Institutional Review Board authors and should not be interpreted as representing the of-
(IRB) approval we will organize an event and recruit stu- ficial policies, either expressed or implied of the U.S. Gov-
dents who will test the BREM game and played against the ernment.
AI assistant as shown in Figure 8. There are 100 dimen-
sional states/actions in total, 42 simulation conditions (en- References
vironmental factors and logistics), 20 defenser’s states (op-
Beltagy, I.; Lo, K.; and Cohan, A. 2019. A pre-
ponent’s), 29 attacker’s states (self-player’s), and nine self-
trained language model for scientific text. Retrieved from
player actions, i.e., the weapon mix. The nine self-player’s
https://arxiv.org/abs/1903.10676.
actions (Sa) will be compared what the AI assistant’s ac-
tions powered by the algorithms. Soar-RL is used to learn Bishop, C. M. 2007. Pattern Recognition and Machine
the value function (win or lose) f as shown in Figure 8. Learning. New York, NY, USA: Springer.
From preliminary observations among researchers playing Boolean. 2019. Boolean lattice. Retrieved from
the BREM game, we collected data from ten games in to- https://www.sciencedirect.com/topics/mathematics/boolean-
tal. Before the machine learning, human players won eight lattice.
games and the AI assistant won one game. The nine games Brown, N., and Sandholm, T. 2017. Safe and nested
were used in the machine learning combining Soar-RL with endgame solving for imperfect-information games. Pro-
DE. After the machine learning, the AI assistant won one ceedings of the AAAI workshop on Computer Poker and
game that a human player lost. Lesson learned is that human Imperfect Information Games.
players tend to use default or known actions while the AI
assistant was able to search a wide range of possibilities of Goldberg, G. 1989. Genetic Algorithms in Search, Opti-
actions when performing the machine learning. mization, and Machine Learning. Addison Wesley.
Juang, B. H., and Rabiner, L. R. 1990. Hidden markov mod-
Differential Evolution (DE) Algorithms els for speech recognition. Technometric 33(3):251–272.
We focus on evolutionary algorithms for optimization for the Laird, J. E.; Derbinsky, N.; and Tinkerhess, M. 2012. Online
nine self-player’s actions. Genetic algorithms are evolution- determination of value-function structure and action-value
ary algorithms (Goldberg 1989) which keep the metaphor of estimates for reinforcement learning in a cognitive architec-
genetic reproduction of selection, mutation, and crossover ture. Advances in Cognitive Systems 2:221–238.
where the objective function’s derivatives are not easy Laird, J. E. 2012. The Soar Cognitive Architecture. Cam-
to compute, therefore gradient decent type of algorithms bridge, MA, USA: MIT Press.
can not be applied for optimization. Differential evolution
Lake, B. M.; Ullman, T. D.; Tenenbaum, J. B.; and Gersh-
(DE) (Rocca, Oliveri, and Massa 2011) is in the same style,
man, S. J. 2016. Building machines that learn and think like
however, deals with the optimization parameters in real
people. Retrieved from https://arxiv.org/abs/1604.00289.
numbers as floats, or doubles. For continuous parameter op-
timization, DE is a derivative-free optimization inspired by Mackenzie, D., and Pearl, J. 2018. The Book of Why: The
the theory of evolution, where the fittest individuals of a pop- New Science of Cause and Effect. New York, NY, USA:
ulation are more likely to survive in the future, the popula- Penguin.
tion improves generation after generation. Pearl, J. 2018. The seven pillars of causal reason-
ing with reflections on machine learning. Retrieved from
Conclusion https://ftp.cs.ucla.edu/pub/stat ser/r481.pdf.
In this paper, we showed the EVEs structures and integration Rezende, D. J.; Mohamed, S.; and Wierstra, D. 2014.
of machine and causal learning and optimization techniques Stochastic backpropagation and approximate inference in
used in modeling a multi-segment wargame. We showed deep generative models. Proceedings of the 31st Interna-
how the EVEs structures and related machine intelligence tional Conference on Machine Learning (ICML).
Rocca, P.; Oliveri, G.; and Massa, A. 2011. Differential
evolution as applied to electromagnetics. In IEEE Antennas
and Propagation Magazine 53(1):38–49.
sciSpaCy. 2020. scispacy. Retrieved from
https://allenai.github.io/scispacy/.
Silver, D.; Schrittwieser, J.; and Simonyan, K. 2017. Mas-
tering the game of go without human knowledge. Nature
550:354–359.
Sutton, R. S., and Barto, A. G. 2014. Reinforcement Learn-
ing: An Introduction. Cambridge, MA, USA: MIT Press.
Vapnik, V. 2000. The Nature of Statistical Learning Theory.
New York, NY, USA: Springer.
Wager, S., and Athey, S. 2018. Estimation and inference of
heterogeneous treatment effects using random forests. Jour-
nal of the American Statistical Association 113(523):1228–
1242.
xai. 2018. Darpa xai. Retrieved from
https://www.darpa.mil/program/explainable-artificial-
intelligence.
Zhao, Y.; Mooren, E.; and Derbinsky, N. 2017. Reinforce-
ment learning for modeling large-scale cognitive reasoning.
Proceedings of the 9th International Joint Conference on
Knowledge Discovery, Knowledge Engineering and Knowl-
edge Management, 233–238.