=Paper= {{Paper |id=Vol-2592/paper9 |storemode=property |title=Importance Sampling to Identify Empirically Valid Policies and their Critical Decisions |pdfUrl=https://ceur-ws.org/Vol-2592/paper9.pdf |volume=Vol-2592 |authors=Song Ju,Shitian Shen,Hamoon Azizsoltani,Tiffany Barnes,Min Chi |dblpUrl=https://dblp.org/rec/conf/edm/JuSABC19 }} ==Importance Sampling to Identify Empirically Valid Policies and their Critical Decisions== https://ceur-ws.org/Vol-2592/paper9.pdf

Importance Sampling to Identify
Empirically Valid Policies and their Critical Decisions

Song Ju, Shitian Shen, Hamoon Azizsoltani, Tiffany Barnes, Min Chi
Department of Computer Science
North Carolina State University
Raleigh, NC 27695
{sju2, sshen, hazizso, tmbarnes, mchi}@ncsu.edu

ABSTRACT situation so as to maximize some predefined cumulative reward.
In this work, we investigated off-policy policy evaluation (OPE) A number of researchers have studied the application of existing
metrics to evaluate Reinforcement Learning (RL) induced poli- RL algorithms to improve the effectiveness of ITSs [3, 26, 21,
cies and to identify critical decisions in the context of Intelligent 4, 28, 8, 9, 34]. While promising, such RL work faces at least
Tutoring Systems (ITSs). We explore the use of three common two major challenges discussed below.
Importance Sampling based OPE metrics in two deployment
settings to evaluate four RL-induced policies for a logic ITS. One challenge is a lack of reliable yet robust evaluation metrics
The two deployment settings explore the impact of using orig- for RL policy evaluation. Generally speaking, there are two ma-
inal or normalized rewards, and the impact of transforming jor categories of RL: online and offline. In the former category,
deterministic to stochastic policies. Our results show that Per the agent learns while interacting with the environment; in the
Decision Importance Sampling (PDIS), using soft max and latter case, the agent learns the policy from pre-collected data.
original rewards, is the best metric, and the only metric that Online RL algorithms are generally appropriate for domains
reached 100% alignment between the theoretical and empirical where interacting with simulations and actual environments is
classroom evaluation results. Furthermore, we used PDIS to computationally cheap and feasible. On the other hand, for
identify what we call critical decisions in RL-induced policies, domains such as e-learning, building accurate simulations or
where the policies successfully identify large differences between simulated students is especially challenging because human learn-
decisions. We found that the students who received more criti- ing is a rather complex, poorly understood process. Moreover,
cal decisions significantly outperformed those who received less; learning policies while interacting with students may not be fea-
more importantly, this result only holds on the policy that was sible, and more importantly, may not be ethical. Therefore, to
identified to be effective using PDIS, not on ineffective ones. improve student learning, much prior work applied offline RL ap-
proaches to induce effective pedagogical strategies. This is done
by first collecting a training corpus and the success of offline RL
Keywords is often heavily dependent on the quality of the training corpus.
Reinforcement Learning, Off-policy Policy Evaluation, Impor- One common convention is to collect an exploratory corpus by
tance Sampling training a group of students on an ITS that makes random yet
reasonable decisions and then apply RL to induce pedagogical
1. INTRODUCTION policies from that training corpus. Empirical study is then
Intelligent Tutoring Systems (ITSs) are a type of highly interac- conducted from a new group of human subjects interacting with
tive e-learning environment that facilitates learning by providing different versions of the system. The only difference among the
step-by-step support and contextualized feedback to individual system versions is the policy employed by the ITS. The students’
students [12, 30]. These step-by-step behaviors can be viewed performance is then statistically compared. Due to cost limita-
as a sequential decision process where at each step the system tions, typically, only the best RL-induced policy is deployed and
chooses an action (e.g. give a hint, show an example) from a compared against some baseline policies. On the other hand, we
set of options, in which pedagogical strategies are policies that often have a large number of RL algorithms (and associated hy-
are used to decide what action to take next in the face of al- perparameter settings), and it is unclear which will work best in
ternatives. Reinforcement Learning (RL) offers one of the most our setting. In these high-stake situations, one needs confidence
promising approaches to data-driven decision-making applica- in the RL-induced policy before risking deployment. Therefore,
tions and RL algorithms are designed to induce effective policies we need to develop reliable yet robust evaluation metrics to evalu-
that determine the best action for an agent to take in any given ate these RL-induced policies without collecting new data before
being tested in the real world. This type of evaluation is called
off-policy evaluation (OPE) because the policy used to collect the
training data, also referred to as the behavior policy, is different
from the RL-induced policy, referred to as the target policy to be
evaluated. To find reliable yet robust OPE metrics, we explored
three Importance Sampling based off-policy evaluation metrics.

The second RL challenge is a lack of interpretability of the
RL-induced policies. Compared with the amount of research

Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
done on applying RL to induce policies, relatively little work has test scores are shown with a red solid line). Figure 1 shows that
been done to analyze, interpret, or explain RL-induced policies. our theoretical evaluation (ECR) does not match the empirical
While traditional hypothesis-driven, cause-and-effect approaches results (post-test) evaluation in that there is no clear relationship
offer clear conceptual and causal insights that can be evaluated between the ECRs’ blue line and the corresponding post-test
and interpreted, RL-induced policies are often large, cumber- in red line across the four policies. This result shows that ECR
some, and difficult to understand. The space of possible policies is not a reliable OPE metric for evaluating RL-induced policy
is exponential in the number of domain features. It is therefore in ITSs. Indeed, Mandel et al. [15] pointed out that ECR tends
difficult to draw general conclusions from them to advance our to be biased, statistically inconsistent and thus it may not be
understanding of the domain. This raises a major open question: the appropriate OPE metric in high stakes domains. In recent
How can we identify the critical system interactive decisions that years, many state-of-the-art OPE metrics have been proposed
are linked to student learning? In this work, we tried to identify and many of them are based on Importance Sampling.
key decisions by taking advantage of the reliable OPE metrics
we discovered and the properties of the policies we induced. Importance Sampling (IS) is a classic OPE method for evaluat-
ing a target policy on existing data obtained from an alternate
behavior policy and thus can be handily applied to the task
2. MOTIVATION of evaluating the effectiveness of an offline RL-induced policy
Just like the fact that assessment sits at the epicenter of ed-
using pre-existing historical training datasets. Many IS-based
ucational research [2], policy evaluation is indeed the central
OPE metrics are proposed and explored and it was shown that
concern among the many stakeholders in applying offline RL
they gain significant performance in simulation environments
to ITSs. As educational assessment should reflect and rein-
like Grid World or Bandit [29, 6]. Among them, three IS-
force the educational goals that society deems valuable, our
based OPE metrics, the original IS, Weighted IS (WIS), and
policy evaluation metrics should reflect the effectiveness of the
Per-Decision IS (PDIS), are the most widely used. However,
induced policies. While various RL approaches such as policy
real-world human-agent interactive applications such as ITSs
iteration and policy search have shown great promise, existing
are much more complicated due to 1) individual differences,
RL approaches tend to perform poorly when they are actually
noise, and randomness during the interaction processes, 2) the
implemented and evaluated in the real world.
large state space that can impact student learning, and 3) long
trajectories due to the nature of the learning process.
In a series of prior studies on a logic ITS, RL and Markov De-
cision Processes (MDPs) were applied to induce four different
In this work, we investigated the three IS-based offline OPE
pedagogical policies, named MDP1-MDP4 respectively, on one
metrics on MDP1-MDP4 to investigate whether the three IS-
type of tutorial decision: whether to provide students with a
based evaluation metrics are indeed effective OPE metrics for
Worked Example (WE) or to ask them to engage in Problem
evaluating the four RL-induced policies mentioned beforehand.
Solving (PS). In WEs, the tutor presents an expert solution to a
We believe an OPE is effective if and only if the theoretical
problem step by step, while in PSs, students are required to com-
results from the OPE evaluations are completely aligned with
plete the problem with the tutor’s support. When inducing each
the empirical results from the classroom studies. Therefore, we
of four policies, we explored different feature selection methods
explored different deployment settings for the IS-based metrics
and used Expected Cumulative Reward (ECR) to evaluate the
from two aspects: one is the transformation function used to
RL-induced policies. ECR of a policy is calculated by average
convert the RL-induced deterministic policy to a stochastic pol-
over the value function of initial states and generally speaking,
icy used in IS-based metrics and the other is reward functions:
the higher the ECR value of a policy, the better the policy is
the original reward function vs. the normalized reward function;
supposed to perform.
the latter is supposed to reduce the variance. Our results showed
that the theoretical and empirical evaluation results are aligned
more or less for different deployment settings using different
IS-based metrics. Only when using a soft-max transformation
function and original reward function, the theoretical results
of PDIS can reach 100% agreement with the empirical results.
Based on results from the OPE metrics, we further explored
using the properties of the RL-induced policy to identify critical
decisions and our results showed that the critical decisions can
be identified by using the theoretically “effective” policies that
were identified by using PDIS with soft-max transformation and
the original reward function.

In summary, we make the following contributions:
Figure 1: Post-test Score vs. ECR, showing no
seeming direct relationship. • We directly compared three IS-based policy evaluation
metrics against the empirical results from real classroom
Figure 1 shows the ECRs (blue dashed line) of the four RL- studies across four different RL-induced policies. Our
induced policies, MDP1-MDP4 (x-axis) and the empirical results results showed that PDIS is the best one and its results
of student learning performance (the solid red line) of the corre- can align with empirical results.
sponding policies. Here the learning performance is measured by
an in-class post-test after students were trained on the tutor with • As far as we know, this is the first study to compare dif-
corresponding policies (the mean and standard errors of post- ferent deployment settings (original/normalized rewards
or deterministic/stochastic policy transformation) on IS- from the control group but no significant difference was found
based policy evaluation metrics. Our results showed that between the two groups on the post-test [23].
settings have a direct impact on the effectiveness of the
evaluation metrics. Only PDIS with soft-max transforma- In short, prior work on applying offline RL in ITSs primarily
tion and the original reward function agreed 100% with explored ECR as the OPE metric and one common phenomenon
the empirical results. is that the theoretical evaluation results do not align with the
empirical results from real students.
• We investigated using information from the RL-induced
policies to identify critical decisions to shed some light on 3.2 OPE Metrics
the induced policies. As far as we know, this is the first OPE is used to evaluate the performance of a target policy given
attempt to differentiate the critical decisions from trivial historical data generated by an alternative behavior policy. A
ones. good OPE metric is especially important for real-world applica-
tions where the deployment of a bad or inefficient policy can be
costly [19]. ECR is one of the most widely used OPE metrics,
3. RELATED WORK which is designed especially for the MDP framework. Tetreault
3.1 Empirical Studies Applying RL to ITSs et al. [11] estimated the reliability of ECRs by repeated sampling
In recent years, a number of researchers have applied RL to to estimate confidence intervals for ECRs. In simulation studies,
induce effective pedagogical policies for ITSs [3, 5, 13, 23]. they showed the policy induced by the confidence interval of
Some previous work treated the user-system interactions as fully ECR performed more reliably than the baseline policies but
observable processes by applying Markov Decision Processes this phenomenon did not hold when evaluating the RL-induced
(MDPs) [14, 27, 1] while others utilized partially observable policies in ITSs for empirical studies [17].
MDPs (POMDPs) [24, 32, 33, 15, 4], and more recently, deep
RL Frameworks [31, 18]. Most of the previous work, including Importance Sampling (IS) [7] is a widely used OPE metric,
this work, took the offline RL approach in that it followed three which considers the mathematical characteristics of the decision
general steps: 1) to collect an exploratory corpus by training making process and can be applied to any MDP, POMDP, or
a relatively large group of real and/or simulated students on DeepRL framework. Precup [20] proposed four IS-based OPE
an ITS that makes some random yet reasonable and rational metrics: IS, weighted importance sampling (WIS), per-decision
tutorial decisions; 2) to apply RL to induce pedagogical pol- importance sampling (PDIS), and weighted per-decision impor-
icy directly from the historical exploratory corpus; and 3) to tance sampling (WPDIS). They used the IS-based estimator as
implement the RL-induced pedagogical policy back to the ITS the policy evaluation for Q-learning and then compared the ef-
and evaluate its effectiveness using simulations and/or in real fectiveness of estimators on a series of 100 randomly-constructed
classroom settings to investigate whether RL fulfills its promise. MDPs based on the mean square error (MSE). Their results
Oftentimes, when implementing and evaluating the RL-induced showed that IS made the Q-learning process converge slowly
policies in real world, they tend to perform poorly compared to and caused high variance, and WIS performed better than IS;
their theoretical performance. but PDIS performed inconsistently and WPDIS performed the
worst. Similarly, Thomas [29] compared the performance of
Iglesias et al. [10] applied Q-learning to generate a policy in an several IS estimators using mean squared error in a grid-world
ITS that teaches students database design. Their goal was to simulation, showing PDIS outperformed all others.
provide students with direct navigational support through the
system’s content. In the training phase, they used simulated stu- In summary, previous work has explored the effectiveness of
dents to induce an RL policy and online-evaluated the induced IS and its variants in simulation studies, which motivated the
policy based upon three self-defined measures. In the test phase, work reported here. Different from previous work, we mainly
they evaluated the induced policy with real students. The results focus on comparing the theoretical evaluation with the empirical
showed that while students using the induced policies had more evaluation in order to determine whether IS-based methods are
effective usage behaviors than their non-policy peers, there was indeed reliable and robust for ITSs.
no significant difference in student learning performance. Chi
et al. [17] applied an offline RL approach to induce pedagogical 4. MARKOV DECISION PROCESS & RL
policies for improving the effectiveness of an ITS that teaches Some of the prior work on applying RL to induce pedagogical
students college physics. They used ECR to evaluate the in- policies used Markov Decision Processes (MDP) frameworks.
duced policy in the training phase. However, when they applied An MDP can be seen as a 4-tuple hS,A,T ,Ri, where S denotes
the induced policy to real students, the induced policy did not the observable state space, which is defined by a set of features
outperform the random policy in empirical evaluation. Similarly, that represent the interactive learning environment and A de-
Shen et al. [26] explored immediate and delayed rewards based notes the space of possible actions for the agent to execute. The
on learning gain and implemented an offline, MDP framework reward function R represents the immediate or delayed feed-
on a rule-based ITS for deductive logic. They selected the policy back from the environment with respect to the agent’s action(s);
with the highest ECR and deployed it in an ITS for real stu- r(s,a,s0 ) denotes the expected reward of transiting from state
dents. The empirical results showed that the RL policies were s to state s 0 by taking action a. Once hS,A,Ri is defined, T
no more effective than the random baseline policy. In addition, represents the transition probability where P (s,a,s0 )=P r(s0 |s,a)
Rowe et al. [22] investigated an MDP framework for tutorial is the probability of transitioning from state s to state s 0 by
planning in a game-based learning system. They used ECR to taking action a and it can be easily estimated from the training
theoretically evaluate induced policies but in an empirical study corpus. The optimal policy π for an MDP can be generated
with real students, they found that students in the induced via dynamic programming approaches, such as Value Iteration.
planner condition had significantly different behavior patterns This algorithm operates by finding the optimal value for each
state V ∗ (s), which is the expected discounted reward that the function p(x) can be calculated as:
agent will gain if it starts in s and follows the optimal policy Z
to the goal. Generally speaking, V ∗ (s) can be obtained by the Ep [f(x)] = f(x)p(x)dx (3)
optimal value function for each state-action pair Q∗ (s,a) which
is defined as the expected discounted reward the agent will gain
Z
p(x)
if it takes an action a in a state s and follows the optimal policy = f(x) q(x)dx (4)
q(x)
to the end. The optimal value function Q∗ (s,a) can be obtained
p(x)
by iteratively updating Q(s,a) via equation 1 until convergence: = Eq [f(x) ] (5)
q(x)

X where p is known as the target distribution, q is the sampling dis-
Q(s,a):= p(s,a,s0 ) r(s,a,s0 )+γmaxQ(s0 0
,a ) (1)
0 a tribution, Ep [f(x)] is the expectation of f(x) under p, p(x)/q(x)
s0
is the likelihood ratio weight and Eq [f(x)p(x)/q(x)] is the ex-
pectation of f(x)p(x)/q(x) under q. We can then approximate
where 0 ≤ γ ≤ 1 is a discount factor. When the process con- the expectation of f(x) over probability density function p(x)
verges, the optimal policy π∗ can be induced corresponding to using the samples drawn from probability density function q(x).
the optimal Q-value function Q∗ (s,a), represented as: In the context of OPE, the target distribution, p, is a probability
event whose density function is determined by the target policy,
and the sampling distribution, p(x), is a probability event whose
π∗ (s)=argmaxQ∗ (s,a) (2) density function is determined by the behaviour policy.
a
Following the general IS technique, we approximated the ex-
Where π∗ is the deterministic policy that maps a given state pected reward of the target policy using the relative probability
into an action. In the context of an ITS, this induced policy of the target and behavior policies. Because of the nature of the
represents the pedagogical strategy by specifying tutorial actions underlying MDP, samples in RL are sequential. Therefore, we
using the current state. assumed that each trajectory is a sequence of events whose prob-
ability density function is determined by its corresponding policy.
Assuming the independence among trajectories and following
5. THREE OPE METRICS & TWO SETTINGS the multiplication rule of independent events, the probability
The following terms will be used throughout this paper. of occurrence of a trajectory, H L , under policy π is
L
Y
a ,r
• H =s1 −−1 1
−→s2 −− 2 2a ,r
−→s3 −−
a ,r
3 3
−→···sL ; denotes one student- Pπ (H L )=P1 (s1 ) π(at |st )PT (st+1 |st ,at ) (6)
system interaction trajectory and H L denotes a trajectory t=1

with length L.
where Pπ (H L ) is the probability of occurrence of the trajectory
PL H L following a policy π, P1 (s1 ) is the probability of occurrence
• G(H L ) = t=1 γ t−1 rt is the discounted return of the
of the state s1 in the beginning of the trajectory and PT is the
trajectory H L , which generally reflects how good the
state transition probability function.
trajectory is supposed to be.
Similar to the original IS technique, the proportional probability
• D = {H1 ,H2 ,H3 ,...,Hn } denotes the historical dataset of each trajectory occurring under the target policy, πe , and
containing n student-system interaction trajectories. behavior policy, πb , is used as the likelihood ratio weight for that
trajectory. Thus, following the importance sampling technique,
• πb denotes the behavior policy carried out for collecting the importance sampling discounted return is defined as:
the historical data D.
Pπe (H L )
• πe denotes the target policy to be evaluated. IS(πe |H L ,πb ) = ·G(H L ) (7)
Pπb (H L )
L
• ρ(πe ) represents the estimated performance of πe . Q
πe (at |st )
t=1
= L
·G(H L ) (8)
5.1 Three IS-based OPE Metrics
Q
πb (at |st )
t=1
Importance Sampling (IS) is an approximation method that
allows the estimation of the expectation of a distribution, p, from
samples generated from a different distribution, q. Suppose that
Substituting G(H L ) with the discounted return and we have:
we have sample space x and random variable f(x) which is a
measurable function from sample space x to another measurable L L
space. We want to estimate the expectation of f(x) over a
Y πe (at |st ) X t−1
IS(πe |H L ,πb )=( )( γ rt ) (9)
strictly positive probability density function p(x). Suppose also t=1
πb (at |st ) t=1
that we cannot directly sample from distribution p(x), but we
can draw Independent and Identically Distributed (IID) samples
from probability density function q(x) and evaluate f(x) for After having the individual IS estimator for each trajectory, we
these samples. The expectation of f(x) over probability density can calculate the expected reward of the dataset D by averaging
the individual IS estimators for each trajectory as optimal action a∗ following the optimal policy π∗ . To transform
 i  the deterministic policy into a stochastic policy, we explore three
nD L Li
1 X Y πe (ait |sit ) X t−1 i  types of transformation functions: Hard-code, Q-proportion, and
IS(πe |D)= ( )( γ rt ) (10) Soft-max. The basic idea behind them is: for any given state s,
nD i=1 t=1 πb (ait |sit ) t=1
the assigned stochastic probability for an action a should reflect
where sit , ait , and rti refer to the ith trajectory at time t and nD its value of Q(s,a).
is the number of trajectories in D.
1. Hard-code Transformation
Weighted Importance Sampling (WIS) is a variant of the (
IS estimator, a biased but consistent estimator which has lower 1−ε optimal action
π(a|s)= (16)
variance than IS. It normalizes the IS in order to produce a lower ε otherwise
variance. At first, it calculates the weight WD for the dataset
as the summation of the likelihood ratios for each trajectory as In eqn 16, ε is a fixed small probability like ε = 0.001
shown in equation 11. Then, it normalizes the IS estimator as which is assigned to actions with smaller Q-value while
shown in equation 12. Finally, the WIS is simply the weighted 1−ε is assigned to the action with the largest Q-value.
average of the estimated reward for each sequence in the dataset
2. Q-proportion Transformation
D, as shown in equation 13.
Q(s,a)
nD L i
Y πe (ait |sit ) π(a|s)= P 0 (17)
a ∈A Q(s,a )
0
X
WD = ( ) (11)
πb (ait |sit )
i=1 t=1 As shown in eqn 17, the probability to take action a given
state s is the proportion of a’s Q-value among all possible
IS(πe |H L ,πb ) 0
actions a in the state s. Thus, the action which has the
W IS(πe |H L ,πb )= (12)
WD highest Q-value is guaranteed to have the highest probabil-
" # ity. In practice, because some of the Q-values are smaller
nD Li Li than 0, we add a constant to Q-values in the same state
P Q πe (ait |i st ) P t−1 i
( π (ai |si )
)( γ rt ) so that Q(s,a)≥1.
b t t
i=1 t=1 t=1
W IS(πe |D)= (13)
n
PD Li
Q πe (ait |sit ) 3. Soft-max Transformation
( πb (ait |sit )
)
i=1 t=1 eθ·Q(s,a)
π(a|s)= P 0 (18)
θ·Q(s,a )
a ∈A e
0

Per-Decision Importance Sampling (PDIS) is also a vari-
ant of IS. Like IS, it is also unbiased and consistent. IS has very Soft-max is a classical function used to calculate the prob-
high variance because for each reward, rt , it uses the likelihood ability distribution of one event over n possible events. In
ratio of the entire trajectory. However, the reward at step t our application, given state s, the soft-max function will
should only depend on the previous steps. The variance can be calculate the probability of action a over all possible ac-
reduced by using the likelihood ratio of the trajectory before tions using equation 18. The main advantage of soft-max
step t for the reward rt . It means that the importance weight is the output probability range is 0 to 1 and the sum of
Qt
πe (aj |sj ) all the probabilities will be 1 and it can handle negative
for a reward at step t is πb (aj |sj )
. Therefore, the individual Q-values. θ is a weight parameter to the Q-value.
j=1
PDIS estimator for a trajectory is given in equation 14 and the
PDIS for the whole historical dataset D is given in equation 15. 5.2.2 Two Types of Reward Functions:
L t
The effectiveness of RL-induced policies is very sensitive to
X Y πe (aj |sj ) the reward functions. In our application, the range of the re-
P DIS(πe |H L ,πb )= γ t−1 ( )rt (14)
t=1 j=1
πb (aj |sj ) ward function is very large [−200,200], which may cause large
variances for IS, especially when a trajectory is long. One effec-
nD Li t
tive way to reduce this variance is to use normalized rewards.
1 X X t−1 Y πe (aij |sij ) i

Therefore, both original rewards and normalized rewards are
P DIS(πe |D)= γ ( )rt (15)
nD i=1 t=1 j=1
πb (aij |sij ) considered. More specifically, the normalized reward z is defined
x−min(x)
as: z = max(x)−min(x) where min and max are the minimum
5.2 Two Different Settings and maximum values of original reward function x and z ∈[0,1].
To evaluate the three IS-based metrics, we explored different
deployment settings from two aspects: one is the transformation 6. ITS & FOUR MDP POLICIES
function to convert the RL-induced deterministic policy to a 6.1 Our Logic Tutor
stochastic policy used in IS-based metrics and the other is re- Figure 2 shows an interface of the logic tutor, which is a data-
ward functions: the original reward function vs. the normalized driven ITS used in the undergraduate Discrete Mathematics
reward. (DM) course at a large university. It [16] provides students
with a graph-based representation of logic proofs which allows
5.2.1 Three Transformation Functions: students to solve problems by applying rules to derive new state-
As described above, IS-based metrics require the policy to be ments, represented as nodes. The system automatically verifies
stochastic but our MDP induced policies are deterministic, i.e. proofs and provides immediate feedback on logical errors. Every
given the current state s, the agent should take the deterministic problem in the tutor can be presented in the form of either
PS. Therefore, over the entire training process, only ∼50% of
actions are actually decided by the pedagogical policy and the
rest are decided by hard-coded system rules.

In short, despite the fact that ECR showed that our four RL-
induced policies should be more effective than the Random,
empirical results showed otherwise because of various potential
reasons. So we explored other OPE metrics to evaluate MDP1
- MDP4 .

7. EXPERIMENT SETUP
Our dataset contains a total of 450 students’ interaction logs
involved in the strictly controlled studies mentioned above. The
Figure 2: Tutor Problem Solving Interface goals of our experiment were to: 1) investigate whether any of
WE or PS. By focusing on the pedagogical decisions of WE the three IS-based metrics can be used to align the theoretical
and PS, the tutor allows us to strictly control the content to be and empirical results for our four MDP policies and 2) identify
equivalent for all students. critical decisions that are linked to student learning.

6.2 Four MDP Policies and Empirical Study 7.1 Three IS-based Metrics Evaluation
Four MDP policies, MDP1 - MDP4 were induced from an We will describe how we determine whether or not the three IS-
exploratory pre-collected dataset following different feature se- based metrics can align the theoretical results with the empirical
lection procedures. The detailed descriptions on our policy results for MDP1-MDP4.
induction process are described in [26, 25] and must be omitted
here because of page limits. The effectiveness of each MDP For a given RL-induced policy π, we first split all students into
policy was empirically evaluated against the Random policy in High vs. Low based on the actual carry-out percentages accord-
strictly controlled studies during three consecutive semesters. In ing to π. Since there are only two tutorial choices: WE vs. PS,
each strictly controlled study, students were randomly assigned there is a possibility that each actual decision that the tutor
to two conditions: MDP policy, or a Random baseline policy made would agree with the decision according to π. For the
which makes random yet reasonable decisions because both PS Random policy, for example, the probability is 50-50. In other
and WE are always considered to be reasonable educational words, we can measure each trajectory by the percentage of the
interventions in our learning context. Moreover, all students tutorial decisions that agree with π. If π is indeed effective, we
went through the identical procedure on the tutor and the only would expect that the more the tutorial decisions in a trajectory
difference was the pedagogical policy employed. After complet- agree with π, the better the corresponding student performance
ing the tutor, students take a post-test which involved two proof would be. We thus treat all 450 students’ interaction log data
questions in a midterm exam. They were 16 points each and equally regardless of their original assigned conditions and for
graded by one TA using a rubric. Overall, no significant differ- each student-ITS interaction log we can calculate a carry-out
N
ence was found between our four RL-induced policies and the percentage for π using the formula: percentage= Nagree
total
, where
Random policy on students’ post-test scores across all studies. Ntotal is the total number of tutorial decisions in the trajec-
tory and Nagree is the number of decisions that agree with π.
There are many possible explanations for our results. First, Then, students are divided into High Carry-out (High) vs. Low
while the Random policy is generally a weak policy for many RL Carry-out (Low) by a median split on carry-out percentages.
tasks, in our situation both of our action choices: WE vs. PS are
considered reasonable and more importantly for each decision Then we empirically evaluate the effectiveness of π by checking
point, there is a 50% chance that the random policy would carry whether there is a significant difference between the High and
out the better of the two. Second, non-significant statistical Low groups on their post-test scores. Similarly, we compare
results do not mean non-existence. Small sample size may play the High and Low groups’ theoretical results. To do so, for
an important role in limiting the significance of statistical com- each student-ITS interaction trajectory, we estimate its reward
parisons. A post hoc power analysis revealed that in order to be by exploring different combinations of the three IS-based OPE
detected as significant at the 5% level with 80% power, MDP1 metrics with the three policy transformation functions and the
vs. Random needed a total sample of 1382 students; MDP2 vs. two reward functions. More specifically, we treat all the trajec-
Random needed 1700 students; MDP3 vs. Random needed 212 tories as generated by Random policy regardless of their original
students; MDP4 vs. Random needed 394 students. However, in behavior policy. So for each student, we have a total of 18
each empirical study, our student sample sizes are much smaller, theoretical evaluations for a given π. If π is indeed effective and
59, 50, 57, and 84 respectively. And last but not least, it turned our OPE metric is reliable, we would expect that the more the
out that all four RL-induced policies were only partially carried tutorial decisions in a trajectory agree with π, the higher the
out. All of the training problems in our tutor are organized into corresponding theoretical rewards would be and vice versa.
six strictly ordered levels and in each level students are required
to complete 3–4 problems. In level 1, all participants receive Finally, for each of the 18 OPE metric settings, we conduct an
the same set of PS problems and in the levels 2–6, our tutor alignment test between the theoretical and empirical results on
has two hard-coded action-based constraints that are required π. This is done by comparing the empirically evaluated results
by the class instructors: students must complete at least one and the theoretical rewards using the corresponding OPE metric.
PS and one WE and the last problem on each level must be More specifically, they are considered to be aligned when:
1. Both empirical and theoretical results were not significant,
that is p>=0.05, or

2. Both results were significant, and the direction of the
comparison was the same, that is p < 0.05, and the sign
of the t values are both positive or both negative.

All the remaining cases are considered as not aligned. Thus,
for each of 18 OPE metrics, we can test whether its theoreti-
cal results would align with the empirical results for π. Since
we have four RL-induced policies, MDP1-MDP4, robust and
reliable OPE metrics should align the two types of evaluation
results across all four policies.

7.2 Critical Decision Identification
Next, we will explain how the critical interactive decisions are
identified and empirically examined. Note that, there may be Figure 3: Q-value difference in MDP1-MDP4
critical decisions over which the RL policies have no influence.
Hence, we focus only on interactive decisions that are critical.
For many RL algorithms, the fundamental approach to induce Table 1 shows the empirical evaluation results by comparing the
an optimal policy can be seen as recursively estimating the Q- High vs. Low carry-out groups’ post-test scores based on the cor-
values: Q(s,a) for any given state-action pair until the Bellman responding RL-induced policy. The motivation is that, if a policy
equation is converged. More specifically, Q(s,a) is defined as is indeed effective, students in the High group should significantly
the expected discounted reward the agent will gain if it takes an out-perform their Low peers on the post-test. In Table 1, the first
action a in a state s and follows the corresponding optimal policy column indicates the name of the RL-induced policy; columns
to the end. Thus, for a given state s, a large difference between 2 and 3 show the mean and standard deviation of classroom
the values of Q(s,“P S”) and Q(s,“W E”) indicates that it is post-test scores for High and Low carry-out groups, respectively,
more important for the ITS to follow the optimal decision in and the last column shows the t-test results when comparing
the state s. We, therefore, used the absolute difference between post-test between groups. Rows 2-4 show that there is no sig-
the Q-values for each state s to identify critical decisions. Our nificant difference between the High vs. Low groups in terms of
procedure can be divided into two steps: post-test scores for MDP1, MDP2, and MDP3 policies, but there
is a significant difference between the two groups for the MDP4
Step 1: Identify Critical Decisions: Given an MDP policy, policy (row 5): t(448)=2.19,p=.029. This result suggests that,
for each state, we calculated the absolute Q-value difference among the four MDP policies, only MDP4 seems to be effective
between the two actions (PS vs. WE) associated with it. Fig- in that the students in MDP4’s High carry-out group performed
ure 3 shows the Q-value difference (y-axis) for each state (x-axis) significantly better than those in the Low carry-out group.
sorted in descending order for MDP1-MDP4 policies respectively.
It clearly shows that, across the four MDP policies, the Q-value Table 2 shows the overall IS-based metrics evaluation results
differences for different states can vary greatly. We used the showing the impact of policy transformations and original versus
median Q-value difference to split the x-axis states into critical normalized rewards on the outcomes of each IS metric. The first
vs. non-critical states. The states with the larger Q-value dif- column indicates the type of policy transformation applied and
ferences were critical states and the rest were non-critical ones. the second column shows whether the rewards are normalized.
For a given RL-induced policy, the critical decisions are defined The third through fifth columns show the performance of each
as those in critical states where the actual carried-out tutorial IS metric, where performance is determined by the percent align-
action agreed with the corresponding policy. ment between the IS policy predictions and empirical post-test
results. Among the three policy transformations, Q-proportion
Step 2: Evaluate Critical Decisions: For each of the four is the worst, with none of its six performance results better than
RL-induced policies, we counted the number of critical decisions the corresponding results of Soft-max or Hard-code. Hard-code
that each student encountered during his/her training. Then for performs slightly better than Soft-max in most cases but never
each policy, students were split into: More vs. Less groups, by reaches a 100% match. For reward normalization, the original
a median split on the number of critical decisions experienced reward performs better than the normalized reward for the
in the training process. A t-test was conducted on the post-test PDIS metric, but there was no effect on IS or WIS.
scores of More vs. Less groups to investigate whether the stu-
dents with More critical decisions would indeed perform better Comparing the three IS metrics, PDIS shows the greatest per-
than those with Less. formance with all 12 of the performance results being the best.
For the last two metrics, IS and WIS, the results are exactly
the same because WIS is much like multiplying IS by a con-
stant and this kind of re-scale won’t change the result of the
t-tests. The metric with the best performance is PDIS with
Soft-max policy transformation and the original reward, whose
8. RESULTS performance is 100%. This means that all the t-test results on
8.1 Three IS-based Metrics Evaluation the PDIS predictions aligned with those on the empirical results
Table 1: Empirical Post-test Evaluation Results for High and Low Carry-out Groups
Policy High Low T-test Result
MDP1 79.06(24.64) 83.13(23.69) t(448)=−1.78,p=.076
MDP2 81.46(24.90) 80.44(23.66) t(448)=.44,p=.658
MDP3 82.55(23.48) 79.30(25.00) t(448)=1.24,p=.156
MDP4 83.43(22.83) 78.43(25.44) t(448)=2.19,p=.029∗∗

bold and ** denote significance at p < 0.05.

Table 2: Policy Transformation and Normalization Impacts on IS Metric Alignment to Post-test Outcomes
Policy Rewards Metrics
Transformation IS WIS PDIS
Original 50% 50% 100%∗
Soft-max
Normalized 50% 50% 50%
Original 25% 25% 75%
Q-proportion
Normalized 25% 25% 25%
Original 75% 75% 75%
Hard-code
Normalized 75% 75% 75%

Table 3: Detailed IS-based Metrics Evaluation Results Using Original Reward
Transform Policy Empirical Result IS & WIS Result PDIS Result
MDP1 t(448)=−1.78,p=.076 t(448)=.89,p=.376 t(448)=.89,p=.375
MDP2 t(448)=.44,p=.658 t(448)=2.60,p=.010∗∗ t(448)=1.84,p=.067
Soft-max
MDP3 t(448)=1.42,p=.156 t(448)=2.50,p=.013∗∗ t(448)=.53,p=.594
MDP4 t(448)=2.19,p=.029∗∗ t(448)=2.18,p=.030∗∗ t(448)=3.23,p=.001∗∗
MDP1 t(448)=−1.78,p=.076 t(448)=3.78,p<.001∗∗ t(448)=1.77,p=.077
MDP2 t(448)=.44,p=.658 t(448)=3.26,p=.001∗∗ t(448)=2.13,p=.034∗∗
Q-proportion
MDP3 t(448)=1.42,p=.156 t(448)=2.71,p=.007∗∗ t(448)=.31,p=.760
MDP4 t(448)=2.19,p=.029∗∗ t(448)=3.69,p<.001∗∗ t(448)=2.32,p=.021∗∗
MDP1 t(448)=−1.78,p=.076 t(448)=1.19,p=.233 t(448)=.96,p=.337
MDP2 t(448)=.44,p=.658 t(448)=.71,p=.479 t(448)=.84,p=.404
Hard-code
MDP3 t(448)=1.42,p=.156 t(448)=2.34,p=.020∗∗ t(448)=2.06,p=.040∗∗
MDP4 t(448)=2.19,p=.029∗∗ t(448)=2.83,p=.005∗∗ t(448)=3.73,p<.001∗∗

Electric-blue cells denote that the theoretical t-test results align with the empirical t-test results (Column 3);
Grey cells denote misaligned t-test results.

in terms of significance match. four t-tests aligning with the corresponding empirical results.
IS and WIS are more likely to predict the significant difference
Table 3 shows the detailed results for metric evaluations using between High vs. Low. Meanwhile, Q-proportion tends to
the original reward, providing t-test results when comparing cause the metric to predict more significant difference, while
post-test results between High and Low carry-out groups. The Hard-code tends to predict less.
first column in table 3 shows the type of policy transformation
functions applied. The second column shows the four MDP In summary, when comparing groups in the RL-induced policy,
policies considered when splitting the dataset into High and Low our results showed that for the MDP4 policy, the students in the
carry-out groups. The third column shows the t-test results High carry-out group significantly outperformed the students
of the empirical evaluation of High vs. Low carry-out, which in the Low group, but no significant difference was found in the
served as the ground truth. The fourth and fifth columns show other three policies. This suggests that the MDP4 policy is an
the t-test results for the prediction of the three IS metrics: IS, effective policy in that the more it is carried out, the better it
WIS, and PDIS respectively. In the tables, electric-blue cells performs. However, the partially carry-out situation reduced
denote that the theoretical t-test results align with the empirical the power of the MDP4 policy so that it did not significantly
t-test results (Column 3) while grey cells denote mismatched outperform the baseline random policy. When comparing the
t-test results. From this table, we can see that only PDIS with empirical evaluation results with theoretical evaluation results,
soft-max transformation and the original reward results in all PDIS is the best among the three IS-based metrics, reaching
Table 4: Critical Decision Evaluation Results
Policy More Less T-test Result
MDP1 78.49(23.08) 83.01(25.44) t(448)=−1.97,p=.049
MDP2 83.22(23.53) 78.86(24.79) t(448)=1.91,p=.057
MDP3 79.45(24.59) 82.08(24.00) t(448)=−1.14,p=.257
MDP4 83.54(22.89) 78.74(25.21) t(448)=2.10,p=.036∗∗

bold and ** denote significance at p < 0.05.

100% agreement. Our results suggested that proper deployment 9. CONCLUSION AND FUTURE WORK
settings have an impact on the performance of IS-based metrics. In this work, we explored three IS-based OPE metrics with two
When transforming the deterministic policy to stochastic policy, deployment settings in a real-world application. Through com-
soft-max is the best one, while Q-proportion is the worst, and paring the effectiveness of four RL-induced policies empirically
Hard-code is stable. The comparison between the original reward and theoretically, our results showed that PDIS is the best one
and normalized reward indicates that the original reward can for interactive e-learning systems and appropriate deployment
better reflect the empirical results despite having larger variance. settings (i.e., where policy decisions are carried out) are required
to achieve reliable, robust evaluations. We also proposed a
method to identify critical decisions by the Q-value differences
in a policy. In order to verify our method, we investigated the re-
8.2 Critical Decision Identification lationship between the number of identified critical decisions and
Recall that each policy impacts its own critical decisions: those
student post-test scores. The results revealed that the identified
with higher differences between Q-values for possible decisions
critical decisions are significantly linked to student learning, and
are considered to be critical, and we split each group of students
further, that critical decisions can be identified by an effective
according to whether the student received More decisions aligned
policy but not by ineffective policies. In the future, we will apply
with the critical decisions. Table 4 shows the t-test results
the PDIS metric with soft-max transformation and original re-
comparing the post-test scores between the More vs. Less
wards to help us induce better RL policies which further improve
critical decisions groups. The first column indicates the MDP
students’ learning in the ITS. Also, when inducing policies, we
policies considered when identifying the critical decision. The
will consider constraints to avoid the partially carry-out situation
second and third columns show the average post-test scores
that limits the impact a policy can have on outcomes. Further-
of students in the More and Less groups, showing mean(sd).
more, with the identification of critical decisions, we can reduce
The fourth column shows the t-test results when comparing the
the size of resulting policies still further by focusing only on the
post-test scores of the More and Less critical decisions groups.
most monumental decisions rather than meaningless ones.
The MDP1 row shows a significant difference between the More
vs. Less critical decisions groups for the MDP1 policy, with
t(448) = −1.97,p = .049. However, the students in More group 10. REFERENCES
perform worse than the Less critical decisions group. The MDP2 [1] J. Beck, B. P. Woolf, and C. R. Beal.
and MDP3 rows show that there is no significant difference Advisor: A machine learning architecture for intelligent
between the two critical decisions groups in terms of post-test tutor construction. In AAAI/IAAI, pages 552–557, 2000.
scores for the MDP2 or MDP3 policies. Finally, the MDP4 policy [2] J. D. Bransford and D. L. Schwartz. Rethinking
shows a significant difference between the two groups, t(448)=
transfer: A simple proposal with multiple implications.
2.10,p = .036, which means that students with More critical Review of Research in Education, pages 61–100, 1999.
decisions performed significantly better than students with Less.
[3] M. Chi, K. VanLehn, D. Litman,
and P. Jordan. Empirically evaluating the application
For the MDP4 policy, the identified critical decisions comprised
of reinforcement learning to the induction of effective
25% of all decisions. This shows that, although critical decisions
and adaptive pedagogical strategies. User Modeling
are a small proportion of all decisions, they can significantly im-
and User-Adapted Interaction, 21(1-2):137–180, 2011.
pact the outcome. Also, the results also show that the Q-values
in the MDP4 policy can be used to identify critical decisions [4] B. Clement and et al. A comparison
aligned with empirical results, but the other three cannot. Based of automatic teaching strategies for heterogeneous
on the results from Table 1, MDP4 was identified as the only student populations. In EDM 16-9th EDM, 2016.
effective policy, since its empirical post-test results aligned, with [5] S. Doroudi, K. Holstein,
students in the High carry-out group performing significantly V. Aleven, and E. Brunskill. Towards understanding
better than the Low carry-out group. The critical decisions how to leverage sense-making, induction and refinement,
results suggest that MDP4 is also the only policy where larger and fluency to improve robust learning. JEDM, 2015.
differences in Q-values had larger impacts on post-test results. [6] L. J. Dudik, Miroslav and
Taken together, these results suggest that only Q-values in ef- L. Li. Doubly robust policy evaluation and learning. the
fective policies work to influence decisions that impact actual 28th International Conference on Machine Learning, 2011.
post-test performance. This further inspires us to investigate [7] J. Hammersley and D. Handscomb. General
whether we could verify the effectiveness of a policy in reverse: principles of the monte carlo method. Springer, 1964.
Given a policy, if the decisions with larger Q-value difference [8] A. Iglesias, P. Martı́nez, R. Aler, and
are significantly linked to the student performance, then this F. Fernández. Learning teaching strategies in an adaptive
policy may be more likely to be effective. and intelligent educational system through reinforcement
learning. Applied Intelligence, 31(1):89–106, 2009. T. L. Griffiths, and P. Shafto. Faster teaching via
[9] A. Iglesias, P. Martı́nez, pomdp planning. Cognitive science, 40(6):1290–1332, 2016.
R. Aler, and F. Fernández. Reinforcement learning of [22] J. P. Rowe and J. C. Lester. Optimizing
pedagogical policies in adaptive and intelligent educational player experience in interactive narrative planning:
systems. Knowledge-Based Systems, 22(4):266–270, 2009. A modular reinforcement learning approach. In the Tenth
[10] A. Iglesias, P. Martı́nez, International Conference on AIIDE, pages 160–166, 2014.
R. Aler, and F. Fernández. Reinforcement learning of [23] J. P. Rowe and J. C. Lester.
pedagogical policies in adaptive and intelligent educational Improving student problem solving in narrative-centered
systems. Knowledge-Based Systems, 22(4):266–270, 2009. learning environments: A modular reinforcement learning
[11] D. J. L. Joel framework. In AIED, pages 419–428. Springer, 2015.
R. Tetreault, Dan Bohus. Estimating the reliability of mdp [24] N. Roy, J. Pineau, and
policies: A confidence interval approach. The Association S. Thrun. Spoken dialogue management using probabilistic
for Computational Linguistics, pages 276–283, 2007. reasoning. In Proceedings of the 38th Annual Meeting
[12] K. R. Koedinger, J. R. Anderson, W. H. Hadley, and M. A. on Association for Computational Linguistics, pages
Mark. Intelligent tutoring goes to school in the big city. 1997. 93–100. Association for Computational Linguistics, 2000.
[13] K. R. Koedinger, E. Brunskill, R. S. [25] S. Shen and M. Chi. Aim
Baker, E. A. McLaughlin, and J. Stamper. New potentials low: Correlation-based feature selection for model-based
for data-driven intelligent tutoring system development reinforcement learning. In EDM, pages 507–512, 2016.
and optimization. AI Magazine, 34(3):27–41, 2013. [26] S. Shen and M. Chi. Reinforcement
[14] E. Levin, R. Pieraccini, and W. Eckert. learning: the sooner the better, or the later the better?
A stochastic model of human-machine interaction In Proceedings of the 2016 Conference on User Modeling
for learning dialog strategies. IEEE Transactions Adaptation and Personalization, pages 37–44. ACM, 2016.
on speech and audio processing, 8(1):11–23, 2000. [27] S. Singh, D. Litman, M. Kearns, and M. Walker.
[15] T. Mandel, Optimizing dialogue management with reinforcement
Y.-E. Liu, S. Levine, E. Brunskill, and Z. Popovic. Offline learning: Experiments with the njfun system. Journal
policy evaluation across representations with applications of Artificial Intelligence Research, 16:105–133, 2002.
to educational games. In AAMAS, pages 1077–1084, 2014. [28] J. C. Stamper, M. Eagle, T. Barnes, and M. Croy.
[16] Z. L. Mostafavi Behrooz and T. Barnes. Data-driven Experimental evaluation of automatic hint generation
proficiency profiling. In Proc. of the 8th International for a logic tutor. In International Conference on Artificial
Conference on Educational Data Mining, 2015. Intelligence in Education, pages 345–352. Springer, 2011.
[17] M. Chi, [29] P. Thomas. Safe reinforcement learning. PhD thesis, 2015.
K. VanLehn, D. J. Litman, and P. W. Jordan. Empirically [30] K. Vanlehn.
evaluating the application of reinforcement learning to the The behavior of tutoring systems. International journal
induction of effective and adaptive pedagogical strategies. of artificial intelligence in education, 16(3):227–265, 2006.
User Model. User-Adapt. Interact., 21(1-2):137–180, 2011. [31] P. Wang, J. Rowe, W. Min, B. Mott,
[18] K. Narasimhan, T. Kulkarni, and R. Barzilay. Language un- and J. Lester. Interactive narrative personalization
derstanding for text-based games using deep reinforcement with deep reinforcement learning. In IJCAI, 2017.
learning. arXiv preprint arXiv:1506.08941, 2015. [32] J. D. Williams and S. Young. Partially observable
[19] E. B. Philip S. Thomas. Data-efficient off-policy policy markov decision processes for spoken dialog systems.
evaluation for reinforcement learning. In In International Computer Speech & Language, 21(2):393–422, 2007.
Conference on Machine Learning, pages 2139–2148, 2016. [33] B. Zhang, Q. Cai, J. Mao, E. Chang, and B. Guo. Spoken
[20] S. R. S. S. S. Precup, D. Eligibility dialogue management as planning and acting under
traces for off-policy policy evaluation. 17th International uncertainty. In INTERSPEECH, pages 2169–2172, 2001.
Conference on Machine Learning, pages 759–766, 2000. [34] G. Zhou and
[21] A. N. Rafferty, E. Brunskill, et al. Towards closing the loop: Bridging machine-induced
pedagogical policies to learning theories. In EDM, 2017.