Importance Sampling to Identify
       Empirically Valid Policies and their Critical Decisions

                    Song Ju, Shitian Shen, Hamoon Azizsoltani, Tiffany Barnes, Min Chi
                                                           Department of Computer Science
                                                            North Carolina State University
                                                                  Raleigh, NC 27695
                                      {sju2, sshen, hazizso, tmbarnes, mchi}@ncsu.edu

ABSTRACT                                                                                   situation so as to maximize some predefined cumulative reward.
In this work, we investigated off-policy policy evaluation (OPE)                           A number of researchers have studied the application of existing
metrics to evaluate Reinforcement Learning (RL) induced poli-                              RL algorithms to improve the effectiveness of ITSs [3, 26, 21,
cies and to identify critical decisions in the context of Intelligent                      4, 28, 8, 9, 34]. While promising, such RL work faces at least
Tutoring Systems (ITSs). We explore the use of three common                                two major challenges discussed below.
Importance Sampling based OPE metrics in two deployment
settings to evaluate four RL-induced policies for a logic ITS.                             One challenge is a lack of reliable yet robust evaluation metrics
The two deployment settings explore the impact of using orig-                              for RL policy evaluation. Generally speaking, there are two ma-
inal or normalized rewards, and the impact of transforming                                 jor categories of RL: online and offline. In the former category,
deterministic to stochastic policies. Our results show that Per                            the agent learns while interacting with the environment; in the
Decision Importance Sampling (PDIS), using soft max and                                    latter case, the agent learns the policy from pre-collected data.
original rewards, is the best metric, and the only metric that                             Online RL algorithms are generally appropriate for domains
reached 100% alignment between the theoretical and empirical                               where interacting with simulations and actual environments is
classroom evaluation results. Furthermore, we used PDIS to                                 computationally cheap and feasible. On the other hand, for
identify what we call critical decisions in RL-induced policies,                           domains such as e-learning, building accurate simulations or
where the policies successfully identify large differences between                         simulated students is especially challenging because human learn-
decisions. We found that the students who received more criti-                             ing is a rather complex, poorly understood process. Moreover,
cal decisions significantly outperformed those who received less;                          learning policies while interacting with students may not be fea-
more importantly, this result only holds on the policy that was                            sible, and more importantly, may not be ethical. Therefore, to
identified to be effective using PDIS, not on ineffective ones.                            improve student learning, much prior work applied offline RL ap-
                                                                                           proaches to induce effective pedagogical strategies. This is done
                                                                                           by first collecting a training corpus and the success of offline RL
Keywords                                                                                   is often heavily dependent on the quality of the training corpus.
Reinforcement Learning, Off-policy Policy Evaluation, Impor-                               One common convention is to collect an exploratory corpus by
tance Sampling                                                                             training a group of students on an ITS that makes random yet
                                                                                           reasonable decisions and then apply RL to induce pedagogical
1.    INTRODUCTION                                                                         policies from that training corpus. Empirical study is then
Intelligent Tutoring Systems (ITSs) are a type of highly interac-                          conducted from a new group of human subjects interacting with
tive e-learning environment that facilitates learning by providing                         different versions of the system. The only difference among the
step-by-step support and contextualized feedback to individual                             system versions is the policy employed by the ITS. The students’
students [12, 30]. These step-by-step behaviors can be viewed                              performance is then statistically compared. Due to cost limita-
as a sequential decision process where at each step the system                             tions, typically, only the best RL-induced policy is deployed and
chooses an action (e.g. give a hint, show an example) from a                               compared against some baseline policies. On the other hand, we
set of options, in which pedagogical strategies are policies that                          often have a large number of RL algorithms (and associated hy-
are used to decide what action to take next in the face of al-                             perparameter settings), and it is unclear which will work best in
ternatives. Reinforcement Learning (RL) offers one of the most                             our setting. In these high-stake situations, one needs confidence
promising approaches to data-driven decision-making applica-                               in the RL-induced policy before risking deployment. Therefore,
tions and RL algorithms are designed to induce effective policies                          we need to develop reliable yet robust evaluation metrics to evalu-
that determine the best action for an agent to take in any given                           ate these RL-induced policies without collecting new data before
                                                                                           being tested in the real world. This type of evaluation is called
                                                                                           off-policy evaluation (OPE) because the policy used to collect the
                                                                                           training data, also referred to as the behavior policy, is different
                                                                                           from the RL-induced policy, referred to as the target policy to be
                                                                                           evaluated. To find reliable yet robust OPE metrics, we explored
                                                                                           three Importance Sampling based off-policy evaluation metrics.

                                                                                           The second RL challenge is a lack of interpretability of the
                                                                                           RL-induced policies. Compared with the amount of research


                 Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
done on applying RL to induce policies, relatively little work has    test scores are shown with a red solid line). Figure 1 shows that
been done to analyze, interpret, or explain RL-induced policies.      our theoretical evaluation (ECR) does not match the empirical
While traditional hypothesis-driven, cause-and-effect approaches      results (post-test) evaluation in that there is no clear relationship
offer clear conceptual and causal insights that can be evaluated      between the ECRs’ blue line and the corresponding post-test
and interpreted, RL-induced policies are often large, cumber-         in red line across the four policies. This result shows that ECR
some, and difficult to understand. The space of possible policies     is not a reliable OPE metric for evaluating RL-induced policy
is exponential in the number of domain features. It is therefore      in ITSs. Indeed, Mandel et al. [15] pointed out that ECR tends
difficult to draw general conclusions from them to advance our        to be biased, statistically inconsistent and thus it may not be
understanding of the domain. This raises a major open question:       the appropriate OPE metric in high stakes domains. In recent
How can we identify the critical system interactive decisions that    years, many state-of-the-art OPE metrics have been proposed
are linked to student learning? In this work, we tried to identify    and many of them are based on Importance Sampling.
key decisions by taking advantage of the reliable OPE metrics
we discovered and the properties of the policies we induced.          Importance Sampling (IS) is a classic OPE method for evaluat-
                                                                      ing a target policy on existing data obtained from an alternate
                                                                      behavior policy and thus can be handily applied to the task
2.    MOTIVATION                                                      of evaluating the effectiveness of an offline RL-induced policy
Just like the fact that assessment sits at the epicenter of ed-
                                                                      using pre-existing historical training datasets. Many IS-based
ucational research [2], policy evaluation is indeed the central
                                                                      OPE metrics are proposed and explored and it was shown that
concern among the many stakeholders in applying offline RL
                                                                      they gain significant performance in simulation environments
to ITSs. As educational assessment should reflect and rein-
                                                                      like Grid World or Bandit [29, 6]. Among them, three IS-
force the educational goals that society deems valuable, our
                                                                      based OPE metrics, the original IS, Weighted IS (WIS), and
policy evaluation metrics should reflect the effectiveness of the
                                                                      Per-Decision IS (PDIS), are the most widely used. However,
induced policies. While various RL approaches such as policy
                                                                      real-world human-agent interactive applications such as ITSs
iteration and policy search have shown great promise, existing
                                                                      are much more complicated due to 1) individual differences,
RL approaches tend to perform poorly when they are actually
                                                                      noise, and randomness during the interaction processes, 2) the
implemented and evaluated in the real world.
                                                                      large state space that can impact student learning, and 3) long
                                                                      trajectories due to the nature of the learning process.
In a series of prior studies on a logic ITS, RL and Markov De-
cision Processes (MDPs) were applied to induce four different
                                                                      In this work, we investigated the three IS-based offline OPE
pedagogical policies, named MDP1-MDP4 respectively, on one
                                                                      metrics on MDP1-MDP4 to investigate whether the three IS-
type of tutorial decision: whether to provide students with a
                                                                      based evaluation metrics are indeed effective OPE metrics for
Worked Example (WE) or to ask them to engage in Problem
                                                                      evaluating the four RL-induced policies mentioned beforehand.
Solving (PS). In WEs, the tutor presents an expert solution to a
                                                                      We believe an OPE is effective if and only if the theoretical
problem step by step, while in PSs, students are required to com-
                                                                      results from the OPE evaluations are completely aligned with
plete the problem with the tutor’s support. When inducing each
                                                                      the empirical results from the classroom studies. Therefore, we
of four policies, we explored different feature selection methods
                                                                      explored different deployment settings for the IS-based metrics
and used Expected Cumulative Reward (ECR) to evaluate the
                                                                      from two aspects: one is the transformation function used to
RL-induced policies. ECR of a policy is calculated by average
                                                                      convert the RL-induced deterministic policy to a stochastic pol-
over the value function of initial states and generally speaking,
                                                                      icy used in IS-based metrics and the other is reward functions:
the higher the ECR value of a policy, the better the policy is
                                                                      the original reward function vs. the normalized reward function;
supposed to perform.
                                                                      the latter is supposed to reduce the variance. Our results showed
                                                                      that the theoretical and empirical evaluation results are aligned
                                                                      more or less for different deployment settings using different
                                                                      IS-based metrics. Only when using a soft-max transformation
                                                                      function and original reward function, the theoretical results
                                                                      of PDIS can reach 100% agreement with the empirical results.
                                                                      Based on results from the OPE metrics, we further explored
                                                                      using the properties of the RL-induced policy to identify critical
                                                                      decisions and our results showed that the critical decisions can
                                                                      be identified by using the theoretically “effective” policies that
                                                                      were identified by using PDIS with soft-max transformation and
                                                                      the original reward function.

                                                                      In summary, we make the following contributions:
Figure 1: Post-test Score vs.              ECR, showing no
seeming direct relationship.                                             • We directly compared three IS-based policy evaluation
                                                                           metrics against the empirical results from real classroom
Figure 1 shows the ECRs (blue dashed line) of the four RL-                 studies across four different RL-induced policies. Our
induced policies, MDP1-MDP4 (x-axis) and the empirical results             results showed that PDIS is the best one and its results
of student learning performance (the solid red line) of the corre-         can align with empirical results.
sponding policies. Here the learning performance is measured by
an in-class post-test after students were trained on the tutor with      • As far as we know, this is the first study to compare dif-
corresponding policies (the mean and standard errors of post-              ferent deployment settings (original/normalized rewards
      or deterministic/stochastic policy transformation) on IS-      from the control group but no significant difference was found
      based policy evaluation metrics. Our results showed that       between the two groups on the post-test [23].
      settings have a direct impact on the effectiveness of the
      evaluation metrics. Only PDIS with soft-max transforma-        In short, prior work on applying offline RL in ITSs primarily
      tion and the original reward function agreed 100% with         explored ECR as the OPE metric and one common phenomenon
      the empirical results.                                         is that the theoretical evaluation results do not align with the
                                                                     empirical results from real students.
   • We investigated using information from the RL-induced
     policies to identify critical decisions to shed some light on   3.2    OPE Metrics
     the induced policies. As far as we know, this is the first      OPE is used to evaluate the performance of a target policy given
     attempt to differentiate the critical decisions from trivial    historical data generated by an alternative behavior policy. A
     ones.                                                           good OPE metric is especially important for real-world applica-
                                                                     tions where the deployment of a bad or inefficient policy can be
                                                                     costly [19]. ECR is one of the most widely used OPE metrics,
3. RELATED WORK                                                      which is designed especially for the MDP framework. Tetreault
3.1 Empirical Studies Applying RL to ITSs                            et al. [11] estimated the reliability of ECRs by repeated sampling
In recent years, a number of researchers have applied RL to          to estimate confidence intervals for ECRs. In simulation studies,
induce effective pedagogical policies for ITSs [3, 5, 13, 23].       they showed the policy induced by the confidence interval of
Some previous work treated the user-system interactions as fully     ECR performed more reliably than the baseline policies but
observable processes by applying Markov Decision Processes           this phenomenon did not hold when evaluating the RL-induced
(MDPs) [14, 27, 1] while others utilized partially observable        policies in ITSs for empirical studies [17].
MDPs (POMDPs) [24, 32, 33, 15, 4], and more recently, deep
RL Frameworks [31, 18]. Most of the previous work, including         Importance Sampling (IS) [7] is a widely used OPE metric,
this work, took the offline RL approach in that it followed three    which considers the mathematical characteristics of the decision
general steps: 1) to collect an exploratory corpus by training       making process and can be applied to any MDP, POMDP, or
a relatively large group of real and/or simulated students on        DeepRL framework. Precup [20] proposed four IS-based OPE
an ITS that makes some random yet reasonable and rational            metrics: IS, weighted importance sampling (WIS), per-decision
tutorial decisions; 2) to apply RL to induce pedagogical pol-        importance sampling (PDIS), and weighted per-decision impor-
icy directly from the historical exploratory corpus; and 3) to       tance sampling (WPDIS). They used the IS-based estimator as
implement the RL-induced pedagogical policy back to the ITS          the policy evaluation for Q-learning and then compared the ef-
and evaluate its effectiveness using simulations and/or in real      fectiveness of estimators on a series of 100 randomly-constructed
classroom settings to investigate whether RL fulfills its promise.   MDPs based on the mean square error (MSE). Their results
Oftentimes, when implementing and evaluating the RL-induced          showed that IS made the Q-learning process converge slowly
policies in real world, they tend to perform poorly compared to      and caused high variance, and WIS performed better than IS;
their theoretical performance.                                       but PDIS performed inconsistently and WPDIS performed the
                                                                     worst. Similarly, Thomas [29] compared the performance of
Iglesias et al. [10] applied Q-learning to generate a policy in an   several IS estimators using mean squared error in a grid-world
ITS that teaches students database design. Their goal was to         simulation, showing PDIS outperformed all others.
provide students with direct navigational support through the
system’s content. In the training phase, they used simulated stu-    In summary, previous work has explored the effectiveness of
dents to induce an RL policy and online-evaluated the induced        IS and its variants in simulation studies, which motivated the
policy based upon three self-defined measures. In the test phase,    work reported here. Different from previous work, we mainly
they evaluated the induced policy with real students. The results    focus on comparing the theoretical evaluation with the empirical
showed that while students using the induced policies had more       evaluation in order to determine whether IS-based methods are
effective usage behaviors than their non-policy peers, there was     indeed reliable and robust for ITSs.
no significant difference in student learning performance. Chi
et al. [17] applied an offline RL approach to induce pedagogical     4.    MARKOV DECISION PROCESS & RL
policies for improving the effectiveness of an ITS that teaches      Some of the prior work on applying RL to induce pedagogical
students college physics. They used ECR to evaluate the in-          policies used Markov Decision Processes (MDP) frameworks.
duced policy in the training phase. However, when they applied       An MDP can be seen as a 4-tuple hS,A,T ,Ri, where S denotes
the induced policy to real students, the induced policy did not      the observable state space, which is defined by a set of features
outperform the random policy in empirical evaluation. Similarly,     that represent the interactive learning environment and A de-
Shen et al. [26] explored immediate and delayed rewards based        notes the space of possible actions for the agent to execute. The
on learning gain and implemented an offline, MDP framework           reward function R represents the immediate or delayed feed-
on a rule-based ITS for deductive logic. They selected the policy    back from the environment with respect to the agent’s action(s);
with the highest ECR and deployed it in an ITS for real stu-         r(s,a,s0 ) denotes the expected reward of transiting from state
dents. The empirical results showed that the RL policies were        s to state s 0 by taking action a. Once hS,A,Ri is defined, T
no more effective than the random baseline policy. In addition,      represents the transition probability where P (s,a,s0 )=P r(s0 |s,a)
Rowe et al. [22] investigated an MDP framework for tutorial          is the probability of transitioning from state s to state s 0 by
planning in a game-based learning system. They used ECR to           taking action a and it can be easily estimated from the training
theoretically evaluate induced policies but in an empirical study    corpus. The optimal policy π for an MDP can be generated
with real students, they found that students in the induced          via dynamic programming approaches, such as Value Iteration.
planner condition had significantly different behavior patterns      This algorithm operates by finding the optimal value for each
state V ∗ (s), which is the expected discounted reward that the       function p(x) can be calculated as:
agent will gain if it starts in s and follows the optimal policy                                    Z
to the goal. Generally speaking, V ∗ (s) can be obtained by the                      Ep [f(x)] =      f(x)p(x)dx                       (3)
optimal value function for each state-action pair Q∗ (s,a) which
is defined as the expected discounted reward the agent will gain
                                                                                                    Z
                                                                                                           p(x)
if it takes an action a in a state s and follows the optimal policy                            =      f(x)      q(x)dx                 (4)
                                                                                                           q(x)
to the end. The optimal value function Q∗ (s,a) can be obtained
                                                                                                            p(x)
by iteratively updating Q(s,a) via equation 1 until convergence:                               = Eq [f(x)        ]                     (5)
                                                                                                            q(x)

                                                       
                     X                                                where p is known as the target distribution, q is the sampling dis-
         Q(s,a):=     p(s,a,s0 ) r(s,a,s0 )+γmaxQ(s0 0
                                                    ,a )       (1)
                                              0   a                   tribution, Ep [f(x)] is the expectation of f(x) under p, p(x)/q(x)
                     s0
                                                                      is the likelihood ratio weight and Eq [f(x)p(x)/q(x)] is the ex-
                                                                      pectation of f(x)p(x)/q(x) under q. We can then approximate
where 0 ≤ γ ≤ 1 is a discount factor. When the process con-           the expectation of f(x) over probability density function p(x)
verges, the optimal policy π∗ can be induced corresponding to         using the samples drawn from probability density function q(x).
the optimal Q-value function Q∗ (s,a), represented as:                In the context of OPE, the target distribution, p, is a probability
                                                                      event whose density function is determined by the target policy,
                                                                      and the sampling distribution, p(x), is a probability event whose
                          π∗ (s)=argmaxQ∗ (s,a)                (2)    density function is determined by the behaviour policy.
                                     a
                                                                      Following the general IS technique, we approximated the ex-
Where π∗ is the deterministic policy that maps a given state          pected reward of the target policy using the relative probability
into an action. In the context of an ITS, this induced policy         of the target and behavior policies. Because of the nature of the
represents the pedagogical strategy by specifying tutorial actions    underlying MDP, samples in RL are sequential. Therefore, we
using the current state.                                              assumed that each trajectory is a sequence of events whose prob-
                                                                      ability density function is determined by its corresponding policy.
                                                                      Assuming the independence among trajectories and following
5.    THREE OPE METRICS & TWO SETTINGS                                the multiplication rule of independent events, the probability
The following terms will be used throughout this paper.               of occurrence of a trajectory, H L , under policy π is
                                                                                                   L
                                                                                                   Y
              a ,r
     • H =s1 −−1 1
                −→s2 −− 2 2a ,r
                         −→s3 −−
                                   a ,r
                                  3 3
                                   −→···sL ; denotes one student-                Pπ (H L )=P1 (s1 ) π(at |st )PT (st+1 |st ,at )       (6)
       system interaction trajectory and H L denotes a trajectory                                   t=1

       with length L.
                                                                      where Pπ (H L ) is the probability of occurrence of the trajectory
                  PL                                                  H L following a policy π, P1 (s1 ) is the probability of occurrence
     • G(H L ) = t=1 γ t−1 rt is the discounted return of the
                                                                      of the state s1 in the beginning of the trajectory and PT is the
       trajectory H L , which generally reflects how good the
                                                                      state transition probability function.
       trajectory is supposed to be.
                                                                      Similar to the original IS technique, the proportional probability
     • D = {H1 ,H2 ,H3 ,...,Hn } denotes the historical dataset       of each trajectory occurring under the target policy, πe , and
       containing n student-system interaction trajectories.          behavior policy, πb , is used as the likelihood ratio weight for that
                                                                      trajectory. Thus, following the importance sampling technique,
     • πb denotes the behavior policy carried out for collecting      the importance sampling discounted return is defined as:
       the historical data D.
                                                                                                          Pπe (H L )
     • πe denotes the target policy to be evaluated.                              IS(πe |H L ,πb ) =                  ·G(H L )         (7)
                                                                                                          Pπb (H L )
                                                                                                           L
     • ρ(πe ) represents the estimated performance of πe .                                                 Q
                                                                                                             πe (at |st )
                                                                                                          t=1
                                                                                                   =       L
                                                                                                                          ·G(H L )     (8)
5.1     Three IS-based OPE Metrics
                                                                                                           Q
                                                                                                             πb (at |st )
                                                                                                          t=1
Importance Sampling (IS) is an approximation method that
allows the estimation of the expectation of a distribution, p, from
samples generated from a different distribution, q. Suppose that
                                                                      Substituting G(H L ) with the discounted return and we have:
we have sample space x and random variable f(x) which is a
measurable function from sample space x to another measurable                                       L                L
space. We want to estimate the expectation of f(x) over a
                                                                                                   Y   πe (at |st ) X t−1
                                                                                 IS(πe |H L ,πb )=(                )( γ rt )           (9)
strictly positive probability density function p(x). Suppose also                                  t=1
                                                                                                       πb (at |st ) t=1
that we cannot directly sample from distribution p(x), but we
can draw Independent and Identically Distributed (IID) samples
from probability density function q(x) and evaluate f(x) for          After having the individual IS estimator for each trajectory, we
these samples. The expectation of f(x) over probability density       can calculate the expected reward of the dataset D by averaging
the individual IS estimators for each trajectory as                               optimal action a∗ following the optimal policy π∗ . To transform
                          i                                                     the deterministic policy into a stochastic policy, we explore three
                      nD    L                Li
                   1 X Y πe (ait |sit ) X t−1 i                                 types of transformation functions: Hard-code, Q-proportion, and
      IS(πe |D)=          (                )( γ rt )                       (10)   Soft-max. The basic idea behind them is: for any given state s,
                  nD i=1 t=1 πb (ait |sit ) t=1
                                                                                  the assigned stochastic probability for an action a should reflect
where sit , ait , and rti refer to the ith trajectory at time t and nD            its value of Q(s,a).
is the number of trajectories in D.
                                                                                     1. Hard-code Transformation
Weighted Importance Sampling (WIS) is a variant of the                                                   (
IS estimator, a biased but consistent estimator which has lower                                           1−ε optimal action
                                                                                                 π(a|s)=                                        (16)
variance than IS. It normalizes the IS in order to produce a lower                                        ε    otherwise
variance. At first, it calculates the weight WD for the dataset
as the summation of the likelihood ratios for each trajectory as                        In eqn 16, ε is a fixed small probability like ε = 0.001
shown in equation 11. Then, it normalizes the IS estimator as                           which is assigned to actions with smaller Q-value while
shown in equation 12. Finally, the WIS is simply the weighted                           1−ε is assigned to the action with the largest Q-value.
average of the estimated reward for each sequence in the dataset
                                                                                     2. Q-proportion Transformation
D, as shown in equation 13.
                                                                                                                       Q(s,a)
                           nD L          i
                               Y πe (ait |sit )                                                          π(a|s)= P             0                (17)
                                                                                                                     a ∈A Q(s,a )
                                                                                                                      0
                           X
                      WD =    (                   )                        (11)
                                   πb (ait |sit )
                           i=1 t=1                                                      As shown in eqn 17, the probability to take action a given
                                                                                        state s is the proportion of a’s Q-value among all possible
                                             IS(πe |H L ,πb )                                     0
                                                                                        actions a in the state s. Thus, the action which has the
                W IS(πe |H L ,πb )=                                        (12)
                                                 WD                                     highest Q-value is guaranteed to have the highest probabil-
                               "                             #                          ity. In practice, because some of the Q-values are smaller
                          nD      Li                  Li                                than 0, we add a constant to Q-values in the same state
                          P       Q  πe (ait |i st ) P t−1 i
                                (    π (ai |si )
                                                    )( γ rt )                           so that Q(s,a)≥1.
                                             b    t   t
                          i=1      t=1                        t=1
         W IS(πe |D)=                                                      (13)
                                     n
                                     PD          Li
                                                 Q πe (ait |sit )                    3. Soft-max Transformation
                                             (        πb (ait |sit )
                                                                       )
                                     i=1 t=1                                                                           eθ·Q(s,a)
                                                                                                         π(a|s)= P                0             (18)
                                                                                                                           θ·Q(s,a )
                                                                                                                     a ∈A e
                                                                                                                      0

Per-Decision Importance Sampling (PDIS) is also a vari-
ant of IS. Like IS, it is also unbiased and consistent. IS has very                     Soft-max is a classical function used to calculate the prob-
high variance because for each reward, rt , it uses the likelihood                      ability distribution of one event over n possible events. In
ratio of the entire trajectory. However, the reward at step t                           our application, given state s, the soft-max function will
should only depend on the previous steps. The variance can be                           calculate the probability of action a over all possible ac-
reduced by using the likelihood ratio of the trajectory before                          tions using equation 18. The main advantage of soft-max
step t for the reward rt . It means that the importance weight                          is the output probability range is 0 to 1 and the sum of
                             Qt
                                πe (aj |sj )                                            all the probabilities will be 1 and it can handle negative
for a reward at step t is       πb (aj |sj )
                                             . Therefore, the individual                Q-values. θ is a weight parameter to the Q-value.
                            j=1
PDIS estimator for a trajectory is given in equation 14 and the
PDIS for the whole historical dataset D is given in equation 15.                   5.2.2   Two Types of Reward Functions:
                                    L          t
                                                                                  The effectiveness of RL-induced policies is very sensitive to
                                   X           Y  πe (aj |sj )                    the reward functions. In our application, the range of the re-
          P DIS(πe |H L ,πb )=         γ t−1 (                 )rt         (14)
                                   t=1        j=1
                                                  πb (aj |sj )                    ward function is very large [−200,200], which may cause large
                                                                                  variances for IS, especially when a trajectory is long. One effec-
                          nD  Li     t
                                                                                  tive way to reduce this variance is to use normalized rewards.
                        1 X X t−1 Y πe (aij |sij ) i
                                                            
                                                                                  Therefore, both original rewards and normalized rewards are
      P DIS(πe |D)=               γ (                   )rt                (15)
                       nD i=1 t=1    j=1
                                         πb (aij |sij )                           considered. More specifically, the normalized reward z is defined
                                                                                             x−min(x)
                                                                                  as: z = max(x)−min(x)   where min and max are the minimum
5.2     Two Different Settings                                                    and maximum values of original reward function x and z ∈[0,1].
To evaluate the three IS-based metrics, we explored different
deployment settings from two aspects: one is the transformation                   6. ITS & FOUR MDP POLICIES
function to convert the RL-induced deterministic policy to a                      6.1 Our Logic Tutor
stochastic policy used in IS-based metrics and the other is re-                   Figure 2 shows an interface of the logic tutor, which is a data-
ward functions: the original reward function vs. the normalized                   driven ITS used in the undergraduate Discrete Mathematics
reward.                                                                           (DM) course at a large university. It [16] provides students
                                                                                  with a graph-based representation of logic proofs which allows
5.2.1     Three Transformation Functions:                                         students to solve problems by applying rules to derive new state-
As described above, IS-based metrics require the policy to be                     ments, represented as nodes. The system automatically verifies
stochastic but our MDP induced policies are deterministic, i.e.                   proofs and provides immediate feedback on logical errors. Every
given the current state s, the agent should take the deterministic                problem in the tutor can be presented in the form of either
                                                                      PS. Therefore, over the entire training process, only ∼50% of
                                                                      actions are actually decided by the pedagogical policy and the
                                                                      rest are decided by hard-coded system rules.

                                                                      In short, despite the fact that ECR showed that our four RL-
                                                                      induced policies should be more effective than the Random,
                                                                      empirical results showed otherwise because of various potential
                                                                      reasons. So we explored other OPE metrics to evaluate MDP1
                                                                      - MDP4 .


                                                                      7.    EXPERIMENT SETUP
                                                                      Our dataset contains a total of 450 students’ interaction logs
                                                                      involved in the strictly controlled studies mentioned above. The
       Figure 2: Tutor Problem Solving Interface                      goals of our experiment were to: 1) investigate whether any of
WE or PS. By focusing on the pedagogical decisions of WE              the three IS-based metrics can be used to align the theoretical
and PS, the tutor allows us to strictly control the content to be     and empirical results for our four MDP policies and 2) identify
equivalent for all students.                                          critical decisions that are linked to student learning.


6.2    Four MDP Policies and Empirical Study                          7.1    Three IS-based Metrics Evaluation
Four MDP policies, MDP1 - MDP4 were induced from an                   We will describe how we determine whether or not the three IS-
exploratory pre-collected dataset following different feature se-     based metrics can align the theoretical results with the empirical
lection procedures. The detailed descriptions on our policy           results for MDP1-MDP4.
induction process are described in [26, 25] and must be omitted
here because of page limits. The effectiveness of each MDP            For a given RL-induced policy π, we first split all students into
policy was empirically evaluated against the Random policy in         High vs. Low based on the actual carry-out percentages accord-
strictly controlled studies during three consecutive semesters. In    ing to π. Since there are only two tutorial choices: WE vs. PS,
each strictly controlled study, students were randomly assigned       there is a possibility that each actual decision that the tutor
to two conditions: MDP policy, or a Random baseline policy            made would agree with the decision according to π. For the
which makes random yet reasonable decisions because both PS           Random policy, for example, the probability is 50-50. In other
and WE are always considered to be reasonable educational             words, we can measure each trajectory by the percentage of the
interventions in our learning context. Moreover, all students         tutorial decisions that agree with π. If π is indeed effective, we
went through the identical procedure on the tutor and the only        would expect that the more the tutorial decisions in a trajectory
difference was the pedagogical policy employed. After complet-        agree with π, the better the corresponding student performance
ing the tutor, students take a post-test which involved two proof     would be. We thus treat all 450 students’ interaction log data
questions in a midterm exam. They were 16 points each and             equally regardless of their original assigned conditions and for
graded by one TA using a rubric. Overall, no significant differ-      each student-ITS interaction log we can calculate a carry-out
                                                                                                                         N
ence was found between our four RL-induced policies and the           percentage for π using the formula: percentage= Nagree
                                                                                                                           total
                                                                                                                                 , where
Random policy on students’ post-test scores across all studies.       Ntotal is the total number of tutorial decisions in the trajec-
                                                                      tory and Nagree is the number of decisions that agree with π.
There are many possible explanations for our results. First,          Then, students are divided into High Carry-out (High) vs. Low
while the Random policy is generally a weak policy for many RL        Carry-out (Low) by a median split on carry-out percentages.
tasks, in our situation both of our action choices: WE vs. PS are
considered reasonable and more importantly for each decision          Then we empirically evaluate the effectiveness of π by checking
point, there is a 50% chance that the random policy would carry       whether there is a significant difference between the High and
out the better of the two. Second, non-significant statistical        Low groups on their post-test scores. Similarly, we compare
results do not mean non-existence. Small sample size may play         the High and Low groups’ theoretical results. To do so, for
an important role in limiting the significance of statistical com-    each student-ITS interaction trajectory, we estimate its reward
parisons. A post hoc power analysis revealed that in order to be      by exploring different combinations of the three IS-based OPE
detected as significant at the 5% level with 80% power, MDP1          metrics with the three policy transformation functions and the
vs. Random needed a total sample of 1382 students; MDP2 vs.           two reward functions. More specifically, we treat all the trajec-
Random needed 1700 students; MDP3 vs. Random needed 212               tories as generated by Random policy regardless of their original
students; MDP4 vs. Random needed 394 students. However, in            behavior policy. So for each student, we have a total of 18
each empirical study, our student sample sizes are much smaller,      theoretical evaluations for a given π. If π is indeed effective and
59, 50, 57, and 84 respectively. And last but not least, it turned    our OPE metric is reliable, we would expect that the more the
out that all four RL-induced policies were only partially carried     tutorial decisions in a trajectory agree with π, the higher the
out. All of the training problems in our tutor are organized into     corresponding theoretical rewards would be and vice versa.
six strictly ordered levels and in each level students are required
to complete 3–4 problems. In level 1, all participants receive        Finally, for each of the 18 OPE metric settings, we conduct an
the same set of PS problems and in the levels 2–6, our tutor          alignment test between the theoretical and empirical results on
has two hard-coded action-based constraints that are required         π. This is done by comparing the empirically evaluated results
by the class instructors: students must complete at least one         and the theoretical rewards using the corresponding OPE metric.
PS and one WE and the last problem on each level must be              More specifically, they are considered to be aligned when:
   1. Both empirical and theoretical results were not significant,
      that is p>=0.05, or

   2. Both results were significant, and the direction of the
      comparison was the same, that is p < 0.05, and the sign
      of the t values are both positive or both negative.


All the remaining cases are considered as not aligned. Thus,
for each of 18 OPE metrics, we can test whether its theoreti-
cal results would align with the empirical results for π. Since
we have four RL-induced policies, MDP1-MDP4, robust and
reliable OPE metrics should align the two types of evaluation
results across all four policies.

7.2    Critical Decision Identification
Next, we will explain how the critical interactive decisions are
identified and empirically examined. Note that, there may be               Figure 3: Q-value difference in MDP1-MDP4
critical decisions over which the RL policies have no influence.
Hence, we focus only on interactive decisions that are critical.
For many RL algorithms, the fundamental approach to induce            Table 1 shows the empirical evaluation results by comparing the
an optimal policy can be seen as recursively estimating the Q-        High vs. Low carry-out groups’ post-test scores based on the cor-
values: Q(s,a) for any given state-action pair until the Bellman      responding RL-induced policy. The motivation is that, if a policy
equation is converged. More specifically, Q(s,a) is defined as        is indeed effective, students in the High group should significantly
the expected discounted reward the agent will gain if it takes an     out-perform their Low peers on the post-test. In Table 1, the first
action a in a state s and follows the corresponding optimal policy    column indicates the name of the RL-induced policy; columns
to the end. Thus, for a given state s, a large difference between     2 and 3 show the mean and standard deviation of classroom
the values of Q(s,“P S”) and Q(s,“W E”) indicates that it is          post-test scores for High and Low carry-out groups, respectively,
more important for the ITS to follow the optimal decision in          and the last column shows the t-test results when comparing
the state s. We, therefore, used the absolute difference between      post-test between groups. Rows 2-4 show that there is no sig-
the Q-values for each state s to identify critical decisions. Our     nificant difference between the High vs. Low groups in terms of
procedure can be divided into two steps:                              post-test scores for MDP1, MDP2, and MDP3 policies, but there
                                                                      is a significant difference between the two groups for the MDP4
Step 1: Identify Critical Decisions: Given an MDP policy,             policy (row 5): t(448)=2.19,p=.029. This result suggests that,
for each state, we calculated the absolute Q-value difference         among the four MDP policies, only MDP4 seems to be effective
between the two actions (PS vs. WE) associated with it. Fig-          in that the students in MDP4’s High carry-out group performed
ure 3 shows the Q-value difference (y-axis) for each state (x-axis)   significantly better than those in the Low carry-out group.
sorted in descending order for MDP1-MDP4 policies respectively.
It clearly shows that, across the four MDP policies, the Q-value      Table 2 shows the overall IS-based metrics evaluation results
differences for different states can vary greatly. We used the        showing the impact of policy transformations and original versus
median Q-value difference to split the x-axis states into critical    normalized rewards on the outcomes of each IS metric. The first
vs. non-critical states. The states with the larger Q-value dif-      column indicates the type of policy transformation applied and
ferences were critical states and the rest were non-critical ones.    the second column shows whether the rewards are normalized.
For a given RL-induced policy, the critical decisions are defined     The third through fifth columns show the performance of each
as those in critical states where the actual carried-out tutorial     IS metric, where performance is determined by the percent align-
action agreed with the corresponding policy.                          ment between the IS policy predictions and empirical post-test
                                                                      results. Among the three policy transformations, Q-proportion
Step 2: Evaluate Critical Decisions: For each of the four             is the worst, with none of its six performance results better than
RL-induced policies, we counted the number of critical decisions      the corresponding results of Soft-max or Hard-code. Hard-code
that each student encountered during his/her training. Then for       performs slightly better than Soft-max in most cases but never
each policy, students were split into: More vs. Less groups, by       reaches a 100% match. For reward normalization, the original
a median split on the number of critical decisions experienced        reward performs better than the normalized reward for the
in the training process. A t-test was conducted on the post-test      PDIS metric, but there was no effect on IS or WIS.
scores of More vs. Less groups to investigate whether the stu-
dents with More critical decisions would indeed perform better        Comparing the three IS metrics, PDIS shows the greatest per-
than those with Less.                                                 formance with all 12 of the performance results being the best.
                                                                      For the last two metrics, IS and WIS, the results are exactly
                                                                      the same because WIS is much like multiplying IS by a con-
                                                                      stant and this kind of re-scale won’t change the result of the
                                                                      t-tests. The metric with the best performance is PDIS with
                                                                      Soft-max policy transformation and the original reward, whose
8. RESULTS                                                            performance is 100%. This means that all the t-test results on
8.1 Three IS-based Metrics Evaluation                                 the PDIS predictions aligned with those on the empirical results
                  Table 1: Empirical Post-test Evaluation Results for High and Low Carry-out Groups
                               Policy      High          Low           T-test Result
                               MDP1 79.06(24.64) 83.13(23.69) t(448)=−1.78,p=.076
                               MDP2 81.46(24.90) 80.44(23.66) t(448)=.44,p=.658
                               MDP3 82.55(23.48) 79.30(25.00) t(448)=1.24,p=.156
                               MDP4 83.43(22.83) 78.43(25.44) t(448)=2.19,p=.029∗∗

                                                         bold and ** denote significance at p < 0.05.


    Table 2: Policy Transformation and Normalization Impacts on IS Metric Alignment to Post-test Outcomes
                                    Policy       Rewards          Metrics
                                Transformation               IS   WIS PDIS
                                                                     Original        50%      50%       100%∗
                                                Soft-max
                                                                    Normalized       50%      50%        50%
                                                                     Original        25%      25%        75%
                                              Q-proportion
                                                                    Normalized       25%      25%        25%
                                                                     Original        75%      75%        75%
                                               Hard-code
                                                                    Normalized       75%      75%        75%


                    Table 3: Detailed IS-based Metrics Evaluation Results Using Original Reward
             Transform Policy        Empirical Result       IS & WIS Result            PDIS Result
                                MDP1        t(448)=−1.78,p=.076                t(448)=.89,p=.376                 t(448)=.89,p=.375
                                MDP2        t(448)=.44,p=.658                  t(448)=2.60,p=.010∗∗              t(448)=1.84,p=.067
               Soft-max
                                MDP3        t(448)=1.42,p=.156                 t(448)=2.50,p=.013∗∗              t(448)=.53,p=.594
                                MDP4        t(448)=2.19,p=.029∗∗               t(448)=2.18,p=.030∗∗              t(448)=3.23,p=.001∗∗
                                MDP1        t(448)=−1.78,p=.076                t(448)=3.78,p<.001∗∗              t(448)=1.77,p=.077
                                MDP2        t(448)=.44,p=.658                  t(448)=3.26,p=.001∗∗              t(448)=2.13,p=.034∗∗
            Q-proportion
                                MDP3        t(448)=1.42,p=.156                 t(448)=2.71,p=.007∗∗              t(448)=.31,p=.760
                                MDP4        t(448)=2.19,p=.029∗∗               t(448)=3.69,p<.001∗∗              t(448)=2.32,p=.021∗∗
                                MDP1        t(448)=−1.78,p=.076                t(448)=1.19,p=.233                t(448)=.96,p=.337
                                MDP2        t(448)=.44,p=.658                  t(448)=.71,p=.479                 t(448)=.84,p=.404
              Hard-code
                                MDP3        t(448)=1.42,p=.156                 t(448)=2.34,p=.020∗∗              t(448)=2.06,p=.040∗∗
                                MDP4        t(448)=2.19,p=.029∗∗               t(448)=2.83,p=.005∗∗              t(448)=3.73,p<.001∗∗

   Electric-blue cells denote that the theoretical t-test results align with the empirical t-test results (Column 3);
   Grey cells denote misaligned t-test results.


in terms of significance match.                                                  four t-tests aligning with the corresponding empirical results.
                                                                                 IS and WIS are more likely to predict the significant difference
Table 3 shows the detailed results for metric evaluations using                  between High vs. Low. Meanwhile, Q-proportion tends to
the original reward, providing t-test results when comparing                     cause the metric to predict more significant difference, while
post-test results between High and Low carry-out groups. The                     Hard-code tends to predict less.
first column in table 3 shows the type of policy transformation
functions applied. The second column shows the four MDP                          In summary, when comparing groups in the RL-induced policy,
policies considered when splitting the dataset into High and Low                 our results showed that for the MDP4 policy, the students in the
carry-out groups. The third column shows the t-test results                      High carry-out group significantly outperformed the students
of the empirical evaluation of High vs. Low carry-out, which                     in the Low group, but no significant difference was found in the
served as the ground truth. The fourth and fifth columns show                    other three policies. This suggests that the MDP4 policy is an
the t-test results for the prediction of the three IS metrics: IS,               effective policy in that the more it is carried out, the better it
WIS, and PDIS respectively. In the tables, electric-blue cells                   performs. However, the partially carry-out situation reduced
denote that the theoretical t-test results align with the empirical              the power of the MDP4 policy so that it did not significantly
t-test results (Column 3) while grey cells denote mismatched                     outperform the baseline random policy. When comparing the
t-test results. From this table, we can see that only PDIS with                  empirical evaluation results with theoretical evaluation results,
soft-max transformation and the original reward results in all                   PDIS is the best among the three IS-based metrics, reaching
                                          Table 4: Critical Decision Evaluation Results
                                    Policy     More          Less          T-test Result
                                    MDP1       78.49(23.08)    83.01(25.44)       t(448)=−1.97,p=.049
                                    MDP2       83.22(23.53)    78.86(24.79)       t(448)=1.91,p=.057
                                    MDP3       79.45(24.59)    82.08(24.00)       t(448)=−1.14,p=.257
                                    MDP4       83.54(22.89)    78.74(25.21)       t(448)=2.10,p=.036∗∗

                                                     bold and ** denote significance at p < 0.05.


100% agreement. Our results suggested that proper deployment                9.    CONCLUSION AND FUTURE WORK
settings have an impact on the performance of IS-based metrics.             In this work, we explored three IS-based OPE metrics with two
When transforming the deterministic policy to stochastic policy,            deployment settings in a real-world application. Through com-
soft-max is the best one, while Q-proportion is the worst, and              paring the effectiveness of four RL-induced policies empirically
Hard-code is stable. The comparison between the original reward             and theoretically, our results showed that PDIS is the best one
and normalized reward indicates that the original reward can                for interactive e-learning systems and appropriate deployment
better reflect the empirical results despite having larger variance.        settings (i.e., where policy decisions are carried out) are required
                                                                            to achieve reliable, robust evaluations. We also proposed a
                                                                            method to identify critical decisions by the Q-value differences
                                                                            in a policy. In order to verify our method, we investigated the re-
8.2    Critical Decision Identification                                     lationship between the number of identified critical decisions and
Recall that each policy impacts its own critical decisions: those
                                                                            student post-test scores. The results revealed that the identified
with higher differences between Q-values for possible decisions
                                                                            critical decisions are significantly linked to student learning, and
are considered to be critical, and we split each group of students
                                                                            further, that critical decisions can be identified by an effective
according to whether the student received More decisions aligned
                                                                            policy but not by ineffective policies. In the future, we will apply
with the critical decisions. Table 4 shows the t-test results
                                                                            the PDIS metric with soft-max transformation and original re-
comparing the post-test scores between the More vs. Less
                                                                            wards to help us induce better RL policies which further improve
critical decisions groups. The first column indicates the MDP
                                                                            students’ learning in the ITS. Also, when inducing policies, we
policies considered when identifying the critical decision. The
                                                                            will consider constraints to avoid the partially carry-out situation
second and third columns show the average post-test scores
                                                                            that limits the impact a policy can have on outcomes. Further-
of students in the More and Less groups, showing mean(sd).
                                                                            more, with the identification of critical decisions, we can reduce
The fourth column shows the t-test results when comparing the
                                                                            the size of resulting policies still further by focusing only on the
post-test scores of the More and Less critical decisions groups.
                                                                            most monumental decisions rather than meaningless ones.
The MDP1 row shows a significant difference between the More
vs. Less critical decisions groups for the MDP1 policy, with
t(448) = −1.97,p = .049. However, the students in More group                10.     REFERENCES
perform worse than the Less critical decisions group. The MDP2               [1] J. Beck, B. P. Woolf, and C. R. Beal.
and MDP3 rows show that there is no significant difference                       Advisor: A machine learning architecture for intelligent
between the two critical decisions groups in terms of post-test                  tutor construction. In AAAI/IAAI, pages 552–557, 2000.
scores for the MDP2 or MDP3 policies. Finally, the MDP4 policy               [2] J. D. Bransford and D. L. Schwartz. Rethinking
shows a significant difference between the two groups, t(448)=
                                                                                 transfer: A simple proposal with multiple implications.
2.10,p = .036, which means that students with More critical                      Review of Research in Education, pages 61–100, 1999.
decisions performed significantly better than students with Less.
                                                                             [3] M. Chi, K. VanLehn, D. Litman,
                                                                                 and P. Jordan. Empirically evaluating the application
For the MDP4 policy, the identified critical decisions comprised
                                                                                 of reinforcement learning to the induction of effective
25% of all decisions. This shows that, although critical decisions
                                                                                 and adaptive pedagogical strategies. User Modeling
are a small proportion of all decisions, they can significantly im-
                                                                                 and User-Adapted Interaction, 21(1-2):137–180, 2011.
pact the outcome. Also, the results also show that the Q-values
in the MDP4 policy can be used to identify critical decisions                [4] B. Clement and et al. A comparison
aligned with empirical results, but the other three cannot. Based                of automatic teaching strategies for heterogeneous
on the results from Table 1, MDP4 was identified as the only                     student populations. In EDM 16-9th EDM, 2016.
effective policy, since its empirical post-test results aligned, with        [5] S. Doroudi, K. Holstein,
students in the High carry-out group performing significantly                    V. Aleven, and E. Brunskill. Towards understanding
better than the Low carry-out group. The critical decisions                      how to leverage sense-making, induction and refinement,
results suggest that MDP4 is also the only policy where larger                   and fluency to improve robust learning. JEDM, 2015.
differences in Q-values had larger impacts on post-test results.             [6] L. J. Dudik, Miroslav and
Taken together, these results suggest that only Q-values in ef-                  L. Li. Doubly robust policy evaluation and learning. the
fective policies work to influence decisions that impact actual                  28th International Conference on Machine Learning, 2011.
post-test performance. This further inspires us to investigate               [7] J. Hammersley and D. Handscomb. General
whether we could verify the effectiveness of a policy in reverse:                principles of the monte carlo method. Springer, 1964.
Given a policy, if the decisions with larger Q-value difference              [8] A. Iglesias, P. Martı́nez, R. Aler, and
are significantly linked to the student performance, then this                   F. Fernández. Learning teaching strategies in an adaptive
policy may be more likely to be effective.                                       and intelligent educational system through reinforcement
     learning. Applied Intelligence, 31(1):89–106, 2009.                     T. L. Griffiths, and P. Shafto. Faster teaching via
 [9] A. Iglesias, P. Martı́nez,                                              pomdp planning. Cognitive science, 40(6):1290–1332, 2016.
     R. Aler, and F. Fernández. Reinforcement learning of              [22] J. P. Rowe and J. C. Lester. Optimizing
     pedagogical policies in adaptive and intelligent educational            player experience in interactive narrative planning:
     systems. Knowledge-Based Systems, 22(4):266–270, 2009.                  A modular reinforcement learning approach. In the Tenth
[10] A. Iglesias, P. Martı́nez,                                              International Conference on AIIDE, pages 160–166, 2014.
     R. Aler, and F. Fernández. Reinforcement learning of              [23] J. P. Rowe and J. C. Lester.
     pedagogical policies in adaptive and intelligent educational            Improving student problem solving in narrative-centered
     systems. Knowledge-Based Systems, 22(4):266–270, 2009.                  learning environments: A modular reinforcement learning
[11] D. J. L. Joel                                                           framework. In AIED, pages 419–428. Springer, 2015.
     R. Tetreault, Dan Bohus. Estimating the reliability of mdp         [24] N. Roy, J. Pineau, and
     policies: A confidence interval approach. The Association               S. Thrun. Spoken dialogue management using probabilistic
     for Computational Linguistics, pages 276–283, 2007.                     reasoning. In Proceedings of the 38th Annual Meeting
[12] K. R. Koedinger, J. R. Anderson, W. H. Hadley, and M. A.                on Association for Computational Linguistics, pages
     Mark. Intelligent tutoring goes to school in the big city. 1997.        93–100. Association for Computational Linguistics, 2000.
[13] K. R. Koedinger, E. Brunskill, R. S.                               [25] S. Shen and M. Chi. Aim
     Baker, E. A. McLaughlin, and J. Stamper. New potentials                 low: Correlation-based feature selection for model-based
     for data-driven intelligent tutoring system development                 reinforcement learning. In EDM, pages 507–512, 2016.
     and optimization. AI Magazine, 34(3):27–41, 2013.                  [26] S. Shen and M. Chi. Reinforcement
[14] E. Levin, R. Pieraccini, and W. Eckert.                                 learning: the sooner the better, or the later the better?
     A stochastic model of human-machine interaction                         In Proceedings of the 2016 Conference on User Modeling
     for learning dialog strategies. IEEE Transactions                       Adaptation and Personalization, pages 37–44. ACM, 2016.
     on speech and audio processing, 8(1):11–23, 2000.                  [27] S. Singh, D. Litman, M. Kearns, and M. Walker.
[15] T. Mandel,                                                              Optimizing dialogue management with reinforcement
     Y.-E. Liu, S. Levine, E. Brunskill, and Z. Popovic. Offline             learning: Experiments with the njfun system. Journal
     policy evaluation across representations with applications              of Artificial Intelligence Research, 16:105–133, 2002.
     to educational games. In AAMAS, pages 1077–1084, 2014.             [28] J. C. Stamper, M. Eagle, T. Barnes, and M. Croy.
[16] Z. L. Mostafavi Behrooz and T. Barnes. Data-driven                      Experimental evaluation of automatic hint generation
     proficiency profiling. In Proc. of the 8th International                for a logic tutor. In International Conference on Artificial
     Conference on Educational Data Mining, 2015.                            Intelligence in Education, pages 345–352. Springer, 2011.
[17] M. Chi,                                                            [29] P. Thomas. Safe reinforcement learning. PhD thesis, 2015.
     K. VanLehn, D. J. Litman, and P. W. Jordan. Empirically            [30] K. Vanlehn.
     evaluating the application of reinforcement learning to the             The behavior of tutoring systems. International journal
     induction of effective and adaptive pedagogical strategies.             of artificial intelligence in education, 16(3):227–265, 2006.
     User Model. User-Adapt. Interact., 21(1-2):137–180, 2011.          [31] P. Wang, J. Rowe, W. Min, B. Mott,
[18] K. Narasimhan, T. Kulkarni, and R. Barzilay. Language un-               and J. Lester. Interactive narrative personalization
     derstanding for text-based games using deep reinforcement               with deep reinforcement learning. In IJCAI, 2017.
     learning. arXiv preprint arXiv:1506.08941, 2015.                   [32] J. D. Williams and S. Young. Partially observable
[19] E. B. Philip S. Thomas. Data-efficient off-policy policy                markov decision processes for spoken dialog systems.
     evaluation for reinforcement learning. In In International              Computer Speech & Language, 21(2):393–422, 2007.
     Conference on Machine Learning, pages 2139–2148, 2016.             [33] B. Zhang, Q. Cai, J. Mao, E. Chang, and B. Guo. Spoken
[20] S. R. S. S. S. Precup, D. Eligibility                                   dialogue management as planning and acting under
     traces for off-policy policy evaluation. 17th International             uncertainty. In INTERSPEECH, pages 2169–2172, 2001.
     Conference on Machine Learning, pages 759–766, 2000.               [34] G. Zhou and
[21] A. N. Rafferty, E. Brunskill,                                           et al. Towards closing the loop: Bridging machine-induced
                                                                             pedagogical policies to learning theories. In EDM, 2017.