Importance Sampling to Identify Empirically Valid Policies and their Critical Decisions Song Ju, Shitian Shen, Hamoon Azizsoltani, Tiffany Barnes, Min Chi Department of Computer Science North Carolina State University Raleigh, NC 27695 {sju2, sshen, hazizso, tmbarnes, mchi}@ncsu.edu ABSTRACT situation so as to maximize some predefined cumulative reward. In this work, we investigated off-policy policy evaluation (OPE) A number of researchers have studied the application of existing metrics to evaluate Reinforcement Learning (RL) induced poli- RL algorithms to improve the effectiveness of ITSs [3, 26, 21, cies and to identify critical decisions in the context of Intelligent 4, 28, 8, 9, 34]. While promising, such RL work faces at least Tutoring Systems (ITSs). We explore the use of three common two major challenges discussed below. Importance Sampling based OPE metrics in two deployment settings to evaluate four RL-induced policies for a logic ITS. One challenge is a lack of reliable yet robust evaluation metrics The two deployment settings explore the impact of using orig- for RL policy evaluation. Generally speaking, there are two ma- inal or normalized rewards, and the impact of transforming jor categories of RL: online and offline. In the former category, deterministic to stochastic policies. Our results show that Per the agent learns while interacting with the environment; in the Decision Importance Sampling (PDIS), using soft max and latter case, the agent learns the policy from pre-collected data. original rewards, is the best metric, and the only metric that Online RL algorithms are generally appropriate for domains reached 100% alignment between the theoretical and empirical where interacting with simulations and actual environments is classroom evaluation results. Furthermore, we used PDIS to computationally cheap and feasible. On the other hand, for identify what we call critical decisions in RL-induced policies, domains such as e-learning, building accurate simulations or where the policies successfully identify large differences between simulated students is especially challenging because human learn- decisions. We found that the students who received more criti- ing is a rather complex, poorly understood process. Moreover, cal decisions significantly outperformed those who received less; learning policies while interacting with students may not be fea- more importantly, this result only holds on the policy that was sible, and more importantly, may not be ethical. Therefore, to identified to be effective using PDIS, not on ineffective ones. improve student learning, much prior work applied offline RL ap- proaches to induce effective pedagogical strategies. This is done by first collecting a training corpus and the success of offline RL Keywords is often heavily dependent on the quality of the training corpus. Reinforcement Learning, Off-policy Policy Evaluation, Impor- One common convention is to collect an exploratory corpus by tance Sampling training a group of students on an ITS that makes random yet reasonable decisions and then apply RL to induce pedagogical 1. INTRODUCTION policies from that training corpus. Empirical study is then Intelligent Tutoring Systems (ITSs) are a type of highly interac- conducted from a new group of human subjects interacting with tive e-learning environment that facilitates learning by providing different versions of the system. The only difference among the step-by-step support and contextualized feedback to individual system versions is the policy employed by the ITS. The students’ students [12, 30]. These step-by-step behaviors can be viewed performance is then statistically compared. Due to cost limita- as a sequential decision process where at each step the system tions, typically, only the best RL-induced policy is deployed and chooses an action (e.g. give a hint, show an example) from a compared against some baseline policies. On the other hand, we set of options, in which pedagogical strategies are policies that often have a large number of RL algorithms (and associated hy- are used to decide what action to take next in the face of al- perparameter settings), and it is unclear which will work best in ternatives. Reinforcement Learning (RL) offers one of the most our setting. In these high-stake situations, one needs confidence promising approaches to data-driven decision-making applica- in the RL-induced policy before risking deployment. Therefore, tions and RL algorithms are designed to induce effective policies we need to develop reliable yet robust evaluation metrics to evalu- that determine the best action for an agent to take in any given ate these RL-induced policies without collecting new data before being tested in the real world. This type of evaluation is called off-policy evaluation (OPE) because the policy used to collect the training data, also referred to as the behavior policy, is different from the RL-induced policy, referred to as the target policy to be evaluated. To find reliable yet robust OPE metrics, we explored three Importance Sampling based off-policy evaluation metrics. The second RL challenge is a lack of interpretability of the RL-induced policies. Compared with the amount of research Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). done on applying RL to induce policies, relatively little work has test scores are shown with a red solid line). Figure 1 shows that been done to analyze, interpret, or explain RL-induced policies. our theoretical evaluation (ECR) does not match the empirical While traditional hypothesis-driven, cause-and-effect approaches results (post-test) evaluation in that there is no clear relationship offer clear conceptual and causal insights that can be evaluated between the ECRs’ blue line and the corresponding post-test and interpreted, RL-induced policies are often large, cumber- in red line across the four policies. This result shows that ECR some, and difficult to understand. The space of possible policies is not a reliable OPE metric for evaluating RL-induced policy is exponential in the number of domain features. It is therefore in ITSs. Indeed, Mandel et al. [15] pointed out that ECR tends difficult to draw general conclusions from them to advance our to be biased, statistically inconsistent and thus it may not be understanding of the domain. This raises a major open question: the appropriate OPE metric in high stakes domains. In recent How can we identify the critical system interactive decisions that years, many state-of-the-art OPE metrics have been proposed are linked to student learning? In this work, we tried to identify and many of them are based on Importance Sampling. key decisions by taking advantage of the reliable OPE metrics we discovered and the properties of the policies we induced. Importance Sampling (IS) is a classic OPE method for evaluat- ing a target policy on existing data obtained from an alternate behavior policy and thus can be handily applied to the task 2. MOTIVATION of evaluating the effectiveness of an offline RL-induced policy Just like the fact that assessment sits at the epicenter of ed- using pre-existing historical training datasets. Many IS-based ucational research [2], policy evaluation is indeed the central OPE metrics are proposed and explored and it was shown that concern among the many stakeholders in applying offline RL they gain significant performance in simulation environments to ITSs. As educational assessment should reflect and rein- like Grid World or Bandit [29, 6]. Among them, three IS- force the educational goals that society deems valuable, our based OPE metrics, the original IS, Weighted IS (WIS), and policy evaluation metrics should reflect the effectiveness of the Per-Decision IS (PDIS), are the most widely used. However, induced policies. While various RL approaches such as policy real-world human-agent interactive applications such as ITSs iteration and policy search have shown great promise, existing are much more complicated due to 1) individual differences, RL approaches tend to perform poorly when they are actually noise, and randomness during the interaction processes, 2) the implemented and evaluated in the real world. large state space that can impact student learning, and 3) long trajectories due to the nature of the learning process. In a series of prior studies on a logic ITS, RL and Markov De- cision Processes (MDPs) were applied to induce four different In this work, we investigated the three IS-based offline OPE pedagogical policies, named MDP1-MDP4 respectively, on one metrics on MDP1-MDP4 to investigate whether the three IS- type of tutorial decision: whether to provide students with a based evaluation metrics are indeed effective OPE metrics for Worked Example (WE) or to ask them to engage in Problem evaluating the four RL-induced policies mentioned beforehand. Solving (PS). In WEs, the tutor presents an expert solution to a We believe an OPE is effective if and only if the theoretical problem step by step, while in PSs, students are required to com- results from the OPE evaluations are completely aligned with plete the problem with the tutor’s support. When inducing each the empirical results from the classroom studies. Therefore, we of four policies, we explored different feature selection methods explored different deployment settings for the IS-based metrics and used Expected Cumulative Reward (ECR) to evaluate the from two aspects: one is the transformation function used to RL-induced policies. ECR of a policy is calculated by average convert the RL-induced deterministic policy to a stochastic pol- over the value function of initial states and generally speaking, icy used in IS-based metrics and the other is reward functions: the higher the ECR value of a policy, the better the policy is the original reward function vs. the normalized reward function; supposed to perform. the latter is supposed to reduce the variance. Our results showed that the theoretical and empirical evaluation results are aligned more or less for different deployment settings using different IS-based metrics. Only when using a soft-max transformation function and original reward function, the theoretical results of PDIS can reach 100% agreement with the empirical results. Based on results from the OPE metrics, we further explored using the properties of the RL-induced policy to identify critical decisions and our results showed that the critical decisions can be identified by using the theoretically “effective” policies that were identified by using PDIS with soft-max transformation and the original reward function. In summary, we make the following contributions: Figure 1: Post-test Score vs. ECR, showing no seeming direct relationship. • We directly compared three IS-based policy evaluation metrics against the empirical results from real classroom Figure 1 shows the ECRs (blue dashed line) of the four RL- studies across four different RL-induced policies. Our induced policies, MDP1-MDP4 (x-axis) and the empirical results results showed that PDIS is the best one and its results of student learning performance (the solid red line) of the corre- can align with empirical results. sponding policies. Here the learning performance is measured by an in-class post-test after students were trained on the tutor with • As far as we know, this is the first study to compare dif- corresponding policies (the mean and standard errors of post- ferent deployment settings (original/normalized rewards or deterministic/stochastic policy transformation) on IS- from the control group but no significant difference was found based policy evaluation metrics. Our results showed that between the two groups on the post-test [23]. settings have a direct impact on the effectiveness of the evaluation metrics. Only PDIS with soft-max transforma- In short, prior work on applying offline RL in ITSs primarily tion and the original reward function agreed 100% with explored ECR as the OPE metric and one common phenomenon the empirical results. is that the theoretical evaluation results do not align with the empirical results from real students. • We investigated using information from the RL-induced policies to identify critical decisions to shed some light on 3.2 OPE Metrics the induced policies. As far as we know, this is the first OPE is used to evaluate the performance of a target policy given attempt to differentiate the critical decisions from trivial historical data generated by an alternative behavior policy. A ones. good OPE metric is especially important for real-world applica- tions where the deployment of a bad or inefficient policy can be costly [19]. ECR is one of the most widely used OPE metrics, 3. RELATED WORK which is designed especially for the MDP framework. Tetreault 3.1 Empirical Studies Applying RL to ITSs et al. [11] estimated the reliability of ECRs by repeated sampling In recent years, a number of researchers have applied RL to to estimate confidence intervals for ECRs. In simulation studies, induce effective pedagogical policies for ITSs [3, 5, 13, 23]. they showed the policy induced by the confidence interval of Some previous work treated the user-system interactions as fully ECR performed more reliably than the baseline policies but observable processes by applying Markov Decision Processes this phenomenon did not hold when evaluating the RL-induced (MDPs) [14, 27, 1] while others utilized partially observable policies in ITSs for empirical studies [17]. MDPs (POMDPs) [24, 32, 33, 15, 4], and more recently, deep RL Frameworks [31, 18]. Most of the previous work, including Importance Sampling (IS) [7] is a widely used OPE metric, this work, took the offline RL approach in that it followed three which considers the mathematical characteristics of the decision general steps: 1) to collect an exploratory corpus by training making process and can be applied to any MDP, POMDP, or a relatively large group of real and/or simulated students on DeepRL framework. Precup [20] proposed four IS-based OPE an ITS that makes some random yet reasonable and rational metrics: IS, weighted importance sampling (WIS), per-decision tutorial decisions; 2) to apply RL to induce pedagogical pol- importance sampling (PDIS), and weighted per-decision impor- icy directly from the historical exploratory corpus; and 3) to tance sampling (WPDIS). They used the IS-based estimator as implement the RL-induced pedagogical policy back to the ITS the policy evaluation for Q-learning and then compared the ef- and evaluate its effectiveness using simulations and/or in real fectiveness of estimators on a series of 100 randomly-constructed classroom settings to investigate whether RL fulfills its promise. MDPs based on the mean square error (MSE). Their results Oftentimes, when implementing and evaluating the RL-induced showed that IS made the Q-learning process converge slowly policies in real world, they tend to perform poorly compared to and caused high variance, and WIS performed better than IS; their theoretical performance. but PDIS performed inconsistently and WPDIS performed the worst. Similarly, Thomas [29] compared the performance of Iglesias et al. [10] applied Q-learning to generate a policy in an several IS estimators using mean squared error in a grid-world ITS that teaches students database design. Their goal was to simulation, showing PDIS outperformed all others. provide students with direct navigational support through the system’s content. In the training phase, they used simulated stu- In summary, previous work has explored the effectiveness of dents to induce an RL policy and online-evaluated the induced IS and its variants in simulation studies, which motivated the policy based upon three self-defined measures. In the test phase, work reported here. Different from previous work, we mainly they evaluated the induced policy with real students. The results focus on comparing the theoretical evaluation with the empirical showed that while students using the induced policies had more evaluation in order to determine whether IS-based methods are effective usage behaviors than their non-policy peers, there was indeed reliable and robust for ITSs. no significant difference in student learning performance. Chi et al. [17] applied an offline RL approach to induce pedagogical 4. MARKOV DECISION PROCESS & RL policies for improving the effectiveness of an ITS that teaches Some of the prior work on applying RL to induce pedagogical students college physics. They used ECR to evaluate the in- policies used Markov Decision Processes (MDP) frameworks. duced policy in the training phase. However, when they applied An MDP can be seen as a 4-tuple hS,A,T ,Ri, where S denotes the induced policy to real students, the induced policy did not the observable state space, which is defined by a set of features outperform the random policy in empirical evaluation. Similarly, that represent the interactive learning environment and A de- Shen et al. [26] explored immediate and delayed rewards based notes the space of possible actions for the agent to execute. The on learning gain and implemented an offline, MDP framework reward function R represents the immediate or delayed feed- on a rule-based ITS for deductive logic. They selected the policy back from the environment with respect to the agent’s action(s); with the highest ECR and deployed it in an ITS for real stu- r(s,a,s0 ) denotes the expected reward of transiting from state dents. The empirical results showed that the RL policies were s to state s 0 by taking action a. Once hS,A,Ri is defined, T no more effective than the random baseline policy. In addition, represents the transition probability where P (s,a,s0 )=P r(s0 |s,a) Rowe et al. [22] investigated an MDP framework for tutorial is the probability of transitioning from state s to state s 0 by planning in a game-based learning system. They used ECR to taking action a and it can be easily estimated from the training theoretically evaluate induced policies but in an empirical study corpus. The optimal policy π for an MDP can be generated with real students, they found that students in the induced via dynamic programming approaches, such as Value Iteration. planner condition had significantly different behavior patterns This algorithm operates by finding the optimal value for each state V ∗ (s), which is the expected discounted reward that the function p(x) can be calculated as: agent will gain if it starts in s and follows the optimal policy Z to the goal. Generally speaking, V ∗ (s) can be obtained by the Ep [f(x)] = f(x)p(x)dx (3) optimal value function for each state-action pair Q∗ (s,a) which is defined as the expected discounted reward the agent will gain Z p(x) if it takes an action a in a state s and follows the optimal policy = f(x) q(x)dx (4) q(x) to the end. The optimal value function Q∗ (s,a) can be obtained p(x) by iteratively updating Q(s,a) via equation 1 until convergence: = Eq [f(x) ] (5) q(x)   X where p is known as the target distribution, q is the sampling dis- Q(s,a):= p(s,a,s0 ) r(s,a,s0 )+γmaxQ(s0 0 ,a ) (1) 0 a tribution, Ep [f(x)] is the expectation of f(x) under p, p(x)/q(x) s0 is the likelihood ratio weight and Eq [f(x)p(x)/q(x)] is the ex- pectation of f(x)p(x)/q(x) under q. We can then approximate where 0 ≤ γ ≤ 1 is a discount factor. When the process con- the expectation of f(x) over probability density function p(x) verges, the optimal policy π∗ can be induced corresponding to using the samples drawn from probability density function q(x). the optimal Q-value function Q∗ (s,a), represented as: In the context of OPE, the target distribution, p, is a probability event whose density function is determined by the target policy, and the sampling distribution, p(x), is a probability event whose π∗ (s)=argmaxQ∗ (s,a) (2) density function is determined by the behaviour policy. a Following the general IS technique, we approximated the ex- Where π∗ is the deterministic policy that maps a given state pected reward of the target policy using the relative probability into an action. In the context of an ITS, this induced policy of the target and behavior policies. Because of the nature of the represents the pedagogical strategy by specifying tutorial actions underlying MDP, samples in RL are sequential. Therefore, we using the current state. assumed that each trajectory is a sequence of events whose prob- ability density function is determined by its corresponding policy. Assuming the independence among trajectories and following 5. THREE OPE METRICS & TWO SETTINGS the multiplication rule of independent events, the probability The following terms will be used throughout this paper. of occurrence of a trajectory, H L , under policy π is L Y a ,r • H =s1 −−1 1 −→s2 −− 2 2a ,r −→s3 −− a ,r 3 3 −→···sL ; denotes one student- Pπ (H L )=P1 (s1 ) π(at |st )PT (st+1 |st ,at ) (6) system interaction trajectory and H L denotes a trajectory t=1 with length L. where Pπ (H L ) is the probability of occurrence of the trajectory PL H L following a policy π, P1 (s1 ) is the probability of occurrence • G(H L ) = t=1 γ t−1 rt is the discounted return of the of the state s1 in the beginning of the trajectory and PT is the trajectory H L , which generally reflects how good the state transition probability function. trajectory is supposed to be. Similar to the original IS technique, the proportional probability • D = {H1 ,H2 ,H3 ,...,Hn } denotes the historical dataset of each trajectory occurring under the target policy, πe , and containing n student-system interaction trajectories. behavior policy, πb , is used as the likelihood ratio weight for that trajectory. Thus, following the importance sampling technique, • πb denotes the behavior policy carried out for collecting the importance sampling discounted return is defined as: the historical data D. Pπe (H L ) • πe denotes the target policy to be evaluated. IS(πe |H L ,πb ) = ·G(H L ) (7) Pπb (H L ) L • ρ(πe ) represents the estimated performance of πe . Q πe (at |st ) t=1 = L ·G(H L ) (8) 5.1 Three IS-based OPE Metrics Q πb (at |st ) t=1 Importance Sampling (IS) is an approximation method that allows the estimation of the expectation of a distribution, p, from samples generated from a different distribution, q. Suppose that Substituting G(H L ) with the discounted return and we have: we have sample space x and random variable f(x) which is a measurable function from sample space x to another measurable L L space. We want to estimate the expectation of f(x) over a Y πe (at |st ) X t−1 IS(πe |H L ,πb )=( )( γ rt ) (9) strictly positive probability density function p(x). Suppose also t=1 πb (at |st ) t=1 that we cannot directly sample from distribution p(x), but we can draw Independent and Identically Distributed (IID) samples from probability density function q(x) and evaluate f(x) for After having the individual IS estimator for each trajectory, we these samples. The expectation of f(x) over probability density can calculate the expected reward of the dataset D by averaging the individual IS estimators for each trajectory as optimal action a∗ following the optimal policy π∗ . To transform  i  the deterministic policy into a stochastic policy, we explore three nD L Li 1 X Y πe (ait |sit ) X t−1 i  types of transformation functions: Hard-code, Q-proportion, and IS(πe |D)= ( )( γ rt ) (10) Soft-max. The basic idea behind them is: for any given state s, nD i=1 t=1 πb (ait |sit ) t=1 the assigned stochastic probability for an action a should reflect where sit , ait , and rti refer to the ith trajectory at time t and nD its value of Q(s,a). is the number of trajectories in D. 1. Hard-code Transformation Weighted Importance Sampling (WIS) is a variant of the ( IS estimator, a biased but consistent estimator which has lower 1−ε optimal action π(a|s)= (16) variance than IS. It normalizes the IS in order to produce a lower ε otherwise variance. At first, it calculates the weight WD for the dataset as the summation of the likelihood ratios for each trajectory as In eqn 16, ε is a fixed small probability like ε = 0.001 shown in equation 11. Then, it normalizes the IS estimator as which is assigned to actions with smaller Q-value while shown in equation 12. Finally, the WIS is simply the weighted 1−ε is assigned to the action with the largest Q-value. average of the estimated reward for each sequence in the dataset 2. Q-proportion Transformation D, as shown in equation 13. Q(s,a) nD L i Y πe (ait |sit ) π(a|s)= P 0 (17) a ∈A Q(s,a ) 0 X WD = ( ) (11) πb (ait |sit ) i=1 t=1 As shown in eqn 17, the probability to take action a given state s is the proportion of a’s Q-value among all possible IS(πe |H L ,πb ) 0 actions a in the state s. Thus, the action which has the W IS(πe |H L ,πb )= (12) WD highest Q-value is guaranteed to have the highest probabil- " # ity. In practice, because some of the Q-values are smaller nD Li Li than 0, we add a constant to Q-values in the same state P Q πe (ait |i st ) P t−1 i ( π (ai |si ) )( γ rt ) so that Q(s,a)≥1. b t t i=1 t=1 t=1 W IS(πe |D)= (13) n PD Li Q πe (ait |sit ) 3. Soft-max Transformation ( πb (ait |sit ) ) i=1 t=1 eθ·Q(s,a) π(a|s)= P 0 (18) θ·Q(s,a ) a ∈A e 0 Per-Decision Importance Sampling (PDIS) is also a vari- ant of IS. Like IS, it is also unbiased and consistent. IS has very Soft-max is a classical function used to calculate the prob- high variance because for each reward, rt , it uses the likelihood ability distribution of one event over n possible events. In ratio of the entire trajectory. However, the reward at step t our application, given state s, the soft-max function will should only depend on the previous steps. The variance can be calculate the probability of action a over all possible ac- reduced by using the likelihood ratio of the trajectory before tions using equation 18. The main advantage of soft-max step t for the reward rt . It means that the importance weight is the output probability range is 0 to 1 and the sum of Qt πe (aj |sj ) all the probabilities will be 1 and it can handle negative for a reward at step t is πb (aj |sj ) . Therefore, the individual Q-values. θ is a weight parameter to the Q-value. j=1 PDIS estimator for a trajectory is given in equation 14 and the PDIS for the whole historical dataset D is given in equation 15. 5.2.2 Two Types of Reward Functions: L t The effectiveness of RL-induced policies is very sensitive to X Y πe (aj |sj ) the reward functions. In our application, the range of the re- P DIS(πe |H L ,πb )= γ t−1 ( )rt (14) t=1 j=1 πb (aj |sj ) ward function is very large [−200,200], which may cause large variances for IS, especially when a trajectory is long. One effec- nD  Li t tive way to reduce this variance is to use normalized rewards. 1 X X t−1 Y πe (aij |sij ) i  Therefore, both original rewards and normalized rewards are P DIS(πe |D)= γ ( )rt (15) nD i=1 t=1 j=1 πb (aij |sij ) considered. More specifically, the normalized reward z is defined x−min(x) as: z = max(x)−min(x) where min and max are the minimum 5.2 Two Different Settings and maximum values of original reward function x and z ∈[0,1]. To evaluate the three IS-based metrics, we explored different deployment settings from two aspects: one is the transformation 6. ITS & FOUR MDP POLICIES function to convert the RL-induced deterministic policy to a 6.1 Our Logic Tutor stochastic policy used in IS-based metrics and the other is re- Figure 2 shows an interface of the logic tutor, which is a data- ward functions: the original reward function vs. the normalized driven ITS used in the undergraduate Discrete Mathematics reward. (DM) course at a large university. It [16] provides students with a graph-based representation of logic proofs which allows 5.2.1 Three Transformation Functions: students to solve problems by applying rules to derive new state- As described above, IS-based metrics require the policy to be ments, represented as nodes. The system automatically verifies stochastic but our MDP induced policies are deterministic, i.e. proofs and provides immediate feedback on logical errors. Every given the current state s, the agent should take the deterministic problem in the tutor can be presented in the form of either PS. Therefore, over the entire training process, only ∼50% of actions are actually decided by the pedagogical policy and the rest are decided by hard-coded system rules. In short, despite the fact that ECR showed that our four RL- induced policies should be more effective than the Random, empirical results showed otherwise because of various potential reasons. So we explored other OPE metrics to evaluate MDP1 - MDP4 . 7. EXPERIMENT SETUP Our dataset contains a total of 450 students’ interaction logs involved in the strictly controlled studies mentioned above. The Figure 2: Tutor Problem Solving Interface goals of our experiment were to: 1) investigate whether any of WE or PS. By focusing on the pedagogical decisions of WE the three IS-based metrics can be used to align the theoretical and PS, the tutor allows us to strictly control the content to be and empirical results for our four MDP policies and 2) identify equivalent for all students. critical decisions that are linked to student learning. 6.2 Four MDP Policies and Empirical Study 7.1 Three IS-based Metrics Evaluation Four MDP policies, MDP1 - MDP4 were induced from an We will describe how we determine whether or not the three IS- exploratory pre-collected dataset following different feature se- based metrics can align the theoretical results with the empirical lection procedures. The detailed descriptions on our policy results for MDP1-MDP4. induction process are described in [26, 25] and must be omitted here because of page limits. The effectiveness of each MDP For a given RL-induced policy π, we first split all students into policy was empirically evaluated against the Random policy in High vs. Low based on the actual carry-out percentages accord- strictly controlled studies during three consecutive semesters. In ing to π. Since there are only two tutorial choices: WE vs. PS, each strictly controlled study, students were randomly assigned there is a possibility that each actual decision that the tutor to two conditions: MDP policy, or a Random baseline policy made would agree with the decision according to π. For the which makes random yet reasonable decisions because both PS Random policy, for example, the probability is 50-50. In other and WE are always considered to be reasonable educational words, we can measure each trajectory by the percentage of the interventions in our learning context. Moreover, all students tutorial decisions that agree with π. If π is indeed effective, we went through the identical procedure on the tutor and the only would expect that the more the tutorial decisions in a trajectory difference was the pedagogical policy employed. After complet- agree with π, the better the corresponding student performance ing the tutor, students take a post-test which involved two proof would be. We thus treat all 450 students’ interaction log data questions in a midterm exam. They were 16 points each and equally regardless of their original assigned conditions and for graded by one TA using a rubric. Overall, no significant differ- each student-ITS interaction log we can calculate a carry-out N ence was found between our four RL-induced policies and the percentage for π using the formula: percentage= Nagree total , where Random policy on students’ post-test scores across all studies. Ntotal is the total number of tutorial decisions in the trajec- tory and Nagree is the number of decisions that agree with π. There are many possible explanations for our results. First, Then, students are divided into High Carry-out (High) vs. Low while the Random policy is generally a weak policy for many RL Carry-out (Low) by a median split on carry-out percentages. tasks, in our situation both of our action choices: WE vs. PS are considered reasonable and more importantly for each decision Then we empirically evaluate the effectiveness of π by checking point, there is a 50% chance that the random policy would carry whether there is a significant difference between the High and out the better of the two. Second, non-significant statistical Low groups on their post-test scores. Similarly, we compare results do not mean non-existence. Small sample size may play the High and Low groups’ theoretical results. To do so, for an important role in limiting the significance of statistical com- each student-ITS interaction trajectory, we estimate its reward parisons. A post hoc power analysis revealed that in order to be by exploring different combinations of the three IS-based OPE detected as significant at the 5% level with 80% power, MDP1 metrics with the three policy transformation functions and the vs. Random needed a total sample of 1382 students; MDP2 vs. two reward functions. More specifically, we treat all the trajec- Random needed 1700 students; MDP3 vs. Random needed 212 tories as generated by Random policy regardless of their original students; MDP4 vs. Random needed 394 students. However, in behavior policy. So for each student, we have a total of 18 each empirical study, our student sample sizes are much smaller, theoretical evaluations for a given π. If π is indeed effective and 59, 50, 57, and 84 respectively. And last but not least, it turned our OPE metric is reliable, we would expect that the more the out that all four RL-induced policies were only partially carried tutorial decisions in a trajectory agree with π, the higher the out. All of the training problems in our tutor are organized into corresponding theoretical rewards would be and vice versa. six strictly ordered levels and in each level students are required to complete 3–4 problems. In level 1, all participants receive Finally, for each of the 18 OPE metric settings, we conduct an the same set of PS problems and in the levels 2–6, our tutor alignment test between the theoretical and empirical results on has two hard-coded action-based constraints that are required π. This is done by comparing the empirically evaluated results by the class instructors: students must complete at least one and the theoretical rewards using the corresponding OPE metric. PS and one WE and the last problem on each level must be More specifically, they are considered to be aligned when: 1. Both empirical and theoretical results were not significant, that is p>=0.05, or 2. Both results were significant, and the direction of the comparison was the same, that is p < 0.05, and the sign of the t values are both positive or both negative. All the remaining cases are considered as not aligned. Thus, for each of 18 OPE metrics, we can test whether its theoreti- cal results would align with the empirical results for π. Since we have four RL-induced policies, MDP1-MDP4, robust and reliable OPE metrics should align the two types of evaluation results across all four policies. 7.2 Critical Decision Identification Next, we will explain how the critical interactive decisions are identified and empirically examined. Note that, there may be Figure 3: Q-value difference in MDP1-MDP4 critical decisions over which the RL policies have no influence. Hence, we focus only on interactive decisions that are critical. For many RL algorithms, the fundamental approach to induce Table 1 shows the empirical evaluation results by comparing the an optimal policy can be seen as recursively estimating the Q- High vs. Low carry-out groups’ post-test scores based on the cor- values: Q(s,a) for any given state-action pair until the Bellman responding RL-induced policy. The motivation is that, if a policy equation is converged. More specifically, Q(s,a) is defined as is indeed effective, students in the High group should significantly the expected discounted reward the agent will gain if it takes an out-perform their Low peers on the post-test. In Table 1, the first action a in a state s and follows the corresponding optimal policy column indicates the name of the RL-induced policy; columns to the end. Thus, for a given state s, a large difference between 2 and 3 show the mean and standard deviation of classroom the values of Q(s,“P S”) and Q(s,“W E”) indicates that it is post-test scores for High and Low carry-out groups, respectively, more important for the ITS to follow the optimal decision in and the last column shows the t-test results when comparing the state s. We, therefore, used the absolute difference between post-test between groups. Rows 2-4 show that there is no sig- the Q-values for each state s to identify critical decisions. Our nificant difference between the High vs. Low groups in terms of procedure can be divided into two steps: post-test scores for MDP1, MDP2, and MDP3 policies, but there is a significant difference between the two groups for the MDP4 Step 1: Identify Critical Decisions: Given an MDP policy, policy (row 5): t(448)=2.19,p=.029. This result suggests that, for each state, we calculated the absolute Q-value difference among the four MDP policies, only MDP4 seems to be effective between the two actions (PS vs. WE) associated with it. Fig- in that the students in MDP4’s High carry-out group performed ure 3 shows the Q-value difference (y-axis) for each state (x-axis) significantly better than those in the Low carry-out group. sorted in descending order for MDP1-MDP4 policies respectively. It clearly shows that, across the four MDP policies, the Q-value Table 2 shows the overall IS-based metrics evaluation results differences for different states can vary greatly. We used the showing the impact of policy transformations and original versus median Q-value difference to split the x-axis states into critical normalized rewards on the outcomes of each IS metric. The first vs. non-critical states. The states with the larger Q-value dif- column indicates the type of policy transformation applied and ferences were critical states and the rest were non-critical ones. the second column shows whether the rewards are normalized. For a given RL-induced policy, the critical decisions are defined The third through fifth columns show the performance of each as those in critical states where the actual carried-out tutorial IS metric, where performance is determined by the percent align- action agreed with the corresponding policy. ment between the IS policy predictions and empirical post-test results. Among the three policy transformations, Q-proportion Step 2: Evaluate Critical Decisions: For each of the four is the worst, with none of its six performance results better than RL-induced policies, we counted the number of critical decisions the corresponding results of Soft-max or Hard-code. Hard-code that each student encountered during his/her training. Then for performs slightly better than Soft-max in most cases but never each policy, students were split into: More vs. Less groups, by reaches a 100% match. For reward normalization, the original a median split on the number of critical decisions experienced reward performs better than the normalized reward for the in the training process. A t-test was conducted on the post-test PDIS metric, but there was no effect on IS or WIS. scores of More vs. Less groups to investigate whether the stu- dents with More critical decisions would indeed perform better Comparing the three IS metrics, PDIS shows the greatest per- than those with Less. formance with all 12 of the performance results being the best. For the last two metrics, IS and WIS, the results are exactly the same because WIS is much like multiplying IS by a con- stant and this kind of re-scale won’t change the result of the t-tests. The metric with the best performance is PDIS with Soft-max policy transformation and the original reward, whose 8. RESULTS performance is 100%. This means that all the t-test results on 8.1 Three IS-based Metrics Evaluation the PDIS predictions aligned with those on the empirical results Table 1: Empirical Post-test Evaluation Results for High and Low Carry-out Groups Policy High Low T-test Result MDP1 79.06(24.64) 83.13(23.69) t(448)=−1.78,p=.076 MDP2 81.46(24.90) 80.44(23.66) t(448)=.44,p=.658 MDP3 82.55(23.48) 79.30(25.00) t(448)=1.24,p=.156 MDP4 83.43(22.83) 78.43(25.44) t(448)=2.19,p=.029∗∗ bold and ** denote significance at p < 0.05. Table 2: Policy Transformation and Normalization Impacts on IS Metric Alignment to Post-test Outcomes Policy Rewards Metrics Transformation IS WIS PDIS Original 50% 50% 100%∗ Soft-max Normalized 50% 50% 50% Original 25% 25% 75% Q-proportion Normalized 25% 25% 25% Original 75% 75% 75% Hard-code Normalized 75% 75% 75% Table 3: Detailed IS-based Metrics Evaluation Results Using Original Reward Transform Policy Empirical Result IS & WIS Result PDIS Result MDP1 t(448)=−1.78,p=.076 t(448)=.89,p=.376 t(448)=.89,p=.375 MDP2 t(448)=.44,p=.658 t(448)=2.60,p=.010∗∗ t(448)=1.84,p=.067 Soft-max MDP3 t(448)=1.42,p=.156 t(448)=2.50,p=.013∗∗ t(448)=.53,p=.594 MDP4 t(448)=2.19,p=.029∗∗ t(448)=2.18,p=.030∗∗ t(448)=3.23,p=.001∗∗ MDP1 t(448)=−1.78,p=.076 t(448)=3.78,p<.001∗∗ t(448)=1.77,p=.077 MDP2 t(448)=.44,p=.658 t(448)=3.26,p=.001∗∗ t(448)=2.13,p=.034∗∗ Q-proportion MDP3 t(448)=1.42,p=.156 t(448)=2.71,p=.007∗∗ t(448)=.31,p=.760 MDP4 t(448)=2.19,p=.029∗∗ t(448)=3.69,p<.001∗∗ t(448)=2.32,p=.021∗∗ MDP1 t(448)=−1.78,p=.076 t(448)=1.19,p=.233 t(448)=.96,p=.337 MDP2 t(448)=.44,p=.658 t(448)=.71,p=.479 t(448)=.84,p=.404 Hard-code MDP3 t(448)=1.42,p=.156 t(448)=2.34,p=.020∗∗ t(448)=2.06,p=.040∗∗ MDP4 t(448)=2.19,p=.029∗∗ t(448)=2.83,p=.005∗∗ t(448)=3.73,p<.001∗∗ Electric-blue cells denote that the theoretical t-test results align with the empirical t-test results (Column 3); Grey cells denote misaligned t-test results. in terms of significance match. four t-tests aligning with the corresponding empirical results. IS and WIS are more likely to predict the significant difference Table 3 shows the detailed results for metric evaluations using between High vs. Low. Meanwhile, Q-proportion tends to the original reward, providing t-test results when comparing cause the metric to predict more significant difference, while post-test results between High and Low carry-out groups. The Hard-code tends to predict less. first column in table 3 shows the type of policy transformation functions applied. The second column shows the four MDP In summary, when comparing groups in the RL-induced policy, policies considered when splitting the dataset into High and Low our results showed that for the MDP4 policy, the students in the carry-out groups. The third column shows the t-test results High carry-out group significantly outperformed the students of the empirical evaluation of High vs. Low carry-out, which in the Low group, but no significant difference was found in the served as the ground truth. The fourth and fifth columns show other three policies. This suggests that the MDP4 policy is an the t-test results for the prediction of the three IS metrics: IS, effective policy in that the more it is carried out, the better it WIS, and PDIS respectively. In the tables, electric-blue cells performs. However, the partially carry-out situation reduced denote that the theoretical t-test results align with the empirical the power of the MDP4 policy so that it did not significantly t-test results (Column 3) while grey cells denote mismatched outperform the baseline random policy. When comparing the t-test results. From this table, we can see that only PDIS with empirical evaluation results with theoretical evaluation results, soft-max transformation and the original reward results in all PDIS is the best among the three IS-based metrics, reaching Table 4: Critical Decision Evaluation Results Policy More Less T-test Result MDP1 78.49(23.08) 83.01(25.44) t(448)=−1.97,p=.049 MDP2 83.22(23.53) 78.86(24.79) t(448)=1.91,p=.057 MDP3 79.45(24.59) 82.08(24.00) t(448)=−1.14,p=.257 MDP4 83.54(22.89) 78.74(25.21) t(448)=2.10,p=.036∗∗ bold and ** denote significance at p < 0.05. 100% agreement. Our results suggested that proper deployment 9. CONCLUSION AND FUTURE WORK settings have an impact on the performance of IS-based metrics. In this work, we explored three IS-based OPE metrics with two When transforming the deterministic policy to stochastic policy, deployment settings in a real-world application. Through com- soft-max is the best one, while Q-proportion is the worst, and paring the effectiveness of four RL-induced policies empirically Hard-code is stable. The comparison between the original reward and theoretically, our results showed that PDIS is the best one and normalized reward indicates that the original reward can for interactive e-learning systems and appropriate deployment better reflect the empirical results despite having larger variance. settings (i.e., where policy decisions are carried out) are required to achieve reliable, robust evaluations. We also proposed a method to identify critical decisions by the Q-value differences in a policy. In order to verify our method, we investigated the re- 8.2 Critical Decision Identification lationship between the number of identified critical decisions and Recall that each policy impacts its own critical decisions: those student post-test scores. The results revealed that the identified with higher differences between Q-values for possible decisions critical decisions are significantly linked to student learning, and are considered to be critical, and we split each group of students further, that critical decisions can be identified by an effective according to whether the student received More decisions aligned policy but not by ineffective policies. In the future, we will apply with the critical decisions. Table 4 shows the t-test results the PDIS metric with soft-max transformation and original re- comparing the post-test scores between the More vs. Less wards to help us induce better RL policies which further improve critical decisions groups. The first column indicates the MDP students’ learning in the ITS. Also, when inducing policies, we policies considered when identifying the critical decision. The will consider constraints to avoid the partially carry-out situation second and third columns show the average post-test scores that limits the impact a policy can have on outcomes. Further- of students in the More and Less groups, showing mean(sd). more, with the identification of critical decisions, we can reduce The fourth column shows the t-test results when comparing the the size of resulting policies still further by focusing only on the post-test scores of the More and Less critical decisions groups. most monumental decisions rather than meaningless ones. The MDP1 row shows a significant difference between the More vs. Less critical decisions groups for the MDP1 policy, with t(448) = −1.97,p = .049. However, the students in More group 10. REFERENCES perform worse than the Less critical decisions group. The MDP2 [1] J. Beck, B. P. Woolf, and C. R. Beal. and MDP3 rows show that there is no significant difference Advisor: A machine learning architecture for intelligent between the two critical decisions groups in terms of post-test tutor construction. In AAAI/IAAI, pages 552–557, 2000. scores for the MDP2 or MDP3 policies. Finally, the MDP4 policy [2] J. D. Bransford and D. L. Schwartz. Rethinking shows a significant difference between the two groups, t(448)= transfer: A simple proposal with multiple implications. 2.10,p = .036, which means that students with More critical Review of Research in Education, pages 61–100, 1999. decisions performed significantly better than students with Less. [3] M. Chi, K. VanLehn, D. Litman, and P. Jordan. Empirically evaluating the application For the MDP4 policy, the identified critical decisions comprised of reinforcement learning to the induction of effective 25% of all decisions. This shows that, although critical decisions and adaptive pedagogical strategies. User Modeling are a small proportion of all decisions, they can significantly im- and User-Adapted Interaction, 21(1-2):137–180, 2011. pact the outcome. Also, the results also show that the Q-values in the MDP4 policy can be used to identify critical decisions [4] B. Clement and et al. A comparison aligned with empirical results, but the other three cannot. Based of automatic teaching strategies for heterogeneous on the results from Table 1, MDP4 was identified as the only student populations. In EDM 16-9th EDM, 2016. effective policy, since its empirical post-test results aligned, with [5] S. Doroudi, K. Holstein, students in the High carry-out group performing significantly V. Aleven, and E. Brunskill. Towards understanding better than the Low carry-out group. The critical decisions how to leverage sense-making, induction and refinement, results suggest that MDP4 is also the only policy where larger and fluency to improve robust learning. JEDM, 2015. differences in Q-values had larger impacts on post-test results. [6] L. J. Dudik, Miroslav and Taken together, these results suggest that only Q-values in ef- L. Li. Doubly robust policy evaluation and learning. the fective policies work to influence decisions that impact actual 28th International Conference on Machine Learning, 2011. post-test performance. This further inspires us to investigate [7] J. Hammersley and D. Handscomb. General whether we could verify the effectiveness of a policy in reverse: principles of the monte carlo method. Springer, 1964. Given a policy, if the decisions with larger Q-value difference [8] A. Iglesias, P. Martı́nez, R. Aler, and are significantly linked to the student performance, then this F. Fernández. Learning teaching strategies in an adaptive policy may be more likely to be effective. and intelligent educational system through reinforcement learning. Applied Intelligence, 31(1):89–106, 2009. T. L. Griffiths, and P. Shafto. Faster teaching via [9] A. Iglesias, P. Martı́nez, pomdp planning. Cognitive science, 40(6):1290–1332, 2016. R. Aler, and F. Fernández. Reinforcement learning of [22] J. P. Rowe and J. C. Lester. Optimizing pedagogical policies in adaptive and intelligent educational player experience in interactive narrative planning: systems. Knowledge-Based Systems, 22(4):266–270, 2009. A modular reinforcement learning approach. In the Tenth [10] A. Iglesias, P. Martı́nez, International Conference on AIIDE, pages 160–166, 2014. R. Aler, and F. Fernández. Reinforcement learning of [23] J. P. Rowe and J. C. Lester. pedagogical policies in adaptive and intelligent educational Improving student problem solving in narrative-centered systems. Knowledge-Based Systems, 22(4):266–270, 2009. learning environments: A modular reinforcement learning [11] D. J. L. Joel framework. In AIED, pages 419–428. Springer, 2015. R. Tetreault, Dan Bohus. Estimating the reliability of mdp [24] N. Roy, J. Pineau, and policies: A confidence interval approach. The Association S. Thrun. Spoken dialogue management using probabilistic for Computational Linguistics, pages 276–283, 2007. reasoning. In Proceedings of the 38th Annual Meeting [12] K. R. Koedinger, J. R. Anderson, W. H. Hadley, and M. A. on Association for Computational Linguistics, pages Mark. Intelligent tutoring goes to school in the big city. 1997. 93–100. Association for Computational Linguistics, 2000. [13] K. R. Koedinger, E. Brunskill, R. S. [25] S. Shen and M. Chi. Aim Baker, E. A. McLaughlin, and J. Stamper. New potentials low: Correlation-based feature selection for model-based for data-driven intelligent tutoring system development reinforcement learning. In EDM, pages 507–512, 2016. and optimization. AI Magazine, 34(3):27–41, 2013. [26] S. Shen and M. Chi. Reinforcement [14] E. Levin, R. Pieraccini, and W. Eckert. learning: the sooner the better, or the later the better? A stochastic model of human-machine interaction In Proceedings of the 2016 Conference on User Modeling for learning dialog strategies. IEEE Transactions Adaptation and Personalization, pages 37–44. ACM, 2016. on speech and audio processing, 8(1):11–23, 2000. [27] S. Singh, D. Litman, M. Kearns, and M. Walker. [15] T. Mandel, Optimizing dialogue management with reinforcement Y.-E. Liu, S. Levine, E. Brunskill, and Z. Popovic. Offline learning: Experiments with the njfun system. Journal policy evaluation across representations with applications of Artificial Intelligence Research, 16:105–133, 2002. to educational games. In AAMAS, pages 1077–1084, 2014. [28] J. C. Stamper, M. Eagle, T. Barnes, and M. Croy. [16] Z. L. Mostafavi Behrooz and T. Barnes. Data-driven Experimental evaluation of automatic hint generation proficiency profiling. In Proc. of the 8th International for a logic tutor. In International Conference on Artificial Conference on Educational Data Mining, 2015. Intelligence in Education, pages 345–352. Springer, 2011. [17] M. Chi, [29] P. Thomas. Safe reinforcement learning. PhD thesis, 2015. K. VanLehn, D. J. Litman, and P. W. Jordan. Empirically [30] K. Vanlehn. evaluating the application of reinforcement learning to the The behavior of tutoring systems. International journal induction of effective and adaptive pedagogical strategies. of artificial intelligence in education, 16(3):227–265, 2006. User Model. User-Adapt. Interact., 21(1-2):137–180, 2011. [31] P. Wang, J. Rowe, W. Min, B. Mott, [18] K. Narasimhan, T. Kulkarni, and R. Barzilay. Language un- and J. Lester. Interactive narrative personalization derstanding for text-based games using deep reinforcement with deep reinforcement learning. In IJCAI, 2017. learning. arXiv preprint arXiv:1506.08941, 2015. [32] J. D. Williams and S. Young. Partially observable [19] E. B. Philip S. Thomas. Data-efficient off-policy policy markov decision processes for spoken dialog systems. evaluation for reinforcement learning. In In International Computer Speech & Language, 21(2):393–422, 2007. Conference on Machine Learning, pages 2139–2148, 2016. [33] B. Zhang, Q. Cai, J. Mao, E. Chang, and B. Guo. Spoken [20] S. R. S. S. S. Precup, D. Eligibility dialogue management as planning and acting under traces for off-policy policy evaluation. 17th International uncertainty. In INTERSPEECH, pages 2169–2172, 2001. Conference on Machine Learning, pages 759–766, 2000. [34] G. Zhou and [21] A. N. Rafferty, E. Brunskill, et al. Towards closing the loop: Bridging machine-induced pedagogical policies to learning theories. In EDM, 2017.