<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MOCHI: an Ofline Evaluation Framework for Educational Recommendations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chunpai Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shaghayegh Sahebi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Brusilovsky</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University at Albany, State University of New York</institution>
          ,
          <addr-line>Albany, New York</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Pittsburgh</institution>
          ,
          <addr-line>Pittsburgh, Pennsylvania</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Evaluating recommendation algorithms with long-term independently-measured rewards, such as educational recommender systems, has proven to be a dificult task, especially using ofline data. While many use model-based evaluation strategies to evaluate such recommender systems, we argue that these strategies are unreliable, particularly due to biases introduced via simulation and reward estimation models. In this paper, we showcase this argument by experimenting with a state-of-the-art model-based evaluation model and presenting its flaws. Next, we propose MOCHI, an ofline model-free evaluation framework that can be used on sparser collected data with longer trajectories. We experiment with MOCHI and show how it can be used to efectively evaluate educational recommender policies with long-term goals.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;educational recommender systems</kwd>
        <kwd>ofline evaluation</kwd>
        <kwd>instructional sequencing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Related Work</title>
      <p>
        With the rise of online education, the size of classes and, as a result, the need to provide
automatic guidance for students grows. Educational recommender systems and instructional
sequencing policies aim to provide the best learning materials to students during their studies
in online learning platforms [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Despite the considerable application of these algorithms
in online education platforms, efectively evaluating them has been proven challenging [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Particularly, because of scalability problems in case-based user studies and equity considerations
in online A/B testing in this domain, ofline evaluation strategies are essential for educational
recommender systems.
      </p>
      <p>
        We note that having delayed and independently-measured utility or reward is one of the main
reasons for this evaluation to be challenging. Unlike consumer-based and commercial systems
that aim to serve users’ interests, the main purpose of educational systems is for students to
learn. Research has shown that students’ interest-based behaviors may not be aligned with their
learning goals and can potentially be against them [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. As a result, the widely used implicit
and explicit observable feedbacks, such as user ratings and clicks, are not adequate success
indicators in evaluating an educational recommender system. Instead, long-term learning
measures, such as post-test score and learning gain, are more reliable to evaluate educational
recommender systems’ efectiveness in student learning.
      </p>
      <p>These more reliable scores are, however, evaluated independently of the student trajectory
items and as a result not directly observable from the trace data. For example, post-test scores
are student grades in a test that is administered at the end of the course and may not include
any problems that the student had practiced before. Additionally, these measures are delayed
as they are collected at the end of student trajectory and can change as the students interact
with the recommended learning materials. Accordingly, for a successful ofline evaluation of
educational recommenders using these measures, not only full user trajectories are needed,
but also the training traces should have vast coverage of all possibly observable trajectories to
facilitate a generalizable ofline evaluation. These problems complicate the ofline evaluation in
educational recommender systems. A similar challenge exists in other recommender systems
with delayed independently-measured rewards, such as health or weight-loss applications.</p>
      <p>
        To avoid these problems, recent educational recommendation literature has sought
modelbased evaluation that simulates student trajectories and estimates the potential final reward for
ofline evaluation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Model-based evaluation methods train a student model using the users’
historical performance data from the logged system and then use it as a simulator coupled with
a reward model to estimate the performance of a target policy [
        <xref ref-type="bibr" rid="ref6 ref7 ref8 ref9">7, 8, 9, 6</xref>
        ]. Yet, these evaluation
strategies sufer from many problems. Most importantly, they rely on two estimators that can
induce major errors. To simulate student trajectories, a student knowledge model should be used
to estimate student knowledge and performance in the recommended items. Naturally, these
models have an estimation bias that would exacerbate by the length of the simulated student
trajectory. The reward estimator is usually trained using the learned student knowledge model
parameters to predict student post-test score or knowledge gain. Again, not only the reward
estimator model itself can be biased, but also relying on error-prone estimated parameters can
intensify such a bias. Recently, eforts such as Robust Evaluation Matrix (REM) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] have aimed
to address some issues by testing the proposed policies against multiple student simulation
models. REM has an innovative approach to explore experiment-worthy policies. It evaluates
the policies in a conservative way: policy  is considered to be better than policy ℬ only if
policy  outperforms policy ℬ under all student simulation models. So, if a policy shows to be
better than other according to REM results, this policy would be a policy worth exploring in
practice. Yet, REM sufers from other uncertainties that we present in this paper.
      </p>
      <p>
        A more elegant model-free solution is importance sampling, which is popularly used in the
ifeld of of-policy evaluation in reinforcement learning [
        <xref ref-type="bibr" rid="ref11 ref12 ref13 ref14 ref15">11, 12, 13, 14, 15</xref>
        ]. The main idea is to
re-weight the pre-collected reward from the logged policy to compute an unbiased estimate of
the expected reward on a new compatible policy. However, this method requires the trajectory
generated by the new policy to preexist in the old data generated by the logged policy. Otherwise,
the importance sampling could yield an estimate with large variance, especially when we have
very sparse observed rewards from the logged system. In other words, the existing
modelfree evaluations are not applicable to the ofline evaluation of educational recommendation
systems with a long trajectory and sparse reward. Given these challenges, having an ofline
evaluation framework that can handle delayed independently-measured rewards, is independent
of recommendation and student models, allows for multiple item recommendations and user
choice, and does not only rely on the superficial observed interest-based measures is essential
for the educational recommender systems.
      </p>
      <p>In this paper, we first examine the REM framework and demonstrate the need for a model-free
evaluation framework by showing the problems that arise in such model-based methods. Next,
we propose Model-free Ofline Correlational HIt (MOCHI), a model-free evaluation framework
that can work with delayed independently-measured rewards with long trajectories. MOCHI
evaluates if higher degrees of following a recommender system’s non-trivial suggestions would
be associated with higher independently-measured rewards. Experimenting with the ofline
data from a real-world online education platform, we show that the results generated by our
proposed framework are in accordance with our expectations. Additionally, we present ways to
interpret MOCHI’s results. Our proposed evaluation framework is algorithm-agnostic and can
be used to evaluate any adaptive or non-adaptive educational recommendation or instructional
sequencing algorithm that either suggests or mandates the next item for the students to work on.
It does not limit the number of recommended items to students, neither their trajectory length.
Most importantly, it can be used for any application domain, such as health and fitness, in
which a delayed long-term independently-measured reward is required rather than immediate
superficial observations.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Model-Based Evaluation Challenges: A Case Study</title>
      <p>
        In this section, we investigate Robust Evaluation Matrix (REM) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], to argue that the existing
evaluation methods are not suficient to validate the efectiveness of educational
recommendation policies.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Dataset</title>
        <p>We use the data from the MasteryGrids platform 1, collected during Spring 2012, Fall 2012,
and Spring 2013 semesters in the Java introductory course with the same curriculum at the
University of Pittsburgh. MasteryGrids is an open-learner interface for an intelligent tutoring
system, in which students can practice with various kinds of problems and annotated examples.
In this paper, we use student trajectories in solving problems that ask the students to read a
code snippet and answer simple questions, such as the final output or a variable’s value. The
items to be recommended in this system are these problems. In MasteryGrids, the programming
problems are ordered from left to right by 21 curriculum topics. The topics that cover a wide
range of concepts including simple "Variables" and more complex "Wrapper Classes"
topics ordered by a domain expert. Each topic includes multiple problems. Although students
can freely select any problem to work on, they typically follow the interface’s topic order. The
students take a pretest before starting their class and a post-test at the end of their course. We
normalize the pre-test and post-test scores to be between zero and one. Also, we calculate
students’ knowledge gain by deducing their pre-test score from their post-test score. Score
distributions are presented in Figure 1. In total, trajectories of 86 students with their pre-test
and post-tests are available in the dataset. Descriptive statistics of the dataset are shown in
Table 1.</p>
        <sec id="sec-2-1-1">
          <title>1http://adapt2.sis.pitt.edu/wiki/Mastery_Grids_Interface</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Setup and Expectations</title>
        <p>Model-based evaluations in the education domain use a student model as a simulator to estimate
the students’ performance under the policies that are being evaluated. Since they are
simulationbased, they have to also estimate the delayed reward or utility for each student using a reward
model. Accordingly, in our experiment setup, we use three diferent student models to evaluate
ifve recommendation policies. Additionally, we use multiple reward models to evaluate how a
reward model would afect the evaluation results. We simulate 1, 000 student trajectories and
sample the trajectory lengths and pretest scores from the pre-collected training data.
Simulator Student Models. We use the following models as student simulators:
• Bayesian Knowledge Tracing (BKT) is a pioneer model based on hidden Markov
models that estimates student probability of success based on the probability that the
student has learned a topic (mastery) [16].
• Deep Knowledge Tracing (DKT) is an LSTM [17] that predicts student’s correctness
probability according to their knowledge estimate [18].
• Dynamic Key-Value Memory Network (DKVMN) is a sequential key-value memory
based deep model that represents the student’s estimated knowledge in each latent
concept [19]. This model has not been used in previous model-based evaluations.
Reward Models. The rewards models aim to estimate the final delayed reward for simulated
students. They are trained on the simulator parameters that are learned according to training
students’ trajectories as their independent variables and the training students’ final utilities
(e.g., post-test scores) as the dependent variables. Since diferent simulator student models
have diferent parameters, we have diferent parametrizations of the dependent variables for
their reward models. Additionally, the model to estimate the final reward according to the
independent parameters can be diferent. We use two diferent reward models for each student
simulator: a linear regression and a ridge regression model. Linear regression is selected
according to REM. Ridge regression is chosen to increase linear regression’s generalizability.</p>
        <p>
          We also define diferent independent variables for each student model. For BKT, following [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ],
we use the trained parameters to infer the mastery probabilities of the 21 topics in MasteryGrids.
So, BKT model’s knowledge representation is a 21-dimensional vector. To train the reward
model for BKT simulator, we fit the regression model with the concatenation of pre-test score
and knowledge vector as independent variables to predict the post-test score. Knowledge
gain can be calculated based on the diference between the estimated post-test and pre-test
scores. For DKT, we use the last estimated student knowledge state vector at the final attempt
concatenated with the pretest score as the input for the reward model. For DKVMN, we compute
the knowledge state at each time step represented by a 10-dimensional vector [19, 20], 10 being
the discovered latent concept size and concatenate it with student pre-test score.
Recommendation Policies. We consider the following standard educational policies that
were evaluated in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]:
• Random: is a non-adaptive policy that randomly recommends  learning resources to
each student at each attempt.
• InstructSeq: is based on the designed instructional order and student’s performance.
        </p>
        <p>If the student answers the current question correctly, the next  following questions in
the instruction sequence will be suggested. Otherwise, the current question and the next
 − 1 following questions are recommended.
• Mastery: is an adaptive policy that leverages BKT 2 student model to estimate student
mastery probability at each attempt. It selects the top- questions that are the farthest
from the predefined mastery threshold level, which is usually set to 95%.
• HighProbCorr: is also an adaptive BKT-based policy that selects the next top- questions
that have the highest probabilities to be answered correctly by the student based on their
current mastery probability estimates.
• Myopic: is another adaptive BKT-based policy that selects the next top- questions that
could lead to the largest estimated reward for the student.</p>
        <p>Expectations. Among the above policies, we expect the InstructSeq policy to be a
reasonably good policy, since the course topic sequence has been designed by the domain experts.
Particularly, we expect InstructSeq to lead to better learning in students, compared to the
Random policy. In addition, since the Mastery policy suggests the items which the student
is least likely to have mastered, we anticipate it to recommend very dificult problems. Since
the course covers a vast variety of topics with some being the most dificult and presented at
the end of the semester, we expect this policy to always recommend the most dificult course
problem to all students. Similarly, we anticipate the HighProbCorr policy to always suggest
the easiest course problems to all students, since they are the most likely to be solved by them.
As a result, we expect InstructSeq to also be better than both Mastery and HighProbCorr
policies.</p>
        <sec id="sec-2-2-1">
          <title>2https://github.com/myudelson/hmm-scalable</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Experimental Results of Robust Evaluation Matrix</title>
        <p>Here, we first present the predictive accuracies of diferent student and reward models. Then,
we present REM evaluation results with diferent setups to show how student and reward model
variations can afect the model-based evaluations. Since in REM students are assumed to always
follow the recommended item, we recommend the top-1 problem.</p>
        <p>Predictive Accuracy of Student Models. We train the three student models with 5-folds
user-stratified cross-validation, and report the predictive accuracy of each model in Table 2.
ROC-AUC stands for the area under the receiver operating characteristic curve, and PR-AUC
denotes the area under the precision-recall curve. As we can see, they all have reasonably good
predictive accuracy with the performance order DKT &gt; DKVMN &gt; BKT.
Predictive Accuracy of Reward Models. We estimate each reward model’s prediction error
via 5-fold user-stratified cross-validation. The results are shown in Table 3. We can see that
ridge regression’s results are only significantly better than regression results in the DKT student
model. This overfitting of linear regression can be explained by the large knowledge state
representation size in DKT, that results in a high number of reward model parameters.</p>
        <p>Reward Model with Linear Regression</p>
        <p>RMSE MAE
0.1558 ± 0.0207 0.1300 ± 0.0158
0.3743 ± 0.2216 0.2812 ± 0.1700
0.1508 ± 0.0270 0.1209 ± 0.0237</p>
        <p>Reward Model with Ridge Regression</p>
        <p>RMSE MAE
0.1428 ± 0.0355 0.1175 ± 0.0282
0.1366 ± 0.0309 0.1101 ± 0.0259
0.1456 ± 0.0284 0.1209 ± 0.0204
REM Results with Linear Regression Reward Model. Table 4 presents the expected reward
and its standard deviation over 1000 simulations under each combination of student simulation
model and recommendation policy. We also show the heatmap of pair-wise Cohen’s d efect
size in Fig 2. Conventionally,  = 0.2, 0.5, and 0.8 respectively represent a ’small’, ’medium’,
and ’large’ efect size [ 21]. As we can see, for BKT student simulation model, the Random policy
has a similar efect to InstructSeq, Mastery, and HighProfCorr; the Mastery policy is equivalent
to HighProbCorr; and only the Myopic policy is the most diferent from all with the highest
reward. On the other hand, for the DKT student model, we can conclude that InstructSeq
equally contributes as HighProbCorr, and the Mastery policy results in the highest reward.
For DKVMN, we have Myopic=InstructSeq=HighPropCorr. We can simply see that diferent
simulator student models result in significantly diferent evaluations. REM selects one policy as
the best (worst) policy only if it is the best (worst) in all simulation model experiments. Overall,
since the policy performance is not consistent across diferent student models, REM will not
conclude that any of the policies are better or worse than others.</p>
        <p>REM Results with Ridge Regression Reward Model. Similarly, we show the estimated
rewards for ridge-regression-based reward models in Table 5 and pairwise Cohen’s efect size
in Fig. 3. Again, with the BKT simulation model, the Myopic policy achieves the best rewards.
With DKT, we can see that Mastery is better than HighProbCorr. But, with DKVMN, we see
the reverse efect. The diference between policies with DKT and DKVMN policies are smaller,
with the Cohen’s d efect usually being smaller for DKVMN. As a result, similar to the previous
analysis, REM will not conclude that any of the policies are better or worse than others.</p>
        <p>Comparing the results of Tables 4 and 5 for each simulation model, the linear regression
and ridge regression reward models can also lead to very diferent and even contradictory
conclusions. For example, for DKVMN with linear regression model Mastery policy is better
than Random and InstructSeq. But, with ridge regression, Mastery is not diferent from Random
and InstructSeq policies.
InstructSeq 0.21</p>
        <p>Mastery 0.68 2.54
HighProbCorr 1.81 1.60 2.54
0.5</p>
        <p>Mastery 1.03 0.72
0.5</p>
        <p>Mastery 0.13 0.16</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Lessons Learned from Robust Evaluation Matrix</title>
        <p>As we have seen in our experiment results, REM evaluation can be inconclusive and highly
variant depending on the simulation and reward models. It also contradicts our expectations in
Section 2.2. Here we discuss the potential problems that may lead to misleading results in REM.</p>
        <p>
          As we have seen, using diferent simulation models can lead to diferent results in REM.
The problem is that, in practice, it is not clear how many simulators and which classes of
student models should be employed to be confident about REM results or any other model-based
evaluations. Even if one policy is consistently better than others in all the applied simulation
models, it may be contradicted in a new simulation model. Predictive accuracy could be one
criterion for student model selection as suggested in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. However, since we have no standard
of poor simulators, there is no guarantee that results under a particular student model with high
predictive accuracy will always generate trustworthy results. Additionally, as we have seen in
our experiments, the results of the DKT simulator, which had the best predictive performance,
still contradicted our expectations from the policies.
        </p>
        <p>Another similar issue comes from the reward modeling. Reward modeling is needed here
to estimate the long-term independently-measured reward. But, simply relying on predictive
accuracy of the reward model is inadequate, as we do not know how good of a reward model
should be used. As we can see in Table 3, the reported expected prediction error is much higher
than the standard error in REM shown in Table 4 and Table 5, which could be an indicator
of a poor reward model. In addition, in the MasteryGrids dataset, we only observe a single
reward (post-test score) for each trajectory, which is extremely sparse. As we can see, too
many factors and variables afect the results of model-based evaluations. As a result, a reliable
model-free ofline evaluation strategy with fewer variations is needed to evaluate long-term
independently-measured reward policies, such as educational recommender systems.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Model-free Ofline Correlational HIt (MOCHI)</title>
      <p>In this section, we present MOCHI, our model-free ofline evaluation framework to validate the
efectiveness of algorithms with long-term independently measured rewards, such as general
educational recommender systems. We assume that a user who often follows a more efective
recommendation policy will have a higher final reward, such as a higher final course grade,
compared to a student who adopts a less efective one. Given an ofline previously collected
dataset, our first counterfactual question is how to quantify if a student would follow the target
recommendation policy if it has never been applied to the student. The second question is how
significant the student’s following of the target policy is, compared to having no, or a diferent,
baseline policy. Finally, the third question is how to determine the efectiveness of a policy,
given the significance of student trajectories that would have followed it if it was applied. In
the following sections, we introduce our solutions to these questions as building blocks of the
MOCHI framework.</p>
      <sec id="sec-3-1">
        <title>3.1. Average Discounted Cumulative Hit</title>
        <p>We consider an educational recommender system that ranks the top- most useful items and
learning resources for the target student at every student attempt. For instructional sequencing
algorithms that only suggest one item to students,  = 1. According to our assumption, we
would expect the students who studied the higher-ranked recommended learning resources
to have better academic performance than those who chose the lower-ranked ones, or did not
follow the recommendations. Hence, we need an “agreement” measure to determine how well a
student follows the high-ranked recommended items by an algorithm. Having a ranked-based
measure is particularly important in applications with smaller ofline trajectory datasets in
which users choose to interact with one or a few items at a time and the datasets may not cover
all the possible combinations of trajectories. In that case, it is important to know how close the
recommender system was to suggest the user’s top choices, even if they were not ranked in the
highest possible positions. Inspired by the Discounted Cumulative Gain (DCG), we design a new
metric called Discounted Cumulative Hit (DCH) in Equation 1, that provides such a measure for
one set of  recommended items to one target student at one attempt.</p>
        <p>DCH = ∑︁  ()
=1 log2( + 1)
(1)</p>
        <p>Here,  () = 1 if the target student had worked on the ℎ learning resource from the
top- recommendations, and 0 otherwise. According to Equation 1 the lower the selected
learning material is in the recommended item list, the lower the DCH will be (with a logarithmic
scale). Note that the DCG measure cannot be used in our problem directly, since its main
assumption is that a gold-standard item-ranking is available from the user to be compared to
the list of recommended items. In other words, to use DCG, users are assumed to be the best
judges of their own interests and provide their interests in an ordinal format. However, in the
education domain, such a ranked-list of learning resources by students in every step of their
trajectories is not available.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Inverse Probability Weighted Discounted Cumulative Hit</title>
        <p>So far, DCH only measures how in agreement an algorithm’s suggestion is, with one attempt of
a student. DCH can be used in a controlled online experiment to compare how much students’
choices agree with a recommender algorithm versus another. However, it is not adequate for
ofline evaluation, as the data is collected without the target algorithms’ recommendations
being presented to the user. In other words, if the data was collected in a system with no
recommender algorithms, DCH cannot distinguish if this agreement is because the recommended
item is something that the students would have selected even if it has not been recommended
to them. We call recommending such an item a trivial recommendation. For example, a
mandatory reading that is completed by everyone at the beginning of the course can be a
trivial recommendation. Ideally, recommending a non-trivial item with a high utility should be
more valuable than recommending a trivial item. Additionally, in educational systems with a
predetermined order of topics, some students select the items within that fixed topic order. This
can create a bias in the collected data, even if no recommender algorithm is used in the system.</p>
        <p>
          In order to reduce this bias from the logged data, we borrow the inverse propensity scoring
idea from importance sampling-based of-policy evaluation [
          <xref ref-type="bibr" rid="ref11">22, 11</xref>
          ] and normalize the DCH
score by a propensity score  . This propensity score discounts the calculated agreement between
the recommended item and the user-selected item by how trivial the item is. We call it inverse
probability weighted DCH (IPW-DCH), and formally define it as in Equation 2.
(2)
(3)
IPW-DCH = ∑︁ 1  ()
        </p>
        <p>=1   log2( + 1)</p>
        <p>In our experiments with no recommendation algorithm at the time of data collection, we
simply use the bi-gram item probability as the propensity score. Given the current working
item  in the training data, we use the conditional probability of next item  as in Equation 3. It
can be interpreted as the sequential item popularity in the dataset.</p>
        <p>=
# question  followed by question 
# question 
.</p>
        <p>Note that   optimistically attributes all encounters of the target student following item
 after item  to the system bias. Consequently, IPW-DCH is a pessimistic indicator of how
frequently or favorably a student would “follow” the recommendation generated by the target
policy. The propensity score   can be defined according to the application domain and data
collection setup. For example, if a baseline recommender algorithm is active during the data
collection,   should be updated to include the bias introduced by this baseline algorithm in
addition to the system bias.</p>
        <p>IPW-DCH focuses on one student’s interaction in one attempt. But a student has a sequence
of attempts and IPW-DCH should be extended to represent the whole student trajectory. Since
diferent students have diferent trajectory lengths, we average all IPW-DCH scores of a student
trajectory to represent their Average IPW-DCH score.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Correlation between Following Policy and Reward</title>
        <p>Finally, with an efective educational recommender system, we expect the students who usually
follow the recommendations to have a higher long-term utility. In the education domain, such
utility would be a better academic performance or a higher knowledge gain. Therefore, in the
end, we evaluate our proposed model based on the correlation between Average IPW-DCH and
students’ academic performance. A stronger positive correlation indicates better performance
on the task of sequential educational learning material recommendation. Particularly, we use
Spearman’s rank correlation coeficient in our experiments, which is defined as below, where
is the number of test students, and  is the diference in the ranks of the

ℎ student in Average
IPW-DCH and real rewards.</p>
        <p>= 1 −
6Σ2</p>
        <p>2
 ( −
1)
(4)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. MOCHI Experiment Results</title>
      <p>To demonstrate our proposed evaluation framework, we run it on the real trajectories of the
MasteryGrids data for recommending  = 1 and  = 3 items at each step. The results are
presented in Table 6.</p>
      <p>First, we check the Average DCH and IPW-DCH values. As we can see, the Average DCH
values for InstructSeq is higher than all other policies. However, the standard deviation of
Average DCH is the largest for InstructSeq, meaning that not all students follow this topic
sequence and may need more guidance than the predefined topic order. The next, is
HighProbCorr Average DCH showing that a few tend to solve easier problems. Looking at Average
IPW-DCH, values are the highest for InstructSeq. Meaning that, although InstructSeq suggests
from the topic sequence, this suggestion is not trivial for all the students. Comparing the
Random and HighProbCorr policies, although HighProbCorr has a higher Average DCH, its
Average IPW-DCH is lower than Random. This shows the non-triviality of random suggestions,
compared to the HighProbCorr ones. Interestingly, unlike when  = 1, the Myopic policy
has a higher Average IPW-DCH compared to HighProbCorr, when  = 3. This can show that
the Myopic policy has more non-trivial interesting suggestions in the second or third-ranked
recommendations.</p>
      <p>Looking at the correlation values, we can see that InstructSeq has the highest correlation
values of Average DCH and IPW-DCH with both post-test and knowledge gain scores. Especially,
its Average DCH and IPW-DCH values are significantly (p-value &lt; 0.1) correlated with post-test
scores. This means that students who followed the InstructSeq policy had higher post-test
scores. Next, the Myopic policy’s Average IPW-DCH has a significant (p-value &lt; 0.1) positive
correlation with students’ post-test score when  = 3. Meaning that for Myopic policy to
help students, it needs to suggest more items to students. The most reliable correlation with
knowledge gain score is the positive relationship in Average IPW-DCH with  = 1. The rest
of the correlations are insignificant and non-conclusive with large p-values. It can be because
of the low number of data points, also reflected by low Average DCH values. But, it may also
represent the inefectiveness of the studied policies on student performance. Overall, our results
show that InstructSeq is better than other policies when considering post-test score rewards.
This is in agreement with our expectations in Section 2.2.</p>
      <p>Additionally, in InstructSeq, we can see that when  = 1 the correlation with posttest score
is lower than when  = 3. However, when  = 1 the correlation with knowledge gain is higher
than the correlation with  = 3. This indicates that students with high post-test scores, and
high knowledge gain benefit more when  = 1. But, students with high post-test scores, but
low knowledge gain benefit more when  = 3. Meaning that a more strict recommendation
( = 1) is needed for the success of students with a lower prior knowledge. But, for students
with an already high prior knowledge more freedom ( = 3) can be more beneficial.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>In this paper, we investigated the state-of-the-art ofline evaluation method, Robust Evaluation
Matrix, on a real-world educational dataset. We found that model-based evaluations are not
reliable, and their results can be contradictory and highly dependent on the student simulation
and reward models. We concluded that a model-free evaluation method is necessary, especially
for domains with delayed independently-measured rewards. We also proposed MOCHI, a
modelfree ofline evaluation framework, as an additional tool for validating the recommendation
policies, that does not rely on estimation models, can evaluate list recommendations, and only
uses the collected ofline data. In our experiments, we showed how MOCHI’s results can be
interpreted and that our proposed metric meets the expected results and can be an auxiliary
tool of ofline evaluation of educational recommender systems. MOCHI’s limitations include
the dificulty to work with policies that have very few instances of agreements with student
trajectories in ofline data, and hence, resulting in insignificant correlations. In future work, we
would like to investigate more on our proposed method with online experiments.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We would like to thank Dr. Shayan Doroudi for providing and discussing his implementation of
REM with us. This paper is based upon work supported by the National Science Foundation
under Grant No. 2047500.
1411–1420.
[16] A. T. Corbett, J. R. Anderson, Knowledge tracing: Modeling the acquisition of procedural
knowledge, User modeling and user-adapted interaction 4 (1994) 253–278.
[17] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997)
1735–1780.
[18] C. Piech, J. Bassen, J. Huang, S. Ganguli, M. Sahami, L. Guibas, J. Sohl-Dickstein, Deep
knowledge tracing, in: Proceedings of the 28th International Conference on Neural
Information Processing Systems-Volume 1, 2015, pp. 505–513.
[19] J. Zhang, X. Shi, I. King, D.-Y. Yeung, Dynamic key-value memory networks for knowledge
tracing, in: Proceedings of the 26th international conference on World Wide Web, 2017,
pp. 765–774.
[20] C. Wang, S. Zhao, S. Sahebi, Learning from non-assessed resources: Deep multi-type
knowledge tracing, in: Proceedings of the 14th International Conference on Educational
Data Mining (EDM-2021), 2021.
[21] D. Lakens, Calculating and reporting efect sizes to facilitate cumulative science: a practical
primer for t-tests and anovas, Frontiers in psychology 4 (2013) 863.
[22] J. P. Hanna, S. Niekum, P. Stone, Importance sampling policy evaluation with an estimated
behavior policy, arXiv preprint arXiv:1806.01347 (2018).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Doroudi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Aleven</surname>
          </string-name>
          , E. Brunskill,
          <article-title>Where's the Reward? a review of reinforcement learning for instructional sequencing</article-title>
          ,
          <source>International Journal of Artificial Intelligence in Education</source>
          <volume>29</volume>
          (
          <year>2019</year>
          )
          <fpage>568</fpage>
          -
          <lpage>620</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.-I.</given-names>
            <surname>Dascalu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-N.</given-names>
            <surname>Bodea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Mihailescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Tanase</surname>
          </string-name>
          , P. Ordoñez de Pablos,
          <article-title>Educational recommender systems and their application in lifelong learning</article-title>
          ,
          <source>Behaviour &amp; information technology 35</source>
          (
          <year>2016</year>
          )
          <fpage>290</fpage>
          -
          <lpage>297</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Erdt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rensing</surname>
          </string-name>
          ,
          <article-title>Evaluating recommender systems for technology enhanced learning: a quantitative survey</article-title>
          ,
          <source>IEEE Transactions on Learning Technologies</source>
          <volume>8</volume>
          (
          <year>2015</year>
          )
          <fpage>326</fpage>
          -
          <lpage>344</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.-Y.</given-names>
            <surname>Teng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. P.-Y.</given-names>
            <surname>Ting</surname>
          </string-name>
          , K.-T. Chuang, H. Liu,
          <article-title>Interactive unknowns recommendation in e-learning systems</article-title>
          ,
          <source>in: 2018 IEEE International Conference on Data Mining (ICDM)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>497</fpage>
          -
          <lpage>506</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirzaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sahebi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Brusilovsky</surname>
          </string-name>
          ,
          <article-title>Detecting trait vs. performance student behavioral patterns using discriminative non-negative matrix factorization</article-title>
          ,
          <source>in: The 33rd International FLAIRS Conference</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Boyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lester</surname>
          </string-name>
          ,
          <article-title>Evaluating state representations for reinforcement learning of turn-taking policies in tutorial dialogue</article-title>
          ,
          <source>in: Proceedings of the SIGDIAL 2013 Conference</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>339</fpage>
          -
          <lpage>343</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Chi</surname>
          </string-name>
          , K. VanLehn, D. Litman,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <article-title>Empirically evaluating the application of reinforcement learning to the induction of efective and adaptive pedagogical strategies, User Modeling and User-Adapted Interaction 21 (</article-title>
          <year>2011</year>
          )
          <fpage>137</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Raferty</surname>
          </string-name>
          , E. Brunskill,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Grifiths</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shafto</surname>
          </string-name>
          ,
          <article-title>Faster teaching via pomdp planning</article-title>
          ,
          <source>Cognitive science 40</source>
          (
          <year>2016</year>
          )
          <fpage>1290</fpage>
          -
          <lpage>1332</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rowe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lester</surname>
          </string-name>
          ,
          <article-title>Optimizing player experience in interactive narrative planning: A modular reinforcement learning approach</article-title>
          ,
          <source>in: Tenth Artificial Intelligence and Interactive Digital Entertainment Conference</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Doroudi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Aleven</surname>
          </string-name>
          , E. Brunskill,
          <article-title>Robust evaluation matrix: Towards a more principled ofline exploration of instructional policies</article-title>
          ,
          <source>in: Proceedings of the fourth (2017) ACM conference on learning@ scale</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Voloshin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <article-title>Empirical study of of-policy policy evaluation for reinforcement learning</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>06854</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Langford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Schapire</surname>
          </string-name>
          ,
          <article-title>A contextual-bandit approach to personalized news article recommendation</article-title>
          ,
          <source>in: Proceedings of the 19th international conference on World wide web</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>661</fpage>
          -
          <lpage>670</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Langford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Unbiased ofline evaluation of contextual-bandit-based news article recommendation algorithms</article-title>
          ,
          <source>in: Proceedings of the fourth ACM international conference on Web search and data mining</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>297</fpage>
          -
          <lpage>306</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. G.</given-names>
            <surname>Baraniuk</surname>
          </string-name>
          ,
          <article-title>A contextual bandits framework for personalized learning action selection</article-title>
          .,
          <source>in: Proceedings of the 9th International Conference on Educational Data Mining (EDM-2016)</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Interactive collaborative filtering</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM international conference on Information &amp; Knowledge Management</source>
          ,
          <year>2013</year>
          , pp.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>