The Effect of Variations of Prior on Knowledge Tracing

                         Matti Nelimarkka                                         Madeeha Ghori
              School of Information, UC Berkeley                     Department of Electrical Engineering and
                           102 South Hall                               Computer Sciences, UC Berkeley
                 Berkeley, California 94720-4600                                  387 Soda Hall
        Helsinki Institute for Information Technology HIIT,             Berkeley, California 94720-17761
                          Aalto University                              madeeha.ghori@berkeley.edu
                           PO Box 15600
                        Aalto, Finland 00076
                    matti.nelimarkka@hiit.fi

ABSTRACT                                                           important to fully understand the strengths and limitations
Knowledge tracing is a method which enables approximation          of knowledge tracing before applying it more widely in the
of a student’s knowledge state using a Bayesian network for        classroom. As the parameters of the model are now known,
approximation. As the applications of this method increase,        there is a need to estimate these parameters from the given
it is vital to understand the limits of this approximation. We     data. Previous research has demonstrated that the accuracy
are interested how well knowledge tracing performs when            of parameter estimation – and therefore knowledge tracing
students’ prior knowledge on the topic is extremely high or        – can be improved by applying different heuristics [17, 13]
low. Our results indicate that the estimates become more           or methods [16, 18] including personalizing the model for
erroneous when prior knowledge is extremely high (prior =          each user [20, 8] or by extending the data used for analysis
0.90).                                                             [15, 6, 1].

                                                                   Our work starts from a different premise: how robust is the
Keywords                                                           BKT approach to variation in the parameter space? Our
bayesian knowledge tracing, personalization, prior, parame-        special interest is in the prior variable, which correlates to
ter estimation                                                     a student’s knowledge of the topic before answering a ques-
                                                                   tion. In any classroom, MOOC or otherwise, some students
1.   INTRODUCTION                                                  will come in with a better understanding of the material
The Bayesian Knowledge-Tracing (BKT) algorithm was de-             than others. Therefore it is important to study the effec-
veloped in 1995 in an effort to model students’ changing           tiveness of knowledge tracing on parameter estimation when
knowledge state during skill acquisition [5]. The idea is to       prior is extremely high or low.
interpret students’ knowledge – a hidden variable – based
on observed answers to a set of questions. The algorithm           If knowledge tracing models are inaccurate in modelling stu-
tracks the change in this probability distribution over time       dents of a certain prior parameter, then smart tutors and
using a simple Bayes’ net. The model is often presented as         other systems designed to help those students learn will be
four parameters: prior, learn, guess and slip (see Figure 1).      less effective. Especially if the students being modelled in-
Prior refers to the probability that the student knows the         accurately are those students doing poorly in the class, as
material initially, before acquiring any skills, learn indicates   the smart tutors exist to help them the most.
that the student did not have the skill initially but acquired
it through doing the exercise, guess refers to accidentally
answering the question correct and slip to answering acci-
dentally wrong.

Knowledge tracing is the most prominent method used to
model student knowledge acquisition and is used in most in-
telligent learning systems. These systems have been said to
be outperforming humans since 2001 [3] and have been used
in the real world to tutor students [4]. For these reasons it is


                                                                           Figure 1: The model of knowledge tracing
2.    PREVIOUS WORK
For the purposes of this work, here we shortly summarize
three methods previously applied to improve the prediction
capabilities of BKT models. However, these methods are in-
sufficient to address the practical problem described above,
resulting in a need for our own experiment.

2.1    Individualization
Yudelson et al. [20] experimented with individualization by
bringing student-specific parameters into the BKT algorithm
on a larger scale. They split the usual skill-specific BKT
parameters into two components: one skill-specific and one
student-specific. They then built several individualized BKT
models and added student-specific parameters in batches,
examining the effect each addition had on the model’s per-
formance. They found that student-specific prior parame-
ters did not provide a vast improvement. However, student-
specific learning provided a significant improvement to the
model’s prediction accuracy.

Pardos and Heffernan furthered the experiment by develop-                  Figure 2: The approach used in this study
ing a method of formulating the individualization within the
Bayes’ Net framework [11]. Especially interesting in terms
of our work is the difference prior values and methods sug-       sum of slip and guess to be less than or equal to 1 [17]. Other
gested for this individualization. Pardos observes that mod-      work determined that one’s starting estimated parameters
els taking student spesific priors based on students’ prior       could affect where the algorithm converged to. In order to
knowledge clearly outperform traditional knowledge trace          improve the accuracy of the convergence, it was suggested
approach. This is a contrast Yudelson et al.’s findings [20]      that starting parameters be selected from a Dirichlet distr-
but it still underscores the importance of individualization      bution derived from the data set [2, 13].
in the BKT algorithm.
                                                                  There have also been efforts to explore other machine learn-
Related to individualization per user, there have been dis-       ing methods on educational data. Initial trials born in the
cussion on using different values per resources. It can be        KDDCup competition use a medley of random forests and
argued that different exercises teach different topics [7, 14].   other machine learning algorithms but these methods have
This can be further used to individualize the model for dif-      proven largely unsuccessful [16, 18].
ferent topics, an approach which has gained initial support
on empirical studies [14].                                        The knowledge tracing community, while accepting the va-
                                                                  lidity of some of these heuristics [9, 12], has criticized their
2.2    Enhancing the data                                         inability to provide any insight into the student learning
The second approach to improve these methods is related           model. Individualization, however, has the potential to im-
to enhanching the data used for prediction. In its most           prove the BKT algorithm while also providing a pedagogical
simple form, this can be done by adding additional relevant       explanation for said improvements.
data, such as data from past years, to the analysis [15].
Others have explored the possibility of adding more data to       3.   METHODOLOGY
the general domain-related knowledge on the models, and
                                                                  We began by generating datasets with specific known ini-
suggest that these indeed improve the estimates [6].
                                                                  tial parameters in order to simulate groups of students at
                                                                  different knowledge levels. We then ran expectation max-
However, the current direction in enhanced data relates to
                                                                  imization (EM) on these datasets and allowed knowledge
information available on user interaction – especially in MOOC
                                                                  tracing to calculate its own estimated parameters. We then
environments where it is possible to access this kind of data.
                                                                  compared these estimated parameters to the original ones
To illustrate, Baker, Corbett, and Aleven [1] explore interac-
                                                                  used for generation to determine if the accurency of the pa-
tions with the learning system and other non-exercise related
                                                                  rameter estimation depends on the initial parameters.
data, such as time spent on answering and asking help, to
determine the difference between slips and guesses.
                                                                             Table 1: Ground Truth Parameter Sets
We applaud these efforts and acknowledge that data other
than just student responses may indeed help to detect both                               prior    learn   guess    slip
the cases where initial knowledge (prior) is high and when         Set 1.1 . . . 1.6      0.15     0.10    0.10   0.05
it is low, instead of tweaking the EM algorithm further.           Set 2.1 . . . 2.6      0.30     0.10    0.10   0.05
                                                                   Set 3.1 . . . 3.6      0.15     0.20    0.10   0.05
2.3    Improving the methods                                                                 ..
There are several heuristics currently used to enhance the                                    .
BKT algorithm. One such heuristic involves expecting the           Set 48.1 . . . 48.6    0.90    0.20     0.20   0.10
3.1    Generating the Data
As our goal was to determine how the prior ground truth af-
fects parameter estimation, we varied the prior used to syn-
thesize the data sets. We used six different priors (0.15, 0.30, . . . ,
0.75, 0.9), and two variations on learn, slip and guess1 each
(see Table 1); total of 48 variations of these parameters.
Each of these data sets consists of 10,000 students and 20
observations per student. To increase the variation, we gen-
erated 6 datasets per condition. This kind of simulated ap-
proach has been previously used to evaluate the success of
Bayesian machine learning methods [8].
                                                                                Figure 4: Log likelihoods with different parameters
3.2    Analysis Procedure
For each data set, we estimated the parameters using the
                                                                           a more accurate prior estimate. However, as we saw in Ta-
expectation maximization fitting (EM) algorithm using the
                                                                           ble 2, this is not actually the case. The prior estimate gets
fastHMM implementation [10]. The parameter estimation
                                                                           less accurate as the value of the ground truth prior increases.
was conducted using a grid search with ten parameters, and
                                                                           In Figure 3 we can see again some of the results we saw in
the best fitting model was selected using the log likelihood.
                                                                           Table 2: the prediction accuracy decreases when prior is 0.6
                                                                           and continues to decrease as prior increases.
Using our 288 data sets, we can compare the estimates and
ground truths for each parameter and analyze the accuracy
                                                                           Figure 4 shows that the log likelihood for each of the param-
of the estimates. We apply the standard methods of root-
                                                                           eter combinations we analyzed. We see a slight, but non-
mean-square error (RMSE) and other visualizations to do
                                                                           significant increase in the log likelihoods, suggesting that
our analysis. Using RMSE, we will be able to see if certain
                                                                           the model is performing better – even while our RMSE er-
ground truths lend themselves to more accurate estimations.
                                                                           ror indicator demonstrates otherwise. It is also noteworthy
                                                                           to observe that that when slip is 0.10, all log likelihoods
4.    RESULTS                                                              range between -65500 and -65250 but when slip is 0.05, all
First, let us explore the parameter estimation in detail. The              log likelihoods range between -40000 and -35750, indicat-
avarage RMSE measurement in the data (Table 2) indicate                    ing that the slip value had a dramatic effect on the model
that the prediction quality decreases as the prior increases;              estimation accurancy.
there is also increase of variance of the RMSE. This indi-
cates that the predictions with higher priors are first more
erronous and second, they converge in a larger area, result-
                                                                           5.    IMPLICATIONS
ing in variance. To confirm our observations, we conducted                 Our findings indicate that there are higher errors in the
a Wilcox-Mann-Whitney test to explore if the computed                      parameter estimations when prior is high (0.90). This is
RMSEs differented in statistically significant manner. As                  probably due to the lack of evidence available for the HMM
shown in Table 3, both the RMSEs computed from the data                    to attribute to the learn and guess parameters. One ap-
sets with priors 0.15 and 0.90 statistically differ significantly          proach to examine the impact of these errors is to examine
from the other datasets (p < 0.05). Therefore we conclude                  the students’ subjective experience in different conditions
that the EM algortihm performs badly when prior is high.                   [19]. As our data is syntetic, we can not measure the time
                                                                           consumed by students due to errors, as examined by Youdel-
To further understand this phenomena, we explore the esti-                 son & Koedinger [19]. Instead we explore the difference on
mates per parameter. The errors per parameter are shown                    the number of questions students’ need to answer to achieve
in the Figure 3. The mean estimates are rather constantly                  mastery learning – for our purposes knowledge above 95 %
close by the zero, though a higher prior does affect variance.             and assuming that the students answer each question cor-
As ground truth prior increases, the variance of guess and                 rectly.
learn increases while the variance of prior decreases. In the-
ory, a lesser variance on the prior prediction should imply                Examining the case of high prior knowledge, and when the
                                                                           true learning was 0.1, we observed that majority of students
1                                                                          needed to answer over 5 times to achieve mastery (or: from
  Variations were 0.10 and 0.20 for learn and guess, and 0.05,
0.10 for slip.                                                             the 168 predicted value sets available, only 24 achieved mas-
                                                                           tery), and for the high learning (0.2) the situation was not
 Ground truth prior       mean RMSE         var RMSE
 0.15                        0.056639         0.000594                          Table 3: Significant differences between the RMSEs
 0.30                        0.069073         0.001137
 0.45                        0.070005         0.000584                              0.15      0.30      0.45       0.60       0.75       0.90
 0.60                        0.074044         0.001874                      0.15       1   < 0.001   < 0.001    < 0.001    < 0.001    < 0.001
 0.75                        0.075946         0.002229                      0.30                 1     0.347      0.614      0.967      0.014
 0.90                        0.085257         0.004876                      0.45                           1      0.660      0.125      0.081
                                                                            0.60                                      1      0.744      0.035
Table 2: The mean and variance of the root-mean-square                      0.75                                                 1      0.007
errors per prior                                                            0.90                                                            1
                                 Figure 3: Predicting parameters with different values of prior


significantly better – there 56 values achieved mastery with 5         bayesian knowledge tracing. In BeverleyP. Woolf, Esma
responses. This indicates that the impact indeed was signif-           AÃŕmeur, Roger Nkambou, and Susanne Lajoie, ed-
icant in terms of impact to students learning and highlights           itors, Intelligent Tutoring Systems, volume 5091 of
the importance of this study.                                          Lecture Notes in Computer Science, pages 406–415.
                                                                       Springer Berlin Heidelberg, 2008.
6.   CONCLUSIONS                                                    [2] Joseph E Beck and Kai-min Chang. Identifiability :
We started this study with the motivation to explore how
                                                                        A Fundamental Problem of Student Modeling. pages
well the knowledge tracing method performs when the prior
                                                                        137–146, 2007. doi: 10.1007/978-3-540-73078-1_17.
is high or low; this performance has practical implications
when applying this approach in a heterogenius classroom             [3] Albert Corbett. Cognitive computer tutors: Solving
where students arrive with highly different knowledge of the            the two-sigma problem. In User Modeling 2001, volume
domain. We studied this empirically by generating 288 dif-              2109 of Lecture Notes in Computer Science, pages 137–
ferent synthetic datasets and explored the difference between           147. Springer Berlin Heidelberg, 2001.
the predicted parameters and the parameters used to gen-
erate the dataset.                                                  [4] Albert Corbett, Megan McLaughlin, and K Christine
                                                                        Scarpinatto. Modeling student knowledge: Cognitive
Our results indicated a slightly increased in the estimation            tutors in high school and college. User modeling and
error when prior was 0.90, which we mostly attribute to                 user-adapted interaction, 10(2-3):81–108, 2000.
higher error in learn and guess parameters. This observation        [5] Albert T Corbett and John R Anderson. Knowledge
was statistically significant and most likely due to the fact           tracing: Modeling the acquisition of procedural knowl-
that students with higher priors produce less information               edge. User modeling and user-adapted interaction, 4(4):
to be used by the HMM to estimate the guess and learn                   253–278, 1994.
parameters.
                                                                    [6] Albert T Corbett and Akshat Bhatnagar.      Stu-
We explored the influence these errors had on the propabil-             dent modeling in the act programming tutor: Ad-
ity of knowledge and observed that these errors significantly           justing a procedural learning model with declar-
reduced the speed students achieved mastery learning. This              ative knowledge.   COURSES AND LECTURES-
result therefore implies that more work needs to be done to             INTERNATIONAL CENTRE FOR MECHANICAL
detect those with high prior knowledge to cater their learn-            SCIENCES, pages 243–254, 1997.
ing needs.
                                                                    [7] Tanja KÃďser, Severin Klingler, AlexanderGerhard
                                                                        Schwing, and Markus Gross. Beyond knowledge trac-
Acknowledgments                                                         ing: Modeling skill topologies with bayesian net-
This work was conducted during UC Berkeley School of In-
                                                                        works. In Stefan Trausan-Matu, KristyElizabeth Boyer,
formation class “INFO290: Machine learning in education“
                                                                        Martha Crosby, and Kitty Panourgia, editors, Intelli-
instructed by Zach Pardos. We thank the support of the
                                                                        gent Tutoring Systems, volume 8474 of Lecture Notes
course staff and peers on the presentation.
                                                                        in Computer Science, pages 188–198. Springer Interna-
                                                                        tional Publishing, 2014.
References
 [1] RyanS.J.d. Baker, AlbertT. Corbett, and Vincent                [8] Z. A. Pardos and N. T. Heffernan. Navigating the pa-
     Aleven. More accurate student modeling through con-                rameter space of Bayesian Knowledge Tracing models
     textual estimation of slip and guess probabilities in              Visualizations of the convergence of the Expectation
    Maximization algorithm. In Proceedings of the 3rd In-
    ternational Conference on Educational Data Mining,
    2010.
 [9] ZA Pardos and NT Heffernan. Using HMMs and bagged
     decision trees to leverage rich features of user and skill
     from an intelligent tutoring system dataset. Jour-
     nal of Machine Learning Research W & CP, 2010. URL
     http://people.csail.mit.edu/zp/papers/pardos_JMLR_in_press.pdf.
[10] Z.A. Pardos, M.J. Johnson, and et al. Scaling cogni-
     tive modeling to massive open environments. TOCHI
     Special Issue on Learning at Scale, (in preparation).
[11] ZacharyA. Pardos and Neil T. Heffernan. Modeling in-
     dividualization in a bayesian networks implementation
     of knowledge tracing. In Paul Bra, Alfred Kobsa, and
     David Chin, editors, User Modeling, Adaptation, and
     Personalization, volume 6075 of Lecture Notes in Com-
     puter Science, pages 255–266. Springer Berlin Heidel-
     berg, 2010. ISBN 978-3-642-13469-2.
[12] Pardos, Zachary A, Sujith M. Gowda, Ryan S.J.d.
     Baker, and Neil T. Heffernan.       The sum is
     greater than the parts.     ACM SIGKDD Explo-
     rations Newsletter, 13(2):37, May 2012.   ISSN
     19310145.   doi: 10.1145/2207243.2207249.  URL
     http://dl.acm.org/citation.cfm?id=2207249
     http://dl.acm.org/citation.cfm?doid=2207243.2207249.
[13] Dovan Rai, Yue Gong, and Joseph E Beck. Using dirich-
     let priors to improve model parameter plausibility. In-
     ternational Working Group on Educational Data Min-
     ing, 2009.
[14] Leena Razzaq, Neil T Heffernan, Mingyu Feng, and
     Zachary A Pardos. Developing Fine-Grained Transfer
     Models in the ASSISTment System. Technology, In-
     struction, Cognition & Learning, 5(3):1–16, 2007.
[15] Steven Ritter, Thomas K Harris, Tristan Nixon, Daniel
     Dickison, R Charles Murray, and Brendon Towle. Re-
     ducing the knowledge tracing space. International
     Working Group on Educational Data Mining, 2009.
[16] A Toscher and Michael Jahrer. Collaborative filtering
     applied to educational data mining. Journal of Machine
     Learning Research, 2010.
[17] Brett van De Sande. Properties of the Bayesian Knowl-
     edge Tracing Model. Journal of Educational Data Min-
     ing, 5(2):1–10, 2013.
[18] Hsiang-Fu Yu, Hung-Yi Lo, Hsun-Ping Hsieh, Jing-
     Kai Lou, Todd G McKenzie, Jung-Wei Chou, Po-Han
     Chung, Chia-Hua Ho, Chun-Fu Chang, Yin-Hsuan Wei,
     et al. Feature engineering and classifier ensemble for
     kdd cup 2010. JMLR: Workshop and Conference Pro-
     ceedings, 1, 2010.
[19] Michael V Yudelson and Kenneth R Koedinger. Esti-
     mating the benefits of student model improvements on
     a substantive scale. In Proceedings of the 6th Interna-
     tional Conference on Educational Data Mining, 2013.
[20] Michael V Yudelson, Kenneth R Koedinger, and Ge-
     offrey J Gordon. Individualized bayesian knowledge
     tracing models. In Artificial Intelligence in Education,
     pages 171–180. Springer, 2013.