=Paper= {{Paper |id=Vol-1183/bkt20y_paper06 |storemode=property |title= The Effect of Variations of Prior on Knowledge Tracing |pdfUrl=https://ceur-ws.org/Vol-1183/bkt20y_paper06.pdf |volume=Vol-1183 |dblpUrl=https://dblp.org/rec/conf/edm/NelimarkkaG14 }} == The Effect of Variations of Prior on Knowledge Tracing== https://ceur-ws.org/Vol-1183/bkt20y_paper06.pdf

The Effect of Variations of Prior on Knowledge Tracing

Matti Nelimarkka Madeeha Ghori
School of Information, UC Berkeley Department of Electrical Engineering and
102 South Hall Computer Sciences, UC Berkeley
Berkeley, California 94720-4600 387 Soda Hall
Helsinki Institute for Information Technology HIIT, Berkeley, California 94720-17761
Aalto University madeeha.ghori@berkeley.edu
PO Box 15600
Aalto, Finland 00076
matti.nelimarkka@hiit.fi

ABSTRACT important to fully understand the strengths and limitations
Knowledge tracing is a method which enables approximation of knowledge tracing before applying it more widely in the
of a student’s knowledge state using a Bayesian network for classroom. As the parameters of the model are now known,
approximation. As the applications of this method increase, there is a need to estimate these parameters from the given
it is vital to understand the limits of this approximation. We data. Previous research has demonstrated that the accuracy
are interested how well knowledge tracing performs when of parameter estimation – and therefore knowledge tracing
students’ prior knowledge on the topic is extremely high or – can be improved by applying different heuristics [17, 13]
low. Our results indicate that the estimates become more or methods [16, 18] including personalizing the model for
erroneous when prior knowledge is extremely high (prior = each user [20, 8] or by extending the data used for analysis
0.90). [15, 6, 1].

Our work starts from a different premise: how robust is the
Keywords BKT approach to variation in the parameter space? Our
bayesian knowledge tracing, personalization, prior, parame- special interest is in the prior variable, which correlates to
ter estimation a student’s knowledge of the topic before answering a ques-
tion. In any classroom, MOOC or otherwise, some students
1. INTRODUCTION will come in with a better understanding of the material
The Bayesian Knowledge-Tracing (BKT) algorithm was de- than others. Therefore it is important to study the effec-
veloped in 1995 in an effort to model students’ changing tiveness of knowledge tracing on parameter estimation when
knowledge state during skill acquisition [5]. The idea is to prior is extremely high or low.
interpret students’ knowledge – a hidden variable – based
on observed answers to a set of questions. The algorithm If knowledge tracing models are inaccurate in modelling stu-
tracks the change in this probability distribution over time dents of a certain prior parameter, then smart tutors and
using a simple Bayes’ net. The model is often presented as other systems designed to help those students learn will be
four parameters: prior, learn, guess and slip (see Figure 1). less effective. Especially if the students being modelled in-
Prior refers to the probability that the student knows the accurately are those students doing poorly in the class, as
material initially, before acquiring any skills, learn indicates the smart tutors exist to help them the most.
that the student did not have the skill initially but acquired
it through doing the exercise, guess refers to accidentally
answering the question correct and slip to answering acci-
dentally wrong.

Knowledge tracing is the most prominent method used to
model student knowledge acquisition and is used in most in-
telligent learning systems. These systems have been said to
be outperforming humans since 2001 [3] and have been used
in the real world to tutor students [4]. For these reasons it is

Figure 1: The model of knowledge tracing
2. PREVIOUS WORK
For the purposes of this work, here we shortly summarize
three methods previously applied to improve the prediction
capabilities of BKT models. However, these methods are in-
sufficient to address the practical problem described above,
resulting in a need for our own experiment.

2.1 Individualization
Yudelson et al. [20] experimented with individualization by
bringing student-specific parameters into the BKT algorithm
on a larger scale. They split the usual skill-specific BKT
parameters into two components: one skill-specific and one
student-specific. They then built several individualized BKT
models and added student-specific parameters in batches,
examining the effect each addition had on the model’s per-
formance. They found that student-specific prior parame-
ters did not provide a vast improvement. However, student-
specific learning provided a significant improvement to the
model’s prediction accuracy.

Pardos and Heffernan furthered the experiment by develop- Figure 2: The approach used in this study
ing a method of formulating the individualization within the
Bayes’ Net framework [11]. Especially interesting in terms
of our work is the difference prior values and methods sug- sum of slip and guess to be less than or equal to 1 [17]. Other
gested for this individualization. Pardos observes that mod- work determined that one’s starting estimated parameters
els taking student spesific priors based on students’ prior could affect where the algorithm converged to. In order to
knowledge clearly outperform traditional knowledge trace improve the accuracy of the convergence, it was suggested
approach. This is a contrast Yudelson et al.’s findings [20] that starting parameters be selected from a Dirichlet distr-
but it still underscores the importance of individualization bution derived from the data set [2, 13].
in the BKT algorithm.
There have also been efforts to explore other machine learn-
Related to individualization per user, there have been dis- ing methods on educational data. Initial trials born in the
cussion on using different values per resources. It can be KDDCup competition use a medley of random forests and
argued that different exercises teach different topics [7, 14]. other machine learning algorithms but these methods have
This can be further used to individualize the model for dif- proven largely unsuccessful [16, 18].
ferent topics, an approach which has gained initial support
on empirical studies [14]. The knowledge tracing community, while accepting the va-
lidity of some of these heuristics [9, 12], has criticized their
2.2 Enhancing the data inability to provide any insight into the student learning
The second approach to improve these methods is related model. Individualization, however, has the potential to im-
to enhanching the data used for prediction. In its most prove the BKT algorithm while also providing a pedagogical
simple form, this can be done by adding additional relevant explanation for said improvements.
data, such as data from past years, to the analysis [15].
Others have explored the possibility of adding more data to 3. METHODOLOGY
the general domain-related knowledge on the models, and
We began by generating datasets with specific known ini-
suggest that these indeed improve the estimates [6].
tial parameters in order to simulate groups of students at
different knowledge levels. We then ran expectation max-
However, the current direction in enhanced data relates to
imization (EM) on these datasets and allowed knowledge
information available on user interaction – especially in MOOC
tracing to calculate its own estimated parameters. We then
environments where it is possible to access this kind of data.
compared these estimated parameters to the original ones
To illustrate, Baker, Corbett, and Aleven [1] explore interac-
used for generation to determine if the accurency of the pa-
tions with the learning system and other non-exercise related
rameter estimation depends on the initial parameters.
data, such as time spent on answering and asking help, to
determine the difference between slips and guesses.
Table 1: Ground Truth Parameter Sets
We applaud these efforts and acknowledge that data other
than just student responses may indeed help to detect both prior learn guess slip
the cases where initial knowledge (prior) is high and when Set 1.1 . . . 1.6 0.15 0.10 0.10 0.05
it is low, instead of tweaking the EM algorithm further. Set 2.1 . . . 2.6 0.30 0.10 0.10 0.05
Set 3.1 . . . 3.6 0.15 0.20 0.10 0.05
2.3 Improving the methods ..
There are several heuristics currently used to enhance the .
BKT algorithm. One such heuristic involves expecting the Set 48.1 . . . 48.6 0.90 0.20 0.20 0.10
3.1 Generating the Data
As our goal was to determine how the prior ground truth af-
fects parameter estimation, we varied the prior used to syn-
thesize the data sets. We used six different priors (0.15, 0.30, . . . ,
0.75, 0.9), and two variations on learn, slip and guess1 each
(see Table 1); total of 48 variations of these parameters.
Each of these data sets consists of 10,000 students and 20
observations per student. To increase the variation, we gen-
erated 6 datasets per condition. This kind of simulated ap-
proach has been previously used to evaluate the success of
Bayesian machine learning methods [8].
Figure 4: Log likelihoods with different parameters
3.2 Analysis Procedure
For each data set, we estimated the parameters using the
a more accurate prior estimate. However, as we saw in Ta-
expectation maximization fitting (EM) algorithm using the
ble 2, this is not actually the case. The prior estimate gets
fastHMM implementation [10]. The parameter estimation
less accurate as the value of the ground truth prior increases.
was conducted using a grid search with ten parameters, and
In Figure 3 we can see again some of the results we saw in
the best fitting model was selected using the log likelihood.
Table 2: the prediction accuracy decreases when prior is 0.6
and continues to decrease as prior increases.
Using our 288 data sets, we can compare the estimates and
ground truths for each parameter and analyze the accuracy
Figure 4 shows that the log likelihood for each of the param-
of the estimates. We apply the standard methods of root-
eter combinations we analyzed. We see a slight, but non-
mean-square error (RMSE) and other visualizations to do
significant increase in the log likelihoods, suggesting that
our analysis. Using RMSE, we will be able to see if certain
the model is performing better – even while our RMSE er-
ground truths lend themselves to more accurate estimations.
ror indicator demonstrates otherwise. It is also noteworthy
to observe that that when slip is 0.10, all log likelihoods
4. RESULTS range between -65500 and -65250 but when slip is 0.05, all
First, let us explore the parameter estimation in detail. The log likelihoods range between -40000 and -35750, indicat-
avarage RMSE measurement in the data (Table 2) indicate ing that the slip value had a dramatic effect on the model
that the prediction quality decreases as the prior increases; estimation accurancy.
there is also increase of variance of the RMSE. This indi-
cates that the predictions with higher priors are first more
erronous and second, they converge in a larger area, result-
5. IMPLICATIONS
ing in variance. To confirm our observations, we conducted Our findings indicate that there are higher errors in the
a Wilcox-Mann-Whitney test to explore if the computed parameter estimations when prior is high (0.90). This is
RMSEs differented in statistically significant manner. As probably due to the lack of evidence available for the HMM
shown in Table 3, both the RMSEs computed from the data to attribute to the learn and guess parameters. One ap-
sets with priors 0.15 and 0.90 statistically differ significantly proach to examine the impact of these errors is to examine
from the other datasets (p < 0.05). Therefore we conclude the students’ subjective experience in different conditions
that the EM algortihm performs badly when prior is high. [19]. As our data is syntetic, we can not measure the time
consumed by students due to errors, as examined by Youdel-
To further understand this phenomena, we explore the esti- son & Koedinger [19]. Instead we explore the difference on
mates per parameter. The errors per parameter are shown the number of questions students’ need to answer to achieve
in the Figure 3. The mean estimates are rather constantly mastery learning – for our purposes knowledge above 95 %
close by the zero, though a higher prior does affect variance. and assuming that the students answer each question cor-
As ground truth prior increases, the variance of guess and rectly.
learn increases while the variance of prior decreases. In the-
ory, a lesser variance on the prior prediction should imply Examining the case of high prior knowledge, and when the
true learning was 0.1, we observed that majority of students
1 needed to answer over 5 times to achieve mastery (or: from
Variations were 0.10 and 0.20 for learn and guess, and 0.05,
0.10 for slip. the 168 predicted value sets available, only 24 achieved mas-
tery), and for the high learning (0.2) the situation was not
Ground truth prior mean RMSE var RMSE
0.15 0.056639 0.000594 Table 3: Significant differences between the RMSEs
0.30 0.069073 0.001137
0.45 0.070005 0.000584 0.15 0.30 0.45 0.60 0.75 0.90
0.60 0.074044 0.001874 0.15 1 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001
0.75 0.075946 0.002229 0.30 1 0.347 0.614 0.967 0.014
0.90 0.085257 0.004876 0.45 1 0.660 0.125 0.081
0.60 1 0.744 0.035
Table 2: The mean and variance of the root-mean-square 0.75 1 0.007
errors per prior 0.90 1
Figure 3: Predicting parameters with different values of prior

significantly better – there 56 values achieved mastery with 5 bayesian knowledge tracing. In BeverleyP. Woolf, Esma
responses. This indicates that the impact indeed was signif- AÃŕmeur, Roger Nkambou, and Susanne Lajoie, ed-
icant in terms of impact to students learning and highlights itors, Intelligent Tutoring Systems, volume 5091 of
the importance of this study. Lecture Notes in Computer Science, pages 406–415.
Springer Berlin Heidelberg, 2008.
6. CONCLUSIONS [2] Joseph E Beck and Kai-min Chang. Identifiability :
We started this study with the motivation to explore how
A Fundamental Problem of Student Modeling. pages
well the knowledge tracing method performs when the prior
137–146, 2007. doi: 10.1007/978-3-540-73078-1_17.
is high or low; this performance has practical implications
when applying this approach in a heterogenius classroom [3] Albert Corbett. Cognitive computer tutors: Solving
where students arrive with highly different knowledge of the the two-sigma problem. In User Modeling 2001, volume
domain. We studied this empirically by generating 288 dif- 2109 of Lecture Notes in Computer Science, pages 137–
ferent synthetic datasets and explored the difference between 147. Springer Berlin Heidelberg, 2001.
the predicted parameters and the parameters used to gen-
erate the dataset. [4] Albert Corbett, Megan McLaughlin, and K Christine
Scarpinatto. Modeling student knowledge: Cognitive
Our results indicated a slightly increased in the estimation tutors in high school and college. User modeling and
error when prior was 0.90, which we mostly attribute to user-adapted interaction, 10(2-3):81–108, 2000.
higher error in learn and guess parameters. This observation [5] Albert T Corbett and John R Anderson. Knowledge
was statistically significant and most likely due to the fact tracing: Modeling the acquisition of procedural knowl-
that students with higher priors produce less information edge. User modeling and user-adapted interaction, 4(4):
to be used by the HMM to estimate the guess and learn 253–278, 1994.
parameters.
[6] Albert T Corbett and Akshat Bhatnagar. Stu-
We explored the influence these errors had on the propabil- dent modeling in the act programming tutor: Ad-
ity of knowledge and observed that these errors significantly justing a procedural learning model with declar-
reduced the speed students achieved mastery learning. This ative knowledge. COURSES AND LECTURES-
result therefore implies that more work needs to be done to INTERNATIONAL CENTRE FOR MECHANICAL
detect those with high prior knowledge to cater their learn- SCIENCES, pages 243–254, 1997.
ing needs.
[7] Tanja KÃďser, Severin Klingler, AlexanderGerhard
Schwing, and Markus Gross. Beyond knowledge trac-
Acknowledgments ing: Modeling skill topologies with bayesian net-
This work was conducted during UC Berkeley School of In-
works. In Stefan Trausan-Matu, KristyElizabeth Boyer,
formation class “INFO290: Machine learning in education“
Martha Crosby, and Kitty Panourgia, editors, Intelli-
instructed by Zach Pardos. We thank the support of the
gent Tutoring Systems, volume 8474 of Lecture Notes
course staff and peers on the presentation.
in Computer Science, pages 188–198. Springer Interna-
tional Publishing, 2014.
References
[1] RyanS.J.d. Baker, AlbertT. Corbett, and Vincent [8] Z. A. Pardos and N. T. Heffernan. Navigating the pa-
Aleven. More accurate student modeling through con- rameter space of Bayesian Knowledge Tracing models
textual estimation of slip and guess probabilities in Visualizations of the convergence of the Expectation
Maximization algorithm. In Proceedings of the 3rd In-
ternational Conference on Educational Data Mining,
2010.
[9] ZA Pardos and NT Heffernan. Using HMMs and bagged
decision trees to leverage rich features of user and skill
from an intelligent tutoring system dataset. Jour-
nal of Machine Learning Research W & CP, 2010. URL
http://people.csail.mit.edu/zp/papers/pardos_JMLR_in_press.pdf.
[10] Z.A. Pardos, M.J. Johnson, and et al. Scaling cogni-
tive modeling to massive open environments. TOCHI
Special Issue on Learning at Scale, (in preparation).
[11] ZacharyA. Pardos and Neil T. Heffernan. Modeling in-
dividualization in a bayesian networks implementation
of knowledge tracing. In Paul Bra, Alfred Kobsa, and
David Chin, editors, User Modeling, Adaptation, and
Personalization, volume 6075 of Lecture Notes in Com-
puter Science, pages 255–266. Springer Berlin Heidel-
berg, 2010. ISBN 978-3-642-13469-2.
[12] Pardos, Zachary A, Sujith M. Gowda, Ryan S.J.d.
Baker, and Neil T. Heffernan. The sum is
greater than the parts. ACM SIGKDD Explo-
rations Newsletter, 13(2):37, May 2012. ISSN
19310145. doi: 10.1145/2207243.2207249. URL
http://dl.acm.org/citation.cfm?id=2207249
http://dl.acm.org/citation.cfm?doid=2207243.2207249.
[13] Dovan Rai, Yue Gong, and Joseph E Beck. Using dirich-
let priors to improve model parameter plausibility. In-
ternational Working Group on Educational Data Min-
ing, 2009.
[14] Leena Razzaq, Neil T Heffernan, Mingyu Feng, and
Zachary A Pardos. Developing Fine-Grained Transfer
Models in the ASSISTment System. Technology, In-
struction, Cognition & Learning, 5(3):1–16, 2007.
[15] Steven Ritter, Thomas K Harris, Tristan Nixon, Daniel
Dickison, R Charles Murray, and Brendon Towle. Re-
ducing the knowledge tracing space. International
Working Group on Educational Data Mining, 2009.
[16] A Toscher and Michael Jahrer. Collaborative filtering
applied to educational data mining. Journal of Machine
Learning Research, 2010.
[17] Brett van De Sande. Properties of the Bayesian Knowl-
edge Tracing Model. Journal of Educational Data Min-
ing, 5(2):1–10, 2013.
[18] Hsiang-Fu Yu, Hung-Yi Lo, Hsun-Ping Hsieh, Jing-
Kai Lou, Todd G McKenzie, Jung-Wei Chou, Po-Han
Chung, Chia-Hua Ho, Chun-Fu Chang, Yin-Hsuan Wei,
et al. Feature engineering and classifier ensemble for
kdd cup 2010. JMLR: Workshop and Conference Pro-
ceedings, 1, 2010.
[19] Michael V Yudelson and Kenneth R Koedinger. Esti-
mating the benefits of student model improvements on
a substantive scale. In Proceedings of the 6th Interna-
tional Conference on Educational Data Mining, 2013.
[20] Michael V Yudelson, Kenneth R Koedinger, and Ge-
offrey J Gordon. Individualized bayesian knowledge
tracing models. In Artificial Intelligence in Education,
pages 171–180. Springer, 2013.