=Paper= {{Paper |id=Vol-1183/bkt20y_paper01 |storemode=property |title= Choosing Sample Size for Knowledge Tracing Models |pdfUrl=https://ceur-ws.org/Vol-1183/bkt20y_paper01.pdf |volume=Vol-1183 |dblpUrl=https://dblp.org/rec/conf/edm/Coetzee14 }} == Choosing Sample Size for Knowledge Tracing Models== https://ceur-ws.org/Vol-1183/bkt20y_paper01.pdf

Choosing Sample Size for Knowledge Tracing Models ∗

Derrick Coetzee
University of California, Berkeley
dcoetzee@berkeley.edu

ABSTRACT However, even if the procedure identifies the global maxi-
An important question in the practical application of Bayesian mum correctly and precisely, the resulting parameters may
knowledge tracing models is determining how much data is not reflect the actual parameters that generated the data;
needed to infer parameters accurately. If training data is this is a sampling error effect. It’s clearest with very small
inadequate, even a perfect inference algorithm will produce samples, such as samples of size 1, but exists with larger sam-
parameters with poor predictive power. In this work, we ples as well. Empirical studies with synthetic data generated
describe an empirical study using synthetic data that pro- from known parameters show that the inferred parameters
vides estimates of the accuracy of inferred parameters based for a given data set can differ substantially from the gen-
on factors such as the number of students used to train the erating parameters, and this same issue would arise in real
model, and the values of the underlying generating param- settings. An understanding of the magnitude of sampling
eters. We find that the standard deviation of the error is error in a particular scenario can help to explain why the
√
roughly proportional to 1/ n where n is the sample size, resulting model does or does not make effective predictions.
and that model parameters near 0 and 1 are easier to learn Moreover, by providing a means to describe the distribution
accurately. of possible generating parameter values, the uncertainty of
calculations based on those parameters such as predictions
Categories and Subject Descriptors can also be determined.
H.2.8 [Database Applications]: Data Mining
2. RELATED WORK
General Terms For simple problems, such as identifying the mean value of
Measurement,Theory. a parameter in a population, or the proportion of the popu-
lation falling into a subgroup, there are simple and well-
Keywords understood statistical approaches for determining sample
Educational data mining,knowledge tracing,sample size size based on statistical power. Such analytic approaches
are not immediately applicable to the problem of minimiz-
1. INTRODUCTION ing the HMM error function because of its complexity and
Simple Bayesian knowledge tracing models a student’s ob- high dimensionality.
served responses to a sequence of items as a Markov process,
with their knowledge state as a hidden underlying variable. Falakmasir et al [2] have noted that training time increases
If values are given for the four standard parameters, learn- linearly with the size of the training set. Choosing an ap-
ing rate, prior, guess, and slip, the likelihood of a particular propriate sample size for a certain desired level of accuracy
set of response sequences can be computed. Using standard can thus help to reduce training time, which is important
search procedures like expectation maximization (EM), the both for research and in some real-time interactive tutor
parameter set giving the highest likelihood for a given set of applications.
sequences can be determined, provided that the procedure
converges to the global maximum. Nooraei et al [3] found that using only the 15 most recent
∗This work published at the BKT20y Workshop in conjunc- data points from each student to train a knowledge trac-
ing model yielded root mean-square error during prediction
tion with Educational Data Mining 2014. The author waives comparable to using the student’s full history. For one data
all rights to this work under Creative Commons CC0 1.0.
set, the most 5 recent items sufficed. Our study conversely
does not vary the number of items per student, but instead
varies the number of students and the four parameters gen-
erating the data. By allowing sample size to be reduced
to meet a desired accuracy, our work offers an orthogonal
method of further reducing training time.

De Sande [8] has suggested that as samples become larger,
models with small parameter sets may no longer be rich
enough to capture the sample’s complexity. Thus our exclu-
0.20 Finally, we take all samples generated from a single model
0.18
and, for each parameter, record the mean and standard devi-
Proportion of samples

0.16
0.14 ation of the inferred values for that parameter. We chose the
0.12 number of samples generated for each model large enough
0.10 so that these statistics remain stable under repeated runs.
0.08 Mean values for each parameter were consistently near the
0.06 generating parameter, typically within at most 0.1 standard
0.04 deviations. Standard deviation provides an estimate of vari-
0.02
0.00
ation in the inferred parameter values, and is plotted. Dif-
ferent models yield different standard deviation values.
0.150
0.155
0.160
0.165
0.170
0.175
0.180
0.185
0.190
0.195
0.200
0.205
0.210
0.215
0.220
0.225
0.230
0.235
0.240
0.245
More
Inferred learning rate Because of the very large number of large samples involved
in this approach, we use the fastHMM C++ BKT library
designed by Pardos and Johnson [5] to quickly generate
Figure 1: Given the fixed model learn=0.2,
datasets and perform EM, invoked from a Matlab script.
prior=0.4, guess=0.14, slip=0.05, we generated
10000 samples with 1000 students each, and for each,
inferred all four parameters using EM. The distribu- 3.1 Varying one parameter
tion of the inferred learning rate parameter over the In our first experiment, we start with typical, plausible val-
samples is above. The mean differs by 3 × 10−6 from ues for all four parameters: learn=0.2, prior=0.4, guess=0.14,
the true generating parameter 0.2. The standard slip=0.05. These values are consistent with prior work that
deviation is 0.01121, and the orange line shows the found large guess and slip values (> 0.5) to be implausible in
expected height of each bar if the proportions pre- most scenarios [6], and in our 5-problem scenario, the chance
cisely followed a normal distribution. Scipy’s nor- of learning the material by the end is about 67%, which is
maltest [7] rejects that the distribution is perfectly reasonable.
normal (p < 0.0002), and a small amount of negative
(left) skew is visible; the median is 0.00016 smaller Then, for each of the four parameters, we hold the other
than the mean. But the distribution is close enough parameters at their single plausible value, and vary the re-
to normal for our purposes. maining parameter from 0 to 1 in steps of 0.01. This results
in 404 total parameter sets.

sive reliance on a simple four-parameter BKT model even For each parameter set, we generate 1000 random samples
for very large samples is a limitation of our approach. of 1000 students each. In this experiment, the number of
students is fixed at 1000, which is large enough to consis-
tently produce a standard deviation not exceeding 0.03 —
3. METHODOLOGY this avoids the boundary effects near 0 and 1 that would
In our experiments we relied on a simple standard Bayesian occur for very small samples.
knowledge tracing model with four parameters: learning
rate, prior, guess and slip. There is only one value for In this experiment, we focus on the variance of our estimates
each parameter, and no specialization by student or prob- of the parameter that is being varied, and don’t consider
lem. Each synthetic student responded to five items; we variance of the other (fixed) parameters.
do not vary this parameter in this study, since Nooraei et
al [3] report that increasing this parameter has diminishing
returns, but future work may investigate it.
3.2 Interactions between parameters
In this experiment, similiar to the first, we hold three pa-
rameters fixed (learn=0.2, prior=0.4, guess=0.14), and vary
We generate separate datasets for each of our experiments.
slip between 0 and 1 in steps of 0.01. This gives 101 pa-
In each case, we enumerate a sequence of models (each spec-
rameter sets. For each, we generate 1000 random samples of
ified by values for learn, prior, guess, slip, sample size), and
1000 students each. However, in this experiment we exam-
for each of those models, we generate a large number of
ine variance of our estimates of all four parameters, rather
random samples consistent with that model. For example,
than just the one being varied (slip). This experiment helps
for a particular model, we may generate 1000 samples each
to demonstrate to what extent varying one parameter can
containing 1000 students.
affect the difficulty of accurately inferring other parameters.
We then run EM on each sample to find the parameter set
giving the maximum likelihood value. All parameters are 3.3 Varying sample size
permitted to vary during the search. EM is run starting In our third experiment, we fix the value of all four pa-
at the generating parameters and run until fully converged rameters, but vary the sample size in powers of two from
(within 10−12 or until 100 iterations are complete). Start- 2 to 2097152. For sample sizes below 10000, we generate
ing at the generating parameters is not feasible in a realistic 1000 samples of that size, while for those above we generate
setting, but here it allows EM to run quickly and consis- 100 samples. The parameter values are heuristically chosen
tently reach the global minimum. As shown in Figure 1, the based on the prior experiments above to generate large error
parameter values inferred from these samples approximate values (but not necessarily the worst possible error). We ex-
a normal distribution with a mean equal to the generating amine how variation of our estimates of all four parameters
parameter. varies with sample size, and identify any trends.
Learning rate Prior Guess Slip Learning rate Prior Guess Slip
0.035 0.1
Std dev of inferred parameters

Std dev of inferred parameters
0.03
0.08
0.025
0.02 0.06
0.015 0.04
0.01
0.005 0.02
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Generating parameter value Slip parameter value

Figure 2: Variation of inferred parameters, based on Figure 3: As the slip parameter is varied and
underlying generating parameter. For each curve, the other parameters are held fixed (learn=0.2,
all parameters other than one being examined are prior=0.4, guess=0.14), the error in our inference
fixed at plausible values. Values near 0 and 1 are of all other parameters varies in a strong and com-
the easiest to infer accurately, and each parameter plex fashion, indicating interactions in the inference
exhibits a unique pattern. of different parameters.

3.4 Interaction between sample size and pa- the maximums of the curves, with prior and guess at 0.5, and
rameters learning rate and slip at 0.67). Because section 4.2 suggests
In our final experiment, we vary both the learning rate (from that there are interactions between parameters, this may not
0 to 1 in steps of 0.01) and the sample size (between the val- give the worst-case variance possible of all combinations, but
ues 1000, 10000, 100000) at the same time. This enables us it is a reasonable starting point for realistic values.
to examine whether there is any interaction between param-
eters and sample size. For 1000 and 10000 students we use As described in section 3.3, sample size is varied in powers of
1000 samples, while for 100000 students we use 100 samples, two from 2 to 2097152. Figure 4 shows the result, suggesting
to reduce runtime. that (except for very small samples) the standard√deviation
of the error is roughly proportional to n−0.5 , or 1/ n, where
4. RESULTS n is the sample size. For these particular parameter values,
slip is consistently inferred most accurately, learning rate is
4.1 Varying one parameter inferred least accurately, and guess and prior are between
As described in section 3.1, in this experiment we vary each
the two and are similar.
parameter between 0 and 1 while holding the other parame-
ters fixed, and examined how the variation in our inference
of that parameter changed with its value. As shown in Fig- 4.4 Interaction between sample size and pa-
ure 2, parameters with values near 0 or 1 are easier to ac- rameters
curately estimate, while those with values in the 0.4 to 0.8 In our final experiment, as described in section 3.4, we vary
range are more difficult to infer. Each parameter exhibits a both the learning rate and the sample size at the same time.
unique pattern, with prior behaving worst for small values, The standard deviation curves for the three sample sizes √ are
guess behaving worst for values in the middle, and learning then plotted on the same plot, each divided by the 1/ n
rate performing worst for the largest values. Slip is unique factor, where n is the sample size, as shown in Figure 5.
in having two peaks in its curve near 0.5 and 0.8. The curves are nearly identical, and we find no evidence
of interaction between parameters and sample size, but we
4.2 Interactions between parameters can’t rule out interaction for other combinations of parame-
√
As described in section 3.2, in this experiment we vary slip ter values. This also offers additional evidence for the 1/ n
between 0 and 1 while keeping the other parameters fixed, trend from the previous section.
and examine how the variation of all four inferred parame-
ters varies, as shown in Figure 3. All variance values exhibit 5. DISCUSSION
a strong, complex dependence on the slip parameter—in par- Because accuracy is good for parameter values near 0 and 1,
ticular there is a dramatic and unexpected drop from large this implies that for large enough samples, boundary effects
variance to small variance around slip=0.85. We conclude (in which the distribution of error is skewed because values
that the variance of an inferred parameter depends not only outside of the 0-1 range are not permitted) are not a serious
on the value of that parameter, but also the values of other concern.
parameters.
Interactions between parameters are complex, suggesting
4.3 Varying sample size that attempting to characterize error in each parameter in-
We fix the parameters at the values empirically determined dependently is unlikely to yield good predictions of error.
in section 4.1 to give maximum variance (roughly based on Moreover, attempts to model these interactions analytically
may be challenging because they cannot be fit well by low-
degree polynomials. A more viable strategy is to form a
conservative estimate of error by conducting a grid search
Prior Learn Guess Slip of parameter sets that are plausible in a given scenario. On
the other hand, once the range of variances at a particular
1 (sufficiently large) sample size is characterized, Figure 4 and
Std dev of inferred parameters

Figure 5 show that altering the sample size has a uniform
0.1 and predictable effect on the error.

The√ main result that standard deviation is proportional to
0.01 1/ n suggests that, in order to decrease the margin of error
in the estimate of a parameter by a factor of 2, an increase
y = 0.4215x-0.533
0.001 in sample size by a factor of 4 is required. Additionally,
R² = 0.9963 Figure 4 shows that achieving even a single valid significant
digit in the learning rate requires sample sizes of 1000 stu-
0.0001 dents or more. This suggests that studies using BKT with
1E+0 1E+1 1E+2 1E+3 1E+4 1E+5 1E+6 less than 1000 students should be considered carefully for
Sample size (number of students) sampling error.

Figure 4: Accuracy of inferred parameters, based on 5.1 Confidence Intervals and Decreasing Train-
sample size (training set size), with fixed parame- ing Time
ters (prior=guess=0.5, learning=slip=0.67). This is As noted in Figure 1, provided that the sample size is large
a log-log plot, and (once the y = 0.1 level is reached) enough, the distribution of samples is approximated well
the lines each remain straight and have slope of by a normal distribution, and the standard deviations com-
roughly -0.5. This suggests that the standard √ de- puted in synthetic simulations such as the preceding ones
viation of the error is roughly proportional to 1/ n, can be used to compute confidence intervals containing the
where n is the sample size. true generating parameters (e.g. 95% of possible values are
within two standard deviations). Parameters used in these
simulations can be set either by using domain knowledge,
and/or by conservatively selecting values that give poor ac-
curacy.

To use our results to decrease training time for a large data
set, one approach is to create many small samples (e.g. 100
1000 students 10000 students 100000 students of size 1000) by sampling uniformly randomly with replace-
0.8 ment from the full data set. By training on these, we can
Std dev of inferred learning rates

estimate the variance of our estimates of each parameter at a
0.7
(Normalized by sample size)

sample size of 1000. Then, given a desired level of accuracy
√
0.6 and a desired probability of achieving it, we can use 1/ n
0.5 to estimate the best final sample size. If the estimated sam-
0.4 ple size exceeds the data size, this suggests that more data
0.3 needs to be gathered.
0.2
0.1
6. IDENTIFIABILITY PROBLEM
Although we have in this work considered a particular gen-
0 erating parameter set to be the correct and desired param-
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 eters, BKT exhibits an Identifiability Problem [1] in which
Learning rate there are an infinite family of four-parameter solutions that
make the same predictions. This creates the risk that a solu-
tion that appears to be far from the generating parameters
Figure 5: Here we vary learning rate from 0 to 1, is actually very close to an equivalent parameter set (or an
and also vary sample size between the values 1000, equivalent solution is).
10000, and 100000. The √ resulting standard devia-
tions are divided by 1/ n to normalize for improve- Van de Sande [9] more specifically characterized BKT (in
ment in error due to increased sample size. The its HMM form) as a three-parameter system in which two
resulting curves are nearly identical; the curve for systems having the same slip, learning rate, and A value will
100000 students appears noisier only because of a yield the same predictions, where A is given by
lower number of samples (100 instead of 1000). We
find no evidence of interaction between sample size
and the learning rate.
A = (1 − slip − guess)(1 − prior).

One way to address the issue is to perform both data gener-
ation and parameter search in this reduced three-parameter sizes each use an ITS, the BKT is trained on the result-
system; this would be similar to our current approach, but ing data, and then the groups continue to use the ITS and
error in the A parameter is more difficult to interpret. In- their learning performance is examined (note however that
tuitively, we expect search in a lower-dimensional space to asymmetric group sizes limit statistical power).
give better accuracy with the same amount of data. How-
ever, Van de Sande also notes that the algorithm form of Finally, an analytical model that can explain some of our
BKT has no analytic solution, and so the degree to which empirical results—such as the skewed normal distribution
BKT is underdetermined may depend on the specific appli- of inferred parameter values, the improvements in parame-
√
cation. ter inference near 0 and 1 parameter values, or the 1/ n
relationship between sample size and standard deviation—
Beyond the underdetermined nature of BKT, there are also would be a valuable contribution.
information-theoretic bounds that limit the accuracy of in-
ferring parameters regardless of the system. In particular, 8. ACKNOWLEDGMENTS
given a collection of at least k different parameter sets, and We thank Zachary A. Pardos for his fastHMM C++ BKT
student data that can only take on < k values, there is library [5], for providing helpful comments on this work, and
no procedure that can reliably infer the generating param- for designing the assignment which inspired it.
eters without error. As the size of the data continues to
decrease, the minimum possible error increases. Although
these bounds are general, they typically apply only to very 9. REFERENCES
small data sets. [1] J. E. Beck and K.-M. Chang. Identifiability: A
fundamental problem of student modeling. In
Proceedings of the 11th International Conference on
7. CONCLUSIONS AND FUTURE WORK User Modeling, UM ’07, pages 137–146, Berlin,
We’ve only explored a small part of the space of input pa- Heidelberg, 2007. Springer-Verlag.
rameters that can affect inferred parameter accuracy; the [2] M. H. Falakmasir, Z. A. Pardos, G. J. Gordon, and
possible interactions between parameters are complex and P. Brusilovsky. A spectral learning approach to
not fully understood. It would also be useful to examine knowledge tracing. In 6th International Conference on
different sizes of problem sets, scenarios where different stu- Educational Data Mining (EDM 2013)., pages 28–35.
dents complete different numbers of problems, models where International Educational Data Mining Society, 2013.
parameters such as learning rate and guess/slip are per prob- [3] B. B. Nooraei, Z. A. Pardos, N. T. Heffernan, and
lem, and models where priors are measured per student (as R. S. J. de Baker. Less is more: Improving the speed
in Pardos and Heffernan [4]). and prediction power of knowledge tracing by using less
data. In M. Pechenizkiy, T. Calders, C. Conati,
Although it seems intuitive that insufficient sample size can S. Ventura, C. Romero, and J. C. Stamper, editors,
lead to poor parameter estimates with poor predictive power, EDM, pages 101–110. www.educationaldatamining.org,
this deserves verification: it’s not clear which errors will 2011.
damage prediction and which are benign. An empirical syn- [4] Z. A. Pardos and N. T. Heffernan. Modeling
thetic study that examines prediction accuracy could assess
individualization in a bayesian networks
this cheaply. Going a step further, it would be useful to implementation of knowledge tracing. In P. D. Bra,
simulate an interactive tutoring system and assess a cost A. Kobsa, and D. N. Chin, editors, UMAP, volume
function that penalizes the system for both incorrect assess- 6075 of Lecture Notes in Computer Science, pages
ment of mastery, and for failing to assess mastery when it 255–266. Springer, 2010.
is reached. By applying weights to these error types, the
[5] Z. A. Pardos and M. J. Johnson. Scaling cognitive
simulation could represent the real-world cost of inaccurate
modeling to massive open environments (in
parameters in such a system.
preparation). 2015.
Another important direction is extending our results to real- [6] S. Ritter, T. K. Harris, T. Nixon, D. Dickison, R. C.
world data. There are a few approaches. One is to use a Murray, and B. Towle. Reducing the knowledge tracing
very large real-world data set and use its inferred param- space. In T. Barnes, M. C. Desmarais, C. Romero, and
eters as the ground-truth generating parameters, then ex- S. Ventura, editors, EDM, pages 151–160.
amine smaller subsets to determine whether parameters are www.educationaldatamining.org, 2009.
inferred less accurately. If the BKT model is appropriate, [7] SciPy v0.13.0 Reference Guide: scipy.stats.normaltest.
we expect to observe similar relationships between sample http://docs.scipy.org/doc/scipy/reference/
size and variance as with our synthetic data. This approach generated/scipy.stats.normaltest.html, May 2013.
can be compared to one experiment of Ritter [6] (Figure 4), [Online; accessed 24-April-2014].
in which they took a large real data set and computed mean- [8] B. Van de Sande. Applying three models of learning to
squared error using the best-fit parameters on subsets with individual student log data. In 6th International
smaller number of students ranging from 5 to 500. Conference on Educational Data Mining (EDM 2013).,
pages 193–199. International Educational Data Mining
There are other approaches to real-world validity. One would Society, 2013.
be a survey of prior BKT applications, to identify whether [9] B. Van de Sande. Properties of the bayesian knowledge
there is a consistent relationship between sample size and tracing model. Journal of Educational Data Mining,
reported prediction accuracy. A third approach would be a 5(2):1–10, 2013.
controlled experiment in which two groups of very different