Choosing Sample Size for Knowledge Tracing Models ∗

                                                       Derrick Coetzee
                                                University of California, Berkeley
                                                  dcoetzee@berkeley.edu


ABSTRACT                                                           However, even if the procedure identifies the global maxi-
An important question in the practical application of Bayesian     mum correctly and precisely, the resulting parameters may
knowledge tracing models is determining how much data is           not reflect the actual parameters that generated the data;
needed to infer parameters accurately. If training data is         this is a sampling error effect. It’s clearest with very small
inadequate, even a perfect inference algorithm will produce        samples, such as samples of size 1, but exists with larger sam-
parameters with poor predictive power. In this work, we            ples as well. Empirical studies with synthetic data generated
describe an empirical study using synthetic data that pro-         from known parameters show that the inferred parameters
vides estimates of the accuracy of inferred parameters based       for a given data set can differ substantially from the gen-
on factors such as the number of students used to train the        erating parameters, and this same issue would arise in real
model, and the values of the underlying generating param-          settings. An understanding of the magnitude of sampling
eters. We find that the standard    deviation of the error is      error in a particular scenario can help to explain why the
                            √
roughly proportional to 1/ n where n is the sample size,           resulting model does or does not make effective predictions.
and that model parameters near 0 and 1 are easier to learn         Moreover, by providing a means to describe the distribution
accurately.                                                        of possible generating parameter values, the uncertainty of
                                                                   calculations based on those parameters such as predictions
Categories and Subject Descriptors                                 can also be determined.
H.2.8 [Database Applications]: Data Mining
                                                                   2.   RELATED WORK
General Terms                                                      For simple problems, such as identifying the mean value of
Measurement,Theory.                                                a parameter in a population, or the proportion of the popu-
                                                                   lation falling into a subgroup, there are simple and well-
Keywords                                                           understood statistical approaches for determining sample
Educational data mining,knowledge tracing,sample size              size based on statistical power. Such analytic approaches
                                                                   are not immediately applicable to the problem of minimiz-
1.   INTRODUCTION                                                  ing the HMM error function because of its complexity and
Simple Bayesian knowledge tracing models a student’s ob-           high dimensionality.
served responses to a sequence of items as a Markov process,
with their knowledge state as a hidden underlying variable.        Falakmasir et al [2] have noted that training time increases
If values are given for the four standard parameters, learn-       linearly with the size of the training set. Choosing an ap-
ing rate, prior, guess, and slip, the likelihood of a particular   propriate sample size for a certain desired level of accuracy
set of response sequences can be computed. Using standard          can thus help to reduce training time, which is important
search procedures like expectation maximization (EM), the          both for research and in some real-time interactive tutor
parameter set giving the highest likelihood for a given set of     applications.
sequences can be determined, provided that the procedure
converges to the global maximum.                                   Nooraei et al [3] found that using only the 15 most recent
∗This work published at the BKT20y Workshop in conjunc-            data points from each student to train a knowledge trac-
                                                                   ing model yielded root mean-square error during prediction
tion with Educational Data Mining 2014. The author waives          comparable to using the student’s full history. For one data
all rights to this work under Creative Commons CC0 1.0.
                                                                   set, the most 5 recent items sufficed. Our study conversely
                                                                   does not vary the number of items per student, but instead
                                                                   varies the number of students and the four parameters gen-
                                                                   erating the data. By allowing sample size to be reduced
                                                                   to meet a desired accuracy, our work offers an orthogonal
                                                                   method of further reducing training time.

                                                                   De Sande [8] has suggested that as samples become larger,
                                                                   models with small parameter sets may no longer be rich
                                                                   enough to capture the sample’s complexity. Thus our exclu-
                        0.20                                       Finally, we take all samples generated from a single model
                        0.18
                                                                   and, for each parameter, record the mean and standard devi-
Proportion of samples

                        0.16
                        0.14                                       ation of the inferred values for that parameter. We chose the
                        0.12                                       number of samples generated for each model large enough
                        0.10                                       so that these statistics remain stable under repeated runs.
                        0.08                                       Mean values for each parameter were consistently near the
                        0.06                                       generating parameter, typically within at most 0.1 standard
                        0.04                                       deviations. Standard deviation provides an estimate of vari-
                        0.02
                        0.00
                                                                   ation in the inferred parameter values, and is plotted. Dif-
                                                                   ferent models yield different standard deviation values.
                               0.150
                               0.155
                               0.160
                               0.165
                               0.170
                               0.175
                               0.180
                               0.185
                               0.190
                               0.195
                               0.200
                               0.205
                               0.210
                               0.215
                               0.220
                               0.225
                               0.230
                               0.235
                               0.240
                               0.245
                               More
                                   Inferred learning rate          Because of the very large number of large samples involved
                                                                   in this approach, we use the fastHMM C++ BKT library
                                                                   designed by Pardos and Johnson [5] to quickly generate
Figure 1:      Given the fixed model learn=0.2,
                                                                   datasets and perform EM, invoked from a Matlab script.
prior=0.4, guess=0.14, slip=0.05, we generated
10000 samples with 1000 students each, and for each,
inferred all four parameters using EM. The distribu-               3.1    Varying one parameter
tion of the inferred learning rate parameter over the              In our first experiment, we start with typical, plausible val-
samples is above. The mean differs by 3 × 10−6 from                ues for all four parameters: learn=0.2, prior=0.4, guess=0.14,
the true generating parameter 0.2. The standard                    slip=0.05. These values are consistent with prior work that
deviation is 0.01121, and the orange line shows the                found large guess and slip values (> 0.5) to be implausible in
expected height of each bar if the proportions pre-                most scenarios [6], and in our 5-problem scenario, the chance
cisely followed a normal distribution. Scipy’s nor-                of learning the material by the end is about 67%, which is
maltest [7] rejects that the distribution is perfectly             reasonable.
normal (p < 0.0002), and a small amount of negative
(left) skew is visible; the median is 0.00016 smaller              Then, for each of the four parameters, we hold the other
than the mean. But the distribution is close enough                parameters at their single plausible value, and vary the re-
to normal for our purposes.                                        maining parameter from 0 to 1 in steps of 0.01. This results
                                                                   in 404 total parameter sets.

sive reliance on a simple four-parameter BKT model even            For each parameter set, we generate 1000 random samples
for very large samples is a limitation of our approach.            of 1000 students each. In this experiment, the number of
                                                                   students is fixed at 1000, which is large enough to consis-
                                                                   tently produce a standard deviation not exceeding 0.03 —
3.                        METHODOLOGY                              this avoids the boundary effects near 0 and 1 that would
In our experiments we relied on a simple standard Bayesian         occur for very small samples.
knowledge tracing model with four parameters: learning
rate, prior, guess and slip. There is only one value for           In this experiment, we focus on the variance of our estimates
each parameter, and no specialization by student or prob-          of the parameter that is being varied, and don’t consider
lem. Each synthetic student responded to five items; we            variance of the other (fixed) parameters.
do not vary this parameter in this study, since Nooraei et
al [3] report that increasing this parameter has diminishing
returns, but future work may investigate it.
                                                                   3.2    Interactions between parameters
                                                                   In this experiment, similiar to the first, we hold three pa-
                                                                   rameters fixed (learn=0.2, prior=0.4, guess=0.14), and vary
We generate separate datasets for each of our experiments.
                                                                   slip between 0 and 1 in steps of 0.01. This gives 101 pa-
In each case, we enumerate a sequence of models (each spec-
                                                                   rameter sets. For each, we generate 1000 random samples of
ified by values for learn, prior, guess, slip, sample size), and
                                                                   1000 students each. However, in this experiment we exam-
for each of those models, we generate a large number of
                                                                   ine variance of our estimates of all four parameters, rather
random samples consistent with that model. For example,
                                                                   than just the one being varied (slip). This experiment helps
for a particular model, we may generate 1000 samples each
                                                                   to demonstrate to what extent varying one parameter can
containing 1000 students.
                                                                   affect the difficulty of accurately inferring other parameters.
We then run EM on each sample to find the parameter set
giving the maximum likelihood value. All parameters are            3.3    Varying sample size
permitted to vary during the search. EM is run starting            In our third experiment, we fix the value of all four pa-
at the generating parameters and run until fully converged         rameters, but vary the sample size in powers of two from
(within 10−12 or until 100 iterations are complete). Start-        2 to 2097152. For sample sizes below 10000, we generate
ing at the generating parameters is not feasible in a realistic    1000 samples of that size, while for those above we generate
setting, but here it allows EM to run quickly and consis-          100 samples. The parameter values are heuristically chosen
tently reach the global minimum. As shown in Figure 1, the         based on the prior experiments above to generate large error
parameter values inferred from these samples approximate           values (but not necessarily the worst possible error). We ex-
a normal distribution with a mean equal to the generating          amine how variation of our estimates of all four parameters
parameter.                                                         varies with sample size, and identify any trends.
                                         Learning rate   Prior   Guess     Slip                                                Learning rate        Prior      Guess       Slip
                                 0.035                                                                                   0.1
Std dev of inferred parameters


                                                                                       Std dev of inferred parameters
                                  0.03
                                                                                                                        0.08
                                 0.025
                                  0.02                                                                                  0.06
                                 0.015                                                                                  0.04
                                  0.01
                                 0.005                                                                                  0.02
                                     0                                                                                    0
                                         0   0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9   1                                           0   0.1   0.2   0.3 0.4 0.5 0.6 0.7       0.8   0.9   1
                                                    Generating parameter value                                                                    Slip parameter value

Figure 2: Variation of inferred parameters, based on                                   Figure 3: As the slip parameter is varied and
underlying generating parameter. For each curve,                                       the other parameters are held fixed (learn=0.2,
all parameters other than one being examined are                                       prior=0.4, guess=0.14), the error in our inference
fixed at plausible values. Values near 0 and 1 are                                     of all other parameters varies in a strong and com-
the easiest to infer accurately, and each parameter                                    plex fashion, indicating interactions in the inference
exhibits a unique pattern.                                                             of different parameters.

3.4                                 Interaction between sample size and pa-            the maximums of the curves, with prior and guess at 0.5, and
                                    rameters                                           learning rate and slip at 0.67). Because section 4.2 suggests
In our final experiment, we vary both the learning rate (from                          that there are interactions between parameters, this may not
0 to 1 in steps of 0.01) and the sample size (between the val-                         give the worst-case variance possible of all combinations, but
ues 1000, 10000, 100000) at the same time. This enables us                             it is a reasonable starting point for realistic values.
to examine whether there is any interaction between param-
eters and sample size. For 1000 and 10000 students we use                              As described in section 3.3, sample size is varied in powers of
1000 samples, while for 100000 students we use 100 samples,                            two from 2 to 2097152. Figure 4 shows the result, suggesting
to reduce runtime.                                                                     that (except for very small samples) the standard√deviation
                                                                                       of the error is roughly proportional to n−0.5 , or 1/ n, where
4. RESULTS                                                                             n is the sample size. For these particular parameter values,
                                                                                       slip is consistently inferred most accurately, learning rate is
4.1 Varying one parameter                                                              inferred least accurately, and guess and prior are between
As described in section 3.1, in this experiment we vary each
                                                                                       the two and are similar.
parameter between 0 and 1 while holding the other parame-
ters fixed, and examined how the variation in our inference
of that parameter changed with its value. As shown in Fig-                             4.4                                 Interaction between sample size and pa-
ure 2, parameters with values near 0 or 1 are easier to ac-                                                                rameters
curately estimate, while those with values in the 0.4 to 0.8                           In our final experiment, as described in section 3.4, we vary
range are more difficult to infer. Each parameter exhibits a                           both the learning rate and the sample size at the same time.
unique pattern, with prior behaving worst for small values,                            The standard deviation curves for the three sample sizes √ are
guess behaving worst for values in the middle, and learning                            then plotted on the same plot, each divided by the 1/ n
rate performing worst for the largest values. Slip is unique                           factor, where n is the sample size, as shown in Figure 5.
in having two peaks in its curve near 0.5 and 0.8.                                     The curves are nearly identical, and we find no evidence
                                                                                       of interaction between parameters and sample size, but we
4.2                                 Interactions between parameters                    can’t rule out interaction for other combinations of parame-
                                                                                                                                                  √
As described in section 3.2, in this experiment we vary slip                           ter values. This also offers additional evidence for the 1/ n
between 0 and 1 while keeping the other parameters fixed,                              trend from the previous section.
and examine how the variation of all four inferred parame-
ters varies, as shown in Figure 3. All variance values exhibit                         5.                                DISCUSSION
a strong, complex dependence on the slip parameter—in par-                             Because accuracy is good for parameter values near 0 and 1,
ticular there is a dramatic and unexpected drop from large                             this implies that for large enough samples, boundary effects
variance to small variance around slip=0.85. We conclude                               (in which the distribution of error is skewed because values
that the variance of an inferred parameter depends not only                            outside of the 0-1 range are not permitted) are not a serious
on the value of that parameter, but also the values of other                           concern.
parameters.
                                                                                       Interactions between parameters are complex, suggesting
4.3                                 Varying sample size                                that attempting to characterize error in each parameter in-
We fix the parameters at the values empirically determined                             dependently is unlikely to yield good predictions of error.
in section 4.1 to give maximum variance (roughly based on                              Moreover, attempts to model these interactions analytically
                                                                                                   may be challenging because they cannot be fit well by low-
                                                                                                   degree polynomials. A more viable strategy is to form a
                                                                                                   conservative estimate of error by conducting a grid search
                                               Prior    Learn       Guess        Slip              of parameter sets that are plausible in a given scenario. On
                                                                                                   the other hand, once the range of variances at a particular
                                           1                                                       (sufficiently large) sample size is characterized, Figure 4 and
Std dev of inferred parameters


                                                                                                   Figure 5 show that altering the sample size has a uniform
                                       0.1                                                         and predictable effect on the error.

                                                                                                   The√ main result that standard deviation is proportional to
                                      0.01                                                         1/ n suggests that, in order to decrease the margin of error
                                                                                                   in the estimate of a parameter by a factor of 2, an increase
                                                         y = 0.4215x-0.533
                                     0.001                                                         in sample size by a factor of 4 is required. Additionally,
                                                           R² = 0.9963                             Figure 4 shows that achieving even a single valid significant
                                                                                                   digit in the learning rate requires sample sizes of 1000 stu-
                                 0.0001                                                            dents or more. This suggests that studies using BKT with
                                       1E+0       1E+1 1E+2 1E+3 1E+4 1E+5 1E+6                    less than 1000 students should be considered carefully for
                                                    Sample size (number of students)               sampling error.

Figure 4: Accuracy of inferred parameters, based on                                                5.1    Confidence Intervals and Decreasing Train-
sample size (training set size), with fixed parame-                                                       ing Time
ters (prior=guess=0.5, learning=slip=0.67). This is                                                As noted in Figure 1, provided that the sample size is large
a log-log plot, and (once the y = 0.1 level is reached)                                            enough, the distribution of samples is approximated well
the lines each remain straight and have slope of                                                   by a normal distribution, and the standard deviations com-
roughly -0.5. This suggests that the standard √     de-                                            puted in synthetic simulations such as the preceding ones
viation of the error is roughly proportional to 1/ n,                                              can be used to compute confidence intervals containing the
where n is the sample size.                                                                        true generating parameters (e.g. 95% of possible values are
                                                                                                   within two standard deviations). Parameters used in these
                                                                                                   simulations can be set either by using domain knowledge,
                                                                                                   and/or by conservatively selecting values that give poor ac-
                                                                                                   curacy.

                                                                                                   To use our results to decrease training time for a large data
                                                                                                   set, one approach is to create many small samples (e.g. 100
                                      1000 students    10000 students        100000 students       of size 1000) by sampling uniformly randomly with replace-
                                     0.8                                                           ment from the full data set. By training on these, we can
Std dev of inferred learning rates


                                                                                                   estimate the variance of our estimates of each parameter at a
                                     0.7
  (Normalized by sample size)


                                                                                                   sample size of 1000. Then, given a desired level of accuracy
                                                                                                                                                             √
                                     0.6                                                           and a desired probability of achieving it, we can use 1/ n
                                     0.5                                                           to estimate the best final sample size. If the estimated sam-
                                     0.4                                                           ple size exceeds the data size, this suggests that more data
                                     0.3                                                           needs to be gathered.
                                     0.2
                                     0.1
                                                                                                   6.    IDENTIFIABILITY PROBLEM
                                                                                                   Although we have in this work considered a particular gen-
                                       0                                                           erating parameter set to be the correct and desired param-
                                           0   0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9             1   eters, BKT exhibits an Identifiability Problem [1] in which
                                                           Learning rate                           there are an infinite family of four-parameter solutions that
                                                                                                   make the same predictions. This creates the risk that a solu-
                                                                                                   tion that appears to be far from the generating parameters
Figure 5: Here we vary learning rate from 0 to 1,                                                  is actually very close to an equivalent parameter set (or an
and also vary sample size between the values 1000,                                                 equivalent solution is).
10000, and 100000. The √ resulting standard devia-
tions are divided by 1/ n to normalize for improve-                                                Van de Sande [9] more specifically characterized BKT (in
ment in error due to increased sample size. The                                                    its HMM form) as a three-parameter system in which two
resulting curves are nearly identical; the curve for                                               systems having the same slip, learning rate, and A value will
100000 students appears noisier only because of a                                                  yield the same predictions, where A is given by
lower number of samples (100 instead of 1000). We
find no evidence of interaction between sample size
and the learning rate.
                                                                                                                A = (1 − slip − guess)(1 − prior).

                                                                                                   One way to address the issue is to perform both data gener-
ation and parameter search in this reduced three-parameter        sizes each use an ITS, the BKT is trained on the result-
system; this would be similar to our current approach, but        ing data, and then the groups continue to use the ITS and
error in the A parameter is more difficult to interpret. In-      their learning performance is examined (note however that
tuitively, we expect search in a lower-dimensional space to       asymmetric group sizes limit statistical power).
give better accuracy with the same amount of data. How-
ever, Van de Sande also notes that the algorithm form of          Finally, an analytical model that can explain some of our
BKT has no analytic solution, and so the degree to which          empirical results—such as the skewed normal distribution
BKT is underdetermined may depend on the specific appli-          of inferred parameter values, the improvements in parame-
                                                                                                                        √
cation.                                                           ter inference near 0 and 1 parameter values, or the 1/ n
                                                                  relationship between sample size and standard deviation—
Beyond the underdetermined nature of BKT, there are also          would be a valuable contribution.
information-theoretic bounds that limit the accuracy of in-
ferring parameters regardless of the system. In particular,       8.   ACKNOWLEDGMENTS
given a collection of at least k different parameter sets, and    We thank Zachary A. Pardos for his fastHMM C++ BKT
student data that can only take on < k values, there is           library [5], for providing helpful comments on this work, and
no procedure that can reliably infer the generating param-        for designing the assignment which inspired it.
eters without error. As the size of the data continues to
decrease, the minimum possible error increases. Although
these bounds are general, they typically apply only to very       9.   REFERENCES
small data sets.                                                  [1] J. E. Beck and K.-M. Chang. Identifiability: A
                                                                      fundamental problem of student modeling. In
                                                                      Proceedings of the 11th International Conference on
7.   CONCLUSIONS AND FUTURE WORK                                      User Modeling, UM ’07, pages 137–146, Berlin,
We’ve only explored a small part of the space of input pa-            Heidelberg, 2007. Springer-Verlag.
rameters that can affect inferred parameter accuracy; the         [2] M. H. Falakmasir, Z. A. Pardos, G. J. Gordon, and
possible interactions between parameters are complex and              P. Brusilovsky. A spectral learning approach to
not fully understood. It would also be useful to examine              knowledge tracing. In 6th International Conference on
different sizes of problem sets, scenarios where different stu-       Educational Data Mining (EDM 2013)., pages 28–35.
dents complete different numbers of problems, models where            International Educational Data Mining Society, 2013.
parameters such as learning rate and guess/slip are per prob-     [3] B. B. Nooraei, Z. A. Pardos, N. T. Heffernan, and
lem, and models where priors are measured per student (as             R. S. J. de Baker. Less is more: Improving the speed
in Pardos and Heffernan [4]).                                         and prediction power of knowledge tracing by using less
                                                                      data. In M. Pechenizkiy, T. Calders, C. Conati,
Although it seems intuitive that insufficient sample size can         S. Ventura, C. Romero, and J. C. Stamper, editors,
lead to poor parameter estimates with poor predictive power,          EDM, pages 101–110. www.educationaldatamining.org,
this deserves verification: it’s not clear which errors will          2011.
damage prediction and which are benign. An empirical syn-         [4] Z. A. Pardos and N. T. Heffernan. Modeling
thetic study that examines prediction accuracy could assess
                                                                      individualization in a bayesian networks
this cheaply. Going a step further, it would be useful to             implementation of knowledge tracing. In P. D. Bra,
simulate an interactive tutoring system and assess a cost             A. Kobsa, and D. N. Chin, editors, UMAP, volume
function that penalizes the system for both incorrect assess-         6075 of Lecture Notes in Computer Science, pages
ment of mastery, and for failing to assess mastery when it            255–266. Springer, 2010.
is reached. By applying weights to these error types, the
                                                                  [5] Z. A. Pardos and M. J. Johnson. Scaling cognitive
simulation could represent the real-world cost of inaccurate
                                                                      modeling to massive open environments (in
parameters in such a system.
                                                                      preparation). 2015.
Another important direction is extending our results to real-     [6] S. Ritter, T. K. Harris, T. Nixon, D. Dickison, R. C.
world data. There are a few approaches. One is to use a               Murray, and B. Towle. Reducing the knowledge tracing
very large real-world data set and use its inferred param-            space. In T. Barnes, M. C. Desmarais, C. Romero, and
eters as the ground-truth generating parameters, then ex-             S. Ventura, editors, EDM, pages 151–160.
amine smaller subsets to determine whether parameters are             www.educationaldatamining.org, 2009.
inferred less accurately. If the BKT model is appropriate,        [7] SciPy v0.13.0 Reference Guide: scipy.stats.normaltest.
we expect to observe similar relationships between sample             http://docs.scipy.org/doc/scipy/reference/
size and variance as with our synthetic data. This approach           generated/scipy.stats.normaltest.html, May 2013.
can be compared to one experiment of Ritter [6] (Figure 4),           [Online; accessed 24-April-2014].
in which they took a large real data set and computed mean-       [8] B. Van de Sande. Applying three models of learning to
squared error using the best-fit parameters on subsets with           individual student log data. In 6th International
smaller number of students ranging from 5 to 500.                     Conference on Educational Data Mining (EDM 2013).,
                                                                      pages 193–199. International Educational Data Mining
There are other approaches to real-world validity. One would          Society, 2013.
be a survey of prior BKT applications, to identify whether        [9] B. Van de Sande. Properties of the bayesian knowledge
there is a consistent relationship between sample size and            tracing model. Journal of Educational Data Mining,
reported prediction accuracy. A third approach would be a             5(2):1–10, 2013.
controlled experiment in which two groups of very different