Choosing Sample Size for Knowledge Tracing Models ∗ Derrick Coetzee University of California, Berkeley dcoetzee@berkeley.edu ABSTRACT However, even if the procedure identifies the global maxi- An important question in the practical application of Bayesian mum correctly and precisely, the resulting parameters may knowledge tracing models is determining how much data is not reflect the actual parameters that generated the data; needed to infer parameters accurately. If training data is this is a sampling error effect. It’s clearest with very small inadequate, even a perfect inference algorithm will produce samples, such as samples of size 1, but exists with larger sam- parameters with poor predictive power. In this work, we ples as well. Empirical studies with synthetic data generated describe an empirical study using synthetic data that pro- from known parameters show that the inferred parameters vides estimates of the accuracy of inferred parameters based for a given data set can differ substantially from the gen- on factors such as the number of students used to train the erating parameters, and this same issue would arise in real model, and the values of the underlying generating param- settings. An understanding of the magnitude of sampling eters. We find that the standard deviation of the error is error in a particular scenario can help to explain why the √ roughly proportional to 1/ n where n is the sample size, resulting model does or does not make effective predictions. and that model parameters near 0 and 1 are easier to learn Moreover, by providing a means to describe the distribution accurately. of possible generating parameter values, the uncertainty of calculations based on those parameters such as predictions Categories and Subject Descriptors can also be determined. H.2.8 [Database Applications]: Data Mining 2. RELATED WORK General Terms For simple problems, such as identifying the mean value of Measurement,Theory. a parameter in a population, or the proportion of the popu- lation falling into a subgroup, there are simple and well- Keywords understood statistical approaches for determining sample Educational data mining,knowledge tracing,sample size size based on statistical power. Such analytic approaches are not immediately applicable to the problem of minimiz- 1. INTRODUCTION ing the HMM error function because of its complexity and Simple Bayesian knowledge tracing models a student’s ob- high dimensionality. served responses to a sequence of items as a Markov process, with their knowledge state as a hidden underlying variable. Falakmasir et al [2] have noted that training time increases If values are given for the four standard parameters, learn- linearly with the size of the training set. Choosing an ap- ing rate, prior, guess, and slip, the likelihood of a particular propriate sample size for a certain desired level of accuracy set of response sequences can be computed. Using standard can thus help to reduce training time, which is important search procedures like expectation maximization (EM), the both for research and in some real-time interactive tutor parameter set giving the highest likelihood for a given set of applications. sequences can be determined, provided that the procedure converges to the global maximum. Nooraei et al [3] found that using only the 15 most recent ∗This work published at the BKT20y Workshop in conjunc- data points from each student to train a knowledge trac- ing model yielded root mean-square error during prediction tion with Educational Data Mining 2014. The author waives comparable to using the student’s full history. For one data all rights to this work under Creative Commons CC0 1.0. set, the most 5 recent items sufficed. Our study conversely does not vary the number of items per student, but instead varies the number of students and the four parameters gen- erating the data. By allowing sample size to be reduced to meet a desired accuracy, our work offers an orthogonal method of further reducing training time. De Sande [8] has suggested that as samples become larger, models with small parameter sets may no longer be rich enough to capture the sample’s complexity. Thus our exclu- 0.20 Finally, we take all samples generated from a single model 0.18 and, for each parameter, record the mean and standard devi- Proportion of samples 0.16 0.14 ation of the inferred values for that parameter. We chose the 0.12 number of samples generated for each model large enough 0.10 so that these statistics remain stable under repeated runs. 0.08 Mean values for each parameter were consistently near the 0.06 generating parameter, typically within at most 0.1 standard 0.04 deviations. Standard deviation provides an estimate of vari- 0.02 0.00 ation in the inferred parameter values, and is plotted. Dif- ferent models yield different standard deviation values. 0.150 0.155 0.160 0.165 0.170 0.175 0.180 0.185 0.190 0.195 0.200 0.205 0.210 0.215 0.220 0.225 0.230 0.235 0.240 0.245 More Inferred learning rate Because of the very large number of large samples involved in this approach, we use the fastHMM C++ BKT library designed by Pardos and Johnson [5] to quickly generate Figure 1: Given the fixed model learn=0.2, datasets and perform EM, invoked from a Matlab script. prior=0.4, guess=0.14, slip=0.05, we generated 10000 samples with 1000 students each, and for each, inferred all four parameters using EM. The distribu- 3.1 Varying one parameter tion of the inferred learning rate parameter over the In our first experiment, we start with typical, plausible val- samples is above. The mean differs by 3 × 10−6 from ues for all four parameters: learn=0.2, prior=0.4, guess=0.14, the true generating parameter 0.2. The standard slip=0.05. These values are consistent with prior work that deviation is 0.01121, and the orange line shows the found large guess and slip values (> 0.5) to be implausible in expected height of each bar if the proportions pre- most scenarios [6], and in our 5-problem scenario, the chance cisely followed a normal distribution. Scipy’s nor- of learning the material by the end is about 67%, which is maltest [7] rejects that the distribution is perfectly reasonable. normal (p < 0.0002), and a small amount of negative (left) skew is visible; the median is 0.00016 smaller Then, for each of the four parameters, we hold the other than the mean. But the distribution is close enough parameters at their single plausible value, and vary the re- to normal for our purposes. maining parameter from 0 to 1 in steps of 0.01. This results in 404 total parameter sets. sive reliance on a simple four-parameter BKT model even For each parameter set, we generate 1000 random samples for very large samples is a limitation of our approach. of 1000 students each. In this experiment, the number of students is fixed at 1000, which is large enough to consis- tently produce a standard deviation not exceeding 0.03 — 3. METHODOLOGY this avoids the boundary effects near 0 and 1 that would In our experiments we relied on a simple standard Bayesian occur for very small samples. knowledge tracing model with four parameters: learning rate, prior, guess and slip. There is only one value for In this experiment, we focus on the variance of our estimates each parameter, and no specialization by student or prob- of the parameter that is being varied, and don’t consider lem. Each synthetic student responded to five items; we variance of the other (fixed) parameters. do not vary this parameter in this study, since Nooraei et al [3] report that increasing this parameter has diminishing returns, but future work may investigate it. 3.2 Interactions between parameters In this experiment, similiar to the first, we hold three pa- rameters fixed (learn=0.2, prior=0.4, guess=0.14), and vary We generate separate datasets for each of our experiments. slip between 0 and 1 in steps of 0.01. This gives 101 pa- In each case, we enumerate a sequence of models (each spec- rameter sets. For each, we generate 1000 random samples of ified by values for learn, prior, guess, slip, sample size), and 1000 students each. However, in this experiment we exam- for each of those models, we generate a large number of ine variance of our estimates of all four parameters, rather random samples consistent with that model. For example, than just the one being varied (slip). This experiment helps for a particular model, we may generate 1000 samples each to demonstrate to what extent varying one parameter can containing 1000 students. affect the difficulty of accurately inferring other parameters. We then run EM on each sample to find the parameter set giving the maximum likelihood value. All parameters are 3.3 Varying sample size permitted to vary during the search. EM is run starting In our third experiment, we fix the value of all four pa- at the generating parameters and run until fully converged rameters, but vary the sample size in powers of two from (within 10−12 or until 100 iterations are complete). Start- 2 to 2097152. For sample sizes below 10000, we generate ing at the generating parameters is not feasible in a realistic 1000 samples of that size, while for those above we generate setting, but here it allows EM to run quickly and consis- 100 samples. The parameter values are heuristically chosen tently reach the global minimum. As shown in Figure 1, the based on the prior experiments above to generate large error parameter values inferred from these samples approximate values (but not necessarily the worst possible error). We ex- a normal distribution with a mean equal to the generating amine how variation of our estimates of all four parameters parameter. varies with sample size, and identify any trends. Learning rate Prior Guess Slip Learning rate Prior Guess Slip 0.035 0.1 Std dev of inferred parameters Std dev of inferred parameters 0.03 0.08 0.025 0.02 0.06 0.015 0.04 0.01 0.005 0.02 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Generating parameter value Slip parameter value Figure 2: Variation of inferred parameters, based on Figure 3: As the slip parameter is varied and underlying generating parameter. For each curve, the other parameters are held fixed (learn=0.2, all parameters other than one being examined are prior=0.4, guess=0.14), the error in our inference fixed at plausible values. Values near 0 and 1 are of all other parameters varies in a strong and com- the easiest to infer accurately, and each parameter plex fashion, indicating interactions in the inference exhibits a unique pattern. of different parameters. 3.4 Interaction between sample size and pa- the maximums of the curves, with prior and guess at 0.5, and rameters learning rate and slip at 0.67). Because section 4.2 suggests In our final experiment, we vary both the learning rate (from that there are interactions between parameters, this may not 0 to 1 in steps of 0.01) and the sample size (between the val- give the worst-case variance possible of all combinations, but ues 1000, 10000, 100000) at the same time. This enables us it is a reasonable starting point for realistic values. to examine whether there is any interaction between param- eters and sample size. For 1000 and 10000 students we use As described in section 3.3, sample size is varied in powers of 1000 samples, while for 100000 students we use 100 samples, two from 2 to 2097152. Figure 4 shows the result, suggesting to reduce runtime. that (except for very small samples) the standard√deviation of the error is roughly proportional to n−0.5 , or 1/ n, where 4. RESULTS n is the sample size. For these particular parameter values, slip is consistently inferred most accurately, learning rate is 4.1 Varying one parameter inferred least accurately, and guess and prior are between As described in section 3.1, in this experiment we vary each the two and are similar. parameter between 0 and 1 while holding the other parame- ters fixed, and examined how the variation in our inference of that parameter changed with its value. As shown in Fig- 4.4 Interaction between sample size and pa- ure 2, parameters with values near 0 or 1 are easier to ac- rameters curately estimate, while those with values in the 0.4 to 0.8 In our final experiment, as described in section 3.4, we vary range are more difficult to infer. Each parameter exhibits a both the learning rate and the sample size at the same time. unique pattern, with prior behaving worst for small values, The standard deviation curves for the three sample sizes √ are guess behaving worst for values in the middle, and learning then plotted on the same plot, each divided by the 1/ n rate performing worst for the largest values. Slip is unique factor, where n is the sample size, as shown in Figure 5. in having two peaks in its curve near 0.5 and 0.8. The curves are nearly identical, and we find no evidence of interaction between parameters and sample size, but we 4.2 Interactions between parameters can’t rule out interaction for other combinations of parame- √ As described in section 3.2, in this experiment we vary slip ter values. This also offers additional evidence for the 1/ n between 0 and 1 while keeping the other parameters fixed, trend from the previous section. and examine how the variation of all four inferred parame- ters varies, as shown in Figure 3. All variance values exhibit 5. DISCUSSION a strong, complex dependence on the slip parameter—in par- Because accuracy is good for parameter values near 0 and 1, ticular there is a dramatic and unexpected drop from large this implies that for large enough samples, boundary effects variance to small variance around slip=0.85. We conclude (in which the distribution of error is skewed because values that the variance of an inferred parameter depends not only outside of the 0-1 range are not permitted) are not a serious on the value of that parameter, but also the values of other concern. parameters. Interactions between parameters are complex, suggesting 4.3 Varying sample size that attempting to characterize error in each parameter in- We fix the parameters at the values empirically determined dependently is unlikely to yield good predictions of error. in section 4.1 to give maximum variance (roughly based on Moreover, attempts to model these interactions analytically may be challenging because they cannot be fit well by low- degree polynomials. A more viable strategy is to form a conservative estimate of error by conducting a grid search Prior Learn Guess Slip of parameter sets that are plausible in a given scenario. On the other hand, once the range of variances at a particular 1 (sufficiently large) sample size is characterized, Figure 4 and Std dev of inferred parameters Figure 5 show that altering the sample size has a uniform 0.1 and predictable effect on the error. The√ main result that standard deviation is proportional to 0.01 1/ n suggests that, in order to decrease the margin of error in the estimate of a parameter by a factor of 2, an increase y = 0.4215x-0.533 0.001 in sample size by a factor of 4 is required. Additionally, R² = 0.9963 Figure 4 shows that achieving even a single valid significant digit in the learning rate requires sample sizes of 1000 stu- 0.0001 dents or more. This suggests that studies using BKT with 1E+0 1E+1 1E+2 1E+3 1E+4 1E+5 1E+6 less than 1000 students should be considered carefully for Sample size (number of students) sampling error. Figure 4: Accuracy of inferred parameters, based on 5.1 Confidence Intervals and Decreasing Train- sample size (training set size), with fixed parame- ing Time ters (prior=guess=0.5, learning=slip=0.67). This is As noted in Figure 1, provided that the sample size is large a log-log plot, and (once the y = 0.1 level is reached) enough, the distribution of samples is approximated well the lines each remain straight and have slope of by a normal distribution, and the standard deviations com- roughly -0.5. This suggests that the standard √ de- puted in synthetic simulations such as the preceding ones viation of the error is roughly proportional to 1/ n, can be used to compute confidence intervals containing the where n is the sample size. true generating parameters (e.g. 95% of possible values are within two standard deviations). Parameters used in these simulations can be set either by using domain knowledge, and/or by conservatively selecting values that give poor ac- curacy. To use our results to decrease training time for a large data set, one approach is to create many small samples (e.g. 100 1000 students 10000 students 100000 students of size 1000) by sampling uniformly randomly with replace- 0.8 ment from the full data set. By training on these, we can Std dev of inferred learning rates estimate the variance of our estimates of each parameter at a 0.7 (Normalized by sample size) sample size of 1000. Then, given a desired level of accuracy √ 0.6 and a desired probability of achieving it, we can use 1/ n 0.5 to estimate the best final sample size. If the estimated sam- 0.4 ple size exceeds the data size, this suggests that more data 0.3 needs to be gathered. 0.2 0.1 6. IDENTIFIABILITY PROBLEM Although we have in this work considered a particular gen- 0 erating parameter set to be the correct and desired param- 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 eters, BKT exhibits an Identifiability Problem [1] in which Learning rate there are an infinite family of four-parameter solutions that make the same predictions. This creates the risk that a solu- tion that appears to be far from the generating parameters Figure 5: Here we vary learning rate from 0 to 1, is actually very close to an equivalent parameter set (or an and also vary sample size between the values 1000, equivalent solution is). 10000, and 100000. The √ resulting standard devia- tions are divided by 1/ n to normalize for improve- Van de Sande [9] more specifically characterized BKT (in ment in error due to increased sample size. The its HMM form) as a three-parameter system in which two resulting curves are nearly identical; the curve for systems having the same slip, learning rate, and A value will 100000 students appears noisier only because of a yield the same predictions, where A is given by lower number of samples (100 instead of 1000). We find no evidence of interaction between sample size and the learning rate. A = (1 − slip − guess)(1 − prior). One way to address the issue is to perform both data gener- ation and parameter search in this reduced three-parameter sizes each use an ITS, the BKT is trained on the result- system; this would be similar to our current approach, but ing data, and then the groups continue to use the ITS and error in the A parameter is more difficult to interpret. In- their learning performance is examined (note however that tuitively, we expect search in a lower-dimensional space to asymmetric group sizes limit statistical power). give better accuracy with the same amount of data. How- ever, Van de Sande also notes that the algorithm form of Finally, an analytical model that can explain some of our BKT has no analytic solution, and so the degree to which empirical results—such as the skewed normal distribution BKT is underdetermined may depend on the specific appli- of inferred parameter values, the improvements in parame- √ cation. ter inference near 0 and 1 parameter values, or the 1/ n relationship between sample size and standard deviation— Beyond the underdetermined nature of BKT, there are also would be a valuable contribution. information-theoretic bounds that limit the accuracy of in- ferring parameters regardless of the system. In particular, 8. ACKNOWLEDGMENTS given a collection of at least k different parameter sets, and We thank Zachary A. Pardos for his fastHMM C++ BKT student data that can only take on < k values, there is library [5], for providing helpful comments on this work, and no procedure that can reliably infer the generating param- for designing the assignment which inspired it. eters without error. As the size of the data continues to decrease, the minimum possible error increases. Although these bounds are general, they typically apply only to very 9. REFERENCES small data sets. [1] J. E. Beck and K.-M. Chang. Identifiability: A fundamental problem of student modeling. In Proceedings of the 11th International Conference on 7. CONCLUSIONS AND FUTURE WORK User Modeling, UM ’07, pages 137–146, Berlin, We’ve only explored a small part of the space of input pa- Heidelberg, 2007. Springer-Verlag. rameters that can affect inferred parameter accuracy; the [2] M. H. Falakmasir, Z. A. Pardos, G. J. Gordon, and possible interactions between parameters are complex and P. Brusilovsky. A spectral learning approach to not fully understood. It would also be useful to examine knowledge tracing. In 6th International Conference on different sizes of problem sets, scenarios where different stu- Educational Data Mining (EDM 2013)., pages 28–35. dents complete different numbers of problems, models where International Educational Data Mining Society, 2013. parameters such as learning rate and guess/slip are per prob- [3] B. B. Nooraei, Z. A. Pardos, N. T. Heffernan, and lem, and models where priors are measured per student (as R. S. J. de Baker. Less is more: Improving the speed in Pardos and Heffernan [4]). and prediction power of knowledge tracing by using less data. In M. Pechenizkiy, T. Calders, C. Conati, Although it seems intuitive that insufficient sample size can S. Ventura, C. Romero, and J. C. Stamper, editors, lead to poor parameter estimates with poor predictive power, EDM, pages 101–110. www.educationaldatamining.org, this deserves verification: it’s not clear which errors will 2011. damage prediction and which are benign. An empirical syn- [4] Z. A. Pardos and N. T. Heffernan. Modeling thetic study that examines prediction accuracy could assess individualization in a bayesian networks this cheaply. Going a step further, it would be useful to implementation of knowledge tracing. In P. D. Bra, simulate an interactive tutoring system and assess a cost A. Kobsa, and D. N. Chin, editors, UMAP, volume function that penalizes the system for both incorrect assess- 6075 of Lecture Notes in Computer Science, pages ment of mastery, and for failing to assess mastery when it 255–266. Springer, 2010. is reached. By applying weights to these error types, the [5] Z. A. Pardos and M. J. Johnson. Scaling cognitive simulation could represent the real-world cost of inaccurate modeling to massive open environments (in parameters in such a system. preparation). 2015. Another important direction is extending our results to real- [6] S. Ritter, T. K. Harris, T. Nixon, D. Dickison, R. C. world data. There are a few approaches. One is to use a Murray, and B. Towle. Reducing the knowledge tracing very large real-world data set and use its inferred param- space. In T. Barnes, M. C. Desmarais, C. Romero, and eters as the ground-truth generating parameters, then ex- S. Ventura, editors, EDM, pages 151–160. amine smaller subsets to determine whether parameters are www.educationaldatamining.org, 2009. inferred less accurately. If the BKT model is appropriate, [7] SciPy v0.13.0 Reference Guide: scipy.stats.normaltest. we expect to observe similar relationships between sample http://docs.scipy.org/doc/scipy/reference/ size and variance as with our synthetic data. This approach generated/scipy.stats.normaltest.html, May 2013. can be compared to one experiment of Ritter [6] (Figure 4), [Online; accessed 24-April-2014]. in which they took a large real data set and computed mean- [8] B. Van de Sande. Applying three models of learning to squared error using the best-fit parameters on subsets with individual student log data. In 6th International smaller number of students ranging from 5 to 500. Conference on Educational Data Mining (EDM 2013)., pages 193–199. International Educational Data Mining There are other approaches to real-world validity. One would Society, 2013. be a survey of prior BKT applications, to identify whether [9] B. Van de Sande. Properties of the bayesian knowledge there is a consistent relationship between sample size and tracing model. Journal of Educational Data Mining, reported prediction accuracy. A third approach would be a 5(2):1–10, 2013. controlled experiment in which two groups of very different