<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Choosing Sample Size for Knowledge Tracing Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Derrick Coetzee</string-name>
          <email>dcoetzee@berkeley.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of California</institution>
          ,
          <addr-line>Berkeley</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>An important question in the practical application of Bayesian knowledge tracing models is determining how much data is needed to infer parameters accurately. If training data is inadequate, even a perfect inference algorithm will produce parameters with poor predictive power. In this work, we describe an empirical study using synthetic data that provides estimates of the accuracy of inferred parameters based on factors such as the number of students used to train the model, and the values of the underlying generating parameters. We nd that the standard deviation of the error is roughly proportional to 1=pn where n is the sample size, and that model parameters near 0 and 1 are easier to learn accurately.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Educational data mining</kwd>
        <kwd>knowledge tracing</kwd>
        <kwd>sample size</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Simple Bayesian knowledge tracing models a student's
observed responses to a sequence of items as a Markov process,
with their knowledge state as a hidden underlying variable.
If values are given for the four standard parameters,
learning rate, prior, guess, and slip, the likelihood of a particular
set of response sequences can be computed. Using standard
search procedures like expectation maximization (EM), the
parameter set giving the highest likelihood for a given set of
sequences can be determined, provided that the procedure
converges to the global maximum.</p>
      <p>This work published at the BKT20y Workshop in
conjunction with Educational Data Mining 2014. The author waives
all rights to this work under Creative Commons CC0 1.0.
However, even if the procedure identi es the global
maximum correctly and precisely, the resulting parameters may
not re ect the actual parameters that generated the data;
this is a sampling error e ect. It's clearest with very small
samples, such as samples of size 1, but exists with larger
samples as well. Empirical studies with synthetic data generated
from known parameters show that the inferred parameters
for a given data set can di er substantially from the
generating parameters, and this same issue would arise in real
settings. An understanding of the magnitude of sampling
error in a particular scenario can help to explain why the
resulting model does or does not make e ective predictions.
Moreover, by providing a means to describe the distribution
of possible generating parameter values, the uncertainty of
calculations based on those parameters such as predictions
can also be determined.</p>
    </sec>
    <sec id="sec-2">
      <title>2. RELATED WORK</title>
      <p>For simple problems, such as identifying the mean value of
a parameter in a population, or the proportion of the
population falling into a subgroup, there are simple and
wellunderstood statistical approaches for determining sample
size based on statistical power. Such analytic approaches
are not immediately applicable to the problem of
minimizing the HMM error function because of its complexity and
high dimensionality.</p>
      <p>
        Falakmasir et al [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] have noted that training time increases
linearly with the size of the training set. Choosing an
appropriate sample size for a certain desired level of accuracy
can thus help to reduce training time, which is important
both for research and in some real-time interactive tutor
applications.
      </p>
      <p>
        Nooraei et al [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] found that using only the 15 most recent
data points from each student to train a knowledge
tracing model yielded root mean-square error during prediction
comparable to using the student's full history. For one data
set, the most 5 recent items su ced. Our study conversely
does not vary the number of items per student, but instead
varies the number of students and the four parameters
generating the data. By allowing sample size to be reduced
to meet a desired accuracy, our work o ers an orthogonal
method of further reducing training time.
      </p>
      <p>
        De Sande [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] has suggested that as samples become larger,
models with small parameter sets may no longer be rich
enough to capture the sample's complexity. Thus our
exclu50 55 60 5 0 5 0 5 0 5 0 5 0 5 0 502 .002 .502 .002 .5402 reoM
.01 .01 .01 .016 .017 .017 .108 .018 .109 .109 .200 .020 .021 .021 .022 .2 3 3 4
      </p>
      <p>Inferred learning rate
sive reliance on a simple four-parameter BKT model even
for very large samples is a limitation of our approach.</p>
    </sec>
    <sec id="sec-3">
      <title>3. METHODOLOGY</title>
      <p>
        In our experiments we relied on a simple standard Bayesian
knowledge tracing model with four parameters: learning
rate, prior, guess and slip. There is only one value for
each parameter, and no specialization by student or
problem. Each synthetic student responded to ve items; we
do not vary this parameter in this study, since Nooraei et
al [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] report that increasing this parameter has diminishing
returns, but future work may investigate it.
      </p>
      <p>We generate separate datasets for each of our experiments.
In each case, we enumerate a sequence of models (each
speci ed by values for learn, prior, guess, slip, sample size), and
for each of those models, we generate a large number of
random samples consistent with that model. For example,
for a particular model, we may generate 1000 samples each
containing 1000 students.</p>
      <p>We then run EM on each sample to nd the parameter set
giving the maximum likelihood value. All parameters are
permitted to vary during the search. EM is run starting
at the generating parameters and run until fully converged
(within 10 12 or until 100 iterations are complete).
Starting at the generating parameters is not feasible in a realistic
setting, but here it allows EM to run quickly and
consistently reach the global minimum. As shown in Figure 1, the
parameter values inferred from these samples approximate
a normal distribution with a mean equal to the generating
parameter.</p>
      <p>
        Finally, we take all samples generated from a single model
and, for each parameter, record the mean and standard
deviation of the inferred values for that parameter. We chose the
number of samples generated for each model large enough
so that these statistics remain stable under repeated runs.
Mean values for each parameter were consistently near the
generating parameter, typically within at most 0.1 standard
deviations. Standard deviation provides an estimate of
variation in the inferred parameter values, and is plotted.
Different models yield di erent standard deviation values.
Because of the very large number of large samples involved
in this approach, we use the fastHMM C++ BKT library
designed by Pardos and Johnson [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to quickly generate
datasets and perform EM, invoked from a Matlab script.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3.1 Varying one parameter</title>
      <p>
        In our rst experiment, we start with typical, plausible
values for all four parameters: learn=0.2, prior=0.4, guess=0.14,
slip=0.05. These values are consistent with prior work that
found large guess and slip values (&gt; 0.5) to be implausible in
most scenarios [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and in our 5-problem scenario, the chance
of learning the material by the end is about 67%, which is
reasonable.
      </p>
      <p>Then, for each of the four parameters, we hold the other
parameters at their single plausible value, and vary the
remaining parameter from 0 to 1 in steps of 0.01. This results
in 404 total parameter sets.</p>
      <p>For each parameter set, we generate 1000 random samples
of 1000 students each. In this experiment, the number of
students is xed at 1000, which is large enough to
consistently produce a standard deviation not exceeding 0.03 |
this avoids the boundary e ects near 0 and 1 that would
occur for very small samples.</p>
      <p>In this experiment, we focus on the variance of our estimates
of the parameter that is being varied, and don't consider
variance of the other ( xed) parameters.</p>
    </sec>
    <sec id="sec-5">
      <title>3.2 Interactions between parameters</title>
      <p>In this experiment, similiar to the rst, we hold three
parameters xed (learn=0.2, prior=0.4, guess=0.14), and vary
slip between 0 and 1 in steps of 0.01. This gives 101
parameter sets. For each, we generate 1000 random samples of
1000 students each. However, in this experiment we
examine variance of our estimates of all four parameters, rather
than just the one being varied (slip). This experiment helps
to demonstrate to what extent varying one parameter can
a ect the di culty of accurately inferring other parameters.</p>
    </sec>
    <sec id="sec-6">
      <title>3.3 Varying sample size</title>
      <p>In our third experiment, we x the value of all four
parameters, but vary the sample size in powers of two from
2 to 2097152. For sample sizes below 10000, we generate
1000 samples of that size, while for those above we generate
100 samples. The parameter values are heuristically chosen
based on the prior experiments above to generate large error
values (but not necessarily the worst possible error). We
examine how variation of our estimates of all four parameters
varies with sample size, and identify any trends.
0.1
s
r
e
te0.08
m
a
ra0.06
p
red0.04
r
e
f
in0.02
f
o
v 0
e
d
d
t
S
0</p>
    </sec>
    <sec id="sec-7">
      <title>3.4 Interaction between sample size and parameters</title>
      <p>In our nal experiment, we vary both the learning rate (from
0 to 1 in steps of 0.01) and the sample size (between the
values 1000, 10000, 100000) at the same time. This enables us
to examine whether there is any interaction between
parameters and sample size. For 1000 and 10000 students we use
1000 samples, while for 100000 students we use 100 samples,
to reduce runtime.</p>
    </sec>
    <sec id="sec-8">
      <title>4. RESULTS</title>
    </sec>
    <sec id="sec-9">
      <title>4.1 Varying one parameter</title>
      <p>As described in section 3.1, in this experiment we vary each
parameter between 0 and 1 while holding the other
parameters xed, and examined how the variation in our inference
of that parameter changed with its value. As shown in
Figure 2, parameters with values near 0 or 1 are easier to
accurately estimate, while those with values in the 0.4 to 0.8
range are more di cult to infer. Each parameter exhibits a
unique pattern, with prior behaving worst for small values,
guess behaving worst for values in the middle, and learning
rate performing worst for the largest values. Slip is unique
in having two peaks in its curve near 0.5 and 0.8.</p>
    </sec>
    <sec id="sec-10">
      <title>4.2 Interactions between parameters</title>
      <p>As described in section 3.2, in this experiment we vary slip
between 0 and 1 while keeping the other parameters xed,
and examine how the variation of all four inferred
parameters varies, as shown in Figure 3. All variance values exhibit
a strong, complex dependence on the slip parameter|in
particular there is a dramatic and unexpected drop from large
variance to small variance around slip=0.85. We conclude
that the variance of an inferred parameter depends not only
on the value of that parameter, but also the values of other
parameters.</p>
    </sec>
    <sec id="sec-11">
      <title>4.3 Varying sample size</title>
      <p>We x the parameters at the values empirically determined
in section 4.1 to give maximum variance (roughly based on
the maximums of the curves, with prior and guess at 0.5, and
learning rate and slip at 0.67). Because section 4.2 suggests
that there are interactions between parameters, this may not
give the worst-case variance possible of all combinations, but
it is a reasonable starting point for realistic values.
As described in section 3.3, sample size is varied in powers of
two from 2 to 2097152. Figure 4 shows the result, suggesting
that (except for very small samples) the standard deviation
of the error is roughly proportional to n 0:5, or 1=pn, where
n is the sample size. For these particular parameter values,
slip is consistently inferred most accurately, learning rate is
inferred least accurately, and guess and prior are between
the two and are similar.</p>
    </sec>
    <sec id="sec-12">
      <title>4.4 Interaction between sample size and parameters</title>
      <p>In our nal experiment, as described in section 3.4, we vary
both the learning rate and the sample size at the same time.
The standard deviation curves for the three sample sizes are
then plotted on the same plot, each divided by the 1=pn
factor, where n is the sample size, as shown in Figure 5.
The curves are nearly identical, and we nd no evidence
of interaction between parameters and sample size, but we
can't rule out interaction for other combinations of
parameter values. This also o ers additional evidence for the 1=pn
trend from the previous section.</p>
    </sec>
    <sec id="sec-13">
      <title>5. DISCUSSION</title>
      <p>Because accuracy is good for parameter values near 0 and 1,
this implies that for large enough samples, boundary e ects
(in which the distribution of error is skewed because values
outside of the 0-1 range are not permitted) are not a serious
concern.</p>
      <p>Interactions between parameters are complex, suggesting
that attempting to characterize error in each parameter
independently is unlikely to yield good predictions of error.
Moreover, attempts to model these interactions analytically
y = 0.4215x-0.533</p>
      <p>R² = 0.9963
1E+0
1E+1 1E+2 1E+3 1E+4 1E+5</p>
      <p>Sample size (number of students)
1E+6
may be challenging because they cannot be t well by
lowdegree polynomials. A more viable strategy is to form a
conservative estimate of error by conducting a grid search
of parameter sets that are plausible in a given scenario. On
the other hand, once the range of variances at a particular
(su ciently large) sample size is characterized, Figure 4 and
Figure 5 show that altering the sample size has a uniform
and predictable e ect on the error.</p>
      <p>The main result that standard deviation is proportional to
1=pn suggests that, in order to decrease the margin of error
in the estimate of a parameter by a factor of 2, an increase
in sample size by a factor of 4 is required. Additionally,
Figure 4 shows that achieving even a single valid signi cant
digit in the learning rate requires sample sizes of 1000
students or more. This suggests that studies using BKT with
less than 1000 students should be considered carefully for
sampling error.</p>
    </sec>
    <sec id="sec-14">
      <title>5.1 Confidence Intervals and Decreasing Training Time</title>
      <p>As noted in Figure 1, provided that the sample size is large
enough, the distribution of samples is approximated well
by a normal distribution, and the standard deviations
computed in synthetic simulations such as the preceding ones
can be used to compute con dence intervals containing the
true generating parameters (e.g. 95% of possible values are
within two standard deviations). Parameters used in these
simulations can be set either by using domain knowledge,
and/or by conservatively selecting values that give poor
accuracy.</p>
      <p>To use our results to decrease training time for a large data
set, one approach is to create many small samples (e.g. 100
of size 1000) by sampling uniformly randomly with
replacement from the full data set. By training on these, we can
estimate the variance of our estimates of each parameter at a
sample size of 1000. Then, given a desired level of accuracy
and a desired probability of achieving it, we can use 1=pn
to estimate the best nal sample size. If the estimated
sample size exceeds the data size, this suggests that more data
needs to be gathered.</p>
    </sec>
    <sec id="sec-15">
      <title>6. IDENTIFIABILITY PROBLEM</title>
      <p>
        Although we have in this work considered a particular
generating parameter set to be the correct and desired
parameters, BKT exhibits an Identi ability Problem [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] in which
there are an in nite family of four-parameter solutions that
make the same predictions. This creates the risk that a
solution that appears to be far from the generating parameters
is actually very close to an equivalent parameter set (or an
equivalent solution is).
      </p>
      <p>
        Van de Sande [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] more speci cally characterized BKT (in
its HMM form) as a three-parameter system in which two
systems having the same slip, learning rate, and A value will
yield the same predictions, where A is given by
      </p>
      <p>A = (1
slip
guess)(1
prior):
One way to address the issue is to perform both data
gener1000 students
10000 students
100000 students
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9</p>
      <p>Learning rate
1
ation and parameter search in this reduced three-parameter
system; this would be similar to our current approach, but
error in the A parameter is more di cult to interpret.
Intuitively, we expect search in a lower-dimensional space to
give better accuracy with the same amount of data.
However, Van de Sande also notes that the algorithm form of
BKT has no analytic solution, and so the degree to which
BKT is underdetermined may depend on the speci c
application.</p>
      <p>Beyond the underdetermined nature of BKT, there are also
information-theoretic bounds that limit the accuracy of
inferring parameters regardless of the system. In particular,
given a collection of at least k di erent parameter sets, and
student data that can only take on &lt; k values, there is
no procedure that can reliably infer the generating
parameters without error. As the size of the data continues to
decrease, the minimum possible error increases. Although
these bounds are general, they typically apply only to very
small data sets.</p>
    </sec>
    <sec id="sec-16">
      <title>7. CONCLUSIONS AND FUTURE WORK</title>
      <p>
        We've only explored a small part of the space of input
parameters that can a ect inferred parameter accuracy; the
possible interactions between parameters are complex and
not fully understood. It would also be useful to examine
di erent sizes of problem sets, scenarios where di erent
students complete di erent numbers of problems, models where
parameters such as learning rate and guess/slip are per
problem, and models where priors are measured per student (as
in Pardos and He ernan [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]).
      </p>
      <p>Although it seems intuitive that insu cient sample size can
lead to poor parameter estimates with poor predictive power,
this deserves veri cation: it's not clear which errors will
damage prediction and which are benign. An empirical
synthetic study that examines prediction accuracy could assess
this cheaply. Going a step further, it would be useful to
simulate an interactive tutoring system and assess a cost
function that penalizes the system for both incorrect
assessment of mastery, and for failing to assess mastery when it
is reached. By applying weights to these error types, the
simulation could represent the real-world cost of inaccurate
parameters in such a system.</p>
      <p>
        Another important direction is extending our results to
realworld data. There are a few approaches. One is to use a
very large real-world data set and use its inferred
parameters as the ground-truth generating parameters, then
examine smaller subsets to determine whether parameters are
inferred less accurately. If the BKT model is appropriate,
we expect to observe similar relationships between sample
size and variance as with our synthetic data. This approach
can be compared to one experiment of Ritter [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] (Figure 4),
in which they took a large real data set and computed
meansquared error using the best- t parameters on subsets with
smaller number of students ranging from 5 to 500.
There are other approaches to real-world validity. One would
be a survey of prior BKT applications, to identify whether
there is a consistent relationship between sample size and
reported prediction accuracy. A third approach would be a
controlled experiment in which two groups of very di erent
sizes each use an ITS, the BKT is trained on the
resulting data, and then the groups continue to use the ITS and
their learning performance is examined (note however that
asymmetric group sizes limit statistical power).
      </p>
      <p>Finally, an analytical model that can explain some of our
empirical results|such as the skewed normal distribution
of inferred parameter values, the improvements in
parameter inference near 0 and 1 parameter values, or the 1=pn
relationship between sample size and standard deviation|
would be a valuable contribution.</p>
    </sec>
    <sec id="sec-17">
      <title>8. ACKNOWLEDGMENTS</title>
      <p>
        We thank Zachary A. Pardos for his fastHMM C++ BKT
library [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], for providing helpful comments on this work, and
for designing the assignment which inspired it.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Beck and K.-M. Chang</surname>
          </string-name>
          .
          <article-title>Identi ability: A fundamental problem of student modeling</article-title>
          .
          <source>In Proceedings of the 11th International Conference on User Modeling</source>
          ,
          <source>UM '07</source>
          , pages
          <fpage>137</fpage>
          {
          <fpage>146</fpage>
          , Berlin, Heidelberg,
          <year>2007</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Falakmasir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. A.</given-names>
            <surname>Pardos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Gordon</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Brusilovsky</surname>
          </string-name>
          .
          <article-title>A spectral learning approach to knowledge tracing</article-title>
          .
          <source>In 6th International Conference on Educational Data Mining (EDM</source>
          <year>2013</year>
          )., pages
          <volume>28</volume>
          {
          <fpage>35</fpage>
          .
          <string-name>
            <surname>International Educational Data Mining Society</surname>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B. B.</given-names>
            <surname>Nooraei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. A.</given-names>
            <surname>Pardos</surname>
          </string-name>
          , N. T. He ernan, and R. S. J. de Baker.
          <article-title>Less is more: Improving the speed and prediction power of knowledge tracing by using less data</article-title>
          . In M. Pechenizkiy,
          <string-name>
            <given-names>T.</given-names>
            <surname>Calders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Conati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ventura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Romero</surname>
          </string-name>
          , and J. C. Stamper, editors,
          <source>EDM</source>
          , pages
          <volume>101</volume>
          {
          <fpage>110</fpage>
          . www.educationaldatamining.org,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z. A.</given-names>
            <surname>Pardos</surname>
          </string-name>
          and
          <string-name>
            <surname>N. T.</surname>
          </string-name>
          <article-title>He ernan. Modeling individualization in a bayesian networks implementation of knowledge tracing</article-title>
          . In P. D.
          <string-name>
            <surname>Bra</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kobsa</surname>
            , and
            <given-names>D. N</given-names>
          </string-name>
          . Chin, editors,
          <source>UMAP</source>
          , volume
          <volume>6075</volume>
          of Lecture Notes in Computer Science, pages
          <volume>255</volume>
          {
          <fpage>266</fpage>
          . Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z. A.</given-names>
            <surname>Pardos</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          .
          <article-title>Scaling cognitive modeling to massive open environments (in preparation)</article-title>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ritter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. K.</given-names>
            <surname>Harris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nixon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dickison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Murray</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Towle</surname>
          </string-name>
          .
          <article-title>Reducing the knowledge tracing space</article-title>
          . In T. Barnes,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Desmarais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Romero</surname>
          </string-name>
          , and S. Ventura, editors,
          <source>EDM</source>
          , pages
          <volume>151</volume>
          {
          <fpage>160</fpage>
          . www.educationaldatamining.org,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>SciPy</surname>
          </string-name>
          <year>v0</year>
          .
          <year>13</year>
          .0 Reference Guide: scipy.stats.normaltest. http://docs.scipy.org/doc/scipy/reference/ generated/scipy.stats.normaltest.html, May
          <year>2013</year>
          . [Online; accessed 24-April-2014].
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Van de Sande</surname>
          </string-name>
          .
          <article-title>Applying three models of learning to individual student log data</article-title>
          .
          <source>In 6th International Conference on Educational Data Mining (EDM</source>
          <year>2013</year>
          )., pages
          <volume>193</volume>
          {
          <fpage>199</fpage>
          .
          <string-name>
            <surname>International Educational Data Mining Society</surname>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>Van de Sande</surname>
          </string-name>
          .
          <article-title>Properties of the bayesian knowledge tracing model</article-title>
          .
          <source>Journal of Educational Data Mining</source>
          ,
          <volume>5</volume>
          (
          <issue>2</issue>
          ):1{
          <fpage>10</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>