=Paper= {{Paper |id=Vol-1183/bkt20y_paper05 |storemode=property |title= Is this Data for Real? |pdfUrl=https://ceur-ws.org/Vol-1183/bkt20y_paper05.pdf |volume=Vol-1183 |dblpUrl=https://dblp.org/rec/conf/edm/Rosenberg-KimaP14 }} == Is this Data for Real?== https://ceur-ws.org/Vol-1183/bkt20y_paper05.pdf
                                              Is this Data for Real?
                  Rinat B. Rosenberg-Kima                                                       Zachary Pardos
                 University of California, Berkeley                                     University of California, Berkeley
               rosenbergkima@berkeley.edu                                                   pardos@berkeley.edu

ABSTRACT                                                                2. DATA SETS
Simulated data plays a central role in Educational Data Mining          To compare simulated data to real data we started with 2 real
and in particular in Bayesian Knowledge Tracing (BKT) research.         dataset generated from the assisstment software1 (specifically,
The initial motivation for this paper was to try to answer the          datasets G6.207-exact.txt with 776 students and G6.259-exact.txt
question: given two datasets could you tell which of them is real       with 212 students) from a previous BKT study [10]. Both of the
and which of them is simulated? The ability to answer this              datasets consist of 6 questions in linear order where all students
question may provide an additional indication of the goodness of        answer all questions. Next, we generated synthetic, simulated data
the model, thus, if it is easy to discern simulated data from real      using the best fitting parameters that were found for the real data
data that could be an indication that the model does not provide an     as the generating parameters. By this we generated a simulated
authentic representation of reality, whereas if it is hard to set the   version of dataset G6.207 and a simulated version of dataset
real and simulated data apart that might be an indication that the      G6.259 that had the exact same number of questions, number of
model is indeed authentic. In this paper we will describe initial       students, and was generated with what appears to be the best
analysis that was performed in an attempt to address this question.     fitting parameters. The specific best fitting parameters that were
Additional findings that emerged during this exploration will be        found for each dataset and were used to generate the simulated
discussed as well.                                                      data are presented in table 1.
Keywords                                                                Table 1. Best fitting parameters for each dataset. These
Bayesian Knowledge Tracing (BKT), simulated data, parameters            parameters were used to generate the simulated datasets.
space.                                                                             N        Prior     Learn       Guess      Slip
                                                                        G6.207     776      .453      .068        .270       .156
                                                                        G6.259     212      .701      .044        .243       .165
1. INTRODUCTION
Simulated data has been increasingly playing a central role in
Educational Data Mining [1] and Bayesian Knowledge Tracing
(BKT) research [1, 4]. For example, simulated data was used to          3. METHODOLOGY
explore the convergence properties of BKT models [5], an                We are interested to find out whether it is possible to distinguish
important area of investigation given the identifiability issues of     between the simulated data and the real data. The approach we
the model [3]. In this paper, we would like to approach simulated       took was to calculate LL for the gird of all the parameters space
data from a slightly different angle. In particular, we claim that      (prior, learn, guess, and slip). We hypothesized that the LL pattern
the question,”given two datasets could you tell which of them is        of the simulated data and real data will be different across the
real and which of them is simulated?”, is interesting as it can be      parameters space. For each of the matrices we conducted a grid
used to evaluate the goodness of a model and may potentially            search with intervals of .04 that generated 25 intervals for each
serve as an alternative metric to RMSE, AUC, and others. We             parameter and 390,625 total combinations of prior, learn, guess,
would like to start approaching this problem in this paper by           and slip. For each one of the combinations LL was calculated and
comparing simulated data to real data with Knowledge Tracing as         placed in a four dimensional matrix. We used fastBKT [11] to (a)
the model.                                                              calculate the best fitting parameters of the real datasets, (b)
                                                                        generate simulated data, and (c) calculate the LL of the
                                                                        parameters space. Additional code in Matlab and R was generated
Knowledge Tracing (KT) models are widely used by cognitive              to put all the pieces together2. In particular, we calculated the LL
tutors to estimate the latent skills of students [6]. Knowledge         for all the combinations of two parameters where the other two
tracing is a Bayesian model, which assumes that each skill has 4        parameters were fixed to the best fitting value. In an additional
parameters: two knowledge parameters including initial (prior           analysis, we let all parameters be free and took the average LL for
knowledge) and learn rate, and two performance parameters               all combinations of two parameters, collapsed over the space of
including guess and slip. KT in its simplest form assumes a single      the other two parameters not visualized. The motivation for this
point estimate for prior knowledge and learn rate for all students,     was to visualize the error space interactions in the four dimensions
and similarly identical guess and slip rates for all students.          of the model.
Simulated data has been used to estimate the parameter space and
in particular to answer questions that relate to the goal of
maximizing the log likelihood (LL) of the model given parameters
and data, and improving prediction power [7], [8], [9].
In this paper we would like to use the KT model as a framework
                                                                        1
for comparing the characteristics of simulated data to real data,           Data can be obtained here: http://people.csail.mit.edu/zp/
and in particular to see whether it is possible to distinguish          2
                                                                            Matlab and R code will be available here:
between the real and sim datasets.                                      2
                                                                            Matlab and R code will be available here:
                                                                            http://myweb.fsu.edu/rr05/
 Figure 1.a (left). Heat maps of LL of real assistment dataset G6-207 (k=776 students) and a corresponding simulated data that was
 generated with the best fitting parameters of the real dataset. The two parameters not in each figure were fixed to the best
 parameters. Blue areas indicate high LL, and red areas indicate lower LL. Circles indicate maximum LL of the given matrix, and
 triangles indicate the best fitting parameters to the real data (that were also used to generate the simulated data). In this case the
 triangles and circles fit the same point.
 Figure 1.b (right). Heat maps of delta LL between real dataset G6-207 and the corresponding simulated data that was generated
 with the best fitting parameters of the real dataset. The two parameters not in each figure were fixed to the best parameters. Blue
 areas indicate high difference between the real and sim LL, and red areas indicate lower difference. Circles indicate minimum
 absolute delta of the given matrix, and triangles indicate the best fitting parameters to the real data.

                                                                          we plotted heat maps of the deltas between the real data and the
 4. DOES THE LL OF SIM vs. REAL DATA                                      simulated data (LL_RealData-LL_SimData) for each matrix. Even
 LOOK DIFFERENT?                                                          though the matrices appear to be identical, as can be seen in Figure
 Our initial thinking was that as we are using a simple BKT model,        1.b, there is in fact a difference between the LL of the matrices
 it is not authentically reflecting reality in all its detail and         although it is not a big difference compared to the values of LL.
 therefore we will observe different patterns of LL across the            Another surprising finding was that the LL of the real data was in
 parameters space between the real data and the simulated data.           many cases higher than the LL of the sim data. We expected that
 The LL space of simulated data in [5] was quite striking in its          the model would better explain the sim data as there should not be
 smooth surface but the appearance of real data was left as an open       additional noise as expected in reality, and therefore the LL of the
 research question.                                                       sim data should be higher, yet the findings were not consistent with
                                                                          this expectation.
4.1 Does the LL of sim vs. real data looks                                Another interesting finding was that the location of the ground truth
different across two parameters grids?                                    (the triangle) in most of the cases resulted in smaller delta between
  First, we calculated the LL over all the combinations of two            the real and the sim data although not in all cases (e.g., guess x
parameters for dataset G6.207 where the other two parameters were         slip). Note that the circles in Figure 1.b indicate the minimum
fixed to the best fitting value. For example, when we calculated LL       absolute difference in LL between the real and the sim data, and
for the combination of slip and prior (top right figure in figure 1.a),   this point is usually not located at the exact ground truth (except for
we fixed learn and guess to be .068 and .270 accordingly. To our          learn x guess).
great surprise, when we plotted heat maps of the LL matrices of the       Another interesting finding can be seen in Figure 1.a - slip vs.
real data and the simulated data (Figure 1.a - real data is presented     guess. Much attention has been given to this LL space which
in the upper triangle and simulated (sim) data is presented in the        revealed the apparent co-linearity of BKT with two primary areas
lower triangle) we received what appears to be identical matrices         of convergence, the upper right area being a false, or “implausible”
(for example, the upper right heat map is the (slip x prior) LL           converging area as defined by [3]. What is interesting in this figure
matrix of the real data, whereas the lowest left heat map is the (slip    is that despite what appears to be two global maxima, the point
x prior) LL matrix of the sim data).                                      with the best LL in this dataset is in fact the lower region for both
The extent of the similarity between the matrices was surprising          sim and real data.
and in order to get a better picture of the differences between them      Next we conducted the same analysis with the second dataset.
Figure 2.a (left) Heat maps of delta LL between real dataset G6-259 (k=212 students) and the corresponding simulated data that
was generated with the best fitting parameters of the real dataset. The two parameters not in each figure were fixed to the best
parameters. Blue areas indicate high difference between the real and sim LL, and red areas indicate lower difference. Circles
indicate maximum LL of the given matrix, and triangles indicate the best fitting parameters to the real data.
Figure 2.b (right). Heat maps of delta LL between real assistment dataset G6-259 and the corresponding simulated data that was
generated with the best fitting parameters of the real dataset. The average is across the two parameters not in each figure. Blue
areas indicate high difference between the real and sim LL, and red areas indicate lower difference. Circles indicate minimum
absolute delta of the given matrix, and triangles indicate the best fitting parameters to the real data.



Even though the G6-259 dataset was significantly smaller than the      given two parameters. For example, if we look at the heat map of
first dataset, we received very similar results to the first dataset   matrix (learn x prior) we can see that there is not a big difference
with surprisingly similar heat maps for the sim and real data (see     between the average maximum point (white circle) and the overall
Figure 2.a). Like in the first dataset, notice that even though the    best fit parameters (white triangle). This may indicate that
LL heat maps look very similar, there is a difference in the delta     changing guess and slip will not affect the value of learn and prior
heat maps (see Figure 2.b). Nevertheless, there is an interesting      that maximizes the LL, therefore might suggest independency. If
difference between the two datasets. Concretely, unlike the bigger     we look at (guess x learn), we see that changes in prior and slip
dataset (G6-207), in G6-259 the LL of the sim data was actually        will again not have an impact on the best fit value of guess,
higher than the real data in most cases.                               however, they will affect the value of learn. Then again, if we
                                                                       look at the heat map of (prior x guess), we will see that both prior
4.2 What if we average LL over 2 parameters                            and guess are sensitive to changes in learn and slip. Yet again, the
                                                                       extremely surprising part of these results is that the sim data
across all the combinations of the other 2                             appear to be almost identical to the real data. It is possible to see
parameters?                                                            from Figure 3.b though that indeed there are differences between
We were interested to find out how will the heat maps look like if     the simulation data and the real data and like before, the LL of the
we do not fix the other two parameters to be best fit, but rather      real data is higher than that of the sim data in the larger dataset.
average the LL across the entire space of the other two
                                                                       Like for the fixed matrices, we received similar LL matrices for
parameters. For example, to calculate the matrix of guess and slip     the smaller dataset (G6-259) (see table 4.a). In addition, as before,
we practically calculated a matrix of guess and slip LL for each       the LL of the sim data for this dataset was higher than that of the
combination of learn and prior (25 x 25 = 625 matrices) instead of     real data (the opposite direction of the larger dataset G6-207).
only one matrix for the best fit learn and prior. Then, we took the    Another interesting finding for this dataset can be seen in the
average of all these matrices for each combination of guess and        (guess x slip) matrices (4.b). Notice that while the sim data
slip (see Figure 3.a). The results are both surprising and             converged to the lower point of the blue area, the real data
interesting. As far as (guess x slip), we no longer receive the two    converged to the higher point. Nevertheless, this only happened in
maximum (global and local) that we received when learn and             the averages matrices and not in the fixed ones.
prior where fixed to best fit parameters. Another interesting
finding is the relationship between the average maximum across
the other two parameters and the overall best fit parameters for
Figure 3.a (left). Heat maps of average LL of real assistment dataset G6-207 (k=776 students) and a corresponding simulated data
that was generated with the best fitting parameters of the real dataset. The average is across the two parameters not in each figure.
Blue areas indicate high LL, and red areas indicate lower LL. Circles indicate maximum LL of the given matrix, and triangles
indicate the best fitting parameters to the real data (that were also used to generate the simulated data).
Figure 3.b (right). Heat maps of delta LL between real assistment dataset G6-207 and the corresponding simulated data that was
generated with the best fitting parameters of the real dataset. The average is across the two parameters not in each figure. Blue
areas indicate high difference between the real and sim LL, and red areas indicate lower difference. Circles indicate minimum
absolute delta of the given matrix, and triangles indicate the best fitting parameters to the real data.




Figure 4.a (left). Heat maps of average LL of real assistment dataset G6-259 (k=212 students) and a corresponding simulated data
that was generated with the best fitting parameters of the real dataset. The average is across the two parameters not in each figure.
Blue areas indicate high LL, and red areas indicate lower LL. Circles indicate maximum LL of the given matrix, and triangles
indicate the best fitting parameters to the real data (that were also used to generate the simulated data).
Figure 4.b (right). Heat maps of delta LL between real assistment dataset G6-259 and the corresponding simulated data that was
generated with the best fitting parameters of the real dataset. The average is across the two parameters not in each figure.
5. DISCUSSION AND FUTURE WORK                                           6. REFERENCES
The initial motivation of this paper was to find whether it is          [1] R. S. Baker and K. Yacef, “The state of educational data
possible to discern a real data from a sim data. If for a given              mining in 2009: A review and future visions,” J. Educ. Data
model it is possible to tell apart a sim data from a real data then          Min., vol. 1, no. 1, pp. 3–17, 2009.
the authenticity of the model can be questioned. This line of           [2] M. C. Desmarais and I. Pelczer, “On the Faithfulness of
thinking is in particular typical of simulation use in Science               Simulated Student Performance Data.,” in EDM, 2010, pp.
context, where different models are used to generate simulated               21–30.
data, and then if a simulated data has a good fit to the real           [3] J. E. Beck and K. Chang, “Identifiability: A fundamental
phenomena at hand, then it may be possible to claim that the                 problem of student modeling,” in User Modeling 2007,
model provides an authentic explanation of the system [12]. We               Springer, 2007, pp. 137–146.
believe that it may be possible to generate a new matric for            [4] Z. A. Pardos and M. V. Yudelson, “Towards Moment of
evaluating the goodness of a model by comparing a simulated data             Learning Accuracy,” in AIED 2013 Workshops Proceedings
from this model to real data.                                                Volume 4, 2013, p. 3.
                                                                        [5] Z. A. Pardos and N. T. Heffernan, “Navigating the parameter
In this work we explored similarities between simulated and real             space of Bayesian Knowledge Tracing models:
data. Nevertheless, we are yet to answer the question “is this data          Visualizations of the convergence of the Expectation
for real?”. In other words, what we still did not do in this work is         Maximization algorithm.,” in EDM, 2010, pp. 161–170.
come up with an algorithm that can take a dataset and determine         [6] A. T. Corbett and J. R. Anderson, “Knowledge tracing:
whether it is real or simulated. Another way to think of it is to            Modeling the acquisition of procedural knowledge,” User
come out with an algorithm that can tell us whether it is possible           Model. User-Adapt. Interact., vol. 4, no. 4, pp. 253–278,
to discern real and simulated data and use it as an indication of the
                                                                             1994.
goodness of the model. We found differences between the real            [7] S. Ritter, T. K. Harris, T. Nixon, D. Dickison, R. C. Murray,
and sim data, but are they strong enough to be noticed by such               and B. Towle, “Reducing the Knowledge Tracing Space.,”
algorithm in a consistent way? In future work we plan to further             Int. Work. Group Educ. Data Min., 2009.
investigate this question by creating a training set of multiple real   [8] R. S. d Baker, A. T. Corbett, S. M. Gowda, A. Z. Wagner, B.
datasets and sim datasets and use machine learning techniques to             A. MacLaren, L. R. Kauffman, A. P. Mitchell, and S.
extract a learning algorithm from this training dataset that can take        Giguere, “Contextual slip and prediction of student
as input a dataset and determine whether it is real or sim. We               performance after use of an intelligent tutor,” in User
argue that if such algorithm can be found, it is an indication that
                                                                             Modeling, Adaptation, and Personalization, Springer, 2010,
the underlying model can be improved. In future work we also                 pp. 52–63.
plan to compare different variations of the KT model and contrast       [9] R. S. Baker, A. T. Corbett, and V. Aleven, “More accurate
their resulting simulated data with real data. In particular we plan         student modeling through contextual estimation of slip and
to generate a more complex set of simulated data that is based on            guess probabilities in bayesian knowledge tracing,” in
a more complex model (e.g., different learning rate for different            Intelligent Tutoring Systems, 2008, pp. 406–415.
types of questions), and then use it as “real” data with the (wrong)    [10] Z. A. Pardos and N. T. Heffernan, “Modeling
assumption that the model is simple (standard BKT model) to                  individualization in a bayesian networks implementation of
simulate a scenario where the real data is indeed grounded in
                                                                             knowledge tracing,” in User Modeling, Adaptation, and
more complex model than our assumptions and see what results                 Personalization, Springer, 2010, pp. 255–266.
would a learning algorithm that uses this “real” data in                [11] Z. A. Pardos and M. J. Johnson, “Scaling Cognitive
comparison to a sim data will yield.                                         Modeling to Massive Open Environments (in preparation),”
In addition, this paper raises interesting questions that we did not         TOCHI Spec. Issue Learn. Scale.
think of while trying to answer our initial question. For example,      [12] U. Wilensky, “GasLab—an Extensible Modeling Toolkit for
it seems like there is potential to dive deeper into the average LL          Connecting Micro-and Macro-properties of Gases,” in
(Figures 3&4) and find more about the relationships and                      Modeling and simulation in science and mathematics
dependencies between the different parameters. Another question              education, Springer, 1999, pp. 151–178.
that emerged is how could it be that the simulated data had lower
LL than the real data in the bigger dataset yet lower in the smaller
dataset? Further analysis is needed to answer these questions.
Last but not least, given the remarkable resemblance between the
sim data and the real data, these initial findings provide an
indication that the BKT model is a model with a very strong hold
in reality.