=Paper=
{{Paper
|id=Vol-1183/bkt20y_paper05
|storemode=property
|title= Is this Data for Real?
|pdfUrl=https://ceur-ws.org/Vol-1183/bkt20y_paper05.pdf
|volume=Vol-1183
|dblpUrl=https://dblp.org/rec/conf/edm/Rosenberg-KimaP14
}}
== Is this Data for Real?==
Is this Data for Real? Rinat B. Rosenberg-Kima Zachary Pardos University of California, Berkeley University of California, Berkeley rosenbergkima@berkeley.edu pardos@berkeley.edu ABSTRACT 2. DATA SETS Simulated data plays a central role in Educational Data Mining To compare simulated data to real data we started with 2 real and in particular in Bayesian Knowledge Tracing (BKT) research. dataset generated from the assisstment software1 (specifically, The initial motivation for this paper was to try to answer the datasets G6.207-exact.txt with 776 students and G6.259-exact.txt question: given two datasets could you tell which of them is real with 212 students) from a previous BKT study [10]. Both of the and which of them is simulated? The ability to answer this datasets consist of 6 questions in linear order where all students question may provide an additional indication of the goodness of answer all questions. Next, we generated synthetic, simulated data the model, thus, if it is easy to discern simulated data from real using the best fitting parameters that were found for the real data data that could be an indication that the model does not provide an as the generating parameters. By this we generated a simulated authentic representation of reality, whereas if it is hard to set the version of dataset G6.207 and a simulated version of dataset real and simulated data apart that might be an indication that the G6.259 that had the exact same number of questions, number of model is indeed authentic. In this paper we will describe initial students, and was generated with what appears to be the best analysis that was performed in an attempt to address this question. fitting parameters. The specific best fitting parameters that were Additional findings that emerged during this exploration will be found for each dataset and were used to generate the simulated discussed as well. data are presented in table 1. Keywords Table 1. Best fitting parameters for each dataset. These Bayesian Knowledge Tracing (BKT), simulated data, parameters parameters were used to generate the simulated datasets. space. N Prior Learn Guess Slip G6.207 776 .453 .068 .270 .156 G6.259 212 .701 .044 .243 .165 1. INTRODUCTION Simulated data has been increasingly playing a central role in Educational Data Mining [1] and Bayesian Knowledge Tracing (BKT) research [1, 4]. For example, simulated data was used to 3. METHODOLOGY explore the convergence properties of BKT models [5], an We are interested to find out whether it is possible to distinguish important area of investigation given the identifiability issues of between the simulated data and the real data. The approach we the model [3]. In this paper, we would like to approach simulated took was to calculate LL for the gird of all the parameters space data from a slightly different angle. In particular, we claim that (prior, learn, guess, and slip). We hypothesized that the LL pattern the question,”given two datasets could you tell which of them is of the simulated data and real data will be different across the real and which of them is simulated?”, is interesting as it can be parameters space. For each of the matrices we conducted a grid used to evaluate the goodness of a model and may potentially search with intervals of .04 that generated 25 intervals for each serve as an alternative metric to RMSE, AUC, and others. We parameter and 390,625 total combinations of prior, learn, guess, would like to start approaching this problem in this paper by and slip. For each one of the combinations LL was calculated and comparing simulated data to real data with Knowledge Tracing as placed in a four dimensional matrix. We used fastBKT [11] to (a) the model. calculate the best fitting parameters of the real datasets, (b) generate simulated data, and (c) calculate the LL of the parameters space. Additional code in Matlab and R was generated Knowledge Tracing (KT) models are widely used by cognitive to put all the pieces together2. In particular, we calculated the LL tutors to estimate the latent skills of students [6]. Knowledge for all the combinations of two parameters where the other two tracing is a Bayesian model, which assumes that each skill has 4 parameters were fixed to the best fitting value. In an additional parameters: two knowledge parameters including initial (prior analysis, we let all parameters be free and took the average LL for knowledge) and learn rate, and two performance parameters all combinations of two parameters, collapsed over the space of including guess and slip. KT in its simplest form assumes a single the other two parameters not visualized. The motivation for this point estimate for prior knowledge and learn rate for all students, was to visualize the error space interactions in the four dimensions and similarly identical guess and slip rates for all students. of the model. Simulated data has been used to estimate the parameter space and in particular to answer questions that relate to the goal of maximizing the log likelihood (LL) of the model given parameters and data, and improving prediction power [7], [8], [9]. In this paper we would like to use the KT model as a framework 1 for comparing the characteristics of simulated data to real data, Data can be obtained here: http://people.csail.mit.edu/zp/ and in particular to see whether it is possible to distinguish 2 Matlab and R code will be available here: between the real and sim datasets. 2 Matlab and R code will be available here: http://myweb.fsu.edu/rr05/ Figure 1.a (left). Heat maps of LL of real assistment dataset G6-207 (k=776 students) and a corresponding simulated data that was generated with the best fitting parameters of the real dataset. The two parameters not in each figure were fixed to the best parameters. Blue areas indicate high LL, and red areas indicate lower LL. Circles indicate maximum LL of the given matrix, and triangles indicate the best fitting parameters to the real data (that were also used to generate the simulated data). In this case the triangles and circles fit the same point. Figure 1.b (right). Heat maps of delta LL between real dataset G6-207 and the corresponding simulated data that was generated with the best fitting parameters of the real dataset. The two parameters not in each figure were fixed to the best parameters. Blue areas indicate high difference between the real and sim LL, and red areas indicate lower difference. Circles indicate minimum absolute delta of the given matrix, and triangles indicate the best fitting parameters to the real data. we plotted heat maps of the deltas between the real data and the 4. DOES THE LL OF SIM vs. REAL DATA simulated data (LL_RealData-LL_SimData) for each matrix. Even LOOK DIFFERENT? though the matrices appear to be identical, as can be seen in Figure Our initial thinking was that as we are using a simple BKT model, 1.b, there is in fact a difference between the LL of the matrices it is not authentically reflecting reality in all its detail and although it is not a big difference compared to the values of LL. therefore we will observe different patterns of LL across the Another surprising finding was that the LL of the real data was in parameters space between the real data and the simulated data. many cases higher than the LL of the sim data. We expected that The LL space of simulated data in [5] was quite striking in its the model would better explain the sim data as there should not be smooth surface but the appearance of real data was left as an open additional noise as expected in reality, and therefore the LL of the research question. sim data should be higher, yet the findings were not consistent with this expectation. 4.1 Does the LL of sim vs. real data looks Another interesting finding was that the location of the ground truth different across two parameters grids? (the triangle) in most of the cases resulted in smaller delta between First, we calculated the LL over all the combinations of two the real and the sim data although not in all cases (e.g., guess x parameters for dataset G6.207 where the other two parameters were slip). Note that the circles in Figure 1.b indicate the minimum fixed to the best fitting value. For example, when we calculated LL absolute difference in LL between the real and the sim data, and for the combination of slip and prior (top right figure in figure 1.a), this point is usually not located at the exact ground truth (except for we fixed learn and guess to be .068 and .270 accordingly. To our learn x guess). great surprise, when we plotted heat maps of the LL matrices of the Another interesting finding can be seen in Figure 1.a - slip vs. real data and the simulated data (Figure 1.a - real data is presented guess. Much attention has been given to this LL space which in the upper triangle and simulated (sim) data is presented in the revealed the apparent co-linearity of BKT with two primary areas lower triangle) we received what appears to be identical matrices of convergence, the upper right area being a false, or “implausible” (for example, the upper right heat map is the (slip x prior) LL converging area as defined by [3]. What is interesting in this figure matrix of the real data, whereas the lowest left heat map is the (slip is that despite what appears to be two global maxima, the point x prior) LL matrix of the sim data). with the best LL in this dataset is in fact the lower region for both The extent of the similarity between the matrices was surprising sim and real data. and in order to get a better picture of the differences between them Next we conducted the same analysis with the second dataset. Figure 2.a (left) Heat maps of delta LL between real dataset G6-259 (k=212 students) and the corresponding simulated data that was generated with the best fitting parameters of the real dataset. The two parameters not in each figure were fixed to the best parameters. Blue areas indicate high difference between the real and sim LL, and red areas indicate lower difference. Circles indicate maximum LL of the given matrix, and triangles indicate the best fitting parameters to the real data. Figure 2.b (right). Heat maps of delta LL between real assistment dataset G6-259 and the corresponding simulated data that was generated with the best fitting parameters of the real dataset. The average is across the two parameters not in each figure. Blue areas indicate high difference between the real and sim LL, and red areas indicate lower difference. Circles indicate minimum absolute delta of the given matrix, and triangles indicate the best fitting parameters to the real data. Even though the G6-259 dataset was significantly smaller than the given two parameters. For example, if we look at the heat map of first dataset, we received very similar results to the first dataset matrix (learn x prior) we can see that there is not a big difference with surprisingly similar heat maps for the sim and real data (see between the average maximum point (white circle) and the overall Figure 2.a). Like in the first dataset, notice that even though the best fit parameters (white triangle). This may indicate that LL heat maps look very similar, there is a difference in the delta changing guess and slip will not affect the value of learn and prior heat maps (see Figure 2.b). Nevertheless, there is an interesting that maximizes the LL, therefore might suggest independency. If difference between the two datasets. Concretely, unlike the bigger we look at (guess x learn), we see that changes in prior and slip dataset (G6-207), in G6-259 the LL of the sim data was actually will again not have an impact on the best fit value of guess, higher than the real data in most cases. however, they will affect the value of learn. Then again, if we look at the heat map of (prior x guess), we will see that both prior 4.2 What if we average LL over 2 parameters and guess are sensitive to changes in learn and slip. Yet again, the extremely surprising part of these results is that the sim data across all the combinations of the other 2 appear to be almost identical to the real data. It is possible to see parameters? from Figure 3.b though that indeed there are differences between We were interested to find out how will the heat maps look like if the simulation data and the real data and like before, the LL of the we do not fix the other two parameters to be best fit, but rather real data is higher than that of the sim data in the larger dataset. average the LL across the entire space of the other two Like for the fixed matrices, we received similar LL matrices for parameters. For example, to calculate the matrix of guess and slip the smaller dataset (G6-259) (see table 4.a). In addition, as before, we practically calculated a matrix of guess and slip LL for each the LL of the sim data for this dataset was higher than that of the combination of learn and prior (25 x 25 = 625 matrices) instead of real data (the opposite direction of the larger dataset G6-207). only one matrix for the best fit learn and prior. Then, we took the Another interesting finding for this dataset can be seen in the average of all these matrices for each combination of guess and (guess x slip) matrices (4.b). Notice that while the sim data slip (see Figure 3.a). The results are both surprising and converged to the lower point of the blue area, the real data interesting. As far as (guess x slip), we no longer receive the two converged to the higher point. Nevertheless, this only happened in maximum (global and local) that we received when learn and the averages matrices and not in the fixed ones. prior where fixed to best fit parameters. Another interesting finding is the relationship between the average maximum across the other two parameters and the overall best fit parameters for Figure 3.a (left). Heat maps of average LL of real assistment dataset G6-207 (k=776 students) and a corresponding simulated data that was generated with the best fitting parameters of the real dataset. The average is across the two parameters not in each figure. Blue areas indicate high LL, and red areas indicate lower LL. Circles indicate maximum LL of the given matrix, and triangles indicate the best fitting parameters to the real data (that were also used to generate the simulated data). Figure 3.b (right). Heat maps of delta LL between real assistment dataset G6-207 and the corresponding simulated data that was generated with the best fitting parameters of the real dataset. The average is across the two parameters not in each figure. Blue areas indicate high difference between the real and sim LL, and red areas indicate lower difference. Circles indicate minimum absolute delta of the given matrix, and triangles indicate the best fitting parameters to the real data. Figure 4.a (left). Heat maps of average LL of real assistment dataset G6-259 (k=212 students) and a corresponding simulated data that was generated with the best fitting parameters of the real dataset. The average is across the two parameters not in each figure. Blue areas indicate high LL, and red areas indicate lower LL. Circles indicate maximum LL of the given matrix, and triangles indicate the best fitting parameters to the real data (that were also used to generate the simulated data). Figure 4.b (right). Heat maps of delta LL between real assistment dataset G6-259 and the corresponding simulated data that was generated with the best fitting parameters of the real dataset. The average is across the two parameters not in each figure. 5. DISCUSSION AND FUTURE WORK 6. REFERENCES The initial motivation of this paper was to find whether it is [1] R. S. Baker and K. Yacef, “The state of educational data possible to discern a real data from a sim data. If for a given mining in 2009: A review and future visions,” J. Educ. Data model it is possible to tell apart a sim data from a real data then Min., vol. 1, no. 1, pp. 3–17, 2009. the authenticity of the model can be questioned. This line of [2] M. C. Desmarais and I. Pelczer, “On the Faithfulness of thinking is in particular typical of simulation use in Science Simulated Student Performance Data.,” in EDM, 2010, pp. context, where different models are used to generate simulated 21–30. data, and then if a simulated data has a good fit to the real [3] J. E. Beck and K. Chang, “Identifiability: A fundamental phenomena at hand, then it may be possible to claim that the problem of student modeling,” in User Modeling 2007, model provides an authentic explanation of the system [12]. We Springer, 2007, pp. 137–146. believe that it may be possible to generate a new matric for [4] Z. A. Pardos and M. V. Yudelson, “Towards Moment of evaluating the goodness of a model by comparing a simulated data Learning Accuracy,” in AIED 2013 Workshops Proceedings from this model to real data. Volume 4, 2013, p. 3. [5] Z. A. Pardos and N. T. Heffernan, “Navigating the parameter In this work we explored similarities between simulated and real space of Bayesian Knowledge Tracing models: data. Nevertheless, we are yet to answer the question “is this data Visualizations of the convergence of the Expectation for real?”. In other words, what we still did not do in this work is Maximization algorithm.,” in EDM, 2010, pp. 161–170. come up with an algorithm that can take a dataset and determine [6] A. T. Corbett and J. R. Anderson, “Knowledge tracing: whether it is real or simulated. Another way to think of it is to Modeling the acquisition of procedural knowledge,” User come out with an algorithm that can tell us whether it is possible Model. User-Adapt. Interact., vol. 4, no. 4, pp. 253–278, to discern real and simulated data and use it as an indication of the 1994. goodness of the model. We found differences between the real [7] S. Ritter, T. K. Harris, T. Nixon, D. Dickison, R. C. Murray, and sim data, but are they strong enough to be noticed by such and B. Towle, “Reducing the Knowledge Tracing Space.,” algorithm in a consistent way? In future work we plan to further Int. Work. Group Educ. Data Min., 2009. investigate this question by creating a training set of multiple real [8] R. S. d Baker, A. T. Corbett, S. M. Gowda, A. Z. Wagner, B. datasets and sim datasets and use machine learning techniques to A. MacLaren, L. R. Kauffman, A. P. Mitchell, and S. extract a learning algorithm from this training dataset that can take Giguere, “Contextual slip and prediction of student as input a dataset and determine whether it is real or sim. We performance after use of an intelligent tutor,” in User argue that if such algorithm can be found, it is an indication that Modeling, Adaptation, and Personalization, Springer, 2010, the underlying model can be improved. In future work we also pp. 52–63. plan to compare different variations of the KT model and contrast [9] R. S. Baker, A. T. Corbett, and V. Aleven, “More accurate their resulting simulated data with real data. In particular we plan student modeling through contextual estimation of slip and to generate a more complex set of simulated data that is based on guess probabilities in bayesian knowledge tracing,” in a more complex model (e.g., different learning rate for different Intelligent Tutoring Systems, 2008, pp. 406–415. types of questions), and then use it as “real” data with the (wrong) [10] Z. A. Pardos and N. T. Heffernan, “Modeling assumption that the model is simple (standard BKT model) to individualization in a bayesian networks implementation of simulate a scenario where the real data is indeed grounded in knowledge tracing,” in User Modeling, Adaptation, and more complex model than our assumptions and see what results Personalization, Springer, 2010, pp. 255–266. would a learning algorithm that uses this “real” data in [11] Z. A. Pardos and M. J. Johnson, “Scaling Cognitive comparison to a sim data will yield. Modeling to Massive Open Environments (in preparation),” In addition, this paper raises interesting questions that we did not TOCHI Spec. Issue Learn. Scale. think of while trying to answer our initial question. For example, [12] U. Wilensky, “GasLab—an Extensible Modeling Toolkit for it seems like there is potential to dive deeper into the average LL Connecting Micro-and Macro-properties of Gases,” in (Figures 3&4) and find more about the relationships and Modeling and simulation in science and mathematics dependencies between the different parameters. Another question education, Springer, 1999, pp. 151–178. that emerged is how could it be that the simulated data had lower LL than the real data in the bigger dataset yet lower in the smaller dataset? Further analysis is needed to answer these questions. Last but not least, given the remarkable resemblance between the sim data and the real data, these initial findings provide an indication that the BKT model is a model with a very strong hold in reality.