=Paper=
{{Paper
|id=Vol-1183/bkt20y_paper02
|storemode=property
|title= A Unified 5-Dimensional Framework for Student Models
|pdfUrl=https://ceur-ws.org/Vol-1183/bkt20y_paper02.pdf
|volume=Vol-1183
|dblpUrl=https://dblp.org/rec/conf/edm/XuM14
}}
== A Unified 5-Dimensional Framework for Student Models==
A Unified 5-Dimensional Framework for Student Models Yanbo Xu and Jack Mostow Carnegie Mellon University Project LISTEN RI-NSH 4103 5000 Forbes Ave, Pittsburgh, PA 15213 {yanbox, mostow}@cs.cmu.edu ABSTRACT model by the number of values to fit. This paper defines 5 key dimensions of student models: whether Xu and Mostow [4] factored the space of different knowledge and how they model time, skill, noise, latent traits, and multiple tracing models in terms of three attributes: how to fit their influences on student performance. We use this framework to parameters, how to predict students’ performance from their characterize and compare previous student models, analyze their estimated knowledge, and how to update those estimates based on relative accuracy, and propose novel models suggested by gaps in observed performance. We will use this factoring in Section 3.2. the multi-dimensional space. To illustrate the generative power of this framework, we derive one such model, called HOT-DINA Section 2 introduces the proposed framework. Section 0 describes (Higher Order Temporal, Deterministic Input, Noisy-And) and HOT-DINA, a novel knowledge tracing method that the evaluate it on synthetic and real data. We show it predicts student framework inspired. Sections 4 and 5 evaluate HOT-DINA on performance better than previous methods, when, and why. synthetic and real data, respectively. Section 6 concludes. Keywords 2. A Unified 5-Dimensional Framework We characterize student models in terms of these five dimensions: Knowledge tracing, Item Response Theory, temporal models, higher order latent trait models, multiple subskills, DINA. Temporal effect: skills time-invariant vs. time-varying. • Static, e.g. IRT [5] and PFA [6] 1. Introduction • 2 or more fixed time points, e.g. at pre- and post-test Morphological analysis [1] is a general method for exploring a • Dynamic, e.g. KT [2] space of possible designs by identifying key attributes, specifying Skill dimensionality: single skill vs. multiple skills at a step. possible values for each attribute, and considering different combinations of choices for the attributes. Structuring the space Credit assignment: how credit (or blame) is allocated among in this manner compares different designs in terms of which influences on the observed success (or failure) of a step. Mostow attribute values they share, and which ones differ. Characterizing et al. [3] define a space of KT parameterizations. Corbett and the space of existing designs in terms of these attributes exposes Andersen [2] originally fit KT per skill. Pardos and Heffernan [7] gaps in the space, suggesting novel combinations to explore. individualized KT and fit parameters per student. Wang and Heffernan [8] simultaneously fit KT per student and per skill. In Some prior work on student modeling has used this approach to contrast, multiple-skills models require combination functions to characterize spaces of possible knowledge tracing models. assign credit or blame among the skills. Product KT [9] assigns Knowledge tracing (KT) [2] generally has 4 or 5 parameters: the full responsibility to each skill and multiplies the estimates. probability slip of failing on a known skill; the probability guess Conjunctive KT [10] assigns fair credit or blame to skills and of succeeding on an unknown skill; the probability knew of multiplies the estimates. Weakest KT [11] credits or blames the knowing a skill before practicing it; the transition probability weakest skill and takes the minimum of the estimates. LR-DBN learn from not knowing the skill to knowing it; and sometimes the [12] apportions credit or blame and performs logistic regression transition probability forget from knowing the skill to not over the estimates. We summarize credit assignment methods as: knowing it, usually assumed to be zero. • Contingency table Mostow et al. [3] defined a space of alternative parameterizations o Per student of a given KT model, based on whether they assigned each o Per skill knowledge tracing parameter a single overall value, a distinct o Pervalue for each individual student and/or skill, or different values o Per student + per skill for different categories of students and/or skills. Thus the number • Binary or probabilistic of values to fit is 4 if using a single global value for each o Conjunctive (min) parameter, but with separate probabilities for each o Independent (product) pair, the number of values to fit is 4 × # students × # skills. This o Disjunctive (max) work ordered the space of possible parameterizations of a single • Other o Compensatory (+) o Mixture (weighted average) o Logistic regression (sigmoid) Higher order: treat static student properties as latent traits or not. We say IRT [5] models “higher order” effects because it estimates static student proficiencies independent of skill properties such as skill difficulty in 1PL (1 Parameter Logistic), skill discrimination in 2PL, and skill guess rate in 3PL. De la Torre [13] first combined IRT with static Cognitive Diagnosis Models such as NIDA (Noisy Inputs, Deterministic And Gate) [14-16] and DINA and DINO respectively add noise either before or after combining (Deterministic Inputs, Noisy And Gate), and proposed higher estimates of multiple skills. We refer to these noise modeling order latent trait models (HO-NIDA and HO-DINA). Xu and methods as: Mostow [17] used IRT to estimate the probability of knowing a • None skill initially in a higher order knowledge tracing model (HO-KT). • Slip/Guess Noise: how to represent errors in model, or discrepancies between • NIDO (noisy input, deterministic output) what a student knows versus does. KT assumes students may • DINO (deterministic input, noisy output) guess a step correctly even though they don’t know its underlying Table 1 summarizes student models in the proposed unified 5- skill(s), or slip at a step even though they know its skill(s). Such dimensional framework. Note that we only discuss known “noise” is also characterized in other models, including single- cognitive models (e.g. Q-matrix) in this paper, so we omit skill KT variants such as PPS (Prior Per Student) [7] and SSM methods that discover unknown cognitive models [18, 19]. (Student Skill Model) [8], and IRT models such as 3PL. NIDO Table 1. A unified 5-dimensional framework for student models Temporal Skill Credit Higher order Student models Noise model effect dimensionality assignment effect IRT 1PL (Rasch model) [5] Per student + None IRT 2PL (2 Parameter Logistic) [5] Single skill Latent trait per skill IRT 3PL (3 Parameter Logistic) [5] Slip/Guess LLM (Linear Logistic Model) [16] LFA (Learning Factor Analysis) [20] Sigmoid None PFA (Performance Factor Analysis) [6] Static No latent trait NIDA [14-16] NIDO Multiple skills Product DINA [14-16] DINA LLTM (Linear Logistic Test Model) [21] Sigmoid None HO-NIDA [13] Latent trait NIDO Product HO-DINA [13] DINO KT [2] Per skill PPS (Prior Per Student) [7] Per student No latent trait SSM (Student Skill Model) [8] Slip/Guess Single skill Per student + HO-KT [17] Per skill Latent trait DIR (Dynamic IRT 1PL) [22] None KT+NIDA [23] Product KT [9] Dynamic Product NIDO CKT [10] No latent trait Weakest KT [11] Minimum Multiple skills KT+DINA [23] Product DINO LR-DBN [12] Sigmoid HOT-NIDA [Section 0] NIDO Product Latent trait HOT-DINA [Section 0] DINO Table 2. Comparative framework to train, predict and update multiple-skills models Student models Train Predict Update Update skills together. Bayes’ CKT Multiply skill estimates. equations assign responsibility. Product KT Weakest KT Train skills separately. Update skills separately, each with (Blame weakest, Assign each skill full full responsibility. credit rest) responsibility. Minimum of skill estimates. Weakest KT (Update weakest skill) Update only the weakest skill. HOT-NIDA HOT-DINA Train skills together. [Section 3.2] Assign each skill full Multiply skill estimates. responsibility. Update skills together, each with KT+NIDA/DINA full responsibility. Train skills together. Logistic Logistic regression on Update skills together. Logistic LR-DBN regression assigns responsibility. skill estimates. regression assigns responsibility. Table 2 (adapted from [4]) expands Credit assignment in terms knowledge of each individual skill by observing additional of how to train, predict and update skills, e.g. to assign full practice on the skill. It also models two attributes of the skills, responsibility to every skill, blame the weakest skill and credit difficulty and discriminability, which are assumed to be the rest, update only the weakest skill, or use logistic function. constants that do not change over time. The tables suggest transformations of models along the To incorporate DINA into HO-KT, we still model a hidden dimensions in the framework. For example, Dynamic IRT [22] binary state in each step to indicate whether a student knows the varies student proficiency by time, transforming static IRT to overall skill used in the step, denoted as ηnj(t) for student n with dynamic. KT+NIDA/DINA [23] varies skill estimates by time, skill j at time t. However, we also model a hidden binary state transforming static NIDA/DINA to dynamic. HO- αnk(t) to indicate whether student n knows skill k at time t. Given NIDA/DINA/KT adds latent traits, transforming a matrix Q = {Qjk}, indicating whether the overall skill j NIDA/DINA/KT to higher order. LLM [16] and LLTM [21] requires skill k, we conjoin the skills as follows: change the combination function, transforming conjunctive ! models to logistic models. In Section 0 we generate a novel ! ! 𝜂!" = (𝛼!" )!!" student model by transforming HO-KT to a multi-skill model. ! ! ! 3. A Higher-Order Temporal Student Model Equation 2. Conjunction of skills in HOT-DINA to Trace Multiple Skills: HOT-DINA This formula gives us the DINA (Deterministic Input, Noisy- Xu and Mostow [17] extended the static IRT model into HO-KT And gate) structure [15], with the conjunction as the “and” gate (Higher Order Knowledge Tracing), which accounts for skill- and guess and slip as the noise. Thus by combining HO-KT with specific learning by using the static IRT model to estimate the DINA, we obtain the HOT-DINA higher order temporal model probability Pr(knew) of knowing a skill before practicing it. By to trace multiple skills. Figure 1 shows how the plate diagram generalizing to steps that require conjunctions of multiple skills, for HOT-DINA integrates IRT, KT, and DINA. we arrive at a combined model we call HOT-DINA (Higher Order Temporal, Deterministic Input, Noisy-And). Note we can transform it into HOT-NIDA simply by changing its noise type. 3.1 HOT-DINA = IRT + KT + DINA Let {Y(0), Y(1) , …, Y(t), …} denote a sequential dataset recorded by an intelligent tutor system, where Ynj(t) = 1 iff student n correctly performs a step that requires skill j at time t. KT is a Hidden Markov Model (HMM) that models a binary hidden state K(t) indicating if the student knows the skill at time t. The probability of knowing the skill is knew at time t = 0, and then changes based on the student’s observed performance on the skill, according to the standard KT parameters slip, guess, learn, and forget (usually set to zero). KT can fit these four parameters (taking forget = 0) for each pair, but the resulting large number of values to fit is likely to cause over-fitting. Thus, Corbett and Andersen [2] originally proposed to estimate knew per student, and learn, guess and slip per skill. IRT assumes a latent trait that represents a student’s underlying proficiency in all the skills. For example, the Two Parameters Logistic (2PL) IRT model assumes that the probability of a student’s correct response is a logistic function of a unidimensional student proficiency θ with two skill-specific parameters: discriminability a and difficulty b (see Equation 1). 1 𝑃 𝑌 = 1 = 1 + exp (−1.7𝑎(𝜃 − 𝑏)) Equation 1. The logistic function of 2PL model The two skill parameters determine the shape of the IRT curve. As a student’s proficiency increases beyond the skill difficulty, the student’s chance of performing correctly surpasses 50%. The skill discriminability reflects how fast the logit (log odds) increase or decrease when the proficiency changes. Thus IRT fits parameters individually on each dimension, without losing the information from the other. HO-KT uses 2PL to estimate knew in KT, by fitting student specific proficiency θn, skill discriminability aj and skill difficulty bj. It then uses KT to trace Figure 1. Graphical representation of Higher-Order each skill, by fitting skill-specific learnj, guessj and slipj. Thus, Temporal DINA (HOT-DINA) to trace multiple skills HO-KT models students’ initial overall knowledge before they practice any skills; then it updates its estimates of students’ Equation 3 shows the formula for using 2PL to estimate the Given η as a conjunction of α, the likelihood of Y given η, the probability knew of a student knowing a skill at time t = 0: conditional independence of α(0) given θ, and of α(t) given α(t-1), (!) the posterior distribution of θ, a, b, α, η, learn (l), guess (g) and 𝑃 𝑘𝑛𝑒𝑤!" = 𝑃 𝛼!" = 1 slip(s) given Y is 1 = 𝑃 𝜽, 𝒂, 𝒃, 𝜶, 𝜼, 𝒍, 𝒈, 𝒔 𝒀 ∝ 𝐿 𝒀 𝒈, 𝒔, 𝜼, 𝜶 𝑃 𝜶 ! 𝜽, 𝒂, 𝒃 1 + exp (−1.7 𝑎! (𝜃! − 𝑏! )) ! Equation 3. 2PL to estimate knew in HOT-DINA ( 𝑃 𝜶 ! 𝜶 !!! , 𝒍 )𝑃 𝜽 𝑃 𝒂 𝑃 𝒃 𝑃 𝒍 𝑃 𝒈 𝑃(𝒔) ! ! ! Equation 4 shows the formula for tracing the skills with skill- specific learn and zero forget: 3.2.2 Predicting student performance For inference, we introduce uncertainty to ηnj, and rewrite the 𝑃 𝛼!" ! = 1 𝛼!" !!! = 0 = 𝑙𝑒𝑎𝑟𝑛! Equation 2 as follows: 𝑃 𝛼!" ! = 0 𝛼!" !!! = 1 = 𝑓𝑜𝑟𝑔𝑒𝑡! = 0 ! !!" ! 1 𝑃 𝜂!" = 1 = Equation 4. Knowledge tracing of skills in HOT-DINA exp −1.7𝑎! 𝜃! − 𝑏! ! ! ! Equation 5 shows the likelihood of a student’s performance ! ! ! !!" given the hidden state η(t) and the skill-specific guess and slip: 𝑃 𝜂!" = 1 = ! ! !(𝑃(𝛼!" = 1)) ) for t = 1,2,3… ! ! ! !!!!" ! Equation 6. Conjunction of skills in HOT-DINA inference 𝐿 𝑌!" = 1| 𝜂!" = 𝑔𝑢𝑒𝑠𝑠! ×(1 − 𝑠𝑙𝑖𝑝! )!!" ! ! ! !!!!" ! Then we predict student performance by using Equation 7: 𝐿 𝑌!" = 0| 𝜂!" = (1 − 𝑔𝑢𝑒𝑠𝑠! ) ×𝑠𝑙𝑖𝑝! !!" ! ! 𝑃 𝑌!" = 1 = 1 − 𝑠𝑙𝑖𝑝! 𝑃 𝜂!" = 1 + 𝑔𝑢𝑒𝑠𝑠! (1 Equation 5. Likelihood in HOT-DINA ! − 𝑃 𝜂!" = 1 ) 3.2 How to Train, Predict, and Update Following the organization of Table 2, Section 3.2.1 details how Equation 7. Prediction in HOT-DINA HOT-DINA trains the skills together and assigns each skill full 3.2.3 Updating estimated skills responsibility; Section 3.2.2 specifies how HOT-DINA predicts We update the estimates of latent states η and α after observing student performance by using a product of skill estimates; and actual student performance. The estimate of knowing a skill or a Section 3.2.3 shows how HOT-DINA updates the weakest skill. subskill should increase if the student performed correctly at the 3.2.1 Training the model with MCMC step. It is easy to update a skill by using Bayes’ rule, as shown in We estimate the parameters of HOT-DINA using Markov Chain Equation 8. The posterior P(ηnj(t) = 1|Ynj(t) = 1) should be higher Monte Carlo (MCMC) methods, which require that we specify than P(ηnj(t) = 1) if and only if (1-slipj) > guessj. the prior distributions and constraints for every parameter. We ! ! assume that student general proficiency θn is normally 𝑃 𝜂!" = 1 𝑌!" = 1 𝑡 distributed with mean 0 and standard deviation 1. The skill 𝑃 𝑌𝑛𝑗 = 1 𝜂𝑛𝑗𝑡 = 1) 𝑃 𝜂𝑛𝑗𝑡 = 1 discrimination an is positive and uniformly distributed between 0 = 𝑡 and 2.5, while the skill difficulty bn is also normally distributed 𝑃 𝑌𝑛𝑗 = 1 with mean 0 and standard deviation 1. Learn has prior Beta ! (1,1), whereas guess and slip have uniform prior from 0 to 0.4. (!!!"#$! ) ! !!" ! ! = ! ! Thus, the priors on each parameter are: (!!!"#$! ) ! !!" ! ! !!"#$$! !! ! !!" ! ! 𝜃! ~ 𝑁𝑜𝑟𝑚𝑎𝑙(0,1) Equation 8. Bayes’ rule to update η in HOT-DINA 𝑏! ~ 𝑁𝑜𝑟𝑚𝑎𝑙(0, 1) Although we could update HOT-DINA by assigning full responsibility to each skill, it would be interesting to update the 𝑎! ~ 𝑈𝑛𝑖𝑓𝑜𝑟𝑚(0, 2.5) weakest (or say hardest) skill since HOT-DINA fits the 𝑙𝑒𝑎𝑟𝑛! ~ 𝐵𝑒𝑡𝑎(1, 1) parameter ‘difficulty’ for each skill. Thus, we update the skill that is the hardest among all the required skills in a step: 𝑔𝑢𝑒𝑠𝑠! ~ 𝑈𝑛𝑖𝑓𝑜𝑟𝑚(0, 0.4) ! ! 𝑃 𝜂!" = 1 𝑌!" = 1 𝑠𝑙𝑖𝑝! ~ 𝑈𝑛𝑖𝑓𝑜𝑟𝑚(0, 0.4) ! ! ! = 𝑃 𝛼!"! = 1|𝑌!" = 1 𝑃(𝛼!" We use the following conditional distributions for each node: !!!! ! = 1) 𝛼!" |𝜃! ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖({1 + exp −1.7 𝑎! 𝜃! − 𝑏! }!! ) for 𝑘 = arg max!: !!" ! ! 𝑏! . 𝛼!" (!) | 𝛼!" !!! = 0 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑙𝑒𝑎𝑟𝑛! ) 𝛼!" (!) | 𝛼!" !!! = 1 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(1) Equation 9. Update the hardest skill in HOT-DINA In short, we extend HO-KT to the HOT-DINA higher order 𝑌!" (!) |𝜂!" ! = 0 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑔𝑢𝑒𝑠𝑠! ) temporal model, which traces multiple skills. We use the MCMC algorithm to estimate the parameters, and update the 𝑌!" (!) |𝜂!" ! = 1 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(1 − 𝑠𝑙𝑖𝑝! ) estimates of a student knowing a skill given observed student performance. How well does the HOT-DINA model work? To evaluate it, we performed a simulation study. Section 4 now standard deviation (s.d.) to assess the accuracy of the posterior describes the study and reports its results. estimates for each parameter. MC error, which is an estimate of the difference between the estimated posterior mean (i.e. the 4. Simulation Study sample mean) and the true posterior mean, should be less than To study the behavior of HOT-DINA, we generated synthetic 5% of the s.d. in order to obtain an accurate posterior estimate. training data for it according to the priors and conditional distributions defined in Section 3.2.1. Section 4.1 describes the Table 5. Estimates of skill-specific discrimination, difficulty, synthetic data. One purpose of this experiment was to test how and learning rate (N = 100, T = 100, K = 4, J = 14) accurately MCMC can recover the parameters of HOT-DINA, k a 𝒂 (95% C.I.) s.d. MC_error as Section 4.2 reports. It is important not only to test how well a 1 1.50 1.33 (0.36, 2.43) 0.65 0.03216 method works, but to analyze when and why. Thus another 2 1.20 1.23 (0.12, 2.43) 0.72 0.03561 purpose was to determine how many students and observations 3 1.90 1.85 (0.22, 2.73) 0.64 0.03146 are needed to estimate the difficulty and discriminability of a 4 1.00 0.98 (0.19, 2.12) 0.58 0.02870 given number of skills, as Section 4.3 explains. k b 𝒃 (95% C.I.) s.d. MC_error 4.1 Synthetic Data 1 -0.95 -0.95 (-2.15, -0.04) 0.50 0.02339 We use the following procedure to generate the synthetic data, 2 1.42 1.51(0.90, 2.21) 0.45 0.01936 with all the variables as defined in Section 3.2: 3 -0.66 -0.69 (-1.81, -0.63) 0.42 0.01990 4 0.5 0.5 (0.05,1.18) 0.38 0.01691 1. We chose K = 4 and J = 14, which results in a 14 × 4 Q k learn 𝒍𝒆𝒂𝒓𝒏 (95% C.I.) s.d. MC_error matrix. The Q matrix, as shown below, indicates that we 1 0.8 0.81 (0.48, 0.99) 0.13 0.006599 generate the skills by combining all the possible skills. 2 0.6 0.60 (0.52, 0.70) 0.05 0.002132 𝐐! 3 0.5 0.57 (0.38, 0.84) 0.11 0.005432 1 0 0 0 1 1 1 0 0 0 1 1 1 0 4 0.3 0.29 (0.25, 0.33) 0.02 7.79E-04 0 1 0 0 1 0 0 1 1 0 1 0 1 1 = 0 0 1 0 0 1 0 1 0 1 1 1 0 1 We calculated Root Mean Squared Error (RMSE) of the 0 0 0 1 0 0 1 0 1 1 0 1 1 1 estimates of the continuous variables 𝒈𝒖𝒆𝒔𝒔 , 1- 𝒔𝒍!𝒑 , and 2. We randomly generated θn from Normal (0,1) for n = 1,..,N. 𝜽. We report the accuracy of recovering the true value of the latent binary variable α in Table 6. 3. We chose a, b and l as shown in Table 3. Table 6. Estimation RMSE of skill-specific guess, not slip, Table 3. True value of skill-specific discrimination, difficulty and student specific proficiency; Prediction accuracy of a and learning rate in synthetic data simulation student mastering a subskill (N = 100, T = 100, K = 4, J = 14) k 1 2 3 4 𝒈𝒖𝒆𝒔𝒔 1-𝒔𝒍!𝒑 𝜽 a 1.5 1.2 1.9 1.0 RMSE 0.0103 0.0196 0.9183 b -0.95 1.42 -0.66 0.50 learn 0.8 0.6 0.5 0.3 𝜶 Accuracy 99.38% 4. We randomly generated g and 1-s from Unif(0,0.4) and Unif (0.6,1) respectively, as shown in Table 4. From the results, we can see that the MCMC algorithm accurately recovered the parameters we used in generating the Table 4. True value of skill-specific guess and not slip synthetic data for HOT-DINA. In addition to seeing how parameters in synthetic data simulation accurately it can estimate the parameters, we are also interested j 1 2 3 4 5 6 7 in finding out how many observations would be sufficient for guess 0.35 0.40 0.13 0.15 0.29 0.39 0.10 the training algorithm to recover the hidden variables. Therefore, we conducted the study we now describe in Section 4.3. 1-slip 0.67 0.66 0.67 0.90 0.65 0.60 0.61 j 8 9 10 11 12 13 14 4.3 Study Design guess 0.40 0.15 0.16 0.38 0.11 0.26 0.35 HOT-DINA requires data from enough students to rate the 1-slip 0.81 0.74 0.76 0.73 0.83 0.89 0.85 difficulty and discriminability of each skill, and data on enough skills to estimate the proficiency of each student. So we fixed 5. We chose N = 100, T = 100, randomly picked one skill at the number of skills at K = 4, and varied the number of students each step, and simulated sequential data with size of 10,000. N or the number of steps observed from each student T, to 4.2 Results discover how many observations would be sufficient to estimate the parameters. In particular, we evaluated each model on how We used OpenBUGS [24] to implement the MCMC algorithm accurately it estimated the latent binary state α¸ which indicates of HOT-DINA. We chose 5 chains starting at different initial points. We monitored the estimates of skill discrimination 𝒂 and if a student masters a skill. We generated the data by using the same parameters as in Section 4.1. Besides the general HOT- difficulty 𝒃 to check their convergence, when all the chains DINA model that accounts for multiple skills, we also studied appear to be overlapping each other. As a result, we ran the the single-skill model by shrinking the number of skills J to simulation for 10,000 iterations with a burn-in of 3000. equal K, and set Q as an identity matrix. Thus we specified the Table 5 reports the sample means and their 95% confidence HOT-DINA model to be a HO-KT model alternatively. interval for parameter estimates 𝒂, 𝒃 and le𝒂rn respectively. We increased N, the number of students, from 10 to 1000, and We also report the Monte Carlo error (MC error) and sample T, the number of observations per student, from 5 to 100. Table 7 and Table 8 respectively show the accuracy of estimating the in Section 4.3, we randomly chose N = 50 students with T = 100 latent state α in HO-KT and HOT-DINA. Both tables show a in order to obtain enough data for the MCMC estimation. trend of increasing accuracy when N or T increases (though at the cost of longer training time, roughly O(N2×T)). Table 9. Data split of the Algebra Tutor data: training on I and IV, and testing on II and III Table 7. Accuracy of estimating the latent binary states α Skill group A Skill group B with different N and T (K = J = 4) Student group A I II T 5 10 20 50 100 Student group B III IV N We split the 50 students into two groups of 25, and split the 15 10 71.01% 80.81% 83.01% 93.11% 96.16% skills into two groups of 8 and 7. As shown in Table 9, we 20 72.32% 82.74% 86.52% 94.06% 97.33% combine data from I (student-group-A practicing on skill-group- A) and IV (student-group-B practicing on skill-group-B) to 50 73.58% 83.79% 87.34% 95.27% 98.90% obtain the training data. Accordingly, we combined the data 100 77.55% 84.43% 88.08% 95.81% 99.41% from II and III to obtain the test data. As a benefit of the data split, we are able to test the models on unseen students for the 200 76.52% 84.02% 89.48% 97.26% NA same group of skills, and also test on the unseen skills for the 500 78.13% 84.34% 92.50% NA NA same group of students. 1000 80.10% 84.59% NA NA NA We compared HOT-DINA with the conjunctive minimum KT model [11] since it showed the best prediction accuracy among Due to the lack of sampling ability of OpenBUGS for high all the previous KT based methods [4]. It fits KT parameters by dimensional dynamic models, we have no available scores to blaming each skill that is required at a step, predicts student’s show for N×T bigger than 10,000. We can see that the multiple performance by the weakest skill, and updates only the weakest skill model predicts better than the single-skill model because skill. Accordingly, we updated the most difficult skill in HOT- the average number of observations per skill in the former one is DINA as discussed in Section 3.2.3. As two baseline models, we larger than the latter. As observed in both tables, it is more fit per-skill KT and per-student KT. Comparing HOT-DINA efficient to increase T, than N, to get a better estimate. Both of with these two baselines also allows us to discuss some more the models reach the best prediction accuracy score (> 99%) interesting research questions later in this section. when N = 100 and T = 100. In order to obtain an accuracy > Table 10 and Table 11 respectively show the models’ prediction 90% for K = 4 skills, the least amount of data we need for HO- accuracy and log-likelihood on the test data. We report the KT is N = 10 with T ≈ 50 observations as shown in Table 7, for majority class because of the unbalanced data. HOT-DINA beat HOT-DINA is N = 10 with T > 20 observations, as shown in the two baselines in predicting the student performance, and also Table 8. obtained the maximum log-likelihood on the test data. The per- student KT model obtained the worst scores on both measures. It Table 8. Accuracy of estimating the latent binary states α predicted student performance almost as poorly as majority class with different N and T (K = 4, J = 14) because it misclassified almost all the data in the minority class. T 5 10 20 50 100 Table 10. Comparison of prediction accuracy on real test N data 10 72.07% 75.57% 91.14% 96.90% 98.10% Overall Accuracy on Accuracy on 20 74.32% 83.60% 91.56% 97.46% 98.53% Accuracy Correct Steps Incorrect Steps HOT-DINA 82.48% 96.63% 27.27% 50 76.55% 84.71% 92.62% 97.52% 98.98% Per-skill KT 80.87% 94.02% 29.60% 100 77.80% 86.82% 93.83% 97.67% 99.82% Per-student KT 79.63% 99.74% 1.20% Majority class 79.60% 100.00% 0.00% 200 79.92% 88.78% 94.26% 99.41% NA 500 82.15% 89.95% 98.61% NA NA Table 11. Comparison of log-likelihood on real test data 1000 83.58% 92.34% NA NA NA Log-likelihood HOT-DINA -2021.04 Next we apply the proposed model to real data logged by an Per-skill KT -2075.67 algebra tutor. We evaluate the model fit and compare it against Per-student KT -2464.74 two baselines. We are also interested in three other hypotheses comparing 5. Evaluation on Real Data HOT-DINA with KT. We describe them, test them, and show We apply HOT-DINA to a real dataset from the Algebra the results as follows. Cognitive Tutor® [25]. Because of limited time, we chose a subset of the data, by crossing out the “isolated” algebra tutor 1. HOT-DINA should predict early steps more accurately than steps. An “isolated” step here means a step that requires one KT since its estimate of knew reflects both skill difficulty skill all its own. We grouped the remaining steps that require the and student proficiency, not just one or the other. In fact same multiple skills into one skill, resulting in J = 15 distinct HOT-DINA beat KT throughout, as Figure 2 shows. skills that require K = 12 subskills. Following the study design 95% 6. Contributions, limitations, future work In this paper we make several contributions. We defined a 5- 90% dimensional framework for student models. We showed how 85% numerous student models fit into it. We described the new combination of IRT, KT, and DINA it suggests in the form of 80% HOT-DINA. We specified how to train HOT-DINA by using 75% MCMC, how to test it by predicting student performance, and 70% how to update estimated skills based on observed performance. 1 2 5 8 10 12 15 18 20 25 30 40 50 100 150 HOT-DINA uses IRT to estimate knew based on student proficiency and skill difficulty. Thus it does not need training Per-‐student KT Per-‐subskill KT data on every pair, since it can estimate student HOT-‐DINA Majority class proficiency based on other skills, and skill difficulty and discriminability based on other students. Likewise, it should Figure 2. Accuracy on student’s 1st, 2nd, 3rd, … test steps estimate knew more accurately than KT for skills and students with sparse training data. HOT-DINA uses KT to model 2. HOT-DINA should beat KT on sparsely trained skills learning over time, and DINA to model combination of multiple thanks to student proficiency estimates based on other skills underlying observed steps (unlike conventional KT and skills. As Figure 3 shows, HOT-DINA tied or beat KT with fewer parameters than CKT [10] or LR-DBN [12]). throughout. Tracing multiple skills underlying an observed step requires allocating responsibility among them for its success or failure. 100% DINA simply conjoins them, a common method but inferior to 80% others. Future work includes using the best known method [4], 60% which we didn’t use here because the logistic regression it performs is non-trivial to integrate with MCMC. 40% 20% We evaluated HOT-DINA on synthetic and real data, not only showing that it predicts student performance better than previous 0% methods, but analyzing when and why. 1864 104 168 170 257 266 334 334 380 811 21 40 64 72 72 We reported a simulation study to test if training could recover model parameters, and to determine the amount of data needed. Per-‐subskill KT HOT-‐DINA Majority class HOT-DINA requires data on enough students and skills to estimate their proficiency and difficulty, respectively. We Figure 3. Skills sorted by amount of training data explored how its accuracy varies with the number of test steps 3. HOT-DINA should beat KT on sparsely trained students and the amount of training data per student and per skill. These thanks to skill difficulty and discriminability estimates analyses were correlational, based on variations that happened to based on other students. As Figure 4 shows, HOT-DINA occur in the training data. Future work should invest in the beat KT throughout. computation required to vary the amount of training data to establish its true causal effect on accuracy. 100% Evaluation on real data from an algebra tutor showed that HOT- DINA achieved higher predictive accuracy and log likelihood 95% than KT with parameters fit per student or per skill. This 90% evaluation was limited to a single data set and two baselines (not 85% counting majority class). Future work should compare HOT- 80% DINA to other methods – notably the Student Skill model [8], 75% which is similar in spirit – and on data from other tutors. 70% 65% We assumed that student proficiency is one-dimensional. Future 60% work can test if k dimensions capture enough additional variance 102 105 109 111 114 116 117 121 126 130 16 51 75 84 88 91 99 to make it worthwhile to fit k times as many parameters. Finally, our choice of 5 dimensions is useful but limiting. Per-‐student KT HOT-‐DINA Majority class Additional dimensions may provide useful finer-grained insights into the models covered by the current framework, and expand it Figure 4. Students sorted by amount of training data to encompass other types of student models, e.g. where the cognitive model is unknown and must be discovered [18, 19]. Thus, HOT-DINA outperformed the two baselines in model fit. It also beat them as specified by the three hypotheses above. ACKNOWLEDGMENTS This work was supported in part by the National Science Foundation through Grants 1124240 and 1121873 to Carnegie Mellon University. The opinions expressed are those of the authors and do not necessarily represent the views of the National Science Foundation or U.S. government. We thank Ken Koedinger for his algebra tutor data. REFERENCES [13] de la Torre, J. and J.A. Douglas. Higher-order latent trait models for cognitive diagnosis. Psychometrika 2004. 69(3): p. [1] Zwicky, F. Discovery, Invention, Research - Through the 333-353. Morphological Approach. 1969, Toronto: The Macmillian Company. [14] Junker, B. and K. Sijtsma. Cognitive assessment models with few assumptions, and connections with nonparametric item [2] Corbett, A. and J. Anderson. Knowledge tracing: Modeling response theory. Applied Psychological Measurement, 2001. the acquisition of procedural knowledge. User modeling and 25(3): p. 258-272. user-adapted interaction, 1995. 4: p. 253-278. [15] de la Torre, J. DINA Model and Parameter Estimation: A [3] Mostow, J., Y. Xu, and M. Munna. Desperately Seeking Didactic Journal of Educational and Behavioral Statistics, 2009. Subscripts: Towards Automated Model Parameterization. 34(1): p. 115-130. Proceedings of the 4th International Conference on Educational Data Mining, 283-287. 2011. Eindhoven, Netherlands. [16] Maris, E. Estimating multiple classification latent class models. Psychometrika, 1999. 64(2): p. 197–212. [4] Xu, Y. and J. Mostow. Comparison of methods to trace multiple subskills: Is LR-DBN best? [Best Student Paper [17] Xu, Y. and J. Mostow. Using item response theory to Award]. Proceedings of the Fifth International Conference on refine knowledge tracing. In Proceedings of the 6th Educational Data Mining, 41-48. 2012. Chania, Crete, Greece. International Conference on Educational Data Mining, S.K. D’Mello, R.A. Calvo, and A. Olney, Editors. 2013, International [5] Hambleton, R.K., H. Swaminathan, and H.J. Rogers. Educational Data Mining Society: Memphis, TN, p. 356-357. Fundamentals of Item Response Theory. Measurement Methods for the Social Science. 1991, Newbury Park, CA: Sage Press. [18] González-Brenes, J.P. and J. Mostow. What and when do students learn? Fully data-driven joint estimation of cognitive [6] Pavlik Jr., P.I., H. Cen, and K.R. Koedinger. Performance and student models. In Proceedings of the 6th International factors analysis - a new alternative to knowledge tracing. Conference on Educational Data Mining, S.K. D’Mello, R.A. Proceedings of the 14th International Conference on Artificial Calvo, and A. Olney, Editors. 2013, International Educational Intelligence in Education (AIED09), 531-538. 2009. Data Mining Society: Memphis, TN, p. 236-239. [7] Pardos, Z. and N. Heffernan. Modeling individualization in [19] González-Brenes, J.P. and J. Mostow. Dynamic cognitive a Bayesian networks implementation of knowledge tracing. tracing: towards unified discovery of student and cognitive Proceedings of the 18th International Conference on User models. Proceedings of the Fifth International Conference on Modeling, Adaptation and Personalization, 255-266. 2010. Big Educational Data Mining 2012. Chania, Crete, Greece. Island, Hawaii. [20] Cen, H., K. Koedinger, and B. Junker. Learning factors [8] Wang, Y. and N.T. Heffernan. The student skill model. analysis – a general method for cognitive model evaluation and Intelligent Tutoring Systems - 11th International Conference, improvement. Proceedings of the 8th International Conference 399-404. 2012. Chania, Crete, Greece. Springer. on Intelligent Tutoring Systems, 164-175. 2006. Jhongli, Taiwan. [9] Cen, H., K.R. Koedinger, and B. Junker. Comparing Two IRT Models for Conjunctive Skills. Ninth International [21] Fischer, G.H. The linear logistic test model. In G.H. Conference on Intelligent Tutoring Systems, 796-798. 2008. Fischer and I.W. Molenaar, Editors, Rasch Models: Montreal. Foundations, Recent Developments, and Applications, 131-155. Springer: New York, 1995. [10] Koedinger, K.R., P.I. Pavlik, J. Stamper, T. Nixon, and S. Ritter. Avoiding problem selection thrashing with conjunctive [22] Wang, X., J.O. Berger, and D.S. Burdick. Bayesian knowledge tracing. In Proceedings of the 4th International analysis of dynamic item response models in educational testing. Conference on Educational Data Mining. 2011: Eindhoven, NL, Annals of Applied Statistics, 2013. 7(1): p. 126-153. p. 91-100. [23] Studer, C. Incorporating Learning Over Time into the [11] Gong, Y., J.E. Beck, and N.T. Heffernan. Comparing Cognitive Assessment Framework. Unpublished PhD, Carnegie knowledge tracing and performance factor analysis by using Mellon University, Pittsburgh, PA, 2012. multiple model fitting procedures. Proceedings of the 10th International Conference on Intelligent Tutoring Systems, 35-44. [24] Lunn, D., D. Spiegelhalter, A. Thomas, and N. Best. The 2010. Pittsburgh, PA. Springer Berlin / Heidelberg. BUGS project: Evolution, critique and future directions. Statistics in Medicine, 2009. 28: p. 3049–306. [12] Xu, Y. and J. Mostow. Using logistic regression to trace multiple subskills in a dynamic Bayes net. Proceedings of the [25] Koedinger, K.R., R.S.J.d. Baker, K. Cunningham, A. 4th International Conference on Educational Data Mining, 241- Skogsholm, B. Leber, and J. Stamper. A data repository for the 245. 2011. Eindhoven, Netherlands. EDM community: the PSLC DataShop. In C. Romero, et al., Editors, Handbook of Educational Data Mining, 43-55. CRC Press: Boca Raton, FL, 2010.