=Paper=
{{Paper
|id=Vol-1183/bkt20y_paper02
|storemode=property
|title= A Unified 5-Dimensional Framework for Student Models
|pdfUrl=https://ceur-ws.org/Vol-1183/bkt20y_paper02.pdf
|volume=Vol-1183
|dblpUrl=https://dblp.org/rec/conf/edm/XuM14
}}
== A Unified 5-Dimensional Framework for Student Models==
<pdf width="1500px">https://ceur-ws.org/Vol-1183/bkt20y_paper02.pdf</pdf>
<pre>
   A Unified 5-Dimensional Framework for Student Models
                                                  Yanbo Xu and Jack Mostow
                                            Carnegie Mellon University Project LISTEN
                                                          RI-NSH 4103
                                             5000 Forbes Ave, Pittsburgh, PA 15213
                                                 {yanbox, mostow}@cs.cmu.edu


ABSTRACT                                                               model by the number of values to fit.
This paper defines 5 key dimensions of student models: whether         Xu and Mostow [4] factored the space of different knowledge
and how they model time, skill, noise, latent traits, and multiple     tracing models in terms of three attributes: how to fit their
influences on student performance. We use this framework to            parameters, how to predict students’ performance from their
characterize and compare previous student models, analyze their        estimated knowledge, and how to update those estimates based on
relative accuracy, and propose novel models suggested by gaps in       observed performance. We will use this factoring in Section 3.2.
the multi-dimensional space. To illustrate the generative power of
this framework, we derive one such model, called HOT-DINA              Section 2 introduces the proposed framework. Section 0 describes
(Higher Order Temporal, Deterministic Input, Noisy-And) and            HOT-DINA, a novel knowledge tracing method that the
evaluate it on synthetic and real data. We show it predicts student    framework inspired. Sections 4 and 5 evaluate HOT-DINA on
performance better than previous methods, when, and why.               synthetic and real data, respectively. Section 6 concludes.

Keywords                                                               2. A Unified 5-Dimensional Framework
                                                                       We characterize student models in terms of these five dimensions:
Knowledge tracing, Item Response Theory, temporal models,
higher order latent trait models, multiple subskills, DINA.            Temporal effect: skills time-invariant vs. time-varying.
                                                                          •   Static, e.g. IRT [5] and PFA [6]
1. Introduction                                                           •   2 or more fixed time points, e.g. at pre- and post-test
Morphological analysis [1] is a general method for exploring a            •   Dynamic, e.g. KT [2]
space of possible designs by identifying key attributes, specifying    Skill dimensionality: single skill vs. multiple skills at a step.
possible values for each attribute, and considering different
combinations of choices for the attributes. Structuring the space      Credit assignment: how credit (or blame) is allocated among
in this manner compares different designs in terms of which            influences on the observed success (or failure) of a step. Mostow
attribute values they share, and which ones differ. Characterizing     et al. [3] define a space of KT parameterizations. Corbett and
the space of existing designs in terms of these attributes exposes     Andersen [2] originally fit KT per skill. Pardos and Heffernan [7]
gaps in the space, suggesting novel combinations to explore.           individualized KT and fit parameters per student. Wang and
                                                                       Heffernan [8] simultaneously fit KT per student and per skill. In
Some prior work on student modeling has used this approach to          contrast, multiple-skills models require combination functions to
characterize spaces of possible knowledge tracing models.              assign credit or blame among the skills. Product KT [9] assigns
Knowledge tracing (KT) [2] generally has 4 or 5 parameters: the        full responsibility to each skill and multiplies the estimates.
probability slip of failing on a known skill; the probability guess    Conjunctive KT [10] assigns fair credit or blame to skills and
of succeeding on an unknown skill; the probability knew of             multiplies the estimates. Weakest KT [11] credits or blames the
knowing a skill before practicing it; the transition probability       weakest skill and takes the minimum of the estimates. LR-DBN
learn from not knowing the skill to knowing it; and sometimes the      [12] apportions credit or blame and performs logistic regression
transition probability forget from knowing the skill to not            over the estimates. We summarize credit assignment methods as:
knowing it, usually assumed to be zero.                                      •    Contingency table
Mostow et al. [3] defined a space of alternative parameterizations                o Per student
of a given KT model, based on whether they assigned each                          o Per skill
knowledge tracing parameter a single overall value, a distinct                    o Per <student, skill>
value for each individual student and/or skill, or different values               o Per student + per skill
for different categories of students and/or skills. Thus the number          •    Binary or probabilistic
of values to fit is 4 if using a single global value for each                     o Conjunctive (min)
parameter, but with separate probabilities for each <student, skill>              o Independent (product)
pair, the number of values to fit is 4 × # students × # skills. This              o Disjunctive (max)
work ordered the space of possible parameterizations of a single             •    Other
                                                                                  o Compensatory (+)
                                                                                  o Mixture (weighted average)
                                                                                  o Logistic regression (sigmoid)
                                                                       Higher order: treat static student properties as latent traits or not.
                                                                       We say IRT [5] models “higher order” effects because it estimates
                                                                       static student proficiencies independent of skill properties such as
                                                                       skill difficulty in 1PL (1 Parameter Logistic), skill discrimination
                                                                       in 2PL, and skill guess rate in 3PL. De la Torre [13] first
                                                                       combined IRT with static Cognitive Diagnosis Models such as
NIDA (Noisy Inputs, Deterministic And Gate) [14-16] and DINA                and DINO respectively add noise either before or after combining
(Deterministic Inputs, Noisy And Gate), and proposed higher                 estimates of multiple skills. We refer to these noise modeling
order latent trait models (HO-NIDA and HO-DINA). Xu and                     methods as:
Mostow [17] used IRT to estimate the probability of knowing a                    •    None
skill initially in a higher order knowledge tracing model (HO-KT).               •    Slip/Guess
Noise: how to represent errors in model, or discrepancies between                •    NIDO (noisy input, deterministic output)
what a student knows versus does. KT assumes students may                        •    DINO (deterministic input, noisy output)
guess a step correctly even though they don’t know its underlying           Table 1 summarizes student models in the proposed unified 5-
skill(s), or slip at a step even though they know its skill(s). Such        dimensional framework. Note that we only discuss known
“noise” is also characterized in other models, including single-            cognitive models (e.g. Q-matrix) in this paper, so we omit
skill KT variants such as PPS (Prior Per Student) [7] and SSM               methods that discover unknown cognitive models [18, 19].
(Student Skill Model) [8], and IRT models such as 3PL. NIDO
                                    Table 1. A unified 5-dimensional framework for student models
                                                     Temporal           Skill                Credit          Higher order
                    Student models                                                                                              Noise model
                                                       effect       dimensionality         assignment           effect
       IRT 1PL (Rasch model) [5]
                                                                                        Per student +                              None
       IRT 2PL (2 Parameter Logistic) [5]                              Single skill                           Latent trait
                                                                                          per skill
       IRT 3PL (3 Parameter Logistic) [5]                                                                                       Slip/Guess
       LLM (Linear Logistic Model) [16]
       LFA (Learning Factor Analysis) [20]                                                  Sigmoid                                None
       PFA (Performance Factor Analysis) [6]            Static                                               No latent trait
       NIDA [14-16]                                                                                                               NIDO
                                                                    Multiple skills         Product
       DINA [14-16]                                                                                                               DINA
       LLTM (Linear Logistic Test Model) [21]                                               Sigmoid                               None
       HO-NIDA [13]                                                                                           Latent trait        NIDO
                                                                                            Product
       HO-DINA [13]                                                                                                               DINO
       KT [2]                                                                               Per skill
       PPS (Prior Per Student) [7]                                                         Per student       No latent trait
       SSM (Student Skill Model) [8]                                                                                            Slip/Guess
                                                                       Single skill
                                                                                        Per student +
       HO-KT [17]
                                                                                          Per skill           Latent trait
       DIR (Dynamic IRT 1PL) [22]                                                                                                  None
       KT+NIDA [23]
       Product KT [9]                                 Dynamic                               Product
                                                                                                                                  NIDO
       CKT [10]
                                                                                                             No latent trait
       Weakest KT [11]                                                                     Minimum
                                                                    Multiple skills
       KT+DINA [23]                                                                         Product
                                                                                                                                  DINO
       LR-DBN [12]                                                                         Sigmoid
       HOT-NIDA [Section 0]                                                                                                       NIDO
                                                                                            Product           Latent trait
       HOT-DINA [Section 0]                                                                                                       DINO
                         Table 2. Comparative framework to train, predict and update multiple-skills models

         Student models                      Train                               Predict                               Update
                                                                                                           Update skills together. Bayes’
               CKT
                                                                        Multiply skill estimates.         equations assign responsibility.
           Product KT

           Weakest KT                Train skills separately.                                            Update skills separately, each with
         (Blame weakest,             Assign each skill full                                                     full responsibility.
            credit rest)                 responsibility.                   Minimum of skill
                                                                              estimates.
           Weakest KT
         (Update weakest
               skill)
                                                                                                           Update only the weakest skill.
           HOT-NIDA
           HOT-DINA                  Train skills together.
           [Section 3.2]             Assign each skill full             Multiply skill estimates.
                                        responsibility.                                                  Update skills together, each with
        KT+NIDA/DINA
                                                                                                                full responsibility.
                                 Train skills together. Logistic         Logistic regression on           Update skills together. Logistic
             LR-DBN
                               regression assigns responsibility.           skill estimates.             regression assigns responsibility.
Table 2 (adapted from [4]) expands Credit assignment in terms           knowledge of each individual skill by observing additional
of how to train, predict and update skills, e.g. to assign full         practice on the skill. It also models two attributes of the skills,
responsibility to every skill, blame the weakest skill and credit       difficulty and discriminability, which are assumed to be
the rest, update only the weakest skill, or use logistic function.      constants that do not change over time.
The tables suggest transformations of models along the                  To incorporate DINA into HO-KT, we still model a hidden
dimensions in the framework. For example, Dynamic IRT [22]              binary state in each step to indicate whether a student knows the
varies student proficiency by time, transforming static IRT to          overall skill used in the step, denoted as ηnj(t) for student n with
dynamic. KT+NIDA/DINA [23] varies skill estimates by time,              skill j at time t. However, we also model a hidden binary state
transforming static NIDA/DINA to dynamic. HO-                           αnk(t) to indicate whether student n knows skill k at time t. Given
NIDA/DINA/KT          adds    latent    traits,   transforming          a matrix Q = {Qjk}, indicating whether the overall skill j
NIDA/DINA/KT to higher order. LLM [16] and LLTM [21]                    requires skill k, we conjoin the skills as follows:
change the combination function, transforming conjunctive                                                     !
models to logistic models. In Section 0 we generate a novel                                      !                       !
                                                                                              𝜂!"    =                (𝛼!" )!!"
student model by transforming HO-KT to a multi-skill model.
                                                                                                            !  !  !
3. A Higher-Order Temporal Student Model                                       Equation 2. Conjunction of skills in HOT-DINA
to Trace Multiple Skills: HOT-DINA                                      This formula gives us the DINA (Deterministic Input, Noisy-
Xu and Mostow [17] extended the static IRT model into HO-KT             And gate) structure [15], with the conjunction as the “and” gate
(Higher Order Knowledge Tracing), which accounts for skill-             and guess and slip as the noise. Thus by combining HO-KT with
specific learning by using the static IRT model to estimate the         DINA, we obtain the HOT-DINA higher order temporal model
probability Pr(knew) of knowing a skill before practicing it. By        to trace multiple skills. Figure 1 shows how the plate diagram
generalizing to steps that require conjunctions of multiple skills,     for HOT-DINA integrates IRT, KT, and DINA.
we arrive at a combined model we call HOT-DINA (Higher
Order Temporal, Deterministic Input, Noisy-And). Note we can
transform it into HOT-NIDA simply by changing its noise type.
3.1 HOT-DINA = IRT + KT + DINA
Let {Y(0), Y(1) , …, Y(t), …} denote a sequential dataset recorded
by an intelligent tutor system, where Ynj(t) = 1 iff student n
correctly performs a step that requires skill j at time t. KT is a
Hidden Markov Model (HMM) that models a binary hidden
state K(t) indicating if the student knows the skill at time t. The
probability of knowing the skill is knew at time t = 0, and then
changes based on the student’s observed performance on the
skill, according to the standard KT parameters slip, guess, learn,
and forget (usually set to zero).
KT can fit these four parameters (taking forget = 0) for each
<student, skill> pair, but the resulting large number of values to
fit is likely to cause over-fitting. Thus, Corbett and Andersen [2]
originally proposed to estimate knew per student, and learn,
guess and slip per skill. IRT assumes a latent trait that represents
a student’s underlying proficiency in all the skills. For example,
the Two Parameters Logistic (2PL) IRT model assumes that the
probability of a student’s correct response is a logistic function
of a unidimensional student proficiency θ with two skill-specific
parameters: discriminability a and difficulty b (see Equation 1).
                                               1
            𝑃 𝑌   =   1    =     
                                    1 + exp  (−1.7𝑎(𝜃 − 𝑏))
        Equation 1. The logistic function of 2PL model
The two skill parameters determine the shape of the IRT curve.
As a student’s proficiency increases beyond the skill difficulty,
the student’s chance of performing correctly surpasses 50%. The
skill discriminability reflects how fast the logit (log odds)
increase or decrease when the proficiency changes. Thus IRT
fits parameters individually on each dimension, without losing
the information from the other. HO-KT uses 2PL to estimate
knew in KT, by fitting student specific proficiency θn, skill
discriminability aj and skill difficulty bj. It then uses KT to trace       Figure 1. Graphical representation of Higher-Order
each skill, by fitting skill-specific learnj, guessj and slipj. Thus,       Temporal DINA (HOT-DINA) to trace multiple skills
HO-KT models students’ initial overall knowledge before they
practice any skills; then it updates its estimates of students’
Equation 3 shows the formula for using 2PL to estimate the                  Given η as a conjunction of α, the likelihood of Y given η, the
probability knew of a student knowing a skill at time t = 0:                conditional independence of α(0) given θ, and of α(t) given α(t-1),
                                 (!)
                                                                            the posterior distribution of θ, a, b, α, η, learn (l), guess (g) and
        𝑃 𝑘𝑛𝑒𝑤!"    =     𝑃 𝛼!"    =   1                                    slip(s) given Y is
                                                   1
                               =                                            𝑃 𝜽, 𝒂, 𝒃, 𝜶, 𝜼, 𝒍, 𝒈, 𝒔 𝒀 ∝ 𝐿 𝒀 𝒈, 𝒔, 𝜼, 𝜶 𝑃 𝜶 ! 𝜽, 𝒂, 𝒃
                                      1 + exp  (−1.7  𝑎! (𝜃! − 𝑏! ))
                                                                                       !
        Equation 3. 2PL to estimate knew in HOT-DINA                        (                    𝑃 𝜶 ! 𝜶 !!! , 𝒍 )𝑃 𝜽 𝑃 𝒂 𝑃 𝒃 𝑃 𝒍 𝑃 𝒈 𝑃(𝒔)
                                                                                       !  !  !
Equation 4 shows the formula for tracing the skills with skill-
specific learn and zero forget:                                             3.2.2 Predicting student performance
                                                                            For inference, we introduce uncertainty to ηnj, and rewrite the
             𝑃 𝛼!" !    =   1 𝛼!" !!!    =   0    =     𝑙𝑒𝑎𝑟𝑛!              Equation 2 as follows:
         𝑃 𝛼!" !    =   0 𝛼!" !!!    =   1    =      𝑓𝑜𝑟𝑔𝑒𝑡!    =   0                                            !                                         !!"
                                                                                        !                                              1
                                                                            𝑃 𝜂!"    =   1    =     
      Equation 4. Knowledge tracing of skills in HOT-DINA                                                                exp −1.7𝑎! 𝜃! − 𝑏!
                                                                                                               !  !  !
Equation 5 shows the likelihood of a student’s performance                              !                                    !
                                                                                                                 !                        !!"
given the hidden state η(t) and the skill-specific guess and slip:          𝑃 𝜂!"    =   1    =                  !  !  !(𝑃(𝛼!"    =   1))     ) for t = 1,2,3…
                                                 !
             !             !                 !!!!"                      !
                                                                                Equation 6. Conjunction of skills in HOT-DINA inference
       𝐿 𝑌!"    =   1|  𝜂!"    =   𝑔𝑢𝑒𝑠𝑠!            ×(1 − 𝑠𝑙𝑖𝑝! )!!"

             !             !
                                                         !
                                                     !!!!"              !   Then we predict student performance by using Equation 7:
       𝐿 𝑌!"    =   0|  𝜂!"    =    (1 − 𝑔𝑢𝑒𝑠𝑠! )            ×𝑠𝑙𝑖𝑝! !!"
                                                                                       !                                           !
                                                                            𝑃 𝑌!"    =   1    =    1 − 𝑠𝑙𝑖𝑝! 𝑃 𝜂!"    =   1 + 𝑔𝑢𝑒𝑠𝑠! (1
                 Equation 5. Likelihood in HOT-DINA                                                                        !
                                                                                                                − 𝑃 𝜂!"    =   1 )
3.2 How to Train, Predict, and Update
Following the organization of Table 2, Section 3.2.1 details how                                    Equation 7. Prediction in HOT-DINA
HOT-DINA trains the skills together and assigns each skill full             3.2.3 Updating estimated skills
responsibility; Section 3.2.2 specifies how HOT-DINA predicts               We update the estimates of latent states η and α after observing
student performance by using a product of skill estimates; and              actual student performance. The estimate of knowing a skill or a
Section 3.2.3 shows how HOT-DINA updates the weakest skill.                 subskill should increase if the student performed correctly at the
3.2.1 Training the model with MCMC                                          step. It is easy to update a skill by using Bayes’ rule, as shown in
We estimate the parameters of HOT-DINA using Markov Chain                   Equation 8. The posterior P(ηnj(t) = 1|Ynj(t) = 1) should be higher
Monte Carlo (MCMC) methods, which require that we specify                   than P(ηnj(t) = 1) if and only if (1-slipj) > guessj.
the prior distributions and constraints for every parameter. We                            !           !
assume that student general proficiency θn is normally                      𝑃 𝜂!"    =   1 𝑌!"    =   1   
                                                                                                                               𝑡
distributed with mean 0 and standard deviation 1. The skill                                                            𝑃 𝑌𝑛𝑗    =   1 𝜂𝑛𝑗𝑡    =   1)  𝑃 𝜂𝑛𝑗𝑡    =   1
discrimination an is positive and uniformly distributed between 0                                               =                              𝑡
and 2.5, while the skill difficulty bn is also normally distributed                                                                    𝑃 𝑌𝑛𝑗    =   1
with mean 0 and standard deviation 1. Learn has prior Beta
                                                                                                                               !
(1,1), whereas guess and slip have uniform prior from 0 to 0.4.                                        (!!!"#$! )  ! !!"   !  !
                                                                              =                            !                               !
                                                                                                                                                       
Thus, the priors on each parameter are:                                                (!!!"#$! )  ! !!"   !  ! !!"#$$! !!  ! !!"   !  !

𝜃!     ~    𝑁𝑜𝑟𝑚𝑎𝑙(0,1)                                                                 Equation 8. Bayes’ rule to update η in HOT-DINA
𝑏!       ~  𝑁𝑜𝑟𝑚𝑎𝑙(0, 1)                                                    Although we could update HOT-DINA by assigning full
                                                                            responsibility to each skill, it would be interesting to update the
𝑎!       ~  𝑈𝑛𝑖𝑓𝑜𝑟𝑚(0, 2.5)                                                 weakest (or say hardest) skill since HOT-DINA fits the
𝑙𝑒𝑎𝑟𝑛!       ~  𝐵𝑒𝑡𝑎(1, 1)                                                  parameter ‘difficulty’ for each skill. Thus, we update the skill
                                                                            that is the hardest among all the required skills in a step:
𝑔𝑢𝑒𝑠𝑠!     ~  𝑈𝑛𝑖𝑓𝑜𝑟𝑚(0, 0.4)
                                                                                        !              !
                                                                            𝑃 𝜂!"    =   1 𝑌!"    =   1   
𝑠𝑙𝑖𝑝!     ~  𝑈𝑛𝑖𝑓𝑜𝑟𝑚(0, 0.4)
                                                                                                                               !           !                          !
                                                                                                                =   𝑃 𝛼!"!    =   1|𝑌!"    =   1                 𝑃(𝛼!"   
We use the following conditional distributions for each node:
                                                                                                                                                          !!!!
  !                                                                                                             =   1)
𝛼!" |𝜃!     ~  𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖({1 + exp −1.7  𝑎! 𝜃! − 𝑏! }!!   )
                                                                            for 𝑘   =   arg max!:  !!"   !  ! 𝑏! .
𝛼!" (!) |  𝛼!" !!!    =   0  ~  𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑙𝑒𝑎𝑟𝑛! )
𝛼!" (!) |  𝛼!" !!!    =   1  ~  𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(1)                                           Equation 9. Update the hardest skill in HOT-DINA
                                                                            In short, we extend HO-KT to the HOT-DINA higher order
𝑌!" (!) |𝜂!" !    =   0    ~  𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑔𝑢𝑒𝑠𝑠! )                            temporal model, which traces multiple skills. We use the
                                                                            MCMC algorithm to estimate the parameters, and update the
𝑌!" (!) |𝜂!" !    =   1    ~  𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(1 − 𝑠𝑙𝑖𝑝! )
                                                                            estimates of a student knowing a skill given observed student
                                                                            performance. How well does the HOT-DINA model work? To
evaluate it, we performed a simulation study. Section 4 now           standard deviation (s.d.) to assess the accuracy of the posterior
describes the study and reports its results.                          estimates for each parameter. MC error, which is an estimate of
                                                                      the difference between the estimated posterior mean (i.e. the
4. Simulation Study                                                   sample mean) and the true posterior mean, should be less than
To study the behavior of HOT-DINA, we generated synthetic             5% of the s.d. in order to obtain an accurate posterior estimate.
training data for it according to the priors and conditional
distributions defined in Section 3.2.1. Section 4.1 describes the     Table 5. Estimates of skill-specific discrimination, difficulty,
synthetic data. One purpose of this experiment was to test how            and learning rate (N = 100, T = 100, K = 4, J = 14)
accurately MCMC can recover the parameters of HOT-DINA,               k        a          𝒂 (95% C.I.)            s.d.       MC_error
as Section 4.2 reports. It is important not only to test how well a
                                                                      1       1.50      1.33 (0.36, 2.43)        0.65         0.03216
method works, but to analyze when and why. Thus another
                                                                      2       1.20      1.23 (0.12, 2.43)        0.72         0.03561
purpose was to determine how many students and observations           3       1.90      1.85 (0.22, 2.73)        0.64         0.03146
are needed to estimate the difficulty and discriminability of a       4       1.00      0.98 (0.19, 2.12)        0.58         0.02870
given number of skills, as Section 4.3 explains.
                                                                      k        b          𝒃 (95% C.I.)            s.d.       MC_error
4.1 Synthetic Data                                                    1      -0.95    -0.95 (-2.15, -0.04)       0.50         0.02339
We use the following procedure to generate the synthetic data,        2       1.42      1.51(0.90, 2.21)         0.45         0.01936
with all the variables as defined in Section 3.2:                     3      -0.66    -0.69 (-1.81, -0.63)       0.42         0.01990
                                                                      4       0.5        0.5 (0.05,1.18)         0.38         0.01691
1.    We chose K = 4 and J = 14, which results in a 14 × 4 Q          k     learn      𝒍𝒆𝒂𝒓𝒏 (95% C.I.)           s.d.       MC_error
      matrix. The Q matrix, as shown below, indicates that we         1       0.8       0.81 (0.48, 0.99)        0.13         0.006599
      generate the skills by combining all the possible skills.       2       0.6       0.60 (0.52, 0.70)        0.05         0.002132
      𝐐!                                                              3       0.5       0.57 (0.38, 0.84)        0.11         0.005432
            1 0 0 0 1 1 1 0 0 0 1 1 1 0                               4       0.3       0.29 (0.25, 0.33)        0.02         7.79E-04
            0 1 0 0 1 0 0 1 1 0 1 0 1 1
      =   
            0 0 1 0 0 1 0 1 0 1 1 1 0 1
                                                                      We calculated Root Mean Squared Error (RMSE) of the
            0 0 0 1 0 0 1 0 1 1 0 1 1 1
                                                                      estimates of the continuous variables 𝒈𝒖𝒆𝒔𝒔 , 1- 𝒔𝒍!𝒑 , and   
2.    We randomly generated θn from Normal (0,1) for n = 1,..,N.      𝜽. We report the accuracy of recovering the true value of the
                                                                      latent binary variable α in Table 6.
3.    We chose a, b and l as shown in Table 3.
                                                                        Table 6. Estimation RMSE of skill-specific guess, not slip,
Table 3. True value of skill-specific discrimination, difficulty
                                                                        and student specific proficiency; Prediction accuracy of a
       and learning rate in synthetic data simulation
                                                                      student mastering a subskill (N = 100, T = 100, K = 4, J = 14)
                 k         1      2       3       4
                                                                                             𝒈𝒖𝒆𝒔𝒔         1-𝒔𝒍!𝒑          𝜽
                 a        1.5    1.2     1.9     1.0
                                                                             RMSE            0.0103        0.0196       0.9183
                 b       -0.95   1.42   -0.66    0.50
               learn      0.8    0.6     0.5     0.3                                                         𝜶
                                                                             Accuracy                     99.38%
4.    We randomly generated g and 1-s from Unif(0,0.4) and
      Unif (0.6,1) respectively, as shown in Table 4.                 From the results, we can see that the MCMC algorithm
                                                                      accurately recovered the parameters we used in generating the
     Table 4. True value of skill-specific guess and not slip         synthetic data for HOT-DINA. In addition to seeing how
           parameters in synthetic data simulation                    accurately it can estimate the parameters, we are also interested
       j       1        2      3      4      5       6      7         in finding out how many observations would be sufficient for
     guess    0.35     0.40   0.13   0.15   0.29    0.39   0.10       the training algorithm to recover the hidden variables. Therefore,
                                                                      we conducted the study we now describe in Section 4.3.
     1-slip   0.67     0.66   0.67   0.90   0.65    0.60   0.61
       j       8        9      10     11     12      13     14        4.3 Study Design
     guess    0.40     0.15   0.16   0.38   0.11    0.26   0.35       HOT-DINA requires data from enough students to rate the
     1-slip   0.81     0.74   0.76   0.73   0.83    0.89   0.85       difficulty and discriminability of each skill, and data on enough
                                                                      skills to estimate the proficiency of each student. So we fixed
5.    We chose N = 100, T = 100, randomly picked one skill at
                                                                      the number of skills at K = 4, and varied the number of students
      each step, and simulated sequential data with size of 10,000.
                                                                      N or the number of steps observed from each student T, to
4.2 Results                                                           discover how many observations would be sufficient to estimate
                                                                      the parameters. In particular, we evaluated each model on how
We used OpenBUGS [24] to implement the MCMC algorithm
                                                                      accurately it estimated the latent binary state α¸ which indicates
of HOT-DINA. We chose 5 chains starting at different initial
points. We monitored the estimates of skill discrimination 𝒂 and      if a student masters a skill. We generated the data by using the
                                                                      same parameters as in Section 4.1. Besides the general HOT-
difficulty 𝒃 to check their convergence, when all the chains
                                                                      DINA model that accounts for multiple skills, we also studied
appear to be overlapping each other. As a result, we ran the
                                                                      the single-skill model by shrinking the number of skills J to
simulation for 10,000 iterations with a burn-in of 3000.
                                                                      equal K, and set Q as an identity matrix. Thus we specified the
Table 5 reports the sample means and their 95% confidence             HOT-DINA model to be a HO-KT model alternatively.
interval for parameter estimates 𝒂, 𝒃 and le𝒂rn respectively.         We increased N, the number of students, from 10 to 1000, and
We also report the Monte Carlo error (MC error) and sample            T, the number of observations per student, from 5 to 100. Table
7 and Table 8 respectively show the accuracy of estimating the          in Section 4.3, we randomly chose N = 50 students with T = 100
latent state α in HO-KT and HOT-DINA. Both tables show a                in order to obtain enough data for the MCMC estimation.
trend of increasing accuracy when N or T increases (though at
the cost of longer training time, roughly O(N2×T)).                      Table 9. Data split of the Algebra Tutor data: training on I
                                                                                      and IV, and testing on II and III
  Table 7. Accuracy of estimating the latent binary states α
                                                                                                    Skill group A     Skill group B
             with different N and T (K = J = 4)
                                                                                Student group A             I                II
            T              5       10        20        50         100           Student group B            III              IV
   N
                                                                        We split the 50 students into two groups of 25, and split the 15
       10           71.01%     80.81% 83.01%      93.11%    96.16%      skills into two groups of 8 and 7. As shown in Table 9, we
       20           72.32%     82.74% 86.52%      94.06%    97.33%      combine data from I (student-group-A practicing on skill-group-
                                                                        A) and IV (student-group-B practicing on skill-group-B) to
       50           73.58%     83.79% 87.34%      95.27%    98.90%      obtain the training data. Accordingly, we combined the data
     100            77.55%     84.43% 88.08%      95.81% 99.41%         from II and III to obtain the test data. As a benefit of the data
                                                                        split, we are able to test the models on unseen students for the
     200            76.52%     84.02% 89.48%      97.26%          NA    same group of skills, and also test on the unseen skills for the
     500            78.13%     84.34% 92.50%        NA            NA    same group of students.

    1000            80.10%     84.59%    NA         NA            NA    We compared HOT-DINA with the conjunctive minimum KT
                                                                        model [11] since it showed the best prediction accuracy among
Due to the lack of sampling ability of OpenBUGS for high                all the previous KT based methods [4]. It fits KT parameters by
dimensional dynamic models, we have no available scores to              blaming each skill that is required at a step, predicts student’s
show for N×T bigger than 10,000. We can see that the multiple           performance by the weakest skill, and updates only the weakest
skill model predicts better than the single-skill model because         skill. Accordingly, we updated the most difficult skill in HOT-
the average number of observations per skill in the former one is       DINA as discussed in Section 3.2.3. As two baseline models, we
larger than the latter. As observed in both tables, it is more          fit per-skill KT and per-student KT. Comparing HOT-DINA
efficient to increase T, than N, to get a better estimate. Both of      with these two baselines also allows us to discuss some more
the models reach the best prediction accuracy score (> 99%)             interesting research questions later in this section.
when N = 100 and T = 100. In order to obtain an accuracy >              Table 10 and Table 11 respectively show the models’ prediction
90% for K = 4 skills, the least amount of data we need for HO-          accuracy and log-likelihood on the test data. We report the
KT is N = 10 with T ≈ 50 observations as shown in Table 7, for          majority class because of the unbalanced data. HOT-DINA beat
HOT-DINA is N = 10 with T > 20 observations, as shown in                the two baselines in predicting the student performance, and also
Table 8.                                                                obtained the maximum log-likelihood on the test data. The per-
                                                                        student KT model obtained the worst scores on both measures. It
  Table 8. Accuracy of estimating the latent binary states α            predicted student performance almost as poorly as majority class
            with different N and T (K = 4, J = 14)                      because it misclassified almost all the data in the minority class.
                T      5         10     20        50        100
                                                                             Table 10. Comparison of prediction accuracy on real test
            N
                                                                                                      data
        10          72.07% 75.57% 91.14% 96.90% 98.10%
                                                                                                Overall     Accuracy on       Accuracy on
        20          74.32% 83.60% 91.56% 97.46% 98.53%                                         Accuracy     Correct Steps    Incorrect Steps
                                                                         HOT-DINA               82.48%            96.63%            27.27%
        50          76.55% 84.71% 92.62% 97.52% 98.98%
                                                                         Per-skill KT           80.87%            94.02%            29.60%
        100         77.80% 86.82% 93.83% 97.67% 99.82%                   Per-student KT         79.63%            99.74%              1.20%
                                                                         Majority class         79.60%          100.00%               0.00%
        200         79.92% 88.78% 94.26% 99.41%              NA
        500         82.15% 89.95% 98.61%            NA       NA              Table 11. Comparison of log-likelihood on real test data
       1000 83.58% 92.34%                NA         NA       NA                                             Log-likelihood
                                                                                       HOT-DINA             -2021.04
Next we apply the proposed model to real data logged by an                             Per-skill KT         -2075.67
algebra tutor. We evaluate the model fit and compare it against                        Per-student KT       -2464.74
two baselines.
                                                                        We are also interested in three other hypotheses comparing
5. Evaluation on Real Data                                              HOT-DINA with KT. We describe them, test them, and show
We apply HOT-DINA to a real dataset from the Algebra                    the results as follows.
Cognitive Tutor® [25]. Because of limited time, we chose a
subset of the data, by crossing out the “isolated” algebra tutor        1.     HOT-DINA should predict early steps more accurately than
steps. An “isolated” step here means a step that requires one                  KT since its estimate of knew reflects both skill difficulty
skill all its own. We grouped the remaining steps that require the             and student proficiency, not just one or the other. In fact
same multiple skills into one skill, resulting in J = 15 distinct              HOT-DINA beat KT throughout, as Figure 2 shows.
skills that require K = 12 subskills. Following the study design
     95%	
  
                                                                                                                                                                     6. Contributions, limitations, future work
                                                                                                                                                                     In this paper we make several contributions. We defined a 5-
     90%	
                                                                                                                                                           dimensional framework for student models. We showed how
     85%	
                                                                                                                                                           numerous student models fit into it. We described the new
                                                                                                                                                                     combination of IRT, KT, and DINA it suggests in the form of
     80%	
  
                                                                                                                                                                     HOT-DINA. We specified how to train HOT-DINA by using
     75%	
                                                                                                                                                           MCMC, how to test it by predicting student performance, and
     70%	
                                                                                                                                                           how to update estimated skills based on observed performance.
                 1	
   2	
   5	
   8	
   10	
   12	
   15	
   18	
   20	
   25	
   30	
   40	
   50	
   100	
  150	
  
                                                                                                                                                                     HOT-DINA uses IRT to estimate knew based on student
                                                                                                                                                                     proficiency and skill difficulty. Thus it does not need training
                                              Per-­‐student	
  KT	
                                      Per-­‐subskill	
  KT	
  
                                                                                                                                                                     data on every <student, skill> pair, since it can estimate student
                                              HOT-­‐DINA	
                                               Majority	
  class	
                                         proficiency based on other skills, and skill difficulty and
                                                                                                                                                                     discriminability based on other students. Likewise, it should
      Figure 2. Accuracy on student’s 1st, 2nd, 3rd, … test steps                                                                                                    estimate knew more accurately than KT for skills and students
                                                                                                                                                                     with sparse training data. HOT-DINA uses KT to model
2.       HOT-DINA should beat KT on sparsely trained skills                                                                                                          learning over time, and DINA to model combination of multiple
         thanks to student proficiency estimates based on other                                                                                                      skills underlying observed steps (unlike conventional KT and
         skills. As Figure 3 shows, HOT-DINA tied or beat KT                                                                                                         with fewer parameters than CKT [10] or LR-DBN [12]).
         throughout.
                                                                                                                                                                     Tracing multiple skills underlying an observed step requires
                                                                                                                                                                     allocating responsibility among them for its success or failure.
     100%	
                                                                                                                                                          DINA simply conjoins them, a common method but inferior to
      80%	
                                                                                                                                                          others. Future work includes using the best known method [4],
      60%	
                                                                                                                                                          which we didn’t use here because the logistic regression it
                                                                                                                                                                     performs is non-trivial to integrate with MCMC.
      40%	
  
      20%	
                                                                                                                                                          We evaluated HOT-DINA on synthetic and real data, not only
                                                                                                                                                                     showing that it predicts student performance better than previous
        0%	
  
                                                                                                                                                                     methods, but analyzing when and why.
                                                                                                                                                          1864	
  
                                                               104	
  
                                                                         168	
  
                                                                                    170	
  
                                                                                              257	
  
                                                                                                        266	
  
                                                                                                                  334	
  
                                                                                                                            334	
  
                                                                                                                                      380	
  
                                                                                                                                                811	
  
                 21	
  
                           40	
  
                                    64	
  
                                             72	
  
                                                      72	
  

                                                                                                                                                                     We reported a simulation study to test if training could recover
                                                                                                                                                                     model parameters, and to determine the amount of data needed.
                          Per-­‐subskill	
  KT	
                                   HOT-­‐DINA	
                                   Majority	
  class	
  
                                                                                                                                                                     HOT-DINA requires data on enough students and skills to
                                                                                                                                                                     estimate their proficiency and difficulty, respectively. We
           Figure 3. Skills sorted by amount of training data                                                                                                        explored how its accuracy varies with the number of test steps
3.       HOT-DINA should beat KT on sparsely trained students                                                                                                        and the amount of training data per student and per skill. These
         thanks to skill difficulty and discriminability estimates                                                                                                   analyses were correlational, based on variations that happened to
         based on other students. As Figure 4 shows, HOT-DINA                                                                                                        occur in the training data. Future work should invest in the
         beat KT throughout.                                                                                                                                         computation required to vary the amount of training data to
                                                                                                                                                                     establish its true causal effect on accuracy.

     100%	
  
                                                                                                                                                                     Evaluation on real data from an algebra tutor showed that HOT-
                                                                                                                                                                     DINA achieved higher predictive accuracy and log likelihood
      95%	
  
                                                                                                                                                                     than KT with parameters fit per student or per skill. This
      90%	
  
                                                                                                                                                                     evaluation was limited to a single data set and two baselines (not
      85%	
  
                                                                                                                                                                     counting majority class). Future work should compare HOT-
      80%	
  
                                                                                                                                                                     DINA to other methods – notably the Student Skill model [8],
      75%	
  
                                                                                                                                                                     which is similar in spirit – and on data from other tutors.
      70%	
  
      65%	
                                                                                                                                                          We assumed that student proficiency is one-dimensional. Future
      60%	
                                                                                                                                                          work can test if k dimensions capture enough additional variance
                 102	
  
                 105	
  
                 109	
  
                 111	
  
                 114	
  
                 116	
  
                 117	
  
                 121	
  
                 126	
  
                 130	
  
                  16	
  
                  51	
  
                  75	
  
                  84	
  
                  88	
  
                  91	
  
                  99	
  

                                                                                                                                                                     to make it worthwhile to fit k times as many parameters.
                                                                                                                                                                     Finally, our choice of 5 dimensions is useful but limiting.
                          Per-­‐student	
  KT	
                                    HOT-­‐DINA	
                                   Majority	
  class	
                Additional dimensions may provide useful finer-grained insights
                                                                                                                                                                     into the models covered by the current framework, and expand it
        Figure 4. Students sorted by amount of training data                                                                                                         to encompass other types of student models, e.g. where the
                                                                                                                                                                     cognitive model is unknown and must be discovered [18, 19].
Thus, HOT-DINA outperformed the two baselines in model fit.
It also beat them as specified by the three hypotheses above.                                                                                                        ACKNOWLEDGMENTS
                                                                                                                                                                     This work was supported in part by the National Science
                                                                                                                                                                     Foundation through Grants 1124240 and 1121873 to Carnegie
                                                                                                                                                                     Mellon University. The opinions expressed are those of the
                                                                                                                                                                     authors and do not necessarily represent the views of the
                                                                                                                                                                     National Science Foundation or U.S. government. We thank
                                                                                                                                                                     Ken Koedinger for his algebra tutor data.
REFERENCES                                                         [13] de la Torre, J. and J.A. Douglas. Higher-order latent trait
                                                                   models for cognitive diagnosis. Psychometrika 2004. 69(3): p.
[1] Zwicky, F. Discovery, Invention, Research - Through the        333-353.
Morphological Approach. 1969, Toronto: The Macmillian
Company.                                                           [14] Junker, B. and K. Sijtsma. Cognitive assessment models
                                                                   with few assumptions, and connections with nonparametric item
[2] Corbett, A. and J. Anderson. Knowledge tracing: Modeling       response theory. Applied Psychological Measurement, 2001.
the acquisition of procedural knowledge. User modeling and         25(3): p. 258-272.
user-adapted interaction, 1995. 4: p. 253-278.
                                                                   [15] de la Torre, J. DINA Model and Parameter Estimation: A
[3] Mostow, J., Y. Xu, and M. Munna. Desperately Seeking           Didactic Journal of Educational and Behavioral Statistics, 2009.
Subscripts:   Towards Automated Model Parameterization.            34(1): p. 115-130.
Proceedings of the 4th International Conference on Educational
Data Mining, 283-287. 2011. Eindhoven, Netherlands.                [16] Maris, E. Estimating multiple classification latent class
                                                                   models. Psychometrika, 1999. 64(2): p. 197–212.
[4] Xu, Y. and J. Mostow. Comparison of methods to trace
multiple subskills: Is LR-DBN best? [Best Student Paper            [17] Xu, Y. and J. Mostow. Using item response theory to
Award]. Proceedings of the Fifth International Conference on       refine knowledge tracing. In Proceedings of the 6th
Educational Data Mining, 41-48. 2012. Chania, Crete, Greece.       International Conference on Educational Data Mining, S.K.
                                                                   D’Mello, R.A. Calvo, and A. Olney, Editors. 2013, International
[5] Hambleton, R.K., H. Swaminathan, and H.J. Rogers.              Educational Data Mining Society: Memphis, TN, p. 356-357.
Fundamentals of Item Response Theory. Measurement Methods
for the Social Science. 1991, Newbury Park, CA: Sage Press.        [18] González-Brenes, J.P. and J. Mostow. What and when do
                                                                   students learn? Fully data-driven joint estimation of cognitive
[6] Pavlik Jr., P.I., H. Cen, and K.R. Koedinger. Performance      and student models. In Proceedings of the 6th International
factors analysis - a new alternative to knowledge tracing.         Conference on Educational Data Mining, S.K. D’Mello, R.A.
Proceedings of the 14th International Conference on Artificial     Calvo, and A. Olney, Editors. 2013, International Educational
Intelligence in Education (AIED09), 531-538. 2009.                 Data Mining Society: Memphis, TN, p. 236-239.
[7] Pardos, Z. and N. Heffernan. Modeling individualization in     [19] González-Brenes, J.P. and J. Mostow. Dynamic cognitive
a Bayesian networks implementation of knowledge tracing.           tracing: towards unified discovery of student and cognitive
Proceedings of the 18th International Conference on User           models. Proceedings of the Fifth International Conference on
Modeling, Adaptation and Personalization, 255-266. 2010. Big       Educational Data Mining 2012. Chania, Crete, Greece.
Island, Hawaii.
                                                                   [20] Cen, H., K. Koedinger, and B. Junker. Learning factors
[8] Wang, Y. and N.T. Heffernan. The student skill model.          analysis – a general method for cognitive model evaluation and
Intelligent Tutoring Systems - 11th International Conference,      improvement. Proceedings of the 8th International Conference
399-404. 2012. Chania, Crete, Greece. Springer.                    on Intelligent Tutoring Systems, 164-175. 2006. Jhongli,
                                                                   Taiwan.
[9] Cen, H., K.R. Koedinger, and B. Junker. Comparing Two
IRT Models for Conjunctive Skills. Ninth International             [21] Fischer, G.H. The linear logistic test model. In G.H.
Conference on Intelligent Tutoring Systems, 796-798. 2008.         Fischer and I.W. Molenaar, Editors, Rasch Models:
Montreal.                                                          Foundations, Recent Developments, and Applications, 131-155.
                                                                   Springer: New York, 1995.
[10] Koedinger, K.R., P.I. Pavlik, J. Stamper, T. Nixon, and S.
Ritter. Avoiding problem selection thrashing with conjunctive      [22] Wang, X., J.O. Berger, and D.S. Burdick. Bayesian
knowledge tracing. In Proceedings of the 4th International         analysis of dynamic item response models in educational testing.
Conference on Educational Data Mining. 2011: Eindhoven, NL,        Annals of Applied Statistics, 2013. 7(1): p. 126-153.
p. 91-100.
                                                                   [23] Studer, C. Incorporating Learning Over Time into the
[11] Gong, Y., J.E. Beck, and N.T. Heffernan. Comparing            Cognitive Assessment Framework. Unpublished PhD, Carnegie
knowledge tracing and performance factor analysis by using         Mellon University, Pittsburgh, PA, 2012.
multiple model fitting procedures. Proceedings of the 10th
International Conference on Intelligent Tutoring Systems, 35-44.   [24] Lunn, D., D. Spiegelhalter, A. Thomas, and N. Best. The
2010. Pittsburgh, PA. Springer Berlin / Heidelberg.                BUGS project: Evolution, critique and future directions.
                                                                   Statistics in Medicine, 2009. 28: p. 3049–306.
[12] Xu, Y. and J. Mostow. Using logistic regression to trace
multiple subskills in a dynamic Bayes net. Proceedings of the      [25] Koedinger, K.R., R.S.J.d. Baker, K. Cunningham, A.
4th International Conference on Educational Data Mining, 241-      Skogsholm, B. Leber, and J. Stamper. A data repository for the
245. 2011. Eindhoven, Netherlands.                                 EDM community: the PSLC DataShop. In C. Romero, et al.,
                                                                   Editors, Handbook of Educational Data Mining, 43-55. CRC
                                                                   Press: Boca Raton, FL, 2010.

</pre>