<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of
Learning Analytics 3 (2016) 317-321. URL: https://doi.org/10.18608/jla.2016.32.17. doi:10.
18608/jla.2016.32.17.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Incorporating Wide Context Information for Deep Knowledge Tracing using Attentional Bi-interaction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Raghava Krishnan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Janmajay Singh</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Masahiro Sato</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qian Zhang</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomoko Ohkuma</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fuji Xerox Co Ltd.</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yokohama</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Japan</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>33</volume>
      <fpage>750</fpage>
      <lpage>757</lpage>
      <abstract>
        <p>Online learning platforms also known as Computer Aided Education Systems have recently grown in importance owing to their ability to personalize study plans in accordance with individual student requirements. Learning platforms have modeled student knowledge state using student responses with the recently popular Deep Knowledge Tracing (DKT) technique. Using context information has also proven efective in various predictive problems prompting learning platforms to store a variety of context features about a student's performance history. An example context may be response time, where shorter times to answer questions may indicate higher mastery of a skill. Therefore, it is crucial to incorporate context features in the most efective way possible. Most of the research in DKT either use no context features, or use a set of context features that span only a narrow set of student characteristics. To overcome this, we identify a wide set of context features and incorporate their interactions into the DKT model. We then observe the efects of incorporating these additional context feature interactions and also propose an adaptive weighting technique that learns the appropriate context feature interaction weights. These techniques are compared with state-of-the-art baselines and their performances were evaluated using AUC scores.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Computer Aided Education</kwd>
        <kwd>Adaptive learning</kwd>
        <kwd>personalization</kwd>
        <kwd>sequential modeling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Computer Aided Education (CAE) systems aim to personalize the study plan of a user to best
suit his/her needs. This is achieved through the process of Knowledge Tracing where the
current knowledge state of the user is estimated using the history of their interactions with the
system, and this estimated knowledge state is used to predict the future performance of the user.
Accurately predicted future student performances are then used as cues to better personalize
the study plan of each user. In addition to a history of user responses, CAE systems usually
also store additional metadata related to user performance history, like response time, type of
question, number of attempts, etc. Adomavicius et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] refer to this additional information
as contexts or context features, and give an interactional view of context as having a cyclical
relationship with an underlying activity. In our case, the activity is a student’s response and
context is the additional information.
      </p>
      <p>
        A popular approach to knowledge tracing over the past few years had been Deep
Knowledge Tracing [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] (DKT) which learns a continuous representation of the knowledge state as
compared to the discrete variable representation used in Bayesian Knowledge Tracing [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ]
(BKT). The drawback with DKT is that it used only the history of student responses, while there
are other factors that afect a student’s performance while using an online learning platform,
such as forgetting, learning ability, etc. This is partially overcome in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] which uses a technique
called bi-interaction in a framework called Bi-interaction Deep Knowledge Tracing (BIDKT).
This technique incorporates context in the form of second degree interactions between input
question response and context features relating to forgetting behavior, where interactions are
the inner product or Hadamard product of the embedding vectors of the context features. This
technique showed us that using context feature interactions is an efective way of incorporating
the above mentioned factors in a knowledge tracing model.
      </p>
      <p>A drawback of BIDKT was using only a small set of additional context features (narrow), in
this case, student forgetting behavior. While including only a few features led to a reasonable
improvement in performance, it would be of interest to know if the trend would have continued
to rise as more related contexts were identified and included in the model. Additionally, the
bi-interaction technique used, assigned the same weight to all feature interactions. This might
be a problem if a larger set of context features are used, as important interactions might get
diluted along with unimportant ones. This may result in either saturation or even a drop in
model performance.</p>
      <p>
        In this paper we posit that using additional context features (wide) should lead to improved
performance of knowledge tracing models. We further hypothesize that existing models may
not be well suited to efectively use additional contexts since they do not weigh contexts by
their importances. To verify these ideas, we first identify additional contexts which may
provide important cues for future student performance prediction. We then analyze how current
best model performance changes with wider contexts. Finally, we propose a new technique
by modifying BIDKT which adaptively learns weights for contexts via an Attention network
similar to [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and see how it compares to identified baselines.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Knowledge Tracing: Since the emergence of Long Short-term Memory(LSTM) Networks,
the Deep Knowledge Tracing model has been the most popular knowledge tracing technique
[
        <xref ref-type="bibr" rid="ref2 ref6 ref8">2, 6, 8</xref>
        ]. There have been variations and extensions of DKT such as [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] that use Memory
Networks to model individual skill levels more efectively, while [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] use hop LSTMs to use
only relevant past exercises to estimate the current skill level. There have also been eforts
to separately model the student’s ability in the Dynamic Key-Value Memory Networks for
Knowledge Tracing framework [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Although most eforts at knowledge tracing only use
sequential models [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] use Convolutional Neural Networks for knowledge tracing, while
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] uses sequential models such as LSTM to estimate parameters of IRT. There have also been
a few attempts at using Attention Networks in knowledge tracing. Pandey et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] use
a self-attention mechanism to identify relevant Knowledge Components from past learning
interactions of the student.
      </p>
      <p>
        Using Context Features in Knowledge Tracing: Given the success of using context
feature and their interactions in other domains [
        <xref ref-type="bibr" rid="ref16 ref17 ref18 ref19 ref20 ref21 ref7">16, 17, 18, 19, 20, 7, 21</xref>
        ], recently there have been
eforts in knowledge tracing to incorporate context features in predictive models as well. Sun
et al. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] try to use a wide variety of context features for the task of knowledge tracing, they
achieve this by ensembling one of various algorithms such as Decision Tree or Support Vector
Machines or Linear Regressor with the Dynamic Key-Value Memory Network architecture of
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Zhang et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] propose an Autoencoder architecture to reduce the dimensionality of a
large number of features being input to DKT. Attention Networks have been used to
incorporate context features as well. Pandey et al. [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] use a self-attention mechanism to incorporate
contextual information relating to exercise relations and forgetting.
      </p>
      <p>
        There have also been eforts to incorporate context features in the form of interactions for
the task of knowledge tracing. Vie et al. [24] uses Factorization Machines to model the
interaction between a wide variety of features. Nagatani et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] on the other hand models feature
interactions using Bi-interaction, a variant of Factorization Machines and additionally inputs
these interactions to an LSTM and achieves reasonable results. The model in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] also forms the
basis of our work. Our proposed model aims to improve the model proposed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] by
increasing the variety of context features and also by proposing a technique to utilize the additional
context information in an efective way using the attention mechanism from [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. While there
have been other eforts at using contextual features in the DKT framework [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and
Factorization Machines framework [24], this is the first attempt at using Attentional bi-interation to
incorporate context feature interactions in the DKT model.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Background</title>
      <p>In this section, we will provide some background to the domain of knowledge tracing and also
describe the architectures of the DKT and BIDKT models. These models are the basis for our
proposed architecture.</p>
      <sec id="sec-3-1">
        <title>3.1. Knowledge Tracing</title>
        <p>Knowledge tracing is the process of estimating a student’s current knowledge state and using
it to predict future performance. Given a sequence of past learning attempts  0 ⋯   , we need
to predict the student’s performance for attempt   +1. In general, an attempt   = (  ,   ) is
defined as a tuple that contains the skill set id (   ) of a question at time step  and whether the
student response (  ) to the question is correct or not. In this case,   is identified as a skill set
id from a set of skills  and   is a binary variable. We need to predict   +1 for   +1.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Deep Knowledge Tracing</title>
        <p>
          Deep Knowledge Tracing (DKT) shown in Figure 1(a), models students’ knowledge state
transition using an LSTM, which is a modified version of the RNN. The architecture of the DKT
model is from [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] where, at time step  , the knowledge state is represented as   ∈ ℝ where 
is the hidden state dimension. The DKT model in Figure 1(a) shows the 2 processes the model
is supposed to perform i.e. estimating the current knowledge state and predicting future
performance.
        </p>
        <p>In the case of DKT, the input   is a one-hot vector, which is the Cartesian product of  
and   .   is then embedded into a dense real-valued vector   . During the knowledge state
estimation process, for a given input   = (  ,   ) at each time step  , the knowledge state  
is updated. The knowledge state   is estimated using the embedded vector   and previous
knowledge state   −1 using the LSTM module. For the prediction process, the output layer is
implemented as a linear layer with sigmoid activation. The predicted probability of correct
responses to all skill sets   ∈ ℝ| | formed the model output.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Bi-Interaction Deep Knowledge Tracing</title>
        <p>
          Bi-Interaction Deep Knowledge Tracing (BIDKT) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], shown in figure 1(b) is an extension to the
DKT model, which integrates interactions between the input question response and context
features related to forgetting behavior into the DKT model. The context features used were
repeated time gap, sequence time gap and past trial counts which have been described further
in Section 5.2.
        </p>
        <p>The input to the RNN module   is computed using an integration technique called
biinteraction.   is the product of the interactions between  
vector of the input   , and   , the embedded dense real-valued vector of each context feature
the embedded dense real-valued

 =1
  = ∑   ⊙</p>
        <p>previous knowledge state   −1 and the product of integration   as:
Here,  is the number of context features. The current knowledge state is computed using the
To predict the student’s performance at the next attempt, the interaction between the current
knowledge state</p>
        <p>and context at the next attempt   +1
parameters are shared between the current knowledge state estimation step and the future
is computed. The context embedding
performance prediction step.
(1)
(2)
(3)
(4)
And finally the probability of answering correctly   ∈ ℝ| | is computed as:
where  (⋅) is the sigmoid function,  
∈ ℝ</p>
        <p>| |× is the weight matrix, and  
bias vector of the output. The implementation of the output layer is similar to that of DKT.
∈ ℝ
| | is the</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed Approach</title>
      <p>Our proposed model Attentional Bi-Interaction Deep Knowledge Tracing (ABIDKT), shown in
ifgure 1(c) is an extension to the BIDKT model, which weights interactions between the skill
id and context features in the BIDKT model. The original BIDKT model uses a narrow set
of context features related only to the long term trait of forgetting. But our goal is to use a
wider set of features so as to better estimate the knowledge state and accurately predict future
performance. The additional context features are wins, fails, question type, previous attempt
response time and diference in previous attempt response time which have been described in
detail in Section 5.2.</p>
      <p>
        The wins and fails context features have been picked up from [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], as the paper says that
these features can be a good indication of the student’s learning ability. The question type
feature is important because each question type is associated with a diferent level of dificulty and
therefore this feature serves as a great indicator of correct response probability. The previous
attempt response time feature was picked up from [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and the diference in previous attempt
response time feature was used because preliminary analysis showed that it is a good indicator
of skill mastery.
      </p>
      <p>However, using this wider set of features could lead to the issue of the important interactions
being averaged out. Therefore, the ABIDKT model uses an attention network in a modified

 =1
  = ∑   (  ⊙   )</p>
      <p>=  (  ,   −1)

 =1
  = ∑   (  ⊙   +1)
integration technique to weight the important interactions and ensure that they do not get
averaged out.
embedded dense real-valued vector of each context feature.</p>
      <p>In this case the input to the RNN module   is computed using a modified integration
technique called attentional bi-interaction. In this integration method,   is the product of the
interactions between</p>
      <p>the embedded dense real-valued vector of the input   , and   , the
weight normalized by the Softmax function are computed as:
interaction calculated using the attention layer. The attention weight  
Here  is the number of context features and   ∈ ℝ is the normalized attention weight of the
′ and   the attention
′
  =  ℎ

(  (  ⊙   ) +   )

 
 =


∑ =1</p>
      <p>′
(  )</p>
      <p>′
(  )</p>
      <p>Similar to BIDKT, the current knowledge state is computed using the previous knowledge
state   −1 and the product of integration   as:
To predict the student’s performance at the next attempt, the weighted interaction between
the current knowledge state and context at the next attempt is computed as:
(5)
(6)
(7)
(8)
(9)
The probability of a correct answer   ∈ ℝ
implementation of the output layer is also the same as the DKT and BIDKT models. Similar
to the architecture of BIDKT, the parameters of context embedding are shared between the
current knowledge state estimation step and the future performance prediction step in the
ABIDKT model as well. In the case of the attention network parameters, 2 variations were
experimented with, one where the attention network parameters are shared and the other
| | is computed in the same way it is for BIDKT. The
where the parameters are not shared.</p>
      <p>The training parameters for BIDKT are the skill id (  ) embedding matrix  , weights of
the RNN, weight  
and bias</p>
      <p>for prediction and embedding matrix  for the context
information. In the case of ABIDKT we additionally have to train weight   , bias  
parameter   of the attention layer. These parameters are jointly learned by minimizing a
standard cross entropy loss between the predicted probability of correctly answering the next
and
question for the skill id   +1 and the true label   +1:
 = − ∑(  +1log(   (  +1)) + (1 −   +1)log(1 −    (  +1)))</p>
      <p>where  (  +1) is the one-hot encoding for which skill id is answered in the next time step  + 1.</p>
      <p>The training process for the ABIDKT model is the same as the training process for BIDKT
and DKT. The main diference between the models lies in the set of trainable parameters.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments</title>
      <p>
        Experiments were conducted to compare the performances of the proposed architecture ABIDKT
with BIDKT and DKT with diferent combinations of context features. The experiments were
conducted to verify the following 2 hypotheses:
1. The bi-interaction technique used in the BIDKT architecture cannot efectively leverage
a wider set of context features than those used in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
2. Weighting context feature interactions using an attention network ensures that the
performance does not saturate even on increasing the number of context features
5-fold cross validation was performed by using a 70% ∶ 10% ∶ 20% ratio for the train:validation:test
split as done in the experimental setting of [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The details of the databases used, experiments
conducted and results obtained are given below.
      </p>
      <sec id="sec-5-1">
        <title>5.1. Datasets</title>
        <p>The datasets chosen for the experiments are the Assistments 2012-2013 [25] dataset which
contains information about students studying school level Mathematics with multiple question
types, and the Slepemapy.cz [26] dataset which contains data from an online platform that
teaches primary school Geography mainly consisting of 2 question types.</p>
        <p>Dataset
Assistments 2012-2013
slepemapy.cz
#records
5,818,868
10,087,305
#users
45,675
87,952
#items</p>
        <p>266
1,458</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Preprocessing</title>
        <p>
          Records where user made only a single attempt of a single skill set item were removed as
in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Additionally a few noisy records with negative response times were also removed.
Continuous valued context features were preprocessed and discretized to use in BIDKT and
ABIDKT models. Further details are as follows:
1. repeated time gap: calculated using the diference in time stamp between the current and
previous attempt of same skill in minutes.
2. sequence time gap: calculated using the diference in time stamp between the current
and previous attempt (independent of skill id) in minutes.
3. past trial counts: calculated as the count of the number of times the same skill has been
attempted in the past.
4. wins: calculated as the count of correct responses in the past trials of the same skill.
5. fails: calculated as the count of incorrect responses in the past trials of the same skill.
6. question type:
• discrete value in the range 0-5 for the Assistments 2012-2013 dataset.
        </p>
        <p>• binary discrete value 0,1 for the Slepemapy.cz dataset.
7. previous attempt response time: time taken for the response of the previous attempt of
the same skill, in seconds.
8. diference in previous attempt response time: calculated as the diference in response
times of the last 2 attempts in seconds, of the same skill.</p>
        <p>
          All features except for question type were discretized using the  2 scale. The repeated time
gap, sequence time gap and past trial counts features are same as the context features used
in BIDKT [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The additional context features were determined based on common features
available across datasets and common features used in KT literature [
          <xref ref-type="bibr" rid="ref11 ref8">8, 11</xref>
          ].
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Hyper-parameters</title>
        <p>The set of hyperparameters which maximized averaged AUC over 5-fold cross validation were
chosen for final model implementations. Final results on corresponding test sets were also
reported using the AUC metric.</p>
        <p>
          The hyper-parameters were set as follows:
1. learning rate: varying the learning rate did not have a significant efect on the maximum
value of AUC. Various learning rate values between 0.001 and 1 were tried, at values that
were approximate multiples of 3 i.e. 0.001, 0.003, 0.01, 0.03, etc. The value was further
ifne-tuned around the best performing value. Finally, the learning rate was set at 0.7 for
the Assistments 2012-2013 dataset, except for the DKT architecture which was fixed at
0.5. For the Slepemapy.cz dataset, the learning rate was fixed at 0.9 for all architectures.
2. hidden layer dimensions: Diferent values of hidden layer dimension between 10 and 100
were tried at diferences of 10, and the value was empirically set at 30 for all variations
of architectures and datasets.
3. dropout: The value of dropout had been set using the best value of dropout from the
experiments in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] at 0.3.
4. weight decay: weight decay values were varied between 10−6 and 10−3 at multiples of
10 i.e.10−5, 10−4, and the best value varied between diferent folds in the k-fold cross
validation.
5. mini batch size: this value was set at 100 for both datasets. For the Slepemapy.cz dataset,
although the batch size value in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] was set at 30, we set it at 100 to speed up processing.
6. epochs: the epochs were set as 1.5 times the maximum number of epochs till the point
of convergence of AUC across all 5 folds for the BIDKT architecture. The number of
epochs was set at 600 and 200 for the Assistments 2012-2013 and Slepemapy.cz datasets
respectively and the highest test AUC score among all these epochs was reported.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results and Discussion</title>
      <p>The results shown are for the DKT, BIDKT and 2 variations of the ABIDKT architecture
respectively. The results for the BIDKT and ABIDKT architectures are shown for diferent number of
features. The feature combinations are:
(a)
(b)
• forgetting: repeated time gap + sequence time gap + past trial counts
• forgetting+3: forgetting features + wins + fails + question type
• forgetting+5: forgetting+3 + previous attempt response time + diference in previous
attempt response time
The variations of the ABIDKT architecture are as follows:
• ABIDKT-SP: The parameters of the attention network and bi-interaction layer are shared
between the knowledge state estimation step and the future performance prediction step
similar to the BIDKT architecture
• ABIDKT: The parameters of the bi-interaction layer are shared between the knowledge
state estimation step and the future performance prediction step, while the parameters
of the attention network are trained independently</p>
      <p>Figures 2(a) and 2(b) show the average test AUC results across 5 folds, for diferent
combinations of features and being incorporated in diferent architectures. From the results we
can observe that sharing trainable parameters between the knowledge state estimation step
and future performance prediction step (ABIDKT-SP) does not have any significant impact on
the performance, although not sharing parameters (ABIDKT) does perform marginally better
when the number of features are increased for both datasets. The main takeaways from the
results are as follows:</p>
      <p>
        Hyperparameter Tuning and Reproducibility. All baselines were reproduced and their
hyperparameters were tuned using the same methodology as for the proposed model. We found that
our tuning method led to an AUC improvement of 0.7% for both models for the Assistments
2012-2013 dataset compared to values stated in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. For Slepemapy.cz, while DKT could be
reproduced, we could not match the AUC for BIDKT, primarily because the batch size parameter
mentioned in the paper was too small and computation proved very time consuming. On the
other hand, while setting a larger batch size led to a more reasonable runtime, the model saw
a 1.1% drop in AUC.
      </p>
      <p>Efect of Additional Context Features. Including wide context features led to improvements
in AUC for both BIDKT and ABIDKT models and on both datasets, suggesting that identified
features encapsulated information indicative of future student performance. Also, in support
of hypothesis 1 stated in Section 5, the extent of improvement in BIDKT tapered of and there
was negligible change when number of features was increased from 5 to 8.</p>
      <p>Efect of Attention Layers. Contrary to hypothesis 2, adaptive learned context weights in
the form of attention layers did not provide a substantial improvement in model performance,
instead consistently achieving an AUC 0.1% less than its BIDKT counterpart. This may be
because the added context features are not large in number, and attention layers involve more
trainable parameters. Trained on the same amount of data, the benefit from fewer trainable
parameters in BIDKT outweighs the adapative weight assignments learned by attention layers.</p>
      <p>We conducted further analysis by computing the micro-AUC by binning the predictions
based on past trial counts as shown in Figure 3 and computing the percentage improvement
of the ABIDKT architecture over the BIDKT architecture for each bin. The bin sizes were
chosen so as to balance the number of samples in each bin. This analysis was performed on the
Assistments 2012-2013 dataset as this is a Mathematics tutor dataset where each user is bound
to have a large number of trials. From these results we can observe that for low trial counts,
ABIDKT does not show an improvement over BIDKT, but as the number of trials increase, the
percentage improvement of ABIDKT over BIDKT also steadily increases for all sets of features.
This shows that ABIDKT may be useful in databases where each student has a large number
of trials.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>The focus of this paper was to observe the efect of using a wider range of context features in the
BIDKT model and to propose techniques to efectively incorporate them. We first identified a
wider set of context features and incorporated them in the BIDKT model. Experimental results
on 2 datasets showed that increasing the number of context features improves the performance
of BIDKT significantly, but the performance begins to taper of as the number of features are
increased from 5 to 8. We postulated that this could be because of important feature
interactions being diluted with other unimportant ones. To overcome this drawback, we proposed a
technique that adaptively learns the weights of feature interactions and incorporated this as an
attention layer in the BIDKT model. Experimental results of these models on 2 datasets show
that this weighting technique was not suficient to improve the performance of our models.
This could be because we were trying to learn additional parameters using the same amount
of data. We therefore analyzed the performance of our model across diferent trial counts and
found that our model does outperform BIDKT when the number of past trial counts are high.</p>
      <p>In future work, we first plan to implement our models on datasets that have a higher number
of trials counts per student. We also plan to modify the architecture of Attention and see if this
can perform better than the ABIDKT model. Additionally, we plan to try these approaches in
an architecture where skill is modeled separately as in Dynamic key-value Memory Networks
for Knowledge Tracing.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Adomavicius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mobasher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ricci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tuzhilin</surname>
          </string-name>
          ,
          <article-title>Context-aware recommender systems</article-title>
          ,
          <source>AI</source>
          Magazine
          <volume>32</volume>
          (
          <issue>1</issue>
          )
          <fpage>67</fpage>
          -
          <lpage>80</lpage>
          . URL: https://aaai.org/ojs/index.php/aimagazine/article/view/ 2364. doi:
          <volume>10</volume>
          .1609/aimag.v32i3.
          <fpage>2364</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Piech</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bassen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ganguli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Guibas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sohl-Dickstein</surname>
          </string-name>
          ,
          <article-title>Deep knowledge tracing</article-title>
          ,
          <source>in: Advances in neural information processing systems</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>505</fpage>
          -
          <lpage>513</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Corbett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Anderson</surname>
          </string-name>
          ,
          <article-title>Knowledge tracing: Modeling the acquisition of procedural knowledge, User modeling and user-adapted interaction 4 (</article-title>
          <year>1994</year>
          )
          <fpage>253</fpage>
          -
          <lpage>278</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Khajah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. V.</given-names>
            <surname>Lindsey</surname>
          </string-name>
          , M. C. Mozer,
          <article-title>How deep is knowledge tracing?</article-title>
          ,
          <source>arXiv preprint arXiv:1604.02416</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Yudelson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Koedinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Gordon</surname>
          </string-name>
          ,
          <article-title>Individualized bayesian knowledge tracing models</article-title>
          ,
          <source>in: International conference on artificial intelligence in education</source>
          , Springer,
          <year>2013</year>
          , pp.
          <fpage>171</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Nagatani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chen</surname>
          </string-name>
          , T. Ohkuma,
          <article-title>Augmenting knowledge tracing by considering forgetting behavior</article-title>
          ,
          <source>in: The World Wide Web Conference</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>3101</fpage>
          -
          <lpage>3107</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          , T.-S. Chua,
          <article-title>Attentional factorization machines: Learning the weight of feature interactions via attention networks</article-title>
          ,
          <source>arXiv preprint arXiv:1708.04617</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Botelho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. T.</given-names>
            <surname>Hefernan</surname>
          </string-name>
          ,
          <article-title>Incorporating rich features into deep knowledge tracing</article-title>
          ,
          <source>in: Proceedings of the Fourth (2017) ACM Conference on Learning@ Scale</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>169</fpage>
          -
          <lpage>172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>King</surname>
          </string-name>
          , D.-Y. Yeung,
          <article-title>Dynamic key-value memory networks for knowledge tracing</article-title>
          ,
          <source>in: Proceedings of the 26th international conference on World Wide Web</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>765</fpage>
          -
          <lpage>774</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Abdelrahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Knowledge tracing with sequential key-value memory networks</article-title>
          ,
          <source>in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>175</fpage>
          -
          <lpage>184</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Minn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Desmarais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Dynamic student classification on memory networks for knowledge tracing</article-title>
          ,
          <source>in: Pacific-Asia Conference on Knowledge Discovery and Data Mining</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>163</fpage>
          -
          <lpage>174</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          , E. Chen,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          , H. Ma, S. Wang,
          <article-title>Convolutional knowledge tracing: Modeling individualization in student learning process</article-title>
          ,
          <source>in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '20,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2020</year>
          , p.
          <fpage>1857</fpage>
          -
          <lpage>1860</lpage>
          . URL: https://doi.org/10.1145/3397271.3401288. doi:
          <volume>10</volume>
          .1145/3397271.3401288.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Deep knowledge tracing with convolutions</article-title>
          ,
          <year>2020</year>
          . arXiv:
          <year>2008</year>
          .01169.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>C</surname>
          </string-name>
          .
          <article-title>-</article-title>
          K. Yeung,
          <article-title>Deep-irt: Make deep learning based knowledge tracing explainable using item response theory</article-title>
          , arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>11738</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Karypis</surname>
          </string-name>
          ,
          <article-title>A self-attentive model for knowledge tracing</article-title>
          , CoRR abs/
          <year>1907</year>
          .06837 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1907</year>
          .06837. arXiv:
          <year>1907</year>
          .06837.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Koren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Volinsky</surname>
          </string-name>
          ,
          <article-title>Matrix factorization techniques for recommender systems</article-title>
          ,
          <source>Computer</source>
          <volume>42</volume>
          (
          <year>2009</year>
          )
          <fpage>30</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rendle</surname>
          </string-name>
          ,
          <article-title>Factorization machines</article-title>
          ,
          <source>in: 2010 IEEE International Conference on Data Mining</source>
          , IEEE,
          <year>2010</year>
          , pp.
          <fpage>995</fpage>
          -
          <lpage>1000</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Sparse factorization machines for clickthrough rate prediction</article-title>
          ,
          <source>in: 2016 IEEE 16th International Conference on Data Mining (ICDM)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>400</fpage>
          -
          <lpage>409</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Field-weighted factorization machines for click-through rate prediction in display advertising</article-title>
          ,
          <source>in: Proceedings of the 2018 World Wide Web Conference</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1349</fpage>
          -
          <lpage>1357</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>N.</given-names>
            <surname>Gui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Afs:</surname>
          </string-name>
          <article-title>An attention-based mechanism for supervised feature selection</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>33</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>3705</fpage>
          -
          <lpage>3713</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Lai,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Autofis: Automatic feature interaction selection in factorization models for click-through rate prediction</article-title>
          , arXiv preprint arXiv:
          <year>2003</year>
          .
          <volume>11235</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <article-title>Muti-behavior features based knowledge tracking using decision tree improved dkvmn</article-title>
          ,
          <source>in: Proceedings of the ACM Turing Celebration Conference-China</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          , Rkt:
          <article-title>Relation-aware self-attention for knowledge tracing</article-title>
          , arXiv preprint arXiv:
          <year>2008</year>
          .
          <volume>12736</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>