Investigate Effectiveness of Code Features in Knowledge
       Tracing Task on Novice Programming Course.

                                       Poorvaja Penmetsa, Yang Shi, Thomas Price
                                                      North Carolina State University
                           ppenmet@ncsu.edu, yshi26@ncsu.edu, twprice@ncsu.edu


ABSTRACT                                                                  interrelated KCs depending on the domain, while multiple
Predicting student performance has been a major task in                   KCs are usually present in computer science problems in-
student modeling. Specifically, in open-ended domains such                volving programming [7]. Traditional KC models developed
as computer science classes, the student submissions con-                 by domain experts used KCs to represent a student’s knowl-
tain more information, however they also require more ad-                 edge state and are dependent on experts’ knowledge [5].
vanced analysis methods to extract this information. Tradi-
tional student modeling approaches use knowledge compo-                   Knowledge Tracing (KT) tracks KCs to model students’
nents (KCs) to predict a student’s success on specific prac-              knowledge. It is a method of using data from their previous
ticed skills. These approaches are useful and necessary in                submissions on different problems (features), to predict on
helping learning environments like Intelligent Tutoring Sys-              their performances on future problems (labels) [3]. Histori-
tems (ITS) personalize feedback, hints, and identify strug-               cally, student performance is defined as the academic perfor-
gling students. However, when working with programming                    mance of a student and KCs are represented with skill IDs
data, code features provide more information than skill tags              [11, 6]. Our work builds data-driven KT models in computer
representing KCs, and this information is not leveraged by                science (CS) classes, and uses preliminary features from stu-
traditional KC models. This work incorporates an implicit                 dent code submissions to further tune and optimize standard
representation of KCs into a student model by including                   KT models, and make predictions on student struggle.
features extracted from students’ code with data from an
undergraduate introductory programming course. This rep-                  Previous work has used data-driven methods to tune the
resentation is then evaluated by using deep learning predic-              parameters of a KT model, for example, Corbett and An-
tive models and investigated to see how well they are able                derson developed LISP tutors [6]. In their work, they used
to leverage code features to model student knowledge and                  a model called Bayesian Knowledge Tracing (BKT) to mon-
compare and contrast against other learning models. The                   itor a student’s changing knowledge state during program-
study shows a modest, but consistent improvement in mod-                  ming assignments, and updated an estimate of the student’s
els that use time-sequential data with even the simplest code             learned skills and modelled their knowledge state as stu-
features, implying that these aspects may improve student                 dents completed problems. Some recent works optimized
modelling.                                                                the implementations of BKT, for example, pyBKT [4] uses
                                                                          expectation maximization to fit parameters.
Keywords                                                                  This standard BKT model has been advanced by a lot of
Student Modeling, Code Features, Knowledge Tracing, Knowl-                newer models such as Deep Knowledge Tracing (DKT) [12],
edge Components, LSTM, DKT                                                Item Difficulty Effect Model (KT-IDEM) [11], etc., as they
                                                                          incorporate many different factors or use more complex mod-
1.    INTRODUCTION                                                        els structure. Out of these improved models, we use DKT
Modeling student learning activities can improve their learn-             to serve as a model for our proof of concept, as it is widely
ing outcome, and automatically achieving this requires the                served as a baseline model, and can also integrate student
system to model student’s knowledge through tracking Knowl-               code to be used as features. Some recent works advanced
edge Components (KCs) [2]. A knowledge component is an                    DKT by adding more variants in the model [9, 10], while
acquired concept that a learner uses to accomplish a task [8].            DKT is their base framework.
An exercise problem may consist of one or a combination of
                                                                          We compare DKT and BKT models in our work, along with
                                                                          classical deep learning models such as multi-layer perceptron
                                                                          (MLP) and Long-Short Term Memory (LSTM) to see if code
                                                                          features improve the ability of KT models, and if it also
                                                                          has a positive effect on static or non-time-sequential models.
                                                                          While skill IDs are hardly available for BKT without manual
                                                                          labeling, we use the problem or assignment IDs to serve as
                                                                          the KCs there [15, 12]. In DKT’s case, the classical model
Copyright ©2021 for this paper by its authors. Use permitted under Cre-   can already use the IDs as inputs, automatically extracting
ative Commons License Attribution 4.0 International (CC BY 4.0)
knowledge components to be tracked [12].                         in each assignment. Different assignments covered different
                                                                 concepts, according to their problem descriptions, and the
While compared in our work, these standard KT models also        complexity of problems increases over time. The dataset
have two important limitations when adding code features         also provides the descriptions of the problems. For example,
in: 1) Need extra accommodation standard KT models use           early problems include simple concepts such as if condi-
correctness of KCs as features, while they do not accept code    tions, while a later problem in the same assignment may
features with the original structure. There are some recent      ask about nested-if conditions.
works adding textual information in KT models [15, 14], but
they work with non-programming data. Some other works            Each assignment is a combination of multiple different con-
such as PKT [16] and Code DKT [17] use programming               cepts, with at least one new concept or advanced concept
data, but they use an hour-long programming code database        being introduced for each assignment. The entire course
(Hour-of-Code) rather than a university course. 2) Defini-       covers a wide range of introductory concepts including con-
tion of Student Performance: they usually define the binary      ditional statements, loops, strings, and arrays, which sug-
score as the student’s academic performance. However, stu-       gests that the knowledge tracing task across all problems is
dents usually have a biased success rate in CS programming       a non-trivial task.
classes (in our case, > 84%), as they are allowed to submit
multiple times and receive feedback immediately. Predicting      The traditional goal of KT is to model knowledge to pre-
on whether students successfully finished the problem is less    dict student performance. Historically, this has been inter-
meaningful than directly detecting their struggles, as they      preted as whether or not a student gets a problem correct.
may spend a lot of effort before reaching the goals.             However, in many programming classes, including the one
                                                                 analyzed here, most students (at least 84% of students in
The usage of code features, however, is not rare in general CS   this dataset) got a full score in the problems they attempt.
education domain or software engineering domain. Recent          So, it is more important to predict student struggle instead
works have used sub-trees [19], Bag of Words [1], and au-        of success. In this programming setting, students can use
tomatically learned representations [13] for student progress    multiple attempts and finally succeed. This allows us to
modeling, grade prediction, or bug discovery tasks. Their        use their number of attempts as a quantitative measure for
models cannot be directly applied to sequential tasks such       their struggle. We define a struggling student as someone
as KT. In our work, we use term frequency-inverse document       who uses more attempts than at least T % of the students.
frequency (TF-IDF) as the base features for this exploratory     In order to find the threshold for T , we explored the data
study.                                                           to find the division of two relatively separate groups of stu-
                                                                 dents corresponding to those who required a lot of effort to
We build preliminary KT models to predict whether novice         complete the problem, and those who didn’t struggle. We
university CS students struggle with the integration of code     found that the 75th percentile of attempts for each problem
features. To verify whether code features can improve the        is a good division of attempts for most problems. A student
KT models, we experiment with term frequency-inverse doc-        is identified as struggling on the problem if they: 1) didn’t
ument frequency (TF-IDF), simply to show possible effects        pass all the test cases or 2) passed all test cases but took
of the usage of code features in this task, and compare with     more than the 75th percentile of attempts for that particu-
other baseline models. To address the aforementioned limi-       lar problem. So a student who scored 1 on a problem might
tations of classical KT models, we define our research ques-     have also struggled with that problem. The prediction that
tion as: Do code features improve the ability to model stu-      a student is struggling is the positive class which is also the
dent struggle pattern in CS in time-sequential and non-time-     minority class.
sequential KT models? Our results suggest that simple code
features coupled with time sequential information have the       2.2     Feature Representations
potential to improve the KT model performance on student         2.2.1    TF-IDF Features
struggle prediction tasks.
                                                                 We use simple code features in the form of Term Frequency-
                                                                 Inverse Document Frequency (TF-IDF) weights, a text anal-
2. METHODS                                                       ysis technique that represents documents based on its term
2.1 Data                                                         frequency. The algorithm uses Term Frequency (TF) to find
The data used in this work is called CodeWorkout1 and is         the frequency of a word in a document and normalizes these
from an undergraduate programming course at a large, pub-        weights with Inverse Document Frequency (IDF), the fre-
lic university in the southeast United States, collected in      quency of the word in the corpus. The final weights reflect
Spring 2019. The dataset includes code submissions from 5        the importance of terms for a particular document.
assignments, 50 problems, and 413 students, recording stu-
dents’ attempts in implementing functions in Java. Each          This method is more commonly found in work involving pro-
assignment contains 10 problems. A student could submit          cessing text rather than code. However, the novice pro-
any number of attempts for a problem, each attempt’s code        gramming concepts present in this data can be partially
is tested against given test cases and given a score from        represented by keywords since they are focused on famil-
0 to 1 based on how the code tested. Our statistics show         iarizing the students with syntax. More complex concepts,
that not all 413 students work on every problem and most         such as stacks and queues are not part of the course curricu-
students eventually get a full score (score = 1). Assign-        lum. The vocabulary used in our work consists of keywords
ments are ordered by time stamp, and problems are grouped        in Java such as ’for’, ’if’, and ’public’ that represent some
                                                                 novice concepts a beginner programming course might in-
1
    https://pslcdatashop.web.cmu.edu/Project?id=585              clude. Other keywords such as ’throw’ and ’extends’ are also
included to identify students who might have had a back-         line models.
ground in programming. These students may not struggle in
the problems in this course compared with other students.
                                                                 2.3     Models
The models in this experiment predict at the problem level,      2.3.1     BKT
so each problem done by a student is represented as one          We used the pyBKT implementation of the standard BKT
input. We define a corpus as the attempts across all students    model [4]. This model tracks student knowledge with the
who worked on problem p − 1 and use the TF-IDF vector            probability that a student has learned a skill. The output
of the last attempt a student made on this problem as code       of one state is directly used as the input to the next state,
feature input. In this way, the frequencies of keywords for      making it a time-sequential model. For this reason, it re-
each attempt in a problem are calculated and the TF-IDF          quires inputs to be ordered sequentially. The purpose of this
weights for the keywords in a student’s best attempt are         model in this experiment is to use a standard KT model on
included in the code features.                                   the current data. The input to pyBKT is two dimensional,
                                                                 including Uniform ID and the label for problem p − 1.
Although other, more complex extraction methods such as
code2vec, ASTNN, and pq-grams exist that extract deeper
code and structural information, we use TF-IDF for this ex-      2.3.2     DKT
ploratory study. Our goal is not to use complicated vector-      We used the classical DKT implementation used in the Github
ization approaches, but rather to examine whether a simple       repository2 , as the standard DKT model with Problem, As-
yet effective approach can represent information before mov-     signment, and Uniform ID as mentioned in the section above.
ing on to more complex approaches. TF-IDF fits this goal         We added the TF-IDF features to these models to create
because it not only simplifies model architecture, but it puts   DKT with TF-IDF features, and compared those without
more weight on relatively rarer keywords and may help in         TF-IDF features. Our DKT uses an embedding layer to ini-
identifying concepts in a problem.                               tialize a random vector representing each input feature and
                                                                 updates this vector as the model trains. This is followed by
                                                                 an LSTM and a linear layer. The number of embeddings is
2.2.2    Feature Groups and Usage                                one more than twice the number of questions (50) and the
This experiment uses 4 sets of models: 1.) Standard BKT          embedding dimension for D02 is 20, while D01 and D03 is
model [4], 2.) DKT models [12] with and without code             200. The batch size is 192 with a sequence size of 50, and
features, 3.) MLP based models with and without code             learning rate of 0.001. The models with code features take
features, and 4.) LSTM-based models with and without             in two inputs, the TF-IDF weights from code and the ID
code features. Each model throughout the experiment uses         associated with the code. DKT with Problem ID and TF-
student struggle on past problems (student performance on        IDF features (D04) and DKT with Uniform ID and TF-IDF
problems 1 to p − 1) along with other features to predict on     features (D05) use a batch size of 192, a sequence size of 50,
the labels, or future student performance on problem p. We       and a learning rate of 0.00001.
extracted the following features:
                                                                 The general DKT model is designed to process time series
ID-Label Group: The Problem ID, Assignment ID, Uniform           data (i.e. multiple problem attempts over time) and uses
ID, and the binary label of whether the student struggles in     LSTM networks which use a recurrent structure to serve the
problems 1 to p − 1. Uniform ID groups employs the same          target. When predicting the label for student s and prob-
IDs for all problems, with one ID that represents all KCs        lem p, the input sequence includes the features for problems
and only the label changes.                                      1 to p − 1 done by the student s. We used a fixed input
                                                                 sequence length of 50, the maximum number of problems
Attempt-Score Group: The number of attempts the student          a student can do, and a padding of 0’s when needed. The
made (with any score) and the maximum score achieved on          padding makes sure that the input is always a fixed length.
unit tests on any attempt for problems 1 to p − 1.               In addition to ID and/or TF-IDF features, the DKT models
                                                                 require student representation to measure the student per-
Code Feature Group: The TF-IDF features of the last at-          formance, in this case, student struggle. We refer to this
tempt at the problems 1 to p − 1 (TF-IDF weights).               representation as the struggle feature where the model iden-
                                                                 tifies the student’s struggle on past problems. In both DKT
Because of the lack of IDs directly representing KCs in the      and BKT sets of models, this struggle feature is binary be-
data, we experiment with three different level of IDs (Prob-     cause it represents whether or not a student struggled in the
lem ID, Assignment ID, Uniform ID). The BKT model uses           past problems.
the Uniform ID along with the labels of past struggle, as-
suming that one ID represents all the KCs in the dataset.        In the DKT models without code features, models incorpo-
The DKT set of models are compared using features across         rate the struggle feature for problems 1 to p − 1 to predict
the entire ID-Label group.                                       for problem p. There are three of these models with prob-
                                                                 lem ID (D01), uniform ID (D02), or assignment ID (D03).
The Attempt-Score group is used in the MLP and LSTM              Another two DKT models use code features. Besides the
based models. Because of these features (Attempts and            code features, one uses problem ID, and another one uses
Score), these models are named Attempt-Score MLP (AS-            uniform ID. Both models, D04 and D05, combine TF-IDF
MLP) and Attempt-Score LSTM (AS-LSTM). The Code                  vectors (length 63) with the embedded vector of ID and
Feature group is used for MLP, DKT, and LSTM models.
                                                                 2
We refer all models that don’t use TF-IDF features as base-          https://github.com/seewoo5/KT
struggle (length 20) resulting in the LSTM input dimension      The results in Table 1 show how D04, D05, and L02 can
63 + 20 = 83.                                                   distinguish between classes slightly better than D01, D02,
                                                                or L01. L02 performs slightly better than D04 and D05
                                                                maybe due to its non-binary struggle feature. By looking at
2.3.3    Multilayer Perceptron Models
                                                                the confusion matrices of D04 and D01 in Table 2, it can be
Attempt and Score MLP (AS-MLP) models with (M02) and
                                                                calculated that with TF-IDF, the DKT model is 6.2 times
without code features (M01) do not take in a sequential
                                                                more likely to mark an at-risk student as needing help than
input and do not explicitly use the output of one state as                                                        TP
                                                                a non-at-risk student by examining the ratio of F  P
                                                                                                                      : F
                                                                                                                        TN
                                                                                                                          N
                                                                                                                            .
input to the next, so they are not time-sequential. However,
                                                                Without TF-IDF, it’s only almost twice as likely. Similarly,
neural network structure allows MLP to incorporate code
                                                                the confusion matrices of D02 and D05 show that D05 is
features. So, the purpose of these models is to compare
                                                                5.5 times more likely to mark an at-risk student as needing
the effect of performance features and code features in a
                                                                help while D02 is 3.9 times as likely. Meanwhile, L01 and
static model on this struggle prediction task. Both AS-MLP
                                                                L02 have a smaller difference with L01 at about 4.5 times
models share the same architecture, with three linear layers.
                                                                likely to mark an at-risk student as needing help while L02
M01 uses a batch size of 260, and a learning rate of 0.001
                                                                is about 4.7 times as likely.
and the AS-MLP model with code features (M02) uses the
same batch size with a learning rate of 0.0001.
                                                                While it has been an open question whether the addition
                                                                of code features would do so, in our experiments, both L02
The inputs for the Attempt-Score-MLP models are less com-
                                                                and D05 have a range of 3% to 6% improvement in their
plex than that of any LSTM. Unlike the DKT input struc-
                                                                AUC. For example, Wang et. al. found that a large number
ture, the AS-MLP models don’t use an empty sequence to
                                                                of code features can lead models to overfit to the training
predict for the 1st problem a student has attempted. In the
                                                                data [18]. This has likely happened with D05 and L02, their
AS-MLP model without code features (M01), the input size
                                                                LSTM input dimensions increased at least 4-fold from D02
is two dimensional, including attempts and maximum score
                                                                and L01 respectively, making it possible that the models
of problem p − 1, while the other MLP model with code fea-
                                                                would overfit. However, it also shows that the AUC of D05
tures has an input dimension of 63 + 2 where 63 is the fixed
                                                                and L02 still increased, suggesting the models benefited from
length of TF-IDF code features.
                                                                the additional complexity.

2.3.4    LSTM Models                                            For AS-MLP models, M02 did not improve with TF-IDF
We created two Attempt-Score-LSTM (AS-LSTM) models:             features and performed similar to M01 as shown in Table 1.
AS-LSTM with (L02) and without TF-IDF features (L01).           This suggest that time-sequential information is important
These models incorporate performance features (attempt          for KT tasks and that TF-IDF features alone cannot improve
and score) compared to DKT’s struggle feature. AS-LSTMs’        model performance.
input is similar to the DKT input, except that AS-LSTM
doesn’t predict on the 1st problem. In L01 both features        The current model performances are not ready for imple-
represent the student performance on past problems and in       mentation in real education settings. This modest perfor-
L02, these performance features are appended to the TF-         mance was predicted because examining the effect of TF-
IDF features. So L01 also inputs a two dimensional array,       IDF features in modeling student knowledge is a very diffi-
while L02 inputs a vector sized 63+2. Both AS-LSTM mod-         cult task due to the open-ended nature of the domain and
els share the same architecture of one LSTM layer followed      the wide range of concepts in the dataset. Although the
by a linear layer. L01 uses a learning rate of 0.001 and a      performances of models with TF-IDF features were modest,
batch size of 260 over 100 epochs. L02 uses a learning rate     there is still a 3% to 6% increase in AUC of these mod-
of 0.000001 and batch size 260 over 250 epochs.                 els compared to other baselines. This improvement marks
                                                                potential for more complex code features.
We used 8:2 split for data, and performed repeated resam-
pling for AS-MLP, AS-LSTM models, while we use 5 fold
cross validation for DKT and BKT models, reporting the
                                                                3.1   Discussion
                                                                Difficulty of Knowledge Tracing for Programming: As hy-
average AUC/Recall/Precision and F1 scores in the result
                                                                pothesized, this is a difficult KT task because of the wide
section.
                                                                range of concepts covered throughout the course. As the
                                                                students progress through the different problems and assign-
                                                                ments, new concepts are introduced which the models try to
                                                                infer from either through an ID or student code features. Re-
                                                                sults across all models in Table 1 show that neither IDs nor
3.   RESULTS AND DISCUSSION                                     code features are particularly successful in representing or
TF-IDF features and time-sequential information. Table 1        inferring KCs. Of all the models in the experiment, BKT is
shows three pairs of models: D01 and D04, D02 and D05,          the worst performing model with an AUC of 0.64 which sug-
M01 and M02, and L01 and L02 which are all identical ex-        gests that this particular definition for student performance
cept that the latter models (D04, D05, M02, and L02) in-        (labels) may not be the best fit with a standard BKT model.
clude TF-IDF features. Before TF-IDF, the DKT models
are biased towards label 1, while after adding TF-IDF fea-      Because there is a lack of tags to directly represent KCs,
tures, the models are biased towards label 0. So we mainly      we use naive ways to represent them in the 3 DKT models.
use AUC, which is more symmetric to bias towards any class,     D01 treats each problem as a separate concept, D02 treats
to compare these models’ performances in this section.          each assignment as a separate concept, and D03 treats all
           Models                    Precision (1)    Recall (1)   F1 Score (1)   Macro F1 score       AUC            ACC
  BKT Uniform ID                     0.57 ± 0.00     0.46 ± 0.01   0.51 ± 0.00    0.65 ± 0.00       0.64 ± 0.00    0.70 ± 0.00
  MLP No TFIDF (M01)                 0.35 ± 0.00     0.83 ± 0.01   0.50 ± 0.00    0.54 ± 0.00       0.70 ± 0.00    0.56 ± 0.00
  MLP TFIDF (M02)                    0.35 ± 0.01     0.80 ± 0.03   0.47 ± 0.00    0.55 ± 0.02       0.69 ± 0.02    0.56 ± 0.00
  DKT Problem ID (D01)               0.37 ± 0.02     0.89 ± 0.01   0.52 ± 0.02    0.45 ± 0.05       0.71 ± 0.00    0.46 ± 0.04
  DKT Uniform ID (D02)               0.42 ± 0.01     0.82 ± 0.01   0.56 ± 0.00    0.57 ± 0.02       0.71 ± 0.00    0.57 ±0.02
  DKT Assignment ID (D03)            0.40 ± 0.01     0.88 ± 0.00   0.55 ± 0.00    0.51 ± 0.01       0.72 ± 0.00    0.51 ± 0.01
  DKT TF-IDF & Problem ID
                                     0.55 ± 0.01     0.33 ± 0.02   0.41 ± 0.01    0.63 ± 0.00       0.73 ± 0.00    0.76 ± 0.00
  (D04)
  DKT TF-IDF & Uniform ID
                                     0.60 ± 0.02     0.34 ± 0.02   0.44 ± 0.02    0.64 ± 0.00       0.75 ± 0.00    0.77 ± 0.00
  (D05)
  LSTM No TF-IDF
                                     0.50 ± 0.00     0.52 ± 0.07   0.50 ± 0.05    0.69 ± 0.02       0.74 ± 0.00    0.73 ± 0.03
  (L01)
  LSTM TF-IDF
                                     0.42± 0.03      0.74 ± 0.10   0.52 ± 0.00    0.62 ± 0.04       0.76 ± 0.00    0.64 ± 0.00
  (L02)

       Table 1: Results of Models with or without Code features. Bold figures suggest best in the corresponding metric.


             D04                         D01                        each KC. Where the smallest dataset in [12] has 4K stu-
     TP = 301 FN = 575           TP = 398 FN = 454                  dents and 200K entries, the data used here has 400 students
     FP = 238 TP = 2224          FP = 395 TP = 2008                 and 16K entries (after cleaning). The original work used
                                                                    each attempt as an entry and in this work, the last attempt
Table 2: DKT Problem ID without (D01) and with (D04)                is used. Moreover, students who worked on CodeWorkout
TF-IDF features confusion matrices.                                 did not all do problems in the same order for an assignment.
                                                                    This makes it very difficult to predict the feature importance
                L02                         L01                     of a problem based on past problems for all students.
     TP = 541     FN = 311       TP = 398     FN = 454
     FP = 618     TP = 1785      FP = 395     TP = 2008             Our results suggest that there is some improvement, but
                                                                    only for models that use time-sequential information. There
Table 3: AS-LSTM without (L01) and with (L02) TF-IDF                is a 3% - 6% increase in the AUC for LSTM-based mod-
features confusion matrices.                                        els that use TF-IDF features when compared with their
                                                                    counterparts that don’t use them. Considering the difficulty
                                                                    of this task, this improvement suggests that more complex
concepts the same. As shown in Table 1, they all are biased
                                                                    models may improve performance more.
towards label 1 (struggling student). Further examinations
into D01 - D03 from Table 1 show that that they perform
                                                                    When code features are added to static models such as AS-
similarly, with D03 having the best AUC and D02 with the
                                                                    MLP (M02), there is no improvement in performance. Com-
highest macro F1 score. Because both IDs have similar re-
                                                                    paring M02 to the final time-sequential models, (D04, D05,
sults, this suggests that neither is better at representing con-
                                                                    L02), there is a 7% to 10% increase in AUC suggesting that
cepts.
                                                                    in the presence of time-series data, the impact of code fea-
                                                                    tures is more. However, only one non-time-sequential model
Why were DKT and the other KT approaches unsuccess-
                                                                    is implemented with code features, so it is not clear if these
ful?: The original DKT work that the D01 code is based on
                                                                    results are robust.
states that the algorithm can leverage skill or KC tags but
doesn’t need them to perform predictions [12]. Looking at
the results in Table 1, it is clear that the D01 model which
uses problem ID as features and incorporates the number of          4.   LIMITATIONS AND FUTURE WORK
attempts in the labels, has mediocre results. The only dif-         One limitation in this work is that the vocabulary used for
ference between the D01 model and the original DKT model            TF-IDF may not be generalizable to more advanced curric-
lies in the definition of student performance. However, there       ula. The second limitation is that we only use one type of
are other differences in the data used. The original DKT            non-time-sequential and one type of time-sequential models
work used non-programming data where the problem ids                to compare code features. So the results may be specific
represented the skills or KCs better. The original work also        to LSTM or MLP based models. In the future, we plan to
uses far more data from online courses than available from a        work with expert-based features like PQ-grams and incor-
formal education setting, so there is more data representing        porate other models into our experiment.

                                                                    In conclusion, in this work, we experimented with mod-
            D05                          D02                        els with and without both simple code features and time-
    TP = 298 FN = 578           TP = 888   FN = 220                 sequential information. The results show that even these
    FP = 211 TP = 2251          FP = 1124 TP = 1106                 simple code features can affect model performance and that
                                                                    time-sequential information is important when using these
Table 4: DKT Uniform ID without (D02) and with (D05)                features. These experiments mark the potential that code
TF-IDF features confusion matrices.                                 features have in representing CS KCs over an entire course.
5.   REFERENCES                                                 [17] L. Wang, A. Sy, L. Liu, and C. Piech. Deep knowledge
 [1] B. Akram et al. Assessment of students’ computer                tracing on programming exercises. pages 201–204.
     science focal knowledge, skills, and abilities in               Association for Computing Machinery, 2017.
     game-based learning environments. 2019.                    [18] W. Wang, Y. Rao, Y. Shi, A. Milliken, C. Martens,
 [2] V. Aleven. Rule-Based Cognitive Modeling for                    T. Barnes, and T. Price. Comparing feature
     Intelligent Tutoring Systems, pages 33–62. Springer             engineering approaches to predict complex
     Berlin Heidelberg, Berlin, Heidelberg, 2010.                    programming behaviors. 5 2020.
 [3] J. Anderson, A. Corbett, K. Koedinger, and                 [19] R. Zhi, T. Price, N. Lytle, Y. Dong, and T. Barnes.
     R. Pelletier. Cognitive tutors: Lessons learned.                Reducing the state space of programming problems
     Journal of the Learning Sciences, 4:167–207, 04 1995.           through data-driven feature detection. 07 2018.
 [4] A. Badrinath, F. Wang, and Z. Pardos. pybkt: An
     accessible python library of bayesian knowledge
     tracing models, 2021.
 [5] R. Clark, D. Feldon, J. J. G. Van Merrienboer,
     K. Yates, and S. Early. Cognitive task analysis.
     Handbook of Research on Educational Communications
     and Technology, pages 577–593, 01 2008.
 [6] A. T. Corbett and J. R. Anderson. Knowledge tracing:
     Modeling the acquisition of procedural knowledge.
     User Modeling and User-Adapted Interaction,
     4:253–278, 1994.
 [7] B. Haberman and O. Muller. Teaching abstraction to
     novices: Pattern-based and adt-based problem-solving
     processes. In 2008 38th Annual Frontiers in Education
     Conference, pages F1C–7. IEEE, 2008.
 [8] K. Koedinger, A. Corbett, and C. Perfetti. The
     knowledge-learning-instruction (kli) framework:
     Toward bridging the science-practice chasm to
     enhance robust student learning. Cognitive science,
     36:757–98, 04 2012.
 [9] S. Pandey and G. Karypis. A self-attentive model for
     knowledge tracing. CoRR, abs/1907.06837, 2019.
[10] S. Pandey and J. Srivastava. Rkt: Relation-aware
     self-attention for knowledge tracing. Proceedings of the
     29th ACM International Conference on Information
     Knowledge Management, Oct 2020.
[11] Z. Pardos and N. Heffernan. Kt-idem: Introducing
     item difficulty to the knowledge tracing model. pages
     243–254, 01 1970.
[12] C. Piech, J. Bassen, J. Huang, S. Ganguli, M. Sahami,
     L. J. Guibas, and J. Sohl-Dickstein. Deep knowledge
     tracing. In C. Cortes, N. Lawrence, D. Lee,
     M. Sugiyama, and R. Garnett, editors, Advances in
     Neural Information Processing Systems, volume 28.
     Curran Associates, Inc., 2015.
[13] Y. Shi, K. Shah, W. Wang, S. Marwan, P. Penmetsa,
     and T. Price. Toward semi-automatic misconception
     discovery using code embeddings. In LAK21: 11th
     International Learning Analytics and Knowledge
     Conference, pages 606–612, 2021.
[14] D. Shin, Y. Shim, H. Yu, S. Lee, B. Kim, and Y. Choi.
     SAINT+: integrating temporal features for ednet
     correctness prediction. CoRR, abs/2010.12042, 2020.
[15] Y. Su, Q. Liu, Q. Liu, Z. Huang, Y. Yin, E. Chen,
     C. H. Q. Ding, S. Wei, and G. Hu. Exercise-enhanced
     sequential modeling for student performance
     prediction. In AAAI, pages 2435–2443, 2018.
[16] L. Wang, A. Sy, L. Liu, and C. Piech. Deep knowledge
     tracing on programming exercises. In Proceedings of
     the Fourth (2017) ACM Conference on Learning @
     Scale, L@S ’17, page 201–204, New York, NY, USA,
     2017. Association for Computing Machinery.