=Paper= {{Paper |id=Vol-3051/CSEDM_1 |storemode=property |title=How productive are homework and elective practice? Applying a post hoc modeling of student knowledge in a large, introductory computing course (Full Paper) |pdfUrl=https://ceur-ws.org/Vol-3051/CSEDM_1.pdf |volume=Vol-3051 |authors=Max Fowler,Binglin Chen,Matthew West,Craig Zilles |dblpUrl=https://dblp.org/rec/conf/edm/FowlerC0Z21 }} ==How productive are homework and elective practice? Applying a post hoc modeling of student knowledge in a large, introductory computing course (Full Paper)== https://ceur-ws.org/Vol-3051/CSEDM_1.pdf
      How productive are homework and elective practice?
     Applying a post hoc modeling of student knowledge in a
              large, introductory computing course

                                           Max Fowler                            Binglin Chen
                                        University of Illinois                University of Illinois
                                         Urbana, IL, USA                       Urbana, IL, USA
                                   mfowler5@illinois.edu                   chen386@illinois.edu
                                      Matthew West                             Craig Zilles
                                        University of Illinois                University of Illinois
                                         Urbana, IL, USA                       Urbana, IL, USA
                                     mwest@illinois.edu                      zilles@illinois.edu

ABSTRACT                                                                  to engage students, to assist both them and us in diagnosing
In this paper, we attempt to estimate how much learning                   their progress, and to provide formative experiences during
happens in required practice activities (homework) relative               their learning careers [8, 15, 21, 34].
to elective practice activities (studying). This analysis is
done in the context of a large enrollment (N = 601) intro-                In most courses, the bulk of the students’ time is spent
ductory programming course that made heavy use of auto-                   outside of course meetings, either completing homework or
grading randomizing question (item) generators. Because                   performing elective practice (studying). It has been shown
these item generators (and other problems) were used as                   that well-formed homework has a positive impact on stu-
homework, on practice exams, and as part of exams, a given                dent performance and motivation [5, 6, 14, 22]. There are,
student may have encountered the same generator multiple                  however, disagreements between experts among the learning
times during the class, providing snapshots of the evolution              and assessment communities on how to craft good home-
of the student’s ability to complete that problem correctly.              work [2, 37]. Studying is usually motivated by a desire to
                                                                          score well on exams and does not typically have a grade
We use a post hoc model of “this-item-correct” prediction to              associated with it [33].
estimate individual student knowledge on each attempt of
a given question. Across five exams, correctness tracing at-              We were curious to explore the degree to which we can
tributes 57-65% of the learning that occurs to the homework               attribute student learning between two kinds of formative
period and the remainder to elective practice (the study pe-              practice activities: required homework and elective practice
riod).                                                                    performed prior to a summative assessment. Additionally,
                                                                          as our course utilizes multiple types of questions, we were cu-
                                                                          rious to know if student experiences differed between types.
Keywords                                                                  To do so after the completion of the course, we use a post hoc
assessment; CS1; exams; student learning; homework                        knowledge estimation method developed by Chen et al [9].
                                                                          This method, which we call “correctness tracing” (CT) as
1.    INTRODUCTION                                                        shorthand, models student learning as the likelihood of stu-
                                                                          dents getting specific questions correct on a given attempt
A well-designed course provides students with many oppor-                 for those questions. The method estimates the chance of a
tunities to learn (e.g., readings, direct instruction, activities         student getting “this-item-correct” for a given item (ques-
with peers, homework). While summative assessment allows                  tion) at every attempt the student makes on that item, for
us to estimate how much learning has occurred, it doesn’t                 all items.
shed light on where the learning happened. If we could at-
tribute learning to the activities in which it occurred, this             We apply CT to student submission data from an on-campus
would allow teachers to increase their use of effective activi-           introductory programming course. The course used ran-
ties and deprecate ineffective ones. Our goal as educators is             domly selected questions from question pools and random
                                                                          item generators for exam creation, with many of the ques-
                                                                          tions appearing previously on homework (and optional prac-
                                                                          tice exams) as a studying motivator for students. We use
                                                                          data from student homework, practice exams, and these
                                                                          proctored exams to build a cohesive snapshot of student
                                                                          experience with the same questions in multiple contexts.
                                                                          Specifically, we share our experience investigating students’

Copyright ©2021 for this paper by its authors. Use permitted under Cre-
ative Commons License Attribution 4.0 International (CC BY 4.0)
                                                                          learning in this fashion to address the following questions:
      RQ1: How much learning happens during required              The idea that studying itself may be comparatively shal-
      practice activities (homework) relative to elective prac-   low is supported in the literature on long-term retention.
      tice (studying)?                                            Karpicke and Blunt found that the retrieval practice from
                                                                  exams was superior for learning than elaborative studying
      RQ2: Does student learning differ based on the type of      processes [18]. Additionally, Roediger and Nestojko found
      the questions (e.g., multiple-choice vs. short answer)      that, while studying did improve long-term retention of con-
      asked?                                                      cepts, retrieval during testing still had superior results [30].

                                                                  2.2    Knowledge tracing and student modeling
The rest of our paper is organized as follows. Section 2
describes related work on student learning and knowledge          There is a wealth of work on different methods of tracing
tracing. Section 3 discusses the course from which we col-        student knowledge and modeling student learning and stu-
lected data and the handling of that data. In Section 4,          dent behavior. Many of these stem from Corbett and An-
we explain the assumptions behind CT and detail our use           derson’s original knowledge tracing paper [12]. Since the
of the method. We follow with our results from the model-         original tracing paper, there has been more work on dealing
ing in Section 5 and with interpretation and limitations in       with issues such as student slip and guess behavior, the ben-
Section 6. We conclude in Section 7.                              efits and traceability of learning resources, and other parts
                                                                  of students’ learning environments. Pelanek’s significant re-
                                                                  view shows how learner modeling has grown to encompass
2.    RELATED WORK                                                domain knowledge structuring, learner clustering, student
                                                                  observations, and more just over the last decade [27]. We
2.1    How are students learning on homework                      address a few below.
       and through studying?
                                                                  Pardos and Heffernan modeled individualized learning in
How students learn is an area of significant study. We are        Bayesian knowledge tracing (BKT) [25]. In their method,
specifically interested here in how formative assessment (e.g.    students’ skills were used to set each student’s individualized
homework) helps students learn. Historically, formative as-       knowledge for more accurate individual knowledge tracing.
sessment is claimed to benefit student learning, although         They later introduced individual item difficulty as a way to
there is little consensus on what exactly makes good for-         make knowledge tracing more robust to unseen items [26].
mative assessment [7]. There is evidence, however, that           As opposed to skills being used for individual student pri-
frequent and distributed practice, such as frequent testing,      ors, Khajah et al. used latent factors pulled from student
boosts student achievement and learning [1, 4, 24, 32].           populations to predict individual student performance [20].
                                                                  Other approaches use machine learning methods to estimate
Research on homework often considers benefits to students’        student guess or slip chances as opposed to students having
motivation and self-regulatory ability as opposed to just         not yet learned course material [3].
content learning. Ramdass and Zimmerman used correla-
tional studies to show that homework leads to higher self-        Deep learning methods have also been applied to knowl-
regulatory abilities and traits, like time management and         edge tracing in deep knowledge tracing (DKT) [28]. Ad-
self-efficacy [29]. Similarly, Bembenutty and White showed        ditions to DKT include prerequisite modeling in students’
that students who approach homework with help-seeking at-         concepts [10], problem level features like time to complete
titudes and as motivating exercises displayed stronger aca-       and student hint usage [39], and dynamic student group-
demic performance [5].                                            ing based on performance [23]. There is some evidence to
                                                                  suggest that, while DKT is powerful, BKT can similarly be
Mandatory homework is found to be beneficial in existing          extended and that the gains do not require “deep” learn-
research, but in large part due to feedback. Gutarts and          ing techniques explicitly [19]. Additionally, methods such
Bains found that homework that provides feedback appears          as predictive failure analysis can perform similarly to DKT
to enhance student performance [14]. However, Johnson and         so long as care is taken to structure data appropriately [38].
McKenzie found that while mandatory homework may in-
centivize homework-related motivation and learning, it was
not correlated with exam performance in their macroeco-           3.    DATA COLLECTION
nomics course [17]. Ryan and Hemmes found homework                Our data was collected in a large enrollment, introductory
was correlated with improved quiz performance, but that           programming course for non-CS majors in Fall 2019. The
points are a necessary contingency to get students to do          course had 601 total students, with 246 women and 355 men.
homework, with feedback-only approaches reducing student          The majority of students who took the course were freshmen
engagement [31].                                                  (67%) and sophomores (21%). The course predominantly
                                                                  taught Python programming with some coverage of basic
The benefits of studying are less clearly defined. Chew sug-      Excel and HTML/web concepts.
gests the benefit of study can be improved by teaching stu-
dents how to study and that expecting students to know
how without designing assignments and material to aid their       3.1    Course context
studying may be a mistake on the part of some instruc-            The course was organized as a flipped class that covered one
tors [11]. Fakcharoenphol et al. found that there was a           major topic each week. Students were expected to complete
learning increase in studying old exams with solutions and        readings in an interactive textbook and an assignment con-
feedback, but that this learning may be shallow [13].             sisting of true/false and multiple-choice questions prior to
Week                                                             domly from a pool of questions on a given topic with similar
  1    2   3   4   5   6   7   8   9   10 11 12 13 14 15         difficulties. Most questions permitted students to attempt
                                                                 them multiple times with a score penalty for each subse-
        E0           E1          E2        E3            E4      quent incorrect attempt until chances to earn credit were
       (2%)        (10%)       (10%)      (10%)         (20%)    exhausted.

Figure 1: Every three weeks the course had a proctored exam      Because of the course’s heavy use of item generators and
(E0 to E4). Weight relative to the final course grade is pro-    to motivate students to take homework seriously, a signif-
vided as a percentage.                                           icant fraction of the exams were drawn from the course’s
                                                                 pre-lecture and homework assignments. In general, 85–90%
                                                                 of the pools on the exam were drawn from questions previ-
                                                                 ously on homework, and exam-only “hidden” questions were
lecture. The weekly 90-minute lecture used peer instruc-
                                                                 written with similar form and content to previous homework
tion to reinforce concepts, and the weekly 80-minute lab
                                                                 questions. Prior to each exam, students were provided access
consisted of practice activities students could complete in-
                                                                 to a practice exam generator that was similar to the actual
dividually or in pairs, supervised by course staff. Finally,
                                                                 exam generator, but without the hidden questions. Reused
each topic culminated with a weekly homework assignment
                                                                 programming questions are largely recall exercises, as most
that consisted of a mix of short answer (e.g., “What is the
                                                                 do not feature random generation. Short answer questions
value of the variable x after the following piece of code ex-
                                                                 are transfer tasks as they are all parameterized and no two
ecutes?”, “Write a statement that removes the 4th element
                                                                 instances of the question should possess the same exact pa-
of a list called ’animals’.”) and small programming (i.e., no
                                                                 rameters and the same expected student answer.
more than a small function) questions.
                                                                 In spite of the exams including a large fraction of previously
Due to the size of the course, almost all of the homework ac-
                                                                 seen material, we don’t believe that rote memorization was
tivities were auto-graded. The course used the open-source
                                                                 a useful strategy for these exams due to their heavy use
assessment platform (PrairieLearn) [35, 36] for all home-
                                                                 of randomization and question pools combined with a large
work and other assessments. PrairieLearn both instantly
                                                                 number of questions (20–30) on the exam. True/false and
grades student submissions and provides automatic feed-
                                                                 multiple-choice slots on the exam generally drew from pools
back. Homework assignments were configured for students
                                                                 of 20 to 100 questions, while short answer and programming
to be fearless: there was no penalty for wrong answers, only
                                                                 question slots had pool sizes of 5 to 12. In addition, short
points to gain as they got answers correct. On homework,
                                                                 answer item generators typically produce at least dozens of
this allowed students to practice with course content repeat-
                                                                 meaningfully different variants.
edly until they got the correct answer. Students were able
to repeat questions until they earned full credit and revisit
questions at any point for studying purposes.                    3.2    Homework and study periods
                                                                 The decision for exams to mostly use the same questions as
Many of the homework questions were item generators that         homework assignments and practice generators created an
could produce many possible questions of similar difficulty      interesting context for attributing student learning. Specif-
on the same topic [16]. The true/false and multiple-choice       ically, we could analyze student performance on homework
item generators randomly selected items from pre-populated       assignments, practice exams, and actual exams to observe
pools of questions. Short answer questions are randomly pa-      how students’ ability to answer these questions improved
rameterized (e.g., changing the list a student has to read or    as they engaged with course material. We pulled all stu-
changing the method applied to a given list). To encour-         dent submissions from PrairieLearn for the entire semester,
age mastery, homework often expected students to correctly       keeping only submissions for any questions that appeared
answer these item generators multiple times. Weekly home-        on both homework and exams.
work assignments typically included 12 to 30 items or item
generators and students needed to complete 90% of them to        We cleaned this data set by removing students who had not
achieve a full score on the homework.                            completed all of the exams, retaining 584 of the 601 students.
                                                                 In total, we retained 1,064,547 individual submissions across
The course’s primary mean of summative assessment was            homework, optional practice, and exams. Each submission’s
through five proctored exams. All the exams had a 50-            score ranges from 0 (incorrect) to 1 (full credit), with scores
minute fixed time limit, except for the final exam (E4) which    in-between indicating partial credit.
allowed for 3 hours. All but the first exam were worth a
significant portion (≥ 10%) of the course grade. These ex-       We subdivide our analysis of the course by exam, focusing
ams were conducted in a proctored computer lab with stu-         on the three week window preceding each of the five exams.
dent scheduled exam times within a three-day window [40–         As shown in Figure 2, each exam is comprehensive, includ-
42]. Students were given access to a Python interpreter and      ing material that was present on previous exams. For this
Python’s documentation, but no other resources were pro-         analysis, we focus solely on the content introduced since the
vided. The exam schedule is given in Figure 1.                   previous exam to see how practice during the homework and
                                                                 study periods contribute to learning for the material’s first
Exams featured all four kinds of questions seen on home-         summative assessment.
work (T/F, MC, short, programming), except for E0 which
did not have programming questions. Each exam consisted          Each student submission is assigned to one of three periods:
of 20–30 question slots (41 on the final). Each slot drew ran-   homework, study, and exam (Figure 3):
    Question pool       hidden (not on homework)
   composition for      new (on homework since previous exam)                                                           Exam
     each exam          old (on homework & on previous exam)                  Homework        Study
                                                                                Period        Period    Practice Exam
                                                                                                                          Exam
  4              23           21            21            44                 On-time HW          After deadline HW        Period
               hidden       hidden                      hidden    Time
                                            33
                                                          46            HW assigned      HW deadline                 Start of exam
                                           new
                              61                         new
                 78          new                                  Figure 3: We subdivide the students’ practice into two peri-
  28            new                                               ods: the homework period is all homework attempts before
 new                                       115           191      the deadline. The study period is all attempts on practice
                                           old           old
                              72                                  exams and any homework attempts after the deadline.
                24            old
                old
  E0            E1            E2            E3            E4      4.     METHODS
Figure 2: Exams are cumulative and largely drawn from item        To analyze the evolution of student knowledge from home-
generators and questions previously on homework. Hidden           work to exam time, we track student learning at the granu-
questions only appear on exams. New questions were on             larity of individual item generators. This is clearly a signifi-
homework since the previous exam, while old questions were        cant approximation to reality for two reasons: 1) because of
previously on earlier homework and one or more previous           pools (of true/false and multiple-choice questions) and pa-
exams. While the fraction of exam slots dedicated to old          rameter randomization (for short answer questions) there is
questions does increase as the semester progresses, this figure   some variation between instances of a given item generator,
is somewhat deceptive because old pools typically have many       and 2) there are relationships between item generators (e.g.,
more questions than new and hidden pools, except on the           practice on a programming question relating to loops would
final (E4) where each week’s material is represented equally.     likely improve students ability to complete a short answer
                                                                  question related to loops and vice-versa).

   • The homework period includes all the submissions to          Nevertheless, for our purposes, we believe this approach is
     homework on or before the homework due date. Sub-            viable. The items of each item generator were considered
     missions in this period represents required practice;        sufficiently similar by the instructor to be fungible with re-
     while students are allowed as many submissions as they       spect to the exams. Furthermore, the method is robust to
     need to get full credit, there is a deadline to receive      whether or not learning occurs between subsequent attempts
     that credit.                                                 on the same problem or from students attempting a prob-
                                                                  lem, trying new problems, and returning again to an older
   • The study period includes all submissions on practice
                                                                  problem. If the student learns significantly by completing
     exam generators as well as any submissions on home-
                                                                  many other homework problems between two attempts at
     work after the homework deadline. The homework sys-
                                                                  a given problem during the homework period, we can still
     tem remains open and students can repeat problems
                                                                  correctly attribute the learning to having taken place dur-
     and complete any problems not previously completed
                                                                  ing the homework period. As such, we made no attempt at
     (only 90% of questions are needed to achieve a full
                                                                  topic modeling in this work.
     homework score). Submissions in this period are elec-
     tive practice, bearing no credit directly.
   • The exam period includes the submissions on the ac-          4.1     Correctness tracing: post hoc modeling for
     tual exam.                                                           student knowledge
                                                                  In general, knowledge tracing (KT) techniques were devel-
The above periods are coarsely defined to capture the dif-        oped as predictors of student performance or estimators of
ference between the time spent on required practice with          the latent knowledge state of students. KT is used either
homework assignment and any additional practice following         to estimate a student’s likelihood of getting the next at-
the homework deadline. For our context, problems being            tempt correct based on previous attempts, adjusting after
completed by students on practice exams as well as after a        each success and failure as the student engages with an as-
homework deadline are both elective activities and are suit-      sessment, or to track changes in students’ latent knowledge
able to be counted together.                                      over time. Much of the difficulty of KT techniques results
                                                                  from attempting to instantaneously obtain a signal of stu-
For our analysis, we also tag each student’s first attempt        dent knowledge as students are engaging with learning op-
on each question on homework, so that we can estimate             portunities. In our case, we already have all the data from
the student’s ability to solve that question gained before        the course as the course has ended and do not need an in-
attempting the question the first time (e.g., from readings,      stantaneous, updating measure of student knowledge. In-
lecture, or solving other problems). A breakdown of the           stead, we desire to perform a post hoc analysis of students’
number of submissions during each period is provided in Fig-      submissions to estimate how their learning changed over an
ure 4. The decrease in submissions throughout the semester        entire course’s worth of data. Our chosen method, CT, mea-
in the homework and studying buckets is a result of home-         sures students’ knowledge as demonstrated by an increase in
work shifting toward fewer, more difficult problems as the        the likelihood that they would get given items correct more
semester progresses.                                              frequently over time.
                                                                                      Submissions per bucket by exam subset
                                                                               178481
                                                                                                                                                        First
                          175000
                                                                                                                                                        Homework
                                                                                                                                                        Studying
                          150000                                                                                                                        Exam
                                                                        131641
   Count of submissions




                          125000                                                                        120436


                          100000                                                                    92533

                                                   77048
                           75000                                                                                                                67181
                                                                                                                          5871161033                 55526
                                            47987
                           50000
                                                                    40014
                                                                                                30412
                                                                                                                                            22984
                           25000
                                        14378                                         14588                           15700
                                                       10859                                                  10936
                                                                                                                                    7151                  6948
                               0
                                              Exam 0                       Exam 1                    Exam 2                Exam 3                Exam 4
                                                                                                      Exams

Figure 4: The submission count per period. Total, there are 1,064,547 submissions in our data set. As the semester progressed,
homework had fewer but harder problems, which accounts for the reduction in submissions.

The method presented by Chen et al [9] can be summarized                                                ilarly, the probability on a correct final attempt will always
by the following formulation:                                                                           be estimated as 1.0, which may be an overestimate. This po-
                                                                                                        tentially could be remedied by adding additional constraints
                                    optimize: L(p1 , . . . , pn ; x1 , . . . , xn )                     to the method (e.g., limiting the rate of increase), but we
                                   subject to: 0 ≤ pi ≤ 1 for all i                           (1)       did not attempt such constraints with this work.
                                                  pi ≤ pj for all i < j
                                                                                                        4.2      Demonstrating CT using “Harlow”, a sam-
where x1 , . . . , xn is the result of a series of submissions which
are either 1 (correct) or 0 (incorrect), and the method tries                                                    ple student
to find a series of predictions p1 , . . . , pn that optimizes the                                      To clarify our use of CT, we present a walk-through of how
loss function, under the constraints that: (1) p1 , . . . , pn are                                      the method models our data for one individual and two
between 0 and 1, as they represent an estimate of the instan-                                           questions from Exam 3, selected randomly from students
taneous probability that the student would get each attempt                                             whose behavior allows for representative variety in CT’s
correct and (2) p1 , . . . , pn are monotonically non-decreasing,                                       estimates. We refer to the student as Harlow, which is a
which is based on the assumptions that the attempts are                                                 name that was not present in the actual class. On Exam 3,
made over a short enough time period that forgetting is in-                                             two of the questions that were randomly selected for Harlow
significant and additional practice would not hurt a student’s                                          to complete were the programming question progLargest-
ability to answer these questions. Since the homework, prac-                                            LessThanValue and the short answer question valueOfList-
tice, and exam attempts occurred over a three-week win-                                                 Reordering.
dow, during which there were a lot of related practice, we
believe these assumptions are reasonable. Rather than hav-                                              Harlow had notably different experiences with these ques-
ing a model with explicit parameters as found in BKT, the                                               tions; Figure 5 plots the correctness of Harlow’s individual
method calculates the probabilities p1 , . . . , pn by optimizing                                       submissions as dots that are color coded based on the pe-
them directly for the target loss function. Chen et al have                                             riod in which the submission occurred. With progLargest-
shown that minimizing root-mean-square error (RMSE) and                                                 LessThanValue, Harlow made two attempts on homework to
maximizing log-likelihood would yield the same optimal so-                                              get the question correct once, got it correct once on a prac-
lution under constraints specified in Equation 1.                                                       tice exam with a single attempt, and tried it twice on Exam
                                                                                                        3 without getting a correct answer. With valueOfList-
We chose to use CT over BKT or DKT as it nicely fit our use                                             Reordering, Harlow had 9 attempts on homework with 6
case. The CT method is able to finely locate and predict the                                            correct submissions, 4 encounters across two practice exams
“jumps” in a students’ likelihood of getting a question correct                                         for 2 correct submissions total, and a correct answer as the
when analyzing the data in a post hoc fashion, which may                                                only attempt on Exam 3.
be too precise a transition for usual predictive knowledge
tracing. For our purposes, a high accuracy, post hoc model                                              Figure 5 also shows the result of running CT as a line indi-
was ideal for analyzing changing student knowledge as a                                                 cating the instantaneous estimate of Harlow’s likelihood of
historical trend from our course’s data.                                                                getting the question correct. In both cases, Harlow got the
                                                                                                        first attempt wrong, so the model assign’s Harlow’s likeli-
One important weakness of CT, however, is that it is prone                                              hood of getting the question correct as 0%, so as to mini-
to underestimate student knowledge on an incorrect first at-                                            mize the error relative to the actual outcome. While Har-
tempt because the optimizer sets the probability of correct-                                            low is flipping between correct and incorrect attempts, the
ness to be zero so as to minimize error on that attempt. Sim-                                           model computes a likelihood of correctness for each attempt
Chance of getting the next item correct i.e. student knowledge                                                                                                                                               Average "This-Item-Correct" Chance Across Periods (All Exams)
                                                                                    Harlow's Knowledge Over Time                                                                                                                    Exam 0         Exam 2           Exam 4
                                                                       progLargestLessThanValue         valueOfListReordering                                                                             1.00                      Exam 1         Exam 3
                                                                 1.0
                                                                 1.0                                1.0                                                                                                   0.95




                                                                                                                                             Chance of getting this item correct i.e. student knowledge
                                                                                                                                                                                                          0.90
                                                                 0.8
                                                                 0.8                                0.8                                                                                                   0.85
                                                                                                                                                                                                          0.80

                                                                 0.6                                                                                                                                      0.75
                                                                 0.6                                0.6
                                                                                                                                                                                                          0.70
                                                                                                                                                                                                          0.65
                                                                 0.4
                                                                 0.4                                0.4
                                                                                                                                                                                                          0.60
                                                                                                              First               Studying                                                                0.55
                                                                 0.2                                0.2       Homework            Exam
                                                                 0.2                                                                                                                                      0.50
                                                                                                                                                                                                          0.45
                                                                 0.0                                0.0                                                                                                   0.40
                                                                 0.0
                                                                    0.01     2   0.23   4   0.45          10.6
                                                                                                            2 3 4 5 6 7 80.8
                                                                                                                          91011121314151.0                                                                0.35
                                                                                        Question answer submission                                                                                        0.30
                                                                                                                                                                                                          0.25
                                                                                                                                                                                                          0.20
Figure 5: The results of running CT on Harlow’s answers to
                                                                                                                                                                                                                 First   End Homework Start Studying End Studying    First Exam   End Exam
our two selected questions. Harlow eventually appeared to
learn how to do valueOfListReordering. However, Har-                                                                                         Figure 6: The changing average “this-item-correct” chance
low’s ability to complete the programming question never                                                                                     from CT per period. CT suggests the majority of student
stabilized, so the model never attributed more than a 50%                                                                                    learning is occurring during the homework period, although
chance that Harlow had learned the question’s material.                                                                                      the study period is also significant.

that minimizes the error for those correct and incorrect at-                                                                                 Exam 4, the final exam, behaves differently from the other
tempts, constrained to be non-decreasing. Because Harlow’s                                                                                   four exams, which we’ll consider in the discussion section.
last three attempts at valueOfListReordering were all cor-
rect, the model decides that Harlow has mastered the ques-                                                                                   Figure 7 plots the change in likelihood of correctness from
tion with a 100% likelihood of getting the question correct.                                                                                 the beginning to the end of each period. When we compare
                                                                                                                                             the pre-exam increase in student knowledge (as measured by
We ran CT for each student on each question independently.                                                                                   likelihood of correctness) between the homework and study
From each trace, we extract six estimates of the student’s                                                                                   period, CT attributes 57–65% of the learning to the home-
likelihood of getting a question right: their first and last                                                                                 work period and 35–43% to the study period, across the five
attempts in the homework period (First, End Homework),                                                                                       exams. The CT method also attributes some learning to the
their first and last attempts in the study period (Start Study-                                                                              exam period, which we’ll consider in the discussion section.
ing, End Studying), and their first and last attempts on the
exam (First Exam, End Exam). Any student without a sub-
mission in that period (i.e., students who did not study or                                                                                  5.2                                                                 Learning trends are largely independent
students who did not get that question on their exam) has                                                                                                                                                        of question type
their previous submission to that point in the timeline used                                                                                 To address RQ2, we disaggregated the exam data sets by
in compliance with CT’s assumption that students do not                                                                                      question type to see whether there was any notable difference
forget. We then average these likelihoods across all students                                                                                between types. For this analysis, we omitted Exam 0, as
and all questions for a given exam period. This allows us to                                                                                 Exam 0 did not feature programming questions.
explore the changing student knowledge as an average for all
the students in a course across the different learning oppor-                                                                                Figure 8 shows the per-question type CT results. The only
tunities presented by homework, studying, and assessment.                                                                                    notable finding is that different questions start at different
                                                                                                                                             levels of initial student knowledge and end with different
5. RESULTS                                                                                                                                   amounts of knowledge, which changes the starting and end-
                                                                                                                                             ing points in Figure 8. Because of this, different questions
5.1 CT attributes significant learning to both                                                                                               drop off faster than others in terms of how much is learned
    the homework and study period; home-                                                                                                     during the practice period. Generally, students have less to
    work contributes slightly more                                                                                                           learn with true/false and multiple-choice questions through
                                                                                                                                             the practice period than they do on programming and short
The results of running CT are shown in Figure 6. From
                                                                                                                                             answer questions, although all question types experience a
the slopes of the lines, it can be seen that CT estimates
                                                                                                                                             learning drop-off through to the exam.
that more learning is occurring (i.e., the change in student
likelihood of correct attempts is larger) during the homework
period than the study period. The plot suggests that the                                                                                     6. DISCUSSION AND LIMITATIONS
course material tends to get more difficult as the semester                                                                                  6.1 RQ1: Students in this course learn slightly
progresses, with the initial and final likelihood of correctness
both decreasing as we move from Exam 0 to 3. Furthermore,                                                                                        more during the homework period than
the lines for Exams 0 through 3 show almost identical trends.                                                                                    the study period
                                                                                                                                                                                                Average "This-Item-Correct" Chance By Period and Question Type (All Exams)
                                             Change in average student knowledge (CT) by period
                                                          Exam 0      Exam 2    Exam 4                                                                                                                                           Programming        True False
                                      0.55                                                                                                                                                                                       Short Answer       Multiple Choice
                                                          Exam 1      Exam 3                                                                                                                             Exam 1                                                                        Exam 2
                                                                                                                                                               1.00
                                      0.50                                                                                                                      1.0                                                                               1.0
                                                                                                                                                               0.95
Change in average student knowledge




                                      0.45
                                                                                                                                                               0.90
                                                                                                                                                                0.8                                                                               0.8
                                      0.40
                                                                                                                                                               0.85
                                      0.35
                                                                                                                                                               0.80
                                                                                                                                                                0.6                                                                               0.6




                                                                                                  Chance of getting this item correct i.e. student knowledge
                                      0.30                                                                                                                     0.75

                                      0.25                                                                                                                     0.70
                                                                                                                                                                0.4                                                                               0.4

                                      0.20                                                                                                                     0.65

                                                                                                                                                               0.60
                                                                                                                                                                0.2                                                                               0.2
                                      0.15
                                                                                                                                                               0.55
                                      0.10                                                                                                                               st           rk              g           g           am          m                 st           rk            g            g           am         m
                                                                                                                                                                      Fir           wo            yin          yin         Ex          xa               Fir           wo            yin         yin          Ex         xa
                                                                                                                                                                                 me           tud          tud          st          dE                             me           tud         tud           st          dE
                                                                                                                                                               0.50           Ho         rt S           dS          Fir          En                              Ho        rt S          dS           Fir          En
                                      0.05                                                                                                                            En
                                                                                                                                                                         d           Sta            En                                                  En
                                                                                                                                                                                                                                                           d           Sta            En
                                                                                                                                                               0.45
                                                                                                                                                                                                         Exam 3                                                                        Exam 4
                                      0.00                                                                                                                      1.0                                                                               1.0
                                               Homework            Studying              Exam                                                                  0.40

                                                                                                                                                               0.35
Figure 7: The average change in student knowledge by pe-                                                                                                        0.8
                                                                                                                                                               0.30
                                                                                                                                                                                                                                                  0.8

riod according to CT. The largest change occurs during the                                                                                                     0.25
                                                                                                                                                                0.6                                                                               0.6
homework period, with a smaller change from study, and the                                                                                                     0.20
smallest on exams.
                                                                                                                                                                0.4                                                                               0.4



                                                                                                                                                                0.2                                                                               0.2

                                                                                                                                                                  0.0st               rk              g0.2                                0.4 m              t 0.6 rk                       0.8g                      m 1.0
                                                                                                                                                                   Fir              wo              in            ing           xa
                                                                                                                                                                                                                                   m
                                                                                                                                                                                                                                             xa         Fir
                                                                                                                                                                                                                                                            s       o            yin
                                                                                                                                                                                                                                                                                    g
                                                                                                                                                                                                                                                                                             yin           am      xa
                                                                                                                                                                                 e           tu  dy          tudy             tE          dE                      ew         tud         tud            Ex       dE
                                                                                                                                                                               om                                          s           En                      om                                    st       En
                                                                                                                                                                            dH           rt S       En
                                                                                                                                                                                                         dS             Fir                                dH           rt S       En
                                                                                                                                                                                                                                                                                      dS         Fir
                                                                                                                                                                      En              Sta                                                               En          Sta
CT attributes more learning to the mandatory homework
period in this particular course. This is represented as the                                      Figure 8: The changing average “this-item-correct” chance
largest increase in student knowledge from their first home-                                      from CT for each question type from Exams 1 to 4. Different
work submission to the last. That gives us some confidence                                        question types start out with a higher assumed learning to
that a course with significant homework opportunities does                                        start, which suggest more students got those questions right
provide students with productive chances to learn as op-                                          on their first attempt.
posed to just inundating students with “busy work.”

Interestingly, CT also indicates performance on the exam is                                       6.2                                                                            RQ2: All question types show similar learn-
better than at the end of the study period. There are a few
possible explanations for this. The most likely explanation
                                                                                                                                                                                 ing trends
is that, given the higher stakes of the exam, students are try-                                   When we disaggregate the analysis by question type, the
ing harder, resulting in a higher correct rate that is being                                      general shape and progression of results is the same for ev-
observed by the model. In addition, some of the score im-                                         ery question type compared to the source exam. Different
provements observed on the exam could be attributed to the                                        questions start with lower amounts of student knowledge,
last pre-exam practice attempt if, for example, the student                                       but this appears to mostly be a function of the difficulty of
got the question wrong, but learned from seeing the correct                                       the problem’s type: programming and short answer ques-
answer. This also might be just be an artifact of CT, as any                                      tions, which require more actual coding on the students’
students that have incorrect and correct attempts to a given                                      parts, tended to start and end lower.
question on the exam will have learning attributed to them.
Finally, actual learning might be occurring during the exam.                                      The lack of different behavior when we disaggregate by ques-
The amount of “learning” attributed to the exam period is,                                        tion type is more interesting than it may initially appear.
however, fairly negligible.                                                                       This means that the “shape” of student learning does not
                                                                                                  differ significantly with the question type. Given this, it ap-
Importantly, one should not attempt to generalize about the                                       pears that homework and additional studying have the same
learning potential of homework relative to elective practice                                      impact on student results regardless of the kind of question.
for all courses from these results. We expect that courses                                        This does mean there are diminishing returns on easier ques-
that assign less homework might observe less learning during                                      tion types over the period compared to harder ones, but not
the homework period and students might compensate by                                              a deficiency in how homework and practice helps on question
studying more, thereby making more of the learning occur                                          types where students still have learning they can do.
during that optional studying. It could also be the case
that there are diminishing returns on each attempt on a
specific question, which the first attempt providing the most                                     6.3                                                                            Limitations
learning benefit, then the second, decreasing further with                                        There are some obvious limitations to the current work.
each attempt from homework through the study period. It                                           First, our findings about the relative learning during the
is reassuring, though, to see that this course’s homework and                                     homework and study periods cannot be assumed to gener-
study opportunities (i.e., the practice exam generators made                                      alize to other course contexts. Courses with different home-
available to students) both appear to contribute significantly                                    work, study materials, and exam structures will likely have
to student learning.                                                                              different breakdowns of learning in each phase.
Second, CT is a fairly coarse measure of learning. Scores         8.   REFERENCES
as a performance indicator are not alone proof of student
                                                                   [1] E. Bailey, J. Jensen, J. Nelson, H. Wiberg, and J. Bell.
learning. Additionally, CT’s potential for underestimating
                                                                       Weekly formative exams and creative grading enhance
likelihood of correctness of first attempts (by strictly opti-
                                                                       student learning in an introductory biology course.
mizing for RMSE) could make the model overestimate the
                                                                       CBE—Life Sciences Education, 16(1):ar2, 2017.
learning that is occurring in the first few attempts, which is
                                                                   [2] J.-A. Baird, D. Andrich, T. N. Hopfenbeck, and
likely occurring in the homework period. We do not have
                                                                       G. Stobart. Assessment and learning: fields apart?
confidence that these measures of learning are particularly
                                                                       Assessment in Education: Principles, Policy &
precise. While we omit it from the paper, we also ran a
                                                                       Practice, 24(3):317–350, July 2017.
regression model to estimate the learning in the same peri-
ods of the course. The regression generally showed the same        [3] R. S. J. d. Baker, A. T. Corbett, and V. Aleven. More
trends as CT, giving us more confidence in CT’s results.               accurate student modeling through contextual
                                                                       estimation of slip and guess probabilities in bayesian
Finally, these methods do not disambiguate from learning               knowledge tracing. In B. P. Woolf, E. Aı̈meur,
that happens during the homework and studying periods                  R. Nkambou, and S. Lajoie, editors, Intelligent
and learning that occurs specifically from homework and                Tutoring Systems, pages 406–415, Berlin, Heidelberg,
elective practice problems. There are notable reasons to               2008. Springer Berlin Heidelberg.
believe that students are learning significantly from reading      [4] G. Başol and G. Johanson. Effectiveness of frequent
the textbook, engaging in active learning exercises, and, per-         testing over achievement: A meta analysis study.
haps, even from listening to the lecturer speak. The learning          Journal of Human Sciences, 6(2):99–121, July 2009.
that occurs during these activities is attributed to the period    [5] H. Bembenutty and M. C. White. Academic
in which it occurs, rather than to the specific task.                  performance and satisfaction with homework
                                                                       completion among college students. Learning and
7.   CONCLUSION                                                        Individual Differences, 24:83–88, Apr. 2013.
                                                                   [6] J. Bempechat. The motivational benefits of homework:
In this work, we explored the degree to which we can at-
                                                                       a social-cognitive perspective. Theory Into Practice,
tribute student learning between required homework and
                                                                       43(3):189–196, Aug. 2004.
elective study performed prior to a summative assessment.
                                                                   [7] R. E. Bennett. Formative assessment: a critical
To analyze learning, we used a post hoc method of “this-item-
                                                                       review. Assessment in Education: Principles, Policy &
correct” likelihood (correctness tracing) to estimate student
                                                                       Practice, 18(1):5–25, Feb. 2011.
knowledge. We found that (required) homework and (elec-
tive) studying both contributed significantly to student learn-    [8] P. Black and D. Wiliam. Developing the theory of
ing, with homework contributing slightly more. Further,                formative assessment. Educational Assessment,
despite using multiple question types, we found the most               Evaluation and Accountability(formerly: Journal of
notable difference between question types is where student             Personnel Evaluation in Education), 21(1):5, Jan.
knowledge starts and not the shape of their learning im-               2009.
provements.                                                        [9] B. Chen, M. West, and C. B. Zilles. Towards a
                                                                       model-free estimate of the limits to student modeling
We think that our results show that frequent, exam-relevant            accuracy. In K. E. Boyer and M. Yudelson, editors,
homework and highly-accessible means for study (e.g., prac-            Proceedings of the 11th International Conference on
tice exam generators) are both effective means of facilitating         Educational Data Mining, EDM 2018, Buffalo, NY,
student learning and believe that these findings could gen-            USA, July 15-18, 2018. International Educational
eralize to other contexts. The magnitude of learning from              Data Mining Society (IEDMS), 2018.
each component may differ, but courses with similar home-         [10] P. Chen, Y. Lu, V. W. Zheng, and Y. Pian.
work and studying opportunities will hopefully see similar             Prerequisite-driven deep knowledge tracing. In 2018
learning gains during each period.                                     IEEE International Conference on Data Mining
                                                                       (ICDM), pages 39–48, 2018.
There remain areas for future work. Considering data, we          [11] S. L. Chew. Helping students to get the most out of
only use students’ submissions to questions that also appear           studying. Acknowledgments and Dedication, page 215,
on homework. Some ability to include other learning events,            2014.
such as reading a textbook, would give a clearer picture of       [12] A. T. Corbett and J. R. Anderson. Knowledge tracing:
students’ learning process. Additionally, some topic-level             Modeling the acquisition of procedural knowledge.
labeling might allow us to include questions unique to exams           User Modeling and User-Adapted Interaction,
in our data and analysis.                                              4(4):253–278, Dec. 1994.
                                                                  [13] W. Fakcharoenphol, E. Potter, and T. Stelzer. What
With respect to CT’s model, we made no attempt to com-                 students learn when studying physics practice exam
pensate for the method’s tendency to underestimate on ini-             problems. Phys. Rev. ST Phys. Educ. Res., 7:010107,
tial incorrect attempts. Future work could investigate con-            May 2011.
straining this behavior by limiting the allowable slope. Fur-     [14] B. Gutarts and F. Bains. Does mandatory homework
ther, there is room to adapt the model to using a richer               have a positive effect on student achievement for
source of information than students’ correctness on submis-            college students studying calculus? Mathematics and
sions — for example, by fitting a similar optimization on              Computer Education, 44(3):232–244, Fall 2010.
students’ knowledge as estimated by methods such as Item
Response Theory (IRT).
[15] M. K. Hartwig and J. Dunlosky. Study strategies of              M. Sahami, L. Guibas, and J. Sohl-Dickstein. Deep
     college students: Are self-testing and scheduling               knowledge tracing. arXiv preprint arXiv:1506.05908,
     related to achievement? Psychonomic Bulletin and                2015.
     Review, 19:126–134, 2012.                                  [29] D. Ramdass and B. J. Zimmerman. Developing
[16] S. Irvine and P. Kyllonen. Item Generation for Test             self-regulation skills: The important role of homework.
     Development. Lawrence Erlbaum Associates, 2002.                 Journal of Advanced Academics, 22(2):194–218, 2011.
[17] J. A. Johnson and R. McKenzie. The effect on student       [30] H. L. Roediger and J. F. Nestojko. The relative
     performance of web-based learning and homework in               benefits of studying and testing on long-term
     microeconomics. Journal of Economics and Economic               retention. Cognitive modeling in perception and
     Education Research, 14(2):115–125, 2013. Copyright -            memory: A festschrift for Richard M. Shiffrin, pages
     Copyright Jordan Whitney Enterprises, Inc 2013;                 99–111, 2015.
     Document feature - Tables; ; Last updated -                [31] C. S. Ryan and N. S. Hemmes. Effects of the
     2020-11-17.                                                     contingency for homework submission on homework
[18] J. D. Karpicke and J. R. Blunt. Retrieval practice              submission and quiz performance in a college course.
     produces more learning than elaborative studying with           Journal of Applied Behavior Analysis, 38(1):79–88,
     concept mapping. Science, 331(6018):772–775, 2011.              2005.
[19] M. Khajah, R. V. Lindsey, and M. C. Mozer. How             [32] M. L. Still and J. D. Still. Contrasting traditional
     deep is knowledge tracing? CoRR, abs/1604.02416,                in-class exams with frequent online testing. Journal of
     2016.                                                           Teaching and Learning with Technology, 4(2):30, 2015.
[20] M. Khajah, R. Wing, R. Lindsey, and M. Mozer.              [33] B. W. Tuckman. Using tests as an incentive to
     Integrating latent-factor and knowledge-tracing                 motivate procrastinators to study. The Journal of
     models to predict individual differences in learning. In        Experimental Education, 66(2):141–147, 1998.
     Educational Data Mining 2014. Citeseer, 2014.              [34] C. K. Waugh and N. E. Gronlund. Assessment of
[21] J. Laverty, S. Underwood, R. Matz, L. Posey,                    Student Achievement (10th Edition). Pearson, 2012.
     J. Carmel, M. Caballero, C. L. Fata-Hartley,               [35] M. West, G. L. Herman, and C. Zilles. Prairielearn:
     D. Ebert-May, S. E. Jardeleza, and M. M. Cooper.                Mastery-based Online Problem Solving with Adaptive
     Characterizing college science assessments: The                 Scoring and Recommendations Driven by Machine
     three-dimensional learning assessment protocol. PLoS            Learning. In 2015 ASEE Annual Conference &
     ONE, 11(9):e0162333, 2016.                                      Exposition, Seattle, Washington, 2015. ASEE
[22] P. Magalhães, D. Ferreira, J. Cunha, and P. Rosário.          Conferences.
     Online vs traditional homework: A systematic review        [36] M. West, N. Walters, M. Silva, T. Bretl, and C. Zilles.
     on the benefits to students’ performance. Computers             Integrating diverse learning tools using the
     & Education, 152:103869, July 2020.                             prairielearn platform. In Seventh SPLICE Workshop
[23] S. Minn, Y. Yu, M. C. Desmarais, F. Zhu, and J. Vie.            at SIGCSE 2021 (Virtual event), March 2021.
     Deep knowledge tracing and dynamic student                 [37] D. Wiliam. What is assessment for learning? Studies
     classification for knowledge tracing. In 2018 IEEE              in Educational Evaluation, 37(1):3–14, 2011.
     International Conference on Data Mining (ICDM),                 Assessment for Learning.
     pages 1182–1187, 2018.                                     [38] X. Xiong, S. Zhao, E. G. Van Inwegen, and J. E. Beck.
[24] J. W. Morphew, M. Silva, G. Herman, and M. West.                Going deeper with deep knowledge tracing.
     Frequent mastery testing with second-chance exams               International Educational Data Mining Society, 2016.
     leads to enhanced student learning in undergraduate        [39] L. Zhang, X. Xiong, S. Zhao, A. Botelho, and N. T.
     engineering. Applied Cognitive Psychology,                      Heffernan. Incorporating rich features into deep
     34(1):168–181, 2020.                                            knowledge tracing. In Proceedings of the Fourth (2017)
[25] Z. A. Pardos and N. T. Heffernan. Modeling                      ACM Conference on Learning @ Scale, L@S ’17, page
     individualization in a bayesian networks                        169–172, New York, NY, USA, 2017. Association for
     implementation of knowledge tracing. In P. De Bra,              Computing Machinery.
     A. Kobsa, and D. Chin, editors, User Modeling,             [40] C. Zilles, R. T. Deloatch, J. Bailey, B. B. Khattar,
     Adaptation, and Personalization, pages 255–266,                 W. Fagen, C. Heeren, D. Mussulman, and M. West.
     Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.           Computerized testing: A vision and initial
[26] Z. A. Pardos and N. T. Heffernan. Kt-idem:                      experiences. In American Society for Engineering
     Introducing item difficulty to the knowledge tracing            Education (ASEE) Annual Conference, 2015.
     model. In J. A. Konstan, R. Conejo, J. L. Marzo, and       [41] C. Zilles, M. West, G. Herman, and T. Bretl. Every
     N. Oliver, editors, User Modeling, Adaption and                 university should have a computer-based testing
     Personalization, pages 243–254, Berlin, Heidelberg,             facility. In Proceedings of the 11th International
     2011. Springer Berlin Heidelberg.                               Conference on Computer Supported Education
[27] R. Pelánek. Bayesian knowledge tracing, logistic               (CSEDU), May 2019.
     models, and beyond: an overview of learner modeling        [42] C. Zilles, M. West, D. Mussulman, and T. Bretl.
     techniques. User Modeling and User-Adapted                      Making testing less trying: Lessons learned from
     Interaction, 27(3):313–350, Dec. 2017.                          operating a Computer-Based Testing Facility. In 2018
[28] C. Piech, J. Spencer, J. Huang, S. Ganguli,                     IEEE Frontiers in Education (FIE) Conference, San
                                                                     Jose, California, 2018.