=Paper= {{Paper |id=Vol-3051/CSEDM_1 |storemode=property |title=How productive are homework and elective practice? Applying a post hoc modeling of student knowledge in a large, introductory computing course (Full Paper) |pdfUrl=https://ceur-ws.org/Vol-3051/CSEDM_1.pdf |volume=Vol-3051 |authors=Max Fowler,Binglin Chen,Matthew West,Craig Zilles |dblpUrl=https://dblp.org/rec/conf/edm/FowlerC0Z21 }} ==How productive are homework and elective practice? Applying a post hoc modeling of student knowledge in a large, introductory computing course (Full Paper)== https://ceur-ws.org/Vol-3051/CSEDM_1.pdf

How productive are homework and elective practice?
Applying a post hoc modeling of student knowledge in a
large, introductory computing course

Max Fowler Binglin Chen
University of Illinois University of Illinois
Urbana, IL, USA Urbana, IL, USA
mfowler5@illinois.edu chen386@illinois.edu
Matthew West Craig Zilles
University of Illinois University of Illinois
Urbana, IL, USA Urbana, IL, USA
mwest@illinois.edu zilles@illinois.edu

ABSTRACT to engage students, to assist both them and us in diagnosing
In this paper, we attempt to estimate how much learning their progress, and to provide formative experiences during
happens in required practice activities (homework) relative their learning careers [8, 15, 21, 34].
to elective practice activities (studying). This analysis is
done in the context of a large enrollment (N = 601) intro- In most courses, the bulk of the students’ time is spent
ductory programming course that made heavy use of auto- outside of course meetings, either completing homework or
grading randomizing question (item) generators. Because performing elective practice (studying). It has been shown
these item generators (and other problems) were used as that well-formed homework has a positive impact on stu-
homework, on practice exams, and as part of exams, a given dent performance and motivation [5, 6, 14, 22]. There are,
student may have encountered the same generator multiple however, disagreements between experts among the learning
times during the class, providing snapshots of the evolution and assessment communities on how to craft good home-
of the student’s ability to complete that problem correctly. work [2, 37]. Studying is usually motivated by a desire to
score well on exams and does not typically have a grade
We use a post hoc model of “this-item-correct” prediction to associated with it [33].
estimate individual student knowledge on each attempt of
a given question. Across five exams, correctness tracing at- We were curious to explore the degree to which we can
tributes 57-65% of the learning that occurs to the homework attribute student learning between two kinds of formative
period and the remainder to elective practice (the study pe- practice activities: required homework and elective practice
riod). performed prior to a summative assessment. Additionally,
as our course utilizes multiple types of questions, we were cu-
rious to know if student experiences differed between types.
Keywords To do so after the completion of the course, we use a post hoc
assessment; CS1; exams; student learning; homework knowledge estimation method developed by Chen et al [9].
This method, which we call “correctness tracing” (CT) as
1. INTRODUCTION shorthand, models student learning as the likelihood of stu-
dents getting specific questions correct on a given attempt
A well-designed course provides students with many oppor- for those questions. The method estimates the chance of a
tunities to learn (e.g., readings, direct instruction, activities student getting “this-item-correct” for a given item (ques-
with peers, homework). While summative assessment allows tion) at every attempt the student makes on that item, for
us to estimate how much learning has occurred, it doesn’t all items.
shed light on where the learning happened. If we could at-
tribute learning to the activities in which it occurred, this We apply CT to student submission data from an on-campus
would allow teachers to increase their use of effective activi- introductory programming course. The course used ran-
ties and deprecate ineffective ones. Our goal as educators is domly selected questions from question pools and random
item generators for exam creation, with many of the ques-
tions appearing previously on homework (and optional prac-
tice exams) as a studying motivator for students. We use
data from student homework, practice exams, and these
proctored exams to build a cohesive snapshot of student
experience with the same questions in multiple contexts.
Specifically, we share our experience investigating students’

Copyright ©2021 for this paper by its authors. Use permitted under Cre-
ative Commons License Attribution 4.0 International (CC BY 4.0)
learning in this fashion to address the following questions:
RQ1: How much learning happens during required The idea that studying itself may be comparatively shal-
practice activities (homework) relative to elective prac- low is supported in the literature on long-term retention.
tice (studying)? Karpicke and Blunt found that the retrieval practice from
exams was superior for learning than elaborative studying
RQ2: Does student learning differ based on the type of processes [18]. Additionally, Roediger and Nestojko found
the questions (e.g., multiple-choice vs. short answer) that, while studying did improve long-term retention of con-
asked? cepts, retrieval during testing still had superior results [30].

2.2 Knowledge tracing and student modeling
The rest of our paper is organized as follows. Section 2
describes related work on student learning and knowledge There is a wealth of work on different methods of tracing
tracing. Section 3 discusses the course from which we col- student knowledge and modeling student learning and stu-
lected data and the handling of that data. In Section 4, dent behavior. Many of these stem from Corbett and An-
we explain the assumptions behind CT and detail our use derson’s original knowledge tracing paper [12]. Since the
of the method. We follow with our results from the model- original tracing paper, there has been more work on dealing
ing in Section 5 and with interpretation and limitations in with issues such as student slip and guess behavior, the ben-
Section 6. We conclude in Section 7. efits and traceability of learning resources, and other parts
of students’ learning environments. Pelanek’s significant re-
view shows how learner modeling has grown to encompass
2. RELATED WORK domain knowledge structuring, learner clustering, student
observations, and more just over the last decade [27]. We
2.1 How are students learning on homework address a few below.
and through studying?
Pardos and Heffernan modeled individualized learning in
How students learn is an area of significant study. We are Bayesian knowledge tracing (BKT) [25]. In their method,
specifically interested here in how formative assessment (e.g. students’ skills were used to set each student’s individualized
homework) helps students learn. Historically, formative as- knowledge for more accurate individual knowledge tracing.
sessment is claimed to benefit student learning, although They later introduced individual item difficulty as a way to
there is little consensus on what exactly makes good for- make knowledge tracing more robust to unseen items [26].
mative assessment [7]. There is evidence, however, that As opposed to skills being used for individual student pri-
frequent and distributed practice, such as frequent testing, ors, Khajah et al. used latent factors pulled from student
boosts student achievement and learning [1, 4, 24, 32]. populations to predict individual student performance [20].
Other approaches use machine learning methods to estimate
Research on homework often considers benefits to students’ student guess or slip chances as opposed to students having
motivation and self-regulatory ability as opposed to just not yet learned course material [3].
content learning. Ramdass and Zimmerman used correla-
tional studies to show that homework leads to higher self- Deep learning methods have also been applied to knowl-
regulatory abilities and traits, like time management and edge tracing in deep knowledge tracing (DKT) [28]. Ad-
self-efficacy [29]. Similarly, Bembenutty and White showed ditions to DKT include prerequisite modeling in students’
that students who approach homework with help-seeking at- concepts [10], problem level features like time to complete
titudes and as motivating exercises displayed stronger aca- and student hint usage [39], and dynamic student group-
demic performance [5]. ing based on performance [23]. There is some evidence to
suggest that, while DKT is powerful, BKT can similarly be
Mandatory homework is found to be beneficial in existing extended and that the gains do not require “deep” learn-
research, but in large part due to feedback. Gutarts and ing techniques explicitly [19]. Additionally, methods such
Bains found that homework that provides feedback appears as predictive failure analysis can perform similarly to DKT
to enhance student performance [14]. However, Johnson and so long as care is taken to structure data appropriately [38].
McKenzie found that while mandatory homework may in-
centivize homework-related motivation and learning, it was
not correlated with exam performance in their macroeco- 3. DATA COLLECTION
nomics course [17]. Ryan and Hemmes found homework Our data was collected in a large enrollment, introductory
was correlated with improved quiz performance, but that programming course for non-CS majors in Fall 2019. The
points are a necessary contingency to get students to do course had 601 total students, with 246 women and 355 men.
homework, with feedback-only approaches reducing student The majority of students who took the course were freshmen
engagement [31]. (67%) and sophomores (21%). The course predominantly
taught Python programming with some coverage of basic
The benefits of studying are less clearly defined. Chew sug- Excel and HTML/web concepts.
gests the benefit of study can be improved by teaching stu-
dents how to study and that expecting students to know
how without designing assignments and material to aid their 3.1 Course context
studying may be a mistake on the part of some instruc- The course was organized as a flipped class that covered one
tors [11]. Fakcharoenphol et al. found that there was a major topic each week. Students were expected to complete
learning increase in studying old exams with solutions and readings in an interactive textbook and an assignment con-
feedback, but that this learning may be shallow [13]. sisting of true/false and multiple-choice questions prior to
Week domly from a pool of questions on a given topic with similar
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 difficulties. Most questions permitted students to attempt
them multiple times with a score penalty for each subse-
E0 E1 E2 E3 E4 quent incorrect attempt until chances to earn credit were
(2%) (10%) (10%) (10%) (20%) exhausted.

Figure 1: Every three weeks the course had a proctored exam Because of the course’s heavy use of item generators and
(E0 to E4). Weight relative to the final course grade is pro- to motivate students to take homework seriously, a signif-
vided as a percentage. icant fraction of the exams were drawn from the course’s
pre-lecture and homework assignments. In general, 85–90%
of the pools on the exam were drawn from questions previ-
ously on homework, and exam-only “hidden” questions were
lecture. The weekly 90-minute lecture used peer instruc-
written with similar form and content to previous homework
tion to reinforce concepts, and the weekly 80-minute lab
questions. Prior to each exam, students were provided access
consisted of practice activities students could complete in-
to a practice exam generator that was similar to the actual
dividually or in pairs, supervised by course staff. Finally,
exam generator, but without the hidden questions. Reused
each topic culminated with a weekly homework assignment
programming questions are largely recall exercises, as most
that consisted of a mix of short answer (e.g., “What is the
do not feature random generation. Short answer questions
value of the variable x after the following piece of code ex-
are transfer tasks as they are all parameterized and no two
ecutes?”, “Write a statement that removes the 4th element
instances of the question should possess the same exact pa-
of a list called ’animals’.”) and small programming (i.e., no
rameters and the same expected student answer.
more than a small function) questions.
In spite of the exams including a large fraction of previously
Due to the size of the course, almost all of the homework ac-
seen material, we don’t believe that rote memorization was
tivities were auto-graded. The course used the open-source
a useful strategy for these exams due to their heavy use
assessment platform (PrairieLearn) [35, 36] for all home-
of randomization and question pools combined with a large
work and other assessments. PrairieLearn both instantly
number of questions (20–30) on the exam. True/false and
grades student submissions and provides automatic feed-
multiple-choice slots on the exam generally drew from pools
back. Homework assignments were configured for students
of 20 to 100 questions, while short answer and programming
to be fearless: there was no penalty for wrong answers, only
question slots had pool sizes of 5 to 12. In addition, short
points to gain as they got answers correct. On homework,
answer item generators typically produce at least dozens of
this allowed students to practice with course content repeat-
meaningfully different variants.
edly until they got the correct answer. Students were able
to repeat questions until they earned full credit and revisit
questions at any point for studying purposes. 3.2 Homework and study periods
The decision for exams to mostly use the same questions as
Many of the homework questions were item generators that homework assignments and practice generators created an
could produce many possible questions of similar difficulty interesting context for attributing student learning. Specif-
on the same topic [16]. The true/false and multiple-choice ically, we could analyze student performance on homework
item generators randomly selected items from pre-populated assignments, practice exams, and actual exams to observe
pools of questions. Short answer questions are randomly pa- how students’ ability to answer these questions improved
rameterized (e.g., changing the list a student has to read or as they engaged with course material. We pulled all stu-
changing the method applied to a given list). To encour- dent submissions from PrairieLearn for the entire semester,
age mastery, homework often expected students to correctly keeping only submissions for any questions that appeared
answer these item generators multiple times. Weekly home- on both homework and exams.
work assignments typically included 12 to 30 items or item
generators and students needed to complete 90% of them to We cleaned this data set by removing students who had not
achieve a full score on the homework. completed all of the exams, retaining 584 of the 601 students.
In total, we retained 1,064,547 individual submissions across
The course’s primary mean of summative assessment was homework, optional practice, and exams. Each submission’s
through five proctored exams. All the exams had a 50- score ranges from 0 (incorrect) to 1 (full credit), with scores
minute fixed time limit, except for the final exam (E4) which in-between indicating partial credit.
allowed for 3 hours. All but the first exam were worth a
significant portion (≥ 10%) of the course grade. These ex- We subdivide our analysis of the course by exam, focusing
ams were conducted in a proctored computer lab with stu- on the three week window preceding each of the five exams.
dent scheduled exam times within a three-day window [40– As shown in Figure 2, each exam is comprehensive, includ-
42]. Students were given access to a Python interpreter and ing material that was present on previous exams. For this
Python’s documentation, but no other resources were pro- analysis, we focus solely on the content introduced since the
vided. The exam schedule is given in Figure 1. previous exam to see how practice during the homework and
study periods contribute to learning for the material’s first
Exams featured all four kinds of questions seen on home- summative assessment.
work (T/F, MC, short, programming), except for E0 which
did not have programming questions. Each exam consisted Each student submission is assigned to one of three periods:
of 20–30 question slots (41 on the final). Each slot drew ran- homework, study, and exam (Figure 3):
Question pool hidden (not on homework)
composition for new (on homework since previous exam) Exam
each exam old (on homework & on previous exam) Homework Study
Period Period Practice Exam
Exam
4 23 21 21 44 On-time HW After deadline HW Period
hidden hidden hidden Time
33
46 HW assigned HW deadline Start of exam
new
61 new
78 new Figure 3: We subdivide the students’ practice into two peri-
28 new ods: the homework period is all homework attempts before
new 115 191 the deadline. The study period is all attempts on practice
old old
72 exams and any homework attempts after the deadline.
24 old
old
E0 E1 E2 E3 E4 4. METHODS
Figure 2: Exams are cumulative and largely drawn from item To analyze the evolution of student knowledge from home-
generators and questions previously on homework. Hidden work to exam time, we track student learning at the granu-
questions only appear on exams. New questions were on larity of individual item generators. This is clearly a signifi-
homework since the previous exam, while old questions were cant approximation to reality for two reasons: 1) because of
previously on earlier homework and one or more previous pools (of true/false and multiple-choice questions) and pa-
exams. While the fraction of exam slots dedicated to old rameter randomization (for short answer questions) there is
questions does increase as the semester progresses, this figure some variation between instances of a given item generator,
is somewhat deceptive because old pools typically have many and 2) there are relationships between item generators (e.g.,
more questions than new and hidden pools, except on the practice on a programming question relating to loops would
final (E4) where each week’s material is represented equally. likely improve students ability to complete a short answer
question related to loops and vice-versa).

• The homework period includes all the submissions to Nevertheless, for our purposes, we believe this approach is
homework on or before the homework due date. Sub- viable. The items of each item generator were considered
missions in this period represents required practice; sufficiently similar by the instructor to be fungible with re-
while students are allowed as many submissions as they spect to the exams. Furthermore, the method is robust to
need to get full credit, there is a deadline to receive whether or not learning occurs between subsequent attempts
that credit. on the same problem or from students attempting a prob-
lem, trying new problems, and returning again to an older
• The study period includes all submissions on practice
problem. If the student learns significantly by completing
exam generators as well as any submissions on home-
many other homework problems between two attempts at
work after the homework deadline. The homework sys-
a given problem during the homework period, we can still
tem remains open and students can repeat problems
correctly attribute the learning to having taken place dur-
and complete any problems not previously completed
ing the homework period. As such, we made no attempt at
(only 90% of questions are needed to achieve a full
topic modeling in this work.
homework score). Submissions in this period are elec-
tive practice, bearing no credit directly.
• The exam period includes the submissions on the ac- 4.1 Correctness tracing: post hoc modeling for
tual exam. student knowledge
In general, knowledge tracing (KT) techniques were devel-
The above periods are coarsely defined to capture the dif- oped as predictors of student performance or estimators of
ference between the time spent on required practice with the latent knowledge state of students. KT is used either
homework assignment and any additional practice following to estimate a student’s likelihood of getting the next at-
the homework deadline. For our context, problems being tempt correct based on previous attempts, adjusting after
completed by students on practice exams as well as after a each success and failure as the student engages with an as-
homework deadline are both elective activities and are suit- sessment, or to track changes in students’ latent knowledge
able to be counted together. over time. Much of the difficulty of KT techniques results
from attempting to instantaneously obtain a signal of stu-
For our analysis, we also tag each student’s first attempt dent knowledge as students are engaging with learning op-
on each question on homework, so that we can estimate portunities. In our case, we already have all the data from
the student’s ability to solve that question gained before the course as the course has ended and do not need an in-
attempting the question the first time (e.g., from readings, stantaneous, updating measure of student knowledge. In-
lecture, or solving other problems). A breakdown of the stead, we desire to perform a post hoc analysis of students’
number of submissions during each period is provided in Fig- submissions to estimate how their learning changed over an
ure 4. The decrease in submissions throughout the semester entire course’s worth of data. Our chosen method, CT, mea-
in the homework and studying buckets is a result of home- sures students’ knowledge as demonstrated by an increase in
work shifting toward fewer, more difficult problems as the the likelihood that they would get given items correct more
semester progresses. frequently over time.
Submissions per bucket by exam subset
178481
First
175000
Homework
Studying
150000 Exam
131641
Count of submissions

125000 120436

100000 92533

77048
75000 67181
5871161033 55526
47987
50000
40014
30412
22984
25000
14378 14588 15700
10859 10936
7151 6948
0
Exam 0 Exam 1 Exam 2 Exam 3 Exam 4
Exams

Figure 4: The submission count per period. Total, there are 1,064,547 submissions in our data set. As the semester progressed,
homework had fewer but harder problems, which accounts for the reduction in submissions.

The method presented by Chen et al [9] can be summarized ilarly, the probability on a correct final attempt will always
by the following formulation: be estimated as 1.0, which may be an overestimate. This po-
tentially could be remedied by adding additional constraints
optimize: L(p1 , . . . , pn ; x1 , . . . , xn ) to the method (e.g., limiting the rate of increase), but we
subject to: 0 ≤ pi ≤ 1 for all i (1) did not attempt such constraints with this work.
pi ≤ pj for all i < j
4.2 Demonstrating CT using “Harlow”, a sam-
where x1 , . . . , xn is the result of a series of submissions which
are either 1 (correct) or 0 (incorrect), and the method tries ple student
to find a series of predictions p1 , . . . , pn that optimizes the To clarify our use of CT, we present a walk-through of how
loss function, under the constraints that: (1) p1 , . . . , pn are the method models our data for one individual and two
between 0 and 1, as they represent an estimate of the instan- questions from Exam 3, selected randomly from students
taneous probability that the student would get each attempt whose behavior allows for representative variety in CT’s
correct and (2) p1 , . . . , pn are monotonically non-decreasing, estimates. We refer to the student as Harlow, which is a
which is based on the assumptions that the attempts are name that was not present in the actual class. On Exam 3,
made over a short enough time period that forgetting is in- two of the questions that were randomly selected for Harlow
significant and additional practice would not hurt a student’s to complete were the programming question progLargest-
ability to answer these questions. Since the homework, prac- LessThanValue and the short answer question valueOfList-
tice, and exam attempts occurred over a three-week win- Reordering.
dow, during which there were a lot of related practice, we
believe these assumptions are reasonable. Rather than hav- Harlow had notably different experiences with these ques-
ing a model with explicit parameters as found in BKT, the tions; Figure 5 plots the correctness of Harlow’s individual
method calculates the probabilities p1 , . . . , pn by optimizing submissions as dots that are color coded based on the pe-
them directly for the target loss function. Chen et al have riod in which the submission occurred. With progLargest-
shown that minimizing root-mean-square error (RMSE) and LessThanValue, Harlow made two attempts on homework to
maximizing log-likelihood would yield the same optimal so- get the question correct once, got it correct once on a prac-
lution under constraints specified in Equation 1. tice exam with a single attempt, and tried it twice on Exam
3 without getting a correct answer. With valueOfList-
We chose to use CT over BKT or DKT as it nicely fit our use Reordering, Harlow had 9 attempts on homework with 6
case. The CT method is able to finely locate and predict the correct submissions, 4 encounters across two practice exams
“jumps” in a students’ likelihood of getting a question correct for 2 correct submissions total, and a correct answer as the
when analyzing the data in a post hoc fashion, which may only attempt on Exam 3.
be too precise a transition for usual predictive knowledge
tracing. For our purposes, a high accuracy, post hoc model Figure 5 also shows the result of running CT as a line indi-
was ideal for analyzing changing student knowledge as a cating the instantaneous estimate of Harlow’s likelihood of
historical trend from our course’s data. getting the question correct. In both cases, Harlow got the
first attempt wrong, so the model assign’s Harlow’s likeli-
One important weakness of CT, however, is that it is prone hood of getting the question correct as 0%, so as to mini-
to underestimate student knowledge on an incorrect first at- mize the error relative to the actual outcome. While Har-
tempt because the optimizer sets the probability of correct- low is flipping between correct and incorrect attempts, the
ness to be zero so as to minimize error on that attempt. Sim- model computes a likelihood of correctness for each attempt
Chance of getting the next item correct i.e. student knowledge Average "This-Item-Correct" Chance Across Periods (All Exams)
Harlow's Knowledge Over Time Exam 0 Exam 2 Exam 4
progLargestLessThanValue valueOfListReordering 1.00 Exam 1 Exam 3
1.0
1.0 1.0 0.95

Chance of getting this item correct i.e. student knowledge
0.90
0.8
0.8 0.8 0.85
0.80

0.6 0.75
0.6 0.6
0.70
0.65
0.4
0.4 0.4
0.60
First Studying 0.55
0.2 0.2 Homework Exam
0.2 0.50
0.45
0.0 0.0 0.40
0.0
0.01 2 0.23 4 0.45 10.6
2 3 4 5 6 7 80.8
91011121314151.0 0.35
Question answer submission 0.30
0.25
0.20
Figure 5: The results of running CT on Harlow’s answers to
First End Homework Start Studying End Studying First Exam End Exam
our two selected questions. Harlow eventually appeared to
learn how to do valueOfListReordering. However, Har- Figure 6: The changing average “this-item-correct” chance
low’s ability to complete the programming question never from CT per period. CT suggests the majority of student
stabilized, so the model never attributed more than a 50% learning is occurring during the homework period, although
chance that Harlow had learned the question’s material. the study period is also significant.

that minimizes the error for those correct and incorrect at- Exam 4, the final exam, behaves differently from the other
tempts, constrained to be non-decreasing. Because Harlow’s four exams, which we’ll consider in the discussion section.
last three attempts at valueOfListReordering were all cor-
rect, the model decides that Harlow has mastered the ques- Figure 7 plots the change in likelihood of correctness from
tion with a 100% likelihood of getting the question correct. the beginning to the end of each period. When we compare
the pre-exam increase in student knowledge (as measured by
We ran CT for each student on each question independently. likelihood of correctness) between the homework and study
From each trace, we extract six estimates of the student’s period, CT attributes 57–65% of the learning to the home-
likelihood of getting a question right: their first and last work period and 35–43% to the study period, across the five
attempts in the homework period (First, End Homework), exams. The CT method also attributes some learning to the
their first and last attempts in the study period (Start Study- exam period, which we’ll consider in the discussion section.
ing, End Studying), and their first and last attempts on the
exam (First Exam, End Exam). Any student without a sub-
mission in that period (i.e., students who did not study or 5.2 Learning trends are largely independent
students who did not get that question on their exam) has of question type
their previous submission to that point in the timeline used To address RQ2, we disaggregated the exam data sets by
in compliance with CT’s assumption that students do not question type to see whether there was any notable difference
forget. We then average these likelihoods across all students between types. For this analysis, we omitted Exam 0, as
and all questions for a given exam period. This allows us to Exam 0 did not feature programming questions.
explore the changing student knowledge as an average for all
the students in a course across the different learning oppor- Figure 8 shows the per-question type CT results. The only
tunities presented by homework, studying, and assessment. notable finding is that different questions start at different
levels of initial student knowledge and end with different
5. RESULTS amounts of knowledge, which changes the starting and end-
ing points in Figure 8. Because of this, different questions
5.1 CT attributes significant learning to both drop off faster than others in terms of how much is learned
the homework and study period; home- during the practice period. Generally, students have less to
work contributes slightly more learn with true/false and multiple-choice questions through
the practice period than they do on programming and short
The results of running CT are shown in Figure 6. From
answer questions, although all question types experience a
the slopes of the lines, it can be seen that CT estimates
learning drop-off through to the exam.
that more learning is occurring (i.e., the change in student
likelihood of correct attempts is larger) during the homework
period than the study period. The plot suggests that the 6. DISCUSSION AND LIMITATIONS
course material tends to get more difficult as the semester 6.1 RQ1: Students in this course learn slightly
progresses, with the initial and final likelihood of correctness
both decreasing as we move from Exam 0 to 3. Furthermore, more during the homework period than
the lines for Exams 0 through 3 show almost identical trends. the study period
Average "This-Item-Correct" Chance By Period and Question Type (All Exams)
Change in average student knowledge (CT) by period
Exam 0 Exam 2 Exam 4 Programming True False
0.55 Short Answer Multiple Choice
Exam 1 Exam 3 Exam 1 Exam 2
1.00
0.50 1.0 1.0
0.95
Change in average student knowledge

0.45
0.90
0.8 0.8
0.40
0.85
0.35
0.80
0.6 0.6

Chance of getting this item correct i.e. student knowledge
0.30 0.75

0.25 0.70
0.4 0.4

0.20 0.65

0.60
0.2 0.2
0.15
0.55
0.10 st rk g g am m st rk g g am m
Fir wo yin yin Ex xa Fir wo yin yin Ex xa
me tud tud st dE me tud tud st dE
0.50 Ho rt S dS Fir En Ho rt S dS Fir En
0.05 En
d Sta En En
d Sta En
0.45
Exam 3 Exam 4
0.00 1.0 1.0
Homework Studying Exam 0.40

0.35
Figure 7: The average change in student knowledge by pe- 0.8
0.30
0.8

riod according to CT. The largest change occurs during the 0.25
0.6 0.6
homework period, with a smaller change from study, and the 0.20
smallest on exams.
0.4 0.4

0.2 0.2

0.0st rk g0.2 0.4 m t 0.6 rk 0.8g m 1.0
Fir wo in ing xa
m
xa Fir
s o yin
g
yin am xa
e tu dy tudy tE dE ew tud tud Ex dE
om s En om st En
dH rt S En
dS Fir dH rt S En
dS Fir
En Sta En Sta
CT attributes more learning to the mandatory homework
period in this particular course. This is represented as the Figure 8: The changing average “this-item-correct” chance
largest increase in student knowledge from their first home- from CT for each question type from Exams 1 to 4. Different
work submission to the last. That gives us some confidence question types start out with a higher assumed learning to
that a course with significant homework opportunities does start, which suggest more students got those questions right
provide students with productive chances to learn as op- on their first attempt.
posed to just inundating students with “busy work.”

Interestingly, CT also indicates performance on the exam is 6.2 RQ2: All question types show similar learn-
better than at the end of the study period. There are a few
possible explanations for this. The most likely explanation
ing trends
is that, given the higher stakes of the exam, students are try- When we disaggregate the analysis by question type, the
ing harder, resulting in a higher correct rate that is being general shape and progression of results is the same for ev-
observed by the model. In addition, some of the score im- ery question type compared to the source exam. Different
provements observed on the exam could be attributed to the questions start with lower amounts of student knowledge,
last pre-exam practice attempt if, for example, the student but this appears to mostly be a function of the difficulty of
got the question wrong, but learned from seeing the correct the problem’s type: programming and short answer ques-
answer. This also might be just be an artifact of CT, as any tions, which require more actual coding on the students’
students that have incorrect and correct attempts to a given parts, tended to start and end lower.
question on the exam will have learning attributed to them.
Finally, actual learning might be occurring during the exam. The lack of different behavior when we disaggregate by ques-
The amount of “learning” attributed to the exam period is, tion type is more interesting than it may initially appear.
however, fairly negligible. This means that the “shape” of student learning does not
differ significantly with the question type. Given this, it ap-
Importantly, one should not attempt to generalize about the pears that homework and additional studying have the same
learning potential of homework relative to elective practice impact on student results regardless of the kind of question.
for all courses from these results. We expect that courses This does mean there are diminishing returns on easier ques-
that assign less homework might observe less learning during tion types over the period compared to harder ones, but not
the homework period and students might compensate by a deficiency in how homework and practice helps on question
studying more, thereby making more of the learning occur types where students still have learning they can do.
during that optional studying. It could also be the case
that there are diminishing returns on each attempt on a
specific question, which the first attempt providing the most 6.3 Limitations
learning benefit, then the second, decreasing further with There are some obvious limitations to the current work.
each attempt from homework through the study period. It First, our findings about the relative learning during the
is reassuring, though, to see that this course’s homework and homework and study periods cannot be assumed to gener-
study opportunities (i.e., the practice exam generators made alize to other course contexts. Courses with different home-
available to students) both appear to contribute significantly work, study materials, and exam structures will likely have
to student learning. different breakdowns of learning in each phase.
Second, CT is a fairly coarse measure of learning. Scores 8. REFERENCES
as a performance indicator are not alone proof of student
[1] E. Bailey, J. Jensen, J. Nelson, H. Wiberg, and J. Bell.
learning. Additionally, CT’s potential for underestimating
Weekly formative exams and creative grading enhance
likelihood of correctness of first attempts (by strictly opti-
student learning in an introductory biology course.
mizing for RMSE) could make the model overestimate the
CBE—Life Sciences Education, 16(1):ar2, 2017.
learning that is occurring in the first few attempts, which is
[2] J.-A. Baird, D. Andrich, T. N. Hopfenbeck, and
likely occurring in the homework period. We do not have
G. Stobart. Assessment and learning: fields apart?
confidence that these measures of learning are particularly
Assessment in Education: Principles, Policy &
precise. While we omit it from the paper, we also ran a
Practice, 24(3):317–350, July 2017.
regression model to estimate the learning in the same peri-
ods of the course. The regression generally showed the same [3] R. S. J. d. Baker, A. T. Corbett, and V. Aleven. More
trends as CT, giving us more confidence in CT’s results. accurate student modeling through contextual
estimation of slip and guess probabilities in bayesian
Finally, these methods do not disambiguate from learning knowledge tracing. In B. P. Woolf, E. Aı̈meur,
that happens during the homework and studying periods R. Nkambou, and S. Lajoie, editors, Intelligent
and learning that occurs specifically from homework and Tutoring Systems, pages 406–415, Berlin, Heidelberg,
elective practice problems. There are notable reasons to 2008. Springer Berlin Heidelberg.
believe that students are learning significantly from reading [4] G. Başol and G. Johanson. Effectiveness of frequent
the textbook, engaging in active learning exercises, and, per- testing over achievement: A meta analysis study.
haps, even from listening to the lecturer speak. The learning Journal of Human Sciences, 6(2):99–121, July 2009.
that occurs during these activities is attributed to the period [5] H. Bembenutty and M. C. White. Academic
in which it occurs, rather than to the specific task. performance and satisfaction with homework
completion among college students. Learning and
7. CONCLUSION Individual Differences, 24:83–88, Apr. 2013.
[6] J. Bempechat. The motivational benefits of homework:
In this work, we explored the degree to which we can at-
a social-cognitive perspective. Theory Into Practice,
tribute student learning between required homework and
43(3):189–196, Aug. 2004.
elective study performed prior to a summative assessment.
[7] R. E. Bennett. Formative assessment: a critical
To analyze learning, we used a post hoc method of “this-item-
review. Assessment in Education: Principles, Policy &
correct” likelihood (correctness tracing) to estimate student
Practice, 18(1):5–25, Feb. 2011.
knowledge. We found that (required) homework and (elec-
tive) studying both contributed significantly to student learn- [8] P. Black and D. Wiliam. Developing the theory of
ing, with homework contributing slightly more. Further, formative assessment. Educational Assessment,
despite using multiple question types, we found the most Evaluation and Accountability(formerly: Journal of
notable difference between question types is where student Personnel Evaluation in Education), 21(1):5, Jan.
knowledge starts and not the shape of their learning im- 2009.
provements. [9] B. Chen, M. West, and C. B. Zilles. Towards a
model-free estimate of the limits to student modeling
We think that our results show that frequent, exam-relevant accuracy. In K. E. Boyer and M. Yudelson, editors,
homework and highly-accessible means for study (e.g., prac- Proceedings of the 11th International Conference on
tice exam generators) are both effective means of facilitating Educational Data Mining, EDM 2018, Buffalo, NY,
student learning and believe that these findings could gen- USA, July 15-18, 2018. International Educational
eralize to other contexts. The magnitude of learning from Data Mining Society (IEDMS), 2018.
each component may differ, but courses with similar home- [10] P. Chen, Y. Lu, V. W. Zheng, and Y. Pian.
work and studying opportunities will hopefully see similar Prerequisite-driven deep knowledge tracing. In 2018
learning gains during each period. IEEE International Conference on Data Mining
(ICDM), pages 39–48, 2018.
There remain areas for future work. Considering data, we [11] S. L. Chew. Helping students to get the most out of
only use students’ submissions to questions that also appear studying. Acknowledgments and Dedication, page 215,
on homework. Some ability to include other learning events, 2014.
such as reading a textbook, would give a clearer picture of [12] A. T. Corbett and J. R. Anderson. Knowledge tracing:
students’ learning process. Additionally, some topic-level Modeling the acquisition of procedural knowledge.
labeling might allow us to include questions unique to exams User Modeling and User-Adapted Interaction,
in our data and analysis. 4(4):253–278, Dec. 1994.
[13] W. Fakcharoenphol, E. Potter, and T. Stelzer. What
With respect to CT’s model, we made no attempt to com- students learn when studying physics practice exam
pensate for the method’s tendency to underestimate on ini- problems. Phys. Rev. ST Phys. Educ. Res., 7:010107,
tial incorrect attempts. Future work could investigate con- May 2011.
straining this behavior by limiting the allowable slope. Fur- [14] B. Gutarts and F. Bains. Does mandatory homework
ther, there is room to adapt the model to using a richer have a positive effect on student achievement for
source of information than students’ correctness on submis- college students studying calculus? Mathematics and
sions — for example, by fitting a similar optimization on Computer Education, 44(3):232–244, Fall 2010.
students’ knowledge as estimated by methods such as Item
Response Theory (IRT).
[15] M. K. Hartwig and J. Dunlosky. Study strategies of M. Sahami, L. Guibas, and J. Sohl-Dickstein. Deep
college students: Are self-testing and scheduling knowledge tracing. arXiv preprint arXiv:1506.05908,
related to achievement? Psychonomic Bulletin and 2015.
Review, 19:126–134, 2012. [29] D. Ramdass and B. J. Zimmerman. Developing
[16] S. Irvine and P. Kyllonen. Item Generation for Test self-regulation skills: The important role of homework.
Development. Lawrence Erlbaum Associates, 2002. Journal of Advanced Academics, 22(2):194–218, 2011.
[17] J. A. Johnson and R. McKenzie. The effect on student [30] H. L. Roediger and J. F. Nestojko. The relative
performance of web-based learning and homework in benefits of studying and testing on long-term
microeconomics. Journal of Economics and Economic retention. Cognitive modeling in perception and
Education Research, 14(2):115–125, 2013. Copyright - memory: A festschrift for Richard M. Shiffrin, pages
Copyright Jordan Whitney Enterprises, Inc 2013; 99–111, 2015.
Document feature - Tables; ; Last updated - [31] C. S. Ryan and N. S. Hemmes. Effects of the
2020-11-17. contingency for homework submission on homework
[18] J. D. Karpicke and J. R. Blunt. Retrieval practice submission and quiz performance in a college course.
produces more learning than elaborative studying with Journal of Applied Behavior Analysis, 38(1):79–88,
concept mapping. Science, 331(6018):772–775, 2011. 2005.
[19] M. Khajah, R. V. Lindsey, and M. C. Mozer. How [32] M. L. Still and J. D. Still. Contrasting traditional
deep is knowledge tracing? CoRR, abs/1604.02416, in-class exams with frequent online testing. Journal of
2016. Teaching and Learning with Technology, 4(2):30, 2015.
[20] M. Khajah, R. Wing, R. Lindsey, and M. Mozer. [33] B. W. Tuckman. Using tests as an incentive to
Integrating latent-factor and knowledge-tracing motivate procrastinators to study. The Journal of
models to predict individual differences in learning. In Experimental Education, 66(2):141–147, 1998.
Educational Data Mining 2014. Citeseer, 2014. [34] C. K. Waugh and N. E. Gronlund. Assessment of
[21] J. Laverty, S. Underwood, R. Matz, L. Posey, Student Achievement (10th Edition). Pearson, 2012.
J. Carmel, M. Caballero, C. L. Fata-Hartley, [35] M. West, G. L. Herman, and C. Zilles. Prairielearn:
D. Ebert-May, S. E. Jardeleza, and M. M. Cooper. Mastery-based Online Problem Solving with Adaptive
Characterizing college science assessments: The Scoring and Recommendations Driven by Machine
three-dimensional learning assessment protocol. PLoS Learning. In 2015 ASEE Annual Conference &
ONE, 11(9):e0162333, 2016. Exposition, Seattle, Washington, 2015. ASEE
[22] P. Magalhães, D. Ferreira, J. Cunha, and P. Rosário. Conferences.
Online vs traditional homework: A systematic review [36] M. West, N. Walters, M. Silva, T. Bretl, and C. Zilles.
on the benefits to students’ performance. Computers Integrating diverse learning tools using the
& Education, 152:103869, July 2020. prairielearn platform. In Seventh SPLICE Workshop
[23] S. Minn, Y. Yu, M. C. Desmarais, F. Zhu, and J. Vie. at SIGCSE 2021 (Virtual event), March 2021.
Deep knowledge tracing and dynamic student [37] D. Wiliam. What is assessment for learning? Studies
classification for knowledge tracing. In 2018 IEEE in Educational Evaluation, 37(1):3–14, 2011.
International Conference on Data Mining (ICDM), Assessment for Learning.
pages 1182–1187, 2018. [38] X. Xiong, S. Zhao, E. G. Van Inwegen, and J. E. Beck.
[24] J. W. Morphew, M. Silva, G. Herman, and M. West. Going deeper with deep knowledge tracing.
Frequent mastery testing with second-chance exams International Educational Data Mining Society, 2016.
leads to enhanced student learning in undergraduate [39] L. Zhang, X. Xiong, S. Zhao, A. Botelho, and N. T.
engineering. Applied Cognitive Psychology, Heffernan. Incorporating rich features into deep
34(1):168–181, 2020. knowledge tracing. In Proceedings of the Fourth (2017)
[25] Z. A. Pardos and N. T. Heffernan. Modeling ACM Conference on Learning @ Scale, L@S ’17, page
individualization in a bayesian networks 169–172, New York, NY, USA, 2017. Association for
implementation of knowledge tracing. In P. De Bra, Computing Machinery.
A. Kobsa, and D. Chin, editors, User Modeling, [40] C. Zilles, R. T. Deloatch, J. Bailey, B. B. Khattar,
Adaptation, and Personalization, pages 255–266, W. Fagen, C. Heeren, D. Mussulman, and M. West.
Berlin, Heidelberg, 2010. Springer Berlin Heidelberg. Computerized testing: A vision and initial
[26] Z. A. Pardos and N. T. Heffernan. Kt-idem: experiences. In American Society for Engineering
Introducing item difficulty to the knowledge tracing Education (ASEE) Annual Conference, 2015.
model. In J. A. Konstan, R. Conejo, J. L. Marzo, and [41] C. Zilles, M. West, G. Herman, and T. Bretl. Every
N. Oliver, editors, User Modeling, Adaption and university should have a computer-based testing
Personalization, pages 243–254, Berlin, Heidelberg, facility. In Proceedings of the 11th International
2011. Springer Berlin Heidelberg. Conference on Computer Supported Education
[27] R. Pelánek. Bayesian knowledge tracing, logistic (CSEDU), May 2019.
models, and beyond: an overview of learner modeling [42] C. Zilles, M. West, D. Mussulman, and T. Bretl.
techniques. User Modeling and User-Adapted Making testing less trying: Lessons learned from
Interaction, 27(3):313–350, Dec. 2017. operating a Computer-Based Testing Facility. In 2018
[28] C. Piech, J. Spencer, J. Huang, S. Ganguli, IEEE Frontiers in Education (FIE) Conference, San
Jose, California, 2018.