=Paper=
{{Paper
|id=Vol-3051/CSEDM_1
|storemode=property
|title=How productive are homework and elective practice? Applying a post hoc modeling of student knowledge in a large, introductory computing course (Full Paper)
|pdfUrl=https://ceur-ws.org/Vol-3051/CSEDM_1.pdf
|volume=Vol-3051
|authors=Max Fowler,Binglin Chen,Matthew West,Craig Zilles
|dblpUrl=https://dblp.org/rec/conf/edm/FowlerC0Z21
}}
==How productive are homework and elective practice? Applying a post hoc modeling of student knowledge in a large, introductory computing course (Full Paper)==
How productive are homework and elective practice? Applying a post hoc modeling of student knowledge in a large, introductory computing course Max Fowler Binglin Chen University of Illinois University of Illinois Urbana, IL, USA Urbana, IL, USA mfowler5@illinois.edu chen386@illinois.edu Matthew West Craig Zilles University of Illinois University of Illinois Urbana, IL, USA Urbana, IL, USA mwest@illinois.edu zilles@illinois.edu ABSTRACT to engage students, to assist both them and us in diagnosing In this paper, we attempt to estimate how much learning their progress, and to provide formative experiences during happens in required practice activities (homework) relative their learning careers [8, 15, 21, 34]. to elective practice activities (studying). This analysis is done in the context of a large enrollment (N = 601) intro- In most courses, the bulk of the students’ time is spent ductory programming course that made heavy use of auto- outside of course meetings, either completing homework or grading randomizing question (item) generators. Because performing elective practice (studying). It has been shown these item generators (and other problems) were used as that well-formed homework has a positive impact on stu- homework, on practice exams, and as part of exams, a given dent performance and motivation [5, 6, 14, 22]. There are, student may have encountered the same generator multiple however, disagreements between experts among the learning times during the class, providing snapshots of the evolution and assessment communities on how to craft good home- of the student’s ability to complete that problem correctly. work [2, 37]. Studying is usually motivated by a desire to score well on exams and does not typically have a grade We use a post hoc model of “this-item-correct” prediction to associated with it [33]. estimate individual student knowledge on each attempt of a given question. Across five exams, correctness tracing at- We were curious to explore the degree to which we can tributes 57-65% of the learning that occurs to the homework attribute student learning between two kinds of formative period and the remainder to elective practice (the study pe- practice activities: required homework and elective practice riod). performed prior to a summative assessment. Additionally, as our course utilizes multiple types of questions, we were cu- rious to know if student experiences differed between types. Keywords To do so after the completion of the course, we use a post hoc assessment; CS1; exams; student learning; homework knowledge estimation method developed by Chen et al [9]. This method, which we call “correctness tracing” (CT) as 1. INTRODUCTION shorthand, models student learning as the likelihood of stu- dents getting specific questions correct on a given attempt A well-designed course provides students with many oppor- for those questions. The method estimates the chance of a tunities to learn (e.g., readings, direct instruction, activities student getting “this-item-correct” for a given item (ques- with peers, homework). While summative assessment allows tion) at every attempt the student makes on that item, for us to estimate how much learning has occurred, it doesn’t all items. shed light on where the learning happened. If we could at- tribute learning to the activities in which it occurred, this We apply CT to student submission data from an on-campus would allow teachers to increase their use of effective activi- introductory programming course. The course used ran- ties and deprecate ineffective ones. Our goal as educators is domly selected questions from question pools and random item generators for exam creation, with many of the ques- tions appearing previously on homework (and optional prac- tice exams) as a studying motivator for students. We use data from student homework, practice exams, and these proctored exams to build a cohesive snapshot of student experience with the same questions in multiple contexts. Specifically, we share our experience investigating students’ Copyright ©2021 for this paper by its authors. Use permitted under Cre- ative Commons License Attribution 4.0 International (CC BY 4.0) learning in this fashion to address the following questions: RQ1: How much learning happens during required The idea that studying itself may be comparatively shal- practice activities (homework) relative to elective prac- low is supported in the literature on long-term retention. tice (studying)? Karpicke and Blunt found that the retrieval practice from exams was superior for learning than elaborative studying RQ2: Does student learning differ based on the type of processes [18]. Additionally, Roediger and Nestojko found the questions (e.g., multiple-choice vs. short answer) that, while studying did improve long-term retention of con- asked? cepts, retrieval during testing still had superior results [30]. 2.2 Knowledge tracing and student modeling The rest of our paper is organized as follows. Section 2 describes related work on student learning and knowledge There is a wealth of work on different methods of tracing tracing. Section 3 discusses the course from which we col- student knowledge and modeling student learning and stu- lected data and the handling of that data. In Section 4, dent behavior. Many of these stem from Corbett and An- we explain the assumptions behind CT and detail our use derson’s original knowledge tracing paper [12]. Since the of the method. We follow with our results from the model- original tracing paper, there has been more work on dealing ing in Section 5 and with interpretation and limitations in with issues such as student slip and guess behavior, the ben- Section 6. We conclude in Section 7. efits and traceability of learning resources, and other parts of students’ learning environments. Pelanek’s significant re- view shows how learner modeling has grown to encompass 2. RELATED WORK domain knowledge structuring, learner clustering, student observations, and more just over the last decade [27]. We 2.1 How are students learning on homework address a few below. and through studying? Pardos and Heffernan modeled individualized learning in How students learn is an area of significant study. We are Bayesian knowledge tracing (BKT) [25]. In their method, specifically interested here in how formative assessment (e.g. students’ skills were used to set each student’s individualized homework) helps students learn. Historically, formative as- knowledge for more accurate individual knowledge tracing. sessment is claimed to benefit student learning, although They later introduced individual item difficulty as a way to there is little consensus on what exactly makes good for- make knowledge tracing more robust to unseen items [26]. mative assessment [7]. There is evidence, however, that As opposed to skills being used for individual student pri- frequent and distributed practice, such as frequent testing, ors, Khajah et al. used latent factors pulled from student boosts student achievement and learning [1, 4, 24, 32]. populations to predict individual student performance [20]. Other approaches use machine learning methods to estimate Research on homework often considers benefits to students’ student guess or slip chances as opposed to students having motivation and self-regulatory ability as opposed to just not yet learned course material [3]. content learning. Ramdass and Zimmerman used correla- tional studies to show that homework leads to higher self- Deep learning methods have also been applied to knowl- regulatory abilities and traits, like time management and edge tracing in deep knowledge tracing (DKT) [28]. Ad- self-efficacy [29]. Similarly, Bembenutty and White showed ditions to DKT include prerequisite modeling in students’ that students who approach homework with help-seeking at- concepts [10], problem level features like time to complete titudes and as motivating exercises displayed stronger aca- and student hint usage [39], and dynamic student group- demic performance [5]. ing based on performance [23]. There is some evidence to suggest that, while DKT is powerful, BKT can similarly be Mandatory homework is found to be beneficial in existing extended and that the gains do not require “deep” learn- research, but in large part due to feedback. Gutarts and ing techniques explicitly [19]. Additionally, methods such Bains found that homework that provides feedback appears as predictive failure analysis can perform similarly to DKT to enhance student performance [14]. However, Johnson and so long as care is taken to structure data appropriately [38]. McKenzie found that while mandatory homework may in- centivize homework-related motivation and learning, it was not correlated with exam performance in their macroeco- 3. DATA COLLECTION nomics course [17]. Ryan and Hemmes found homework Our data was collected in a large enrollment, introductory was correlated with improved quiz performance, but that programming course for non-CS majors in Fall 2019. The points are a necessary contingency to get students to do course had 601 total students, with 246 women and 355 men. homework, with feedback-only approaches reducing student The majority of students who took the course were freshmen engagement [31]. (67%) and sophomores (21%). The course predominantly taught Python programming with some coverage of basic The benefits of studying are less clearly defined. Chew sug- Excel and HTML/web concepts. gests the benefit of study can be improved by teaching stu- dents how to study and that expecting students to know how without designing assignments and material to aid their 3.1 Course context studying may be a mistake on the part of some instruc- The course was organized as a flipped class that covered one tors [11]. Fakcharoenphol et al. found that there was a major topic each week. Students were expected to complete learning increase in studying old exams with solutions and readings in an interactive textbook and an assignment con- feedback, but that this learning may be shallow [13]. sisting of true/false and multiple-choice questions prior to Week domly from a pool of questions on a given topic with similar 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 difficulties. Most questions permitted students to attempt them multiple times with a score penalty for each subse- E0 E1 E2 E3 E4 quent incorrect attempt until chances to earn credit were (2%) (10%) (10%) (10%) (20%) exhausted. Figure 1: Every three weeks the course had a proctored exam Because of the course’s heavy use of item generators and (E0 to E4). Weight relative to the final course grade is pro- to motivate students to take homework seriously, a signif- vided as a percentage. icant fraction of the exams were drawn from the course’s pre-lecture and homework assignments. In general, 85–90% of the pools on the exam were drawn from questions previ- ously on homework, and exam-only “hidden” questions were lecture. The weekly 90-minute lecture used peer instruc- written with similar form and content to previous homework tion to reinforce concepts, and the weekly 80-minute lab questions. Prior to each exam, students were provided access consisted of practice activities students could complete in- to a practice exam generator that was similar to the actual dividually or in pairs, supervised by course staff. Finally, exam generator, but without the hidden questions. Reused each topic culminated with a weekly homework assignment programming questions are largely recall exercises, as most that consisted of a mix of short answer (e.g., “What is the do not feature random generation. Short answer questions value of the variable x after the following piece of code ex- are transfer tasks as they are all parameterized and no two ecutes?”, “Write a statement that removes the 4th element instances of the question should possess the same exact pa- of a list called ’animals’.”) and small programming (i.e., no rameters and the same expected student answer. more than a small function) questions. In spite of the exams including a large fraction of previously Due to the size of the course, almost all of the homework ac- seen material, we don’t believe that rote memorization was tivities were auto-graded. The course used the open-source a useful strategy for these exams due to their heavy use assessment platform (PrairieLearn) [35, 36] for all home- of randomization and question pools combined with a large work and other assessments. PrairieLearn both instantly number of questions (20–30) on the exam. True/false and grades student submissions and provides automatic feed- multiple-choice slots on the exam generally drew from pools back. Homework assignments were configured for students of 20 to 100 questions, while short answer and programming to be fearless: there was no penalty for wrong answers, only question slots had pool sizes of 5 to 12. In addition, short points to gain as they got answers correct. On homework, answer item generators typically produce at least dozens of this allowed students to practice with course content repeat- meaningfully different variants. edly until they got the correct answer. Students were able to repeat questions until they earned full credit and revisit questions at any point for studying purposes. 3.2 Homework and study periods The decision for exams to mostly use the same questions as Many of the homework questions were item generators that homework assignments and practice generators created an could produce many possible questions of similar difficulty interesting context for attributing student learning. Specif- on the same topic [16]. The true/false and multiple-choice ically, we could analyze student performance on homework item generators randomly selected items from pre-populated assignments, practice exams, and actual exams to observe pools of questions. Short answer questions are randomly pa- how students’ ability to answer these questions improved rameterized (e.g., changing the list a student has to read or as they engaged with course material. We pulled all stu- changing the method applied to a given list). To encour- dent submissions from PrairieLearn for the entire semester, age mastery, homework often expected students to correctly keeping only submissions for any questions that appeared answer these item generators multiple times. Weekly home- on both homework and exams. work assignments typically included 12 to 30 items or item generators and students needed to complete 90% of them to We cleaned this data set by removing students who had not achieve a full score on the homework. completed all of the exams, retaining 584 of the 601 students. In total, we retained 1,064,547 individual submissions across The course’s primary mean of summative assessment was homework, optional practice, and exams. Each submission’s through five proctored exams. All the exams had a 50- score ranges from 0 (incorrect) to 1 (full credit), with scores minute fixed time limit, except for the final exam (E4) which in-between indicating partial credit. allowed for 3 hours. All but the first exam were worth a significant portion (≥ 10%) of the course grade. These ex- We subdivide our analysis of the course by exam, focusing ams were conducted in a proctored computer lab with stu- on the three week window preceding each of the five exams. dent scheduled exam times within a three-day window [40– As shown in Figure 2, each exam is comprehensive, includ- 42]. Students were given access to a Python interpreter and ing material that was present on previous exams. For this Python’s documentation, but no other resources were pro- analysis, we focus solely on the content introduced since the vided. The exam schedule is given in Figure 1. previous exam to see how practice during the homework and study periods contribute to learning for the material’s first Exams featured all four kinds of questions seen on home- summative assessment. work (T/F, MC, short, programming), except for E0 which did not have programming questions. Each exam consisted Each student submission is assigned to one of three periods: of 20–30 question slots (41 on the final). Each slot drew ran- homework, study, and exam (Figure 3): Question pool hidden (not on homework) composition for new (on homework since previous exam) Exam each exam old (on homework & on previous exam) Homework Study Period Period Practice Exam Exam 4 23 21 21 44 On-time HW After deadline HW Period hidden hidden hidden Time 33 46 HW assigned HW deadline Start of exam new 61 new 78 new Figure 3: We subdivide the students’ practice into two peri- 28 new ods: the homework period is all homework attempts before new 115 191 the deadline. The study period is all attempts on practice old old 72 exams and any homework attempts after the deadline. 24 old old E0 E1 E2 E3 E4 4. METHODS Figure 2: Exams are cumulative and largely drawn from item To analyze the evolution of student knowledge from home- generators and questions previously on homework. Hidden work to exam time, we track student learning at the granu- questions only appear on exams. New questions were on larity of individual item generators. This is clearly a signifi- homework since the previous exam, while old questions were cant approximation to reality for two reasons: 1) because of previously on earlier homework and one or more previous pools (of true/false and multiple-choice questions) and pa- exams. While the fraction of exam slots dedicated to old rameter randomization (for short answer questions) there is questions does increase as the semester progresses, this figure some variation between instances of a given item generator, is somewhat deceptive because old pools typically have many and 2) there are relationships between item generators (e.g., more questions than new and hidden pools, except on the practice on a programming question relating to loops would final (E4) where each week’s material is represented equally. likely improve students ability to complete a short answer question related to loops and vice-versa). • The homework period includes all the submissions to Nevertheless, for our purposes, we believe this approach is homework on or before the homework due date. Sub- viable. The items of each item generator were considered missions in this period represents required practice; sufficiently similar by the instructor to be fungible with re- while students are allowed as many submissions as they spect to the exams. Furthermore, the method is robust to need to get full credit, there is a deadline to receive whether or not learning occurs between subsequent attempts that credit. on the same problem or from students attempting a prob- lem, trying new problems, and returning again to an older • The study period includes all submissions on practice problem. If the student learns significantly by completing exam generators as well as any submissions on home- many other homework problems between two attempts at work after the homework deadline. The homework sys- a given problem during the homework period, we can still tem remains open and students can repeat problems correctly attribute the learning to having taken place dur- and complete any problems not previously completed ing the homework period. As such, we made no attempt at (only 90% of questions are needed to achieve a full topic modeling in this work. homework score). Submissions in this period are elec- tive practice, bearing no credit directly. • The exam period includes the submissions on the ac- 4.1 Correctness tracing: post hoc modeling for tual exam. student knowledge In general, knowledge tracing (KT) techniques were devel- The above periods are coarsely defined to capture the dif- oped as predictors of student performance or estimators of ference between the time spent on required practice with the latent knowledge state of students. KT is used either homework assignment and any additional practice following to estimate a student’s likelihood of getting the next at- the homework deadline. For our context, problems being tempt correct based on previous attempts, adjusting after completed by students on practice exams as well as after a each success and failure as the student engages with an as- homework deadline are both elective activities and are suit- sessment, or to track changes in students’ latent knowledge able to be counted together. over time. Much of the difficulty of KT techniques results from attempting to instantaneously obtain a signal of stu- For our analysis, we also tag each student’s first attempt dent knowledge as students are engaging with learning op- on each question on homework, so that we can estimate portunities. In our case, we already have all the data from the student’s ability to solve that question gained before the course as the course has ended and do not need an in- attempting the question the first time (e.g., from readings, stantaneous, updating measure of student knowledge. In- lecture, or solving other problems). A breakdown of the stead, we desire to perform a post hoc analysis of students’ number of submissions during each period is provided in Fig- submissions to estimate how their learning changed over an ure 4. The decrease in submissions throughout the semester entire course’s worth of data. Our chosen method, CT, mea- in the homework and studying buckets is a result of home- sures students’ knowledge as demonstrated by an increase in work shifting toward fewer, more difficult problems as the the likelihood that they would get given items correct more semester progresses. frequently over time. Submissions per bucket by exam subset 178481 First 175000 Homework Studying 150000 Exam 131641 Count of submissions 125000 120436 100000 92533 77048 75000 67181 5871161033 55526 47987 50000 40014 30412 22984 25000 14378 14588 15700 10859 10936 7151 6948 0 Exam 0 Exam 1 Exam 2 Exam 3 Exam 4 Exams Figure 4: The submission count per period. Total, there are 1,064,547 submissions in our data set. As the semester progressed, homework had fewer but harder problems, which accounts for the reduction in submissions. The method presented by Chen et al [9] can be summarized ilarly, the probability on a correct final attempt will always by the following formulation: be estimated as 1.0, which may be an overestimate. This po- tentially could be remedied by adding additional constraints optimize: L(p1 , . . . , pn ; x1 , . . . , xn ) to the method (e.g., limiting the rate of increase), but we subject to: 0 ≤ pi ≤ 1 for all i (1) did not attempt such constraints with this work. pi ≤ pj for all i < j 4.2 Demonstrating CT using “Harlow”, a sam- where x1 , . . . , xn is the result of a series of submissions which are either 1 (correct) or 0 (incorrect), and the method tries ple student to find a series of predictions p1 , . . . , pn that optimizes the To clarify our use of CT, we present a walk-through of how loss function, under the constraints that: (1) p1 , . . . , pn are the method models our data for one individual and two between 0 and 1, as they represent an estimate of the instan- questions from Exam 3, selected randomly from students taneous probability that the student would get each attempt whose behavior allows for representative variety in CT’s correct and (2) p1 , . . . , pn are monotonically non-decreasing, estimates. We refer to the student as Harlow, which is a which is based on the assumptions that the attempts are name that was not present in the actual class. On Exam 3, made over a short enough time period that forgetting is in- two of the questions that were randomly selected for Harlow significant and additional practice would not hurt a student’s to complete were the programming question progLargest- ability to answer these questions. Since the homework, prac- LessThanValue and the short answer question valueOfList- tice, and exam attempts occurred over a three-week win- Reordering. dow, during which there were a lot of related practice, we believe these assumptions are reasonable. Rather than hav- Harlow had notably different experiences with these ques- ing a model with explicit parameters as found in BKT, the tions; Figure 5 plots the correctness of Harlow’s individual method calculates the probabilities p1 , . . . , pn by optimizing submissions as dots that are color coded based on the pe- them directly for the target loss function. Chen et al have riod in which the submission occurred. With progLargest- shown that minimizing root-mean-square error (RMSE) and LessThanValue, Harlow made two attempts on homework to maximizing log-likelihood would yield the same optimal so- get the question correct once, got it correct once on a prac- lution under constraints specified in Equation 1. tice exam with a single attempt, and tried it twice on Exam 3 without getting a correct answer. With valueOfList- We chose to use CT over BKT or DKT as it nicely fit our use Reordering, Harlow had 9 attempts on homework with 6 case. The CT method is able to finely locate and predict the correct submissions, 4 encounters across two practice exams “jumps” in a students’ likelihood of getting a question correct for 2 correct submissions total, and a correct answer as the when analyzing the data in a post hoc fashion, which may only attempt on Exam 3. be too precise a transition for usual predictive knowledge tracing. For our purposes, a high accuracy, post hoc model Figure 5 also shows the result of running CT as a line indi- was ideal for analyzing changing student knowledge as a cating the instantaneous estimate of Harlow’s likelihood of historical trend from our course’s data. getting the question correct. In both cases, Harlow got the first attempt wrong, so the model assign’s Harlow’s likeli- One important weakness of CT, however, is that it is prone hood of getting the question correct as 0%, so as to mini- to underestimate student knowledge on an incorrect first at- mize the error relative to the actual outcome. While Har- tempt because the optimizer sets the probability of correct- low is flipping between correct and incorrect attempts, the ness to be zero so as to minimize error on that attempt. Sim- model computes a likelihood of correctness for each attempt Chance of getting the next item correct i.e. student knowledge Average "This-Item-Correct" Chance Across Periods (All Exams) Harlow's Knowledge Over Time Exam 0 Exam 2 Exam 4 progLargestLessThanValue valueOfListReordering 1.00 Exam 1 Exam 3 1.0 1.0 1.0 0.95 Chance of getting this item correct i.e. student knowledge 0.90 0.8 0.8 0.8 0.85 0.80 0.6 0.75 0.6 0.6 0.70 0.65 0.4 0.4 0.4 0.60 First Studying 0.55 0.2 0.2 Homework Exam 0.2 0.50 0.45 0.0 0.0 0.40 0.0 0.01 2 0.23 4 0.45 10.6 2 3 4 5 6 7 80.8 91011121314151.0 0.35 Question answer submission 0.30 0.25 0.20 Figure 5: The results of running CT on Harlow’s answers to First End Homework Start Studying End Studying First Exam End Exam our two selected questions. Harlow eventually appeared to learn how to do valueOfListReordering. However, Har- Figure 6: The changing average “this-item-correct” chance low’s ability to complete the programming question never from CT per period. CT suggests the majority of student stabilized, so the model never attributed more than a 50% learning is occurring during the homework period, although chance that Harlow had learned the question’s material. the study period is also significant. that minimizes the error for those correct and incorrect at- Exam 4, the final exam, behaves differently from the other tempts, constrained to be non-decreasing. Because Harlow’s four exams, which we’ll consider in the discussion section. last three attempts at valueOfListReordering were all cor- rect, the model decides that Harlow has mastered the ques- Figure 7 plots the change in likelihood of correctness from tion with a 100% likelihood of getting the question correct. the beginning to the end of each period. When we compare the pre-exam increase in student knowledge (as measured by We ran CT for each student on each question independently. likelihood of correctness) between the homework and study From each trace, we extract six estimates of the student’s period, CT attributes 57–65% of the learning to the home- likelihood of getting a question right: their first and last work period and 35–43% to the study period, across the five attempts in the homework period (First, End Homework), exams. The CT method also attributes some learning to the their first and last attempts in the study period (Start Study- exam period, which we’ll consider in the discussion section. ing, End Studying), and their first and last attempts on the exam (First Exam, End Exam). Any student without a sub- mission in that period (i.e., students who did not study or 5.2 Learning trends are largely independent students who did not get that question on their exam) has of question type their previous submission to that point in the timeline used To address RQ2, we disaggregated the exam data sets by in compliance with CT’s assumption that students do not question type to see whether there was any notable difference forget. We then average these likelihoods across all students between types. For this analysis, we omitted Exam 0, as and all questions for a given exam period. This allows us to Exam 0 did not feature programming questions. explore the changing student knowledge as an average for all the students in a course across the different learning oppor- Figure 8 shows the per-question type CT results. The only tunities presented by homework, studying, and assessment. notable finding is that different questions start at different levels of initial student knowledge and end with different 5. RESULTS amounts of knowledge, which changes the starting and end- ing points in Figure 8. Because of this, different questions 5.1 CT attributes significant learning to both drop off faster than others in terms of how much is learned the homework and study period; home- during the practice period. Generally, students have less to work contributes slightly more learn with true/false and multiple-choice questions through the practice period than they do on programming and short The results of running CT are shown in Figure 6. From answer questions, although all question types experience a the slopes of the lines, it can be seen that CT estimates learning drop-off through to the exam. that more learning is occurring (i.e., the change in student likelihood of correct attempts is larger) during the homework period than the study period. The plot suggests that the 6. DISCUSSION AND LIMITATIONS course material tends to get more difficult as the semester 6.1 RQ1: Students in this course learn slightly progresses, with the initial and final likelihood of correctness both decreasing as we move from Exam 0 to 3. Furthermore, more during the homework period than the lines for Exams 0 through 3 show almost identical trends. the study period Average "This-Item-Correct" Chance By Period and Question Type (All Exams) Change in average student knowledge (CT) by period Exam 0 Exam 2 Exam 4 Programming True False 0.55 Short Answer Multiple Choice Exam 1 Exam 3 Exam 1 Exam 2 1.00 0.50 1.0 1.0 0.95 Change in average student knowledge 0.45 0.90 0.8 0.8 0.40 0.85 0.35 0.80 0.6 0.6 Chance of getting this item correct i.e. student knowledge 0.30 0.75 0.25 0.70 0.4 0.4 0.20 0.65 0.60 0.2 0.2 0.15 0.55 0.10 st rk g g am m st rk g g am m Fir wo yin yin Ex xa Fir wo yin yin Ex xa me tud tud st dE me tud tud st dE 0.50 Ho rt S dS Fir En Ho rt S dS Fir En 0.05 En d Sta En En d Sta En 0.45 Exam 3 Exam 4 0.00 1.0 1.0 Homework Studying Exam 0.40 0.35 Figure 7: The average change in student knowledge by pe- 0.8 0.30 0.8 riod according to CT. The largest change occurs during the 0.25 0.6 0.6 homework period, with a smaller change from study, and the 0.20 smallest on exams. 0.4 0.4 0.2 0.2 0.0st rk g0.2 0.4 m t 0.6 rk 0.8g m 1.0 Fir wo in ing xa m xa Fir s o yin g yin am xa e tu dy tudy tE dE ew tud tud Ex dE om s En om st En dH rt S En dS Fir dH rt S En dS Fir En Sta En Sta CT attributes more learning to the mandatory homework period in this particular course. This is represented as the Figure 8: The changing average “this-item-correct” chance largest increase in student knowledge from their first home- from CT for each question type from Exams 1 to 4. Different work submission to the last. That gives us some confidence question types start out with a higher assumed learning to that a course with significant homework opportunities does start, which suggest more students got those questions right provide students with productive chances to learn as op- on their first attempt. posed to just inundating students with “busy work.” Interestingly, CT also indicates performance on the exam is 6.2 RQ2: All question types show similar learn- better than at the end of the study period. There are a few possible explanations for this. The most likely explanation ing trends is that, given the higher stakes of the exam, students are try- When we disaggregate the analysis by question type, the ing harder, resulting in a higher correct rate that is being general shape and progression of results is the same for ev- observed by the model. In addition, some of the score im- ery question type compared to the source exam. Different provements observed on the exam could be attributed to the questions start with lower amounts of student knowledge, last pre-exam practice attempt if, for example, the student but this appears to mostly be a function of the difficulty of got the question wrong, but learned from seeing the correct the problem’s type: programming and short answer ques- answer. This also might be just be an artifact of CT, as any tions, which require more actual coding on the students’ students that have incorrect and correct attempts to a given parts, tended to start and end lower. question on the exam will have learning attributed to them. Finally, actual learning might be occurring during the exam. The lack of different behavior when we disaggregate by ques- The amount of “learning” attributed to the exam period is, tion type is more interesting than it may initially appear. however, fairly negligible. This means that the “shape” of student learning does not differ significantly with the question type. Given this, it ap- Importantly, one should not attempt to generalize about the pears that homework and additional studying have the same learning potential of homework relative to elective practice impact on student results regardless of the kind of question. for all courses from these results. We expect that courses This does mean there are diminishing returns on easier ques- that assign less homework might observe less learning during tion types over the period compared to harder ones, but not the homework period and students might compensate by a deficiency in how homework and practice helps on question studying more, thereby making more of the learning occur types where students still have learning they can do. during that optional studying. It could also be the case that there are diminishing returns on each attempt on a specific question, which the first attempt providing the most 6.3 Limitations learning benefit, then the second, decreasing further with There are some obvious limitations to the current work. each attempt from homework through the study period. It First, our findings about the relative learning during the is reassuring, though, to see that this course’s homework and homework and study periods cannot be assumed to gener- study opportunities (i.e., the practice exam generators made alize to other course contexts. Courses with different home- available to students) both appear to contribute significantly work, study materials, and exam structures will likely have to student learning. different breakdowns of learning in each phase. Second, CT is a fairly coarse measure of learning. Scores 8. REFERENCES as a performance indicator are not alone proof of student [1] E. Bailey, J. Jensen, J. Nelson, H. Wiberg, and J. Bell. learning. Additionally, CT’s potential for underestimating Weekly formative exams and creative grading enhance likelihood of correctness of first attempts (by strictly opti- student learning in an introductory biology course. mizing for RMSE) could make the model overestimate the CBE—Life Sciences Education, 16(1):ar2, 2017. learning that is occurring in the first few attempts, which is [2] J.-A. Baird, D. Andrich, T. N. Hopfenbeck, and likely occurring in the homework period. We do not have G. Stobart. Assessment and learning: fields apart? confidence that these measures of learning are particularly Assessment in Education: Principles, Policy & precise. While we omit it from the paper, we also ran a Practice, 24(3):317–350, July 2017. regression model to estimate the learning in the same peri- ods of the course. The regression generally showed the same [3] R. S. J. d. Baker, A. T. Corbett, and V. Aleven. More trends as CT, giving us more confidence in CT’s results. accurate student modeling through contextual estimation of slip and guess probabilities in bayesian Finally, these methods do not disambiguate from learning knowledge tracing. In B. P. Woolf, E. Aı̈meur, that happens during the homework and studying periods R. Nkambou, and S. Lajoie, editors, Intelligent and learning that occurs specifically from homework and Tutoring Systems, pages 406–415, Berlin, Heidelberg, elective practice problems. There are notable reasons to 2008. Springer Berlin Heidelberg. believe that students are learning significantly from reading [4] G. Başol and G. Johanson. Effectiveness of frequent the textbook, engaging in active learning exercises, and, per- testing over achievement: A meta analysis study. haps, even from listening to the lecturer speak. The learning Journal of Human Sciences, 6(2):99–121, July 2009. that occurs during these activities is attributed to the period [5] H. Bembenutty and M. C. White. Academic in which it occurs, rather than to the specific task. performance and satisfaction with homework completion among college students. Learning and 7. CONCLUSION Individual Differences, 24:83–88, Apr. 2013. [6] J. Bempechat. The motivational benefits of homework: In this work, we explored the degree to which we can at- a social-cognitive perspective. Theory Into Practice, tribute student learning between required homework and 43(3):189–196, Aug. 2004. elective study performed prior to a summative assessment. [7] R. E. Bennett. Formative assessment: a critical To analyze learning, we used a post hoc method of “this-item- review. Assessment in Education: Principles, Policy & correct” likelihood (correctness tracing) to estimate student Practice, 18(1):5–25, Feb. 2011. knowledge. We found that (required) homework and (elec- tive) studying both contributed significantly to student learn- [8] P. Black and D. Wiliam. Developing the theory of ing, with homework contributing slightly more. Further, formative assessment. Educational Assessment, despite using multiple question types, we found the most Evaluation and Accountability(formerly: Journal of notable difference between question types is where student Personnel Evaluation in Education), 21(1):5, Jan. knowledge starts and not the shape of their learning im- 2009. provements. [9] B. Chen, M. West, and C. B. Zilles. Towards a model-free estimate of the limits to student modeling We think that our results show that frequent, exam-relevant accuracy. In K. E. Boyer and M. Yudelson, editors, homework and highly-accessible means for study (e.g., prac- Proceedings of the 11th International Conference on tice exam generators) are both effective means of facilitating Educational Data Mining, EDM 2018, Buffalo, NY, student learning and believe that these findings could gen- USA, July 15-18, 2018. International Educational eralize to other contexts. The magnitude of learning from Data Mining Society (IEDMS), 2018. each component may differ, but courses with similar home- [10] P. Chen, Y. Lu, V. W. Zheng, and Y. Pian. work and studying opportunities will hopefully see similar Prerequisite-driven deep knowledge tracing. In 2018 learning gains during each period. IEEE International Conference on Data Mining (ICDM), pages 39–48, 2018. There remain areas for future work. Considering data, we [11] S. L. Chew. Helping students to get the most out of only use students’ submissions to questions that also appear studying. Acknowledgments and Dedication, page 215, on homework. Some ability to include other learning events, 2014. such as reading a textbook, would give a clearer picture of [12] A. T. Corbett and J. R. Anderson. Knowledge tracing: students’ learning process. Additionally, some topic-level Modeling the acquisition of procedural knowledge. labeling might allow us to include questions unique to exams User Modeling and User-Adapted Interaction, in our data and analysis. 4(4):253–278, Dec. 1994. [13] W. Fakcharoenphol, E. Potter, and T. Stelzer. What With respect to CT’s model, we made no attempt to com- students learn when studying physics practice exam pensate for the method’s tendency to underestimate on ini- problems. Phys. Rev. ST Phys. Educ. Res., 7:010107, tial incorrect attempts. Future work could investigate con- May 2011. straining this behavior by limiting the allowable slope. Fur- [14] B. Gutarts and F. Bains. Does mandatory homework ther, there is room to adapt the model to using a richer have a positive effect on student achievement for source of information than students’ correctness on submis- college students studying calculus? Mathematics and sions — for example, by fitting a similar optimization on Computer Education, 44(3):232–244, Fall 2010. students’ knowledge as estimated by methods such as Item Response Theory (IRT). [15] M. K. Hartwig and J. Dunlosky. Study strategies of M. Sahami, L. Guibas, and J. Sohl-Dickstein. Deep college students: Are self-testing and scheduling knowledge tracing. arXiv preprint arXiv:1506.05908, related to achievement? Psychonomic Bulletin and 2015. Review, 19:126–134, 2012. [29] D. Ramdass and B. J. Zimmerman. Developing [16] S. Irvine and P. Kyllonen. Item Generation for Test self-regulation skills: The important role of homework. Development. Lawrence Erlbaum Associates, 2002. Journal of Advanced Academics, 22(2):194–218, 2011. [17] J. A. Johnson and R. McKenzie. The effect on student [30] H. L. Roediger and J. F. Nestojko. The relative performance of web-based learning and homework in benefits of studying and testing on long-term microeconomics. Journal of Economics and Economic retention. Cognitive modeling in perception and Education Research, 14(2):115–125, 2013. Copyright - memory: A festschrift for Richard M. Shiffrin, pages Copyright Jordan Whitney Enterprises, Inc 2013; 99–111, 2015. Document feature - Tables; ; Last updated - [31] C. S. Ryan and N. S. Hemmes. Effects of the 2020-11-17. contingency for homework submission on homework [18] J. D. Karpicke and J. R. Blunt. Retrieval practice submission and quiz performance in a college course. produces more learning than elaborative studying with Journal of Applied Behavior Analysis, 38(1):79–88, concept mapping. Science, 331(6018):772–775, 2011. 2005. [19] M. Khajah, R. V. Lindsey, and M. C. Mozer. How [32] M. L. Still and J. D. Still. Contrasting traditional deep is knowledge tracing? CoRR, abs/1604.02416, in-class exams with frequent online testing. Journal of 2016. Teaching and Learning with Technology, 4(2):30, 2015. [20] M. Khajah, R. Wing, R. Lindsey, and M. Mozer. [33] B. W. Tuckman. Using tests as an incentive to Integrating latent-factor and knowledge-tracing motivate procrastinators to study. The Journal of models to predict individual differences in learning. In Experimental Education, 66(2):141–147, 1998. Educational Data Mining 2014. Citeseer, 2014. [34] C. K. Waugh and N. E. Gronlund. Assessment of [21] J. Laverty, S. Underwood, R. Matz, L. Posey, Student Achievement (10th Edition). Pearson, 2012. J. Carmel, M. Caballero, C. L. Fata-Hartley, [35] M. West, G. L. Herman, and C. Zilles. Prairielearn: D. Ebert-May, S. E. Jardeleza, and M. M. Cooper. Mastery-based Online Problem Solving with Adaptive Characterizing college science assessments: The Scoring and Recommendations Driven by Machine three-dimensional learning assessment protocol. PLoS Learning. In 2015 ASEE Annual Conference & ONE, 11(9):e0162333, 2016. Exposition, Seattle, Washington, 2015. ASEE [22] P. Magalhães, D. Ferreira, J. Cunha, and P. Rosário. Conferences. Online vs traditional homework: A systematic review [36] M. West, N. Walters, M. Silva, T. Bretl, and C. Zilles. on the benefits to students’ performance. Computers Integrating diverse learning tools using the & Education, 152:103869, July 2020. prairielearn platform. In Seventh SPLICE Workshop [23] S. Minn, Y. Yu, M. C. Desmarais, F. Zhu, and J. Vie. at SIGCSE 2021 (Virtual event), March 2021. Deep knowledge tracing and dynamic student [37] D. Wiliam. What is assessment for learning? Studies classification for knowledge tracing. In 2018 IEEE in Educational Evaluation, 37(1):3–14, 2011. International Conference on Data Mining (ICDM), Assessment for Learning. pages 1182–1187, 2018. [38] X. Xiong, S. Zhao, E. G. Van Inwegen, and J. E. Beck. [24] J. W. Morphew, M. Silva, G. Herman, and M. West. Going deeper with deep knowledge tracing. Frequent mastery testing with second-chance exams International Educational Data Mining Society, 2016. leads to enhanced student learning in undergraduate [39] L. Zhang, X. Xiong, S. Zhao, A. Botelho, and N. T. engineering. Applied Cognitive Psychology, Heffernan. Incorporating rich features into deep 34(1):168–181, 2020. knowledge tracing. In Proceedings of the Fourth (2017) [25] Z. A. Pardos and N. T. Heffernan. Modeling ACM Conference on Learning @ Scale, L@S ’17, page individualization in a bayesian networks 169–172, New York, NY, USA, 2017. Association for implementation of knowledge tracing. In P. De Bra, Computing Machinery. A. Kobsa, and D. Chin, editors, User Modeling, [40] C. Zilles, R. T. Deloatch, J. Bailey, B. B. Khattar, Adaptation, and Personalization, pages 255–266, W. Fagen, C. Heeren, D. Mussulman, and M. West. Berlin, Heidelberg, 2010. Springer Berlin Heidelberg. Computerized testing: A vision and initial [26] Z. A. Pardos and N. T. Heffernan. Kt-idem: experiences. In American Society for Engineering Introducing item difficulty to the knowledge tracing Education (ASEE) Annual Conference, 2015. model. In J. A. Konstan, R. Conejo, J. L. Marzo, and [41] C. Zilles, M. West, G. Herman, and T. Bretl. Every N. Oliver, editors, User Modeling, Adaption and university should have a computer-based testing Personalization, pages 243–254, Berlin, Heidelberg, facility. In Proceedings of the 11th International 2011. Springer Berlin Heidelberg. Conference on Computer Supported Education [27] R. Pelánek. Bayesian knowledge tracing, logistic (CSEDU), May 2019. models, and beyond: an overview of learner modeling [42] C. Zilles, M. West, D. Mussulman, and T. Bretl. techniques. User Modeling and User-Adapted Making testing less trying: Lessons learned from Interaction, 27(3):313–350, Dec. 2017. operating a Computer-Based Testing Facility. In 2018 [28] C. Piech, J. Spencer, J. Huang, S. Ganguli, IEEE Frontiers in Education (FIE) Conference, San Jose, California, 2018.