Evaluating an Instrumented Python CS1 Course Austin Cory Bart Teomara Rutherford James Skripchuk University of Delaware University of Delaware University of Delaware Newark, DE Newark, DE Newark, DE acbart@udel.edu teomara@udel.edu jskrip@udel.edu ABSTRACT 1. INTRODUCTION The CS1 course is a critical experience for most novice pro- The first Computer Science course (CS1) can be a chal- grammers, requiring significant time and effort to overcome lenging experience for novices given the constraints of a the inherent challenges. Ever-increasing enrollments mean semester [22], but success in CS1 is critical for computer sci- that instructors have less insight into their students and ence students, as it sets a foundation for subsequent classes. can provide less individualized instruction. Automated pro- Large amounts of practice and feedback are critical to this gramming environments and grading systems are one mech- experience, so that learners can overcome programming mis- anism to scale CS1 instruction, but these new technologies conceptions [17, 20] and develop effective schema. Instruc- can sometimes make it difficult for the instructor to gain tors have a key role in developing materials to support learn- insight into their learners. However, learning analytics col- ers’ productive struggle. Recently, however, scaling enroll- lected by these systems can be used to make up some of ments [26] and the move to remote/hybrid learning environ- the difference. This paper describes the process of mining ments has shifted much of this work away from interacting a heavily-instrumented CS1 course to leverage fine-grained with individual students towards interacting with systems evidence of student learning. The existing Python-based (which in turn interact with the students directly). For ex- curriculum was already heavily integrated with a web-based ample, programming autograders [19] remove the instructor programming environment that captured keystroke-level stu- from the grading process, automatically assessing and some- dent coding snapshots, along with various other forms of times even providing feedback directly to the learner. automated analyses. A Design-Based Research approach was taken to collect, analyze, and evaluate the data, with Although these systems scale the learning process, they can the intent to derive meaningful conclusions about the stu- inhibit the evaluation and revising of course materials. In- dent experience and develop evidence-based improvements structors do not have as many first-hand interactions with for the course. In addition to modeling our process, we students or the artifacts that they produce. When home- report on a number of results regarding the persistence of works and exams are no longer hand-graded, teachers may student mistakes, measurements of student learning and er- not be as directly motivated to review each submission. Sim- rors, the association between student learning and student ilarly, when automated feedback systems are effectively sup- effort and procrastination, and places where we might be porting students, teachers will have fewer opportunities to able to accelerate our curriculum’s pacing. We hope that get direct insight into what issues students are encounter- these results, as well as our generalized approach, can guide ing. This knowledge of the students’ experience is critical to larger community efforts around systematic course analysis gauge the effectiveness of the course materials. Instructors and revision. need a new model to guide their revision decisions. We propose instructors follow a Design-Based Research (DBR) Categories and Subject Descriptors approach [8, 3] to iteratively improve their course. In par- Social and professional topics [Professional topics]: Com- ticular, course development should be seen as an iterative puting education; Information systems [Information sys- and statistical Instructional Design process; each semester, tems applications]: Data mining a curriculum is built and presented to learners as an inter- vention, data is generated and collected as learners interact, Keywords that data is analyzed to discover shortcomings and successes cs1, modeling, dbr, python, procrastination of the intervention, and then modifications to the “protocol” are identified for the next iteration of the study. Instruc- tional Design models provide a systematic framework for this development process, but the DBR approach augments this to emphasize the statistical and theory-driven nature of the evaluation process. Fortunately, the same autograd- ing tools that scale practice and feedback opportunities for students can also be used to collect many kinds of learning analytics, permitting the use of educational data mining to Copyright c 2020 for this paper by its authors. Use permitted under Cre- garner insights into learning [2]. ative Commons License Attribution 4.0 International (CC BY 4.0). In this paper, we present our experience of evaluating a CS1 paper evaluating their decade-long Computational Thinking course that has been heavily instrumented to provide rich course (“MediaComp”) [15]. Although a longer time scale, data on student actions. Our goal is not to prove that our this paper takes a scientific, cohesive look at their course us- curriculum was a “success” or “failure” as a whole, but to em- ing a DBR lens. They critically evaluate what worked and pirically judge specific pieces and identify components that contextualize all their findings by their design. They begin should be modified or maintained. We draw upon program- with a set of hypotheses about what aspects of the course ming snapshot data, non-programming autograded question will be effective, and then systematically review data col- logs, surveys, exam data, and human assessments to produce lected from the offerings to accept or reject those hypotheses. a diverse dataset. In addition to sharing our conclusions Their conclusions, while not transcendent, are impactful for about the state of our course, we believe that we present a anyone modeling themselves after their context. formative model for other instructors who wish to evaluate their courses systematically. In fact, our specific analyses are In computing education, programming log data has been recorded in a Jupyter Notebook1 . Our hope is that others used to make various kinds of predictions and evaluations will use our own analyses as a baseline to develop their own of student learning [16]. Applications include predicting questions, and to motivate others to approach their courses student performance in subsequent courses [10], identifying with a more systematic, empirical method. learners who need additional support [30], modelling stu- dent strategies as they work on programming problems [23], evaluating students over the course of a semester [6, 24]. 2. THEORIES AND RELATED WORK These approaches tend to rely on vast datasets or seek to The central premise of our approach is inspired by Design- derive conclusions that are predictive, highly transferable, or Based Research, which has been well established in the ed- are about individual students. Although such research work ucation literature for decades. Those interested in an intro- is valuable, the goal is distinctive. We recognize that each duction to DBR can refer to [8]. Briefly, there are several course offering has an important local context that cannot key tenets: 1) Development is an iterative process of design, be factored out, and that collecting sufficient evidence over intervention, collection, and analysis. 2) Educational inter- time inhibits the process of iterative course design. Rather ventions cannot be decontextualized from their setting. 3) than developing generalizable theories or predicting perfor- Processes from all phases of development must be captured mance, we seek actionable data from a single semester an and provided sufficient context to ensure reproducibility and instructor can use to evaluate and redesign their course. replication. 4) Developing learning experiences cannot be separated from developing theories about learning. 5) Re- Effenberger et al [11] are perhaps an example more closely sults from an intervention must inform the next iteration aligned with our own research goals. Rather than evaluat- and communicated out to broader stakeholders. ing students, their work sought to evaluate four program- ming problems in a course. Their results suggest that de- Messy authenticity is inherent in this process, and naturally spite commonalities in the tasks, the problems’ characteris- limits the theoretical extent of findings in a DBR process. tics were considerably different, underscoring the danger of Therefore, any conclusions derived should not be seen as treating questions as interchangeable in course evaluation. broadly applicable, but only meaningful for the context in which they were developed. Although theories of learning The process of systematic course revision is similar to the are generated from DBR, this is less true for early iterations. ID+KC model by Gusukuma (2018), which combines formal True success for a course is a moving target. As the curricu- Instructional Design methodology with a cognitive student lum improves and students overcome misconceptions faster, model based on Knowledge Components [14]. Instead of fo- more material can be added. Over time, the curriculum cusing on a student model, however, we focus on components necessarily needs to be updated and assignments refreshed. of the instruction such as the learning objectives. Still, the Further, courses often need to be adapted for new audiences systematic process of data collection and analysis to inform with different demographics and prior experiences. Given revision is common between our methods. the DBR model strongly incorporates context, these reali- ties can be accounted for at some level. 3. CURRICULUM AND TECHNOLOGY DBR has been somewhat underused in Computing Educa- In this section, we describe the course’s curriculum and tech- tion Research (CER). Recently, Neslon and Ko (2018) made nology. DBR necessitates a clear enough description of the a strong argument that CE research should almost exclu- curriculum to understand the evaluation conducted, so we sively follow Design-Based Research methodologies [27], for cannot avoid low-level details —the context matters. We three reasons: 1) avoid splitting attention between advanc- have attempted to separate, however, the specific experi- ing theory vs. design, 2) the field has not generated enough ential details of our intervention (i.e., the course offering), domain-specific theories, and 3) theory has sometimes been which are described in Section 3. used to impede effective design-based research in the peer review process. Many of the recommendations made in the As a starting point, we based our course on the PythonSneks paper echo the tenets of DBR listed above and are consis- curriculum 2 . This curriculum has students move through tent with our vision for communicating our course designs. a large sequence of almost 50 lessons over the course of a In fact, their paper was a major guiding inspiration. semester, with each lesson focused on a particular introduc- tory programming topic. Each lesson is composed of a set Another major inspiration for our approach is Guzdial’s 2013 of learning objectives, the lesson presentation, a mastery- 1 2 https://github.com/acbart/csedm20-paper-cs1-analysis https://acbart.github.io/python-sneks/ based quiz, and a set of programming problems. We have When students submitted a solution to a programming prob- made a number of modifications to the materials reported lem, the system evaluated their work using an instructor- in [4], such as the introduction of static typing and increas- authored script written using the Pedal autograding frame- ing the emphasis on functional design to better suit CS1 for work [13]. This system generates feedback to learners and Computer Science majors. A full listing of all the learning calculates a correctness grade (usually 0 or 1, although par- objectives covered is available 3 . tial credit was possible on exams). The existing curriculum had a large quantity of autograded programming problems, Learning Management System: The course was deliv- some of which needed to be updated based on our changes. ered through Canvas, which was our university’s Learning Management System. All material, including quizzes, pro- Exams: There were two midterm exams and a final exam. gramming assignments, and exams, were directly available These exams were all divided into two parts: 1) multiple- in Canvas (either natively or through LTI). choice/true-false/matching/etc. questions, and 2) autograded programming questions. For the latter, students were given Lesson Presentation: The lessons were PowerPoint slides five-six programming problems that they could move freely with a recorded voice-over, embedded as a YouTube video between. These problems were automatically graded and directly into a Canvas Page. The content of these slides given partial credit (20% for correctly specifying the header, are transcribed directly below the video, including any code and the remaining points allocated based on the percentage with proper syntax highlighting. Finally, PDF versions of of passing instructor unit tests). Both parts were presented all the slides with their transcriptions are also available. in Canvas through the systems students were already famil- iar with, but students were not allowed to use the internet Mastery Quizzes: After the presentations, students are or to Google. Students took the exam at a proctored testing presented with a Canvas Quiz containing a series of True/False, center and had two hours. They were only allowed to bring Matching, Multiple Choice, and Fill-in-the-blank questions. a single sheet of hand-written notes. Multiple versions of This assignment is presented in a mastery style, where learn- each exam question were created and drawn from a pool at ers can make repeated attempts until they earn a satisfac- random, so that no two students had the exact same exam. tory grade. Each of the 200+ questions are annotated with a specific identifier. These quizzes are 10% of students’ grade. Projects: There were six projects throughout the semester, although the first two were very small and heavily scaffolded. Although Canvas provides an interface to visualize statistics The final project was relatively open-ended and meant to about individual quiz questions, this is obfuscated by the be summative, but the middle three projects allowed more students multiple attempts–only the final grade is shown, mixed forms of support. Although students were largely ex- so instructors cannot see how difficult a question was for a pected to produce their own code, they were encouraged to student. To provide greater detail in an instructor-friendly seek help as needed from the instructional staff. For the final report, the Canvas API was used to pull all submission at- project, students used the Python Arcade library 5 to create tempts for each student. The scripts used in analysis and a game. Because students were not previously taught Ar- an example of the instructor report 4 are publicly available. cade, two weeks were allocated for students to work collabo- ratively on extending sample games with new functionality. Programming Problems: Additionally, most lessons con- Then, they individually built one of 12 games. tain two-eight programming problems through a web-based Python coding environment [5]. These problems were also presented in a mastery style, allowing learners to spend as 4. INTERVENTION much time as they want until the deadline. These prob- In this section, we describe the specific intervention context lems are 15% of students’ final course grade. The envi- in more detail. The curriculum and technology was used ronment has a dual block/text interface, although students in the Fall 2019 semester at an R1 university in the east- were discouraged from using the block interface past the ern United States for a CS1 course that was required for first two weeks of programming activities. The environment Computer Science majors in their first semester. An IRB- naturally records all student interactions in ProgSnap2 for- approved research protocol was followed. At the beginning mat [28], making it readily accessible for our evaluation. of the semester, students were asked to provide consent via a survey, with 103 students agreeing out of 136 (for a 75.7% Students were also required to install (and eventually use) consent rate). A separate survey was also administered at a desktop Python programming environment, Thonny [1]. the beginning of the semester to collect various demographic Students largely used Thonny for their programming projects, data (summarized in Table 1, only for consenting students) particularly the final project, although a small number chose relating to gender, race, and prior coding experience. to use the environment to write code for other assignments. The Thonny environment was not instrumented to collect Percentage Number log data, but students were required to submit their projects Identifies as Woman 19% 20 through the autograder in Canvas–therefore, submission data Black Student 6% 6 should not be affected by the relatively small number of stu- No Prior Coding Experience 37% 38 dents who used Thonny. Total number of students 100% 103 Table 1: Demographic Data for Intervention 3 https://tinyurl.com/csedm2020-sneks-los 4 5 https://github.com/acbart/canvas-grading-reports https://arcade.academy/ Instructional Staff : The course was taught by a single Midterm 1 instructor. He managed a team of 12 undergraduate teach- 50 ing assistants. These TAs varied from CS sophomores to 0 seniors, and not all of them had taken the curriculum be- fore. However, they were all selected by the instructor for Midterm 2 both for their knowledge and amiability. All members of the 50 instructional staff hosted office hours. The TAs were also 0 responsible for grading certain aspects of the projects (e.g., test quality, documentation quality, code quality), although Final Exam this amounted to relatively little of the students’ final course 50 grade. The instructor met with these TAs every other week 0 for an hour to discuss the state of the course and provide training on pedagogy, inclusivity, etc. Final Project 25 Structure: The lecture met Monday-Wednesday-Friday for 0 50 minutes across three separate sections. The sections were 0 20 40 60 80 100 led by the same instructor, but were taught at different times of day (mid-morning, noon, and afternoon). The instructor Figure 1: Exam and Final Project Grade Distributions did not attempt to provide the exact same experience to all three sections–if a mistake was made in the morning sec- tion, they attempted to avoid that mistake later. Typically, Figure 1 gives histograms for Midterm 1 and 2, Final Exam, the first lecture session of a module started with 15-30 min- and Final Project scores. There was considerably more vari- utes of review of the material guided by clickers, and then ance in the final project scores than the exams, possibly due students spent the rest of the module’s class time working to the issues outlined before. The fact that many students on assignments. There were several special in-class assign- were failed to produce a final project may be evidence that ments such as worksheets, coding challenges, and readings. the assignment had unreasonable expectations. The lab met on Thursdays for 1.5 hours. Students worked on open assignments with the support of two TAs, who would A Kruskal-Wallis test was used to analyze final exam scores actively walk around and answer questions. by demographics. There were no significant differences for gender, but a large difference for black students (H(1)=6.39, 5. RESULTS AND ANALYSIS p=.01) and a smaller difference for prior programming ex- Our ultimate goal is to evaluate the course and identify as- perience (H(1)=5.51, p=0.02). The students without prior pects that were successful and unsuccessful. First, we con- experience scored about 12% lower on average, while the sider basic course final course outcomes. Then, we use the black students scored about 41% lower. Given the concern- programming log data to analyze students’ behavioral out- ing spread here, we review this data with more context in comes from the semester. We dive deeper into this data to the next section before drawing any conclusions. characterize the feedback that was delivered to students over the semester. We look at fine-grained data from both parts The university-run course evaluations from students yielded of the final exam to develop a list of problematic subskills, positive but simplistic results. Both the course and the and then review more of the programming log data in light of instructor were separately rated on a 5-point likert scale these results. We particularly focus our efforts on subskills (Poor... Excellent). Both the course (Mdn=5, M=4.62, related to defining functions, to tighten our analysis. SD=0.77) and the instructor (Mdn=5, M=4.70, SD=0.67) achieved very high results, but ultimately this tells us lit- The instructor’s naive perception of the course was that tle about the students’ experience. Course evaluation data things were largely successful, except for the final project. is known to contain bias and provide limited data [7, 25]; Insufficient time was given to the students to learn the game these results must be taken in context with other sources of development API, and instructor expectations were a bit data. Note that because the course evaluations are anony- high (which was adjusted for in the grading, but may have mous, they cannot be cross-referenced with other data. A caused students undue stress). However, the material prior review of the students’ free response answers reveals many to the final project went smoothly. Office hours were rarely were unhappy with the Final Project. In fact, the word “Ar- overfilled, with the exception of week 4 (the module intro- cade” appears in 41 of the 86 text responses, often as their ducing Functions), which had one lesson too many–this was only comment. Although this helps us see a major point of resolved by making the last programming assignment op- failure in our curriculum, it highlights the need for alterna- tional (Programming 25: Functional Decomposition). tive evaluation mechanisms. Relying solely on student final perceptions leaves us vulnerable to student biases. 5.1 Basic Course Outcomes As a starting point, we consider basic course-level outcomes, 5.2 Time Spent Programming the kind that could be determined even without the extra The keystroke-level log data allows us to determine a num- instrumentation. This will include the overall course grades, ber of interesting metrics beyond what is available from our the major grade categories, and the university-administered grading spreadsheet. As a simple starting point, using the course evaluations. As a starting point, the total number timestamps of the programming logs we can get a measure of failing grades and course withdrawals (DFW rate) was of how early students started working on assignments and 14.5%, considered acceptable by the instructor. total time spent. Earliness was measured by taking each 100 100 40 Final Exam Grade Final Exam Grade 80 80 Hours Spent 30 60 60 40 40 20 20 20 10 0 0 100 200 300 20 40 100 200 300 Earliness in Hours Hours Spent Earliness in Hours (a) Earliness vs. Final Exam Score (b) Hours Spent vs. Final Exam Score (c) Earliness vs. Hours Spent Figure 2: Comparison of Earliness, Time Spent, and Final Exam Score submission event across the entire course, finding the differ- ence between this and the relevant assignment’s deadline, Men Gender and averaging those durations together within each student. Hours Spent was measured by grouping all the events in the logs by student, finding the difference with the next adja- Women cent event (clipping to a maximum of 30 seconds, to consider breaks), and summing these durations. 10 20 30 40 Total Hours Figures 2a, 2b, and 2c show a marginal plot between earli- ness, hours spent, and final exam grade. Spearman’s Rho (a) Hours Spent by Gender was used to calculate the correlation between each outcome. Prior Experience Consistent with Kazerouni [18], earliness (a measure of pro- crastination) had a significant medium correlation with exam No Prior scores (rs = .49, p < .001), while time spent was only mod- estly correlated (rs = −.32, p = .001). Interestingly, there was no significant correlation between student’s time spent Prior and their procrastination (rs = −0.09, p = .36). 10 20 30 40 Analyzing behavioral outcomes by demographics indicated Total Hours no differences, with the exception of total hours spent be- tween women vs. men (H(1)=9.77, p=0.002) and between (b) Hours Spent by Prior Experience students with vs. without prior experience (H(1)=7.28, p=0.007). This comparison is visualized in Figures 3a and 3b. Women Figure 3: Hours Spent by Demographics and students with no prior experience spent, on average, about 8 and 5 hours more than their counterparts. Impor- tantly, this means that there was no significant difference in how early students started between subgroups. the guidance from the administration6 is that in a three- credit course like this one, students should spend 45 hours Given the difference in final exam scores, black students ap- in class and 90 hours outside of class over the course of the pear poorly served by the current curriculum. On average, 15-week semester. The median time spent in our course by these students spent as much time as their peers on assign- a given student on all the programming assignments was ments, but their final exam scores were lower than students 19 hours, while the highest time spent by any individual outside of this category. Given the evidence for the contin- student was just over 42 hours. This does not take into ued education debt owed to non-White students (Ladson- account time spent outside the coding environment (e.g., Billings, 2006) [21], more work is needed to identify both working on projects in Thonny), working on quizzes, and potentially problematic structural elements of the course reading/watching the lesson presentations. However, some and how the course can better draw on student strengths students did complete their projects in the online environ- to produce more equitable outcomes. ment, and we expect most of those activities to take consid- erably less time than the programming activities. This may Figure 4 visualizes the total time spent by students per week suggest that we are not asking our students to dedicate as on the programming problems. The data collected raises an much time as we might. interesting question–how many hours should we ideally ex- 6 pect students to spend on our courses? At our institution, https://tinyurl.com/csedm2020-udel-credit-policy Category Subcategory Percentage Instructor 37.8% 8 Problem Specific 32.1% Not Enough Student Tests 1.0% Not Printing Answer .8% Hours Spent 6 Analyzer 22.1% Initialization Problem 6.9% 4 Unused Variable 5.9% Multiple Return Types 2.9% Incompatible Types 1.4% 2 Parameter Type Mismatch 1.0% Overwritten Variable .6% Read out of scope .5% 0 Correct 17.8% 1 2 3 4 5 6 7 8 9 10111213141516 Syntax 11.4% No Source Code .3% Week into Semester Runtime 7.8% TypeError 3.7% Figure 4: Time Spent per Week of Semester NameError 1.0% AttributeError .8% ValueError .4% Looking at specific time periods within the data, we see KeyError .4% that students spent less time programming in the first weeks IndexError .4% of the course, around the midpoint, and near the end of Student Student Tests Failing 2.9% the programming problems (the last few weeks took place Instructions 2.4% outside of the online coding environment). Especially for the System Error .8% earlier material, it is likely that the pace can be accelerated. Table 2: Frequency of Error Messages by Category 5.3 Error Classification Table 2 gives the percentage of different feedback messages that students received on programming problems as a per- Figure 5a gives the ratio of correct submission events in the centage of all the feedback events received. Our numbers log data over each week of the semester. Early on, students vary from those reported by Smith and Rixner (2019) [29], complete problems with fewer attempts. This might explain possibly because of our very different approach to feedback the steady growth in ratios of different kinds of feedback over and the affordances of our programming environment. time, as evidenced by Figures 5b and 5c. It is interesting to observe that the runtime error frequency grows almost One of the most notable departures is the Analyzer and linearly over the course of the semester, with the exception Problem-specific Instructor feedback categories. Our auto- of week 7 (a peak week for the Analyzer feedback). We grading system is capable of overriding error messages. In hypothesize this is the result of some more carefully-refined particular, one of its key features is a type inferencer and instructor feedback available during that week. flow analyzer that automatically provides more readable and targeted error messages. The subcategories give examples 5.4 Final Exam Conceptual Questions of the kinds of errors produced: Initialization Problem The first part of the final exam was composed of concep- (using a variable that was not previously defined) frequently tual questions for topics across the curriculum, drawn largely supersedes the classic NameError, for example. Meanwhile, from the quiz questions students had already seen. In this some issues have no corresponding runtime error, such as section, we review the quiz report to determine the topics Unused Variable (never reading a variable that was previ- that students struggled with. Most students performed rela- ously written to). The Analyzer gives more than a fifth of tively well across the questions, so we focus on errors where all feedback delivered to students, suggesting its role is sig- more than 80% of the students had incorrect answers. nificant. Further work is needed to evaluate the quality of this feedback and the impact on students’ learning. Students largely had no issues with questions involving eval- uating expressions. A small exception to this is students’ The Problem-specific Instructor feedback category is opaque. struggle with Equality vs. Order of different types. In Given that this represents almost a third of the feedback, it Python, as in many languages, it is not an error to check if is unhelpful that the category cannot be easily broken down two things of different types are equal (although that com- further. Sampling the logs’ text, we see examples like stu- parison will always produce false); however, it is an error to dents failing instructor unit tests, a reminder to call a func- compare their order (less than/greater than operators). In tion just once, and a suggestion to avoid a specific subscript fact, 58% of students got this specific question wrong. index. Although the autograder is a powerful mechanism for delivering contextualized help to students, the lack of orga- There were three questions related to tracing complex con- nization severely limits our automated analysis possible. As trol flow for loops, if statements, and functions. Tracing part of our process in the future, we intend to annotate feed- seemed to pose difficulties, with between 25-40% of the stu- back in our autograding scripts with identifiers. dents getting these questions wrong. We believe that more decided to focus on a subset of skills just related to Functions Correct Submissions that we felt we could clearly identify with computational analysis and that the instructor felt, a-priori, they had seen 0.4 Ratio of students struggle with over the course of the semester. Ta- ble 3 gives the percentages and quantity of students who 0.2 successfully demonstrated the subskill on each exam. 0.0 Header Definition: Even though we had not observed 1 2 3 4 5 6 7 8 9 10 11 many students struggling with syntax during the semester, Week into the Semester we felt it critical to analyze the incidence of submitted code that had malformed headers. Although the numbers were a (a) Ratio of Correct Submissions by Week of Semester little higher than expected, we are not terribly concerned - reviewing the submissions, many seemed like simple typos Syntax Errors (e.g., a missing colon) that were relatively easily fixed. 0.2 Ratio of 0.1 Provided Types: Students were not required or encour- aged to provide types in their headers during the exam. In 0.0 fact, since the advanced feedback features were turned off, their feedback would not actually reference any parameter 1 2 3 4 5 6 7 8 9 10 11 Week into the Semester or return types they specified (as long as they were syntacti- cally correct code). We did not assess the correctness of their (b) Ratio of Syntax Errors by Week of Semester provided types - merely their existence. In the final exam, the number of students who annotated their parameter types Runtime Errors falls off sharply after the first three questions (moving from about 50% down to 20%). We offer two explanations: first, Ratio of the fourth question is one of the most difficult in the entire 0.1 course, so students may have been distracted by its diffi- culty. Second, the last questions all involve more compli- 0.0 cated nested data types (e.g., lists of dictionaries) that were too troublesome for the students to specify. 1 2 3 4 5 6 7 8 9 10 11 Week into the Semester Parameter Overwriting: This misconception is one that (c) Ratio of Runtime Errors by Week of Semester the instructors were very concerned with, having observed it repeatedly among certain students early in the semester Figure 5: Ratio of Feedback Types by Week of Semester (and concerned with its persistence). Applying the param- eter overwriting pattern to the rest of the submissions over the entire semester, we found that the behavior trails off emphasis should be placed on tracing in the curriculum; over the course of the semester. By the final exam, almost there are quiz questions and a worksheet dedicated to the no students were making this particular mistake. Although topic, but there are opportunities to expand this material. the instructor believes that more can be done up front to Tracing has been a recent area of focus, with promising ap- avoid this critical misconception, it is comforting that the proaches by Xie et al [31] and Cunningham et al [9]. existing curriculum seems to largely address this by the end. Dictionaries also posed significant trouble for students. Dic- Return/Print: We observed that some students struggle tionaries come up later in the course, represent more com- to differentiate between the concepts of return statements plex reality, and conflate syntactic operations with lists. In and print calls. However, largely students were success- fact, this last point is evidenced by data. In a question com- ful with this subskill, despite a quarter of students getting paring the relative speed of traversing lists and dictionaries, a related (more abstract rendering) version of this subskill 50% of the students got one variant of a True/False question wrong on part 1 of the final exam. It seems that although incorrect (so they might as well have been guessing). troublesome for a small clutch of students, most are able to eventually separate this concept in their code. Again, the point of our analysis is not to necessarily develop a validated examination instrument or to distill an author- Parameters/Input: Similar to students’ issues with re- itative set of misconceptions. Instead, we seek to demon- turning vs. printing, some students were observed in indi- strate the insight we have garnered from reviewing our exam. vidual sessions mixing up parameters and the input function With these simple percentages, we have found targets. (which was presented as a very distinctive way that data could enter a function). However, it appears that this was 5.5 Deeper Dive on Functions truly isolated to just a few students. In looking over the second part of the final exam questions, Functional Decomposition: Largely inspired by Fisler [12] we are faced with a tremendous number of concepts inte- success in overcoming the difficulties of the Rainfall problem, grated into each problem. In fact, with over 261 learning Functional Decomposition was taught as a method for com- objectives in the course, analyzing the entire set is an over- plex processing data. Students had previously been taught whelming prospect. To scope our analysis for this paper, we Subskill Description 1st Exam 2nd Exam Final Exam Header Definition Defined the function header with correct syntax 83.5% (86) 84.5% (87) 91.3% (94) Provided Types Provided types for all parameters and the return 40.8% (42) 45.6% (47) 37.9% (39) Parameter Overwrite Did not assign literal values to parameters in the body 88.3% (91) 98.1% (101) 99.0% (102) Return/Print Did not print without returning 80.6% (83) 89.3% (92) 91.3% (94) Parameters/Input Did not use the input function instead of parameters 96.1% (99) 100.0% (103) 99.0% (102) Unit Testing Wrote unit tests 88.3% (91) 79.6% (82) 67.0% (69) Decomposition Separated work into a helper function 1.0% (1) 17.5% (18) 19.4% (20) Table 3: Percentage of Students Demonstrating Subskill across Exams 8 different looping patterns (e.g., accumulating, mapping, Better structure to our existing data sources might help in filtering). A number of assignments required students to future analyses. For example, although each quiz question decompose problems. Therefore, it is somewhat disappoint- was labeled with a unique identifier, we realized during anal- ing that so few students chose to leverage decomposition ysis that we really needed every quiz answer (and in some (particularly since the harder final exam problems were nat- cases, sets of answers) to have a unique identifier as well. In urally susceptible to a decomposition approach). In addition particular, some questions had multiple parts, or different to the midterm 2 and final exam questions, we also took a answers yielded information about different misconceptions. closer look at an earlier open-ended programming problem In a similar vein, annotating instructor feedback for the pro- that was particularly complex and well-suited to decompo- gramming problems would have substantially increased the sition. In these problems, there seemed to be a pattern of differentiation of our feedback messages. students being more successful when they leverage decompo- sition. Although not conclusive, this supports the hypothe- More metadata about each identifier would also help efforts sis that decomposition may be an effective strategy. to cross-reference and cluster related problems (especially over time). This is a non-trivial effort, given the quantity Decomposed Monolithic of course materials present in the curriculum. As a start- Pass Fail Pass Fail ing point, we believe this effort should probably be focused Earlier Problem 37 8 29 27 on certain major learning objectives and topics (e.g., func- Midterm 2 Question 5 13 5 40 43 tions) that are particularly worthy of attention based on the Final Exam Question 4 7 8 42 44 formative evaluation conducted here. Total 57 21 111 114 18.8% 6.9% 36.6% 37.6% We expect that before our next iteration of our analyses, we need to develop more hypotheses up front for guidance. Table 4: Student Use’s of Decomposition over Time A considerable amount of time was spent performing ex- ploratory analyses, trying different approaches and seeing what emerged from the data. Although helpful as we ori- Unit Testing: Given that students were not required to ented ourselves, the data dredging that can emerge may unit test their code on the final exam, we were pleased to yield false conclusions that are not actually worth invest- find that many students wrote unit tests anyway. Inter- ing in. Finally, while we attempted to follow a replicable estingly, though, the percentage of students who used this process in our data collection and analysis, we believe more strategy decreased over the course of the semester, even as should be done to streamline and package our data pipeline the programming problems became more difficult. We hy- to encourage replication and reproduction. pothesize that since the later exam problems involve com- plex nested data, students either did not feel comfortable generating test data or they felt that it would not be an ef- 7. CONCLUSION ficient use of their time. We believe that we need to sell the In this paper, we have described our evaluation of data from concept more - rather than thinking that writing test cases a heavily-instrumented CS1 course. Our goal was less about would be a detriment to their success, students should see judging the course overall, and more about finding specific tests as one of the most direct paths to completion. areas of improvement and success. We feel that course eval- uation is less about the end-goal and more about small it- erative augmentations that collect over time. To structure 6. DISCUSSION our approach, we followed a loose Design-Based Research Reviewing our findings, we made several decisions about model supported by educational data mining. In our ex- places to modify our curriculum. The log data suggests that perience, the high volume and variety of data sources can some of the earlier material can be accelerated, so that more be very helpful in understanding the successes and failures time can ultimately be allocated to week 4 (critical mate- of the course, although it does pose difficulties for analy- rial covering functions). We also believe we need to spend sis. As always, a better pipeline could help make sense of more time throughout the semester convincing students that these data and results more quickly, possibly even during subskills like decomposition and unit testing can help them the semester. However, in the immediate term, our data solve challenging questions, although follow-up analyses will analysis contributes to the community’s knowledge of stu- be needed to confirm this theory. Finally, we must come dents and ideally provides a model for others to follow along. up with new ways to support some of our demographic sub- In general, we hope to encourage increased rigor in course groups, given that outcomes in that area are not yet equal. evaluation as we integrate data-rich tools into our courses. 8. REFERENCES [16] P. Ihantola, A. Vihavainen, A. Ahadi, M. Butler, [1] A. Annamaa. Introducing thonny, a python ide for J. Börstler, S. H. Edwards, E. Isohanni, A. Korhonen, learning programming. In Proceedings of the 15th Koli A. Petersen, K. Rivers, et al. Educational data mining Calling Conference on Computing Education Research, and learning analytics in programming: Literature pages 117–121, 2015. review and case studies. In Proceedings of the 2015 [2] R. S. Baker and P. S. Inventado. Educational data ITiCSE on Working Group Reports, pages 41–63. mining and learning analytics. In Learning analytics, 2015. pages 61–75. Springer, 2014. [17] L. C. Kaczmarczyk, E. R. Petrick, J. P. East, and [3] S. Barab and K. Squire. Design-based research: G. L. Herman. Identifying student misconceptions of Putting a stake in the ground. The journal of the programming. In Proceedings of the 41st ACM learning sciences, 13(1):1–14, 2004. technical symposium on Computer science education, [4] A. C. Bart, A. Sarver, M. Friend, and L. Cox II. pages 107–111, 2010. Pythonsneks: an open-source, instructionally-designed [18] A. M. Kazerouni, S. H. Edwards, and C. A. Shaffer. introductory curriculum with action-design research. Quantifying incremental development practices and In Proceedings of the 50th ACM Technical Symposium their relationship to procrastination. In Proceedings of on Computer Science Education, 2019. the 2017 ACM Conference on International [5] A. C. Bart, J. Tibau, E. Tilevich, C. A. Shaffer, and Computing Education Research, pages 191–199, 2017. D. Kafura. Blockpy: An open access data-science [19] H. Keuning, J. Jeuring, and B. Heeren. Towards a environment for introductory programmers. systematic review of automated feedback generation Computer, 50(5):18–26, 2017. for programming exercises. In Proceedings of the 2016 [6] K. Buffardi. Assessing individual contributions to ACM Conference on Innovation and Technology in software engineering projects with git logs and user Computer Science Education, pages 41–46, 2016. stories. In Proceedings of the 51st ACM Technical [20] E. Kurvinen, N. Hellgren, E. Kaila, M.-J. Laakso, and Symposium on Computer Science Education, pages T. Salakoski. Programming misconceptions in an 650–656, 2020. introductory level programming course exam. In [7] S. E. Carrell and J. E. West. Does professor quality Proceedings of the 2016 ACM Conference on matter? evidence from random assignment of students Innovation and Technology in Computer Science to professors. Journal of Political Economy, Education, pages 308–313, 2016. 118(3):409–432, 2010. [21] G. Ladson-Billings. From the achievement gap to the [8] D.-B. R. Collective. Design-based research: An education debt: Understanding achievement in us emerging paradigm for educational inquiry. schools. Educational researcher, 35(7):3–12, 2006. Educational Researcher, 32(1):5–8, 2003. [22] A. Luxton-Reilly. Learning to program is easy. In [9] K. Cunningham, S. Blanchard, B. Ericson, and Proceedings of the 2016 ACM Conference on M. Guzdial. Using tracing and sketching to solve Innovation and Technology in Computer Science programming problems: replicating and extending an Education, pages 284–289, 2016. analysis of what students draw. In Proceedings of the [23] P. Mandal and I.-H. Hsiao. Using differential mining 2017 ACM Conference on International Computing to explore bite-size problem solving practices. In Education Research, pages 164–172, 2017. Educational Data Mining in Computer Science [10] N. Diana, M. Eagle, J. Stamper, S. Grover, Education (CSEDM) Workshop, 2018. M. Bienkowski, and S. Basu. Measuring transfer of [24] C. Matthies, R. Teusner, and G. Hesse. Beyond data-driven code features across tasks in alice. 2018. surveys: analyzing software development artifacts to [11] T. Effenberger, J. Cechák, and R. Pelánek. Difficulty assess teaching efforts. In 2018 IEEE Frontiers in and complexity of introductory programming Education Conference (FIE), pages 1–9. IEEE, 2018. problems. 2019. [25] K. M. Mitchell and J. Martin. Gender bias in student [12] K. Fisler. The recurring rainfall problem. In evaluations. PS: Political Science & Politics, Proceedings of the tenth annual conference on 51(3):648–652, 2018. International computing education research, pages [26] National Academies of Sciences, Engineering, and 35–42, 2014. Medicine and others. Assessing and responding to the [13] L. Gusukuma, A. C. Bart, and D. Kafura. Pedal: An growth of computer science undergraduate enrollments. infrastructure for automated feedback systems. In National Academies Press, 2018. Proceedings of the 51st ACM Technical Symposium on [27] G. L. Nelson and A. J. Ko. On use of theory in Computer Science Education, pages 1061–1067, 2020. computing education research. In Proceedings of the [14] L. Gusukuma, A. C. Bart, D. Kafura, J. Ernst, and 2018 ACM Conference on International Computing K. Cennamo. Instructional design+ knowledge Education Research, pages 31–39, 2018. components: A systematic method for refining [28] T. W. Price, D. Hovemeyer, K. Rivers, B. A. Becker, instruction. In Proceedings of the 49th ACM Technical et al. Progsnap2: A flexible format for programming Symposium on Computer Science Education, pages process data. In The 9th International Learning 338–343, 2018. Analytics & Knowledge Conference, Tempe, Arizona, [15] M. Guzdial. Exploring hypotheses about media 4-8 March 2019, 2019. computation. In Proceedings of the ninth annual [29] R. Smith and S. Rixner. The error landscape: international ACM conference on International Characterizing the mistakes of novice programmers. In computing education research, pages 19–26, 2013. Proceedings of the 50th ACM Technical Symposium on Computer Science Education, pages 538–544, 2019. [30] Y. Vance Paredes, D. Azcona, I.-H. Hsiao, and A. F. Smeaton. Predictive modelling of student reviewing behaviors in an introductory programming course. 2018. [31] B. Xie, G. L. Nelson, and A. J. Ko. An explicit strategy to scaffold novice program tracing. In Proceedings of the 49th ACM Technical Symposium on Computer Science Education, pages 344–349, 2018.