=Paper= {{Paper |id=Vol-2734/paper8 |storemode=property |title=Evaluating an Instrumented Python CS1 Course |pdfUrl=https://ceur-ws.org/Vol-2734/paper8.pdf |volume=Vol-2734 |authors=Austin Cory Bart,Teomara Rutherford,James Skripchuk |dblpUrl=https://dblp.org/rec/conf/edm/BartRS20 }} ==Evaluating an Instrumented Python CS1 Course== https://ceur-ws.org/Vol-2734/paper8.pdf

Evaluating an Instrumented Python CS1 Course

Austin Cory Bart Teomara Rutherford James Skripchuk
University of Delaware University of Delaware University of Delaware
Newark, DE Newark, DE Newark, DE
acbart@udel.edu teomara@udel.edu jskrip@udel.edu

ABSTRACT 1. INTRODUCTION
The CS1 course is a critical experience for most novice pro- The first Computer Science course (CS1) can be a chal-
grammers, requiring significant time and effort to overcome lenging experience for novices given the constraints of a
the inherent challenges. Ever-increasing enrollments mean semester [22], but success in CS1 is critical for computer sci-
that instructors have less insight into their students and ence students, as it sets a foundation for subsequent classes.
can provide less individualized instruction. Automated pro- Large amounts of practice and feedback are critical to this
gramming environments and grading systems are one mech- experience, so that learners can overcome programming mis-
anism to scale CS1 instruction, but these new technologies conceptions [17, 20] and develop effective schema. Instruc-
can sometimes make it difficult for the instructor to gain tors have a key role in developing materials to support learn-
insight into their learners. However, learning analytics col- ers’ productive struggle. Recently, however, scaling enroll-
lected by these systems can be used to make up some of ments [26] and the move to remote/hybrid learning environ-
the difference. This paper describes the process of mining ments has shifted much of this work away from interacting
a heavily-instrumented CS1 course to leverage fine-grained with individual students towards interacting with systems
evidence of student learning. The existing Python-based (which in turn interact with the students directly). For ex-
curriculum was already heavily integrated with a web-based ample, programming autograders [19] remove the instructor
programming environment that captured keystroke-level stu- from the grading process, automatically assessing and some-
dent coding snapshots, along with various other forms of times even providing feedback directly to the learner.
automated analyses. A Design-Based Research approach
was taken to collect, analyze, and evaluate the data, with Although these systems scale the learning process, they can
the intent to derive meaningful conclusions about the stu- inhibit the evaluation and revising of course materials. In-
dent experience and develop evidence-based improvements structors do not have as many first-hand interactions with
for the course. In addition to modeling our process, we students or the artifacts that they produce. When home-
report on a number of results regarding the persistence of works and exams are no longer hand-graded, teachers may
student mistakes, measurements of student learning and er- not be as directly motivated to review each submission. Sim-
rors, the association between student learning and student ilarly, when automated feedback systems are effectively sup-
effort and procrastination, and places where we might be porting students, teachers will have fewer opportunities to
able to accelerate our curriculum’s pacing. We hope that get direct insight into what issues students are encounter-
these results, as well as our generalized approach, can guide ing. This knowledge of the students’ experience is critical to
larger community efforts around systematic course analysis gauge the effectiveness of the course materials. Instructors
and revision. need a new model to guide their revision decisions.

We propose instructors follow a Design-Based Research (DBR)
Categories and Subject Descriptors approach [8, 3] to iteratively improve their course. In par-
Social and professional topics [Professional topics]: Com-
ticular, course development should be seen as an iterative
puting education; Information systems [Information sys-
and statistical Instructional Design process; each semester,
tems applications]: Data mining
a curriculum is built and presented to learners as an inter-
vention, data is generated and collected as learners interact,
Keywords that data is analyzed to discover shortcomings and successes
cs1, modeling, dbr, python, procrastination of the intervention, and then modifications to the “protocol”
are identified for the next iteration of the study. Instruc-
tional Design models provide a systematic framework for
this development process, but the DBR approach augments
this to emphasize the statistical and theory-driven nature
of the evaluation process. Fortunately, the same autograd-
ing tools that scale practice and feedback opportunities for
students can also be used to collect many kinds of learning
analytics, permitting the use of educational data mining to
Copyright c 2020 for this paper by its authors. Use permitted under Cre- garner insights into learning [2].
ative Commons License Attribution 4.0 International (CC BY 4.0).
In this paper, we present our experience of evaluating a CS1 paper evaluating their decade-long Computational Thinking
course that has been heavily instrumented to provide rich course (“MediaComp”) [15]. Although a longer time scale,
data on student actions. Our goal is not to prove that our this paper takes a scientific, cohesive look at their course us-
curriculum was a “success” or “failure” as a whole, but to em- ing a DBR lens. They critically evaluate what worked and
pirically judge specific pieces and identify components that contextualize all their findings by their design. They begin
should be modified or maintained. We draw upon program- with a set of hypotheses about what aspects of the course
ming snapshot data, non-programming autograded question will be effective, and then systematically review data col-
logs, surveys, exam data, and human assessments to produce lected from the offerings to accept or reject those hypotheses.
a diverse dataset. In addition to sharing our conclusions Their conclusions, while not transcendent, are impactful for
about the state of our course, we believe that we present a anyone modeling themselves after their context.
formative model for other instructors who wish to evaluate
their courses systematically. In fact, our specific analyses are In computing education, programming log data has been
recorded in a Jupyter Notebook1 . Our hope is that others used to make various kinds of predictions and evaluations
will use our own analyses as a baseline to develop their own of student learning [16]. Applications include predicting
questions, and to motivate others to approach their courses student performance in subsequent courses [10], identifying
with a more systematic, empirical method. learners who need additional support [30], modelling stu-
dent strategies as they work on programming problems [23],
evaluating students over the course of a semester [6, 24].
2. THEORIES AND RELATED WORK These approaches tend to rely on vast datasets or seek to
The central premise of our approach is inspired by Design- derive conclusions that are predictive, highly transferable, or
Based Research, which has been well established in the ed- are about individual students. Although such research work
ucation literature for decades. Those interested in an intro- is valuable, the goal is distinctive. We recognize that each
duction to DBR can refer to [8]. Briefly, there are several course offering has an important local context that cannot
key tenets: 1) Development is an iterative process of design, be factored out, and that collecting sufficient evidence over
intervention, collection, and analysis. 2) Educational inter- time inhibits the process of iterative course design. Rather
ventions cannot be decontextualized from their setting. 3) than developing generalizable theories or predicting perfor-
Processes from all phases of development must be captured mance, we seek actionable data from a single semester an
and provided sufficient context to ensure reproducibility and instructor can use to evaluate and redesign their course.
replication. 4) Developing learning experiences cannot be
separated from developing theories about learning. 5) Re- Effenberger et al [11] are perhaps an example more closely
sults from an intervention must inform the next iteration aligned with our own research goals. Rather than evaluat-
and communicated out to broader stakeholders. ing students, their work sought to evaluate four program-
ming problems in a course. Their results suggest that de-
Messy authenticity is inherent in this process, and naturally spite commonalities in the tasks, the problems’ characteris-
limits the theoretical extent of findings in a DBR process. tics were considerably different, underscoring the danger of
Therefore, any conclusions derived should not be seen as treating questions as interchangeable in course evaluation.
broadly applicable, but only meaningful for the context in
which they were developed. Although theories of learning The process of systematic course revision is similar to the
are generated from DBR, this is less true for early iterations. ID+KC model by Gusukuma (2018), which combines formal
True success for a course is a moving target. As the curricu- Instructional Design methodology with a cognitive student
lum improves and students overcome misconceptions faster, model based on Knowledge Components [14]. Instead of fo-
more material can be added. Over time, the curriculum cusing on a student model, however, we focus on components
necessarily needs to be updated and assignments refreshed. of the instruction such as the learning objectives. Still, the
Further, courses often need to be adapted for new audiences systematic process of data collection and analysis to inform
with different demographics and prior experiences. Given revision is common between our methods.
the DBR model strongly incorporates context, these reali-
ties can be accounted for at some level.
3. CURRICULUM AND TECHNOLOGY
DBR has been somewhat underused in Computing Educa- In this section, we describe the course’s curriculum and tech-
tion Research (CER). Recently, Neslon and Ko (2018) made nology. DBR necessitates a clear enough description of the
a strong argument that CE research should almost exclu- curriculum to understand the evaluation conducted, so we
sively follow Design-Based Research methodologies [27], for cannot avoid low-level details —the context matters. We
three reasons: 1) avoid splitting attention between advanc- have attempted to separate, however, the specific experi-
ing theory vs. design, 2) the field has not generated enough ential details of our intervention (i.e., the course offering),
domain-specific theories, and 3) theory has sometimes been which are described in Section 3.
used to impede effective design-based research in the peer
review process. Many of the recommendations made in the As a starting point, we based our course on the PythonSneks
paper echo the tenets of DBR listed above and are consis- curriculum 2 . This curriculum has students move through
tent with our vision for communicating our course designs. a large sequence of almost 50 lessons over the course of a
In fact, their paper was a major guiding inspiration. semester, with each lesson focused on a particular introduc-
tory programming topic. Each lesson is composed of a set
Another major inspiration for our approach is Guzdial’s 2013 of learning objectives, the lesson presentation, a mastery-
1 2
https://github.com/acbart/csedm20-paper-cs1-analysis https://acbart.github.io/python-sneks/
based quiz, and a set of programming problems. We have When students submitted a solution to a programming prob-
made a number of modifications to the materials reported lem, the system evaluated their work using an instructor-
in [4], such as the introduction of static typing and increas- authored script written using the Pedal autograding frame-
ing the emphasis on functional design to better suit CS1 for work [13]. This system generates feedback to learners and
Computer Science majors. A full listing of all the learning calculates a correctness grade (usually 0 or 1, although par-
objectives covered is available 3 . tial credit was possible on exams). The existing curriculum
had a large quantity of autograded programming problems,
Learning Management System: The course was deliv- some of which needed to be updated based on our changes.
ered through Canvas, which was our university’s Learning
Management System. All material, including quizzes, pro- Exams: There were two midterm exams and a final exam.
gramming assignments, and exams, were directly available These exams were all divided into two parts: 1) multiple-
in Canvas (either natively or through LTI). choice/true-false/matching/etc. questions, and 2) autograded
programming questions. For the latter, students were given
Lesson Presentation: The lessons were PowerPoint slides five-six programming problems that they could move freely
with a recorded voice-over, embedded as a YouTube video between. These problems were automatically graded and
directly into a Canvas Page. The content of these slides given partial credit (20% for correctly specifying the header,
are transcribed directly below the video, including any code and the remaining points allocated based on the percentage
with proper syntax highlighting. Finally, PDF versions of of passing instructor unit tests). Both parts were presented
all the slides with their transcriptions are also available. in Canvas through the systems students were already famil-
iar with, but students were not allowed to use the internet
Mastery Quizzes: After the presentations, students are or to Google. Students took the exam at a proctored testing
presented with a Canvas Quiz containing a series of True/False, center and had two hours. They were only allowed to bring
Matching, Multiple Choice, and Fill-in-the-blank questions. a single sheet of hand-written notes. Multiple versions of
This assignment is presented in a mastery style, where learn- each exam question were created and drawn from a pool at
ers can make repeated attempts until they earn a satisfac- random, so that no two students had the exact same exam.
tory grade. Each of the 200+ questions are annotated with a
specific identifier. These quizzes are 10% of students’ grade. Projects: There were six projects throughout the semester,
although the first two were very small and heavily scaffolded.
Although Canvas provides an interface to visualize statistics The final project was relatively open-ended and meant to
about individual quiz questions, this is obfuscated by the be summative, but the middle three projects allowed more
students multiple attempts–only the final grade is shown, mixed forms of support. Although students were largely ex-
so instructors cannot see how difficult a question was for a pected to produce their own code, they were encouraged to
student. To provide greater detail in an instructor-friendly seek help as needed from the instructional staff. For the final
report, the Canvas API was used to pull all submission at- project, students used the Python Arcade library 5 to create
tempts for each student. The scripts used in analysis and a game. Because students were not previously taught Ar-
an example of the instructor report 4 are publicly available. cade, two weeks were allocated for students to work collabo-
ratively on extending sample games with new functionality.
Programming Problems: Additionally, most lessons con- Then, they individually built one of 12 games.
tain two-eight programming problems through a web-based
Python coding environment [5]. These problems were also
presented in a mastery style, allowing learners to spend as 4. INTERVENTION
much time as they want until the deadline. These prob- In this section, we describe the specific intervention context
lems are 15% of students’ final course grade. The envi- in more detail. The curriculum and technology was used
ronment has a dual block/text interface, although students in the Fall 2019 semester at an R1 university in the east-
were discouraged from using the block interface past the ern United States for a CS1 course that was required for
first two weeks of programming activities. The environment Computer Science majors in their first semester. An IRB-
naturally records all student interactions in ProgSnap2 for- approved research protocol was followed. At the beginning
mat [28], making it readily accessible for our evaluation. of the semester, students were asked to provide consent via
a survey, with 103 students agreeing out of 136 (for a 75.7%
Students were also required to install (and eventually use) consent rate). A separate survey was also administered at
a desktop Python programming environment, Thonny [1]. the beginning of the semester to collect various demographic
Students largely used Thonny for their programming projects, data (summarized in Table 1, only for consenting students)
particularly the final project, although a small number chose relating to gender, race, and prior coding experience.
to use the environment to write code for other assignments.
The Thonny environment was not instrumented to collect Percentage Number
log data, but students were required to submit their projects Identifies as Woman 19% 20
through the autograder in Canvas–therefore, submission data Black Student 6% 6
should not be affected by the relatively small number of stu- No Prior Coding Experience 37% 38
dents who used Thonny. Total number of students 100% 103

Table 1: Demographic Data for Intervention
3
https://tinyurl.com/csedm2020-sneks-los
4 5
https://github.com/acbart/canvas-grading-reports https://arcade.academy/
Instructional Staff : The course was taught by a single Midterm 1
instructor. He managed a team of 12 undergraduate teach- 50
ing assistants. These TAs varied from CS sophomores to
0
seniors, and not all of them had taken the curriculum be-
fore. However, they were all selected by the instructor for Midterm 2
both for their knowledge and amiability. All members of the 50
instructional staff hosted office hours. The TAs were also 0
responsible for grading certain aspects of the projects (e.g.,
test quality, documentation quality, code quality), although Final Exam
this amounted to relatively little of the students’ final course 50
grade. The instructor met with these TAs every other week 0
for an hour to discuss the state of the course and provide
training on pedagogy, inclusivity, etc. Final Project
25
Structure: The lecture met Monday-Wednesday-Friday for 0
50 minutes across three separate sections. The sections were 0 20 40 60 80 100
led by the same instructor, but were taught at different times
of day (mid-morning, noon, and afternoon). The instructor Figure 1: Exam and Final Project Grade Distributions
did not attempt to provide the exact same experience to all
three sections–if a mistake was made in the morning sec-
tion, they attempted to avoid that mistake later. Typically, Figure 1 gives histograms for Midterm 1 and 2, Final Exam,
the first lecture session of a module started with 15-30 min- and Final Project scores. There was considerably more vari-
utes of review of the material guided by clickers, and then ance in the final project scores than the exams, possibly due
students spent the rest of the module’s class time working to the issues outlined before. The fact that many students
on assignments. There were several special in-class assign- were failed to produce a final project may be evidence that
ments such as worksheets, coding challenges, and readings. the assignment had unreasonable expectations.
The lab met on Thursdays for 1.5 hours. Students worked on
open assignments with the support of two TAs, who would A Kruskal-Wallis test was used to analyze final exam scores
actively walk around and answer questions. by demographics. There were no significant differences for
gender, but a large difference for black students (H(1)=6.39,
5. RESULTS AND ANALYSIS p=.01) and a smaller difference for prior programming ex-
Our ultimate goal is to evaluate the course and identify as- perience (H(1)=5.51, p=0.02). The students without prior
pects that were successful and unsuccessful. First, we con- experience scored about 12% lower on average, while the
sider basic course final course outcomes. Then, we use the black students scored about 41% lower. Given the concern-
programming log data to analyze students’ behavioral out- ing spread here, we review this data with more context in
comes from the semester. We dive deeper into this data to the next section before drawing any conclusions.
characterize the feedback that was delivered to students over
the semester. We look at fine-grained data from both parts The university-run course evaluations from students yielded
of the final exam to develop a list of problematic subskills, positive but simplistic results. Both the course and the
and then review more of the programming log data in light of instructor were separately rated on a 5-point likert scale
these results. We particularly focus our efforts on subskills (Poor... Excellent). Both the course (Mdn=5, M=4.62,
related to defining functions, to tighten our analysis. SD=0.77) and the instructor (Mdn=5, M=4.70, SD=0.67)
achieved very high results, but ultimately this tells us lit-
The instructor’s naive perception of the course was that tle about the students’ experience. Course evaluation data
things were largely successful, except for the final project. is known to contain bias and provide limited data [7, 25];
Insufficient time was given to the students to learn the game these results must be taken in context with other sources of
development API, and instructor expectations were a bit data. Note that because the course evaluations are anony-
high (which was adjusted for in the grading, but may have mous, they cannot be cross-referenced with other data. A
caused students undue stress). However, the material prior review of the students’ free response answers reveals many
to the final project went smoothly. Office hours were rarely were unhappy with the Final Project. In fact, the word “Ar-
overfilled, with the exception of week 4 (the module intro- cade” appears in 41 of the 86 text responses, often as their
ducing Functions), which had one lesson too many–this was only comment. Although this helps us see a major point of
resolved by making the last programming assignment op- failure in our curriculum, it highlights the need for alterna-
tional (Programming 25: Functional Decomposition). tive evaluation mechanisms. Relying solely on student final
perceptions leaves us vulnerable to student biases.
5.1 Basic Course Outcomes
As a starting point, we consider basic course-level outcomes, 5.2 Time Spent Programming
the kind that could be determined even without the extra The keystroke-level log data allows us to determine a num-
instrumentation. This will include the overall course grades, ber of interesting metrics beyond what is available from our
the major grade categories, and the university-administered grading spreadsheet. As a simple starting point, using the
course evaluations. As a starting point, the total number timestamps of the programming logs we can get a measure
of failing grades and course withdrawals (DFW rate) was of how early students started working on assignments and
14.5%, considered acceptable by the instructor. total time spent. Earliness was measured by taking each
100 100
40
Final Exam Grade

Final Exam Grade
80 80

Hours Spent
30
60 60

40 40 20

20 20 10

0 0
100 200 300 20 40 100 200 300
Earliness in Hours Hours Spent Earliness in Hours

(a) Earliness vs. Final Exam Score (b) Hours Spent vs. Final Exam Score (c) Earliness vs. Hours Spent

Figure 2: Comparison of Earliness, Time Spent, and Final Exam Score

submission event across the entire course, finding the differ-
ence between this and the relevant assignment’s deadline, Men

Gender
and averaging those durations together within each student.
Hours Spent was measured by grouping all the events in the
logs by student, finding the difference with the next adja- Women
cent event (clipping to a maximum of 30 seconds, to consider
breaks), and summing these durations. 10 20 30 40
Total Hours
Figures 2a, 2b, and 2c show a marginal plot between earli-
ness, hours spent, and final exam grade. Spearman’s Rho (a) Hours Spent by Gender
was used to calculate the correlation between each outcome.
Prior Experience

Consistent with Kazerouni [18], earliness (a measure of pro-
crastination) had a significant medium correlation with exam
No Prior
scores (rs = .49, p < .001), while time spent was only mod-
estly correlated (rs = −.32, p = .001). Interestingly, there
was no significant correlation between student’s time spent Prior
and their procrastination (rs = −0.09, p = .36).
10 20 30 40
Analyzing behavioral outcomes by demographics indicated Total Hours
no differences, with the exception of total hours spent be-
tween women vs. men (H(1)=9.77, p=0.002) and between (b) Hours Spent by Prior Experience
students with vs. without prior experience (H(1)=7.28, p=0.007).
This comparison is visualized in Figures 3a and 3b. Women Figure 3: Hours Spent by Demographics
and students with no prior experience spent, on average,
about 8 and 5 hours more than their counterparts. Impor-
tantly, this means that there was no significant difference in
how early students started between subgroups. the guidance from the administration6 is that in a three-
credit course like this one, students should spend 45 hours
Given the difference in final exam scores, black students ap- in class and 90 hours outside of class over the course of the
pear poorly served by the current curriculum. On average, 15-week semester. The median time spent in our course by
these students spent as much time as their peers on assign- a given student on all the programming assignments was
ments, but their final exam scores were lower than students 19 hours, while the highest time spent by any individual
outside of this category. Given the evidence for the contin- student was just over 42 hours. This does not take into
ued education debt owed to non-White students (Ladson- account time spent outside the coding environment (e.g.,
Billings, 2006) [21], more work is needed to identify both working on projects in Thonny), working on quizzes, and
potentially problematic structural elements of the course reading/watching the lesson presentations. However, some
and how the course can better draw on student strengths students did complete their projects in the online environ-
to produce more equitable outcomes. ment, and we expect most of those activities to take consid-
erably less time than the programming activities. This may
Figure 4 visualizes the total time spent by students per week suggest that we are not asking our students to dedicate as
on the programming problems. The data collected raises an much time as we might.
interesting question–how many hours should we ideally ex-
6
pect students to spend on our courses? At our institution, https://tinyurl.com/csedm2020-udel-credit-policy
Category Subcategory Percentage
Instructor 37.8%
8 Problem Specific 32.1%
Not Enough Student Tests 1.0%
Not Printing Answer .8%
Hours Spent

6 Analyzer 22.1%
Initialization Problem 6.9%
4 Unused Variable 5.9%
Multiple Return Types 2.9%
Incompatible Types 1.4%
2 Parameter Type Mismatch 1.0%
Overwritten Variable .6%
Read out of scope .5%
0 Correct 17.8%
1 2 3 4 5 6 7 8 9 10111213141516 Syntax 11.4%
No Source Code .3%
Week into Semester Runtime 7.8%
TypeError 3.7%
Figure 4: Time Spent per Week of Semester NameError 1.0%
AttributeError .8%
ValueError .4%
Looking at specific time periods within the data, we see KeyError .4%
that students spent less time programming in the first weeks IndexError .4%
of the course, around the midpoint, and near the end of Student Student Tests Failing 2.9%
the programming problems (the last few weeks took place Instructions 2.4%
outside of the online coding environment). Especially for the System Error .8%
earlier material, it is likely that the pace can be accelerated.
Table 2: Frequency of Error Messages by Category
5.3 Error Classification
Table 2 gives the percentage of different feedback messages
that students received on programming problems as a per- Figure 5a gives the ratio of correct submission events in the
centage of all the feedback events received. Our numbers log data over each week of the semester. Early on, students
vary from those reported by Smith and Rixner (2019) [29], complete problems with fewer attempts. This might explain
possibly because of our very different approach to feedback the steady growth in ratios of different kinds of feedback over
and the affordances of our programming environment. time, as evidenced by Figures 5b and 5c. It is interesting
to observe that the runtime error frequency grows almost
One of the most notable departures is the Analyzer and linearly over the course of the semester, with the exception
Problem-specific Instructor feedback categories. Our auto- of week 7 (a peak week for the Analyzer feedback). We
grading system is capable of overriding error messages. In hypothesize this is the result of some more carefully-refined
particular, one of its key features is a type inferencer and instructor feedback available during that week.
flow analyzer that automatically provides more readable and
targeted error messages. The subcategories give examples 5.4 Final Exam Conceptual Questions
of the kinds of errors produced: Initialization Problem The first part of the final exam was composed of concep-
(using a variable that was not previously defined) frequently tual questions for topics across the curriculum, drawn largely
supersedes the classic NameError, for example. Meanwhile, from the quiz questions students had already seen. In this
some issues have no corresponding runtime error, such as section, we review the quiz report to determine the topics
Unused Variable (never reading a variable that was previ- that students struggled with. Most students performed rela-
ously written to). The Analyzer gives more than a fifth of tively well across the questions, so we focus on errors where
all feedback delivered to students, suggesting its role is sig- more than 80% of the students had incorrect answers.
nificant. Further work is needed to evaluate the quality of
this feedback and the impact on students’ learning. Students largely had no issues with questions involving eval-
uating expressions. A small exception to this is students’
The Problem-specific Instructor feedback category is opaque. struggle with Equality vs. Order of different types. In
Given that this represents almost a third of the feedback, it Python, as in many languages, it is not an error to check if
is unhelpful that the category cannot be easily broken down two things of different types are equal (although that com-
further. Sampling the logs’ text, we see examples like stu- parison will always produce false); however, it is an error to
dents failing instructor unit tests, a reminder to call a func- compare their order (less than/greater than operators). In
tion just once, and a suggestion to avoid a specific subscript fact, 58% of students got this specific question wrong.
index. Although the autograder is a powerful mechanism for
delivering contextualized help to students, the lack of orga- There were three questions related to tracing complex con-
nization severely limits our automated analysis possible. As trol flow for loops, if statements, and functions. Tracing
part of our process in the future, we intend to annotate feed- seemed to pose difficulties, with between 25-40% of the stu-
back in our autograding scripts with identifiers. dents getting these questions wrong. We believe that more
decided to focus on a subset of skills just related to Functions
Correct Submissions that we felt we could clearly identify with computational
analysis and that the instructor felt, a-priori, they had seen
0.4
Ratio of

students struggle with over the course of the semester. Ta-
ble 3 gives the percentages and quantity of students who
0.2 successfully demonstrated the subskill on each exam.
0.0 Header Definition: Even though we had not observed
1

11
many students struggling with syntax during the semester,
Week into the Semester we felt it critical to analyze the incidence of submitted code
that had malformed headers. Although the numbers were a
(a) Ratio of Correct Submissions by Week of Semester little higher than expected, we are not terribly concerned -
reviewing the submissions, many seemed like simple typos
Syntax Errors

(e.g., a missing colon) that were relatively easily fixed.
0.2
Ratio of

0.1
Provided Types: Students were not required or encour-
aged to provide types in their headers during the exam. In
0.0 fact, since the advanced feedback features were turned off,
their feedback would not actually reference any parameter
1

11
Week into the Semester or return types they specified (as long as they were syntacti-
cally correct code). We did not assess the correctness of their
(b) Ratio of Syntax Errors by Week of Semester provided types - merely their existence. In the final exam,
the number of students who annotated their parameter types
Runtime Errors

falls off sharply after the first three questions (moving from
about 50% down to 20%). We offer two explanations: first,
Ratio of

the fourth question is one of the most difficult in the entire
0.1 course, so students may have been distracted by its diffi-
culty. Second, the last questions all involve more compli-
0.0 cated nested data types (e.g., lists of dictionaries) that were
too troublesome for the students to specify.
1

Week into the Semester
Parameter Overwriting: This misconception is one that
(c) Ratio of Runtime Errors by Week of Semester the instructors were very concerned with, having observed
it repeatedly among certain students early in the semester
Figure 5: Ratio of Feedback Types by Week of Semester (and concerned with its persistence). Applying the param-
eter overwriting pattern to the rest of the submissions over
the entire semester, we found that the behavior trails off
emphasis should be placed on tracing in the curriculum; over the course of the semester. By the final exam, almost
there are quiz questions and a worksheet dedicated to the no students were making this particular mistake. Although
topic, but there are opportunities to expand this material. the instructor believes that more can be done up front to
Tracing has been a recent area of focus, with promising ap- avoid this critical misconception, it is comforting that the
proaches by Xie et al [31] and Cunningham et al [9]. existing curriculum seems to largely address this by the end.

Dictionaries also posed significant trouble for students. Dic- Return/Print: We observed that some students struggle
tionaries come up later in the course, represent more com- to differentiate between the concepts of return statements
plex reality, and conflate syntactic operations with lists. In and print calls. However, largely students were success-
fact, this last point is evidenced by data. In a question com- ful with this subskill, despite a quarter of students getting
paring the relative speed of traversing lists and dictionaries, a related (more abstract rendering) version of this subskill
50% of the students got one variant of a True/False question wrong on part 1 of the final exam. It seems that although
incorrect (so they might as well have been guessing). troublesome for a small clutch of students, most are able to
eventually separate this concept in their code.
Again, the point of our analysis is not to necessarily develop
a validated examination instrument or to distill an author- Parameters/Input: Similar to students’ issues with re-
itative set of misconceptions. Instead, we seek to demon- turning vs. printing, some students were observed in indi-
strate the insight we have garnered from reviewing our exam. vidual sessions mixing up parameters and the input function
With these simple percentages, we have found targets. (which was presented as a very distinctive way that data
could enter a function). However, it appears that this was
5.5 Deeper Dive on Functions truly isolated to just a few students.
In looking over the second part of the final exam questions,
Functional Decomposition: Largely inspired by Fisler [12]
we are faced with a tremendous number of concepts inte-
success in overcoming the difficulties of the Rainfall problem,
grated into each problem. In fact, with over 261 learning
Functional Decomposition was taught as a method for com-
objectives in the course, analyzing the entire set is an over-
plex processing data. Students had previously been taught
whelming prospect. To scope our analysis for this paper, we
Subskill Description 1st Exam 2nd Exam Final Exam
Header Definition Defined the function header with correct syntax 83.5% (86) 84.5% (87) 91.3% (94)
Provided Types Provided types for all parameters and the return 40.8% (42) 45.6% (47) 37.9% (39)
Parameter Overwrite Did not assign literal values to parameters in the body 88.3% (91) 98.1% (101) 99.0% (102)
Return/Print Did not print without returning 80.6% (83) 89.3% (92) 91.3% (94)
Parameters/Input Did not use the input function instead of parameters 96.1% (99) 100.0% (103) 99.0% (102)
Unit Testing Wrote unit tests 88.3% (91) 79.6% (82) 67.0% (69)
Decomposition Separated work into a helper function 1.0% (1) 17.5% (18) 19.4% (20)

Table 3: Percentage of Students Demonstrating Subskill across Exams

8 different looping patterns (e.g., accumulating, mapping, Better structure to our existing data sources might help in
filtering). A number of assignments required students to future analyses. For example, although each quiz question
decompose problems. Therefore, it is somewhat disappoint- was labeled with a unique identifier, we realized during anal-
ing that so few students chose to leverage decomposition ysis that we really needed every quiz answer (and in some
(particularly since the harder final exam problems were nat- cases, sets of answers) to have a unique identifier as well. In
urally susceptible to a decomposition approach). In addition particular, some questions had multiple parts, or different
to the midterm 2 and final exam questions, we also took a answers yielded information about different misconceptions.
closer look at an earlier open-ended programming problem In a similar vein, annotating instructor feedback for the pro-
that was particularly complex and well-suited to decompo- gramming problems would have substantially increased the
sition. In these problems, there seemed to be a pattern of differentiation of our feedback messages.
students being more successful when they leverage decompo-
sition. Although not conclusive, this supports the hypothe- More metadata about each identifier would also help efforts
sis that decomposition may be an effective strategy. to cross-reference and cluster related problems (especially
over time). This is a non-trivial effort, given the quantity
Decomposed Monolithic of course materials present in the curriculum. As a start-
Pass Fail Pass Fail ing point, we believe this effort should probably be focused
Earlier Problem 37 8 29 27 on certain major learning objectives and topics (e.g., func-
Midterm 2 Question 5 13 5 40 43 tions) that are particularly worthy of attention based on the
Final Exam Question 4 7 8 42 44 formative evaluation conducted here.
Total 57 21 111 114
18.8% 6.9% 36.6% 37.6% We expect that before our next iteration of our analyses,
we need to develop more hypotheses up front for guidance.
Table 4: Student Use’s of Decomposition over Time A considerable amount of time was spent performing ex-
ploratory analyses, trying different approaches and seeing
what emerged from the data. Although helpful as we ori-
Unit Testing: Given that students were not required to ented ourselves, the data dredging that can emerge may
unit test their code on the final exam, we were pleased to yield false conclusions that are not actually worth invest-
find that many students wrote unit tests anyway. Inter- ing in. Finally, while we attempted to follow a replicable
estingly, though, the percentage of students who used this process in our data collection and analysis, we believe more
strategy decreased over the course of the semester, even as should be done to streamline and package our data pipeline
the programming problems became more difficult. We hy- to encourage replication and reproduction.
pothesize that since the later exam problems involve com-
plex nested data, students either did not feel comfortable
generating test data or they felt that it would not be an ef- 7. CONCLUSION
ficient use of their time. We believe that we need to sell the In this paper, we have described our evaluation of data from
concept more - rather than thinking that writing test cases a heavily-instrumented CS1 course. Our goal was less about
would be a detriment to their success, students should see judging the course overall, and more about finding specific
tests as one of the most direct paths to completion. areas of improvement and success. We feel that course eval-
uation is less about the end-goal and more about small it-
erative augmentations that collect over time. To structure
6. DISCUSSION our approach, we followed a loose Design-Based Research
Reviewing our findings, we made several decisions about model supported by educational data mining. In our ex-
places to modify our curriculum. The log data suggests that perience, the high volume and variety of data sources can
some of the earlier material can be accelerated, so that more be very helpful in understanding the successes and failures
time can ultimately be allocated to week 4 (critical mate- of the course, although it does pose difficulties for analy-
rial covering functions). We also believe we need to spend sis. As always, a better pipeline could help make sense of
more time throughout the semester convincing students that these data and results more quickly, possibly even during
subskills like decomposition and unit testing can help them the semester. However, in the immediate term, our data
solve challenging questions, although follow-up analyses will analysis contributes to the community’s knowledge of stu-
be needed to confirm this theory. Finally, we must come dents and ideally provides a model for others to follow along.
up with new ways to support some of our demographic sub- In general, we hope to encourage increased rigor in course
groups, given that outcomes in that area are not yet equal. evaluation as we integrate data-rich tools into our courses.
8. REFERENCES [16] P. Ihantola, A. Vihavainen, A. Ahadi, M. Butler,
[1] A. Annamaa. Introducing thonny, a python ide for J. Börstler, S. H. Edwards, E. Isohanni, A. Korhonen,
learning programming. In Proceedings of the 15th Koli A. Petersen, K. Rivers, et al. Educational data mining
Calling Conference on Computing Education Research, and learning analytics in programming: Literature
pages 117–121, 2015. review and case studies. In Proceedings of the 2015
[2] R. S. Baker and P. S. Inventado. Educational data ITiCSE on Working Group Reports, pages 41–63.
mining and learning analytics. In Learning analytics, 2015.
pages 61–75. Springer, 2014. [17] L. C. Kaczmarczyk, E. R. Petrick, J. P. East, and
[3] S. Barab and K. Squire. Design-based research: G. L. Herman. Identifying student misconceptions of
Putting a stake in the ground. The journal of the programming. In Proceedings of the 41st ACM
learning sciences, 13(1):1–14, 2004. technical symposium on Computer science education,
[4] A. C. Bart, A. Sarver, M. Friend, and L. Cox II. pages 107–111, 2010.
Pythonsneks: an open-source, instructionally-designed [18] A. M. Kazerouni, S. H. Edwards, and C. A. Shaffer.
introductory curriculum with action-design research. Quantifying incremental development practices and
In Proceedings of the 50th ACM Technical Symposium their relationship to procrastination. In Proceedings of
on Computer Science Education, 2019. the 2017 ACM Conference on International
[5] A. C. Bart, J. Tibau, E. Tilevich, C. A. Shaffer, and Computing Education Research, pages 191–199, 2017.
D. Kafura. Blockpy: An open access data-science [19] H. Keuning, J. Jeuring, and B. Heeren. Towards a
environment for introductory programmers. systematic review of automated feedback generation
Computer, 50(5):18–26, 2017. for programming exercises. In Proceedings of the 2016
[6] K. Buffardi. Assessing individual contributions to ACM Conference on Innovation and Technology in
software engineering projects with git logs and user Computer Science Education, pages 41–46, 2016.
stories. In Proceedings of the 51st ACM Technical [20] E. Kurvinen, N. Hellgren, E. Kaila, M.-J. Laakso, and
Symposium on Computer Science Education, pages T. Salakoski. Programming misconceptions in an
650–656, 2020. introductory level programming course exam. In
[7] S. E. Carrell and J. E. West. Does professor quality Proceedings of the 2016 ACM Conference on
matter? evidence from random assignment of students Innovation and Technology in Computer Science
to professors. Journal of Political Economy, Education, pages 308–313, 2016.
118(3):409–432, 2010. [21] G. Ladson-Billings. From the achievement gap to the
[8] D.-B. R. Collective. Design-based research: An education debt: Understanding achievement in us
emerging paradigm for educational inquiry. schools. Educational researcher, 35(7):3–12, 2006.
Educational Researcher, 32(1):5–8, 2003. [22] A. Luxton-Reilly. Learning to program is easy. In
[9] K. Cunningham, S. Blanchard, B. Ericson, and Proceedings of the 2016 ACM Conference on
M. Guzdial. Using tracing and sketching to solve Innovation and Technology in Computer Science
programming problems: replicating and extending an Education, pages 284–289, 2016.
analysis of what students draw. In Proceedings of the [23] P. Mandal and I.-H. Hsiao. Using differential mining
2017 ACM Conference on International Computing to explore bite-size problem solving practices. In
Education Research, pages 164–172, 2017. Educational Data Mining in Computer Science
[10] N. Diana, M. Eagle, J. Stamper, S. Grover, Education (CSEDM) Workshop, 2018.
M. Bienkowski, and S. Basu. Measuring transfer of [24] C. Matthies, R. Teusner, and G. Hesse. Beyond
data-driven code features across tasks in alice. 2018. surveys: analyzing software development artifacts to
[11] T. Effenberger, J. Cechák, and R. Pelánek. Difficulty assess teaching efforts. In 2018 IEEE Frontiers in
and complexity of introductory programming Education Conference (FIE), pages 1–9. IEEE, 2018.
problems. 2019. [25] K. M. Mitchell and J. Martin. Gender bias in student
[12] K. Fisler. The recurring rainfall problem. In evaluations. PS: Political Science & Politics,
Proceedings of the tenth annual conference on 51(3):648–652, 2018.
International computing education research, pages [26] National Academies of Sciences, Engineering, and
35–42, 2014. Medicine and others. Assessing and responding to the
[13] L. Gusukuma, A. C. Bart, and D. Kafura. Pedal: An growth of computer science undergraduate enrollments.
infrastructure for automated feedback systems. In National Academies Press, 2018.
Proceedings of the 51st ACM Technical Symposium on [27] G. L. Nelson and A. J. Ko. On use of theory in
Computer Science Education, pages 1061–1067, 2020. computing education research. In Proceedings of the
[14] L. Gusukuma, A. C. Bart, D. Kafura, J. Ernst, and 2018 ACM Conference on International Computing
K. Cennamo. Instructional design+ knowledge Education Research, pages 31–39, 2018.
components: A systematic method for refining [28] T. W. Price, D. Hovemeyer, K. Rivers, B. A. Becker,
instruction. In Proceedings of the 49th ACM Technical et al. Progsnap2: A flexible format for programming
Symposium on Computer Science Education, pages process data. In The 9th International Learning
338–343, 2018. Analytics & Knowledge Conference, Tempe, Arizona,
[15] M. Guzdial. Exploring hypotheses about media 4-8 March 2019, 2019.
computation. In Proceedings of the ninth annual [29] R. Smith and S. Rixner. The error landscape:
international ACM conference on International Characterizing the mistakes of novice programmers. In
computing education research, pages 19–26, 2013. Proceedings of the 50th ACM Technical Symposium on
Computer Science Education, pages 538–544, 2019.
[30] Y. Vance Paredes, D. Azcona, I.-H. Hsiao, and A. F.
Smeaton. Predictive modelling of student reviewing
behaviors in an introductory programming course.
2018.
[31] B. Xie, G. L. Nelson, and A. J. Ko. An explicit
strategy to scaffold novice program tracing. In
Proceedings of the 49th ACM Technical Symposium on
Computer Science Education, pages 344–349, 2018.