Evaluating an Instrumented Python CS1 Course

                  Austin Cory Bart                         Teomara Rutherford                     James Skripchuk
                University of Delaware                      University of Delaware               University of Delaware
                     Newark, DE                                  Newark, DE                           Newark, DE
                 acbart@udel.edu                            teomara@udel.edu                       jskrip@udel.edu


ABSTRACT                                                                   1.   INTRODUCTION
The CS1 course is a critical experience for most novice pro-               The first Computer Science course (CS1) can be a chal-
grammers, requiring significant time and effort to overcome                lenging experience for novices given the constraints of a
the inherent challenges. Ever-increasing enrollments mean                  semester [22], but success in CS1 is critical for computer sci-
that instructors have less insight into their students and                 ence students, as it sets a foundation for subsequent classes.
can provide less individualized instruction. Automated pro-                Large amounts of practice and feedback are critical to this
gramming environments and grading systems are one mech-                    experience, so that learners can overcome programming mis-
anism to scale CS1 instruction, but these new technologies                 conceptions [17, 20] and develop effective schema. Instruc-
can sometimes make it difficult for the instructor to gain                 tors have a key role in developing materials to support learn-
insight into their learners. However, learning analytics col-              ers’ productive struggle. Recently, however, scaling enroll-
lected by these systems can be used to make up some of                     ments [26] and the move to remote/hybrid learning environ-
the difference. This paper describes the process of mining                 ments has shifted much of this work away from interacting
a heavily-instrumented CS1 course to leverage fine-grained                 with individual students towards interacting with systems
evidence of student learning. The existing Python-based                    (which in turn interact with the students directly). For ex-
curriculum was already heavily integrated with a web-based                 ample, programming autograders [19] remove the instructor
programming environment that captured keystroke-level stu-                 from the grading process, automatically assessing and some-
dent coding snapshots, along with various other forms of                   times even providing feedback directly to the learner.
automated analyses. A Design-Based Research approach
was taken to collect, analyze, and evaluate the data, with                 Although these systems scale the learning process, they can
the intent to derive meaningful conclusions about the stu-                 inhibit the evaluation and revising of course materials. In-
dent experience and develop evidence-based improvements                    structors do not have as many first-hand interactions with
for the course. In addition to modeling our process, we                    students or the artifacts that they produce. When home-
report on a number of results regarding the persistence of                 works and exams are no longer hand-graded, teachers may
student mistakes, measurements of student learning and er-                 not be as directly motivated to review each submission. Sim-
rors, the association between student learning and student                 ilarly, when automated feedback systems are effectively sup-
effort and procrastination, and places where we might be                   porting students, teachers will have fewer opportunities to
able to accelerate our curriculum’s pacing. We hope that                   get direct insight into what issues students are encounter-
these results, as well as our generalized approach, can guide              ing. This knowledge of the students’ experience is critical to
larger community efforts around systematic course analysis                 gauge the effectiveness of the course materials. Instructors
and revision.                                                              need a new model to guide their revision decisions.

                                                                           We propose instructors follow a Design-Based Research (DBR)
Categories and Subject Descriptors                                         approach [8, 3] to iteratively improve their course. In par-
Social and professional topics [Professional topics]: Com-
                                                                           ticular, course development should be seen as an iterative
puting education; Information systems [Information sys-
                                                                           and statistical Instructional Design process; each semester,
tems applications]: Data mining
                                                                           a curriculum is built and presented to learners as an inter-
                                                                           vention, data is generated and collected as learners interact,
Keywords                                                                   that data is analyzed to discover shortcomings and successes
cs1, modeling, dbr, python, procrastination                                of the intervention, and then modifications to the “protocol”
                                                                           are identified for the next iteration of the study. Instruc-
                                                                           tional Design models provide a systematic framework for
                                                                           this development process, but the DBR approach augments
                                                                           this to emphasize the statistical and theory-driven nature
                                                                           of the evaluation process. Fortunately, the same autograd-
                                                                           ing tools that scale practice and feedback opportunities for
                                                                           students can also be used to collect many kinds of learning
                                                                           analytics, permitting the use of educational data mining to
Copyright c 2020 for this paper by its authors. Use permitted under Cre-   garner insights into learning [2].
ative Commons License Attribution 4.0 International (CC BY 4.0).
In this paper, we present our experience of evaluating a CS1       paper evaluating their decade-long Computational Thinking
course that has been heavily instrumented to provide rich          course (“MediaComp”) [15]. Although a longer time scale,
data on student actions. Our goal is not to prove that our         this paper takes a scientific, cohesive look at their course us-
curriculum was a “success” or “failure” as a whole, but to em-     ing a DBR lens. They critically evaluate what worked and
pirically judge specific pieces and identify components that       contextualize all their findings by their design. They begin
should be modified or maintained. We draw upon program-            with a set of hypotheses about what aspects of the course
ming snapshot data, non-programming autograded question            will be effective, and then systematically review data col-
logs, surveys, exam data, and human assessments to produce         lected from the offerings to accept or reject those hypotheses.
a diverse dataset. In addition to sharing our conclusions          Their conclusions, while not transcendent, are impactful for
about the state of our course, we believe that we present a        anyone modeling themselves after their context.
formative model for other instructors who wish to evaluate
their courses systematically. In fact, our specific analyses are   In computing education, programming log data has been
recorded in a Jupyter Notebook1 . Our hope is that others          used to make various kinds of predictions and evaluations
will use our own analyses as a baseline to develop their own       of student learning [16]. Applications include predicting
questions, and to motivate others to approach their courses        student performance in subsequent courses [10], identifying
with a more systematic, empirical method.                          learners who need additional support [30], modelling stu-
                                                                   dent strategies as they work on programming problems [23],
                                                                   evaluating students over the course of a semester [6, 24].
2.     THEORIES AND RELATED WORK                                   These approaches tend to rely on vast datasets or seek to
The central premise of our approach is inspired by Design-         derive conclusions that are predictive, highly transferable, or
Based Research, which has been well established in the ed-         are about individual students. Although such research work
ucation literature for decades. Those interested in an intro-      is valuable, the goal is distinctive. We recognize that each
duction to DBR can refer to [8]. Briefly, there are several        course offering has an important local context that cannot
key tenets: 1) Development is an iterative process of design,      be factored out, and that collecting sufficient evidence over
intervention, collection, and analysis. 2) Educational inter-      time inhibits the process of iterative course design. Rather
ventions cannot be decontextualized from their setting. 3)         than developing generalizable theories or predicting perfor-
Processes from all phases of development must be captured          mance, we seek actionable data from a single semester an
and provided sufficient context to ensure reproducibility and      instructor can use to evaluate and redesign their course.
replication. 4) Developing learning experiences cannot be
separated from developing theories about learning. 5) Re-          Effenberger et al [11] are perhaps an example more closely
sults from an intervention must inform the next iteration          aligned with our own research goals. Rather than evaluat-
and communicated out to broader stakeholders.                      ing students, their work sought to evaluate four program-
                                                                   ming problems in a course. Their results suggest that de-
Messy authenticity is inherent in this process, and naturally      spite commonalities in the tasks, the problems’ characteris-
limits the theoretical extent of findings in a DBR process.        tics were considerably different, underscoring the danger of
Therefore, any conclusions derived should not be seen as           treating questions as interchangeable in course evaluation.
broadly applicable, but only meaningful for the context in
which they were developed. Although theories of learning           The process of systematic course revision is similar to the
are generated from DBR, this is less true for early iterations.    ID+KC model by Gusukuma (2018), which combines formal
True success for a course is a moving target. As the curricu-      Instructional Design methodology with a cognitive student
lum improves and students overcome misconceptions faster,          model based on Knowledge Components [14]. Instead of fo-
more material can be added. Over time, the curriculum              cusing on a student model, however, we focus on components
necessarily needs to be updated and assignments refreshed.         of the instruction such as the learning objectives. Still, the
Further, courses often need to be adapted for new audiences        systematic process of data collection and analysis to inform
with different demographics and prior experiences. Given           revision is common between our methods.
the DBR model strongly incorporates context, these reali-
ties can be accounted for at some level.
                                                                   3.     CURRICULUM AND TECHNOLOGY
DBR has been somewhat underused in Computing Educa-                In this section, we describe the course’s curriculum and tech-
tion Research (CER). Recently, Neslon and Ko (2018) made           nology. DBR necessitates a clear enough description of the
a strong argument that CE research should almost exclu-            curriculum to understand the evaluation conducted, so we
sively follow Design-Based Research methodologies [27], for        cannot avoid low-level details —the context matters. We
three reasons: 1) avoid splitting attention between advanc-        have attempted to separate, however, the specific experi-
ing theory vs. design, 2) the field has not generated enough       ential details of our intervention (i.e., the course offering),
domain-specific theories, and 3) theory has sometimes been         which are described in Section 3.
used to impede effective design-based research in the peer
review process. Many of the recommendations made in the            As a starting point, we based our course on the PythonSneks
paper echo the tenets of DBR listed above and are consis-          curriculum 2 . This curriculum has students move through
tent with our vision for communicating our course designs.         a large sequence of almost 50 lessons over the course of a
In fact, their paper was a major guiding inspiration.              semester, with each lesson focused on a particular introduc-
                                                                   tory programming topic. Each lesson is composed of a set
Another major inspiration for our approach is Guzdial’s 2013       of learning objectives, the lesson presentation, a mastery-
1                                                                  2
    https://github.com/acbart/csedm20-paper-cs1-analysis               https://acbart.github.io/python-sneks/
based quiz, and a set of programming problems. We have            When students submitted a solution to a programming prob-
made a number of modifications to the materials reported          lem, the system evaluated their work using an instructor-
in [4], such as the introduction of static typing and increas-    authored script written using the Pedal autograding frame-
ing the emphasis on functional design to better suit CS1 for      work [13]. This system generates feedback to learners and
Computer Science majors. A full listing of all the learning       calculates a correctness grade (usually 0 or 1, although par-
objectives covered is available 3 .                               tial credit was possible on exams). The existing curriculum
                                                                  had a large quantity of autograded programming problems,
Learning Management System: The course was deliv-                 some of which needed to be updated based on our changes.
ered through Canvas, which was our university’s Learning
Management System. All material, including quizzes, pro-          Exams: There were two midterm exams and a final exam.
gramming assignments, and exams, were directly available          These exams were all divided into two parts: 1) multiple-
in Canvas (either natively or through LTI).                       choice/true-false/matching/etc. questions, and 2) autograded
                                                                  programming questions. For the latter, students were given
Lesson Presentation: The lessons were PowerPoint slides           five-six programming problems that they could move freely
with a recorded voice-over, embedded as a YouTube video           between. These problems were automatically graded and
directly into a Canvas Page. The content of these slides          given partial credit (20% for correctly specifying the header,
are transcribed directly below the video, including any code      and the remaining points allocated based on the percentage
with proper syntax highlighting. Finally, PDF versions of         of passing instructor unit tests). Both parts were presented
all the slides with their transcriptions are also available.      in Canvas through the systems students were already famil-
                                                                  iar with, but students were not allowed to use the internet
Mastery Quizzes: After the presentations, students are            or to Google. Students took the exam at a proctored testing
presented with a Canvas Quiz containing a series of True/False,   center and had two hours. They were only allowed to bring
Matching, Multiple Choice, and Fill-in-the-blank questions.       a single sheet of hand-written notes. Multiple versions of
This assignment is presented in a mastery style, where learn-     each exam question were created and drawn from a pool at
ers can make repeated attempts until they earn a satisfac-        random, so that no two students had the exact same exam.
tory grade. Each of the 200+ questions are annotated with a
specific identifier. These quizzes are 10% of students’ grade.    Projects: There were six projects throughout the semester,
                                                                  although the first two were very small and heavily scaffolded.
Although Canvas provides an interface to visualize statistics     The final project was relatively open-ended and meant to
about individual quiz questions, this is obfuscated by the        be summative, but the middle three projects allowed more
students multiple attempts–only the final grade is shown,         mixed forms of support. Although students were largely ex-
so instructors cannot see how difficult a question was for a      pected to produce their own code, they were encouraged to
student. To provide greater detail in an instructor-friendly      seek help as needed from the instructional staff. For the final
report, the Canvas API was used to pull all submission at-        project, students used the Python Arcade library 5 to create
tempts for each student. The scripts used in analysis and         a game. Because students were not previously taught Ar-
an example of the instructor report 4 are publicly available.     cade, two weeks were allocated for students to work collabo-
                                                                  ratively on extending sample games with new functionality.
Programming Problems: Additionally, most lessons con-             Then, they individually built one of 12 games.
tain two-eight programming problems through a web-based
Python coding environment [5]. These problems were also
presented in a mastery style, allowing learners to spend as       4.     INTERVENTION
much time as they want until the deadline. These prob-            In this section, we describe the specific intervention context
lems are 15% of students’ final course grade. The envi-           in more detail. The curriculum and technology was used
ronment has a dual block/text interface, although students        in the Fall 2019 semester at an R1 university in the east-
were discouraged from using the block interface past the          ern United States for a CS1 course that was required for
first two weeks of programming activities. The environment        Computer Science majors in their first semester. An IRB-
naturally records all student interactions in ProgSnap2 for-      approved research protocol was followed. At the beginning
mat [28], making it readily accessible for our evaluation.        of the semester, students were asked to provide consent via
                                                                  a survey, with 103 students agreeing out of 136 (for a 75.7%
Students were also required to install (and eventually use)       consent rate). A separate survey was also administered at
a desktop Python programming environment, Thonny [1].             the beginning of the semester to collect various demographic
Students largely used Thonny for their programming projects,      data (summarized in Table 1, only for consenting students)
particularly the final project, although a small number chose     relating to gender, race, and prior coding experience.
to use the environment to write code for other assignments.
The Thonny environment was not instrumented to collect                                              Percentage    Number
log data, but students were required to submit their projects            Identifies as Woman           19%          20
through the autograder in Canvas–therefore, submission data                 Black Student               6%           6
should not be affected by the relatively small number of stu-         No Prior Coding Experience       37%          38
dents who used Thonny.                                                 Total number of students       100%         103

                                                                           Table 1: Demographic Data for Intervention
3
  https://tinyurl.com/csedm2020-sneks-los
4                                                                 5
  https://github.com/acbart/canvas-grading-reports                    https://arcade.academy/
Instructional Staff : The course was taught by a single                                       Midterm 1
instructor. He managed a team of 12 undergraduate teach-             50
ing assistants. These TAs varied from CS sophomores to
                                                                         0
seniors, and not all of them had taken the curriculum be-
fore. However, they were all selected by the instructor for                                   Midterm 2
both for their knowledge and amiability. All members of the          50
instructional staff hosted office hours. The TAs were also            0
responsible for grading certain aspects of the projects (e.g.,
test quality, documentation quality, code quality), although                                 Final Exam
this amounted to relatively little of the students’ final course     50
grade. The instructor met with these TAs every other week             0
for an hour to discuss the state of the course and provide
training on pedagogy, inclusivity, etc.                                                     Final Project
                                                                     25
Structure: The lecture met Monday-Wednesday-Friday for                0
50 minutes across three separate sections. The sections were                 0     20        40        60       80        100
led by the same instructor, but were taught at different times
of day (mid-morning, noon, and afternoon). The instructor            Figure 1: Exam and Final Project Grade Distributions
did not attempt to provide the exact same experience to all
three sections–if a mistake was made in the morning sec-
tion, they attempted to avoid that mistake later. Typically,       Figure 1 gives histograms for Midterm 1 and 2, Final Exam,
the first lecture session of a module started with 15-30 min-      and Final Project scores. There was considerably more vari-
utes of review of the material guided by clickers, and then        ance in the final project scores than the exams, possibly due
students spent the rest of the module’s class time working         to the issues outlined before. The fact that many students
on assignments. There were several special in-class assign-        were failed to produce a final project may be evidence that
ments such as worksheets, coding challenges, and readings.         the assignment had unreasonable expectations.
The lab met on Thursdays for 1.5 hours. Students worked on
open assignments with the support of two TAs, who would            A Kruskal-Wallis test was used to analyze final exam scores
actively walk around and answer questions.                         by demographics. There were no significant differences for
                                                                   gender, but a large difference for black students (H(1)=6.39,
5.    RESULTS AND ANALYSIS                                         p=.01) and a smaller difference for prior programming ex-
Our ultimate goal is to evaluate the course and identify as-       perience (H(1)=5.51, p=0.02). The students without prior
pects that were successful and unsuccessful. First, we con-        experience scored about 12% lower on average, while the
sider basic course final course outcomes. Then, we use the         black students scored about 41% lower. Given the concern-
programming log data to analyze students’ behavioral out-          ing spread here, we review this data with more context in
comes from the semester. We dive deeper into this data to          the next section before drawing any conclusions.
characterize the feedback that was delivered to students over
the semester. We look at fine-grained data from both parts         The university-run course evaluations from students yielded
of the final exam to develop a list of problematic subskills,      positive but simplistic results. Both the course and the
and then review more of the programming log data in light of       instructor were separately rated on a 5-point likert scale
these results. We particularly focus our efforts on subskills      (Poor... Excellent). Both the course (Mdn=5, M=4.62,
related to defining functions, to tighten our analysis.            SD=0.77) and the instructor (Mdn=5, M=4.70, SD=0.67)
                                                                   achieved very high results, but ultimately this tells us lit-
The instructor’s naive perception of the course was that           tle about the students’ experience. Course evaluation data
things were largely successful, except for the final project.      is known to contain bias and provide limited data [7, 25];
Insufficient time was given to the students to learn the game      these results must be taken in context with other sources of
development API, and instructor expectations were a bit            data. Note that because the course evaluations are anony-
high (which was adjusted for in the grading, but may have          mous, they cannot be cross-referenced with other data. A
caused students undue stress). However, the material prior         review of the students’ free response answers reveals many
to the final project went smoothly. Office hours were rarely       were unhappy with the Final Project. In fact, the word “Ar-
overfilled, with the exception of week 4 (the module intro-        cade” appears in 41 of the 86 text responses, often as their
ducing Functions), which had one lesson too many–this was          only comment. Although this helps us see a major point of
resolved by making the last programming assignment op-             failure in our curriculum, it highlights the need for alterna-
tional (Programming 25: Functional Decomposition).                 tive evaluation mechanisms. Relying solely on student final
                                                                   perceptions leaves us vulnerable to student biases.
5.1    Basic Course Outcomes
As a starting point, we consider basic course-level outcomes,      5.2       Time Spent Programming
the kind that could be determined even without the extra           The keystroke-level log data allows us to determine a num-
instrumentation. This will include the overall course grades,      ber of interesting metrics beyond what is available from our
the major grade categories, and the university-administered        grading spreadsheet. As a simple starting point, using the
course evaluations. As a starting point, the total number          timestamps of the programming logs we can get a measure
of failing grades and course withdrawals (DFW rate) was            of how early students started working on assignments and
14.5%, considered acceptable by the instructor.                    total time spent. Earliness was measured by taking each
                     100                                                     100
                                                                                                                                              40
  Final Exam Grade


                                                          Final Exam Grade
                      80                                                      80


                                                                                                                                Hours Spent
                                                                                                                                              30
                      60                                                      60

                      40                                                      40                                                              20

                      20                                                      20                                                              10

                       0                                                       0
                             100    200      300                                      20                           40                                100      200         300
                           Earliness in Hours                                      Hours Spent                                                     Earliness in Hours

                 (a) Earliness vs. Final Exam Score           (b) Hours Spent vs. Final Exam Score                                             (c) Earliness vs. Hours Spent

                                          Figure 2: Comparison of Earliness, Time Spent, and Final Exam Score


submission event across the entire course, finding the differ-
ence between this and the relevant assignment’s deadline,                                                           Men


                                                                                               Gender
and averaging those durations together within each student.
Hours Spent was measured by grouping all the events in the
logs by student, finding the difference with the next adja-                                                       Women
cent event (clipping to a maximum of 30 seconds, to consider
breaks), and summing these durations.                                                                                                10                20           30          40
                                                                                                                                                   Total Hours
Figures 2a, 2b, and 2c show a marginal plot between earli-
ness, hours spent, and final exam grade. Spearman’s Rho                                                                      (a) Hours Spent by Gender
was used to calculate the correlation between each outcome.
                                                                                               Prior Experience


Consistent with Kazerouni [18], earliness (a measure of pro-
crastination) had a significant medium correlation with exam
                                                                                                                  No Prior
scores (rs = .49, p < .001), while time spent was only mod-
estly correlated (rs = −.32, p = .001). Interestingly, there
was no significant correlation between student’s time spent                                                          Prior
and their procrastination (rs = −0.09, p = .36).
                                                                                                                                          10            20           30          40
Analyzing behavioral outcomes by demographics indicated                                          Total Hours
no differences, with the exception of total hours spent be-
tween women vs. men (H(1)=9.77, p=0.002) and between                           (b) Hours Spent by Prior Experience
students with vs. without prior experience (H(1)=7.28, p=0.007).
This comparison is visualized in Figures 3a and 3b. Women                  Figure 3: Hours Spent by Demographics
and students with no prior experience spent, on average,
about 8 and 5 hours more than their counterparts. Impor-
tantly, this means that there was no significant difference in
how early students started between subgroups.                    the guidance from the administration6 is that in a three-
                                                                 credit course like this one, students should spend 45 hours
Given the difference in final exam scores, black students ap-    in class and 90 hours outside of class over the course of the
pear poorly served by the current curriculum. On average,        15-week semester. The median time spent in our course by
these students spent as much time as their peers on assign-      a given student on all the programming assignments was
ments, but their final exam scores were lower than students      19 hours, while the highest time spent by any individual
outside of this category. Given the evidence for the contin-     student was just over 42 hours. This does not take into
ued education debt owed to non-White students (Ladson-           account time spent outside the coding environment (e.g.,
Billings, 2006) [21], more work is needed to identify both       working on projects in Thonny), working on quizzes, and
potentially problematic structural elements of the course        reading/watching the lesson presentations. However, some
and how the course can better draw on student strengths          students did complete their projects in the online environ-
to produce more equitable outcomes.                              ment, and we expect most of those activities to take consid-
                                                                 erably less time than the programming activities. This may
Figure 4 visualizes the total time spent by students per week    suggest that we are not asking our students to dedicate as
on the programming problems. The data collected raises an        much time as we might.
interesting question–how many hours should we ideally ex-
                                                                                           6
pect students to spend on our courses? At our institution,                                     https://tinyurl.com/csedm2020-udel-credit-policy
                                                                    Category         Subcategory                    Percentage
                                                                    Instructor                                          37.8%
                 8                                                                   Problem Specific                   32.1%
                                                                                     Not Enough Student Tests            1.0%
                                                                                     Not Printing Answer                   .8%
   Hours Spent

                 6                                                  Analyzer                                            22.1%
                                                                                     Initialization Problem              6.9%
                 4                                                                   Unused Variable                     5.9%
                                                                                     Multiple Return Types               2.9%
                                                                                     Incompatible Types                  1.4%
                 2                                                                   Parameter Type Mismatch             1.0%
                                                                                     Overwritten Variable                  .6%
                                                                                     Read out of scope                     .5%
                 0                                                  Correct                                             17.8%
                     1 2 3 4 5 6 7 8 9 10111213141516               Syntax                                              11.4%
                                                                                     No Source Code                        .3%
                             Week into Semester                     Runtime                                              7.8%
                                                                                     TypeError                           3.7%
                 Figure 4: Time Spent per Week of Semester                           NameError                           1.0%
                                                                                     AttributeError                        .8%
                                                                                     ValueError                            .4%
Looking at specific time periods within the data, we see                             KeyError                              .4%
that students spent less time programming in the first weeks                         IndexError                            .4%
of the course, around the midpoint, and near the end of             Student          Student Tests Failing               2.9%
the programming problems (the last few weeks took place             Instructions                                         2.4%
outside of the online coding environment). Especially for the       System Error                                           .8%
earlier material, it is likely that the pace can be accelerated.
                                                                         Table 2: Frequency of Error Messages by Category
5.3              Error Classification
Table 2 gives the percentage of different feedback messages
that students received on programming problems as a per-           Figure 5a gives the ratio of correct submission events in the
centage of all the feedback events received. Our numbers           log data over each week of the semester. Early on, students
vary from those reported by Smith and Rixner (2019) [29],          complete problems with fewer attempts. This might explain
possibly because of our very different approach to feedback        the steady growth in ratios of different kinds of feedback over
and the affordances of our programming environment.                time, as evidenced by Figures 5b and 5c. It is interesting
                                                                   to observe that the runtime error frequency grows almost
One of the most notable departures is the Analyzer and             linearly over the course of the semester, with the exception
Problem-specific Instructor feedback categories. Our auto-         of week 7 (a peak week for the Analyzer feedback). We
grading system is capable of overriding error messages. In         hypothesize this is the result of some more carefully-refined
particular, one of its key features is a type inferencer and       instructor feedback available during that week.
flow analyzer that automatically provides more readable and
targeted error messages. The subcategories give examples           5.4     Final Exam Conceptual Questions
of the kinds of errors produced: Initialization Problem            The first part of the final exam was composed of concep-
(using a variable that was not previously defined) frequently      tual questions for topics across the curriculum, drawn largely
supersedes the classic NameError, for example. Meanwhile,          from the quiz questions students had already seen. In this
some issues have no corresponding runtime error, such as           section, we review the quiz report to determine the topics
Unused Variable (never reading a variable that was previ-          that students struggled with. Most students performed rela-
ously written to). The Analyzer gives more than a fifth of         tively well across the questions, so we focus on errors where
all feedback delivered to students, suggesting its role is sig-    more than 80% of the students had incorrect answers.
nificant. Further work is needed to evaluate the quality of
this feedback and the impact on students’ learning.                Students largely had no issues with questions involving eval-
                                                                   uating expressions. A small exception to this is students’
The Problem-specific Instructor feedback category is opaque.       struggle with Equality vs. Order of different types. In
Given that this represents almost a third of the feedback, it      Python, as in many languages, it is not an error to check if
is unhelpful that the category cannot be easily broken down        two things of different types are equal (although that com-
further. Sampling the logs’ text, we see examples like stu-        parison will always produce false); however, it is an error to
dents failing instructor unit tests, a reminder to call a func-    compare their order (less than/greater than operators). In
tion just once, and a suggestion to avoid a specific subscript     fact, 58% of students got this specific question wrong.
index. Although the autograder is a powerful mechanism for
delivering contextualized help to students, the lack of orga-      There were three questions related to tracing complex con-
nization severely limits our automated analysis possible. As       trol flow for loops, if statements, and functions. Tracing
part of our process in the future, we intend to annotate feed-     seemed to pose difficulties, with between 25-40% of the stu-
back in our autograding scripts with identifiers.                  dents getting these questions wrong. We believe that more
                                                                               decided to focus on a subset of skills just related to Functions
 Correct Submissions                                                           that we felt we could clearly identify with computational
                                                                               analysis and that the instructor felt, a-priori, they had seen
                       0.4
       Ratio of

                                                                               students struggle with over the course of the semester. Ta-
                                                                               ble 3 gives the percentages and quantity of students who
                       0.2                                                     successfully demonstrated the subskill on each exam.
                       0.0                                                     Header Definition: Even though we had not observed
                             1

                                 2

                                     3

                                          4

                                              5

                                                   6

                                                       7

                                                            8

                                                                9

                                                                     10

                                                                          11
                                                                               many students struggling with syntax during the semester,
                                       Week into the Semester                  we felt it critical to analyze the incidence of submitted code
                                                                               that had malformed headers. Although the numbers were a
            (a) Ratio of Correct Submissions by Week of Semester               little higher than expected, we are not terribly concerned -
                                                                               reviewing the submissions, many seemed like simple typos
 Syntax Errors


                                                                               (e.g., a missing colon) that were relatively easily fixed.
                       0.2
   Ratio of


                       0.1
                                                                               Provided Types: Students were not required or encour-
                                                                               aged to provide types in their headers during the exam. In
                       0.0                                                     fact, since the advanced feedback features were turned off,
                                                                               their feedback would not actually reference any parameter
                             1

                                 2

                                     3

                                          4

                                              5

                                                   6

                                                       7

                                                            8

                                                                9

                                                                     10

                                                                          11
                                       Week into the Semester                  or return types they specified (as long as they were syntacti-
                                                                               cally correct code). We did not assess the correctness of their
                       (b) Ratio of Syntax Errors by Week of Semester          provided types - merely their existence. In the final exam,
                                                                               the number of students who annotated their parameter types
 Runtime Errors


                                                                               falls off sharply after the first three questions (moving from
                                                                               about 50% down to 20%). We offer two explanations: first,
    Ratio of


                                                                               the fourth question is one of the most difficult in the entire
                       0.1                                                     course, so students may have been distracted by its diffi-
                                                                               culty. Second, the last questions all involve more compli-
                       0.0                                                     cated nested data types (e.g., lists of dictionaries) that were
                                                                               too troublesome for the students to specify.
                             1

                                 2

                                     3

                                          4

                                              5

                                                   6

                                                       7

                                                            8

                                                                9

                                                                     10

                                                                          11


                                       Week into the Semester
                                                                               Parameter Overwriting: This misconception is one that
                       (c) Ratio of Runtime Errors by Week of Semester         the instructors were very concerned with, having observed
                                                                               it repeatedly among certain students early in the semester
 Figure 5: Ratio of Feedback Types by Week of Semester                         (and concerned with its persistence). Applying the param-
                                                                               eter overwriting pattern to the rest of the submissions over
                                                                               the entire semester, we found that the behavior trails off
emphasis should be placed on tracing in the curriculum;                        over the course of the semester. By the final exam, almost
there are quiz questions and a worksheet dedicated to the                      no students were making this particular mistake. Although
topic, but there are opportunities to expand this material.                    the instructor believes that more can be done up front to
Tracing has been a recent area of focus, with promising ap-                    avoid this critical misconception, it is comforting that the
proaches by Xie et al [31] and Cunningham et al [9].                           existing curriculum seems to largely address this by the end.

Dictionaries also posed significant trouble for students. Dic-                 Return/Print: We observed that some students struggle
tionaries come up later in the course, represent more com-                     to differentiate between the concepts of return statements
plex reality, and conflate syntactic operations with lists. In                 and print calls. However, largely students were success-
fact, this last point is evidenced by data. In a question com-                 ful with this subskill, despite a quarter of students getting
paring the relative speed of traversing lists and dictionaries,                a related (more abstract rendering) version of this subskill
50% of the students got one variant of a True/False question                   wrong on part 1 of the final exam. It seems that although
incorrect (so they might as well have been guessing).                          troublesome for a small clutch of students, most are able to
                                                                               eventually separate this concept in their code.
Again, the point of our analysis is not to necessarily develop
a validated examination instrument or to distill an author-                    Parameters/Input: Similar to students’ issues with re-
itative set of misconceptions. Instead, we seek to demon-                      turning vs. printing, some students were observed in indi-
strate the insight we have garnered from reviewing our exam.                   vidual sessions mixing up parameters and the input function
With these simple percentages, we have found targets.                          (which was presented as a very distinctive way that data
                                                                               could enter a function). However, it appears that this was
5.5                    Deeper Dive on Functions                                truly isolated to just a few students.
In looking over the second part of the final exam questions,
                                                                               Functional Decomposition: Largely inspired by Fisler [12]
we are faced with a tremendous number of concepts inte-
                                                                               success in overcoming the difficulties of the Rainfall problem,
grated into each problem. In fact, with over 261 learning
                                                                               Functional Decomposition was taught as a method for com-
objectives in the course, analyzing the entire set is an over-
                                                                               plex processing data. Students had previously been taught
whelming prospect. To scope our analysis for this paper, we
             Subskill      Description                                                1st Exam       2nd Exam       Final Exam
    Header Definition      Defined the function header with correct syntax           83.5% (86)      84.5% (87)      91.3% (94)
      Provided Types       Provided types for all parameters and the return          40.8% (42)      45.6% (47)      37.9% (39)
 Parameter Overwrite       Did not assign literal values to parameters in the body   88.3% (91)     98.1% (101)     99.0% (102)
        Return/Print       Did not print without returning                           80.6% (83)      89.3% (92)      91.3% (94)
    Parameters/Input       Did not use the input function instead of parameters      96.1% (99)    100.0% (103)     99.0% (102)
         Unit Testing      Wrote unit tests                                          88.3% (91)      79.6% (82)      67.0% (69)
       Decomposition       Separated work into a helper function                       1.0% (1)      17.5% (18)      19.4% (20)

                             Table 3: Percentage of Students Demonstrating Subskill across Exams


8 different looping patterns (e.g., accumulating, mapping,         Better structure to our existing data sources might help in
filtering). A number of assignments required students to           future analyses. For example, although each quiz question
decompose problems. Therefore, it is somewhat disappoint-          was labeled with a unique identifier, we realized during anal-
ing that so few students chose to leverage decomposition           ysis that we really needed every quiz answer (and in some
(particularly since the harder final exam problems were nat-       cases, sets of answers) to have a unique identifier as well. In
urally susceptible to a decomposition approach). In addition       particular, some questions had multiple parts, or different
to the midterm 2 and final exam questions, we also took a          answers yielded information about different misconceptions.
closer look at an earlier open-ended programming problem           In a similar vein, annotating instructor feedback for the pro-
that was particularly complex and well-suited to decompo-          gramming problems would have substantially increased the
sition. In these problems, there seemed to be a pattern of         differentiation of our feedback messages.
students being more successful when they leverage decompo-
sition. Although not conclusive, this supports the hypothe-        More metadata about each identifier would also help efforts
sis that decomposition may be an effective strategy.               to cross-reference and cluster related problems (especially
                                                                   over time). This is a non-trivial effort, given the quantity
                             Decomposed      Monolithic            of course materials present in the curriculum. As a start-
                              Pass  Fail      Pass      Fail       ing point, we believe this effort should probably be focused
        Earlier Problem          37    8         29      27        on certain major learning objectives and topics (e.g., func-
  Midterm 2 Question 5           13    5         40      43        tions) that are particularly worthy of attention based on the
 Final Exam Question 4            7    8         42      44        formative evaluation conducted here.
                   Total         57   21       111      114
                             18.8% 6.9%      36.6% 37.6%           We expect that before our next iteration of our analyses,
                                                                   we need to develop more hypotheses up front for guidance.
     Table 4: Student Use’s of Decomposition over Time             A considerable amount of time was spent performing ex-
                                                                   ploratory analyses, trying different approaches and seeing
                                                                   what emerged from the data. Although helpful as we ori-
Unit Testing: Given that students were not required to             ented ourselves, the data dredging that can emerge may
unit test their code on the final exam, we were pleased to         yield false conclusions that are not actually worth invest-
find that many students wrote unit tests anyway. Inter-            ing in. Finally, while we attempted to follow a replicable
estingly, though, the percentage of students who used this         process in our data collection and analysis, we believe more
strategy decreased over the course of the semester, even as        should be done to streamline and package our data pipeline
the programming problems became more difficult. We hy-             to encourage replication and reproduction.
pothesize that since the later exam problems involve com-
plex nested data, students either did not feel comfortable
generating test data or they felt that it would not be an ef-      7.   CONCLUSION
ficient use of their time. We believe that we need to sell the     In this paper, we have described our evaluation of data from
concept more - rather than thinking that writing test cases        a heavily-instrumented CS1 course. Our goal was less about
would be a detriment to their success, students should see         judging the course overall, and more about finding specific
tests as one of the most direct paths to completion.               areas of improvement and success. We feel that course eval-
                                                                   uation is less about the end-goal and more about small it-
                                                                   erative augmentations that collect over time. To structure
6.    DISCUSSION                                                   our approach, we followed a loose Design-Based Research
Reviewing our findings, we made several decisions about            model supported by educational data mining. In our ex-
places to modify our curriculum. The log data suggests that        perience, the high volume and variety of data sources can
some of the earlier material can be accelerated, so that more      be very helpful in understanding the successes and failures
time can ultimately be allocated to week 4 (critical mate-         of the course, although it does pose difficulties for analy-
rial covering functions). We also believe we need to spend         sis. As always, a better pipeline could help make sense of
more time throughout the semester convincing students that         these data and results more quickly, possibly even during
subskills like decomposition and unit testing can help them        the semester. However, in the immediate term, our data
solve challenging questions, although follow-up analyses will      analysis contributes to the community’s knowledge of stu-
be needed to confirm this theory. Finally, we must come            dents and ideally provides a model for others to follow along.
up with new ways to support some of our demographic sub-           In general, we hope to encourage increased rigor in course
groups, given that outcomes in that area are not yet equal.        evaluation as we integrate data-rich tools into our courses.
8.   REFERENCES                                                [16] P. Ihantola, A. Vihavainen, A. Ahadi, M. Butler,
 [1] A. Annamaa. Introducing thonny, a python ide for               J. Börstler, S. H. Edwards, E. Isohanni, A. Korhonen,
     learning programming. In Proceedings of the 15th Koli          A. Petersen, K. Rivers, et al. Educational data mining
     Calling Conference on Computing Education Research,            and learning analytics in programming: Literature
     pages 117–121, 2015.                                           review and case studies. In Proceedings of the 2015
 [2] R. S. Baker and P. S. Inventado. Educational data              ITiCSE on Working Group Reports, pages 41–63.
     mining and learning analytics. In Learning analytics,          2015.
     pages 61–75. Springer, 2014.                              [17] L. C. Kaczmarczyk, E. R. Petrick, J. P. East, and
 [3] S. Barab and K. Squire. Design-based research:                 G. L. Herman. Identifying student misconceptions of
     Putting a stake in the ground. The journal of the              programming. In Proceedings of the 41st ACM
     learning sciences, 13(1):1–14, 2004.                           technical symposium on Computer science education,
 [4] A. C. Bart, A. Sarver, M. Friend, and L. Cox II.               pages 107–111, 2010.
     Pythonsneks: an open-source, instructionally-designed     [18] A. M. Kazerouni, S. H. Edwards, and C. A. Shaffer.
     introductory curriculum with action-design research.           Quantifying incremental development practices and
     In Proceedings of the 50th ACM Technical Symposium             their relationship to procrastination. In Proceedings of
     on Computer Science Education, 2019.                           the 2017 ACM Conference on International
 [5] A. C. Bart, J. Tibau, E. Tilevich, C. A. Shaffer, and          Computing Education Research, pages 191–199, 2017.
     D. Kafura. Blockpy: An open access data-science           [19] H. Keuning, J. Jeuring, and B. Heeren. Towards a
     environment for introductory programmers.                      systematic review of automated feedback generation
     Computer, 50(5):18–26, 2017.                                   for programming exercises. In Proceedings of the 2016
 [6] K. Buffardi. Assessing individual contributions to             ACM Conference on Innovation and Technology in
     software engineering projects with git logs and user           Computer Science Education, pages 41–46, 2016.
     stories. In Proceedings of the 51st ACM Technical         [20] E. Kurvinen, N. Hellgren, E. Kaila, M.-J. Laakso, and
     Symposium on Computer Science Education, pages                 T. Salakoski. Programming misconceptions in an
     650–656, 2020.                                                 introductory level programming course exam. In
 [7] S. E. Carrell and J. E. West. Does professor quality           Proceedings of the 2016 ACM Conference on
     matter? evidence from random assignment of students            Innovation and Technology in Computer Science
     to professors. Journal of Political Economy,                   Education, pages 308–313, 2016.
     118(3):409–432, 2010.                                     [21] G. Ladson-Billings. From the achievement gap to the
 [8] D.-B. R. Collective. Design-based research: An                 education debt: Understanding achievement in us
     emerging paradigm for educational inquiry.                     schools. Educational researcher, 35(7):3–12, 2006.
     Educational Researcher, 32(1):5–8, 2003.                  [22] A. Luxton-Reilly. Learning to program is easy. In
 [9] K. Cunningham, S. Blanchard, B. Ericson, and                   Proceedings of the 2016 ACM Conference on
     M. Guzdial. Using tracing and sketching to solve               Innovation and Technology in Computer Science
     programming problems: replicating and extending an             Education, pages 284–289, 2016.
     analysis of what students draw. In Proceedings of the     [23] P. Mandal and I.-H. Hsiao. Using differential mining
     2017 ACM Conference on International Computing                 to explore bite-size problem solving practices. In
     Education Research, pages 164–172, 2017.                       Educational Data Mining in Computer Science
[10] N. Diana, M. Eagle, J. Stamper, S. Grover,                     Education (CSEDM) Workshop, 2018.
     M. Bienkowski, and S. Basu. Measuring transfer of         [24] C. Matthies, R. Teusner, and G. Hesse. Beyond
     data-driven code features across tasks in alice. 2018.         surveys: analyzing software development artifacts to
[11] T. Effenberger, J. Cechák, and R. Pelánek. Difficulty        assess teaching efforts. In 2018 IEEE Frontiers in
     and complexity of introductory programming                     Education Conference (FIE), pages 1–9. IEEE, 2018.
     problems. 2019.                                           [25] K. M. Mitchell and J. Martin. Gender bias in student
[12] K. Fisler. The recurring rainfall problem. In                  evaluations. PS: Political Science & Politics,
     Proceedings of the tenth annual conference on                  51(3):648–652, 2018.
     International computing education research, pages         [26] National Academies of Sciences, Engineering, and
     35–42, 2014.                                                   Medicine and others. Assessing and responding to the
[13] L. Gusukuma, A. C. Bart, and D. Kafura. Pedal: An              growth of computer science undergraduate enrollments.
     infrastructure for automated feedback systems. In              National Academies Press, 2018.
     Proceedings of the 51st ACM Technical Symposium on        [27] G. L. Nelson and A. J. Ko. On use of theory in
     Computer Science Education, pages 1061–1067, 2020.             computing education research. In Proceedings of the
[14] L. Gusukuma, A. C. Bart, D. Kafura, J. Ernst, and              2018 ACM Conference on International Computing
     K. Cennamo. Instructional design+ knowledge                    Education Research, pages 31–39, 2018.
     components: A systematic method for refining              [28] T. W. Price, D. Hovemeyer, K. Rivers, B. A. Becker,
     instruction. In Proceedings of the 49th ACM Technical          et al. Progsnap2: A flexible format for programming
     Symposium on Computer Science Education, pages                 process data. In The 9th International Learning
     338–343, 2018.                                                 Analytics & Knowledge Conference, Tempe, Arizona,
[15] M. Guzdial. Exploring hypotheses about media                   4-8 March 2019, 2019.
     computation. In Proceedings of the ninth annual           [29] R. Smith and S. Rixner. The error landscape:
     international ACM conference on International                  Characterizing the mistakes of novice programmers. In
     computing education research, pages 19–26, 2013.               Proceedings of the 50th ACM Technical Symposium on
     Computer Science Education, pages 538–544, 2019.
[30] Y. Vance Paredes, D. Azcona, I.-H. Hsiao, and A. F.
     Smeaton. Predictive modelling of student reviewing
     behaviors in an introductory programming course.
     2018.
[31] B. Xie, G. L. Nelson, and A. J. Ko. An explicit
     strategy to scaffold novice program tracing. In
     Proceedings of the 49th ACM Technical Symposium on
     Computer Science Education, pages 344–349, 2018.