<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating an Instrumented Python CS1 Course</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Austin Cory Bart</string-name>
          <email>acbart@udel.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Teomara Rutherford</string-name>
          <email>teomara@udel.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>James Skripchuk</string-name>
          <email>jskrip@udel.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Delaware</institution>
          ,
          <addr-line>Newark, DE</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The CS1 course is a critical experience for most novice programmers, requiring signi cant time and e ort to overcome the inherent challenges. Ever-increasing enrollments mean that instructors have less insight into their students and can provide less individualized instruction. Automated programming environments and grading systems are one mechanism to scale CS1 instruction, but these new technologies can sometimes make it di cult for the instructor to gain insight into their learners. However, learning analytics collected by these systems can be used to make up some of the di erence. This paper describes the process of mining a heavily-instrumented CS1 course to leverage ne-grained evidence of student learning. The existing Python-based curriculum was already heavily integrated with a web-based programming environment that captured keystroke-level student coding snapshots, along with various other forms of automated analyses. A Design-Based Research approach was taken to collect, analyze, and evaluate the data, with the intent to derive meaningful conclusions about the student experience and develop evidence-based improvements for the course. In addition to modeling our process, we report on a number of results regarding the persistence of student mistakes, measurements of student learning and errors, the association between student learning and student e ort and procrastination, and places where we might be able to accelerate our curriculum's pacing. We hope that these results, as well as our generalized approach, can guide larger community e orts around systematic course analysis and revision.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;cs1</kwd>
        <kwd>modeling</kwd>
        <kwd>dbr</kwd>
        <kwd>python</kwd>
        <kwd>procrastination</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
      <p>Social and professional topics [Professional topics]:
Computing education; Information systems [Information
systems applications]: Data mining
Copyright c 2020 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>
        The rst Computer Science course (CS1) can be a
challenging experience for novices given the constraints of a
semester [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], but success in CS1 is critical for computer
science students, as it sets a foundation for subsequent classes.
Large amounts of practice and feedback are critical to this
experience, so that learners can overcome programming
misconceptions [
        <xref ref-type="bibr" rid="ref17 ref20">17, 20</xref>
        ] and develop e ective schema.
Instructors have a key role in developing materials to support
learners' productive struggle. Recently, however, scaling
enrollments [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] and the move to remote/hybrid learning
environments has shifted much of this work away from interacting
with individual students towards interacting with systems
(which in turn interact with the students directly). For
example, programming autograders [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] remove the instructor
from the grading process, automatically assessing and
sometimes even providing feedback directly to the learner.
Although these systems scale the learning process, they can
inhibit the evaluation and revising of course materials.
Instructors do not have as many rst-hand interactions with
students or the artifacts that they produce. When
homeworks and exams are no longer hand-graded, teachers may
not be as directly motivated to review each submission.
Similarly, when automated feedback systems are e ectively
supporting students, teachers will have fewer opportunities to
get direct insight into what issues students are
encountering. This knowledge of the students' experience is critical to
gauge the e ectiveness of the course materials. Instructors
need a new model to guide their revision decisions.
We propose instructors follow a Design-Based Research (DBR)
approach [
        <xref ref-type="bibr" rid="ref3 ref8">8, 3</xref>
        ] to iteratively improve their course. In
particular, course development should be seen as an iterative
and statistical Instructional Design process; each semester,
a curriculum is built and presented to learners as an
intervention, data is generated and collected as learners interact,
that data is analyzed to discover shortcomings and successes
of the intervention, and then modi cations to the \protocol"
are identi ed for the next iteration of the study.
Instructional Design models provide a systematic framework for
this development process, but the DBR approach augments
this to emphasize the statistical and theory-driven nature
of the evaluation process. Fortunately, the same
autograding tools that scale practice and feedback opportunities for
students can also be used to collect many kinds of learning
analytics, permitting the use of educational data mining to
garner insights into learning [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>In this paper, we present our experience of evaluating a CS1
course that has been heavily instrumented to provide rich
data on student actions. Our goal is not to prove that our
curriculum was a \success" or \failure" as a whole, but to
empirically judge speci c pieces and identify components that
should be modi ed or maintained. We draw upon
programming snapshot data, non-programming autograded question
logs, surveys, exam data, and human assessments to produce
a diverse dataset. In addition to sharing our conclusions
about the state of our course, we believe that we present a
formative model for other instructors who wish to evaluate
their courses systematically. In fact, our speci c analyses are
recorded in a Jupyter Notebook1. Our hope is that others
will use our own analyses as a baseline to develop their own
questions, and to motivate others to approach their courses
with a more systematic, empirical method.</p>
    </sec>
    <sec id="sec-3">
      <title>2. THEORIES AND RELATED WORK</title>
      <p>
        The central premise of our approach is inspired by
DesignBased Research, which has been well established in the
education literature for decades. Those interested in an
introduction to DBR can refer to [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Brie y, there are several
key tenets: 1) Development is an iterative process of design,
intervention, collection, and analysis. 2) Educational
interventions cannot be decontextualized from their setting. 3)
Processes from all phases of development must be captured
and provided su cient context to ensure reproducibility and
replication. 4) Developing learning experiences cannot be
separated from developing theories about learning. 5)
Results from an intervention must inform the next iteration
and communicated out to broader stakeholders.
Messy authenticity is inherent in this process, and naturally
limits the theoretical extent of ndings in a DBR process.
Therefore, any conclusions derived should not be seen as
broadly applicable, but only meaningful for the context in
which they were developed. Although theories of learning
are generated from DBR, this is less true for early iterations.
True success for a course is a moving target. As the
curriculum improves and students overcome misconceptions faster,
more material can be added. Over time, the curriculum
necessarily needs to be updated and assignments refreshed.
Further, courses often need to be adapted for new audiences
with di erent demographics and prior experiences. Given
the DBR model strongly incorporates context, these
realities can be accounted for at some level.
      </p>
      <p>
        DBR has been somewhat underused in Computing
Education Research (CER). Recently, Neslon and Ko (2018) made
a strong argument that CE research should almost
exclusively follow Design-Based Research methodologies [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], for
three reasons: 1) avoid splitting attention between
advancing theory vs. design, 2) the eld has not generated enough
domain-speci c theories, and 3) theory has sometimes been
used to impede e ective design-based research in the peer
review process. Many of the recommendations made in the
paper echo the tenets of DBR listed above and are
consistent with our vision for communicating our course designs.
In fact, their paper was a major guiding inspiration.
Another major inspiration for our approach is Guzdial's 2013
paper evaluating their decade-long Computational Thinking
course (\MediaComp") [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Although a longer time scale,
this paper takes a scienti c, cohesive look at their course
using a DBR lens. They critically evaluate what worked and
contextualize all their ndings by their design. They begin
with a set of hypotheses about what aspects of the course
will be e ective, and then systematically review data
collected from the o erings to accept or reject those hypotheses.
Their conclusions, while not transcendent, are impactful for
anyone modeling themselves after their context.
In computing education, programming log data has been
used to make various kinds of predictions and evaluations
of student learning [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Applications include predicting
student performance in subsequent courses [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], identifying
learners who need additional support [30], modelling
student strategies as they work on programming problems [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ],
evaluating students over the course of a semester [
        <xref ref-type="bibr" rid="ref24 ref6">6, 24</xref>
        ].
These approaches tend to rely on vast datasets or seek to
derive conclusions that are predictive, highly transferable, or
are about individual students. Although such research work
is valuable, the goal is distinctive. We recognize that each
course o ering has an important local context that cannot
be factored out, and that collecting su cient evidence over
time inhibits the process of iterative course design. Rather
than developing generalizable theories or predicting
performance, we seek actionable data from a single semester an
instructor can use to evaluate and redesign their course.
E enberger et al [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] are perhaps an example more closely
aligned with our own research goals. Rather than
evaluating students, their work sought to evaluate four
programming problems in a course. Their results suggest that
despite commonalities in the tasks, the problems'
characteristics were considerably di erent, underscoring the danger of
treating questions as interchangeable in course evaluation.
The process of systematic course revision is similar to the
ID+KC model by Gusukuma (2018), which combines formal
Instructional Design methodology with a cognitive student
model based on Knowledge Components [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Instead of
focusing on a student model, however, we focus on components
of the instruction such as the learning objectives. Still, the
systematic process of data collection and analysis to inform
revision is common between our methods.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. CURRICULUM AND TECHNOLOGY</title>
      <p>In this section, we describe the course's curriculum and
technology. DBR necessitates a clear enough description of the
curriculum to understand the evaluation conducted, so we
cannot avoid low-level details |the context matters. We
have attempted to separate, however, the speci c
experiential details of our intervention (i.e., the course o ering),
which are described in Section 3.</p>
      <p>As a starting point, we based our course on the PythonSneks
curriculum 2. This curriculum has students move through
a large sequence of almost 50 lessons over the course of a
semester, with each lesson focused on a particular
introductory programming topic. Each lesson is composed of a set
of learning objectives, the lesson presentation, a
mastery1https://github.com/acbart/csedm20-paper-cs1-analysis</p>
      <sec id="sec-4-1">
        <title>2https://acbart.github.io/python-sneks/</title>
        <p>
          based quiz, and a set of programming problems. We have
made a number of modi cations to the materials reported
in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], such as the introduction of static typing and
increasing the emphasis on functional design to better suit CS1 for
Computer Science majors. A full listing of all the learning
objectives covered is available 3.
        </p>
        <p>Learning Management System: The course was
delivered through Canvas, which was our university's Learning
Management System. All material, including quizzes,
programming assignments, and exams, were directly available
in Canvas (either natively or through LTI).</p>
        <p>
          Lesson Presentation: The lessons were PowerPoint slides
with a recorded voice-over, embedded as a YouTube video
directly into a Canvas Page. The content of these slides
are transcribed directly below the video, including any code
with proper syntax highlighting. Finally, PDF versions of
all the slides with their transcriptions are also available.
Mastery Quizzes: After the presentations, students are
presented with a Canvas Quiz containing a series of True/False,
Matching, Multiple Choice, and Fill-in-the-blank questions.
This assignment is presented in a mastery style, where
learners can make repeated attempts until they earn a
satisfactory grade. Each of the 200+ questions are annotated with a
speci c identi er. These quizzes are 10% of students' grade.
Although Canvas provides an interface to visualize statistics
about individual quiz questions, this is obfuscated by the
students multiple attempts{only the nal grade is shown,
so instructors cannot see how di cult a question was for a
student. To provide greater detail in an instructor-friendly
report, the Canvas API was used to pull all submission
attempts for each student. The scripts used in analysis and
an example of the instructor report 4 are publicly available.
Programming Problems: Additionally, most lessons
contain two-eight programming problems through a web-based
Python coding environment [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. These problems were also
presented in a mastery style, allowing learners to spend as
much time as they want until the deadline. These
problems are 15% of students' nal course grade. The
environment has a dual block/text interface, although students
were discouraged from using the block interface past the
rst two weeks of programming activities. The environment
naturally records all student interactions in ProgSnap2
format [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], making it readily accessible for our evaluation.
Students were also required to install (and eventually use)
a desktop Python programming environment, Thonny [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
Students largely used Thonny for their programming projects,
particularly the nal project, although a small number chose
to use the environment to write code for other assignments.
The Thonny environment was not instrumented to collect
log data, but students were required to submit their projects
through the autograder in Canvas{therefore, submission data
should not be a ected by the relatively small number of
students who used Thonny.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>3https://tinyurl.com/csedm2020-sneks-los 4https://github.com/acbart/canvas-grading-reports</title>
        <p>
          When students submitted a solution to a programming
problem, the system evaluated their work using an
instructorauthored script written using the Pedal autograding
framework [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. This system generates feedback to learners and
calculates a correctness grade (usually 0 or 1, although
partial credit was possible on exams). The existing curriculum
had a large quantity of autograded programming problems,
some of which needed to be updated based on our changes.
Exams: There were two midterm exams and a nal exam.
These exams were all divided into two parts: 1)
multiplechoice/true-false/matching/etc. questions, and 2) autograded
programming questions. For the latter, students were given
ve-six programming problems that they could move freely
between. These problems were automatically graded and
given partial credit (20% for correctly specifying the header,
and the remaining points allocated based on the percentage
of passing instructor unit tests). Both parts were presented
in Canvas through the systems students were already
familiar with, but students were not allowed to use the internet
or to Google. Students took the exam at a proctored testing
center and had two hours. They were only allowed to bring
a single sheet of hand-written notes. Multiple versions of
each exam question were created and drawn from a pool at
random, so that no two students had the exact same exam.
Projects: There were six projects throughout the semester,
although the rst two were very small and heavily sca olded.
The nal project was relatively open-ended and meant to
be summative, but the middle three projects allowed more
mixed forms of support. Although students were largely
expected to produce their own code, they were encouraged to
seek help as needed from the instructional sta . For the nal
project, students used the Python Arcade library 5 to create
a game. Because students were not previously taught
Arcade, two weeks were allocated for students to work
collaboratively on extending sample games with new functionality.
Then, they individually built one of 12 games.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. INTERVENTION</title>
      <p>In this section, we describe the speci c intervention context
in more detail. The curriculum and technology was used
in the Fall 2019 semester at an R1 university in the
eastern United States for a CS1 course that was required for
Computer Science majors in their rst semester. An
IRBapproved research protocol was followed. At the beginning
of the semester, students were asked to provide consent via
a survey, with 103 students agreeing out of 136 (for a 75.7%
consent rate). A separate survey was also administered at
the beginning of the semester to collect various demographic
data (summarized in Table 1, only for consenting students)
relating to gender, race, and prior coding experience.</p>
      <sec id="sec-5-1">
        <title>Percentage</title>
        <p>19%
6%
37%
100%</p>
      </sec>
      <sec id="sec-5-2">
        <title>Identi es as Woman</title>
        <p>Black Student
No Prior Coding Experience</p>
        <p>Total number of students</p>
        <sec id="sec-5-2-1">
          <title>5https://arcade.academy/</title>
          <p>Instructional Sta : The course was taught by a single
instructor. He managed a team of 12 undergraduate
teaching assistants. These TAs varied from CS sophomores to
seniors, and not all of them had taken the curriculum
before. However, they were all selected by the instructor for
both for their knowledge and amiability. All members of the
instructional sta hosted o ce hours. The TAs were also
responsible for grading certain aspects of the projects (e.g.,
test quality, documentation quality, code quality), although
this amounted to relatively little of the students' nal course
grade. The instructor met with these TAs every other week
for an hour to discuss the state of the course and provide
training on pedagogy, inclusivity, etc.</p>
          <p>Structure: The lecture met Monday-Wednesday-Friday for
50 minutes across three separate sections. The sections were
led by the same instructor, but were taught at di erent times
of day (mid-morning, noon, and afternoon). The instructor
did not attempt to provide the exact same experience to all
three sections{if a mistake was made in the morning
section, they attempted to avoid that mistake later. Typically,
the rst lecture session of a module started with 15-30
minutes of review of the material guided by clickers, and then
students spent the rest of the module's class time working
on assignments. There were several special in-class
assignments such as worksheets, coding challenges, and readings.
The lab met on Thursdays for 1.5 hours. Students worked on
open assignments with the support of two TAs, who would
actively walk around and answer questions.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. RESULTS AND ANALYSIS</title>
      <p>Our ultimate goal is to evaluate the course and identify
aspects that were successful and unsuccessful. First, we
consider basic course nal course outcomes. Then, we use the
programming log data to analyze students' behavioral
outcomes from the semester. We dive deeper into this data to
characterize the feedback that was delivered to students over
the semester. We look at ne-grained data from both parts
of the nal exam to develop a list of problematic subskills,
and then review more of the programming log data in light of
these results. We particularly focus our e orts on subskills
related to de ning functions, to tighten our analysis.
The instructor's naive perception of the course was that
things were largely successful, except for the nal project.
Insu cient time was given to the students to learn the game
development API, and instructor expectations were a bit
high (which was adjusted for in the grading, but may have
caused students undue stress). However, the material prior
to the nal project went smoothly. O ce hours were rarely
over lled, with the exception of week 4 (the module
introducing Functions), which had one lesson too many{this was
resolved by making the last programming assignment
optional (Programming 25: Functional Decomposition).</p>
    </sec>
    <sec id="sec-7">
      <title>5.1 Basic Course Outcomes</title>
      <p>As a starting point, we consider basic course-level outcomes,
the kind that could be determined even without the extra
instrumentation. This will include the overall course grades,
the major grade categories, and the university-administered
course evaluations. As a starting point, the total number
of failing grades and course withdrawals (DFW rate) was
14.5%, considered acceptable by the instructor.
0</p>
      <p>
        A Kruskal-Wallis test was used to analyze nal exam scores
by demographics. There were no signi cant di erences for
gender, but a large di erence for black students (H(1)=6.39,
p=.01) and a smaller di erence for prior programming
experience (H(1)=5.51, p=0.02). The students without prior
experience scored about 12% lower on average, while the
black students scored about 41% lower. Given the
concerning spread here, we review this data with more context in
the next section before drawing any conclusions.
The university-run course evaluations from students yielded
positive but simplistic results. Both the course and the
instructor were separately rated on a 5-point likert scale
(Poor... Excellent). Both the course (Mdn=5, M=4.62,
SD=0.77) and the instructor (Mdn=5, M=4.70, SD=0.67)
achieved very high results, but ultimately this tells us
little about the students' experience. Course evaluation data
is known to contain bias and provide limited data [
        <xref ref-type="bibr" rid="ref25 ref7">7, 25</xref>
        ];
these results must be taken in context with other sources of
data. Note that because the course evaluations are
anonymous, they cannot be cross-referenced with other data. A
review of the students' free response answers reveals many
were unhappy with the Final Project. In fact, the word
\Arcade" appears in 41 of the 86 text responses, often as their
only comment. Although this helps us see a major point of
failure in our curriculum, it highlights the need for
alternative evaluation mechanisms. Relying solely on student nal
perceptions leaves us vulnerable to student biases.
      </p>
    </sec>
    <sec id="sec-8">
      <title>5.2 Time Spent Programming</title>
      <p>The keystroke-level log data allows us to determine a
number of interesting metrics beyond what is available from our
grading spreadsheet. As a simple starting point, using the
timestamps of the programming logs we can get a measure
of how early students started working on assignments and
total time spent. Earliness was measured by taking each
0
0
100
200</p>
      <p>300
Earliness in Hours
20</p>
      <p>40
Hours Spent
100
200</p>
      <p>300
Earliness in Hours
(a) Earliness vs. Final Exam Score
(b) Hours Spent vs. Final Exam Score
(c) Earliness vs. Hours Spent
submission event across the entire course, nding the di
erence between this and the relevant assignment's deadline,
and averaging those durations together within each student.
Hours Spent was measured by grouping all the events in the
logs by student, nding the di erence with the next
adjacent event (clipping to a maximum of 30 seconds, to consider
breaks), and summing these durations.</p>
      <p>Analyzing behavioral outcomes by demographics indicated
no di erences, with the exception of total hours spent
between women vs. men (H(1)=9.77, p=0.002) and between
students with vs. without prior experience (H(1)=7.28, p=0.007).
This comparison is visualized in Figures 3a and 3b. Women
and students with no prior experience spent, on average,
about 8 and 5 hours more than their counterparts.
Importantly, this means that there was no signi cant di erence in
how early students started between subgroups.</p>
      <p>
        Given the di erence in nal exam scores, black students
appear poorly served by the current curriculum. On average,
these students spent as much time as their peers on
assignments, but their nal exam scores were lower than students
outside of this category. Given the evidence for the
continued education debt owed to non-White students
(LadsonBillings, 2006) [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], more work is needed to identify both
potentially problematic structural elements of the course
and how the course can better draw on student strengths
to produce more equitable outcomes.
      </p>
      <p>Figure 4 visualizes the total time spent by students per week
on the programming problems. The data collected raises an
interesting question{how many hours should we ideally
expect students to spend on our courses? At our institution,
Men
r
e
d
n
e
GWomen
e
c
n
ireNo Prior
e
p
x
rE Prior
o
i
r
P</p>
      <p>Total Hours
(a) Hours Spent by Gender
10
20
30</p>
      <p>40</p>
      <p>Total Hours
(b) Hours Spent by Prior Experience
the guidance from the administration6 is that in a
threecredit course like this one, students should spend 45 hours
in class and 90 hours outside of class over the course of the
15-week semester. The median time spent in our course by
a given student on all the programming assignments was
19 hours, while the highest time spent by any individual
student was just over 42 hours. This does not take into
account time spent outside the coding environment (e.g.,
working on projects in Thonny), working on quizzes, and
reading/watching the lesson presentations. However, some
students did complete their projects in the online
environment, and we expect most of those activities to take
considerably less time than the programming activities. This may
suggest that we are not asking our students to dedicate as
much time as we might.
6https://tinyurl.com/csedm2020-udel-credit-policy
Looking at speci c time periods within the data, we see
that students spent less time programming in the rst weeks
of the course, around the midpoint, and near the end of
the programming problems (the last few weeks took place
outside of the online coding environment). Especially for the
earlier material, it is likely that the pace can be accelerated.</p>
    </sec>
    <sec id="sec-9">
      <title>5.3 Error Classification</title>
      <p>
        Table 2 gives the percentage of di erent feedback messages
that students received on programming problems as a
percentage of all the feedback events received. Our numbers
vary from those reported by Smith and Rixner (2019) [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ],
possibly because of our very di erent approach to feedback
and the a ordances of our programming environment.
One of the most notable departures is the Analyzer and
Problem-speci c Instructor feedback categories. Our
autograding system is capable of overriding error messages. In
particular, one of its key features is a type inferencer and
ow analyzer that automatically provides more readable and
targeted error messages. The subcategories give examples
of the kinds of errors produced: Initialization Problem
(using a variable that was not previously de ned) frequently
supersedes the classic NameError, for example. Meanwhile,
some issues have no corresponding runtime error, such as
Unused Variable (never reading a variable that was
previously written to). The Analyzer gives more than a fth of
all feedback delivered to students, suggesting its role is
signi cant. Further work is needed to evaluate the quality of
this feedback and the impact on students' learning.
The Problem-speci c Instructor feedback category is opaque.
Given that this represents almost a third of the feedback, it
is unhelpful that the category cannot be easily broken down
further. Sampling the logs' text, we see examples like
students failing instructor unit tests, a reminder to call a
function just once, and a suggestion to avoid a speci c subscript
index. Although the autograder is a powerful mechanism for
delivering contextualized help to students, the lack of
organization severely limits our automated analysis possible. As
part of our process in the future, we intend to annotate
feedback in our autograding scripts with identi ers.
      </p>
      <sec id="sec-9-1">
        <title>Analyzer</title>
      </sec>
      <sec id="sec-9-2">
        <title>Correct Syntax</title>
      </sec>
      <sec id="sec-9-3">
        <title>Runtime</title>
      </sec>
      <sec id="sec-9-4">
        <title>Student Instructions System Error</title>
      </sec>
      <sec id="sec-9-5">
        <title>Subcategory</title>
      </sec>
      <sec id="sec-9-6">
        <title>Problem Speci c Not Enough Student Tests Not Printing Answer</title>
      </sec>
      <sec id="sec-9-7">
        <title>Initialization Problem</title>
        <p>Unused Variable
Multiple Return Types
Incompatible Types
Parameter Type Mismatch
Overwritten Variable
Read out of scope</p>
      </sec>
      <sec id="sec-9-8">
        <title>No Source Code</title>
      </sec>
      <sec id="sec-9-9">
        <title>TypeError</title>
        <p>NameError
AttributeError
ValueError
KeyError
IndexError
Student Tests Failing</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>5.4 Final Exam Conceptual Questions</title>
      <p>The rst part of the nal exam was composed of
conceptual questions for topics across the curriculum, drawn largely
from the quiz questions students had already seen. In this
section, we review the quiz report to determine the topics
that students struggled with. Most students performed
relatively well across the questions, so we focus on errors where
more than 80% of the students had incorrect answers.
Students largely had no issues with questions involving
evaluating expressions. A small exception to this is students'
struggle with Equality vs. Order of di erent types. In
Python, as in many languages, it is not an error to check if
two things of di erent types are equal (although that
comparison will always produce false); however, it is an error to
compare their order (less than/greater than operators). In
fact, 58% of students got this speci c question wrong.
There were three questions related to tracing complex
control ow for loops, if statements, and functions. Tracing
seemed to pose di culties, with between 25-40% of the
students getting these questions wrong. We believe that more
1
2
9
7</p>
      <p>8
3
4
5
6
0
1
Header De nition: Even though we had not observed
many students struggling with syntax during the semester,
we felt it critical to analyze the incidence of submitted code
that had malformed headers. Although the numbers were a
little higher than expected, we are not terribly concerned
reviewing the submissions, many seemed like simple typos
(e.g., a missing colon) that were relatively easily xed.
Provided Types: Students were not required or
encouraged to provide types in their headers during the exam. In
fact, since the advanced feedback features were turned o ,
their feedback would not actually reference any parameter
or return types they speci ed (as long as they were
syntactically correct code). We did not assess the correctness of their
provided types - merely their existence. In the nal exam,
the number of students who annotated their parameter types
falls o sharply after the rst three questions (moving from
about 50% down to 20%). We o er two explanations: rst,
the fourth question is one of the most di cult in the entire
course, so students may have been distracted by its di
culty. Second, the last questions all involve more
complicated nested data types (e.g., lists of dictionaries) that were
too troublesome for the students to specify.</p>
      <p>Parameter Overwriting: This misconception is one that
the instructors were very concerned with, having observed
it repeatedly among certain students early in the semester
(and concerned with its persistence). Applying the
parameter overwriting pattern to the rest of the submissions over
the entire semester, we found that the behavior trails o
over the course of the semester. By the nal exam, almost
no students were making this particular mistake. Although
the instructor believes that more can be done up front to
avoid this critical misconception, it is comforting that the
existing curriculum seems to largely address this by the end.
Return/Print: We observed that some students struggle
to di erentiate between the concepts of return statements
and print calls. However, largely students were
successful with this subskill, despite a quarter of students getting
a related (more abstract rendering) version of this subskill
wrong on part 1 of the nal exam. It seems that although
troublesome for a small clutch of students, most are able to
eventually separate this concept in their code.</p>
      <p>Parameters/Input: Similar to students' issues with
returning vs. printing, some students were observed in
individual sessions mixing up parameters and the input function
(which was presented as a very distinctive way that data
could enter a function). However, it appears that this was
truly isolated to just a few students.</p>
      <p>
        Functional Decomposition: Largely inspired by Fisler [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
success in overcoming the di culties of the Rainfall problem,
Functional Decomposition was taught as a method for
complex processing data. Students had previously been taught
1
2
9
7
      </p>
      <p>
        8
3
4
5
6
0
1
emphasis should be placed on tracing in the curriculum;
there are quiz questions and a worksheet dedicated to the
topic, but there are opportunities to expand this material.
Tracing has been a recent area of focus, with promising
approaches by Xie et al [31] and Cunningham et al [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
Dictionaries also posed signi cant trouble for students.
Dictionaries come up later in the course, represent more
complex reality, and con ate syntactic operations with lists. In
fact, this last point is evidenced by data. In a question
comparing the relative speed of traversing lists and dictionaries,
50% of the students got one variant of a True/False question
incorrect (so they might as well have been guessing).
Again, the point of our analysis is not to necessarily develop
a validated examination instrument or to distill an
authoritative set of misconceptions. Instead, we seek to
demonstrate the insight we have garnered from reviewing our exam.
With these simple percentages, we have found targets.
      </p>
    </sec>
    <sec id="sec-11">
      <title>5.5 Deeper Dive on Functions</title>
      <p>In looking over the second part of the nal exam questions,
we are faced with a tremendous number of concepts
integrated into each problem. In fact, with over 261 learning
objectives in the course, analyzing the entire set is an
overwhelming prospect. To scope our analysis for this paper, we</p>
      <p>Description
De ned the function header with correct syntax
Provided types for all parameters and the return
Did not assign literal values to parameters in the body
Did not print without returning
Did not use the input function instead of parameters
Wrote unit tests
Separated work into a helper function
8 di erent looping patterns (e.g., accumulating, mapping,
ltering). A number of assignments required students to
decompose problems. Therefore, it is somewhat
disappointing that so few students chose to leverage decomposition
(particularly since the harder nal exam problems were
naturally susceptible to a decomposition approach). In addition
to the midterm 2 and nal exam questions, we also took a
closer look at an earlier open-ended programming problem
that was particularly complex and well-suited to
decomposition. In these problems, there seemed to be a pattern of
students being more successful when they leverage
decomposition. Although not conclusive, this supports the
hypothesis that decomposition may be an e ective strategy.
Unit Testing: Given that students were not required to
unit test their code on the nal exam, we were pleased to
nd that many students wrote unit tests anyway.
Interestingly, though, the percentage of students who used this
strategy decreased over the course of the semester, even as
the programming problems became more di cult. We
hypothesize that since the later exam problems involve
complex nested data, students either did not feel comfortable
generating test data or they felt that it would not be an
efcient use of their time. We believe that we need to sell the
concept more - rather than thinking that writing test cases
would be a detriment to their success, students should see
tests as one of the most direct paths to completion.</p>
    </sec>
    <sec id="sec-12">
      <title>6. DISCUSSION</title>
      <p>Reviewing our ndings, we made several decisions about
places to modify our curriculum. The log data suggests that
some of the earlier material can be accelerated, so that more
time can ultimately be allocated to week 4 (critical
material covering functions). We also believe we need to spend
more time throughout the semester convincing students that
subskills like decomposition and unit testing can help them
solve challenging questions, although follow-up analyses will
be needed to con rm this theory. Finally, we must come
up with new ways to support some of our demographic
subgroups, given that outcomes in that area are not yet equal.
Better structure to our existing data sources might help in
future analyses. For example, although each quiz question
was labeled with a unique identi er, we realized during
analysis that we really needed every quiz answer (and in some
cases, sets of answers) to have a unique identi er as well. In
particular, some questions had multiple parts, or di erent
answers yielded information about di erent misconceptions.
In a similar vein, annotating instructor feedback for the
programming problems would have substantially increased the
di erentiation of our feedback messages.</p>
      <p>More metadata about each identi er would also help e orts
to cross-reference and cluster related problems (especially
over time). This is a non-trivial e ort, given the quantity
of course materials present in the curriculum. As a
starting point, we believe this e ort should probably be focused
on certain major learning objectives and topics (e.g.,
functions) that are particularly worthy of attention based on the
formative evaluation conducted here.</p>
      <p>We expect that before our next iteration of our analyses,
we need to develop more hypotheses up front for guidance.
A considerable amount of time was spent performing
exploratory analyses, trying di erent approaches and seeing
what emerged from the data. Although helpful as we
oriented ourselves, the data dredging that can emerge may
yield false conclusions that are not actually worth
investing in. Finally, while we attempted to follow a replicable
process in our data collection and analysis, we believe more
should be done to streamline and package our data pipeline
to encourage replication and reproduction.</p>
    </sec>
    <sec id="sec-13">
      <title>7. CONCLUSION</title>
      <p>In this paper, we have described our evaluation of data from
a heavily-instrumented CS1 course. Our goal was less about
judging the course overall, and more about nding speci c
areas of improvement and success. We feel that course
evaluation is less about the end-goal and more about small
iterative augmentations that collect over time. To structure
our approach, we followed a loose Design-Based Research
model supported by educational data mining. In our
experience, the high volume and variety of data sources can
be very helpful in understanding the successes and failures
of the course, although it does pose di culties for
analysis. As always, a better pipeline could help make sense of
these data and results more quickly, possibly even during
the semester. However, in the immediate term, our data
analysis contributes to the community's knowledge of
students and ideally provides a model for others to follow along.
In general, we hope to encourage increased rigor in course
evaluation as we integrate data-rich tools into our courses.
Computer Science Education, pages 538{544, 2019.
[30] Y. Vance Paredes, D. Azcona, I.-H. Hsiao, and A. F.</p>
      <p>Smeaton. Predictive modelling of student reviewing
behaviors in an introductory programming course.
2018.
[31] B. Xie, G. L. Nelson, and A. J. Ko. An explicit
strategy to sca old novice program tracing. In
Proceedings of the 49th ACM Technical Symposium on
Computer Science Education, pages 344{349, 2018.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Annamaa</surname>
          </string-name>
          .
          <article-title>Introducing thonny, a python ide for learning programming</article-title>
          .
          <source>In Proceedings of the 15th Koli Calling Conference on Computing Education Research</source>
          , pages
          <volume>117</volume>
          {
          <fpage>121</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Baker</surname>
          </string-name>
          and
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Inventado</surname>
          </string-name>
          .
          <article-title>Educational data mining and learning analytics</article-title>
          .
          <source>In Learning analytics, pages</source>
          <volume>61</volume>
          {
          <fpage>75</fpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Barab</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Squire</surname>
          </string-name>
          .
          <article-title>Design-based research: Putting a stake in the ground</article-title>
          .
          <source>The journal of the learning sciences, 13</source>
          <volume>(1):1</volume>
          {
          <fpage>14</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Bart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sarver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Friend</surname>
          </string-name>
          , and
          <string-name>
            <surname>L. Cox II</surname>
          </string-name>
          .
          <article-title>Pythonsneks: an open-source, instructionally-designed introductory curriculum with action-design research</article-title>
          .
          <source>In Proceedings of the 50th ACM Technical Symposium on Computer Science Education</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Bart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tibau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tilevich</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. A.</surname>
          </string-name>
          <article-title>Sha er</article-title>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Kafura</surname>
          </string-name>
          . Blockpy:
          <article-title>An open access data-science environment for introductory programmers</article-title>
          .
          <source>Computer</source>
          ,
          <volume>50</volume>
          (
          <issue>5</issue>
          ):
          <volume>18</volume>
          {
          <fpage>26</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>Bu ardi</article-title>
          .
          <article-title>Assessing individual contributions to software engineering projects with git logs and user stories</article-title>
          .
          <source>In Proceedings of the 51st ACM Technical Symposium on Computer Science Education</source>
          , pages
          <volume>650</volume>
          {
          <fpage>656</fpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Carrell</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. E.</given-names>
            <surname>West</surname>
          </string-name>
          .
          <article-title>Does professor quality matter? evidence from random assignment of students to professors</article-title>
          .
          <source>Journal of Political Economy</source>
          ,
          <volume>118</volume>
          (
          <issue>3</issue>
          ):
          <volume>409</volume>
          {
          <fpage>432</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.-B. R.</given-names>
            <surname>Collective</surname>
          </string-name>
          .
          <article-title>Design-based research: An emerging paradigm for educational inquiry</article-title>
          .
          <source>Educational Researcher</source>
          ,
          <volume>32</volume>
          (
          <issue>1</issue>
          ):5{
          <issue>8</issue>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Cunningham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Blanchard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ericson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Guzdial</surname>
          </string-name>
          .
          <article-title>Using tracing and sketching to solve programming problems: replicating and extending an analysis of what students draw</article-title>
          .
          <source>In Proceedings of the 2017 ACM Conference on International Computing Education Research</source>
          , pages
          <volume>164</volume>
          {
          <fpage>172</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Diana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eagle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Stamper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Grover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bienkowski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Basu</surname>
          </string-name>
          .
          <article-title>Measuring transfer of data-driven code features across tasks in alice</article-title>
          .
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>T. E enberger</surname>
            , J. Cechak, and
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Pelanek</surname>
          </string-name>
          .
          <article-title>Di culty and complexity of introductory programming problems</article-title>
          .
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Fisler</surname>
          </string-name>
          .
          <article-title>The recurring rainfall problem</article-title>
          .
          <source>In Proceedings of the tenth annual conference on International computing education research</source>
          , pages
          <volume>35</volume>
          {
          <fpage>42</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gusukuma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Bart</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Kafura</surname>
          </string-name>
          . Pedal:
          <article-title>An infrastructure for automated feedback systems</article-title>
          .
          <source>In Proceedings of the 51st ACM Technical Symposium on Computer Science Education</source>
          , pages
          <volume>1061</volume>
          {
          <fpage>1067</fpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gusukuma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Bart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kafura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ernst</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Cennamo</surname>
          </string-name>
          .
          <article-title>Instructional design+ knowledge components: A systematic method for re ning instruction</article-title>
          .
          <source>In Proceedings of the 49th ACM Technical Symposium on Computer Science Education</source>
          , pages
          <volume>338</volume>
          {
          <fpage>343</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Guzdial</surname>
          </string-name>
          .
          <article-title>Exploring hypotheses about media computation</article-title>
          .
          <source>In Proceedings of the ninth annual international ACM conference on International computing education research</source>
          , pages
          <volume>19</volume>
          {
          <fpage>26</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ihantola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vihavainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ahadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Butler</surname>
          </string-name>
          , J. Borstler,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Edwards</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Isohanni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korhonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Petersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rivers</surname>
          </string-name>
          , et al.
          <article-title>Educational data mining and learning analytics in programming: Literature review and case studies</article-title>
          .
          <source>In Proceedings of the 2015 ITiCSE on Working Group Reports</source>
          , pages
          <volume>41</volume>
          {
          <fpage>63</fpage>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Kaczmarczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. R.</given-names>
            <surname>Petrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>East</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. L.</given-names>
            <surname>Herman</surname>
          </string-name>
          .
          <article-title>Identifying student misconceptions of programming</article-title>
          .
          <source>In Proceedings of the 41st ACM technical symposium on Computer science education</source>
          , pages
          <volume>107</volume>
          {
          <fpage>111</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>A. M. Kazerouni</surname>
            ,
            <given-names>S. H.</given-names>
          </string-name>
          <string-name>
            <surname>Edwards</surname>
            , and
            <given-names>C. A.</given-names>
          </string-name>
          <article-title>Sha er. Quantifying incremental development practices and their relationship to procrastination</article-title>
          .
          <source>In Proceedings of the 2017 ACM Conference on International Computing Education Research</source>
          , pages
          <volume>191</volume>
          {
          <fpage>199</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>H.</given-names>
            <surname>Keuning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jeuring</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Heeren</surname>
          </string-name>
          .
          <article-title>Towards a systematic review of automated feedback generation for programming exercises</article-title>
          .
          <source>In Proceedings of the 2016 ACM Conference on Innovation and Technology in Computer Science Education</source>
          , pages
          <volume>41</volume>
          {
          <fpage>46</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kurvinen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hellgren</surname>
          </string-name>
          , E. Kaila,
          <string-name>
            <surname>M.-J. Laakso</surname>
            , and
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Salakoski</surname>
          </string-name>
          .
          <article-title>Programming misconceptions in an introductory level programming course exam</article-title>
          .
          <source>In Proceedings of the 2016 ACM Conference on Innovation and Technology in Computer Science Education</source>
          , pages
          <volume>308</volume>
          {
          <fpage>313</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ladson-Billings</surname>
          </string-name>
          .
          <article-title>From the achievement gap to the education debt: Understanding achievement in us schools</article-title>
          .
          <source>Educational researcher</source>
          ,
          <volume>35</volume>
          (
          <issue>7</issue>
          ):3{
          <fpage>12</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Luxton-Reilly</surname>
          </string-name>
          .
          <article-title>Learning to program is easy</article-title>
          .
          <source>In Proceedings of the 2016 ACM Conference on Innovation and Technology in Computer Science Education</source>
          , pages
          <volume>284</volume>
          {
          <fpage>289</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mandal</surname>
          </string-name>
          and
          <string-name>
            <given-names>I.-H.</given-names>
            <surname>Hsiao</surname>
          </string-name>
          .
          <article-title>Using di erential mining to explore bite-size problem solving practices</article-title>
          .
          <source>In Educational Data Mining in Computer Science Education (CSEDM) Workshop</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>C.</given-names>
            <surname>Matthies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Teusner</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Hesse</surname>
          </string-name>
          .
          <article-title>Beyond surveys: analyzing software development artifacts to assess teaching e orts</article-title>
          .
          <source>In 2018 IEEE Frontiers in Education Conference (FIE)</source>
          , pages
          <fpage>1</fpage>
          <article-title>{9</article-title>
          . IEEE,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>K. M.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Martin</surname>
          </string-name>
          .
          <article-title>Gender bias in student evaluations</article-title>
          .
          <source>PS: Political Science &amp; Politics</source>
          ,
          <volume>51</volume>
          (
          <issue>3</issue>
          ):
          <volume>648</volume>
          {
          <fpage>652</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <article-title>National Academies of Sciences, Engineering, and Medicine and others. Assessing and responding to the growth of computer science undergraduate enrollments</article-title>
          . National Academies Press,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>G. L.</given-names>
            <surname>Nelson</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Ko</surname>
          </string-name>
          .
          <article-title>On use of theory in computing education research</article-title>
          .
          <source>In Proceedings of the 2018 ACM Conference on International Computing Education Research</source>
          , pages
          <volume>31</volume>
          {
          <fpage>39</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>T. W.</given-names>
            <surname>Price</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovemeyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rivers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Becker</surname>
          </string-name>
          , et al.
          <article-title>Progsnap2: A exible format for programming process data</article-title>
          .
          <source>In The 9th International Learning Analytics &amp; Knowledge Conference, Tempe, Arizona, 4-8 March</source>
          <year>2019</year>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>R.</given-names>
            <surname>Smith</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Rixner</surname>
          </string-name>
          .
          <article-title>The error landscape: Characterizing the mistakes of novice programmers</article-title>
          .
          <source>In Proceedings of the 50th ACM Technical Symposium on</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>