Immediate Data-Driven Positive Feedback Increases
       Engagement on Programming Homework for Novices

                                        Samiha Marwan                          Thomas W. Price
                                    North Carolina State Univ.             North Carolina State Univ.
                                        Raleigh, NC, USA                       Raleigh, NC, USA
                                    samarwan@ncsu.edu                         twprice@ncsu.edu
                                          Min Chi                               Tiffany Barnes
                                    North Carolina State Univ.             North Carolina State Univ.
                                        Raleigh, NC, USA                       Raleigh, NC, USA
                                        mchi@ncsu.edu                       tmbarnes@ncsu.edu

ABSTRACT                                                                   primarily by addressing errors and misconceptions might be-
Learning theories and psychological research show that pos-                come yet more helpful if extended with a positive feedback
itive feedback during practice can increase learners’ motiva-              capability [18].” However, there are two barriers to using
tion, and correlates with their learning. In our prior work,               positive feedback effectively in ITSs.
we built a system that provides immediate positive feed-
back using expert-authored features, and found a promising                 First, little work has empirically evaluated the effect of pos-
impact on students’ performance and engagement with the                    itive feedback in ITSs, especially in traditional classrooms,
system. However, scaling this expert-feedback system to                    and it is unclear how it will affect students in practice. On
new programming tasks requires extensive human effort. In                  one hand, one might think that students who have already
this paper, we present a system that provides automated,                   correctly completed a step will not need additional feed-
data-driven, immediate positive feedback (DD-IPF) to stu-                  back since they are already on the right track. On the other
dents while programming. This system uses a data-driven                    hand, considering that novices have not yet established their
feature detector that automatically detects feature comple-                programming understanding, it might be particularly bene-
tion in the current student’s code based on features learned               ficial for them to get positive reinforcement when they cre-
from historical student data. To explore the impact of DD-                 ate a correct step. Novices have a range of prior knowledge
IPF on students’ programming behavior, we performed a                      and confidence levels, and unexpected results in the pro-
quasi-experimental study across two semesters in a block-                  gramming environment may cause students to question their
based programming class. Our results showed that students                  knowledge. Providing students with feedback that confirms
with DD-IPF were more engaged, as measured by time spent                   their correct steps could, therefore, reduce their uncertainty
on the programming task, and also showed marginal im-                      [18], and this may also improve their performance [17], but
provement in their grades, compared to students in a prior                 further empirical evaluations are needed to test this. Sec-
semester solving the same task without feedback. This sug-                 ond, positive feedback is rarely present in current program-
gests that positive feedback based on data-driven feature                  ming ITSs, since correct steps can be difficult to detect in
detection can provide benefits in student engagement and                   programming problems, due to their huge, possible solution
performance. We conclude with design recommendations                       strategies and sparse, diverse data from students.
for data-driven programming feedback systems.
                                                                           In this work, we present a system that provides data-driven
1.    INTRODUCTION                                                         immediate positive feedback (DD-IPF) to novice students
Intelligent tutoring systems (ITSs) are software systems that              while programming. In this system, we combine both au-
have been shown to improve student performance and learn-                  tomated data-driven feature detection with low-effort expert
ing by providing individualized feedback to each student, as               human labelling to provide high quality data-driven feed-
human tutors do [26]. One particularly effective form of hu-               back. To do so, the DD-IPF system uses a data-driven fea-
man tutor feedback is positive feedback, which can improve                 ture detector algorithm that learns common code structures
students’ affective outcomes [5], and correlates with their                (features) that are present in correct solutions from prior
learning [10]. Mitrovic et al. noted that ITSs “that teach                 student data, and then detects when these features are com-
                                                                           pleted or broken in a student’s code. We combined these
                                                                           features into a set of objectives with meaningful labels de-
                                                                           signed by human experts. We integrated our DD-IPF system
                                                                           into a block-based programming environment. As shown in
                                                                           Figure 1, the interface of our DD-IPF system has two com-
                                                                           ponents: a progress panel that displays human labels of the
                                                                           data-driven features of a programming task and shows when
                                                                           they are completed, and pop-up messages triggered based on
Copyright c 2020 for this paper by its authors. Use permitted under Cre-   the completion of these features or lack of progress.
ative Commons License Attribution 4.0 International (CC BY 4.0).
We performed a quasi-experimental pilot study across two            is just one of several supports provided, and their studies do
semesters to investigate our primary research question: How         not attempt to separate the impact of the positive feedback
does the data-driven immediate positive feedback system im-         from the overall system.
pact student engagement and performance while program-
ming? Through log data analysis, we found that students             In addition, the timing of positive feedback can be important
who received DD-IPF spent significantly more time with the          for retention. The SQL tutor, and programming autograders
system, and had a somewhat better performance than stu-             like Lambda for Snap! [2], provide positive feedback only
dents who did not work with the DD-IPF system. We argue             when students submit their code, and usually this feedback
that this increase in time shows that students were more en-        indicates which constraints or test cases are satisfied. This
gaged to work more on the programming assignment, which             means that students who become discouraged or confused
had historically low engagement, and this resulted in an in-        during programming may never receive any positive feed-
crease in students’ scores.                                         back at all. While prior work shows that immediate feedback
                                                                    allows students to finish problems quickly [8], there is less
In summary, this work makes contributions to educational            evidence on the impact of immediate positive feedback on
data mining and computing education through: (1) an ap-             students in programming. Our DD-IPF system continuously
proach that combines a data-driven feature detector with            and adaptively confirms when students complete (or break)
human labelling to provide data-driven immediate positive           meaningful objectives through its progress panel. Our sys-
feedback (DD-IPF), (2) a controlled quasi-experimental study        tem also includes personalized pop-up messages addressing
that suggests that DD-IPF in classrooms can increase stu-           the user as “you”, and tailored to our student population,
dents’ engagement with the programming environment, and             since personalization is key in effective human tutoring di-
has the potential to improve their performance as well, and         alogs [5, 10] and has been shown to improve novices’ learning
(3) recommendations on the design of data-driven feedback           [15, 19].
systems for programming.
                                                                    3.   DATA-DRIVEN FEATURE DETECTOR (DDFD)
2.   RELATED WORK ON POSITIVE FEED-                                      ALGORITHM
     BACK                                                           In our prior work [29], we introduced a data-driven feature
Intelligent tutoring systems are developed to adopt human           detector (DDFD) algorithm for block-based programming
strategies to improve students’ learning, especially through        exercises. In this context, a feature corresponds to a mean-
adaptive feedback [9, 27, 12]. For example, positive feedback       ingful objective or property of a correct solution. In brief,
is given to students when they complete a problem-solving           this algorithm works as follows: First, it learns common fea-
step appropriately, e.g. “Good move” [3, 10, 12]. Empirical         tures from existing, correct student solutions. Each feature
studies of human tutoring dialogues show that positive feed-        is represented as a set of code blocks, referred to as code
back occurs eight times as often as negative feedback [10],         shapes. For example, using a ‘pen down’ block, followed by
and correlates with learning [13, 10, 6, 7]. More importantly,      a ‘move’ block, is a necessary feature for any drawing task.
human tutors find positive feedback to be an effective moti-        Second, the algorithm detects the presence or the absence
vational strategy that improves student confidence [16]. In         of each feature in each student’s code, generating a boolean
programming, while various learning environments provide            array that represents its feature state. For example, if the
feedback through compiler messages [22, 4], or error detec-         algorithm learns four features for a given exercise, and for
tors [1, 25], or autograders [2, 14], or hints [23, 20], far less   a given student’s code it detects only the first and last fea-
work has been devoted to integrate features that detect stu-        tures, then the algorithm will output {1, 0, 0, 1} as the
dents’ correct moves and provide positive feedback as human         code’s feature state.
tutors do.
                                                                    This DDFD algorithm motivated us to build a system that
To our knowledge, only two studies have focused on auto-            provides immediate positive feedback that can be easily scaled
mated positive feedback for programming: Fossati et al.’s           to various programming tasks because of three reasons. First,
iList[12] tutor for data structures, and Mitrovic et al.’s SQL      this DDFD algorithm can detect features at any phase of
tutor, for databases [18]. In the iList tutor, Fossati et al.       student code, whether the code is complete, or incomplete,
provided students with positive feedback by calculating the         and therefore can provide immediate feedback. Second, the
goodness and uncertainty of students’ moves. Authors de-            presence of features represents student progress towards the
tected a good move if it improves the student’s probability         correct solution, and therefore can provide positive feedback.
to reach the correct solution, while the uncertainty is de-         Third, it is designed to generate features for a variety of pro-
tected if a student spent more time than the time taken by          gramming tasks, as long as historical student data exists,
prior students at this point. If a student’s move is detected       making it scalable to various contexts and tasks. However,
as a good move and uncertain, then iList will provide posi-         the DDFD algorithm suffers two limitations that make it
tive feedback. Fossati et al. found that the iList tutor with       hard to deploy in practice. First, because features are code
positive feedback improved learning, and students liked it          shapes that are generated automatically, they are not la-
more than iList without positive feedback. The SQL tutor            belled with meaningful names for students to understand
is a constraint-based tutor using a knowledge base of 700           how they are making progress. Second, the generated fea-
constraints. When a student submits their solution, the de-         tures are too specific and may need to be further clustered
tected satisfied constraints can trigger positive feedback to       into a smaller set of features, to limit their number and al-
students. An evaluation of SQL tutor showed that positive           low more concrete positive feedback. In the next section, we
feedback helped students master skills in less time. How-           describe how we took advantage of the DDFD algorithm,
ever, in both iList and SQL tutor studies, positive feedback        and addressed its limitations to develop a system that pro-
vides data-driven immediate positive feedback to students           Figure 1. To increase students’ motivation, the system also
while programming. Rather than being fully data-driven or           provides motivational pop-up messages if a student makes no
fully expert-authored, this system applies a small amount of        progress for more than three minutes (a threshold based on
expert curation to a largely data-driven process to achieve         instructors’ feedback). We integrated the DD-IPF system
scalable, higher-quality results.                                   into iSnap, a block-based programming environment that
                                                                    extends Snap! programming environment [20].
4.   DATA-DRIVEN IMMEDIATE POSITIVE
     FEEDBACK (DD-IPF) SYSTEM                                       We note that our current version of DD-IPF system has a
                                                                    similar interface to our adaptive feedback system in [17],
To build the DD-IPF, we first had to apply the DDFD algo-
                                                                    but the approach to generate positive feedback is different.
rithm to a new dataset and overcome DDFD’s two main lim-
                                                                    Specifically, in our prior work we used expert-authored au-
itations of having unlabeled features and having too many to
                                                                    tograders to monitor students’ progress instead of the data-
present to students. To do so, we first used three semesters
                                                                    driven feature detector algorithm presented in our current
of previous correct students’ solutions to generate a set of
                                                                    work (Section 3). This adaptation is important for scaling
code shapes (features) of correct solutions of a program-
                                                                    the system to more programming tasks.
ming exercise (described in Section 5.2). As expected, (as
shown in the first column of Table 1), each of the seven gen-
erated features were too small to be used as an objective
(e.g. F1: create a procedure), and lacked labels to break the
larger task down into smaller meaningful objectives. There-
fore, the first author combined these seven features into four
features, and provided each with a label. The last author
briefly reviewed the combination and suggested a few minor
wording changes. Table 1 shows the mapping of the seven
data-driven features to four meaningful objectives, with hu-
man labels. The features were combined, ordered, and la-
beled based on their relevance and importance. Our goal
was to make the label clear, meaningful and concrete. Note
that not every aspect of the data-driven features is reflected
in the objective labels, so there is not a direct correspon-
dence between the objective label provided to students and
the process that the detector applies to detect it. We de-
cided to limit the number of objectives to 4, and that it was
                                                                    Figure 1: Data-driven immediate positive feedback
more important to have independent objectives that stu-
                                                                    (DD-IPF) system integrated to iSnap, a block-based
dents could understand, than it was to tell students exactly
                                                                    programming environment. The script area – where
what constructs make up each objective. As we discuss in
                                                                    students add code blocks – is shown in the bottom-
more detail later, listing just a few objectives will necessarily
                                                                    left (blue box), the pop-up message is shown in top-
leave out information, there is no way to build an algorithm
                                                                    left (orange box), the stage where students can see
that can correctly detect every possible solution, and there
                                                                    their output is shown in the top-right (green box),
is more complex information in the detectors than novice
                                                                    and progress panel is shown in the bottom-right (yel-
programmers can understand. Given these important and
                                                                    low box).
inherent limitations of automated feedback, we felt clarity
and brevity were of high value.
                                                                    5.    CLASSROOM STUDY
We strove to design the interface of the DD-IPF system to           Our goal in this study is to evaluate the impact of data-
help students’ track their progress, with the goal of improv-       driven immediate positive feedback on students in a class-
ing their performance, and increasing their motivation to           room setting. This study seeks to answer our primary re-
complete the programming task. The DD-IPF system con-               search question: How does the data-driven immediate pos-
sists of two main features that continuously provide adap-          itive feedback system impact student engagement and per-
tive, positive feedback to students based on their code edits:      formance while programming?
1) a progress panel and 2) pop-up messages, shown in Fig-
ure 1. These two components are designed together to com-           5.1   Population
prise a positive feedback system for open-ended program-            The participants of this study were enrolled in two semesters
ming. The progress panel shows students the human labels            of an introductory programming course for computer science
of data-driven features (objectives) for a task, and whether        non-majors, both taught by the same instructor at a large
they are complete, or broken, since prior research suggests         southeastern university in the United States. The Spring
that students who were uncertain often delete their correct         2019 class had 48 students and the Spring 2020 class had 42
code [11]. Initially, all the objectives are deactivated. Once      students. We adopted a quasi-experimental design, where
an objective is detected to be completed, it provides posi-         the experimental (Spring 2020) group used iSnap with access
tive feedback by becoming green, but if it is detected to be        to the DD-IPF system on one assignment (Squiral, described
broken, it changes to red, as shown in the bottom right of          below), and the control (Spring 2019) group completed the
Figure 1. The pop-up messages provide positive messages             same assignment in iSnap but without the DD-IPF system.
on students’ accomplishments, i.e. whenever they completed          Due to deployment issues, the first 15 students to complete
an objective or fixed a broken one, as shown in the top left of     the assignment (36%) in the Spring 2020 class did not use
      Table 1: Generated Data-driven features, and their Corresponding Objectives and Human Labels.
 Data-Driven Features                                 Combined Features (Objectives) Human Label
 F1. Create a procedure OR ‘ReceiveGo’ block.                                        Make a Squiral custom block
                                                      Objective 1 = F1 + F2
 F2. Add procedure on stage.                                                         and use it in your code.
 F3. Have a ‘multiple’ block with a variable                                         The Squiral custom block rotates
                                                      Objective 2 = F3
 in a ’repeat’ block OR two nested ‘repeat’ blocks.                                  the correct number of times.
 F4. Add parameter in the procedure.                                                 The length of each side of the
                                                      Objective 3 = F4 + F5
 F5. Add a variable in ‘move’ block                                                  Squiral is based on a variable.
 AND a variable in ‘repeat‘ block.
 F6. Have a ‘move’ and ‘turn’ block inside a
                                                                                     The length of the Squiral increases
 ’repeat’ block AND a ’pen down’ block.               Objective 4 = F6 + F7
                                                                                     with each side.
 F7. Change a variable value inside a ‘repeat’ block.


                                                                 Time: We measured a student’s total time from when they
                                                                 began to program to the time when they either success-
                                                                 fully completed the programming task, or submitted it in-
                                                                 correctly. We did this because some students who completed
                                                                 the task continued to work afterwards, so we did not include
                                                                 the afterward time in their total time. We found the aver-
                                                                 age time (in minutes) spent by students in the experimental
                                                                 group (M ed = 34.75; M = 42.29; IQR = 28.04) was much
                                                                 greater than that spent by the control group (M ed = 13.54;
                                                                 M = 20.14; IQR = 17.81), as shown in Plot A of Figure 3. A
                                                                 t-test1 shows that this difference is significant with a strong
                                                                 effect size (t(40.78) = 3.96; p < 0.01; Cohen’s d = 1.08).

                                                                 Score: To grade students’ submissions, we used a rubric
Figure 2: One possible solution of Squiral program-              for the Squiral exercise created by researchers in prior work.
ming exercise (on the left), and its output (on the              This rubric consists of six items, each with two points, mak-
right).                                                          ing a total of 12 points. Two researchers in block-based pro-
                                                                 gramming graded students’ submitted code across the two
                                                                 semesters2 . We found that the average score of students in
the DD-IPF system, and therefore we excluded them, re-           the experimental group (M ed = 91.7%; M ean = 86.4%;
sulting in 27 students in the experimental group. Since this     SD = 14.5%) was more than that in the control group
exclusion may have biased our sample in the experimental         (M ed = 83.3%; M ean = 76.9%; SD = 25.5%), as shown
group, we also removed the first 36% of students (17) in         in Plot B of Figure 3. A Mann-Whitney U test3 shows that
the Spring 2019 class, resulting in 33 students in the control   this difference is not significant, but has a medium effect size
group.                                                           (p = 0.19; Cohen’s d = 0.45).


5.2   Procedure                                                  6.   DISCUSSION
This study took place during the second programming home-        In this section we discuss our primary research question:
work in the CS0 classroom. The programming task is called        How does the data-driven immediate positive feedback sys-
Squiral, which asks students to create a method with one         tem impact student engagement and performance while pro-
parameter, r , that draws a square-shaped spiral with r ro-      gramming? We found that the DD-IPF system increased the
tations. Figure 2 shows one possible solution to Squiral, and    amount of time students spent engaged with the program-
its output. The instructor gave students one week to submit      ming homework, and we found suggestive evidence that it
this homework. In both semesters, the programming envi-          can improve students’ programming performance as well.
ronment provides students with access to on-demand next-
step hints, which provide students with single edits that can    In our study we found students who used the DD-IPF sys-
possibly bring their code closer to a correct solution, but      tem (experimental group) took more than double the time
this was independent of their access to the DD-IPF. Our          on average compared to students in the control group to
prior evaluations of expert-authored positive feedback sug-      complete their homework. While increasing time on task
gests it provides complementary benefits to hints, but does      is sometimes considered a negative outcome (i.e. decreased
not conflict with them [17].                                     learning efficiency), for this homework, we believe that our
                                                                 results suggest that the DD-IPF system increased students’
                                                                 1
                                                                   Our use of t-tests indicates the data was normally dis-
5.3   Results                                                    tributed.
In this section we report the impact of using our DD-IPF         2
                                                                   We note that we found all students submitted their code
system on the time students took to finish the homework ex-      for grading.
ercise, and the score of their submitted solution, as assessed   3
                                                                   Our use of non-parametric tests indicates the data was non-
by a rubric.                                                     normal.
                                      Group        Control                    Experimental               plete Objective 1; however this is not necessary according to
                                                                                                         the instructions. It is a feature that most students did in
  A 120                                                      B 100                                       the dataset used to train the DD-IPF system. Based on our
                                                                                                         manual investigation of the experimental group log data, we
                                                                                                         found a number of times where the system mislabeled a stu-
                                                                                                         dent’s progress. For example, some students finished the
                                                                                                         programming task, but not all the objectives were detected
  Time to Complete the Task


                              90
                                                                             75                          by the system. As a result, a few students kept working for
                                                                                                         more time, despite having completed the task, which was not


                                                             Score Percent
                                                                                                         the case for students in the control group. We provide spe-
                                                                                                         cific case studies on instances at which the DD-IPF system
                              60
                                                                                                         provides incorrect feedback, and how students responded in
                                                                             50                          [24].

                                                                                                         We also found that students with the DD-IPF system may
                              30                                                                         have had improved performance, since students in the exper-
                                                                                                         imental group achieved higher average scores (almost 9.5%
                                                                             25                          higher) than those in the control group. We conclude that
                                                                                                         the DD-IPF system may have increased students’ scores be-
                               0                                                                         cause it increased their working time in the programming
                                   Control Experimental                           Control Experimental   environment. The ability of the progress panel to confirm
                                                                                                         correct steps might have increased students’ motivation and
                                                                                                         reduced their uncertainty about their moves, leading to im-
Figure 3: Boxplots comparing time (in minutes) to                                                        proved performance. These results are consistent with the
complete the task (left) and score percent (right)                                                       “uncertainty reduction” hypothesis presented by Mitrovic et
between the control and experimental groups.                                                             al. [18], suggesting that the positive feedback helps stu-
                                                                                                         dents to continue working because it reduced their uncer-
                                                                                                         tainty about their code edits.
engagement with the assignment. Squiral is a challenging
                                                                                                         While the availability of on-demand hints might have af-
assignment for students, which should take students more
                                                                                                         fected our results somewhat, students in both semesters had
than the median 13.5 minutes spent by the control group
                                                                                                         access to hints, so the differences between semesters were
to compete correctly. However, this task was a program-
                                                                                                         due primarily to the DD-IPF system. Interestingly, when
ming homework, where students were not observed and did
                                                                                                         we compared the hint usage across the two semesters, we
not have easy access to an instructor to get assistance. We
                                                                                                         found suggestive evidence that the DD-IPF system might
found that of the 15 students in the control group who sub-
                                                                                                         have increased students’ hint usage. We found that the av-
mitted their homework in less than the median time of 13.5
                                                                                                         erage number of hints requested by the experimental group
minutes, 11 of these students (73.3%) had incorrect submis-
                                                                                                         (M ean = 17; M ed = 11) was higher than that of the con-
sions. This suggests that many students spent too little time
                                                                                                         trol group (M ean = 10; M ed = 6.5). In addition, we found
and submitted incomplete or incorrect work in the control
                                                                                                         that the average percent of followed hints in the experimen-
group. Overall, the control group had lower performance
                                                                                                         tal group (M ean = 66.64%; M ed = 70%) was significantly
(average grade 77%), in comparison with the experimental
                                                                                                         higher than that in the control group (M ean = 39.91%;
group (average grade 86%). By contrast, only 2 students in
                                                                                                         M ed = 38.3%). We hypothesize a few possible implications
the experimental group spent less than 13.5 minutes. We
                                                                                                         from these results: First, the progress panel may have mo-
hypothesize that the DD-IPF system helped keep students
                                                                                                         tivated students to seek out and follow hints in order to
engaged by continuously informing students in the experi-
                                                                                                         achieve the incomplete objectives it showed. Second, stu-
mental group about their progress, such as how far they are
                                                                                                         dents in the experimental group might have followed more
from completing the homework, which might have motivated
                                                                                                         hints since the progress panel helps them to understand how
them to keep working to try to get all the objectives marked
                                                                                                         hints relate to a specific objective. Lastly, students might
correct in the progress panel. However, we cannot directly
                                                                                                         have trusted the system more because the DD-IPF helped
investigate this hypothesis with the current study (e.g. by
                                                                                                         them see some intelligence and intentionality behind the sys-
correlating student time and performance). For example,
                                                                                                         tem. This addresses a concern raised by prior work, which
higher-performing students may take less time to complete
                                                                                                         suggests that students do not trust automated feedback be-
the assignment (creating a negative correlation), even if any
                                                                                                         cause they did not believe the system really understands
individual student may perform better by taking more time.
                                                                                                         what students are doing, or its ability to offer useful help
                                                                                                         [21].
However, we note that some of the increase in students’
time engaged with the assignment was not productive, (e.g.
2 students spending over 90 minutes). Some of this may                                                   7.   RECOMMENDATIONS FOR DATA-DRIVEN
have resulted from errors in the data-driven feature detec-                                                   PROGRAMMING FEEDBACK
tion. As shown in Table 1, a few of the data-driven features                                             Based on our current and prior work, we have several recom-
do not correspond well to a concrete assignment objective.                                               mendations for designing data-driven feedback for program-
For example, data-driven Feature 1 requires the use of ‘Re-                                              ming [29, 17, 24]. Because of the very large, sparse solution
ceiveGo’ block (which can be used to start a script) to com-                                             spaces for programming, we should always expect to have
some inadequacies in any data-driven features or detectors
for programs. For example, in our current work dataset, we
found cases where the DD-IPF system incorrectly detected
the completion of an objective and others where the student
completed an objective, but it was not detected. These flaws
point to a tradeoff between the more easily-generated, data-
driven features used in this work and the expert-authored
features that we used in our prior work [17]. A data-driven
positive feedback system is likely to learn less generalizable
features from prior students, and still needs to be labeled
and curated for presentation, but it can be scaled to various
programming exercises with less human effort. In contrast,
an expert-authored feedback can be designed to have un-
derstandable and generalizable features, but it is hard to
scale to various programming exercises, and requires exten-      Figure 4: A novel student solution, with a new and
sive time and effort to create autograders (e.g. with static     correct, but undetected, strategy, where only 2 ob-
analysis [17, 28, 2]). In this work we combined both ap-         jectives were detected (see progress panel on the
proaches, using data-driven feature detection to account for     right).
the diversity of student solutions, and a human labeling pro-
cess to make the features more general, independent, and
understandable.                                                  detectors that could be added to the system without expert
                                                                 review. Therefore, validation should be conducted to match
We recommend that data-driven feedback approaches should         the nature of how our system is used in practice: trained on
be designed iteratively, learning and implementing an initial    one dataset and used in a separate, later section. Further-
set of feature detectors from a given dataset, and iteratively   more, instructors can have a large impact on how students
detecting new features after each new dataset is added. For      approach problems, so each section of a class is very likely
example, despite extracting features from over 100 prior stu-    to differ significantly due to that factor alone.
dent solutions to the Squiral programming exercise (which
can be solved in 7-11 lines of code), in this study, we found    8.   LIMITATIONS & CONCLUSION
two students who used a completely new strategy, not rep-
                                                                 This study has four primary limitations. First, this was
resented in the prior data. In particular, one of these two
                                                                 a quasi-experimental study and therefore there might be
students created a procedure (i.e. a custom block) to draw
                                                                 other differences between semesters that affected our re-
a side of a Squiral, and called this procedure in an inner
                                                                 sults. However, when we compared the students excluded
loop of the main procedure, and as a result, the DD-IPF
                                                                 from both semesters (due to deployment problems as men-
system only detected the completion of two objectives that
                                                                 tioned in Section 5.1), we found that the time taken by the
matched the features it learned from prior students’ data,
                                                                 excluded students (the first 36% to complete or submit the
as shown in Figure 4. This behavior should be expected and
                                                                 programming task) in Spring 2019, as well as their scores,
planned for, since the process of providing automated data-
                                                                 were very close to that of the excluded (first 36%) students
driven feedback is inherently uncertain. To mitigate this, we
                                                                 in Spring 2020. This suggests that the differences we found
propose to combine iterative cycles of data-driven feature
                                                                 in our study results were likely due to the DD-IPF system,
detection with expert authoring to achieve the best of both
                                                                 rather than inherent differences between semesters. The
worlds - building a system that can intelligently address the
                                                                 second limitation is that, due to the structure of the pro-
diverse but correct ways that students solve problems (fit-
                                                                 gramming course, we were not able to measure learning with
ting correct prior solutions), while benefiting from human
                                                                 pre/post tests. We hypothesize that the DD-IPF system can
expertise in communication (labeling the objectives). There
                                                                 improve learning, since it provides immediate feedback that
is also a need to investigate how and when to communicate
                                                                 confirms that the programming steps a student just com-
the fact that feature detection will always be imperfect for
                                                                 pleted are those that contribute to the specific objective.
open-ended tasks in programming. We plan to explore ways
                                                                 The third limitation is that we evaluated the DD-IPF sys-
to explain how the system works to future students, both
                                                                 tem on only one programming homework. The second and
to promote learning and to mitigate potential harms from
                                                                 third limitations, however, are somewhat addressed in our
incorrect feedback.
                                                                 prior work [17], making us optimistic that these limitations
                                                                 can be successfully addressed in future studies. Our prior
We also recommend learning features from student data in
                                                                 work shows that, when compared to students in the con-
an offline, section-by-section fashion, grouping students who
                                                                 trol group with no positive feedback, students who used the
took the same course with the same instructor at the same
                                                                 expert-authored positive feedback system performed better
time. Objective features should be reviewed and edited to
                                                                 on two tasks when they had access to the feedback, and con-
be as independent as possible, and labeled so that students
                                                                 tinued to perform better in a third, more difficult task, with-
can understand what each one means. Note, however, that
                                                                 out positive feedback [17]. Finally, we acknowledge that our
we do not believe that every feature detected must be fully
                                                                 DD-IPF system includes other features than positive feed-
explained by experts. Detectors should be trained on prior
                                                                 back: it breaks down the programming task into smaller
data, and cross validated with testing and training groups
                                                                 objectives and provides corrective feedback when objectives
from different sections. This is because we do not anticipate
                                                                 are broken. Our study presents the results of the whole
being able to generate highly accurate data-driven feature
                                                                 system, but we argue that the most salient aspect of the
system was its focus on providing immediate positive feed-           65–75. Association for Computational Linguistics,
back during problem solving. In our future work, we hope to          2011.
conduct a larger-scale study with different treatment groups     [8] A. Corbett and J. R. Anderson. Locus of Feedback
to evaluate individual features of the DD-IPF system.                Control in Computer-Based Tutoring: Impact on
                                                                     Learning Rate, Achievement and Attitudes. In
To conclude, we developed a system that combines a data-             Proceedings of the SIGCHI Conference on Human
driven feature detector with human labelling to provide data-        Computer Interaction, pages 245–252, 2001.
driven immediate positive feedback (DD-IPF) in a block-          [9] B. Di Eugenio, D. Fossati, S. Haller, D. Yu, and
based programming environment. We conducted a quasi-                 M. Glass. Be brief, and they shall learn: Generating
experimental classroom study to evaluate the impact of DD-           concise language feedback for a computer tutor.
IPF on students while programming homework. We found                 International Journal of Artificial Intelligence in
evidence that the DD-IPF system increased students’ en-              Education, 18(4):317–345, 2008.
gagement with the programming task, and it has the po-          [10] B. Di Eugenio, D. Fossati, S. Ohlsson, and D. Cosejo.
tential to improve students’ programming performance. We             Towards explaining effective tutorial dialogues. In
also provided recommendations to the computing education             Annual Meeting of the Cognitive Science Society,
researchers on how to design better data-driven feedback             pages 1430–1435, 2009.
systems. In our future work, we plan to improve the ac-         [11] Y. Dong, S. Marwan, V. Catete, T. Price, and
curacy of the DD-IPF system, test it across several pro-             T. Barnes. Defining tinkering behavior in open-ended
gramming exercises, and evaluate its impact on students’             block-based programming assignments. In Proceedings
cognitive and affective outcomes. We also plan to research           of the 50th ACM Technical Symposium on Computer
ways to counteract the inadequacies of automated feedback            Science Education, pages 1204–1210. ACM, 2019.
with interface design and opportunities for self-explanation
                                                                [12] D. Fossati, B. Di Eugenio, S. Ohlsson, C. Brown, and
prompts to promote user trust and learning.
                                                                     L. Chen. Data driven automatic feedback generation
                                                                     in the ilist intelligent tutoring system. Technology,
9.    ACKNOWLEDGEMENTS                                               Instruction, Cognition and Learning, 10(1):5–26, 2015.
This material is based upon work supported by the Na-           [13] D. Fossati, B. Di Eugenio, S. Ohlsson, C. W. Brown,
tional Science Foundation under grant 1623470. The au-               L. Chen, D. G. Cosejo, et al. I learn from you, you
thors would also like to thank Preya Shabrina for her help           learn from me: How to make ilist learn from students.
on data-analysis.                                                    In AIED, pages 491–498, 2009.
                                                                [14] D. E. Johnson. Itch: Individual testing of computer
10.   REFERENCES                                                     homework for scratch assignments. In Proceedings of
 [1] J. R. Anderson, A. T. Corbett, K. R. Koedinger, and             the 47th ACM Technical Symposium on Computing
     R. Pelletier. Cognitive tutors: Lessons learned. The            Science Education, pages 223–227. ACM, 2016.
     journal of the learning sciences, 4(2):167–207, 1995.      [15] M. J. Lee and A. J. Ko. Personifying programming
 [2] M. Ball. Lambda: An autograder for snap.                        tool feedback improves novice programmers’ learning.
     Masterscriptie. EECS Department, University of                  In Proceedings of the seventh international workshop
     California, Berkeley, 2018.                                     on Computing education research, pages 109–116.
 [3] D. Barrow, A. Mitrovic, S. Ohlsson, and M. Grimley.             ACM, 2011.
     Assessing the impact of positive feedback in               [16] M. R. Lepper, M. Woolverton, D. L. Mumme, and
     constraint-based tutors. In International Conference            J. Gurtner. Motivational techniques of expert human
     on Intelligent Tutoring Systems, pages 250–259.                 tutors: Lessons for the design of computer-based
     Springer, 2008.                                                 tutors. Computers as cognitive tools, 1993:75–105,
 [4] B. A. Becker, G. Glanville, R. Iwashima,                        1993.
     C. McDonnell, K. Goslin, and C. Mooney. Effective          [17] S. Marwan, G. Gao, S. Fisk, T. W. Price, and
     compiler error message enhancement for novice                   T. Barnes. Adaptive immediate feedback can improve
     programming students. Computer Science Education,               novice programming engagement and intention to
     26(2-3):148–175, 2016.                                          persist in computer science. In Proceedings of the
 [5] K. E. Boyer, R. Phillips, M. D. Wallis, M. A. Vouk,             International Computing Education Research
     and J. C. Lester. Learner characteristics and feedback          Conference (forthcoming), 2020.
     in tutorial dialogue. In Proceedings of the Third          [18] A. Mitrovic, S. Ohlsson, and D. K. Barrow. The effect
     Workshop on Innovative Use of NLP for Building                  of positive feedback in a constraint-based intelligent
     Educational Applications, pages 53–61. Association for          tutoring system. Computers & Education,
     Computational Linguistics, 2008.                                60(1):264–272, 2013.
 [6] W. L. Cade, J. L. Copeland, N. K. Person, and S. K.        [19] R. Moreno and R. E. Mayer. Personalized messages
     D’Mello. Dialogue modes in expert tutoring. In                  that promote science learning in virtual environments.
     International Conference on Intelligent Tutoring                Journal of educational Psychology, 96(1):165, 2004.
     Systems, pages 470–479. Springer, 2008.                    [20] T. W. Price, Y. Dong, and D. Lipovac. iSnap:
 [7] L. Chen, B. Di Eugenio, D. Fossati, S. Ohlsson, and             Towards Intelligent Tutoring in Novice Programming
     D. Cosejo. Exploring effective dialogue act sequences           Environments. In Proceedings of the ACM Technical
     in one-on-one computer science tutoring dialogues. In           Symposium on Computer Science Education, 2017.
     Proceedings of the 6th Workshop on Innovative Use of       [21] T. W. Price, Z. Liu, V. Catete, and T. Barnes. Factors
     NLP for Building Educational Applications, pages                Influencing Students’ Help-Seeking Behavior while
     Programming with Human and Computer Tutors. In
     Proceedings of the International Computing Education
     Research Conference, 2017.
[22] P. C. Rigby and S. Thompson. Study of novice
     programmers using eclipse and gild. In Proceedings of
     the 2005 OOPSLA workshop on Eclipse technology
     eXchange, pages 105–109. ACM, 2005.
[23] K. Rivers and K. R. Koedinger. Data-Driven Hint
     Generation in Vast Solution Spaces: a Self-Improving
     Python Programming Tutor. International Journal of
     Artificial Intelligence in Education, 27(1):37–64, 2017.
[24] P. Shabrina, S. Marwan, T. W. Price, M. Chi, and
     T. Barnes. The impact of data-driven positive
     programming feedback: When it helps, what happens
     when it goes wrong, and how students respond. In
     Educational Data Mining in Computer Science
     Education (CSEDM) Workshop @ EDM’20, 2020.
[25] D. Sleeman, A. E. Kelly, R. Martinak, R. D. Ward,
     and J. L. Moore. Studies of diagnosis and remediation
     with high school algebra students. Cognitive Science,
     13(4):551–568, 1989.
[26] K. Vanlehn. The behavior of tutoring systems.
     International Journal of Artificial Intelligence in
     Education, 16:227–265, 08 2006.
[27] K. VanLehn. The relative effectiveness of human
     tutoring, intelligent tutoring systems, and other
     tutoring systems. Educational Psychologist,
     46(4):197–221, 2011.
[28] W. Wang, R. Zhi, A. Milliken, N. Lytle, and T. W.
     Price. Crescendo : Engaging Students to Self-Paced
     Programming Practices. In Proceedings of the ACM
     Technical Symposium on Computer Science Education,
     2020.
[29] R. Zhi, T. W. Price, N. Lytle, and T. Barnes.
     Reducing the State Space of Programming Problems
     through Data-Driven Feature Detection. In
     Educational Data Mining in Computer Science
     Education (CSEDM) Workshop @ EDM’18, 2018.