Immediate Data-Driven Positive Feedback Increases Engagement on Programming Homework for Novices Samiha Marwan Thomas W. Price North Carolina State Univ. North Carolina State Univ. Raleigh, NC, USA Raleigh, NC, USA samarwan@ncsu.edu twprice@ncsu.edu Min Chi Tiffany Barnes North Carolina State Univ. North Carolina State Univ. Raleigh, NC, USA Raleigh, NC, USA mchi@ncsu.edu tmbarnes@ncsu.edu ABSTRACT primarily by addressing errors and misconceptions might be- Learning theories and psychological research show that pos- come yet more helpful if extended with a positive feedback itive feedback during practice can increase learners’ motiva- capability [18].” However, there are two barriers to using tion, and correlates with their learning. In our prior work, positive feedback effectively in ITSs. we built a system that provides immediate positive feed- back using expert-authored features, and found a promising First, little work has empirically evaluated the effect of pos- impact on students’ performance and engagement with the itive feedback in ITSs, especially in traditional classrooms, system. However, scaling this expert-feedback system to and it is unclear how it will affect students in practice. On new programming tasks requires extensive human effort. In one hand, one might think that students who have already this paper, we present a system that provides automated, correctly completed a step will not need additional feed- data-driven, immediate positive feedback (DD-IPF) to stu- back since they are already on the right track. On the other dents while programming. This system uses a data-driven hand, considering that novices have not yet established their feature detector that automatically detects feature comple- programming understanding, it might be particularly bene- tion in the current student’s code based on features learned ficial for them to get positive reinforcement when they cre- from historical student data. To explore the impact of DD- ate a correct step. Novices have a range of prior knowledge IPF on students’ programming behavior, we performed a and confidence levels, and unexpected results in the pro- quasi-experimental study across two semesters in a block- gramming environment may cause students to question their based programming class. Our results showed that students knowledge. Providing students with feedback that confirms with DD-IPF were more engaged, as measured by time spent their correct steps could, therefore, reduce their uncertainty on the programming task, and also showed marginal im- [18], and this may also improve their performance [17], but provement in their grades, compared to students in a prior further empirical evaluations are needed to test this. Sec- semester solving the same task without feedback. This sug- ond, positive feedback is rarely present in current program- gests that positive feedback based on data-driven feature ming ITSs, since correct steps can be difficult to detect in detection can provide benefits in student engagement and programming problems, due to their huge, possible solution performance. We conclude with design recommendations strategies and sparse, diverse data from students. for data-driven programming feedback systems. In this work, we present a system that provides data-driven 1. INTRODUCTION immediate positive feedback (DD-IPF) to novice students Intelligent tutoring systems (ITSs) are software systems that while programming. In this system, we combine both au- have been shown to improve student performance and learn- tomated data-driven feature detection with low-effort expert ing by providing individualized feedback to each student, as human labelling to provide high quality data-driven feed- human tutors do [26]. One particularly effective form of hu- back. To do so, the DD-IPF system uses a data-driven fea- man tutor feedback is positive feedback, which can improve ture detector algorithm that learns common code structures students’ affective outcomes [5], and correlates with their (features) that are present in correct solutions from prior learning [10]. Mitrovic et al. noted that ITSs “that teach student data, and then detects when these features are com- pleted or broken in a student’s code. We combined these features into a set of objectives with meaningful labels de- signed by human experts. We integrated our DD-IPF system into a block-based programming environment. As shown in Figure 1, the interface of our DD-IPF system has two com- ponents: a progress panel that displays human labels of the data-driven features of a programming task and shows when they are completed, and pop-up messages triggered based on Copyright c 2020 for this paper by its authors. Use permitted under Cre- the completion of these features or lack of progress. ative Commons License Attribution 4.0 International (CC BY 4.0). We performed a quasi-experimental pilot study across two is just one of several supports provided, and their studies do semesters to investigate our primary research question: How not attempt to separate the impact of the positive feedback does the data-driven immediate positive feedback system im- from the overall system. pact student engagement and performance while program- ming? Through log data analysis, we found that students In addition, the timing of positive feedback can be important who received DD-IPF spent significantly more time with the for retention. The SQL tutor, and programming autograders system, and had a somewhat better performance than stu- like Lambda for Snap! [2], provide positive feedback only dents who did not work with the DD-IPF system. We argue when students submit their code, and usually this feedback that this increase in time shows that students were more en- indicates which constraints or test cases are satisfied. This gaged to work more on the programming assignment, which means that students who become discouraged or confused had historically low engagement, and this resulted in an in- during programming may never receive any positive feed- crease in students’ scores. back at all. While prior work shows that immediate feedback allows students to finish problems quickly [8], there is less In summary, this work makes contributions to educational evidence on the impact of immediate positive feedback on data mining and computing education through: (1) an ap- students in programming. Our DD-IPF system continuously proach that combines a data-driven feature detector with and adaptively confirms when students complete (or break) human labelling to provide data-driven immediate positive meaningful objectives through its progress panel. Our sys- feedback (DD-IPF), (2) a controlled quasi-experimental study tem also includes personalized pop-up messages addressing that suggests that DD-IPF in classrooms can increase stu- the user as “you”, and tailored to our student population, dents’ engagement with the programming environment, and since personalization is key in effective human tutoring di- has the potential to improve their performance as well, and alogs [5, 10] and has been shown to improve novices’ learning (3) recommendations on the design of data-driven feedback [15, 19]. systems for programming. 3. DATA-DRIVEN FEATURE DETECTOR (DDFD) 2. RELATED WORK ON POSITIVE FEED- ALGORITHM BACK In our prior work [29], we introduced a data-driven feature Intelligent tutoring systems are developed to adopt human detector (DDFD) algorithm for block-based programming strategies to improve students’ learning, especially through exercises. In this context, a feature corresponds to a mean- adaptive feedback [9, 27, 12]. For example, positive feedback ingful objective or property of a correct solution. In brief, is given to students when they complete a problem-solving this algorithm works as follows: First, it learns common fea- step appropriately, e.g. “Good move” [3, 10, 12]. Empirical tures from existing, correct student solutions. Each feature studies of human tutoring dialogues show that positive feed- is represented as a set of code blocks, referred to as code back occurs eight times as often as negative feedback [10], shapes. For example, using a ‘pen down’ block, followed by and correlates with learning [13, 10, 6, 7]. More importantly, a ‘move’ block, is a necessary feature for any drawing task. human tutors find positive feedback to be an effective moti- Second, the algorithm detects the presence or the absence vational strategy that improves student confidence [16]. In of each feature in each student’s code, generating a boolean programming, while various learning environments provide array that represents its feature state. For example, if the feedback through compiler messages [22, 4], or error detec- algorithm learns four features for a given exercise, and for tors [1, 25], or autograders [2, 14], or hints [23, 20], far less a given student’s code it detects only the first and last fea- work has been devoted to integrate features that detect stu- tures, then the algorithm will output {1, 0, 0, 1} as the dents’ correct moves and provide positive feedback as human code’s feature state. tutors do. This DDFD algorithm motivated us to build a system that To our knowledge, only two studies have focused on auto- provides immediate positive feedback that can be easily scaled mated positive feedback for programming: Fossati et al.’s to various programming tasks because of three reasons. First, iList[12] tutor for data structures, and Mitrovic et al.’s SQL this DDFD algorithm can detect features at any phase of tutor, for databases [18]. In the iList tutor, Fossati et al. student code, whether the code is complete, or incomplete, provided students with positive feedback by calculating the and therefore can provide immediate feedback. Second, the goodness and uncertainty of students’ moves. Authors de- presence of features represents student progress towards the tected a good move if it improves the student’s probability correct solution, and therefore can provide positive feedback. to reach the correct solution, while the uncertainty is de- Third, it is designed to generate features for a variety of pro- tected if a student spent more time than the time taken by gramming tasks, as long as historical student data exists, prior students at this point. If a student’s move is detected making it scalable to various contexts and tasks. However, as a good move and uncertain, then iList will provide posi- the DDFD algorithm suffers two limitations that make it tive feedback. Fossati et al. found that the iList tutor with hard to deploy in practice. First, because features are code positive feedback improved learning, and students liked it shapes that are generated automatically, they are not la- more than iList without positive feedback. The SQL tutor belled with meaningful names for students to understand is a constraint-based tutor using a knowledge base of 700 how they are making progress. Second, the generated fea- constraints. When a student submits their solution, the de- tures are too specific and may need to be further clustered tected satisfied constraints can trigger positive feedback to into a smaller set of features, to limit their number and al- students. An evaluation of SQL tutor showed that positive low more concrete positive feedback. In the next section, we feedback helped students master skills in less time. How- describe how we took advantage of the DDFD algorithm, ever, in both iList and SQL tutor studies, positive feedback and addressed its limitations to develop a system that pro- vides data-driven immediate positive feedback to students Figure 1. To increase students’ motivation, the system also while programming. Rather than being fully data-driven or provides motivational pop-up messages if a student makes no fully expert-authored, this system applies a small amount of progress for more than three minutes (a threshold based on expert curation to a largely data-driven process to achieve instructors’ feedback). We integrated the DD-IPF system scalable, higher-quality results. into iSnap, a block-based programming environment that extends Snap! programming environment [20]. 4. DATA-DRIVEN IMMEDIATE POSITIVE FEEDBACK (DD-IPF) SYSTEM We note that our current version of DD-IPF system has a similar interface to our adaptive feedback system in [17], To build the DD-IPF, we first had to apply the DDFD algo- but the approach to generate positive feedback is different. rithm to a new dataset and overcome DDFD’s two main lim- Specifically, in our prior work we used expert-authored au- itations of having unlabeled features and having too many to tograders to monitor students’ progress instead of the data- present to students. To do so, we first used three semesters driven feature detector algorithm presented in our current of previous correct students’ solutions to generate a set of work (Section 3). This adaptation is important for scaling code shapes (features) of correct solutions of a program- the system to more programming tasks. ming exercise (described in Section 5.2). As expected, (as shown in the first column of Table 1), each of the seven gen- erated features were too small to be used as an objective (e.g. F1: create a procedure), and lacked labels to break the larger task down into smaller meaningful objectives. There- fore, the first author combined these seven features into four features, and provided each with a label. The last author briefly reviewed the combination and suggested a few minor wording changes. Table 1 shows the mapping of the seven data-driven features to four meaningful objectives, with hu- man labels. The features were combined, ordered, and la- beled based on their relevance and importance. Our goal was to make the label clear, meaningful and concrete. Note that not every aspect of the data-driven features is reflected in the objective labels, so there is not a direct correspon- dence between the objective label provided to students and the process that the detector applies to detect it. We de- cided to limit the number of objectives to 4, and that it was Figure 1: Data-driven immediate positive feedback more important to have independent objectives that stu- (DD-IPF) system integrated to iSnap, a block-based dents could understand, than it was to tell students exactly programming environment. The script area – where what constructs make up each objective. As we discuss in students add code blocks – is shown in the bottom- more detail later, listing just a few objectives will necessarily left (blue box), the pop-up message is shown in top- leave out information, there is no way to build an algorithm left (orange box), the stage where students can see that can correctly detect every possible solution, and there their output is shown in the top-right (green box), is more complex information in the detectors than novice and progress panel is shown in the bottom-right (yel- programmers can understand. Given these important and low box). inherent limitations of automated feedback, we felt clarity and brevity were of high value. 5. CLASSROOM STUDY We strove to design the interface of the DD-IPF system to Our goal in this study is to evaluate the impact of data- help students’ track their progress, with the goal of improv- driven immediate positive feedback on students in a class- ing their performance, and increasing their motivation to room setting. This study seeks to answer our primary re- complete the programming task. The DD-IPF system con- search question: How does the data-driven immediate pos- sists of two main features that continuously provide adap- itive feedback system impact student engagement and per- tive, positive feedback to students based on their code edits: formance while programming? 1) a progress panel and 2) pop-up messages, shown in Fig- ure 1. These two components are designed together to com- 5.1 Population prise a positive feedback system for open-ended program- The participants of this study were enrolled in two semesters ming. The progress panel shows students the human labels of an introductory programming course for computer science of data-driven features (objectives) for a task, and whether non-majors, both taught by the same instructor at a large they are complete, or broken, since prior research suggests southeastern university in the United States. The Spring that students who were uncertain often delete their correct 2019 class had 48 students and the Spring 2020 class had 42 code [11]. Initially, all the objectives are deactivated. Once students. We adopted a quasi-experimental design, where an objective is detected to be completed, it provides posi- the experimental (Spring 2020) group used iSnap with access tive feedback by becoming green, but if it is detected to be to the DD-IPF system on one assignment (Squiral, described broken, it changes to red, as shown in the bottom right of below), and the control (Spring 2019) group completed the Figure 1. The pop-up messages provide positive messages same assignment in iSnap but without the DD-IPF system. on students’ accomplishments, i.e. whenever they completed Due to deployment issues, the first 15 students to complete an objective or fixed a broken one, as shown in the top left of the assignment (36%) in the Spring 2020 class did not use Table 1: Generated Data-driven features, and their Corresponding Objectives and Human Labels. Data-Driven Features Combined Features (Objectives) Human Label F1. Create a procedure OR ‘ReceiveGo’ block. Make a Squiral custom block Objective 1 = F1 + F2 F2. Add procedure on stage. and use it in your code. F3. Have a ‘multiple’ block with a variable The Squiral custom block rotates Objective 2 = F3 in a ’repeat’ block OR two nested ‘repeat’ blocks. the correct number of times. F4. Add parameter in the procedure. The length of each side of the Objective 3 = F4 + F5 F5. Add a variable in ‘move’ block Squiral is based on a variable. AND a variable in ‘repeat‘ block. F6. Have a ‘move’ and ‘turn’ block inside a The length of the Squiral increases ’repeat’ block AND a ’pen down’ block. Objective 4 = F6 + F7 with each side. F7. Change a variable value inside a ‘repeat’ block. Time: We measured a student’s total time from when they began to program to the time when they either success- fully completed the programming task, or submitted it in- correctly. We did this because some students who completed the task continued to work afterwards, so we did not include the afterward time in their total time. We found the aver- age time (in minutes) spent by students in the experimental group (M ed = 34.75; M = 42.29; IQR = 28.04) was much greater than that spent by the control group (M ed = 13.54; M = 20.14; IQR = 17.81), as shown in Plot A of Figure 3. A t-test1 shows that this difference is significant with a strong effect size (t(40.78) = 3.96; p < 0.01; Cohen’s d = 1.08). Score: To grade students’ submissions, we used a rubric Figure 2: One possible solution of Squiral program- for the Squiral exercise created by researchers in prior work. ming exercise (on the left), and its output (on the This rubric consists of six items, each with two points, mak- right). ing a total of 12 points. Two researchers in block-based pro- gramming graded students’ submitted code across the two semesters2 . We found that the average score of students in the DD-IPF system, and therefore we excluded them, re- the experimental group (M ed = 91.7%; M ean = 86.4%; sulting in 27 students in the experimental group. Since this SD = 14.5%) was more than that in the control group exclusion may have biased our sample in the experimental (M ed = 83.3%; M ean = 76.9%; SD = 25.5%), as shown group, we also removed the first 36% of students (17) in in Plot B of Figure 3. A Mann-Whitney U test3 shows that the Spring 2019 class, resulting in 33 students in the control this difference is not significant, but has a medium effect size group. (p = 0.19; Cohen’s d = 0.45). 5.2 Procedure 6. DISCUSSION This study took place during the second programming home- In this section we discuss our primary research question: work in the CS0 classroom. The programming task is called How does the data-driven immediate positive feedback sys- Squiral, which asks students to create a method with one tem impact student engagement and performance while pro- parameter, r , that draws a square-shaped spiral with r ro- gramming? We found that the DD-IPF system increased the tations. Figure 2 shows one possible solution to Squiral, and amount of time students spent engaged with the program- its output. The instructor gave students one week to submit ming homework, and we found suggestive evidence that it this homework. In both semesters, the programming envi- can improve students’ programming performance as well. ronment provides students with access to on-demand next- step hints, which provide students with single edits that can In our study we found students who used the DD-IPF sys- possibly bring their code closer to a correct solution, but tem (experimental group) took more than double the time this was independent of their access to the DD-IPF. Our on average compared to students in the control group to prior evaluations of expert-authored positive feedback sug- complete their homework. While increasing time on task gests it provides complementary benefits to hints, but does is sometimes considered a negative outcome (i.e. decreased not conflict with them [17]. learning efficiency), for this homework, we believe that our results suggest that the DD-IPF system increased students’ 1 Our use of t-tests indicates the data was normally dis- 5.3 Results tributed. In this section we report the impact of using our DD-IPF 2 We note that we found all students submitted their code system on the time students took to finish the homework ex- for grading. ercise, and the score of their submitted solution, as assessed 3 Our use of non-parametric tests indicates the data was non- by a rubric. normal. Group Control Experimental plete Objective 1; however this is not necessary according to the instructions. It is a feature that most students did in A 120 B 100 the dataset used to train the DD-IPF system. Based on our manual investigation of the experimental group log data, we found a number of times where the system mislabeled a stu- dent’s progress. For example, some students finished the programming task, but not all the objectives were detected Time to Complete the Task 90 75 by the system. As a result, a few students kept working for more time, despite having completed the task, which was not Score Percent the case for students in the control group. We provide spe- cific case studies on instances at which the DD-IPF system 60 provides incorrect feedback, and how students responded in 50 [24]. We also found that students with the DD-IPF system may 30 have had improved performance, since students in the exper- imental group achieved higher average scores (almost 9.5% 25 higher) than those in the control group. We conclude that the DD-IPF system may have increased students’ scores be- 0 cause it increased their working time in the programming Control Experimental Control Experimental environment. The ability of the progress panel to confirm correct steps might have increased students’ motivation and reduced their uncertainty about their moves, leading to im- Figure 3: Boxplots comparing time (in minutes) to proved performance. These results are consistent with the complete the task (left) and score percent (right) “uncertainty reduction” hypothesis presented by Mitrovic et between the control and experimental groups. al. [18], suggesting that the positive feedback helps stu- dents to continue working because it reduced their uncer- tainty about their code edits. engagement with the assignment. Squiral is a challenging While the availability of on-demand hints might have af- assignment for students, which should take students more fected our results somewhat, students in both semesters had than the median 13.5 minutes spent by the control group access to hints, so the differences between semesters were to compete correctly. However, this task was a program- due primarily to the DD-IPF system. Interestingly, when ming homework, where students were not observed and did we compared the hint usage across the two semesters, we not have easy access to an instructor to get assistance. We found suggestive evidence that the DD-IPF system might found that of the 15 students in the control group who sub- have increased students’ hint usage. We found that the av- mitted their homework in less than the median time of 13.5 erage number of hints requested by the experimental group minutes, 11 of these students (73.3%) had incorrect submis- (M ean = 17; M ed = 11) was higher than that of the con- sions. This suggests that many students spent too little time trol group (M ean = 10; M ed = 6.5). In addition, we found and submitted incomplete or incorrect work in the control that the average percent of followed hints in the experimen- group. Overall, the control group had lower performance tal group (M ean = 66.64%; M ed = 70%) was significantly (average grade 77%), in comparison with the experimental higher than that in the control group (M ean = 39.91%; group (average grade 86%). By contrast, only 2 students in M ed = 38.3%). We hypothesize a few possible implications the experimental group spent less than 13.5 minutes. We from these results: First, the progress panel may have mo- hypothesize that the DD-IPF system helped keep students tivated students to seek out and follow hints in order to engaged by continuously informing students in the experi- achieve the incomplete objectives it showed. Second, stu- mental group about their progress, such as how far they are dents in the experimental group might have followed more from completing the homework, which might have motivated hints since the progress panel helps them to understand how them to keep working to try to get all the objectives marked hints relate to a specific objective. Lastly, students might correct in the progress panel. However, we cannot directly have trusted the system more because the DD-IPF helped investigate this hypothesis with the current study (e.g. by them see some intelligence and intentionality behind the sys- correlating student time and performance). For example, tem. This addresses a concern raised by prior work, which higher-performing students may take less time to complete suggests that students do not trust automated feedback be- the assignment (creating a negative correlation), even if any cause they did not believe the system really understands individual student may perform better by taking more time. what students are doing, or its ability to offer useful help [21]. However, we note that some of the increase in students’ time engaged with the assignment was not productive, (e.g. 2 students spending over 90 minutes). Some of this may 7. RECOMMENDATIONS FOR DATA-DRIVEN have resulted from errors in the data-driven feature detec- PROGRAMMING FEEDBACK tion. As shown in Table 1, a few of the data-driven features Based on our current and prior work, we have several recom- do not correspond well to a concrete assignment objective. mendations for designing data-driven feedback for program- For example, data-driven Feature 1 requires the use of ‘Re- ming [29, 17, 24]. Because of the very large, sparse solution ceiveGo’ block (which can be used to start a script) to com- spaces for programming, we should always expect to have some inadequacies in any data-driven features or detectors for programs. For example, in our current work dataset, we found cases where the DD-IPF system incorrectly detected the completion of an objective and others where the student completed an objective, but it was not detected. These flaws point to a tradeoff between the more easily-generated, data- driven features used in this work and the expert-authored features that we used in our prior work [17]. A data-driven positive feedback system is likely to learn less generalizable features from prior students, and still needs to be labeled and curated for presentation, but it can be scaled to various programming exercises with less human effort. In contrast, an expert-authored feedback can be designed to have un- derstandable and generalizable features, but it is hard to scale to various programming exercises, and requires exten- Figure 4: A novel student solution, with a new and sive time and effort to create autograders (e.g. with static correct, but undetected, strategy, where only 2 ob- analysis [17, 28, 2]). In this work we combined both ap- jectives were detected (see progress panel on the proaches, using data-driven feature detection to account for right). the diversity of student solutions, and a human labeling pro- cess to make the features more general, independent, and understandable. detectors that could be added to the system without expert review. Therefore, validation should be conducted to match We recommend that data-driven feedback approaches should the nature of how our system is used in practice: trained on be designed iteratively, learning and implementing an initial one dataset and used in a separate, later section. Further- set of feature detectors from a given dataset, and iteratively more, instructors can have a large impact on how students detecting new features after each new dataset is added. For approach problems, so each section of a class is very likely example, despite extracting features from over 100 prior stu- to differ significantly due to that factor alone. dent solutions to the Squiral programming exercise (which can be solved in 7-11 lines of code), in this study, we found 8. LIMITATIONS & CONCLUSION two students who used a completely new strategy, not rep- This study has four primary limitations. First, this was resented in the prior data. In particular, one of these two a quasi-experimental study and therefore there might be students created a procedure (i.e. a custom block) to draw other differences between semesters that affected our re- a side of a Squiral, and called this procedure in an inner sults. However, when we compared the students excluded loop of the main procedure, and as a result, the DD-IPF from both semesters (due to deployment problems as men- system only detected the completion of two objectives that tioned in Section 5.1), we found that the time taken by the matched the features it learned from prior students’ data, excluded students (the first 36% to complete or submit the as shown in Figure 4. This behavior should be expected and programming task) in Spring 2019, as well as their scores, planned for, since the process of providing automated data- were very close to that of the excluded (first 36%) students driven feedback is inherently uncertain. To mitigate this, we in Spring 2020. This suggests that the differences we found propose to combine iterative cycles of data-driven feature in our study results were likely due to the DD-IPF system, detection with expert authoring to achieve the best of both rather than inherent differences between semesters. The worlds - building a system that can intelligently address the second limitation is that, due to the structure of the pro- diverse but correct ways that students solve problems (fit- gramming course, we were not able to measure learning with ting correct prior solutions), while benefiting from human pre/post tests. We hypothesize that the DD-IPF system can expertise in communication (labeling the objectives). There improve learning, since it provides immediate feedback that is also a need to investigate how and when to communicate confirms that the programming steps a student just com- the fact that feature detection will always be imperfect for pleted are those that contribute to the specific objective. open-ended tasks in programming. We plan to explore ways The third limitation is that we evaluated the DD-IPF sys- to explain how the system works to future students, both tem on only one programming homework. The second and to promote learning and to mitigate potential harms from third limitations, however, are somewhat addressed in our incorrect feedback. prior work [17], making us optimistic that these limitations can be successfully addressed in future studies. Our prior We also recommend learning features from student data in work shows that, when compared to students in the con- an offline, section-by-section fashion, grouping students who trol group with no positive feedback, students who used the took the same course with the same instructor at the same expert-authored positive feedback system performed better time. Objective features should be reviewed and edited to on two tasks when they had access to the feedback, and con- be as independent as possible, and labeled so that students tinued to perform better in a third, more difficult task, with- can understand what each one means. Note, however, that out positive feedback [17]. Finally, we acknowledge that our we do not believe that every feature detected must be fully DD-IPF system includes other features than positive feed- explained by experts. Detectors should be trained on prior back: it breaks down the programming task into smaller data, and cross validated with testing and training groups objectives and provides corrective feedback when objectives from different sections. This is because we do not anticipate are broken. Our study presents the results of the whole being able to generate highly accurate data-driven feature system, but we argue that the most salient aspect of the system was its focus on providing immediate positive feed- 65–75. Association for Computational Linguistics, back during problem solving. In our future work, we hope to 2011. conduct a larger-scale study with different treatment groups [8] A. Corbett and J. R. Anderson. Locus of Feedback to evaluate individual features of the DD-IPF system. Control in Computer-Based Tutoring: Impact on Learning Rate, Achievement and Attitudes. In To conclude, we developed a system that combines a data- Proceedings of the SIGCHI Conference on Human driven feature detector with human labelling to provide data- Computer Interaction, pages 245–252, 2001. driven immediate positive feedback (DD-IPF) in a block- [9] B. Di Eugenio, D. Fossati, S. Haller, D. Yu, and based programming environment. We conducted a quasi- M. Glass. Be brief, and they shall learn: Generating experimental classroom study to evaluate the impact of DD- concise language feedback for a computer tutor. IPF on students while programming homework. We found International Journal of Artificial Intelligence in evidence that the DD-IPF system increased students’ en- Education, 18(4):317–345, 2008. gagement with the programming task, and it has the po- [10] B. Di Eugenio, D. Fossati, S. Ohlsson, and D. Cosejo. tential to improve students’ programming performance. We Towards explaining effective tutorial dialogues. In also provided recommendations to the computing education Annual Meeting of the Cognitive Science Society, researchers on how to design better data-driven feedback pages 1430–1435, 2009. systems. In our future work, we plan to improve the ac- [11] Y. Dong, S. Marwan, V. Catete, T. Price, and curacy of the DD-IPF system, test it across several pro- T. Barnes. Defining tinkering behavior in open-ended gramming exercises, and evaluate its impact on students’ block-based programming assignments. In Proceedings cognitive and affective outcomes. We also plan to research of the 50th ACM Technical Symposium on Computer ways to counteract the inadequacies of automated feedback Science Education, pages 1204–1210. ACM, 2019. with interface design and opportunities for self-explanation [12] D. Fossati, B. Di Eugenio, S. Ohlsson, C. Brown, and prompts to promote user trust and learning. L. Chen. Data driven automatic feedback generation in the ilist intelligent tutoring system. Technology, 9. ACKNOWLEDGEMENTS Instruction, Cognition and Learning, 10(1):5–26, 2015. This material is based upon work supported by the Na- [13] D. Fossati, B. Di Eugenio, S. Ohlsson, C. W. Brown, tional Science Foundation under grant 1623470. The au- L. Chen, D. G. Cosejo, et al. I learn from you, you thors would also like to thank Preya Shabrina for her help learn from me: How to make ilist learn from students. on data-analysis. In AIED, pages 491–498, 2009. [14] D. E. Johnson. Itch: Individual testing of computer 10. REFERENCES homework for scratch assignments. In Proceedings of [1] J. R. Anderson, A. T. Corbett, K. R. Koedinger, and the 47th ACM Technical Symposium on Computing R. Pelletier. Cognitive tutors: Lessons learned. The Science Education, pages 223–227. ACM, 2016. journal of the learning sciences, 4(2):167–207, 1995. [15] M. J. Lee and A. J. Ko. Personifying programming [2] M. Ball. Lambda: An autograder for snap. tool feedback improves novice programmers’ learning. Masterscriptie. EECS Department, University of In Proceedings of the seventh international workshop California, Berkeley, 2018. on Computing education research, pages 109–116. [3] D. Barrow, A. Mitrovic, S. Ohlsson, and M. Grimley. ACM, 2011. Assessing the impact of positive feedback in [16] M. R. Lepper, M. Woolverton, D. L. Mumme, and constraint-based tutors. In International Conference J. Gurtner. Motivational techniques of expert human on Intelligent Tutoring Systems, pages 250–259. tutors: Lessons for the design of computer-based Springer, 2008. tutors. Computers as cognitive tools, 1993:75–105, [4] B. A. Becker, G. Glanville, R. Iwashima, 1993. C. McDonnell, K. Goslin, and C. Mooney. Effective [17] S. Marwan, G. Gao, S. Fisk, T. W. Price, and compiler error message enhancement for novice T. Barnes. Adaptive immediate feedback can improve programming students. Computer Science Education, novice programming engagement and intention to 26(2-3):148–175, 2016. persist in computer science. In Proceedings of the [5] K. E. Boyer, R. Phillips, M. D. Wallis, M. A. Vouk, International Computing Education Research and J. C. Lester. Learner characteristics and feedback Conference (forthcoming), 2020. in tutorial dialogue. In Proceedings of the Third [18] A. Mitrovic, S. Ohlsson, and D. K. Barrow. The effect Workshop on Innovative Use of NLP for Building of positive feedback in a constraint-based intelligent Educational Applications, pages 53–61. Association for tutoring system. Computers & Education, Computational Linguistics, 2008. 60(1):264–272, 2013. [6] W. L. Cade, J. L. Copeland, N. K. Person, and S. K. [19] R. Moreno and R. E. Mayer. Personalized messages D’Mello. Dialogue modes in expert tutoring. In that promote science learning in virtual environments. International Conference on Intelligent Tutoring Journal of educational Psychology, 96(1):165, 2004. Systems, pages 470–479. Springer, 2008. [20] T. W. Price, Y. Dong, and D. Lipovac. iSnap: [7] L. Chen, B. Di Eugenio, D. Fossati, S. Ohlsson, and Towards Intelligent Tutoring in Novice Programming D. Cosejo. Exploring effective dialogue act sequences Environments. In Proceedings of the ACM Technical in one-on-one computer science tutoring dialogues. In Symposium on Computer Science Education, 2017. Proceedings of the 6th Workshop on Innovative Use of [21] T. W. Price, Z. Liu, V. Catete, and T. Barnes. Factors NLP for Building Educational Applications, pages Influencing Students’ Help-Seeking Behavior while Programming with Human and Computer Tutors. In Proceedings of the International Computing Education Research Conference, 2017. [22] P. C. Rigby and S. Thompson. Study of novice programmers using eclipse and gild. In Proceedings of the 2005 OOPSLA workshop on Eclipse technology eXchange, pages 105–109. ACM, 2005. [23] K. Rivers and K. R. Koedinger. Data-Driven Hint Generation in Vast Solution Spaces: a Self-Improving Python Programming Tutor. International Journal of Artificial Intelligence in Education, 27(1):37–64, 2017. [24] P. Shabrina, S. Marwan, T. W. Price, M. Chi, and T. Barnes. The impact of data-driven positive programming feedback: When it helps, what happens when it goes wrong, and how students respond. In Educational Data Mining in Computer Science Education (CSEDM) Workshop @ EDM’20, 2020. [25] D. Sleeman, A. E. Kelly, R. Martinak, R. D. Ward, and J. L. Moore. Studies of diagnosis and remediation with high school algebra students. Cognitive Science, 13(4):551–568, 1989. [26] K. Vanlehn. The behavior of tutoring systems. International Journal of Artificial Intelligence in Education, 16:227–265, 08 2006. [27] K. VanLehn. The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist, 46(4):197–221, 2011. [28] W. Wang, R. Zhi, A. Milliken, N. Lytle, and T. W. Price. Crescendo : Engaging Students to Self-Paced Programming Practices. In Proceedings of the ACM Technical Symposium on Computer Science Education, 2020. [29] R. Zhi, T. W. Price, N. Lytle, and T. Barnes. Reducing the State Space of Programming Problems through Data-Driven Feature Detection. In Educational Data Mining in Computer Science Education (CSEDM) Workshop @ EDM’18, 2018.