Next Steps for Next-step Hints: Lessons Learned from Teacher Evaluations of Automatic Programming Hints ∗ Benjamin Paaßen Jessica McBroom Bryn Jeffries Institute of Informatics School of Computer Science Grok Learning Humboldt-University of Berlin The University of Sydney bryn@groklearning.com benjamin.paassen@ jmcb6755@ hu-berlin.de uni.sydney.edu.au Irena Koprinska Kalina Yacef School of Computer Science School of Computer Science The University of Sydney The University of Sydney irena.koprinska@ kalina.yacef@ sydney.edu.au sydney.edu.au ABSTRACT utilize historical student data and, as such, can be fully au- Next-step programming hints have attracted considerable tomated [10]. However, it remains challenging to evaluate research attention in recent years, with many new techniques next-step hints. Price et al. [14] found at least three different being developed for a variety of contexts. However, evalu- criteria to grade next-step hints: how often they are avail- ating next-step hints is still a challenge. We performed a able (coverage), how they impact student outcomes, such pilot study in which teachers (N = 7) rated automatic next- as task completion speed and learning gain, and how well step hints, both quantitatively and qualitatively, providing they align with expert opinions. Importantly, the relation reasons for their ratings. Additionally, we asked teachers to between these criteria is not trivial and different ways to write a free-form hint themselves. We found that teachers present next-step hints can influence their effect. For exam- tended to prefer higher level hints over syntax-based hints, ple, Marwan et al. [8] found that adding textual explanations and that the differences between hint techniques were often improved hint quality in expert eyes but did not influence less important to teachers than the format of the generated student outcomes. hints. Based on these results, we propose modifications to next-step hint strategies to increase their similarity to hu- Our main contribution in this paper is to combine quantita- man teacher feedback, and suggest this as a potential avenue tive ratings with qualitative explanations. In other words, for improving their effectiveness. we do not only investigate differences in teacher ratings, but also why teachers preferred some hints over others. To this end, we performed a survey with N = 7 teachers, asking Keywords them to grade next-step hints generated by three different computer science education, next step hints, data-driven methods across three programming tasks in Python. Our feedback, teacher evaluation overarching research questions are: 1. INTRODUCTION RQ1 Do teachers’ ratings differ between hint methods? To support students in solving practical programming tasks, many automatic feedback strategies provide next-step hints, RQ2 Do automatic hints align with teacher hints? i.e. they select a target program that is closer to a correct solution and provide feedback based on the contrast be- RQ3 What are teachers’ reasons for preferring some hints tween the student’s current program and the target program over others? (e.g. [3, 6, 10, 11, 13–15]). Next-step hints are compelling be- cause they do not require teacher intervention. Instead, they This paper is set out as follows: Section 2 discusses related ∗Corresponding Author work in more detail, Section 3 describes the setup of our study, Section 4 describes the results and, finally, Sections 5-6 discuss and summarize the implications of our work. 2. RELATED WORK Prior work on evaluating next-step hints broadly falls into three categories: technical criteria, outcomes for students, and expert opinions [14]. Joint Proceedings of the Workshops of the 14th International Conference on Educational Data Mining (EDM 2021); Copyright ©2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Technical criteria are mostly concerned with the availabil- International (CC BY 4.0) ity of hints and motivated by the cold start problem, i.e. the problem that data-driven hint generation requires a cer- programs that students deliberately submitted for evalua- tain set of data to become possible [1]. Over the years, this tion against unit tests. problem has arguably become less critical as multiple meth- ods are now available which require very little training data, 3.2 Example selection such as [6, 10, 11, 13, 15]. In this paper we restrict ourselves Our goal in this study was to evaluate the quality of auto- to these methods and therefore omit such criteria. matic hints in a range of realistic situations where students were likely to need help and where feedback generation was Regarding student outcomes, prior studies have already shown non-trivial. For the purpose of this study, we considered a that data-driven, next-step hints can yield similar learning program as indicative of help-need if at least five students gains to working with human teachers [5], can improve so- who submitted this program failed the same or more unit lution quality [2], and completion speed [4]. The challenge tests in the next step of their development. This is in line in applying such criteria is that they require a study design with prior work of [9] and [7], who both suggest that help is in which an intervention group works on-line on a task with needed if students repeatedly fail to make progress. hint support, which was beyond the scope of our pilot study. As proxy for non-triviality we considered the tree edit dis- An alternative which requires less resources is offered by tance to the top-100 most frequent submissions for the same expert opinions, i.e. ratings by experienced programming programming task. If this tree edit distance is low, provid- teachers on the quality of hints. In particular, Price et ing automatic hints is simple: we can retrieve the nearest al. [13] have suggested three scales (relevance, progress, and neighbor according to tree edit distance and use a successful interpretability) to grade hint quality and have shown that continuation of this nearest neighbor as a hint, as suggested expert ratings on these scales are related to the likelihood of by [6]. However, if this distance is high, we are in a region of students accepting hints in the future. Further, both Piech the space of possible programs that is not frequently visited et al. [12] and Price et al. [14] asked teachers to generate by students and, hence, harder to cover for an automatic next-step hints themselves and evaluated the overlap be- hint system. tween the teacher hints and automatic hints as a measure of quality. Importantly, a next-step hint may be affected In the end, we selected for each of the three challenges the not only by the selected target program but also by how the program which maximized the tree edit distance to frequent hint is presented. For example, Marwan et al. [8] found that programs and indicated help-need. The resulting submis- adding textual explanations improved expert quality ratings sions are shown in Figure 1, alongside with a description of – but not student outcomes. the respective programming task and an example solution. In our work, we combine aspects of this prior work with 3.3 Hint generation qualitative questions. In particular, we use a variation of the We considered three techniques to produce next-step hints. three scales of Price et al. [13] for quantitative ratings of hint quality and let teachers provide their own hints to evaluate Firstly, we used one-nearest neighbor (1NN) prediction [6], overlap, akin to [12, 14]. Additionally, we ask teachers to i.e. we selected the nearest neighbor to the help-seeking pro- provide a textual explanation for why they would give a hint gram in the training data and recommended its successor. and why they would choose not to give one of the automatic Distance was measured according to the tree edit distance, hints. as used e.g. by [10, 15]. 3. METHOD Secondly, we used the continuous hint factory (CHF) [10] In this section, we cover the setup for our survey, beginning which extends the one-nearest neighbor approach by com- with the programming data sets we used, followed by the puting a weighted average of multiple close neighbors and mechanism to select specific examples, the hint methods, then constructs the program which is closest to this weighted and the recruitment for the survey itself. average. Since this construction occurs in the space of syn- tax trees, it does not come with variable or function names attached. We therefore consider two versions: For the first 3.1 Programming data sets two tasks, we present an ’abstract’ program version where In order to provide realistic stimulus material, we selected all variables and functions are named ’x’. For the last task, our programs from three real-world, large-scale data sets of we instead use the nearest neighbor in the training data to program traces in introductory programming. Namely, we the weighted average. considered data from the 2018 (beginner challenge) and 2019 (beginner and intermediate challenges) National Computer Finally, we used the ast2vec neural network [11] to first Science School (NCSS)1 , an introductory computer science translate the student’s current program into a vector, then curriculum for (mostly Australian) school children in grades predict how this vector should change via linear regression, 5-10. 12, 876 students were enrolled in the beginners 2018 and decode this predicted vector back into a syntax tree. To challenge, 11, 181 students in the beginners 2019 challenge, provide function names as well, we trained a classifier that and 7, 854 students in the intermediate 2019 challenge. Each mapped ast2vec encodings for subtrees to typical function challenge consisted of about twenty-five programming tasks names in the training data and we automatically copied vari- in ascending difficulty, each of which were annotated with able names and strings from the student’s current program, unit tests. In all cases, we only considered submissions, i.e. as suggested by [11]. 1 https://ncss.edu.au In all cases, the hint was formatted as a program which the recipe task anagram task You’re opening a boutique pie shop. You Let’s make a computer program that only knows have lots of crazy pie ideas, but you need how to say ’hi’: It doesn’t matter what you type in, to keep them secret! Write a program it should still print ’Hi, I am a computer’! To that asks for a pie idea, and encodes it make it a bit more exciting, though, we’ll add an as the numeric code for each letter, using Easter Egg. The word ’anagram’ should trigger a the ord function. Print the code for each secret message: letter on a new line. Hi, I am a computer! What are you? anagram recipe submission Nag a ram! 1 msg = input ( ’ Pie idea : ’) Hi, I am a computer! 2 code = ord ( msg [0]) anagram submission 3 code1 = ord ( msg [1]) 4 code2 = ord ( msg [2]) 1 print ( ’Hi , I am a computer ! ’) 5 code3 = ord ( msg [3]) 2 raining = input ( ’ What are you ? ’) 6 code4 = ord ( msg [4]) 3 if raining == ’a dog ’: 7 code5 = ord ( msg [5]) 4 print ( ’Hi , I am a computer ! ’) 8 print ( code ) 5 if raining == ’ anagram ’: 9 print ( code1 ) 6 print ( ’ Nag a ram ! ’) 10 print ( code2 ) 7 print ( ’Hi , I am a computer ! ’) 11 print ( code3 ) 8 if raining == ’a person ’: 12 print ( code4 ) 9 print ( ’Hi , I am a computer ! ’) 13 print ( code5 ) anagram solution recipe solution 1 print ( ’Hi , I am a computer ! ’) 2 word = input ( ’ What are you ? ’) 1 msg = input ( ’ Pie idea : ’) 3 if word == ’ anagram ’: 2 for a in msg : 4 print ( ’ Nag a ram ! ’) 3 print ( ord ( a )) 5 print ( ’Hi , I am a computer !) scoville task scoville submission The Scoville scale measures the spiciness of 1 jalapeno = int ( input ( " What is the SHU value ? " )) chilli peppers or other spicy foods in Scoville 2 if jalapeno <= 10000: heat units (SHU). For example, a jalapeño has 3 print ( ’ Nam will enjoy this ! ’) a range between 1,000 to 10,000, and a ha- 4 else : banero is between 100,000 and 350,000! Differ- 5 print ( ’ This food is too spicy for everyone ! ’) ent people have different tolerances to eating 6 7 sugary = int ( input ( " What is the SHU value ? " )) chilli peppers. Nam’s parents cook with a lot 8 if sugary <= 120: of chilli, and so she enjoys eating foods with 9 print ( ’ Nam will enjoy this ! ’) a SHU value less than 10000. Michael likes it 10 print ( ’ Michael will enjoy this ! ’) less spicy, and only enjoys eating foods with 11 else : a SHU value less than or equal to 120. Write 12 print ( ’ This food is too spicy for everyone ! ’) a program to read in the SHU value for some food, and print out who will enjoy the food. scoville solution For example: 1 shu = int ( input ( " What is the SHU value ? " )) What is the SHU value? 5000 2 Nam will enjoy this! 3 if shu < 10000: and another example: 4 print ( ’ Nam will enjoy this ! ’) What is the SHU value? 120 5 if shu < 120: 6 print ( ’ Michael will enjoy this ! ’) Michael will enjoy this! 7 if shu >= 10000: Nam will enjoy this! 8 print ( ’ This food is too spicy for everyone :( ’) If neither Michael nor Nam will enjoy the food, your program should output: This food is too spicy for everyone :( Figure 1: The three program examples in our study (recipe, anagram, and scoville), each with the task description the student’s received, the student submission, and a correct solution. recipe hints anagram hints 1 a = input ( ’ Pie idea : ’) 1 print ( ’Hi , I am a computer ! ’) 1NN 2 b = [ ord ( x ) for x in a ] 2 computer = input ( ’ What are you ? ’) 3 print (* b , sep = " \ n " ) 3 if computer == ’ anagram ’: 4 print ( ’ Nag a ram ! ’) 5 print ( ’Hi , I am a computer ! ’) 1 x = x ( ’ < string > ’) 2 x = x ( x [0]) 3 x = x ( x [0]) 1 x ( ’ < string > ’) CHF 4 x = x ( x [0]) 2 x = x ( ’ < string > ’) 5 x = x ( x [0]) 3 if x == ’ < string > ’: 6 for x in x 4 x ( ’ < string > ’) 7 x (x , keyword = ’ < string > ’) 5 if x == ’ < string > ’: 6 x ( ’ < string > ’) 7 x(x) 8 if x == ’ < string > ’: 1 msg = input ( ’ Pie idea : ’) 9 x ( ’ < string > ’) 2 code = ord ( msg [0]) 3 code1 = ord ( msg [1]) ast2vec 4 code2 = ord ( msg [2]) 1 raining = input ( ’ What are you ? ’) 5 code3 = ord ( msg [3]) 2 if raining == ’ anagram ’: 6 code4 = ord ( msg [4]) 3 print ( ’ Nag a ram ! ’) 7 print ( code1 ) 4 print ( ’Hi , I am a computer ! ’) 8 print ( code1 ) 5 if raining == ’a person ’: 6 print ( ’Hi , I am a computer ! ’) scoville hints 1 msg = int ( input ( ’ What is the SHU value ? ’ )) 2 if msg < 10000: 1NN 3 print ( ’ Nam will enjoy this ! ’) 4 else : 5 print ( ’ This food is too spicy for everyone :( ’) 6 if msg <= 120: 7 print ( ’ Michael will enjoy this ! ’) 1 shu = int ( input ( ’ What is the SHU value ? ’ )) 2 if shu < 10000: CHF 3 print ( ’ Nam will enjoy this ! ’) 4 if shu <= 120: 5 print ( ’ Michael will enjoy this ! ’) 6 else : 7 print ( ’ This food is too spicy for everyone :( ’) 1 sugary = int ( input ( ’ What is the SHU value ? ’ )) 2 if sugary >= 10000: ast2vec 3 print ( ’ Nam will enjoy this ! ’) 4 if sugary < 0: 5 print ( ’ What is the SHU value ? ’) 6 if sugary < 120: 7 print ( ’ Nam will enjoy this ! ’) Figure 2: The hints of all three methods (1NN, CHF, and ast2vec) for all three student submissions from Figure 1. student might try next to improve their current program. 7 recipe We used a random sample of 30 student traces from the # teachers 5 anagram same task as training data to simulate a ‘cold start’ with scoville only a classroom-sized training data set. 3 The resulting hints of all three methods for all three tasks 1 are shown in Figure 2. 0 0 1 2 3 3.4 Survey and recruitment help need For our study, we recruited N = 7 teachers from program- ming courses in Australia and Germany. Recruitment was performed via e-mail lists with a survey link. Teachers could then voluntarily and anonymously complete the survey in Figure 3: teacher’s assessment of help-need for the three sub- Microsoft forms. Participants were first asked about their missions from Figure 1. 0 corresponds to ’no help needed’, 3 experience as programming teachers. Six participants re- to ’the student should start from scratch’. The y axis corre- sponded that they had taught more than three courses, and sponds to the number of teachers. one participant that they had taught between one and three courses. Participants with no experience were excluded from need on a four-point scale, ranging from 0 (“The student is the study. on the way to a correct solution and does not need help.”) to 3 (“The student seems to have crucial misconceptions and We acknowledge that our recruitment strategy has limita- should start from scratch.”). tions: While we can be reasonably certain that only experi- enced programming teachers took part, we have no informa- Figure 3 shows the distribution of responses for each task. tion about the specific courses they taught and whether that Most teachers agreed on response 1 for all tasks (“The stu- matches up with the kind of programming task we investi- dent is on the way to a correct solution but could benefit gated in our study. Further, self-selection bias may have from a hint.”). This indicates that our automatic selection occurred as we did not employ a random recruitment strat- indeed identified examples which indicated help-need. egy. We also asked teachers why they believed that the student Next, we presented the first programming task (the recipe did or did not need help. In response to this question, most task) with the official task description from the National teachers appeared to analyze which high-level concepts the Computer Science School, the example solution, and the stu- student had already understood - judging from their pro- dent’s submission (refer to Figure 1, top left). We asked the gram - and which concepts were still missing or were misun- teachers whether they thought the student needed help in derstood. For example, one teacher responded for the recipe this situation (on a four point scale), why (as a free text task: “They know how to input, they know how to do the ord, field), and what edit they would recommend to guide the they know they need to move through the string, they have student (free text field). Further, we presented the three just forgotten that there is a ’short cut’ to do this in a loop.”, automatic hints in Figure 2 (top left) and asked the teacher and another teacher responded for the anagram task: “The how relevant each hint was, how useful it was, and how student is not seeing the general rule of the program and much the student could learn from it, all on a five-point is trying to cover possible cases by hand.” Further, multi- scale. We defined ‘relevant’ as ‘addressing the core problem ple teachers responded to this question with suggestions on of the student’s program’ and ‘useful’ as ‘getting closer to how to provide further guidance and help to the student, a correct solution’. These scales correspond to the scales of such as “It just hasn’t clicked that there are other possible relevance and progress proposed by [13]. We replaced the inputs that they need to account for. Keeping this fresh in ‘interpretability’ scale of [13] with ‘learning’ to encourage their mind should quickly lead to a solution.” or “I think that the teachers to reflect on the learning impact a hint may they should mess around a bit more, but they should get a have. hint that the input function should only be used once for this problem.”. Finally, we asked the teachers whether they would prefer not to give any of the hints and why (free text). We re- peated all questions for the anagram and scoville tasks. To 4.2 Categories of teacher Hints avoid ordering bias, the order of hint methods was shuffled Next, we asked teachers how they would recommend editing randomly for each participant. the student’s program next. Interestingly, teachers generally did not give hints on a syntactic level. In fact, some of them stated explicitly that they thought this was not helpful (e.g. 4. RESULTS “I would not give students exact syntax because then they We present our results beginning with the teacher’s assess- blindly follow without understanding”). However, in some ment of whether hints were needed at all, followed by the cases, they did suggest lines to delete. We found that teacher hints given by teachers, and we conclude with the assessment feedback tended to fit into four general categories: of teachers for the automatic hints. 4.1 Help-need assessment A suggesting a missing concept, such as a for-loop, an else For each of the programs in Figure 1, we first asked the statement, or a combination of if-statements. For ex- teachers how much help a student in this situation would ample, “When you have a line of code that you are 7 Table 1: teacher hint types 1NN # teachers teacher 5 CHF task 1 2 3 4 5 6 7 ast2vec 3 recipe A A B B A C,D A,B anagram B B A D B B B 1 scoville A A A D B D A,D 0 recipe anagram scoville Table 2: Average teacher ratings (± std.) for each of the hints from Figure 2. Figure 4: The number of teachers (y-axis) who would prefer not to give a certain hint according to hint method (color) method relevant useful learning and task (x-axis). recipe 1NN 1.43 ± 0.73 0.86 ± 1.12 −0.14 ± 1.12 recommended a correct solution for the tasks in our exam- CHF −1.71 ± 0.45 −1.86 ± 0.35 −1.86 ± 0.35 ples (refer to Figure 2 and 1). However, teachers believed ast2vec −1.29 ± 0.88 −1.71 ± 0.45 −1.43 ± 0.73 that students would not learn particularly much from these anagram hints (rating around zero). For CHF, we observe strongly negative ratings in all three criteria for the recipe and ana- 1NN 1.86 ± 0.35 1.43 ± 0.73 0.14 ± 0.99 gram task, where CHF did not provide function and variable CHF −2.00 ± 0.00 −2.00 ± 0.00 −1.86 ± 0.35 names (refer to Figure 2), but positive scores for the scov- ast2vec −0.29 ± 1.28 −0.43 ± 0.90 −0.71 ± 1.03 ille task where it selected a correct solution. On that task, scoville it received even higher scores than 1NN. Ast2vec received 1NN 1.14 ± 0.35 1.14 ± 0.83 0.14 ± 1.12 negative scores on the recipe task, and scores around zero CHF 1.43 ± 0.49 1.57 ± 0.73 0.57 ± 1.29 for all criteria on all other tasks. ast2vec 0.29 ± 1.28 −0.29 ± 1.58 −0.29 ± 1.28 Finally, we asked teachers if there were any hints they would prefer not to give, and why. Figure 4 shows how often each method was named for each task (where lower is better). repeating it can be useful to use a loop like ‘for’ or Regarding the reasoning, one teacher always responded that ‘while’. This also allows you to repeat the code within “I would not give students exact syntax because then they the body of the loop for a flexible number of times.” blindly follow without understanding.”, which excludes all automatic hints. Further, 1NN was often named because it B explaining or hinting at situations when the program will “shows a valid solution. Students may copy the hint code not work as expected. E.g. “You’re almost there. How- and then use it - without understanding what it does.”. The ever, it will only do the right thing when the user writes hints provided by CHF for the first two tasks did not in- ‘a dog’, ‘anagram’ or ‘a person’. You can improve it clude variable and function names (refer to Figure 2), which so that it says ‘Hi, I am a computer!’ every time, no lead teachers to exclude it because it “does not look syntac- matter what the user says.” tically valid, and is incomprehensible.”. Ast2vec was named C suggesting the student solve a simpler problem first. E.g. least often, albeit by a narrow margin. Reasons for nam- “Suggest that they delete all but the first line and try ing it were that it is “not helpful for developing student un- and print out each letter one at a time.” derstanding”, would require additional explanation, or even “harm the learning of the student.” D suggesting that the student has something unnecessary in their program, or directly telling them to delete it. E.g. “Remove the two irrelevant if statements leaving 5. DISCUSSION only the correct ‘easter egg’ statement” In this section, we interpret the results in light of our three research questions: Do ratings differ between methods? Do automatic hints align with teacher hints? And what are Table 1 shows a classification of the teacher hints into these teachers’ reasons for preferring some hints over others? four categories. We observe that each hint could be classified in at least one category. RQ1: Do teacher’s ratings differ between hint meth- 4.3 Assessment of automatic hints ods?. We do observe systematic differences between hint Our third set of questions for each task concerned the rating methods in terms of ratings. However, these differences of automatic hints (refer to Figure 2) according to relevance, appear to be driven less by the underlying algorithm, but usefulness, and learning, each on a five-point scale from -2 rather by two factors: a) whether a correct solution was se- (“not at all”) to +2 (“very”). Table 2 shows the average lected or a partial solution, and b) whether the hint was ratings (± standard deviation) given by the teachers. presented as a program with function and variable names or not. teachers only gave high scores for usefulness and We observe that 1NN is generally regarded as relevant and relevance if a correct solution was given and gave very low useful, which can be explained by the fact that it always scores if function and variable names were missing. They We note in passing that abstraction may also make it easier Table 3: Abstracted hints based on the four hint types from to provide helpful hints because the hint method does not Table 1. need to get every detail (such as function or variable names) method abstracted hint right, merely the rough direction. For example, we notice recipe hints that ast2vec uses the wrong string in line 5 of its hint and the A 1NN, CHF Maybe you could try a for-loop. wrong comparison constant in line 4 of Figure 2 (bottom). B 1NN What happens when the user types This would not be an issue in the abstracted hint. ‘apple’ ? C 1NN, CHF Try doing this task on for-loops first Still, we acknowledge that this approach has limitations: For D 1NN, CHF, Can you think of a way to reduce the strategy A and C, we implicitly assume that ’concepts’ co- ast2vec number of print statements? incide with syntactic building blocks, e.g.: ‘have you tried adding a for loop?’. This assumption likely breaks down in anagram hints more advanced programming classes. B 1NN What happens when the user types ‘cat’ ? D CHF, Can you think of a way to use fewer RQ3: What are teachers’ reasons for preferring some ast2vec if statements? hints over others?. Teachers mostly explained hints by scoville hints concepts that were still missing in a program (like loops), B 1NN, CHF What happens when the user types undesired functional behavior for additional cases, or super- 120? fluous code. These reasons align with the hints the teachers D 1NN, CHF, How can you reduce the number of gave. Importantly, many teachers also emphasized that the ast2vec input calls? (and/or) Try using only student had already gotten many things right, indicating one variable. that teachers were motivated to preserve the progress that had already been made. The main reason for not provid- ing a hint appeared to be that the teachers were concerned that the hint may not lead to any learning, either because never gave high scores for learning. the hint was syntactic, and hence not abstract enough, be- cause the hint was a correct solution, or because the hint was ‘incomprehensible’ due to missing function or variable RQ2: Do automatic hints align with teacher hints?. names. Overall, the reasons provided by teachers underline We observe that teacher hints do not directly align with our finding that teachers do not only care about the content automatic hints because teachers generally suggested hints of a next-step hint but also – perhaps mainly – how it is which were higher-level, like adding missing concepts or delet- communicated, i.e. with a high-level explanation instead of ing lines to get to a more compact program. More gener- a syntactic edit and without revealing the solution. ally, we identified four categories (refer to Section 4.2) which cover the kinds of edits teachers would have given them- 6. CONCLUSION selves. To align teacher hints to automatic hints, we could We performed a survey with N = 7 teachers to evaluate employ automatic heuristics which post-process next-step hints from three hint methods (1NN, CHF, and ast2vec) to hints, such as: investigate three research questions: Do quantitative rat- ings differ between methods? Do automatic hints align with teacher hints? And what are teachers’ reasons for preferring A - Missing Concepts Compare the nodes of the student’s some hints over others? current AST to that of the next step, then suggest concepts corresponding to missing nodes. We found that teachers generally had a low opinion of syn- tactic next-step hints, irrespective of the method. Differ- B - Mishandled Situations Apply test cases to the current ences in ratings could be explained by two factors: Whether and the predicted program, then suggest the student the hint was a correct solution (then it was regarded relevant focus on inputs that work for the next step but not the and useful) or not (then all ratings were around zero) and current step whether the hint used human-readable variable and function names or not (then the ratings became strongly negative). C - Simpler Problem Similarly to A, first identify missing concepts in the student’s work, then suggest easier pro- Instead of syntactic hints, teachers preferred higher-level gramming tasks that contain this concept (e.g. from hints which suggested a missing concept, pointed out in- earlier in the course). puts which were not appropriately covered by the current program, suggested a simpler problem first, or proposed to D - Deletions Compare the current step to the next step remove superfluous lines. and, if lines are deleted, ask the student if these lines are necessary. Finally, the main concern of teachers for not giving hints was students’ learning. They disregarded both syntactic hints as well as showing a correct solution because they were con- Applying these strategies, the automated hints for the tasks cerned that students might naı̈vely apply the hint without in Figure 2 might then become the hints shown in Table 3, reflecting on it sufficiently. This points to a potential gap in which align better with the hints given by teachers. current next-step hint approaches, which are mainly focused on suggesting changes, but less on inviting reflection or ab- USA, 2019. Association for Computing Machinery. stracting to a higher level. For introductory programming [9] J. McBroom, B. Paassen, B. Jeffries, I. Koprinska, and courses, it may be sufficient to just post-process syntax-level K. Yacef. Progress networks as a tool for analysing hints with simple heuristics – such as the ones proposed in student programming difficulties. In C. Szabo and the previous section. Additionally, one can introduce textual J. Sheard, editors, Proceedings of the Twenty-Third explanations as suggested by [8]. However, further research Australasian Computing Education Conference (ACE is needed to investigate whether it is sufficient to change ’21), page 158–167. Association for Computing how next-step hints are communicated or whether deeper Machinery, 2021. changes in hint methods are necessary to achieve a better [10] B. Paaßen, B. Hammer, T. Price, T. Barnes, S. Gross, alignment with the pedagogical expertise of programming and N. Pinkwart. The continuous hint factory - teachers. Finally, an evaluation study with students is still providing hints in vast and sparsely populated edit required to make sure that any refined hint strategy yields distance spaces. Journal of Educational Datamining, better student outcomes compared to current hint strategies. 10(1):1–35, 2018. [11] B. Paaßen, J. McBroom, B. Jeffries, I. Koprinska, and 7. ACKNOWLEDGMENTS K. Yacef. ast2vec: Utilizing recursive neural encodings Funding by the German Research Foundation (DFG) under of python programs. Journal of Educational grant number PA 3460/2-1 is gratefully acknowledged. Datamining, 2021. in press. [12] C. Piech, M. Sahami, J. Huang, and L. Guibas. 8. REFERENCES Autonomously generating hints by inferring problem [1] T. Barnes and J. Stamper. Toward automatic hint solving policies. In G. Kiczales, D. Russell, and generation for logic proof tutoring using historical B. Woolf, editors, Proceedings of the Second ACM student data. In B. P. Woolf, E. Aı̈meur, R. Nkambou, Conference on Learning @ Scale (L@S 2015), page and S. Lajoie, editors, Proceedings of the International 195–204, 2015. Conference on Intelligent Tutoring Systems (ITS [13] T. Price, R. Zhi, and T. Barnes. Evaluation of a 2008), pages 373–382, Berlin, Heidelberg, 2008. data-driven feedback algorithm for open-ended Springer Berlin Heidelberg. programming. In X. Hu, T. Barnes, and P. Inventado, [2] R. R. Choudhury, H. Yin, and A. Fox. Scale-driven editors, Proceedings of the 10th International automatic hint generation for coding style. In Conference on Educational Data Mining (EDM 2017), A. Micarelli, J. Stamper, and K. Panourgia, editors, pages 192–197, 2017. Intelligent Tutoring Systems, pages 122–132, Cham, [14] T. W. Price, Y. Dong, R. Zhi, B. Paaßen, N. Lytle, 2016. Springer International Publishing. V. Cateté, and T. Barnes. A comparison of the quality [3] S. Chow, K. Yacef, I. Koprinska, and J. Curran. of data-driven programming hint generation Automated data-driven hints for computer algorithms. International Journal of Artificial programming students. In Adjunct Publication of the Intelligence in Education, 29(3):368–395, 2019. 25th Conference on User Modeling, Adaptation and [15] K. Rivers and K. R. Koedinger. Data-driven hint Personalization (UMAP 2017), page 5–10, 2017. generation in vast solution spaces: a self-improving [4] A. T. Corbett and J. R. Anderson. Locus of feedback python programming tutor. International Journal of control in computer-based tutoring: Impact on Artificial Intelligence in Education, 27(1):37–64, 2017. learning rate, achievement and attitudes. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, page 245–252, 2001. [5] D. Fossati, B. Di Eugenio, S. Ohlsson, C. Brown, and L. Chen. Data driven automatic feedback generation in the ilist intelligent tutoring system. Technology Instruction, Cognition and Learning, 10(1):5–26, 2015. [6] S. Gross and N. Pinkwart. How do learners behave in help-seeking when given a choice? In C. Conati, N. Heffernan, A. Mitrovic, and M. F. Verdejo, editors, Proceedings of the 17th International Conference on Artificial Intelligence in Education (AIED 2015), pages 600–603, 2015. [7] M. Maniktala, C. Cody, A. Isvik, N. Lytle, M. Chi, T. Barnes, et al. Extending the hint factory for the assistance dilemma: A novel, data-driven helpneed predictor for proactive problem-solving help. Journal of Educational Data Mining, 12(4):24–65, 2020. [8] S. Marwan, N. Lytle, J. J. Williams, and T. Price. The impact of adding textual explanations to next-step hints in a novice programming environment. In Proceedings of the 2019 ACM Conference on Innovation and Technology in Computer Science Education, ITiCSE ’19, page 520–526, New York, NY,