=Paper= {{Paper |id=Vol-3051/CSEDM_3 |storemode=property |title=Next Steps for Next-step Hints: Lessons Learned from Teacher Evaluations of Automatic Programming Hints (Full Paper) |pdfUrl=https://ceur-ws.org/Vol-3051/CSEDM_3.pdf |volume=Vol-3051 |authors=Benjamin Paaßen,Jessica McBroom,Bryn Jeffries,Irena Koprinska,Kalina Yacef |dblpUrl=https://dblp.org/rec/conf/edm/PaassenMJKY21a }} ==Next Steps for Next-step Hints: Lessons Learned from Teacher Evaluations of Automatic Programming Hints (Full Paper)== https://ceur-ws.org/Vol-3051/CSEDM_3.pdf

Next Steps for Next-step Hints: Lessons Learned from
Teacher Evaluations of Automatic Programming Hints

∗
Benjamin Paaßen Jessica McBroom Bryn Jeffries
Institute of Informatics School of Computer Science Grok Learning
Humboldt-University of Berlin The University of Sydney bryn@groklearning.com
benjamin.paassen@ jmcb6755@
hu-berlin.de uni.sydney.edu.au
Irena Koprinska Kalina Yacef
School of Computer Science School of Computer Science
The University of Sydney The University of Sydney
irena.koprinska@ kalina.yacef@
sydney.edu.au sydney.edu.au

ABSTRACT utilize historical student data and, as such, can be fully au-
Next-step programming hints have attracted considerable tomated [10]. However, it remains challenging to evaluate
research attention in recent years, with many new techniques next-step hints. Price et al. [14] found at least three different
being developed for a variety of contexts. However, evalu- criteria to grade next-step hints: how often they are avail-
ating next-step hints is still a challenge. We performed a able (coverage), how they impact student outcomes, such
pilot study in which teachers (N = 7) rated automatic next- as task completion speed and learning gain, and how well
step hints, both quantitatively and qualitatively, providing they align with expert opinions. Importantly, the relation
reasons for their ratings. Additionally, we asked teachers to between these criteria is not trivial and different ways to
write a free-form hint themselves. We found that teachers present next-step hints can influence their effect. For exam-
tended to prefer higher level hints over syntax-based hints, ple, Marwan et al. [8] found that adding textual explanations
and that the differences between hint techniques were often improved hint quality in expert eyes but did not influence
less important to teachers than the format of the generated student outcomes.
hints. Based on these results, we propose modifications to
next-step hint strategies to increase their similarity to hu- Our main contribution in this paper is to combine quantita-
man teacher feedback, and suggest this as a potential avenue tive ratings with qualitative explanations. In other words,
for improving their effectiveness. we do not only investigate differences in teacher ratings, but
also why teachers preferred some hints over others. To this
end, we performed a survey with N = 7 teachers, asking
Keywords them to grade next-step hints generated by three different
computer science education, next step hints, data-driven methods across three programming tasks in Python. Our
feedback, teacher evaluation overarching research questions are:

1. INTRODUCTION RQ1 Do teachers’ ratings differ between hint methods?
To support students in solving practical programming tasks,
many automatic feedback strategies provide next-step hints, RQ2 Do automatic hints align with teacher hints?
i.e. they select a target program that is closer to a correct
solution and provide feedback based on the contrast be- RQ3 What are teachers’ reasons for preferring some hints
tween the student’s current program and the target program over others?
(e.g. [3, 6, 10, 11, 13–15]). Next-step hints are compelling be-
cause they do not require teacher intervention. Instead, they
This paper is set out as follows: Section 2 discusses related
∗Corresponding Author work in more detail, Section 3 describes the setup of our
study, Section 4 describes the results and, finally, Sections
5-6 discuss and summarize the implications of our work.

2. RELATED WORK
Prior work on evaluating next-step hints broadly falls into
three categories: technical criteria, outcomes for students,
and expert opinions [14].
Joint Proceedings of the Workshops of the 14th International Conference on
Educational Data Mining (EDM 2021); Copyright ©2021 for this paper by
its authors. Use permitted under Creative Commons License Attribution 4.0 Technical criteria are mostly concerned with the availabil-
International (CC BY 4.0) ity of hints and motivated by the cold start problem, i.e.
the problem that data-driven hint generation requires a cer- programs that students deliberately submitted for evalua-
tain set of data to become possible [1]. Over the years, this tion against unit tests.
problem has arguably become less critical as multiple meth-
ods are now available which require very little training data, 3.2 Example selection
such as [6, 10, 11, 13, 15]. In this paper we restrict ourselves Our goal in this study was to evaluate the quality of auto-
to these methods and therefore omit such criteria. matic hints in a range of realistic situations where students
were likely to need help and where feedback generation was
Regarding student outcomes, prior studies have already shown non-trivial. For the purpose of this study, we considered a
that data-driven, next-step hints can yield similar learning program as indicative of help-need if at least five students
gains to working with human teachers [5], can improve so- who submitted this program failed the same or more unit
lution quality [2], and completion speed [4]. The challenge tests in the next step of their development. This is in line
in applying such criteria is that they require a study design with prior work of [9] and [7], who both suggest that help is
in which an intervention group works on-line on a task with needed if students repeatedly fail to make progress.
hint support, which was beyond the scope of our pilot study.
As proxy for non-triviality we considered the tree edit dis-
An alternative which requires less resources is offered by tance to the top-100 most frequent submissions for the same
expert opinions, i.e. ratings by experienced programming programming task. If this tree edit distance is low, provid-
teachers on the quality of hints. In particular, Price et ing automatic hints is simple: we can retrieve the nearest
al. [13] have suggested three scales (relevance, progress, and neighbor according to tree edit distance and use a successful
interpretability) to grade hint quality and have shown that continuation of this nearest neighbor as a hint, as suggested
expert ratings on these scales are related to the likelihood of by [6]. However, if this distance is high, we are in a region of
students accepting hints in the future. Further, both Piech the space of possible programs that is not frequently visited
et al. [12] and Price et al. [14] asked teachers to generate by students and, hence, harder to cover for an automatic
next-step hints themselves and evaluated the overlap be- hint system.
tween the teacher hints and automatic hints as a measure
of quality. Importantly, a next-step hint may be affected In the end, we selected for each of the three challenges the
not only by the selected target program but also by how the program which maximized the tree edit distance to frequent
hint is presented. For example, Marwan et al. [8] found that programs and indicated help-need. The resulting submis-
adding textual explanations improved expert quality ratings sions are shown in Figure 1, alongside with a description of
– but not student outcomes. the respective programming task and an example solution.

In our work, we combine aspects of this prior work with 3.3 Hint generation
qualitative questions. In particular, we use a variation of the
We considered three techniques to produce next-step hints.
three scales of Price et al. [13] for quantitative ratings of hint
quality and let teachers provide their own hints to evaluate
Firstly, we used one-nearest neighbor (1NN) prediction [6],
overlap, akin to [12, 14]. Additionally, we ask teachers to
i.e. we selected the nearest neighbor to the help-seeking pro-
provide a textual explanation for why they would give a hint
gram in the training data and recommended its successor.
and why they would choose not to give one of the automatic
Distance was measured according to the tree edit distance,
hints.
as used e.g. by [10, 15].

3. METHOD Secondly, we used the continuous hint factory (CHF) [10]
In this section, we cover the setup for our survey, beginning which extends the one-nearest neighbor approach by com-
with the programming data sets we used, followed by the puting a weighted average of multiple close neighbors and
mechanism to select specific examples, the hint methods, then constructs the program which is closest to this weighted
and the recruitment for the survey itself. average. Since this construction occurs in the space of syn-
tax trees, it does not come with variable or function names
attached. We therefore consider two versions: For the first
3.1 Programming data sets two tasks, we present an ’abstract’ program version where
In order to provide realistic stimulus material, we selected all variables and functions are named ’x’. For the last task,
our programs from three real-world, large-scale data sets of we instead use the nearest neighbor in the training data to
program traces in introductory programming. Namely, we the weighted average.
considered data from the 2018 (beginner challenge) and 2019
(beginner and intermediate challenges) National Computer Finally, we used the ast2vec neural network [11] to first
Science School (NCSS)1 , an introductory computer science translate the student’s current program into a vector, then
curriculum for (mostly Australian) school children in grades predict how this vector should change via linear regression,
5-10. 12, 876 students were enrolled in the beginners 2018 and decode this predicted vector back into a syntax tree. To
challenge, 11, 181 students in the beginners 2019 challenge, provide function names as well, we trained a classifier that
and 7, 854 students in the intermediate 2019 challenge. Each mapped ast2vec encodings for subtrees to typical function
challenge consisted of about twenty-five programming tasks names in the training data and we automatically copied vari-
in ascending difficulty, each of which were annotated with able names and strings from the student’s current program,
unit tests. In all cases, we only considered submissions, i.e. as suggested by [11].
1
https://ncss.edu.au In all cases, the hint was formatted as a program which the
recipe task anagram task
You’re opening a boutique pie shop. You Let’s make a computer program that only knows
have lots of crazy pie ideas, but you need how to say ’hi’: It doesn’t matter what you type in,
to keep them secret! Write a program it should still print ’Hi, I am a computer’! To
that asks for a pie idea, and encodes it make it a bit more exciting, though, we’ll add an
as the numeric code for each letter, using Easter Egg. The word ’anagram’ should trigger a
the ord function. Print the code for each secret message:
letter on a new line. Hi, I am a computer!
What are you? anagram
recipe submission Nag a ram!
1 msg = input ( ’ Pie idea : ’) Hi, I am a computer!
2 code = ord ( msg [0]) anagram submission
3 code1 = ord ( msg [1])
4 code2 = ord ( msg [2]) 1 print ( ’Hi , I am a computer ! ’)
5 code3 = ord ( msg [3]) 2 raining = input ( ’ What are you ? ’)
6 code4 = ord ( msg [4]) 3 if raining == ’a dog ’:
7 code5 = ord ( msg [5]) 4 print ( ’Hi , I am a computer ! ’)
8 print ( code ) 5 if raining == ’ anagram ’:
9 print ( code1 ) 6 print ( ’ Nag a ram ! ’)
10 print ( code2 ) 7 print ( ’Hi , I am a computer ! ’)
11 print ( code3 ) 8 if raining == ’a person ’:
12 print ( code4 ) 9 print ( ’Hi , I am a computer ! ’)
13 print ( code5 )
anagram solution
recipe solution 1 print ( ’Hi , I am a computer ! ’)
2 word = input ( ’ What are you ? ’)
1 msg = input ( ’ Pie idea : ’) 3 if word == ’ anagram ’:
2 for a in msg : 4 print ( ’ Nag a ram ! ’)
3 print ( ord ( a )) 5 print ( ’Hi , I am a computer !)

scoville task scoville submission
The Scoville scale measures the spiciness of 1 jalapeno = int ( input ( " What is the SHU value ? " ))
chilli peppers or other spicy foods in Scoville 2 if jalapeno <= 10000:
heat units (SHU). For example, a jalapeño has 3 print ( ’ Nam will enjoy this ! ’)
a range between 1,000 to 10,000, and a ha- 4 else :
banero is between 100,000 and 350,000! Differ- 5 print ( ’ This food is too spicy for everyone ! ’)
ent people have different tolerances to eating 6
7 sugary = int ( input ( " What is the SHU value ? " ))
chilli peppers. Nam’s parents cook with a lot 8 if sugary <= 120:
of chilli, and so she enjoys eating foods with 9 print ( ’ Nam will enjoy this ! ’)
a SHU value less than 10000. Michael likes it 10 print ( ’ Michael will enjoy this ! ’)
less spicy, and only enjoys eating foods with 11 else :
a SHU value less than or equal to 120. Write 12 print ( ’ This food is too spicy for everyone ! ’)
a program to read in the SHU value for some
food, and print out who will enjoy the food. scoville solution
For example: 1 shu = int ( input ( " What is the SHU value ? " ))
What is the SHU value? 5000 2
Nam will enjoy this! 3 if shu < 10000:
and another example: 4 print ( ’ Nam will enjoy this ! ’)
What is the SHU value? 120 5 if shu < 120:
6 print ( ’ Michael will enjoy this ! ’)
Michael will enjoy this! 7 if shu >= 10000:
Nam will enjoy this! 8 print ( ’ This food is too spicy for everyone :( ’)
If neither Michael nor Nam will enjoy the food,
your program should output: This food is too
spicy for everyone :(

Figure 1: The three program examples in our study (recipe, anagram, and scoville), each with the task description the student’s
received, the student submission, and a correct solution.
recipe hints anagram hints

1 a = input ( ’ Pie idea : ’) 1 print ( ’Hi , I am a computer ! ’)
1NN 2 b = [ ord ( x ) for x in a ] 2 computer = input ( ’ What are you ? ’)
3 print (* b , sep = " \ n " ) 3 if computer == ’ anagram ’:
4 print ( ’ Nag a ram ! ’)
5 print ( ’Hi , I am a computer ! ’)
1 x = x ( ’ < string > ’)
2 x = x ( x [0])
3 x = x ( x [0]) 1 x ( ’ < string > ’)
CHF 4 x = x ( x [0]) 2 x = x ( ’ < string > ’)
5 x = x ( x [0]) 3 if x == ’ < string > ’:
6 for x in x 4 x ( ’ < string > ’)
7 x (x , keyword = ’ < string > ’) 5 if x == ’ < string > ’:
6 x ( ’ < string > ’)
7 x(x)
8 if x == ’ < string > ’:
1 msg = input ( ’ Pie idea : ’) 9 x ( ’ < string > ’)
2 code = ord ( msg [0])
3 code1 = ord ( msg [1])
ast2vec 4 code2 = ord ( msg [2]) 1 raining = input ( ’ What are you ? ’)
5 code3 = ord ( msg [3]) 2 if raining == ’ anagram ’:
6 code4 = ord ( msg [4]) 3 print ( ’ Nag a ram ! ’)
7 print ( code1 ) 4 print ( ’Hi , I am a computer ! ’)
8 print ( code1 ) 5 if raining == ’a person ’:
6 print ( ’Hi , I am a computer ! ’)

scoville hints

1 msg = int ( input ( ’ What is the SHU value ? ’ ))
2 if msg < 10000:
1NN 3 print ( ’ Nam will enjoy this ! ’)
4 else :
5 print ( ’ This food is too spicy for everyone :( ’)
6 if msg <= 120:
7 print ( ’ Michael will enjoy this ! ’)

1 shu = int ( input ( ’ What is the SHU value ? ’ ))
2 if shu < 10000:
CHF 3 print ( ’ Nam will enjoy this ! ’)
4 if shu <= 120:
5 print ( ’ Michael will enjoy this ! ’)
6 else :
7 print ( ’ This food is too spicy for everyone :( ’)

1 sugary = int ( input ( ’ What is the SHU value ? ’ ))
2 if sugary >= 10000:
ast2vec 3 print ( ’ Nam will enjoy this ! ’)
4 if sugary < 0:
5 print ( ’ What is the SHU value ? ’)
6 if sugary < 120:
7 print ( ’ Nam will enjoy this ! ’)

Figure 2: The hints of all three methods (1NN, CHF, and ast2vec) for all three student submissions from Figure 1.
student might try next to improve their current program. 7
recipe
We used a random sample of 30 student traces from the

# teachers
5 anagram
same task as training data to simulate a ‘cold start’ with
scoville
only a classroom-sized training data set.
3
The resulting hints of all three methods for all three tasks 1
are shown in Figure 2. 0
0 1 2 3
3.4 Survey and recruitment help need
For our study, we recruited N = 7 teachers from program-
ming courses in Australia and Germany. Recruitment was
performed via e-mail lists with a survey link. Teachers could
then voluntarily and anonymously complete the survey in Figure 3: teacher’s assessment of help-need for the three sub-
Microsoft forms. Participants were first asked about their missions from Figure 1. 0 corresponds to ’no help needed’, 3
experience as programming teachers. Six participants re- to ’the student should start from scratch’. The y axis corre-
sponded that they had taught more than three courses, and sponds to the number of teachers.
one participant that they had taught between one and three
courses. Participants with no experience were excluded from need on a four-point scale, ranging from 0 (“The student is
the study. on the way to a correct solution and does not need help.”)
to 3 (“The student seems to have crucial misconceptions and
We acknowledge that our recruitment strategy has limita- should start from scratch.”).
tions: While we can be reasonably certain that only experi-
enced programming teachers took part, we have no informa- Figure 3 shows the distribution of responses for each task.
tion about the specific courses they taught and whether that Most teachers agreed on response 1 for all tasks (“The stu-
matches up with the kind of programming task we investi- dent is on the way to a correct solution but could benefit
gated in our study. Further, self-selection bias may have from a hint.”). This indicates that our automatic selection
occurred as we did not employ a random recruitment strat- indeed identified examples which indicated help-need.
egy.
We also asked teachers why they believed that the student
Next, we presented the first programming task (the recipe did or did not need help. In response to this question, most
task) with the official task description from the National teachers appeared to analyze which high-level concepts the
Computer Science School, the example solution, and the stu- student had already understood - judging from their pro-
dent’s submission (refer to Figure 1, top left). We asked the gram - and which concepts were still missing or were misun-
teachers whether they thought the student needed help in derstood. For example, one teacher responded for the recipe
this situation (on a four point scale), why (as a free text task: “They know how to input, they know how to do the ord,
field), and what edit they would recommend to guide the they know they need to move through the string, they have
student (free text field). Further, we presented the three just forgotten that there is a ’short cut’ to do this in a loop.”,
automatic hints in Figure 2 (top left) and asked the teacher and another teacher responded for the anagram task: “The
how relevant each hint was, how useful it was, and how student is not seeing the general rule of the program and
much the student could learn from it, all on a five-point is trying to cover possible cases by hand.” Further, multi-
scale. We defined ‘relevant’ as ‘addressing the core problem ple teachers responded to this question with suggestions on
of the student’s program’ and ‘useful’ as ‘getting closer to how to provide further guidance and help to the student,
a correct solution’. These scales correspond to the scales of such as “It just hasn’t clicked that there are other possible
relevance and progress proposed by [13]. We replaced the inputs that they need to account for. Keeping this fresh in
‘interpretability’ scale of [13] with ‘learning’ to encourage their mind should quickly lead to a solution.” or “I think that
the teachers to reflect on the learning impact a hint may they should mess around a bit more, but they should get a
have. hint that the input function should only be used once for this
problem.”.
Finally, we asked the teachers whether they would prefer
not to give any of the hints and why (free text). We re-
peated all questions for the anagram and scoville tasks. To
4.2 Categories of teacher Hints
avoid ordering bias, the order of hint methods was shuffled Next, we asked teachers how they would recommend editing
randomly for each participant. the student’s program next. Interestingly, teachers generally
did not give hints on a syntactic level. In fact, some of them
stated explicitly that they thought this was not helpful (e.g.
4. RESULTS “I would not give students exact syntax because then they
We present our results beginning with the teacher’s assess-
blindly follow without understanding”). However, in some
ment of whether hints were needed at all, followed by the
cases, they did suggest lines to delete. We found that teacher
hints given by teachers, and we conclude with the assessment
feedback tended to fit into four general categories:
of teachers for the automatic hints.

4.1 Help-need assessment A suggesting a missing concept, such as a for-loop, an else
For each of the programs in Figure 1, we first asked the statement, or a combination of if-statements. For ex-
teachers how much help a student in this situation would ample, “When you have a line of code that you are
7
Table 1: teacher hint types 1NN

# teachers
teacher 5 CHF
task 1 2 3 4 5 6 7 ast2vec
3
recipe A A B B A C,D A,B
anagram B B A D B B B 1
scoville A A A D B D A,D 0
recipe anagram scoville

Table 2: Average teacher ratings (± std.) for each of the
hints from Figure 2. Figure 4: The number of teachers (y-axis) who would prefer
not to give a certain hint according to hint method (color)
method relevant useful learning and task (x-axis).
recipe
1NN 1.43 ± 0.73 0.86 ± 1.12 −0.14 ± 1.12 recommended a correct solution for the tasks in our exam-
CHF −1.71 ± 0.45 −1.86 ± 0.35 −1.86 ± 0.35 ples (refer to Figure 2 and 1). However, teachers believed
ast2vec −1.29 ± 0.88 −1.71 ± 0.45 −1.43 ± 0.73 that students would not learn particularly much from these
anagram hints (rating around zero). For CHF, we observe strongly
negative ratings in all three criteria for the recipe and ana-
1NN 1.86 ± 0.35 1.43 ± 0.73 0.14 ± 0.99
gram task, where CHF did not provide function and variable
CHF −2.00 ± 0.00 −2.00 ± 0.00 −1.86 ± 0.35
names (refer to Figure 2), but positive scores for the scov-
ast2vec −0.29 ± 1.28 −0.43 ± 0.90 −0.71 ± 1.03
ille task where it selected a correct solution. On that task,
scoville it received even higher scores than 1NN. Ast2vec received
1NN 1.14 ± 0.35 1.14 ± 0.83 0.14 ± 1.12 negative scores on the recipe task, and scores around zero
CHF 1.43 ± 0.49 1.57 ± 0.73 0.57 ± 1.29 for all criteria on all other tasks.
ast2vec 0.29 ± 1.28 −0.29 ± 1.58 −0.29 ± 1.28
Finally, we asked teachers if there were any hints they would
prefer not to give, and why. Figure 4 shows how often each
method was named for each task (where lower is better).
repeating it can be useful to use a loop like ‘for’ or Regarding the reasoning, one teacher always responded that
‘while’. This also allows you to repeat the code within “I would not give students exact syntax because then they
the body of the loop for a flexible number of times.” blindly follow without understanding.”, which excludes all
automatic hints. Further, 1NN was often named because it
B explaining or hinting at situations when the program will
“shows a valid solution. Students may copy the hint code
not work as expected. E.g. “You’re almost there. How-
and then use it - without understanding what it does.”. The
ever, it will only do the right thing when the user writes
hints provided by CHF for the first two tasks did not in-
‘a dog’, ‘anagram’ or ‘a person’. You can improve it
clude variable and function names (refer to Figure 2), which
so that it says ‘Hi, I am a computer!’ every time, no
lead teachers to exclude it because it “does not look syntac-
matter what the user says.”
tically valid, and is incomprehensible.”. Ast2vec was named
C suggesting the student solve a simpler problem first. E.g. least often, albeit by a narrow margin. Reasons for nam-
“Suggest that they delete all but the first line and try ing it were that it is “not helpful for developing student un-
and print out each letter one at a time.” derstanding”, would require additional explanation, or even
“harm the learning of the student.”
D suggesting that the student has something unnecessary
in their program, or directly telling them to delete it.
E.g. “Remove the two irrelevant if statements leaving 5. DISCUSSION
only the correct ‘easter egg’ statement” In this section, we interpret the results in light of our three
research questions: Do ratings differ between methods? Do
automatic hints align with teacher hints? And what are
Table 1 shows a classification of the teacher hints into these teachers’ reasons for preferring some hints over others?
four categories. We observe that each hint could be classified
in at least one category.
RQ1: Do teacher’s ratings differ between hint meth-
4.3 Assessment of automatic hints ods?. We do observe systematic differences between hint
Our third set of questions for each task concerned the rating methods in terms of ratings. However, these differences
of automatic hints (refer to Figure 2) according to relevance, appear to be driven less by the underlying algorithm, but
usefulness, and learning, each on a five-point scale from -2 rather by two factors: a) whether a correct solution was se-
(“not at all”) to +2 (“very”). Table 2 shows the average lected or a partial solution, and b) whether the hint was
ratings (± standard deviation) given by the teachers. presented as a program with function and variable names
or not. teachers only gave high scores for usefulness and
We observe that 1NN is generally regarded as relevant and relevance if a correct solution was given and gave very low
useful, which can be explained by the fact that it always scores if function and variable names were missing. They
We note in passing that abstraction may also make it easier
Table 3: Abstracted hints based on the four hint types from to provide helpful hints because the hint method does not
Table 1. need to get every detail (such as function or variable names)
method abstracted hint right, merely the rough direction. For example, we notice
recipe hints that ast2vec uses the wrong string in line 5 of its hint and the
A 1NN, CHF Maybe you could try a for-loop. wrong comparison constant in line 4 of Figure 2 (bottom).
B 1NN What happens when the user types This would not be an issue in the abstracted hint.
‘apple’ ?
C 1NN, CHF Try doing this task on for-loops first Still, we acknowledge that this approach has limitations: For
D 1NN, CHF, Can you think of a way to reduce the strategy A and C, we implicitly assume that ’concepts’ co-
ast2vec number of print statements? incide with syntactic building blocks, e.g.: ‘have you tried
adding a for loop?’. This assumption likely breaks down in
anagram hints more advanced programming classes.
B 1NN What happens when the user types
‘cat’ ?
D CHF, Can you think of a way to use fewer RQ3: What are teachers’ reasons for preferring some
ast2vec if statements? hints over others?. Teachers mostly explained hints by
scoville hints concepts that were still missing in a program (like loops),
B 1NN, CHF What happens when the user types undesired functional behavior for additional cases, or super-
120? fluous code. These reasons align with the hints the teachers
D 1NN, CHF, How can you reduce the number of gave. Importantly, many teachers also emphasized that the
ast2vec input calls? (and/or) Try using only student had already gotten many things right, indicating
one variable. that teachers were motivated to preserve the progress that
had already been made. The main reason for not provid-
ing a hint appeared to be that the teachers were concerned
that the hint may not lead to any learning, either because
never gave high scores for learning. the hint was syntactic, and hence not abstract enough, be-
cause the hint was a correct solution, or because the hint
was ‘incomprehensible’ due to missing function or variable
RQ2: Do automatic hints align with teacher hints?. names. Overall, the reasons provided by teachers underline
We observe that teacher hints do not directly align with our finding that teachers do not only care about the content
automatic hints because teachers generally suggested hints of a next-step hint but also – perhaps mainly – how it is
which were higher-level, like adding missing concepts or delet- communicated, i.e. with a high-level explanation instead of
ing lines to get to a more compact program. More gener- a syntactic edit and without revealing the solution.
ally, we identified four categories (refer to Section 4.2) which
cover the kinds of edits teachers would have given them- 6. CONCLUSION
selves. To align teacher hints to automatic hints, we could We performed a survey with N = 7 teachers to evaluate
employ automatic heuristics which post-process next-step hints from three hint methods (1NN, CHF, and ast2vec) to
hints, such as: investigate three research questions: Do quantitative rat-
ings differ between methods? Do automatic hints align with
teacher hints? And what are teachers’ reasons for preferring
A - Missing Concepts Compare the nodes of the student’s some hints over others?
current AST to that of the next step, then suggest
concepts corresponding to missing nodes. We found that teachers generally had a low opinion of syn-
tactic next-step hints, irrespective of the method. Differ-
B - Mishandled Situations Apply test cases to the current ences in ratings could be explained by two factors: Whether
and the predicted program, then suggest the student the hint was a correct solution (then it was regarded relevant
focus on inputs that work for the next step but not the and useful) or not (then all ratings were around zero) and
current step whether the hint used human-readable variable and function
names or not (then the ratings became strongly negative).
C - Simpler Problem Similarly to A, first identify missing
concepts in the student’s work, then suggest easier pro- Instead of syntactic hints, teachers preferred higher-level
gramming tasks that contain this concept (e.g. from hints which suggested a missing concept, pointed out in-
earlier in the course). puts which were not appropriately covered by the current
program, suggested a simpler problem first, or proposed to
D - Deletions Compare the current step to the next step remove superfluous lines.
and, if lines are deleted, ask the student if these lines
are necessary. Finally, the main concern of teachers for not giving hints was
students’ learning. They disregarded both syntactic hints
as well as showing a correct solution because they were con-
Applying these strategies, the automated hints for the tasks cerned that students might naı̈vely apply the hint without
in Figure 2 might then become the hints shown in Table 3, reflecting on it sufficiently. This points to a potential gap in
which align better with the hints given by teachers. current next-step hint approaches, which are mainly focused
on suggesting changes, but less on inviting reflection or ab- USA, 2019. Association for Computing Machinery.
stracting to a higher level. For introductory programming [9] J. McBroom, B. Paassen, B. Jeffries, I. Koprinska, and
courses, it may be sufficient to just post-process syntax-level K. Yacef. Progress networks as a tool for analysing
hints with simple heuristics – such as the ones proposed in student programming difficulties. In C. Szabo and
the previous section. Additionally, one can introduce textual J. Sheard, editors, Proceedings of the Twenty-Third
explanations as suggested by [8]. However, further research Australasian Computing Education Conference (ACE
is needed to investigate whether it is sufficient to change ’21), page 158–167. Association for Computing
how next-step hints are communicated or whether deeper Machinery, 2021.
changes in hint methods are necessary to achieve a better [10] B. Paaßen, B. Hammer, T. Price, T. Barnes, S. Gross,
alignment with the pedagogical expertise of programming and N. Pinkwart. The continuous hint factory -
teachers. Finally, an evaluation study with students is still providing hints in vast and sparsely populated edit
required to make sure that any refined hint strategy yields distance spaces. Journal of Educational Datamining,
better student outcomes compared to current hint strategies. 10(1):1–35, 2018.
[11] B. Paaßen, J. McBroom, B. Jeffries, I. Koprinska, and
7. ACKNOWLEDGMENTS K. Yacef. ast2vec: Utilizing recursive neural encodings
Funding by the German Research Foundation (DFG) under of python programs. Journal of Educational
grant number PA 3460/2-1 is gratefully acknowledged. Datamining, 2021. in press.
[12] C. Piech, M. Sahami, J. Huang, and L. Guibas.
8. REFERENCES Autonomously generating hints by inferring problem
[1] T. Barnes and J. Stamper. Toward automatic hint solving policies. In G. Kiczales, D. Russell, and
generation for logic proof tutoring using historical B. Woolf, editors, Proceedings of the Second ACM
student data. In B. P. Woolf, E. Aı̈meur, R. Nkambou, Conference on Learning @ Scale (L@S 2015), page
and S. Lajoie, editors, Proceedings of the International 195–204, 2015.
Conference on Intelligent Tutoring Systems (ITS [13] T. Price, R. Zhi, and T. Barnes. Evaluation of a
2008), pages 373–382, Berlin, Heidelberg, 2008. data-driven feedback algorithm for open-ended
Springer Berlin Heidelberg. programming. In X. Hu, T. Barnes, and P. Inventado,
[2] R. R. Choudhury, H. Yin, and A. Fox. Scale-driven editors, Proceedings of the 10th International
automatic hint generation for coding style. In Conference on Educational Data Mining (EDM 2017),
A. Micarelli, J. Stamper, and K. Panourgia, editors, pages 192–197, 2017.
Intelligent Tutoring Systems, pages 122–132, Cham, [14] T. W. Price, Y. Dong, R. Zhi, B. Paaßen, N. Lytle,
2016. Springer International Publishing. V. Cateté, and T. Barnes. A comparison of the quality
[3] S. Chow, K. Yacef, I. Koprinska, and J. Curran. of data-driven programming hint generation
Automated data-driven hints for computer algorithms. International Journal of Artificial
programming students. In Adjunct Publication of the Intelligence in Education, 29(3):368–395, 2019.
25th Conference on User Modeling, Adaptation and [15] K. Rivers and K. R. Koedinger. Data-driven hint
Personalization (UMAP 2017), page 5–10, 2017. generation in vast solution spaces: a self-improving
[4] A. T. Corbett and J. R. Anderson. Locus of feedback python programming tutor. International Journal of
control in computer-based tutoring: Impact on Artificial Intelligence in Education, 27(1):37–64, 2017.
learning rate, achievement and attitudes. In
Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems, page 245–252, 2001.
[5] D. Fossati, B. Di Eugenio, S. Ohlsson, C. Brown, and
L. Chen. Data driven automatic feedback generation
in the ilist intelligent tutoring system. Technology
Instruction, Cognition and Learning, 10(1):5–26, 2015.
[6] S. Gross and N. Pinkwart. How do learners behave in
help-seeking when given a choice? In C. Conati,
N. Heffernan, A. Mitrovic, and M. F. Verdejo, editors,
Proceedings of the 17th International Conference on
Artificial Intelligence in Education (AIED 2015),
pages 600–603, 2015.
[7] M. Maniktala, C. Cody, A. Isvik, N. Lytle, M. Chi,
T. Barnes, et al. Extending the hint factory for the
assistance dilemma: A novel, data-driven helpneed
predictor for proactive problem-solving help. Journal
of Educational Data Mining, 12(4):24–65, 2020.
[8] S. Marwan, N. Lytle, J. J. Williams, and T. Price.
The impact of adding textual explanations to
next-step hints in a novice programming environment.
In Proceedings of the 2019 ACM Conference on
Innovation and Technology in Computer Science
Education, ITiCSE ’19, page 520–526, New York, NY,