Next Steps for Next-step Hints: Lessons Learned from
      Teacher Evaluations of Automatic Programming Hints

                                             ∗
                Benjamin Paaßen                                Jessica McBroom                          Bryn Jeffries
              Institute of Informatics                    School of Computer Science                    Grok Learning
            Humboldt-University of Berlin                  The University of Sydney             bryn@groklearning.com
               benjamin.paassen@              jmcb6755@
                   hu-berlin.de            uni.sydney.edu.au
                               Irena Koprinska             Kalina Yacef
                                   School of Computer Science                School of Computer Science
                                    The University of Sydney                  The University of Sydney
                                        irena.koprinska@                           kalina.yacef@
                                          sydney.edu.au                            sydney.edu.au

ABSTRACT                                                                     utilize historical student data and, as such, can be fully au-
Next-step programming hints have attracted considerable                      tomated [10]. However, it remains challenging to evaluate
research attention in recent years, with many new techniques                 next-step hints. Price et al. [14] found at least three different
being developed for a variety of contexts. However, evalu-                   criteria to grade next-step hints: how often they are avail-
ating next-step hints is still a challenge. We performed a                   able (coverage), how they impact student outcomes, such
pilot study in which teachers (N = 7) rated automatic next-                  as task completion speed and learning gain, and how well
step hints, both quantitatively and qualitatively, providing                 they align with expert opinions. Importantly, the relation
reasons for their ratings. Additionally, we asked teachers to                between these criteria is not trivial and different ways to
write a free-form hint themselves. We found that teachers                    present next-step hints can influence their effect. For exam-
tended to prefer higher level hints over syntax-based hints,                 ple, Marwan et al. [8] found that adding textual explanations
and that the differences between hint techniques were often                  improved hint quality in expert eyes but did not influence
less important to teachers than the format of the generated                  student outcomes.
hints. Based on these results, we propose modifications to
next-step hint strategies to increase their similarity to hu-                Our main contribution in this paper is to combine quantita-
man teacher feedback, and suggest this as a potential avenue                 tive ratings with qualitative explanations. In other words,
for improving their effectiveness.                                           we do not only investigate differences in teacher ratings, but
                                                                             also why teachers preferred some hints over others. To this
                                                                             end, we performed a survey with N = 7 teachers, asking
Keywords                                                                     them to grade next-step hints generated by three different
computer science education, next step hints, data-driven                     methods across three programming tasks in Python. Our
feedback, teacher evaluation                                                 overarching research questions are:

1.    INTRODUCTION                                                           RQ1 Do teachers’ ratings differ between hint methods?
To support students in solving practical programming tasks,
many automatic feedback strategies provide next-step hints,                  RQ2 Do automatic hints align with teacher hints?
i.e. they select a target program that is closer to a correct
solution and provide feedback based on the contrast be-                      RQ3 What are teachers’ reasons for preferring some hints
tween the student’s current program and the target program                       over others?
(e.g. [3, 6, 10, 11, 13–15]). Next-step hints are compelling be-
cause they do not require teacher intervention. Instead, they
                                                                             This paper is set out as follows: Section 2 discusses related
∗Corresponding Author                                                        work in more detail, Section 3 describes the setup of our
                                                                             study, Section 4 describes the results and, finally, Sections
                                                                             5-6 discuss and summarize the implications of our work.

                                                                             2.   RELATED WORK
                                                                             Prior work on evaluating next-step hints broadly falls into
                                                                             three categories: technical criteria, outcomes for students,
                                                                             and expert opinions [14].
Joint Proceedings of the Workshops of the 14th International Conference on
Educational Data Mining (EDM 2021); Copyright ©2021 for this paper by
its authors. Use permitted under Creative Commons License Attribution 4.0    Technical criteria are mostly concerned with the availabil-
International (CC BY 4.0)                                                    ity of hints and motivated by the cold start problem, i.e.
the problem that data-driven hint generation requires a cer-         programs that students deliberately submitted for evalua-
tain set of data to become possible [1]. Over the years, this        tion against unit tests.
problem has arguably become less critical as multiple meth-
ods are now available which require very little training data,       3.2    Example selection
such as [6, 10, 11, 13, 15]. In this paper we restrict ourselves     Our goal in this study was to evaluate the quality of auto-
to these methods and therefore omit such criteria.                   matic hints in a range of realistic situations where students
                                                                     were likely to need help and where feedback generation was
Regarding student outcomes, prior studies have already shown         non-trivial. For the purpose of this study, we considered a
that data-driven, next-step hints can yield similar learning         program as indicative of help-need if at least five students
gains to working with human teachers [5], can improve so-            who submitted this program failed the same or more unit
lution quality [2], and completion speed [4]. The challenge          tests in the next step of their development. This is in line
in applying such criteria is that they require a study design        with prior work of [9] and [7], who both suggest that help is
in which an intervention group works on-line on a task with          needed if students repeatedly fail to make progress.
hint support, which was beyond the scope of our pilot study.
                                                                     As proxy for non-triviality we considered the tree edit dis-
An alternative which requires less resources is offered by           tance to the top-100 most frequent submissions for the same
expert opinions, i.e. ratings by experienced programming             programming task. If this tree edit distance is low, provid-
teachers on the quality of hints. In particular, Price et            ing automatic hints is simple: we can retrieve the nearest
al. [13] have suggested three scales (relevance, progress, and       neighbor according to tree edit distance and use a successful
interpretability) to grade hint quality and have shown that          continuation of this nearest neighbor as a hint, as suggested
expert ratings on these scales are related to the likelihood of      by [6]. However, if this distance is high, we are in a region of
students accepting hints in the future. Further, both Piech          the space of possible programs that is not frequently visited
et al. [12] and Price et al. [14] asked teachers to generate         by students and, hence, harder to cover for an automatic
next-step hints themselves and evaluated the overlap be-             hint system.
tween the teacher hints and automatic hints as a measure
of quality. Importantly, a next-step hint may be affected            In the end, we selected for each of the three challenges the
not only by the selected target program but also by how the          program which maximized the tree edit distance to frequent
hint is presented. For example, Marwan et al. [8] found that         programs and indicated help-need. The resulting submis-
adding textual explanations improved expert quality ratings          sions are shown in Figure 1, alongside with a description of
– but not student outcomes.                                          the respective programming task and an example solution.

In our work, we combine aspects of this prior work with              3.3    Hint generation
qualitative questions. In particular, we use a variation of the
                                                                     We considered three techniques to produce next-step hints.
three scales of Price et al. [13] for quantitative ratings of hint
quality and let teachers provide their own hints to evaluate
                                                                     Firstly, we used one-nearest neighbor (1NN) prediction [6],
overlap, akin to [12, 14]. Additionally, we ask teachers to
                                                                     i.e. we selected the nearest neighbor to the help-seeking pro-
provide a textual explanation for why they would give a hint
                                                                     gram in the training data and recommended its successor.
and why they would choose not to give one of the automatic
                                                                     Distance was measured according to the tree edit distance,
hints.
                                                                     as used e.g. by [10, 15].

3.     METHOD                                                        Secondly, we used the continuous hint factory (CHF) [10]
In this section, we cover the setup for our survey, beginning        which extends the one-nearest neighbor approach by com-
with the programming data sets we used, followed by the              puting a weighted average of multiple close neighbors and
mechanism to select specific examples, the hint methods,             then constructs the program which is closest to this weighted
and the recruitment for the survey itself.                           average. Since this construction occurs in the space of syn-
                                                                     tax trees, it does not come with variable or function names
                                                                     attached. We therefore consider two versions: For the first
3.1     Programming data sets                                        two tasks, we present an ’abstract’ program version where
In order to provide realistic stimulus material, we selected         all variables and functions are named ’x’. For the last task,
our programs from three real-world, large-scale data sets of         we instead use the nearest neighbor in the training data to
program traces in introductory programming. Namely, we               the weighted average.
considered data from the 2018 (beginner challenge) and 2019
(beginner and intermediate challenges) National Computer             Finally, we used the ast2vec neural network [11] to first
Science School (NCSS)1 , an introductory computer science            translate the student’s current program into a vector, then
curriculum for (mostly Australian) school children in grades         predict how this vector should change via linear regression,
5-10. 12, 876 students were enrolled in the beginners 2018           and decode this predicted vector back into a syntax tree. To
challenge, 11, 181 students in the beginners 2019 challenge,         provide function names as well, we trained a classifier that
and 7, 854 students in the intermediate 2019 challenge. Each         mapped ast2vec encodings for subtrees to typical function
challenge consisted of about twenty-five programming tasks           names in the training data and we automatically copied vari-
in ascending difficulty, each of which were annotated with           able names and strings from the student’s current program,
unit tests. In all cases, we only considered submissions, i.e.       as suggested by [11].
1
    https://ncss.edu.au                                              In all cases, the hint was formatted as a program which the
                 recipe task                                                                 anagram task
You’re opening a boutique pie shop. You                                   Let’s make a computer program that only knows
have lots of crazy pie ideas, but you need                                how to say ’hi’: It doesn’t matter what you type in,
to keep them secret! Write a program                                      it should still print ’Hi, I am a computer’! To
that asks for a pie idea, and encodes it                                  make it a bit more exciting, though, we’ll add an
as the numeric code for each letter, using                                Easter Egg. The word ’anagram’ should trigger a
the ord function. Print the code for each                                 secret message:
letter on a new line.                                                     Hi, I am a computer!
                                                                          What are you? anagram
             recipe submission                                            Nag a ram!
 1   msg = input ( ’ Pie idea : ’)                                        Hi, I am a computer!
 2   code = ord ( msg [0])                                                              anagram submission
 3   code1 = ord ( msg [1])
 4   code2 = ord ( msg [2])                                               1   print ( ’Hi , I am a computer ! ’)
 5   code3 = ord ( msg [3])                                               2   raining = input ( ’ What are you ? ’)
 6   code4 = ord ( msg [4])                                               3   if raining == ’a dog ’:
 7   code5 = ord ( msg [5])                                               4     print ( ’Hi , I am a computer ! ’)
 8   print ( code )                                                       5   if raining == ’ anagram ’:
 9   print ( code1 )                                                      6     print ( ’ Nag a ram ! ’)
10   print ( code2 )                                                      7     print ( ’Hi , I am a computer ! ’)
11   print ( code3 )                                                      8   if raining == ’a person ’:
12   print ( code4 )                                                      9     print ( ’Hi , I am a computer ! ’)
13   print ( code5 )
                                                                                          anagram solution
               recipe solution                                            1   print ( ’Hi , I am a computer ! ’)
                                                                          2   word = input ( ’ What are you ? ’)
 1   msg = input ( ’ Pie idea : ’)                                        3   if word == ’ anagram ’:
 2   for a in msg :                                                       4     print ( ’ Nag a ram ! ’)
 3     print ( ord ( a ))                                                 5   print ( ’Hi , I am a computer !)


                    scoville task                                              scoville submission
 The Scoville scale measures the spiciness of       1   jalapeno = int ( input ( " What is the SHU value ? " ))
 chilli peppers or other spicy foods in Scoville    2   if jalapeno <= 10000:
 heat units (SHU). For example, a jalapeño has     3     print ( ’ Nam will enjoy this ! ’)
 a range between 1,000 to 10,000, and a ha-         4   else :
 banero is between 100,000 and 350,000! Differ-     5     print ( ’ This food is too spicy for everyone ! ’)
 ent people have different tolerances to eating     6
                                                    7   sugary = int ( input ( " What is the SHU value ? " ))
 chilli peppers. Nam’s parents cook with a lot      8   if sugary <= 120:
 of chilli, and so she enjoys eating foods with     9     print ( ’ Nam will enjoy this ! ’)
 a SHU value less than 10000. Michael likes it     10     print ( ’ Michael will enjoy this ! ’)
 less spicy, and only enjoys eating foods with     11   else :
 a SHU value less than or equal to 120. Write      12     print ( ’ This food is too spicy for everyone ! ’)
 a program to read in the SHU value for some
 food, and print out who will enjoy the food.                                    scoville solution
 For example:                                      1    shu = int ( input ( " What is the SHU value ? " ))
 What is the SHU value? 5000                       2
 Nam will enjoy this!                              3    if shu < 10000:
 and another example:                              4      print ( ’ Nam will enjoy this ! ’)
 What is the SHU value? 120                        5    if shu < 120:
                                                   6      print ( ’ Michael will enjoy this ! ’)
 Michael will enjoy this!                          7    if shu >= 10000:
 Nam will enjoy this!                              8      print ( ’ This food is too spicy for everyone :( ’)
 If neither Michael nor Nam will enjoy the food,
 your program should output: This food is too
 spicy for everyone :(

Figure 1: The three program examples in our study (recipe, anagram, and scoville), each with the task description the student’s
received, the student submission, and a correct solution.
                         recipe hints                                           anagram hints

           1   a = input ( ’ Pie idea : ’)                  1   print ( ’Hi , I am a computer ! ’)
   1NN     2   b = [ ord ( x ) for x in a ]                 2   computer = input ( ’ What are you ? ’)
           3   print (* b , sep = " \ n " )                 3   if computer == ’ anagram ’:
                                                            4     print ( ’ Nag a ram ! ’)
                                                            5   print ( ’Hi , I am a computer ! ’)
           1   x = x ( ’ < string > ’)
           2   x = x ( x [0])
           3   x = x ( x [0])                               1   x ( ’ < string > ’)
   CHF     4   x = x ( x [0])                               2   x = x ( ’ < string > ’)
           5   x = x ( x [0])                               3   if x == ’ < string > ’:
           6   for x in x                                   4       x ( ’ < string > ’)
           7   x (x , keyword = ’ < string > ’)             5   if x == ’ < string > ’:
                                                            6       x ( ’ < string > ’)
                                                            7       x(x)
                                                            8   if x == ’ < string > ’:
           1   msg = input ( ’ Pie idea : ’)                9       x ( ’ < string > ’)
           2   code = ord ( msg [0])
           3   code1 = ord ( msg [1])
 ast2vec   4   code2 = ord ( msg [2])                       1   raining = input ( ’ What are you ? ’)
           5   code3 = ord ( msg [3])                       2   if raining == ’ anagram ’:
           6   code4 = ord ( msg [4])                       3     print ( ’ Nag a ram ! ’)
           7   print ( code1 )                              4     print ( ’Hi , I am a computer ! ’)
           8   print ( code1 )                              5   if raining == ’a person ’:
                                                            6     print ( ’Hi , I am a computer ! ’)


                                                     scoville hints

                         1   msg = int ( input ( ’ What is the SHU value ? ’ ))
                         2   if msg < 10000:
                 1NN     3     print ( ’ Nam will enjoy this ! ’)
                         4   else :
                         5     print ( ’ This food is too spicy for everyone :( ’)
                         6   if msg <= 120:
                         7     print ( ’ Michael will enjoy this ! ’)


                         1   shu = int ( input ( ’ What is the SHU value ? ’ ))
                         2   if shu < 10000:
                 CHF     3     print ( ’ Nam will enjoy this ! ’)
                         4   if shu <= 120:
                         5     print ( ’ Michael will enjoy this ! ’)
                         6   else :
                         7     print ( ’ This food is too spicy for everyone :( ’)


                         1   sugary = int ( input ( ’ What is the SHU value ? ’ ))
                         2   if sugary >= 10000:
               ast2vec   3     print ( ’ Nam will enjoy this ! ’)
                         4   if sugary < 0:
                         5     print ( ’ What is the SHU value ? ’)
                         6   if sugary < 120:
                         7     print ( ’ Nam will enjoy this ! ’)


Figure 2: The hints of all three methods (1NN, CHF, and ast2vec) for all three student submissions from Figure 1.
student might try next to improve their current program.                            7
                                                                                                                  recipe
We used a random sample of 30 student traces from the


                                                                      # teachers
                                                                                    5                             anagram
same task as training data to simulate a ‘cold start’ with
                                                                                                                  scoville
only a classroom-sized training data set.
                                                                                    3
The resulting hints of all three methods for all three tasks                        1
are shown in Figure 2.                                                              0
                                                                                        0     1       2   3
3.4   Survey and recruitment                                                                  help need
For our study, we recruited N = 7 teachers from program-
ming courses in Australia and Germany. Recruitment was
performed via e-mail lists with a survey link. Teachers could
then voluntarily and anonymously complete the survey in         Figure 3: teacher’s assessment of help-need for the three sub-
Microsoft forms. Participants were first asked about their      missions from Figure 1. 0 corresponds to ’no help needed’, 3
experience as programming teachers. Six participants re-        to ’the student should start from scratch’. The y axis corre-
sponded that they had taught more than three courses, and       sponds to the number of teachers.
one participant that they had taught between one and three
courses. Participants with no experience were excluded from     need on a four-point scale, ranging from 0 (“The student is
the study.                                                      on the way to a correct solution and does not need help.”)
                                                                to 3 (“The student seems to have crucial misconceptions and
We acknowledge that our recruitment strategy has limita-        should start from scratch.”).
tions: While we can be reasonably certain that only experi-
enced programming teachers took part, we have no informa-       Figure 3 shows the distribution of responses for each task.
tion about the specific courses they taught and whether that    Most teachers agreed on response 1 for all tasks (“The stu-
matches up with the kind of programming task we investi-        dent is on the way to a correct solution but could benefit
gated in our study. Further, self-selection bias may have       from a hint.”). This indicates that our automatic selection
occurred as we did not employ a random recruitment strat-       indeed identified examples which indicated help-need.
egy.
                                                                We also asked teachers why they believed that the student
Next, we presented the first programming task (the recipe       did or did not need help. In response to this question, most
task) with the official task description from the National      teachers appeared to analyze which high-level concepts the
Computer Science School, the example solution, and the stu-     student had already understood - judging from their pro-
dent’s submission (refer to Figure 1, top left). We asked the   gram - and which concepts were still missing or were misun-
teachers whether they thought the student needed help in        derstood. For example, one teacher responded for the recipe
this situation (on a four point scale), why (as a free text     task: “They know how to input, they know how to do the ord,
field), and what edit they would recommend to guide the         they know they need to move through the string, they have
student (free text field). Further, we presented the three      just forgotten that there is a ’short cut’ to do this in a loop.”,
automatic hints in Figure 2 (top left) and asked the teacher    and another teacher responded for the anagram task: “The
how relevant each hint was, how useful it was, and how          student is not seeing the general rule of the program and
much the student could learn from it, all on a five-point       is trying to cover possible cases by hand.” Further, multi-
scale. We defined ‘relevant’ as ‘addressing the core problem    ple teachers responded to this question with suggestions on
of the student’s program’ and ‘useful’ as ‘getting closer to    how to provide further guidance and help to the student,
a correct solution’. These scales correspond to the scales of   such as “It just hasn’t clicked that there are other possible
relevance and progress proposed by [13]. We replaced the        inputs that they need to account for. Keeping this fresh in
‘interpretability’ scale of [13] with ‘learning’ to encourage   their mind should quickly lead to a solution.” or “I think that
the teachers to reflect on the learning impact a hint may       they should mess around a bit more, but they should get a
have.                                                           hint that the input function should only be used once for this
                                                                problem.”.
Finally, we asked the teachers whether they would prefer
not to give any of the hints and why (free text). We re-
peated all questions for the anagram and scoville tasks. To
                                                                4.2                Categories of teacher Hints
avoid ordering bias, the order of hint methods was shuffled     Next, we asked teachers how they would recommend editing
randomly for each participant.                                  the student’s program next. Interestingly, teachers generally
                                                                did not give hints on a syntactic level. In fact, some of them
                                                                stated explicitly that they thought this was not helpful (e.g.
4.    RESULTS                                                   “I would not give students exact syntax because then they
We present our results beginning with the teacher’s assess-
                                                                blindly follow without understanding”). However, in some
ment of whether hints were needed at all, followed by the
                                                                cases, they did suggest lines to delete. We found that teacher
hints given by teachers, and we conclude with the assessment
                                                                feedback tended to fit into four general categories:
of teachers for the automatic hints.

4.1   Help-need assessment                                      A suggesting a missing concept, such as a for-loop, an else
For each of the programs in Figure 1, we first asked the             statement, or a combination of if-statements. For ex-
teachers how much help a student in this situation would             ample, “When you have a line of code that you are
                                                                                    7
                    Table 1: teacher hint types                                                                       1NN


                                                                       # teachers
                                   teacher                                          5                                 CHF
          task        1 2 3 4 5               6       7                                                               ast2vec
                                                                                    3
          recipe      A   A    B    B   A     C,D    A,B
        anagram       B   B    A    D   B      B      B                             1
         scoville     A   A    A    D   B      D     A,D                            0
                                                                                        recipe   anagram   scoville

Table 2: Average teacher ratings (± std.) for each of the
hints from Figure 2.                                              Figure 4: The number of teachers (y-axis) who would prefer
                                                                  not to give a certain hint according to hint method (color)
   method        relevant           useful          learning      and task (x-axis).
                                    recipe
    1NN      1.43 ± 0.73       0.86 ± 1.12        −0.14 ± 1.12    recommended a correct solution for the tasks in our exam-
    CHF      −1.71 ± 0.45      −1.86 ± 0.35       −1.86 ± 0.35    ples (refer to Figure 2 and 1). However, teachers believed
   ast2vec   −1.29 ± 0.88      −1.71 ± 0.45       −1.43 ± 0.73    that students would not learn particularly much from these
                                   anagram                        hints (rating around zero). For CHF, we observe strongly
                                                                  negative ratings in all three criteria for the recipe and ana-
    1NN      1.86 ± 0.35       1.43 ± 0.73        0.14 ± 0.99
                                                                  gram task, where CHF did not provide function and variable
    CHF      −2.00 ± 0.00      −2.00 ± 0.00       −1.86 ± 0.35
                                                                  names (refer to Figure 2), but positive scores for the scov-
   ast2vec   −0.29 ± 1.28      −0.43 ± 0.90       −0.71 ± 1.03
                                                                  ille task where it selected a correct solution. On that task,
                                   scoville                       it received even higher scores than 1NN. Ast2vec received
    1NN       1.14 ± 0.35      1.14 ± 0.83        0.14 ± 1.12     negative scores on the recipe task, and scores around zero
    CHF       1.43 ± 0.49      1.57 ± 0.73        0.57 ± 1.29     for all criteria on all other tasks.
   ast2vec    0.29 ± 1.28      −0.29 ± 1.58       −0.29 ± 1.28
                                                                  Finally, we asked teachers if there were any hints they would
                                                                  prefer not to give, and why. Figure 4 shows how often each
                                                                  method was named for each task (where lower is better).
      repeating it can be useful to use a loop like ‘for’ or      Regarding the reasoning, one teacher always responded that
      ‘while’. This also allows you to repeat the code within     “I would not give students exact syntax because then they
      the body of the loop for a flexible number of times.”       blindly follow without understanding.”, which excludes all
                                                                  automatic hints. Further, 1NN was often named because it
B explaining or hinting at situations when the program will
                                                                  “shows a valid solution. Students may copy the hint code
     not work as expected. E.g. “You’re almost there. How-
                                                                  and then use it - without understanding what it does.”. The
     ever, it will only do the right thing when the user writes
                                                                  hints provided by CHF for the first two tasks did not in-
     ‘a dog’, ‘anagram’ or ‘a person’. You can improve it
                                                                  clude variable and function names (refer to Figure 2), which
     so that it says ‘Hi, I am a computer!’ every time, no
                                                                  lead teachers to exclude it because it “does not look syntac-
     matter what the user says.”
                                                                  tically valid, and is incomprehensible.”. Ast2vec was named
C suggesting the student solve a simpler problem first. E.g.      least often, albeit by a narrow margin. Reasons for nam-
    “Suggest that they delete all but the first line and try      ing it were that it is “not helpful for developing student un-
     and print out each letter one at a time.”                    derstanding”, would require additional explanation, or even
                                                                  “harm the learning of the student.”
D suggesting that the student has something unnecessary
     in their program, or directly telling them to delete it.
     E.g. “Remove the two irrelevant if statements leaving        5.   DISCUSSION
     only the correct ‘easter egg’ statement”                     In this section, we interpret the results in light of our three
                                                                  research questions: Do ratings differ between methods? Do
                                                                  automatic hints align with teacher hints? And what are
Table 1 shows a classification of the teacher hints into these    teachers’ reasons for preferring some hints over others?
four categories. We observe that each hint could be classified
in at least one category.
                                                                  RQ1: Do teacher’s ratings differ between hint meth-
4.3    Assessment of automatic hints                              ods?. We do observe systematic differences between hint
Our third set of questions for each task concerned the rating     methods in terms of ratings. However, these differences
of automatic hints (refer to Figure 2) according to relevance,    appear to be driven less by the underlying algorithm, but
usefulness, and learning, each on a five-point scale from -2      rather by two factors: a) whether a correct solution was se-
(“not at all”) to +2 (“very”). Table 2 shows the average          lected or a partial solution, and b) whether the hint was
ratings (± standard deviation) given by the teachers.             presented as a program with function and variable names
                                                                  or not. teachers only gave high scores for usefulness and
We observe that 1NN is generally regarded as relevant and         relevance if a correct solution was given and gave very low
useful, which can be explained by the fact that it always         scores if function and variable names were missing. They
                                                                   We note in passing that abstraction may also make it easier
Table 3: Abstracted hints based on the four hint types from        to provide helpful hints because the hint method does not
Table 1.                                                           need to get every detail (such as function or variable names)
       method        abstracted hint                               right, merely the rough direction. For example, we notice
  recipe hints                                                     that ast2vec uses the wrong string in line 5 of its hint and the
  A 1NN, CHF         Maybe you could try a for-loop.               wrong comparison constant in line 4 of Figure 2 (bottom).
  B 1NN              What happens when the user types              This would not be an issue in the abstracted hint.
                     ‘apple’ ?
  C    1NN, CHF      Try doing this task on for-loops first        Still, we acknowledge that this approach has limitations: For
  D    1NN, CHF,     Can you think of a way to reduce the          strategy A and C, we implicitly assume that ’concepts’ co-
       ast2vec       number of print statements?                   incide with syntactic building blocks, e.g.: ‘have you tried
                                                                   adding a for loop?’. This assumption likely breaks down in
  anagram hints                                                    more advanced programming classes.
  B 1NN              What happens when the user types
                     ‘cat’ ?
  D    CHF,          Can you think of a way to use fewer           RQ3: What are teachers’ reasons for preferring some
       ast2vec       if statements?                                hints over others?. Teachers mostly explained hints by
  scoville hints                                                   concepts that were still missing in a program (like loops),
  B 1NN, CHF         What happens when the user types              undesired functional behavior for additional cases, or super-
                     120?                                          fluous code. These reasons align with the hints the teachers
  D    1NN, CHF,     How can you reduce the number of              gave. Importantly, many teachers also emphasized that the
       ast2vec       input calls? (and/or) Try using only          student had already gotten many things right, indicating
                     one variable.                                 that teachers were motivated to preserve the progress that
                                                                   had already been made. The main reason for not provid-
                                                                   ing a hint appeared to be that the teachers were concerned
                                                                   that the hint may not lead to any learning, either because
never gave high scores for learning.                               the hint was syntactic, and hence not abstract enough, be-
                                                                   cause the hint was a correct solution, or because the hint
                                                                   was ‘incomprehensible’ due to missing function or variable
RQ2: Do automatic hints align with teacher hints?.                 names. Overall, the reasons provided by teachers underline
We observe that teacher hints do not directly align with           our finding that teachers do not only care about the content
automatic hints because teachers generally suggested hints         of a next-step hint but also – perhaps mainly – how it is
which were higher-level, like adding missing concepts or delet-    communicated, i.e. with a high-level explanation instead of
ing lines to get to a more compact program. More gener-            a syntactic edit and without revealing the solution.
ally, we identified four categories (refer to Section 4.2) which
cover the kinds of edits teachers would have given them-           6.   CONCLUSION
selves. To align teacher hints to automatic hints, we could        We performed a survey with N = 7 teachers to evaluate
employ automatic heuristics which post-process next-step           hints from three hint methods (1NN, CHF, and ast2vec) to
hints, such as:                                                    investigate three research questions: Do quantitative rat-
                                                                   ings differ between methods? Do automatic hints align with
                                                                   teacher hints? And what are teachers’ reasons for preferring
A - Missing Concepts Compare the nodes of the student’s            some hints over others?
     current AST to that of the next step, then suggest
     concepts corresponding to missing nodes.                      We found that teachers generally had a low opinion of syn-
                                                                   tactic next-step hints, irrespective of the method. Differ-
B - Mishandled Situations Apply test cases to the current          ences in ratings could be explained by two factors: Whether
     and the predicted program, then suggest the student           the hint was a correct solution (then it was regarded relevant
     focus on inputs that work for the next step but not the       and useful) or not (then all ratings were around zero) and
     current step                                                  whether the hint used human-readable variable and function
                                                                   names or not (then the ratings became strongly negative).
C - Simpler Problem Similarly to A, first identify missing
      concepts in the student’s work, then suggest easier pro-     Instead of syntactic hints, teachers preferred higher-level
      gramming tasks that contain this concept (e.g. from          hints which suggested a missing concept, pointed out in-
      earlier in the course).                                      puts which were not appropriately covered by the current
                                                                   program, suggested a simpler problem first, or proposed to
D - Deletions Compare the current step to the next step            remove superfluous lines.
     and, if lines are deleted, ask the student if these lines
     are necessary.                                                Finally, the main concern of teachers for not giving hints was
                                                                   students’ learning. They disregarded both syntactic hints
                                                                   as well as showing a correct solution because they were con-
Applying these strategies, the automated hints for the tasks       cerned that students might naı̈vely apply the hint without
in Figure 2 might then become the hints shown in Table 3,          reflecting on it sufficiently. This points to a potential gap in
which align better with the hints given by teachers.               current next-step hint approaches, which are mainly focused
on suggesting changes, but less on inviting reflection or ab-          USA, 2019. Association for Computing Machinery.
stracting to a higher level. For introductory programming          [9] J. McBroom, B. Paassen, B. Jeffries, I. Koprinska, and
courses, it may be sufficient to just post-process syntax-level        K. Yacef. Progress networks as a tool for analysing
hints with simple heuristics – such as the ones proposed in            student programming difficulties. In C. Szabo and
the previous section. Additionally, one can introduce textual          J. Sheard, editors, Proceedings of the Twenty-Third
explanations as suggested by [8]. However, further research            Australasian Computing Education Conference (ACE
is needed to investigate whether it is sufficient to change            ’21), page 158–167. Association for Computing
how next-step hints are communicated or whether deeper                 Machinery, 2021.
changes in hint methods are necessary to achieve a better         [10] B. Paaßen, B. Hammer, T. Price, T. Barnes, S. Gross,
alignment with the pedagogical expertise of programming                and N. Pinkwart. The continuous hint factory -
teachers. Finally, an evaluation study with students is still          providing hints in vast and sparsely populated edit
required to make sure that any refined hint strategy yields            distance spaces. Journal of Educational Datamining,
better student outcomes compared to current hint strategies.           10(1):1–35, 2018.
                                                                  [11] B. Paaßen, J. McBroom, B. Jeffries, I. Koprinska, and
7.   ACKNOWLEDGMENTS                                                   K. Yacef. ast2vec: Utilizing recursive neural encodings
Funding by the German Research Foundation (DFG) under                  of python programs. Journal of Educational
grant number PA 3460/2-1 is gratefully acknowledged.                   Datamining, 2021. in press.
                                                                  [12] C. Piech, M. Sahami, J. Huang, and L. Guibas.
8.   REFERENCES                                                        Autonomously generating hints by inferring problem
 [1] T. Barnes and J. Stamper. Toward automatic hint                   solving policies. In G. Kiczales, D. Russell, and
     generation for logic proof tutoring using historical              B. Woolf, editors, Proceedings of the Second ACM
     student data. In B. P. Woolf, E. Aı̈meur, R. Nkambou,             Conference on Learning @ Scale (L@S 2015), page
     and S. Lajoie, editors, Proceedings of the International          195–204, 2015.
     Conference on Intelligent Tutoring Systems (ITS              [13] T. Price, R. Zhi, and T. Barnes. Evaluation of a
     2008), pages 373–382, Berlin, Heidelberg, 2008.                   data-driven feedback algorithm for open-ended
     Springer Berlin Heidelberg.                                       programming. In X. Hu, T. Barnes, and P. Inventado,
 [2] R. R. Choudhury, H. Yin, and A. Fox. Scale-driven                 editors, Proceedings of the 10th International
     automatic hint generation for coding style. In                    Conference on Educational Data Mining (EDM 2017),
     A. Micarelli, J. Stamper, and K. Panourgia, editors,              pages 192–197, 2017.
     Intelligent Tutoring Systems, pages 122–132, Cham,           [14] T. W. Price, Y. Dong, R. Zhi, B. Paaßen, N. Lytle,
     2016. Springer International Publishing.                          V. Cateté, and T. Barnes. A comparison of the quality
 [3] S. Chow, K. Yacef, I. Koprinska, and J. Curran.                   of data-driven programming hint generation
     Automated data-driven hints for computer                          algorithms. International Journal of Artificial
     programming students. In Adjunct Publication of the               Intelligence in Education, 29(3):368–395, 2019.
     25th Conference on User Modeling, Adaptation and             [15] K. Rivers and K. R. Koedinger. Data-driven hint
     Personalization (UMAP 2017), page 5–10, 2017.                     generation in vast solution spaces: a self-improving
 [4] A. T. Corbett and J. R. Anderson. Locus of feedback               python programming tutor. International Journal of
     control in computer-based tutoring: Impact on                     Artificial Intelligence in Education, 27(1):37–64, 2017.
     learning rate, achievement and attitudes. In
     Proceedings of the SIGCHI Conference on Human
     Factors in Computing Systems, page 245–252, 2001.
 [5] D. Fossati, B. Di Eugenio, S. Ohlsson, C. Brown, and
     L. Chen. Data driven automatic feedback generation
     in the ilist intelligent tutoring system. Technology
     Instruction, Cognition and Learning, 10(1):5–26, 2015.
 [6] S. Gross and N. Pinkwart. How do learners behave in
     help-seeking when given a choice? In C. Conati,
     N. Heffernan, A. Mitrovic, and M. F. Verdejo, editors,
     Proceedings of the 17th International Conference on
     Artificial Intelligence in Education (AIED 2015),
     pages 600–603, 2015.
 [7] M. Maniktala, C. Cody, A. Isvik, N. Lytle, M. Chi,
     T. Barnes, et al. Extending the hint factory for the
     assistance dilemma: A novel, data-driven helpneed
     predictor for proactive problem-solving help. Journal
     of Educational Data Mining, 12(4):24–65, 2020.
 [8] S. Marwan, N. Lytle, J. J. Williams, and T. Price.
     The impact of adding textual explanations to
     next-step hints in a novice programming environment.
     In Proceedings of the 2019 ACM Conference on
     Innovation and Technology in Computer Science
     Education, ITiCSE ’19, page 520–526, New York, NY,