An Exploration of Data-Driven Hint Generation in an
              Open-Ended Programming Problem

                        Thomas W. Price                                             Tiffany Barnes
                  North Carolina State University                           North Carolina State University
                         890 Oval Drive                                            890 Oval Drive
                        Raleigh, NC 27606                                         Raleigh, NC 27606
                       twprice@ncsu.edu                                         tmbarnes@ncsu.edu

ABSTRACT                                                          lems have a large, often infinite, space of possible states.
Data-driven systems can provide automated feedback in the         Even a relatively simple programming problem may have
form of hints to students working in problem solving environ-     many unique goal states, each with multiple possible solu-
ments. Programming problems present a unique challenge            tion paths, leaving little overlap among student solutions.
to these systems, in part because of the many ways in which       Despite this challenge, a number of attempts have been
a single program can be written. This paper reviews current       made to adapt the Hint Factory to programming problems [9,
strategies for generating data-driven hints for programming       12, 16]. While these approaches have been generally success-
problems and examines their applicability to larger, more         ful, they are most effective on small, well structured pro-
open-ended problems, with multiple, loosely ordered goals.        gramming problems, where the state space cannot grow too
We use this analysis to suggest directions for future work to     large. This paper explores the opposite type of problem: one
generate hints for these problems.                                that has a large state space, multiple loosely ordered goals,
                                                                  unstructured output and involves creative design. Each of
1.   INTRODUCTION                                                 these attributes poses a challenge to current data-driven hint
Adaptive feedback is one of the hallmarks of an Intelli-          generation techniques, but they are also the attributes that
gent Tutoring System. This feedback often takes the form          make such problems interesting, useful and realistic. In this
of hints, pointing a student to the next step in solving a        paper, we will refer to these as open-ended programming
problem. While hints can be authored by experts or gener-         problems. We investigate the applicability of current tech-
ated by a solver, more recent data-driven approaches have         niques to these problems and suggest areas for future re-
shown that this feedback can be automatically generated           search.
from previous students’ solutions to a problem. The Hint
Factory [18] has successfully generated data-driven hints in a    The primary contributions of this paper are 1) a review of
number of problem solving domains, including logic proofs [2],    current data-driven hint generation methods for program-
linked list problems [5] and a programming game [12]. The         ming problems, 2) an analysis of those methods’ applica-
Hint Factory operates on a representation of a problem called     bility to an open-ended problem and 3) a discussion of the
an interaction network [4], a directed graph where each ver-      challenges that need to be addressed before we can expect
tex represents a student’s state at some point in the problem     to generate hints for similar problems.
solving process, and each edge represents a student’s action
that alters that state. A solution is represented as a path       2.     CURRENT APPROACHES
from the initial state to a goal state. A student request-        Current approaches to generating data-driven programming
ing a hint is matched to a previously observed state and          hints, including alternatives to the Hint Factory [10, 13, 17],
directed on a path to a goal state. The Hint Factory takes        can be broken down into three primary components:
advantage of the intuition that students with the same ini-
tial state and objective will follow similar paths, producing a
well-connected interaction network. When this occurs, even             1. A representation of a student’s state and a method
a relatively small sample of student solutions can be enough              for determining when one state can be reached from
to provide hints to the majority of new students [1].                     another, meaning they are connected in the network
                                                                       2. An algorithm that, given a student’s current state, con-
While problems in many domains result in well-connected                   structs an optimal path to a goal state. Often we sim-
networks, this is not always the case. Programming prob-                  plify this to the problem of picking the first state on
                                                                          that path
                                                                       3. A method to present this path, or next state, to the
                                                                          student in the form of a hint


                                                                  For the purposes of this paper, we limit our discussion to
                                                                  the first step in this process. (For a good discussion of the
                                                                  second step, see an analysis by Piech et al. [13], compar-
                                                                  ing path selection algorithms.) While in some domains this
first step is straightforward, in programming tasks, espe-         inition. Hicks and Peddycord [7, 12] used the Hint Factory
cially open-ended problems, it is likely the most challeng-        to generate hints for a programming game called Bots, in
ing. The simplest approach is to take periodic snapshots of        which the player writes a program to direct a robot through
a student’s code and treat these as states, connecting con-        various tasks in a 3D level. They chose to represent the state
secutive snapshots in the network. However, because two            of a player’s program as the final state of the game world
students’ programs are unlikely to match exactly, this ap-         after the program was executed. They compared the avail-
proach is likely to produce a very sparse, poorly connected        ability of hints when using this “world state” model with a
network, making it difficult to match new students to prior        traditional “code state” model, and found that using world
solution attempts. A variety of techniques have been pre-          states significantly reduced the total number of states and
sented to address this problem, which can be grouped into          increased the availability of hints. The challenge with this
three main strategies: canonicalization, connecting states         approach, as noted by the authors, is the generation of ac-
and alternative state definitions.                                 tionable hints. A student may be more capable of making a
                                                                   specific change to her code than determining how to effect
2.1    Canonicalization                                            a specific change in the code’s output.
Canonicalization is the process of putting code states into
a standardized form, often by removing semantically unim-          3.   AN OPEN-ENDED PROBLEM
portant information, so that trivial difference do not prevent     The above techniques have all shown success on smaller,
two states from matching. Rivers and Koedinger [15] present        well-structured problems, with ample data. We want to in-
a method for canonicalizing student code by first represent-       vestigate their applicability to an open-ended problem, as
ing it as an Abstract Syntax Tree (AST). Once in this form,        described in Section 1, where this is not the case. The pur-
they apply a number of functions to canonicalize the code,         pose of this paper is not to create actionable hints, nor are
including normalizing arithmetic and boolean operators, re-        we attempting to show the failures of current methods by
moving unreachable and unused code, and inlining helper            applying them to an overly challenging task. Rather, our
functions. After performing this canonicalization on a set         purpose is exploratory, using a small dataset to identify ar-
of introductory programming problems, they found that a            eas of possible future work, and challenges of which to be
median 70% of states had at least one match in the net-            mindful when moving forward with hint generation research.
work. Jin et al. [9] represent a program’s state as a Linkage
Graph, where each vertex is a code statement, and each di-         We collected data from a programming activity completed
rected edge represents an ordering dependency, determined          by 6th grade students in a STEM outreach program called
by which variables are read and assigned to in each state-         SPARCS [3]. The program, which meets for half-day ses-
ment. This state representation allows the Hint Factory to         sions approximately once a month during the school year,
ignore statement orderings which are not important to the          consists of lessons designed and taught by undergraduate
execution of the program. Lazar and Bratko [10] use the            and graduate students to promote technical literacy. The
actual text of Prolog code to represent a student’s state,         class consisted of 17 students, 12 male and 5 female.
and then canonicalize the code by removing whitespace and
normalizing variable names.                                        The activity was a programming exercise based on an Hour
                                                                   of Code activity from the Beauty and Joy of Computing cur-
2.2    Connecting States                                           riculum [6]. It was a tutorial designed to introduce novices
Even with canonicalization, a student requesting a hint may        to programming for the first time. The exercise had users
not match any existing state in the network. In this case, we      create a simple web-based game, similar to whack-a-mole,
can look for a similar state and create a connection between       in which players attempt to click on a sprite as it jumps
them. Rivers and Koedinger [16] use normalized string edit         around the screen to win points. The exercise was split into
distance as a similarity metric between two program states.        9 objectives, with tutorial text at each stage. Students were
They connect any two states in the network which have at           not required to finish an objective before proceeding. A fin-
least 90% similarity, even if no historical data connect these     ished project required the use of various programming con-
states. Additionally, they use a technique called path con-        cepts, including events, loops, variables and conditionals.
struction to generate new solution paths from a given state        The students used a drag-and-drop, block-based program-
to a nearby, unconnected goal state by searching for a series      ming language called Tiled Grace [8], which is similar to
of insertions, deletions and edits to their AST that will trans-   Scratch [14]. The user writes a program by arranging code
form it into the goal state [17]. They also use this method        blocks, which correspond directly to constructs in the Grace
to discover new goal states which may be closer to the stu-        programming language. The editor also supports switching
dent’s current state. Jin et al. [9] use a similar technique       to textual coding, but this feature was disabled. A screen-
to transform their Linkage Graphs to better match the cur-         shot of the activity can be seen in Figure 1.
rent state of a student when no direct matches can be found
in the interaction network. Piech et al. [13] use path con-        During the activity, the students were allowed to go through
struction to interpolate between two consecutive states on         the exercise at their own pace. If they had questions, the
a solution path which differ by more than one edit. This           students were allowed to ask for help from the student vol-
is useful to smooth data when student code is recorded in          unteers. Students were stopped after 45 minutes of work.
shapshots that are too far apart.                                  Snapshots of a student’s code were saved each time it was
                                                                   run and periodically throughout the session. Occasional
2.3    Alternate State Definitions                                 technical issues did occur in both groups. One student had
Another approach is to forego the traditional code-based           severe technical issues, and this student’s data was not an-
representation of a student’s state, and use an alternate def-     alyzed (and is not reflected in the counts above). Students
                                                                                            Raw     Canonical    Ordered
                                                                 Total States               2380      1781         1656
                                                                 % Unique                  97.5%     94.8%        92.8%
                                                                 Mean NU Count              3.44      3.95         2.82
                                                                 Median NU Count              2         2            2
                                                                 Mean % Path Unique        89.9%     83.0%        78.9%
                                                                 Standard Deviation        (6.67)    (10.5)       (13.3)

                                                                Table 1: Various measures of the sparseness of the
                                                                interaction network for the raw, canonicalized, and
                                                                ordered-canonicalized state representations. Mean
                                                                and median NU counts refer to the number of stu-
                                                                dents who reached each non-unique state.


Figure 1: A screenshot of the Hour of Code activ-               ent code states to be merged in the process. We therefore
ity that students completed. Students received in-              see this as an upper bound on the value of removing unim-
structions on the left panel. The center panel was a            portant orderings from a code state. We recomputed our
work area, and students could drag blocks in from               metrics for the ordered-canonicalized interaction network as
the bottom-right panel. The top-right panel allowed             well. The results can be seen in Table 1. Our later analyses
students to test their games.                                   use the unsorted, canonicalized code representation.

                                                                These results indicate that canonicalization does little to re-
produced on average 148.5 unique code states and accom-         duce the sparsity of the state space, with students spending
plished between 1 and 6 of the activity’s objectives, averag-   most of their time in states that no other student has seen.
ing 3.2 objectives per student.                                 For comparison, recall Rivers and Koedinger found 70% of
                                                                states in a simple programming problem had a match after
4.    ANALYSIS                                                  canonicalization [15], though they were using a much larger
We attempted to understand the applicability of each of the     dataset. In our dataset, it is unlikely that we would be able
techniques discussed in Section 2 to our dataset. However,      to find a direct path from a new student’s state to a goal
since the output of our program was a game that involved        state in order to suggest a hint.
nondeterminism, we felt it would be inappropriate to at-
tempt to represent a program’s state as the result of its       4.2    Connecting States
execution. We therefore focused on the first two strategies,
                                                                To address this, we explored the feasibility of connecting a
canonicalization and connecting states.
                                                                new student’s state to a similar, existing state in the net-
                                                                work. It is unclear how close two code states should be
4.1   Canonicalization                                          before it is appropriate to connect them as in [16], or to
Our initial representation of a student’s code state was a      generate a path between them as in [17]. It certainly de-
tree, where each code block was a node, and its children in-    pends on the state representation and distance metric used.
cluded any blocks that were nested inside of it. In this way,   Rather than identifying a cutoff and measuring how often
our representation was similar to Rivers and Koedinger’s        these techniques could be applied, we chose to visualize the
ASTs. To get a baseline for the sparsity of our dataset,        distance between two students and make qualitative obser-
we first analyzed the code states without performing any        vations. Because our code states were already represented
canonicalization. We calculated the total number of states      as trees, we used Tree Edit Distance (TED) as a distance
in the interaction network and the percentage which were        metric. While Rivers and Koedinger reported better success
only reached by one student. Of those states reached by         with Levenshtein distance [16], we believe that TED is the
multiple students, we calculated the mean and median num-       most appropriate distance metric for block code, where tree
ber of students who reached them. We also calculated the        edit operations correspond directly to user actions.
percentage of each student’s states that were unreached by
any other student in the dataset.                               For each pair of students, A and B, we created an N by M
                                                                distance matrix, D, where N is the number of states in A’s
We then canonicalized the data by removing variable names       solution path, and M is the number of states in B’s solution
and the values of number and string literals. Our problem       path. Di,j = d(Ai , Bj ), where d is the TED distance func-
featured very few arithmetic or logical operators, and these    tion, Ai is the ith state of A and Bj is the jth state of B.
were generally not nested, so we did not normalize them,        We used the RTED algorithm [11] to calculate the distance
as suggested by Rivers and Koedinger [15]. We reran our         function, putting a weight of 1.0 on insertions, deletions and
analyses on the canonicalized data. To ensure that we had       replacements. We also omitted any state which was identical
effectively removed all unimportant ordering information,       to its predecessor state. We normalized these values by di-
we recursively sorted the children of each node in the tree.    viding by the maximum value of Di,j , and plotted the result
This effectively removed any ordering information from a        as an image. Three such images can be seen in Figure 2.
student’s code state, and kept only hierarchical information.
This is somewhat more extreme than the Linkage Graphs of        We also calculated the “path” through this matrix that passes
Jin et al. [9], and it does allow two meaningfully differ-      through the least total distance. This does not represent a
                                                                          Mean         Median          Max           Farthest
                                                                  1    0.25 (0.27)   0.15 (0.29)    0.76 (0.56)     2.23 (0.75)
                                                                  2    4.88 (3.93)   4.95 (4.34)    9.18 (5.74)    12.73 (6.10)
                                                                  4    4.92 (2.77)   4.83 (2.78)   10.11 (3.69)    14.67 (4.77)
                                                                  5    7.79 (1.32)   7.75 (1.41)   13.17 (1.72)    18.17 (1.72)
                                                                  6    7.49 (1.11)   7.76 (1.37)   13.17 (0.98)    18.67 (1.75)

                                                                 Table 2: For each objective, average distances (and
                                                                 standard deviations) of minimum-distance student
                                                                 pairs, using the mean, median and max metrics.
                                                                 For reference, the final column represents the av-
                                                                 erage maximum distance each student moved from
                                                                 the start state while completing the objective.


                                                                 A visual inspection of these matrices reveals that while many
                                                                 student pairs are quite divergent, some show a notable close-
                                                                 ness throughout the exercise. In order to quantify these
                                                                 results, we developed a set of distance metrics between stu-
                                                                 dents. First, the distance matrix and the minimum-distance
                                                                 “path” were calculated for the two students. The path is
                                                                 comprised of pairs of states, and for each pair, we recorded
                                                                 the tree edit distance between the states. From this list of
                                                                 distances, we calculated the mean, median and maximum
                                                                 distances between the two students. We looked at each ob-
Figure 2: Three distance matrices, each comparing                jective in the exercise, and isolated the relevant subpath of
two students, where each pixel represents the TED                each student who completed that objective. We paired each
between two states. Lighter shades of gray indi-                 of these subpaths with the most similar subpath in the set,
cate smaller distances. The green/yellow line shows              using the mean, median and max distance metrics. Table 2
the path through the matrix with minimized total                 shows the average values of these minimized pairs of stu-
distance, with yellow shades indicating smaller dis-             dents, using each metric. Objective 3, and objectives 7-9
tances. Red crosses indicate where both students                 were omitted, as too few students completed them.
met an objective. The top-left figure compares two
students as they completed objectives 1-4 in the ex-
ercise. In this example, the minimum-distance line               5.   DISCUSSION
crosses each objective. The top right figure also de-            We have attempted to apply a meaningful canonicalization
picts two students completing objective 4, but the               to our state space, which did serve to reduce the number
darker colors and straighter line indicate less align-           of states by 30.4%. However, as seen in Table 1, even af-
ment. The bottom figure depicts two students com-                ter the strongest canonicalization, over 90% of the states in
pleting objectives 1-6, with high alignment.                     the interaction network had only been reached by one stu-
                                                                 dent, with an average 78.9% of the states in each student’s
                                                                 solution path being unique to that student. It seems that
path through the interaction network, but rather an align-       our approach to canonicalization is insufficient to produce a
ment between the states of student A and those of student        meaningful reduction of the state space, though it is possible
B. Each pixel of the line represents a pairing of a state        a more stringent canonicalization would be more effective.
from A with a state from B, such that these pairings are
contiguous and represent the smallest total distance. While      Connecting existing states seems to be a more promising ap-
we do not suggest applying this directly as a strategy for       proach, giving us the ability to link new states to previously
hint generation, it serves an a useful visual indicator of the   observed states, even when they do not match exactly. Our
compatibility of two students for hinting purposes.              distance matrices indicate that some students take parallel,
                                                                 or slowly diverging solution paths, which suggests that they
An alternate approach would have been to pair each state in      may be useful to each other in the context of hinting. As
the interaction network with its closest pair from any other     shown in Figure 2, students are often closest together when
student, and use this as a measure of how sparse the network     completing the same objective. This may seem self-evident,
was. We chose to compare whole students, rather than indi-       but it does indicate that our distance metric is meaning-
vidual states, because we felt that the former could lead to     ful. It is more difficult to put the actual TED values into
strange hinting behavior. Imagine, for instance, that a stu-     context. Students get, on average, farther away from their
dent requests a hint, which initially points to a state from     closest paired student as they complete more objectives, but
student B, but at the very next step requests a hint that        this average distance does not exceed 8 tree edits during the
points instead to student C. Perhaps the attributes that         first 6 objectives. To put that number into context, during
make the student’s state similar to that of B are different      this same time students do not, on average, get more than
from those that make the state similar to C. The resulting       19 tree edits away from the start state. This suggest that
hints would be at best confusing, and at worst conflicting.      there is certainly hint-relevant knowledge in these paired stu-
dents, but that it may be difficult to harness this knowledge     [3] V. Cateté, K. Wassell, and T. Barnes. Use and
to generate a hint.                                                   development of entertainment technologies in after
                                                                      school STEM program. In Proceedings of the 45th
                                                                      ACM technical symposium on Computer science
5.1   Limitations                                                     education, pages 163–168, 2014.
It is important to note that this analysis is an exploratory      [4] M. Eagle, M. Johnson, and T. Barnes. Interaction
case study, and makes no strong claims, only observations.            Networks: Generating High Level Hints Based on
We studied data from only 17 novice programmers, and the              Network Community Clustering. In International
problem we analyzed was highly complex, involving multiple            Educational Data Mining Society, pages 164–167,
control structures, loosely ordered objectives, and unstruc-          2012.
tured output. This makes the problem quite dissimilar from        [5] D. Fossati, B. D. Eugenio, and S. Ohlsson. I learn
previous problems that have been been studied in the con-             from you, you learn from me: How to make iList learn
text of hint generation, making it difficult to determine what        from students. In Artificial Intelligence in Education
observations should be generalized.                                   (AIED), 2009.
                                                                  [6] D. Garcia, B. Harvey, L. Segars, and C. How. AP CS
5.2   Future Work                                                     Principles Pilot at University of California, Berkeley.
Rivers and Koedinger [16], as well as Jin et al. [9] note             ACM Inroads, 3(2), 2012.
the limitations of their methods for larger problems, and         [7] A. Hicks, B. Peddycord III, and T. Barnes. Building
each suggest that breaking a problem down into subprob-               Games to Learn from Their Players: Generating Hints
lems would help to address this. Lazar and Bratko [10] at-            in a Serious Game. In Intelligent Tutoring Systems
tempted this in their Prolog tutor by constructing hints for          (ITS), pages 312–317, 2014.
individual lines of code, which were treated as independent       [8] M. Homer and J. Noble. Combining Tiled and Textual
subproblems. Similarly, we may be able to isolate the sub-            Views of Code. In Proceedings of 2nd IEEE Working
section of a student’s code that is currently relevant, and           Conference on Software Visualization, 2014.
treat this as an independent problem. The hierarchical na-        [9] W. Jin, T. Barnes, and J. Stamper. Program
ture of block-based coding environments lends itself to this          representation for automatic hint generation for a
practice, making it an appealing direction for future work.           data-driven novice programming tutor. In Intelligent
                                                                      Tutoring Systems (ITS), 2012.
Our work makes the simplifying assumption that unweighted        [10] T. Lazar and I. Bratko. Data-Driven Program
TED is a reliable distance metric for code, but future work           Synthesis for Hint Generation in Programming Tutors.
should investigate alternative metrics. This might include a          In Intelligent Tutoring Systems (ITS). Springer, 2014.
weighted TED metric, which assigns different costs to inser-     [11] M. Pawlik and N. Augsten. RTED: a robust algorithm
tions, deletions and replacements, or even to different types         for the tree edit distance. Proceedings of the VLDB
of nodes (e.g. deleting a for-loop node might cost more than          Endowment, 5(4):334–345, 2011.
a function call node). Regardless of the metric used, once       [12] B. Peddycord III, A. Hicks, and T. Barnes.
two proximate states are identified, it is still an open ques-        Generating Hints for Programming Problems Using
tion how this information can be best used for hint gener-            Intermediate Output. In Proceedings of the 7th
ation. It is possible to construct a path between the states          International Conference on Educational Data Mining
and direct a student along this path. However, future work            (EDM 2014), pages 92–98, 2014.
might also investigate how to extract hint-relevant informa-     [13] C. Piech, M. Sahami, J. Huang, and L. Guibas.
tion from one state and apply it to a similar state directly.         Autonomously Generating Hints by Inferring Problem
                                                                      Solving Policies. In Learning at Scale (LAS), 2015.
Because of the nature of our problem’s output, we did not
                                                                 [14] M. Resnick, J. Maloney, H. Andrés, N. Rusk,
explore non-code-based state representations, as described
                                                                      E. Eastmond, K. Brennan, A. Millner, E. Rosenbaum,
in Section 2.3. It would still be worth investigating how this        J. Silver, B. Silverman, and Y. Kafai. Scratch:
might be applied to open-ended problems. For instance, a              programming for all. Communications of the ACM,
code state could be represented as a boolean vector, indi-            52(11):60–67, 2009.
cating whether the code has passed a series of Unit Tests,
                                                                 [15] K. Rivers and K. Koedinger. A canonicalizing model
and hints could direct the student to the current flaw in
                                                                      for building programming tutors. In Intelligent
their program. However, creating actionable hints from this
                                                                      Tutoring Systems (ITS), 2012.
information would pose a significant challenge.
                                                                 [16] K. Rivers and K. Koedinger. Automatic generation of
                                                                      programming feedback: A data-driven approach. In
6.    REFERENCES                                                      The First Workshop on AI-supported Education for
 [1] T. Barnes and J. Stamper. Toward Automatic Hint                  Computer Science (AIEDCS 2013), 2013.
     Generation for Logic Proof Tutoring Using Historical        [17] K. Rivers and K. Koedinger. Automating Hint
     Student Data. In Intelligent Tutoring Systems (ITS),             Generation with Solution Space Path Construction. In
     pages 373–382, 2008.                                             Intelligent Tutoring Systems (ITS), pages 329–339,
 [2] T. Barnes, J. Stamper, L. Lehman, and M. Croy. A                 2014.
     pilot study on logic proof tutoring using hints             [18] J. Stamper, M. Eagle, T. Barnes, and M. Croy.
     generated from historical student data. In Proceedings           Experimental evaluation of automatic hint generation
     of the 1st Annual International Conference on                    for a logic tutor. Artificial Intelligence in Education
     Educational Data Mining (EDM), pages 1–5, 2008.                  (AIED), 22(1):3–17, 2013.