An Exploration of Data-Driven Hint Generation in an Open-Ended Programming Problem Thomas W. Price Tiffany Barnes North Carolina State University North Carolina State University 890 Oval Drive 890 Oval Drive Raleigh, NC 27606 Raleigh, NC 27606 twprice@ncsu.edu tmbarnes@ncsu.edu ABSTRACT lems have a large, often infinite, space of possible states. Data-driven systems can provide automated feedback in the Even a relatively simple programming problem may have form of hints to students working in problem solving environ- many unique goal states, each with multiple possible solu- ments. Programming problems present a unique challenge tion paths, leaving little overlap among student solutions. to these systems, in part because of the many ways in which Despite this challenge, a number of attempts have been a single program can be written. This paper reviews current made to adapt the Hint Factory to programming problems [9, strategies for generating data-driven hints for programming 12, 16]. While these approaches have been generally success- problems and examines their applicability to larger, more ful, they are most effective on small, well structured pro- open-ended problems, with multiple, loosely ordered goals. gramming problems, where the state space cannot grow too We use this analysis to suggest directions for future work to large. This paper explores the opposite type of problem: one generate hints for these problems. that has a large state space, multiple loosely ordered goals, unstructured output and involves creative design. Each of 1. INTRODUCTION these attributes poses a challenge to current data-driven hint Adaptive feedback is one of the hallmarks of an Intelli- generation techniques, but they are also the attributes that gent Tutoring System. This feedback often takes the form make such problems interesting, useful and realistic. In this of hints, pointing a student to the next step in solving a paper, we will refer to these as open-ended programming problem. While hints can be authored by experts or gener- problems. We investigate the applicability of current tech- ated by a solver, more recent data-driven approaches have niques to these problems and suggest areas for future re- shown that this feedback can be automatically generated search. from previous students’ solutions to a problem. The Hint Factory [18] has successfully generated data-driven hints in a The primary contributions of this paper are 1) a review of number of problem solving domains, including logic proofs [2], current data-driven hint generation methods for program- linked list problems [5] and a programming game [12]. The ming problems, 2) an analysis of those methods’ applica- Hint Factory operates on a representation of a problem called bility to an open-ended problem and 3) a discussion of the an interaction network [4], a directed graph where each ver- challenges that need to be addressed before we can expect tex represents a student’s state at some point in the problem to generate hints for similar problems. solving process, and each edge represents a student’s action that alters that state. A solution is represented as a path 2. CURRENT APPROACHES from the initial state to a goal state. A student request- Current approaches to generating data-driven programming ing a hint is matched to a previously observed state and hints, including alternatives to the Hint Factory [10, 13, 17], directed on a path to a goal state. The Hint Factory takes can be broken down into three primary components: advantage of the intuition that students with the same ini- tial state and objective will follow similar paths, producing a well-connected interaction network. When this occurs, even 1. A representation of a student’s state and a method a relatively small sample of student solutions can be enough for determining when one state can be reached from to provide hints to the majority of new students [1]. another, meaning they are connected in the network 2. An algorithm that, given a student’s current state, con- While problems in many domains result in well-connected structs an optimal path to a goal state. Often we sim- networks, this is not always the case. Programming prob- plify this to the problem of picking the first state on that path 3. A method to present this path, or next state, to the student in the form of a hint For the purposes of this paper, we limit our discussion to the first step in this process. (For a good discussion of the second step, see an analysis by Piech et al. [13], compar- ing path selection algorithms.) While in some domains this first step is straightforward, in programming tasks, espe- inition. Hicks and Peddycord [7, 12] used the Hint Factory cially open-ended problems, it is likely the most challeng- to generate hints for a programming game called Bots, in ing. The simplest approach is to take periodic snapshots of which the player writes a program to direct a robot through a student’s code and treat these as states, connecting con- various tasks in a 3D level. They chose to represent the state secutive snapshots in the network. However, because two of a player’s program as the final state of the game world students’ programs are unlikely to match exactly, this ap- after the program was executed. They compared the avail- proach is likely to produce a very sparse, poorly connected ability of hints when using this “world state” model with a network, making it difficult to match new students to prior traditional “code state” model, and found that using world solution attempts. A variety of techniques have been pre- states significantly reduced the total number of states and sented to address this problem, which can be grouped into increased the availability of hints. The challenge with this three main strategies: canonicalization, connecting states approach, as noted by the authors, is the generation of ac- and alternative state definitions. tionable hints. A student may be more capable of making a specific change to her code than determining how to effect 2.1 Canonicalization a specific change in the code’s output. Canonicalization is the process of putting code states into a standardized form, often by removing semantically unim- 3. AN OPEN-ENDED PROBLEM portant information, so that trivial difference do not prevent The above techniques have all shown success on smaller, two states from matching. Rivers and Koedinger [15] present well-structured problems, with ample data. We want to in- a method for canonicalizing student code by first represent- vestigate their applicability to an open-ended problem, as ing it as an Abstract Syntax Tree (AST). Once in this form, described in Section 1, where this is not the case. The pur- they apply a number of functions to canonicalize the code, pose of this paper is not to create actionable hints, nor are including normalizing arithmetic and boolean operators, re- we attempting to show the failures of current methods by moving unreachable and unused code, and inlining helper applying them to an overly challenging task. Rather, our functions. After performing this canonicalization on a set purpose is exploratory, using a small dataset to identify ar- of introductory programming problems, they found that a eas of possible future work, and challenges of which to be median 70% of states had at least one match in the net- mindful when moving forward with hint generation research. work. Jin et al. [9] represent a program’s state as a Linkage Graph, where each vertex is a code statement, and each di- We collected data from a programming activity completed rected edge represents an ordering dependency, determined by 6th grade students in a STEM outreach program called by which variables are read and assigned to in each state- SPARCS [3]. The program, which meets for half-day ses- ment. This state representation allows the Hint Factory to sions approximately once a month during the school year, ignore statement orderings which are not important to the consists of lessons designed and taught by undergraduate execution of the program. Lazar and Bratko [10] use the and graduate students to promote technical literacy. The actual text of Prolog code to represent a student’s state, class consisted of 17 students, 12 male and 5 female. and then canonicalize the code by removing whitespace and normalizing variable names. The activity was a programming exercise based on an Hour of Code activity from the Beauty and Joy of Computing cur- 2.2 Connecting States riculum [6]. It was a tutorial designed to introduce novices Even with canonicalization, a student requesting a hint may to programming for the first time. The exercise had users not match any existing state in the network. In this case, we create a simple web-based game, similar to whack-a-mole, can look for a similar state and create a connection between in which players attempt to click on a sprite as it jumps them. Rivers and Koedinger [16] use normalized string edit around the screen to win points. The exercise was split into distance as a similarity metric between two program states. 9 objectives, with tutorial text at each stage. Students were They connect any two states in the network which have at not required to finish an objective before proceeding. A fin- least 90% similarity, even if no historical data connect these ished project required the use of various programming con- states. Additionally, they use a technique called path con- cepts, including events, loops, variables and conditionals. struction to generate new solution paths from a given state The students used a drag-and-drop, block-based program- to a nearby, unconnected goal state by searching for a series ming language called Tiled Grace [8], which is similar to of insertions, deletions and edits to their AST that will trans- Scratch [14]. The user writes a program by arranging code form it into the goal state [17]. They also use this method blocks, which correspond directly to constructs in the Grace to discover new goal states which may be closer to the stu- programming language. The editor also supports switching dent’s current state. Jin et al. [9] use a similar technique to textual coding, but this feature was disabled. A screen- to transform their Linkage Graphs to better match the cur- shot of the activity can be seen in Figure 1. rent state of a student when no direct matches can be found in the interaction network. Piech et al. [13] use path con- During the activity, the students were allowed to go through struction to interpolate between two consecutive states on the exercise at their own pace. If they had questions, the a solution path which differ by more than one edit. This students were allowed to ask for help from the student vol- is useful to smooth data when student code is recorded in unteers. Students were stopped after 45 minutes of work. shapshots that are too far apart. Snapshots of a student’s code were saved each time it was run and periodically throughout the session. Occasional 2.3 Alternate State Definitions technical issues did occur in both groups. One student had Another approach is to forego the traditional code-based severe technical issues, and this student’s data was not an- representation of a student’s state, and use an alternate def- alyzed (and is not reflected in the counts above). Students Raw Canonical Ordered Total States 2380 1781 1656 % Unique 97.5% 94.8% 92.8% Mean NU Count 3.44 3.95 2.82 Median NU Count 2 2 2 Mean % Path Unique 89.9% 83.0% 78.9% Standard Deviation (6.67) (10.5) (13.3) Table 1: Various measures of the sparseness of the interaction network for the raw, canonicalized, and ordered-canonicalized state representations. Mean and median NU counts refer to the number of stu- dents who reached each non-unique state. Figure 1: A screenshot of the Hour of Code activ- ent code states to be merged in the process. We therefore ity that students completed. Students received in- see this as an upper bound on the value of removing unim- structions on the left panel. The center panel was a portant orderings from a code state. We recomputed our work area, and students could drag blocks in from metrics for the ordered-canonicalized interaction network as the bottom-right panel. The top-right panel allowed well. The results can be seen in Table 1. Our later analyses students to test their games. use the unsorted, canonicalized code representation. These results indicate that canonicalization does little to re- produced on average 148.5 unique code states and accom- duce the sparsity of the state space, with students spending plished between 1 and 6 of the activity’s objectives, averag- most of their time in states that no other student has seen. ing 3.2 objectives per student. For comparison, recall Rivers and Koedinger found 70% of states in a simple programming problem had a match after 4. ANALYSIS canonicalization [15], though they were using a much larger We attempted to understand the applicability of each of the dataset. In our dataset, it is unlikely that we would be able techniques discussed in Section 2 to our dataset. However, to find a direct path from a new student’s state to a goal since the output of our program was a game that involved state in order to suggest a hint. nondeterminism, we felt it would be inappropriate to at- tempt to represent a program’s state as the result of its 4.2 Connecting States execution. We therefore focused on the first two strategies, To address this, we explored the feasibility of connecting a canonicalization and connecting states. new student’s state to a similar, existing state in the net- work. It is unclear how close two code states should be 4.1 Canonicalization before it is appropriate to connect them as in [16], or to Our initial representation of a student’s code state was a generate a path between them as in [17]. It certainly de- tree, where each code block was a node, and its children in- pends on the state representation and distance metric used. cluded any blocks that were nested inside of it. In this way, Rather than identifying a cutoff and measuring how often our representation was similar to Rivers and Koedinger’s these techniques could be applied, we chose to visualize the ASTs. To get a baseline for the sparsity of our dataset, distance between two students and make qualitative obser- we first analyzed the code states without performing any vations. Because our code states were already represented canonicalization. We calculated the total number of states as trees, we used Tree Edit Distance (TED) as a distance in the interaction network and the percentage which were metric. While Rivers and Koedinger reported better success only reached by one student. Of those states reached by with Levenshtein distance [16], we believe that TED is the multiple students, we calculated the mean and median num- most appropriate distance metric for block code, where tree ber of students who reached them. We also calculated the edit operations correspond directly to user actions. percentage of each student’s states that were unreached by any other student in the dataset. For each pair of students, A and B, we created an N by M distance matrix, D, where N is the number of states in A’s We then canonicalized the data by removing variable names solution path, and M is the number of states in B’s solution and the values of number and string literals. Our problem path. Di,j = d(Ai , Bj ), where d is the TED distance func- featured very few arithmetic or logical operators, and these tion, Ai is the ith state of A and Bj is the jth state of B. were generally not nested, so we did not normalize them, We used the RTED algorithm [11] to calculate the distance as suggested by Rivers and Koedinger [15]. We reran our function, putting a weight of 1.0 on insertions, deletions and analyses on the canonicalized data. To ensure that we had replacements. We also omitted any state which was identical effectively removed all unimportant ordering information, to its predecessor state. We normalized these values by di- we recursively sorted the children of each node in the tree. viding by the maximum value of Di,j , and plotted the result This effectively removed any ordering information from a as an image. Three such images can be seen in Figure 2. student’s code state, and kept only hierarchical information. This is somewhat more extreme than the Linkage Graphs of We also calculated the “path” through this matrix that passes Jin et al. [9], and it does allow two meaningfully differ- through the least total distance. This does not represent a Mean Median Max Farthest 1 0.25 (0.27) 0.15 (0.29) 0.76 (0.56) 2.23 (0.75) 2 4.88 (3.93) 4.95 (4.34) 9.18 (5.74) 12.73 (6.10) 4 4.92 (2.77) 4.83 (2.78) 10.11 (3.69) 14.67 (4.77) 5 7.79 (1.32) 7.75 (1.41) 13.17 (1.72) 18.17 (1.72) 6 7.49 (1.11) 7.76 (1.37) 13.17 (0.98) 18.67 (1.75) Table 2: For each objective, average distances (and standard deviations) of minimum-distance student pairs, using the mean, median and max metrics. For reference, the final column represents the av- erage maximum distance each student moved from the start state while completing the objective. A visual inspection of these matrices reveals that while many student pairs are quite divergent, some show a notable close- ness throughout the exercise. In order to quantify these results, we developed a set of distance metrics between stu- dents. First, the distance matrix and the minimum-distance “path” were calculated for the two students. The path is comprised of pairs of states, and for each pair, we recorded the tree edit distance between the states. From this list of distances, we calculated the mean, median and maximum distances between the two students. We looked at each ob- Figure 2: Three distance matrices, each comparing jective in the exercise, and isolated the relevant subpath of two students, where each pixel represents the TED each student who completed that objective. We paired each between two states. Lighter shades of gray indi- of these subpaths with the most similar subpath in the set, cate smaller distances. The green/yellow line shows using the mean, median and max distance metrics. Table 2 the path through the matrix with minimized total shows the average values of these minimized pairs of stu- distance, with yellow shades indicating smaller dis- dents, using each metric. Objective 3, and objectives 7-9 tances. Red crosses indicate where both students were omitted, as too few students completed them. met an objective. The top-left figure compares two students as they completed objectives 1-4 in the ex- ercise. In this example, the minimum-distance line 5. DISCUSSION crosses each objective. The top right figure also de- We have attempted to apply a meaningful canonicalization picts two students completing objective 4, but the to our state space, which did serve to reduce the number darker colors and straighter line indicate less align- of states by 30.4%. However, as seen in Table 1, even af- ment. The bottom figure depicts two students com- ter the strongest canonicalization, over 90% of the states in pleting objectives 1-6, with high alignment. the interaction network had only been reached by one stu- dent, with an average 78.9% of the states in each student’s solution path being unique to that student. It seems that path through the interaction network, but rather an align- our approach to canonicalization is insufficient to produce a ment between the states of student A and those of student meaningful reduction of the state space, though it is possible B. Each pixel of the line represents a pairing of a state a more stringent canonicalization would be more effective. from A with a state from B, such that these pairings are contiguous and represent the smallest total distance. While Connecting existing states seems to be a more promising ap- we do not suggest applying this directly as a strategy for proach, giving us the ability to link new states to previously hint generation, it serves an a useful visual indicator of the observed states, even when they do not match exactly. Our compatibility of two students for hinting purposes. distance matrices indicate that some students take parallel, or slowly diverging solution paths, which suggests that they An alternate approach would have been to pair each state in may be useful to each other in the context of hinting. As the interaction network with its closest pair from any other shown in Figure 2, students are often closest together when student, and use this as a measure of how sparse the network completing the same objective. This may seem self-evident, was. We chose to compare whole students, rather than indi- but it does indicate that our distance metric is meaning- vidual states, because we felt that the former could lead to ful. It is more difficult to put the actual TED values into strange hinting behavior. Imagine, for instance, that a stu- context. Students get, on average, farther away from their dent requests a hint, which initially points to a state from closest paired student as they complete more objectives, but student B, but at the very next step requests a hint that this average distance does not exceed 8 tree edits during the points instead to student C. Perhaps the attributes that first 6 objectives. To put that number into context, during make the student’s state similar to that of B are different this same time students do not, on average, get more than from those that make the state similar to C. The resulting 19 tree edits away from the start state. This suggest that hints would be at best confusing, and at worst conflicting. there is certainly hint-relevant knowledge in these paired stu- dents, but that it may be difficult to harness this knowledge [3] V. Cateté, K. Wassell, and T. Barnes. Use and to generate a hint. development of entertainment technologies in after school STEM program. In Proceedings of the 45th ACM technical symposium on Computer science 5.1 Limitations education, pages 163–168, 2014. It is important to note that this analysis is an exploratory [4] M. Eagle, M. Johnson, and T. Barnes. Interaction case study, and makes no strong claims, only observations. Networks: Generating High Level Hints Based on We studied data from only 17 novice programmers, and the Network Community Clustering. In International problem we analyzed was highly complex, involving multiple Educational Data Mining Society, pages 164–167, control structures, loosely ordered objectives, and unstruc- 2012. tured output. This makes the problem quite dissimilar from [5] D. Fossati, B. D. Eugenio, and S. Ohlsson. I learn previous problems that have been been studied in the con- from you, you learn from me: How to make iList learn text of hint generation, making it difficult to determine what from students. In Artificial Intelligence in Education observations should be generalized. (AIED), 2009. [6] D. Garcia, B. Harvey, L. Segars, and C. How. AP CS 5.2 Future Work Principles Pilot at University of California, Berkeley. Rivers and Koedinger [16], as well as Jin et al. [9] note ACM Inroads, 3(2), 2012. the limitations of their methods for larger problems, and [7] A. Hicks, B. Peddycord III, and T. Barnes. Building each suggest that breaking a problem down into subprob- Games to Learn from Their Players: Generating Hints lems would help to address this. Lazar and Bratko [10] at- in a Serious Game. In Intelligent Tutoring Systems tempted this in their Prolog tutor by constructing hints for (ITS), pages 312–317, 2014. individual lines of code, which were treated as independent [8] M. Homer and J. Noble. Combining Tiled and Textual subproblems. Similarly, we may be able to isolate the sub- Views of Code. In Proceedings of 2nd IEEE Working section of a student’s code that is currently relevant, and Conference on Software Visualization, 2014. treat this as an independent problem. The hierarchical na- [9] W. Jin, T. Barnes, and J. Stamper. Program ture of block-based coding environments lends itself to this representation for automatic hint generation for a practice, making it an appealing direction for future work. data-driven novice programming tutor. In Intelligent Tutoring Systems (ITS), 2012. Our work makes the simplifying assumption that unweighted [10] T. Lazar and I. Bratko. Data-Driven Program TED is a reliable distance metric for code, but future work Synthesis for Hint Generation in Programming Tutors. should investigate alternative metrics. This might include a In Intelligent Tutoring Systems (ITS). Springer, 2014. weighted TED metric, which assigns different costs to inser- [11] M. Pawlik and N. Augsten. RTED: a robust algorithm tions, deletions and replacements, or even to different types for the tree edit distance. Proceedings of the VLDB of nodes (e.g. deleting a for-loop node might cost more than Endowment, 5(4):334–345, 2011. a function call node). Regardless of the metric used, once [12] B. Peddycord III, A. Hicks, and T. Barnes. two proximate states are identified, it is still an open ques- Generating Hints for Programming Problems Using tion how this information can be best used for hint gener- Intermediate Output. In Proceedings of the 7th ation. It is possible to construct a path between the states International Conference on Educational Data Mining and direct a student along this path. However, future work (EDM 2014), pages 92–98, 2014. might also investigate how to extract hint-relevant informa- [13] C. Piech, M. Sahami, J. Huang, and L. Guibas. tion from one state and apply it to a similar state directly. Autonomously Generating Hints by Inferring Problem Solving Policies. In Learning at Scale (LAS), 2015. Because of the nature of our problem’s output, we did not [14] M. Resnick, J. Maloney, H. Andrés, N. Rusk, explore non-code-based state representations, as described E. Eastmond, K. Brennan, A. Millner, E. Rosenbaum, in Section 2.3. It would still be worth investigating how this J. Silver, B. Silverman, and Y. Kafai. Scratch: might be applied to open-ended problems. For instance, a programming for all. Communications of the ACM, code state could be represented as a boolean vector, indi- 52(11):60–67, 2009. cating whether the code has passed a series of Unit Tests, [15] K. Rivers and K. Koedinger. A canonicalizing model and hints could direct the student to the current flaw in for building programming tutors. In Intelligent their program. However, creating actionable hints from this Tutoring Systems (ITS), 2012. information would pose a significant challenge. [16] K. Rivers and K. Koedinger. Automatic generation of programming feedback: A data-driven approach. In 6. REFERENCES The First Workshop on AI-supported Education for [1] T. Barnes and J. Stamper. Toward Automatic Hint Computer Science (AIEDCS 2013), 2013. Generation for Logic Proof Tutoring Using Historical [17] K. Rivers and K. Koedinger. Automating Hint Student Data. In Intelligent Tutoring Systems (ITS), Generation with Solution Space Path Construction. In pages 373–382, 2008. Intelligent Tutoring Systems (ITS), pages 329–339, [2] T. Barnes, J. Stamper, L. Lehman, and M. Croy. A 2014. pilot study on logic proof tutoring using hints [18] J. Stamper, M. Eagle, T. Barnes, and M. Croy. generated from historical student data. In Proceedings Experimental evaluation of automatic hint generation of the 1st Annual International Conference on for a logic tutor. Artificial Intelligence in Education Educational Data Mining (EDM), pages 1–5, 2008. (AIED), 22(1):3–17, 2013.