=Paper=
{{Paper
|id=Vol-1907/3_mici_osborn
|storemode=property
|title=Evaluating a Solver-Aided Puzzle Design Tool
|pdfUrl=https://ceur-ws.org/Vol-1907/3_mici_osborn.pdf
|volume=Vol-1907
|authors=Joseph Osborn,Michael Mateas
|dblpUrl=https://dblp.org/rec/conf/chi/OsbornM17
}}
==Evaluating a Solver-Aided Puzzle Design Tool==
Evaluating a Solver-Aided Puzzle Design Tool Joseph C. Osborn Abstract Michael Mateas Puzzle game levels generally each admit a particular set Expressive Intelligence Studio of intended solutions so that the game designer can con- University of California at Santa trol difficulty and the introduction of novel concepts. Catch- Cruz ing unintended solutions can be difficult for humans, but Santa Cruz, USA jcosborn@soe.ucsc.edu promising connections with software model checking have michaelm@soe.ucsc.edu been explored by previous work in educational puzzle games. With this in mind, we prototyped a design support tool for the PuzzleScript game engine based on finding and visual- izing shortest solution paths. We evaluated the tool with a larger population of novice designers on a fixed level design task aimed at removing shortcuts. Surprisingly, we found no difference in task performance given an oracle for short- est solutions; this paper explores these results and possible explanations for this phenomenon. A video recording of the tool in use is available at https:// archive.org/details/PuzzlescriptAssistantDemonstration. Author Keywords Design support; game design; puzzle games Copyright © 2017 for this paper is held by the author(s). ACM Classification Keywords Proceedings of MICI 2017: CHI Workshop on Mixed-Initiative Creative Interfaces. H.5.2. [Information Interfaces and Presentation (e.g. HCI)]: User Interfaces; D.2.5 [Software Engineering]: Testing and Debugging tools will aid the game design process and help designers create higher-quality designs. While many game design tools have had some form of user evaluation, we are aware of no controlled studies that compare design outcomes be- tween two groups of users, one group using a tool and one not (the closest might be [2], which measured no difference in engagement time between fully- and partially-human- authored game level progressions). This means there is lit- tle evidence supporting the fundamental assumption behind Figure 2: A (shortcut) solution to game design support tools. We present the negative result the first level displayed in the tool. of our attempt to gather such evidence to prompt discus- Note the position of the mouse sion on measuring the leverage such tools provide. In the cursor, currently scrubbing through the steps. Figure 1: PuzzleScript and the PuzzleScript Analyzer. remainder of this paper we present our tool, the experiment, and a discussion of our results. PuzzleScript Analyzer Introduction The PuzzleScript Analyzer (PSA) can find solutions to lev- Designing games is difficult for many of the same reasons els for any game written in the programming language Puz- that computer programming is difficult: game designers, like zleScript [4], a domain-specific language for puzzle games. programmers, define an initial state and operations on that Solution search is done via A*; it is similar to [6], but with a state given user input, and these operations can interact in different heuristic and more attention to performance. PSA Figure 3: The same solution unexpected ways or the initial state may be misconfigured. is integrated into PuzzleScript’s web-based editor and is au- display updated after a level design Existing tools for analyzing and testing programs mainly tomatically run when the rules or levels change (see Fig. 1). repair. The old solution is marked focus on functional requirements, i.e. input/output behavior. Designers can scrub back and forth through a solution’s as unavailable with a red border. In games, however, nonfunctional requirements such as steps using the mouse (see Fig. 2) and load up the cur- teaching a progression of skills play a comparatively greater rently displayed state for interactive play by clicking on it. role, necessitating new design-focused analysis tools. PSA alerts the designer if the shortest solution changes in length or becomes invalid as the game level and rules are Accordingly, several researchers have proposed tools and changed (see Fig. 3). techniques to assist game designers. Some tools generate elaborated alternatives for the designer’s consideration (e.g. From a requirements standpoint, our work is most similar candidate game levels [10, 5]), while others show the con- to two design tools for the game Refraction, which teaches sequences of design decisions (e.g. solution or reachable fractional arithmetic. The first tool evaluates puzzles to en- space visualizations [8, 1]). We follow the latter class of sure that every solution satisfies designer-provided proper- tools, providing design-time analysis of PuzzleScript levels. ties [9], while the second generates progressions of levels to teach designer-provided concept sequences [2]. The literature makes the reasonable assumption that such Following the Refraction tools and e-mail interviews we con- in games where levels have specific learning objectives that ducted with five PuzzleScript users, we believed visualizing must be satisfied, e.g. in educational games or to ensure the shortest solution would help designers. Three of our in- players learn necessary skills before proceeding. terviewees asserted that shortcuts—solutions which did not require learning or using desired concepts—were a prob- Experiment lem, especially when levels were meant to teach concepts We evaluated PSA on a population of 195 novice puzzle in order. More formally, shortcuts (or workarounds) are a designers (game design undergraduates) on a fixed level class of design bug where the designer’s intended solution design task split into two conditions: one without and one is either not optimal or is not the only solution. PSA helps with the tool. The control group used PuzzleScript’s stan- find true shortcuts, i.e. those which both circumvent and are dard editor, while the experimental group also had the level shorter than the intended solution. highlighting and solution viewer interfaces highlighted in Fig. 1. Novice designers in an artificial environment are not Figure 4: The first level used in the We could not simply visualize solutions by superposing the an ideal population, since they might not be comfortable experiment. player’s path on the level: puzzles are often transformed with or interested in the design task. On the other hand, during play, and players might control multiple characters. we hoped that our tool could help even novice designers Instead, we provide an interface for scrubbing through solu- work more effectively. To account for differences in Puzzle- tion steps to see intermediate puzzle states. Moreover, puz- Script experience, every user received a 1-hour lecture on zle game rules and levels both undergo iteration; following PuzzleScript in the week prior to performing the task. The Inform 7’s Skein [7], we inform the user when previously- day each group performed the task, we gave an additional seen shortest solutions are no longer optimal or valid. 1-hour introductory workshop on PuzzleScript. Our experimental approach differs from previous evalua- For our purposes, the task of puzzle design is the arrange- tions of mixed-initiative co-creative interfaces in e.g. [11]. ment of elements from a fixed set. We devised a small puz- First, PSA certainly meets that work’s requirements of co- zle with four rules: creativity: unanticipated proposed solutions can prompt lateral thinking, and scrubbing through solutions aids dia- 1. Push boxes by walking into them. grammatic reasoning, potentially replacing (some) manual 2. Flip switches to toggle which of the red and blue col- Figure 5: The second level used in playtesting as part of the iterative design cycle. This indeed ored bridges are up or down. the experiment. was the case: reduced manual playtesting and increased 3. Some switches have adjacent black targets, which level edit counts were the biggest differences between our must be filled with a box in order to use the switch. experimental and control groups. 4. If a box is on a bridge and the bridge is toggled down, the player may walk over the lowered box. Unlike [11], we are more interested in evaluating the final artifacts than the creative process itself. Moreover, our goal is not to support unconstrained creativity, but creative solu- For the experiment, we defined two levels (Figs. 4, 5). The tions to the challenge of avoiding shortcuts or workarounds first was intended to teach rules 1-3 and the second rule 4. in puzzles. This may be more relevant to junior designers or Two bugs were intentionally inserted into each level: one 20 which allowed skipping a large part of the level, and another and designing puzzles, using PuzzleScript, and computer 15 15 which violated the provided description of the intended programming; responses were similar across the groups. 10 solution. Users were given the game rules, levels, and in- c1 c1 10 5 5 tended solution for each level; the experimental group was Evaluation also given PSA and instructions for its use. Halfway through Our central hypothesis was that the use of PSA would lead count count 0 0 20 15 the task (at 20 out of 40 minutes), a hint was given describ- to better solutions faster. Surprisingly, we found no signif- 15 ing the bugs in each level so that we could discover the ef- icant effect on solution quality across the two conditions 10 (see Fig. 6). The mean quality scores were basically the c2 c2 10 fectiveness of the tool in solving design problems even for 5 5 users who didn’t find the bugs themselves. For each level, same, and a Mann-Whitney U-test across the groups veri- 0 0 the user was asked to describe their belief that they fixed fied that even the small observed differences were insignifi- 0 1 2 3 0 1 2 3 l1−quality l2−quality the design problems on a 3-point scale. cant (p = 1.0 for level 1 and p = 0.92 for level 2). Figure 6: Solution quality by Both groups were using a version of PuzzleScript’s edi- We also derived an error measure for each user by normal- condition and level. The median tor instrumented to anonymously record level changes, izing the reported confidence value for each level to the 0-3 solution for level 1 was worse than game play, and other actions. Of our 195 users, 107 made range used for level score, then taking the difference be- that for level 2, with no substantial a good-faith effort to complete the task (i.e., they edited the tween that confidence value and the actual quality score. differences between the two Our second hypothesis was that the experimental group level text at all and marked the task as completed). Solu- groups. tions were scored on a 0-3 scale based on how many de- would show more consistent self-assessment of solution sign flaws were fixed and how completely they were fixed quality. This measure also showed essentially no difference Scoring: (participants were unaware of the criteria). We asked users between the two groups either in confidence or in error. Level 1: 3 points if all boxes on to make minimal repairs and not to add or remove crates, targets; 2 if one box on a blue Did the PSA group use the tool at all? The telemetry shows targets, or bridges so as not to end up with levels that were bridge; 1 if one box on a red that they did. The experimental group spent less time man- unrecognizable to our solution metric. bridge; otherwise, 0. ually playtesting the level and more time modifying it, per- Level 2: 3 points if four boxes Unfortunately, from our original 195 participants, we had forming on average 500 fewer game moves than the control on targets and one on a blue to throw out data for 127 of them for the following reasons: group (about a 30% reduction), with a two sample t-test bridge; 2 if two boxes on targets a bug in our telemetry code that failed to collect telemetry yielding a strongly significant p-value < 0.005. Everyone in and one on a red bridge; 1 if at for some participants; cases where we could not recon- the experimental group scrubbed through solutions, though least one box on a red bridge; 0 struct the final game definition from the recorded sequence very few clicked to load a solution step in the editor (those otherwise. of edit operations; and cases where the levels were left in who did load steps did so very frequently, suggesting that unsolvable states (we had no sound way of evaluating the the feature is simply non-obvious). solution quality of unsolvable levels). Ultimately, we were able to analyze 68 of the resulting activity records, of which We also saw a near doubling in level edit operations among 34 came from each group. All 68 completed both tasks, the experimental group, and a two sample t-test gave a sig- and our figures and tables refer to those 68 users. We also nificant p-value < 0.05. This suggests that the experimental asked survey questions assessing proficiency at solving group performed much more iteration on their puzzle lev- els, which conventional wisdom suggests would yield better solutions. We then looked for a significant correlation be- study did have some specific avoidable issues. For exam- ● tween level edit counts and level quality. There was a sig- ple, we provided a hint that described the level design bugs; nificant moderate increase in the first level’s solution quality this meant that if PSA played a role in finding (as opposed 3000 with increased edit counts in the experimental group (in- to resolving) the bugs, that effect may have been reduced. creased scores by about 0.5, p < 0.005), but nothing that On the other hand, most users finished in under 30 min- held across both levels or across the two groups (all small utes, so this is possible but seems unlikely (see Fig. 7). In 2000 and insignificant effects with p > 0.3). fact, the experimental group took on average 17% longer to complete the task (a two-sample t-test yielded p < 0.05). Discussion 1000 It was clear that the PSA group used the tool, yet we saw Is the tool UI unhelpful? It could be that having solutions no difference in performance. PSA encouraged increased provided automatically leads users, perhaps especially Control Experiment iteration on levels, but this did not lead to superior solutions! novices, to pay less attention and think less about their This invalidates our hypotheses and some assumptions decisions. This could account for the increased edit count Figure 7: Time taken to complete among the experimental group to no apparent benefit, as behind previous work, so we looked for other explanations. the task in seconds for each initial wrong guesses regarding the true design issue re- condition. The experimental group Was the population ill-suited to the task? We only evaluated quired corrections. Given that PSA’s interface does show took about 8 minutes longer on PSA on novice designers. We did ask for self-assessments specific steps to produce problematic solutions, this seems average to complete the tasks. of puzzle solving and design proficiency, but found no cor- unlikely to us. Another issue with PSA is that it does not relation between these measures and solution quality. This foreground the specific instant of departure from the in- suggests either that the task does not measure these skills tended solution, or indeed validate the intended solution at or that the students’ self-assessments were inaccurate. The all. While specifying intended solutions formally is a lot of self-assessments were not informative, since puzzle famil- work, it stands to reason that a tool which accepted and en- iarity did not correlate strongly with either confidence or so- forced those specifications would be more useful than a tool lution quality. If our population truly consists of only novices, which merely shows solutions and asks users to check that then PSA does not help novice designers on the level repair they are acceptable. Still, one might expect that something task; if the users were a mix of novices and experts, PSA is better than nothing, which is not borne out by our study. does not help anyone on the level repair task and the task This leaves us with two questions: is equally hard for novices and experts. The former possibil- • Does increased iteration on puzzle levels make them ity seems more likely, though it remains surprising. Future safer with respect to intended solutions? work with expert users could help answer this question. • Does puzzle design benefit with respect to solution- Was the study designed poorly? While we tried to pick a safety from the use of automatic solution finding? natural task—removing unwanted solutions from a puzzle level—some artificiality was unavoidable if we wanted to If the answers to these questions are negative, we must ask compare results across two groups. This may point to an in- about the role of design support tools, especially design nate challenge in evaluating creativity support tools, but our validation tools, in the game design process. The second question in particular seems to be a central assumption of many computational game design aides, and either it Nitsche. 2013. Creativity support for novice digital film- is false in general—a claim for which this paper is weak making. In Proceedings of the SIGCHI Conference on evidence, but evidence nonetheless—or it does not hold in Human Factors in Computing Systems. ACM, 651– our evaluation scheme. This may be because the task is 660. too easy or because solvers do not adequately support the [4] Stephen Lavelle. 2013. PuzzleScript. http://puzzlescript. specific subtasks explored in this evaluation. net. (2013). [5] Antonios Liapis, Georgios N Yannakakis, and Julian We intend to perform more targeted evaluations of the PSA Togelius. 2013. Sentient Sketchbook: Computer-aided to explore these possibilities. For immediate future work, game level authoring. In Proceedings of the Eighth this evaluation should be conducted with populations of ex- International Conference on the Foundations of Digital pert and novice users, perhaps using other puzzle games, Games. 213–220. other level design bugs, or all of the above. After all, it has [6] Chong-U Lim and D Fox Harrell. 2014. An approach to been shown for some creative tasks that novices benefit general videogame evaluation and automatic genera- from tool support [3] — perhaps the support that PSA pro- tion using a description language. In 2014 IEEE Con- vides does not help novices effectively. It may also be the ference on Computational Intelligence and Games. case that PSA best supports puzzle rule design iteration as IEEE, 1–8. opposed to puzzle level design iteration, or that it helps pre- [7] Aaron Reed. 2010. Creating interactive fiction with vent bugs from being added rather than helping to remove Inform 7. Cengage Learning. bugs. PSA’s utility as an automated regression testing tool [8] Mohammad Shaker, Noor Shaker, and Julian Togelius. was not evaluated in this study. Integrating specific support 2013. Ropossum: An Authoring Tool for Designing, for enforcing intended solutions into PSA could help answer Optimizing and Solving Cut the Rope Levels. In Pro- some of the study design questions above, as could improv- ceedings of the Ninth Aaai Conference on Artificial ing the user interface based on small-scale user studies. Intelligence and Interactive Digital Entertainment. AAAI Press. References [9] Adam M Smith, Eric Butler, and Zoran Popovic. 2013. [1] Aaron William Bauer and Zoran Popović. 2012. RRT- Quantifying over play: Constraining undesirable solu- Based Game Level Analysis, Visualization, and Visual tions in puzzle design.. In FDG. 221–228. Refinement. In Eighth Artificial Intelligence and Inter- [10] Gillian Smith, Jim Whitehead, and Michael Mateas. active Digital Entertainment Conference. 2010. Tanagra: A mixed-initiative level design tool. In [2] Eric Butler, Adam M Smith, Yun-En Liu, and Zoran Proceedings of the Fifth International Conference on Popovic. 2013. A mixed-initiative tool for designing the Foundations of Digital Games. ACM, 209–216. level progressions in games. In Proceedings of the [11] Georgios N Yannakakis, Antonios Liapis, and Constan- 26th annual ACM symposium on User interface soft- tine Alexopoulos. 2014. Mixed-initiative co-creativity.. ware and technology. ACM, 377–386. In Proceedings of the Ninth International Conference [3] Nicholas Davis, Alexander Zook, Brian O’Neill, Bran- on the Foundations of Digital Games. don Headrick, Mark Riedl, Ashton Grosz, and Michael