=Paper= {{Paper |id=Vol-1907/3_mici_osborn |storemode=property |title=Evaluating a Solver-Aided Puzzle Design Tool |pdfUrl=https://ceur-ws.org/Vol-1907/3_mici_osborn.pdf |volume=Vol-1907 |authors=Joseph Osborn,Michael Mateas |dblpUrl=https://dblp.org/rec/conf/chi/OsbornM17 }} ==Evaluating a Solver-Aided Puzzle Design Tool== https://ceur-ws.org/Vol-1907/3_mici_osborn.pdf
                                  Evaluating a Solver-Aided Puzzle
                                  Design Tool

Joseph C. Osborn                                                                  Abstract
Michael Mateas                                                                    Puzzle game levels generally each admit a particular set
Expressive Intelligence Studio                                                    of intended solutions so that the game designer can con-
University of California at Santa
                                                                                  trol difficulty and the introduction of novel concepts. Catch-
Cruz
                                                                                  ing unintended solutions can be difficult for humans, but
Santa Cruz, USA
jcosborn@soe.ucsc.edu
                                                                                  promising connections with software model checking have
michaelm@soe.ucsc.edu                                                             been explored by previous work in educational puzzle games.

                                                                                  With this in mind, we prototyped a design support tool for
                                                                                  the PuzzleScript game engine based on finding and visual-
                                                                                  izing shortest solution paths. We evaluated the tool with a
                                                                                  larger population of novice designers on a fixed level design
                                                                                  task aimed at removing shortcuts. Surprisingly, we found
                                                                                  no difference in task performance given an oracle for short-
                                                                                  est solutions; this paper explores these results and possible
                                                                                  explanations for this phenomenon.

                                                                                  A video recording of the tool in use is available at https://
                                                                                  archive.org/details/PuzzlescriptAssistantDemonstration.

                                                                                  Author Keywords
                                                                                  Design support; game design; puzzle games

Copyright © 2017 for this paper is held by the author(s).                         ACM Classification Keywords
Proceedings of MICI 2017: CHI Workshop on Mixed-Initiative Creative Interfaces.
                                                                                  H.5.2. [Information Interfaces and Presentation (e.g. HCI)]:
                                                                                  User Interfaces; D.2.5 [Software Engineering]: Testing and
                                                                                  Debugging
                                                                                                          tools will aid the game design process and help designers
                                                                                                          create higher-quality designs. While many game design
                                                                                                          tools have had some form of user evaluation, we are aware
                                                                                                          of no controlled studies that compare design outcomes be-
                                                                                                          tween two groups of users, one group using a tool and one
                                                                                                          not (the closest might be [2], which measured no difference
                                                                                                          in engagement time between fully- and partially-human-
                                                                                                          authored game level progressions). This means there is lit-
                                                                                                          tle evidence supporting the fundamental assumption behind
Figure 2: A (shortcut) solution to                                                                        game design support tools. We present the negative result
the first level displayed in the tool.
                                                                                                          of our attempt to gather such evidence to prompt discus-
Note the position of the mouse
                                                                                                          sion on measuring the leverage such tools provide. In the
cursor, currently scrubbing through
the steps.                                    Figure 1: PuzzleScript and the PuzzleScript Analyzer.       remainder of this paper we present our tool, the experiment,
                                                                                                          and a discussion of our results.

                                                                                                          PuzzleScript Analyzer
                                         Introduction                                                     The PuzzleScript Analyzer (PSA) can find solutions to lev-
                                         Designing games is difficult for many of the same reasons        els for any game written in the programming language Puz-
                                         that computer programming is difficult: game designers, like     zleScript [4], a domain-specific language for puzzle games.
                                         programmers, define an initial state and operations on that      Solution search is done via A*; it is similar to [6], but with a
                                         state given user input, and these operations can interact in     different heuristic and more attention to performance. PSA
Figure 3: The same solution
                                         unexpected ways or the initial state may be misconfigured.       is integrated into PuzzleScript’s web-based editor and is au-
display updated after a level design     Existing tools for analyzing and testing programs mainly         tomatically run when the rules or levels change (see Fig. 1).
repair. The old solution is marked       focus on functional requirements, i.e. input/output behavior.    Designers can scrub back and forth through a solution’s
as unavailable with a red border.        In games, however, nonfunctional requirements such as            steps using the mouse (see Fig. 2) and load up the cur-
                                         teaching a progression of skills play a comparatively greater    rently displayed state for interactive play by clicking on it.
                                         role, necessitating new design-focused analysis tools.           PSA alerts the designer if the shortest solution changes in
                                                                                                          length or becomes invalid as the game level and rules are
                                         Accordingly, several researchers have proposed tools and
                                                                                                          changed (see Fig. 3).
                                         techniques to assist game designers. Some tools generate
                                         elaborated alternatives for the designer’s consideration (e.g.   From a requirements standpoint, our work is most similar
                                         candidate game levels [10, 5]), while others show the con-       to two design tools for the game Refraction, which teaches
                                         sequences of design decisions (e.g. solution or reachable        fractional arithmetic. The first tool evaluates puzzles to en-
                                         space visualizations [8, 1]). We follow the latter class of      sure that every solution satisfies designer-provided proper-
                                         tools, providing design-time analysis of PuzzleScript levels.    ties [9], while the second generates progressions of levels
                                                                                                          to teach designer-provided concept sequences [2].
                                         The literature makes the reasonable assumption that such
                                        Following the Refraction tools and e-mail interviews we con-     in games where levels have specific learning objectives that
                                        ducted with five PuzzleScript users, we believed visualizing     must be satisfied, e.g. in educational games or to ensure
                                        the shortest solution would help designers. Three of our in-     players learn necessary skills before proceeding.
                                        terviewees asserted that shortcuts—solutions which did not
                                        require learning or using desired concepts—were a prob-          Experiment
                                        lem, especially when levels were meant to teach concepts         We evaluated PSA on a population of 195 novice puzzle
                                        in order. More formally, shortcuts (or workarounds) are a        designers (game design undergraduates) on a fixed level
                                        class of design bug where the designer’s intended solution       design task split into two conditions: one without and one
                                        is either not optimal or is not the only solution. PSA helps     with the tool. The control group used PuzzleScript’s stan-
                                        find true shortcuts, i.e. those which both circumvent and are    dard editor, while the experimental group also had the level
                                        shorter than the intended solution.                              highlighting and solution viewer interfaces highlighted in
                                                                                                         Fig. 1. Novice designers in an artificial environment are not
Figure 4: The first level used in the   We could not simply visualize solutions by superposing the       an ideal population, since they might not be comfortable
experiment.                             player’s path on the level: puzzles are often transformed        with or interested in the design task. On the other hand,
                                        during play, and players might control multiple characters.      we hoped that our tool could help even novice designers
                                        Instead, we provide an interface for scrubbing through solu-     work more effectively. To account for differences in Puzzle-
                                        tion steps to see intermediate puzzle states. Moreover, puz-     Script experience, every user received a 1-hour lecture on
                                        zle game rules and levels both undergo iteration; following      PuzzleScript in the week prior to performing the task. The
                                        Inform 7’s Skein [7], we inform the user when previously-        day each group performed the task, we gave an additional
                                        seen shortest solutions are no longer optimal or valid.          1-hour introductory workshop on PuzzleScript.

                                        Our experimental approach differs from previous evalua-          For our purposes, the task of puzzle design is the arrange-
                                        tions of mixed-initiative co-creative interfaces in e.g. [11].   ment of elements from a fixed set. We devised a small puz-
                                        First, PSA certainly meets that work’s requirements of co-       zle with four rules:
                                        creativity: unanticipated proposed solutions can prompt
                                        lateral thinking, and scrubbing through solutions aids dia-         1. Push boxes by walking into them.
                                        grammatic reasoning, potentially replacing (some) manual            2. Flip switches to toggle which of the red and blue col-
Figure 5: The second level used in      playtesting as part of the iterative design cycle. This indeed         ored bridges are up or down.
the experiment.                         was the case: reduced manual playtesting and increased              3. Some switches have adjacent black targets, which
                                        level edit counts were the biggest differences between our             must be filled with a box in order to use the switch.
                                        experimental and control groups.                                    4. If a box is on a bridge and the bridge is toggled down,
                                                                                                               the player may walk over the lowered box.
                                        Unlike [11], we are more interested in evaluating the final
                                        artifacts than the creative process itself. Moreover, our goal
                                        is not to support unconstrained creativity, but creative solu-   For the experiment, we defined two levels (Figs. 4, 5). The
                                        tions to the challenge of avoiding shortcuts or workarounds      first was intended to teach rules 1-3 and the second rule 4.
                                        in puzzles. This may be more relevant to junior designers or     Two bugs were intentionally inserted into each level: one
     20                                                      which allowed skipping a large part of the level, and another     and designing puzzles, using PuzzleScript, and computer
                                   15
     15                                                      which violated the provided description of the intended           programming; responses were similar across the groups.
                                   10
                                                             solution. Users were given the game rules, levels, and in-
                          c1




                                                        c1
     10
         5                             5                     tended solution for each level; the experimental group was        Evaluation
                                                             also given PSA and instructions for its use. Halfway through      Our central hypothesis was that the use of PSA would lead
 count




                               count




      0                                0
     20
                                   15                        the task (at 20 out of 40 minutes), a hint was given describ-     to better solutions faster. Surprisingly, we found no signif-
     15                                                      ing the bugs in each level so that we could discover the ef-      icant effect on solution quality across the two conditions
                                   10
                                                                                                                               (see Fig. 6). The mean quality scores were basically the
                          c2




                                                        c2




     10                                                      fectiveness of the tool in solving design problems even for
         5                             5                     users who didn’t find the bugs themselves. For each level,        same, and a Mann-Whitney U-test across the groups veri-
         0                             0                     the user was asked to describe their belief that they fixed       fied that even the small observed differences were insignifi-
             0 1 2 3                       0 1 2 3
             l1−quality                    l2−quality        the design problems on a 3-point scale.                           cant (p = 1.0 for level 1 and p = 0.92 for level 2).

Figure 6: Solution quality by                                Both groups were using a version of PuzzleScript’s edi-           We also derived an error measure for each user by normal-
condition and level. The median                              tor instrumented to anonymously record level changes,             izing the reported confidence value for each level to the 0-3
solution for level 1 was worse than                          game play, and other actions. Of our 195 users, 107 made          range used for level score, then taking the difference be-
that for level 2, with no substantial                        a good-faith effort to complete the task (i.e., they edited the   tween that confidence value and the actual quality score.
differences between the two                                                                                                    Our second hypothesis was that the experimental group
                                                             level text at all and marked the task as completed). Solu-
groups.
                                                             tions were scored on a 0-3 scale based on how many de-            would show more consistent self-assessment of solution
                                                             sign flaws were fixed and how completely they were fixed          quality. This measure also showed essentially no difference
Scoring:
                                                             (participants were unaware of the criteria). We asked users       between the two groups either in confidence or in error.
Level 1: 3 points if all boxes on
                                                             to make minimal repairs and not to add or remove crates,
targets; 2 if one box on a blue                                                                                                Did the PSA group use the tool at all? The telemetry shows
                                                             targets, or bridges so as not to end up with levels that were
bridge; 1 if one box on a red                                                                                                  that they did. The experimental group spent less time man-
                                                             unrecognizable to our solution metric.
bridge; otherwise, 0.                                                                                                          ually playtesting the level and more time modifying it, per-
Level 2: 3 points if four boxes                              Unfortunately, from our original 195 participants, we had         forming on average 500 fewer game moves than the control
on targets and one on a blue                                 to throw out data for 127 of them for the following reasons:      group (about a 30% reduction), with a two sample t-test
bridge; 2 if two boxes on targets                            a bug in our telemetry code that failed to collect telemetry      yielding a strongly significant p-value < 0.005. Everyone in
and one on a red bridge; 1 if at                             for some participants; cases where we could not recon-            the experimental group scrubbed through solutions, though
least one box on a red bridge; 0                             struct the final game definition from the recorded sequence       very few clicked to load a solution step in the editor (those
otherwise.                                                   of edit operations; and cases where the levels were left in       who did load steps did so very frequently, suggesting that
                                                             unsolvable states (we had no sound way of evaluating the          the feature is simply non-obvious).
                                                             solution quality of unsolvable levels). Ultimately, we were
                                                             able to analyze 68 of the resulting activity records, of which    We also saw a near doubling in level edit operations among
                                                             34 came from each group. All 68 completed both tasks,             the experimental group, and a two sample t-test gave a sig-
                                                             and our figures and tables refer to those 68 users. We also       nificant p-value < 0.05. This suggests that the experimental
                                                             asked survey questions assessing proficiency at solving           group performed much more iteration on their puzzle lev-
                                                                                                                               els, which conventional wisdom suggests would yield better
                                    solutions. We then looked for a significant correlation be-         study did have some specific avoidable issues. For exam-
           ●

                                    tween level edit counts and level quality. There was a sig-         ple, we provided a hint that described the level design bugs;
                                    nificant moderate increase in the first level’s solution quality    this meant that if PSA played a role in finding (as opposed
 3000




                                    with increased edit counts in the experimental group (in-           to resolving) the bugs, that effect may have been reduced.
                                    creased scores by about 0.5, p < 0.005), but nothing that           On the other hand, most users finished in under 30 min-
                                    held across both levels or across the two groups (all small         utes, so this is possible but seems unlikely (see Fig. 7). In
 2000




                                    and insignificant effects with p > 0.3).                            fact, the experimental group took on average 17% longer to
                                                                                                        complete the task (a two-sample t-test yielded p < 0.05).
                                    Discussion
 1000




                                    It was clear that the PSA group used the tool, yet we saw           Is the tool UI unhelpful? It could be that having solutions
                                    no difference in performance. PSA encouraged increased              provided automatically leads users, perhaps especially
        Control     Experiment
                                    iteration on levels, but this did not lead to superior solutions!   novices, to pay less attention and think less about their
                                    This invalidates our hypotheses and some assumptions                decisions. This could account for the increased edit count
Figure 7: Time taken to complete                                                                        among the experimental group to no apparent benefit, as
                                    behind previous work, so we looked for other explanations.
the task in seconds for each                                                                            initial wrong guesses regarding the true design issue re-
condition. The experimental group   Was the population ill-suited to the task? We only evaluated        quired corrections. Given that PSA’s interface does show
took about 8 minutes longer on
                                    PSA on novice designers. We did ask for self-assessments            specific steps to produce problematic solutions, this seems
average to complete the tasks.
                                    of puzzle solving and design proficiency, but found no cor-         unlikely to us. Another issue with PSA is that it does not
                                    relation between these measures and solution quality. This          foreground the specific instant of departure from the in-
                                    suggests either that the task does not measure these skills         tended solution, or indeed validate the intended solution at
                                    or that the students’ self-assessments were inaccurate. The         all. While specifying intended solutions formally is a lot of
                                    self-assessments were not informative, since puzzle famil-          work, it stands to reason that a tool which accepted and en-
                                    iarity did not correlate strongly with either confidence or so-     forced those specifications would be more useful than a tool
                                    lution quality. If our population truly consists of only novices,   which merely shows solutions and asks users to check that
                                    then PSA does not help novice designers on the level repair         they are acceptable. Still, one might expect that something
                                    task; if the users were a mix of novices and experts, PSA           is better than nothing, which is not borne out by our study.
                                    does not help anyone on the level repair task and the task          This leaves us with two questions:
                                    is equally hard for novices and experts. The former possibil-
                                                                                                            • Does increased iteration on puzzle levels make them
                                    ity seems more likely, though it remains surprising. Future
                                                                                                              safer with respect to intended solutions?
                                    work with expert users could help answer this question.
                                                                                                            • Does puzzle design benefit with respect to solution-
                                    Was the study designed poorly? While we tried to pick a                   safety from the use of automatic solution finding?
                                    natural task—removing unwanted solutions from a puzzle
                                    level—some artificiality was unavoidable if we wanted to            If the answers to these questions are negative, we must ask
                                    compare results across two groups. This may point to an in-         about the role of design support tools, especially design
                                    nate challenge in evaluating creativity support tools, but our      validation tools, in the game design process. The second
                                                                                                        question in particular seems to be a central assumption
of many computational game design aides, and either it                 Nitsche. 2013. Creativity support for novice digital film-
is false in general—a claim for which this paper is weak               making. In Proceedings of the SIGCHI Conference on
evidence, but evidence nonetheless—or it does not hold in              Human Factors in Computing Systems. ACM, 651–
our evaluation scheme. This may be because the task is                 660.
too easy or because solvers do not adequately support the          [4] Stephen Lavelle. 2013. PuzzleScript. http://puzzlescript.
specific subtasks explored in this evaluation.                         net. (2013).
                                                                   [5] Antonios Liapis, Georgios N Yannakakis, and Julian
We intend to perform more targeted evaluations of the PSA              Togelius. 2013. Sentient Sketchbook: Computer-aided
to explore these possibilities. For immediate future work,             game level authoring. In Proceedings of the Eighth
this evaluation should be conducted with populations of ex-            International Conference on the Foundations of Digital
pert and novice users, perhaps using other puzzle games,               Games. 213–220.
other level design bugs, or all of the above. After all, it has    [6] Chong-U Lim and D Fox Harrell. 2014. An approach to
been shown for some creative tasks that novices benefit                general videogame evaluation and automatic genera-
from tool support [3] — perhaps the support that PSA pro-              tion using a description language. In 2014 IEEE Con-
vides does not help novices effectively. It may also be the            ference on Computational Intelligence and Games.
case that PSA best supports puzzle rule design iteration as            IEEE, 1–8.
opposed to puzzle level design iteration, or that it helps pre-    [7] Aaron Reed. 2010. Creating interactive fiction with
vent bugs from being added rather than helping to remove               Inform 7. Cengage Learning.
bugs. PSA’s utility as an automated regression testing tool        [8] Mohammad Shaker, Noor Shaker, and Julian Togelius.
was not evaluated in this study. Integrating specific support          2013. Ropossum: An Authoring Tool for Designing,
for enforcing intended solutions into PSA could help answer            Optimizing and Solving Cut the Rope Levels. In Pro-
some of the study design questions above, as could improv-             ceedings of the Ninth Aaai Conference on Artificial
ing the user interface based on small-scale user studies.              Intelligence and Interactive Digital Entertainment. AAAI
                                                                       Press.
References                                                         [9] Adam M Smith, Eric Butler, and Zoran Popovic. 2013.
 [1] Aaron William Bauer and Zoran Popović. 2012. RRT-                Quantifying over play: Constraining undesirable solu-
     Based Game Level Analysis, Visualization, and Visual              tions in puzzle design.. In FDG. 221–228.
     Refinement. In Eighth Artificial Intelligence and Inter-     [10] Gillian Smith, Jim Whitehead, and Michael Mateas.
     active Digital Entertainment Conference.                          2010. Tanagra: A mixed-initiative level design tool. In
 [2] Eric Butler, Adam M Smith, Yun-En Liu, and Zoran                  Proceedings of the Fifth International Conference on
     Popovic. 2013. A mixed-initiative tool for designing              the Foundations of Digital Games. ACM, 209–216.
     level progressions in games. In Proceedings of the           [11] Georgios N Yannakakis, Antonios Liapis, and Constan-
     26th annual ACM symposium on User interface soft-                 tine Alexopoulos. 2014. Mixed-initiative co-creativity..
     ware and technology. ACM, 377–386.                                In Proceedings of the Ninth International Conference
 [3] Nicholas Davis, Alexander Zook, Brian O’Neill, Bran-              on the Foundations of Digital Games.
     don Headrick, Mark Riedl, Ashton Grosz, and Michael