=Paper= {{Paper |id=Vol-1907/3_mici_osborn |storemode=property |title=Evaluating a Solver-Aided Puzzle Design Tool |pdfUrl=https://ceur-ws.org/Vol-1907/3_mici_osborn.pdf |volume=Vol-1907 |authors=Joseph Osborn,Michael Mateas |dblpUrl=https://dblp.org/rec/conf/chi/OsbornM17 }} ==Evaluating a Solver-Aided Puzzle Design Tool== https://ceur-ws.org/Vol-1907/3_mici_osborn.pdf

Evaluating a Solver-Aided Puzzle
Design Tool

Joseph C. Osborn Abstract
Michael Mateas Puzzle game levels generally each admit a particular set
Expressive Intelligence Studio of intended solutions so that the game designer can con-
University of California at Santa
trol difficulty and the introduction of novel concepts. Catch-
Cruz
ing unintended solutions can be difficult for humans, but
Santa Cruz, USA
jcosborn@soe.ucsc.edu
promising connections with software model checking have
michaelm@soe.ucsc.edu been explored by previous work in educational puzzle games.

With this in mind, we prototyped a design support tool for
the PuzzleScript game engine based on finding and visual-
izing shortest solution paths. We evaluated the tool with a
larger population of novice designers on a fixed level design
task aimed at removing shortcuts. Surprisingly, we found
no difference in task performance given an oracle for short-
est solutions; this paper explores these results and possible
explanations for this phenomenon.

A video recording of the tool in use is available at https://
archive.org/details/PuzzlescriptAssistantDemonstration.

Author Keywords
Design support; game design; puzzle games

Copyright © 2017 for this paper is held by the author(s). ACM Classification Keywords
Proceedings of MICI 2017: CHI Workshop on Mixed-Initiative Creative Interfaces.
H.5.2. [Information Interfaces and Presentation (e.g. HCI)]:
User Interfaces; D.2.5 [Software Engineering]: Testing and
Debugging
tools will aid the game design process and help designers
create higher-quality designs. While many game design
tools have had some form of user evaluation, we are aware
of no controlled studies that compare design outcomes be-
tween two groups of users, one group using a tool and one
not (the closest might be [2], which measured no difference
in engagement time between fully- and partially-human-
authored game level progressions). This means there is lit-
tle evidence supporting the fundamental assumption behind
Figure 2: A (shortcut) solution to game design support tools. We present the negative result
the first level displayed in the tool.
of our attempt to gather such evidence to prompt discus-
Note the position of the mouse
sion on measuring the leverage such tools provide. In the
cursor, currently scrubbing through
the steps. Figure 1: PuzzleScript and the PuzzleScript Analyzer. remainder of this paper we present our tool, the experiment,
and a discussion of our results.

PuzzleScript Analyzer
Introduction The PuzzleScript Analyzer (PSA) can find solutions to lev-
Designing games is difficult for many of the same reasons els for any game written in the programming language Puz-
that computer programming is difficult: game designers, like zleScript [4], a domain-specific language for puzzle games.
programmers, define an initial state and operations on that Solution search is done via A*; it is similar to [6], but with a
state given user input, and these operations can interact in different heuristic and more attention to performance. PSA
Figure 3: The same solution
unexpected ways or the initial state may be misconfigured. is integrated into PuzzleScript’s web-based editor and is au-
display updated after a level design Existing tools for analyzing and testing programs mainly tomatically run when the rules or levels change (see Fig. 1).
repair. The old solution is marked focus on functional requirements, i.e. input/output behavior. Designers can scrub back and forth through a solution’s
as unavailable with a red border. In games, however, nonfunctional requirements such as steps using the mouse (see Fig. 2) and load up the cur-
teaching a progression of skills play a comparatively greater rently displayed state for interactive play by clicking on it.
role, necessitating new design-focused analysis tools. PSA alerts the designer if the shortest solution changes in
length or becomes invalid as the game level and rules are
Accordingly, several researchers have proposed tools and
changed (see Fig. 3).
techniques to assist game designers. Some tools generate
elaborated alternatives for the designer’s consideration (e.g. From a requirements standpoint, our work is most similar
candidate game levels [10, 5]), while others show the con- to two design tools for the game Refraction, which teaches
sequences of design decisions (e.g. solution or reachable fractional arithmetic. The first tool evaluates puzzles to en-
space visualizations [8, 1]). We follow the latter class of sure that every solution satisfies designer-provided proper-
tools, providing design-time analysis of PuzzleScript levels. ties [9], while the second generates progressions of levels
to teach designer-provided concept sequences [2].
The literature makes the reasonable assumption that such
Following the Refraction tools and e-mail interviews we con- in games where levels have specific learning objectives that
ducted with five PuzzleScript users, we believed visualizing must be satisfied, e.g. in educational games or to ensure
the shortest solution would help designers. Three of our in- players learn necessary skills before proceeding.
terviewees asserted that shortcuts—solutions which did not
require learning or using desired concepts—were a prob- Experiment
lem, especially when levels were meant to teach concepts We evaluated PSA on a population of 195 novice puzzle
in order. More formally, shortcuts (or workarounds) are a designers (game design undergraduates) on a fixed level
class of design bug where the designer’s intended solution design task split into two conditions: one without and one
is either not optimal or is not the only solution. PSA helps with the tool. The control group used PuzzleScript’s stan-
find true shortcuts, i.e. those which both circumvent and are dard editor, while the experimental group also had the level
shorter than the intended solution. highlighting and solution viewer interfaces highlighted in
Fig. 1. Novice designers in an artificial environment are not
Figure 4: The first level used in the We could not simply visualize solutions by superposing the an ideal population, since they might not be comfortable
experiment. player’s path on the level: puzzles are often transformed with or interested in the design task. On the other hand,
during play, and players might control multiple characters. we hoped that our tool could help even novice designers
Instead, we provide an interface for scrubbing through solu- work more effectively. To account for differences in Puzzle-
tion steps to see intermediate puzzle states. Moreover, puz- Script experience, every user received a 1-hour lecture on
zle game rules and levels both undergo iteration; following PuzzleScript in the week prior to performing the task. The
Inform 7’s Skein [7], we inform the user when previously- day each group performed the task, we gave an additional
seen shortest solutions are no longer optimal or valid. 1-hour introductory workshop on PuzzleScript.

Our experimental approach differs from previous evalua- For our purposes, the task of puzzle design is the arrange-
tions of mixed-initiative co-creative interfaces in e.g. [11]. ment of elements from a fixed set. We devised a small puz-
First, PSA certainly meets that work’s requirements of co- zle with four rules:
creativity: unanticipated proposed solutions can prompt
lateral thinking, and scrubbing through solutions aids dia- 1. Push boxes by walking into them.
grammatic reasoning, potentially replacing (some) manual 2. Flip switches to toggle which of the red and blue col-
Figure 5: The second level used in playtesting as part of the iterative design cycle. This indeed ored bridges are up or down.
the experiment. was the case: reduced manual playtesting and increased 3. Some switches have adjacent black targets, which
level edit counts were the biggest differences between our must be filled with a box in order to use the switch.
experimental and control groups. 4. If a box is on a bridge and the bridge is toggled down,
the player may walk over the lowered box.
Unlike [11], we are more interested in evaluating the final
artifacts than the creative process itself. Moreover, our goal
is not to support unconstrained creativity, but creative solu- For the experiment, we defined two levels (Figs. 4, 5). The
tions to the challenge of avoiding shortcuts or workarounds first was intended to teach rules 1-3 and the second rule 4.
in puzzles. This may be more relevant to junior designers or Two bugs were intentionally inserted into each level: one
20 which allowed skipping a large part of the level, and another and designing puzzles, using PuzzleScript, and computer
15
15 which violated the provided description of the intended programming; responses were similar across the groups.
10
solution. Users were given the game rules, levels, and in-
c1

c1
10
5 5 tended solution for each level; the experimental group was Evaluation
also given PSA and instructions for its use. Halfway through Our central hypothesis was that the use of PSA would lead
count

count

0 0
20
15 the task (at 20 out of 40 minutes), a hint was given describ- to better solutions faster. Surprisingly, we found no signif-
15 ing the bugs in each level so that we could discover the ef- icant effect on solution quality across the two conditions
10
(see Fig. 6). The mean quality scores were basically the
c2

10 fectiveness of the tool in solving design problems even for
5 5 users who didn’t find the bugs themselves. For each level, same, and a Mann-Whitney U-test across the groups veri-
0 0 the user was asked to describe their belief that they fixed fied that even the small observed differences were insignifi-
0 1 2 3 0 1 2 3
l1−quality l2−quality the design problems on a 3-point scale. cant (p = 1.0 for level 1 and p = 0.92 for level 2).

Figure 6: Solution quality by Both groups were using a version of PuzzleScript’s edi- We also derived an error measure for each user by normal-
condition and level. The median tor instrumented to anonymously record level changes, izing the reported confidence value for each level to the 0-3
solution for level 1 was worse than game play, and other actions. Of our 195 users, 107 made range used for level score, then taking the difference be-
that for level 2, with no substantial a good-faith effort to complete the task (i.e., they edited the tween that confidence value and the actual quality score.
differences between the two Our second hypothesis was that the experimental group
level text at all and marked the task as completed). Solu-
groups.
tions were scored on a 0-3 scale based on how many de- would show more consistent self-assessment of solution
sign flaws were fixed and how completely they were fixed quality. This measure also showed essentially no difference
Scoring:
(participants were unaware of the criteria). We asked users between the two groups either in confidence or in error.
Level 1: 3 points if all boxes on
to make minimal repairs and not to add or remove crates,
targets; 2 if one box on a blue Did the PSA group use the tool at all? The telemetry shows
targets, or bridges so as not to end up with levels that were
bridge; 1 if one box on a red that they did. The experimental group spent less time man-
unrecognizable to our solution metric.
bridge; otherwise, 0. ually playtesting the level and more time modifying it, per-
Level 2: 3 points if four boxes Unfortunately, from our original 195 participants, we had forming on average 500 fewer game moves than the control
on targets and one on a blue to throw out data for 127 of them for the following reasons: group (about a 30% reduction), with a two sample t-test
bridge; 2 if two boxes on targets a bug in our telemetry code that failed to collect telemetry yielding a strongly significant p-value < 0.005. Everyone in
and one on a red bridge; 1 if at for some participants; cases where we could not recon- the experimental group scrubbed through solutions, though
least one box on a red bridge; 0 struct the final game definition from the recorded sequence very few clicked to load a solution step in the editor (those
otherwise. of edit operations; and cases where the levels were left in who did load steps did so very frequently, suggesting that
unsolvable states (we had no sound way of evaluating the the feature is simply non-obvious).
solution quality of unsolvable levels). Ultimately, we were
able to analyze 68 of the resulting activity records, of which We also saw a near doubling in level edit operations among
34 came from each group. All 68 completed both tasks, the experimental group, and a two sample t-test gave a sig-
and our figures and tables refer to those 68 users. We also nificant p-value < 0.05. This suggests that the experimental
asked survey questions assessing proficiency at solving group performed much more iteration on their puzzle lev-
els, which conventional wisdom suggests would yield better
solutions. We then looked for a significant correlation be- study did have some specific avoidable issues. For exam-
●

tween level edit counts and level quality. There was a sig- ple, we provided a hint that described the level design bugs;
nificant moderate increase in the first level’s solution quality this meant that if PSA played a role in finding (as opposed
3000

with increased edit counts in the experimental group (in- to resolving) the bugs, that effect may have been reduced.
creased scores by about 0.5, p < 0.005), but nothing that On the other hand, most users finished in under 30 min-
held across both levels or across the two groups (all small utes, so this is possible but seems unlikely (see Fig. 7). In
2000

and insignificant effects with p > 0.3). fact, the experimental group took on average 17% longer to
complete the task (a two-sample t-test yielded p < 0.05).
Discussion
1000

It was clear that the PSA group used the tool, yet we saw Is the tool UI unhelpful? It could be that having solutions
no difference in performance. PSA encouraged increased provided automatically leads users, perhaps especially
Control Experiment
iteration on levels, but this did not lead to superior solutions! novices, to pay less attention and think less about their
This invalidates our hypotheses and some assumptions decisions. This could account for the increased edit count
Figure 7: Time taken to complete among the experimental group to no apparent benefit, as
behind previous work, so we looked for other explanations.
the task in seconds for each initial wrong guesses regarding the true design issue re-
condition. The experimental group Was the population ill-suited to the task? We only evaluated quired corrections. Given that PSA’s interface does show
took about 8 minutes longer on
PSA on novice designers. We did ask for self-assessments specific steps to produce problematic solutions, this seems
average to complete the tasks.
of puzzle solving and design proficiency, but found no cor- unlikely to us. Another issue with PSA is that it does not
relation between these measures and solution quality. This foreground the specific instant of departure from the in-
suggests either that the task does not measure these skills tended solution, or indeed validate the intended solution at
or that the students’ self-assessments were inaccurate. The all. While specifying intended solutions formally is a lot of
self-assessments were not informative, since puzzle famil- work, it stands to reason that a tool which accepted and en-
iarity did not correlate strongly with either confidence or so- forced those specifications would be more useful than a tool
lution quality. If our population truly consists of only novices, which merely shows solutions and asks users to check that
then PSA does not help novice designers on the level repair they are acceptable. Still, one might expect that something
task; if the users were a mix of novices and experts, PSA is better than nothing, which is not borne out by our study.
does not help anyone on the level repair task and the task This leaves us with two questions:
is equally hard for novices and experts. The former possibil-
• Does increased iteration on puzzle levels make them
ity seems more likely, though it remains surprising. Future
safer with respect to intended solutions?
work with expert users could help answer this question.
• Does puzzle design benefit with respect to solution-
Was the study designed poorly? While we tried to pick a safety from the use of automatic solution finding?
natural task—removing unwanted solutions from a puzzle
level—some artificiality was unavoidable if we wanted to If the answers to these questions are negative, we must ask
compare results across two groups. This may point to an in- about the role of design support tools, especially design
nate challenge in evaluating creativity support tools, but our validation tools, in the game design process. The second
question in particular seems to be a central assumption
of many computational game design aides, and either it Nitsche. 2013. Creativity support for novice digital film-
is false in general—a claim for which this paper is weak making. In Proceedings of the SIGCHI Conference on
evidence, but evidence nonetheless—or it does not hold in Human Factors in Computing Systems. ACM, 651–
our evaluation scheme. This may be because the task is 660.
too easy or because solvers do not adequately support the [4] Stephen Lavelle. 2013. PuzzleScript. http://puzzlescript.
specific subtasks explored in this evaluation. net. (2013).
[5] Antonios Liapis, Georgios N Yannakakis, and Julian
We intend to perform more targeted evaluations of the PSA Togelius. 2013. Sentient Sketchbook: Computer-aided
to explore these possibilities. For immediate future work, game level authoring. In Proceedings of the Eighth
this evaluation should be conducted with populations of ex- International Conference on the Foundations of Digital
pert and novice users, perhaps using other puzzle games, Games. 213–220.
other level design bugs, or all of the above. After all, it has [6] Chong-U Lim and D Fox Harrell. 2014. An approach to
been shown for some creative tasks that novices benefit general videogame evaluation and automatic genera-
from tool support [3] — perhaps the support that PSA pro- tion using a description language. In 2014 IEEE Con-
vides does not help novices effectively. It may also be the ference on Computational Intelligence and Games.
case that PSA best supports puzzle rule design iteration as IEEE, 1–8.
opposed to puzzle level design iteration, or that it helps pre- [7] Aaron Reed. 2010. Creating interactive fiction with
vent bugs from being added rather than helping to remove Inform 7. Cengage Learning.
bugs. PSA’s utility as an automated regression testing tool [8] Mohammad Shaker, Noor Shaker, and Julian Togelius.
was not evaluated in this study. Integrating specific support 2013. Ropossum: An Authoring Tool for Designing,
for enforcing intended solutions into PSA could help answer Optimizing and Solving Cut the Rope Levels. In Pro-
some of the study design questions above, as could improv- ceedings of the Ninth Aaai Conference on Artificial
ing the user interface based on small-scale user studies. Intelligence and Interactive Digital Entertainment. AAAI
Press.
References [9] Adam M Smith, Eric Butler, and Zoran Popovic. 2013.
[1] Aaron William Bauer and Zoran Popović. 2012. RRT- Quantifying over play: Constraining undesirable solu-
Based Game Level Analysis, Visualization, and Visual tions in puzzle design.. In FDG. 221–228.
Refinement. In Eighth Artificial Intelligence and Inter- [10] Gillian Smith, Jim Whitehead, and Michael Mateas.
active Digital Entertainment Conference. 2010. Tanagra: A mixed-initiative level design tool. In
[2] Eric Butler, Adam M Smith, Yun-En Liu, and Zoran Proceedings of the Fifth International Conference on
Popovic. 2013. A mixed-initiative tool for designing the Foundations of Digital Games. ACM, 209–216.
level progressions in games. In Proceedings of the [11] Georgios N Yannakakis, Antonios Liapis, and Constan-
26th annual ACM symposium on User interface soft- tine Alexopoulos. 2014. Mixed-initiative co-creativity..
ware and technology. ACM, 377–386. In Proceedings of the Ninth International Conference
[3] Nicholas Davis, Alexander Zook, Brian O’Neill, Bran- on the Foundations of Digital Games.
don Headrick, Mark Riedl, Ashton Grosz, and Michael