<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Mixed-Initiative Creative Interfaces via Expressive Range Coverage Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Max Kreminski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Isaac Karth</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Mateas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Noah Wardrip-Fruin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of California</institution>
          ,
          <addr-line>Santa Cruz, 1156 High St, Santa Cruz, CA 95064</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We introduce expressive range coverage analysis (ERaCA): a technique for evaluating mixed-initiative creative interfaces (MICIs) in which creative responsibility is shared between a human user and a generative model. ERaCA revolves around the examination of a small number of human-created artifacts in the context of a visualization of the broader expressive range from which these artifacts were sampled. As a pilot study of our approach, we apply ERaCA to the evaluation of Redactionist-a MICI for erasure poetry creation-and find that ERaCA allows us to visually answer questions about how thoroughly users explore the underlying model's expressive range; whether users produce artifacts that are typical or unusual from the underlying model's perspective; whether diferent users of a single MICI tend to produce similar or diferent artifacts; whether a MICI tends to promote divergent or convergent thinking; and how a single user's artifacts evolve as they continue to use a MICI over time.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;expressive range analysis</kwd>
        <kwd>mixed-initiative co-creativity</kwd>
        <kwd>creativity support tools</kwd>
        <kwd>evaluation methods</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>tool. As a result, assessments of MICIs often focus on
evaluating the subjective perception of creativity support
Mixed-initiative creative interfaces [1, 2], or MICIs, are from the user’s perspective [6]. The artifacts that users
a genre of creativity support tools [3] in which creative produce are comparatively rarely evaluated, and even
responsibility is shared between a human user and an when they are evaluated, discerning what role the MICI
artificially intelligent system. Many MICIs consist of two played in shaping these artifacts may still be out of reach.
layers: an underlying generative model that defines a Though evaluating creativity is dificult in general [ 7],
possibility space of artifacts—sometimes learned from a researchers have developed a number of efective
apcorpus of training data [4], sometimes defined by a set of proaches to the evaluation of computationally creative
sysrules or constraints [5]—and a supervening mechanism tems [8, 9] in which creative responsibility is attributed
for navigating this space to locate artifacts that match a primarily or solely to the machine [10]. In particular, a
user’s prompt or intent. In MICIs that take this approach, technique known as expressive range analysis (ERA) [11]
an artifact’s creation is synonymous with its discovery can be used to characterize the behavior of a generative
and selection by a user. model by visualizing its possibility space. This makes</p>
      <p>Evaluating the efectiveness of these systems can be it easy to visually compare the expressive range of
difdificult, in part because neither the user nor the gen- ferent generative models that produce the same kind of
erative model is solely responsible for the artifacts pro- artifact—and to describe a generative model in terms of
duced [6]. In particular, a skilled user may be able to its grain, or the characteristics of the artifacts that it tends
coax compelling artifacts from even the most unwieldy to produce [12].</p>
      <p>MICI, making it dificult to characterize how efectively However, because ERA relies on the rapid
generaa MICI supports its users in realizing their creative goals. tion and characterization of a very large number of
arAdditionally, insofar as these tools often lead users to cre- tifacts [13], this method of evaluation cannot
straightate artifacts that they would not have thought to create forwardly be applied to mixed-initiative creative
collabbefore, it is dificult to compare a MICI-plus-user system orations. When a human user must be involved in the
with the unassisted user in terms of creative capabilities, production of every artifact, it becomes prohibitively
because the user’s original creative intent can be substan- time-consuming to produce the hundreds or thousands
tially shaped or modified by their interaction with the of artifacts that ERA demands. As a result, although
ERA is frequently applied to the evaluation of end-to-end
JHoeilnstinPkrio,cFeiendlianngds of the ACM IUI Workshops 2022, March 2022, computationally creative systems, including the
genera$ mkremins@ucsc.edu (M. Kreminski); ikarth@ucsc.edu (I. Karth); tive models underlying some MICIs [14], its application
mmateas@ucsc.edu (M. Mateas); nwardrip@ucsc.edu to understanding the influence of MICI design on user
(N. Wardrip-Fruin) behavior and user experience has remained limited.
 https://©m20k22rCeompyiringhst.fgoritthhisupbap.ieorby( Mits.auKthrores.mUsienpsekrmii)tted under Creative In this paper, we propose a new technique for
evaluatCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) ing MICIs—expressive range coverage analysis (ERaCA)—
that extends ERA to a co-creative context by visualizing
a small number of co-created artifacts in the context of
the broader expressive range from which these artifacts
were sampled. ERaCA applies a set of quantitative
artifact evaluation metrics to the simultaneous assessment of
many model-created artifacts and a handful of co-created
artifacts, then produces a visualization of the results,
allowing us to visually answer such questions as:
• Does a MICI allow its users to access the entirety
of the underlying generative model’s expressive
range, or only a limited subset?
• How typical or unusual are the artifacts created
by a user in the context of the broader expressive
range?
• Are all of a MICI’s users drawn toward the same
parts of its expressive range, or do diferent users
typically explore diferent regions of the
possibility space?
• As users continue to interact with a MICI, do the
artifacts they produce tend to get closer together
or further apart within the expressive range? In
other words, does the MICI tend to promote
convergent or divergent thinking?
• More generally, as users continue to interact with
a MICI, what trends appear in a single user’s
artifacts over time?</p>
      <p>We demonstrate ERaCA via a pilot study in which we
apply the ERaCA method to the evaluation of
Redactionist, a MICI for erasure poetry creation. The resulting
visualizations provide preliminary answers to several of
the above questions based on data collected from a small
number of users. Altogether, the argument for our
approach can be summed up as follows: we learn more
about a MICI from inspecting co-created artifacts in
the context of the underlying expressive range than
we do from inspecting both co-created artifacts and
the underlying expressive range individually.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>Expressive range analysis (ERA) [11] is a
visualizationbased approach to understanding and evaluating the
effectiveness of generative models. Application of ERA
follows a four-step approach:
1. Determine appropriate quantitative metrics
for the kinds of artifacts that the generative model
will produce. Ideally these metrics are
computationally inexpensive to evaluate, so that they can
be eficiently applied to a large number of
individual artifacts.
2. Generate a large number of artifacts using
the generative model to collect a representative
sample of the model’s output, using the metrics
defined in step 1 to evaluate each artifact.
3. Visualize the results of evaluation, typically
as a set of two-dimensional histograms in which
pairs of metrics are plotted against one another
to showcase artifact density in diferent “slices”
of the overall expressive range.
4. Analyze the impact of parameters passed to
the generative model on the resulting expressive
range, allowing for the visual determination of
how diferent parameters influence the artifacts
that the model produces.</p>
      <p>Although ERA has been integrated into tools for
human creators of generative models [15], extended in
various ways [13, 16], and applied to domains as
wideranging as emergent narrative [17] and road network
generation [18], it has several important limitations. In
particular, conventional ERA is data-hungry and poorly
suited to the evaluation of small numbers of artifacts,
which has prevented its application to creative contexts
in which artifacts are individually time-consuming or
costly to generate [13]—as is often the case when human
users are involved in the creative process.</p>
      <p>However, the ideas captured by ERA remain important
to the evaluation of tools for human-AI creative
collaboration. Among nine potential pitfalls for co-creative
systems discussed by Buschek et al. [19], at least five
(“Invisible AI boundaries”, “Lack of expressive interaction”,
“Agony of choice”, “Time waster”, and “AI bias”) can be
viewed as stemming from either an insuficiently wide
expressive range; an expressive range that does not overlap
well with user desires; or a flawed user interface for
accessing the available expressive range. Evaluations based
exclusively on self-reported subjective user experience
can produce misleading results [20], leading some to
suggest that inspection of co-created artifacts is also needed
to arrive at a holistic picture of a co-creative system’s
success or failure [21, 22, 23, 24]—but even these hybrid
evaluations cannot clearly diagnose whether a MICI’s
weaknesses are due to the underlying generative model
or the interface through which the model is accessed.
And some studies of user behavior in MICIs have
suggested that some users are motivated by a drive to explore
the extremes of a MICI’s expressive range [25],
necessitating the comparison of co-created artifacts against
the expressive range to verify these findings. In sum,
these dificulties all point to a common unmet need: an
evaluation method for MICIs that can illuminate the
relationship between individual co-created artifacts and the
MICI’s overall expressive range.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Expressive Range Coverage</title>
    </sec>
    <sec id="sec-4">
      <title>Analysis</title>
      <p>Expressive range coverage analysis (ERaCA) is a new
evaluation technique for mixed-initiative creative
interfaces (MICIs) in which a human user and a generative
model share creative responsibility for the discovery and
selection of artifacts from a large possibility space.
ERaCA builds on ERA, but also extends the evaluation
process by incorporating the solicitation and examination of
a small number of co-created artifacts (i.e., artifacts made
or discovered by human study participants through their
interaction with the MICI) in the context of the
generative model’s expressive range.</p>
      <p>ERaCA as a process consists of seven steps:
1. Determine appropriate quantitative metrics for
the kinds of artifacts that the generative model
will produce.
2. Generate a large number of artifacts using the ERaCA method to evaluate the mixed-initiative erasure
generative model, and evaluate each artifact using poetry creation tool Redactionist on the basis of artifacts
the metrics defined in step 1. created by four participants (all coauthors of this paper)
3. Visualize the results of evaluation. from a single fixed paragraph of source text.
4. Solicit the co-creation of a small number of
artifacts by human study participants, ideally 4.1. Redactionist
drawn from among the MICI’s target user base,
and evaluate these artifacts using the same met- Redactionist [26], previously known as Blackout [27], is a
rics that are used to evaluate the purely machine- browser-based1 casual-creator [28] MICI that helps users
created ones. create English-language erasure poetry by interactively
5. Visualize the location of co-created artifacts removing most of the words from a user-provided source
within the context of the larger possibility space, text. Once given a source text, Redactionist uses a
rulesfor instance as a set of scatterplots drawn directly based generative model (adapted from an earlier model
on top of the two-dimensional histograms created created by Liza Daly [29]) to generate a large number
in step 3. of potential erasure poems that could be created from
6. (Optional) Construct per-user visualizations the text. Then it provides the user with an interface
of the user’s trajectory within the possibil- for navigating this space of potential poems by toggling
ity space, using a color gradient to indicate the whether specific words should be present in the final
order in which artifacts were created on the plot. poem. A screenshot of Redactionist’s interface, showing
We discuss this visualization approach in greater a half-constructed poem, can be seen in Figure 1.
detail in section 5.4, and an example can be seen Given a source text, Redactionist’s rules look for poems
in Figure 6. that take the form of several short and grammatically
correct declarative sentences—one sentence per paragraph of
7. Visually analyze the results to make determi- input text. For instance, one of Redactionist’s rules—the
nations about users’ coverage of and trajectory grammatical pattern ARTICLE NOUN VERB ARTICLE
within the generative model’s possibility space. ADJECTIVE NOUN—would find and match sequences of
Steps 1-3 of this process are the same as for ERA, while words such as “the poem conceals an elusive metaphor”
steps 4-7 (which rely on incorporation of co-created arti- within a paragraph of source text, with any other words
facts into the evaluation process) are unique to ERaCA. in the source text paragraph being erased. The words in
each matched sentence might be separated by any
number of other words, as long as they occur in the correct
4. Pilot Study Procedure sequence within a single paragraph of the source text.
The version of Redactionist used here contains 136 rules,
In preparation for a larger-scale user study to be con- each of which matches sentences of a particular form.
ducted in the future, we ran a small-scale pilot study to
test and illustrate our approach. Our pilot study used the</p>
      <sec id="sec-4-1">
        <title>4.2. Data Collection</title>
        <sec id="sec-4-1-1">
          <title>Due to logistical constraints (described further in section</title>
          <p>6.1), the four coauthors of this paper served as our pilot
study participants. Each participant was instructed to use
Redactionist with a fixed source text (a one-paragraph
excerpt from a transcript of a talk by Allison Parrish on
computational poetry [30]) to create a sequence of ten
short erasure poems. To ensure that participants were
not composing their poems with a particular metric or
evaluation criterion in mind, we avoided deciding what
metrics would be used to evaluate the poems until after
the data had been collected, and we did not confer with
one another about our aesthetic intentions for the poems
we had made.</p>
          <p>In addition to these 40 co-created poems, we also
gathered and analyzed the complete set of 57,195 potential
poems that the Redactionist generative model considers
to be possible erasures of the fixed input text. This larger
set of poems, which we call the “full poemspace”, forms
the backdrop for our analysis: by comparing the 40
cocreated poems to the full poemspace, we can identify the
co-created poems as typical or atypical in various ways
and analyze the extent to which the co-created poems
cover (or fail to cover) the full poemspace. For some
generative models, it may be easier to instead establish a
backdrop set of artifacts by uniformly sampling many
(but not all) possible artifacts for the given user input;
the details of this sampling vary depending on how the
generative model is implemented.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.3. Artifact Evaluation Metrics</title>
        <p>ERaCA, like ERA, uses several domain-specific
quantitative metrics to characterize each of the artifacts produced
by a generative model or creative collaboration. Erasure
poetry is an unusual form of poetry that has not been
investigated much in the scholarly literature [31, 32, 33],
and the short, single-declarative-sentence poems
produced by Redactionist given a single paragraph of input
text do not contain many of the features (such as end
rhyme) that are most widely studied in the analysis of
poetry. Consequently, rather than drawing directly on
metrics that have been defined for more conventional
forms of poetry [34], we instead defined several
preliminary but easy-to-implement metrics of our own that
attempt to capture key aesthetic features of erasure
poetry as a form. These metrics include:</p>
        <p>Average word position within the source text.
Erasure poems are characterized partly by the visual spacing
of the non-erased words within the source text. Since
Redactionist represents poems internally as a set of
numerical indexes into the source text pointing to the
userselected words, averaging these indexes together can
give a simple approximation of whether a poem mostly
contains words taken from near the start, middle, or end
of the source text.</p>
        <p>Distance between the poem’s first and last words
within the source text. This metric can be used to
diferentiate poems that draw exclusively from one narrow
region within the source text from poems that draw from a
larger span. It is especially useful when applied alongside
the previous metric to identify where in the source text
the user focused their attention when selecting words to
retain.</p>
        <p>Poem length in characters. This metric counts the
total number of characters in the selected words that
comprise the poem. Many erasure poems attempt to
visually overwhelm the reader with the sheer amount of
text that is erased [33]; counting non-erased characters
relative to a fixed source text length works as a loose
proxy for the proportion of the source text that is erased.</p>
        <p>Average English-corpus word frequency of the
words selected for inclusion in the poem. This metric
attempts to quantify how unusual a given poem’s word
choices are in the context of the English language as a
whole, under the logic that retained words in erasure
poems are often chosen with the intent to surprise the
reader. For English word frequency data, we used the
SUBTLEXUS dataset of film and television subtitles [ 36]—
specifically the word frequency per 1,000,000 words
measure (SUBTLWF), as given by the file that contains word
frequency data for all 74,286 distinct words that appear
within the dataset.</p>
        <p>Average within-poemspace word frequency of the
words selected for inclusion in the poem. This metric
attempts to quantify how unusual a given poem’s word
choices are in the context of the complete poemspace, with
each word’s frequency determined by counting how often
it appears in the complete set of poems that the
generative model is able to create from this source text. Because
the meaning of an erasure poem is partly defined in
relation to the meaning of its source text [31], including the
alternative erasures of the same source text that might
have been performed, it makes sense to consider the
individual poem’s relationship to the full poemspace as a
potential aesthetic measure.</p>
        <p>Average word pair probability within the poemspace
across all word pairs in the poem. The probability of a
word pair ⟨, ⟩ is the probability that, given word  is
present in a poem, word  is also present within that
same poem. Like the word frequency metrics, this
metric attempts to capture the surprising quality of word
choices in many human-created erasure poems; here, it
is particularly useful for identifying poems that contain
pairs of words that the generative model would not often
use together when unguided by a human user.</p>
        <p>Letter repetition score. This metric counts all of the
unique letters in a poem and divides this count by the
total number of letters in the poem. Poems receive a low
score if they reuse the same letter many times, and a letter repetition scores may indicate intentional selection
high score if they reuse letters infrequently. This score of words that phonetically clash.
is intended as a loose proxy for sound reuse, an aesthetic We also defined minimum and maximum variants of
quality of poems related to how similar the words in the each metric that reports an average value—for instance,
poem sound to one another when pronounced. Sound metrics that report the probability score of the most and
devices [34] such as assonance, consonance, alliteration, least likely word pairs in each poem, to accompany the
and rhyme are all varieties of sound reuse. Low letter metric that reports the average probability of all of a
repetition scores may indicate intentional selection of poem’s word pairs. However, for reasons of space, we do
words that sound similar to one another, while very high not report results related to these metrics here.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.4. Data and Code Availability</title>
        <sec id="sec-4-3-1">
          <title>All data for this study (including the participant-created</title>
          <p>poems and the full poemspace), as well as the code that
we used to run the analysis and generate our
visualizations, is available online: https://github.com/mkremins/
redactionist-eraca.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <sec id="sec-5-1">
        <title>Examination of the visualizations we created allows us to characterize Redactionist’s efects on users in terms of the artifacts they tend to create. Below, we briefly discuss some of the key findings from our pilot study.</title>
        <sec id="sec-5-1-1">
          <title>5.1. Users collectively explore most of the model’s expressive range</title>
          <p>At a high level, inspection of the metric pair
visualizations in the corner plot (Figure 2) shows that the four
participants collectively created artifacts that cover the
generative model’s expressive range well. Although the
densest clusters of co-created artifacts within the
possibility space mostly do not align with the densest clusters
of possible machine-generated artifacts, the placement of
co-created artifacts across the possibility space suggests
that users are capable of creating poems that occupy any
point within the generative model’s expressive range as
defined by these metrics. This provides evidence that the
Redactionist interface is successful at exposing the full
possibility space of the underlying generative model to its
users: no regions of the possibility space are inaccessible
to users due to interface limitations.</p>
          <p>A particularly good example of expressive range
coverage can be seen in the visualization of the
poemLengthInChars and avgEnglishWordFreq metric pair
(Figure 3). Although co-created poems largely fall outside of
the densest parts of the possibility space, and although
some co-created poems stand out as extreme outliers
relative to the possibility space as a whole, the overall
distribution of co-created artifacts shows that users can
access the entirety of the possibility space.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.2. Co-created artifacts are disproportionately unusual</title>
          <p>Further inspection of the corner plot (Figure 2) shows that
co-created artifacts rarely occupy the densest parts of the
generative model’s expressive range, and that they are This may suggest that the generative model’s expressive
unusually likely to be outliers in comparison to most pos- range contains many poems that human users would tend
sible model-created poems. This is backed up by closer to reject as unsuitable, leading to a focusing of human
examination of individual metric pairs: for instance, Fig- attention on poems that are considered to be outliers.
ure 4 shows that co-created artifacts are much more likely
than model-created artifacts to contain unusual
individual words and word pairs (from the model’s perspective).</p>
        </sec>
        <sec id="sec-5-1-3">
          <title>5.4. Redactionist tends to promote convergent thinking over divergent</title>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>One question that it would be useful to answer about</title>
        <p>MICIs involves the tendency of the MICI’s design to
pro5.3. Diferent users explore diferent mote divergent or convergent thinking within a single
portions of the expressive range user: do users tend to jump around between very
diferent regions of the possibility space, or do users tend to
We can also see from the corner plot (Figure 2) that difer- select a single region of the possibility space and then
ent users tend to explore diferent portions of the expres- “mine it out” by creating several artifacts all drawn from
sive range. Each participant’s co-created artifacts tend that same region? This question can be answered to some
to cluster together, allowing for the visual determination extent with a standard scatterplot overlay, but coloring
of each participant’s “style” in terms of the metrics we the points representing a single user’s artifacts in the
defined. Figure 5 shows this especially well: the visual order that these artifacts were created (according to a
clustering of poems created by each participant is highly color gradient) can further enable us to discern whether
evident here, suggesting that each participant tended to artifacts drawn from a particular region of the possibility
behave diferently when deciding where in the text they space were created contiguously or noncontiguously. We
should select words from. Participants P1, P3, and P4 call these augmented scatterplots “trajectory
visualizaall tended to pick a relatively narrow “window” within tions”, because they attempt to illuminate a single user’s
the source text and construct poems from several close- trajectory through the possibility space over time; an
together words, but P4 tended to draw from near the example trajectory visualization can be seen in Figure 6.
start of the source text; P1 tended to draw from near the Side-by-side per-user trajectory visualizations for the
end; and P3 moved throughout the source text while still avgWordPosition and
distBetweenFirstAndLastselecting mostly close-together words for each individ- Words metrics (Figure 7) shows that Redactionist users
ual poem. Meanwhile, participant P2 tended to create tend to converge on a specific approach to selecting
poems that drew words from all throughout the source words from the source text for inclusion in poems,
estext, resulting in unusually high distBetweenFirst- sentially choosing a “home region” within the source
AndLastWords scores relative to the other participants. text that they repeatedly revisit for multiple poems over
the course of a single session. Specifically, by examining
the order in which poems were created alongside their
positioning within the expressive range, we can see that
all four participants created at least three poems that
fall within a visually distinct region of the expressive
range from a source text location perspective; that two 5.5. Users experiment with highly
of these participants (P2 and P4) created an even larger unusual word choices before
number of poems sampled largely from similar locations regressing to the mean
within the source text; and that these poems were not
created in immediate sequence with one another, indi- We hypothesized that, as users are exposed to more of
cating that the user’s preference for a particular “home the generative model’s choices and explore a wider
varilocation” endures over the course of a session rather than ety of the words available to them, they might be driven
disappearing after a few successive poems are sampled toward selecting more unusual words over time—both
from the same region. from the perspective of the Redactionist poemspace (i.e.,</p>
        <p>The tendency of Redactionist users to work conver- avoiding words that tend to be used very frequently in
gently may be partly attributable to interface design. In generated poems) and from the perspective of the
EnRedactionist, once you have locked in a large number glish language as a whole (i.e., preferring words that
of words to finish a poem, it is easier to change only a occur less frequently in a corpus of general English
usfew of these selections than to change a large number of age). Examination of trajectory visualizations for the
them at once. Additionally, the actual word attached to a avgPoemspaceWordFreq and avgEnglishWordFreq
span of selectable text is not made visible to users until metric pair, however, does not show this expected trend—
they hover over this span. Consequently, users often take see Figure 8. Instead, we observe that all four participants
small, incremental steps within the possibility space and at some point during their session experimented with the
less frequently make the large jumps needed to switch selection of highly unlikely words, but that no
particifrom one region of the space to another—and even when pant remained consistently focused on the selection of
they do make larger jumps, they tend to anchor their highly unlikely words afterward.
jumps on potentially selectable words that they had used In particular, in the bottom left-hand corner of their
in poems previously. Insofar as these behaviors are at- respective trajectory visualizations, we can see that three
tributable to the user’s inadvertent fixation on a narrow of four participants (P2, P3 and P4) all discovered a region
region of the expressive range rather than intentional of poemspace in which the poems contain words that are
commitment to certain design choices [37], this analysis highly unlikely from both a poemspace word frequency
suggests the possibility of user interface features that and English word frequency perspective. Each of these
deliberately encourage users to work divergently: for participants created two poems within this region of
poinstance, an option to randomly select a new set of words emspace; for P2 and P4 one of these poems was created
containing none of the words that are currently selected, shortly after the other, while for P3 these poems were
sepor a process that randomly highlights a nearby selectable arated in time by several others. However, none of these
word that a user has not yet used in any poems. participants’ penultimate or final poems fall within this
region, suggesting that none of these participants were
primarily attempting to optimize for surprising word
choice over the course of their session.</p>
        <p>This may be an instance of the curiosity-driven
behavior previously observed in some MICI users [25]:
deliberate probing of the MICI in an efort to discover the edges
of the possibility space. This explanation may also help
to explain why P4’s final poem in particular is visibly an
extreme outlier on the avgPoemspaceWordFreq
metric, containing much more common English words on
average than any other co-created poem: all of the
participants were driven by curiosity to some extent, but P4
was especially successful in probing the extreme corners
of the possibility space.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Limitations</title>
      <sec id="sec-6-1">
        <title>6.1. Pilot Study Limitations</title>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Visualization Limitations</title>
        <p>The visualizations that we presented here use only color
to indicate which user created each artifact (in
multiuser visualizations) and the order in which artifacts were
created (in single-user trajectory visualizations). This
limits the accessibility of these visualizations to users
who have dificulty perceiving color [ 38]. Future work
should explore the use of shape, pattern, or another
redundant visual channel alongside color in the co-created
artifacts visualization layer. Particularly for trajectory
visualizations, we suspect there may be value in
shaping each data point as a small arrowhead pointing in
the direction of the next data point in sequence, so that
the order in which a user created their artifacts can be
visually analyzed more easily.</p>
        <sec id="sec-6-2-1">
          <title>Our four participants for the ERaCA pilot study presented</title>
          <p>here were all members of this paper’s authorship team.</p>
          <p>We took this unusual approach because obtaining IRB
approval for collection of user data at scale was not possible
prior to the workshop submission deadline, due partly 6.3. Limitations of ERaCA as a Method
to the late-breaking nature of this work and partly to Like ERA, ERaCA is a qualitative and visual evaluation
ongoing pandemic-related IRB reviewing backlogs. The technique. It is not capable of producing a single
sumsmall number of participants limits generalizability of the mary value that tells you how good a MICI is—but it does
study’s results, and there was obviously an incentive for illuminate the MICI’s influence on users and co-created
authors to try to “behave interestingly” while using the artifacts in useful ways, especially when the information
MICI so that publishable results would emerge. We tried that ERaCA provides is considered in terms of the MICI’s
to mitigate this potential source of bias (in particular by overall goals. It may be the case that ERaCA is best
emavoiding selection of poem evaluation metrics until after ployed alongside other user-centered evaluation methods,
the data collection was complete), but this attempt at es- such as the think-aloud method [39] and interviews [20],
tablishing a firewall between data collection and analysis to provide an additional channel of information. For
inis clearly imperfect. In the near future, we plan to run a stance, there may be potential value in showing ERaCA
larger user study (with a larger number of non-coauthor plots to study participants in a debriefing interview
afparticipants) to validate and expand on our findings. In ter a conventional user study session, using the plots as
the meantime, however, because the primary goal of this prompts or visual aids to elicit remarks or insights from
paper is to introduce the idea of expressive range cov- participants about specific aspects of their experience.
erage analysis and present a minimal case study of its Also like ERA, ERaCA relies on domain-specific
arapplication, we believe that our pilot study results are tifact evaluation metrics to characterize artifacts in a
suficient to illustrate the methodology. particular creative domain. A few standard metrics [40]
are widely used to evaluate 2D platformer game levels,
and metrics for several other domains [18, 41, 17] have
also been defined. However, there are many domains
for which appropriate metrics have not yet been devel- [3] B. Shneiderman, Creativity support tools:
Acceleroped, necessitating additional work before ERaCA can ating discovery and innovation, Communications
be applied to these domains. of the ACM 50 (2007) 20–32.</p>
          <p>
            Finally, ERaCA can only be applied to MICIs where [4] A. Summerville, S. Snodgrass, M. Guzdial,
the underlying generative model is capable of produc- C. Holmgård, A. K. Hoover, A. Isaksen, A. Nealen,
ing complete artifacts without human input. Fortunately, J. Togelius, Procedural content generation via
many recently developed MICIs for a wide variety of machine learning (PCGML), IEEE Transactions on
creative domains—including sketching [42], creature de- Games 10 (2018) 257–270.
sign [43], prose-level creative writing [44, 45], plot-level [5] A. M. Smith, M. Mateas, Answer set programming
storytelling [46], poetry [47], instrumental music [48], for procedural content generation: A design space
songwriting [49], game design [50, 51], and level de- approach, IEEE Transactions on Computational
sign [52]—follow this architectural pattern. However, Intelligence and AI in Games 3 (2011) 187–200.
ERaCA may not be as readily applicable to the evalu- [6] P. Karimi, K. Grace, M. L. Maher, N. Davis,
Evaluatation of MICIs for domains such as physical crafts, in ing creativity in computational co-creative systems,
which the generative models employed by MICIs often in: Proceedings of the 9th International Conference
cannot produce complete artifacts on their own due to on Computat
            <xref ref-type="bibr" rid="ref1">ional Creativity, 2018</xref>
            , pp. 104–111.
the need for human involvement in the physicalization [7] E. A. Carroll, C. Latulipe, R. Fung, M. Terry,
Creof generated designs [53, 54]. ativity factor evaluation: towards a standardized
survey metric for creativity support, in:
Proceedings of the Seventh ACM Conference on Creativity
7. Conclusion and Cognition, 2009, pp. 127–136.
[8] C. Lamb, D. G. Brown, C. L. Clarke, Evaluating
comExpressive range coverage analysis (ERaCA) is a poten- putational creativity: An interdisciplinary tutorial,
tially powerful new methodology for the evaluation of ACM Computing Surveys (CSUR) 51 (2018).
mixed-initiative creative interfaces (MICIs). However, it [9] A. Jordanous, Evaluating evaluation: Assessing
still needs to be evaluated at a greater scale; visually pol- progress and practices in computational creativity
ished to improve visualization legibility; integrated with research, in: Computational Creativity, Springer,
other approaches to MICI evaluation, including conven- 2019, pp. 211–236.
tional user studies; and extended to many new creative [10] S. Colton, G. A. Wiggins, Computational
creativdomains. We are excited to undertake many of these ity: The final frontier?, in: ECAI 2012 - 20th
Eueforts in the future and intend to adopt ERaCA in the ropean Conference on Artificial Intelligence, IOS
evaluation of our own co-creative systems going forward. Press, 2012, pp. 21–26.
[11] G. Smith, J. Whitehead, Analyzing the expressive
Acknowledgements range of a level generator, in: Proceedings of the
2010 Workshop on Procedural Content Generation
This paper was partly inspired by Gillian Smith’s ques- in Games, 2010.
tions about evaluation during Max Kreminski’s advance- [12] M. Kreminski, M. Mateas, Toward narrative
instrument to candidacy. We hope that ERaCA represents a ments, in: International Conference on Interactive
step toward a method of evaluating co-creative systems Digital Storytelling, Springer, 2021, pp. 499–508.
that better reflects what we value about co-creativity. [13] A. Summerville, Expanding expressive range:
Evaluation methodologies for procedural content
generation, in: Fourteenth Artificial Intelligence and
References Interactive Digital Entertainment Conference, 2018.
[14] G. Smith, J. Whitehead, M. Mateas, Tanagra:
Re[1] S. Deterding, J. Hook, R. Fiebrink, M. Gillies, J. Gow, active planning and constraint solving for
mixedM. Akten, G. Smith, A. Liapis, K. Compton, Mixed- initiative level design, IEEE Transactions on
Cominitiative creative interfaces, in: Proceedings of putational Intelligence and AI in Games 3 (2011)
the 2017 CHI Conference Extended Abstracts on 201–215.
          </p>
          <p>Human Factors in Computing Systems, 2017, pp. [15] M. Cook, J. Gow, G. Smith, S. Colton, Danesh:
Inter628–635. active tools for understanding procedural content
[2] A. Liapis, G. N. Yannakakis, C. Alexopoulos, generators, IEEE Transactions on Games (2021).</p>
          <p>P. Lopes, Can computers foster human users’ cre- [16] S. Snodgrass, A. Summerville, S. Ontañón, Studying
ativity? Theory and praxis of mixed-initiative co- the efects of training data on machine
learningcreativity, Digital Culture &amp; Education (DCE) 8 based procedural content generation, in:
Thir(2016) 136–152. teenth Artificial Intelligence and Interactive Digital
Entertainment Conference, 2017. [32] B. McHale, Poetry under erasure, in: Theory into
[17] Q. Kybartas, C. Verbrugge, J. Lessard, Tension space Poetry: New Approaches to the Lyric, Rodopi
Amsanalysis for emergent narrative, IEEE Transactions terdam, 2005, pp. 277–301.</p>
          <p>on Games 13 (2020) 146–159. [33] B. C. Cooney, “Nothing is left out”: Kenneth
Gold[18] E. Teng, R. Bidarra, A semantic approach to patch- smith’s Sports and erasure poetry, jml: Journal of
based procedural generation of urban road net- Modern Literature 37 (2014) 16–33.
works, in: Proceedings of the 12th International [34] J. Kao, D. Jurafsky, A computational analysis of
Conference on the Foundations of Digital Games, style, afect, and imagery in contemporary poetry,
2017. in: Proceedings of the NAACL-HLT 2012 Workshop
[19] D. Buschek, L. Mecke, F. Lehmann, H. Dang, Nine on Computational Linguistics for Literature, 2012,
potential pitfalls when designing human-AI co- pp. 8–17.
creative systems, in: Joint Proceedings of the ACM [35] D. Foreman-Mackey, corner.py: Scatterplot
matriIUI 2021 Workshops, 2021. ces in Python, The Journal of Open Source Software
[20] A. Adams, P. Lunt, P. Cairns, A qualititative ap- 1 (2016) 24.</p>
          <p>proach to HCI research, in: Research Methods for [36] M. Brysbaert, B. New, Moving beyond Kučera and
Human-Computer Interaction, Cambridge Univer- Francis: A critical evaluation of current word
fresity Press, 2008, pp. 138–157. quency norms and the introduction of a new and
[21] A. Kantosalo, Human-Computer Co-Creativity: De- improved word frequency measure for american
signing, Evaluating and Modelling Computational english, Behavior Research Methods 41 (2009) 977–
Collaborators for Poetry Writing, Ph.D. thesis, Uni- 990.</p>
          <p>versity of Helsinki, 2019. [37] J. S. Gero, Fixation and commitment while
design[22] J. Kim, M. L. Maher, S. Siddiqui, Studying the impact ing and its measurement, The Journal of Creative
of AI-based inspiration on human ideation in a co- Behavior 45 (2011) 108–115.
creative design system, in: Joint Proceedings of the [38] W3C Web Content Accessibility
GuideACM IUI 2021 Workshops, 2021. lines Working Group, Use of color:
Un[23] M. Kreminski, B. Samuel, E. Melcer, N. Wardrip- derstanding sc 1.4.1, https://www.w3.</p>
          <p>
            Fruin, Evaluating AI-based games through org/TR/UNDERSTANDING-WCAG20/
retellings, in: Proceedings of the AAAI Conference visual-audio-contrast-without-color.html, 2016.
on Artificial Intelligence and Interactive Digital En- [39] K. A. Ericsson, H. A. Simon, Protocol Analysis:
Vertertainment, volume 15, 2019, pp. 45–51. bal Reports as Data, MIT Press, 1984.
[24] M. P. Eladhari, Re-tellings: the fourth layer of [40] A. Canossa, G. Smith, Towards a procedural
evalunarrative as an instrument for critique, in: Inter- ation technique: Metrics for level design, in: The
national Conference on Interactive Digital Story- 10th International Conference on the Foundat
            <xref ref-type="bibr" rid="ref1">ions
telling, Springer, 2018</xref>
            , pp. 65–78. of Digital Games, 2015.
[25] M. J. Nelson, S. E. Gaudl, S. Colton, S. Deterding, [41] A. Liapis, G. N. Yannakakis, J. Togelius, Sentient
Curious users of casual creators, in: Proceedings Sketchbook: computer-assisted game level
authorof the 13th International Conference on the Foun- ing, in: Proceedings of the 8th International
Condat
            <xref ref-type="bibr" rid="ref1">ions of Digital Games, 2018</xref>
            . ference on the Foundations of Digital Games, 2013.
[26] M. Kreminski, M. Mateas, Reflective creators, in: [42] J. E. Fan, M. Dinculescu, D. Ha, collabdraw: an
enInternational Conference on Computational Cre- vironment for collaborative sketching with an
artiativity, 2021. ifcial agent, in:
            <xref ref-type="bibr" rid="ref11">Proceedings of the 2019</xref>
            Conference
[27] M. Kreminski, I. Karth, N. Wardrip-Fruin, Gener- on Creativity and Cognition, 2019, pp. 556–561.
ators that read, in: Proceedings of the 14th Inter- [43] Z. Epstein, O. Boulais, S. Gordon, M. Groh,
International Conference on the Foundations of Digital polating GANs to scafold autotelic creativity, in:
Games, 2019. Joint Workshops of the International Conference
[28] K. Compton, M. Mateas, Casual creators, in: Inter- on Computational Creativity, 2020.
national Conference on Computational Creativity, [44] A. Calderwood, V. Qiu, K. I. Gero, L. B. Chilton,
2015, pp. 228–235. How novelists use generative language models: An
[29] L. Daly, The days left forebodings and water, https: exploratory user study, in: Joint Proceedings of
//lizadaly.com/pages/blackout, 2016. the Workshops on Human-AI Co-Creation with
[30] A. Parrish, Exploring (semantic) space with (lit- Generative Models and User-Aware Conversational
eral) robots, http://opentranscripts.org/transcript/ Agents, 2020.
          </p>
          <p>semantic-space-literal-robots, 2015. [45] M. Roemmele, A. S. Gordon, Automated
assis[31] T. Macdonald, A brief history of erasure poetics, tance for creative writing with an RNN language
Jacket Magazine 38 (2009). model, in: Proceedings of the 23rd International</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>ion</surname>
          </string-name>
          ,
          <year>2018</year>
          . [46]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kreminski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dickinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mateas</surname>
          </string-name>
          , N. Wardrip-
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <year>2020</year>
          . [47]
          <string-name>
            <given-names>H. G.</given-names>
            <surname>Oliveira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Boavida</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Nakamura,
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>generation</surname>
          </string-name>
          ,
          <source>Cognitive Systems Research</source>
          <volume>54</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          199-
          <fpage>216</fpage>
          . [48]
          <string-name>
            <given-names>R.</given-names>
            <surname>Louie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Coenen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Terry</surname>
          </string-name>
          , C. J.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>of the 2020 CHI Conference on Human Factors in</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Computing</given-names>
            <surname>Systems</surname>
          </string-name>
          ,
          <year>2020</year>
          . [49]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ackerman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Loker</surname>
          </string-name>
          , Algorithmic songwriting
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <year>2017</year>
          . [50]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nelson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Colton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Powley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gaudl</surname>
          </string-name>
          , P. Ivey,
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          sign,
          <source>in: Proceedings of the CHI'17 Workshop on</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Mixed-Initiative Creative</surname>
            <given-names>Interfaces</given-names>
          </string-name>
          ,
          <year>2017</year>
          . [51]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kreminski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dickinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Osborn</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Sum-
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>tertainment</surname>
          </string-name>
          , volume
          <volume>16</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>102</fpage>
          -
          <lpage>108</lpage>
          . [52]
          <string-name>
            <given-names>M.</given-names>
            <surname>Guzdial</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          , S.-Y. Chen, S. Shah,
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>Proceedings of the 2019 CHI Conference on Human</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>Factors in Computing Systems</source>
          ,
          <year>2019</year>
          . [53]
          <string-name>
            <given-names>L.</given-names>
            <surname>Albaugh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Hudson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yao</surname>
          </string-name>
          , L. Devendorf, In-
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>signing Interactive Systems</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1033</fpage>
          -
          <lpage>1046</lpage>
          . [54]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sullivan</surname>
          </string-name>
          , Embroidered Ephemera: Crafting qual-
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>ference on Computational Creativity</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>