1. Introduction

Evaluating Mixed-Initiative Creative Interfaces via Expressive Range Coverage Analysis

Max Kreminski

Isaac Karth

Michael Mateas

Noah Wardrip-Fruin

0 0 University of California , Santa Cruz, 1156 High St, Santa Cruz, CA 95064 , USA

We introduce expressive range coverage analysis (ERaCA): a technique for evaluating mixed-initiative creative interfaces (MICIs) in which creative responsibility is shared between a human user and a generative model. ERaCA revolves around the examination of a small number of human-created artifacts in the context of a visualization of the broader expressive range from which these artifacts were sampled. As a pilot study of our approach, we apply ERaCA to the evaluation of Redactionist-a MICI for erasure poetry creation-and find that ERaCA allows us to visually answer questions about how thoroughly users explore the underlying model's expressive range; whether users produce artifacts that are typical or unusual from the underlying model's perspective; whether diferent users of a single MICI tend to produce similar or diferent artifacts; whether a MICI tends to promote divergent or convergent thinking; and how a single user's artifacts evolve as they continue to use a MICI over time.

eol>expressive range analysis mixed-initiative co-creativity creativity support tools evaluation methods

1. Introduction

tool. As a result, assessments of MICIs often focus on evaluating the subjective perception of creativity support Mixed-initiative creative interfaces [1, 2], or MICIs, are from the user’s perspective [6]. The artifacts that users a genre of creativity support tools [3] in which creative produce are comparatively rarely evaluated, and even responsibility is shared between a human user and an when they are evaluated, discerning what role the MICI artificially intelligent system. Many MICIs consist of two played in shaping these artifacts may still be out of reach. layers: an underlying generative model that defines a Though evaluating creativity is dificult in general [ 7], possibility space of artifacts—sometimes learned from a researchers have developed a number of efective apcorpus of training data [4], sometimes defined by a set of proaches to the evaluation of computationally creative sysrules or constraints [5]—and a supervening mechanism tems [8, 9] in which creative responsibility is attributed for navigating this space to locate artifacts that match a primarily or solely to the machine [10]. In particular, a user’s prompt or intent. In MICIs that take this approach, technique known as expressive range analysis (ERA) [11] an artifact’s creation is synonymous with its discovery can be used to characterize the behavior of a generative and selection by a user. model by visualizing its possibility space. This makes

Evaluating the efectiveness of these systems can be it easy to visually compare the expressive range of difdificult, in part because neither the user nor the gen- ferent generative models that produce the same kind of erative model is solely responsible for the artifacts pro- artifact—and to describe a generative model in terms of duced [6]. In particular, a skilled user may be able to its grain, or the characteristics of the artifacts that it tends coax compelling artifacts from even the most unwieldy to produce [12].

MICI, making it dificult to characterize how efectively However, because ERA relies on the rapid generaa MICI supports its users in realizing their creative goals. tion and characterization of a very large number of arAdditionally, insofar as these tools often lead users to cre- tifacts [13], this method of evaluation cannot straightate artifacts that they would not have thought to create forwardly be applied to mixed-initiative creative collabbefore, it is dificult to compare a MICI-plus-user system orations. When a human user must be involved in the with the unassisted user in terms of creative capabilities, production of every artifact, it becomes prohibitively because the user’s original creative intent can be substan- time-consuming to produce the hundreds or thousands tially shaped or modified by their interaction with the of artifacts that ERA demands. As a result, although ERA is frequently applied to the evaluation of end-to-end JHoeilnstinPkrio,cFeiendlianngds of the ACM IUI Workshops 2022, March 2022, computationally creative systems, including the genera$ mkremins@ucsc.edu (M. Kreminski); ikarth@ucsc.edu (I. Karth); tive models underlying some MICIs [14], its application mmateas@ucsc.edu (M. Mateas); nwardrip@ucsc.edu to understanding the influence of MICI design on user (N. Wardrip-Fruin) behavior and user experience has remained limited. https://©m20k22rCeompyiringhst.fgoritthhisupbap.ieorby( Mits.auKthrores.mUsienpsekrmii)tted under Creative In this paper, we propose a new technique for evaluatCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) ing MICIs—expressive range coverage analysis (ERaCA)— that extends ERA to a co-creative context by visualizing a small number of co-created artifacts in the context of the broader expressive range from which these artifacts were sampled. ERaCA applies a set of quantitative artifact evaluation metrics to the simultaneous assessment of many model-created artifacts and a handful of co-created artifacts, then produces a visualization of the results, allowing us to visually answer such questions as: • Does a MICI allow its users to access the entirety of the underlying generative model’s expressive range, or only a limited subset? • How typical or unusual are the artifacts created by a user in the context of the broader expressive range? • Are all of a MICI’s users drawn toward the same parts of its expressive range, or do diferent users typically explore diferent regions of the possibility space? • As users continue to interact with a MICI, do the artifacts they produce tend to get closer together or further apart within the expressive range? In other words, does the MICI tend to promote convergent or divergent thinking? • More generally, as users continue to interact with a MICI, what trends appear in a single user’s artifacts over time?

We demonstrate ERaCA via a pilot study in which we apply the ERaCA method to the evaluation of Redactionist, a MICI for erasure poetry creation. The resulting visualizations provide preliminary answers to several of the above questions based on data collected from a small number of users. Altogether, the argument for our approach can be summed up as follows: we learn more about a MICI from inspecting co-created artifacts in the context of the underlying expressive range than we do from inspecting both co-created artifacts and the underlying expressive range individually.

2. Background

Expressive range analysis (ERA) [11] is a visualizationbased approach to understanding and evaluating the effectiveness of generative models. Application of ERA follows a four-step approach: 1. Determine appropriate quantitative metrics for the kinds of artifacts that the generative model will produce. Ideally these metrics are computationally inexpensive to evaluate, so that they can be eficiently applied to a large number of individual artifacts. 2. Generate a large number of artifacts using the generative model to collect a representative sample of the model’s output, using the metrics defined in step 1 to evaluate each artifact. 3. Visualize the results of evaluation, typically as a set of two-dimensional histograms in which pairs of metrics are plotted against one another to showcase artifact density in diferent “slices” of the overall expressive range. 4. Analyze the impact of parameters passed to the generative model on the resulting expressive range, allowing for the visual determination of how diferent parameters influence the artifacts that the model produces.

Although ERA has been integrated into tools for human creators of generative models [15], extended in various ways [13, 16], and applied to domains as wideranging as emergent narrative [17] and road network generation [18], it has several important limitations. In particular, conventional ERA is data-hungry and poorly suited to the evaluation of small numbers of artifacts, which has prevented its application to creative contexts in which artifacts are individually time-consuming or costly to generate [13]—as is often the case when human users are involved in the creative process.

However, the ideas captured by ERA remain important to the evaluation of tools for human-AI creative collaboration. Among nine potential pitfalls for co-creative systems discussed by Buschek et al. [19], at least five (“Invisible AI boundaries”, “Lack of expressive interaction”, “Agony of choice”, “Time waster”, and “AI bias”) can be viewed as stemming from either an insuficiently wide expressive range; an expressive range that does not overlap well with user desires; or a flawed user interface for accessing the available expressive range. Evaluations based exclusively on self-reported subjective user experience can produce misleading results [20], leading some to suggest that inspection of co-created artifacts is also needed to arrive at a holistic picture of a co-creative system’s success or failure [21, 22, 23, 24]—but even these hybrid evaluations cannot clearly diagnose whether a MICI’s weaknesses are due to the underlying generative model or the interface through which the model is accessed. And some studies of user behavior in MICIs have suggested that some users are motivated by a drive to explore the extremes of a MICI’s expressive range [25], necessitating the comparison of co-created artifacts against the expressive range to verify these findings. In sum, these dificulties all point to a common unmet need: an evaluation method for MICIs that can illuminate the relationship between individual co-created artifacts and the MICI’s overall expressive range.

3. Expressive Range Coverage Analysis

Expressive range coverage analysis (ERaCA) is a new evaluation technique for mixed-initiative creative interfaces (MICIs) in which a human user and a generative model share creative responsibility for the discovery and selection of artifacts from a large possibility space. ERaCA builds on ERA, but also extends the evaluation process by incorporating the solicitation and examination of a small number of co-created artifacts (i.e., artifacts made or discovered by human study participants through their interaction with the MICI) in the context of the generative model’s expressive range.

ERaCA as a process consists of seven steps: 1. Determine appropriate quantitative metrics for the kinds of artifacts that the generative model will produce. 2. Generate a large number of artifacts using the ERaCA method to evaluate the mixed-initiative erasure generative model, and evaluate each artifact using poetry creation tool Redactionist on the basis of artifacts the metrics defined in step 1. created by four participants (all coauthors of this paper) 3. Visualize the results of evaluation. from a single fixed paragraph of source text. 4. Solicit the co-creation of a small number of artifacts by human study participants, ideally 4.1. Redactionist drawn from among the MICI’s target user base, and evaluate these artifacts using the same met- Redactionist [26], previously known as Blackout [27], is a rics that are used to evaluate the purely machine- browser-based1 casual-creator [28] MICI that helps users created ones. create English-language erasure poetry by interactively 5. Visualize the location of co-created artifacts removing most of the words from a user-provided source within the context of the larger possibility space, text. Once given a source text, Redactionist uses a rulesfor instance as a set of scatterplots drawn directly based generative model (adapted from an earlier model on top of the two-dimensional histograms created created by Liza Daly [29]) to generate a large number in step 3. of potential erasure poems that could be created from 6. (Optional) Construct per-user visualizations the text. Then it provides the user with an interface of the user’s trajectory within the possibil- for navigating this space of potential poems by toggling ity space, using a color gradient to indicate the whether specific words should be present in the final order in which artifacts were created on the plot. poem. A screenshot of Redactionist’s interface, showing We discuss this visualization approach in greater a half-constructed poem, can be seen in Figure 1. detail in section 5.4, and an example can be seen Given a source text, Redactionist’s rules look for poems in Figure 6. that take the form of several short and grammatically correct declarative sentences—one sentence per paragraph of 7. Visually analyze the results to make determi- input text. For instance, one of Redactionist’s rules—the nations about users’ coverage of and trajectory grammatical pattern ARTICLE NOUN VERB ARTICLE within the generative model’s possibility space. ADJECTIVE NOUN—would find and match sequences of Steps 1-3 of this process are the same as for ERA, while words such as “the poem conceals an elusive metaphor” steps 4-7 (which rely on incorporation of co-created arti- within a paragraph of source text, with any other words facts into the evaluation process) are unique to ERaCA. in the source text paragraph being erased. The words in each matched sentence might be separated by any number of other words, as long as they occur in the correct 4. Pilot Study Procedure sequence within a single paragraph of the source text. The version of Redactionist used here contains 136 rules, In preparation for a larger-scale user study to be con- each of which matches sentences of a particular form. ducted in the future, we ran a small-scale pilot study to test and illustrate our approach. Our pilot study used the

4.2. Data Collection Due to logistical constraints (described further in section

6.1), the four coauthors of this paper served as our pilot study participants. Each participant was instructed to use Redactionist with a fixed source text (a one-paragraph excerpt from a transcript of a talk by Allison Parrish on computational poetry [30]) to create a sequence of ten short erasure poems. To ensure that participants were not composing their poems with a particular metric or evaluation criterion in mind, we avoided deciding what metrics would be used to evaluate the poems until after the data had been collected, and we did not confer with one another about our aesthetic intentions for the poems we had made.

In addition to these 40 co-created poems, we also gathered and analyzed the complete set of 57,195 potential poems that the Redactionist generative model considers to be possible erasures of the fixed input text. This larger set of poems, which we call the “full poemspace”, forms the backdrop for our analysis: by comparing the 40 cocreated poems to the full poemspace, we can identify the co-created poems as typical or atypical in various ways and analyze the extent to which the co-created poems cover (or fail to cover) the full poemspace. For some generative models, it may be easier to instead establish a backdrop set of artifacts by uniformly sampling many (but not all) possible artifacts for the given user input; the details of this sampling vary depending on how the generative model is implemented.

4.3. Artifact Evaluation Metrics

ERaCA, like ERA, uses several domain-specific quantitative metrics to characterize each of the artifacts produced by a generative model or creative collaboration. Erasure poetry is an unusual form of poetry that has not been investigated much in the scholarly literature [31, 32, 33], and the short, single-declarative-sentence poems produced by Redactionist given a single paragraph of input text do not contain many of the features (such as end rhyme) that are most widely studied in the analysis of poetry. Consequently, rather than drawing directly on metrics that have been defined for more conventional forms of poetry [34], we instead defined several preliminary but easy-to-implement metrics of our own that attempt to capture key aesthetic features of erasure poetry as a form. These metrics include:

Average word position within the source text. Erasure poems are characterized partly by the visual spacing of the non-erased words within the source text. Since Redactionist represents poems internally as a set of numerical indexes into the source text pointing to the userselected words, averaging these indexes together can give a simple approximation of whether a poem mostly contains words taken from near the start, middle, or end of the source text.

Distance between the poem’s first and last words within the source text. This metric can be used to diferentiate poems that draw exclusively from one narrow region within the source text from poems that draw from a larger span. It is especially useful when applied alongside the previous metric to identify where in the source text the user focused their attention when selecting words to retain.

Poem length in characters. This metric counts the total number of characters in the selected words that comprise the poem. Many erasure poems attempt to visually overwhelm the reader with the sheer amount of text that is erased [33]; counting non-erased characters relative to a fixed source text length works as a loose proxy for the proportion of the source text that is erased.

Average English-corpus word frequency of the words selected for inclusion in the poem. This metric attempts to quantify how unusual a given poem’s word choices are in the context of the English language as a whole, under the logic that retained words in erasure poems are often chosen with the intent to surprise the reader. For English word frequency data, we used the SUBTLEXUS dataset of film and television subtitles [ 36]— specifically the word frequency per 1,000,000 words measure (SUBTLWF), as given by the file that contains word frequency data for all 74,286 distinct words that appear within the dataset.

Average within-poemspace word frequency of the words selected for inclusion in the poem. This metric attempts to quantify how unusual a given poem’s word choices are in the context of the complete poemspace, with each word’s frequency determined by counting how often it appears in the complete set of poems that the generative model is able to create from this source text. Because the meaning of an erasure poem is partly defined in relation to the meaning of its source text [31], including the alternative erasures of the same source text that might have been performed, it makes sense to consider the individual poem’s relationship to the full poemspace as a potential aesthetic measure.

Average word pair probability within the poemspace across all word pairs in the poem. The probability of a word pair ⟨, ⟩ is the probability that, given word is present in a poem, word is also present within that same poem. Like the word frequency metrics, this metric attempts to capture the surprising quality of word choices in many human-created erasure poems; here, it is particularly useful for identifying poems that contain pairs of words that the generative model would not often use together when unguided by a human user.

Letter repetition score. This metric counts all of the unique letters in a poem and divides this count by the total number of letters in the poem. Poems receive a low score if they reuse the same letter many times, and a letter repetition scores may indicate intentional selection high score if they reuse letters infrequently. This score of words that phonetically clash. is intended as a loose proxy for sound reuse, an aesthetic We also defined minimum and maximum variants of quality of poems related to how similar the words in the each metric that reports an average value—for instance, poem sound to one another when pronounced. Sound metrics that report the probability score of the most and devices [34] such as assonance, consonance, alliteration, least likely word pairs in each poem, to accompany the and rhyme are all varieties of sound reuse. Low letter metric that reports the average probability of all of a repetition scores may indicate intentional selection of poem’s word pairs. However, for reasons of space, we do words that sound similar to one another, while very high not report results related to these metrics here.

4.4. Data and Code Availability All data for this study (including the participant-created

poems and the full poemspace), as well as the code that we used to run the analysis and generate our visualizations, is available online: https://github.com/mkremins/ redactionist-eraca.

5. Results and Discussion Examination of the visualizations we created allows us to characterize Redactionist’s efects on users in terms of the artifacts they tend to create. Below, we briefly discuss some of the key findings from our pilot study. 5.1. Users collectively explore most of the model’s expressive range

At a high level, inspection of the metric pair visualizations in the corner plot (Figure 2) shows that the four participants collectively created artifacts that cover the generative model’s expressive range well. Although the densest clusters of co-created artifacts within the possibility space mostly do not align with the densest clusters of possible machine-generated artifacts, the placement of co-created artifacts across the possibility space suggests that users are capable of creating poems that occupy any point within the generative model’s expressive range as defined by these metrics. This provides evidence that the Redactionist interface is successful at exposing the full possibility space of the underlying generative model to its users: no regions of the possibility space are inaccessible to users due to interface limitations.

A particularly good example of expressive range coverage can be seen in the visualization of the poemLengthInChars and avgEnglishWordFreq metric pair (Figure 3). Although co-created poems largely fall outside of the densest parts of the possibility space, and although some co-created poems stand out as extreme outliers relative to the possibility space as a whole, the overall distribution of co-created artifacts shows that users can access the entirety of the possibility space.

5.2. Co-created artifacts are disproportionately unusual

Further inspection of the corner plot (Figure 2) shows that co-created artifacts rarely occupy the densest parts of the generative model’s expressive range, and that they are This may suggest that the generative model’s expressive unusually likely to be outliers in comparison to most pos- range contains many poems that human users would tend sible model-created poems. This is backed up by closer to reject as unsuitable, leading to a focusing of human examination of individual metric pairs: for instance, Fig- attention on poems that are considered to be outliers. ure 4 shows that co-created artifacts are much more likely than model-created artifacts to contain unusual individual words and word pairs (from the model’s perspective).

5.4. Redactionist tends to promote convergent thinking over divergent One question that it would be useful to answer about

MICIs involves the tendency of the MICI’s design to pro5.3. Diferent users explore diferent mote divergent or convergent thinking within a single portions of the expressive range user: do users tend to jump around between very diferent regions of the possibility space, or do users tend to We can also see from the corner plot (Figure 2) that difer- select a single region of the possibility space and then ent users tend to explore diferent portions of the expres- “mine it out” by creating several artifacts all drawn from sive range. Each participant’s co-created artifacts tend that same region? This question can be answered to some to cluster together, allowing for the visual determination extent with a standard scatterplot overlay, but coloring of each participant’s “style” in terms of the metrics we the points representing a single user’s artifacts in the defined. Figure 5 shows this especially well: the visual order that these artifacts were created (according to a clustering of poems created by each participant is highly color gradient) can further enable us to discern whether evident here, suggesting that each participant tended to artifacts drawn from a particular region of the possibility behave diferently when deciding where in the text they space were created contiguously or noncontiguously. We should select words from. Participants P1, P3, and P4 call these augmented scatterplots “trajectory visualizaall tended to pick a relatively narrow “window” within tions”, because they attempt to illuminate a single user’s the source text and construct poems from several close- trajectory through the possibility space over time; an together words, but P4 tended to draw from near the example trajectory visualization can be seen in Figure 6. start of the source text; P1 tended to draw from near the Side-by-side per-user trajectory visualizations for the end; and P3 moved throughout the source text while still avgWordPosition and distBetweenFirstAndLastselecting mostly close-together words for each individ- Words metrics (Figure 7) shows that Redactionist users ual poem. Meanwhile, participant P2 tended to create tend to converge on a specific approach to selecting poems that drew words from all throughout the source words from the source text for inclusion in poems, estext, resulting in unusually high distBetweenFirst- sentially choosing a “home region” within the source AndLastWords scores relative to the other participants. text that they repeatedly revisit for multiple poems over the course of a single session. Specifically, by examining the order in which poems were created alongside their positioning within the expressive range, we can see that all four participants created at least three poems that fall within a visually distinct region of the expressive range from a source text location perspective; that two 5.5. Users experiment with highly of these participants (P2 and P4) created an even larger unusual word choices before number of poems sampled largely from similar locations regressing to the mean within the source text; and that these poems were not created in immediate sequence with one another, indi- We hypothesized that, as users are exposed to more of cating that the user’s preference for a particular “home the generative model’s choices and explore a wider varilocation” endures over the course of a session rather than ety of the words available to them, they might be driven disappearing after a few successive poems are sampled toward selecting more unusual words over time—both from the same region. from the perspective of the Redactionist poemspace (i.e.,

The tendency of Redactionist users to work conver- avoiding words that tend to be used very frequently in gently may be partly attributable to interface design. In generated poems) and from the perspective of the EnRedactionist, once you have locked in a large number glish language as a whole (i.e., preferring words that of words to finish a poem, it is easier to change only a occur less frequently in a corpus of general English usfew of these selections than to change a large number of age). Examination of trajectory visualizations for the them at once. Additionally, the actual word attached to a avgPoemspaceWordFreq and avgEnglishWordFreq span of selectable text is not made visible to users until metric pair, however, does not show this expected trend— they hover over this span. Consequently, users often take see Figure 8. Instead, we observe that all four participants small, incremental steps within the possibility space and at some point during their session experimented with the less frequently make the large jumps needed to switch selection of highly unlikely words, but that no particifrom one region of the space to another—and even when pant remained consistently focused on the selection of they do make larger jumps, they tend to anchor their highly unlikely words afterward. jumps on potentially selectable words that they had used In particular, in the bottom left-hand corner of their in poems previously. Insofar as these behaviors are at- respective trajectory visualizations, we can see that three tributable to the user’s inadvertent fixation on a narrow of four participants (P2, P3 and P4) all discovered a region region of the expressive range rather than intentional of poemspace in which the poems contain words that are commitment to certain design choices [37], this analysis highly unlikely from both a poemspace word frequency suggests the possibility of user interface features that and English word frequency perspective. Each of these deliberately encourage users to work divergently: for participants created two poems within this region of poinstance, an option to randomly select a new set of words emspace; for P2 and P4 one of these poems was created containing none of the words that are currently selected, shortly after the other, while for P3 these poems were sepor a process that randomly highlights a nearby selectable arated in time by several others. However, none of these word that a user has not yet used in any poems. participants’ penultimate or final poems fall within this region, suggesting that none of these participants were primarily attempting to optimize for surprising word choice over the course of their session.

This may be an instance of the curiosity-driven behavior previously observed in some MICI users [25]: deliberate probing of the MICI in an efort to discover the edges of the possibility space. This explanation may also help to explain why P4’s final poem in particular is visibly an extreme outlier on the avgPoemspaceWordFreq metric, containing much more common English words on average than any other co-created poem: all of the participants were driven by curiosity to some extent, but P4 was especially successful in probing the extreme corners of the possibility space.

6. Limitations 6.1. Pilot Study Limitations 6.2. Visualization Limitations

The visualizations that we presented here use only color to indicate which user created each artifact (in multiuser visualizations) and the order in which artifacts were created (in single-user trajectory visualizations). This limits the accessibility of these visualizations to users who have dificulty perceiving color [ 38]. Future work should explore the use of shape, pattern, or another redundant visual channel alongside color in the co-created artifacts visualization layer. Particularly for trajectory visualizations, we suspect there may be value in shaping each data point as a small arrowhead pointing in the direction of the next data point in sequence, so that the order in which a user created their artifacts can be visually analyzed more easily.

Our four participants for the ERaCA pilot study presented

here were all members of this paper’s authorship team.

We took this unusual approach because obtaining IRB approval for collection of user data at scale was not possible prior to the workshop submission deadline, due partly 6.3. Limitations of ERaCA as a Method to the late-breaking nature of this work and partly to Like ERA, ERaCA is a qualitative and visual evaluation ongoing pandemic-related IRB reviewing backlogs. The technique. It is not capable of producing a single sumsmall number of participants limits generalizability of the mary value that tells you how good a MICI is—but it does study’s results, and there was obviously an incentive for illuminate the MICI’s influence on users and co-created authors to try to “behave interestingly” while using the artifacts in useful ways, especially when the information MICI so that publishable results would emerge. We tried that ERaCA provides is considered in terms of the MICI’s to mitigate this potential source of bias (in particular by overall goals. It may be the case that ERaCA is best emavoiding selection of poem evaluation metrics until after ployed alongside other user-centered evaluation methods, the data collection was complete), but this attempt at es- such as the think-aloud method [39] and interviews [20], tablishing a firewall between data collection and analysis to provide an additional channel of information. For inis clearly imperfect. In the near future, we plan to run a stance, there may be potential value in showing ERaCA larger user study (with a larger number of non-coauthor plots to study participants in a debriefing interview afparticipants) to validate and expand on our findings. In ter a conventional user study session, using the plots as the meantime, however, because the primary goal of this prompts or visual aids to elicit remarks or insights from paper is to introduce the idea of expressive range cov- participants about specific aspects of their experience. erage analysis and present a minimal case study of its Also like ERA, ERaCA relies on domain-specific arapplication, we believe that our pilot study results are tifact evaluation metrics to characterize artifacts in a suficient to illustrate the methodology. particular creative domain. A few standard metrics [40] are widely used to evaluate 2D platformer game levels, and metrics for several other domains [18, 41, 17] have also been defined. However, there are many domains for which appropriate metrics have not yet been devel- [3] B. Shneiderman, Creativity support tools: Acceleroped, necessitating additional work before ERaCA can ating discovery and innovation, Communications be applied to these domains. of the ACM 50 (2007) 20–32.

Finally, ERaCA can only be applied to MICIs where [4] A. Summerville, S. Snodgrass, M. Guzdial, the underlying generative model is capable of produc- C. Holmgård, A. K. Hoover, A. Isaksen, A. Nealen, ing complete artifacts without human input. Fortunately, J. Togelius, Procedural content generation via many recently developed MICIs for a wide variety of machine learning (PCGML), IEEE Transactions on creative domains—including sketching [42], creature de- Games 10 (2018) 257–270. sign [43], prose-level creative writing [44, 45], plot-level [5] A. M. Smith, M. Mateas, Answer set programming storytelling [46], poetry [47], instrumental music [48], for procedural content generation: A design space songwriting [49], game design [50, 51], and level de- approach, IEEE Transactions on Computational sign [52]—follow this architectural pattern. However, Intelligence and AI in Games 3 (2011) 187–200. ERaCA may not be as readily applicable to the evalu- [6] P. Karimi, K. Grace, M. L. Maher, N. Davis, Evaluatation of MICIs for domains such as physical crafts, in ing creativity in computational co-creative systems, which the generative models employed by MICIs often in: Proceedings of the 9th International Conference cannot produce complete artifacts on their own due to on Computat ional Creativity, 2018 , pp. 104–111. the need for human involvement in the physicalization [7] E. A. Carroll, C. Latulipe, R. Fung, M. Terry, Creof generated designs [53, 54]. ativity factor evaluation: towards a standardized survey metric for creativity support, in: Proceedings of the Seventh ACM Conference on Creativity 7. Conclusion and Cognition, 2009, pp. 127–136. [8] C. Lamb, D. G. Brown, C. L. Clarke, Evaluating comExpressive range coverage analysis (ERaCA) is a poten- putational creativity: An interdisciplinary tutorial, tially powerful new methodology for the evaluation of ACM Computing Surveys (CSUR) 51 (2018). mixed-initiative creative interfaces (MICIs). However, it [9] A. Jordanous, Evaluating evaluation: Assessing still needs to be evaluated at a greater scale; visually pol- progress and practices in computational creativity ished to improve visualization legibility; integrated with research, in: Computational Creativity, Springer, other approaches to MICI evaluation, including conven- 2019, pp. 211–236. tional user studies; and extended to many new creative [10] S. Colton, G. A. Wiggins, Computational creativdomains. We are excited to undertake many of these ity: The final frontier?, in: ECAI 2012 - 20th Eueforts in the future and intend to adopt ERaCA in the ropean Conference on Artificial Intelligence, IOS evaluation of our own co-creative systems going forward. Press, 2012, pp. 21–26. [11] G. Smith, J. Whitehead, Analyzing the expressive Acknowledgements range of a level generator, in: Proceedings of the 2010 Workshop on Procedural Content Generation This paper was partly inspired by Gillian Smith’s ques- in Games, 2010. tions about evaluation during Max Kreminski’s advance- [12] M. Kreminski, M. Mateas, Toward narrative instrument to candidacy. We hope that ERaCA represents a ments, in: International Conference on Interactive step toward a method of evaluating co-creative systems Digital Storytelling, Springer, 2021, pp. 499–508. that better reflects what we value about co-creativity. [13] A. Summerville, Expanding expressive range: Evaluation methodologies for procedural content generation, in: Fourteenth Artificial Intelligence and References Interactive Digital Entertainment Conference, 2018. [14] G. Smith, J. Whitehead, M. Mateas, Tanagra: Re[1] S. Deterding, J. Hook, R. Fiebrink, M. Gillies, J. Gow, active planning and constraint solving for mixedM. Akten, G. Smith, A. Liapis, K. Compton, Mixed- initiative level design, IEEE Transactions on Cominitiative creative interfaces, in: Proceedings of putational Intelligence and AI in Games 3 (2011) the 2017 CHI Conference Extended Abstracts on 201–215.

Human Factors in Computing Systems, 2017, pp. [15] M. Cook, J. Gow, G. Smith, S. Colton, Danesh: Inter628–635. active tools for understanding procedural content [2] A. Liapis, G. N. Yannakakis, C. Alexopoulos, generators, IEEE Transactions on Games (2021).

P. Lopes, Can computers foster human users’ cre- [16] S. Snodgrass, A. Summerville, S. Ontañón, Studying ativity? Theory and praxis of mixed-initiative co- the efects of training data on machine learningcreativity, Digital Culture & Education (DCE) 8 based procedural content generation, in: Thir(2016) 136–152. teenth Artificial Intelligence and Interactive Digital Entertainment Conference, 2017. [32] B. McHale, Poetry under erasure, in: Theory into [17] Q. Kybartas, C. Verbrugge, J. Lessard, Tension space Poetry: New Approaches to the Lyric, Rodopi Amsanalysis for emergent narrative, IEEE Transactions terdam, 2005, pp. 277–301.

on Games 13 (2020) 146–159. [33] B. C. Cooney, “Nothing is left out”: Kenneth Gold[18] E. Teng, R. Bidarra, A semantic approach to patch- smith’s Sports and erasure poetry, jml: Journal of based procedural generation of urban road net- Modern Literature 37 (2014) 16–33. works, in: Proceedings of the 12th International [34] J. Kao, D. Jurafsky, A computational analysis of Conference on the Foundations of Digital Games, style, afect, and imagery in contemporary poetry, 2017. in: Proceedings of the NAACL-HLT 2012 Workshop [19] D. Buschek, L. Mecke, F. Lehmann, H. Dang, Nine on Computational Linguistics for Literature, 2012, potential pitfalls when designing human-AI co- pp. 8–17. creative systems, in: Joint Proceedings of the ACM [35] D. Foreman-Mackey, corner.py: Scatterplot matriIUI 2021 Workshops, 2021. ces in Python, The Journal of Open Source Software [20] A. Adams, P. Lunt, P. Cairns, A qualititative ap- 1 (2016) 24.

proach to HCI research, in: Research Methods for [36] M. Brysbaert, B. New, Moving beyond Kučera and Human-Computer Interaction, Cambridge Univer- Francis: A critical evaluation of current word fresity Press, 2008, pp. 138–157. quency norms and the introduction of a new and [21] A. Kantosalo, Human-Computer Co-Creativity: De- improved word frequency measure for american signing, Evaluating and Modelling Computational english, Behavior Research Methods 41 (2009) 977– Collaborators for Poetry Writing, Ph.D. thesis, Uni- 990.

versity of Helsinki, 2019. [37] J. S. Gero, Fixation and commitment while design[22] J. Kim, M. L. Maher, S. Siddiqui, Studying the impact ing and its measurement, The Journal of Creative of AI-based inspiration on human ideation in a co- Behavior 45 (2011) 108–115. creative design system, in: Joint Proceedings of the [38] W3C Web Content Accessibility GuideACM IUI 2021 Workshops, 2021. lines Working Group, Use of color: Un[23] M. Kreminski, B. Samuel, E. Melcer, N. Wardrip- derstanding sc 1.4.1, https://www.w3.

Fruin, Evaluating AI-based games through org/TR/UNDERSTANDING-WCAG20/ retellings, in: Proceedings of the AAAI Conference visual-audio-contrast-without-color.html, 2016. on Artificial Intelligence and Interactive Digital En- [39] K. A. Ericsson, H. A. Simon, Protocol Analysis: Vertertainment, volume 15, 2019, pp. 45–51. bal Reports as Data, MIT Press, 1984. [24] M. P. Eladhari, Re-tellings: the fourth layer of [40] A. Canossa, G. Smith, Towards a procedural evalunarrative as an instrument for critique, in: Inter- ation technique: Metrics for level design, in: The national Conference on Interactive Digital Story- 10th International Conference on the Foundat ions telling, Springer, 2018 , pp. 65–78. of Digital Games, 2015. [25] M. J. Nelson, S. E. Gaudl, S. Colton, S. Deterding, [41] A. Liapis, G. N. Yannakakis, J. Togelius, Sentient Curious users of casual creators, in: Proceedings Sketchbook: computer-assisted game level authorof the 13th International Conference on the Foun- ing, in: Proceedings of the 8th International Condat ions of Digital Games, 2018 . ference on the Foundations of Digital Games, 2013. [26] M. Kreminski, M. Mateas, Reflective creators, in: [42] J. E. Fan, M. Dinculescu, D. Ha, collabdraw: an enInternational Conference on Computational Cre- vironment for collaborative sketching with an artiativity, 2021. ifcial agent, in: Proceedings of the 2019 Conference [27] M. Kreminski, I. Karth, N. Wardrip-Fruin, Gener- on Creativity and Cognition, 2019, pp. 556–561. ators that read, in: Proceedings of the 14th Inter- [43] Z. Epstein, O. Boulais, S. Gordon, M. Groh, International Conference on the Foundations of Digital polating GANs to scafold autotelic creativity, in: Games, 2019. Joint Workshops of the International Conference [28] K. Compton, M. Mateas, Casual creators, in: Inter- on Computational Creativity, 2020. national Conference on Computational Creativity, [44] A. Calderwood, V. Qiu, K. I. Gero, L. B. Chilton, 2015, pp. 228–235. How novelists use generative language models: An [29] L. Daly, The days left forebodings and water, https: exploratory user study, in: Joint Proceedings of //lizadaly.com/pages/blackout, 2016. the Workshops on Human-AI Co-Creation with [30] A. Parrish, Exploring (semantic) space with (lit- Generative Models and User-Aware Conversational eral) robots, http://opentranscripts.org/transcript/ Agents, 2020.

semantic-space-literal-robots, 2015. [45] M. Roemmele, A. S. Gordon, Automated assis[31] T. Macdonald, A brief history of erasure poetics, tance for creative writing with an RNN language Jacket Magazine 38 (2009). model, in: Proceedings of the 23rd International

ion , 2018 . [46]

Kreminski ,

Dickinson ,

Mateas , N. Wardrip-

2020 . [47]

H. G.

Oliveira ,

Mendes ,

Boavida , A . Nakamura,

generation , Cognitive Systems Research 54 ( 2019 )

199- 216 . [48]

Louie ,

Coenen ,

C. Z.

Huang ,

Terry , C. J.

of the 2020 CHI Conference on Human Factors in

Computing

Systems , 2020 . [49]

Ackerman ,

Loker , Algorithmic songwriting

2017 . [50]

Nelson ,

Colton ,

Powley ,

Gaudl , P. Ivey,

sign, in: Proceedings of the CHI'17 Workshop on

Mixed-Initiative Creative

Interfaces

, 2017 . [51]

Kreminski ,

Dickinson ,

Osborn , A . Sum-

tertainment , volume 16 , 2020 , pp. 102 - 108 . [52]

Guzdial ,

Liao ,

Chen , S.-Y. Chen, S. Shah,

Proceedings of the 2019 CHI Conference on Human

Factors in Computing Systems , 2019 . [53]

Albaugh ,

S. E.

Hudson ,

Yao , L. Devendorf, In-

signing Interactive Systems , 2020 , pp. 1033 - 1046 . [54]

Sullivan , Embroidered Ephemera: Crafting qual-

ference on Computational Creativity , 2020 .