Axes of Characterizing Generative Systems: A Taxonomy of Approaches to Expressive Range Analysis Josiah Boucher1,* , Gillian Smith1,† and Yunus Doğan Telliel1,† 1 Worcester Polytechnic Institute (WPI)), 100 Institute Road, Worcester MA, U.S.A. Abstract Seeking to leverage Expressive Range Analysis (ERA)- a method of characterizing generative systems- for analysis of generative AI (GAI) systems and their outputs, this paper categorizes the approaches that have been used for ERA. We present a taxonomy with five axes of characterization that may be applied to ERA methodologies: content agnostic vs semantic; quantitative vs qualitative; product vs process; objective vs subjective; and automated vs manual. While ERA has traditionally been limited to the domain of Procedural Content Generation (PCG) in video game development, we recognize parallels between PCG and GAI and hope to see an expansion of the application of ERA into the domain of GAI. Serving this goal, these axes provide metrics through which to categorize, compare, and explore approaches. Keywords Expressive Range Analysis, Procedural Content Generation, Generative AI, Evaluating Generative Systems, Taxonomy 1. Introduction e.g. some decision points for input selection could include whether to handcraft inputs or sample the latent space, and Expressive Range Analysis (ERA) is a method of character- whether the inputs should be restricted in a way that targets izing the nature and shape of a generative model in terms a specific domain of outputs (i.e. a prompt to output a of its outputs. What sorts of outputs is a generator capable haiku would be very different from one that might produce a of? What impact does changing inputs of a generator have cooking recipe). Furthermore, ERA requires the application on its outputs? How do biases manifest in the generative of metrics to large quantities of content outputted from the outputs, and how do the inputs influence these biases? ERA system in question. What metrics are used and how they is well suited for answering these questions [1]. ERA comes are determined is a major component of ERA, often directly from the domain of Procedural Content Generation (PCG) determining the value provided by the analysis. Because in video game development: the practice of leveraging algo- GAI outputs tend to occupy a broad domain of applications rithmic methods to design and produce artifacts for use in and mediums, unique challenges arise when considering games across a variety of contexts, such as levels or maps these metrics, further complicating the application of ERA [2]. in this context. We recognize common threads between PCG and genera- Responding to 1) these parallels between PCG and GAI, tive AI (GAI) systems- defined here as the process of using and 2) the massive increase of scope presented by GAI, this generative technologies such as large language models to paper presents a taxonomy for categorizing approaches to produce content, often using natural language prompts as ERA. This taxonomy includes vocabulary to better distin- human-provided input- including high-level functionality, guish examples from existing work in PCG, as well as a motivations, use cases, and shortcomings of both domains compass to guide further exploration of using this method (for details, see Section 2). Because of these similarities, in the rapidly expanding domain of GAI. The goal of this we seek to expand and leverage ERA—which has not only taxonomy is to push the boundaries of what ERA may be proven useful for recognizing bias within, and categorizing used for and how it may be applied. generative outputs of PCG systems, but also resembles some existing approaches for GAI analysis [3]—for analysis of GAI system outputs, especially text-to-X, large language model 2. Related Work (LLM)-based applications (such as the text-to-text Chat GPT [4], text-to-image Stable Diffusion [5], and text-to-speech This work operates in the intersection of procedural content ElevenLabs [6]). generation and generative AI. Guzdial provides a valuable The application of ERA in the domain of GAI is not a one- bridge between these domains with the lens of human-AI to-one translation from how the method is used in PCG. Use interactive generation [8]. While Guzdial uses this lens to of GAI tools commonly requires more complicated, varied, frame PCG and GAI as essentially the same process, we view and linguistic inputs than PCG, which tends to operate from PCG as a broader term describing the process of content numerical randomization as a starting point for much of its generation at its highest level, and GAI as a more descriptive generation. Because these inputs have a major impact on term identifying a subset of generative practices that use the outputs of GAI systems [7], selection of these inputs specific technologies. Framing GAI as a subset of PCG—or is an important consideration for applying ERA to GAI— using Guzdial’s lens of human-AI interactive generation [8]—broadens the scope of understanding for both domains 11th Experimental Artificial Intelligence in Games Workshop, November and allows the application of evaluation methods of PCG 19, 2024, Lexington, Kentucky, USA. for analysis of GAI systems. * Corresponding author. Particularly, we are interested in leveraging expressive † These authors contributed equally. range analysis for evaluation of GAI systems [1]. Withing- $ jdboucher@wpi.edu (J. Boucher); gmsmith@wpi.edu (G. Smith); ton et al. list expressive range analysis as one of 12 "features" ydtelliel@wpi.edu (Y. D. Telliel)  0009-0002-6865-9031 (J. Boucher); 0000-0002-2765-7702 (G. Smith); used for comparison of PCG systems [9]. We consider ERA 0000-0001-5651-7349 (Y. D. Telliel) a promising method of GAI analysis compared to alternative © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). PCG-evaluation practices due to its flexibility in applying CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings a broad range of analysis metrics to its evaluated systems, • 2) Determine useful metrics for the goal of analysis particularly highlighting its strength of identifying bias and (i.e. linearity or leniency of a level) and apply them system tendencies- including as influenced by user input to the data. [1]. Withington et al. identifies weaknesses with current • 3) Produce a visualization of the metered data space. evaluation methods for PCG systems, suggesting they could be mitigated by diverse research frameworks and promoting Two challenges arise when considering this method for the reuse of methodology where possible [9]. This paper GAI-generators as opposed to PCG-generators. First, pro- seeks to answer this call by broadening the applicability of ducing a set of data to analyze tends to be more complicated. ERA. GAI text-to-X applications require linguistic input, leading We seek to do so by presenting a taxonomic framework to larger variance of the input space in terms of both quan- for categorizing approaches used for ERA, drawing from tity and meaning. Further complicating the issue, PCG tools similar work in PCG such as Togelius et. al. [10] and Smith’s are typically custom-made for specific use-cases [18], where [2] taxonomies of PCG, as well as Withington et al.’s modern GAI tools (such as ChatGPT [4] or Google’s Gemini [24]) taxonomy of available evaluation approaches [9]. are commonly presented as general-purpose—while random Further outlining the connection between GAI and PCG, inputs could be provided in place of handcrafted ones or the we consider the motivations and use-cases for such sys- latent space could be randomly sampled, these approaches tems. PCG and GAI are both used to automatically generate would provide undesirable outputs, lacking the targeted fo- large amounts of content, and to increase variety of content cus of a particular use-case. The general-purpose nature of [11, 12]. PCG is also used for assistive tools, helping with GAI systems adds noise to the output space when looking tasks such as level creation [13], and support tools have at these tools for specific use-cases, since outputs that don’t emerged to make these systems easier to understand [14]. fit the specific use-case become irrelevant. Guzdial summa- LLM applications like ChatGPT [4] and Stable Diffusion rizes this aspect as GAI tool developers seeking to expand [5] are promoted for their potential to reduce labor costs, the possible valid outputs for a particular tool to include all enable new business models, and increase access to content possible valid outputs for every tool [8]. This noise presents creation– Cook claims all of these as common motivations additional challenges: how do you narrow the scope of the for studying AI in games [15]. We also acknowledge the output to only match the relevant use-case? What prompts use of machine learning in PCG [16, 17] as a point that do you provide as an input to produce a suitable set of data highlights the connection between PCG and GAI. to categorize the generator as a whole? Guzdial’s human- We also recognize parallel claims of GAI and PCG centered input alignment [8] may prove a useful attribute capabilities—and shortcomings thereof—that may provide of consideration to respond to this challenge. useful points of interest for continuing this investigation. The second challenge that arises from considering ERA Developing a useful PCG generator often takes just as much, for use in analyzing GAI systems comes from the metrics of or more, effort than hand-crafted alternatives [18], despite analysis. ERA can be used alongside any metrics, depending the allure of rapid content production. Furthermore, the on what a generative system is being analyzed for. These variety content generators are able to produce can be lim- metrics must be applied to large quantities of data, in some ited, as players are often capable of identifying patterns cases encompassing the entire generative output of a system between generated content [19]. While Generative AI of- [1]. Furthermore, many examples of ERA employ automatic fers potential to expand the type of content that can be processing of data, but not all data and research goals bene- generated—such as more complex generative audio and mu- fit from automatable metrics- which tend to be quantitative sic used to increase variety and interactivity of game music in nature. The challenge here is twofold: how do you de- compared to non-generative methods [20], as well as the termine semantically meaningful metrics for increasingly range of users that can make use of generators—we also variable data? And, accounting for practical workload and view the drawbacks of PCG as rich sites of investigation in feasibility, how do you apply these metrics to increasingly GAI. ERA has proven a useful method for identifying such multitudinous data? weaknesses in the domain of PCG, and we hope to see its We treat producing an intended result as the motivation effectiveness applied to GAI as well. behind creating and using generators. While some genera- GAI technologies have prompted significant concern—i.e. tors offer promises of increased speed and efficiency, those regarding its social and environmental impact, underlying claims are not always accurate [18]—despite this, genera- politics of AI design, implementation, and advocacy [21]. tors are valued by many as useful tools. We recognize the Additional concerns include embedded systemic biases, risk purpose of ERA as evaluating the success of that motivation. of plagiarism and misinformation, and human and environ- Using ERA is akin to asking: has this this generator suc- mental costs [22]. Jiang et al. find that artists, in particular, cessfully captured what it was designed to capture? With identify harms to professional reputation, intellectual prop- this framing in mind, our motivation for this paper is to erty, and financial risk [23]. Because of GAI’s recognized facilitate the answering of that question in a greater variety potential for harm, we value tools and methods for under- of contexts. The challenge in the case of GAI is, in short: standing and analyzing these systems and their outputs. how? 3. The Problem Space 4. Framework Summarizing the process of Expressive Range Analysis, this Here, we present a taxonomy of ERA methods, seeking to form of inquiry operates in three steps [1]: provide language and framing to better address the chal- lenges of applying ERA to a greater variety of contexts— • 1) Start with a set of images or similar data (i.e. lev- especially GAI systems. We have defined five axes for char- els, maps, etc.) produced by the generator being acterizing ERA methodologies: quantitative vs qualitative, analyzed. product vs process, objective vs subjective, content agnostic produce such an output. Kreminski et al. [25] arguably vs semantic, and automated vs manual. present a process-focused approach, because their motiva- We considered three factors as a basis for defining these tion for analysis seeks to evaluate the usability experience axes: the origin and definition of ERA, the adopted practice of a generative system. However, the metrics used are pri- of this method, and its potential for further expansion and marily focused on the product of the generator. A more refinement. We considered the origin of ERA [1], identifying process-centric example would include metrics that evalu- decision points in how the method may be applied. We ate procedural aspects such as the time it takes to generate also considered how ERA has been adopted in research an output, or the number of generative attempts before a endeavors and sought to challenge assumptions that have user finds a suitable option. become commonplace. Finally, this taxonomy is also a result Shaker et. al. [11] highlight generator reliability as one of our efforts to address challenges we faced in applying important piece of PCG evaluation. Looking at this aspect ERA to GAI. We hope to further refine these axes and their provides a mixed-method approach that tends to use product definitions in future work. to reveal something about the process. 4.1. Quantitative vs Qualitative 4.3. Objective vs Subjective The metrics that are applied to the generator-produced This axis is concerned with the nature of the analysis data may be quantitative, qualitative, or varying degrees of metric—is a given metric verifiable based on factual evi- mixed-method. Many traditional ERA approaches utilize dence, or does it vary with perspective, based in emotion or quantitative metrics, as this approach is often more suitable opinion? Smith and Whitehead’s two metrics for Launchpad for automatically applying metrics to data and producing again provide a useful example: Linearity, as described in a descriptive visualization. Smith and Whitehead use two their paper, is an objective metric that describes the factual metrics for applying ERA to Launchpad- linearity and le- "profile" of platforming levels (how well the geometry of niency [1]. Both of these metrics are quantitative, because the level fits a straight line) in the evaluated system [1]. they are described, measured, and depicted numerically. Leniency, however, is identified as a subjective score based Interestingly, Kreminski et al. [25] identify determining on an intuitive sense of how lenient components of a level quantitative metrics as an essential step of ERA, which is are towards a player [1]. Subjective metrics are relatively consistent with the examples of the method’s initial pre- underrepresented in PCG, but may prove useful for evaluat- sentation [1]- it is therefore unsurprising that Qualitative ing GAI systems– especially as they are applied in creative metrics are relatively under-utilized in ERA. This taxon- domains. We consider user experience analysis a useful omy challenges this assumption—which is present across point of reference for how subjective metrics may be used many ERA-focused research endeavors—instead suggesting in data analysis and visualization [27, 28]. an expansion of valid data collection efforts. While there is not a strong foundation of qualitative met- 4.4. Content Agnostic vs Semantic rics applied to ERA in PCG, some hypothetical qualitative metrics could come from methods like user surveys or inter- This axis is concerned with the data itself and the value views, such as those found in play-testing efforts to evaluate sought from analysis. Semantic approaches seek to find elements of a game like challenge or engagement. These meaning within the context of the inputs and outputs of the metrics could be visually represented with tools such as generator. Smith and Whitehead [1] present a semantic ap- word clouds, affinity charts, or heat-maps that highlight the proach, as their metrics are intended to allow comparison of frequency of thematic elements within the data, for example. the generated content. Lucas and Volz [29] provide another example of a semantic approach, as theirs is also intended to compare generated content. 4.2. Product vs Process Content agnostic approaches seek to find meaning in Because our goal is to evaluate if a generator is successful the generator, regardless of its inputs or outputs- though in capturing what it was designed to do, it is interesting to the inputs and outputs may be useful for analysis. While consider both the product of a generator and the process Kreminski et al. [25] evaluate the product of the genera- of producing that output. The product of a generator is its tor, their interests lie in the users of the generative system- output- an image, the answer to a question, or a video game answering questions such as how thoroughly they explore level are all examples of this. Analyzing the product is useful the generative range- rather than finding meaning from for identifying what a generator is capable of- in terms of the generative output itself, making their approach content- quality, variety, etc.- and what sort of biases are present in agnostic. Withington’s exploration of quality-diversity al- the outputs. Smith and Whitehead [1] present a product- gorithms [26] presents another content-agnostic example focused approach, as their linearity and leniency metrics because the focus is not on the details of specific outputs, consider only the levels produced by the generative system. but rather measuring the differences between them. Withington’s [26] approach is also product-focused, since this work considers the differences between outputs of a 4.5. Automated vs Manual generator rather than considering the process of producing those outputs. This axis is concerned with the process of applying metrics The process entails the experience of producing that out- to the data. Automated processes are typically conducted put. Common narratives of generative systems sell them as computationally, making them especially suitable for quan- faster and more efficient than manual alternatives. Look- titative analysis metrics and often desirable for the promise ing at the process allows us to evaluate those claims, using of reduced processing time or effort. Smith and Whitehead metrics such as the time it takes to get a desirable output [1], Kreminski et al. [25], and Kybartas et al. [30], among or how many iterations of prompts/generations it takes to others, all present automated approaches for applying ERA outputs. We present a taxonomic framework for categoriz- metrics to their data. ing existing and imagined ERA inquiries, hoping to allow Manual processes require human labor for applying anal- more effective navigation of this space and leverage PCG ysis metrics to each piece of data—while this tends to be tools for study of GAI technologies and systems. more time-consuming, it also allows closer scrutiny of el- This taxonomy opens interesting possibilities for future ements that are difficult to capture without direct human applications of ERA. What do qualitative applications of intervention. Manual processes are often unsuitable for ERA look like? How would manually applying metrics to the large quantities of data that ERA typically processes, data compare to more commonly applied automated pro- but methods such as crowd sourced photogrammetry—e.g. cesses? These are relatively underexplored areas of ERA, as used for identifying information about wildlife popula- and the possibility of applying this method to generative tions [31]—may hypothetically be leveraged to manually systems makes these questions more compelling. process such data. Human subject experiments, such as The scope of this paper is limited to theory. This thread those commonly used in narrative generation projects [32], of research would benefit from a more complete, systematic are another example of a manual approach. review of ERA projects that places existing work on the axes of this taxonomy and further inform the chosen axes. As an extension of this research, we also see value in per- 5. Discussion forming ERA on GAI tools using different combinations of axis placement—especially including qualitative and manual Framing GAI under the lens of PCG- or both as the same approaches. process, e.g. through the lens of human-AI interaction gen- eration [8]- incorporates this new technology into an estab- lished domain of research that has ERA as a design-focused Acknowledgments method for evaluation. In connecting GAI and PCG, though, we have identified a need for both an improved vocabulary This material is based upon work partially supported by the for describing ERA and an expanded potential scope for ERA National Science Foundation (NSF) under Grant No DGE- that challenges existing methods. Thus this taxonomy fur- 1922761. Any opinions, findings, and conclusions or rec- ther expands the potential range of analytical applications ommendations expressed in this material are those of the for ERA, providing language to better describe and imagine authors and do not necessarily reflect the views of the NSF. its use-cases and identify its historical gaps. This expansion responds to existing weak-points in PCG evaluation—such as those identified by Withington et al. [9]—by increasing References the diversity and re-usability of ERA as a framework for [1] G. Smith, J. Whitehead, Analyzing the expressive generative analysis. range of a level generator, in: Proceedings of the 2010 Further, this taxonomy allows for better description of the workshop on procedural content generation in games, flexibility and space occupied by broadly applicable aspects 2010, pp. 1–7. of evaluation, such as those presented in Guzdial’s human- [2] G. Smith, Understanding procedural content gener- AI interaction generation [8]. For example, Guzdial’s call for ation: a design-centric analysis of the role of pcg in human-centered input alignment considers the relationship games, in: Proceedings of the SIGCHI Conference between valid system inputs and user-preferences regard- on Human Factors in Computing Systems, 2014, pp. ing those inputs. Using the vocabulary from our taxonomy, 917–926. this is a process-focused, content-agnostic, subjective ap- [3] N. Deckers, J. Peters, M. Potthast, Manipulating em- proach, because meaning is found according to individual beddings of stable diffusion prompts, arXiv preprint perceptions without concern for the generative output. Such arXiv:2308.12059 (2023). a metric could be applied to data quantitatively or qual- [4] OpenAI, Chatgpt, https://chat.openai.com/, 2024. itatively, using either an automated or manual approach. [5] S. AI, Stable diffusion, https://stability.ai/stable-image, Guzdial’s adaptability similarly considers the process of 2024. generation and user perceptions [8], and has similar axis [6] ElevenLabs, Elevenlabs, https://elevenlabs.io/, 2024. placement to human-centered input alignment– though it [7] P. Korzynski, G. Mazurek, P. Krzypkowska, A. Kurasin- is objective rather than subjective, since adaptability was ski, Artificial intelligence prompt engineering as a new a predetermined aspect of the generative process, rather digital competence: Analysis of generative ai technolo- than a variable expressed by human interpretation. Novelty, gies such as chatgpt, Entrepreneurial Business and however, is an objective, product-focused metric because it Economics Review 11 (2023) 25–37. is based in the observed possible generative outputs. [8] M. Guzdial, Human-AI interaction generation: A con- It is our hope that using the vocabulary of this taxonomy nective lens for generative AI and procedural content can provide some clarity to the research community on generation, in: Proceedings of the Thirty-Third In- how they are using ERA for evaluation, as well as identify ternational Joint Conference on Artificial Intelligence potential new approaches for ERA. (IJCAI-24), 2024. [9] O. Withington, M. Cook, L. Tokarchuk, On the eval- 6. Conclusions, Limitations, & uation of procedural level generation systems, in: Proceedings of the 19th International Conference on Future Work the Foundations of Digital Games, 2024, pp. 1–10. [10] J. Togelius, G. N. Yannakakis, K. O. Stanley, C. Browne, This paper explores an avenue for exploring the intersec- Search-based procedural content generation: A taxon- tion of PCG and GAI, using expressive range analysis as a omy and survey, IEEE Transactions on Computational common method for analyzing generative systems and their Intelligence and AI in Games 3 (2011) 172–186. [11] N. Shaker, J. Togelius, M. J. Nelson, Procedural content [29] S. M. Lucas, V. Volz, Tile pattern kl-divergence for generation in games (2016). analysing and evolving game levels, in: Proceedings [12] M. Hendrikx, S. Meijer, J. Van Der Velden, A. Iosup, of the Genetic and Evolutionary Computation Confer- Procedural content generation for games: A survey, ence, 2019, pp. 170–178. ACM Transactions on Multimedia Computing, Com- [30] B. A. Kybartas, C. Verbrugge, J. Lessard, Tension space munications, and Applications (TOMM) 9 (2013) 1–22. analysis for emergent narrative, IEEE Transactions on [13] A. Liapis, G. N. Yannakakis, J. Togelius, Sentient Games 13 (2020) 146–159. sketchbook: computer-assisted game level authoring [31] S. A. Wood, P. W. Robinson, D. P. Costa, R. S. Bel- (2013). tran, Accuracy and precision of citizen scientist an- [14] M. Cook, J. Gow, G. Smith, S. Colton, Danesh: In- imal counts from drone imagery, PloS one 16 (2021) teractive tools for understanding procedural content e0244040. generators, IEEE Transactions on Games 14 (2021) [32] R. Sanghrajka, E. Lang, R. M. Young, Generating quest 329–338. representations for narrative plans consisting of failed [15] M. Cook, Optimists at heart: Why do we research actions, in: Proceedings of the 16th International game ai?, in: 2022 IEEE Conference on Games (CoG), Conference on the Foundations of Digital Games, 2021, IEEE, 2022, pp. 560–567. pp. 1–10. [16] A. Summerville, Expanding expressive range: Evalua- tion methodologies for procedural content generation, in: Proceedings of the AAAI Conference on Artifi- cial Intelligence and Interactive Digital Entertainment, volume 14, 2018, pp. 116–122. [17] A. Summerville, S. Snodgrass, M. Guzdial, C. Holmgård, A. K. Hoover, A. Isaksen, A. Nealen, J. Togelius, Procedural content generation via machine learning (pcgml), IEEE Transactions on Games 10 (2018) 257–270. [18] T. Short, T. Adams, Procedural generation in game design, CRC Press, 2017. [19] E. Short, Bowls of Oatmeal and Text Gen- eration, https://emshort.blog/2016/09/21/ bowls-of-oatmeal-and-text-generation/, 2016. [20] C. Plut, P. Pasquier, Generative music in video games: State of the art, challenges, and prospects, Entertain- ment Computing 33 (2020) 100337. [21] E. M. Bender, T. Gebru, A. McMillan-Major, S. Shmitchell, On the dangers of stochastic parrots: Can language models be too big?, in: Proceedings of the 2021 ACM conference on fairness, account- ability, and transparency, Association for Computing Machinery, New York, NY, USA, 2021, pp. 610–623. [22] J. E. Fischer, Generative ai considered harmful, in: Proceedings of the 5th International Conference on Conversational User Interfaces, Association for Com- puting Machinery, New York, NY, USA, 2023. [23] H. H. Jiang, L. Brown, J. Cheng, M. Khan, A. Gupta, D. Workman, A. Hanna, J. Flowers, T. Gebru, Ai art and its impact on artists, in: Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, Association for Computing Machinery, New York, NY, USA, 2023, pp. 363–374. [24] G. AI, Gemini, https://gemini.google.com/, 2024. [25] M. Kreminski, I. Karth, M. Mateas, N. Wardrip-Fruin, Evaluating mixed-initiative creative interfaces via ex- pressive range coverage analysis., in: IUI Workshops, 2022, pp. 34–45. [26] O. Withington, Illuminating super mario bros: quality- diversity within platformer level generation, in: Pro- ceedings of the 2020 Genetic and Evolutionary Com- putation Conference Companion, 2020, pp. 223–224. [27] A. Bangor, P. T. Kortum, J. T. Miller, An empirical evaluation of the system usability scale, Intl. Journal of Human–Computer Interaction 24 (2008) 574–594. [28] S. Djamasbi, Eye tracking and web experience, AIS Transactions on Human-Computer Interaction 6 (2014) 37–54.