1. Introduction

Workshop on Artificial Intelligence and Creativity, Santiago de Compostela (Spain)

The creative psychometric item generator: a framework for item generation and validation using large language models

Antonio LaverghetJtra.

Simone Luchin

AverieLinnel

Roni Reiter-Palmo

nand RogerBeaty

0 0 Department of Psychology, The Pennsylvania State University , 201 Old Main, University Park, Pennsylvania , USA 1 Department of Psychology, University of Nebraska at Omaha , 6001 Dodge Street, Omaha, Nebraska , USA

2024

1 9 24

Increasingly, large language models (LLMs) are being used to automate workplace processes requiring a high degree of creativity. While much prior work has examined the creativity of LLMs, there has been little research on whether they can generate valid creativity assessments for humans despite the increasingly central role of creativity in modern economies. We develop a psychometrically inspired framework for creating test items (questions) for a classic free-response creativity test: the creative problem-solving (CPS) task. Our framework, the creative psychometric item generaCtPoIrG)(, uses a mixture of LLM-based item generators and evaluators to iteratively develop new prompts for writing CPS items, such that items from later iterations will elicit more creative responses from test takers. We find strong empirical evidencCePtIhGagtenerates valid and reliable items and that this efect is not attributable to known biases in the evaluation process. Our findings have implications for employing LLMs to automatically generate valid and reliable creativity tests for humans and AI.

eol>automated item generation prompt engineering artificial intelligence

1. Introduction

Creativity is considered one of the primary factors that determine in2d]ivaindduaolrg[anizational [ 3 ] success in the modern economy. This is due to improved automation of routine4]t,atshkse [ increasing complexity and ambiguity of problems organizations face, and projected growth of the creative sectors of the econom5]y. [As such, the development of validated creativity tests has become increasingly important. Nevertheless, generating new creativity assessments remains a resourceintensive process requiring many hours of trial and error to develop suitable items (questions). Such items can be highly complex, requiring participants to reason about intricate scenarios or design solutions to ambiguous problem1s][, and therefore are dificult for even subject matter experts to develop.

With the introduction of modern large language models (6L,L7M] st)h[e ability of AI to automatically develop novel creativity tests appears increasingly pl8a]u,sainbdleL[LMs are already being used to automatically generate items measuring a variety of cogniti9v,e10sk,1il1l]s. A[pplying similar ideas in creativity assessment could provide a method to generate valid and reliable creativity test at scale, which would be beneficial for assessing creativity in both humans and AI. However, doing so may also be contentious for some, given the broader debate on whether AI can be creative. Despite some evidence pointing towards AI creativity, whether AI-generated ideas are truly novel remains a hotly debated topi1c2,[13]. Some research suggests that using LLMs may lower the diversity of ideas produced over time, resulting in reduced collective no14v,e1l5t]y. P[ublic perception of the creativity of AI also remains mixed; humans tend to view creative works produced by AI as less novel than those produced by other human1s4[], and this could be problematic if humans become aware that they are being given AI-generated creativity tests. Broader research in social psychology has found that LLMs produce highly similar responses to questions regarding political orientation, moral philosophy, and other complex constructs that usually exhibit high variability in1h6]u. mCaonlsle[ctively, these results point to a diminished diversity of thought in LLMs, which has important implications for whether and how LLMs should be used to automate creativity assessment.

How can we employ LLMs in designing items for measuring creativity without comprising the validity of any conclusions drawn from such items? We approach this pfrsyocmhoametric perspective, which is both a field dedicated to measuring psychological constructs in humans and the source of a rich body of work measuring similar constructs i1n7,A1I8[, 19]. When measuring a construct like creativity, psychometrics requires that any measurement be both valid and reliable — it must accurately measure the intended construct and give consistent results over repeated measurements. Accomplishing this involves developing tests whose items accurately measure the construct, which historically was done by human expertsC. an we use LLMs togeneratehigh-qualityitemsformeasuring creativity? If so, this would be invaluable not only for the study of human creativity but it might also allow us to measure creativity more accurately in LLMs, which would be a boon for assessing AI creativity. Nevertheless, no prior work has investigated whether LLMs can automatically generate creativity assessments.

In this paper, we develop a framework to extend item generation into the creativity domain: the creative psychometric item generator (CPIG). CPIG relies on structured prompting and psychometrically based exemplar selection to generate items for a creative problem-solving task (CPS), an influential test of creativit2y0][. Our framework is iterative and allows us to continuously refine the same item based on automated validity metrics until reaching a desired level of quality. While other works have explored how to use LLMs to solv21e][and generate22[] CPS-like items, none to our knowledge has examined how to generate psychometrically rigorous assessments of creativity. We CfinPdIGthat generated items are just as valid and reliable as those written by humans. Remarkably, LLM solutions toCPIG items also appear to become more original over successive rounds of generation, suggesting a possible method to boost the creativity of generative AI via carefully designed items.

We make the following contributions: 1. We developCPIG, a new framework for generating creativity items using1LLMs. 2. Through a series of experiments, we confirm thCaPtIG generated items are just as valid as those written by humans, and that our metrics for validity are robust to known biases in the scoring process. 1Code and supplementary materials will be provided at: https://osf.io/umnk5/

2. Background

Creativity is thought to comprise multiple facets, including originality (the novelty of an idea) and efectiveness (how useful or relevant the idea is), among ot2h3e]r. s P[ast work has demonstrated that human judgments of originality are an efective predictor of the creativity23o]f. iAdseasusc[h, the value of a creativity test rests on its capacity to elicit many original2r4]e.spToonmseesa[sure originality, researchers historically relied on human judgments performed by trained raters — a method called the Consensual Assessment Technique (CA25T]). [In the CAT, human raters are instructed to read a series of ideas and assess their originality on a Likert scale. Although efective, human scoring is not eficient, as the recruitment and training of human raters is often costly and prone to errors. More recently, automated creativity assessment tools have been developed, including finetuning LLMs to predict human creativity rati1n]g.sH[ighly accurate models have been reported, often matching or surpassing the agreement between human raters, which makes it practical to evaluate the quality of creative responses at scale.

From a psychometric perspective, measuring an individual’s creativity requires developing structured tasks to evaluate how well they can produce ideas that are both original and high quality. We focus on a CPS as the basis for our experiments. In this task, a participant is given a scenario involving a dilemma to be solved (e.g., a coworker’s roommate is causing problems at work, and it may put both of their jobs at risk), and they must produce a creative solution to this1d]i.lSecmemnaar[ios are ambiguous by design, with many possible solutions, and reflect creative thinking in day-to-day settings. We focus on this CPS task due to its popularity as a creativity test and the availability of automated and psychometrically validated models for assessing the originality of CPS re1s]p.oHnosewse[ver, because many creative tasks can be evaluated in terms of originality, our methods are extensible to other tasks that can be automatically scored.

3. The architecture of CPIG

We take a psychometric approach to generating CPS items, inspired by recent work on automatically generating psychometrically valid test i1t1e,m9s,1[ 7 ]. We use LLMs to act aitsem generators to write the items,item response generators to create human-like solutions to the itemsi,teamndscorers to score the originality of LLM responses using psychometrically validated metrics. We hypothesize that originality in item responses provides a proxy for item quality: items with high quality should enable more creative responses and will tend to elicit better originality scores on average than those that ar of lower quality. Optimizing for originality thus provides a way to generate higher quality items that can better tap the creative potential of subjects1.Fsihgouwres an overview oCfPIG.

3.1. Item generation

Automatically generating valid CPS items is a non-trivial task, as the items must describe suficently complex scenarios to allow a wide variety of responses while also being suficiently ambiguous that no single solution is canonically more “correct” than the others. Furthermore, we also want scenarios to describe a wide range of situations to avoid generating an item pool revolving around a narrow range of topics. We thus develop a multi-stage prompting me2thod.

First, before any runs CoPfIG, we first prompt gpt-3.5-turbo to generate lists of words, where each list contains three names, a place, and an action (e.g., “Mark”, “beach”, “Amy”, “Lucas”, “swimming”). The goal behind this step is to make the item generation task more concrete; rather than prompting the item generator LLMs to design scenarios without any additional context, we instead use the word lists as criteria that must be satisfied (e.g., the final scenario must contain all the names from the word list). This is meant to both simplify generation by breaking it down into multiple steps and help maximize diversity in scenario content by using diferent word lists to ensure no two item generation prompts 2All prompts used throughoCuPtIG are listed in the supplementary material. are the same. We have gpt-3.5-turbo generate ten word lists at once to help eliminate redundant lists and query the model five times to generate 50 lists in total. We set the max number of tokens to 2048 and the temperature1t.0o, leaving other parameters at their defaults. We use this process to generate lists covering a wide variety of semantic content that we manually checked to confirm they obeyed the specified format. We use these word lists throughout all tCrPiaIlGs. of

We use these word lists in the item generation prompt, where we instruct item generator LLMs to design CPS items using the contents of the word list provided. We provide LLMs with generation guidelines and examples of CPS items written by experts. For each trial, we attempt to generate one scenario for each word list. However, the generated items may fail basic validity checks for a variety of reasons, so to mitigate this, we develop a list of rules to drop generations that are likely low quality 1. We compute item readability using Flesch’s reading e2a6]sean[d drop scenarios with scores lower than 45 (considered very dificult to read). We note that this metric requires a minimum string length to compute, so we also require that scenarios be at least 140 tokens long. We use the NLTK word tokenizer to ensure a conistent token3 count. 2. From preliminary trials, we find that LLMs sometimes generate scenarios with priming efects, steering participants toward specific solutions. Examples of this include generating a list of possible solutions or setting up the scenario as a dichotomy (“ShouXldorIYd?o”). Based on the content of such scenarios, we developed a list of strings that indicate possible priming and drop scenarios that contain any such string. Specifically, we drop scenarios containing “on the one hand,” “on the other hand,” “dilemma,” “must navigate,” “must decide,” “has to decide,” and “is torn between.” We do not claim that this list is comprehensive, but we found that it eliminated most priming in generated scenarios. 3. To prevent LLMs from generating irrelevant content after the scenario, we instruct them to always generate “I am finished with this scenario.” at the end. We drop scenarios that lack this string.

Importantly, our goal behind this quality control was not to identify every possible error that might occur in the items, as we expect human experts will make the final decision for which items to include in a creativity assessmen9t].[Rather, we use it to reduce the number of items that need to be examined by eliminating those that are unlikely to be valid. We attempt to generate a scenario a maximum of 10 times for each word list and drop the list if the LLM fails to generate a valid scenario on all attempts. We strip extra newlines and whitespace surrounding the scenario and text after the termination string (including the string itself).

3.2. Item response generation

Once we have LLM-generated items, we must evaluate whether they elicit creative responses. LLMs have proven adept at modeling psychometric d1a9t]aa[nd are competent as human simulacra for sociological modelin2g7[], so we use LLMs to generate synthetic responses to each item. A potential challenge here is that the item response generator LLMs may suggest similar solutions to the same item [14]. We account for this by adopting several prompting styles meant to increase the variation in the LLM responses: abaseline prompt where the LLM is asked to provide a creative solution to the item (with no further contextd)e,maographic prompt where the LLM is provided demographic data about a hypothetical participant that it is meant to simulate while responding (e.g., “You are a Hispanic woman who works in real estate”), anpdsyachometric prompt where we replace the prior demographic data with statements sourced from psychometric inventories strongly correlated with creative performance.

For demographic and psychometric prompts, we construct a popaorltiocfipant creativity profiles to draw from based on responses to prior creativity stu1]d.ieTsh[ese responses include difering occupations and responses to psychometric assessments, which we reason would increase the variability 3https://www.nltk.org/api/nltk.tokenize.word_tokenize in the output of the item response generator LLMs. We provide demographic data in the prompt using either a variable format (e.g., ”You are an Asian man”) or as demographically relevant names. Demographic variables, including name, ethnicity, and gender, were taken from the New York City Health Department 2016 census of baby nam4esa, nd last names specifically were taken from the Decennial Census Survey5 from the United States Census Bureau. We selected the three most common first and last names associated with each demographic variable for a total of 20 first names and 20 last names. We extract data for the psychometric prompts from a series of validated scales measuring constructs related to creativity. We employed scales tapping creative self-e2fic8a],ccyr[eativity anxiet2y9][, creative mindset30[], openness to experience31[], tolerance for ambiguit3y2][, cynicism [33], and the RIASEC interest type3s4[].

In each prompting style, the model is provided a CPS item after the task instructions and demographic/psychometric profile (if applicable), and we process the generated response by removing extra newlines and white space. Because response generation is comparatively a much simpler task than item generation, we do not include additional content validity checks. We generate between 10 to 20 responses for each item. For the demographic and psychometric prompts, we sample a participant profile at random each time.

3.3. Item scoring and selection

Each LLM-generated item response is then scored using the methodology develo1p]e, dwbhyic[h trained roberta-base3[ 5 ] to predict mean originality scores of responses to CPS items. Specifically, this model was trained on a dataset annotated by experts for originality, who scored each response using a five-point Likert scale. They used a test set comprising originality scores to CPS items not seen during training and obtaine0d.4a1 Pearson correlation with human ratings. We use this model to score the originality of eaCcPhIG item, which we use to selectitems to include as exemplars in the next round of item generation. We develop several shot selection strategies for choosing exemplars, which we discuss below. Additionally, we include a baseline that simply chitoeomsessat random. 3.3.1. Greedy This approach simply selects t heitems with the highest originality scores. Specifically, we take the mean of the originality scores of all the responses per item and sort the resulting scores to select the items with the highest scores. 3.3.2. Constraint satisfaction A challenge with the greedy approach is that it may choose highly similar items if they all score high on originality. Indeed, we found in preliminary trials that cosine similarity scores between all pairs of the items tend to increase over iterations, sometimes drastically. To address this, we develop another shot selection method that instead finds a se tiotfems that maximize originality and minimize similarity, which we treat as a constraint satisfaction problem. For each iterCaPItGio, nwoefhave a set of exemplars from the prior iterat6iownith a mean originality sco raend a mean semantic similarity (the mean cosine similarity scores between all pairs of it e)m.Asdinditionally, we include thresholds and that define a tolerance abo veand below for the new set of exemplars. We then search for a se t of size from the generated item pool at the current iteration that satisfies: > ∨ − ≤ < ∨ − ≤ (1) (2) 6We still employ the greedy approach for the first iteration, as we don’t yet have values to compare against.

We use Sentence Transformer3s6[] and all-MiniLM-L6-v2 to compute and , and we search for all matchingacross all unique combinations of sizferom the item pool. We return thweith the highest originality score; further details on this method and the chosen valrueepsrfoovrided in the supplementary material.

3.4. Implementation details

We implementCPIG using LangChai7nand utilize a variety of chat-based open-source and commercial LLMs, including LLama-2 (7b, 13b, and 70b)37[], Vicuna-1.5 (7b and 13b) [38], and Claude-3-haiku.8 All open-source models are implemented using Transform3e9]r.s W[e set the temperature1t.0oacross all trials to increase variation in the generated items and responses while leaving other text generatio parameters at their defaults. We select four items to use as exemplars for all shot selection methods to ensure item generation prompts do not become too long and because we find this is suficient to ensure variation in item content. We cap item generation to a maximum of 768 tokens and item response generation to 350 tokens, as responses to CPS items tend to be much shorter than the items themselves. We run eachCPIG trial for five iterations, using three random seeds for every hyperparameter combination. We use the same LLM for item generation and item response generation for each open-source model trial and use LLama-7b for response generation when using Claude-3-haiku for item generation. We provide a table listing all trials in the supplementary materials. We run experiments on three Nvidia RTX A6000 GPUs with 49GB of video memory each. We apply 4-bit quantization to all supported models.

4. Results

We present a comprehensive picture of how efective the diferent componentCsPoIfG are at generating items that maximize the originality of the output from item response generator LLMs. This includes both ablations on the efect of the diferent prompting strategies and shot selection methods, 7https://www.langchain.com/ 8https://www.anthropic.com/news/claude-3-family as well as human review on the quality of the generated items. For any ablation that requires computing semantic similarity, we use Sentence Transform3e6r]sa[nd all-MiniLM-L6-v2 as the embedding model. All density plots employ kernel density estima4t0i]o.n [

4.1. Originality of LLM responses

Figure2 shows originality scores for all runs that do not use random shot selection, broken down by model type. Critically, regardless of the item geneCrPaItGocro,nsistently improves originality scores of responses by the last round of item generation, in somemcaorseesthan doubling the score compared to the first round. The diference in mean scores was significan t-tinests for both demographic ( << 0.001 ) and psychometric <(< 0.001 ) prompting styles and hence remains regardless of the specific prompting strategy used for item response generation. This demonstratCePsItGh-gaetnerated items can elicit more creative responses from the item response generator LLMs. However, a potential confound when scoring originality is that the metric is influenced by the length of the response, with longer solutions typically being scored as more ori1g]i.nWael [find that LLM responses are, on average, much longer than those of humans, leaving open the possibility that the increase in originality is driven purely by more elaboration in the response. We check for this by computing the Pearson correlation between response length and originality for every generation model and the items generated on the last round (not including random shot selection). Results are shown3.inAFsiegxupreected, length is at least partially correlated with originality for all generation models, though there is sign cant variation in the strength of this correlation. Importantly, however, the correlations remain wea overall and do not rise abo0.v3ein either direction for most LLMs, suggesting that the increases in originality are not only due to increasing response length.

(a) Distributions of originality scores, broken down by

item response prompting strategy. As a point of comparison, we also plot the originality scores of the human participants used to train the scoring model from [ 1 ], but note that they are not given the same items generated by CPIG.

(b) Cosine similarity scores between all pairs of items from the last round of generation, for both greedy shot selection and constraint satisfaction.

4.2. Relationship between originality and similarity

While improvements in response originality denote an increase in item quality, it remains unclear whether the item generator LLMs converge onto a few similar yet high-quality scenarios or how these variables relate to each other in the generated item pool. We explore this by plotting a joint histogram of originality and similarity sc9ofroers all generated items, broken down by shot selection method, in Figure4. Darker cells in this figure indicate a higher frequency of a particular originality-similarity combination. We observe that random shot selection obtains the worst combination of results: not only are most items low on originality, but the distribution also peaks the highest on similarity. Both greedy shot selection and constraint satisfaction achieve lower similarity and higher originality and d so consistently. As the originality of items produced using these strategies increases, their similarity scores remain generally static, indicating that improvements in originality do not come at the expense of more redundant items.

One notable trend is that greedy shot selection seems to have lower similarity scores on average despite constraint satisfaction being designed to minimize similarity. However, for this figure, we dropped all items whose similarity is ab0o.v95e to any other item to make computing the joint his9Measured as the mean cosine similarity between each item and every other item.

(a) Complexity (b) Difficulty togram more manageable. In Figu5r,ewe graph the univariate histogram of cosine similarity scores for both greedy and constraint satisfaction, and this time, include all the items that are generated in t last round. Although both methods generate some item pairs with cosine simil1a.0r,ittiheesroefare many more such items for greedy shot selection, indicating a much larger fraction of extremely similar item content. Interestingly, greedy also peaks at a higher density than constraint satisfaction toward the lower end of the distribution. This likely reflects the balancing act required for constraint satisfaction; selecting items to maximize originality may sometimes require increases in similarity, though the method still succeeds in eliminating most duplicate content.

4.3. Efect of item response prompting style

Humans typically exhibit high variability in the originality of their responses to1C]P.SThiteedmisf-[ ferent item response prompting strategies we develop are meant to induce a similar degree of variation, and we examine how efective they are in Figu5r.eCompared to the no-context baseline — where the item response generator LLMs are simply instructed to answer the item — both demographic and psychometric prompting strategies exhibit higher variance and heavier tails in the originality distribution better reflecting the trends from human participants. Both curves still have lower variance than humans and much higher peaks in originality scores, so it appears there remains headroom for alignment between LLM and human psychometric properties. The main challenge here again relates to elaboration in the response; while human participants often give short solutions, LLMs tend to provide very elaborate responses that embed multiple solutions simultaneously. Fully overcoming this challenge requires more sophisticated prompting and perhaps additional finetuning on human responses to align with our preferences for this task, but we leave this to future work.

4.4. Human content review

The prior results demonstrate that, with carefully chosen prompts and few-shot eCxPeImGpclaanrs, generate items that elicit more original responses from LLM test takers. But is this trend due to improvements in item quality or some other artifact of the generation process? We explore this by recruiting human annotators to rate the qualityCoPfItGhietems.

We recruited five annotators with prior experience in rating for creativity studies. Annotators rated each item in terms of itcosmplexity anddificulty , where we define complexity as how manydemands were present in the item and dificulty as how many of those demands directly compete with each other, such that a solution that attempts to solve one might come at the expense of another. We define demands as any relevant information in the scenario that could be used to construct a creative solution Demands could include challenges to overcome in the scenario or resource constraints, among many others. We selected these facets to cover the most important factors to rate to ensure content validit in the items based on our expertise in creativity assessment and preliminary examinations of the items generated byCPIG. Both facets were rated on a five-point Likert scale, with one being too simple/easy, ifve being too complex/dificult, and three having the right amount of complexity/dificulty. This scale allowed us to account for both extremes of item content; items that are too complex or dificult might cause human participants to give up prematurely, while items that are too simplistic or easy are unlikely to require much creativity to solve. We designed a rubric that annotators used to rate each item, including definitions for complexity and dificulty. The annotators were first shown the rubric and allowed to ask any questions they had about the task. Then, together with one of the authors, the annotators rated ten practice items. Finally, the annotators, in combination with two of the author rated the remaining items via a missing data approach, where annotators only rated a sCuPbIsGet of the items. This approach allowed us to achieve maximum coverage of all items while limiting rating time and making the annotation workload manageable. Each annotator rated between 200 and 245 LLMwritten items, including items from the first and last rouCnPdIGo.fAnnotators were only provided the text of each item, and were blinded to all other related details. For instance, annotators were no informed of which items belonged to which rounCdPoIfG.

We obtained intraclass correlatio0n.5s2offor complexity an0d.49 for dificulty, for absolute agreement on the average ratings, indicating a modest rater agr10eeWmenptlo.t in Figur6e the distributions of complexity and dificulty scores from the items from the first and last rounds. For complexity, we see a definite improvement from round five, with a much larger fraction of items achieving the ideal complexity level than was present in round one. Trends are more static for dificulty as the distributions are quite similar to each other, especially at the ideal dificulty level. Collectively, the content review indicated thCaPtIG items are generally of high quality and that later iterations result in definite improvements for at least some facets of item quality.

We include two items generated by LLama-13b in Ta1b,lbeoth using the same word list. While even items generated in the first round exhibit many desirable qualities, we see key improvements over iterations. Although the round one item (top row in the table) sets up what could be a complex scenario, it remains unclear what the exact problem is other than that Noah is being asked to do “extra work” for a customer. The round five scenario (bottom row) makes this clear: a new family is causing problems by stealing plants. This scenario also introduces added complexity by including new characters with interwoven relationships, hence adding more competing demands that need to be considered. The scenario is still not perfect as not all the information appears especially relevant, but overall, it does appear to be both more original and of higher quality.

5. Related work 5.1. Psychometric AI

Psychometric analysis of language models has seen growing interest in NLP re1s1e,a1r9c,h41[, 18, 42, 43]. Measurement models from psychometrics provide a strong test bed for evaluating language understanding in LLMs18[], making psychometrics a valuable tool for building better NLP test sets. However, LLMs are also valuable for modeling psychometric properties exhibited by humans on both cognitive1[9] and non-cognitive10[] assessments, spurring interest in how LLMs might model human response data more broadl4y4][. One rapidly growing research area is automated item generation, where LLMs are used to create new test items for standardized assessments with little or no human intervention9,[11]. Several works have proposed frameworks similar to ours, where multiple LLMs are used to iteratively generate and evaluate new tes4t5i,t1e7m].sH[ owever, this research has focused almost entirely on generating multiple-choice items, where the range of possible responses is inherently 10This was expected as rating creativity can be highly subjective, so it is challenging to achieve stronger rater agreement. restricted. Additionally, the constructs targeted by such frameworks are either purely cognitive (wit an objectively correct answer) or non-cognitive (open to interpretation based on individual diferences). Creativity does not neatly fit into either mold: there is an aspect of “correctness” when judging CPS responses as the goal is to present a viable solution, yet how solutions are compared against each other in terms of originality is often open to rater interpret4a6t]i.oOnu[r work thus moves psychometric AI in a new direction to examine constructs outside the narrow scope explored in prior work.

5.2. Prompt engineering for psychometric assessment

An often-overlooked aspect of AI-based test development is prompt engineering: the process of developing prompts for LLMs that yield strong performance on the task of interest. Many studies rely on manual prompt tuning to adapt LLMs to a specific cognitive or psychometric task, which has allowed for the successful replication of many classic results from cognitive psy4c7h]oalnodgyh[as yielded high-quality items for various assessme1n0t].sA[ typical design pattern for such prompts is to use a format that aligns closely with how the actual task is presented to humans as if to simulate an experimental session44[]. However, greater care must be taken in the prompt design than might be necessary for other applications, as LLMs appear susceptible to more biases in task instructions than humans [48]. A starting point for addressing this could be to employ methods for prompt optimization, which have been widely successful in improving the performance of LLMs for NLP4t9a]s.kTsh[ese techniques, while powerful, typically rely on information-theoretic metrics for assessing prompt quality, often resulting in uninterruptible prom5p0t]s. [A few works have explored how to create prompt optimization methods employing psychometrics as optimization targets by combining LLM item generators with discriminative models trained to predict item alignment with a target45c]oonrstruct [ by incorporating standard metrics for reliability and validity to assess the quality of an LLM’s generations 1[ 1, 17 ]. Even in these cases, the prompt itself usually remains sCtPaItGicp.rovides a structured method for prompt mutation via the selection of exemplars that demonstrate evidence of validity on the task of interest.

6. Conclusion

We proposeCPIG, a framework for generating creativity items using LLMs. By combining state-of-theart models for response scoring with methods for item generation, we finCdPItGhcaatn generate items that improve the originality of LLM responses over time, which in turn points to increased creativity in their solutions. This trend is not attributable to known biases in the scoring model, and human raters ifnd CPIG items to be high quality.

While our results are promising, our analysis also has limitations. In deCvPeIlGo,pwinegfocused primarily on originality as the metric to optimize. While originality is a crucial facet of creativity, it is just one metric for judging creative outputs. Depending on the context, other metrics, such as an output’s quality or relevance, may be more important to evaluate, and future work should extend our framework to optimize multiple criteria simultaneously. The quality of the generated items depends directly on the item evaluation, which was accomplished through automated scoring that, while effective, is not without limitati1o]n.sD[eveloping more robust evaluations requires layering multiple quality control checks on top of each other, perhaps by employing separate LLM judges to rate the quality of the items directly and provide structured feedback on how to improve the items. Though we performed a content review on tChPeIG items, it remains unclear how efective they would be when administered to human participants to solve without conducting more studies. As such, we caution against using the items frCoPmIG until they have undergone more extensive review. Finally, we must acknowledge biases in the LLMs, which may have influenced item generation. The data for our scoring model was curated using raters from a Western backg1r],omuankdin[g the possibility of bias even more likely. Addressing this requires curating originality scores representing a more diverse slate of cultural views and developing bias mitigation strategies during item generation to ensure the evaluation remains fair.

Acknowledgments

The research described herein was sponsored by the U.S. Army Research Institute for the Behavioral and Social Sciences, Department of the Army (Contract No. W911NF-23-C-0040 P00001). The views expressed in this article are those of the authors and do not reflect the oficial policy or position of the Department of the Army, DoD, or the U.S. Government. G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901. [8] J. Rafner, R. E. Beaty, J. C. Kaufman, T. Lubart, J. Sherson, Creativity in the age of generative ai,

Nature Human Behaviour 7 (2023) 1836–1838. [9] A. A. von Davier, A. Runge, Y. Park, Y. Attali, J. Church, G. LaFlair, The item factory, Machine

Learning, Natural Language Processing, and Psychometrics (2024) 1. [10] P. Lee, S. Fyfe, M. Son, Z. Jia, Z. Yao, A paradigm shift from “human writing” to “machine generation” in personality test development: An application of state-of-the-art natural language processing, Journal of Business and Psychology 38 (2023) 163–190. [11] A. Laverghetta Jr., J. Licato, Generating better items for cognitive assessments using large language models, in: E. Kochmar, J. Burstein, A. Horbach, R. Laarmann-Quante, N. Madnani, A. Tack, V. Yaneva, Z. Yuan, T. Zesch (Eds.), Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 414–428. [12] S. Saebø, H. Brovold, On the stochastics of human and artificial creativity, arXiv preprint arXiv:2403.06996 (2024). [13] G. Franceschelli, M. Musolesi, On the creativity of large language models, arXiv preprint arXiv:2304.00008 (2023). [14] B. R. Anderson, J. H. Shah, M. Kreminski, Homogenization efects of large language models on human creative ideation, arXiv preprint arXiv:2402.01536 (2024). [15] A. R. Doshi, O. Hauser, Generative artificial intelligence enhances creativity, Available at SSRN (2023). [16] P. S. Park, P. Schoenegger, C. Zhu, Diminished diversity-of-thought in a standard large language model, Behavior Research Methods (2024) 1–17. [17] Y. Attali, A. Runge, G. T. LaFlair, K. Yancey, S. Goodwin, Y. Park, A. A. von Davier, The interactive reading task: Transformer-based automatic item generation, Frontiers in Artificial Intelligence 5 (2022) 903077. [18] C. Vania, P. M. Htut, W. Huang, D. Mungra, R. Y. Pang, J. Phang, H. Liu, K. Cho, S. Bowman, Comparing test sets with item response theory, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 1141–1158. [19] A. Laverghetta Jr, A. Nighojkar, J. Mirzakhalov, J. Licato, Can transformer language models predict psychometric properties?, in: Proceedings of* SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics, 2021, pp. 12–25. [20] R. Reiter-Palmon, M. Y. Illies, L. Kobe Cross, C. Buboltz, T. Nimps, Creativity and domain speciifcity: The efect of task type on multiple indexes of creative problem-solving., Psychology of Aesthetics, Creativity, and the Arts 3 (2009) 73. [21] S. R. Rick, G. Giacomelli, H. Wen, R. J. Laubacher, N. Taubenslag, J. L. Heyman, M. S. Knicker, Y. Jeddi, H. Maier, S. Dwyer, et al., Supermind ideator: Exploring generative ai to support creative problem-solving, arXiv preprint arXiv:2311.01937 (2023). [22] Y. Tian, A. Ravichander, L. Qin, R. L. Bras, R. Marjieh, N. Peng, Y. Choi, T. L. Grifiths, F. Brahman, Macgyver: Are large language models creative problem solvers?, arXiv preprint arXiv:2311.09682 (2023). [23] J. Diedrich, M. Benedek, E. Jauk, A. C. Neubauer, Are creative ideas novel and useful?, Psychology of Aesthetics, Creativity, and the Arts 9 (2015) 35. [24] M. A. Runco, G. J. Jaeger, The standard definition of creativity, Creativity Research Journal 24 (2012) 92–96. [25] P. J. Silvia, B. P. Winterstein, J. T. Willse, C. M. Barona, J. T. Cram, K. I. Hess, J. L. Martinez, C. A.

Richard, Assessing creativity with divergent thinking tasks: exploring the reliability and validity of new subjective scoring methods., Psychology of Aesthetics, Creativity, and the Arts 2 (2008) 68. [26] J. Kincaid, R. P. Fishburne Jr, R. L. Rogers, B. S. Chissom, N. T. T. C. M. T. R. Branch, Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel (1975). [27] S. Sun, E. Lee, D. Nan, X. Zhao, W. Lee, B. J. Jansen, J. H. Kim, Random silicon sampling: Simulating human sub-population opinion using a large language model based on group-level demographic information, arXiv preprint arXiv:2402.18144 (2024). [28] M. Karwowski, Did curiosity kill the cat? relationship between trait curiosity, creative selfeficacy and creative personal identity, Europe’s Journal of Psychology 8 (2012) 547–558. [29] R. J. Daker, R. A. Cortes, I. M. Lyons, A. E. Green, Creativity anxiety: Evidence for anxiety that is specific to creative thinking, from stem to the arts., Journal of Experimental Psychology: General 149 (2020) 42. [30] M. Karwowski, Creative mindsets: Measurement, correlates, consequences., Psychology of Aesthetics, Creativity, and the Arts 8 (2014) 62. [31] C. G. DeYoung, L. C. Quilty, J. B. Peterson, J. R. Gray, Openness to experience, intellect, and cognitive ability, Journal of personality assessment 96 (2014) 46–52. [32] A. Furnham, T. Ribchester, Tolerance of ambiguity: A review of the concept, its measurement and applications, Current Psychology 14 (1995) 179–199. [33] K. S. Mitchell, R. Reiter-Palmon, Malevolent creativity: personality, process, and the larger creativity field, in: Creativity and Morality, Elsevier, 2023, pp. 47–68. [34] P. I. Armstrong, S. X. Day, J. P. McVay, J. Rounds, Holland’s riasec model as an integrative framework for individual diferences., Journal of Counseling Psychology 55 (2008) 1. [35] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,

Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [36] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2019. hUtRtLp:://arxiv.org/abs/1908.100.84 [37] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288 (2023). [38] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, E. P. Xing, Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. [39] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, A. Rush, Transformers: State-of-the-art natural language processing, in: Q. Liu, D. Schlangen (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 38–45. [40] E. Parzen, On estimation of a probability density function and mode, The annals of mathematical statistics 33 (1962) 1065–1076. [41] A. Laverghetta Jr, A. Nighojkar, J. Mirzakhalov, J. Licato, Predicting human psychometric properties using computational language models, in: The Annual Meeting of the Psychometric Society, Springer, 2021, pp. 151–169. [42] Y. Li, Y. Huang, H. Wang, X. Zhang, J. Zou, L. Sun, Quantifying ai psychology: A psychometrics benchmark for large language models, arXiv preprint arXiv:2406.17675 (2024). [43] J. He-Yueya, W. A. Ma, K. Gandhi, B. W. Domingue, E. Brunskill, N. D. Goodman, Psychometric alignment: Capturing human knowledge distributions via language models, arXiv preprint arXiv:2407.15645 (2024). [44] M. Tavast, A. Kunnari, P. Hämäläinen, Language models can generate human-like self-reports of emotion, in: 27th International Conference on Intelligent User Interfaces, 2022, pp. 69–72. [45] I. Hernandez, W. Nie, The ai-ip: Minimizing the guesswork of personality scale item development through artificial intelligence, Personnel Psychology 76 (2023) 1011–1035. [46] M. Benedek, C. Mühlmann, E. Jauk, A. C. Neubauer, Assessment of divergent thinking by means of the subjective top-scoring method: Efects of the number of top-ideas and time-on-task on reliability and validity., Psychology of Aesthetics, Creativity, and the Arts 7 (2013) 341. [47] A. Ushio, L. Espinosa Anke, S. Schockaert, J. Camacho-Collados, BERT is to NLP what AlexNet is to CV: Can pre-trained language models identify analogies?, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online, 2021, pp. 3609–3624. [48] A. Gupta, X. Song, G. Anumanchipalli, Investigating the applicability of self-assessment tests for personality measurement of large language models, arXiv preprint arXiv:2309.08163 (2023). [49] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, J. Ba, Large language models are human-level prompt engineers, arXiv preprint arXiv:2211.01910 (2022). [50] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing, ACM Computing Surveys 55 (2023) 1–35.

[1]

Luchini ,

N. T.

Maliakkal ,

P. V.

DiStefano ,

J. D.

Patterson ,

Beaty ,

Reiter-Palmon , Automatic scoring of creative problem-solving with large language models: A comparison of originality and quality ratings ( 2023 ).

[2]

Makó ,

Illéssy , Automation, creativity, and the future of work in europe: A comparison between the old and new member states with a special focus on hungary , MTA Társadalomtudományi Kutatóközpont Kisebbsegkutató Intézet ( 2020 ).

[3]

Tsegaye ,

Su ,

Malik , The antecedent impact of culture and economic growth on nationscreativity and innovation capability , Creativity Research Journal 31 ( 2019 ) 215 - 222 .

[4]

Chui ,

Manyika ,

Miremadi , Four fundamentals of workplace automation , McKinsey Quarterly 29 ( 2015 ) 1 - 9 .

[5]

T. M.

Amabile , Creativity, artificial intelligence, and a world of surprises, Academy of Management Discoveries 6 ( 2020 ) 351 - 354 .

[6]

Bommasani ,

D. A.

Hudson , E. Adeli,

Altman ,

Arora , S. von Arx,

M. S.

Bernstein ,

Bohg ,

Bosselut ,

Brunskill , et al., On the opportunities and risks of foundation models , arXiv preprint arXiv:2108.07258 ( 2021 ).

[7]

Brown ,

Mann ,

Ryder ,

Subbiah ,

J. D.

Kaplan ,

Dhariwal ,

Neelakantan , P. Shyam,