=Paper=
{{Paper
|id=Vol-3878/36_main_long
|storemode=property
|title=Comparing Large Language Models Verbal Creativity to Human Verbal Creativity
|pdfUrl=https://ceur-ws.org/Vol-3878/36_main_long.pdf
|volume=Vol-3878
|authors=Anca Dinu,Andra Florescu
|dblpUrl=https://dblp.org/rec/conf/clic-it/DinuF24
}}
==Comparing Large Language Models Verbal Creativity to Human Verbal Creativity==
<pdf width="1500px">https://ceur-ws.org/Vol-3878/36_main_long.pdf</pdf>
<pre>
                                Comparing Large Language Models verbal creativity to
                                human verbal creativity
                                Anca Dinu1,*,† , Andra Maria Florescu1,*,†
                                1
                                    University of Bucharest, S, oseaua Panduri 90, Sector 5, Bucharest, 050663, Romania


                                                 Abstract
                                                 This study investigates the verbal creativity differences and similarities between Large Language Models and humans, based
                                                 on their answers given to the integrated verbal creativity test in [1]. Since this article reported a very small difference of
                                                 scores in favour of the machines, the aim of the present work is to thoroughly analyse the data through four methods: scoring
                                                 the uniqueness of the answers of one human or one machine compared to all the others, semantic similarity clustering, binary
                                                 classification and manual inspection of the data. The results showed that humans and machines are on a par in terms of
                                                 uniqueness scores, that humans and machines group in two well defined clusters based on semantics similarities between
                                                 documents comprising all the answers of an individual (human or machine), per tasks and overall, and that the separate
                                                 answers can be automatically classified in human answers and LLM answers with traditional machine learning methods, with
                                                 F1 scores ranging from 68 to 74. The manual analysis supported the insight gained from the automated methods in that LLMs
                                                 behave human-like while performing creativity tasks, but there are still some important distinctive features to tell them apart.

                                                 Keywords
                                                 creativity assessment, LLM creativity, verbal creativity, semantic similarity clustering


                                1. Introduction                                                                    so on. A good survey on LLMs’ verbal creativity is [8].
                                                                                                                   Since work on LLMs creativity is just at the beginning,
                                Creativity has made it possible for humanity to survive                            there is a need for methods, resources, and evaluation
                                and develop since prehistoric times. Despite the per-                              to better understand LLMs’ creative abilities and their
                                ception that some people are more creative than others,                            differences and similarities with human creative traits.
                                many psychologists argue that everyone has the capacity                               In a recent article, [1] designed a verbal creativity test,
                                for creativity or that creativity is innate and encoded in                         integrating a wide range of tasks and criteria inspired
                                human nature [2].                                                                  from psychological creativity testing, and administrating
                                   Creativity is inherently interdisciplinary, involving do-                       it to both humans and LLMs. The scope of this paper
                                mains like psychology, cognitive sciences, philosophy,                             is to analyze the answers given by LLMs and human re-
                                arts, engineering, mathematics, or computer science. Re-                           spondents to this previous study, for a direct comparison
                                cently, it has become a field of interest in GenerativeAI                          of human and machine verbal creativity. To this end, we
                                (GenAI) [3] in general, and in particular, in Large Lan-                           will compute uniqueness scores, cluster the individual
                                guage Models (LLMs) [4].                                                           answers per task and overall, perform supervised binary
                                   However, much of the current research in genera-                                classification with classic machine learning methods on
                                tive models [5] is concerned with constraining them so                             all answers and manually analyze some of the data par-
                                they do not harm people, so they are well-behaved, fac-                            ticularities.
                                tual, non-hallucinating, non-biased, non-negative, non-
                                misleading, non-toxic, etc., and for a good reason. In con-
                                trast, fewer studies (see section 2) focus on encouraging 2. Theoretical background and
                                them to be original, unconstrained, or creative, although                                                    previous work
                                computational creativity, as a research field, dates back
                                to the late ’90s [6], [7] with various disciplines including The formal study of creativity and of its mechanisms and
                                creative writing, music, or graphics, utilizing artificial processes started with J.P. Guilford’s plead for creativ-
                                intelligence, particularly neural networks, heuristics, and ity in the 1950s [9]. Since then, thousands of articles
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, and books have been published on different aspects of
                                Dec 04 — 06, 2024, Pisa, Italy                                                                         creativity [10].
                                *
                                  Corresponding author.                                                                                   Creativity is a notoriously hard-to-define notion, be-
                                †
                                  These authors contributed equally.                                                                   cause it is trans-disciplinary, branched in a variety of
                                $ anca.dinu@lls.unibuc.ro (A. Dinu);                                                                   domains. It can also be of many kinds like verbal, graph-
                                andra-maria.florescu@s.unibuc.ro (A. M. Florescu)                                                      ical, musical, or kinetic creativity. While the last three
                                 0000-0002-4611-3516 (A. Dinu); 0009-0007-1949-9867
                                (A. M. Florescu)
                                                                                                                                       kinds of creativity are related to arts, verbal creativity is
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License the most general kind, expressing the overall creativity
                                          Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
of ideas.                                                              5. the Consequences, for which one should guess the
   Regardless of the domain perspective and of the kind                   effects of a specified situation , and
of creativity, a basic idea in defining it, common to most             6. Divergent Association (DAT), where the respon-
of the definitions, is that creativity represents the ability             dent has to produce seven nouns that are max-
of an individual to come up with something original                       imally semantically different, in all their senses
or innovative, of good quality, and appropriate, based                    and uses.
on prior knowledge [11]. One can be creative, but lack
                                                                In [1], ten LLMs and ten humans were tested on this
appropriateness of the idea or artifact produced, hence
                                                             verbal creativity test, including the six tasks above. The
diminishing its quality in terms of creativity.
                                                             authors stated that their goal was to test the creativity
   Another related aspect of creativity, as stated by [12], is
                                                             of the selected LLMs in their default architecture, and,
represented by two types of thinking during the creative
                                                             thus, they did not change any settings that could have
process:
                                                             modified the creativity level, such as temperature or top-
         • divergent thinking, which concentrates on the nu- K. The collected answers given to this test are the input
           merous ideas appearing during a creative task, data for the present article.
           and

         • convergent thinking, which restricts them to the        3. Analysis
           only best-fitted or appropriate ones. So, even if
           an idea or artifact might seem creative from a          Creativity assessment is usually performed with human
           divergent perspective if it is unreasonable to the      evaluators who take into account the four creativity cri-
           point of being completely unrelated to the initial      teria formulated by [9, 12]:
           creativity task to begin with, the overall creativity       1. originality: uniqueness of the creative answers,
           level drastically diminishes.                               2. flexibility: how semantically distant the answers
                                                                          are,
   With the recent rise of generative models like LLMs
                                                                       3. elaboration: how detailed are the answers, and
such as Chat GPT1 or Copilot, the interest in compu-
                                                                       4. fluency: how many answers are given.
tational creativity peaked, in an attempt to harvest the
creative potential of the machines, in spite of many chal-            [1] automatically evaluated the verbal creativity by
lenges such as safety, ethical problems, methodological            using the Open Creativity Scoring with AI (OCSAI) tool
norms, evaluating standards, etc.                                  [16], an open-source software that uses traditional seman-
   Previous studies on machine creativity are fragmented:          tic distance and fine-tuned GPT for scoring the creativity
some are task-specific, like, for instance, using just role-       between the prompt and the answer. The results showed
plays[13], or just storytelling [14], while others focus           a slightly better score of the overall verbal creativity, com-
on just one LLM [4], or just on one type of creativity             puted as the mean of the scores for all the 6 tasks, for
assessment [15].                                                   the machines, with a value of 0.58, compared to humans,
   In this study, we mind this research gap by analyz-             with 0.51. Given that the difference is of just 7 decimals,
ing the creative responses to a wide range of tasks, of a          one of our goals for this study is to analyze more in-depth
considerable number of LLMs, from [1], who proposed                the differences and similarities of the answers of humans
a comprehensive assessment benchmark for testing the               and machines to the verbal creativity test, looking specif-
verbal creativity of both LLMs and humans, alike. It               ically for distinctive features, rather than raw scores. The
consists of six tasks, inspired from human psychology:             ten selected LLMs from the previous study were accessed
        1. Alternative Uses (AUT), where the test taker is         via: HuggingChat2 (LLAma-3-70B, Mixtral-8x7B3 ), via
           asked to come up with uncommon uses for an              Hugging Space 4 (Cohere- c4ai-command-r-plus, Yichat-
           ordinary object,                                        34B), locally (Falcon through GPT4All5 ), or directly from
                                                                   their web pages (Copilot(Balanced Mode) 6 ), Gemini-free
        2. Instances, for which the aim is to name as many
                                                                   version7 , Jais-30B8 , Youchat from You.com-Smart mode9 ,
           things as one can think of that have a common
                                                                   Character AI (Character Assistant10 ).
           feature,
                                                                   2
        3. the Similarities, which consists of stating as many       https://huggingface.co/chat/models/
                                                                   3
           as possible commonalities of two specified ob-            No longer supported
                                                                   4
                                                                     https://huggingface.co/spaces
           jects,                                                  5
                                                                     https://gpt4all.io/index.html
        4. the Causes, where the aim is to guess the cause         6
                                                                     https://www.bing.com/chat?form=NTPCHB
                                                                   7
           of a given situation,                                     https://gemini.google.com/app
                                                                   8
                                                                     https://auth.arabic-gpt.ai/
1                                                                  9
    https://help.openai.com/en/articles/                             https://you.com/?chatMode=default
                                                                   10
    6825453-chatgpt-release-notes                                     https://c.ai/c/YntB_ZeqRq2l_aVf2gWDCZl4oBttQzDvhj9cXafWcF8
   The humans were non-native fluent English speakers
who responded to the verbal creativity test as volunteers,
either in a lab or at their homes by completing a Google
Form. Their background was all academic, from students,
undergraduates, graduates and professors, the average
age being 26.
   We implemented all the experiments in Google Colab11
and we have used three LLMs to assist us with the codes:
Claude12 , Copilot13 and Gemini14 , in a setting of mostly
zero-shot prompt engineering, with the standard settings
and parameters.
   For data analysis, we used Python and the following Figure 1: Ranking of uniqueness scores for humans and ma-
libraries: Spacy15 , Scikit-learn16 , Matplotlib17 , Numpy18 , chines
and Pandas19 .

3.1. Data                                                   since one of their goals was to evaluate the answers fully
The databases of verbal creativity answers contains 4530 automatically. Nevertheless, the uniqueness of the an-
answers, totalling 13714 words. The test was organized in swers of an individual constitutes an important clue to
6 tasks. Five out of the six tasks have five items each and their creativity. Hence, to better understand the unique-
a maximum of 10 answers per item. An answer can have ness trait of both humans and machines, we computed
a maximum of 5 words. The sixth task, DAT, consists uniqueness scores as if follows.
only of one item of 10 single-words answers, but only the      We grouped the creativity test answers of both hu-
most semantically different 7 out of the ten given by the   mans   and machines in separate files, each containing all
respondents were taken into account by the DAT web          the answers     of a particular individual. We thus obtained
      20
page . That amounts to 2570 answers for the machines,       20 answer     files, 10 for humans and 10 for LLMs. After
which responded always with the maximum number of           removing     the  stop words, we generated embeddings for
answers, 10, even if the instruction was the same for both  each  file, and  then  we  computed their pairwise semantic
humans and machines to give between 1 and 10 answers similarity, using spaCY library. The uniqueness scores
per task. The human respondents gave any number of were obtained as the inverse of the average semantic sim-
answers in the range 1 to 10, obtaining thus 1960 human ilarity scores between an individual and all the others.
answers. As such, the database is unbalanced, with with The ranking obtained in the decreasing order of unique-
more than a third more machine answers compared to ness is depicted in figure 1, where one can see that the
human answers.                                              humans (in green) and the machines (in red) are mostly
                                                            intermingling.
                                                               This uniform distribution of humans and machines
3.2. Uniqueness scores for the answers of in terms of uniqueness scores shows that humans and
       humans and machines to the verbal                    machines are on a par in this respect.
      creativity test
One of the criteria for assessing creativity in psychology    3.3. Semantic similarity clustering of the
is the degree of originality of the answers of one individ-        answers of humans and machines
ual, compared to the answers of all the other individuals.    The aim of this experiment was to investigate if individual
The evaluation of this criterion is done manually and         humans and individual machines cluster together, based
is time-consuming, since it includes assessing not only       on semantic similarity of their answers to the creativity
word similarities, but also similarities between ideas of     test. We used the word embedding of the 20 individual
the different individuals. [1] did not use this criterion,    files described in subsection 3.2. To reduce the dimension-
11
   https://colab.research.google.com/                         ality of the vector space for the 2D plot, we used Principal
12
   https://claude.ai/chat/                                    Component Analysis (PCA), from spaCY library.
13
   https://www.microsoft.com/en-us/microsoft-copilot             In figure 2 we can see how the LLMs (dots in red)
14
   https://gemini.google.com/app/                             perfectly cluster together, just as the humans (dots in
15
   https://spacy.io/
16
   https://scikit-learn.org/stable/                           green) do, considering all responses to the six tasks. This
17
   https://matplotlib.org/                                    result indicates that from a semantic perspective, humans
18
   https://numpy.org/                                         and LLMs generate creative answers differently, or at
19
   https://pandas.pydata.org/                                 least that there are discriminating features to distinguish
20
   https://www.datcreativity.com/
Figure 2: Semantic similarity clusters of answers for all tasks    Figure 4: Semantic similarity clusters of answers for Instances


Figure 3: Semantic similarity clusters of answers for Alterna-     Figure 5: Semantic similarity clusters of answers for Similari-
tive Uses                                                          ties


between the two.                                            In this binary classification experiment, we investigated
   We also plotted the clusters per answers to a specific   if they also have distinctive features at the answer level.
task, for all the 6 tasks, in figures 3, 4, 5, 6, 7, and 8. Gen-
                                                            For this, we trained several traditional machine learning
erally, the answers of the humans and of the machines       (ML) classifiers to discriminate between the answers of
clearly clustered by their kind, with the exception of the  humans and of LLMs to the verbal creativity test. The
task Instances, where the humans and the LLMs were          two classes were represented by all the answers of the
interposed, meaning that the semantic content of their      humans and, separately, by all the answers of the LLMs,
answers was not specific to any of the two classes. A       with one answer per line, excluding the DAT task, since
bit of mixing appeared also in Divergent Association Task   it only required enumerating words. As the LLMs al-
(DAT). The not so clear separation of humans and ma-        ways gave the maximum number of answers required in
chines for Instances and DAT tasks might result from the    the test, the dataset was unbalanced (2500 answers for
fact that the responses to these particular tasks are inher-LLMs and 1890 for humans). To address this problem of
ently very short, of just one or two words for Instances    unbalanced dataset, we implemented a simple random
task and of just one word for the DAT.                      under-sampling technique, thus obtaining 1890 answers
                                                            for each class, humans and LLMs. We then employed the
3.4. Binary classification of human and                     Term Frequency-Inverse Document Frequency (TF-IDF)
                                                            vectorization technique to convert the text data into nu-
      machine creativity answers                            merical features. The vectorizer used a maximum of 1000
As the clusterization experiment suggested, the answers features, for capturing all important aspects and dealing
to the verbal creativity test are almost linearly separable with computational complexity. Stratified sampling was
in two classes (humans and machines) at individual level. used to ensure a dataset split for an 80/20 training and
Table 1
Binary classification scores

                                  SVM                          NaïveBayes                      RandomForest
                      Prec.    Rec.    F1    accu     Prec.    Rec.    F1    accu      Prec.    Rec.    F1    accu
          Humans      0.78     0.60   0.68            0.70     0.83   0.76             0.67     0.80   0.73
                                               0.71                             0.74                           0.71
           LLMs       0.67     0.83   0.74            0.79     0.65   0.71             0.76     0.61   0.68


Figure 6: Semantic similarity clusters of answers for Causes    Figure 8: Semantic similarity clusters of answers for DAT


                                                                ilarity between the answers of humans and machines
                                                                that prevents the model to better learn to discriminate
                                                                between human and machine answers. Further experi-
                                                                ments are needed to see if by enlarging the dataset or by
                                                                experimenting with SOTA transformers to see wheter
                                                                the performance rises considerably or not.

                                                                3.5. General considerations
                                                               We manually inspected the first two most unique LLMs
                                                               and humans to see what makes their answers so differ-
                                                               ent from the others but also investigated the uniqueness
                                                               scores correlation with the quality and creativity.
Figure 7: Semantic similarity clusters of answers for Conse-      The first positioned on the uniqueness ranking, the
quences                                                        LLM Jais, had the tendency to respond to the Similar-
                                                               ity task with word obtained by nominalization (deriving
                                                               nouns from verbs), like, for instance, "dependency", "cu-
testing ratio. Thus, training and testing sets contained       riosity", "belonging", and "growth", as opposed to all the
the same number of samples for each category, e.g. 1512        other LLMs, which responded with regular nouns. It also
answers for training, and 378 answers for testing.             tended to use answers that started with the same prefix:
   In table 1, we give the best three classifier methods,      “Unfiltered”, "Unmatched", "Unrestricted", and "Unyield-
with precision, recall, accuracies, and F1 scores. The         ing", and to use the same word followed by other words,
NaïveBayes classifier obtained the highest accuracy, of        like in, for instance, "Thought policing", "Thoughtful
0.74, followed at just three decimals by both the Support      shopping", and "Thought clones". In this respect, Jais
Vector Machine (SVM) classifier and the Random Forest          gave the most unique answers, which, obviously, were
classifier, with an accuracy of 0.71.                          not also the most creative.
   This moderate performance of the ML models sug-                The second positioned on the uniqueness ranking, Hu-
gests that either the dataset is too small for the models      man 3, started the majority of their answers with "use"
to perform better, or that there is a fair amount of sim-      or "use it as". This respondent also repeated the starting
point of most of their answers, like in "what...", "getting     4. Ethical considerations
a ...", "where ...", "in a...". These features seem enough to
score highly w.r.t. uniqueness, but fail to correlate with      We did not use or disclose any personal data from the hu-
the quality of the creativity.                                  man participants, who remained completely anonymous
   This inspection shows that the most unique answers           and took part in this research as volunteers. There are no
are not necessarily the most creative. If the bulk of the       ethical concerns with regard to publishing this research.
respondents give good-quality answers, that might re-
sult in a high uniqueness score for lower-quality or less
creative responses.
                                                                5. Limitations
   We also checked the appropriateness of the answers           The dataset for this research was small and slightly unbal-
given by both humans and machines, which is an im-              anced since the humans answered based on their mood
portant requirement of genuine creativity, as mentioned         or capabilities, while the LLMs answered strictly with a
in section 1. Creativity requires divergent thinking, but       maximum of ten answers per task.
true creativity emerges when convergent thinking also              Also, the sample pool is quite small, as there were only
restricts the divergence to only those responses that are       ten humans and ten LLMs involved, so the results might
appropriate for the creative assignment [12].                   be unstable when enlarging the dataset.
   In general, humans gave fairly suitable answers. In-            Due to lack of space, this study focuses more on au-
stead, not all the LLMs managed to generate all the an-         tomated methods of analysis, than on manual analysis,
swers in an appropriate manner. For instance, for the           thus lacking a more in-depth insight into the patterns of
Consequences task, for the item "There is a virus and only      the collected answers to the verbal creativity test from
children survive", Gemini, although responded creatively,       both humans and machines.
failed to also respond suitably. This model gave four              Finally, this study compares the creativity answers
out of the ten answers that are either paradoxical, or          of humans and LLMs in English, but the human partici-
non-sensical, in a situation that clearly implies that only     pants to the test were non-native (fluent) English speak-
children are alive, so there are no adults around: "Toy Fac-    ers, which can potentially decrease their creativity score,
tories booming", "Geriatric Theme Parks", "Grandparents         compared to scores they could obtain in their own native
raise parents", "Parents taught by Tablets".                    language.
   Another manual scrutiny focused on analyzing the
similar or the different patterns of LLMs and humans
when responding to a particular task. We found that sev-        6. Conclusion and future works
eral LLMs answered to the Divergent Association Task
with the same word among the seven required ones.               This study showed that there are some differences be-
For instance, "Serendipity" was used by three models.           tween human and machine answers given to a verbal
This phenomenon is not specific only to the machines.           creativity test, but also plenty of similarities.
For the Guessing Causes Task, Human 3 and Human                    The LLMs’ answers vary much like the humans an-
4 produced similar answers, like, for instance, both            swers. Individual, unique answers, w.r.t. to the set of
gave the answer "earthquake", or produced the same              all answers are produced by both humans and machines
idea, like "green lights"/"because of green lights", "eating    alike, with no noticeable difference.
something bad"/"they ate something bad", "St Patrick’s             Still, at a semantic level, humans and machines gener-
Day"/"St. Patrick’s day party", "poor construction"/"faulty     ally group together as individuals.
structural integrity", "looking at screens too much"/"too          The performance of automatic classification between
much screen time".                                              human and machine answers is moderate and leaves
   Also, we noticed some peculiarities of individual LLMs,      room for improvement.
such as Falcon’s generation of only words starting with            The general findings of this study indicate that LLMs’
the letter "a" for DAT, or Cohere’s generation of only op-      creative capabilities are comparable with human abilities
posite words for this task: "love", "hate", "peace”, "chaos".   and, as such, they could be put to good use in the creative
   Moreover, humans seem more personally involved in            domain. Humans "just" need to adapt to their usage, mind
answering than LLMs, which tend to give only general            the ethics and safety issues, and discern the information
answers to the tasks, with some exceptions. Some LLMs           at every step, instead of blindly using them.
seem to respond "humanly", even producing humor and                In future work, we will focus on expanding the dataset,
figurative speech, while others only respond quite stan-        by adding more LLMs’ and humans’ answers to the test,
dard or "robotic".                                              for a better statistical coverage.
   Overall, the LLMs’s distribution is similar with the            Also, we aim to manually investigate more in-depth
humans’ distribution, varying from one individual to            the database, to look for more systematic patterns for
another.                                                        both humans and machines.
   As creativity remains a domain with endless possibili-       2. There is a virus and only children survive
ties, we also plan to investigate other aspects of LLMs’        3. People can read each other’s thoughts
creativity, such as language or image.                          4. You wake up as your child self
   Another future approach worthy of pursuing is using          5. AI replaces teachers and professors
Deep Learning approaches instead of traditional Machine
Learning approaches for the binary classification task, or     6. Divergent Association Task (DAT)
using metrics specific to LLM-generated tasks.                 Write ten words that are as different from each other
                                                            as possible, in all meanings and uses of the words.
                                                               Rules:
7. Appendix Verbal Creativity Test Only single words in English. Only nouns (e.g., things,
                                                            objects, concepts). No proper nouns (e.g., no specific
There are 6 types of creativity assessments in this test.
                                                            people or places). No specialized vocabulary (e.g., no
Note: Be as creative, original, and innovative as possible.
                                                            technical terms). Think of the words on your own (e.g.,
Pay attention to the word and answer limit! Try to think
                                                            do not just look at objects in your surroundings).
of as many answers as possible within the limit!
   1. Alternative uses Test Name up to ten unusual
uses for the following five items. Use a maximum of five Acknowledgments
words. Give one answer per line.
                                                            This work was supported by a mobility project of the
     1. Lipstick                                            Romanian Ministery of Research, Innovation and Digiti-
     2. Avocado                                             zation, CNCS - UEFISCDI, project number PN-IV-P2-2.2-
     3. Whistle                                             MC-2024-0589, within PNCDI IV.
     4. Chalk
     5. Pantyhose
                                                             References
  2. Instances Use a maximum of five words per answer.
Give one answer per line. Name up to 10 things that:         [1] D. Anca, F. A. Maria, An integrated benchmark for
                                                                 verbal creativity testing of llms and humans, in:
    1. Things that can harm one’s self-esteem                    Proceedings of the 28th International Conference
    2. Things that you have control of in your life              on Knowledge-Based and Intelligent Information &
    3. Situations where it is good to be loud                    Engineering Systems (KES 2024), "KES 2024", 2024.
    4. Things that can flow                                      "accepted".
    5. Things that you can mark on a map                     [2] M. Csikszentmihalyi, Creativity: Flow and the
                                                                 Psychology of Discovery and Invention, first ed.,
   3. Similarities How are the following 2 terms alike?          HarperCollins Publishers, New York, NY, 1996.
Use a maximum of three words to describe a common            [3] A. R. Doshi, O. Hauser, Generative artificial intel-
feature of the following pair of words. Give one answer          ligence enhances creativity but reduces the diver-
per line. Give up to ten answers:                                sity of novel content, Science Advances 10 (2023)
                                                                 eadn5290. URL: https://ssrn.com/abstract=4535536.
    1. Prison & School                                           doi:10.2139/ssrn.4535536.
    2. Eyes & Ears                                           [4] E. E. Guzik, C. Byrge, C. Gilde,           The orig-
    3. House & Den                                               inality of machines: Ai takes the torrance
    4. Earthquake & Tornado                                      test,    Journal of Creativity 33 (2023) 100065.
    5. Baby & Cub                                                URL:       https://www.sciencedirect.com/science/
                                                                 article/pii/S2713374523000249.           doi:https:
  4. Causes                                                      //doi.org/10.1016/j.yjoc.2023.100065.
    1. Crash of a building                                   [5] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
                                                                 I. Sutskever, Language models are unsupervised
    2. Everybody turns green at a party
                                                                 multitask learners, 2019.
    3. Social media disappears
                                                             [6] M. Boden, The Creative Mind: Myths and Mecha-
    4. Humanity becomes shortsighted                             nisms, Routledge, 2004.
    5. Your hat does not fit you anymore                     [7] N. Anantrasirichai, D. Bull, Artificial intelligence
                                                                 in the creative industries: a review, Artificial Intel-
  5. Consequences
                                                                 ligence Review 55 (2021) 589–656.
    1. There is a mutation and men are the ones giving
       birth
 [8] X. Jiang, Y. Tian, F. Hua, C. Xu, Y. Wang, J. Guo, A
     survey on large language model hallucination via a
     creativity perspective, 2024. arXiv:2402.06647.
 [9] G. J.P., Creativity, American Psychologist (1950).
[10] E. Carayannis (Ed.), Encyclopedia of Creativ-
     ity, Invention, Innovation and Entrepreneurship,
     Springer International Publishing, 2013.
[11] J. Kaufman, R. Sternberg (Eds.), The Cambridge
     Handbook of Creativity, Cambridge Handbooks in
     Psychology, Cambridge University Press, 2010.
[12] J. P. J. P. Guilford, The nature of human intelligence
     / [by] J.P. Guilford., McGraw-Hill series in psychol-
     ogy, McGraw-Hill, New York, 1967.
[13] Y. Zhao, R. Zhang, W. Li, D. Huang, J. Guo, S. Peng,
     Y. Hao, Y. Wen, X. Hu, Z. Du, Q. Guo, L. Li, Y. Chen,
     Assessing and understanding creativity in large lan-
     guage models, 2024. arXiv:2401.12491.
[14] T. Chakrabarty, P. Laban, D. Agarwal, S. Mure-
     san, C.-S. Wu, Art or artifice? large language
     models and the false promise of creativity, 2024.
     arXiv:2309.14556.
[15] D. Cropley, Is artificial intelligence more cre-
     ative than humans? : Chatgpt and the divergent
     association task, Learning Letters 2 (2023) 13.
     URL: https://learningletters.org/index.php/learn/
     article/view/13. doi:10.59453/ll.v2.13.
[16] P. Organisciak, S. Acar, D. Dumas, K. Berthiaume,
     Beyond semantic distance: Automated scoring of
     divergent thinking greatly improves with large lan-
     guage models, Thinking Skills and Creativity 49
     (2023) 101356. URL: https://www.sciencedirect.com/
     science/article/pii/S1871187123001256. doi:https:
     //doi.org/10.1016/j.tsc.2023.101356.

</pre>