AI Writers and Critics: An Exploratory Study on Creative Content Generation and Evaluation by Large Language Models Shraddha Vijay Pawar*,† , Savita Bhat† , Ganesh Prasath and Shirish Karande TCS Research Abstract Recently, large language models (LLMs) have demonstrated promising potential in creative writing tasks, including stories and poems. However, existing studies have often focused on limited tasks and have not fully explored the comprehensive capabilities of these models across a diverse range of creative writing forms. This paper presents a comprehensive analysis of the performance of 11 LLMs across various creative writing tasks, including poem writing, blog creation, ad copy generation, short movie scriptwriting, and news article production. Our research aims to evaluate whether larger models consistently produce superior content or if smaller models can achieve comparable results. We examine different prompting styles and multi-agent frameworks to understand their impact on output quality. Using a detailed rubric based on a defined set of criteria such as innovation, adherence, coherence, expressiveness, conformity, and diversity, we systematically assess the generated content. The study highlights the effectiveness of using multiple LLMs as evaluators to enhance the reliability of content evaluation. Our findings provide insights into the capabilities and limitations of LLMs in creative tasks, suggesting avenues for future research to improve their creative potential and evaluation methodologies. Keywords Large Language Models, Machine Creativity, Generative Artificial Intelligence, Creative Writing Tasks, AI-Generated Content, LLM-Based Evaluations, Content Quality Assessment 1. Introduction Content creation spans various textual forms, each demanding a unique approach and skill set, from news articles and academic papers to blogs, songs, poetry, and storytelling [1] [2]. Poetry uses aesthetic and rhythmic language to convey deeper meanings, while story writing focuses on plot and character development. Songwriting blends lyrics with music for an auditory experience, and ad copy aims to persuade using rhetorical devices. Blogs serve personal expression or information sharing, scriptwriting crafts visual storytelling, dialogue writing hones conversational skills, and news writing informs with factual reporting [3] [4]. These diverse forms enrich our cultural landscape and support various professions, requiring a nuanced understanding of language, audience, and purpose. The evolution of AI and NLP technologies, particularly advancements in LLMs like GPT-3 and GPT-4, has significantly impacted content CREAI 2024 - Workshop on Artificial Intelligence and Creativity * Corresponding author. † These authors contributed equally. $ shraddhavijay.pawar@tcs.com (S. V. Pawar); savita.bhat@tcs.com (S. Bhat); ganesh.prasathr@tcs.com (G. Prasath); shirish.karande@tcs.com (S. Karande) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings writing, enabling these models to generate coherent and contextually relevant text, making them powerful tools for diverse applications. To understand the writing quality produced by LLMs, our research considers various writing tasks, each requiring specific skills. We develop criteria common to these tasks and use them to design comprehensive rubrics for each problem statement, to assess the generated content. By evaluating 11 different diverse models of various sizes, we explore both large and small models’ capabilities. Our objective is to determine if larger models consistently produce superior quality content or if smaller models can achieve comparable results. Additionally, we examine different prompting styles and multi-agent frameworks to understand how prompt engineering affects the LLMs’ output quality. This methodology systematically determines which models excel in particular tasks and how various prompting approaches influence content generation effectiveness. 2. Related work LLMs have been increasingly utilized in creative writing tasks, with notable studies demon- strating their potential and limitations. For instance, [5] showcased the beneficial role of AI in translation and reviewing to enhance writing. However, their focus was limited to generating drafts based on user-provided plots in short fiction and non-fiction writing tasks using Chat GPT-3.5. This narrow scope underscores the need for broader exploration across various creative writing tasks, a gap that our research addresses. Similarly, [6] conducted a comparative study of GPT-3 with other state-of-the-art models like KGGPT2, HINT, PROGEN, and MTCL. Their findings highlighted GPT-3’s superiority in story generation, particularly the text-davinci-001 variant. However, they acknowledged the advancements in newer models like GPT-4 and LLAMA 3, which their study did not cover. Our research builds on this by assessing these newer models to evaluate their enhanced content generation capabilities. We considered the comprehensive evaluation by [7] of LLMs for English content writing, focusing on both open-source and closed-source models. Their study was confined to a single problem statement for narrative writing using zero-shot prompting. In our study, we use their human evaluation results as a benchmark and explore more complex prompting styles like chain of thought (CoT)[8] and Reason and Act (ReAct) [9] to enhance content generation. The examination of creativity in LLMs by[10] is relevant for its consideration of value, novelty, and surprise through Margaret Boden’s creativity theories. They argued that truly creative processes in LLMs require attributes like motivation, thinking, and perception, which are currently lacking. This insight motivated us to explore advanced prompting styles to better evaluate LLMs’ creative capabilities. The study by [11] highlights the challenge of non-expert judges in assessing creativity, often favoring novice work over professional outputs. They examined metrics like Ritchie’s model[12], Pease et al.’s criteria, Colton’s creative tripod,[13] and the IDEA model[14]. We integrated these with Boden’s criteria[15] to design effective rubrics for evaluating creativity in LLMs. The framework introduced by [16] for evaluating LLM creativity through verbal question-and-answer formats, using a modified Torrance Tests of Creative Thinking, offers a comprehensive approach. However, it emphasizes the need to assess creativity beyond verbal Figure 1: Overview of the Approach & Methodology formats. Our study addresses this by exploring a broader range of writing tasks, including poem writing, ad copy creation, blog writing, short movie scriptwriting, and news articles. Finally, [17] introduced the TTCW framework to evaluate creativity in short fiction writing by both human authors and LLMs. Their findings showed that LLM-generated stories passed fewer TTCW tests compared to human-authored ones, highlighting a gap in creative proficiency. This insight led us to extend our experimentation with LLMs to assess creativity in various other writing tasks beyond story writing. 3. Approach and Methodology Our methodology evaluates LLM writing quality across various tasks using common criteria. We assess 11 models of different sizes and explore different prompting styles and techniques This process is done to identify the best models for specific tasks and the impact of prompt engineering on content quality. Refer to the figure 1 for the details of the process. 3.1. Selection of Writing Tasks We selected tasks that highlight the unique capabilities of LLMs in content generation, each designed to assess specific aspects of the model’s creative and analytical skills. Poem Writing involves problem statements that require the model to evoke particular emotions such as melan- choly, resilience, and playfulness, using tones like reflective, whimsical, or serious. Ad Copy Creation challenges the model to craft persuasive and concise messages, demanding a precise balance of action-oriented language and emotional appeal. Blog Writing requires the model to sustain reader engagement over varied lengths, blending informative and narrative styles, while maintaining a consistent tone that can range from analytical to conversational. Short Movie Scriptwriting emphasizes the model’s ability to create compelling narratives, requiring effective representation of character development, dialogue authenticity, and thematic depth. Lastly, News Article Writing focuses on the model’s capacity to present factual information with clarity and neutrality, analyzing its ability to convey events accurately while maintaining an objective and accessible tone. 3.2. Criteria for Content Assessment To comprehensively assess creativity in language models across diverse writing tasks, we utilize several established theoretical models, including Ritchie’s Model [12], Pease et al.’s criteria, Boden’s criteria [15], the IDEA Model by Isaksen and Dorval, and Colton’s Creative Tripod [13]. By studying these models, we developed a customized set of criteria given below, applicable across the writing skills chosen. • Innovation: Evaluates the introduction of new, creative ideas and their unique im- plementation within the content. Derived from Ritchie’s Model, Pease et al.’s Novelty, and Boden’s Novelty, emphasizing the novelty, uniqueness, and unexpected insights in content. • Adherence: Assesses the depth and significance of the content, ensuring it offers mean- ingful and imaginative insights. Inspired by Boden’s Value and Colton’s Imagination, focusing on the usefulness, depth, and imaginative quality of the content. • Coherence: Examines the logical flow and clarity of the content, ensuring it is well- structured and understandable. Based on the IDEA Model’s Evaluation and Colton’s Skill, integrating refinement of ideas and technical proficiency. • Expressiveness: Measures the emotional and aesthetic impact of the content, evaluating its ability to evoke feelings and surprise. Rooted in Colton’s Appreciation and Boden’s Surprise, focusing on emotional connection and unexpected elements. • Conformity: Assesses how well the content adheres to structural, stylistic, and genre- specific requirements. Derived from the IDEA Model’s Intention and Action, ensuring alignment with genre-specific conventions and effective implementation of ideas. • Diversity: Evaluates the range and variation in outputs for the same problem statement, ensuring distinct and novel solutions. Inspired by Boden’s Novelty and Surprise and Ritchie’s Originality, focusing on generating varied and original content. We use these criteria as a high-level foundation to establish a hierarchical structure for content evaluation, meticulously developing task-specific sub-criteria based on this foundation. Each set of task-specific sub-criteria then guides the creation of well-detailed rubrics with specific questions tailored to each user prompt or problem statement, ensuring a nuanced approach to content evaluation. For example, in the task of poem writing, sub-criteria such as ’Sound and Rhythm’, ’Structure and Form’ etc reflect the unique aspects of poetic expression. Similarly, for blogs, sub-criteria like ’Clarity and Structure’, ’Engagement’ etc are designed to capture the analytical depth and reader involvement essential for effective blog writing. Each task-specific sub-criterion is operationalized through specific questions within the rubrics designed for individual problem statements. For instance, a problem statement about writing a reflective poem on a farmer’s struggle includes a question under the sub-criterion ’Adherence to Theme’: ’Did the poem effectively address the theme of a farmer’s dedication during tough times?’ This question is assessed using a rating scale that evaluates the depth of perseverance and hope portrayed. Similarly, there are specific questions designed for each problem statement corresponding to the task-specific sub-criteria, ensuring that every aspect of the content is meticulously evaluated. This structured approach is done to ensure a comprehensive and consistent evaluation of content, enabling detailed comparison across models and tasks. 3.3. Large Language Models We consider a diverse array of LLMs to evaluate their performance across various content gener- ation tasks, involving both proprietary and open-source models. Our study evaluates 11 LLMs, including proprietary models such as OpenAI’s GPT-3.5, GPT-4 [18] , Anthropic’s Claude 2[19] , and Google’s Gemini-Pro [20]. It also includes open-source models like meta-llama/Llama-3-70b- chat-hf [21], google/gemma-7b[22], mistralai/Mixtral-8x7B-Instruct-v0.1[23], Qwen/Qwen1.5- 72B-Chat [24], Qwen/Qwen1.5-7B-Chat, meta-llama/Llama-2-7b-chat-hf[25], and lmsys/vicuna- 13b-v1.5.By analyzing these models, we aim to determine if proprietary models are superior or if open-source models of various sizes are also competent enough to perform well on different writing tasks. 3.4. Prompting Techniques We investigate the impact of different prompting techniques on the output quality of Language Models (LLMs) for various writing tasks, employing both "User prompts" and "System prompts." "User prompts" are direct problem statements specifying the content the model needs to generate, such as creating a poem or writing a blog. In contrast, "System prompts" are guiding instructions or contexts that direct the model’s approach, like adopting a specific persona or style. We explore three primary prompting techniques: Zero-Shot Prompting, where only the user prompt is provided, allowing us to assess the LLM’s intrinsic ability to interpret and respond without additional guidance and evaluate its baseline capability to produce coherent and contextually appropriate content. Contextual Role-Induced Prompting, which enhances the zero-shot method by adding a system message that defines a specific role or context for the LLM, such as a "creative poet," to guide the model’s creative process, ensuring adherence to thematic and stylistic demands, thus enhancing the depth and quality of the generated content, and Chain of Thought (CoT) Prompting, a structured approach that makes the LLM engage in a two- phase process of planning and execution, where a system message first encourages the model to methodically plan its response before generating the content, fostering a disciplined and systematic content generation process and promoting clarity, inspired by the "Tree of Thought" [26] paper, to evaluate the model’s ability to plan and execute tasks effectively. 3.5. Multi-Agent and Retrieval-Augmented Generation (RAG) Approaches We explore the multi-agent framework and Retrieval-Augmented Generation (RAG) [27] to understand the enhancement in content generation. In campaign or ad video scriptwriting, specialized agents focus on specific elements such as plot writing, character development, taglines, scene description, and dialogue. We analyze if the collaboration of specialists improves the content. For news generation, we use a RAG approach combined with a multi-agent framework to produce accurate and up-to-date articles. A News Collector agent gathers the latest information, while an Article Writer agent synthesizes it into a coherent piece. This method is applied specifically to closed-source LLMs, as open-source models do not support function calling, which is essential for integrating real-time data. This setup assesses whether models accurately use the obtained data for facts or introduce inaccuracies. 3.6. Evaluation Methodology We plan to use LLMs to evaluate the content because we want to understand their effectiveness in assessing creative writing tasks. Two studies used GPT-4 for evaluating generated content and stories but did not justify why they chose GPT-4 over other models like Claude 3, Gemini 1.5 Pro, and Llama 3 70B. Additionally, no benchmarks exist for evaluating LLMs’ performance in content writing tasks such as poems, blogs, and stories. 3.6.1. Selection of evaluator LLM: We referred to the recent work in [7] which evaluated the quality of stories generated by 12 LLMs and 5 human writers. Ten honors and postgraduate Creative Writing students rated these stories using a detailed rubric. We used this corpus and expert ratings as a benchmark to select the best content evaluator LLM. We considered 4 LLMs—GPT-4, Claude 3, Gemini 1.5 Pro, and Llama 3-70B to rate all 65 stories from the corpus using the same rubric used by the human experts, ensuring unbiased evaluation by not disclosing the generator LLM of each story. Their study demonstrated a variance of up to 30% in scores assigned by different human evaluators, while consistently distinguishing between low-quality and high-quality content. We managed a similar setup, considering two temperatures for each evaluator LLM: t=0 and t=1. For each evaluator LLM, we calculated the average rating and standard deviation for the 13 candidate story writers across the 5 sets and compared these metrics to human expert ratings using the Euclidean distance formula given below. distance = √︀ (𝑥LLM − 𝑥human )2 + (𝑦LLM − 𝑦human )2 where 𝑥LLM and 𝑦LLM represent the average score and standard deviation of the LLM’s ratings, and 𝑥human and 𝑦human represent the average score and standard deviation of the human expert ratings. This determined each evaluator LLM’s proximity to the human benchmark. The results, visualized in a graph 2, show average story ratings and standard deviations for each model. The Euclidean distance calculations, presented in the table 1, indicate that Llama 3-70B and Gemini 1.5 Pro have the closest alignment with human ratings. These results highlight that Gemini 1.5 Pro and Llama 3-70B are the most similar to human evaluators, making them strong candidates for content evaluation tasks. We utilized Gemini 1.5 Pro, Llama 3-70B, and GPT-4 for evaluating poems, blogs, and ad copies, ensuring a comprehensive assessment of how these LLMs understand and rate high-quality and low-quality content. Each piece was evaluated at temperature settings of 0 and 1 to capture scoring variability. For news article evaluation, we exclusively used GPT-4 due to its real-time Table 1 D is the average Euclidean distance between the ratings given by each evaluator LLM and the human ratings, indicating the proximity of each LLM’s ratings to the human benchmark. The table includes distances (D) for temperature 0 and 1 settings, as well as their average(D - overall). Model D for t=0 D for t=1 D - overall Claude 3 23.69 19.47 22.24 GPT-4 17.29 19.68 18.84 Llama 3 12.44 13.93 13.50 Gemini 14.98 11.84 13.53 Figure 2: Comparison of story ratings by humans and 4 LLM’s. The plot shows the average story ratings with standard deviations, illustrating the alignment of each LLM’s ratings with human ratings. internet access and superior fact-checking capabilities, ensuring accuracy and relevance. For short movie script generation, Gemini 1.5 Pro and Llama 3-70B served as evaluators. Due to a lack of expert evaluators, formal human evaluations were not conducted. However, to understand human preferences for AI-generated versus human-generated content, we con- ducted a survey using Google Forms with 50 participants. This survey aimed to gauge human preferences rather than serve as a formal evaluation. Without disclosing whether the content was human or AI-generated, all participants were presented with the same form containing two poems, two blogs, and two news articles. Each type included one human-written and one LLM-generated piece addressing the same problem statement. Participants were asked to select their preferred creation for each set, allowing for a direct comparison of preferences across different content types. 4. Experiment Setup For our experiments with open-source LLMs, we use the Together AI service, while interactions with GPT-4 and GPT-3.5 are conducted via the OpenAI Playground. Claude is accessed through the Anthropic Playground, and Gemini-Pro and Gemini-1.5-Pro are utilized through Google AI Studio. For implementing agent frameworks, we develop an agent team using CrewAI [28], utilizing the API services provided by the aforementioned platforms. Additionally, for news (a) Poems (b) Blogs (c) Ad-Copies Figure 3: Box plot displaying performance scores of LLMs across three writing tasks (blog, poem, and ad copy writing) and three prompting strategies (Zero Shot, Zero Shot with Role Induced, Zero Shot COT). Evaluations were conducted using three LLMs (Llama-3, GPT-4, and Gemini 1.5-Pro) at temperature settings 0 and 1. For each problem statement, scores from different rubric items are totaled per evaluator LLM, then these totals are converted to percentages and combined across all three LLMs for both temperature settings. Boxes in the plot represent the median and quartiles (Q1-Q3), while whiskers extend to 1.5 times the interquartile range (IQR). Outliers and means are indicated as black hollow circles and red diamonds, respectively. This plot highlights the comparative effectiveness of prompting strategies and performance variability between models and within models across tasks. searches, we utilize the DuckDuckGo[29] service to fetch articles from the internet. The CrewAI team was configured to operate sequentially for news search tasks and hierarchically for movie script generation. We consider three to four problem statements for every writing task, employing three types of prompting methods, and execute each problem statement five times per prompting type. This approach ensure consistency and diversity in understanding instructions. Consequently, we generated 660 poems, 495 blogs, and 495 ads across 11 LLMs. However, due to resource limitations and infrastructure costs, only one-fifth of these outputs (132 poems, 99 blogs, and 99 ads) were evaluated. Each piece of content was assessed by GPT-4, Gemini 1.5 Pro, and Llama 3-70B, with the evaluators set at temperature values of 0 and 1, resulting in six evaluations per content piece. For movie script generation, the six LLMs used were GPT-4, GPT-3.5, Claude 2, Gemini-Pro, Qwen 1.5-72B, and Llama 3-70B, producing 18 scripts evaluated by GPT-4, Gemini 1.5 Pro and Llama 3-70B. The multi-agent setup for movie scripts and news articles utilized the CrewAI framework, where the agents were ReAct-based agents . In the news generation task, four LLMs—GPT-4, GPT-3.5, Claude 2, and Gemini-Pro—produced 12 articles, which were evaluated exclusively by GPT-4 due to its web searching capabilities. We calculate BERT scores [30] within the five iterations of running each prompt in each setting to assess the diversity of the outputs, determining whether the LLMs were generating varied content or if there were consistent core ideas across the iterations. 5. Results and Discussion Our analysis reveals some good insights into the performance of various LLMs across multiple creative writing tasks. 5.1. LLMs as content writers Across various writing tasks, LLMs show significant variability in content quality, topic adher- ence, clarity, and emotional impact. GPT-4 and Qwen 1.5-72B excel in thematic consistency, emotional depth, sound, and rhythm in poems. Claude 2 produces well-structured and clear content but sometimes refuses to generate advertisements or handle sensitive topics. Llama 3-70B maintains clarity, persuasion, and flow in blogs and ads, and also performs well in po- ems. Gemma-7B struggles significantly across all tasks, failing in stanza structures, language appropriateness, and clarity, resulting in free-form text and incoherent outputs. Mixtral-8x7B, Vicuna-13b, and Llama-2-7b generally adhere to problem statements but sometimes struggle with maintaining clarity in intricate themes or longer formats. Qwen 1.5-7B-Chat performs mod- erately well but occasionally produces content in Chinese, affecting clarity. Gemini Pro excels in blogs and ads, maintaining structural integrity and clarity, but falls short in sound, rhythm, and emotional impact in poems. GPT-3.5 shows moderate adherence to problem statements and quality in content generation for poems, ads, and blogs but struggles with maintaining consis- tent clarity and coherence. For blog writing, most LLMs fulfilled the word limit requirements for shorter texts (less than 500 words), but all struggled with higher word limits like 2000 words. Additionally, almost all LLMs failed to count words accurately, indicating a limitation in word count accuracy. However, Claude 2, Llama 3-70B, and Qwen 1.5-72B generated longer contexts more effectively than others. 5.2. Effect of Prompting Strategies In poem writing, the Zero Shot Role Induced prompting consistently outperforms other strate- gies, significantly enhancing thematic consistency, emotional depth, and structural elements such as sound and rhythm. This strategy benefits models like GPT-4, Llama 3-70B, and Qwen 1.5-72B the most, as illustrated in 3a , where these models show marked improvements in performance scores. For ad copy generation, Zero Shot Role Induced prompting is the leading strategy, improving clarity, engagement, and visual appeal, with GPT-4, Llama 3-70B, and Claude 2 showing the most significant gains, as depicted in 3c. In blog writing, both Zero Shot and Zero Shot Role Induced prompting perform well, with GPT-4 and Llama 3-70B excelling in clarity and structural coherence. However, Chain of Thought (CoT) prompting in blog writing performs poorly as it consumes many tokens for planning, resulting in shorter and less comprehensive outputs, which can be observed in the comparative effectiveness of prompting strategies shown in 3b. Smaller models like Gemma-7B and Llama-2-7b often fail to plan content properly, leading to inconsistent outputs, while models like Vicuna-13b, Mixtral-8x7B, and Qwen 1.5-7B-Chat sometimes struggle with executing their plans effectively. 5.3. Factual Consistency in News For news article generation, our analysis by manual observation indicates that GPT-3.5 occa- sionally introduced extraneous facts when the internet search by the news searcher agent was inadequate. In contrast, Claude 2 exhibited a different behavior: if the news searcher agent did not gather sufficient information, the writer agent would refuse to perform the writing task due to the lack of necessary data. On the other hand, both Gemini Pro and GPT-4 demonstrated overall better performance, integrating accurate and relevant information into the news articles. 5.4. Multi-Agent Collaboration Regarding movie script generation, GPT-4, Llama 3-70B, and Qwen 1.5-72B slightly outperformed other models in both single and multi-agent settings, as shown in Figure 4. However, Claude2 agents, despite seeking extensive information about plots, characters, and scenes, did not show a significant impact on output quality. GPT-3.5, on the other hand, showed a noticeable decline in performance when operating as a single agent compared to its multi-agent setup, indicating a potential reliance on collaborative workflows to optimize output. A notable shortfall across all models was their lack of creativity and out-of-the-box thinking, often reflecting stereotypical mindsets. For example, in generating SnoreAway ad scripts, all models predominantly depicted men as the snorers, overlooking women. The generated plots, scenes, and dialogues were generally generic and lacked expert-level imagination. Interestingly, despite the general superiority of multi-agent setups as shown in the figure, experiments revealed that in some cases, such as with GPT-4 and Llama 3-70B, the single agent’s performance was relatively competitive, showing only a slight drop of about 5-15% in scores compared to their multi-agent counterparts. This suggests that under certain conditions, single zero-shot agents can nearly match the performance of multi-agent setups, offering a resource-efficient alternative with lower token consumption. This observed efficiency warrants further investigation into whether optimizing single agents for multi-task roles could reduce resource consumption without compromising the quality of the output. Figure 4: Box plot summarizing AI model performance under single and multi-agent settings. The plot displays median, quartiles (Q1-Q3), and 1.5 IQR whiskers, with outliers as black hollow circles. Red diamonds mark the means. Evaluations involved two LLMs (Llama-3 and Gemini 1.5-Pro) at temperature settings 0 and 1. Figure 5: Heatmap illustrating summarized performance of LLMs by aggregating scores from a bottom- up approach within our evaluation hierarchy (as described in 3.2). Scores start from specific questions addressing task-specific sub-criteria for all tasks and models across two temperatures. These are then aggregated to represent each task-specific sub-criterion, and finally compiled into the high-level criteria, illustrating the comprehensive performance across all tasks. The heatmap analysis 5 highlights that GPT-4 and Llama3-70B are the overall best-performing LLMs, while Gemma-7B consistently underperforms across tasks. The evaluators, LLMs them- selves, rated high-quality content positively and low-quality content negatively, demonstrating their potential for reliable content evaluations. Notably, using multiple evaluator LLMs enhances the reliability of the results. As depicted in the figure 5, innovation is the lowest-rated criterion, indicating that the generated content lacked originality and out-of-the-box thinking, tending to be basic and easily reproducible. For measuring diversity, we have calculated BERT scores across the five generations made for each problem statement to understand the dissimilarity of outputs for the same prompt in the same setting. Higher dissimilarity values, as seen with Llama 3-70B (19.23%) and Mistral-7B (17.94%), indicate that these LLMs are capable of generating diverse and novel content pieces for the same user prompt. Other models show dissimilarity in the range of approximately 11% to 17.5%. The overall performance of open-source models ranged from average to good, suggesting that they are viable options for creative tasks. Additionally, the LLMs showed a consistent ability to adhere to structural and stylistic guidelines, indicat- ing their proficiency in producing coherent and well-organized content. This suggests that while they may not yet excel in innovative thinking, they are competent in following detailed instructions and maintaining clarity and structure in their outputs. In our survey taken for human preference, as mentioned in 3.6, for poems, respondents tended to select the version they found easier to understand. For the first poem’s problem statement, 52.9% favored the AI-generated content, appreciating its clarity, while 53.8% preferred the human-written version of the second poem’s problem statement for its simplicity. For blogs, preference leaned towards LLM-generated content, with 62.7% favoring AI for the first blog and 58.8% for the second. Similarly, 54.9% preferred AI-generated news articles for both problem statements, valuing their clarity and engagement. Even though the differences are not significant, AI-generated content is equally preferred and often chosen alongside human-created content. These findings suggest that content clarity significantly drives preferences, with respondents favoring clearer presentations from both humans and LLMs. 6. Conclusion Our study effectively demonstrates the capabilities and limitations of various LLMs across multiple creative writing tasks. Our comprehensive evaluation highlights the strengths of models like GPT-4 and Llama 3-70B, while also identifying areas for improvement, particularly in fostering creativity and originality. Additionally, the LLMs showed a consistent ability to adhere to structural and stylistic guidelines, indicating their proficiency in producing coherent and well-organized content. This suggests that while they may not yet excel in innovative thinking, they are competent in following detailed instructions and maintaining clarity and structure in their outputs. Interestingly, our results indicate that proprietary models are not always significantly superior, with open-source models like Qwen and Llama-3 outperforming GPT-3.5 and Gemini in several tasks, demonstrating that mid-ranged open-source models can perform competitively. The use of multiple LLMs as evaluators has proven reliable for content assessment, emphasizing the potential of automated evaluations. However, the study is limited by the absence of human-based evaluations and reliance on human preferences. Future research should include more diverse prompting techniques, explore alternative content evaluation strategies, and assess advanced Retrieval-Augmented Generation (RAG) systems beyond the classic RAG methodology employed here. Additionally, the potential of single-agent setups versus multi-agent frameworks warrants further investigation to optimize content quality across multiple subtasks. 7. Acknowledgments We thank Mr. Hari Narayan, a researcher at TCS, for his valuable ideas, insights and reference implementation of the LLM-as-Judge-based content evaluation experimental setup. References [1] M. Donovan, M. Donovan, Types of creative writing | Writing forward, 2021. URL: https:// www.writingforward.com/creative-writing/types-of-creative-writing#google_vignette. [2] R. Lowe, Different writing styles: Exploration of 9 powerful artistic forms, 2024. URL: https://thewritingking.com/different-writing-styles/. [3] K. Cummins, Text Types and Different Styles of Writing: The Complete guide, 2024. URL: https://literacyideas.com/different-text-types/#poetry. [4] D. Zelikman, 10 Types of Creative Writing (with Examples You’ll Love), 2021. URL: https: //blog.reedsy.com/guide/creative-writing/types-and-examples/. [5] T. Chakrabarty, V. Padmakumar, F. Brahman, S. Muresan, Creativity support in the age of large language models: An empirical study involving emerging writers, ArXiv abs/2309.12570 (2023). URL: https://api.semanticscholar.org/CorpusID:262217523. [6] Z. Xie, T. Cohn, J. H. Lau, The next chapter: A study of large language models in sto- rytelling, in: C. M. Keet, H.-Y. Lee, S. Zarrieß (Eds.), Proceedings of the 16th Interna- tional Natural Language Generation Conference, Association for Computational Linguis- tics, Prague, Czechia, 2023, pp. 323–351. URL: https://aclanthology.org/2023.inlg-main.23. doi:10.18653/v1/2023.inlg-main.23. [7] C. Gómez-Rodríguez, P. Williams, A confederacy of models: a comprehensive evaluation of LLMs on creative writing, in: H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore, 2023, pp. 14504–14528. URL: https://aclanthology.org/2023.findings-emnlp.966. doi:10.18653/v1/2023.findings-emnlp.966. [8] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou, Chain-of- thought prompting elicits reasoning in large language models, 2023. arXiv:2201.11903. [9] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, Y. Cao, React: Synergizing reasoning and acting in language models, 2023. arXiv:2210.03629. [10] C. Lamb, D. G. Brown, C. L. A. Clarke, Human competence in creativity evaluation, in: International Conference on Innovative Computing and Cloud Computing, 2015. URL: https://api.semanticscholar.org/CorpusID:14806090. [11] G. Franceschelli, M. Musolesi, Creativity and machine learning: A survey, ACM Computing Surveys (2024). URL: http://dx.doi.org/10.1145/3664595. doi:10.1145/3664595. [12] F. Pereira, M. Mendes, P. Gervás, A. Cardoso, Experiments with assessment of creative systems: An application of ritchie’s criteria, 2005. [13] A. Jordanous, A standardised procedure for evaluating creative systems: Computational creativity evaluation based on what it is to be creative, Cognitive Computation 4 (2012). doi:10.1007/s12559-012-9156-1. [14] S. Colton, J. Charnley, A. Pease, Computational creativity theory: The face and idea descriptive models, in: Proceedings of the 2nd International Conference on Computational Creativity, ICCC 2011, Proceedings of the 2nd International Conference on Computational Creativity, ICCC 2011, 2011, pp. 90–95. URL: https://computationalcreativity.net/iccc2011/ proceedings/index.html, international Conference on Computational Creativity 2011, ICCC 2011 ; Conference date: 27-04-2011 Through 29-04-2011. [15] M. A. Boden, Understanding creativity, J. Creat. Behav. 26 (1992) 213–217. [16] Y. Zhao, R. Zhang, W. Li, D. Huang, J. Guo, S. Peng, Y. Hao, Y. Wen, X. Hu, Z. Du, Q. Guo, L. Li, Y. Chen, Assessing and understanding creativity in large language models, ArXiv abs/2401.12491 (2024). URL: https://api.semanticscholar.org/CorpusID:267094860. [17] T. Chakrabarty, P. Laban, D. Agarwal, S. Muresan, C.-S. Wu, Art or artifice? large language models and the false promise of creativity, 2024. arXiv:2309.14556. [18] O. Team, Gpt-4 technical report, 2024. arXiv:2303.08774. [19] Introducing the next generation of Claude Anthropic, ???? URL: https://www.anthropic. com/news/claude-3-family. [20] G. Team, Gemini: A family of highly capable multimodal models, 2024. arXiv:2312.11805. [21] W. Huang, X. Ma, H. Qin, X. Zheng, C. Lv, H. Chen, J. Luo, X. Qi, X. Liu, M. Magno, How good are low-bit quantized llama3 models? an empirical study, 2024. arXiv:2404.14047. [22] G. Team, Gemma: Open models based on gemini research and technology, 2024. arXiv:2403.08295. [23] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bres- sand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023. arXiv:2310.06825. [24] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang, Z. Zhang, C. Zhou, J. Zhou, X. Zhou, T. Zhu, Qwen technical report, 2023. arXiv:2309.16609. [25] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, Llama: Open and efficient foundation language models, 2023. arXiv:2302.13971. [26] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, K. Narasimhan, Tree of thoughts: Deliberate problem solving with large language models, 2023. arXiv:2305.10601. [27] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, H. Wang, Retrieval- augmented generation for large language models: A survey, 2024. arXiv:2312.10997. [28] Joaomdmoura, GitHub - joaomdmoura/crewAI: Framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks., ???? URL: https://github.com/ joaomdmoura/CrewAI. [29] duckduckgo-search — pypi.org, https://pypi.org/project/duckduckgo-search/, ???? [Ac- cessed 29-05-2024]. [30] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text genera- tion with bert, 2020. arXiv:1904.09675.