Enhancing the Reliability of LLMs-based Systems for Survey Generation through Distributional Drift Detection Vinicius Monteiro de Lira1,∗,† , Antonio Maiorino1,† and Peng Jiang2,† 1 SurveyMonkey, Padua, Italy 2 SurveyMonkey, San Mateo, California, USA Abstract Evaluating Large Language Model (LLM)-based systems is a recurrent challenge in modern machine learning research and development. It is crucial to ensure that any changes made in the production environments will not negatively impact user experience, and clever evaluation techniques are especially important when updated models or prompts create disparities within the system. Since we released the feature to help our customers create surveys with textual prompts in 2023, we have iteratively improved several parts of the system such as the prompts, the LLM models and the system’s internal logic. To measure the impact of these changes, we propose a comprehensive framework for assessing surveys generated by LLMs, focusing on data drift analyses based on survey metadata features. By leveraging this approach, we can effectively identify and address potential areas of concern related to model performance, enhancing the reliability and usability of LLM-based systems for survey generation tasks. Keywords LLMs, Survey Generation, Reliability, Distribution Drifts 1. Introduction the very nature of the task, which is more akin to “cre- ative writing” than to other Natural Language Processing We are the global leader in survey software, with our flag- (NLP) use cases where “correct” and “wrong” labels could ship platform enabling the collection of over 20 million typically be identified. In fact, in this use case it is much answers per day across a vast variety of domains. One harder to determine what a “good” or “bad” generation of the major goals of our service is to help customers would look like, as there are many possible examples of create high-quality surveys by leveraging the wealth of good surveys in which could be generated starting from research that internal teams have accumulated over the the same input prompt. course of many years in the industry. This translates into This paper presents a novel contribution in the form the continuous development of features aimed at help- of a comprehensive framework for evaluating surveys ing customers create effective surveys that allow them generated by LLMs, specifically addressing challenges to learn what they’re interested in by asking the best in survey generation tasks. The framework exploits sur- possible questions to their audience. vey metadata to facilitate data drift analysis, enabling One of the latest features released for this purpose the identification and mitigation of potential issues re- is called Build with AI (BWAI). This feature has been lated to model performance. By systematically analyzing released to all the users of the platform near the end survey metadata and detecting distributional drift, the of 2023 and leverages Large Language Models (LLMs) framework assesses the behavior of LLM-based systems to allow users to build high quality surveys through a for survey generation. conversational interface, where users can specify what they want to learn about their audience through a textual description (a prompt), which will be used by the system 2. Related Work to generate a survey with relevant questions and context. Since this application involves generating a long text The Survey Generation domain that we studied in this based on concise instructions provided in a short ”seed” work poses its own set of challenges since typically there input text, it can be particularly challenging because of is not a ”correct” or ”wrong” survey, but rather the good- ness of the model lies in its ability to follow the instruc- KiL’24: Workshop on Knowledge-infused Learning co-located with tions specified by the user, while also trying to produce 30th ACM KDD Conference, August 26, 2024, Barcelona, Spain interesting ideas for potentially useful survey questions. ∗ Corresponding author. † This puts our model in an area closer to use cases such These authors contributed equally. Envelope-Open vmonteirodelira@surveymonkey.com (V. M. d. Lira); as brainstorming and creative writing than to other, more amaiorino@surveymonkey.com (A. Maiorino); studied areas such as Question Answering, Intent Recog- pjiang@surveymonkey.com (P. Jiang) nition and Summarization, where some “ground truths” Orcid 0000-0002-7580-1756 (V. M. d. Lira) are usually available and can be used to evaluate the level © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings of quality of the generated text. The lack of ground truth or through human curation), which are then used to combined with the lack of standardized metrics for open- fine-tune LLMs to try and distill the Human Evaluators’ ended tasks makes evaluation even more difficult in our knowledge, as in [12]. scenario. In summary, for most use cases involving the genera- As pointed out in [1] these kinds of use cases are often tion of open-ended text where no ground truth is avail- missing from popular benchmarks such as HELM [2], able the usual process involves combining some of the since most of these tend to focus on verifiable, closed- “automated” strategies mentioned above with Human- ended and automated metrics. For reliably evaluating based evaluation, with varying weight given to the Au- open-ended use cases researchers and practitioners often tomated vs Human evaluation based on the particular hire human raters, as for example done by the authors of needs. When the tasks are broader, more nuanced, and [1] who hired 10 human raters to evaluate several LLMs “vague”, or for tasks where a detailed explanation of the on Creative Writing tasks. While a human evaluation evaluation scores is needed, human evaluation is typi- is currently still the most effective and reliable way of cally given more weight. evaluating such open-ended tasks, human involvement also makes the process much longer and expensive. Some strategies proposed in the literature to automat- 3. Methodology: Evaluation ically evaluate the quality of open-ended use cases in- Framework for Survey clude measuring the degree of “text quality” through metrics such as text readability and diversity. In [3] the Generation authors distinguish between “reference-based metrics”, Before diving deep into the architecture of our frame- where the output generated by the model is compared to work, we introduce and formalize a few basic concepts. a similar output written by a human, and “reference-free metrics”, where the quality of the outputs is measured directly, with some examples of the latter group being 3.1. Background: Basic Concepts n-gram based metrics such as ”Lexical Repetition” [4] A survey is a questionnaire used to collect data from and ”Distinct-3 (D-3) [5]”, descriptive statistics such as a group of people to gather information, opinions, or text length, Self-BLEU (SBL) [6] and BARTScore (BAS) feedback on a particular topic or subject. We formally [7]. Nonetheless, the authors also report that these met- introduce a survey as: rics often do not seem to agree with each other, and they complement their assessment with human-based mea- Definition 1 (Survey). A Survey typically consists of sev- surements. eral questions designed to gather specific information from The authors of [8] analyze the NLG evaluation respondents. We define a survey as a tuple < ℎ, 𝑙 > where landscape from another angle that’s becoming more h represents the survey title and l is the list of questions widespread after the advent of big-scale powerful models, composing the survey. which is the LLM-based evaluation. These techniques involve using LLMs themselves as “judges” for gener- In turn, a single survey question can be defined as: ated text and include Scoring, Comparison, Ranking, Definition 2 (Survey question). We define a Survey ques- and Boolean QA among the strategies used to constraint tion as a tuple < 𝑡, 𝑘, 𝑜 > where t represents the survey ques- LLMs to output close-ended scores. This direction is ex- tion text, k is its type drawn from a predefined taxonomy citing because it seems like a promising way to automate 𝐾, and o represents the list of answer options. Examples of evaluation tasks that were previously very hard to auto- survey question types belonging to 𝐾 include: open-ended mate without models capable of understanding all the questions, Net Promoter Score (NPS) questions, contact in- nuances in the generated examples, but it also comes formation questions, rating questions, and more. Except with its own challenges and limitations. For example, for the ”Open-ended” questions, a survey question usually the authors of [9] showed how the position of texts in has a list of user-defined answer options amongst which pairwise comparisons can influence the outcomes of eval- the respondent may choose to respond to the question. For uation results when using GPT models. Other limitations example, in the question ”What’s your work status?”, possi- are that LLMs can give higher scores to more verbose and ble answer options could be: ”Employed”, ”Self-employed”, long-winded sentences [10], and also prefer responses ”Interning”, ”Part-time”, and ”Unemployed”. generated by themselves as opposed to other LLMs [11]. Another variation of Evaluation strategy still based In our platform, users can leverage the BWAI feature on LLMs is represented by fine-tuning specialized, open- to automatically generate surveys. This process involves source models specifically for evaluation purposes. This users providing their survey intent through a written text pattern typically involves crafting high-quality Evalua- (the prompt). Using LLMs, we can streamline the process tion datasets (either synthetically with a powerful LLM, and allow users to generate high-quality surveys with minimal effort, increasing the level of our user experience. safeguarding against the deployment of new model ver- We formalize a user prompt as follows: sions that could introduce unforeseen behavior. Figure 1 illustrates the comprehensive workflow im- Definition 3 (User prompt). The User prompt embodies plemented in our Survey Generation Testing Framework . the user’s intention when creating a survey. Through text, The User prompts, as described in Definition 3, represent the user can articulate the desired structure and content of authentic prompts logged in our platform, conveying the survey. users’ intentions for survey creation. Highlighted in These generated surveys are designed to align with blue are the pivotal components utilized by the BWAI our established standards, which are the culmination of tool for survey generation: the system prompt (as de- years of research on best practices and recommendations fined in 4) and the generative model (e.g., GPT models for creating surveys for large audiences. Our aim is to or open-source LLMs like llama, mistral, etc.). These leverage our domain knowledge to help users to create components constitute the fundamental dimensions of high-quality surveys. To achieve this, we incorporate our framework. Generated surveys are leveraged for elements of our guidelines and best practices directly metadata feature extraction and distributional drift tests. into the system prompt which defines the ”behaviour” of With varied settings of system prompts and generative the LLM. models, the Survey Generation Testing Framework con- ducts pairwise analyses to discern drifts between these Definition 4 (System prompt). The System prompt serves configurations. These steps are better detailed in the next as the blueprint for instructing the LLM on generating two subsections. surveys in accordance with elements of our established standards. 3.3. Survey metadata features Nevertheless, we acknowledge the challenge of ensur- To measure the impact of our developments on the out- ing that LLM models accurately follow our instructions, puts of the system, we define several metadata features given their inherently unpredictable behavior. We for- that are computed on sets of surveys generated with malize this problem as follows: different configurations of the BWAI feature. All the metadata features used are based on some at- Definition 5 (Survey Generation Reliability Problem). tributes of the surveys. In Table 1, we outline the com- Given a user prompt 𝑝𝑢 , a system prompt 𝑝𝑠 , and a genera- plete set of metadata features used in our framework, tive model 𝑔, our objective is to automatically generate a along with some relevant information. The column first survey 𝑠. The generated survey 𝑠 should accurately reflect column indicates the name of the feature, while the sec- the user’s intent as specified in 𝑝𝑢 , while also adhering to ond one specifies the aggregation function applied to the the survey standards and guidelines detailed in 𝑝𝑠 . data. For example, given a list of questions 𝑄 for a given 3.2. Framework architecture survey, the feature n_open_ended_questions is defined In order to continuously improve the quality of the sur- as a simple Count which counts the number of ”Open veys generated with BWAI we typically work in iterative Ended” questions in a survey: cycles, which may introduce new issues while addressing existing ones, potentially impacting model quality. These 𝑛_𝑜𝑝𝑒𝑛_𝑒𝑛𝑑𝑒𝑑_𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠 = ∑ 1{type(𝑞𝑖 ) = ”open_ended”} issues can arise mainly due to changes in the prompts 𝑞𝑖 ∈𝑄 to accommodate new functionalities or due to switches and upgrades in the generative models at the core of the Most of the features are based on numerical attributes feature. of the survey, with the only exceptions being represented To mitigate this risk, we propose a testing framework by the feature 𝑎𝑛𝑦_𝑠𝑝𝑒𝑐𝑖𝑎𝑙_𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟, which is a Boolean with automatic tests to ensure expected model behaviors, attribute, and the features 𝑑𝑖𝑠𝑡_𝑢𝑛𝑖𝑔𝑟𝑎𝑚𝑠 and 𝑑𝑖𝑠𝑡_𝑏𝑖𝑔𝑟𝑎𝑚𝑠 aiding in risk assessment regarding survey standards which are both categorical attributes. One notewor- and increasing our confidence when evaluating model thy feature is the score_flesch_kincaid, representing the updates. Unlike traditional machine learning problems Flesch-Kincaid Grade Level metric as defined in [13]. such as classification or regression tasks which have well- defined test sets (ground truth), generative features lack 3.4. Distributional drift tests this, necessitating such a framework. Its scope is not to monitor data, but to validate model functionality due to We calculate the distribution for each metadata feature. changes in the BWAI components. The ultimate goal is To detect distributional shifts between two different con- to maintain a reliable user experience for our customers, figurations of system prompt and generative model for Figure 1: Survey Generation Testing Framework overall workflow Feature Name Aggregation • n is the total number of bins (or segments) in the n_contact_info_questions Count distribution. n_open_ended_questions Count n_nps_questions Count The typical interpretations of PSI outcomes are as fol- n_multiple_selection_questions Count lows: n_closed_ended_questions Count n_generated_questions Count • PSI < 0.1: Indicates no significant population n_unsupported_questions Count change. n_single_choice_questions Count • PSI < 0.2: Reflects a moderate population change. n_characters_in_survey Count • PSI ≥ 0.2: Signifies a significant population n_words_in_survey Count change. n_unsupported_questions Count std_n_words_per_question Std For our framework, we use 0.2 as the threshold (𝜆) for avg_word_length Mean the PSI score. Therefore, any value above the score is avg_n_answer_options Mean called a FAILED test, indicating significant changes in the avg_n_words_per_question Mean avg_n_words_per_answer_option Mean distributions. For better clarity, Algorithm 1 presents the max_word_length Max drift test function algorithm utilized in our framework. any_special_character Any score_flesch_kincaid Count Algorithm 1 Metadata drift test algorithm dist_unigrams Count procedure drift_test(𝑚, 𝜆) dist_bigrams Count ▷ Inputs: 𝑚 is a given survey metadata feature distribution. Table 1 List of survey metadata features supported by our framework 𝜆 is the PSI threshold. if PSI(𝑚) > 𝜆 then return FAIL else a specific metadata feature, we compute the Population return PASS Stability Index (PSI). end if The PSI is a synthetic measure of how much a pop- end procedure ulation has shifted over time or between two different samples of a population. It achieves this by categoriz- ing the two distributions into buckets and assessing the percentage of items in each bucket, culminating in a sin- 4. Experimental results gle scalar value that indicates the disparity between the populations [14]. We use the popular PSI formula: 4.1. Experiment setup: 𝑛 The BWAI system is made up of two primary elements: (a) 𝑃𝑖 𝑃𝑆𝐼 = ∑(𝑃𝑡𝑖 − 𝑃𝑏𝑖 ) ⋅ ln ( 𝑡𝑖 ) the system prompt and (b) the generative model. When 𝑖=1 𝑃𝑏 the feature was released in late 2023 the first version was Where: relying on GPT3.5-Turbo as the core LLM and used a version of the system (including prompts and logic) we • 𝑃𝑖𝑡 is the proportion of the population in the i-th will refer to as v1. Later we updated several components bin (or segment) at time t (typically the test or of the system, and we will refer to this updated version current time period). as v2. Also, we have experimented with GPT4-Turbo as • 𝑃𝑖𝑏 is the proportion of the population in the i- a base LLM. th bin (or segment) at the baseline time period Given this context, in this paper we present real-case (typically the training or historical time period). tests conducted using two system versions (v1 and v2) and two models for analysis: GTP-3.5 Turbo (GPT3.5) 4.2. Drift tests: Overall results and GPT-4 Turbo (GPT4), both under the ”0125” release from the OpenAI API. Experiment FAIL PASS BWAI configuration (ℬ𝑆𝑦𝑠𝑃𝑟𝑜𝑚𝑝𝑡 𝐺𝑒𝑛𝑀𝑜𝑑𝑒𝑙 ). Our objective is 𝐺𝑃𝑇 3.5 𝐺𝑃𝑇 3.5 to evaluate the differences in survey generation across <ℬ v1 , ℬ v2 > 5 16 𝐺𝑃𝑇 3.5 𝐺𝑃𝑇 4 various combinations of generative models (i.e. GPT3.5 <ℬ v1 , ℬ v1 > 13 8 𝐺𝑃𝑇 3.5 𝐺𝑃𝑇 4 <ℬv2 , ℬv2 > 16 5 and GPT4) and system prompt versions (i.e. v1 and v2). 𝐺𝑃𝑇 4 𝐺𝑃𝑇 4 <ℬv1 , ℬv2 > 9 12 We list all the pairs of evaluations that we focused on in this analysis. The idea is to have at least one common Table 2 element in the tuple (i.e. either the prompt or the gen- Overall results of drift tests erative model) to assess the impact when transitioning between versions: For each of the metadata features introduced in 1, we 1. <ℬv1 𝐺𝑃𝑇 3.5 , ℬ 𝐺𝑃𝑇 3.5 > perform the distribution drift tests. v2 2. <ℬv1 𝐺𝑃𝑇 3.5 , ℬ 𝐺𝑃𝑇 4 > The number of passed and failed drift tests is shown v1 3. <ℬv2 𝐺𝑃𝑇 3.5 𝐺𝑃𝑇 , ℬv2 3.5 > in table 2. A failed test means that there is a drift in 4. <ℬv1 𝐺𝑃𝑇 4 , ℬ 𝐺𝑃𝑇 4 > the output. As introduced in the section 3.4, it measures v2 drift by using the Population Stability Index of the two System prompts differences. Regarding the differ- distributions. ences between the v1 and v2 system prompts, we summa- We observe a significant increase in the number of rize some of the key improvements that the v2 prompt failed tests when transitioning from the GPT3.5 to the introduces over the previous version: GPT4 model. Specifically, there were 16 failed tests when comparing these two models for the system prompt ver- • Addition of multilingual support sion v2, and 13 failures for version v1. • Improved output formatting instructions • Improved instructions to encourage the system to comply with survey research best practices (i.e. 4.3. Drift tests: per feature avoid open-ended questions where not necessary, In this section, we conduct a detailed examination of order questions from general to specific, etc.) the experiment results presented in the previous section, • Addition of specific instructions to improve cre- focusing on the analysis of actual drift scores of metadata ativity features across the different BWAI configurations. Table • Longer prompt with much more structure in the 3 provides a comprehensive overview of the PSI drift system prompt (416 to 690 tokens) scores for the metadata features. We focus only on the • Support for additional use cases such as survey ones having at least a failed case. forms One noteworthy case involves the metadata feature n_contact_ info_questions, which exhibits a PSI score of User prompts collection. In order to measure the dif- 3.778. The histograms for this metadata feature is shown ferences across the system configurations outlined above in Figure 2. This indicates a significant drift primarily due we selected a subset of 3185 input prompts which have to the transition of models (i.e., from GPT3.5 to GPT4), been collected from real customers who have interacted without any modifications in prompts where in both with the BWAI system and consented to let us use their cases the v1 system prompt was used. In turn, for v2, prompts to improve our system. The selection has been the highest drift was observed for the metadata feature done starting from the full set of user prompts collected n_generated_questions with PSI equals to 2.950. This hap- between October 2023 and January 2024 and applying the pens when upgrading the generative model from GPT3.5 following filters in sequence (with filtering boundaries to GPT4. and parameters determined through ad-hoc analyses to When assessing changes induced solely by changes exclude poor quality samples): in the system prompts (i.e., transitioning from v1 to 1. Drop duplicates; v2) while maintaining the same generative model, over- 2. Drop input prompts which contain PII or sensitive all lower metadata drift scores are observed. Specif- information as flagged by our internal privacy- ically, when utilizing the GPT3.5 model, the highest preservation pipelines; score among these cases was reported for the meta- 3. Select only inputs written in English; data feature n_multiple_selection_questions (0.642). Con- 4. Drop inputs shorter than 200 characters and versely, with the GPT4 model as the generative model, longer than 500 characters; the highest score resurfaced for the metadata feature 5. Drop inputs which led to generated surveys with n_contact_info_questions (1.540). The histograms for this an outlier number of questions (i.e. <5 or >12). case are shown on Figure 3. Figure 2: Histograms for the metadata feature n_con- Figure 4: Histograms for the metadata feature n_multi- tact_info_questions extracted from surveys generated using ple_selection_questions extracted from surveys generated us- GPT4 and GPT3.5 with v1 only ing GPT4 with v1 and v2 prompts 5. Conclusion In this study, we proposed a comprehensive evaluation framework to enhance the reliability of Large Language Model (LLM)-based systems for survey generation tasks. By addressing the challenges associated with accurately following user prompts and maintaining consistency with established standards, the framework functions as a protective barrier, effectively setting guardrails to pre- empt unforeseen behaviors of our BWAI tool. Through the detection of distributional drift of the survey meta- data features, the framework acts as a guiding compass Figure 3: Histograms for the metadata feature n_con- for data scientists to investigate and address any unin- tact_info_questions extracted from surveys generated using tended deviations in the application’s behavior, thereby GPT4 with v1 and v2 prompts ensuring its stability and reliability. Our experimental results demonstrate the effective- ness of the proposed framework in evaluating survey generation metadata features across different configu- In practice, this framework serves as a valuable tool rations of system prompts and generative models. We for assessing whether intended modifications to system observed significant differences in survey outputs when prompts translate effectively into the survey generation transitioning between different versions of LLM models, process. For instance, the updated system prompt in- highlighting the importance of comprehensive evalua- cludes specific instructions to nudge the LLM to gen- tion in adapting to model updates. Furthermore, our erate questions which include answer options (as op- analysis revealed nuanced insights into the impact of posed to open-ended questions which do not). One system prompt versions on survey generation quality, of these question types is represented by the Multi- underscoring the need for careful consideration of both ple Selection question type, and the impact of these prompt design and model selection in ensuring reliable instructions between V1 and V2 can be seen on the survey generation. scores in Table 3 on the row corresponding to the fea- As future work, we aim to integrate automated evalu- ture 𝑛_𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑒_𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛_𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠, as well as in Figure 4 ation strategies to assess the ”quality” of the generated where it is clearly shown that the V2 prompt tends to surveys. In this scenario, the emphasis shifts from lever- generate more 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑒_𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 questions than the V1 aging metadata features to compare differences across prompt. Also, through the detection of distributional different system versions to analyzing the survey con- drift of the survey metadata features, we can identify and tent itself. One promising direction is to use LLMs to act mitigate potential issues, thereby avoiding unexpected as preliminary inspectors of survey quality. This could behaviors of the feature. significantly accelerate our quality assessment process, which currently relies heavily on human evaluation. 𝐺𝑃𝑇 3.5 𝐺𝑃𝑇 3.5 𝐺𝑃𝑇 3.5 𝐺𝑃𝑇 4 𝐺𝑃𝑇 3.5 𝐺𝑃𝑇 4 𝐺𝑃𝑇 4 𝐺𝑃𝑇 4 Features <ℬv1 , ℬv2 > <ℬv1 , ℬv1 > <ℬv2 , ℬv2 > <ℬv1 , ℬv2 > avg_n_answer_options 0.405 0.478 0.424 0.351 avg_n_words_per_answer_option 0.642 0.311 0.339 0.622 avg_n_words_per_question 0.000 0.526 0.392 0.367 drift:bigrams_distribution 0.565 0.860 0.795 0.673 drift:unigrams_distribution 0.205 0.247 0.222 0.217 max word length 0.000 0.000 0.219 0.000 n_closed_ended_questions 0.000 0.318 1.203 0.447 n_contact_info_questions 0.000 3.778 0.529 1.540 n_generated_questions 0.000 2.624 2.950 0.000 n_multiple_selection_questions 0.991 0.000 0.395 1.271 n_nps_questions 0.000 1.899 2.300 0.000 n_open_ended_questions 0.000 0.352 0.846 0.000 n_single_choice_questions 0.000 0.000 0.504 0.000 n_words_in_survey 0.000 1.535 2.068 0.000 std_n_words_per_question 0.000 1.173 0.419 0.599 n_characters_in_survey 0.000 1.446 1.871 0.000 Table 3 PSI scores per feature. We show only the features that had failed in at least one of the BWAI configurations (ℬ). References conversation models, in: K. Knight, A. Nenkova, O. Rambow (Eds.), Proceedings of the 2016 Con- [1] C. Gómez-Rodríguez, P. Williams, A confederacy ference of the North American Chapter of the As- of models: a comprehensive evaluation of llms on sociation for Computational Linguistics: Human creative writing, 2023. arXiv:2310.08433 . Language Technologies, Association for Computa- [2] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, tional Linguistics, San Diego, California, 2016, pp. M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Ku- 110–119. URL: https://aclanthology.org/N16-1014. mar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cos- doi:10.18653/v1/N16- 1014 . grove, C. D. Manning, C. Ré, D. Acosta-Navas, [6] Y. Zhu, S. Lu, L. Zheng, J. Guo, W. Zhang, J. Wang, D. A. Hudson, E. Zelikman, E. Durmus, F. Lad- Y. Yu, Texygen: A benchmarking platform for text hak, F. Rong, H. Ren, H. Yao, J. Wang, K. San- generation models, 2018. arXiv:1802.01886 . thanam, L. Orr, L. Zheng, M. Yuksekgonul, M. Suz- [7] W. Yuan, G. Neubig, P. Liu, Bartscore: Eval- gun, N. Kim, N. Guha, N. Chatterji, O. Khattab, uating generated text as text generation, 2021. P. Henderson, Q. Huang, R. Chi, S. M. Xie, S. San- arXiv:2106.11520 . turkar, S. Ganguli, T. Hashimoto, T. Icard, T. Zhang, [8] M. Gao, X. Hu, J. Ruan, X. Pu, X. Wan, Llm-based V. Chaudhary, W. Wang, X. Li, Y. Mai, Y. Zhang, nlg evaluation: Current status and challenges, 2024. Y. Koreeda, Holistic evaluation of language models, arXiv:2402.01383 . 2023. arXiv:2211.09110 . [9] J. Wang, Y. Liang, F. Meng, B. Zou, Z. Li, J. Qu, [3] Z. Xie, T. Cohn, J. H. Lau, The next chapter: A J. Zhou, Zero-shot cross-lingual summarization via study of large language models in storytelling, 2023. large language models, 2023. arXiv:2302.14229 . arXiv:2301.09790 . [10] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, [4] Z. Shao, M. Huang, J. Wen, W. Xu, X. Zhu, Long Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, and diverse text generation with planning-based H. Zhang, J. E. Gonzalez, I. Stoica, Judging llm-as- hierarchical variational model, in: K. Inui, J. Jiang, a-judge with mt-bench and chatbot arena, 2023. V. Ng, X. Wan (Eds.), Proceedings of the 2019 Con- arXiv:2306.05685 . ference on Empirical Methods in Natural Language [11] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, C. Zhu, G- Processing and the 9th International Joint Con- eval: Nlg evaluation using gpt-4 with better human ference on Natural Language Processing (EMNLP- alignment, 2023. arXiv:2303.16634 . IJCNLP), Association for Computational Linguis- [12] W. Xu, D. Wang, L. Pan, Z. Song, M. Freitag, W. Y. tics, Hong Kong, China, 2019, pp. 3257–3268. URL: Wang, L. Li, Instructscore: Explainable text gener- https://aclanthology.org/D19-1321. doi:10.18653/ ation evaluation with finegrained feedback, 2023. v1/D19- 1321 . arXiv:2305.14282 . [5] J. Li, M. Galley, C. Brockett, J. Gao, B. Dolan, A [13] J. Kincaid, Derivation of New Readability Formulas: diversity-promoting objective function for neural (automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel, Research Branch report, Chief of Naval Technical Training, Naval Air Station Memphis, 1975. URL: https://books.google.it/books?id=4tjroQEACAAJ. [14] R. Taplin, C. Hunt, The population accuracy index: A new measure of population stability for model monitoring, Risks 7 (2019) 53.