<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Enhancing the Reliability of LLMs-based Systems for Survey Generation through Distributional Drift Detection</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Vinicius</forename><surname>Monteiro De Lira</surname></persName>
							<email>vmonteirodelira@surveymonkey.com</email>
							<affiliation key="aff0">
								<orgName type="institution">SurveyMonkey</orgName>
								<address>
									<settlement>Padua</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Antonio</forename><surname>Maiorino</surname></persName>
							<email>amaiorino@surveymonkey.com</email>
							<affiliation key="aff0">
								<orgName type="institution">SurveyMonkey</orgName>
								<address>
									<settlement>Padua</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Peng</forename><surname>Jiang</surname></persName>
							<email>pjiang@surveymonkey.com</email>
							<affiliation key="aff1">
								<orgName type="institution">SurveyMonkey</orgName>
								<address>
									<settlement>San Mateo</settlement>
									<region>California</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Enhancing the Reliability of LLMs-based Systems for Survey Generation through Distributional Drift Detection</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">005251FD0235C71934079AACDAF6A62D</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:57+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>LLMs</term>
					<term>Survey Generation</term>
					<term>Reliability</term>
					<term>Distribution Drifts</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Evaluating Large Language Model (LLM)-based systems is a recurrent challenge in modern machine learning research and development. It is crucial to ensure that any changes made in the production environments will not negatively impact user experience, and clever evaluation techniques are especially important when updated models or prompts create disparities within the system. Since we released the feature to help our customers create surveys with textual prompts in 2023, we have iteratively improved several parts of the system such as the prompts, the LLM models and the system's internal logic. To measure the impact of these changes, we propose a comprehensive framework for assessing surveys generated by LLMs, focusing on data drift analyses based on survey metadata features. By leveraging this approach, we can effectively identify and address potential areas of concern related to model performance, enhancing the reliability and usability of LLM-based systems for survey generation tasks.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>We are the global leader in survey software, with our flagship platform enabling the collection of over 20 million answers per day across a vast variety of domains. One of the major goals of our service is to help customers create high-quality surveys by leveraging the wealth of research that internal teams have accumulated over the course of many years in the industry. This translates into the continuous development of features aimed at helping customers create effective surveys that allow them to learn what they're interested in by asking the best possible questions to their audience.</p><p>One of the latest features released for this purpose is called Build with AI (BWAI). This feature has been released to all the users of the platform near the end of 2023 and leverages Large Language Models (LLMs) to allow users to build high quality surveys through a conversational interface, where users can specify what they want to learn about their audience through a textual description (a prompt), which will be used by the system to generate a survey with relevant questions and context.</p><p>Since this application involves generating a long text based on concise instructions provided in a short "seed" input text, it can be particularly challenging because of the very nature of the task, which is more akin to "creative writing" than to other Natural Language Processing (NLP) use cases where "correct" and "wrong" labels could typically be identified. In fact, in this use case it is much harder to determine what a "good" or "bad" generation would look like, as there are many possible examples of good surveys in which could be generated starting from the same input prompt.</p><p>This paper presents a novel contribution in the form of a comprehensive framework for evaluating surveys generated by LLMs, specifically addressing challenges in survey generation tasks. The framework exploits survey metadata to facilitate data drift analysis, enabling the identification and mitigation of potential issues related to model performance. By systematically analyzing survey metadata and detecting distributional drift, the framework assesses the behavior of LLM-based systems for survey generation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>The Survey Generation domain that we studied in this work poses its own set of challenges since typically there is not a "correct" or "wrong" survey, but rather the goodness of the model lies in its ability to follow the instructions specified by the user, while also trying to produce interesting ideas for potentially useful survey questions.</p><p>This puts our model in an area closer to use cases such as brainstorming and creative writing than to other, more studied areas such as Question Answering, Intent Recognition and Summarization, where some "ground truths" are usually available and can be used to evaluate the level of quality of the generated text. The lack of ground truth combined with the lack of standardized metrics for openended tasks makes evaluation even more difficult in our scenario.</p><p>As pointed out in <ref type="bibr" target="#b0">[1]</ref> these kinds of use cases are often missing from popular benchmarks such as HELM <ref type="bibr" target="#b1">[2]</ref>, since most of these tend to focus on verifiable, closedended and automated metrics. For reliably evaluating open-ended use cases researchers and practitioners often hire human raters, as for example done by the authors of <ref type="bibr" target="#b0">[1]</ref> who hired 10 human raters to evaluate several LLMs on Creative Writing tasks. While a human evaluation is currently still the most effective and reliable way of evaluating such open-ended tasks, human involvement also makes the process much longer and expensive. Some strategies proposed in the literature to automatically evaluate the quality of open-ended use cases include measuring the degree of "text quality" through metrics such as text readability and diversity. In <ref type="bibr" target="#b2">[3]</ref> the authors distinguish between "reference-based metrics", where the output generated by the model is compared to a similar output written by a human, and "reference-free metrics", where the quality of the outputs is measured directly, with some examples of the latter group being n-gram based metrics such as "Lexical Repetition" <ref type="bibr" target="#b3">[4]</ref> and "Distinct-3 (D-3) <ref type="bibr" target="#b4">[5]</ref>", descriptive statistics such as text length, Self-BLEU (SBL) <ref type="bibr" target="#b5">[6]</ref> and BARTScore (BAS) <ref type="bibr" target="#b6">[7]</ref>. Nonetheless, the authors also report that these metrics often do not seem to agree with each other, and they complement their assessment with human-based measurements.</p><p>The authors of <ref type="bibr" target="#b7">[8]</ref> analyze the NLG evaluation landscape from another angle that's becoming more widespread after the advent of big-scale powerful models, which is the LLM-based evaluation. These techniques involve using LLMs themselves as "judges" for generated text and include Scoring, Comparison, Ranking, and Boolean QA among the strategies used to constraint LLMs to output close-ended scores. This direction is exciting because it seems like a promising way to automate evaluation tasks that were previously very hard to automate without models capable of understanding all the nuances in the generated examples, but it also comes with its own challenges and limitations. For example, the authors of <ref type="bibr" target="#b8">[9]</ref> showed how the position of texts in pairwise comparisons can influence the outcomes of evaluation results when using GPT models. Other limitations are that LLMs can give higher scores to more verbose and long-winded sentences <ref type="bibr" target="#b9">[10]</ref>, and also prefer responses generated by themselves as opposed to other LLMs <ref type="bibr" target="#b10">[11]</ref>.</p><p>Another variation of Evaluation strategy still based on LLMs is represented by fine-tuning specialized, opensource models specifically for evaluation purposes. This pattern typically involves crafting high-quality Evaluation datasets (either synthetically with a powerful LLM, or through human curation), which are then used to fine-tune LLMs to try and distill the Human Evaluators' knowledge, as in <ref type="bibr" target="#b11">[12]</ref>.</p><p>In summary, for most use cases involving the generation of open-ended text where no ground truth is available the usual process involves combining some of the "automated" strategies mentioned above with Humanbased evaluation, with varying weight given to the Automated vs Human evaluation based on the particular needs. When the tasks are broader, more nuanced, and "vague", or for tasks where a detailed explanation of the evaluation scores is needed, human evaluation is typically given more weight.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology: Evaluation Framework for Survey Generation</head><p>Before diving deep into the architecture of our framework, we introduce and formalize a few basic concepts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Background: Basic Concepts</head><p>A survey is a questionnaire used to collect data from a group of people to gather information, opinions, or feedback on a particular topic or subject. We formally introduce a survey as:</p><p>Definition 1 (Survey). A Survey typically consists of several questions designed to gather specific information from respondents. We define a survey as a tuple &lt; ℎ, 𝑙 &gt; where h represents the survey title and l is the list of questions composing the survey.</p><p>In turn, a single survey question can be defined as:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 2 (Survey question). We define a Survey question as a tuple &lt; 𝑡, 𝑘, 𝑜 &gt; where t represents the survey question text, k is its type drawn from a predefined taxonomy 𝐾, and o represents the list of answer options. Examples of survey question types belonging to</head><p>𝐾 include: open-ended questions, Net Promoter Score (NPS) questions, contact information questions, rating questions, and more. Except for the "Open-ended" questions, a survey question usually has a list of user-defined answer options amongst which the respondent may choose to respond to the question. For example, in the question "What's your work status?", possible answer options could be: "Employed", "Self-employed", "Interning", "Part-time", and "Unemployed".</p><p>In our platform, users can leverage the BWAI feature to automatically generate surveys. This process involves users providing their survey intent through a written text (the prompt). Using LLMs, we can streamline the process and allow users to generate high-quality surveys with minimal effort, increasing the level of our user experience. We formalize a user prompt as follows: Definition 3 (User prompt). The User prompt embodies the user's intention when creating a survey. Through text, the user can articulate the desired structure and content of the survey.</p><p>These generated surveys are designed to align with our established standards, which are the culmination of years of research on best practices and recommendations for creating surveys for large audiences. Our aim is to leverage our domain knowledge to help users to create high-quality surveys. To achieve this, we incorporate elements of our guidelines and best practices directly into the system prompt which defines the "behaviour" of the LLM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 4 (System prompt). The System prompt serves as the blueprint for instructing the LLM on generating surveys in accordance with elements of our established standards.</head><p>Nevertheless, we acknowledge the challenge of ensuring that LLM models accurately follow our instructions, given their inherently unpredictable behavior. We formalize this problem as follows:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 5 (Survey Generation Reliability Problem).</head><p>Given a user prompt 𝑝 𝑢 , a system prompt 𝑝 𝑠 , and a generative model 𝑔, our objective is to automatically generate a survey 𝑠. The generated survey 𝑠 should accurately reflect the user's intent as specified in 𝑝 𝑢 , while also adhering to the survey standards and guidelines detailed in 𝑝 𝑠 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Framework architecture</head><p>In order to continuously improve the quality of the surveys generated with BWAI we typically work in iterative cycles, which may introduce new issues while addressing existing ones, potentially impacting model quality. These issues can arise mainly due to changes in the prompts to accommodate new functionalities or due to switches and upgrades in the generative models at the core of the feature.</p><p>To mitigate this risk, we propose a testing framework with automatic tests to ensure expected model behaviors, aiding in risk assessment regarding survey standards and increasing our confidence when evaluating model updates. Unlike traditional machine learning problems such as classification or regression tasks which have welldefined test sets (ground truth), generative features lack this, necessitating such a framework. Its scope is not to monitor data, but to validate model functionality due to changes in the BWAI components. The ultimate goal is to maintain a reliable user experience for our customers, safeguarding against the deployment of new model versions that could introduce unforeseen behavior.</p><p>Figure <ref type="figure" target="#fig_0">1</ref> illustrates the comprehensive workflow implemented in our Survey Generation Testing Framework . The User prompts, as described in Definition 3, represent authentic prompts logged in our platform, conveying users' intentions for survey creation. Highlighted in blue are the pivotal components utilized by the BWAI tool for survey generation: the system prompt (as defined in 4) and the generative model (e.g., GPT models or open-source LLMs like llama, mistral, etc.). These components constitute the fundamental dimensions of our framework. Generated surveys are leveraged for metadata feature extraction and distributional drift tests. With varied settings of system prompts and generative models, the Survey Generation Testing Framework conducts pairwise analyses to discern drifts between these configurations. These steps are better detailed in the next two subsections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Survey metadata features</head><p>To measure the impact of our developments on the outputs of the system, we define several metadata features that are computed on sets of surveys generated with different configurations of the BWAI feature.</p><p>All the metadata features used are based on some attributes of the surveys. In Table <ref type="table" target="#tab_0">1</ref>, we outline the complete set of metadata features used in our framework, along with some relevant information. The column first column indicates the name of the feature, while the second one specifies the aggregation function applied to the data.</p><p>For example, given a list of questions 𝑄 for a given survey, the feature n_open_ended_questions is defined as a simple Count which counts the number of "Open Ended" questions in a survey:</p><formula xml:id="formula_0">𝑛_𝑜𝑝𝑒𝑛_𝑒𝑛𝑑𝑒𝑑_𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠 = ∑ 𝑞 𝑖 ∈𝑄 1{type(𝑞 𝑖 ) = "open_ended"}</formula><p>Most of the features are based on numerical attributes of the survey, with the only exceptions being represented by the feature 𝑎𝑛𝑦_𝑠𝑝𝑒𝑐𝑖𝑎𝑙_𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟, which is a Boolean attribute, and the features 𝑑𝑖𝑠𝑡_𝑢𝑛𝑖𝑔𝑟𝑎𝑚𝑠 and 𝑑𝑖𝑠𝑡_𝑏𝑖𝑔𝑟𝑎𝑚𝑠 which are both categorical attributes. One noteworthy feature is the score_flesch_kincaid, representing the Flesch-Kincaid Grade Level metric as defined in <ref type="bibr" target="#b12">[13]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Distributional drift tests</head><p>We calculate the distribution for each metadata feature. To detect distributional shifts between two different configurations of system prompt and generative model for  The PSI is a synthetic measure of how much a population has shifted over time or between two different samples of a population. It achieves this by categorizing the two distributions into buckets and assessing the percentage of items in each bucket, culminating in a single scalar value that indicates the disparity between the populations <ref type="bibr" target="#b13">[14]</ref>. We use the popular PSI formula:</p><formula xml:id="formula_1">𝑃𝑆𝐼 = 𝑛 ∑ 𝑖=1 (𝑃 𝑖 𝑡 − 𝑃 𝑖 𝑏 ) ⋅ ln ( 𝑃 𝑖 𝑡 𝑃 𝑖 𝑏 )</formula><p>Where:</p><p>• 𝑃 𝑡 𝑖 is the proportion of the population in the i-th bin (or segment) at time t (typically the test or current time period).</p><p>• 𝑃 𝑏 𝑖 is the proportion of the population in the ith bin (or segment) at the baseline time period (typically the training or historical time period).</p><p>• n is the total number of bins (or segments) in the distribution.</p><p>The typical interpretations of PSI outcomes are as follows:</p><p>• PSI &lt; 0.1: Indicates no significant population change. • PSI &lt; 0.2: Reflects a moderate population change.</p><p>• PSI ≥ 0.2: Signifies a significant population change.</p><p>For our framework, we use 0.2 as the threshold (𝜆) for the PSI score. Therefore, any value above the score is called a FAILED test, indicating significant changes in the distributions. For better clarity, Algorithm 1 presents the drift test function algorithm utilized in our framework. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Experiment setup:</head><p>The BWAI system is made up of two primary elements: (a) the system prompt and (b) the generative model. When the feature was released in late 2023 the first version was relying on GPT3.5-Turbo as the core LLM and used a version of the system (including prompts and logic) we will refer to as v1. Later we updated several components of the system, and we will refer to this updated version as v2. Also, we have experimented with GPT4-Turbo as a base LLM.</p><p>Given this context, in this paper we present real-case tests conducted using two system versions (v1 and v2) and two models for analysis: GTP-3.5 Turbo (GPT3.5) and GPT-4 Turbo (GPT4), both under the "0125" release from the OpenAI API.</p><p>BWAI configuration (ℬ 𝐺𝑒𝑛𝑀𝑜𝑑𝑒𝑙 𝑆𝑦𝑠𝑃𝑟𝑜𝑚𝑝𝑡 ). Our objective is to evaluate the differences in survey generation across various combinations of generative models (i.e. GPT3.5 and GPT4) and system prompt versions (i.e. v1 and v2). We list all the pairs of evaluations that we focused on in this analysis. The idea is to have at least one common element in the tuple (i.e. either the prompt or the generative model) to assess the impact when transitioning between versions:</p><formula xml:id="formula_2">1. &lt;ℬ 𝐺𝑃𝑇 3.5 v1 , ℬ 𝐺𝑃𝑇 3.5 v2 &gt; 2. &lt;ℬ 𝐺𝑃𝑇 3.5 v1 , ℬ 𝐺𝑃𝑇 4 v1 &gt; 3. &lt;ℬ 𝐺𝑃𝑇 3.5 v2 , ℬ 𝐺𝑃𝑇 3.5 v2 &gt; 4. &lt;ℬ 𝐺𝑃𝑇 4 v1 , ℬ 𝐺𝑃𝑇 4 v2</formula><p>&gt; System prompts differences. Regarding the differences between the v1 and v2 system prompts, we summarize some of the key improvements that the v2 prompt introduces over the previous version:</p><p>• Addition of multilingual support • Improved output formatting instructions • Improved instructions to encourage the system to comply with survey research best practices (i.e. avoid open-ended questions where not necessary, order questions from general to specific, etc.) • Addition of specific instructions to improve creativity • Longer prompt with much more structure in the system prompt (416 to 690 tokens) • Support for additional use cases such as survey forms User prompts collection. In order to measure the differences across the system configurations outlined above we selected a subset of 3185 input prompts which have been collected from real customers who have interacted with the BWAI system and consented to let us use their prompts to improve our system. The selection has been done starting from the full set of user prompts collected between October 2023 and January 2024 and applying the following filters in sequence (with filtering boundaries and parameters determined through ad-hoc analyses to exclude poor quality samples):</p><p>1. Drop duplicates; 2. Drop input prompts which contain PII or sensitive information as flagged by our internal privacypreservation pipelines; 3. Select only inputs written in English; 4. Drop inputs shorter than 200 characters and longer than 500 characters; 5. Drop inputs which led to generated surveys with an outlier number of questions (i.e. &lt;5 or &gt;12). Table <ref type="table">2</ref> Overall results of drift tests</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Drift tests: Overall results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experiment</head><p>For each of the metadata features introduced in 1, we perform the distribution drift tests.</p><p>The number of passed and failed drift tests is shown in table <ref type="table">2</ref>. A failed test means that there is a drift in the output. As introduced in the section 3.4, it measures drift by using the Population Stability Index of the two distributions.</p><p>We observe a significant increase in the number of failed tests when transitioning from the GPT3.5 to the GPT4 model. Specifically, there were 16 failed tests when comparing these two models for the system prompt version v2, and 13 failures for version v1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Drift tests: per feature</head><p>In this section, we conduct a detailed examination of the experiment results presented in the previous section, focusing on the analysis of actual drift scores of metadata features across the different BWAI configurations. Table <ref type="table">3</ref> provides a comprehensive overview of the PSI drift scores for the metadata features. We focus only on the ones having at least a failed case.</p><p>One noteworthy case involves the metadata feature n_contact_ info_questions, which exhibits a PSI score of 3.778. The histograms for this metadata feature is shown in Figure <ref type="figure" target="#fig_2">2</ref>. This indicates a significant drift primarily due to the transition of models (i.e., from GPT3.5 to GPT4), without any modifications in prompts where in both cases the v1 system prompt was used. In turn, for v2, the highest drift was observed for the metadata feature n_generated_questions with PSI equals to 2.950. This happens when upgrading the generative model from GPT3.5 to GPT4.</p><p>When assessing changes induced solely by changes in the system prompts (i.e., transitioning from v1 to v2) while maintaining the same generative model, overall lower metadata drift scores are observed. Specifically, when utilizing the GPT3.5 model, the highest score among these cases was reported for the metadata feature n_multiple_selection_questions (0.642). Conversely, with the GPT4 model as the generative model, the highest score resurfaced for the metadata feature n_contact_info_questions (1.540). The histograms for this case are shown on Figure <ref type="figure" target="#fig_3">3</ref>.  In practice, this framework serves as a valuable tool for assessing whether intended modifications to system prompts translate effectively into the survey generation process. For instance, the updated system prompt includes specific instructions to nudge the LLM to generate questions which include answer options (as opposed to open-ended questions which do not). One of these question types is represented by the Multiple Selection question type, and the impact of these instructions between V1 and V2 can be seen on the scores in Table <ref type="table">3</ref> on the row corresponding to the feature 𝑛_𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑒_𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛_𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠, as well as in Figure <ref type="figure" target="#fig_4">4</ref> where it is clearly shown that the V2 prompt tends to generate more 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑒_𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 questions than the V1 prompt. Also, through the detection of distributional drift of the survey metadata features, we can identify and mitigate potential issues, thereby avoiding unexpected behaviors of the feature. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>In this study, we proposed a comprehensive evaluation framework to enhance the reliability of Large Language Model (LLM)-based systems for survey generation tasks. By addressing the challenges associated with accurately following user prompts and maintaining consistency with established standards, the framework functions as a protective barrier, effectively setting guardrails to preempt unforeseen behaviors of our BWAI tool. Through the detection of distributional drift of the survey metadata features, the framework acts as a guiding compass for data scientists to investigate and address any unintended deviations in the application's behavior, thereby ensuring its stability and reliability.</p><p>Our experimental results demonstrate the effectiveness of the proposed framework in evaluating survey generation metadata features across different configurations of system prompts and generative models. We observed significant differences in survey outputs when transitioning between different versions of LLM models, highlighting the importance of comprehensive evaluation in adapting to model updates. Furthermore, our analysis revealed nuanced insights into the impact of system prompt versions on survey generation quality, underscoring the need for careful consideration of both prompt design and model selection in ensuring reliable survey generation.</p><p>As future work, we aim to integrate automated evaluation strategies to assess the "quality" of the generated surveys. In this scenario, the emphasis shifts from leveraging metadata features to compare differences across different system versions to analyzing the survey content itself. One promising direction is to use LLMs to act as preliminary inspectors of survey quality. This could significantly accelerate our quality assessment process, which currently relies heavily on human evaluation.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Survey Generation Testing Framework overall workflow</figDesc><graphic coords="4,119.52,84.19,354.20,67.14" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Histograms for the metadata feature n_con-tact_info_questions extracted from surveys generated using GPT4 and GPT3.5 with v1 only</figDesc><graphic coords="6,98.44,84.19,183.03,135.24" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Histograms for the metadata feature n_con-tact_info_questions extracted from surveys generated using GPT4 with v1 and v2 prompts</figDesc><graphic coords="6,98.44,270.43,183.03,135.24" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Histograms for the metadata feature n_multi-ple_selection_questions extracted from surveys generated using GPT4 with v1 and v2 prompts</figDesc><graphic coords="6,311.77,84.19,183.03,135.24" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 List</head><label>1</label><figDesc></figDesc><table><row><cell>Feature Name</cell><cell>Aggregation</cell></row><row><cell>n_contact_info_questions</cell><cell>Count</cell></row><row><cell>n_open_ended_questions</cell><cell>Count</cell></row><row><cell>n_nps_questions</cell><cell>Count</cell></row><row><cell>n_multiple_selection_questions</cell><cell>Count</cell></row><row><cell>n_closed_ended_questions</cell><cell>Count</cell></row><row><cell>n_generated_questions</cell><cell>Count</cell></row><row><cell>n_unsupported_questions</cell><cell>Count</cell></row><row><cell>n_single_choice_questions</cell><cell>Count</cell></row><row><cell>n_characters_in_survey</cell><cell>Count</cell></row><row><cell>n_words_in_survey</cell><cell>Count</cell></row><row><cell>n_unsupported_questions</cell><cell>Count</cell></row><row><cell>std_n_words_per_question</cell><cell>Std</cell></row><row><cell>avg_word_length</cell><cell>Mean</cell></row><row><cell>avg_n_answer_options</cell><cell>Mean</cell></row><row><cell>avg_n_words_per_question</cell><cell>Mean</cell></row><row><cell>avg_n_words_per_answer_option</cell><cell>Mean</cell></row><row><cell>max_word_length</cell><cell>Max</cell></row><row><cell>any_special_character</cell><cell>Any</cell></row><row><cell>score_flesch_kincaid</cell><cell>Count</cell></row><row><cell>dist_unigrams</cell><cell>Count</cell></row><row><cell>dist_bigrams</cell><cell>Count</cell></row></table><note>of survey metadata features supported by our framework a specific metadata feature, we compute the Population Stability Index (PSI).</note></figure>
		</body>
		<back>
			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Features &lt;ℬ 𝐺𝑃𝑇 3.5   v1   , ℬ 𝐺𝑃𝑇 3.5   v2   &gt; &lt;ℬ 𝐺𝑃𝑇 3.5   v1   , ℬ 𝐺𝑃𝑇 4   v1   &gt; &lt;ℬ 𝐺𝑃𝑇 3. </p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Gómez-Rodríguez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Williams</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2310.08433</idno>
		<title level="m">A confederacy of models: a comprehensive evaluation of llms on creative writing</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Bommasani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Tsipras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Soylu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yasunaga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Narayanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Newman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Cosgrove</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ré</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Acosta-Navas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">A</forename><surname>Hudson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Zelikman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Durmus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ladhak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Santhanam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Orr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yuksekgonul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Suzgun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Guha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Chatterji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Khattab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Henderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Chi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Santurkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ganguli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Hashimoto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Icard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Chaudhary</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Mai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Koreeda</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2211.09110</idno>
		<title level="m">Holistic evaluation of language models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">The next chapter: A study of large language models in storytelling</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Cohn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">H</forename><surname>Lau</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2301.09790</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Long and diverse text generation with planning-based hierarchical variational model</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Shao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/D19-1321</idno>
		<ptr target="https://aclanthology.org/D19-1321.doi:10.18653/v1/D19-1321" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">K</forename><surname>Inui</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Jiang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Ng</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">X</forename><surname>Wan</surname></persName>
		</editor>
		<meeting>the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics<address><addrLine>Hong Kong, China</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="3257" to="3268" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A diversity-promoting objective function for neural conversation models</title>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Galley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Brockett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Dolan</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/N16-1014</idno>
		<ptr target="https://aclanthology.org/N16-1014.doi:10.18653/v1/N16-1014" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
				<editor>
			<persName><forename type="first">K</forename><surname>Knight</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Nenkova</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">O</forename><surname>Rambow</surname></persName>
		</editor>
		<meeting>the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies<address><addrLine>San Diego, California</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="110" to="119" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1802.01886</idno>
		<title level="m">Texygen: A benchmarking platform for text generation models</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">W</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Neubig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Liu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2106.11520</idno>
		<title level="m">Bartscore: Evaluating generated text as text generation</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ruan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Pu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wan</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2402.01383</idno>
		<title level="m">Llm-based nlg evaluation: Current status and challenges</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Meng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Qu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhou</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2302.14229</idno>
		<title level="m">Zero-shot cross-lingual summarization via large language models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-L</forename><surname>Chiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Sheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">P</forename><surname>Xing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">E</forename><surname>Gonzalez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Stoica</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2306.05685</idno>
		<title level="m">Judging llm-asa-judge with mt-bench and chatbot arena</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Iter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2303.16634</idno>
		<title level="m">Geval: Nlg evaluation using gpt-4 with better human alignment</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">W</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Freitag</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Li</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.14282</idno>
		<title level="m">Instructscore: Explainable text generation evaluation with finegrained feedback</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Derivation of New Readability Formulas: (automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel</title>
		<author>
			<persName><forename type="first">J</forename><surname>Kincaid</surname></persName>
		</author>
		<ptr target="https://books.google.it/books?id=4tjroQEACAAJ" />
		<imprint>
			<date type="published" when="1975">1975</date>
		</imprint>
		<respStmt>
			<orgName>Research Branch report, Chief of Naval Technical Training, Naval Air Station Memphis</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">The population accuracy index: A new measure of population stability for model monitoring</title>
		<author>
			<persName><forename type="first">R</forename><surname>Taplin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hunt</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Risks</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page">53</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
