-

ACM KDD Conference, August

1613-0073

Enhancing the Reliability of LLMs-based Systems for Survey Generation through Distributional Drift Detection

Vinicius Monteiro deLira

vmonteirodelira@surveymonkey.co 0

AntonioMaiorino

Peng Jiang

pjiang@surveymonkey.com 1

LLMs, Survey Generation, Reliability, Distribution Drifts

0 SurveyMonkey , Padua , Italy 1 SurveyMonkey , San Mateo, California , USA

2024

26 2024

Evaluating Large Language Model (LLM)-based systems is a recurrent challenge in modern machine learning research and development. It is crucial to ensure that any changes made in the production environments will not negatively impact user experience, and clever evaluation techniques are especially important when updated models or prompts create disparities within the system. Since we released the feature to help our customers create surveys with textual prompts in 2023, we have iteratively improved several parts of the system such as the prompts, the LLM models and the system's internal logic. To measure the impact of these changes, we propose a comprehensive framework for assessing surveys generated by LLMs, focusing on data drift analyses based on survey metadata features. By leveraging this approach, we can efectively identify and address potential areas of concern related to model performance, enhancing the reliability and usability of LLM-based systems for survey generation tasks.

ative writing” than to other Natural Language Processing ness of the model lies in its ability to follow the instruc-

CEUR ceur-ws.org

1. Introduction

possible questions to their audience.

We are the global leader in survey software, with our flag- (NLP) use cases where “correct” and “wrong” labels could ship platform enabling the collection of over 20 milliontypically be identified. In fact, in this use case it is much answers per day across a vast variety of domains. Onhearder to determine what a “good” or “bad” generation of the major goals of our service is to help customerwsould look like, as there are many possible examples of create high-quality surveys by leveraging the wealth ogfood surveys in which could be generated starting from research that internal teams have accumulated over tthhee same input prompt. course of many years in the industry. This translates intoThis paper presents a novel contribution in the form the continuous development of features aimed at helopf- a comprehensive framework for evaluating surveys ing customers create efective surveys that allow themgenerated by LLMs, specifically addressing challenges to learn what they’re interested in by asking the beinstsurvey generation tasks. The framework exploits survey metadata to facilitate data drift analysis, enabling conversational interface, where users can specify what to allow users to build high quality surveys through faor survey generation. they want to learn about their audience through a textual description (aprompt), which will be used by the system 2. Related Work to generate a survey with relevant questions and context.

One of the latest features released for this purptohsee identification and mitigation of potential issues reis called Build with AI (BWAI). This feature has been lated to model performance. By systematically analyzing released to all the users of the platform near the esnudrvey metadata and detecting distributional drift, the of 2023 and leverages Large Language Models (LLMs) framework assesses the behavior of LLM-based systems

Since this application involves generating a long texTthe Survey Generation domain that we studied in this based on concise instructions provided in a short ”seedw”ork poses its own set of challenges since typically there input text, it can be particularly challenging because oifs not a ”correct” or ”wrong” survey, but rather the goodKiL’24: Workshop on Knowledge-infused Learning co-located with tions specified by the user, while also trying to produce nEvelop-O ∗Corresponding author. †These authors contributed equally.

This puts our model in an area closer to use cases such as brainstorming and creative writing than to other, more studied areas such as Question Answering, Intent Recognition and Summarization, where some “ground truths” of quality of the generated text. The lack of ground truotrhthrough human curation), which are then used to combined with the lack of standardized metrics for op efinne--tune LLMs to try and distill the Human Evaluators’ ended tasks makes evaluation even more dificult in our knowledge, as in [12]. scenario. In summary, for most use cases involving the genera

As pointed out in1[] these kinds of use cases are often tion of open-ended text where no ground truth is availmissing from popular benchmarks such as HELM2[], able the usual process involves combining some of the since most of these tend to focus on verifiable, closed“-automated” strategies mentioned above with Humanended and automated metrics. For reliably evaluatinbgased evaluation, with varying weight given to the Auopen-ended use cases researchers and practitioners oftentomated vs Human evaluation based on the particular hire human raters, as for example done by the authors onfeeds. When the tasks are broader, more nuanced, and [ 1 ] who hired 10 human raters to evaluate several LLM“svague”, or for tasks where a detailed explanation of the on Creative Writing tasks. While a human evaluationevaluation scores is needed, human evaluation is typiis currently still the most efective and reliable way ofcally given more weight. evaluating such open-ended tasks, human involvement also makes the process much longer and expensive.

Some strategies proposed in the literature to autom3a.t-Methodology: Evaluation ically evaluate the quality of open-ended use cases in- Framework for Survey clude measuring the degree of “text quality” through Generation metrics such as text readability and diversity3.]Inth[e authors distinguish between “reference-based metricBse”,fore diving deep into the architecture of our framewhere the output generated by the model is compared wtoork, we introduce and formalize a few basic concepts. a similar output written by a human, and “reference-free metrics”, where the quality of the outputs is measured directly, with some examples of the latter group bein3g.1. Background: Basic Concepts n-gram based metrics such as ”Lexical Repetition4”] [ A survey is a questionnaire used to collect data from and ”Distinct-3 (D-3)5[]”, descriptive statistics such asa group of people to gather information, opinions, or text length, Self-BLEU (SBL)6][ and BARTScore (BAS) feedback on a particular topic or subject. We formally [7]. Nonetheless, the authors also report that these mientt-roduce asurvey as: rics often do not seem to agree with each other, and they complement their assessment with human-based meaD- efinition 1 (Survey). A Survey typically consists of sevsurements. eral questions designed to gather specific information from

The authors of 8[] analyze the NLG evaluation respondents. We define a survey as a tuple < ℎ, > where landscape from another angle that’s becoming moreh represents the survey title and l is the list of questions widespread after the advent of big-scale powerful models,composing the survey. which is the LLM-based evaluation. These techniques involve using LLMs themselves as “judges” for gener- In turn, a singlesurvey question can be defined as: ated text and include Scoring, Comparison, RankingD,efinition 2 (Survey question.) We define a Survey quesand Boolean QA among the strategies used to constraintiton as a tuple < , , > where t represents the survey quesLLMs to output close-ended scores. This direction is etxi-on text, k is its type drawn from a predefined taxonomy citing because it seems like a promising way to automate

, and o represents the list of answer options. Examples of evaluation tasks that were previously very hard to autsuor-vey question types belonging to include: open-ended mate without models capable of understanding all thqeuestions, Net Promoter Score (NPS) questions, contact innuances in the generated examples, but it also comesformation questions, rating questions, and more. Except with its own challenges and limitations. For examplef,or the ”Open-ended” questions, a survey question usually the authors of9[] showed how the position of texts inhas a list of user-defined answer options amongst which pairwise comparisons can influence the outcomes of eval- the respondent may choose to respond to the question. For uation results when using GPT models. Other limitationesxample, in the question ”What’s your work status?”, possiare that LLMs can give higher scores to more verbose anbdle answer options could be: ”Employed”, ”Self-employed”, long-winded sentences 1[0], and also prefer responses ”Interning”, ”Part-time”, and ”Unemployed”. generated by themselves as opposed to other LLM11s][.

Another variation of Evaluation strategy still basedIn our platform, users can leverage tBhWeAI feature on LLMs is represented by fine-tuning specialized, open- to automatically generate surveys. This process involves source models specifically for evaluation purposes. This users providing their survey intent through a written text pattern typically involves crafting high-quality Evalua-(the prompt). Using LLMs, we can streamline the process tion datasets (either synthetically with a powerful LLaMn d, allow users to generate high-quality surveys with Definition 4 (System prompt.) The System prompt serves as the blueprint for instructing the LLM on generating surveys in accordance with elements of our established standards. minimal efort, increasing the level of our user experiences.afeguarding against the deployment of new model verWe formalize a user prompt as follows: sions that could introduce unforeseen behavior. Figure1 illustrates the comprehensive workflow imDefinition 3 (User prompt.) The User prompt embodies plemented in our Survey Generation Testing Framework . the user’s intention when creating a survey. Through text, The User prompts, as described in Definitio3n, represent the user can articulate the desired structure and content of authentic prompts logged in our platform, conveying the survey. users’ intentions for survey creation. Highlighted in

These generated surveys are designed to align witbhlue are the pivotal components utilized by tBhWeAI our established standards, which are the culmination otfool for survey generation: the system prompt (as deyears of research on best practices and recommendationfinsed in 4) and the generative model (e.g., GPT models for creating surveys for large audiences. Our aim is toor open-source LLMs like llama, mistral, etc.). These leverage our domain knowledge to help users to creatceomponents constitute the fundamental dimensions of high-quality surveys. To achieve this, we incorporateour framework. Generated surveys are leveraged for elements of our guidelines and best practices directmlyetadata feature extraction and distributional drift tests. into the system prompt which defines the ”behaviour” ofWith varied settings of system prompts and generative the LLM. models, the Survey Generation Testing Framework conducts pairwise analyses to discern drifts between these configurations. These steps are better detailed in the next two subsections.

3.3. Survey metadata features

Nevertheless, we acknowledge the challenge of ensurT-o measure the impact of our developments on the outing that LLM models accurately follow our instructionpsu, ts of the system, we define several metadata features given their inherently unpredictable behavior. We fotrh-at are computed on sets of surveys generated with malize this problem as follows: diferent configurations of theBWAI feature. Definition 5 (Survey Generation Reliability Proble.m) All the metadata features used are based on some atGiven a user prompt , a system prompt , and a genera- tributes of the surveys. In Tab1l,ewe outline the comtive model , our objective is to automatically generate a plete set of metadata features used in our framework, survey . The generated survey should accurately reflect along with some relevant information. The column first the user’s intent as specified in , while also adhering to column indicates the name of the feature, while the secthe survey standards and guidelines detailed in . ond one specifies the aggregation function applied to the data.

For example, given a list of questio ns for a given 3.2. Framework architecture survey, the featurne_open_ended_questions is defined In order to continuously improve the quality of the suars-a simple Count which counts the number of ”Open veys generated witBhWAI we typically work in iterativeEnded” questions in a survey: cycles, which may introduce new issues while addressing existing ones, potentially impacting model quality. These issues can arise mainly due to changes in the prompts _ _ _ = ∑ 1{type( ) =”open_ended”} to accommodate new functionalities or due to switches ∈ and upgrades in the generative models at the core of theMost of the features are based on numerical attributes feature. of the survey, with the only exceptions being represented

To mitigate this risk, we propose a testing framewobryk the feature _ _ℎ , which is a Boolean with automatic tests to ensure expected model behavioartst,ribute, and the features_ and _ aiding in risk assessment regarding survey standardwshich are both categorical attributes. One noteworand increasing our confidence when evaluating model thy feature is thsceore_flesch_kincaid , representing the updates. Unlike traditional machine learning problemFslesch-Kincaid Grade Level metric as defined in1[3]. such as classification or regression tasks which have welldefined test sets (ground truth), generative features la3ck.4. Distributional drift tests this, necessitating such a framework. Its scope is not to monitor data, but to validate model functionality dueWteo calculate the distribution for each metadata feature. changes in theBWAI components. The ultimate goal is To detect distributional shifts between two diferent conto maintain a reliable user experience for our customerfigsu,rations of system prompt and generative model for the PSI score. Therefore, any value above the score is called a FAILED test, indicating significant changes in the distributions. For better clarity, Algori1tphrmesents the drift test function algorithm utilized in our framework.

Algorithm 1 Metadata drift test algorithm is a given survey metadata feature distribution. procedure drift_test(, ) ▷ Inputs: is the PSI threshold.

if PSI() > then

returnFAIL else end if returnPASS a specific metadata feature, we compute the Population

Stability Index (PSI).

The PSI is a synthetic measure of how much a pop- end procedure ulation has shifted over time or between two diferent samples of a population. It achieves this by categorizing the two distributions into buckets and assessing the percentage of items in each bucket, culminating in a sin4-. Experimental results gle scalar value that indicates the disparity between the populations [14]. We use the popular PSI formula:

4.1. Experiment setup:

Where: =1 = ∑( − ) ⋅ln ( ) • is the proportion of the population in the i-twhill refer to avs1. Later we updated several components bin (or segment) at time t (typically the testoofrthe system, and we will refer to this updated version current time period). • is the proportion of the population in theai-base LLM.

th bin (or segment) at the baseline time period Given this context, in this paper we present real-case (typically the training or historical time periodt)e.sts conducted using two system versiovn1sa(nd v2) The BWAI system is made up of two primary elements: (a) the system prompt and (b) the generative model. When the feature was released in late 2023 the first version was relying on GPT3.5-Turbo as the core LLM and used a version of the system (including prompts and logic) we as v2. Also, we have experimented with GPT4-Turbo as and two models for analysis: GTP-3.5 Turbo (GPT3.5) 4.2. Drift tests: Overall results and GPT-4 Turbo (GPT4), both under the ”0125” release ftrooBemWvatAlhuIeactOoepntefighnueArdaItiAfeioPrnIe.n(ceℬsin surve)y. gOeunreroabtjeicotnivaecriosss <Eℬxpve31r.5ime,nℬtv32.5 > FA5IL various combinations of generative models (i.e. GPT3.5 <ℬv31.5 , ℬv41 > 13 WanedliGstPTa4ll)tahnedpsayisrtseomf epvraolmuapttiovnesrstihoants w(ie.ve1f.oacnudsevd2)o.n in <<ℬℬvv3421.5 , ℬ,ℬv42v42 >> 196 this analysis. The idea is to have at least one commonTable 2 element in the tuple (i.e. either the prompt or the gOenve-rall results of drift tests erative model) to assess the impact when transitioning between versions:

For each of the metadata features introduce1d, wine

1. <ℬv31.5 , ℬv32.5 > perform the distribution drift tests. 2. <ℬv31.5 , ℬv41 > The number of passed and failed drift tests is shown 3. <ℬv32.5 , ℬv32.5 > in table2. A failed test means that there is a drift in 4. <ℬv41 , ℬv42 > the output. As introduced in the sect3i.o4n,it measures drift by using the Population Stability Index of the two

System prompts diferences. Regarding the difer- distributions. ences between thev1 and v2 system prompts, we summa- We observe a significant increase in the number of rize some of the key improvements that the v2 promfpatiled tests when transitioning from the GPT3.5 to the introduces over the previous version: GPT4 model. Specifically, there were 16 failed tests when comparing these two models for the system prompt version v2, and 13 failures for versiovn1. • Addition of multilingual support • Improved output formatting instructions • Improved instructions to encourage the system to comply with survey research best practices (i.4e..3. Drift tests: per feature avoid open-ended questions where not necessaryI,n this section, we conduct a detailed examination of order questions from general to specific, etc.) the experiment results presented in the previous section, • Addition of specific instructions to improve cref-ocusing on the analysis of actual drift scores of metadata ativity features across the diferenBtWAI configurations. Table • Longer prompt with much more structure in th3eprovides a comprehensive overview of the PSI drift system prompt (416 to 690 tokens) scores for the metadata features. We focus only on the • Support for additional use cases such as surveyones having at least a failed case.

forms One noteworthy case involves the metadata feature User prompts collection. In order to measure the dif-n_contact_ info_questions, which exhibits a PSI score of ferences across the system configurations outlined above3.778. The histograms for this metadata feature is shown we selected a subset of 3185 input prompts which havein Figure2. This indicates a significant drift primarily due been collected from real customers who have interactetdo the transition of models (i.e., from GPT3.5 to GPT4), with theBWAI system and consented to let us use theiwrithout any modifications in prompts where in both prompts to improve our system. The selection has beecnases thev1 system prompt was used. In turn, fvo2r, the highest drift was observed for the metadata feature done starting from the full set of user prompts collected between October 2023 and January 2024 and applying then_generated_questions with PSI equals to 2.950. This hapfollowing filters in sequence (with filtering boundaries pens when upgrading the generative model from GPT3.5 and parameters determined through ad-hoc analyses ttoo GPT4. exclude poor quality samples): When assessing changes induced solely by changes in the system prompts (i.e., transitioning from v1 to 1. Drop duplicates; v2) while maintaining the same generative model, over2. Drop input prompts which contain PII or sensitiveall lower metadata drift scores are observed. Specifinformation as flagged by our internal privacy-ically, when utilizing the GPT3.5 model, the highest preservation pipelines; score among these cases was reported for the meta3. Select only inputs written in English; data featurne_multiple_selection_questions (0.642). Con4. Drop inputs shorter than 200 characters anvdersely, with the GPT4 model as the generative model, longer than 500 characters; the highest score resurfaced for the metadata feature 5. Drop inputs which led to generated surveys witnh_contact_info_questions (1.540). The histograms for this an outlier number of questions (i.<e.5 or >12). case are shown on Figure3.

5. Conclusion In this study, we proposed a comprehensive evaluation

framework to enhance the reliability of Large Language Model (LLM)-based systems for survey generation tasks.

By addressing the challenges associated with accurately following user prompts and maintaining consistency with established standards, the framework functions as a protective barrier, efectively setting guardrails to preempt unforeseen behaviors of our BWAI tool. Through the detection of distributional drift of the survey metadata features, the framework acts as a guiding compass Figure 3: Histograms for the metadata feature n_con- for data scientists to investigate and address any unintact_info_questions extracted from surveys generated using tended deviations in the application’s behavior, thereby GPT4 with v1 and v2 prompts ensuring its stability and reliability.

Our experimental results demonstrate the efectiveness of the proposed framework in evaluating survey generation metadata features across diferent configu

In practice, this framework serves as a valuable tool for assessing whether intended modifications to systemrations of system prompts and generative models. We observed significant diferences in survey outputs when prompts translate efectively into the survey generation process. For instance, the updated system prompt itnr-ansitioning between diferent versions of LLM models, cludes specific instructions to nudge the LLM to gen-highlighting the importance of comprehensive evaluation in adapting to model updates. Furthermore, our erate questions which include answer options (as opposed to open-ended questions which do not). Oneanalysis revealed nuanced insights into the impact of system prompt versions on survey generation quality, of these question types is represented by the Multi

underscoring the need for careful consideration of both ple Selection question type, and the impact of these instructions between V1 and V2 can be seen on theprompt design and model selection in ensuring reliable survey generation. scores in Table3 on the row corresponding to the feature _ _ _ , as well as in Figure4 atAiosnfsuttruarteewgioerskt,owaessaeimssttohein”tqeugarlaittye”aouftotmheatgeedneervaatlue-d where it is clearly shown that the V2 prompt tends stuorveys. In this scenario, the emphasis shifts from levergenerate mor e _ questions than the V1 aging metadata features to compare diferences across prompt. Also, through the detection of distributional

diferent system versions to analyzing the survey condrift of the survey metadata features, we can identify and mitigate potential issues, thereby avoiding unexpectetdent itself. One promising direction is to use LLMs to act behaviors of the feature. as preliminary inspectors of survey quality. This could significantly accelerate our quality assessment process, which currently relies heavily on human evaluation. Reading Ease Formula) for Navy Enlisted Personnel, Research Branch report, Chief of Naval Technical Training, Naval Air Station Memphis, 1975. URL: https://books.google.it/books?id=4tjroQEACA.AJ [14] R. Taplin, C. Hunt, The population accuracy index:

A new measure of population stability for model monitoring, Risks 7 (2019) 53.

Rambow (Eds.), Proceedings of the 2016 Con [1]

Gómez-Rodríguez ,

Williams , A confederacy ference of the North American Chapter of the As-

creative writing, 2023a.rXiv:2310 . 08433 . Language

Technologies

, Association for Computa[2]

Liang ,

Bommasani ,

Lee ,

Tsipras , D. Soylu, tional Linguistics, San Diego, California, 2016 , pp.

Yasunaga ,

Zhang ,

Narayanan ,

Wu , A . Ku- 110 - 119 . URL: https://aclanthology.org/N16-101. 4

mar , B.

Newman , B.

Yuan , B.

Yan , C.

Zhang , C. Cos- doi:10.18653/v1/ N16 -1014.

grove , C. D. Manning , C. Ré , D. Acosta-Navas, [6] Y.

Zhu , S.

Lu , L.

Zheng , J.

Guo , W.

Zhang , J. Wang,

hak , F.

Rong , H.

Ren , H.

Yao , J.

Wang , K. San- generation models, 2018a .rXiv: 1802 . 01886 .

thanam , L. Orr, L.

Zheng , M.

Yuksekgonul , M. Suz- [7] W.

Yuan , G. Neubig, P. Liu, Bartscore: Eval-

gun , N. Kim, N.

Guha , N.

Chatterji , O.

Khattab, uating generated text as text generation , 2021 .

Henderson ,

Huang ,

Chi ,

S. M.

Xie , S. San- arXiv: 2106 . 11520 .

turkar , S. Ganguli, T.

Hashimoto , T.

Icard , T. Zhang, [8] M.

Gao , X.

Hu , J.

Ruan , X.

Pu , X.

Wan , Llm-based

Chaudhary ,

Wang ,

Li ,

Mai , Y. Zhang, nlg evaluation: Current status and challenges, 2024 .

Koreeda , Holistic evaluation of language models , arXiv:2402 . 01383 .

2023 . arXiv: 2211 . 09110 . [9]

Wang ,

Liang ,

Meng ,

Zou ,

Li ,

Qu , [3]

Xie ,

Cohn ,

J. H.

Lau , The next chapter: A J. Zhou, Zero-shot cross-lingual summarization via

study of large language models in storytelling , 2023. large language models , 2023 .arXiv: 2302 . 14229 .

arXiv:2301 . 09790 . [10]

Zheng , W.-L. Chiang,

Sheng , S. Zhuang, [4]

Shao ,

Huang ,

Wen ,

Xu ,

Zhu , Long Z. Wu , Y.

Zhuang , Z.

Lin , Z.

Li , D.

Li , E. P.

Xing ,

hierarchical variational model , in: K. Inui, J. Jiang, a-judge with mt-bench and chatbot arena , 2023 .

Ng , X. Wan (Eds.), Proceedings of the 2019 Con- arXiv:2306 . 05685 .

ference on Empirical Methods in Natural Language [11]

Liu ,

Iter ,

Xu ,

Wang ,

Xu ,

Zhu , G-

Processing and the 9th International Joint Con- eval: Nlg evaluation using gpt-4 with better human

ference on Natural Language Processing (EMNLP- alignment , 2023 .arXiv: 2303 . 16634 .

IJCNLP), Association for Computational Linguis-[12]

Xu ,

Wang ,

Pan ,

Song ,

Freitag , W. Y.

tics , Hong Kong, China, 2019 , pp. 3257 - 3268 . URL: Wang , L. Li , Instructscore: Explainable text gener-

https://aclanthology.org/D19-132. 1doi : 10 .18653/ ation evaluation with finegrained feedback , 2023 .

v1/ D19 -1321. arXiv: 2305 . 14282 . [5]

Li ,

Galley ,

Brockett ,

Gao ,

Dolan , A [13]

Kincaid , Derivation of New Readability Formulas: