Enhancing the Reliability of LLMs-based Systems for Survey
                                Generation through Distributional Drift Detection
                                Vinicius Monteiro de Lira1,∗,† , Antonio Maiorino1,† and Peng Jiang2,†
                                1
                                    SurveyMonkey, Padua, Italy
                                2
                                    SurveyMonkey, San Mateo, California, USA


                                                  Abstract
                                                  Evaluating Large Language Model (LLM)-based systems is a recurrent challenge in modern machine learning research and
                                                  development. It is crucial to ensure that any changes made in the production environments will not negatively impact user
                                                  experience, and clever evaluation techniques are especially important when updated models or prompts create disparities
                                                  within the system. Since we released the feature to help our customers create surveys with textual prompts in 2023, we have
                                                  iteratively improved several parts of the system such as the prompts, the LLM models and the system’s internal logic. To
                                                  measure the impact of these changes, we propose a comprehensive framework for assessing surveys generated by LLMs,
                                                  focusing on data drift analyses based on survey metadata features. By leveraging this approach, we can effectively identify
                                                  and address potential areas of concern related to model performance, enhancing the reliability and usability of LLM-based
                                                  systems for survey generation tasks.

                                                  Keywords
                                                  LLMs, Survey Generation, Reliability, Distribution Drifts


                                1. Introduction                                                                                                 the very nature of the task, which is more akin to “cre-
                                                                                                                                                ative writing” than to other Natural Language Processing
                                We are the global leader in survey software, with our flag- (NLP) use cases where “correct” and “wrong” labels could
                                ship platform enabling the collection of over 20 million typically be identified. In fact, in this use case it is much
                                answers per day across a vast variety of domains. One harder to determine what a “good” or “bad” generation
                                of the major goals of our service is to help customers would look like, as there are many possible examples of
                                create high-quality surveys by leveraging the wealth of good surveys in which could be generated starting from
                                research that internal teams have accumulated over the the same input prompt.
                                course of many years in the industry. This translates into                                                         This paper presents a novel contribution in the form
                                the continuous development of features aimed at help- of a comprehensive framework for evaluating surveys
                                ing customers create effective surveys that allow them generated by LLMs, specifically addressing challenges
                                to learn what they’re interested in by asking the best in survey generation tasks. The framework exploits sur-
                                possible questions to their audience.                                                                           vey metadata to facilitate data drift analysis, enabling
                                              One of the latest features released for this purpose the identification and mitigation of potential issues re-
                                is called Build with AI (BWAI). This feature has been lated to model performance. By systematically analyzing
                                released to all the users of the platform near the end survey metadata and detecting distributional drift, the
                                of 2023 and leverages Large Language Models (LLMs) framework assesses the behavior of LLM-based systems
                                to allow users to build high quality surveys through a for survey generation.
                                conversational interface, where users can specify what
                                they want to learn about their audience through a textual
                                description (a prompt), which will be used by the system 2. Related Work
                                to generate a survey with relevant questions and context.
                                              Since this application involves generating a long text The Survey Generation domain that we studied in this
                                based on concise instructions provided in a short ”seed” work poses its own set of challenges since typically there
                                input text, it can be particularly challenging because of is not a ”correct” or ”wrong” survey, but rather the good-
                                                                                                                                                ness of the model lies in its ability to follow the instruc-
                                KiL’24: Workshop on Knowledge-infused Learning co-located with tions specified by the user, while also trying to produce
                                30th ACM KDD Conference, August 26, 2024, Barcelona, Spain                                                      interesting ideas for potentially useful survey questions.
                                ∗
                                     Corresponding author.
                                †                                                                                                                  This puts our model in an area closer to use cases such
                                    These authors contributed equally.
                                Envelope-Open vmonteirodelira@surveymonkey.com (V. M. d. Lira);                                                 as brainstorming and creative writing than to other, more
                                amaiorino@surveymonkey.com (A. Maiorino);                                                                       studied areas such as Question Answering, Intent Recog-
                                pjiang@surveymonkey.com (P. Jiang)                                                                              nition and Summarization, where some “ground truths”
                                Orcid 0000-0002-7580-1756 (V. M. d. Lira)                                                                       are usually available and can be used to evaluate the level
                                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
of quality of the generated text. The lack of ground truth    or through human curation), which are then used to
combined with the lack of standardized metrics for open-      fine-tune LLMs to try and distill the Human Evaluators’
ended tasks makes evaluation even more difficult in our       knowledge, as in [12].
scenario.                                                        In summary, for most use cases involving the genera-
   As pointed out in [1] these kinds of use cases are often   tion of open-ended text where no ground truth is avail-
missing from popular benchmarks such as HELM [2],             able the usual process involves combining some of the
since most of these tend to focus on verifiable, closed-      “automated” strategies mentioned above with Human-
ended and automated metrics. For reliably evaluating          based evaluation, with varying weight given to the Au-
open-ended use cases researchers and practitioners often      tomated vs Human evaluation based on the particular
hire human raters, as for example done by the authors of      needs. When the tasks are broader, more nuanced, and
[1] who hired 10 human raters to evaluate several LLMs        “vague”, or for tasks where a detailed explanation of the
on Creative Writing tasks. While a human evaluation           evaluation scores is needed, human evaluation is typi-
is currently still the most effective and reliable way of     cally given more weight.
evaluating such open-ended tasks, human involvement
also makes the process much longer and expensive.
   Some strategies proposed in the literature to automat-     3. Methodology: Evaluation
ically evaluate the quality of open-ended use cases in-          Framework for Survey
clude measuring the degree of “text quality” through
metrics such as text readability and diversity. In [3] the
                                                                 Generation
authors distinguish between “reference-based metrics”,        Before diving deep into the architecture of our frame-
where the output generated by the model is compared to        work, we introduce and formalize a few basic concepts.
a similar output written by a human, and “reference-free
metrics”, where the quality of the outputs is measured
directly, with some examples of the latter group being        3.1. Background: Basic Concepts
n-gram based metrics such as ”Lexical Repetition” [4]         A survey is a questionnaire used to collect data from
and ”Distinct-3 (D-3) [5]”, descriptive statistics such as    a group of people to gather information, opinions, or
text length, Self-BLEU (SBL) [6] and BARTScore (BAS)          feedback on a particular topic or subject. We formally
[7]. Nonetheless, the authors also report that these met-     introduce a survey as:
rics often do not seem to agree with each other, and they
complement their assessment with human-based mea-             Definition 1 (Survey). A Survey typically consists of sev-
surements.                                                    eral questions designed to gather specific information from
   The authors of [8] analyze the NLG evaluation              respondents. We define a survey as a tuple < ℎ, 𝑙 > where
landscape from another angle that’s becoming more             h represents the survey title and l is the list of questions
widespread after the advent of big-scale powerful models,     composing the survey.
which is the LLM-based evaluation. These techniques
involve using LLMs themselves as “judges” for gener-             In turn, a single survey question can be defined as:
ated text and include Scoring, Comparison, Ranking,           Definition 2 (Survey question). We define a Survey ques-
and Boolean QA among the strategies used to constraint        tion as a tuple < 𝑡, 𝑘, 𝑜 > where t represents the survey ques-
LLMs to output close-ended scores. This direction is ex-      tion text, k is its type drawn from a predefined taxonomy
citing because it seems like a promising way to automate      𝐾, and o represents the list of answer options. Examples of
evaluation tasks that were previously very hard to auto-      survey question types belonging to 𝐾 include: open-ended
mate without models capable of understanding all the          questions, Net Promoter Score (NPS) questions, contact in-
nuances in the generated examples, but it also comes          formation questions, rating questions, and more. Except
with its own challenges and limitations. For example,         for the ”Open-ended” questions, a survey question usually
the authors of [9] showed how the position of texts in        has a list of user-defined answer options amongst which
pairwise comparisons can influence the outcomes of eval-      the respondent may choose to respond to the question. For
uation results when using GPT models. Other limitations       example, in the question ”What’s your work status?”, possi-
are that LLMs can give higher scores to more verbose and      ble answer options could be: ”Employed”, ”Self-employed”,
long-winded sentences [10], and also prefer responses         ”Interning”, ”Part-time”, and ”Unemployed”.
generated by themselves as opposed to other LLMs [11].
   Another variation of Evaluation strategy still based          In our platform, users can leverage the BWAI feature
on LLMs is represented by fine-tuning specialized, open-      to automatically generate surveys. This process involves
source models specifically for evaluation purposes. This      users providing their survey intent through a written text
pattern typically involves crafting high-quality Evalua-      (the prompt). Using LLMs, we can streamline the process
tion datasets (either synthetically with a powerful LLM,      and allow users to generate high-quality surveys with
minimal effort, increasing the level of our user experience. safeguarding against the deployment of new model ver-
We formalize a user prompt as follows:                       sions that could introduce unforeseen behavior.
                                                                Figure 1 illustrates the comprehensive workflow im-
Definition 3 (User prompt). The User prompt embodies plemented in our Survey Generation Testing Framework .
the user’s intention when creating a survey. Through text, The User prompts, as described in Definition 3, represent
the user can articulate the desired structure and content of authentic prompts logged in our platform, conveying
the survey.                                                  users’ intentions for survey creation. Highlighted in
   These generated surveys are designed to align with blue are the pivotal components utilized by the BWAI
our established standards, which are the culmination of tool for survey generation: the system prompt (as de-
years of research on best practices and recommendations fined in 4) and the generative model (e.g., GPT models
for creating surveys for large audiences. Our aim is to or open-source LLMs like llama, mistral, etc.). These
leverage our domain knowledge to help users to create components constitute the fundamental dimensions of
high-quality surveys. To achieve this, we incorporate our framework. Generated surveys are leveraged for
elements of our guidelines and best practices directly metadata feature extraction and distributional drift tests.
into the system prompt which defines the ”behaviour” of With varied settings of system prompts and generative
the LLM.                                                     models, the Survey Generation Testing Framework con-
                                                             ducts pairwise analyses to discern drifts between these
Definition 4 (System prompt). The System prompt serves configurations. These steps are better detailed in the next
as the blueprint for instructing the LLM on generating two subsections.
surveys in accordance with elements of our established
standards.                                                   3.3. Survey metadata features
  Nevertheless, we acknowledge the challenge of ensur-        To measure the impact of our developments on the out-
ing that LLM models accurately follow our instructions,       puts of the system, we define several metadata features
given their inherently unpredictable behavior. We for-        that are computed on sets of surveys generated with
malize this problem as follows:                               different configurations of the BWAI feature.
                                                                 All the metadata features used are based on some at-
Definition 5 (Survey Generation Reliability Problem).
                                                              tributes of the surveys. In Table 1, we outline the com-
Given a user prompt 𝑝𝑢 , a system prompt 𝑝𝑠 , and a genera-
                                                              plete set of metadata features used in our framework,
tive model 𝑔, our objective is to automatically generate a
                                                              along with some relevant information. The column first
survey 𝑠. The generated survey 𝑠 should accurately reflect
                                                              column indicates the name of the feature, while the sec-
the user’s intent as specified in 𝑝𝑢 , while also adhering to
                                                              ond one specifies the aggregation function applied to the
the survey standards and guidelines detailed in 𝑝𝑠 .
                                                              data.
                                                                 For example, given a list of questions 𝑄 for a given
3.2. Framework architecture                                   survey, the feature n_open_ended_questions is defined
In order to continuously improve the quality of the sur- as a simple Count which counts the number of ”Open
veys generated with BWAI we typically work in iterative Ended” questions in a survey:
cycles, which may introduce new issues while addressing
existing ones, potentially impacting model quality. These
                                                              𝑛_𝑜𝑝𝑒𝑛_𝑒𝑛𝑑𝑒𝑑_𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠 = ∑ 1{type(𝑞𝑖 ) = ”open_ended”}
issues can arise mainly due to changes in the prompts                                    𝑞𝑖 ∈𝑄
to accommodate new functionalities or due to switches
and upgrades in the generative models at the core of the         Most of the features are based on numerical attributes
feature.                                                      of the survey, with the only exceptions being represented
   To mitigate this risk, we propose a testing framework by the feature 𝑎𝑛𝑦_𝑠𝑝𝑒𝑐𝑖𝑎𝑙_𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟, which is a Boolean
with automatic tests to ensure expected model behaviors, attribute, and the features 𝑑𝑖𝑠𝑡_𝑢𝑛𝑖𝑔𝑟𝑎𝑚𝑠 and 𝑑𝑖𝑠𝑡_𝑏𝑖𝑔𝑟𝑎𝑚𝑠
aiding in risk assessment regarding survey standards which are both categorical attributes. One notewor-
and increasing our confidence when evaluating model thy feature is the score_flesch_kincaid, representing the
updates. Unlike traditional machine learning problems Flesch-Kincaid Grade Level metric as defined in [13].
such as classification or regression tasks which have well-
defined test sets (ground truth), generative features lack
                                                              3.4. Distributional drift tests
this, necessitating such a framework. Its scope is not to
monitor data, but to validate model functionality due to We calculate the distribution for each metadata feature.
changes in the BWAI components. The ultimate goal is To detect distributional shifts between two different con-
to maintain a reliable user experience for our customers, figurations of system prompt and generative model for
Figure 1: Survey Generation Testing Framework overall workflow


             Feature Name               Aggregation               • n is the total number of bins (or segments) in the
    n_contact_info_questions            Count
                                                                    distribution.
    n_open_ended_questions              Count
    n_nps_questions                     Count                  The typical interpretations of PSI outcomes are as fol-
    n_multiple_selection_questions      Count                lows:
    n_closed_ended_questions            Count
    n_generated_questions               Count                     • PSI < 0.1: Indicates no significant population
    n_unsupported_questions             Count                       change.
    n_single_choice_questions           Count                     • PSI < 0.2: Reflects a moderate population change.
    n_characters_in_survey              Count
                                                                  • PSI ≥ 0.2: Signifies a significant population
    n_words_in_survey                   Count
                                                                    change.
    n_unsupported_questions             Count
    std_n_words_per_question            Std
                                                                For our framework, we use 0.2 as the threshold (𝜆) for
    avg_word_length                     Mean
                                                             the PSI score. Therefore, any value above the score is
    avg_n_answer_options                Mean
                                                             called a FAILED test, indicating significant changes in the
    avg_n_words_per_question            Mean
    avg_n_words_per_answer_option       Mean
                                                             distributions. For better clarity, Algorithm 1 presents the
    max_word_length                     Max                  drift test function algorithm utilized in our framework.
    any_special_character               Any
    score_flesch_kincaid                Count                Algorithm 1 Metadata drift test algorithm
    dist_unigrams                       Count                   procedure drift_test(𝑚, 𝜆)
    dist_bigrams                        Count                   ▷ Inputs:
                                                                𝑚 is a given survey metadata feature distribution.
Table 1
List of survey metadata features supported by our framework     𝜆 is the PSI threshold.
                                                                    if PSI(𝑚) > 𝜆 then
                                                                        return FAIL
                                                                    else
a specific metadata feature, we compute the Population                  return PASS
Stability Index (PSI).                                              end  if
   The PSI is a synthetic measure of how much a pop-            end   procedure
ulation has shifted over time or between two different
samples of a population. It achieves this by categoriz-
ing the two distributions into buckets and assessing the
percentage of items in each bucket, culminating in a sin- 4. Experimental results
gle scalar value that indicates the disparity between the
populations [14]. We use the popular PSI formula:            4.1. Experiment setup:
                         𝑛                                   The BWAI system is made up of two primary elements: (a)
                                           𝑃𝑖
                 𝑃𝑆𝐼 = ∑(𝑃𝑡𝑖 − 𝑃𝑏𝑖 ) ⋅ ln ( 𝑡𝑖 )             the system prompt and (b) the generative model. When
                       𝑖=1                 𝑃𝑏                the feature was released in late 2023 the first version was
   Where:                                                    relying on GPT3.5-Turbo as the core LLM and used a
                                                             version of the system (including prompts and logic) we
      • 𝑃𝑖𝑡 is the proportion of the population in the i-th will refer to as v1. Later we updated several components
         bin (or segment) at time t (typically the test or of the system, and we will refer to this updated version
         current time period).                               as v2. Also, we have experimented with GPT4-Turbo as
      • 𝑃𝑖𝑏 is the proportion of the population in the i- a base LLM.
         th bin (or segment) at the baseline time period        Given this context, in this paper we present real-case
         (typically the training or historical time period). tests conducted using two system versions (v1 and v2)
and two models for analysis: GTP-3.5 Turbo (GPT3.5) 4.2. Drift tests: Overall results
and GPT-4 Turbo (GPT4), both under the ”0125” release
from the OpenAI API.                                                        Experiment               FAIL PASS
   BWAI configuration (ℬ𝑆𝑦𝑠𝑃𝑟𝑜𝑚𝑝𝑡  𝐺𝑒𝑛𝑀𝑜𝑑𝑒𝑙 ). Our objective is
                                                                               𝐺𝑃𝑇 3.5     𝐺𝑃𝑇 3.5
to evaluate the differences in survey generation across                     <ℬ v1      , ℬ v2      >   5     16
                                                                               𝐺𝑃𝑇 3.5     𝐺𝑃𝑇 4
various combinations of generative models (i.e. GPT3.5                      <ℬ v1      , ℬ v1    >    13      8
                                                                               𝐺𝑃𝑇 3.5     𝐺𝑃𝑇 4
                                                                            <ℬv2       , ℬv2     >    16      5
and GPT4) and system prompt versions (i.e. v1 and v2).                         𝐺𝑃𝑇 4      𝐺𝑃𝑇 4
                                                                            <ℬv1     , ℬv2      >      9     12
We list all the pairs of evaluations that we focused on in
this analysis. The idea is to have at least one common Table 2
element in the tuple (i.e. either the prompt or the gen- Overall results of drift tests
erative model) to assess the impact when transitioning
between versions:                                                  For each of the metadata features introduced in 1, we
     1. <ℬv1 𝐺𝑃𝑇 3.5 , ℬ 𝐺𝑃𝑇 3.5 >                              perform the distribution drift tests.
                        v2
     2. <ℬv1 𝐺𝑃𝑇 3.5 , ℬ 𝐺𝑃𝑇 4 >                                   The number of passed and failed drift tests is shown
                        v1
     3. <ℬv2 𝐺𝑃𝑇 3.5     𝐺𝑃𝑇
                     , ℬv2   3.5 >                              in table 2. A failed test means that there is a drift in
     4. <ℬv1 𝐺𝑃𝑇 4 , ℬ 𝐺𝑃𝑇 4 >                                  the output. As introduced in the section 3.4, it measures
                       v2
                                                                drift by using the Population Stability Index of the two
   System prompts differences. Regarding the differ- distributions.
ences between the v1 and v2 system prompts, we summa-              We observe a significant increase in the number of
rize some of the key improvements that the v2 prompt failed tests when transitioning from the GPT3.5 to the
introduces over the previous version:                           GPT4 model. Specifically, there were 16 failed tests when
                                                                comparing these two models for the system prompt ver-
      • Addition of multilingual support
                                                                sion v2, and 13 failures for version v1.
      • Improved output formatting instructions
      • Improved instructions to encourage the system
        to comply with survey research best practices (i.e. 4.3. Drift tests: per feature
        avoid open-ended questions where not necessary, In this section, we conduct a detailed examination of
        order questions from general to specific, etc.)         the experiment results presented in the previous section,
      • Addition of specific instructions to improve cre- focusing on the analysis of actual drift scores of metadata
        ativity                                                 features across the different BWAI configurations. Table
      • Longer prompt with much more structure in the 3 provides a comprehensive overview of the PSI drift
        system prompt (416 to 690 tokens)                       scores for the metadata features. We focus only on the
      • Support for additional use cases such as survey ones having at least a failed case.
        forms                                                      One noteworthy case involves the metadata feature
                                                                n_contact_ info_questions, which exhibits a PSI score of
   User prompts collection. In order to measure the dif-
                                                                3.778. The histograms for this metadata feature is shown
ferences across the system configurations outlined above
                                                                in Figure 2. This indicates a significant drift primarily due
we selected a subset of 3185 input prompts which have
                                                                to the transition of models (i.e., from GPT3.5 to GPT4),
been collected from real customers who have interacted
                                                                without any modifications in prompts where in both
with the BWAI system and consented to let us use their
                                                                cases the v1 system prompt was used. In turn, for v2,
prompts to improve our system. The selection has been
                                                                the highest drift was observed for the metadata feature
done starting from the full set of user prompts collected
                                                                n_generated_questions with PSI equals to 2.950. This hap-
between October 2023 and January 2024 and applying the
                                                                pens when upgrading the generative model from GPT3.5
following filters in sequence (with filtering boundaries
                                                                to GPT4.
and parameters determined through ad-hoc analyses to
                                                                   When assessing changes induced solely by changes
exclude poor quality samples):
                                                                in the system prompts (i.e., transitioning from v1 to
     1. Drop duplicates;                                        v2) while maintaining the same generative model, over-
     2. Drop input prompts which contain PII or sensitive all lower metadata drift scores are observed. Specif-
        information as flagged by our internal privacy- ically, when utilizing the GPT3.5 model, the highest
        preservation pipelines;                                 score among these cases was reported for the meta-
     3. Select only inputs written in English;                  data feature n_multiple_selection_questions (0.642). Con-
     4. Drop inputs shorter than 200 characters and versely, with the GPT4 model as the generative model,
        longer than 500 characters;                             the highest score resurfaced for the metadata feature
     5. Drop inputs which led to generated surveys with n_contact_info_questions (1.540). The histograms for this
        an outlier number of questions (i.e. <5 or >12).        case are shown on Figure 3.
Figure 2: Histograms for the metadata feature n_con-          Figure 4: Histograms for the metadata feature n_multi-
tact_info_questions extracted from surveys generated using    ple_selection_questions extracted from surveys generated us-
GPT4 and GPT3.5 with v1 only                                  ing GPT4 with v1 and v2 prompts


                                                              5. Conclusion
                                                              In this study, we proposed a comprehensive evaluation
                                                              framework to enhance the reliability of Large Language
                                                              Model (LLM)-based systems for survey generation tasks.
                                                              By addressing the challenges associated with accurately
                                                              following user prompts and maintaining consistency
                                                              with established standards, the framework functions as
                                                              a protective barrier, effectively setting guardrails to pre-
                                                              empt unforeseen behaviors of our BWAI tool. Through
                                                              the detection of distributional drift of the survey meta-
                                                              data features, the framework acts as a guiding compass
Figure 3: Histograms for the metadata feature n_con-          for data scientists to investigate and address any unin-
tact_info_questions extracted from surveys generated using    tended deviations in the application’s behavior, thereby
GPT4 with v1 and v2 prompts                                   ensuring its stability and reliability.
                                                                 Our experimental results demonstrate the effective-
                                                              ness of the proposed framework in evaluating survey
                                                              generation metadata features across different configu-
   In practice, this framework serves as a valuable tool
                                                              rations of system prompts and generative models. We
for assessing whether intended modifications to system
                                                              observed significant differences in survey outputs when
prompts translate effectively into the survey generation
                                                              transitioning between different versions of LLM models,
process. For instance, the updated system prompt in-
                                                              highlighting the importance of comprehensive evalua-
cludes specific instructions to nudge the LLM to gen-
                                                              tion in adapting to model updates. Furthermore, our
erate questions which include answer options (as op-
                                                              analysis revealed nuanced insights into the impact of
posed to open-ended questions which do not). One
                                                              system prompt versions on survey generation quality,
of these question types is represented by the Multi-
                                                              underscoring the need for careful consideration of both
ple Selection question type, and the impact of these
                                                              prompt design and model selection in ensuring reliable
instructions between V1 and V2 can be seen on the
                                                              survey generation.
scores in Table 3 on the row corresponding to the fea-
                                                                 As future work, we aim to integrate automated evalu-
ture 𝑛_𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑒_𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛_𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠, as well as in Figure 4
                                                              ation strategies to assess the ”quality” of the generated
where it is clearly shown that the V2 prompt tends to
                                                              surveys. In this scenario, the emphasis shifts from lever-
generate more 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑒_𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 questions than the V1
                                                              aging metadata features to compare differences across
prompt. Also, through the detection of distributional
                                                              different system versions to analyzing the survey con-
drift of the survey metadata features, we can identify and
                                                              tent itself. One promising direction is to use LLMs to act
mitigate potential issues, thereby avoiding unexpected
                                                              as preliminary inspectors of survey quality. This could
behaviors of the feature.
                                                              significantly accelerate our quality assessment process,
                                                              which currently relies heavily on human evaluation.
                                         𝐺𝑃𝑇 3.5    𝐺𝑃𝑇 3.5       𝐺𝑃𝑇 3.5    𝐺𝑃𝑇 4       𝐺𝑃𝑇 3.5    𝐺𝑃𝑇 4       𝐺𝑃𝑇 4    𝐺𝑃𝑇 4
  Features                             <ℬv1      , ℬv2      >   <ℬv1      , ℬv1    >   <ℬv2      , ℬv2    >   <ℬv1    , ℬv2    >
  avg_n_answer_options                        0.405                   0.478                  0.424                 0.351
  avg_n_words_per_answer_option               0.642                   0.311                  0.339                 0.622
  avg_n_words_per_question                    0.000                   0.526                  0.392                 0.367
  drift:bigrams_distribution                  0.565                   0.860                  0.795                 0.673
  drift:unigrams_distribution                 0.205                   0.247                  0.222                 0.217
  max word length                             0.000                   0.000                  0.219                 0.000
  n_closed_ended_questions                    0.000                   0.318                  1.203                 0.447
  n_contact_info_questions                    0.000                   3.778                  0.529                 1.540
  n_generated_questions                       0.000                   2.624                  2.950                 0.000
  n_multiple_selection_questions              0.991                   0.000                  0.395                 1.271
  n_nps_questions                             0.000                   1.899                  2.300                 0.000
  n_open_ended_questions                      0.000                   0.352                  0.846                 0.000
  n_single_choice_questions                   0.000                   0.000                  0.504                 0.000
  n_words_in_survey                           0.000                   1.535                  2.068                 0.000
  std_n_words_per_question                    0.000                   1.173                  0.419                 0.599
  n_characters_in_survey                      0.000                   1.446                  1.871                 0.000

Table 3
PSI scores per feature. We show only the features that had failed in at least one of the BWAI configurations (ℬ).


References                                                            conversation models, in: K. Knight, A. Nenkova,
                                                                      O. Rambow (Eds.), Proceedings of the 2016 Con-
 [1] C. Gómez-Rodríguez, P. Williams, A confederacy                   ference of the North American Chapter of the As-
     of models: a comprehensive evaluation of llms on                 sociation for Computational Linguistics: Human
     creative writing, 2023. arXiv:2310.08433 .                       Language Technologies, Association for Computa-
 [2] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu,            tional Linguistics, San Diego, California, 2016, pp.
     M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Ku-               110–119. URL: https://aclanthology.org/N16-1014.
     mar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cos-               doi:10.18653/v1/N16- 1014 .
     grove, C. D. Manning, C. Ré, D. Acosta-Navas,                [6] Y. Zhu, S. Lu, L. Zheng, J. Guo, W. Zhang, J. Wang,
     D. A. Hudson, E. Zelikman, E. Durmus, F. Lad-                    Y. Yu, Texygen: A benchmarking platform for text
     hak, F. Rong, H. Ren, H. Yao, J. Wang, K. San-                   generation models, 2018. arXiv:1802.01886 .
     thanam, L. Orr, L. Zheng, M. Yuksekgonul, M. Suz-            [7] W. Yuan, G. Neubig, P. Liu, Bartscore: Eval-
     gun, N. Kim, N. Guha, N. Chatterji, O. Khattab,                  uating generated text as text generation, 2021.
     P. Henderson, Q. Huang, R. Chi, S. M. Xie, S. San-               arXiv:2106.11520 .
     turkar, S. Ganguli, T. Hashimoto, T. Icard, T. Zhang,        [8] M. Gao, X. Hu, J. Ruan, X. Pu, X. Wan, Llm-based
     V. Chaudhary, W. Wang, X. Li, Y. Mai, Y. Zhang,                  nlg evaluation: Current status and challenges, 2024.
     Y. Koreeda, Holistic evaluation of language models,              arXiv:2402.01383 .
     2023. arXiv:2211.09110 .                                     [9] J. Wang, Y. Liang, F. Meng, B. Zou, Z. Li, J. Qu,
 [3] Z. Xie, T. Cohn, J. H. Lau, The next chapter: A                  J. Zhou, Zero-shot cross-lingual summarization via
     study of large language models in storytelling, 2023.            large language models, 2023. arXiv:2302.14229 .
     arXiv:2301.09790 .                                          [10] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang,
 [4] Z. Shao, M. Huang, J. Wen, W. Xu, X. Zhu, Long                   Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing,
     and diverse text generation with planning-based                  H. Zhang, J. E. Gonzalez, I. Stoica, Judging llm-as-
     hierarchical variational model, in: K. Inui, J. Jiang,           a-judge with mt-bench and chatbot arena, 2023.
     V. Ng, X. Wan (Eds.), Proceedings of the 2019 Con-               arXiv:2306.05685 .
     ference on Empirical Methods in Natural Language            [11] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, C. Zhu, G-
     Processing and the 9th International Joint Con-                  eval: Nlg evaluation using gpt-4 with better human
     ference on Natural Language Processing (EMNLP-                   alignment, 2023. arXiv:2303.16634 .
     IJCNLP), Association for Computational Linguis-             [12] W. Xu, D. Wang, L. Pan, Z. Song, M. Freitag, W. Y.
     tics, Hong Kong, China, 2019, pp. 3257–3268. URL:                Wang, L. Li, Instructscore: Explainable text gener-
     https://aclanthology.org/D19-1321. doi:10.18653/                 ation evaluation with finegrained feedback, 2023.
     v1/D19- 1321 .                                                   arXiv:2305.14282 .
 [5] J. Li, M. Galley, C. Brockett, J. Gao, B. Dolan, A          [13] J. Kincaid, Derivation of New Readability Formulas:
     diversity-promoting objective function for neural                (automated Readability Index, Fog Count and Flesch
     Reading Ease Formula) for Navy Enlisted Personnel,
     Research Branch report, Chief of Naval Technical
     Training, Naval Air Station Memphis, 1975. URL:
     https://books.google.it/books?id=4tjroQEACAAJ.
[14] R. Taplin, C. Hunt, The population accuracy index:
     A new measure of population stability for model
     monitoring, Risks 7 (2019) 53.