<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ACM KDD Conference, August</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Enhancing the Reliability of LLMs-based Systems for Survey Generation through Distributional Drift Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vinicius Monteiro deLira</string-name>
          <email>vmonteirodelira@surveymonkey.co</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>AntonioMaiorino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peng Jiang</string-name>
          <email>pjiang@surveymonkey.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>LLMs, Survey Generation, Reliability, Distribution Drifts</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>SurveyMonkey</institution>
          ,
          <addr-line>Padua</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>SurveyMonkey</institution>
          ,
          <addr-line>San Mateo, California</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>26</volume>
      <issue>2024</issue>
      <abstract>
        <p>Evaluating Large Language Model (LLM)-based systems is a recurrent challenge in modern machine learning research and development. It is crucial to ensure that any changes made in the production environments will not negatively impact user experience, and clever evaluation techniques are especially important when updated models or prompts create disparities within the system. Since we released the feature to help our customers create surveys with textual prompts in 2023, we have iteratively improved several parts of the system such as the prompts, the LLM models and the system's internal logic. To measure the impact of these changes, we propose a comprehensive framework for assessing surveys generated by LLMs, focusing on data drift analyses based on survey metadata features. By leveraging this approach, we can efectively identify and address potential areas of concern related to model performance, enhancing the reliability and usability of LLM-based systems for survey generation tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>ative writing” than to other Natural Language Processing</kwd>
        <kwd>ness of the model lies in its ability to follow the instruc-</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>possible questions to their audience.</p>
      <p>We are the global leader in survey software, with our flag- (NLP) use cases where “correct” and “wrong” labels could
ship platform enabling the collection of over 20 milliontypically be identified. In fact, in this use case it is much
answers per day across a vast variety of domains. Onhearder to determine what a “good” or “bad” generation
of the major goals of our service is to help customerwsould look like, as there are many possible examples of
create high-quality surveys by leveraging the wealth ogfood surveys in which could be generated starting from
research that internal teams have accumulated over tthhee same input prompt.
course of many years in the industry. This translates intoThis paper presents a novel contribution in the form
the continuous development of features aimed at helopf- a comprehensive framework for evaluating surveys
ing customers create efective surveys that allow themgenerated by LLMs, specifically addressing challenges
to learn what they’re interested in by asking the beinstsurvey generation tasks. The framework exploits
survey metadata to facilitate data drift analysis, enabling
conversational interface, where users can specify what
to allow users to build high quality surveys through faor survey generation.
they want to learn about their audience through a textual
description (aprompt), which will be used by the system 2. Related Work
to generate a survey with relevant questions and context.</p>
      <p>One of the latest features released for this purptohsee identification and mitigation of potential issues
reis called Build with AI (BWAI). This feature has been lated to model performance. By systematically analyzing
released to all the users of the platform near the esnudrvey metadata and detecting distributional drift, the
of 2023 and leverages Large Language Models (LLMs) framework assesses the behavior of LLM-based systems</p>
      <p>Since this application involves generating a long texTthe Survey Generation domain that we studied in this
based on concise instructions provided in a short ”seedw”ork poses its own set of challenges since typically there
input text, it can be particularly challenging because oifs not a ”correct” or ”wrong” survey, but rather the
goodKiL’24: Workshop on Knowledge-infused Learning co-located with tions specified by the user, while also trying to produce
nEvelop-O
∗Corresponding author.
†These authors contributed equally.</p>
      <p>This puts our model in an area closer to use cases such
as brainstorming and creative writing than to other, more
studied areas such as Question Answering, Intent
Recognition and Summarization, where some “ground truths”
of quality of the generated text. The lack of ground truotrhthrough human curation), which are then used to
combined with the lack of standardized metrics for op efinne--tune LLMs to try and distill the Human Evaluators’
ended tasks makes evaluation even more dificult in our knowledge, as in [12].
scenario. In summary, for most use cases involving the
genera</p>
      <p>
        As pointed out in1[] these kinds of use cases are often tion of open-ended text where no ground truth is
availmissing from popular benchmarks such as HELM2[], able the usual process involves combining some of the
since most of these tend to focus on verifiable, closed“-automated” strategies mentioned above with
Humanended and automated metrics. For reliably evaluatinbgased evaluation, with varying weight given to the
Auopen-ended use cases researchers and practitioners oftentomated vs Human evaluation based on the particular
hire human raters, as for example done by the authors onfeeds. When the tasks are broader, more nuanced, and
[
        <xref ref-type="bibr" rid="ref24">1</xref>
        ] who hired 10 human raters to evaluate several LLM“svague”, or for tasks where a detailed explanation of the
on Creative Writing tasks. While a human evaluationevaluation scores is needed, human evaluation is
typiis currently still the most efective and reliable way ofcally given more weight.
evaluating such open-ended tasks, human involvement
also makes the process much longer and expensive.
      </p>
      <p>Some strategies proposed in the literature to autom3a.t-Methodology: Evaluation
ically evaluate the quality of open-ended use cases in- Framework for Survey
clude measuring the degree of “text quality” through Generation
metrics such as text readability and diversity3.]Inth[e
authors distinguish between “reference-based metricBse”,fore diving deep into the architecture of our
framewhere the output generated by the model is compared wtoork, we introduce and formalize a few basic concepts.
a similar output written by a human, and “reference-free
metrics”, where the quality of the outputs is measured
directly, with some examples of the latter group bein3g.1. Background: Basic Concepts
n-gram based metrics such as ”Lexical Repetition4”] [ A survey is a questionnaire used to collect data from
and ”Distinct-3 (D-3)5[]”, descriptive statistics such asa group of people to gather information, opinions, or
text length, Self-BLEU (SBL)6][ and BARTScore (BAS) feedback on a particular topic or subject. We formally
[7]. Nonetheless, the authors also report that these mientt-roduce asurvey as:
rics often do not seem to agree with each other, and they
complement their assessment with human-based meaD- efinition 1 (Survey). A Survey typically consists of
sevsurements. eral questions designed to gather specific information from</p>
      <p>The authors of 8[] analyze the NLG evaluation respondents. We define a survey as a tuple &lt; ℎ,  &gt; where
landscape from another angle that’s becoming moreh represents the survey title and l is the list of questions
widespread after the advent of big-scale powerful models,composing the survey.
which is the LLM-based evaluation. These techniques
involve using LLMs themselves as “judges” for gener- In turn, a singlesurvey question can be defined as:
ated text and include Scoring, Comparison, RankingD,efinition 2 (Survey question.) We define a Survey
quesand Boolean QA among the strategies used to constraintiton as a tuple &lt; , ,  &gt; where t represents the survey
quesLLMs to output close-ended scores. This direction is etxi-on text, k is its type drawn from a predefined taxonomy
citing because it seems like a promising way to automate</p>
      <p>, and o represents the list of answer options. Examples of
evaluation tasks that were previously very hard to autsuor-vey question types belonging to  include: open-ended
mate without models capable of understanding all thqeuestions, Net Promoter Score (NPS) questions, contact
innuances in the generated examples, but it also comesformation questions, rating questions, and more. Except
with its own challenges and limitations. For examplef,or the ”Open-ended” questions, a survey question usually
the authors of9[] showed how the position of texts inhas a list of user-defined answer options amongst which
pairwise comparisons can influence the outcomes of eval- the respondent may choose to respond to the question. For
uation results when using GPT models. Other limitationesxample, in the question ”What’s your work status?”,
possiare that LLMs can give higher scores to more verbose anbdle answer options could be: ”Employed”, ”Self-employed”,
long-winded sentences 1[0], and also prefer responses ”Interning”, ”Part-time”, and ”Unemployed”.
generated by themselves as opposed to other LLM11s][.</p>
      <p>Another variation of Evaluation strategy still basedIn our platform, users can leverage tBhWeAI feature
on LLMs is represented by fine-tuning specialized, open- to automatically generate surveys. This process involves
source models specifically for evaluation purposes. This users providing their survey intent through a written text
pattern typically involves crafting high-quality Evalua-(the prompt). Using LLMs, we can streamline the process
tion datasets (either synthetically with a powerful LLaMn d, allow users to generate high-quality surveys with
Definition 4 (System prompt.) The System prompt serves
as the blueprint for instructing the LLM on generating
surveys in accordance with elements of our established
standards.
minimal efort, increasing the level of our user experiences.afeguarding against the deployment of new model
verWe formalize a user prompt as follows: sions that could introduce unforeseen behavior.
Figure1 illustrates the comprehensive workflow
imDefinition 3 (User prompt.) The User prompt embodies plemented in our Survey Generation Testing Framework .
the user’s intention when creating a survey. Through text, The User prompts, as described in Definitio3n, represent
the user can articulate the desired structure and content of authentic prompts logged in our platform, conveying
the survey. users’ intentions for survey creation. Highlighted in</p>
      <p>These generated surveys are designed to align witbhlue are the pivotal components utilized by tBhWeAI
our established standards, which are the culmination otfool for survey generation: the system prompt (as
deyears of research on best practices and recommendationfinsed in 4) and the generative model (e.g., GPT models
for creating surveys for large audiences. Our aim is toor open-source LLMs like llama, mistral, etc.). These
leverage our domain knowledge to help users to creatceomponents constitute the fundamental dimensions of
high-quality surveys. To achieve this, we incorporateour framework. Generated surveys are leveraged for
elements of our guidelines and best practices directmlyetadata feature extraction and distributional drift tests.
into the system prompt which defines the ”behaviour” ofWith varied settings of system prompts and generative
the LLM. models, the Survey Generation Testing Framework
conducts pairwise analyses to discern drifts between these
configurations. These steps are better detailed in the next
two subsections.</p>
      <sec id="sec-2-1">
        <title>3.3. Survey metadata features</title>
        <p>Nevertheless, we acknowledge the challenge of ensurT-o measure the impact of our developments on the
outing that LLM models accurately follow our instructionpsu, ts of the system, we define several metadata features
given their inherently unpredictable behavior. We fotrh-at are computed on sets of surveys generated with
malize this problem as follows: diferent configurations of theBWAI feature.
Definition 5 (Survey Generation Reliability Proble.m) All the metadata features used are based on some
atGiven a user prompt   , a system prompt   , and a genera- tributes of the surveys. In Tab1l,ewe outline the
comtive model  , our objective is to automatically generate a plete set of metadata features used in our framework,
survey  . The generated survey  should accurately reflect along with some relevant information. The column first
the user’s intent as specified in   , while also adhering to column indicates the name of the feature, while the
secthe survey standards and guidelines detailed in   . ond one specifies the aggregation function applied to the
data.</p>
        <p>For example, given a list of questio ns for a given
3.2. Framework architecture survey, the featurne_open_ended_questions is defined
In order to continuously improve the quality of the suars-a simple Count which counts the number of ”Open
veys generated witBhWAI we typically work in iterativeEnded” questions in a survey:
cycles, which may introduce new issues while addressing
existing ones, potentially impacting model quality. These
issues can arise mainly due to changes in the prompts _ _ _ = ∑ 1{type(  ) =”open_ended”}
to accommodate new functionalities or due to switches   ∈
and upgrades in the generative models at the core of theMost of the features are based on numerical attributes
feature. of the survey, with the only exceptions being represented</p>
        <p>To mitigate this risk, we propose a testing framewobryk the feature _ _ℎ , which is a Boolean
with automatic tests to ensure expected model behavioartst,ribute, and the features_ and  _
aiding in risk assessment regarding survey standardwshich are both categorical attributes. One
noteworand increasing our confidence when evaluating model thy feature is thsceore_flesch_kincaid , representing the
updates. Unlike traditional machine learning problemFslesch-Kincaid Grade Level metric as defined in1[3].
such as classification or regression tasks which have
welldefined test sets (ground truth), generative features la3ck.4. Distributional drift tests
this, necessitating such a framework. Its scope is not to
monitor data, but to validate model functionality dueWteo calculate the distribution for each metadata feature.
changes in theBWAI components. The ultimate goal is To detect distributional shifts between two diferent
conto maintain a reliable user experience for our customerfigsu,rations of system prompt and generative model for
the PSI score. Therefore, any value above the score is
called a FAILED test, indicating significant changes in the
distributions. For better clarity, Algori1tphrmesents the
drift test function algorithm utilized in our framework.</p>
        <p>Algorithm 1 Metadata drift test algorithm
 is a given survey metadata feature distribution.
procedure drift_test(,  )
▷ Inputs:
 is the PSI threshold.</p>
        <p>if PSI() &gt;  then</p>
        <p>returnFAIL
else
end if
returnPASS
a specific metadata feature, we compute the Population</p>
        <sec id="sec-2-1-1">
          <title>Stability Index (PSI).</title>
          <p>The PSI is a synthetic measure of how much a pop- end procedure
ulation has shifted over time or between two diferent
samples of a population. It achieves this by
categorizing the two distributions into buckets and assessing the
percentage of items in each bucket, culminating in a sin4-. Experimental results
gle scalar value that indicates the disparity between the
populations [14]. We use the popular PSI formula:</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>4.1. Experiment setup:</title>
        <p>Where:

=1
  =
∑(   − 

 ) ⋅ln (

 



)
•    is the proportion of the population in the i-twhill refer to avs1. Later we updated several components
bin (or segment) at time t (typically the testoofrthe system, and we will refer to this updated version
current time period).
•    is the proportion of the population in theai-base LLM.</p>
        <p>th bin (or segment) at the baseline time period Given this context, in this paper we present real-case
(typically the training or historical time periodt)e.sts conducted using two system versiovn1sa(nd v2)
The BWAI system is made up of two primary elements: (a)
the system prompt and (b) the generative model. When
the feature was released in late 2023 the first version was
relying on GPT3.5-Turbo as the core LLM and used a
version of the system (including prompts and logic) we
as v2. Also, we have experimented with GPT4-Turbo as
and two models for analysis: GTP-3.5 Turbo (GPT3.5) 4.2. Drift tests: Overall results
and GPT-4 Turbo (GPT4), both under the ”0125” release
ftrooBemWvatAlhuIeactOoepntefighnueArdaItiAfeioPrnIe.n(ceℬsin surve)y. gOeunreroabtjeicotnivaecriosss &lt;Eℬxpve31r.5ime,nℬtv32.5 &gt; FA5IL
various combinations of generative models (i.e. GPT3.5 &lt;ℬv31.5 , ℬv41 &gt; 13
WanedliGstPTa4ll)tahnedpsayisrtseomf epvraolmuapttiovnesrstihoants w(ie.ve1f.oacnudsevd2)o.n in &lt;&lt;ℬℬvv3421.5 , ℬ,ℬv42v42 &gt;&gt; 196
this analysis. The idea is to have at least one commonTable 2
element in the tuple (i.e. either the prompt or the gOenve-rall results of drift tests
erative model) to assess the impact when transitioning
between versions:</p>
        <sec id="sec-2-2-1">
          <title>For each of the metadata features introduce1d, wine</title>
          <p>1. &lt;ℬv31.5 , ℬv32.5 &gt; perform the distribution drift tests.
2. &lt;ℬv31.5 , ℬv41 &gt; The number of passed and failed drift tests is shown
3. &lt;ℬv32.5 , ℬv32.5 &gt; in table2. A failed test means that there is a drift in
4. &lt;ℬv41 , ℬv42 &gt; the output. As introduced in the sect3i.o4n,it measures
drift by using the Population Stability Index of the two</p>
          <p>System prompts diferences. Regarding the difer- distributions.
ences between thev1 and v2 system prompts, we summa- We observe a significant increase in the number of
rize some of the key improvements that the v2 promfpatiled tests when transitioning from the GPT3.5 to the
introduces over the previous version: GPT4 model. Specifically, there were 16 failed tests when
comparing these two models for the system prompt
version v2, and 13 failures for versiovn1.
• Addition of multilingual support
• Improved output formatting instructions
• Improved instructions to encourage the system
to comply with survey research best practices (i.4e..3. Drift tests: per feature
avoid open-ended questions where not necessaryI,n this section, we conduct a detailed examination of
order questions from general to specific, etc.) the experiment results presented in the previous section,
• Addition of specific instructions to improve cref-ocusing on the analysis of actual drift scores of metadata
ativity features across the diferenBtWAI configurations. Table
• Longer prompt with much more structure in th3eprovides a comprehensive overview of the PSI drift
system prompt (416 to 690 tokens) scores for the metadata features. We focus only on the
• Support for additional use cases such as surveyones having at least a failed case.</p>
          <p>forms One noteworthy case involves the metadata feature
User prompts collection. In order to measure the dif-n_contact_ info_questions, which exhibits a PSI score of
ferences across the system configurations outlined above3.778. The histograms for this metadata feature is shown
we selected a subset of 3185 input prompts which havein Figure2. This indicates a significant drift primarily due
been collected from real customers who have interactetdo the transition of models (i.e., from GPT3.5 to GPT4),
with theBWAI system and consented to let us use theiwrithout any modifications in prompts where in both
prompts to improve our system. The selection has beecnases thev1 system prompt was used. In turn, fvo2r,
the highest drift was observed for the metadata feature
done starting from the full set of user prompts collected
between October 2023 and January 2024 and applying then_generated_questions with PSI equals to 2.950. This
hapfollowing filters in sequence (with filtering boundaries pens when upgrading the generative model from GPT3.5
and parameters determined through ad-hoc analyses ttoo GPT4.
exclude poor quality samples): When assessing changes induced solely by changes
in the system prompts (i.e., transitioning from v1 to
1. Drop duplicates; v2) while maintaining the same generative model,
over2. Drop input prompts which contain PII or sensitiveall lower metadata drift scores are observed.
Specifinformation as flagged by our internal privacy-ically, when utilizing the GPT3.5 model, the highest
preservation pipelines; score among these cases was reported for the
meta3. Select only inputs written in English; data featurne_multiple_selection_questions (0.642).
Con4. Drop inputs shorter than 200 characters anvdersely, with the GPT4 model as the generative model,
longer than 500 characters; the highest score resurfaced for the metadata feature
5. Drop inputs which led to generated surveys witnh_contact_info_questions (1.540). The histograms for this
an outlier number of questions (i.&lt;e.5 or &gt;12). case are shown on Figure3.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusion</title>
      <sec id="sec-3-1">
        <title>In this study, we proposed a comprehensive evaluation</title>
        <p>framework to enhance the reliability of Large Language
Model (LLM)-based systems for survey generation tasks.</p>
        <p>By addressing the challenges associated with accurately
following user prompts and maintaining consistency
with established standards, the framework functions as
a protective barrier, efectively setting guardrails to
preempt unforeseen behaviors of our BWAI tool. Through
the detection of distributional drift of the survey
metadata features, the framework acts as a guiding compass
Figure 3: Histograms for the metadata feature n_con- for data scientists to investigate and address any
unintact_info_questions extracted from surveys generated using tended deviations in the application’s behavior, thereby
GPT4 with v1 and v2 prompts ensuring its stability and reliability.</p>
        <p>Our experimental results demonstrate the
efectiveness of the proposed framework in evaluating survey
generation metadata features across diferent
configu</p>
        <p>In practice, this framework serves as a valuable tool
for assessing whether intended modifications to systemrations of system prompts and generative models. We
observed significant diferences in survey outputs when
prompts translate efectively into the survey generation
process. For instance, the updated system prompt itnr-ansitioning between diferent versions of LLM models,
cludes specific instructions to nudge the LLM to gen-highlighting the importance of comprehensive
evaluation in adapting to model updates. Furthermore, our
erate questions which include answer options (as
opposed to open-ended questions which do not). Oneanalysis revealed nuanced insights into the impact of
system prompt versions on survey generation quality,
of these question types is represented by the
Multi</p>
        <p>underscoring the need for careful consideration of both
ple Selection question type, and the impact of these
instructions between V1 and V2 can be seen on theprompt design and model selection in ensuring reliable
survey generation.
scores in Table3 on the row corresponding to the
feature _ _ _ , as well as in Figure4 atAiosnfsuttruarteewgioerskt,owaessaeimssttohein”tqeugarlaittye”aouftotmheatgeedneervaatlue-d
where it is clearly shown that the V2 prompt tends stuorveys. In this scenario, the emphasis shifts from
levergenerate mor e _ questions than the V1 aging metadata features to compare diferences across
prompt. Also, through the detection of distributional</p>
        <p>diferent system versions to analyzing the survey
condrift of the survey metadata features, we can identify and
mitigate potential issues, thereby avoiding unexpectetdent itself. One promising direction is to use LLMs to act
behaviors of the feature. as preliminary inspectors of survey quality. This could
significantly accelerate our quality assessment process,
which currently relies heavily on human evaluation.
Reading Ease Formula) for Navy Enlisted Personnel,
Research Branch report, Chief of Naval Technical
Training, Naval Air Station Memphis, 1975. URL:
https://books.google.it/books?id=4tjroQEACA.AJ
[14] R. Taplin, C. Hunt, The population accuracy index:</p>
        <p>A new measure of population stability for model
monitoring, Risks 7 (2019) 53.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>O.</given-names>
            <surname>Rambow</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 2016 Con</source>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gómez-Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <article-title>A confederacy ference of the North American Chapter of the As-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>creative writing, 2023a.rXiv:2310</source>
          .
          <fpage>08433</fpage>
          .
          <string-name>
            <surname>Language</surname>
            <given-names>Technologies</given-names>
          </string-name>
          , Association for Computa[2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tsipras</surname>
          </string-name>
          , D. Soylu, tional Linguistics, San Diego, California,
          <year>2016</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Yasunaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Ku-
          <volume>110</volume>
          -
          <fpage>119</fpage>
          . URL: https://aclanthology.org/N16-101.
          <fpage>4</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>mar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Newman</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , C. Cos- doi:10.18653/v1/
          <fpage>N16</fpage>
          -1014.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>grove</surname>
          </string-name>
          , C. D.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Ré</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
            Acosta-Navas, [6]
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , J. Wang,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>hak</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Rong</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          , K. San- generation models,
          <year>2018a</year>
          .rXiv:
          <year>1802</year>
          .
          <year>01886</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>thanam</surname>
            ,
            <given-names>L.</given-names>
            Orr, L.
          </string-name>
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Yuksekgonul</surname>
            , M. Suz- [7]
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Yuan</surname>
          </string-name>
          , G. Neubig, P. Liu, Bartscore: Eval-
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>gun</surname>
            ,
            <given-names>N.</given-names>
            Kim, N.
          </string-name>
          <string-name>
            <surname>Guha</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Chatterji</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <article-title>Khattab, uating generated text as text generation</article-title>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Henderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Xie</surname>
          </string-name>
          , S. San- arXiv:
          <fpage>2106</fpage>
          .
          <fpage>11520</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>turkar</surname>
            , S. Ganguli,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Hashimoto</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Icard</surname>
            , T. Zhang, [8]
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Ruan</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Pu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Wan</surname>
          </string-name>
          , Llm-based
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mai</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>nlg evaluation: Current status</article-title>
          and challenges,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Y.</given-names>
            <surname>Koreeda</surname>
          </string-name>
          ,
          <article-title>Holistic evaluation of language models</article-title>
          ,
          <source>arXiv:2402</source>
          .
          <fpage>01383</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <year>2023</year>
          . arXiv:
          <volume>2211</volume>
          .
          <fpage>09110</fpage>
          . [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qu</surname>
          </string-name>
          , [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Lau</surname>
          </string-name>
          ,
          <article-title>The next chapter: A J. Zhou, Zero-shot cross-lingual summarization via</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>study of large language models in storytelling</article-title>
          ,
          <source>2023. large language models</source>
          ,
          <year>2023</year>
          .arXiv:
          <volume>2302</volume>
          .
          <fpage>14229</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>arXiv:2301</source>
          .
          <fpage>09790</fpage>
          . [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , W.-L. Chiang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheng</surname>
          </string-name>
          , S. Zhuang, [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Long Z. Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhuang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>E. P.</given-names>
          </string-name>
          <string-name>
            <surname>Xing</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>hierarchical variational model</article-title>
          , in: K. Inui,
          <string-name>
            <surname>J. Jiang,</surname>
          </string-name>
          <article-title>a-judge with mt-bench and chatbot arena</article-title>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          Wan (Eds.),
          <source>Proceedings of the 2019 Con- arXiv:2306</source>
          .
          <fpage>05685</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>ference on Empirical Methods in Natural Language</source>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Iter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , G-
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <article-title>Processing and the 9th International Joint Con- eval: Nlg evaluation using gpt-4 with better human</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>ference on Natural Language Processing (EMNLP- alignment</source>
          ,
          <year>2023</year>
          .arXiv:
          <volume>2303</volume>
          .
          <fpage>16634</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          IJCNLP), Association for Computational Linguis-[12]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Freitag</surname>
          </string-name>
          , W. Y.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>tics</surname>
          </string-name>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>3257</fpage>
          -
          <lpage>3268</lpage>
          . URL:
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          , Instructscore: Explainable text gener-
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          https://aclanthology.org/D19-132.
          <year>1doi</year>
          :
          <fpage>10</fpage>
          .18653/
          <article-title>ation evaluation with finegrained feedback</article-title>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          v1/
          <fpage>D19</fpage>
          -1321. arXiv:
          <volume>2305</volume>
          .
          <fpage>14282</fpage>
          . [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Galley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Brockett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dolan</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kincaid</surname>
          </string-name>
          , Derivation of New Readability Formulas:
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>