<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bhashithe Abeysinghe</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruhan Circi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>American Institutes for Research</institution>
          ,
          <addr-line>Arlington, VA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Chatbots have been an interesting application of natural language generation since its inception. With novel transformer based Generative AI methods, building chatbots have become trivial. Chatbots which are targeted at specific domains for example medicine and psychology are implemented rapidly. This however, should not distract from the need to evaluate the chatbot responses. Especially because the natural language generation community does not entirely agree upon how to efectively evaluate such applications. With this work we discuss the issue further with the increasingly popular LLM based evaluations and how they correlate with human evaluations. Additionally, we introduce a comprehensive factored evaluation mechanism that can be utilized in conjunction with both human and LLM-based evaluations. We present the results of an experimental evaluation conducted using this scheme in one of our chatbot implementations which consumed educational reports, and subsequently compare automated, traditional human evaluation, factored human evaluation, and factored LLM evaluation. Results show that factor based evaluation produces better insights on which aspects need to be improved in LLM applications and further strengthens the argument to use human evaluation in critical spaces where main functionality is not direct retrieval.</p>
      </abstract>
      <kwd-group>
        <kwd>LLM</kwd>
        <kwd>Human Evaluation</kwd>
        <kwd>Evaluation Challenges</kwd>
        <kwd>factor based evaluation</kwd>
        <kwd>LLM Evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The landscape of chatbot development is rapidly evolving, propelled by advancements in
Large Language Model (LLM) APIs. While the pace of development is exciting, there is a gap
between building an LLM-powered application and building a reliable system with LLMs. This
challenge requires carefully considering whether the final product satisfies all requirements
and evaluate it to test its alignment with performance and ethical standards. As highlighted by
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], this evaluation process should encompass both a technical assessment and a trust-oriented
framework. It is essential to ensure a balance between operational eficiency and responsible
usage.
      </p>
      <p>
        This process is further complicated by common pitfalls in LLMs, as several authors [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
        ]
mention areas of LLM could make mistakes, such as hallucination, tone, and output formatting.
      </p>
      <p>Efective evaluation can help to improve and maintain validation and consistency to avoid
common pitfalls. The development of an efective evaluation system is timely for researchers
and developers alike, given the propagation of LLM based generative applications such as
chatbots.</p>
      <p>
        The development cycle of a generic LLM-based application typically covers three phases: a)
selection of LLM, b) iterative development of the application, and c) operational deployment of
the app. The evaluation of LLMs themselves, as discussed in various papers [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] is beyond
the scope of this brief. However, it is essential to note that the quality of the base LLM is a
fundamental component in leveraging its capabilities efectively and minimizing risk in the
resulting application. For applications, developers may follow diferent development approaches
(e.g., fine-tuning, chaining, prompting, Retrieval Augmented Generation (RAG), LLM search
combined with Knowledge graphs, etc.) and each approach demands tailored evaluation steps
e.g., quality of data used in fine-tuning or prompting styles [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], or chunk size and quantity in
RAG [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This paper explores three fundamental approaches for evaluating the final response
(i.e., output) generated by LLM-based chatbots namely automated metrics, human evaluation
and LLM based evaluation. With respect to human evaluation we investigate preferential
evaluation and factored evaluation methods.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>
        Chatbots interact with users in such a way that they resolve user queries. Some chatbots are
domain specific [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] while others are general purpose chatbots [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Evaluating a chatbot largely
hinges on the intended use and specialization of the chatbot. In reviewing 16 papers on this
topic, we summarized several key components that require attention for the evaluation; among
these, the clear definition of the chatbot’s intended purpose (i.e., use case - that specify business
goal or client expectations, and user interaction with app) is critical. Such clarity helps for a
focused evaluation of whether the chatbot attains its designated purpose.
      </p>
      <p>The components described in Table 1 suggest that chatbots can be evaluated on diferent
factors (also known as factors or dimensions), such as their ability to answer the users’ queries
completely, their linguistic efectiveness, and their ability to recall information (either through
information retrieval or memory). Additional metrics may include the system’s response time,
usability, and intuitiveness.</p>
      <p>
        Currently, there are no common methods or agreed upon best practices that are robust
enough to evaluate LLM-based applications. As pointed out in almost all the prior work on
this topic, a notable challenge is the lack of consensus on appropriate evaluation criteria and
metrics. Therefore, researchers and developers bear the responsibility of choosing evaluation
methods that are most appropriate for their unique application. This responsibility may not
only increase development timelines but may also lead to underpowered statistical evaluations
[
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]. A resounding issue of automated metrics is that they are inconsistent with results
and may not always correlate with human evaluation. But many still prefer to use them in
evaluation due to being readily available and also easily repeatable [
        <xref ref-type="bibr" rid="ref14 ref15 ref16 ref17 ref18">14, 15, 16, 17, 18</xref>
        ]. Which
is not the case with human evaluation, it is expensive and will not be repeatable in the same
context even if one uses the same humans [
        <xref ref-type="bibr" rid="ref13 ref19 ref20">19, 13, 20</xref>
        ]. We must acknowledge the work where
generative AI models which are being used at the evaluation step such as ChatEval, GPTScore
and ARES [
        <xref ref-type="bibr" rid="ref21 ref22 ref23">21, 22, 23</xref>
        ] which are novel applications of LLMs. [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] discusses about “bot-play”
where an already evaluated LLM being used in evaluating a new un-evaluated LLM. When
considering LLM based evaluators, one must make sure the evaluator LLM produces acceptable
and accurate decisions to a given threshold.
      </p>
      <p>
        Human evaluation remains the most widely accepted form of evaluation in research studies
despite frequent reports of underpowered results [
        <xref ref-type="bibr" rid="ref13 ref25">25, 13</xref>
        ]. Several attempts have been called
for the standardization of human evaluation methods [
        <xref ref-type="bibr" rid="ref20 ref26">26, 20</xref>
        ], but its costly nature often leads
researchers to report on systems with statistically insuficient power. Additionally, the sensitivity
of human evaluators to the framing of questions (framed negatively or positively) is reported
to influence outcomes [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. For conversational or dialogue systems, the common standard of
human evaluation is Quality on Likert scales. Quality can vary across tasks, and it encompasses
multiple factors such as correctness, relevance, informativeness, consistency, understanding,
etc. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] suggest using a minimum of 100 questions rated on 5 or 7-point Likert scales to
evaluate multiple dimensions. This seems to be a dificult goal to achieve due to the expensive
nature of human evaluation.
      </p>
      <p>
        The variability in expert opinions has led to multiple recommendations for refining human
evaluation approaches. Engaging at least four experts is recommended, but more is preferable
for robust results [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. However, using expert evaluations may not always be productive,
particularly if the system is not designed for expert use [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. In cases where the number of
available experts is limited, a comparative (also known as preferential) evaluation approach
is often preferred. Additionally, it is advisable to involve about 10 to 60 non-expert users
the intended end-users of the system - in the evaluation process and to ensure that the Inter
Annotator Agreement (IAA) is reported for reliability (refer to Table 3 in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] for best practices).
It is also imperative to use external evaluators who have not taken part in the conversation to
judge the conversation [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] discusses the complexities in explaining human evaluations;
noting that individuals with varying levels of expertise can provide divergent assessments of
the same response, this again shows the importance of employing many humans with varying
expertise to completely evaluate such a system.
      </p>
      <p>In summarizing insights from reviewed research articles, it is evident that human evaluation
remains a common and indispensable element in the evaluation pipeline of chatbot systems,
albeit implemented at diferent stages. Additionally, a diverse selection of metrics is frequently
employed to assess various aspects of chatbot responses. Utilizing evaluator LLMs seems to
be a promising approach that warrants exploration due to its potential to ofer eficient and
scalable evaluation. While the current focus is on the evaluation, a potentially critical factor,
often overlooked, is the nature of the data used for testing and evaluation and many papers lack
specificity regarding the types of questions posed to chatbots. We propose that incorporating
a range of question types, informed by cognitive psychology frameworks such as Bloom’s
Taxonomy, could significantly enhance the systematic evaluation of chatbot responses and the
insights drawn from such an evaluation.</p>
      <p>To experiment with the evaluation procedures, we implement a chatbot first (Figure 2). This
implementation follows industry standards such as Retrieval Augmented Generation (RAG),
Vector Databases etc. to create a chatbot. The chatbot EdTalk aims to assist users in navigating
and comprehending lengthy reports by harnessing the power of LLMs and the goals are to have
minimal hallucination and strict adherence to factual information from its knowledge base. The
goal of this chatbot is to make the educational reports such as Condition of Education accessible
to a wide range of readers. Hence, chatbots knowledge base is built with the said reports. By
evaluating EdTalk, we investigate if this chatbot aligns with its initial goals. Simultaneously we
ifnd if the chatbot is able to consistently follow the goals for various diferent types of questions
in Bloom’s Taxonomy. Later we compare the results from various evaluation procedures
including automated, human and LLM-based to find what is more informative with respect to
the development of this chatbot.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation procedures</title>
      <p>We understand that chatbots, like any software will have an iterative implementation where
the developers would be updating components which make up the chatbot. Each of these
components and the full system need to be evaluated for reliability and performance. In this
section we dive into various evaluation procedures we conducted and briefly explain how they
were implemented. But we only focus on the utterance-based evaluation; meaning that we shall
only be investigating procedures which are built to look at responses of the chatbot. Other
components performance such as the semantic search used for retrieval in RAG is not in scope
for this investigation.</p>
      <p>
        To conduct the evaluation we employ the service of 5 humans. Initially, one of the human
evaluators, having access to the content to be evaluated, generated 40 questions based on
Bloom’s Taxonomy [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. The purpose behind adopting Bloom’s Taxonomy was to determine
the eficacy of the chatbot in responding to diferent types of questions. This approach adds
another unique dimension to the evaluation process, enabling us to evaluate the quality of the
chatbot’s responses against diferent types of questions. It should be noted that the specific
questions used in the evaluation were dependent on the use case of the chatbot implementation
and have not been disclosed in this article.
      </p>
      <p>Then a pair of humans hereafter known as - annotators, write their own responses to the
above questions. Later another pair hereafter known as - evaluators determines the quality of
the responses. Both pairs consists of an expert and a novice. An expert is someone who has
been working with these reports for more than 2 years and a novice is new to the area but has
some experience with the content.</p>
      <sec id="sec-3-1">
        <title>3.1. Automated evaluation</title>
        <p>
          Selecting an automated evaluation model is one crucial step. We do not select n-gram based
methods because of the issues that literature points out and hence, we utilize embedding based
methods. In that regard we believe BLUERT [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] to be the best out of the selection. We must
not forget that this methods would still sometimes produce inconsistent results, but as it is
chal- Schools experienced many
did other challenges—besides
face stafing-related ones—during
the their recovery from the
coronavirus pandemic. Some of
these other issues facing public
schools included reported
increases in student and
teacher absenteeism, student
socioemotional and behavioral
development, and an increase
in the percentage of students
seeking mental health services
from school, as compared
with before the coronavirus
pandemic.
        </p>
        <p>
          Schools faced many challenges during the pandemic
besides stafing-related issues. Some of these
challenges included reported increases in student and
teacher absenteeism, student socioemotional and
behavioral development, and an increase in the
percentage of students seeking mental health services from
school, as compared with before the coronavirus
pandemic. In particular, in April 2022, some 70 percent
of public schools reported that the percentage of
students who had sought mental health services from
school had increased since the start of the coronavirus
pandemic. Overall, only 12 percent of schools strongly
agreed and 44 percent moderately agreed that their
school was able to efectively provide mental health
services to all students in need. So, schools faced
challenges related to student mental health, absenteeism,
and socioemotional and behavioral development
during the pandemic.
repeatable, it can be used at the rapid development stage to test parameters such as chunk sizes,
overlap sizes etc. BLEURT requires a reference text and a generated text to compute similarity,
and [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] suggests using a specific checkpoint to achieve best comparison 1, an example of the
reference text (Expert response) and the generated text (generated response) is given in the
Table 2. Evaluating if the chatbot responses are similar to annotators is straightforward with
BLEURT.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Human evaluation</title>
        <p>
          Human evaluation on the other hand is a bit complex. There is traditional human evaluation
which is typically a preferential rating of what response a human would prefer more. While
this is an acceptable measure [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], it may still miss insights from the results. We conduct this
traditional preferential evaluation first to start the human evaluation. The humans do not need
to be experts in the domain to conduct this type of evaluation [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ].
        </p>
        <p>
          Then we enlist evaluators to rate responses of the chatbot for the previously created questions.
Rating will be conducted on a few factors [
          <xref ref-type="bibr" rid="ref13 ref22">22, 13</xref>
          ]. We carefully select these factors so that we
can efectively evaluate many aspects of the chatbot, where many of the selected factors were
inspired by [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. We develop a 5-point Likert scale-based questionnaire from which we collect
expert ratings for the chatbot responses.
        </p>
        <p>Instructions on how to perform the ratings were given prior to the evaluators. Table 3 shows
1https://github.com/google-research/bleurt?tab=readme-ov-file#checkpoints
what questions an evaluator should ask before rating a response for a criterion. The criterions
are set up so that a response with all the accurate and relevant information, without unnecessary
information, in the most clear and concise manner is rated high. We also take hallucinations
into the equation as well; this covers most quality criteria a generative AI application should
look for. Evaluators are also free to refer the text where the questions re based of of, but we
did not make the previous Annotator responses available for the Evaluators. We gave example
ratings for a few questions and responses which were not part of the 40 selected above, these
included examples for ratings 1, 3 and 5. Evaluators were free to determine how to assign the
intermediate ratings.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. LLM-based evaluation</title>
        <p>
          The evaluation procedure being discussed is a relatively new one, and there is currently limited
literature available to support its reliability as compared to human evaluation. The purpose
of this study is to contribute to the existing literature by comparing human-based evaluation
with LLM-based evaluation. The researchers used the same instructions that were given to
human evaluators to prompt the LLM for evaluation. In addition, examples for each Likert scale
value were provided to ensure that the LLM was aligned with the evaluation criteria, this is
the only diference between the human instructions as humans do not receive examples for all
Likert scales. The evaluation prompt included the question, facts retrieved from the content,
and the response generated by the chatbot, as per the methodology proposed by [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. The
responses were evaluated for a given factor at a time, and the generated evaluation responses
were processed to extract similar Likert scales from the LLM. The LLM evaluators did not have
access to the Annotator responses created in the automated evaluation step, but LLM evaluator
did have access to the content of the document. This allowed the researchers to compare the
LLM-based evaluation with the human evaluation in a similar light.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>In this section, the results of all evaluation procedures are compared and contrasted. The
purpose is to gain an understanding of what was learned from each experiment and to identify
any advantages or disadvantages associated with each method. Bloom’s Taxonomy is used to
make comparisons, but the specific types within the taxonomy are not explained in this work.</p>
      <p>
        Table 4 presents the results captured by the automated evaluation experiment. As we explain
in the previous sections, here we use BLEURT [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] as the metric to compute similarities of
the generated response against a human written answer. This evaluation can be conducted
rapidly if the human written responses are readily available. Meaning that the human needs
to only write the response once, where it is possible to repeatedly run the evaluation after the
parameters of the application are altered. It is not clear how to compare two BLEURT scores
for a similar task where multiple reference text are used. Upon inspection and comparison of
BLEURT values, it was noted that for some question types, expert and novice fell into similar
ranges. For both humans, the generated response has a lower similarity in Evaluate questions.
For Apply questions, while Experts similarity is at 0.44, novice has 0.24. Highest similarities
were reported in both humans at Understand questions.
      </p>
      <p>We conducted traditional human evaluation through preferential rating first, this type of
evaluation does not require domain experts to conduct evaluation and is much faster considering the
other human evaluation methods. Here we find that the chatbots answers are preferred only 47%
(on average) of the time, Table 5 present results broken down into the same Bloom’s Taxonomy
type. This measure does not reveal anything about what areas are needed improvement in order
to perform better. Which is typically why the community prefers factored human evaluation.</p>
      <p>Table 7 reports the results of the factored evaluation in both human and LLM procedures.
Since we used Likert scales to capture ratings, we have reported the results via medians of
each factor and question type. The visualized results are displayed in Figure 1, which clearly
highlight the notable diferences between novices and experts in their approaches to response
analysis. The graph underscores the importance of recognizing individual variations in cognitive
processing and interpretation of information.</p>
      <p>
        Using the factored human evaluation procedure, we were able to experimentally figure out
previously elusive facts about the generative application. When we initially conducted trivial
automated and human evaluation (preferential), if we do not break questions down to Bloom’s
Taxonomy, we only get one measure to test if the chatbot works within the parameters of an
acceptable application. This is not usually enough to understand the underlying complex issues
of LLMs, and if they are present in the LLM-powered application or not. RAG systems are
built to retrieve information which is available in context. This means that when posed with
Remember questions, they must perform well, but as the results from the expert show; EdTalk
does not perform well with Remember questions (Table 7 and Figure 1). It shows also that
chatbot responses are not consistent enough to say anything related to other question types.
This result reveals while RAG chatbots should be great at answering retrieval based questions
they sometimes do not work as intended in the perspective of a human. We also note that the
automated evaluation with BLEURT showed similar patterns with each of the question type
as well, but when we take the novice into account, the similarity is not present anymore. One
advantage in this type of evaluation is that we can now check the inter-rater reliability, and
we show this in Table 6. We notice the major issue pointed out by many prior work here with,
where humans not agreeing in their reviews. Also by categorizing questions into factors we
notice that human agreement is moderate in Clarity but all other factors are low agreement.
One disadvantage we notice here is the ability of repeating the evaluation efort, same humans
may rate these responses diferently if we change the order or the framing of the questions in
the questionnaire [
        <xref ref-type="bibr" rid="ref13 ref25">13, 25</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>
        The goal of this work is to illustrate how challenging it is to evaluate an LLM based application,
especially evaluating a chatbot with current methodologies including automated, human and
LLM procedures. We first demonstrate that there are advantages and disadvantages in all three
of these approaches. We also note the diferences of results gained from all three evaluation
procedures, there is very little correlation between these results and it would be dificult to
suggest one to be used. We also observed that the experts evaluation results are a bit stricter
and resulted lower scores generally for many factors. The novice had looked at the chatbot
in a favorable light and we notice the slightly elevated scores. Using an LLM to evaluate the
chatbot responses seems to be not reliable as the LLM scores its own responses high. In our
experimental case, we used the same LLM (GPT-3.5) to generate the responses and also as the
evaluator LLM. This is not the ideal setting as [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] points out, in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] authors point out if an
LLM is not evaluated it must be evaluated using an already evaluated LLM or a higher order
LLM. Given this situation of uncertain evaluations from any procedure, we should not distract
the readers from the need for evaluating. To improve the reliability of evaluation, we suggest
increasing the number of humans used in the factored human evaluation. Also enlisting a wide
range of expertise would create a smoothed preview of the results; however, this would increase
the expensiveness of the evaluation. As [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] suggests, enlisting a larger amount of intended
users of a chatbot would still not be ideal as these users may also create confusion on whats
correct and whats not. Allowing untrained humans to make judgments on the factors will not
yield the most accurate results, similar to the case we have with LLM results in Figure 1.
      </p>
      <p>
        One deciding factor would be the repeatability and the amount of funds a person has toward
evaluating a chatbot. In this regard we note while automated procedures are repeatable, low
reliability of these metrics make a case against them. Human evaluation is considered the gold
standard, while that can be true research indicates that the human disagreement is a greater
issue; we also notice this issue indicated in Table 6. LLM evaluators are a novel adaptation of
LLMs, its greatest adversary right now is not having enough research to support its reliability.
We observe that in some cases LLM evaluators have similar responses to human evaluators.
But this is not the case always, in most instances LLM evaluators tend to be overly confident
in the response being correct. We cannot reject the promise in LLM evaluators as we can set
various personalities and take various versions of its evaluation rapidly [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], but this also must
be explored in terms of whether a person of such an expertise would rate the same response in
a similar way. Further research needs to be conducted in understanding how LLMs can help us
evaluate LLMs.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Abhinav Cheruvu for helping with implementation of the chatbot and to Tabitha Tezil, Erika
Kessler and Jijun Zhang for helping with human evaluation.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Prompts</title>
      <p>This section notes the prompts that have been used in this work, we first note the prompt that
has been utilized in the RAG process in the chatbot for clarity and then a sample prompt that
was</p>
      <sec id="sec-7-1">
        <title>A.1. RAG Prompt</title>
        <p>The u s e r a s k s t h e q u e s t i o n &lt; q u e s t i o n &gt; . Here a r e some f a c t s t h a t
c o u l d be u s e d t o s u p p o r t t h e q u e s t i o n , &lt; f a c t s d e l i m i t e d by
s e m i c o l o n s &gt; .</p>
      </sec>
      <sec id="sec-7-2">
        <title>A.2. LLM Evaluator Prompt</title>
        <p>Here in this prompt we only add the prompt used with the “Correctness” criterion and similar
prompts can be drawn for others.
5 . Document S c o r e s : Keep a r e c o r d o f t h e s c o r e s and f e e d b a c k
f o r r e f e r e n c e .</p>
        <p>T h i s can be h e l p f u l f o r t r a c k i n g p r o g r e s s and e n s u r i n g
c o n s i s t e n c y i n your e v a l u a t i o n s .</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lakkaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kundu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Joshi</surname>
          </string-name>
          , Evaluating Chatbots to Promote Users' Trust - Practices and Open Problems,
          <year>2023</year>
          . URL: http://arxiv. org/abs/2309.05680, arXiv:
          <fpage>2309</fpage>
          .05680 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I. O.</given-names>
            <surname>Gallegos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Barrow</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Tanjim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Dernoncourt</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>N. K.</given-names>
          </string-name>
          <string-name>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <source>Bias and Fairness in Large Language Models: A Survey</source>
          ,
          <year>2023</year>
          . URL: http: //arxiv.org/abs/2309.00770. doi:
          <volume>10</volume>
          .48550/arXiv.2309.00770, arXiv:
          <fpage>2309</fpage>
          .00770 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          , W. Ma,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          , T. Liu,
          <source>A Survey on Hallucination in Large Language Models: Principles</source>
          , Taxonomy, Challenges, and Open Questions,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2311.05232. doi:
          <volume>10</volume>
          .485 50/arXiv.2311.05232, arXiv:
          <fpage>2311</fpage>
          .05232 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frieske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ishii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Survey of Hallucination in Natural Language Generation</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          . URL: https://dl.acm.org/doi/10.1145/3571730. doi:
          <volume>10</volume>
          .1145/3571730.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kaddour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Harris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mozes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bradley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Raileanu</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. McHardy</surname>
          </string-name>
          ,
          <source>Challenges and Applications of Large Language Models</source>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2307.10169. doi:
          <volume>10</volume>
          .48550/arXiv.2307.10169, arXiv:
          <fpage>2307</fpage>
          .10169 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jin</surname>
          </string-name>
          , C. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Supryadi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Xiong</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <surname>Evaluating Large Language Models: A Comprehensive Survey</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/ abs/2310.19736. doi:
          <volume>10</volume>
          .48550/arXiv.2310.19736, arXiv:
          <fpage>2310</fpage>
          .19736 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tsipras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Soylu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yasunaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Newman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cosgrove</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ré</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Acosta-Navas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Hudson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Zelikman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Durmus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ladhak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Santhanam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Orr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yuksekgonul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Suzgun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chatterji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Henderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Santurkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ganguli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hashimoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Icard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y. Koreeda,
          <source>Holistic Evaluation of Language Models</source>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2211.09110. doi:
          <volume>10</volume>
          .48550/arXiv.2211.09110, arXiv:
          <fpage>2211</fpage>
          .09110 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Nori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Carignan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Edgar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fusi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>McKinney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. O.</given-names>
            <surname>Ness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usuyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>White</surname>
          </string-name>
          , E. Horvitz,
          <string-name>
            <surname>Can Generalist Foundation Models Outcompete</surname>
          </string-name>
          Special-Purpose
          <source>Tuning? Case Study in Medicine</source>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2311.16452. doi:
          <volume>10</volume>
          .48550/arXiv.2311.1645 2, arXiv:
          <fpage>2311</fpage>
          .16452 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Retrieval-Augmented Generation for Large Language Models: A Survey</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2312.1 0997. doi:
          <volume>10</volume>
          .48550/arXiv.2312.10997, arXiv:
          <fpage>2312</fpage>
          .10997 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abd-Alrazaq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Safi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Alajlani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Warren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Househ</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Denecke</surname>
          </string-name>
          , Technical Metrics Used to Evaluate Health Care Chatbots: Scoping Review,
          <source>Journal of Medical Internet Research</source>
          <volume>22</volume>
          (
          <year>2020</year>
          )
          <article-title>e18301</article-title>
          . URL: http://www.jmir.org/
          <year>2020</year>
          /6/e18301/. doi:
          <volume>10</volume>
          .2196/
          <year>1830</year>
          1.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Vicuna</surname>
          </string-name>
          :
          <article-title>An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality |</article-title>
          LMSYS Org,
          <year>2023</year>
          . URL: https://lmsys.org/blog/2023-03-30-vicuna.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Card</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Henderson</surname>
          </string-name>
          , U. Khandelwal,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mahowald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          ,
          <source>With Little Power Comes Great Responsibility</source>
          ,
          <year>2020</year>
          . URL: http://arxiv.org/abs/
          <year>2010</year>
          .06595, arXiv:
          <year>2010</year>
          .06595 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C. van der</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gatt</surname>
          </string-name>
          , E. Van Miltenburg,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wubben</surname>
          </string-name>
          , E. Krahmer,
          <article-title>Best practices for the human evaluation of automatically generated text</article-title>
          ,
          <source>in: Proceedings of the 12th International Conference on Natural Language Generation</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>355</fpage>
          -
          <lpage>368</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          ,
          <string-name>
            <surname>METEOR:</surname>
          </string-name>
          <article-title>An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments</article-title>
          , in: J.
          <string-name>
            <surname>Goldstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lavie</surname>
            ,
            <given-names>C.-Y.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Voss (Eds.),
          <source>Proceedings of the ACL Workshop</source>
          on Intrinsic and
          <article-title>Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics</article-title>
          , Ann Arbor, Michigan,
          <year>2005</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>72</lpage>
          . URL: https://aclanthology.org/W05-0909.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>C.-Y. Lin</surname>
          </string-name>
          ,
          <article-title>ROUGE: A Package for Automatic Evaluation of Summaries, in: Text Summarization Branches Out, Association for Computational Linguistics</article-title>
          , Barcelona, Spain,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          . URL: https://aclanthology.org/W04-1013.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>Bleu: a Method for Automatic Evaluation of Machine Translation</article-title>
          , in: P.
          <string-name>
            <surname>Isabelle</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Charniak</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          (Eds.),
          <article-title>Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Philadelphia, Pennsylvania, USA,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . URL: https://aclanthology.org/P02-1040. doi:
          <volume>10</volume>
          .3115/1073083.1073135.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sellam</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          <string-name>
            <surname>Parikh</surname>
          </string-name>
          ,
          <source>BLEURT: Learning Robust Metrics for Text Generation</source>
          ,
          <year>2020</year>
          . URL: http://arxiv.org/abs/
          <year>2004</year>
          .04696. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2004</year>
          .
          <volume>04696</volume>
          , arXiv:
          <year>2004</year>
          .04696 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          , Y. Artzi,
          <source>BERTScore: Evaluating Text Generation with BERT</source>
          ,
          <year>2020</year>
          . URL: http://arxiv.org/abs/
          <year>1904</year>
          .09675. doi:
          <volume>10</volume>
          .48550/arXiv .
          <year>1904</year>
          .
          <volume>09675</volume>
          , arXiv:
          <year>1904</year>
          .09675 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Finch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Finch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <article-title>Don't Forget Your ABC's: Evaluating the State-of-theArt in</article-title>
          <source>Chat-Oriented Dialogue Systems</source>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2212.09180, arXiv:
          <fpage>2212</fpage>
          .09180 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>C. van der</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gatt</surname>
          </string-name>
          , E. van
          <string-name>
            <surname>Miltenburg</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Krahmer</surname>
          </string-name>
          ,
          <article-title>Human evaluation of automatically generated text: Current trends and best practice guidelines</article-title>
          ,
          <source>Computer Speech &amp; Language</source>
          <volume>67</volume>
          (
          <year>2021</year>
          )
          <article-title>101151</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/S0885230820300 84X. doi:
          <volume>10</volume>
          .1016/j.csl.
          <year>2020</year>
          .
          <volume>101151</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>C.-M. Chan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Xue</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>Z. Liu,</given-names>
          </string-name>
          <article-title>ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate</article-title>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2308 .07201, arXiv:
          <fpage>2308</fpage>
          .07201 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          , S.
          <article-title>-</article-title>
          <string-name>
            <surname>K. Ng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Jiang</surname>
          </string-name>
          , P. Liu, GPTScore: Evaluate as You Desire,
          <year>2023</year>
          . URL: http: //arxiv.org/abs/2302.04166, arXiv:
          <fpage>2302</fpage>
          .04166 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Saad-Falcon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Potts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <source>ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems</source>
          ,
          <year>2024</year>
          . URL: http://arxiv.org/abs/2311.0 9476, arXiv:
          <fpage>2311</fpage>
          .09476 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>E.</given-names>
            <surname>Svikhnushina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <article-title>Approximating Online Human Evaluation of Social Chatbots with Prompting</article-title>
          , in: S. Stoyanchev,
          <string-name>
            <given-names>S.</given-names>
            <surname>Joty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schlangen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Dusek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kennington</surname>
          </string-name>
          , M. Alikhani (Eds.),
          <source>Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue</source>
          , Association for Computational Linguistics, Prague, Czechia,
          <year>2023</year>
          , pp.
          <fpage>268</fpage>
          -
          <lpage>281</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .sigdial-
          <volume>1</volume>
          .
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>E.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>August</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Serrano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Haduong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gururangan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          , All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text,
          <year>2021</year>
          . URL: http: //arxiv.org/abs/2107.00061, arXiv:
          <fpage>2107</fpage>
          .00061 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>D. M. Howcroft</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Rieser</surname>
          </string-name>
          ,
          <article-title>What happens if you treat ordinal ratings as interval data? Human evaluations in NLP are even more under-powered than you think</article-title>
          , in: M.
          <article-title>-</article-title>
          <string-name>
            <surname>F. Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , S. W.-t. Yih (Eds.),
          <source>Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Online and
          <string-name>
            <given-names>Punta</given-names>
            <surname>Cana</surname>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>8932</fpage>
          -
          <lpage>8939</lpage>
          . URL: https: //aclanthology.org/
          <year>2021</year>
          .emnlp-main.
          <volume>703</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .emnlp-main.
          <volume>703</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schoch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ji</surname>
          </string-name>
          , “
          <article-title>This is a Problem, Don't You Agree?” Framing and Bias in Human Evaluation for Natural Language Generation</article-title>
          , in: S.
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Dušek</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gehrmann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Gkatzia</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Konstas</surname>
            , E. Van Miltenburg,
            <given-names>S.</given-names>
          </string-name>
          Santhanam (Eds.),
          <source>Proceedings of the 1st Workshop on Evaluating NLG Evaluation</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          ,
          <string-name>
            <surname>Online</surname>
          </string-name>
          (Dublin, Ireland),
          <year>2020</year>
          , pp.
          <fpage>10</fpage>
          -
          <lpage>16</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .evalnlgeval1.2.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>V.</given-names>
            <surname>Vijayaraghavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Cooper</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. L. J.</surname>
          </string-name>
          ,
          <source>Algorithm Inspection for Chatbot Performance Evaluation, Procedia Computer Science</source>
          <volume>171</volume>
          (
          <year>2020</year>
          )
          <fpage>2267</fpage>
          -
          <lpage>2274</lpage>
          . URL: https://linkinghub.els evier.com/retrieve/pii/S1877050920312370. doi:
          <volume>10</volume>
          .1016/j.procs.
          <year>2020</year>
          .
          <volume>04</volume>
          .245.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>P.</given-names>
            <surname>Armstrong</surname>
          </string-name>
          ,
          <source>Bloom's Taxonomy</source>
          ,
          <year>2010</year>
          . URL: https://cft.vanderbilt.edu/guides-sub-pages/ blooms-taxonomy/.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>