<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Bhashithe</forename><surname>Abeysinghe</surname></persName>
							<email>babeysinghe@air.org</email>
							<affiliation key="aff0">
								<orgName type="institution">American Institutes for Research</orgName>
								<address>
									<settlement>Arlington</settlement>
									<region>VA</region>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ruhan</forename><surname>Circi</surname></persName>
							<email>rcirci@air.org</email>
							<affiliation key="aff0">
								<orgName type="institution">American Institutes for Research</orgName>
								<address>
									<settlement>Arlington</settlement>
									<region>VA</region>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="laboratory">The First Workshop on Large Language Models for Evaluation in Information Retrieval</orgName>
								<address>
									<addrLine>18 July 2024</addrLine>
									<settlement>Washington</settlement>
									<region>DC</region>
									<country key="US">United States</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">F3B5F57BD932DB891A97CC13A9B4D8D4</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:26+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>LLM</term>
					<term>Human Evaluation</term>
					<term>Evaluation Challenges</term>
					<term>factor based evaluation</term>
					<term>LLM Evaluation</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Chatbots have been an interesting application of natural language generation since its inception. With novel transformer based Generative AI methods, building chatbots have become trivial. Chatbots which are targeted at specific domains for example medicine and psychology are implemented rapidly. This however, should not distract from the need to evaluate the chatbot responses. Especially because the natural language generation community does not entirely agree upon how to effectively evaluate such applications. With this work we discuss the issue further with the increasingly popular LLM based evaluations and how they correlate with human evaluations. Additionally, we introduce a comprehensive factored evaluation mechanism that can be utilized in conjunction with both human and LLM-based evaluations. We present the results of an experimental evaluation conducted using this scheme in one of our chatbot implementations which consumed educational reports, and subsequently compare automated, traditional human evaluation, factored human evaluation, and factored LLM evaluation. Results show that factor based evaluation produces better insights on which aspects need to be improved in LLM applications and further strengthens the argument to use human evaluation in critical spaces where main functionality is not direct retrieval.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The landscape of chatbot development is rapidly evolving, propelled by advancements in Large Language Model (LLM) APIs. While the pace of development is exciting, there is a gap between building an LLM-powered application and building a reliable system with LLMs. This challenge requires carefully considering whether the final product satisfies all requirements and evaluate it to test its alignment with performance and ethical standards. As highlighted by <ref type="bibr" target="#b0">[1]</ref>, this evaluation process should encompass both a technical assessment and a trust-oriented framework. It is essential to ensure a balance between operational efficiency and responsible usage.</p><p>This process is further complicated by common pitfalls in LLMs, as several authors <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5</ref>] mention areas of LLM could make mistakes, such as hallucination, tone, and output formatting. Effective evaluation can help to improve and maintain validation and consistency to avoid common pitfalls. The development of an effective evaluation system is timely for researchers and developers alike, given the propagation of LLM based generative applications such as chatbots.</p><p>The development cycle of a generic LLM-based application typically covers three phases: a) selection of LLM, b) iterative development of the application, and c) operational deployment of the app. The evaluation of LLMs themselves, as discussed in various papers <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7]</ref> is beyond the scope of this brief. However, it is essential to note that the quality of the base LLM is a fundamental component in leveraging its capabilities effectively and minimizing risk in the resulting application. For applications, developers may follow different development approaches (e.g., fine-tuning, chaining, prompting, Retrieval Augmented Generation (RAG), LLM search combined with Knowledge graphs, etc.) and each approach demands tailored evaluation steps e.g., quality of data used in fine-tuning or prompting styles <ref type="bibr" target="#b7">[8]</ref>, or chunk size and quantity in RAG <ref type="bibr" target="#b8">[9]</ref>. This paper explores three fundamental approaches for evaluating the final response (i.e., output) generated by LLM-based chatbots namely automated metrics, human evaluation and LLM based evaluation. With respect to human evaluation we investigate preferential evaluation and factored evaluation methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background</head><p>Chatbots interact with users in such a way that they resolve user queries. Some chatbots are domain specific <ref type="bibr" target="#b9">[10]</ref> while others are general purpose chatbots <ref type="bibr" target="#b10">[11]</ref>. Evaluating a chatbot largely hinges on the intended use and specialization of the chatbot. In reviewing 16 papers on this topic, we summarized several key components that require attention for the evaluation; among these, the clear definition of the chatbot's intended purpose (i.e., use case -that specify business goal or client expectations, and user interaction with app) is critical. Such clarity helps for a focused evaluation of whether the chatbot attains its designated purpose.</p><p>The components described in Table <ref type="table">1</ref> suggest that chatbots can be evaluated on different factors (also known as factors or dimensions), such as their ability to answer the users' queries completely, their linguistic effectiveness, and their ability to recall information (either through information retrieval or memory). Additional metrics may include the system's response time, usability, and intuitiveness.</p><p>Currently, there are no common methods or agreed upon best practices that are robust enough to evaluate LLM-based applications. As pointed out in almost all the prior work on this topic, a notable challenge is the lack of consensus on appropriate evaluation criteria and metrics. Therefore, researchers and developers bear the responsibility of choosing evaluation methods that are most appropriate for their unique application. This responsibility may not only increase development timelines but may also lead to underpowered statistical evaluations <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b12">13]</ref>. A resounding issue of automated metrics is that they are inconsistent with results and may not always correlate with human evaluation. But many still prefer to use them in evaluation due to being readily available and also easily repeatable <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b16">17,</ref><ref type="bibr" target="#b17">18]</ref>. Which is not the case with human evaluation, it is expensive and will not be repeatable in the same context even if one uses the same humans <ref type="bibr" target="#b18">[19,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b19">20]</ref>. We must acknowledge the work where generative AI models which are being used at the evaluation step such as ChatEval, GPTScore and ARES <ref type="bibr" target="#b20">[21,</ref><ref type="bibr" target="#b21">22,</ref><ref type="bibr" target="#b22">23]</ref> which are novel applications of LLMs. <ref type="bibr" target="#b23">[24]</ref> discusses about "bot-play" where an already evaluated LLM being used in evaluating a new un-evaluated LLM. When considering LLM based evaluators, one must make sure the evaluator LLM produces acceptable and accurate decisions to a given threshold.</p><p>Human evaluation remains the most widely accepted form of evaluation in research studies despite frequent reports of underpowered results <ref type="bibr" target="#b24">[25,</ref><ref type="bibr" target="#b12">13]</ref>. Several attempts have been called for the standardization of human evaluation methods <ref type="bibr" target="#b25">[26,</ref><ref type="bibr" target="#b19">20]</ref>, but its costly nature often leads researchers to report on systems with statistically insufficient power. Additionally, the sensitivity of human evaluators to the framing of questions (framed negatively or positively) is reported to influence outcomes <ref type="bibr" target="#b27">[27]</ref>. For conversational or dialogue systems, the common standard of human evaluation is Quality on Likert scales. Quality can vary across tasks, and it encompasses multiple factors such as correctness, relevance, informativeness, consistency, understanding, etc. <ref type="bibr" target="#b18">[19]</ref>. <ref type="bibr" target="#b12">[13]</ref> suggest using a minimum of 100 questions rated on 5 or 7-point Likert scales to evaluate multiple dimensions. This seems to be a difficult goal to achieve due to the expensive nature of human evaluation.</p><p>The variability in expert opinions has led to multiple recommendations for refining human evaluation approaches. Engaging at least four experts is recommended, but more is preferable for robust results <ref type="bibr" target="#b19">[20]</ref>. However, using expert evaluations may not always be productive, particularly if the system is not designed for expert use <ref type="bibr" target="#b24">[25]</ref>. In cases where the number of available experts is limited, a comparative (also known as preferential) evaluation approach is often preferred. Additionally, it is advisable to involve about 10 to 60 non-expert usersthe intended end-users of the system -in the evaluation process and to ensure that the Inter Annotator Agreement (IAA) is reported for reliability (refer to Table <ref type="table" target="#tab_1">3</ref> in <ref type="bibr" target="#b12">[13]</ref> for best practices). It is also imperative to use external evaluators who have not taken part in the conversation to judge the conversation <ref type="bibr" target="#b18">[19]</ref>. <ref type="bibr" target="#b28">[28]</ref> discusses the complexities in explaining human evaluations; noting that individuals with varying levels of expertise can provide divergent assessments of the same response, this again shows the importance of employing many humans with varying expertise to completely evaluate such a system.</p><p>In summarizing insights from reviewed research articles, it is evident that human evaluation remains a common and indispensable element in the evaluation pipeline of chatbot systems, albeit implemented at different stages. Additionally, a diverse selection of metrics is frequently employed to assess various aspects of chatbot responses. Utilizing evaluator LLMs seems to be a promising approach that warrants exploration due to its potential to offer efficient and scalable evaluation. While the current focus is on the evaluation, a potentially critical factor, often overlooked, is the nature of the data used for testing and evaluation and many papers lack specificity regarding the types of questions posed to chatbots. We propose that incorporating a range of question types, informed by cognitive psychology frameworks such as Bloom's Taxonomy, could significantly enhance the systematic evaluation of chatbot responses and the insights drawn from such an evaluation.</p><p>To experiment with the evaluation procedures, we implement a chatbot first (Figure <ref type="figure" target="#fig_1">2</ref>). This implementation follows industry standards such as Retrieval Augmented Generation (RAG), Vector Databases etc. to create a chatbot. The chatbot EdTalk aims to assist users in navigating and comprehending lengthy reports by harnessing the power of LLMs and the goals are to have minimal hallucination and strict adherence to factual information from its knowledge base. The goal of this chatbot is to make the educational reports such as Condition of Education accessible to a wide range of readers. Hence, chatbots knowledge base is built with the said reports. By evaluating EdTalk, we investigate if this chatbot aligns with its initial goals. Simultaneously we find if the chatbot is able to consistently follow the goals for various different types of questions in Bloom's Taxonomy. Later we compare the results from various evaluation procedures including automated, human and LLM-based to find what is more informative with respect to the development of this chatbot.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Evaluation procedures</head><p>We understand that chatbots, like any software will have an iterative implementation where the developers would be updating components which make up the chatbot. Each of these components and the full system need to be evaluated for reliability and performance. In this section we dive into various evaluation procedures we conducted and briefly explain how they were implemented. But we only focus on the utterance-based evaluation; meaning that we shall only be investigating procedures which are built to look at responses of the chatbot. Other components performance such as the semantic search used for retrieval in RAG is not in scope for this investigation.</p><p>To conduct the evaluation we employ the service of 5 humans. Initially, one of the human evaluators, having access to the content to be evaluated, generated 40 questions based on Bloom's Taxonomy <ref type="bibr" target="#b29">[29]</ref>. The purpose behind adopting Bloom's Taxonomy was to determine the efficacy of the chatbot in responding to different types of questions. This approach adds another unique dimension to the evaluation process, enabling us to evaluate the quality of the chatbot's responses against different types of questions. It should be noted that the specific questions used in the evaluation were dependent on the use case of the chatbot implementation and have not been disclosed in this article.</p><p>Then a pair of humans hereafter known asannotators, write their own responses to the above questions. Later another pair hereafter known asevaluators determines the quality of the responses. Both pairs consists of an expert and a novice. An expert is someone who has been working with these reports for more than 2 years and a novice is new to the area but has some experience with the content.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Automated evaluation</head><p>Selecting an automated evaluation model is one crucial step. We do not select n-gram based methods because of the issues that literature points out and hence, we utilize embedding based methods. In that regard we believe BLUERT <ref type="bibr" target="#b16">[17]</ref> to be the best out of the selection. We must not forget that this methods would still sometimes produce inconsistent results, but as it is  Schools experienced many other challenges-besides staffing-related ones-during their recovery from the coronavirus pandemic. Some of these other issues facing public schools included reported increases in student and teacher absenteeism, student socioemotional and behavioral development, and an increase in the percentage of students seeking mental health services from school, as compared with before the coronavirus pandemic.</p><p>Schools faced many challenges during the pandemic besides staffing-related issues. Some of these challenges included reported increases in student and teacher absenteeism, student socioemotional and behavioral development, and an increase in the percentage of students seeking mental health services from school, as compared with before the coronavirus pandemic. In particular, in April 2022, some 70 percent of public schools reported that the percentage of students who had sought mental health services from school had increased since the start of the coronavirus pandemic. Overall, only 12 percent of schools strongly agreed and 44 percent moderately agreed that their school was able to effectively provide mental health services to all students in need. So, schools faced challenges related to student mental health, absenteeism, and socioemotional and behavioral development during the pandemic.</p><p>repeatable, it can be used at the rapid development stage to test parameters such as chunk sizes, overlap sizes etc. BLEURT requires a reference text and a generated text to compute similarity, and <ref type="bibr" target="#b16">[17]</ref> suggests using a specific checkpoint to achieve best comparison <ref type="foot" target="#foot_0">1</ref> , an example of the reference text (Expert response) and the generated text (generated response) is given in the Table <ref type="table" target="#tab_0">2</ref>. Evaluating if the chatbot responses are similar to annotators is straightforward with BLEURT.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Human evaluation</head><p>Human evaluation on the other hand is a bit complex. There is traditional human evaluation which is typically a preferential rating of what response a human would prefer more. While this is an acceptable measure <ref type="bibr" target="#b12">[13]</ref>, it may still miss insights from the results. We conduct this traditional preferential evaluation first to start the human evaluation. The humans do not need to be experts in the domain to conduct this type of evaluation <ref type="bibr" target="#b24">[25]</ref>.</p><p>Then we enlist evaluators to rate responses of the chatbot for the previously created questions. Rating will be conducted on a few factors <ref type="bibr" target="#b21">[22,</ref><ref type="bibr" target="#b12">13]</ref>. We carefully select these factors so that we can effectively evaluate many aspects of the chatbot, where many of the selected factors were inspired by <ref type="bibr" target="#b12">[13]</ref>. We develop a 5-point Likert scale-based questionnaire from which we collect expert ratings for the chatbot responses.</p><p>Instructions on how to perform the ratings were given prior to the evaluators. Table <ref type="table" target="#tab_1">3</ref> shows what questions an evaluator should ask before rating a response for a criterion. The criterions are set up so that a response with all the accurate and relevant information, without unnecessary information, in the most clear and concise manner is rated high. We also take hallucinations into the equation as well; this covers most quality criteria a generative AI application should look for. Evaluators are also free to refer the text where the questions re based off of, but we did not make the previous Annotator responses available for the Evaluators. We gave example ratings for a few questions and responses which were not part of the 40 selected above, these included examples for ratings 1, 3 and 5. Evaluators were free to determine how to assign the intermediate ratings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">LLM-based evaluation</head><p>The evaluation procedure being discussed is a relatively new one, and there is currently limited literature available to support its reliability as compared to human evaluation. The purpose of this study is to contribute to the existing literature by comparing human-based evaluation with LLM-based evaluation. The researchers used the same instructions that were given to human evaluators to prompt the LLM for evaluation. In addition, examples for each Likert scale value were provided to ensure that the LLM was aligned with the evaluation criteria, this is the only difference between the human instructions as humans do not receive examples for all Likert scales. The evaluation prompt included the question, facts retrieved from the content, and the response generated by the chatbot, as per the methodology proposed by <ref type="bibr" target="#b22">[23]</ref>. The responses were evaluated for a given factor at a time, and the generated evaluation responses were processed to extract similar Likert scales from the LLM. The LLM evaluators did not have access to the Annotator responses created in the automated evaluation step, but LLM evaluator did have access to the content of the document. This allowed the researchers to compare the LLM-based evaluation with the human evaluation in a similar light.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results</head><p>In this section, the results of all evaluation procedures are compared and contrasted. The purpose is to gain an understanding of what was learned from each experiment and to identify any advantages or disadvantages associated with each method. Bloom's Taxonomy is used to make comparisons, but the specific types within the taxonomy are not explained in this work. Table <ref type="table" target="#tab_2">4</ref> presents the results captured by the automated evaluation experiment. As we explain in the previous sections, here we use BLEURT <ref type="bibr" target="#b16">[17]</ref> as the metric to compute similarities of the generated response against a human written answer. This evaluation can be conducted rapidly if the human written responses are readily available. Meaning that the human needs to only write the response once, where it is possible to repeatedly run the evaluation after the parameters of the application are altered. It is not clear how to compare two BLEURT scores for a similar task where multiple reference text are used. Upon inspection and comparison of BLEURT values, it was noted that for some question types, expert and novice fell into similar ranges. For both humans, the generated response has a lower similarity in Evaluate questions. For Apply questions, while Experts similarity is at 0.44, novice has 0.24. Highest similarities were reported in both humans at Understand questions.</p><p>We conducted traditional human evaluation through preferential rating first, this type of evaluation does not require domain experts to conduct evaluation and is much faster considering the other human evaluation methods. Here we find that the chatbots answers are preferred only 47% (on average) of the time, Table <ref type="table" target="#tab_3">5</ref> present results broken down into the same Bloom's Taxonomy type. This measure does not reveal anything about what areas are needed improvement in order to perform better. Which is typically why the community prefers factored human evaluation.</p><p>Table <ref type="table" target="#tab_4">7</ref> reports the results of the factored evaluation in both human and LLM procedures. Since we used Likert scales to capture ratings, we have reported the results via medians of each factor and question type. The visualized results are displayed in Figure <ref type="figure" target="#fig_0">1</ref>, which clearly highlight the notable differences between novices and experts in their approaches to response analysis. The graph underscores the importance of recognizing individual variations in cognitive processing and interpretation of information.</p><p>Using the factored human evaluation procedure, we were able to experimentally figure out previously elusive facts about the generative application. When we initially conducted trivial automated and human evaluation (preferential), if we do not break questions down to Bloom's Taxonomy, we only get one measure to test if the chatbot works within the parameters of an acceptable application. This is not usually enough to understand the underlying complex issues of LLMs, and if they are present in the LLM-powered application or not. RAG systems are built to retrieve information which is available in context. This means that when posed with Remember questions, they must perform well, but as the results from the expert show; EdTalk does not perform well with Remember questions (Table <ref type="table" target="#tab_4">7</ref> and Figure <ref type="figure" target="#fig_0">1</ref>). It shows also that chatbot responses are not consistent enough to say anything related to other question types. This result reveals while RAG chatbots should be great at answering retrieval based questions they sometimes do not work as intended in the perspective of a human. We also note that the automated evaluation with BLEURT showed similar patterns with each of the question type as well, but when we take the novice into account, the similarity is not present anymore. One advantage in this type of evaluation is that we can now check the inter-rater reliability, and we show this in Table <ref type="table">6</ref>. We notice the major issue pointed out by many prior work here with, where humans not agreeing in their reviews. Also by categorizing questions into factors we notice that human agreement is moderate in Clarity but all other factors are low agreement. One disadvantage we notice here is the ability of repeating the evaluation effort, same humans may rate these responses differently if we change the order or the framing of the questions in the questionnaire <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b24">25]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Discussion</head><p>The goal of this work is to illustrate how challenging it is to evaluate an LLM based application, especially evaluating a chatbot with current methodologies including automated, human and LLM procedures. We first demonstrate that there are advantages and disadvantages in all three of these approaches. We also note the differences of results gained from all three evaluation procedures, there is very little correlation between these results and it would be difficult to suggest one to be used. We also observed that the experts evaluation results are a bit stricter and resulted lower scores generally for many factors. The novice had looked at the chatbot in a favorable light and we notice the slightly elevated scores. Using an LLM to evaluate the chatbot responses seems to be not reliable as the LLM scores its own responses high. In our experimental case, we used the same LLM (GPT-3.5) to generate the responses and also as the evaluator LLM. This is not the ideal setting as <ref type="bibr" target="#b23">[24]</ref> points out, in <ref type="bibr" target="#b23">[24]</ref> authors point out if an LLM is not evaluated it must be evaluated using an already evaluated LLM or a higher order LLM. Given this situation of uncertain evaluations from any procedure, we should not distract the readers from the need for evaluating. To improve the reliability of evaluation, we suggest increasing the number of humans used in the factored human evaluation. Also enlisting a wide range of expertise would create a smoothed preview of the results; however, this would increase the expensiveness of the evaluation. As <ref type="bibr" target="#b12">[13]</ref> suggests, enlisting a larger amount of intended users of a chatbot would still not be ideal as these users may also create confusion on whats correct and whats not. Allowing untrained humans to make judgments on the factors will not yield the most accurate results, similar to the case we have with LLM results in Figure <ref type="figure" target="#fig_0">1</ref>.</p><p>One deciding factor would be the repeatability and the amount of funds a person has toward evaluating a chatbot. In this regard we note while automated procedures are repeatable, low reliability of these metrics make a case against them. Human evaluation is considered the gold standard, while that can be true research indicates that the human disagreement is a greater issue; we also notice this issue indicated in Table <ref type="table">6</ref>. LLM evaluators are a novel adaptation of LLMs, its greatest adversary right now is not having enough research to support its reliability. We observe that in some cases LLM evaluators have similar responses to human evaluators. But this is not the case always, in most instances LLM evaluators tend to be overly confident in the response being correct. We cannot reject the promise in LLM evaluators as we can set various personalities and take various versions of its evaluation rapidly <ref type="bibr" target="#b20">[21]</ref>, but this also must be explored in terms of whether a person of such an expertise would rate the same response in a similar way. Further research needs to be conducted in understanding how LLMs can help us evaluate LLMs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Prompts</head><p>This section notes the prompts that have been used in this work, we first note the prompt that has been utilized in the RAG process in the chatbot for clarity and then a sample prompt that was</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1. RAG Prompt</head><p>The u s e r a s k s t h e q u e s t i o n &lt; q u e s t i o n &gt; . Here a r e some f a c t s t h a t c o u l d be u s e d t o s u p p o r t t h e q u e s t i o n , &lt; f a c t s d e l i m i t e d by s e m i c o l o n s &gt; .</p><p>You must f i r s t i n v e s t i g a t e i f i t i s p o s s i b l e t o s u p p o r t an answer w i t h t h e a v a i l a b l e f a c t s I f you do n o t have f a c t s t o s u p p o r t an answer , s t e p by s t e p e x p l a i n i n g your r e a s o n i n g b e h i n d e a c h a c t i o n you must come up w i t h a answer by p r o c e s s i n g , a p p l y i n g and e v a l u a t i n g f a c t s a s n e e d e d . O t h e r w i s e you must o n l y r e s p o n d w i t h " I d o n t know " and do n o t o u t p u t n y t h i n g e l s e .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2. LLM Evaluator Prompt</head><p>Here in this prompt we only add the prompt used with the "Correctness" criterion and similar prompts can be drawn for others.</p><p>You a r e an e x p e r t e d u c a t i o n r e s e a r c h e r . You a r e g i v e n a s e t o f f a c t s , a q u e s t i o n t h a t r e l a t e s t o t h e t e x t o f t h e s e f a c t s and an answer f o r t h e g i v e n q u e s t i o n . Your t a s k i s t o e v a l u a t e i f t h e answer i s a good answer t o t h e g i v e n q u e s t i o n b a s e d o f f o f a c r i t e r i o n and a l s o c o n s i d e r i n g t h e f a c t s . E v a l u a t i o n s t e p s : 1 . Read t h e f a c t s : S t a r t by c a r e f u l l y r e a d i n g t h e f a c t s p r o v i d e d . U n d e r s t a n d t h e c o n t e x t , main p o i n t s , and any r e l e v a n t d e t a i l s . 2 . A n a l y z e t h e Q u e s t i o n : Examine t h e q u e s t i o n t h a t r e l a t e s t o t h e f a c t s . </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Median of Likert scale ratings of each evaluator. Each spoke shows how an evaluator rated a response based on the question type from Blooms Taxonomy.</figDesc><graphic coords="2,89.29,84.19,416.70,164.78" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Screen capture of the EdTalk chatbot answering a question</figDesc><graphic coords="5,184.25,438.36,226.77,208.40" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head></head><label></label><figDesc>E n s u r e you have a c l e a r u n d e r s t a n d i n g o f what t h e q u e s t i o n i s a s k i n g f o r . 3 . Review t h e Answer : C a r e f u l l y r e a d t h e answer p r o v i d e d and a s s e s s i t b a s e d o n l y on t h e f o l l o w i n g c r i t e r i o n : C o r r e c t n e s s : Does t h e answer p r o v i d e a c c u r a t e i n f o r m a t i o n b a s e d on t h e p a r a g r a p h t e x t ?</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 2</head><label>2</label><figDesc>Scenario from Condition of Education report 2023, the example question, Annotator expert response, generated response. Similar response pairs are used in the BLEURT evaluation</figDesc><table><row><cell>Question</cell><cell>Expert response</cell><cell>Generated response</cell></row><row><cell>What</cell><cell>chal-</cell><cell></cell></row><row><cell>lenges</cell><cell>did</cell><cell></cell></row><row><cell cols="2">schools face</cell><cell></cell></row><row><cell>during</cell><cell>the</cell><cell></cell></row><row><cell cols="2">pandemic?</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 3</head><label>3</label><figDesc>Criteria for the Likert scale questionnaire</figDesc><table><row><cell>Criterion</cell><cell>Description</cell></row><row><cell>Relevance</cell><cell>If the facts presented are required by the question?</cell></row><row><cell>Informativeness</cell><cell>Are all the facts called by the question presented by</cell></row><row><cell></cell><cell>the response?</cell></row><row><cell>Correctness</cell><cell>How correct the generated response?</cell></row><row><cell>Clarity</cell><cell>Does the question call for a certain formatting ofr</cell></row><row><cell></cell><cell>the answer or is the response brief or verbose?</cell></row><row><cell>hallucination</cell><cell>Is the answer a hallucinated reference, information</cell></row><row><cell></cell><cell>etc.?</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 4</head><label>4</label><figDesc>Automated evaluation results; each generated answer is compared against a human (Expert or Novice) and the BLEURT score is reported herewith</figDesc><table><row><cell>Type</cell><cell cols="2">Expert Novice</cell></row><row><cell>Remember</cell><cell>0.45</cell><cell>0.40</cell></row><row><cell>Understand</cell><cell>0.61</cell><cell>0.55</cell></row><row><cell>Apply</cell><cell>0.44</cell><cell>0.24</cell></row><row><cell>Analyze</cell><cell>0.47</cell><cell>0.41</cell></row><row><cell>Evaluate</cell><cell>0.22</cell><cell>0.31</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 5</head><label>5</label><figDesc>Percentage of preference of generated response in the preferential rating evaluation</figDesc><table><row><cell>Type</cell><cell>Generated response preference</cell></row><row><cell>Remember</cell><cell>31%</cell></row><row><cell>Understand</cell><cell>100%</cell></row><row><cell>Apply</cell><cell>0%</cell></row><row><cell>Analyze</cell><cell>57%</cell></row><row><cell>Evaluate</cell><cell>33%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 7</head><label>7</label><figDesc>Factored evaluation results; median across question type. Higher the better.</figDesc><table><row><cell></cell><cell>Type</cell><cell cols="5">Correctness Informativeness Relevance Clarity Hallucinations</cell></row><row><cell></cell><cell>Remember</cell><cell>2</cell><cell>2</cell><cell>3</cell><cell>2</cell><cell>3</cell></row><row><cell></cell><cell>Understand</cell><cell>5</cell><cell>4</cell><cell>4</cell><cell>2</cell><cell>3</cell></row><row><cell>Expert</cell><cell>Apply</cell><cell>3.5</cell><cell>3.5</cell><cell>3</cell><cell>3</cell><cell>2</cell></row><row><cell></cell><cell>Analyze</cell><cell>4</cell><cell>4</cell><cell>4</cell><cell>4</cell><cell>5</cell></row><row><cell></cell><cell>Evaluate</cell><cell>2</cell><cell>3</cell><cell>3</cell><cell>4</cell><cell>1</cell></row><row><cell></cell><cell>Remember</cell><cell>5</cell><cell>4</cell><cell>3</cell><cell>4</cell><cell>3</cell></row><row><cell></cell><cell>Understand</cell><cell>3</cell><cell>3</cell><cell>2</cell><cell>2</cell><cell>2</cell></row><row><cell>Novice</cell><cell>Apply</cell><cell>4</cell><cell>2.5</cell><cell>3.5</cell><cell>2.5</cell><cell>2</cell></row><row><cell></cell><cell>Analyze</cell><cell>4</cell><cell>4</cell><cell>5</cell><cell>4</cell><cell>5</cell></row><row><cell></cell><cell>Evaluate</cell><cell>4</cell><cell>4</cell><cell>4</cell><cell>4</cell><cell>4</cell></row><row><cell></cell><cell>Remember</cell><cell>4</cell><cell>3</cell><cell>4</cell><cell>5</cell><cell>5</cell></row><row><cell></cell><cell>Understand</cell><cell>4</cell><cell>2</cell><cell>4</cell><cell>5</cell><cell>5</cell></row><row><cell>LLM</cell><cell>Apply</cell><cell>5</cell><cell>5</cell><cell>4.5</cell><cell>5</cell><cell>4</cell></row><row><cell></cell><cell>Analyze</cell><cell>5</cell><cell>5</cell><cell>5</cell><cell>5</cell><cell>5</cell></row><row><cell></cell><cell>Evaluate</cell><cell>4</cell><cell>3</cell><cell>4</cell><cell>5</cell><cell>5</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://github.com/google-research/bleurt?tab=readme-ov-file#checkpoints</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>Abhinav Cheruvu for helping with implementation of the chatbot and to Tabitha Tezil, Erika Kessler and Jijun Zhang for helping with human evaluation.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Evaluating Chatbots to Promote Users&apos; Trust -Practices and Open Problems</title>
		<author>
			<persName><forename type="first">B</forename><surname>Srivastava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lakkaraju</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Koppel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Narayanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kundu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Joshi</surname></persName>
		</author>
		<idno>arXiv:</idno>
		<ptr target="2309.05680" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Bias and Fairness in Large Language Models: A Survey</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">O</forename><surname>Gallegos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">A</forename><surname>Rossi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Barrow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">M</forename><surname>Tanjim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Dernoncourt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">K</forename><surname>Ahmed</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2309.00770</idno>
		<idno type="arXiv">arXiv:2309.00770</idno>
		<ptr target="http://arxiv.org/abs/2309.00770.doi:10.48550/arXiv.2309.00770" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions</title>
		<author>
			<persName><forename type="first">L</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Liu</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2311.05232</idno>
		<idno type="arXiv">arXiv:2311.05232</idno>
		<ptr target="http://arxiv.org/abs/2311.05232.doi:10.48550/arXiv.2311.05232" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Survey of Hallucination in Natural Language Generation</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Ji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Frieske</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ishii</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">J</forename><surname>Bang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Madotto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Fung</surname></persName>
		</author>
		<idno type="DOI">10.1145/3571730</idno>
		<ptr target="https://dl.acm.org/doi/10.1145/3571730.doi:10.1145/3571730" />
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys</title>
		<imprint>
			<biblScope unit="volume">55</biblScope>
			<biblScope unit="page" from="1" to="38" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Challenges and Applications of Large Language Models</title>
		<author>
			<persName><forename type="first">J</forename><surname>Kaddour</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Harris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mozes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Bradley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Raileanu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Mchardy</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2307.10169</idno>
		<idno type="arXiv">arXiv:2307.10169</idno>
		<ptr target="http://arxiv.org/abs/2307.10169.doi:10.48550/arXiv.2307.10169" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Evaluating Large Language Models: A Comprehensive Survey</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Supryadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><surname>Xiong</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2310.19736</idno>
		<idno type="arXiv">arXiv:2310.19736</idno>
		<ptr target="http://arxiv.org/abs/2310.19736.doi:10.48550/arXiv.2310.19736" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Holistic Evaluation of Language Models</title>
		<author>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Bommasani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Tsipras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Soylu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yasunaga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Narayanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Newman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Cosgrove</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ré</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Acosta-Navas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">A</forename><surname>Hudson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Zelikman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Durmus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ladhak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Santhanam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Orr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yuksekgonul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Suzgun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Guha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Chatterji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Khattab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Henderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Chi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Santurkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ganguli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Hashimoto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Icard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Chaudhary</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Mai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Koreeda</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2211.09110</idno>
		<idno type="arXiv">arXiv:2211.09110</idno>
		<ptr target="http://arxiv.org/abs/2211.09110.doi:10.48550/arXiv.2211.09110" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine</title>
		<author>
			<persName><forename type="first">H</forename><surname>Nori</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">T</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Carignan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Edgar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Fusi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>King</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Larson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Mckinney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">O</forename><surname>Ness</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Poon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Usuyama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>White</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Horvitz</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2311.16452</idno>
		<idno type="arXiv">arXiv:2311.16452</idno>
		<ptr target="http://arxiv.org/abs/2311.16452.doi:10.48550/arXiv.2311.16452" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note>cs</note>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Jia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2312.10997</idno>
		<idno type="arXiv">arXiv:2312.10997</idno>
		<ptr target="http://arxiv.org/abs/2312.10997.doi:10.48550/arXiv.2312.10997" />
		<title level="m">Retrieval-Augmented Generation for Large Language Models: A Survey</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Technical Metrics Used to Evaluate Health Care Chatbots: Scoping Review</title>
		<author>
			<persName><forename type="first">A</forename><surname>Abd-Alrazaq</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Safi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Alajlani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Warren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Househ</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Denecke</surname></persName>
		</author>
		<idno type="DOI">10.2196/18301</idno>
		<ptr target="http://www.jmir.org/2020/6/e18301/.doi:10.2196/18301" />
	</analytic>
	<monogr>
		<title level="j">Journal of Medical Internet Research</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="page">e18301</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><surname>Vicuna</surname></persName>
		</author>
		<ptr target="https://lmsys.org/blog/2023-03-30-vicuna" />
		<title level="m">An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality | LMSYS Org</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Card</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Henderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Khandelwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Jia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Mahowald</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</author>
		<idno>arXiv:</idno>
		<ptr target="2010.06595" />
		<title level="m">With Little Power Comes Great Responsibility</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Best practices for the human evaluation of automatically generated text</title>
		<author>
			<persName><forename type="first">C</forename><surname>Van Der Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gatt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Van Miltenburg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wubben</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Krahmer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 12th International Conference on Natural Language Generation</title>
				<meeting>the 12th International Conference on Natural Language Generation</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="355" to="368" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments</title>
		<author>
			<persName><forename type="first">S</forename><surname>Banerjee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lavie</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/W05-0909" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">J</forename><surname>Goldstein</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Lavie</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C.-Y</forename><surname>Lin</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Voss</surname></persName>
		</editor>
		<meeting>the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics<address><addrLine>Ann Arbor, Michigan</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="65" to="72" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">ROUGE: A Package for Automatic Evaluation of Summaries</title>
		<author>
			<persName><forename type="first">C.-Y</forename><surname>Lin</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/W04-1013" />
	</analytic>
	<monogr>
		<title level="m">Text Summarization Branches Out, Association for Computational Linguistics</title>
				<meeting><address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="74" to="81" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Bleu: a Method for Automatic Evaluation of Machine Translation</title>
		<author>
			<persName><forename type="first">K</forename><surname>Papineni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Roukos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-J</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="DOI">10.3115/1073083.1073135</idno>
		<ptr target="https://aclanthology.org/P02-1040.doi:10.3115/1073083.1073135" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">P</forename><surname>Isabelle</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Charniak</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Lin</surname></persName>
		</editor>
		<meeting>the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics<address><addrLine>Philadelphia, Pennsylvania, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="311" to="318" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">BLEURT: Learning Robust Metrics for Text Generation</title>
		<author>
			<persName><forename type="first">T</forename><surname>Sellam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">P</forename><surname>Parikh</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2004.04696</idno>
		<idno type="arXiv">arXiv:2004.04696</idno>
		<ptr target="http://arxiv.org/abs/2004.04696.doi:10.48550/arXiv.2004.04696" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">BERTScore: Evaluating Text Generation with BERT</title>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kishore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Weinberger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Artzi</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.1904.09675</idno>
		<idno type="arXiv">arXiv:1904.09675</idno>
		<ptr target="http://arxiv.org/abs/1904.09675.doi:10.48550/arXiv.1904.09675" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Don&apos;t Forget Your ABC&apos;s: Evaluating the State-of-the-Art in Chat-Oriented Dialogue Systems</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">E</forename><surname>Finch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Finch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Choi</surname></persName>
		</author>
		<idno>arXiv:</idno>
		<ptr target="2212.09180[cs" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Human evaluation of automatically generated text: Current trends and best practice guidelines</title>
		<author>
			<persName><forename type="first">C</forename><surname>Van Der Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gatt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Van Miltenburg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Krahmer</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.csl.2020.101151</idno>
		<ptr target="https://www.sciencedirect.com/science/article/pii/S088523082030084X.doi:10.1016/j.csl.2020.101151" />
	</analytic>
	<monogr>
		<title level="j">Computer Speech &amp; Language</title>
		<imprint>
			<biblScope unit="volume">67</biblScope>
			<biblScope unit="page">101151</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<title level="m" type="main">ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate</title>
		<author>
			<persName><forename type="first">C.-M</forename><surname>Chan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Xue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<idno>arXiv:</idno>
		<ptr target="2308.07201" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">GPTScore: Evaluate as You Desire</title>
		<author>
			<persName><forename type="first">J</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-K</forename><surname>Ng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Liu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2302.04166</idno>
		<ptr target="http://arxiv.org/abs/2302.04166" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<title level="m" type="main">ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems</title>
		<author>
			<persName><forename type="first">J</forename><surname>Saad-Falcon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Khattab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Potts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zaharia</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2311.09476</idno>
		<ptr target="http://arxiv.org/abs/2311.09476" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note>cs</note>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Approximating Online Human Evaluation of Social Chatbots with Prompting</title>
		<author>
			<persName><forename type="first">E</forename><surname>Svikhnushina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Pu</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2023.sigdial-1.25" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">S</forename><surname>Stoyanchev</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Joty</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Schlangen</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">O</forename><surname>Dusek</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Kennington</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Alikhani</surname></persName>
		</editor>
		<meeting>the 24th Meeting of the Special Interest Group on Discourse and Dialogue, Association for Computational Linguistics<address><addrLine>Prague, Czechia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="268" to="281" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<title level="m" type="main">All That&apos;s &apos;Human&apos; Is Not Gold: Evaluating Human Evaluation of Generated Text</title>
		<author>
			<persName><forename type="first">E</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>August</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Serrano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Haduong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gururangan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">A</forename><surname>Smith</surname></persName>
		</author>
		<idno>arXiv:</idno>
		<ptr target="2107.00061[cs" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Howcroft</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Rieser</surname></persName>
		</author>
		<title level="m">What happens if you treat ordinal ratings as interval data? Human evaluations in NLP are even more under-powered than you think</title>
				<editor>
			<persName><forename type="first">M.-F</forename></persName>
		</editor>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<monogr>
		<idno type="DOI">10.18653/v1/2021.emnlp-main.703</idno>
		<ptr target="https://aclanthology.org/2021.emnlp-main.703.doi:10.18653/v1/2021.emnlp-main.703" />
		<title level="m">Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and</title>
				<editor>
			<persName><forename type="first">X</forename><surname>Moens</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Huang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><forename type="middle">W</forename><surname>Specia</surname></persName>
		</editor>
		<editor>
			<persName><surname>.-T</surname></persName>
		</editor>
		<editor>
			<persName><surname>Yih</surname></persName>
		</editor>
		<meeting>the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and<address><addrLine>Punta Cana, Dominican Republic</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="8932" to="8939" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">This is a Problem, Don&apos;t You Agree?&quot; Framing and Bias in Human Evaluation for Natural Language Generation</title>
		<author>
			<persName><forename type="first">S</forename><surname>Schoch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ji</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2020.evalnlgeval-1.2" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 1st Workshop on Evaluating NLG Evaluation, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">O</forename><surname>Dušek</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Gehrmann</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Gkatzia</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">I</forename><surname>Konstas</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Van Miltenburg</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Santhanam</surname></persName>
		</editor>
		<meeting>the 1st Workshop on Evaluating NLG Evaluation, Association for Computational Linguistics<address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="10" to="16" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Algorithm Inspection for Chatbot Performance Evaluation</title>
		<author>
			<persName><forename type="first">V</forename><surname>Vijayaraghavan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">B</forename><surname>Cooper</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">L J</forename></persName>
		</author>
		<idno type="DOI">10.1016/j.procs.2020.04.245</idno>
		<ptr target="https://linkinghub.elsevier.com/retrieve/pii/S1877050920312370.doi:10.1016/j.procs.2020.04.245" />
	</analytic>
	<monogr>
		<title level="j">Procedia Computer Science</title>
		<imprint>
			<biblScope unit="volume">171</biblScope>
			<biblScope unit="page" from="2267" to="2274" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<title level="m" type="main">Bloom&apos;s Taxonomy</title>
		<author>
			<persName><forename type="first">P</forename><surname>Armstrong</surname></persName>
		</author>
		<ptr target="https://cft.vanderbilt.edu/guides-sub-pages/blooms-taxonomy/" />
		<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
