<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Evaluating LLMs&apos; Performance At Automatic Short-Answer Grading</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Rositsa</forename><forename type="middle">V</forename><surname>Ivanova</surname></persName>
							<email>rositsa.ivanova@unisg.ch</email>
							<affiliation key="aff0">
								<orgName type="institution">University of St</orgName>
								<address>
									<settlement>Gallen</settlement>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Siegfried</forename><surname>Handschuh</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of St</orgName>
								<address>
									<settlement>Gallen</settlement>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Evaluating LLMs&apos; Performance At Automatic Short-Answer Grading</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">BF908378F517581AAA18175EA24DB18F</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T19:38+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>automatic short-answer grading</term>
					<term>large language models</term>
					<term>automated scoring</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In recent years, the use of Large Language Models (LLMs) has become more accessible and wide-spread. With a free-of-charge access types people have began applying the models to various tasks beyond the task of next-word prediction. In an exploratory study, we take a closer look at the use of LLMs for Automatic Short Answer Grading. We compare the grading of short-answer tasks by two human graders to this of an LLM. We discuss the results and present examples of observed short-comings in the annotation and grading.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Large Language Models (LLMs) have become our assistants in many everyday activities. Over the last few years, the speed at which new models are developed has become overwhelming to daily users, researchers, politicians, and law makers struggling to keep up with all options and opportunities <ref type="bibr" target="#b0">[1]</ref>. Yet, their application has been explored and accepted in various domains <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4]</ref>.</p><p>Automatic Short Answer Grading (ASAG) systems have emerged as an educational technology, addressing the need for efficient assessment methods in both online and traditional educational environments long before the hype of LLMs <ref type="bibr" target="#b4">[5]</ref>. The primary objective of ASAG systems is to automatically evaluate and score students' responses to short answer questions. The difficulty of the task arises from the length of the texts -often even simply a few words -and thus the limited given context <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7]</ref>. One of the approaches to the task of ASAG for closed-ended questions is the comparison of the student answer to a predefined correct answer <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9]</ref>. The developments in ASAG have been heavily influenced by advancements in Natural Language Processing (NLP) and Machine Learning <ref type="bibr" target="#b9">[10]</ref>.</p><p>Accordingly, LLMs have found their applications in the creation of datasets and tools. While they are of great help for generic tasks such as answering questions or writing text <ref type="bibr" target="#b10">[11]</ref>, they often fall short when applied to domain specific tasks <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b13">14]</ref>. One primary concern is the risk for LLMs to amplify biases present in their training data <ref type="bibr" target="#b14">[15]</ref>. Further, it is a challenge to ensuring the factual accuracy and relevance of the content generated by LLMs <ref type="bibr" target="#b15">[16]</ref>. Previous attempts using Retrieval-Augmented Generation have been made to incorporate external sources and enrich LLMs answers with knowledge, improving the factual grounding and thus the safety of answers <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18,</ref><ref type="bibr" target="#b18">19]</ref>. However, such approaches rely on knowledge databases and annotated datasets to learn from, which underlines the critical importance of creating qualitative gold standard datasets <ref type="bibr" target="#b19">[20,</ref><ref type="bibr" target="#b20">21]</ref>.</p><p>We explore the use of LLMs for the automated grading of short-answer texts as an example of a complex task that requires an understanding of a brief answer without receiving more than a sample solution. Our exploratory study aims to address the question of whether the LLMs have implicitly learned to perform well on specific NLP tasks (e.g. ASAG). We believe that understanding the short-comings of LLMs is one of many steps towards developing more suitable annotation approaches that could be used for the support by LLMs in process of automated grading.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Experiment</head><p>We compare the grading of students answers to exam questions done by two people to that of a popular, widely-used and free-of-charge LLM (i.e. ChatGPT-3.5). We acknowledge the fact that the chosen model is merely one amongst many, which all have their individual strengths and weaknesses, and that it is being continuously updated. However, due to the wide spread use of the model in various domains and the exploratory scope of this study, we build our use-case on ChatGPT-3.5, while pointing out the limitations of our choice.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Human annotation</head><p>The initial dataset of this experiment was created in two steps. First, Mohler and Mihalcea <ref type="bibr" target="#b21">[22]</ref> graded the assignments of undergraduate students in an introductory computer science (CS) course. The 630 short-answers given by 30 students were evaluated by two graduate CS students on an interval scale from 0 to 5. The second dataset extended the former by expanding the total number of short-answers to 2 273 <ref type="bibr" target="#b22">[23]</ref>. The grading of the new texts was also done by the same two people. The grading scale ranged from 0 to 10 and in some cases the graders gave half points. The conversion of this scale to an equivalent from 1 to 5 lead to the use of rational numbers with a decimal increment of 0.25 interval for some of the grades. For the purpose of our study, we kept the answers, which received a whole-number grade, as we deemed the comparison to grades with various initial granularity (i.e. only whole numbers for first part and a mix for the second) to be introducing unnecessary bias and 89% (2 022 answers) of the answers received whole-number grades.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ChatGPT</head><p>The prompt consisted of instruction incl. the grading scale, the initial question, the desired correct answer, and the student answer. To gain a better insight in the grading decisions, we requested a text comment for each grade selection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Results</head><p>We compared the grading of the human annotators and ChatGPT in multiple steps and using various approaches. First, we compare the grades given to the answers by the first grader (H1) and the second grader (H2). Second, we compare them individually to the automatically assigned score by ChatGPT. For the three pairs, we derive a simple percentage of inter-annotator agreement (IAA), evaluate the agreement beyond chance (Kappa Score), the agreement with a focus on the severity of disagreement (Weighted Kappa Score), and the linear correlation between the scoring (Pearson's Correlation Coefficient). A detailed discussion on choice of correlation metric is provided by the dataset creators <ref type="bibr" target="#b21">[22]</ref>. Table <ref type="table" target="#tab_0">1</ref> depicts the results for each pair and score. The agreement between the two human annotators (i.e. H1 &amp; H2) served as a benchmark for expected IAA. The Inter-annotator Score was 60.88%, indicating that both human annotators agreed on grades more than half of the time. The Kappa Score (0.295) indicates an agreement below moderate (0.41-0.60) underlined by the Weighted Kappa Score at 0.395, showing a slightly better but still modest agreement. However, considering the applied grading scale, the Pearson's Correlation Coefficient (0.586) reflects a moderate positive correlation between the two sets of grades.</p><p>On the contrary, the comparison between each human annotator and ChatGPT (i.e. H1 &amp; ChatGPT; H2 &amp; ChatGPT) reveals a lower level of agreement. For H1 &amp; ChatGPT, the Interannotator Score, the Kappa Score and the Weighted Kappa Score indicate a minimal agreement beyond what would be expected by chance. A surprisingly high value is achieved for the Pearson's Correlation Coefficient at 0.628, suggesting a stronger correlation. One explanation for this could be the different distributions of the grading of H1 and H2. The agreement between the second human annotator (H2) and ChatGPT was even lower for all of the measures, yet also here the Pearson's Correlation Coefficient remained high, indicating a moderate correlation despite the low agreement scores.</p><p>In addition to the evaluation for the three pairs, we created a subset of the initial dataset (with 1 231 answers), where H1 and H2 agreed on the grade (i.e. H*). We view these instances as examples of answers, which were graded more objectively and where the assignment of the grade may be more straight forward. We calculate the IAA measures for the subset against ChatGPT. This yielded an Inter-annotator Score of 33.96%, which is the highest of the scores achieved by pairs including ChatGPT. However, also here the Kappa and the Weighted Kappa Scores remained noticeably lower. This suggests that even when humans were in agreement, ChatGPT's grading did not significantly align with the human consensus. The Pearson's Correlation Coefficient was 0.537, indicating a moderate positive correlation but not a strong agreement.</p><p>In summary, while we observe a moderate level of agreement between human annotators, the agreement between ChatGPT and the humans is significantly lower. However, the Pearson's Correlation Coefficients suggest there is still a moderate positive relationship in the grading patterns between humans and ChatGPT. The results indicate that while ChatGPT can follow a grading pattern similar to humans to some extent, the consistency of these grades with human annotators varies and is generally lower than the human-human agreement levels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Discussion</head><p>Bias. In our reduced dataset, the grading of H1 and H2 overlapped only in 60.88% of the cases. In the remaining cases H2 has demonstrated a bias in their grading by giving a higher grade to 76.61% of the answers. While Mohler et al. <ref type="bibr" target="#b22">[23]</ref> describe this as a "real-world [issue] associated with the task of grading", such subjectivity can also be perceived as the strength of human annotation. Plank <ref type="bibr" target="#b23">[24]</ref> criticizes the assumption that a single gold label should be assigned to instances, as it diminishes the variety in opinions and interpretations of human language. Particularly when creating new gold standards, such richness in the annotation may be an essential step in the aim to reduce bias in models trained on them Kasneci et al. <ref type="bibr" target="#b24">[25]</ref>. In this context, we observe that ChatGPT assigned lower grades than H1 and H2 in 79.56% and 94.03% of all cases of disagreement.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Question / Answers</head><p>H1 H2 ChatGPT Q1: What is the base case for a recursive implementation of merge sort? Best case is one element. One element is sorted. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Q3: What is the role of a header-file?</head><p>To allow the compiler to recognize the classes when used elsewhere. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>Examples of similar short-answers having received a different grade by ChatGPT. Note: Typos in the student answers are present in the original data.</p><p>Inconsistency. Next, we took a closer look at the exam tasks, which were answered by students very similarly, yet have received different grades. We manually grouped similar answers to the same questions. While we discovered some inconsistencies in the human annotation within these groups, ChatGPT provided various grades and differing justifications for the assigned grade within nearly all of the answer groups. Table <ref type="table">2</ref> provides three such examples. In Q1 and Q2 both graders assigned highest mark to the pairs of similar answers consistently. In both cases ChatGPT gave different marks. Similar observations have been made by Duong and Solomon <ref type="bibr" target="#b25">[26]</ref> in particular when the authors asked the same questions multiple times. Filighera et al. <ref type="bibr" target="#b26">[27]</ref> discuss potential weaknesses of LLMs that can easily be manipulated via minor changes in the syntax of an answer (e.g. adding adjectives and adverbs). Depending on the manipulation, Filighera et al. <ref type="bibr" target="#b27">[28]</ref> discovered that students even manage to pass a 50% threshold on an exam "without answering a single question correctly". This underlines the difficulty of automating tasks such as ASAG. Such varieties can be crucial when two answers are assessed as equivalent by a human, yet distinguished by a LLMs due to differences which a human would consider neglectable (e.g. an extra empty character or a period in the end of an answer).</p><p>The third example (Q3) depicts a case where one of the annotators also graded the answers differently, despite high similarity of the text. As mentioned by the authors of the initial dataset, one of the graders (i.e. H2) frequently assigned higher grades. In addition to this fact, H2 also tended to grade similar answers differently more frequently than H1, for whom this was a rare exception. These results indicate that may be a need for finer-grained grading (i.e. annotation) guidelines to reduce the discrepancies between graders.</p><p>The results shed light on some issues associated with human annotation. One note-worthy issue is the low inter-annotator scores achieved by human annotators. Previous work has suggested the use of finer-grained and precise annotation guidelines to achieve higher annotation accuracy <ref type="bibr" target="#b28">[29,</ref><ref type="bibr" target="#b29">30]</ref>. Additionally, human annotation can be time-consuming and costly <ref type="bibr" target="#b30">[31]</ref>, which leaves dataset creators to look for alternatives such as the use of LLMs.</p><p>Large Language Models (LLMs) like ChatGPT present their own set of challenges. One issue is that closed-source models like GPT-3.5 are fundamentally different from their successors (e.g., GPT-4), making it difficult to understand and predict their behavior. While open-source models accessible, they often become large 'black boxes' that are challenging to interpret or understand fully <ref type="bibr" target="#b31">[32]</ref>. Providing more precise instructions to LLMs could potentially improve their performance. Yet, we need to consider the risk that they may still miss nuances, which are easily spotted by human annotators especially in complex or subtle domains. Lastly, the use of LLMs such as ChatGPT require a substantial computational infrastructure <ref type="bibr" target="#b32">[33,</ref><ref type="bibr" target="#b14">15]</ref>, posing the question whether the same (if not better) performance can be achieved without their excessive use.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>Large Language Models (LLMs) like ChatGPT present their own set of challenges. Closed-source models like GPT-3.5 are fundamentally different from their successors (e.g., GPT-4), making it difficult to understand and predict their behavior. While open-source models are accessible, they often become large 'black boxes' that are challenging to interpret or understand fully. Providing more precise instructions to LLMs could potentially improve their performance. Yet, we need to consider the risk that they may still miss nuances, which are easily spotted by human annotators especially in complex or subtle domains. Generalization of the results to other domains may not be trivial, however the results of this survey already hint at the need for further research in the potential use of LLMs as an aid for domain-specific tasks such as ASAG. At this stage we believe that the ability of humans to interpret and detect nuances in brief answers remains unmatched. Due to the complexity of the task, its time-intensive nature, and the costs associated with manual annotation, the use of LLMs as support in the annotation process for domain specific datasets should further be explored.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>5 5 2 A 4 Q2:</head><label>24</label><figDesc>list size of 1, where it is already sorted.<ref type="bibr" target="#b4">5</ref> 5 When does C++ create a default constructor? whenevery you dont specifiy your own 5</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Evaluation of inter-annotator performance. ChatGPT is the automated grading by GPT-3.5, H1 and H2 represent the human annotators, and H is the subset instances where H1 and H2 gave the same score. The highest scores for each measure are presented in bold.</figDesc><table><row><cell>Pair</cell><cell cols="4">Inter-ann. Score Kappa Score Weighted Kappa Score Pearson's Corr. Coef.</cell></row><row><cell>H1 &amp; H2</cell><cell>60.88%</cell><cell>0.295</cell><cell>0.395</cell><cell>0.586</cell></row><row><cell>H1 &amp; ChatGPT</cell><cell>30.56%</cell><cell>0.120</cell><cell>0.364</cell><cell>0.628</cell></row><row><cell>H2 &amp; ChatGPT</cell><cell>27.10%</cell><cell>0.050</cell><cell>0.189</cell><cell>0.519</cell></row><row><cell>H* &amp; ChatGPT</cell><cell>33.96%</cell><cell>0.050</cell><cell>0.186</cell><cell>0.537</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">The rapid competitive economy of machine learning development: a discussion on the social risks and benefits</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Walter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">AI and Ethics</title>
		<imprint>
			<biblScope unit="page" from="1" to="14" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Health care trainees&apos; and professionals&apos; perceptions of chatgpt in improving medical knowledge training: rapid survey study</title>
		<author>
			<persName><forename type="first">J.-M</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F.-C</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-M</forename><surname>Chu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-T</forename><surname>Chang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Medical Internet Research</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="page">e49385</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">What is the impact of chatgpt on education? a rapid review of the literature</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">K</forename><surname>Lo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Education Sciences</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="page">410</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">The programmer&apos;s assistant: Conversational interaction with a large language model for software development</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">I</forename><surname>Ross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Martinez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Houde</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Muller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Weisz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 28th International Conference on Intelligent User Interfaces</title>
				<meeting>the 28th International Conference on Intelligent User Interfaces</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="491" to="514" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Using lexical semantic techniques to classify free-responses</title>
		<author>
			<persName><forename type="first">J</forename><surname>Burstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wolff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Lu</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1999">1999</date>
			<publisher>Springer</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Automatic grading of portuguese short answers using a machine learning approach</title>
		<author>
			<persName><forename type="first">L</forename><surname>Galhardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">C T</forename><surname>De Souza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Brancher</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Anais Estendidos do XVI Simpósio Brasileiro de Sistemas de Informação</title>
				<imprint>
			<publisher>SBC</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="109" to="124" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A transformer for sag: What does it grade?</title>
		<author>
			<persName><forename type="first">N</forename><surname>Willms</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Padó</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 11th Workshop on NLP for Computer Assisted Language Learning</title>
				<meeting>the 11th Workshop on NLP for Computer Assisted Language Learning</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="114" to="122" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">A review of an information extraction technique approach for automatic short answer grading</title>
		<author>
			<persName><forename type="first">U</forename><surname>Hasanah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">E</forename><surname>Permanasari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Kusumawardani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">S</forename><surname>Pribadi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2016 1st International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="192" to="196" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">An automatic short-answer grading model for semi-open-ended questions</title>
		<author>
			<persName><forename type="first">L</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Zhuang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Interactive learning environments</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="page" from="177" to="190" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">On deep learning approaches to automated assessment: Strategies for short answer grading</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ahmed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Joorabchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Hayes</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">CSEDU</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="85" to="94" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">what can chatgpt do?&quot; analyzing early reactions to the innovative ai chatbot on twitter</title>
		<author>
			<persName><forename type="first">V</forename><surname>Taecharungroj</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Big Data and Cognitive Computing</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page">35</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Selection-inference: Exploiting large language models for interpretable logical reasoning</title>
		<author>
			<persName><forename type="first">A</forename><surname>Creswell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Shanahan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Higgins</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Eleventh International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Zerotop: Zero-shot task-oriented semantic parsing using large language models</title>
		<author>
			<persName><forename type="first">D</forename><surname>Mekala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wolfe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Roy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference on Empirical Methods in Natural Language Processing</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Progprompt: Generating situated robot task plans using large language models</title>
		<author>
			<persName><forename type="first">I</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Blukis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mousavian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tremblay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Fox</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Thomason</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Garg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ICRA</title>
		<imprint>
			<biblScope unit="page" from="11523" to="11530" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">On the dangers of stochastic parrots: Can language models be too big?</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">M</forename><surname>Bender</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Gebru</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mcmillan-Major</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shmitchell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 ACM conference on fairness, accountability, and transparency</title>
				<meeting>the 2021 ACM conference on fairness, accountability, and transparency</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="610" to="623" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Assessing the factual accuracy of generated text</title>
		<author>
			<persName><forename type="first">B</forename><surname>Goodrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Rao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Saleh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">proceedings of the 25th ACM SIGKDD international conference on knowledge discovery &amp; data mining</title>
				<meeting>the 25th ACM SIGKDD international conference on knowledge discovery &amp; data mining</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="166" to="175" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Simlex-999: Evaluating semantic models with (genuine) similarity estimation</title>
		<author>
			<persName><forename type="first">F</forename><surname>Hill</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Reichart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Korhonen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="page" from="665" to="695" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Retrieval-augmented generation for knowledge-intensive nlp tasks</title>
		<author>
			<persName><forename type="first">P</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Perez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Piktus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Petroni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Karpukhin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Küttler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>-T. Yih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Rocktäschel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="9459" to="9474" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Re2G: Retrieve, rerank, generate</title>
		<author>
			<persName><forename type="first">M</forename><surname>Glass</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Rossiello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">F M</forename><surname>Chowdhury</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Naik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Cai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gliozzo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">M</forename><surname>Carpuat</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M.-C</forename><surname>De Marneffe</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">I</forename><forename type="middle">V</forename><surname>Meza Ruiz</surname></persName>
		</editor>
		<meeting>the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics<address><addrLine>Seattle, United States</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="2701" to="2715" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Publicly available clinical bert embeddings</title>
		<author>
			<persName><forename type="first">E</forename><surname>Alsentzer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Murphy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Boag</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-H</forename><surname>Weng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Naumann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Redmond</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">B</forename><surname>Mcdermott</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">NAACL HLT</title>
		<imprint>
			<biblScope unit="page">72</biblScope>
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">On the effectiveness of pre-trained language models for legal natural language processing: An empirical study</title>
		<author>
			<persName><forename type="first">D</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Schilder</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Access</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="75835" to="75858" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Text-to-text semantic similarity for automatic short answer grading</title>
		<author>
			<persName><forename type="first">M</forename><surname>Mohler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Mihalcea</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)</title>
				<meeting>the 12th Conference of the European Chapter of the ACL (EACL 2009)</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="567" to="575" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Learning to grade short answer questions using semantic similarity measures and dependency graph alignments</title>
		<author>
			<persName><forename type="first">M</forename><surname>Mohler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Bunescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Mihalcea</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies</title>
				<meeting>the 49th annual meeting of the association for computational linguistics: Human language technologies</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="752" to="762" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">The &quot;problem&quot; of human label variation: On ground truth in data, modeling and evaluation</title>
		<author>
			<persName><forename type="first">B</forename><surname>Plank</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2022 Conference on Empirical Methods in Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="10671" to="10682" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Chatgpt for good? on opportunities and challenges of large language models for education</title>
		<author>
			<persName><forename type="first">E</forename><surname>Kasneci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Seßler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Küchemann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bannert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Dementieva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Fischer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Gasser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Groh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Günnemann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Hüllermeier</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Learning and individual differences</title>
		<imprint>
			<biblScope unit="volume">103</biblScope>
			<biblScope unit="page">102274</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Analysis of large-language model versus human performance for genetics questions</title>
		<author>
			<persName><forename type="first">D</forename><surname>Duong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">D</forename><surname>Solomon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">European Journal of Human Genetics</title>
		<imprint>
			<biblScope unit="page" from="1" to="3" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Cheating automatic short answer grading with the adversarial usage of adjectives and adverbs</title>
		<author>
			<persName><forename type="first">A</forename><surname>Filighera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ochs</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Steuer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Tregel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Artificial Intelligence in Education</title>
		<imprint>
			<biblScope unit="page" from="1" to="31" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Fooling automatic short answer grading systems</title>
		<author>
			<persName><forename type="first">A</forename><surname>Filighera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Steuer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Rensing</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on artificial intelligence in education</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="177" to="190" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora</title>
		<author>
			<persName><forename type="first">A</forename><surname>Rigouts Terryn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Hoste</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Lefever</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Language Resources and Evaluation</title>
		<imprint>
			<biblScope unit="volume">54</biblScope>
			<biblScope unit="page" from="385" to="418" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Comparing annotated datasets for named entity recognition in english literature</title>
		<author>
			<persName><forename type="first">R</forename><surname>Ivanova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Van Erp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kirrane</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Thirteenth Language Resources and Evaluation Conference</title>
				<meeting>the Thirteenth Language Resources and Evaluation Conference</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="3788" to="3797" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Exploiting debate portals for semi-supervised argumentation mining in user-generated web discourse</title>
		<author>
			<persName><forename type="first">I</forename><surname>Habernal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Gurevych</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2015 Conference on Empirical Methods in Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="2127" to="2137" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Roller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Artetxe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Dewan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Diab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">V</forename><surname>Lin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2205.01068</idno>
		<title level="m">Opt: Open pre-trained transformer language models</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">It&apos;s not just size that matters: Small language models are also few-shot learners</title>
		<author>
			<persName><forename type="first">T</forename><surname>Schick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schütze</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
				<meeting>the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="2339" to="2352" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
