Evaluating LLMs’ Performance At Automatic Short-Answer Grading Rositsa V. Ivanova1 , Siegfried Handschuh1 1 University of St. Gallen, Switzerland Abstract In recent years, the use of Large Language Models (LLMs) has become more accessible and wide-spread. With a free-of-charge access types people have began applying the models to various tasks beyond the task of next-word prediction. In an exploratory study, we take a closer look at the use of LLMs for Automatic Short Answer Grading. We compare the grading of short-answer tasks by two human graders to this of an LLM. We discuss the results and present examples of observed short-comings in the annotation and grading. Keywords automatic short-answer grading, large language models, automated scoring 1. Introduction Large Language Models (LLMs) have become our assistants in many everyday activities. Over the last few years, the speed at which new models are developed has become overwhelming to daily users, researchers, politicians, and law makers struggling to keep up with all options and opportunities [1]. Yet, their application has been explored and accepted in various domains [2, 3, 4]. Automatic Short Answer Grading (ASAG) systems have emerged as an educational technology, addressing the need for efficient assessment methods in both online and traditional educational environments long before the hype of LLMs [5]. The primary objective of ASAG systems is to automatically evaluate and score students’ responses to short answer questions. The difficulty of the task arises from the length of the texts - often even simply a few words - and thus the limited given context [6, 7]. One of the approaches to the task of ASAG for closed-ended questions is the comparison of the student answer to a predefined correct answer [8, 9]. The developments in ASAG have been heavily influenced by advancements in Natural Language Processing (NLP) and Machine Learning [10]. Accordingly, LLMs have found their applications in the creation of datasets and tools. While they are of great help for generic tasks such as answering questions or writing text [11], they often fall short when applied to domain specific tasks [12, 13, 14]. One primary concern is the risk for LLMs to amplify biases present in their training data [15]. Further, it is a challenge to ensuring the factual accuracy and relevance of the content generated by LLMs [16]. Previous attempts using Retrieval-Augmented Generation have been made to incorporate external sources EvalLAC’24: Workshop on Automatic Evaluation of Learning and Assessment Content, July 08, 2024, Recife, Brazil $ rositsa.ivanova@unisg.ch (R. V. Ivanova) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings and enrich LLMs answers with knowledge, improving the factual grounding and thus the safety of answers [17, 18, 19]. However, such approaches rely on knowledge databases and annotated datasets to learn from, which underlines the critical importance of creating qualitative gold standard datasets [20, 21]. We explore the use of LLMs for the automated grading of short-answer texts as an example of a complex task that requires an understanding of a brief answer without receiving more than a sample solution. Our exploratory study aims to address the question of whether the LLMs have implicitly learned to perform well on specific NLP tasks (e.g. ASAG). We believe that understanding the short-comings of LLMs is one of many steps towards developing more suitable annotation approaches that could be used for the support by LLMs in process of automated grading. 2. Experiment We compare the grading of students answers to exam questions done by two people to that of a popular, widely-used and free-of-charge LLM (i.e. ChatGPT-3.5). We acknowledge the fact that the chosen model is merely one amongst many, which all have their individual strengths and weaknesses, and that it is being continuously updated. However, due to the wide spread use of the model in various domains and the exploratory scope of this study, we build our use-case on ChatGPT-3.5, while pointing out the limitations of our choice. Human annotation The initial dataset of this experiment was created in two steps. First, Mohler and Mihalcea [22] graded the assignments of undergraduate students in an introductory computer science (CS) course. The 630 short-answers given by 30 students were evaluated by two graduate CS students on an interval scale from 0 to 5. The second dataset extended the former by expanding the total number of short-answers to 2 273 [23]. The grading of the new texts was also done by the same two people. The grading scale ranged from 0 to 10 and in some cases the graders gave half points. The conversion of this scale to an equivalent from 1 to 5 lead to the use of rational numbers with a decimal increment of 0.25 interval for some of the grades. For the purpose of our study, we kept the answers, which received a whole-number grade, as we deemed the comparison to grades with various initial granularity (i.e. only whole numbers for first part and a mix for the second) to be introducing unnecessary bias and 89% (2 022 answers) of the answers received whole-number grades. ChatGPT The prompt consisted of instruction incl. the grading scale, the initial question, the desired correct answer, and the student answer. To gain a better insight in the grading decisions, we requested a text comment for each grade selection. 3. Results We compared the grading of the human annotators and ChatGPT in multiple steps and using various approaches. First, we compare the grades given to the answers by the first grader (H1) and the second grader (H2). Second, we compare them individually to the automatically assigned score by ChatGPT. For the three pairs, we derive a simple percentage of inter-annotator agreement (IAA), evaluate the agreement beyond chance (Kappa Score), the agreement with a focus on the severity of disagreement (Weighted Kappa Score), and the linear correlation between the scoring (Pearson’s Correlation Coefficient). A detailed discussion on choice of correlation metric is provided by the dataset creators [22]. Pair Inter-ann. Score Kappa Score Weighted Kappa Score Pearson’s Corr. Coef. H1 & H2 60.88% 0.295 0.395 0.586 H1 & ChatGPT 30.56% 0.120 0.364 0.628 H2 & ChatGPT 27.10% 0.050 0.189 0.519 H* & ChatGPT 33.96% 0.050 0.186 0.537 Table 1 Evaluation of inter-annotator performance. ChatGPT is the automated grading by GPT-3.5, H1 and H2 represent the human annotators, and H is the subset instances where H1 and H2 gave the same score. The highest scores for each measure are presented in bold. Table 1 depicts the results for each pair and score. The agreement between the two human annotators (i.e. H1 & H2) served as a benchmark for expected IAA. The Inter-annotator Score was 60.88%, indicating that both human annotators agreed on grades more than half of the time. The Kappa Score (0.295) indicates an agreement below moderate (0.41-0.60) underlined by the Weighted Kappa Score at 0.395, showing a slightly better but still modest agreement. However, considering the applied grading scale, the Pearson’s Correlation Coefficient (0.586) reflects a moderate positive correlation between the two sets of grades. On the contrary, the comparison between each human annotator and ChatGPT (i.e. H1 & ChatGPT; H2 & ChatGPT) reveals a lower level of agreement. For H1 & ChatGPT, the Inter- annotator Score, the Kappa Score and the Weighted Kappa Score indicate a minimal agreement beyond what would be expected by chance. A surprisingly high value is achieved for the Pearson’s Correlation Coefficient at 0.628, suggesting a stronger correlation. One explanation for this could be the different distributions of the grading of H1 and H2. The agreement between the second human annotator (H2) and ChatGPT was even lower for all of the measures, yet also here the Pearson’s Correlation Coefficient remained high, indicating a moderate correlation despite the low agreement scores. In addition to the evaluation for the three pairs, we created a subset of the initial dataset (with 1 231 answers), where H1 and H2 agreed on the grade (i.e. H*). We view these instances as examples of answers, which were graded more objectively and where the assignment of the grade may be more straight forward. We calculate the IAA measures for the subset against ChatGPT. This yielded an Inter-annotator Score of 33.96%, which is the highest of the scores achieved by pairs including ChatGPT. However, also here the Kappa and the Weighted Kappa Scores remained noticeably lower. This suggests that even when humans were in agreement, ChatGPT’s grading did not significantly align with the human consensus. The Pearson’s Correlation Coefficient was 0.537, indicating a moderate positive correlation but not a strong agreement. In summary, while we observe a moderate level of agreement between human annotators, the agreement between ChatGPT and the humans is significantly lower. However, the Pearson’s Correlation Coefficients suggest there is still a moderate positive relationship in the grading patterns between humans and ChatGPT. The results indicate that while ChatGPT can follow a grading pattern similar to humans to some extent, the consistency of these grades with human annotators varies and is generally lower than the human-human agreement levels. 4. Discussion Bias. In our reduced dataset, the grading of H1 and H2 overlapped only in 60.88% of the cases. In the remaining cases H2 has demonstrated a bias in their grading by giving a higher grade to 76.61% of the answers. While Mohler et al. [23] describe this as a “real-world [issue] associated with the task of grading”, such subjectivity can also be perceived as the strength of human annotation. Plank [24] criticizes the assumption that a single gold label should be assigned to instances, as it diminishes the variety in opinions and interpretations of human language. Particularly when creating new gold standards, such richness in the annotation may be an essential step in the aim to reduce bias in models trained on them Kasneci et al. [25]. In this context, we observe that ChatGPT assigned lower grades than H1 and H2 in 79.56% and 94.03% of all cases of disagreement. Question / Answers H1 H2 ChatGPT Q1: What is the base case for a recursive implementation of merge sort? Best case is one element. One element is sorted. 5 5 2 A list size of 1, where it is already sorted. 5 5 4 Q2: When does C++ create a default constructor? whenevery you dont specifiy your own 5 5 2 When you dont specify any constructors. 5 5 4 Q3: What is the role of a header-file? To allow the compiler to recognize the classes when used elsewhere. 3 4 2 Allow compiler to recognize the classes when used elsewhere 3 3 4 Table 2 Examples of similar short-answers having received a different grade by ChatGPT. Note: Typos in the student answers are present in the original data. Inconsistency. Next, we took a closer look at the exam tasks, which were answered by students very similarly, yet have received different grades. We manually grouped similar answers to the same questions. While we discovered some inconsistencies in the human annotation within these groups, ChatGPT provided various grades and differing justifications for the assigned grade within nearly all of the answer groups. Table 2 provides three such examples. In Q1 and Q2 both graders assigned highest mark to the pairs of similar answers consistently. In both cases ChatGPT gave different marks. Similar observations have been made by Duong and Solomon [26] in particular when the authors asked the same questions multiple times. Filighera et al. [27] discuss potential weak- nesses of LLMs that can easily be manipulated via minor changes in the syntax of an answer (e.g. adding adjectives and adverbs). Depending on the manipulation, Filighera et al. [28] discovered that students even manage to pass a 50% threshold on an exam “without answering a single question correctly”. This underlines the difficulty of automating tasks such as ASAG. Such varieties can be crucial when two answers are assessed as equivalent by a human, yet distinguished by a LLMs due to differences which a human would consider neglectable (e.g. an extra empty character or a period in the end of an answer). The third example (Q3) depicts a case where one of the annotators also graded the answers differently, despite high similarity of the text. As mentioned by the authors of the initial dataset, one of the graders (i.e. H2) frequently assigned higher grades. In addition to this fact, H2 also tended to grade similar answers differently more frequently than H1, for whom this was a rare exception. These results indicate that may be a need for finer-grained grading (i.e. annotation) guidelines to reduce the discrepancies between graders. The results shed light on some issues associated with human annotation. One note-worthy issue is the low inter-annotator scores achieved by human annotators. Previous work has suggested the use of finer-grained and precise annotation guidelines to achieve higher annotation accuracy [29, 30]. Additionally, human annotation can be time-consuming and costly [31], which leaves dataset creators to look for alternatives such as the use of LLMs. Large Language Models (LLMs) like ChatGPT present their own set of challenges. One issue is that closed-source models like GPT-3.5 are fundamentally different from their successors (e.g., GPT-4), making it difficult to understand and predict their behavior. While open-source models accessible, they often become large ’black boxes’ that are challenging to interpret or understand fully [32]. Providing more precise instructions to LLMs could potentially improve their performance. Yet, we need to consider the risk that they may still miss nuances, which are easily spotted by human annotators especially in complex or subtle domains. Lastly, the use of LLMs such as ChatGPT require a substantial computational infrastructure [33, 15], posing the question whether the same (if not better) performance can be achieved without their excessive use. 5. Conclusion Large Language Models (LLMs) like ChatGPT present their own set of challenges. Closed-source models like GPT-3.5 are fundamentally different from their successors (e.g., GPT-4), making it difficult to understand and predict their behavior. While open-source models are accessible, they often become large ’black boxes’ that are challenging to interpret or understand fully. Providing more precise instructions to LLMs could potentially improve their performance. Yet, we need to consider the risk that they may still miss nuances, which are easily spotted by human annotators especially in complex or subtle domains. Generalization of the results to other domains may not be trivial, however the results of this survey already hint at the need for further research in the potential use of LLMs as an aid for domain-specific tasks such as ASAG. At this stage we believe that the ability of humans to interpret and detect nuances in brief answers remains unmatched. Due to the complexity of the task, its time-intensive nature, and the costs associated with manual annotation, the use of LLMs as support in the annotation process for domain specific datasets should further be explored. References [1] Y. Walter, The rapid competitive economy of machine learning development: a discussion on the social risks and benefits, AI and Ethics (2023) 1–14. [2] J.-M. Hu, F.-C. Liu, C.-M. Chu, Y.-T. Chang, Health care trainees’ and professionals’ perceptions of chatgpt in improving medical knowledge training: rapid survey study, Journal of Medical Internet Research 25 (2023) e49385. [3] C. K. Lo, What is the impact of chatgpt on education? a rapid review of the literature, Education Sciences 13 (2023) 410. [4] S. I. Ross, F. Martinez, S. Houde, M. Muller, J. D. Weisz, The programmer’s assistant: Conversational interaction with a large language model for software development, in: Proceedings of the 28th International Conference on Intelligent User Interfaces, 2023, pp. 491–514. [5] J. Burstein, S. Wolff, C. Lu, Using lexical semantic techniques to classify free-responses, Springer, 1999. [6] L. Galhardi, R. C. T. de Souza, J. Brancher, Automatic grading of portuguese short answers using a machine learning approach, in: Anais Estendidos do XVI Simpósio Brasileiro de Sistemas de Informação, SBC, 2020, pp. 109–124. [7] N. Willms, U. Padó, A transformer for sag: What does it grade?, in: Proceedings of the 11th Workshop on NLP for Computer Assisted Language Learning, 2022, pp. 114–122. [8] U. Hasanah, A. E. Permanasari, S. S. Kusumawardani, F. S. Pribadi, A review of an information extraction technique approach for automatic short answer grading, in: 2016 1st International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE), IEEE, 2016, pp. 192–196. [9] L. Zhang, Y. Huang, X. Yang, S. Yu, F. Zhuang, An automatic short-answer grading model for semi-open-ended questions, Interactive learning environments 30 (2022) 177–190. [10] A. Ahmed, A. Joorabchi, M. J. Hayes, On deep learning approaches to automated assess- ment: Strategies for short answer grading., CSEDU (2) (2022) 85–94. [11] V. Taecharungroj, “what can chatgpt do?” analyzing early reactions to the innovative ai chatbot on twitter, Big Data and Cognitive Computing 7 (2023) 35. [12] A. Creswell, M. Shanahan, I. Higgins, Selection-inference: Exploiting large language models for interpretable logical reasoning, in: The Eleventh International Conference on Learning Representations, 2022. [13] D. Mekala, J. Wolfe, S. Roy, Zerotop: Zero-shot task-oriented semantic parsing using large language models, in: Conference on Empirical Methods in Natural Language Processing, 2022. [14] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, A. Garg, Progprompt: Generating situated robot task plans using large language models, in: ICRA, 2023, pp. 11523–11530. [15] E. M. Bender, T. Gebru, A. McMillan-Major, S. Shmitchell, On the dangers of stochastic parrots: Can language models be too big?, in: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, 2021, pp. 610–623. [16] B. Goodrich, V. Rao, P. J. Liu, M. Saleh, Assessing the factual accuracy of generated text, in: proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 166–175. [17] F. Hill, R. Reichart, A. Korhonen, Simlex-999: Evaluating semantic models with (genuine) similarity estimation, Computational Linguistics 41 (2015) 665–695. [18] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances in Neural Information Processing Systems 33 (2020) 9459–9474. [19] M. Glass, G. Rossiello, M. F. M. Chowdhury, A. Naik, P. Cai, A. Gliozzo, Re2G: Retrieve, rerank, generate, in: M. Carpuat, M.-C. de Marneffe, I. V. Meza Ruiz (Eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Seattle, United States, 2022, pp. 2701–2715. [20] E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, W. Redmond, M. B. McDermott, Publicly available clinical bert embeddings, NAACL HLT 2019 (2019) 72. [21] D. Song, S. Gao, B. He, F. Schilder, On the effectiveness of pre-trained language models for legal natural language processing: An empirical study, IEEE Access 10 (2022) 75835–75858. [22] M. Mohler, R. Mihalcea, Text-to-text semantic similarity for automatic short answer grading, in: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), 2009, pp. 567–575. [23] M. Mohler, R. Bunescu, R. Mihalcea, Learning to grade short answer questions using semantic similarity measures and dependency graph alignments, in: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, 2011, pp. 752–762. [24] B. Plank, The “problem” of human label variation: On ground truth in data, modeling and evaluation, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 10671–10682. [25] E. Kasneci, K. Seßler, S. Küchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. Günnemann, E. Hüllermeier, et al., Chatgpt for good? on opportunities and challenges of large language models for education, Learning and individual differences 103 (2023) 102274. [26] D. Duong, B. D. Solomon, Analysis of large-language model versus human performance for genetics questions, European Journal of Human Genetics (2023) 1–3. [27] A. Filighera, S. Ochs, T. Steuer, T. Tregel, Cheating automatic short answer grading with the adversarial usage of adjectives and adverbs, International Journal of Artificial Intelligence in Education (2023) 1–31. [28] A. Filighera, T. Steuer, C. Rensing, Fooling automatic short answer grading systems, in: International conference on artificial intelligence in education, Springer, 2020, pp. 177–190. [29] A. Rigouts Terryn, V. Hoste, E. Lefever, In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora, Language Resources and Evaluation 54 (2020) 385–418. [30] R. Ivanova, M. Van Erp, S. Kirrane, Comparing annotated datasets for named entity recognition in english literature, in: Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, pp. 3788–3797. [31] I. Habernal, I. Gurevych, Exploiting debate portals for semi-supervised argumentation mining in user-generated web discourse, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 2127–2137. [32] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, et al., Opt: Open pre-trained transformer language models, arXiv preprint arXiv:2205.01068 (2022). [33] T. Schick, H. Schütze, It’s not just size that matters: Small language models are also few-shot learners, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 2339–2352.