=Paper=
{{Paper
|id=Vol-3909/Paper_22.pdf
|storemode=property
|title=Features of the Practical Use of LLM for Generating Quiz
|pdfUrl=https://ceur-ws.org/Vol-3909/Paper_22.pdf
|volume=Vol-3909
|authors=Oleh Ilarionov,Hanna Krasovska,Iryna Domanetska,Olena Fedusenko
|dblpUrl=https://dblp.org/rec/conf/iti2/IlarionovKDF24
}}
==Features of the Practical Use of LLM for Generating Quiz==
Features of the practical use of LLM for generating quiz⋆ Oleh Ilarionov1 , Hanna Krasovska1 , Iryna Domanetska1 , Olena Fedusenko1 1 Taras Shevchenko National University of Kyiv, Volodymyrs'ka str. 64/13, Kyiv, 01601, Ukraine Abstract The article explores the possibilities of using large-scale language models (LLMs) such as GPT, Claude, Copilot and Gemini for automated test task generation in the field of education. The ability of these models to generate different types of tasks, including multiple choice, open-ended and fill-in-the-blank, as well as their compliance with educational standards and cognitive levels according to Bloom's taxonomy is assessed. A comparative analysis of the quality of the generated tests in terms of complexity, structure and adaptability is carried out. The limitations of the models for generating tasks of higher cognitive levels are identified and recommendations for their integration into educational platforms are given. The results of the study can improve the process of assessing students' knowledge and promote the development of adaptive learning. Keywords. LLM, test generation, Bloom's taxonomy, educational standards, test automation 1. Introduction Thanks to their ability to create new content based on existing data, generative artificial intelligence models (GAMs) are opening up new opportunities in many industries, from business to art, science and education, and allowing for the automation of sometimes routine tasks, speeding up processes that used to require a lot of time and creative effort. The impact of generative AI is becoming increasingly visible in the educational environment. Generative AI tools can be used for a variety of educational purposes, making the educational process more individualised, adaptive and efficient, providing access to education for a wider range of people, including those with disabilities. Already, GMIS are helping teachers generate multimodal teaching materials, adjust lesson plans, select relevant literature, and generate various tasks, scenarios, or simulations that help students develop analytical and research skills. It should be noted that one of the key elements of the educational process is the quality control of students' knowledge and skills, as this allows not only to assess the level of learning but also to identify gaps in learning and improve teaching methods. The control provides feedback between teachers and students, encouraging the latter to learn more actively and develop themselves. In recent years, one of the most common methods of assessing students' knowledge in modern education has been the test form of control, which has a number of significant advantages: objectivity of assessment, speed and convenience of testing, coverage of a large amount of material, variety of task formats, possibility of analysing statistics, transparency and clarity, standardisation of assessment, adaptability, etc. However, testing, although an effective control method, has its limitations: tests cannot always adequately assess the depth of understanding of the material or Information Technology and Implementation (IT&I-2024), November 20-21, 2024, Kyiv, Ukraine Corresponding author. These authors contributed equally. oleg.ilarionov@knu.ua (O. Ilarionov); hanna.krasovska@knu.ua (H. Krasovska); irinadomanetskaya@gmail.com (I. Domanetska); elvenff@gmail.com (O. Fedusenko) 0000-0002-7435-3533 (O. Ilarionov); 0000-0003-1986-6130 (H. Krasovska); 0000-0002-8629-9933 (I. Domanetska); 0000- 0002-5782-5922 (O. Fedusenko) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org 275 Workshop ISSN 1613-0073 Proceedings practical skills of students. Therefore, it is important that test tasks are well thought out, as their quality directly affects the results of control. Recently, special attention has been paid to LLMs (Large Language Models), which are designed to process and generate texts and can solve various tasks: translation, text creation, emotion analysis, answering questions, etc., and have great potential in the field of test task automation. They can greatly simplify the work of teachers, create adaptive, diverse and personalised tests, which improves the quality of student knowledge control. LLMs can create tasks of various formats, such as closed- ended questions with one or more correct answers; matching questions; fill-in-the-blank tasks; open- ended questions; logical thinking and analysis tests. LLMs can not only generate questions, but also provide detailed explanations for correct and incorrect answers, which allows students to better understand their mistakes and improves their learning. However, the question arises as to how effectively different LLMs generate such test tasks and whether the tests they create meet generally accepted pedagogical standards. 2. Literature review 2.1. Opportunities for LLM in education Large-scale language models (LLMs) offer great prospects for improving the efficiency of the educational process, in particular through the automation of tasks that previously required significant time and intellectual resources. Research shows that LLMs can adapt to different learning contexts by automating the development of test tasks, personalising educational materials, and improving access to knowledge for students with different learning needs[5]. The GPT, Claude, Copilot and Gemini models provide the ability to generate both simple and complex test items, including multiple-choice, matching and open-ended questions. Studies of GPT- 4 have shown that this model has a high ability to adapt the complexity of tasks, which allows teachers to create questions of different cognitive levels, according to Bloom's taxonomy [5,6]. Claude, on the other hand, demonstrates strengths in ethical and safe content generation, which is especially important in educational environments focused on preventing bias and harmful materials[5,6]. Automation of the creation of training materials and tests using LLM reduces the workload of teachers, freeing up time to work on individual student support. Another important aspect is the ability to create adaptive tasks that adjust to the level of knowledge and learning pace of each participant. This helps to increase motivation to learn, as students receive immediate feedback and can identify gaps in their knowledge at the early stages of learning[7]. Despite these advantages, the issue of integrating LLMs into learning platforms remains relevant. Teachers need to learn how to properly formulate queries to the models to ensure the relevance of the results obtained. Researchers also draw attention to the limitations of free versions of LLMs, which may restrict their use in educational institutions, especially when processing large amounts of textual data or graphical content[5,8]. However, the prospects for the development of these technologies, in particular in terms of improving the accuracy and reliability of models, open up new horizons for innovation in education. Thus, the use of LLMs in education allows for an integrated approach to learning, combining the automation of routine processes with the increased individualisation of educational experience. This helps to create conditions for more effective knowledge control and the development of students' analytical skills, which is critical for modern education. A query optimisation algorithm was used to ensure the relevance and quality of the received tasks. It included several key stages: defining the goal, forming a role for the LLM, checking for key details in the request, and eliminating ambiguities before sending it (Figure 1) [9, 10]. 276 Figure 1: Algorithm for generating an optimal query for large language models 2.2. Analysing the correspondence of tests to Bloom's taxonomy Bloom's Taxonomy is a widely recognised tool for assessing the level of cognitive complexity of learning tasks, and its use in knowledge testing provides structure and consistency in testing different levels of understanding. The taxonomy divides cognitive processes into six levels: memorising, understanding, applying, analysing, evaluating and creating [5]. To assess whether tests generated by large-scale language models (LLMs) meet the standards of Bloom's Taxonomy, it is important to consider how well the test questions cover these cognitive levels and whether they promote critical thinking and analytical skills. Research on GPT-4 has shown that this model is capable of generating questions that correspond to different levels of Bloom's Taxonomy, including questions that involve basic memorisation of information (e.g., definitions of concepts or terms), as well as more complex analysis and synthesis tasks that require a deeper understanding of the topic[5,7]. For example, GPT-4 can generate questions that require applying knowledge in new contexts, such as problem solving or comparing concepts that belong to the application and analysis level of Bloom's Taxonomy. The Claude model also demonstrates the ability to generate questions of different cognitive levels, but its focus is mainly on basic-level tasks such as memorisation and comprehension. The analysed tests generated by Claude show a tendency to create questions that require students to reproduce factual information or explain simple concepts, with less attention to tasks that require evaluation or the creation of new solutions[5]. An important characteristic of LLM-generated tests is their ability to adapt the level of difficulty to different levels of student knowledge. For example, the GPT and Copilot models can generate adaptive tests that match both basic and advanced cognitive levels. This allows teachers to create tests that gradually increase the level of difficulty, starting with simple memorisation questions and ending with analysis and evaluation tasks that require deeper processing of the material[7]. 277 It is important to note, however, that the ability of models to generate questions at the highest cognitive levels, such as creating new concepts or evaluating solutions, is still limited. For example, only a few models, such as GPT-4, are able to effectively formulate tasks that include elements of synthesis and critical evaluation, while other models, such as Claude and Gemini, mainly focus on lower cognitive level tasks[6,7]. Thus, the analysis of the correspondence of LLM-generated tests to Bloom's taxonomy shows that these models are able to cover different cognitive levels, but the level of complexity and variety of tasks vary depending on the specific model. The use of LLMs to create tests opens up prospects for flexible adaptation of learning tasks, which contributes to improving the quality of student assessment. 2.3 Limitations of models in task generation Despite its considerable potential, the use of large language models (LLMs) for generating test items has a number of limitations that should be taken into account when implementing them in the educational process. These limitations can affect the quality of the created tests, as well as the effectiveness of their use in different learning contexts. The main challenges relate to both technical aspects of the models and pedagogical limitations. 1. Limitation of context size and task types. LLMs, such as GPT-4, Claude or Gemini, process text within a predefined context size, which limits their ability to generate tasks based on large learning materials. For example, when dealing with long lecture materials, the model may lose relevant details or create incomplete questions that do not cover all the necessary information[6]. Some models, such as Copilot and Gemini, further limit text processing in the free versions, forcing users to manually copy content into queries, which reduces their usability[5]. 2. Problems with the validity and relevance of questions. Generated tasks do not always fully meet the learning objectives and may not cover the full range of competencies required by the educational programme. Such tasks often have reduced construct validity, as models can formulate questions without taking into account a deep understanding of the subject area or course specifics[7]. In addition, the possibility of randomly guessing the correct answer is especially relevant for multiple-choice tasks if the answer options are not sufficiently differentiated or plausible. 3. Lack of specialisation and domain knowledge. LLMs are able to generate tasks from various disciplines, but their performance may decrease in specific subject areas that require expert knowledge. For example, when creating tasks in programming or medicine, models can generate questions that contain inaccuracies or do not match the level of difficulty of the course[8]. Claude and Copilot demonstrate limited ability to create high cognitive level tasks, such as evaluation and new solution creation, which may reduce their effectiveness for advanced courses. 4. Ethical and methodological limitations. Another important issue is the risk of bias in the questions, as LLMs are trained on large amounts of data that may contain cultural or gender stereotypes. Claude has built-in mechanisms to minimise such risks, but it is not always possible to completely avoid bias[7]. In addition, the tasks created by LLMs may not take into account the individual needs of students with special educational needs, which can limit the accessibility and fairness of testing. 5. Technical limitations and integration. Integration of LLMs into learning platforms can also be difficult due to technical limitations, such as limited access to free versions of models or the complexity of setting up APIs for automatic task generation. Teachers may need to undergo additional training to generate high-quality queries, which increases the complexity of using these tools in everyday practice[6, 8]. Thus, the limitations of LLM in generating tasks are related to both their technical characteristics and pedagogical aspects. Nevertheless, the development of new algorithms and the improvement of models open up opportunities to overcome these challenges in the future. 278 3. Objective. General-purpose LMMs were chosen for the study: GPT-4 (OpenAI - chatgpt.com); Claude (Anthropic - claude.ai); Copilot (Microsoft - copilot.cloud.microsoft) and Gemini (Google DeepMind - gemini.google.com), which are available in free versions. These models were chosen due to their availability, popularity, and ability to generate different types of tasks (e.g., multiple choice questions, open-ended questions, matching tasks). This decision allowed us to focus on the possibilities of using models that do not require additional hardware or payment costs, which is an essential factor for the widespread introduction of such tools in educational institutions. The use of available models allows us to evaluate their potential for automating the creation of test tasks without the need for significant investment in resources. 4. Methodology 4.1 Research models and tools General-purpose LMMs in their free versions were chosen for the study: GPT-4 (OpenAI - chatgpt.com); Claude (Anthropic - claude.ai); Copilot (Microsoft - copilot.cloud.microsoft) and Gemini (Google DeepMind - gemini.google.com). These models were chosen for their availability, popularity, and ability to generate different types of tasks. This allowed us to focus on analysing the possibilities of using the models and assessing their potential for automated test task creation without the need for monthly fees or investments in additional hardware or other resources, which is an essential factor for the widespread introduction of such tools in the educational process of educational institutions. 4.2 Selecting learning content for test task generation To generate test tasks, a fragment of lecture material in Ukrainian on the discipline "Technology of creating software products" intended for 2nd year students of the speciality "Computer Science" was used. The text document had the following parameters: ● Characters without spaces: 9710; ● Word count: 1523; ● Format: PDF, size 361 KB. The content covered the basic concepts of UML use case diagrams, which provides sufficient depth to test the models' ability to generate questions at different cognitive levels. 4.3. The procedure for creating and optimising queries To ensure the relevance and quality of the tasks received, we used a query optimisation algorithm (Figure 1). The following key steps were taken into account before sending the request: ● Objective: To receive 20 test questions that cover the entire content of the lecture and correspond to Bloom's taxonomy. ● Formation of a role for LLM: The models were set up as a "virtual teacher", able to explain the material and create questions based on the reading. ● Checking for clarity and consistency: The request was checked for logical errors and ambiguities before being sent. The procedure for creating and optimising queries was formed on the basis of the algorithm (Fig. 1) to obtain relevant test tasks for the downloaded lecture fragment. An example of an optimised query is shown in Figure 2. 279 Figure 2: Optimised query for test case generation 4.4. Assessing the quality of generated tasks Several criteria were used to assess the quality of the tasks: 1. Compliance with Bloom's Taxonomy: We assessed whether the questions corresponded to different cognitive levels, from memorisation to creation. 2. Structure and clarity: The clarity of wording and the presence of explanations for correct and incorrect answers were analysed. 3. Variety of task types: The ability of the models to generate questions of different formats (multiple choice, open-ended, etc.) was compared. 4. Completion time: It was taken into account how quickly students could complete the test within the given time. 5. Validity and discriminative power: Expert analysis was used to check the extent to which the tasks meet the educational objectives and discriminate between students with different levels of knowledge. 4.5. Collecting and comparing data For each model, 20 test questions were generated based on the same training content. The generated questions were compared by the following parameters: ● Average length of the question and justification (in characters); ● The median number of words in a question; ● The level of difficulty of the questions according to Bloom's taxonomy. 4.6. Analysis of reliability and practical limitations After the tasks were generated, an expert analysis was conducted to ensure that they met the learning objectives and were clear to students. The usability of each model in the educational process was also taken into account. Particular attention was paid to technical limitations, such as the amount of text to be processed and the ability to integrate models into existing testing platforms. 5. Data and analysis 5.1. Comparative analysis of model capabilities To compare the capabilities of the GPT-4 (chatgpt.com), Claude, Copilot and Gemini models, 20 test tasks were generated on the basis of the same training content. The models were evaluated according to the following parameters: the number of aspects of the topic covered, compliance with Bloom's taxonomy, the variety of task formats, and the level of question complexity. 280 Table 1 summarises the ability of the models to process text and graphic information in the free mode. As you can see, only GPT-4 and Claude can process uploaded text files, while Copilot and Gemini require manual text input. Table 1. Types of data processed free of charge Model Text data Image GPT-4 + + Claude + + Copilot - + Gemini - + 5.2. Variety of test item formats Table 2 shows a comparison of the types of test tasks that can generate models without specialised queries. Of the 10 possible formats, GPT-4 supports all of them, while Copilot and Gemini demonstrate limited functionality. Table 2. Types of tasks available for generation Format of tasks GPT-4 Claude Copilot Gemini Multiple choices + + + + Several correct answers + + + + True/False + + + - Open answer + - + + The challenge of compliance + + + + Sequence tasks + - + + Filling in the blanks + + + - Crosswords + - - - Essay. + + + - Graphic tasks + - - - 5.3. Assessment of the quality and complexity of tasks A comparison of the quality of questions generated by GPT-4 and Claude revealed a difference in the level of difficulty and depth of topic coverage. GPT-4 demonstrated a tendency to create higher cognitive level questions, including analysis and synthesis, while Claude focuses on memorisation and comprehension. The following parameters were chosen to assess the quality of the generated tasks: the average length of the question and explanation were used as indicators of the structuredness and level of detail of the answers, which affects the clarity and completeness of the information provided to students; and the median value of the number of words in the question was chosen to assess the conciseness and clarity of the wording. The details of the selected parameters are presented in Table 3 below. Table 3. Statistical analysis of task characteristics Parameter GPT-4 Claude Average question length (characters) 85 110 -25 Average length of justification (characters) 120 150 -30 Median number of words per question 15 18 -3 Figure 3 shows an example of a GPT-4-generated test question that demonstrates the structure of the question and the explanation for the answer. 281 Figure 3: An example of generating a test case for a promt in GPT Figure 4 shows an example of a test question created by Claude that illustrates a different approach to question generation and level of detail. Figure 4: An example of generating a test case for a manufacturing task in Claude 5.4. Correspondence of tasks to Bloom's taxonomy The analysis of the generated tasks showed that GPT-4 covers all levels of Bloom's Taxonomy, including the highest levels - evaluation and creation. Claude, on the other hand, mostly generates questions on the basic levels (knowledge and understanding). This shows that GPT-4 is more flexible in generating tasks for different learning contexts (Figure 5) Figure 5: Graphs comparing question difficulty (a) and coverage of use case aspects for the generated GPT and Claude tests 282 5.5. Additional features of the models GPT-4 provides more advanced functionality by allowing you to add justifications to your answers. Claude is less detailed in explaining correct and incorrect options, which may limit its effectiveness for training purposes that require in-depth feedback. 5.6. Analysing the validity of generated tasks A comprehensive approach was used to assess the validity of the tasks, covering several key aspects: content validity, construct validity, clarity of wording, relevance of answer options, and absence of ambiguity. Each criterion was assessed by experts on a five-point scale (1 - very low validity, 5 - very high validity). Leading academic staff with many years of teaching experience in computer science disciplines and practical experience in applying object-oriented analysis and design using UML were involved as experts in the quality assessment of the generated tasks. The scores for the GPT-4 and Claude models are presented in Table 4. Table 4. Assessing the validity of tasks generated by GPT-4 and Claude Criterion GPT-4 Claude Content validity 4.5 4.5 0 Construct validity 4.0 4.5 +0.5 Clarity of wording 4.5 4.5 0 No ambiguity 4.0 4.5 +0.5 Relevance of answer options 4.5 4.5 0 Cognitive level (according to Bloom) 3.5 4.0 +0.5 Average score 4.17 4.42 +0.25 The analysis showed that the Claude model received higher scores for construct validity and lack of ambiguity, which indicates the clarity and relevance of the questions. At the same time, the GPT-4 demonstrated flexibility in generating tasks at different cognitive levels, although some of them may have minor ambiguities. 6. Conclusion This study demonstrates the high potential of large-scale language models (LLMs) for automating the creation of test tasks in the educational process. However, the results also revealed a difference in the capabilities and effectiveness of different models. GPT-4 and Claude have shown high performance in generating tasks, but each of them has its own advantages and limitations that affect their application. The GPT-4 has demonstrated the greatest flexibility in creating tasks of different formats and at different cognitive levels, according to Bloom's Taxonomy. Its ability to generate complex questions, including those requiring analysis and synthesis, makes this model suitable for use in curricula focused on the development of analytical thinking. At the same time, GPT-4 revealed some shortcomings related to possible ambiguities in the questions, which requires additional verification by teachers. Claude, in turn, received the highest scores for construct validity and clarity of task wording. This indicates its effectiveness in creating questions of basic and medium difficulty. However, this model demonstrated a limited ability to formulate tasks of the highest cognitive levels (synthesis and evaluation), which may reduce its effectiveness for advanced courses. Copilot and Gemini are less versatile than GPT-4 and Claude, in part because of the limited number of available task formats in the free mode. However, these models can be useful for highly specialised tasks, such as programming testing or visual element integration. 283 The study also revealed that the correct formulation of queries is an important factor in obtaining relevant answers from models. Teachers need to take into account both the limitations of the models (for example, the amount of text being processed) and the peculiarities of generating questions at different cognitive levels. Thus, the use of LLMs to create test tasks is a promising area of educational technology development. It is important to note that the study used commonly used models in free versions that do not require specialised hardware. This demonstrates that automated test task creation can be affordable for educational institutions with limited resources, as well as for teachers who want to use modern technologies without additional costs. The choice of a particular model should be based on the purpose of the test and the level of complexity of the tasks. Declaration on Generative AI The authors have not employed any Generative AI tools. References [1]. Philippe Laban, Chien-Sheng Wu, Lidiya Murakhovs'ka, Wenhao Liu, and Caiming Xiong. (2022). Quiz Design Task: Helping Teachers Create Quizzes with Automated Question Generation. In Findings of the Association for Computational Linguistics: NAACL 2022 (pp. 102- 111). Seattle, United States: Association for Computational Linguistics. https://aclanthology.org/2022.findings-naacl.9/. [2]. Kwan, C.C.L. (2024). Exploring ChatGPT-Generated Assessment Scripts of Probability and Engineering Statistics from Bloom's Taxonomy. In S.K.S. Cheung, F.L. Wang, N. Paoprasert, P. Charnsethikul, K.C. Li, K. Phusavat (Eds.), Technology in Education. Innovative Practices for the New Normal. ICTE 2023. Communications in Computer and Information Science (vol. 1974, pp. 275-286). Singapore: Springer. https://doi.org/10.1007/978-981-99-8255-4_24 [3]. Bharatha, A., Ojeh, N., Rabbi, A.M.F., Campbell, M.H., Krishnamurthy, K., Layne-Yarde, R.N., Kumar, A., Springer, D.C.R., Connell, K.L., & Majumder, M.A.A. (2024). Comparing the Performance of ChatGPT-4 and Medical Students on MCQs at Various Levels of Bloom's Taxonomy. Advances in Medical Education and Practice, 15, 393-400. https://doi.org/10.2147/AMEP.S457408 [4]. Herrmann-Werner, A., Festl-Wietek, T., Holderried, F., Herschbach, L., Griewatz, J., Masters, K., Zipfel, S., & Mahling, M. (2024). Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: A Mixed-Methods Study. Journal of Medical Internet Research, 26, e52113. https://doi.org/10.2196/52113 [5]. Aboalela, R.A. (2023). ChatGPT for generating questions and assessments based on accreditations. In ACITY 13th International Conference on Advances in Computing and Information Technology (pp. 1-12). https://arxiv.org/abs/2312.00047. [6]. Agarwal, M., Goswami, A., & Sharma, P. (2023, September 29). Evaluating ChatGPT-3.5 and Claude-2 in Answering and Explaining Conceptual Medical Physiology Multiple-Choice Questions. Cureus, 15(9), e46222. https://doi.org/10.7759/cureus.46222 [7]. Brame, C. (2013). Writing good multiple choice test questions. Retrieved [today's date], from https://cft.vanderbilt.edu/guides-sub-pages/writing-good-multiple-choice-test-questions/ [8]. Haladyna, T.M., Downing, S.M., & Rodriguez, M.C. (2002). A review of multiple-choice item- writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309-334. https://doi.org/10.1207/S15324818AME1503_5 [9]. Amatriain, X. (2024). Prompt Design and Engineering: Introduction and Advanced Methods. ArXiv. https://arxiv.org/abs/2401.14423. [10]. Tran, Andrew & Angelikas, Kenneth & Rama, Egi & Okechukwu, Chiku & Smith, David & Macneil, Stephen. (2023). Generating Multiple Choice Questions for Computing Courses Using Large Language Models. In IEEE Frontiers in Education Conference (p.1-8) https://bit.ly/3AE4YOc 284