<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Kyiv, Ukraine
Corresponding author.
These authors contributed equally.
oleg.ilarionov@knu.ua (O. Ilarionov); hanna.krasovska@knu.ua (H. Krasovska); irinadomanetskaya@gmail.com
(I. Domanetska); elvenff@gmail.com (O. Fedusenko)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>⋆ Features of the practical use of LLM for generating quiz</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oleh Ilarionov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hanna Krasovska</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iryna Domanetska</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olena Fedusenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Taras Shevchenko National University of Kyiv</institution>
          ,
          <addr-line>Volodymyrs'ka str. 64/13, Kyiv, 01601</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>The article explores the possibilities of using large-scale language models (LLMs) such as GPT, Claude, Copilot and Gemini for automated test task generation in the field of education. The ability of these models to generate different types of tasks, including multiple choice, open-ended and fill-in-the-blank, as well as their compliance with educational standards and cognitive levels according to Bloom's taxonomy is assessed. A comparative analysis of the quality of the generated tests in terms of complexity, structure and adaptability is carried out. The limitations of the models for generating tasks of higher cognitive levels are identified and recommendations for their integration into educational platforms are given. The results of the study can improve the process of assessing students' knowledge and promote the development of adaptive learning.</p>
      </abstract>
      <kwd-group>
        <kwd />
        <kwd>LLM</kwd>
        <kwd>test generation</kwd>
        <kwd>Bloom's taxonomy</kwd>
        <kwd>educational standards</kwd>
        <kwd>test automation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Thanks to their ability to create new content based on existing data, generative artificial intelligence
models (GAMs) are opening up new opportunities in many industries, from business to art, science
and education, and allowing for the automation of sometimes routine tasks, speeding up processes
that used to require a lot of time and creative effort.</p>
      <p>The impact of generative AI is becoming increasingly visible in the educational environment.
Generative AI tools can be used for a variety of educational purposes, making the educational process
more individualised, adaptive and efficient, providing access to education for a wider range of people,
including those with disabilities.</p>
      <p>Already, GMIS are helping teachers generate multimodal teaching materials, adjust lesson plans,
select relevant literature, and generate various tasks, scenarios, or simulations that help students
develop analytical and research skills.</p>
      <p>It should be noted that one of the key elements of the educational process is the quality control
of students' knowledge and skills, as this allows not only to assess the level of learning but also to
identify gaps in learning and improve teaching methods. The control provides feedback between
teachers and students, encouraging the latter to learn more actively and develop themselves.</p>
      <p>In recent years, one of the most common methods of assessing students' knowledge in modern
education has been the test form of control, which has a number of significant advantages: objectivity
of assessment, speed and convenience of testing, coverage of a large amount of material, variety of
task formats, possibility of analysing statistics, transparency and clarity, standardisation of
assessment, adaptability, etc. However, testing, although an effective control method, has its
limitations: tests cannot always adequately assess the depth of understanding of the material or
practical skills of students. Therefore, it is important that test tasks are well thought out, as their
quality directly affects the results of control.</p>
      <p>Recently, special attention has been paid to LLMs (Large Language Models), which are designed
to process and generate texts and can solve various tasks: translation, text creation, emotion analysis,
answering questions, etc., and have great potential in the field of test task automation. They can
greatly simplify the work of teachers, create adaptive, diverse and personalised tests, which improves
the quality of student knowledge control. LLMs can create tasks of various formats, such as
closedended questions with one or more correct answers; matching questions; fill-in-the-blank tasks;
openended questions; logical thinking and analysis tests. LLMs can not only generate questions, but also
provide detailed explanations for correct and incorrect answers, which allows students to better
understand their mistakes and improves their learning.</p>
      <p>However, the question arises as to how effectively different LLMs generate such test tasks and
whether the tests they create meet generally accepted pedagogical standards.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature review</title>
      <sec id="sec-2-1">
        <title>2.1. Opportunities for LLM in education</title>
        <p>
          Large-scale language models (LLMs) offer great prospects for improving the efficiency of the
educational process, in particular through the automation of tasks that previously required
significant time and intellectual resources. Research shows that LLMs can adapt to different learning
contexts by automating the development of test tasks, personalising educational materials, and
improving access to knowledge for students with different learning needs[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          The GPT, Claude, Copilot and Gemini models provide the ability to generate both simple and
complex test items, including multiple-choice, matching and open-ended questions. Studies of
GPT4 have shown that this model has a high ability to adapt the complexity of tasks, which allows
teachers to create questions of different cognitive levels, according to Bloom's taxonomy [
          <xref ref-type="bibr" rid="ref5 ref6">5,6</xref>
          ].
Claude, on the other hand, demonstrates strengths in ethical and safe content generation, which is
especially important in educational environments focused on preventing bias and harmful
materials[
          <xref ref-type="bibr" rid="ref5 ref6">5,6</xref>
          ].
        </p>
        <p>
          Automation of the creation of training materials and tests using LLM reduces the workload of
teachers, freeing up time to work on individual student support. Another important aspect is the
ability to create adaptive tasks that adjust to the level of knowledge and learning pace of each
participant. This helps to increase motivation to learn, as students receive immediate feedback and
can identify gaps in their knowledge at the early stages of learning[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>
          Despite these advantages, the issue of integrating LLMs into learning platforms remains relevant.
Teachers need to learn how to properly formulate queries to the models to ensure the relevance of
the results obtained. Researchers also draw attention to the limitations of free versions of LLMs,
which may restrict their use in educational institutions, especially when processing large amounts
of textual data or graphical content[
          <xref ref-type="bibr" rid="ref5 ref8">5,8</xref>
          ]. However, the prospects for the development of these
technologies, in particular in terms of improving the accuracy and reliability of models, open up new
horizons for innovation in education.
        </p>
        <p>Thus, the use of LLMs in education allows for an integrated approach to learning, combining the
automation of routine processes with the increased individualisation of educational experience. This
helps to create conditions for more effective knowledge control and the development of students'
analytical skills, which is critical for modern education.</p>
        <p>
          A query optimisation algorithm was used to ensure the relevance and quality of the received
tasks. It included several key stages: defining the goal, forming a role for the LLM, checking for key
details in the request, and eliminating ambiguities before sending it (Figure 1) [
          <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Analysing the correspondence of tests to Bloom's taxonomy</title>
        <p>
          Bloom's Taxonomy is a widely recognised tool for assessing the level of cognitive complexity of
learning tasks, and its use in knowledge testing provides structure and consistency in testing
different levels of understanding. The taxonomy divides cognitive processes into six levels:
memorising, understanding, applying, analysing, evaluating and creating [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. To assess whether tests
generated by large-scale language models (LLMs) meet the standards of Bloom's Taxonomy, it is
important to consider how well the test questions cover these cognitive levels and whether they
promote critical thinking and analytical skills.
        </p>
        <p>
          Research on GPT-4 has shown that this model is capable of generating questions that correspond
to different levels of Bloom's Taxonomy, including questions that involve basic memorisation of
information (e.g., definitions of concepts or terms), as well as more complex analysis and synthesis
tasks that require a deeper understanding of the topic[
          <xref ref-type="bibr" rid="ref5 ref7">5,7</xref>
          ]. For example, GPT-4 can generate
questions that require applying knowledge in new contexts, such as problem solving or comparing
concepts that belong to the application and analysis level of Bloom's Taxonomy.
        </p>
        <p>
          The Claude model also demonstrates the ability to generate questions of different cognitive levels,
but its focus is mainly on basic-level tasks such as memorisation and comprehension. The analysed
tests generated by Claude show a tendency to create questions that require students to reproduce
factual information or explain simple concepts, with less attention to tasks that require evaluation
or the creation of new solutions[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          An important characteristic of LLM-generated tests is their ability to adapt the level of difficulty
to different levels of student knowledge. For example, the GPT and Copilot models can generate
adaptive tests that match both basic and advanced cognitive levels. This allows teachers to create
tests that gradually increase the level of difficulty, starting with simple memorisation questions and
ending with analysis and evaluation tasks that require deeper processing of the material[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>
          It is important to note, however, that the ability of models to generate questions at the highest
cognitive levels, such as creating new concepts or evaluating solutions, is still limited. For example,
only a few models, such as GPT-4, are able to effectively formulate tasks that include elements of
synthesis and critical evaluation, while other models, such as Claude and Gemini, mainly focus on
lower cognitive level tasks[
          <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
          ].
        </p>
        <p>Thus, the analysis of the correspondence of LLM-generated tests to Bloom's taxonomy shows that
these models are able to cover different cognitive levels, but the level of complexity and variety of
tasks vary depending on the specific model. The use of LLMs to create tests opens up prospects for
flexible adaptation of learning tasks, which contributes to improving the quality of student
assessment.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3 Limitations of models in task generation</title>
        <p>Despite its considerable potential, the use of large language models (LLMs) for generating test items
has a number of limitations that should be taken into account when implementing them in the
educational process. These limitations can affect the quality of the created tests, as well as the
effectiveness of their use in different learning contexts. The main challenges relate to both technical
aspects of the models and pedagogical limitations.</p>
        <p>
          1. Limitation of context size and task types. LLMs, such as GPT-4, Claude or Gemini, process text
within a predefined context size, which limits their ability to generate tasks based on large learning
materials. For example, when dealing with long lecture materials, the model may lose relevant details
or create incomplete questions that do not cover all the necessary information[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Some models, such
as Copilot and Gemini, further limit text processing in the free versions, forcing users to manually
copy content into queries, which reduces their usability[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          2. Problems with the validity and relevance of questions. Generated tasks do not always fully
meet the learning objectives and may not cover the full range of competencies required by the
educational programme. Such tasks often have reduced construct validity, as models can formulate
questions without taking into account a deep understanding of the subject area or course specifics[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
In addition, the possibility of randomly guessing the correct answer is especially relevant for
multiple-choice tasks if the answer options are not sufficiently differentiated or plausible.
        </p>
        <p>
          3. Lack of specialisation and domain knowledge. LLMs are able to generate tasks from various
disciplines, but their performance may decrease in specific subject areas that require expert
knowledge. For example, when creating tasks in programming or medicine, models can generate
questions that contain inaccuracies or do not match the level of difficulty of the course[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Claude
and Copilot demonstrate limited ability to create high cognitive level tasks, such as evaluation and
new solution creation, which may reduce their effectiveness for advanced courses.
        </p>
        <p>
          4. Ethical and methodological limitations. Another important issue is the risk of bias in the
questions, as LLMs are trained on large amounts of data that may contain cultural or gender
stereotypes. Claude has built-in mechanisms to minimise such risks, but it is not always possible to
completely avoid bias[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. In addition, the tasks created by LLMs may not take into account the
individual needs of students with special educational needs, which can limit the accessibility and
fairness of testing.
        </p>
        <p>
          5. Technical limitations and integration. Integration of LLMs into learning platforms can also be
difficult due to technical limitations, such as limited access to free versions of models or the
complexity of setting up APIs for automatic task generation. Teachers may need to undergo
additional training to generate high-quality queries, which increases the complexity of using these
tools in everyday practice[
          <xref ref-type="bibr" rid="ref6 ref8">6, 8</xref>
          ].
        </p>
        <p>Thus, the limitations of LLM in generating tasks are related to both their technical characteristics
and pedagogical aspects. Nevertheless, the development of new algorithms and the improvement of
models open up opportunities to overcome these challenges in the future.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Objective.</title>
      <p>General-purpose LMMs were chosen for the study: GPT-4 (OpenAI - chatgpt.com); Claude
(Anthropic - claude.ai); Copilot (Microsoft - copilot.cloud.microsoft) and Gemini (Google DeepMind
- gemini.google.com), which are available in free versions. These models were chosen due to their
availability, popularity, and ability to generate different types of tasks (e.g., multiple choice
questions, open-ended questions, matching tasks). This decision allowed us to focus on the
possibilities of using models that do not require additional hardware or payment costs, which is an
essential factor for the widespread introduction of such tools in educational institutions. The use of
available models allows us to evaluate their potential for automating the creation of test tasks
without the need for significant investment in resources.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>4.1 Research models and tools</title>
        <p>General-purpose LMMs in their free versions were chosen for the study: GPT-4 (OpenAI
chatgpt.com); Claude (Anthropic - claude.ai); Copilot (Microsoft - copilot.cloud.microsoft) and
Gemini (Google DeepMind - gemini.google.com). These models were chosen for their availability,
popularity, and ability to generate different types of tasks. This allowed us to focus on analysing the
possibilities of using the models and assessing their potential for automated test task creation
without the need for monthly fees or investments in additional hardware or other resources, which
is an essential factor for the widespread introduction of such tools in the educational process of
educational institutions.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2 Selecting learning content for test task generation</title>
        <p>To generate test tasks, a fragment of lecture material in Ukrainian on the discipline "Technology of
creating software products" intended for 2nd year students of the speciality "Computer Science" was
used. The text document had the following parameters:
●
●
●</p>
        <sec id="sec-4-2-1">
          <title>Characters without spaces: 9710;</title>
          <p>Word count: 1523;</p>
          <p>Format: PDF, size 361 KB.</p>
          <p>The content covered the basic concepts of UML use case diagrams, which provides sufficient
depth to test the models' ability to generate questions at different cognitive levels.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. The procedure for creating and optimising queries</title>
        <p>To ensure the relevance and quality of the tasks received, we used a query optimisation algorithm
(Figure 1). The following key steps were taken into account before sending the request:
●
●
●</p>
        <p>Objective: To receive 20 test questions that cover the entire content of the lecture and
correspond to Bloom's taxonomy.</p>
        <p>Formation of a role for LLM: The models were set up as a "virtual teacher", able to explain
the material and create questions based on the reading.</p>
        <p>Checking for clarity and consistency: The request was checked for logical errors and
ambiguities before being sent.</p>
        <p>The procedure for creating and optimising queries was formed on the basis of the algorithm (Fig.
1) to obtain relevant test tasks for the downloaded lecture fragment. An example of an optimised
query is shown in Figure 2.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Assessing the quality of generated tasks</title>
        <p>Several criteria were used to assess the quality of the tasks:
1. Compliance with Bloom's Taxonomy: We assessed whether the questions corresponded to
different cognitive levels, from memorisation to creation.
2. Structure and clarity: The clarity of wording and the presence of explanations for correct and
incorrect answers were analysed.
3. Variety of task types: The ability of the models to generate questions of different formats
(multiple choice, open-ended, etc.) was compared.
4. Completion time: It was taken into account how quickly students could complete the test
within the given time.
5. Validity and discriminative power: Expert analysis was used to check the extent to which the
tasks meet the educational objectives and discriminate between students with different levels
of knowledge.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Collecting and comparing data</title>
        <p>For each model, 20 test questions were generated based on the same training content. The generated
questions were compared by the following parameters:
●
●
●</p>
        <sec id="sec-4-5-1">
          <title>Average length of the question and justification (in characters); The median number of words in a question; The level of difficulty of the questions according to Bloom's taxonomy.</title>
        </sec>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Analysis of reliability and practical limitations</title>
        <p>After the tasks were generated, an expert analysis was conducted to ensure that they met the learning
objectives and were clear to students. The usability of each model in the educational process was
also taken into account. Particular attention was paid to technical limitations, such as the amount of
text to be processed and the ability to integrate models into existing testing platforms.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Data and analysis</title>
      <sec id="sec-5-1">
        <title>5.1. Comparative analysis of model capabilities</title>
        <p>To compare the capabilities of the GPT-4 (chatgpt.com), Claude, Copilot and Gemini models, 20 test
tasks were generated on the basis of the same training content. The models were evaluated according
to the following parameters: the number of aspects of the topic covered, compliance with Bloom's
taxonomy, the variety of task formats, and the level of question complexity.</p>
        <p>Table 1 summarises the ability of the models to process text and graphic information in the free
mode. As you can see, only GPT-4 and Claude can process uploaded text files, while Copilot and
Gemini require manual text input.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Variety of test item formats</title>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Assessment of the quality and complexity of tasks</title>
        <p>A comparison of the quality of questions generated by GPT-4 and Claude revealed a difference in
the level of difficulty and depth of topic coverage. GPT-4 demonstrated a tendency to create higher
cognitive level questions, including analysis and synthesis, while Claude focuses on memorisation
and comprehension. The following parameters were chosen to assess the quality of the generated
tasks: the average length of the question and explanation were used as indicators of the
structuredness and level of detail of the answers, which affects the clarity and completeness of the
information provided to students; and the median value of the number of words in the question was
chosen to assess the conciseness and clarity of the wording. The details of the selected parameters
are presented in Table 3 below.</p>
        <p>Figure 4 shows an example of a test question created by Claude that illustrates a different
approach to question generation and level of detail.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Correspondence of tasks to Bloom's taxonomy</title>
        <p>The analysis of the generated tasks showed that GPT-4 covers all levels of Bloom's Taxonomy,
including the highest levels - evaluation and creation. Claude, on the other hand, mostly generates
questions on the basic levels (knowledge and understanding). This shows that GPT-4 is more flexible
in generating tasks for different learning contexts (Figure 5)</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Additional features of the models</title>
        <p>GPT-4 provides more advanced functionality by allowing you to add justifications to your answers.
Claude is less detailed in explaining correct and incorrect options, which may limit its effectiveness
for training purposes that require in-depth feedback.</p>
      </sec>
      <sec id="sec-5-6">
        <title>5.6. Analysing the validity of generated tasks</title>
        <p>A comprehensive approach was used to assess the validity of the tasks, covering several key aspects:
content validity, construct validity, clarity of wording, relevance of answer options, and absence of
ambiguity.</p>
        <p>Each criterion was assessed by experts on a five-point scale (1 - very low validity, 5 - very high
validity). Leading academic staff with many years of teaching experience in computer science
disciplines and practical experience in applying object-oriented analysis and design using UML were
involved as experts in the quality assessment of the generated tasks. The scores for the GPT-4 and
Claude models are presented in Table 4.
The analysis showed that the Claude model received higher scores for construct validity and lack of
ambiguity, which indicates the clarity and relevance of the questions. At the same time, the GPT-4
demonstrated flexibility in generating tasks at different cognitive levels, although some of them may
have minor ambiguities.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This study demonstrates the high potential of large-scale language models (LLMs) for automating
the creation of test tasks in the educational process. However, the results also revealed a difference
in the capabilities and effectiveness of different models. GPT-4 and Claude have shown high
performance in generating tasks, but each of them has its own advantages and limitations that affect
their application.</p>
      <p>The GPT-4 has demonstrated the greatest flexibility in creating tasks of different formats and at
different cognitive levels, according to Bloom's Taxonomy. Its ability to generate complex questions,
including those requiring analysis and synthesis, makes this model suitable for use in curricula
focused on the development of analytical thinking. At the same time, GPT-4 revealed some
shortcomings related to possible ambiguities in the questions, which requires additional verification
by teachers.</p>
      <p>Claude, in turn, received the highest scores for construct validity and clarity of task wording. This
indicates its effectiveness in creating questions of basic and medium difficulty. However, this model
demonstrated a limited ability to formulate tasks of the highest cognitive levels (synthesis and
evaluation), which may reduce its effectiveness for advanced courses.</p>
      <p>Copilot and Gemini are less versatile than GPT-4 and Claude, in part because of the limited
number of available task formats in the free mode. However, these models can be useful for highly
specialised tasks, such as programming testing or visual element integration.</p>
      <p>The study also revealed that the correct formulation of queries is an important factor in obtaining
relevant answers from models. Teachers need to take into account both the limitations of the models
(for example, the amount of text being processed) and the peculiarities of generating questions at
different cognitive levels.</p>
      <p>Thus, the use of LLMs to create test tasks is a promising area of educational technology
development. It is important to note that the study used commonly used models in free versions that
do not require specialised hardware. This demonstrates that automated test task creation can be
affordable for educational institutions with limited resources, as well as for teachers who want to
use modern technologies without additional costs. The choice of a particular model should be based
on the purpose of the test and the level of complexity of the tasks.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <sec id="sec-7-1">
        <title>The authors have not employed any Generative AI tools.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]. Philippe Laban,
          <string-name>
            <surname>Chien-Sheng</surname>
            <given-names>Wu</given-names>
          </string-name>
          , Lidiya Murakhovs'ka, Wenhao Liu, and
          <string-name>
            <surname>Caiming Xiong.</surname>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Quiz Design Task: Helping Teachers Create Quizzes with Automated Question Generation</article-title>
          .
          <source>In Findings of the Association for Computational Linguistics: NAACL</source>
          <year>2022</year>
          (pp.
          <fpage>102</fpage>
          -
          <lpage>111</lpage>
          ). Seattle, United States:
          <article-title>Association for Computational Linguistics</article-title>
          . https://aclanthology.org/
          <year>2022</year>
          .findings-naacl.9/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]. Kwan,
          <string-name>
            <surname>C.C.L.</surname>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Exploring ChatGPT-Generated Assessment Scripts of Probability and Engineering Statistics from Bloom's Taxonomy</article-title>
          . In S.K.S. Cheung,
          <string-name>
            <given-names>F.L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Paoprasert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Charnsethikul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          Phusavat (Eds.),
          <article-title>Technology in Education. Innovative Practices for the New Normal</article-title>
          .
          <source>ICTE 2023. Communications in Computer and Information Science</source>
          (vol.
          <year>1974</year>
          , pp.
          <fpage>275</fpage>
          -
          <lpage>286</lpage>
          ). Singapore: Springer. https://doi.org/10.1007/
          <fpage>978</fpage>
          -981-99-8255-4_
          <fpage>24</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>.</given-names>
            <surname>Bharatha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Ojeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Rabbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.M.F.</given-names>
            ,
            <surname>Campbell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.H.</given-names>
            ,
            <surname>Krishnamurthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Layne-Yarde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.N.</given-names>
            ,
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          , Springer,
          <string-name>
            <given-names>D.C.R.</given-names>
            ,
            <surname>Connell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.L.</given-names>
            , &amp;
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.A.A.</surname>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Comparing the Performance of ChatGPT-4 and Medical Students on MCQs at Various Levels of Bloom's Taxonomy</article-title>
          .
          <source>Advances in Medical Education and Practice</source>
          ,
          <volume>15</volume>
          ,
          <fpage>393</fpage>
          -
          <lpage>400</lpage>
          . https://doi.org/10.2147/AMEP.S457408
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>.</given-names>
            <surname>Herrmann-Werner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Festl-Wietek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Holderried</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Herschbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Griewatz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Masters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Zipfel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            , &amp;
            <surname>Mahling</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: A Mixed-Methods Study</article-title>
          .
          <source>Journal of Medical Internet Research</source>
          ,
          <volume>26</volume>
          , e52113. https://doi.org/10.2196/52113
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]. Aboalela,
          <string-name>
            <surname>R.A.</surname>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>ChatGPT for generating questions and assessments based on accreditations</article-title>
          .
          <source>In ACITY 13th International Conference on Advances in Computing and Information Technology</source>
          (pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          ). https://arxiv.org/abs/2312.00047.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]. Agarwal,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Goswami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            , &amp;
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          (
          <year>2023</year>
          , September 29).
          <article-title>Evaluating ChatGPT-3.5 and Claude-2 in Answering and Explaining Conceptual Medical Physiology Multiple-Choice Questions</article-title>
          . Cureus,
          <volume>15</volume>
          (
          <issue>9</issue>
          ),
          <year>e46222</year>
          . https://doi.org/10.7759/cureus.46222
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]. Brame,
          <string-name>
            <surname>C.</surname>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Writing good multiple choice test questions</article-title>
          .
          <source>Retrieved [today's date]</source>
          , from https://cft.vanderbilt.edu/guides-sub
          <article-title>-pages/writing-good-multiple-choice-test-questions/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]. Haladyna,
          <string-name>
            <given-names>T.M.</given-names>
            ,
            <surname>Downing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.M.</given-names>
            , &amp;
            <surname>Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.C.</surname>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>A review of multiple-choice itemwriting guidelines for classroom assessment</article-title>
          .
          <source>Applied Measurement in Education</source>
          ,
          <volume>15</volume>
          (
          <issue>3</issue>
          ),
          <fpage>309</fpage>
          -
          <lpage>334</lpage>
          . https://doi.org/10.1207/S15324818AME1503_
          <fpage>5</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>.</given-names>
            <surname>Amatriain</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Prompt Design and Engineering: Introduction and Advanced Methods</article-title>
          . ArXiv. https://arxiv.org/abs/2401.14423.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]. Tran, Andrew &amp; Angelikas,
          <string-name>
            <surname>Kenneth</surname>
          </string-name>
          &amp; Rama, Egi &amp; Okechukwu, Chiku &amp; Smith,
          <string-name>
            <surname>David</surname>
          </string-name>
          &amp; Macneil,
          <string-name>
            <surname>Stephen.</surname>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>Generating Multiple Choice Questions for Computing Courses Using Large Language Models</article-title>
          .
          <source>In IEEE Frontiers in Education Conference</source>
          (p.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          ) https://bit.ly/3AE4YOc
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>