<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging LLMs for Adaptive Testing and Learning in Taiwan Adaptive Learning Platform (TALP)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bor-Chen Kuo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frederic T. Y. Chang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zong-En Bai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Empowering Education with LLMs - the Next-Gen Interface and Content Generation</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LLMs</institution>
          ,
          <addr-line>Adaptive learning, Chatbot, Learning platform, GPT</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>National Taichung University of Education</institution>
          ,
          <addr-line>140 Minsheng Rd., West Dist., Taichung City, 403514</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Artificial Intelligence (AI) and Large Language Models (LLMs) have gained prominence in the educational context, revolutionizing various aspects of teaching and learning. This study focuses on the feasibility of integrating LLMs into the Taiwan Adaptive Learning Platform (TALP) to improve its current adaptive mechanism and enhance the learning experience of students. Through an in-depth exploration, the study identifies several potential benefits of incorporating LLMs into TALP. Firstly, by harnessing the power of LLMs and combining them with the existing knowledge structure in TALP, qualitative responses from open-ended questions can be analyzed more effectively. This enables a more precise assessment of students' understanding and significantly reduces the number of unnecessary testing items, saving valuable time and resources. Additionally, the integration of a chatbot into TALP's diagnostic report provides an innovative approach for scaffolding during remediation. The chatbot can engage in Socratic interactions with students, guiding them through the learning process and addressing misconceptions in real-time. This personalized support fosters a deeper understanding of the material and facilitates more effective remediation. Furthermore, the study highlights the potential of LLMs in detecting and addressing individual learning weaknesses. By leveraging the deep interaction capabilities of LLMs, TALP can analyze student responses and identify cross-grade misconceptions more efficiently. This study also provides examples of how GPT-3.5 can be applied for the above purposes. Finally, the implementation of LLMs in TALP also presents challenges, which are discussed. In conclusion, integrating LLMs into TALP holds great potential to enhance its adaptive mechanism, provide personalized learning experiences, and address individual learning weaknesses.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        With the introduction of Large Language Models (LLMs), particularly ChatGPT, Artificial
Intelligence (AI) has become increasingly involved in the educational context. Several studies
have sought to apply LLMs in education for various purposes, including tutoring, homework
assistance, language learning, writing aid, personalized learning, and interactive learning [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ].
Currently, it is difficult to predict whether LLMs like ChatGPT, or their future iterations, will fully
replace teachers. However, we are more interested in exploring how the application of LLMs can
enhance the effectiveness of current educational tools. Serving 2.8 million registered users from
grades 1 to 12, the Taiwan Adaptive Learning Platform (TALP) is the official learning platform of
the Ministry of Education (MOE) in Taiwan. A unique feature of TALP is its use of AI to provide
individual learning paths for personalized learning. According to a large-scale survey conducted
by the MOE of Taiwan, this platform has been highly effective in enhancing students' academic
achievement and promoting self-regulated learning [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In this study, we aim to explore the
feasibility of introducing LLMs to TALP and investigate if such implementation can enhance
TALP's existing adaptive mechanism.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. The application of LLMs in TALP</title>
      <p>In the following four sections, our study will delve into the existing adaptive mechanism of TALP.
Subsequently, we will propose the implementation of LLM technology to augment the efficacy of
adaptive testing and learning within TALP. We will also present some examples of how LLMs
could be applied in TALP to facilitate learning. Finally, we will discuss potential obstacles that
might arise during the integration of LLMs into TALP.</p>
      <sec id="sec-2-1">
        <title>2.1 . The Current adaptive mechanism in TALP</title>
        <p>The conceptual framework of the adaptive mechanism in TALP involves two main steps: applying
adaptive tests to diagnose learning weaknesses, and offering an individual learning path based
on the diagnosis to remedy learning mistakes. The current adaptive testing in TALP applies
rulebased AI technology, which is guided by the responses of test-takers in multiple-choice items. For
example, as shown in Figure 1(a), the testing system will select a question related to the
highestlevel concept (A) from the question database for the test taker to answer. If the test taker answers
question A incorrectly, the testing system, based on the rule-base, will then select questions
related to the lower-level concepts (B and C) for further testing. If the test taker answers question
B correctly, it is then predicted that they would answer the sub-concepts of B (D and E) correctly
as well, so there is no need for them to answer these questions. However, if the test taker answers
question C incorrectly, the system will subsequently present questions related to the
subconcepts of C (F, G, and H) for the test taker to answer.</p>
        <p>As highlighted by Wu, Kuo, &amp; Yang (2012), employing AI algorithms that incorporate
knowledge structure with ordering theory in diagnostic tests offers several advantages,
including: 1. tracing learning paths across students, 2. visualizing learning paths, and 3.
eliminating unnecessary test items during diagnosis. Wu, Kuo &amp; Wang (2017) have demonstrated
that the high effectiveness and efficiency of knowledge structure can increase the accuracy of
identifying learning weaknesses by up to 90%, while simultaneously reducing up to 80% of
unnecessary items during testing.</p>
        <p>The knowledge structure in TALP, as shown in Figure 2(a), resembles a sky map composed of
knowledge nodes. When students complete an adaptive diagnostic test in TALP, the results are
reflected in the color of the nodes and sub-nodes in the knowledge structure. Nodes colored green
indicate that students have mastered the skills, while those in orange reveal areas the students
have yet to master. The individual learning path is plotted by connecting the orange nodes in the
knowledge structure, as shown in Figure 2(a). Each subskill within TALP includes an instructional
video, in-video quizzes, exercises, and dynamic assessments aimed at correcting mistakes. Once
students competently complete watching the videos and pass the tests, the color of the nodes
turns green. The learning path can also be converted into a diagnostic report as illustrated in
Figure 2(b). This report not only indicates progress along the learning path but also displays the
percentage of completion for the instructional video, quizzes, exercises, and dynamic assessment.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Elevating the efficiency in diagnosing learning weakness by LLMs</title>
        <p>
          In the past, multiple-choice items were the preferred format for computer-aided testing due to
their straightforward nature (right or wrong). Evaluating open-ended questions, which provide
more qualitative and richly informative responses, posed a significant challenge due to the
limitations of computer technology [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The advantage of LLMs is their ability to offer a service
for automated response analysis, which can examine and evaluate the responses to open-ended
questions from test-takers. Open-ended questions can reveal arithmetic processes, providing rich
information for LLMs to directly identify misconceptions. For example, if concept A in Figure 1
involves the four arithmetic operations, concept B refers to arithmetic operations involving
addition and subtraction, and concept C indicates arithmetic operations in multiplication and
division. In the current TALP system, if the test taker answers a question related to concept A
incorrectly, the system will provide two items related to concepts B and C, respectively. Due to
the nature of open-ended questions, the answer includes the calculation process, which can
effectively demonstrate the level of mastery in arithmetic. However, it's important to consider
scenarios where a student excels in addition and subtraction but struggles with multiplication
and division, as illustrated in Figure 1(b). TALP plans to incorporate LLM technology into
adaptive testing, enabling it to assess students' responses to open-ended questions similarly to
how a teacher would evaluate them. In the scenario depicted in Figure 1, the utilization of LLMs
in the TALP adaptive testing system has the potential to save an additional two items.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Better Scaffolding and diagnosis in remediation by LLMs</title>
      <p>
        Although the current learning resources in TALP are abundant, with instructional videos and the
assessment module to remediate learning weaknesses in the diagnostic report, the importance of
social processes in learning should be addressed. As identified in Vygotsky’s sociocultural theory,
learning is essentially a social process; guidance from teachers or collaboration with peers is vital
[
        <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
        ]. Many researchers have endeavored to simulate tutors using technology, such as Intelligent
Tutoring Systems (ITS)[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. However, the level of engagement and feedback provided by these
systems has remained unsatisfactory. The current diagnostic report of TALP can list the desired
learning material for an individual, but it cannot provide instant feedback anytime during
remediation.
      </p>
      <p>
        Significant improvements were not realized until the advent of Large Language Models (LLMs).
Some researchers have utilized BERT to solve mathematical problems and attempted to provide
students with feedback by assessing their responses. In the realm of automatic item generation,
researchers historically relied on item templates [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], but with the introduction of LLMs, some
have begun to generate items based on students’ responses to test questions [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. While early
LLMs may have made some progress in these areas, it is noteworthy that no LLMs were able to
integrate the above tasks, especially in the context of Traditional Chinese. However, with the
advent of GPT-3.5, it has become feasible to implement a chatbot in TALP's diagnostic report that
can interact with students while addressing their learning weaknesses.
      </p>
      <p>
        Figure 3 illustrates how the TALP system utilizes GPT, providing it with pertinent remedial
information. These prompts are critical to successfully diagnosing mathematical problems,
interacting with students, offering feedback and instructions, as well as generating test items. In
our design, we employ the framework presented in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for our prompts, which encompasses a
cognitive model, an item model, and an instructional procedure. The cognitive model represents
the knowledge structure along with its associated notes. Meanwhile, the item model refers to the
test items, and the instructional procedure outlines how GPT will engage with students. The
above prompt aims to help GPT understand students more and provide feedback with better
quality. In the planned diagnostic report of TALP, chatbot for remediation is optional. Once
students use chatbot for learning, TALP will open a chat box where students can do their quizzes
or assessment with chatbot. In the settings of instructional procedure, chatbot will intact with
students by Socratic methods, instead of providing direct answers, the chatbot uses probing
questions to guide students in discovering knowledge, examining their performance, and
engaging in logical reasoning. Once students have completed the remedial tasks scaffolded by the
chatbot, the TALP system collects the dialogue information. This information is then utilized by
GPT to generate customized assessment items, specifically targeting the individual student's
learning weaknesses. The purpose is to assess and evaluate whether students have achieved a
thorough mastery of the required competence. As depicted in Figure 3, when students answer
correctly, the TALP system will guide them in remediating higher-level misconceptions. When
students are unable to provide a correct answer, the chatbot guides them towards a more
indepth remediation at a lower level. While the existing diagnostic system in TALP offers
crossgrade precision with commendable accuracy, we believe integrating it with LLMs could further
improve the results. By enabling deeper interaction with students, the collaboration of LLMs with
the current rule-based TALP AI system could offer more nuanced diagnostics of learning
weaknesses.
      </p>
      <p>Of the numerous large language models available on the market, GPT-3.5 emerges as our top
choice, especially due to its efficient and fluent handling of Traditional Chinese content. This
selection was also necessitated by the current unavailability of GPT-4 for TALP. To assess its
capabilities, we conducted an initial test using 569 5th-grade test items derived from TALP. The
mathematical problem-solving accuracy rate was 79% initially, which, while promising, indicated
potential for further enhancement. By integrating both the cognitive and item model into the
prompts, the accuracy witnessed a significant rise to 96%. Intriguingly, when we employed
GPT4 prompting with both the cognitive model and item model, the accuracy impressively peaked at
100%. The above results suggest that while GPT-4 stands out as the superior engine, GPT-3.5 can
also deliver comparable outcomes when provided with carefully structured prompts, thereby
effectively meeting our requirements.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Examples by applying GPT-3.5</title>
      <p>In the following examples, we will illustrate the feasibility of employing GPT-3.5 to enhance
TALP’s diagnostic reports in mathematics by: (1) pinpointing the students' learning weaknesses
and saving testing items; (2) scaffolding their learning through interaction with a chatbot; and (3)
generating assessment items. The domain knowledge pertains to fifth-grade level understanding
of ratios and their practical applications in everyday life, encompassing concepts such as
'percentage' and 'discount'.</p>
      <p>
        To achieve the aforementioned goals, the input of appropriate prompts is crucial for the
success of our task. Two prompts are required: one for the assessment item and domain
knowledge, and another for the instructional procedure. Given that our task involves automated
item generation and automated rating, it is imperative to have well-defined cognitive and item
models [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], as outlined in Table 1, to clarify the testing domain knowledge. Table 1 also shows
how indicators of knowledge structure are utilized for the cognitive model.
      </p>
      <p>In the section related to the item model, we input data as multiple-choice questions,
comprising stems, options, and answers. This comprehensive information significantly aids
GPT3.5 in understanding the context of tests. Our initial trials have shown that structuring prompts
in multiple-choice format can substantially improve the accuracy of the feedback provided.
Additionally, the distractors included in multiple-choice items serve to effectively illustrate
common misconceptions to GPT-3.5.</p>
      <p>Lastly, establishing an instructional procedure is essential for the chatbot to effectively guide
students through the Zone of Proximal Development (ZPD). A Socratic interaction approach is
applied to scaffold and guide students towards understanding. The prompt for the instructional
procedure can be found in Table 1.
Knowledge Structure Note 5-n-14
5-n-14-S01: Understand the concept of ratio as "the amount of a part compared to the total."
5-n-14-S02: Able to solve problems related to ratios in daily life.
5-n-14-S03: Understand percentages as a commonly used representation of ratios.
5-n-14-S04: Able to solve problems related to percentages in daily life.
5-n-14-S05: Able to solve applied problems related to percentages in daily life (including
discounts and increases).
5-n-14-S06: Proficient in converting between commonly used percentages and fractions.
The hierarchical knowledge structure in 5-n-14 is as follows: At the top is S06, which is
preceded by S05. S05 then precedes both S04 and S02. Continuing down the hierarchy, S04 is
above S03, while S03 and S02 are both positioned over S01.</p>
      <p>Item model
Stem:”15/16=( )%，( )的答案應該是多少呢？”
Option: (1) 93.75 (2) 9375 (3) 0.9375 (4) 9.375
Answer: (1)</p>
      <p>Instructional procedure
Prompts to GPT3.5:
Analyze students' mistakes using indicators of knowledge structure, and pinpoint which area
they are struggling with. Subsequently, employ scaffolding. Rather than directly providing the
correct answer, use the Socratic method to guide students in thinking and explaining. Based on
the students’ responses, offer explanations and guidance. Depending on the student's response,
generate a testing item similar to the one in the item model to assess the level of learning. If the
answer is correct, it is assumed that the student has grasped the material; if the answer is
incorrect, continue providing guidance until the correct answer is given.</p>
      <p>In Figure 4(a), the chatbot displays a question labeled as 5-n-14-S06 (shown in Figure 4(b)) in
the knowledge structure for the student to solve. Based on the students' responses, the chatbot
identifies that their learning weakness, attributed to failing 5-n-14-S06, is rooted in 5-n-14-S03.
In the knowledge structure, 5-n-14-S03 is the competence of understanding percentages as a
commonly used representation of ratios. In the other word, it refers to convert a decimal into a
percentage. In contrast, the previous TALP AI rule-based diagnostic system would require testing
the students on 5-n-14-S05, 5-n-14-S04, and 5-n-14-S02 to pinpoint the actual learning deficiency
located in 5-n-14-S03. Employing GPT-3.5 as an automated rater streamlines the diagnostic
process by reducing the number of test items needed.
In the chatbox (as depicted in Figure 5), the chatbot interacts with students by providing
instructions for remediation. In this instance, the student was identified as having difficulty
converting a decimal to a percentage. The chatbot sought to teach the student how to convert a
decimal into a percentage. It demonstrated this by showing that multiplying a decimal by 100 and
appending the percentage symbol to the result accomplishes the conversion. As shown in Figure
5, though the chatbot demonstrated the method of converting a decimal into a percentage, the
student still had doubts regarding this demonstration. To address these doubts, the chatbot,
within the chatbox, used the Socratic method to guide the student towards understanding the
concept. This was achieved by providing additional explanations and posing simple questions for
the student to answer. After completing the instruction, the chatbot generates a testing item
similar to the original one in Figure 4(a) to assess whether the student has mastered the concept
(as seen in Figure 6). This process aims to test whether students can complete tasks
independently without assistance from the chatbot.</p>
      <p>Figure 5: The interaction between student and chatbot in the chatbox</p>
    </sec>
    <sec id="sec-5">
      <title>5. The challenges in implementing LLMs in TALP</title>
      <p>
        As the previous discussion shown, LLMs is so potential to improve the current adaptive
mechanism in TALP. Combining with the knowledge structure, TALP equipped with LLMs
technology may save more unnecessary items than before. Integrating a chatbot into the
diagnostic report can create an improved scaffold for remediation, facilitated by Socratic
interactions. Additionally, the chatbot's deep interaction capabilities can enable further diagnosis
of the student's understanding. To accomplish the aforementioned objectives, the achievement
relies on the accuracy and precision of GPT in interpreting and providing answers to the learning
content and testing items, especially in mathematics and science. Evidently, the API of GPT-3.5 is
accessible for constructing chatbots within the platform. However, its precision and accuracy in
problem-solving and interpretation of mathematical symbols are areas that still require
improvement [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Even if GPT-4 were currently available, its superior accuracy and precision in
5th-grade mathematics, as shown in our initial test, do not guarantee equivalent performance at
the high school level. The problem-solving capabilities of GPT-4 would still need to be
demonstrated and validated in this more advanced context. The cost associated with GPT-4 poses
a significant challenge that needs to be addressed, particularly for operating a learning platform
like TALP, which is fully funded by the Ministry of Education (MOE) and renowned for providing
free usage to grade 1-12 students.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>We gratefully acknowledge the National Science and Technology Council of Taiwan, along with
the Ministry of Education, for their generous financial support and steadfast endorsement of this
study under the Project: MOST 108-2511-H-142 -005 -MY3. Their generosity and commitment
have been instrumental in the advancement of this study. We extend our heartfelt appreciation
for their steadfast belief in our research endeavors.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kasneci</surname>
          </string-name>
          et al.,
          <article-title>"ChatGPT for good? On opportunities and challenges of large language models for education,"</article-title>
          <source>Learning and Individual Differences</source>
          , vol.
          <volume>103</volume>
          , p.
          <fpage>102274</fpage>
          ,
          <year>2023</year>
          /04/01/ 2023, doi: https://doi.org/10.1016/j.lindif.
          <year>2023</year>
          .
          <volume>102274</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fraiwan</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Khasawneh</surname>
          </string-name>
          ,
          <article-title>"A Review of ChatGPT Applications in Education, Marketing</article-title>
          , Software Engineering, and Healthcare: Benefits, Drawbacks, and Research Directions,
          <article-title>"</article-title>
          <source>arXiv preprint arXiv:2305.00237</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Dai</surname>
          </string-name>
          et al.,
          <article-title>"Can large language models provide feedback to students? A case study on ChatGPT," 2023</article-title>
          . doi:
          <volume>10</volume>
          .35542/osf.io/hcgzj.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Huang</surname>
          </string-name>
          , H.-Y. (
          <year>2022</year>
          , December 14).
          <article-title>Analyzing the Effectiveness of the Taiwan Adaptive Learning Platform on Learning Outcomes using Educational Data and Data Mining Techniques</article-title>
          .
          <article-title>Paper presented at the 2022 Self-Regulated Learning Festival and Learning Analytics Seminar</article-title>
          , Kaohsiung, Taiwan.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Stevenson</surname>
          </string-name>
          , "Shermis,
          <string-name>
            <given-names>MD</given-names>
            , &amp;
            <surname>Burstein</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          .(Eds)(
          <year>2013</year>
          ).
          <article-title>Handbook of Automated Essay Evaluation: Current applications and new directions,"</article-title>
          <source>Journal of Writing Research</source>
          , vol.
          <volume>5</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>239</fpage>
          -
          <lpage>243</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>McLeod</surname>
          </string-name>
          ,
          <article-title>"</article-title>
          <source>Vygotsky's Zone of Proximal Development and Scaffolding"</source>
          ,
          <year>2023</year>
          . URL:https://www.simplypsychology.
          <article-title>org/zone-of-proximaldevelopment.html?ref=brainscape-academy.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Vygotsky</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Cole</surname>
          </string-name>
          ,
          <article-title>Mind in society: Development of higher psychological processes</article-title>
          . Harvard university press,
          <year>1978</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>K.-C. Pai</surname>
            ,
            <given-names>B.-C.</given-names>
          </string-name>
          <string-name>
            <surname>Kuo</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Liao</surname>
          </string-name>
          , and Y.
          <string-name>
            <surname>-M. Liu</surname>
          </string-name>
          ,
          <article-title>"An application of Chinese dialogue-based intelligent tutoring system in remedial instruction for mathematics learning,"</article-title>
          <source>Educational Psychology</source>
          , vol.
          <volume>41</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>137</fpage>
          -
          <lpage>152</lpage>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <article-title>"Federated Prompting and Chain-of-Thought Reasoning for Improving LLMs Answering,"</article-title>
          <source>arXiv preprint arXiv:2304.13911</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Gierl</surname>
          </string-name>
          and
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Haladyna</surname>
          </string-name>
          ,
          <article-title>"Using weak and strong theory to create item models for automatic item generation: Some practical guidelines with examples," in Automatic item generation:</article-title>
          <source>Routledge</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>36</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>"Automatic Math Word Problem Generation With TopicExpression Co-Attention Mechanism and Reinforcement Learning,"</article-title>
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          , vol.
          <volume>30</volume>
          , pp.
          <fpage>1061</fpage>
          -
          <lpage>1072</lpage>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Shen</surname>
          </string-name>
          et al.,
          <article-title>"Mathbert: A pre-trained language model for general nlp tasks in mathematics education,"</article-title>
          <source>arXiv preprint arXiv:2106.07340</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>