<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hannah-Beth Clark</string-name>
          <email>hannah-beth.clark@thenational.academy</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Owen Henkel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Benton</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Margaux Dowland</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Reka Budai</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ibrahim Kaan Keskin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emma Searle</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthew Gregory</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Hodierne</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>William Gayne</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>John Roberts</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Education, University of Oxford</institution>
          ,
          <addr-line>15 Norham Gardens, Oxford OX2 6PY</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Oak National Academy</institution>
          ,
          <addr-line>1 Scott Place, 2 Hardman Street, Manchester, M3 3AA</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Second International Workshop on Generative AI for Learning Analytics</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Designing AI tools for use in educational settings presents distinct challenges; the need for accuracy is heightened, safety is imperative and pedagogical rigour is crucial. As a publicly funded body in the UK, Oak National Academy is in a unique position to innovate within this field as we have a comprehensive curriculum of approximately 13,000 open education resources (OER) for all National Curriculum subjects, designed and quality-assured by expert, human teachers. This has provided the corpus of content needed for building a high-quality AI-powered lesson planning tool, Aila, that is free to use and therefore accessible to all teachers across the country. Furthermore, using our evidence-informed curriculum principles, we have codified and exemplified each component of lesson design. To assess the quality of lessons produced by Aila at scale, we have developed an AI-powered autoevaluation agent, facilitating informed improvements to enhance output quality. Through comparisons between human and auto-evaluations, we have begun to refine this agent further to increase its accuracy, measured by its alignment with an expert human evaluator. In this paper we present this iterative evaluation process through an illustrative case study focused on one quality benchmark - the level of challenge within multiple-choice quizzes. We also explore the contribution that this may make to similar projects and the wider sector.</p>
      </abstract>
      <kwd-group>
        <kwd>AI-powered lesson planning</kwd>
        <kwd>Open education resources</kwd>
        <kwd>LLM as a judge</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>CEUR</p>
      <p>
        ceur-ws.org
generative AI models with a high-quality corpus in a retrieval database for use in RAG can improve
accuracy from 67% to 92% [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In this paper, we describe our approach to designing Aila, our AI lesson
assistant and the auto-evaluation agent built alongside to assess the accuracy, quality and safety of the
lessons Aila produces. We also present empirical data from a case study to assess the efectiveness of
this auto-evaluation agent.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. System Design</title>
      <p>Aila is designed to emulate the thought process of an experienced teacher as they plan a lesson. It
is intentionally designed not to be a ’single-shot’ tool that creates a lesson in one click, but instead
supports teacher agency through enabling them to adapt and edit the lesson step-by-step to better suit
their students (see Figure 1).</p>
      <p>
        Our underlying content, alongside the codification of good practice in lesson design, has enabled
us to use several techniques to raise the quality of Aila’s outputs. These include retrieval augmented
generation (RAG), to provide relevant context for the output [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and more specifically content anchoring,
to improve lesson quality by instructing the model to respond within the bounds of specified content (i.e.
an existing Oak lesson) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]; prompt engineering, to focus the response of the underlying Large Language
Model (LLM) according to our codified definition of a high-quality lesson; and decision-making by the
teacher at a granular level to act as the human in the loop [
        <xref ref-type="bibr" rid="ref10">10, 12</xref>
        ].
      </p>
      <p>
        To enable us to understand the efectiveness of these techniques by evaluating Aila’s outputs quickly
and eficiently, we have built an auto-evaluation agent, using LLM as a Judge methodology [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which
is based on Oak’s curriculum principles [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Each lesson is currently evaluated using a series of
autoevaluation prompts, assessing 24 quality and accuracy benchmarks, such as cultural bias, minimally
diferent quiz answers or the progression of quiz dificulty (for the full list, see Appendix A). This has
enabled us to evaluate the impact of the changes we make to improve Aila and compare the results,
such as using diferent models as the underlying LLM, testing new versions of Aila before release, and
identifying particular areas for development, which is the focus of this paper.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Case Study</title>
      <sec id="sec-3-1">
        <title>3.1. Task Description</title>
        <p>Aila produces diverse educational resources, including lesson plans and classroom materials. We
wanted to understand how closely aligned the auto-evaluation agent was with qualified teachers. To
do this we first created a dataset of 2249 user-created Aila lessons, and 2736 lessons produced by Aila
without user input or content anchoring (i.e. single shot), totalling 4985 lessons. The lessons were
across all four key stages (i.e. for ages 5-16 years) and included maths, English, history, geography
and science. The auto-evaluation model (gpt-4o-2024-08-06, temperature: 0.5) scored the lessons on
19 Likert criteria (using a 1-5 scale, see Figure 2) and 5 boolean criteria (true or false), each with their
respective justifications.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Analysis</title>
        <p>Our initial analysis focused on MCQs that teachers scored as 1, 3, and 5 to understand weak, average,
and strong distractor quality, conducting a thematic analysis of the teachers’ justifications for these
scores. We limited our thematic analysis to these three categories to provide clear benchmarks for
quality assessment and to identify distinctive characteristics at each level of performance. We then
identified exemplar MCQs to supplement the amended auto-evaluation prompts.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Results</title>
        <p>3.3.1. What makes a generated distractor high or low-quality in relation to providing an
appropriate level of challenge?
Appendix B summarises the key rating justification themes given by the human evaluators. The most
common reason for distractors being low-quality was having the opposite sentiment to the correct
answer (e.g. correct answer is a positive trait and the distractors are all negative traits). Other reasons
included having a diferent grammatical structure to the correct answer, as well as the correct answer
repeating words from the question, but the distractors not. For distractors to be high-quality they
should fall into the same category as the correct answer, relate to a common theme, include common
misconceptions and have a similar grammatical structure.
3.3.2. How well aligned were the auto-evaluation agent and the human evaluators?
Figure 3 highlights how the auto-evaluation agent was applying excessively strict criteria compared
to the human evaluator, rating a large number of quiz questions as having low-quality distractors. It
justified the low scores by claiming that the answer options were conceptually very diferent, thereby
lacking the necessary challenge for the specified key stage. There was also an overemphasis on what
was expected of students at the key stages, challenging deeper understanding.</p>
        <p>We used the thematic analysis findings to update the prompt with additional guidance defining a
high-quality distractor, and as a result, the auto-evaluation scores and human evaluation scores became
more aligned (see Table 1). We calculated the Mean Squared Error (MSE) using the mean of the 10 scores
given by the auto-evaluation per evaluation. The mean-based MSE decreased from 3.81 to 2.94 (p-value
= 0.00679), which is statistically significant (p &lt; 0.05). We also calculated several other evaluation
metrics, including the Quadratic Weighted Kappa (QWK), which showed an increase from 0.17 to 0.32,
indicating a moderate to large and statistically significant improvement in agreement (see Appendix C).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>Through an illustrative case study, we have demonstrated the potential of using an auto-evaluation
agent to drive improvement in the quality of AI-generated lessons and resources, as well as how
the efectiveness of this agent can be improved by drawing on specific teaching expertise of human
evaluators. Thematic analysis of rating justifications allowed us to codify what high and low-quality
distractors looked like (with few-shot examples) and incorporate this information directly into the
prompt, increasing the alignment with the human evaluators and driving improvements in the overall
MCQ quality.</p>
      <p>Incorporating the thematic analysis and corresponding representative examples for scores of 2 and
4 in future work could help reduce minor discrepancies by increasing granularity, especially in cases
where scores are ‘1 away’ from human evaluations. Absolute alignment is not necessarily the ultimate
goal; the more important measure of success would be to see if the justifications the LLM gives alongside
scores of 1, 3 and 5 are in line with the themes we found, providing consistent scoring according to
these guidelines. Further thematic analysis would be required to establish this. Even after the changes,
the LLM still scores lower than the human the majority of the time. This greater sensitivity is more
beneficial than the alternative, as potential issues are more likely to be flagged and addressed.</p>
      <p>There were also limitations to this work. We had a specific focus on answer diferentiation and MCQs
which could have implications for wider generalisability. Furthermore, due to time constraints, we
weren’t able to have multiple human evaluators for each question. Ideally, we would have an average
human score per evaluation to deal with possible outliers. In future work, we could also consider
weighting these responses according to the teacher’s experience level, factoring in years of experience,
teaching role and other metrics.</p>
      <sec id="sec-4-1">
        <title>4.1. Recommendations</title>
        <p>Aila has been designed specifically to support teachers in the UK with planning high-quality lessons
and resources to reduce teacher workload and improve the quality of materials produced using AI. We
hope by sharing what we have learned through this work it can also have an impact on other projects:</p>
        <p>Having a base of high-quality OER has been integral to the quality of lessons produced by Aila. Our
curriculum materials are aligned with the national curriculum for England, produced by expert teachers,
available on an open government licence, and targeted at UK schools. For other organisations looking
to develop tools within this space in other contexts, access to high-quality resources appropriate for
their context will be imperative. We seek to enable this by making our OER resources available through
a public API.</p>
        <p>We had already done significant work codifying and exemplifying high-quality curriculum design.
This provided invaluable input as the starting point for writing our prompt and, in turn, our evaluation
tools. Deciding on your organisation’s agreed-upon concept of “high-quality” is an important starting
point before developing your tool, as this will be built into your prompt and evaluation work.</p>
        <p>
          Using a cycle of comparative auto and human evaluations allowed us to iterate on the auto-evaluation
prompt continuously and will ultimately also enable us to refine Aila’s prompt. Once you have identified
full lesson plans that achieve good scores aligned between evaluators through this iterative process
these plans can subsequently be used to fine-tune generation models to output better-quality lesson
plans [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Conclusion</title>
        <p>We believe that auto-evaluation is a powerful tool for driving improvement in AI-produced content
quickly and eficiently. We have focused specifically on a “quality” benchmark but we are also in the
process of applying this approach to our “safety” benchmarks. The use of our auto-evaluation tool to
evaluate diferent versions of Aila as we release them, comparisons of quality in how RAG is used, and
the use of fine-tuning to develop the quality of our AI tools are further areas we plan to investigate.
We also aim to use an improvement agent which will take feedback from our auto-evaluation agent to
improve the quality of lesson content before it is displayed to users as well as suggest specific areas for
users to check carefully or improve.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>Generative AI tools have not been used to support manuscript preparation.
[12] Wu, X., Xiao, L., Sun, Y., Zhang, J., Ma, T., &amp; He, L. (2022). A survey of human-in-the-loop for
machine learning. Future Generation Computer Systems, 135, 364-381.</p>
      <p>A. Full set of assessed quality and accuracy benchmarks
Learning Cycle Feasibility
Practice Tasks Assess Expla- Likert
nation Understanding
CFUs Align with Explana- Likert
tions and Key Learning
Points
Learning Cycles Achieve
Learning Outcome’
Learning Outcome
Efectiveness</p>
      <p>Likert</p>
      <p>Likert
Explanations Address Mis- Likert
conceptions
Test Understanding of Mis- Likert
conceptions
Question Answers Are Fac- Likert
tual
Internal Consistency
Appropriate Level for Age
Answers Are Minimally Dif- Likert
ferent
Americanisms
Cultural Bias
Gender Bias
Exit Quiz Tests Key Learning
Points
Starter Quiz Tests Prior
Knowledge
Progressive Complexity in
quiz Questions
Learning Cycles Increase in
Challenge
No Negative Phrasing in
Quiz Questions
Repeated
Quizzes
Starter Quiz does not Rest
Lesson Content
Exit Quiz Contains
Vocabulary Question</p>
      <p>Check
Output
Format
Likert
Likert
Likert
Likert
Likert
Likert
Likert
Likert
Likert
Boolean
Boolean
Boolean</p>
      <p>Boolean
Meaningful Misconceptions</p>
    </sec>
    <sec id="sec-6">
      <title>B. Summary of thematic analysis</title>
      <p>metrics</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            , H., Han,
            <given-names>X.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Benchmarking large language models in retrievalaugmented generation</article-title>
          .
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          ,
          <volume>38</volume>
          (
          <issue>16</issue>
          ),
          <fpage>17754</fpage>
          -
          <lpage>17762</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Chiang</surname>
            ,
            <given-names>C. H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H. Y.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>A closer look into automatic evaluation using large language models</article-title>
          .
          <source>arXiv preprint</source>
          , arXiv:
          <fpage>2310</fpage>
          .
          <fpage>05657</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Chiu</surname>
            ,
            <given-names>T. K. F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>C. S.</given-names>
          </string-name>
          , &amp; Cheng,
          <string-name>
            <surname>M.</surname>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>Systematic literature review on opportunities, challenges, and future research recommendations of artificial intelligence in education</article-title>
          .
          <source>Computers and Education: Artificial Intelligence</source>
          ,
          <volume>4</volume>
          ,
          <fpage>100118</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D</given-names>
            <surname>'Sa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            , &amp;
            <surname>Wisbal-Dionaldo</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. L.</surname>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Analysis of multiple choice questions: item dificulty, discrimination index and distractor eficiency</article-title>
          .
          <source>International Journal of Nursing Education</source>
          ,
          <volume>9</volume>
          (
          <issue>3</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Government</given-names>
            <surname>Social</surname>
          </string-name>
          <string-name>
            <surname>Research.</surname>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Use Cases for Generative AI in Education: Building a proof of concept for Generative AI feedback and resource generation in education contexts</article-title>
          [
          <source>Technical report]. GOV</source>
          .UK.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Kommineni</surname>
            ,
            <given-names>V. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>König-Ries</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Samuek</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>From human experts to machines: An LLM supported approach to ontology and knowledge graph construction</article-title>
          .
          <source>arXiv preprint</source>
          ,
          <volume>2403</volume>
          .
          <fpage>08345</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>McCrea</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>Our 6 principles guiding our approach to curriculum</article-title>
          . Oak National Academy.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Ouyang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , et al. (
          <year>2022</year>
          ).
          <article-title>Training language models to follow instructions with human feedback</article-title>
          .
          <source>Advances in neural information processing systems</source>
          ,
          <volume>35</volume>
          ,
          <fpage>27730</fpage>
          -
          <lpage>27744</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Teacher</given-names>
            <surname>Tapp</surname>
          </string-name>
          . (
          <year>2024</year>
          ).
          <article-title>AI teachers, school exclusions and cutting workload</article-title>
          .
          <source>Teacher Tapp.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Tsiakas</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Murray-Rust</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Using human-in-the-loop and explainable AI to envisage new future work practices</article-title>
          .
          <source>Proceedings of the 15th International Conference on PErvasive Technologies</source>
          Related to Assistive Environments,
          <fpage>588</fpage>
          -
          <lpage>594</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>UNESCO.</surname>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Recommendation on Open Educational Resources (OER) - Legal Afairs</article-title>
          . UNESCO.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>