1. Introduction

1613-0073

Hannah-Beth Clark

hannah-beth.clark@thenational.academy 1 2

Owen Henkel

0 2

Laura Benton

1 2

Margaux Dowland

1 2

Reka Budai

1 2

Ibrahim Kaan Keskin

1 2

Emma Searle

1 2

Matthew Gregory

1 2

Mark Hodierne

1 2

William Gayne

1 2

John Roberts

1 2

Workshop

2 0 Department of Education, University of Oxford , 15 Norham Gardens, Oxford OX2 6PY , United Kingdom 1 Oak National Academy , 1 Scott Place, 2 Hardman Street, Manchester, M3 3AA , United Kingdom 2 Second International Workshop on Generative AI for Learning Analytics

Designing AI tools for use in educational settings presents distinct challenges; the need for accuracy is heightened, safety is imperative and pedagogical rigour is crucial. As a publicly funded body in the UK, Oak National Academy is in a unique position to innovate within this field as we have a comprehensive curriculum of approximately 13,000 open education resources (OER) for all National Curriculum subjects, designed and quality-assured by expert, human teachers. This has provided the corpus of content needed for building a high-quality AI-powered lesson planning tool, Aila, that is free to use and therefore accessible to all teachers across the country. Furthermore, using our evidence-informed curriculum principles, we have codified and exemplified each component of lesson design. To assess the quality of lessons produced by Aila at scale, we have developed an AI-powered autoevaluation agent, facilitating informed improvements to enhance output quality. Through comparisons between human and auto-evaluations, we have begun to refine this agent further to increase its accuracy, measured by its alignment with an expert human evaluator. In this paper we present this iterative evaluation process through an illustrative case study focused on one quality benchmark - the level of challenge within multiple-choice quizzes. We also explore the contribution that this may make to similar projects and the wider sector.

AI-powered lesson planning Open education resources LLM as a judge

1. Introduction

CEUR

ceur-ws.org generative AI models with a high-quality corpus in a retrieval database for use in RAG can improve accuracy from 67% to 92% [ 5 ]. In this paper, we describe our approach to designing Aila, our AI lesson assistant and the auto-evaluation agent built alongside to assess the accuracy, quality and safety of the lessons Aila produces. We also present empirical data from a case study to assess the efectiveness of this auto-evaluation agent.

2. System Design

Aila is designed to emulate the thought process of an experienced teacher as they plan a lesson. It is intentionally designed not to be a ’single-shot’ tool that creates a lesson in one click, but instead supports teacher agency through enabling them to adapt and edit the lesson step-by-step to better suit their students (see Figure 1).

Our underlying content, alongside the codification of good practice in lesson design, has enabled us to use several techniques to raise the quality of Aila’s outputs. These include retrieval augmented generation (RAG), to provide relevant context for the output [ 1 ] and more specifically content anchoring, to improve lesson quality by instructing the model to respond within the bounds of specified content (i.e. an existing Oak lesson) [ 6 ]; prompt engineering, to focus the response of the underlying Large Language Model (LLM) according to our codified definition of a high-quality lesson; and decision-making by the teacher at a granular level to act as the human in the loop [ 10, 12 ].

To enable us to understand the efectiveness of these techniques by evaluating Aila’s outputs quickly and eficiently, we have built an auto-evaluation agent, using LLM as a Judge methodology [ 2 ], which is based on Oak’s curriculum principles [ 7 ]. Each lesson is currently evaluated using a series of autoevaluation prompts, assessing 24 quality and accuracy benchmarks, such as cultural bias, minimally diferent quiz answers or the progression of quiz dificulty (for the full list, see Appendix A). This has enabled us to evaluate the impact of the changes we make to improve Aila and compare the results, such as using diferent models as the underlying LLM, testing new versions of Aila before release, and identifying particular areas for development, which is the focus of this paper.

3. Case Study 3.1. Task Description

Aila produces diverse educational resources, including lesson plans and classroom materials. We wanted to understand how closely aligned the auto-evaluation agent was with qualified teachers. To do this we first created a dataset of 2249 user-created Aila lessons, and 2736 lessons produced by Aila without user input or content anchoring (i.e. single shot), totalling 4985 lessons. The lessons were across all four key stages (i.e. for ages 5-16 years) and included maths, English, history, geography and science. The auto-evaluation model (gpt-4o-2024-08-06, temperature: 0.5) scored the lessons on 19 Likert criteria (using a 1-5 scale, see Figure 2) and 5 boolean criteria (true or false), each with their respective justifications.

3.2. Analysis

Our initial analysis focused on MCQs that teachers scored as 1, 3, and 5 to understand weak, average, and strong distractor quality, conducting a thematic analysis of the teachers’ justifications for these scores. We limited our thematic analysis to these three categories to provide clear benchmarks for quality assessment and to identify distinctive characteristics at each level of performance. We then identified exemplar MCQs to supplement the amended auto-evaluation prompts.

3.3. Results

3.3.1. What makes a generated distractor high or low-quality in relation to providing an appropriate level of challenge? Appendix B summarises the key rating justification themes given by the human evaluators. The most common reason for distractors being low-quality was having the opposite sentiment to the correct answer (e.g. correct answer is a positive trait and the distractors are all negative traits). Other reasons included having a diferent grammatical structure to the correct answer, as well as the correct answer repeating words from the question, but the distractors not. For distractors to be high-quality they should fall into the same category as the correct answer, relate to a common theme, include common misconceptions and have a similar grammatical structure. 3.3.2. How well aligned were the auto-evaluation agent and the human evaluators? Figure 3 highlights how the auto-evaluation agent was applying excessively strict criteria compared to the human evaluator, rating a large number of quiz questions as having low-quality distractors. It justified the low scores by claiming that the answer options were conceptually very diferent, thereby lacking the necessary challenge for the specified key stage. There was also an overemphasis on what was expected of students at the key stages, challenging deeper understanding.

We used the thematic analysis findings to update the prompt with additional guidance defining a high-quality distractor, and as a result, the auto-evaluation scores and human evaluation scores became more aligned (see Table 1). We calculated the Mean Squared Error (MSE) using the mean of the 10 scores given by the auto-evaluation per evaluation. The mean-based MSE decreased from 3.81 to 2.94 (p-value = 0.00679), which is statistically significant (p < 0.05). We also calculated several other evaluation metrics, including the Quadratic Weighted Kappa (QWK), which showed an increase from 0.17 to 0.32, indicating a moderate to large and statistically significant improvement in agreement (see Appendix C).

4. Discussion

Through an illustrative case study, we have demonstrated the potential of using an auto-evaluation agent to drive improvement in the quality of AI-generated lessons and resources, as well as how the efectiveness of this agent can be improved by drawing on specific teaching expertise of human evaluators. Thematic analysis of rating justifications allowed us to codify what high and low-quality distractors looked like (with few-shot examples) and incorporate this information directly into the prompt, increasing the alignment with the human evaluators and driving improvements in the overall MCQ quality.

Incorporating the thematic analysis and corresponding representative examples for scores of 2 and 4 in future work could help reduce minor discrepancies by increasing granularity, especially in cases where scores are ‘1 away’ from human evaluations. Absolute alignment is not necessarily the ultimate goal; the more important measure of success would be to see if the justifications the LLM gives alongside scores of 1, 3 and 5 are in line with the themes we found, providing consistent scoring according to these guidelines. Further thematic analysis would be required to establish this. Even after the changes, the LLM still scores lower than the human the majority of the time. This greater sensitivity is more beneficial than the alternative, as potential issues are more likely to be flagged and addressed.

There were also limitations to this work. We had a specific focus on answer diferentiation and MCQs which could have implications for wider generalisability. Furthermore, due to time constraints, we weren’t able to have multiple human evaluators for each question. Ideally, we would have an average human score per evaluation to deal with possible outliers. In future work, we could also consider weighting these responses according to the teacher’s experience level, factoring in years of experience, teaching role and other metrics.

4.1. Recommendations

Aila has been designed specifically to support teachers in the UK with planning high-quality lessons and resources to reduce teacher workload and improve the quality of materials produced using AI. We hope by sharing what we have learned through this work it can also have an impact on other projects:

Having a base of high-quality OER has been integral to the quality of lessons produced by Aila. Our curriculum materials are aligned with the national curriculum for England, produced by expert teachers, available on an open government licence, and targeted at UK schools. For other organisations looking to develop tools within this space in other contexts, access to high-quality resources appropriate for their context will be imperative. We seek to enable this by making our OER resources available through a public API.

We had already done significant work codifying and exemplifying high-quality curriculum design. This provided invaluable input as the starting point for writing our prompt and, in turn, our evaluation tools. Deciding on your organisation’s agreed-upon concept of “high-quality” is an important starting point before developing your tool, as this will be built into your prompt and evaluation work.

Using a cycle of comparative auto and human evaluations allowed us to iterate on the auto-evaluation prompt continuously and will ultimately also enable us to refine Aila’s prompt. Once you have identified full lesson plans that achieve good scores aligned between evaluators through this iterative process these plans can subsequently be used to fine-tune generation models to output better-quality lesson plans [ 8 ].

4.2. Conclusion

We believe that auto-evaluation is a powerful tool for driving improvement in AI-produced content quickly and eficiently. We have focused specifically on a “quality” benchmark but we are also in the process of applying this approach to our “safety” benchmarks. The use of our auto-evaluation tool to evaluate diferent versions of Aila as we release them, comparisons of quality in how RAG is used, and the use of fine-tuning to develop the quality of our AI tools are further areas we plan to investigate. We also aim to use an improvement agent which will take feedback from our auto-evaluation agent to improve the quality of lesson content before it is displayed to users as well as suggest specific areas for users to check carefully or improve.

Declaration on Generative AI

Generative AI tools have not been used to support manuscript preparation. [12] Wu, X., Xiao, L., Sun, Y., Zhang, J., Ma, T., & He, L. (2022). A survey of human-in-the-loop for machine learning. Future Generation Computer Systems, 135, 364-381.

A. Full set of assessed quality and accuracy benchmarks Learning Cycle Feasibility Practice Tasks Assess Expla- Likert nation Understanding CFUs Align with Explana- Likert tions and Key Learning Points Learning Cycles Achieve Learning Outcome’ Learning Outcome Efectiveness

Likert

Likert Explanations Address Mis- Likert conceptions Test Understanding of Mis- Likert conceptions Question Answers Are Fac- Likert tual Internal Consistency Appropriate Level for Age Answers Are Minimally Dif- Likert ferent Americanisms Cultural Bias Gender Bias Exit Quiz Tests Key Learning Points Starter Quiz Tests Prior Knowledge Progressive Complexity in quiz Questions Learning Cycles Increase in Challenge No Negative Phrasing in Quiz Questions Repeated Quizzes Starter Quiz does not Rest Lesson Content Exit Quiz Contains Vocabulary Question

Check Output Format Likert Likert Likert Likert Likert Likert Likert Likert Likert Boolean Boolean Boolean

Boolean Meaningful Misconceptions

B. Summary of thematic analysis

metrics

[1] Chen , J. , Lin , H., Han, X. , & Sun , L. ( 2024 ). Benchmarking large language models in retrievalaugmented generation . Proceedings of the AAAI Conference on Artificial Intelligence , 38 ( 16 ), 17754 - 17762 .

[2] Chiang , C. H. , & Lee , H. Y. ( 2023 ). A closer look into automatic evaluation using large language models . arXiv preprint , arXiv: 2310 . 05657 .

[3] Chiu , T. K. F. , Xia , Q. , Zhou , X. , Chai , C. S. , & Cheng, M. ( 2023 ). Systematic literature review on opportunities, challenges, and future research recommendations of artificial intelligence in education . Computers and Education: Artificial Intelligence , 4 , 100118 .

[4]

'Sa , J. L. , & Wisbal-Dionaldo , M. L. ( 2017 ). Analysis of multiple choice questions: item dificulty, discrimination index and distractor eficiency . International Journal of Nursing Education , 9 ( 3 ).

[5]

Government

Social Research. ( 2024 ). Use Cases for Generative AI in Education: Building a proof of concept for Generative AI feedback and resource generation in education contexts [ Technical report]. GOV .UK.

[6] Kommineni , V. K. , König-Ries , B. , & Samuek , S. ( 2024 ). From human experts to machines: An LLM supported approach to ontology and knowledge graph construction . arXiv preprint , 2403 . 08345 .

[7] McCrea , E. ( 2023 ). Our 6 principles guiding our approach to curriculum . Oak National Academy.

[8] Ouyang , L. , et al. ( 2022 ). Training language models to follow instructions with human feedback . Advances in neural information processing systems , 35 , 27730 - 27744 .

[9]

Teacher

Tapp . ( 2024 ). AI teachers, school exclusions and cutting workload . Teacher Tapp.

[10] Tsiakas , K. , & Murray-Rust , D. ( 2022 ). Using human-in-the-loop and explainable AI to envisage new future work practices . Proceedings of the 15th International Conference on PErvasive Technologies Related to Assistive Environments, 588 - 594 .

[11] UNESCO. ( 2019 ). Recommendation on Open Educational Resources (OER) - Legal Afairs . UNESCO.