<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ORCID:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Coaching AI to be a Team Player</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>A. A. Tomlinson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>N. K. Wilkin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M. Kainth</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>R. Stanyon</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Birmingham</institution>
          ,
          <addr-line>B15 2TT</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>In this paper, we explore how an 'AI marking engine' (Aime) can be effectively, and ethically, incorporated into an assessment and feedback cycle with an agent-based model. We consider Aime as an additional team member who needs to be on-boarded. In the scenario we explore, Aime sits in the education hierarchy alongside the GTA graders while the overall decision making and moderation remains with the module lead. We implement, and comment, on the expectations of the humans, and the human-AI interactions via Graide, a subscription platform, that incorporates an AI engine which is optimised for mathematical disciplines. We propose that by treating the AI engine as a persona with a well-defined role and reporting structure in the human- AI ecosystem, one can optimise the grading and, most importantly, the individual student experience where the goal is to rapidly provide personalised, in-depth, and consistent feedback.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Human-centered computing</kwd>
        <kwd>computer-aided assessment</kwd>
        <kwd>instructional coaching</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In order for AI to be an effective tool in education, there is an acknowledgement that new protocols
are required in order that the AI both enhances the workflow and works with the human team [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. There
is understandably a distrust in AI and concerns around the transparency and accuracy of decision
making [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. There is also a clear need for the proliferation of AI tools to support teaching and learning
for large cohorts of students where individuals grading work struggle to return consistent and timely
feedback [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Within the context of AIED, it is crucial that grading precision and the quality of
feedback generated by AI assessment tools is reliable and similar in depth and nuance to that expected
of an expert human grader. Retaining human oversight can provide assurances to leadership and
frontline educators that the grading and feedback process is at least as fair and accurate to students as
traditional processes. Benefits include meaningful pedagogical insights alongside significant time and
cost savings. In this paper, we discuss how an AI marking engine (Aime) can fit into existing grading
workflows to enhance all aspects of assessment for students and educators. We consider Aime to be a
member of the team, who is ‘coached’ to become fully-integrated with carefully demarked
responsibilities. We discuss the design of the handover points between humans (students, module leads,
and Graduate Teaching Assistants (GTAs)) and Aime. In particular, we highlight the controls
differentially owned by human users and how this is leveraged to speed up the grading process while
maintaining confidence in the AI decision-making. We will exemplify this in the context of an
undergraduate mathematical subject.
      </p>
      <p>
        The platform we have chosen for our ecosystem is Graide. As showcased at L@S2022 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], Graide,
is a now commercialised grading and personalised feedback system which is a spin-out from the
University of Birmingham, UK. This close working between academics and the system developers has
enabled this model to develop nimbly.
      </p>
      <p>However, the underlying methodology of ascertaining what roles Aime is empowered (or prevented)
from undertaking transfers to any grading ecosystem that has access to an AI engine that has been made
suitably adaptable to work with and for the humans, rather than as a central system with which the
humans must learn to engage</p>
    </sec>
    <sec id="sec-2">
      <title>1.1. Including Aime in an idealised grading workflow for large undergraduate</title>
      <p>cohorts</p>
      <p>In the team hierarchy that we have implemented in Fig. 1, the humans always retain overall
moderation control, intentionally implemented as a safety feature. In stage 1, as with traditional grading
of large cohorts, there is written and verbal communication and discussion between the module lead
and the GTAs (for simplicity, we consider 3 GTAs in the figures shown). The GTAs then commence
grading, to the best of their ability, without an implied time constraint per script.</p>
      <p>
        In a non-AI scenario, each GTA grades work individually and, without a carefully agreed-upon mark
scheme or rubric, graders can produce inconsistent feedback [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In addition to this, as graders become
more familiar with a question paper, their grading speed should increase as they proceed, potentially
leading to a reduction in the depth of feedback on individual scripts as the grader ‘gets into the swing’
of grading. This increase in speed or cumulative grading time could cause a grader to become careless
or fatigued leading to variations in their decision making [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. From an individual student perspective,
this traditional grading set up leads to an appearance of inconsistent care and attention when comparing
their feedback.
      </p>
      <p>
        Aime sits alongside the GTAs and initially learns from the human grading they undertake. GTAs
grade work and Aime rapidly predicts feedback they are likely to apply to the next submission with a
similar algebraic structure. Feed- back is then suggested to the grader (with an associated confidence
level) which is then reviewed and, ideally, confirmed. If the grader rejects the suggested feed- back, the
AI model adapts to provide refined feedback on the next approach. Work by Benton has shown that the
accuracy of grades produced by a collective team of graders is better than a single senior grader [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ];
Aime facilitates a collective approach to grading work by adapting to the entire scope of the feedback
applied by the team of graders.
      </p>
      <p>For large student cohorts, the team of GTAs will be of significant size itself, often with a designated
lead GTA, who has prior experience of working on the module and a deeper familiarity with the
material. In this scenario (in a non-AI setting), the lead GTA (GTA 3 in Fig. 2) will be the one who
liaises with the module lead. They would then instruct or supervise the other GTAs, potentially by
physically co-locating to grade.</p>
      <p>The process can be streamlined with Aime in the team. GTA 3 grades scripts first, such that Aime
only learns from GTA 3. Once GTA 3 is satisfied that the feedback coverage is sufficient for the
standard solution paths students offer, the other GTAs can commence grading with high quality
feedback offered to them to select for the students. In effect, via Aime, the GTAs are being coached ‘on
the job’ to become as good as the senior GTA.</p>
      <p>Stage 2 of the process is where the bulk of the scripts are graded. There is a choice to be made in
terms of workload distribution between the GTAs. One can either allocate groups of students to each
GTA or give all the answers to a particular question to a GTA. The second option provides greater
consistency at a question level, but the whole script approach enables overall synoptic feedback to be
additionally offered.</p>
      <p>With Aime as part of the team, the same feedback is offered to each GTA (Fig. 3) as they review
similar answers from different students. Similarly, if a grading error is uncovered by a GTA, this is
comments are then easily added by GTA 3, where these overall comments themselves are ensured to
be consistent via Aime’s support.</p>
      <p>
        Stage 3 is a moderation process. In a traditional non-AI set up with finite time, and a large cohort of
students, this will only be a statistical spot test of scripts, with a follow up of remarking if significant
discrepancies are uncovered. Particularly for in-semester class tests or similar, where rapid feedback to
students is also a driver, there is a need for feedback for student learning, rather than just a statement of
attainment to date. Further discrepancies (perceived or real) will also be picked up by students
comparing their feedback with that of their peers. This latter issue is a particular concern in the UK
where the National Student Survey (NSS) has an entire section set on assessment and feedback [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Students are asked a series of questions where they reflect on the quality of the provision they receive
from their HE provider. Findings are taken seriously by institutions and can impact league table
positions and prospective student choices [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. With the ideal Aime in the team, the module lead can
interrogate all scripts and Aime’s narrative feedback as a summarised visual report including average
and individual performance. The module lead can then apply moderation (for instance: reallocating
marks between sections or changing the sentiment of a narrative comment) to all scripts simultaneously.
      </p>
      <p>We have observed, as both GTAs and instructors, that stage 4 is frequently only given cursory
attention. It is the closing of the feedback loop between the module lead and the GTA, where
improvements to questions for later years, and discussing aspects that students have consistently found
difficult, should happen. This is a human-only discussion, but informed by the visual summary that
Aime provides.</p>
      <p>To note, so far we have coached Aime to become a member of the grading team. We have carefully
defined Aime’s role at the different stages of the grading process and who they are learning from,
recommending information to, and reporting to. It should be emphasised that none of this section is
dependent on either a particular platform, or the discipline.</p>
      <p>In summary, when incorporating Aime into an existing workflow, we should carefully consider the
requirements of each stakeholder to ensure they are all met. Aime needs to integrate into this network
of stakeholders and improve the grading experience for everyone to justify implementing such a system.
In the theoretical framework that we have outlined above, this is visibly true for all except the
administrators. This group is dependent on having well-designed Application Programming Interfaces
(APIs) of both the Learning Management System (LMS) at their institution, and Aime. Although
important, it is a technical issue and out with the discussion of this paper.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Incorporating Aime into the team: a proposed case study</title>
      <p>So far we have described the team interactions in an assessment and feedback ecosystem that include
an Aime. Aime is, ideally, agnostic to the type of assessment and completely capable for all tasks
required.</p>
      <p>
        The case study focuses on mathematical subjects, the predominant area in which we have taught and
graded as subject-matter experts. At scale, this often comprises aspects of pre-calculus and calculus,
where students are required to produce a moderate amount of working to solve a problem. This is
commonly assessed via problem sheets and closed book examinations which are typically hand-graded
by a team of GTAs or the module lead. Previous attempts to in- crease the rapidity of marking have
diminished the personalisation, and depth of feedback. Hahn et al. report that automatic feedback
systems risk reducing human interactions and could reduce the amount of personalised feedback given
to students who need it [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Instead, using a grading team that includes Aime has enabled us to change
the workflow to facilitate grading traditional assessments more efficiently, while providing additional
training opportunities to GTAs.
Students need: Fast, helpful, and friendly feedback; opportunities to submit
formative work in advance; authentic assessments to prepare them for the
future.
      </p>
      <p>Graders need: Clear grading schemes and the ability to grade anywhere; peer
communities of practice to develop professionally; support, engagement, and
praise from module leaders.</p>
      <p>Module leads need: to facilitate timely and consistent grading; areas of student
strengths and weaknesses to be highlighted; happy students, even when the
work is challenging.</p>
      <p>Administrators need: Easy integration into an existing LMS; simple
moderation tools to view and access submissions and grading; accessible grade books
for data processing.</p>
      <p>In the case study proposed, Aime is Graide, which itself has been designed to be a team player who
always follows the rules and guidance and defers to humans for final decisions. To date, the focus has
been on ensuring that Graide is optimised for the key stakeholder interactions, as shown in Fig. 6.</p>
      <p>
        To enable interaction with the submitted scripts, they need to be digitised, but can be initially
handwritten. It could be tempting to require students to learn and become proficient in digital equation
input, although this would add significant time and cognitive load to the assessment, inadvertently
changing the learning outcomes away from solving problems and towards computer input and required
syntax [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. By requiring students (or administrators, depending on the assessment) to upload scans of
work, solutions can be digitised and interpreted by Graide. This means that educators can continue to
develop their assessments using well-understood and accepted methods (only scanning is added to the
workflow) without having to consider additional complications of a new system. Hence Graide does
not require alterations in traditional workflows in order to be incorporated as a team member and we
are able to implement stages 1-4 of the team marking that were described in the previous section for
real assessments.
      </p>
      <p>To contextualise this, we will consider a scenario with a typical question set in a first year
undergraduate mathematics module for physics students. Considering a single problem (where the
details of the mathematics are not particularly important):</p>
      <p>Differentiate
1
√</p>
      <p>This is considered a straightforward problem at this level and many students should be able to
correctly answer it in one or two lines of working. Assuming we have 150 students and 3 GTAs, stage
1A would begin with the module lead informing the GTAs of the problems to be marked and associated
deadlines for returning feedback to students. They would share a set of model solutions so the GTAs
can inspect the standard method and calibrate how they award marks. In this case, this would be the
first time the GTAs have worked with Aime, so the module lead gives a further briefing on what to
expect.
(1)</p>
      <p>Assuming all the GTAs have a similar level of experience, one of them (say, GTA 1) will likely start
stage 2 of the grading before the others and begin training Aime on their allocation of 50 pieces of work
(Fig. 7a). Aime quickly starts suggesting feedback to GTA 1 and the other GTAs when they start grading
(Fig. 7b). GTA 1 notices that a few students use the notation incorrectly and instructs Aime to provide
some constructive (but not punitive) feedback for students (Fig. 7c). When GTA 2 starts grading, Aime
automatically suggests this feedback to them, modelling good feedback practice and ensuring that no
penalty is consistently applied to the overall mark for that particular error (Fig. 7d).</p>
      <p>Once the GTAs have finished marking, the module lead begins stage 3 and inspects the grade
distribution and the scheme of feedback for a general overview. They notice that a GTA has written
some unsupportive feedback so decides to modify the tone of it before releasing it to students (Fig. 8).</p>
      <p>(a) (b)
Figure 8. (a) The module lead notices a GTA has produced an unsupportive piece of feedback. (b) The
module lead updates the tone of the feedback to be more supportive and Aime updates all the work
it was applied to.</p>
      <p>The module lead, satisfied with the quality of the feedback, meets with the GTAs for stage 4. They
praise them for their prompt return of marks and discuss the modified feedback and why a supportive
tone is more helpful to encourage students to learn from their mistakes. The GTAs take this on board
for the next assessment cycle and the module lead is reassured when Aime shows more supportive
feedback the next time.</p>
      <p>All of these experiences at the different stages have been implemented and the planned case study
will observe the full set of stages and hand-over points in detail.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Conclusions</title>
      <p>In this paper, we have considered the AI marking engine, Aime, as a team member, rather than a
platform that one interacts with. Through this view of the whole team, we have been able to ensure that
Aime fits into the team hierarchy and that the human members of the team always retain the decision
making authority. Detailing the key aspects of the ecosystem in this way will be helpful to institutional
providers who need reassurance of the guide rails that have been put in place to ensure that humans
have the ultimate control of the moderation processes.</p>
      <p>By drawing on examples from our use and development of Graide, we have demonstrated that this
viewpoint of the grading team with Aime as a member, is relevant to a commercially available system.
Scenarios beyond mathematics require different technical specifications for Aime, for instance narrative
or figures. Furthermore, recent extensions to Graide, in beta, enable short answer text questions, but
there is still much to do in implementing this paradigm with Aime as a team member for assessments
across the breadth of disciplines that are actively assessed via digital or paper submissions.</p>
    </sec>
    <sec id="sec-5">
      <title>4. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Benton</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Which is better: one experienced marker or many inexperienced markers?</article-title>
          <source>Research Matters: A Cambridge Assessment publication 28</source>
          ,
          <fpage>2</fpage>
          -
          <lpage>10</lpage>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Chan</surname>
            ,
            <given-names>C. K. Y.</given-names>
          </string-name>
          :
          <article-title>A systematic review-handwritten examinations are becoming outdated, is it time to change to typed examinations in our assessment policy?</article-title>
          .
          <source>Assessment &amp; Evaluation in Higher Education</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Popenici</surname>
            ,
            <given-names>S. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kerr</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Exploring the impact of artificial intelligence on teaching and learning in higher education</article-title>
          .
          <source>Research and Practice in Technology Enhanced Learning</source>
          ,
          <volume>12</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Hahn</surname>
            ,
            <given-names>M.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Navarro</surname>
            ,
            <given-names>S.M.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valentin</surname>
            ,
            <given-names>L.D.L.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burgos</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>A sys- tematic review of the effects of automatic scoring and automatic feed- back in educational settings</article-title>
          .
          <source>IEEE Access 9</source>
          ,
          <fpage>108190</fpage>
          -
          <lpage>108198</lpage>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haenlein</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Siri</surname>
          </string-name>
          , Siri, in my hand:
          <article-title>Who's the fairest in the land? On the interpretations, illustrations, and implications of artificial intelligence</article-title>
          .
          <source>Business horizons 62(1)</source>
          ,
          <fpage>15</fpage>
          -
          <lpage>25</lpage>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Omrani</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rivieccio</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fiore</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schiavone</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agreda</surname>
            ,
            <given-names>S. G.</given-names>
          </string-name>
          :
          <article-title>To trust or not to trust? An assessment of trust in AI-based systems: Concerns, ethics and contexts</article-title>
          .
          <source>Technological Forecasting and Social Change</source>
          ,
          <volume>181</volume>
          ,
          <issue>121763</issue>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Popenici</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kerr</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Exploring the impact of artificial intelligence on teaching and learning in higher education</article-title>
          .
          <source>Research and Practice in Technology Enhanced Learning</source>
          <volume>12</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Richardson</surname>
            ,
            <given-names>J. T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Slater</surname>
            ,
            <given-names>J. B.</given-names>
          </string-name>
          , Wilson,
          <string-name>
            <surname>J.:</surname>
          </string-name>
          <article-title>The National Student Survey: development, findings and implications</article-title>
          .
          <source>Studies in Higher Education</source>
          ,
          <volume>32</volume>
          (
          <issue>5</issue>
          ),
          <fpage>557</fpage>
          -
          <lpage>580</lpage>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Sadler</surname>
            ,
            <given-names>D. R.:</given-names>
          </string-name>
          <article-title>Indeterminacy in the use of preset criteria for assessment and grading</article-title>
          .
          <source>Assessment &amp; Evaluation in Higher Education</source>
          ,
          <volume>34</volume>
          (
          <issue>2</issue>
          ),
          <fpage>159</fpage>
          -
          <lpage>179</lpage>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Stanyon</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tomlinson</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kainth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilkin</surname>
            ,
            <given-names>N. K.</given-names>
          </string-name>
          :
          <article-title>Providing individual student feedback at scale for mathematical disciplines</article-title>
          .
          <source>In Proceedings of the Ninth ACM Conference on Learning@ Scale</source>
          , pp.
          <fpage>400</fpage>
          -
          <lpage>404</lpage>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kane</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sagu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Exploring the national student survey: Assessment and feedback issues</article-title>
          .
          <source>The Higher Education Academy</source>
          , Centre for Research into Quality (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harfitt</surname>
          </string-name>
          , G.:
          <article-title>Is assessment for learning feasible in large classes? Challenges and coping strategies from three case studies</article-title>
          .
          <source>Asia-Pacific Journal of Teacher Education</source>
          ,
          <volume>47</volume>
          (
          <issue>5</issue>
          ),
          <fpage>472</fpage>
          -
          <lpage>486</lpage>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>