=Paper=
{{Paper
|id=Vol-2327/UIBK2
|storemode=property
|title=An apprenticeship model for human and AI collaborative essay grading
|pdfUrl=https://ceur-ws.org/Vol-2327/IUI19WS-UIBK-2.pdf
|volume=Vol-2327
|authors=Alok Baikadi,Lee Becker,Jill Budden,Peter Foltz,Andrew Gorman,Scott Hellman,William Murray,Mark Rosenstein
|dblpUrl=https://dblp.org/rec/conf/iui/BaikadiBBFGHMR19
}}
==An apprenticeship model for human and AI collaborative essay grading==
An apprenticeship model for human and AI
collaborative essay grading
Alok Baikadi Lee Becker Jill Budden
alok.baikadi@pearson.com leek.becker@pearson.com jill.budden@pearson.com
Pearson Pearson Pearson
Peter Foltz Andrew Gorman Scott Hellman
peter.foltz@pearson.com andrew.gorman@pearson.com scott.hellman@pearson.com
Pearson Pearson Pearson
William Murray Mark Rosenstein
william.murray@pearson.com mark.rosenstein@pearson.com
Pearson Pearson
CCS CONCEPTS Over the last 20 years, automated scoring of student writ-
• Human-centered computing → Human computer inter- ing through the use of natural language processing, machine
action (HCI). learning and artificial intelligence techniques coupled with
human-scored training sets has in many applications achieved
KEYWORDS performance comparable to that of humans [19]. In high
Human-ai collaboration; Intelligent user interfaces; Machine stakes testing, millions of students have had their writing
learning; Active learning; Automated essay grading automatically scored, with the prompts, rubrics and human
scoring for training and validating model performance imple-
ACM Reference Format: mented in tightly controlled frameworks. Our work addresses
Alok Baikadi, Lee Becker, Jill Budden, Peter Foltz, Andrew Gorman, the question of how to move this technology to a wider au-
Scott Hellman, William Murray, and Mark Rosenstein. 2019. An
dience and adding formative assessment to have a broader
apprenticeship model for human and AI collaborative essay grading.
In Joint Proceedings of the ACM IUI 2019 Workshops, Los Angeles,
impact in helping students learn to write.
USA, March 20, 2019 , 6 pages. Our goal is to lower the barriers that limit an instructor’s
ability to assign essays in their courses. Our approach is
1 INTRODUCTION to develop a system in which instructors develop prompts
appropriate for their course and, by scoring a subset of their
Across a range of domains, humans and complex algorithms
student responses, are able to turn on an automated system
embedded in systems are collaborating and coordinating ac-
to score the rest of the responses. These prompts and the
tion. Tasks are shared because the combined system can be
ability to automatically score them become a resource that
more capable than either agent acting alone. Such systems
instructors can reuse and even share. The critical issue is
with shared autonomy raise important research questions in
that while the instructor is an expert in their domain, they
how to design these joint systems to best utilize each of the
are likely not an expert in either assessment or the machine
actors capabilities for optimal performance while addressing
learning techniques that make automated scoring possible.
important safety, legal and ethical issues.
We have piloted a prototype system in a large introductory
Our work investigates students developing their writing
psychology course at a large university. This pilot explored
ability throughout their educational career, since writing pro-
the issues of 1) transferring scoring expertise from an instruc-
ficiency is a critical life and career competency. Writing is a
tor to an automated system and 2) using an automated system
skill that develops through practice and feedback. However,
to provide feedback to the instructor about the quality of the
the massive effort required of instructors in providing feed-
current state of its scoring. In most end user applications of AI,
back on drafts and scoring final versions is a limiting factor in
the user is only exposed to decisions or behavior from some
assigning essays and short answer problems in their classes.
unseen, unknown model, and training of the machine learning
mechanism is either hidden or taken for granted by the user.
IUI Workshops’19, March 20, 2019, Los Angeles, USA Exposing machine learning flows to users unversed in the no-
© 2019 Copyright © 2019 for the individual papers by the papers’ authors. tions of model performance and evaluation raises interesting
Copying permitted for private and academic purposes. This volume is pub- design questions around user trust, system transparency, and
lished and copyrighted by its editors. managing expectations.
IUI Workshops’19, March 20, 2019, Los Angeles, USA Baikadi, et al.
We discuss the approach used for the pilot and some of the have impacted the modeling choices in both research and in
issues that emerged. We then show how adopting a metaphor commercial systems. These techniques include hierarchical
of apprenticeship clarifies the communication between the classification [14], correlated linear regression [17] and var-
instructor and the AI assistant. In this paper we discuss the ious neural net deeplearning approaches [7, 21] and many
issues of shared autonomy that arise in such a system and others. In addition, some commercial systems have described
the issues we have seen, in defining the task, in making re- their modeling subsystems, e.g. [4].
sults mutually interpretable to the instructor and the machine, As we move away from high stakes scoring with precisely
and in designing user interfaces that makes more transparent trained models to formative scoring with instructor-trained
all the various factors that are required for making correct models (and the use of automated scoring in the classroom),
decisions on how the task should proceed. the burden of generating reliable scores to train the automated
system now falls on the instructor. For the automated sys-
2 RELIABLE SCORING tem to reliably score, the instructor must score a sufficiently
Current high stakes scoring implementations attempt to achieve large number of essays to capture the variability of student
reliability through various means: the use of rubrics to specify responses and do so in a sufficiently reliable fashion to allow
the results that must be evidenced at each score point [9, 20]; the regularities of scoring to be learnable by machine learn-
anchor papers, which are example, actual essays selected to ing techniques. In the system we have developed, where the
indicate typical and boundary score point answers [15]; and system learns from an instructor’s scoring behavior, the in-
supervised scorer training, which often includes practice scor- structor only need score enough essays to build a performant
ing exercises and required levels of performance on example model. The hurdle is that, as the instructor scores, the system
essays. Yet despite these rigorous preparations, raters still needs to provide feedback to the instructor as the instructor
disagree. Psychometricians have developed methods to detect scores on how well the current model is performing. In an
subtle biases in raters referred to as rater effects [23]. These intelligible manner, the system must update the instructor on
techniques can allow performance monitoring during scoring its progress. The AI system must continually provide infor-
and detection after scoring. mation to the instructor to allow the instructor to make an
In complex tasks, such as writing an essay, even with well- informed decision about the quality of the automated scoring
trained scorers without rater effects there will be an expected and determine when it is justifiable to turn scoring over to the
level of disagreement over the score of individual essays. A automated system.
number of studies have found that in well-designed prompts
and rubrics with well-trained scorers, the expected range of 3 SYSTEM DESCRIPTION
adjacent agreement, i.e., scores within 1 point, is 80-99% and We have developed a prototype system which allows instruc-
correlations in the 0.7 to 0.8 range (Brown et al. [2] provide a tors to assign writing to their students and participate in AI-
summary of research and standards in this area). assisted grading workflows. The system allows an instructor
Human scoring is time consuming, expensive and limits to create an account, invite students to join a course, and as-
the immediacy of feedback that can be provided to the student. sign writing within a course. The prototype reported on here
Page described the first system to automatically score essays is an intermediate step toward enabling instructors to write
based on analysis of a fairly limited set of features from the and have their own prompts automatically scored. This step
essay [16]. Present day automated systems score millions of allows us to test the user interface and methods for sharing
student essays in both high stakes and formative engagements the task between the instructor and the system. This system
with performance levels at or above human scorers (Sher- learns to modify the scoring of existing, already-modeled
mis and Burstein have co-edited comprehensive summaries prompts to more closely represent an instructor’s scoring. In
on the subject [18, 19]). These automated scoring systems this current iteration, instructors select from a list of available
are typically based on supervised machine learning, where writing prompts, each of which contains a short description,
the system is trained on a set of student essays and human a rubric against which to grade student submissions, and a
scores. The system derives a set of features for each essay and currently existing automated grading model. Once the prompt
learns to infer the human score from the features. A sample has been assigned, students are able to draft and submit their
of essays, typically on the order of 300 to 500, are scored by responses.
human scorers, and then used to train the automated system. The collection of student responses goes through an active
Performance of the automated system is compared to the per- learning preprocessing step to calculate a recommended or-
formance of two human raters and, if found acceptable, the dering for the instructor to grade essays. Active learning is
automated system then scores the remaining essays. typically employed to reduce human annotation effort, and
In the six years since the most current survey of the auto- in our system we use it to minimize the number of human-
mated scoring field [19] developments in machine learning graded submissions needed for reliable modeling. Within
Apprenticeship model for Essay Scoring IUI Workshops’19, March 20, 2019, Los Angeles, USA
Figure 1: The instructor’s grading interface used in our pilot study. Upper Left: AI readiness notification message. Lower Left: Essay
prompt details link and student’s essay response. Right: Total score, AI training progress bar, and rubric scoring controls with score
of 4 for trait Organization.
the instructor interface this is simply the list of submissions As a first step to understand how instructors interact with
to grade sorted by the active learning order. In our current AI systems, we decided to not allow instructors direct access
implementation, we use the Kennard-Stone algorithm [10]. to a highly complex large parameter space machine learning
Kennard-Stone attempts to select submissions in a manner model. Instead, instructors assigned prompts for which there
that uniformly covers the feature space by iteratively selecting already existed a fully trained machine learning model. Our
the submission with the maximum minimum distance to all implementation uses logistic regression to learn a model that
previously selected submissions. We use baseline automated modifies the pre-trained automated scores to better match the
scores as our feature space so that the human grader will see instructor’s scoring. The system learned the two parameters of
approximately the same number of submissions at each score a logistic regression model to estimate the instructor’s scores
point, despite very low and very high scoring submissions based on the instructor’s scores on responses and the scores
being rare. In other natural language active learning tasks, bi- from the existing model. The logistic regression functions as a
asing the active learner in favor of low-frequency classes has transformation over the pre-trained model scores by adjusting
been found to work well [8, 22], and Kennard-Stone has been the distance between score points to more closely match the
found to perform well for automated grading in particular [6]. instructor’s scoring behavior. By learning a transformation
As the instructor scores submissions, the system begins the over the pre-trained model, we are able to leverage the ac-
modeling phase. In the modeling phase, the machine learning curacy of the existing model, while allowing instructors to
system is trained to mimic the instructors grades, and its adjust the scoring needs to suit their classroom needs.
performance is evaluated. Once the system determines the
evaluation is acceptable, the instructors are signaled that the
training is complete, and they are able to view the automated 4 PILOT STUDY
grades and make adjustments as needed. If the instructor We conducted a pilot with nine instructors and teaching as-
corrects the grade, the system refines the model using the sistants for an Introductory Psychology course at a large uni-
newly graded submissions. versity. The participants completed an initial training session
where they were presented with an overview of the interface,
IUI Workshops’19, March 20, 2019, Los Angeles, USA Baikadi, et al.
performance to the participants. The interface and interaction
implicitly encouraged instructors to regard the system as a
tool and to think of system state as bimodal – untrained or
trained. The disadvantage of this approach is that it encour-
aged instructors to infer that after the transition from untrained
to trained the tool’s performance matched their own, but tech-
Figure 2: Step 1a - Model training (apprenticeship modeling): nically it meant that a fuzzy threshold had been passed but
UI indicates instructor must explicitly train the AI grading as- further monitoring and feedback of automated scoring were
sistant. This is the initial messaging. still required.
This mismatch likely caused participants to be less vigilant,
while survey results indicated participants felt disappointed
when they had to correct the nascent automatic scoring. The
message "Hooray! The scoring assistant is now calibrated
. . . " and the green color of the progress bar implicitly set
Figure 3: Step 1b - Model training (apprenticeship modeling):
UI indicates current progress in the training phase.
incorrect expectations and discouraged participants from car-
rying out further review and revision of scores, other than
minimal tests to satisfy themselves that the model was perfor-
mant. Additionally, during pre-pilot instruction, we suggested
and were encouraged to practice on a small set of student writ- that the participants review approximately twenty submis-
ing before the start of the pilot. Over the course of the semes- sions after the autoscoring model was enabled. Participants
ter, the participants were asked to use our system to grade 100 rarely strayed from these guidelines, reviewing approximately
student submissions for each of nine writing prompts. Student twenty submissions on average. In post-surveys, participant
submissions were sampled from the participating instructors’ responses indicated that they did not have a strong sense of
courses and were anonymized. Prompts were assigned in sets when to stop reviewing. Many would grade until the auto-
of three, and then made available to the participants to score. mated scores for a single essay matched their expectations.
Participants received prompts to grade in three phases, each Analyses of behavioral data, survey results and users’ feed-
consisting of three prompts. Upon logging in, the participants back motivated us to reevaluate our user experience design
were able to begin grading by selecting any of the three avail- to better scaffold the user through the process of training and
able prompts to begin grading. The prompt was presented to better communicate the expected quality of the automated
on a summary screen (Figure 1). The summary screen also scoring model.
presents the participants with the suggested order in which to While apprenticeship has been a model of human skill
grade the submissions (generated by active learning). building for millennia, Lave was among the first to study and
In addition to this ordering, the submissions were further describe it as a formal mode of learning [12]. Collins et al.
divided into two sets: One that must be graded by the human further generalized Lave’s observation into what we refer to
rater, and one that could have AI feedback. As participants as an apprenticeship model of training [5]. This pedagogy-
graded the first set, a progress bar indicated how close the oriented paradigm consists of multiple phases, where the first
model was to being trained. They could refer to the summary three are relevant for our application: modeling, coaching,
screen to evaluate their progress at any time. Once the prede- and fading. In modeling the apprentice (learner) “repeatedly
termined threshold was reached, the participants received a observes the expert performing the target process”. During
“Hooray” message, as in Figure 1, and they were then able to coaching, the apprentice “attempts to execute the process
review the output of the automated scoring model, adjusted while the expert provides guidance and scaffolds feedback
by the logistic regression. They would then review several and instruction”. Lastly, in fading the expert provides less
autoscored responses, and adjust the scores as needed. The feedback and eventually ascertains the apprentice’s mastery.
regression model would be recalculated after each further This paradigm has been employed for computer supported
adjustment. Once the participants were satisfied with the per- collaborative learning (CSCL) and intelligent tutoring sys-
formance of the model, they were able to finalize the scoring tems (ITS) where the system regards the user as an apprentice
and fix the grades for the rest of the submissions to that to help them develop new skills [3, 11, 13].
prompt. The apprenticeship model provides a useful metaphor for
aligning our system’s three stages of data gathering and ap-
5 APPRENTICESHIP MODEL OF TRAINING plication with an accessible, real-world process. Our system
The system used in the pilot employed a progress bar with swaps the human-computer relationship typical of ITS and
an alert message to communicate the current level of scoring
Apprenticeship model for Essay Scoring IUI Workshops’19, March 20, 2019, Los Angeles, USA
In the redesigned UI, at the top of each screen a large circle
indicates the current location in the process, with messages
updating to keep the instructor informed of progress within
a given phase. At the beginning of the apprentice modeling
phase (Figure 2), instructors are encouraged to score essays to
help train the AI grading assistant. As they score more essays
they see progress updates as shown in Figure 3. When the AI
grading assistant achieves the ability to initially begin scoring,
instructors move into the coaching phase (Figure 4).
Figure 4: Step 2a - Model tuning/validation (apprenticeship In this phase the instructor scores and then compares their
coaching): Messaging informs instructor AI grading assistant score to the AI-assistant (Figure 5). This phase ends when
is ready to begin scoring essays. the system gains sufficient confidence in the model’s perfor-
mance (e.g., 0.7 to 0.8, values similar to human inter-rater
reliability). This is reflected in the instructor’s view by show-
ing the instructor’s corrections diminish (Figure 6). Once
in the fading phase (Figure 7), the instructor passes scor-
ing to the AI-assistant, but still retains the ability to review
Figure 5: Step 2b - Model tuning/validation (apprenticeship the AI-assistant’s scores. Messages reinforce the relation-
coaching): Evaluation metrics inform instructor of AI grading ship between additional scoring and performance, making
assistant’s performance and encourages the instructor to con- the apprentice-relationship of the assistant (e.g., ". . . the more
tinue grading. you grade, the more the assistant learns from you") more
transparent. The level of the assistant’s learning is indicated
by the number and percentage of agreements compared to dis-
agreements with the instructor’s grades, and by the progress
bar. The progress indicated by the bar follows the number
of essays scored, but can accelerate as model performance
improves.
Figure 6: Step 2c - Model tuning/validation (apprenticeship
coaching): Messaging informs instructor AI grading assistant
might be able to take over scoring essays. 7 CONCLUSIONS AND FUTURE WORK
As more people interact with systems based on sophisticated,
often opaque algorithms, it becomes ever more critical to
instead considers the user the expert and the AI-assistant per- develop common languages and appropriate metaphors to
sona the apprentice. By adopting this framework of model allow communication and common understanding. Often, as
training as task modeling; model tuning and validation as these systems move to increasingly common use, a more re-
coaching and model acceptance as fading, we help the user fined understanding of how the machine learning component
to better understand the expected interactions and responsibil- is trained and what its limitations are, becomes lost. In our
ities. first pilot we adopted a quite reasonable view of training an
Our coupling of apprenticeship with machine learning is automated scoring system as a tool. Our first set of instructors
distinct from the use of apprenticeship in reinforcement learn- internalized this model with unexpected consequences both
ing [1], which does not have an interactive human element. for their performance on the task and their satisfaction with
completing the task. In moving to the apprentice model, we
6 REDESIGNED USER INTERFACE AND FLOW believe we have found a metaphor that ameliorates some of
Based on the feedback from our initial pilot, in our new three- these issues.
phase apprenticeship approach, we encourage instructors to Our next steps include conducting pilots with this new
view the process of training the automated scoring system as metaphor and a UI/UX that supports it. We have begun to
an apprenticeship. In this view, they can reasonably expect the think more broadly about the complex relationships between
AI-assistant to continue to learn even after it starts grading. clever systems and equally clever people, both of which have
The instructor now expects mistakes to continue, but in di- large blind spots. The instructors know the domain area but
minishing number and severity over time. With this approach may have less experience in the type of reliable scoring re-
minor mistakes are less likely to damage trust in the system. quired to train an automated scoring model. The AI system
IUI Workshops’19, March 20, 2019, Los Angeles, USA Baikadi, et al.
403. Centre for the Study of Reading, University of Illinois, BBN
Laboratories, Cambridge, MA.
[6] Nicholas Dronen, Peter W. Foltz, and Kyle Habermehl. 2015. Effec-
tive Sampling for Large-scale Automated Writing Evaluation Systems.
Proceedings of the Second (2015) ACM Conference on Learning @
Scale (2015), 3–10. https://doi.org/10.1145/2724660.
2724661
[7] Youmna Farag, Helen Yannakoudakis, and Ted Briscoe. 2018. Neural
automated essay scoring and coherence modeling for adversarially
crafted input. arXiv preprint arXiv:1804.06898 (2018).
[8] Andrea Horbach and Alexis Palmer. 2016. Investigating Active Learn-
ing for Short-Answer Scoring. In BEA@ NAACL-HLT. 301–311.
[9] Anders Jonsson and Gunilla Svingby. 2007. The use of scoring rubrics:
Reliability, validity and educational consequences. Educational re-
search review 2, 2 (2007), 130–144.
[10] Ronald W. Kennard and Larry A. Stone. 1969. Computer Aided Design
of Experiments. Technometrics: A Journal of Statistics for the Physical,
Chemical, and Engineering Sciences 11, 1 (1969), 137–48.
[11] S. P. Lajoie and A. M. Lesgold. 1992. Dynamic assessment of profi-
ciency for solving procedural knowledge tasks. Educational Psycholo-
gist 27 (1992), 365–384.
[12] J. Lave. [n. d.]. Tailored learning: Education and everyday practice
among craftsmen in West Africa. Technical Report.
[13] A. Lesgold, G. Eggan, and G. Rao. 1992. Possibilities for assessment
using computer-based apprenticeship environments. Cognitive ap-
proaches to automated instruction W. Regian & V. Shute (Eds.) (1992),
49–80.
[14] Danielle S. McNamara, Scott A. Corssley, Rod D. Roscoe, Laura K.
Figure 7: Step 3 - Model acceptance (apprenticeship fading): In- Allen, and Jianmin Dai. 2015. A hierarchical classification approach to
structor has handed over control to AI grading assistant and is automated essay scoring. Assessing Writing 23 (2015), 35–59.
encouraged to review scores on remaining essays. [15] Miles Myers. 1980. A procedure for writing assessment and holistic
scoring. National Council of Teachers of English, Urbana, IL.
[16] Ellis B. Page. 1967. Statistical and linguistic strategies in the computer
embodies knowledge about scoring that can be used to scaf- grading of essays. Coling 1967: Conférence Internationale sur le
fold the instructor’s scoring, but at the same time is an appren- Traitement Automatique des Langues, Grenoble, France (1967).
tice to how the instructor wants the prompt evaluated. How to [17] Peter Phandi, Kian Ming A. Chai, and Hwee Tou Ng. 2015. Flexible
domain adaptation for automated essay scoring using correlated lin-
share the task and how the two agents should communicate ear regression. In Proceedings of the 2015 Conference on Empirical
are interesting, open questions. These questions will become Methods in Natural Language Processing.
even more relevant as we will begin testing the complete sys- [18] Mark D. Shermis and Jill C. Burstein (Eds.). 2003. Automated essay
tem which will now include instructors authoring prompts and scoring: A cross-disciplinary perspective. Lawrence Erlbaum Asso-
replacing logistic regression with a full modeling pipeline. ciates, Inc., Mahway, NJ.
[19] Mark D. Shermis and Jill C. Burstein (Eds.). 2013. Handbook of
automated essay evaluation: Current applications and new directions.
REFERENCES
Routledge, New York.
[1] P. Abbeel and A. Ng. 2004. Apprenticeship Learning via Inverse [20] Dannelle D. Stevens and Antonia J. Levi. 2013. Introduction to Rubrics:
Reinforcement Learning. In Proceedings of the 21st International Con- An Assessment Tool to Save Grading Time, Convey Effective Feedback,
ference on Machine Learning. and Promote Student Learning. Stylus Publishing, LLC.
[2] Gavin TL Brown, Kath Glasswell, and Don Harland. [n. d.]. Accuracy [21] Kaveh Taghipour and Hwee Tou Ng. 2016. A neural approach to
in the scoring of writing: Studies of reliability and validity using a New automated essay scoring. In Proceedings of the 2016 Conference on
Zealand writing assessment system. Assessing writing 9, 2 ([n. d.]), Empirical Methods in Natural Language Processing.
105–121. [22] Katrin Tomanek and Udo Hahn. 2009. Reducing Class Imbalance
[3] John Seely Brown, R. R Burton, and A. G. Bell. 1975. SOPHIE: A During Active Learning for Named Entity Annotation. In Proceedings
Step toward creating a reactive learning environment. International of the Fifth International Conference on Knowledge Capture (K-CAP
Journal of Man-Machine Studies 7, 5 (Sept. 1975), 675–696. ’09). ACM, New York, NY, USA, 105–112. https://doi.org/
[4] Jing Chen, James Fife, H, Issac I. Bejar, and André A. Rupp. 2016. 10.1145/1597735.1597754
Building e-rater® Scoring Models Using Machine Learning Methods. [23] Edward W. Wolfe. 2004. Identifying rater effects using latent trait
ETS Research Report Series 2016.1 (2016), 1–12. models. Psychology Science 46 (2004), 35–51.
[5] A. Collins, J. S. Brown, and S. E. Newman. 1987. Cognitive apprentice-
ship: Teaching the craft of reading, writing and mathematics. Technical